patternjavaMinor

Web Crawler in Java

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

crawlerwebjava

Problem

I've written a working web crawler in Java that finds the frequencies of words on web pages. I have two issues with it.

-
The organization of my code in WebCrawler.java is terrible. Is there a way I could refactor the class to make it more readable? Perhaps splitting it into multiple classes?

-
I want to find a way to group similar words, so that I don't have entries for every possible word(for example, page and pages).

This is WebCrawler.java:

```
public class WebCrawler {
private static final int MAX_URL_QUEUE = 3000;
private final Object lock = new Object();
private List threads;

private int currentURLNum = 0;
private final int MAXURL;
private Queue urlQqueue;

private URLGroup crawledURLs;
private WordExtractor words;

public WebCrawler(String start, int max, int threads) {
this.MAXURL = max;
urlQqueue = new LinkedList();
urlQqueue.offer(start);
this.threads = new ArrayList();

words = new WordExtractor();
for(int i = 0; i urls = new LinkExtractor().extractFromDocument(doc).getCollectedURLs();

synchronized(bot.lock){
bot.currentURLNum++;
group.add(url);
for(String u : urls){
if(bot.urlQqueue.size() > MAX_URL_QUEUE){
break;
}
bot.urlQqueue.offer(u);
}
System.out.println(bot.currentURLNum);
}
}
for(String u : group.urls){
bot.crawledURLs.add(u);
}
System.out.println(Thread.currentThread() + " is done!");
}

}

private Document getDocument(String url, int timeout) throws IOException {
return Jsoup.connect(url).timeout(timeout).get();
}
// The Main Entry-point
public static void main(String[] args) throws IOException {
String startingUrl = "http://en.wikipedia.org/

Solution

-
If I'm right you could eliminate most of the threading-logic with an ExecutorService and an awaitTermination call. The linked javadoc has some usage examples too. (Don't forget to check the return value of awaitTermination.)

(Effective Java, 2nd edition, Item 47: Know and use the libraries)

-
int max is a little bit mysterious. What does it maximize? I'd put that into the name.

-
urlQqueue is used from multiple threads without proper synchronization (urlQqueue.poll()), it's not thread-safe. Queue has other, thread-safe implementations, I'd use one of them instead of LinkedList.

You could use a BlockingQueue to avoid loops with Thread.yield() (which is rarely approprite, according to the javadoc):

while (bot.currentURLNum < bot.MAXURL) {
    String url = bot.urlQqueue.poll();
    if (url == null) {
        Thread.yield();
        continue;
    }

-
currentURLNum is also not properly synchronized, it is read by the while statement without synchronization.

[...] synchronization has no effect unless both read and write operations are synchronized.

From Effective Java, 2nd Edition, Item 66: Synchronize access to shared mutable data.

-
The reference type could be simply Map here:

private HashMap wordFrequency = new HashMap();

See also: Effective Java, 2nd edition, Item 52: Refer to objects by their interfaces

-
Actually, you could use a Multiset for counting the words. (Here is a similar example)

-
You need some more synchronization for WordExtractor, since the underlying HashMap is not thread safe. printTopWords might not see your latest data. foundWord is also public, so if anyone call he or she could get weird results too.

-
Gauava has a helper method (copyHighestCountFirst) for multisets which might be very similar to the algorithm in the printTopWords method.

(Effective Java, 2nd edition, Item 47: Know and use the libraries The author mentions only the JDK's built-in libraries but I think the reasoning could be true for other libraries too.)

Code Snippets

while (bot.currentURLNum < bot.MAXURL) {
    String url = bot.urlQqueue.poll();
    if (url == null) {
        Thread.yield();
        continue;
    }

private HashMap<String, Integer> wordFrequency = new HashMap<String, Integer>();

Context

StackExchange Code Review Q#51347, answer score: 3

Revisions (0)

No revisions yet.