HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavaMinor

Multithreaded webcrawler

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
webcrawlermultithreadedstackoverflow

Problem

I've been trying to learn Java for the last day or two. This is the first project I am working on, so please bear with me. I worked on a multithreaded web crawler. It is fairly simple but I'd like to ask for some advice.

Purpose / Running

The program starts at one web address (in this code, http://google.com) and looks for all valid URLs inside the response given. All URLs found in the response will be added to the queue. The crawler will then continue crawling through the URLs in the queue. To stop the crawler, type exit in the input

Http.java

```
package com.janchr;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;

/**
* Created by Jan on 8/20/2016.
*/
public class Http {
public static BufferedReader Get(URL url) throws IOException {
HttpURLConnection con = (HttpURLConnection)url.openConnection();
con.setRequestMethod("GET");

// pretend that we are a new-ish browser. current user agent is actually from 2015.
con.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36");
con.setInstanceFollowRedirects(true);

int statusCode = con.getResponseCode();

// https://www.mkyong.com/java/java-httpurlconnection-follow-redirect-example/
boolean redirect = false;
if (statusCode != HttpURLConnection.HTTP_OK) {
if (statusCode == HttpURLConnection.HTTP_MOVED_TEMP
|| statusCode == HttpURLConnection.HTTP_MOVED_PERM
|| statusCode == HttpURLConnection.HTTP_SEE_OTHER)
redirect = true;
}

if (redirect) {
// get redirect url from "location" header field
String newUrl = con.getHeaderField("Location");
// get the cookie if need
String cookies = con.getHeaderField("Set-Cookie");

Solution

Nice work for someone with only a few days of Java experience! Now some improvements:

You are working on a derivative of the classic producer-consumer problem. This is a common problem, and there are well established patterns for solving it in Java.

The abstraction you want to use here is called an ExecutorService. Essentially, it allows you to submit Runnables to be executed by the ExecutorService. You can easily construct an ExecutorService using Executors#newFixedThreadPool. We can make a few modifications to your CrawlThread class to make it work in this new model:

class Crawler implements Runnable {
  private final String url;
  private final Executor executor;
  private final Map seenUrls;

  public Crawler(
      String url,
      Executor executor,
      Map seenUrls) {
    this.url = url;
    this.executor = executor;
    this.seenUrls = seenUrls;
  }

  @Override
  public void run() {
    List newUrls = parse(); // Very similar to your parse
    for (String newUrl : newUrls) {
      synchronized(seenUrls) {
        if (seenUrls.containsKey(newUrl)) {
          seenUrls.get(newUrl).timesSeen++;
        } else {
          seenUrls.put(newUrl, new SeenUrl(newUrl));
          executor.submit(new Crawler(newUrl, executor, seenUrls));
        }
      }
    }
  }
}

public class Main {
  public static void main(String[] args) {
    // Run with 5 threads, adjust as necessary.
    ExecutorService executorService = Executors.newFixedThreadPool(5);
    Map seenUrls = new LinkedHashMap<>();
    seenUrls
      .put("http://google.com", new SeenUrl("http://google.com"));
    executorService.submit(
      new Crawler("http://google.com", executorService, seenUrls)); 
    executorService.awaitTermination();
  }
}


Now, there may be a few surprises in the above code snippets:

  • VisitedUrls became SeenUrls - I think thats what you were actually counting in your code, although I could be wrong. Additionally, it allows to formulate a simple invariant: only submit a new URL for crawling if it is not in seenUrls.



  • Crawler takes in an Executor, not an ExecutorService - ExecutorService implements Executor, and we don't need the full power of ExecutorService in Crawler.



  • The program never terminates - I'll let you implement that :). You will probably want to use ExecutorService#shutdown().



The next improvement I would make would be to replace the map of SeenUrls with a Multiset. However, that is not included in the standard Collections library.

Code Snippets

class Crawler implements Runnable {
  private final String url;
  private final Executor executor;
  private final Map<String, SeenUrl> seenUrls;

  public Crawler(
      String url,
      Executor executor,
      Map<String, VisitedUrl> seenUrls) {
    this.url = url;
    this.executor = executor;
    this.seenUrls = seenUrls;
  }

  @Override
  public void run() {
    List<String> newUrls = parse(); // Very similar to your parse
    for (String newUrl : newUrls) {
      synchronized(seenUrls) {
        if (seenUrls.containsKey(newUrl)) {
          seenUrls.get(newUrl).timesSeen++;
        } else {
          seenUrls.put(newUrl, new SeenUrl(newUrl));
          executor.submit(new Crawler(newUrl, executor, seenUrls));
        }
      }
    }
  }
}

public class Main {
  public static void main(String[] args) {
    // Run with 5 threads, adjust as necessary.
    ExecutorService executorService = Executors.newFixedThreadPool(5);
    Map<String, SeenURL> seenUrls = new LinkedHashMap<>();
    seenUrls
      .put("http://google.com", new SeenUrl("http://google.com"));
    executorService.submit(
      new Crawler("http://google.com", executorService, seenUrls)); 
    executorService.awaitTermination();
  }
}

Context

StackExchange Code Review Q#139219, answer score: 4

Revisions (0)

No revisions yet.