patternjavaMinor
Multithreaded webcrawler
Viewed 0 times
webcrawlermultithreadedstackoverflow
Problem
I've been trying to learn Java for the last day or two. This is the first project I am working on, so please bear with me. I worked on a multithreaded web crawler. It is fairly simple but I'd like to ask for some advice.
Purpose / Running
The program starts at one web address (in this code, http://google.com) and looks for all valid URLs inside the response given. All URLs found in the response will be added to the queue. The crawler will then continue crawling through the URLs in the queue. To stop the crawler, type
Http.java
```
package com.janchr;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
/**
* Created by Jan on 8/20/2016.
*/
public class Http {
public static BufferedReader Get(URL url) throws IOException {
HttpURLConnection con = (HttpURLConnection)url.openConnection();
con.setRequestMethod("GET");
// pretend that we are a new-ish browser. current user agent is actually from 2015.
con.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36");
con.setInstanceFollowRedirects(true);
int statusCode = con.getResponseCode();
// https://www.mkyong.com/java/java-httpurlconnection-follow-redirect-example/
boolean redirect = false;
if (statusCode != HttpURLConnection.HTTP_OK) {
if (statusCode == HttpURLConnection.HTTP_MOVED_TEMP
|| statusCode == HttpURLConnection.HTTP_MOVED_PERM
|| statusCode == HttpURLConnection.HTTP_SEE_OTHER)
redirect = true;
}
if (redirect) {
// get redirect url from "location" header field
String newUrl = con.getHeaderField("Location");
// get the cookie if need
String cookies = con.getHeaderField("Set-Cookie");
Purpose / Running
The program starts at one web address (in this code, http://google.com) and looks for all valid URLs inside the response given. All URLs found in the response will be added to the queue. The crawler will then continue crawling through the URLs in the queue. To stop the crawler, type
exit in the inputHttp.java
```
package com.janchr;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
/**
* Created by Jan on 8/20/2016.
*/
public class Http {
public static BufferedReader Get(URL url) throws IOException {
HttpURLConnection con = (HttpURLConnection)url.openConnection();
con.setRequestMethod("GET");
// pretend that we are a new-ish browser. current user agent is actually from 2015.
con.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36");
con.setInstanceFollowRedirects(true);
int statusCode = con.getResponseCode();
// https://www.mkyong.com/java/java-httpurlconnection-follow-redirect-example/
boolean redirect = false;
if (statusCode != HttpURLConnection.HTTP_OK) {
if (statusCode == HttpURLConnection.HTTP_MOVED_TEMP
|| statusCode == HttpURLConnection.HTTP_MOVED_PERM
|| statusCode == HttpURLConnection.HTTP_SEE_OTHER)
redirect = true;
}
if (redirect) {
// get redirect url from "location" header field
String newUrl = con.getHeaderField("Location");
// get the cookie if need
String cookies = con.getHeaderField("Set-Cookie");
Solution
Nice work for someone with only a few days of Java experience! Now some improvements:
You are working on a derivative of the classic producer-consumer problem. This is a common problem, and there are well established patterns for solving it in Java.
The abstraction you want to use here is called an ExecutorService. Essentially, it allows you to submit
Now, there may be a few surprises in the above code snippets:
The next improvement I would make would be to replace the map of SeenUrls with a Multiset. However, that is not included in the standard Collections library.
You are working on a derivative of the classic producer-consumer problem. This is a common problem, and there are well established patterns for solving it in Java.
The abstraction you want to use here is called an ExecutorService. Essentially, it allows you to submit
Runnables to be executed by the ExecutorService. You can easily construct an ExecutorService using Executors#newFixedThreadPool. We can make a few modifications to your CrawlThread class to make it work in this new model:class Crawler implements Runnable {
private final String url;
private final Executor executor;
private final Map seenUrls;
public Crawler(
String url,
Executor executor,
Map seenUrls) {
this.url = url;
this.executor = executor;
this.seenUrls = seenUrls;
}
@Override
public void run() {
List newUrls = parse(); // Very similar to your parse
for (String newUrl : newUrls) {
synchronized(seenUrls) {
if (seenUrls.containsKey(newUrl)) {
seenUrls.get(newUrl).timesSeen++;
} else {
seenUrls.put(newUrl, new SeenUrl(newUrl));
executor.submit(new Crawler(newUrl, executor, seenUrls));
}
}
}
}
}
public class Main {
public static void main(String[] args) {
// Run with 5 threads, adjust as necessary.
ExecutorService executorService = Executors.newFixedThreadPool(5);
Map seenUrls = new LinkedHashMap<>();
seenUrls
.put("http://google.com", new SeenUrl("http://google.com"));
executorService.submit(
new Crawler("http://google.com", executorService, seenUrls));
executorService.awaitTermination();
}
}Now, there may be a few surprises in the above code snippets:
- VisitedUrls became SeenUrls - I think thats what you were actually counting in your code, although I could be wrong. Additionally, it allows to formulate a simple invariant: only submit a new URL for crawling if it is not in seenUrls.
- Crawler takes in an Executor, not an ExecutorService - ExecutorService implements Executor, and we don't need the full power of ExecutorService in Crawler.
- The program never terminates - I'll let you implement that :). You will probably want to use
ExecutorService#shutdown().
The next improvement I would make would be to replace the map of SeenUrls with a Multiset. However, that is not included in the standard Collections library.
Code Snippets
class Crawler implements Runnable {
private final String url;
private final Executor executor;
private final Map<String, SeenUrl> seenUrls;
public Crawler(
String url,
Executor executor,
Map<String, VisitedUrl> seenUrls) {
this.url = url;
this.executor = executor;
this.seenUrls = seenUrls;
}
@Override
public void run() {
List<String> newUrls = parse(); // Very similar to your parse
for (String newUrl : newUrls) {
synchronized(seenUrls) {
if (seenUrls.containsKey(newUrl)) {
seenUrls.get(newUrl).timesSeen++;
} else {
seenUrls.put(newUrl, new SeenUrl(newUrl));
executor.submit(new Crawler(newUrl, executor, seenUrls));
}
}
}
}
}
public class Main {
public static void main(String[] args) {
// Run with 5 threads, adjust as necessary.
ExecutorService executorService = Executors.newFixedThreadPool(5);
Map<String, SeenURL> seenUrls = new LinkedHashMap<>();
seenUrls
.put("http://google.com", new SeenUrl("http://google.com"));
executorService.submit(
new Crawler("http://google.com", executorService, seenUrls));
executorService.awaitTermination();
}
}Context
StackExchange Code Review Q#139219, answer score: 4
Revisions (0)
No revisions yet.