HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavaMinor

Web crawler uses lots of memory

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
useslotsmemorywebcrawler

Problem

I am developing a web crawler application. When I run the program for more than 3 hours, the program runs out of memory. I should run the program for more that 2-3 days non-stop to get the results I need. How is this program using memory inefficiently?

seeds.txt

http://www.stanford.edu
http://www.archive.org


WebCrawler.java

`package pkg.crawler;

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.net.MalformedURLException;
import java.net.SocketTimeoutException;
import java.util.*;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.PriorityBlockingQueue;
import java.util.concurrent.TimeUnit;

import org.jsoup.HttpStatusException;
import org.jsoup.UnsupportedMimeTypeException;
import org.joda.time.DateTime;

public class WebCrawler {

public static Queue queue = new PriorityBlockingQueue <> (); // priority queue
public static final int n_threads = 5; // amount of threads
private static Set processed = new LinkedHashSet <> (); // set of processed urls
private PrintWriter out; // output file
private PrintWriter err; // error file
private static Integer cntIntra = new Integer (0); // counters for intra- links in the queue
private static Integer cntInter = new Integer (0); // counters for inter- links in the queue
private static Integer dub = new Integer (0); // amount of skipped urls

public static void main(String[] args) throws Exception {
System.out.println("Running web crawler: " + new Date());

WebCrawler webCrawler = new WebCrawler();
webCrawler.createFiles();
try (Scanner in = new Scanner(new File ("seeds.txt"))) {
while (in.hasNext()) {
webCrawler.enque(new LinkNode (in.nextLine()

Solution

Jprofiler is your friend here. After running it for a minute the biggest object in heap is the queue in WebCrawler, and it keeps growing.

Put a sysout in the enque or deque, printing the size of the queue, and you should see that the size is rocketing up much faster than the 5 threads can deal with.

Maybe you should use files to maintain your queue,

let current_file = seed_file
let temp_file = new file
while should_continue_crawling:
  for link in current_file:
    dump all found links for link to temp_file
  current_file = temp_file
  temp_file = new file

Code Snippets

let current_file = seed_file
let temp_file = new file
while should_continue_crawling:
  for link in current_file:
    dump all found links for link to temp_file
  current_file = temp_file
  temp_file = new file

Context

StackExchange Code Review Q#94138, answer score: 3

Revisions (0)

No revisions yet.