patternjavaMinor
Web crawler uses lots of memory
Viewed 0 times
useslotsmemorywebcrawler
Problem
I am developing a web crawler application. When I run the program for more than 3 hours, the program runs out of memory. I should run the program for more that 2-3 days non-stop to get the results I need. How is this program using memory inefficiently?
seeds.txt
WebCrawler.java
`package pkg.crawler;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.net.MalformedURLException;
import java.net.SocketTimeoutException;
import java.util.*;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.PriorityBlockingQueue;
import java.util.concurrent.TimeUnit;
import org.jsoup.HttpStatusException;
import org.jsoup.UnsupportedMimeTypeException;
import org.joda.time.DateTime;
public class WebCrawler {
public static Queue queue = new PriorityBlockingQueue <> (); // priority queue
public static final int n_threads = 5; // amount of threads
private static Set processed = new LinkedHashSet <> (); // set of processed urls
private PrintWriter out; // output file
private PrintWriter err; // error file
private static Integer cntIntra = new Integer (0); // counters for intra- links in the queue
private static Integer cntInter = new Integer (0); // counters for inter- links in the queue
private static Integer dub = new Integer (0); // amount of skipped urls
public static void main(String[] args) throws Exception {
System.out.println("Running web crawler: " + new Date());
WebCrawler webCrawler = new WebCrawler();
webCrawler.createFiles();
try (Scanner in = new Scanner(new File ("seeds.txt"))) {
while (in.hasNext()) {
webCrawler.enque(new LinkNode (in.nextLine()
seeds.txt
http://www.stanford.edu
http://www.archive.org
WebCrawler.java
`package pkg.crawler;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.net.MalformedURLException;
import java.net.SocketTimeoutException;
import java.util.*;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.PriorityBlockingQueue;
import java.util.concurrent.TimeUnit;
import org.jsoup.HttpStatusException;
import org.jsoup.UnsupportedMimeTypeException;
import org.joda.time.DateTime;
public class WebCrawler {
public static Queue queue = new PriorityBlockingQueue <> (); // priority queue
public static final int n_threads = 5; // amount of threads
private static Set processed = new LinkedHashSet <> (); // set of processed urls
private PrintWriter out; // output file
private PrintWriter err; // error file
private static Integer cntIntra = new Integer (0); // counters for intra- links in the queue
private static Integer cntInter = new Integer (0); // counters for inter- links in the queue
private static Integer dub = new Integer (0); // amount of skipped urls
public static void main(String[] args) throws Exception {
System.out.println("Running web crawler: " + new Date());
WebCrawler webCrawler = new WebCrawler();
webCrawler.createFiles();
try (Scanner in = new Scanner(new File ("seeds.txt"))) {
while (in.hasNext()) {
webCrawler.enque(new LinkNode (in.nextLine()
Solution
Jprofiler is your friend here. After running it for a minute the biggest object in heap is the queue in WebCrawler, and it keeps growing.
Put a sysout in the enque or deque, printing the size of the queue, and you should see that the size is rocketing up much faster than the 5 threads can deal with.
Maybe you should use files to maintain your queue,
Put a sysout in the enque or deque, printing the size of the queue, and you should see that the size is rocketing up much faster than the 5 threads can deal with.
Maybe you should use files to maintain your queue,
let current_file = seed_file
let temp_file = new file
while should_continue_crawling:
for link in current_file:
dump all found links for link to temp_file
current_file = temp_file
temp_file = new fileCode Snippets
let current_file = seed_file
let temp_file = new file
while should_continue_crawling:
for link in current_file:
dump all found links for link to temp_file
current_file = temp_file
temp_file = new fileContext
StackExchange Code Review Q#94138, answer score: 3
Revisions (0)
No revisions yet.