HiveBrain v1.2.0
Get Started
← Back to all entries
patterncsharpMinor

Implementing a POC Async Web Crawler

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
asyncwebcrawlerimplementingpoc

Problem

I've created a small proof of concept web crawler to learn more about asynchrony in .NET.

Currently when run it crawls stack overflow with a fixed number of current requests (workers).

I was interested if I have made any common mistakes and if there is more of the framework / TPL I could be using to reduce the amount of code used.

To run it you would need to install the HtmlAgilityPack.

I am only really interested in feedback on program structure and the asynchronous aspects of it.

```
class Program
{
static void Main(string[] args)
{
var crawler = new WebCrawler();
crawler.Start();
}

public class WebCrawler
{
private readonly HttpClient client;
private const string SiteUrl = "http://stackoverflow.com";
private CrawlList crawlList;
private const int MaxWorkers =5;
private int workers;

public WebCrawler()
{
client = new HttpClient();
crawlList = new CrawlList();
}

public void Start()
{
crawlList = new CrawlList();
crawlList.AddUrl(SiteUrl);

do
{
if (workers >= MaxWorkers) continue;

if (!crawlList.HasNext()) continue;

Interlocked.Increment(ref workers);

Debug.Write("Workers " + workers);
ProcessUrl(crawlList.GetNext());

} while (crawlList.HasNext() || workers > 0);

}

private async void ProcessUrl(string url)
{
Debug.Print("Processing " + url);
await client.GetAsync(url).ContinueWith(ProcessResponse);
}

private async void ProcessResponse(Task response)
{
Debug.Print("Processing response ");
var html = await response.Result.Content.ReadAsStringAsync();
var doc = new HtmlDocument();
doc.LoadHtml(html);
var internalLinks = doc.DocumentNode.SelectNo

Solution

crawler.Start();


This is very confusing naming. If a method is called Start(), I would expect it to, well, start the work and then return, not do all of the work. A better name would be something like Run(), or maybe even better, something more specific.

private const string SiteUrl = "http://stackoverflow.com";


Why is this here? I think this makes much more sense as a constructor parameter, or maybe parameter to Start().

private const int MaxWorkers =5;


Again, this should be probably configurable. Also, if you're processing a single website, it doesn't make much sense to have more workers than ServicePointManager.DefaultConnectionLimit, which by default is set to 2.

public WebCrawler()
{
    client = new HttpClient();
    crawlList = new CrawlList();
}

public void Start()
{
    crawlList = new CrawlList();


If you're going to set crawlList every time Start() is called, you don't need to set it in the constructor.

do
{
    …
} while (crawlList.HasNext() || workers > 0);


So, while some download is in progress, you're going to loop around and waste a whole CPU core doing nothing useful? That's a terrible idea. mariosangiorgio's answer has one way of solving this.

await client.GetAsync(url).ContinueWith(ProcessResponse);


You don't need ContinueWith() here, that's what await is for:

var response = await client.GetAsync(url);
await ProcessResponseAsync(response);


This assumes that you're going to change async void ProcessResponse(Task response) to async Task ProcessResponseAsync(HttpResponseMessage response), following the Task-based Asynchronous Pattern.

var internalLinks = doc.DocumentNode.SelectNodes("//a[@href]").Where(x => x.Attributes["href"].Value.StartsWith("/")).Select(x =>  SiteUrl + x.Attributes["href"].Value).ToList();


Why is this all a single line? You should break it into multiple lines, to make it more readable:

var internalLinks = doc.DocumentNode.SelectNodes("//a[@href]")
    .Where(x => x.Attributes["href"].Value.StartsWith("/"))
    .Select(x =>  SiteUrl + x.Attributes["href"].Value)
    .ToList();


And you could also use query syntax to avoid some repetition:

var internalLinks =
    from node in doc.DocumentNode.SelectNodes("//a[@href]")
    let href = node.Attributes["href"].Value
    where href.StartsWith("/")
    select SiteUrl + href;


Or maybe:

var internalLinks =
    from node in doc.DocumentNode.SelectNodes("//a[@href]")
    select node.Attributes["href"].Value into href
    where href.StartsWith("/")
    select SiteUrl + href;


The difference is that the second version is going to be slightly more efficient that the first one, but I also think it's less readable.

Also, I removed ToList(), see below.

internalLinks.ForEach(x => crawlList.AddUrl(x));


I don't think List.ForEach should be used. Instead use foreach, or add something like AddUrls() to your collection.

public string GetNext()
{
    string url;
    urlsToCrawl.TryTake(out url);
    urlsCompleted.Add(url);
    return url;
}


So, when urlsToCrawl is empty, you're going to add null to urlsCompleted and then return it? That doesn't sound like a good idea to me.

If this method is supposed to be called only when urlsToCrawl is not empty, then it should throw an exception when it's called incorrectly.

public void AddUrl(string url)
{
    if (!UrlAlreadyAdded(url))
    {
        urlsToCrawl.Add(url);
        Debug.Print("Adding Url "  + url);
    }
}


This is not thread-safe. Since this method can be called from multiple threads at the same time, you can end up downloading the same URL multiple times.

Code Snippets

crawler.Start();
private const string SiteUrl = "http://stackoverflow.com";
private const int MaxWorkers =5;
public WebCrawler()
{
    client = new HttpClient();
    crawlList = new CrawlList();
}

public void Start()
{
    crawlList = new CrawlList();
do
{
    …
} while (crawlList.HasNext() || workers > 0);

Context

StackExchange Code Review Q#64650, answer score: 5

Revisions (0)

No revisions yet.