patterncsharpMinor
Implementing a POC Async Web Crawler
Viewed 0 times
asyncwebcrawlerimplementingpoc
Problem
I've created a small proof of concept web crawler to learn more about asynchrony in .NET.
Currently when run it crawls stack overflow with a fixed number of current requests (workers).
I was interested if I have made any common mistakes and if there is more of the framework / TPL I could be using to reduce the amount of code used.
To run it you would need to install the HtmlAgilityPack.
I am only really interested in feedback on program structure and the asynchronous aspects of it.
```
class Program
{
static void Main(string[] args)
{
var crawler = new WebCrawler();
crawler.Start();
}
public class WebCrawler
{
private readonly HttpClient client;
private const string SiteUrl = "http://stackoverflow.com";
private CrawlList crawlList;
private const int MaxWorkers =5;
private int workers;
public WebCrawler()
{
client = new HttpClient();
crawlList = new CrawlList();
}
public void Start()
{
crawlList = new CrawlList();
crawlList.AddUrl(SiteUrl);
do
{
if (workers >= MaxWorkers) continue;
if (!crawlList.HasNext()) continue;
Interlocked.Increment(ref workers);
Debug.Write("Workers " + workers);
ProcessUrl(crawlList.GetNext());
} while (crawlList.HasNext() || workers > 0);
}
private async void ProcessUrl(string url)
{
Debug.Print("Processing " + url);
await client.GetAsync(url).ContinueWith(ProcessResponse);
}
private async void ProcessResponse(Task response)
{
Debug.Print("Processing response ");
var html = await response.Result.Content.ReadAsStringAsync();
var doc = new HtmlDocument();
doc.LoadHtml(html);
var internalLinks = doc.DocumentNode.SelectNo
Currently when run it crawls stack overflow with a fixed number of current requests (workers).
I was interested if I have made any common mistakes and if there is more of the framework / TPL I could be using to reduce the amount of code used.
To run it you would need to install the HtmlAgilityPack.
I am only really interested in feedback on program structure and the asynchronous aspects of it.
```
class Program
{
static void Main(string[] args)
{
var crawler = new WebCrawler();
crawler.Start();
}
public class WebCrawler
{
private readonly HttpClient client;
private const string SiteUrl = "http://stackoverflow.com";
private CrawlList crawlList;
private const int MaxWorkers =5;
private int workers;
public WebCrawler()
{
client = new HttpClient();
crawlList = new CrawlList();
}
public void Start()
{
crawlList = new CrawlList();
crawlList.AddUrl(SiteUrl);
do
{
if (workers >= MaxWorkers) continue;
if (!crawlList.HasNext()) continue;
Interlocked.Increment(ref workers);
Debug.Write("Workers " + workers);
ProcessUrl(crawlList.GetNext());
} while (crawlList.HasNext() || workers > 0);
}
private async void ProcessUrl(string url)
{
Debug.Print("Processing " + url);
await client.GetAsync(url).ContinueWith(ProcessResponse);
}
private async void ProcessResponse(Task response)
{
Debug.Print("Processing response ");
var html = await response.Result.Content.ReadAsStringAsync();
var doc = new HtmlDocument();
doc.LoadHtml(html);
var internalLinks = doc.DocumentNode.SelectNo
Solution
crawler.Start();This is very confusing naming. If a method is called
Start(), I would expect it to, well, start the work and then return, not do all of the work. A better name would be something like Run(), or maybe even better, something more specific.private const string SiteUrl = "http://stackoverflow.com";Why is this here? I think this makes much more sense as a constructor parameter, or maybe parameter to
Start().private const int MaxWorkers =5;Again, this should be probably configurable. Also, if you're processing a single website, it doesn't make much sense to have more workers than
ServicePointManager.DefaultConnectionLimit, which by default is set to 2.public WebCrawler()
{
client = new HttpClient();
crawlList = new CrawlList();
}
public void Start()
{
crawlList = new CrawlList();If you're going to set
crawlList every time Start() is called, you don't need to set it in the constructor.do
{
…
} while (crawlList.HasNext() || workers > 0);So, while some download is in progress, you're going to loop around and waste a whole CPU core doing nothing useful? That's a terrible idea. mariosangiorgio's answer has one way of solving this.
await client.GetAsync(url).ContinueWith(ProcessResponse);You don't need
ContinueWith() here, that's what await is for:var response = await client.GetAsync(url);
await ProcessResponseAsync(response);This assumes that you're going to change
async void ProcessResponse(Task response) to async Task ProcessResponseAsync(HttpResponseMessage response), following the Task-based Asynchronous Pattern.var internalLinks = doc.DocumentNode.SelectNodes("//a[@href]").Where(x => x.Attributes["href"].Value.StartsWith("/")).Select(x => SiteUrl + x.Attributes["href"].Value).ToList();Why is this all a single line? You should break it into multiple lines, to make it more readable:
var internalLinks = doc.DocumentNode.SelectNodes("//a[@href]")
.Where(x => x.Attributes["href"].Value.StartsWith("/"))
.Select(x => SiteUrl + x.Attributes["href"].Value)
.ToList();And you could also use query syntax to avoid some repetition:
var internalLinks =
from node in doc.DocumentNode.SelectNodes("//a[@href]")
let href = node.Attributes["href"].Value
where href.StartsWith("/")
select SiteUrl + href;Or maybe:
var internalLinks =
from node in doc.DocumentNode.SelectNodes("//a[@href]")
select node.Attributes["href"].Value into href
where href.StartsWith("/")
select SiteUrl + href;The difference is that the second version is going to be slightly more efficient that the first one, but I also think it's less readable.
Also, I removed
ToList(), see below.internalLinks.ForEach(x => crawlList.AddUrl(x));I don't think
List.ForEach should be used. Instead use foreach, or add something like AddUrls() to your collection.public string GetNext()
{
string url;
urlsToCrawl.TryTake(out url);
urlsCompleted.Add(url);
return url;
}So, when
urlsToCrawl is empty, you're going to add null to urlsCompleted and then return it? That doesn't sound like a good idea to me.If this method is supposed to be called only when
urlsToCrawl is not empty, then it should throw an exception when it's called incorrectly.public void AddUrl(string url)
{
if (!UrlAlreadyAdded(url))
{
urlsToCrawl.Add(url);
Debug.Print("Adding Url " + url);
}
}This is not thread-safe. Since this method can be called from multiple threads at the same time, you can end up downloading the same URL multiple times.
Code Snippets
crawler.Start();private const string SiteUrl = "http://stackoverflow.com";private const int MaxWorkers =5;public WebCrawler()
{
client = new HttpClient();
crawlList = new CrawlList();
}
public void Start()
{
crawlList = new CrawlList();do
{
…
} while (crawlList.HasNext() || workers > 0);Context
StackExchange Code Review Q#64650, answer score: 5
Revisions (0)
No revisions yet.