HiveBrain v1.2.0
Get Started
← Back to all entries
patterncsharpMinor

Class to encapsulate and manage multiple background web crawlers

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
crawlersbackgroundmanagemultiplewebandclassencapsulate

Problem

I need to crawl web contents from some websites and then do some processing. Note that this is a small application, so the dataset is relatively small (need to crawl about 30,000 pages every time, once a week). The problem is that I can't start too many threads to crawl the pages at the same time. Otherwise my IP will be recognized unusual and will be blocked.

So, I create a class, called CrawlingService. It's designed to encapsulate these things:

  • Start some threads to crawl the web content



  • Control the waiting time after a page is crawled (a thread need to "have a rest" after a page is crawled to prevent the app from being blocked by the server)



  • Notify other classes that a webpage is crawled



  • Automatically retry N times when fail to crawl a page (most are "Timeout" errors)



  • When an error occurs (99% is timeout), all threads need to pause for a while. Because the "Timeout" is mostly caused by server busy.



The following is my implementation.

  • The main class is CrawlingService which is already mentioned above.



  • ITaskRestStrategy.Duration() method is used to return the information about how long a thread need to wait for after a page is crawled.



  • The AbstractHttpClient is used to make HTTP requests, not important.



  • Let's ignore argument null checking



What I want to know

  • Is the multi-threading implementation correct? (Not good at this)



  • Can I improve the multi-threading implementation by using a better approach?



  • Can I improve the design of these classes?



  • Better naming for the classes/methods/variables? (Not quite good at English)



WebResource.cs

/// 
/// Represents a webpage to be crawled.
/// 
public class WebResource
{
    public string Url { get; private set; }

    public Encoding Encoding { get; private set; }

    public string Content { get; set; }

    // Ignore the constructor
}


CrawlingEventArgs.cs

```
public class CrawlingEventArgs : EventArgs
{
public WebResource Resource { get; private set; }

Solution

Generally, very well written code. I just have a few suggestions:

  • In the Start method, I understand why, but I don't like that there are two calls to check if the service is running. It feels kind of redundant, and I don't think you'll get a speed increase by short-circuiting the lock.



-
I would rather create and throw customized Exception classes rather than using the built-in one.

throw new ServiceIsAlreadyRunningException()


is less confusing than:

InvalidOperationException("Cannot add new items after the service is started.")


  • I would change the IsRunning property to use an enum. This will allow the addition of more states in the future (Starting, Running, Stopped, ShuttingDown, ...)



  • I'm not sure what version of .Net you are using, but look into the await keyword. I found it makes my code flow so much better and makes it easier to follow when using threads.



-
In your constructors, check the injected classes for null. This will save you problems when trying to use them later:

public CralwingService(
    ITaskRestStrategy itemRestStrategy,
    ITaskRestStrategy errorRestStrategy,
    int maxDegreeOfParallelism,
    int maxRetriesForEachItem,
    AbstractHttpClient httpClient)
{

    if (itemRestStrategy == null) throw new ArgumentNullException("itemRestStrategy");
    // same for all other reference types injected

    // rest of constructor code
}


Overall, this was easy code to read, and very well written, which I like. The suggestions I've pointed out will take it from good code to excellent code, in my opinion.

Code Snippets

throw new ServiceIsAlreadyRunningException()
InvalidOperationException("Cannot add new items after the service is started.")
public CralwingService(
    ITaskRestStrategy itemRestStrategy,
    ITaskRestStrategy errorRestStrategy,
    int maxDegreeOfParallelism,
    int maxRetriesForEachItem,
    AbstractHttpClient httpClient)
{

    if (itemRestStrategy == null) throw new ArgumentNullException("itemRestStrategy");
    // same for all other reference types injected

    // rest of constructor code
}

Context

StackExchange Code Review Q#28517, answer score: 6

Revisions (0)

No revisions yet.