patterncsharpMinor
Class to encapsulate and manage multiple background web crawlers
Viewed 0 times
crawlersbackgroundmanagemultiplewebandclassencapsulate
Problem
I need to crawl web contents from some websites and then do some processing. Note that this is a small application, so the dataset is relatively small (need to crawl about 30,000 pages every time, once a week). The problem is that I can't start too many threads to crawl the pages at the same time. Otherwise my IP will be recognized unusual and will be blocked.
So, I create a class, called
The following is my implementation.
What I want to know
WebResource.cs
CrawlingEventArgs.cs
```
public class CrawlingEventArgs : EventArgs
{
public WebResource Resource { get; private set; }
So, I create a class, called
CrawlingService. It's designed to encapsulate these things: - Start some threads to crawl the web content
- Control the waiting time after a page is crawled (a thread need to "have a rest" after a page is crawled to prevent the app from being blocked by the server)
- Notify other classes that a webpage is crawled
- Automatically retry N times when fail to crawl a page (most are "Timeout" errors)
- When an error occurs (99% is timeout), all threads need to pause for a while. Because the "Timeout" is mostly caused by server busy.
The following is my implementation.
- The main class is
CrawlingServicewhich is already mentioned above.
ITaskRestStrategy.Duration()method is used to return the information about how long a thread need to wait for after a page is crawled.
- The
AbstractHttpClientis used to make HTTP requests, not important.
- Let's ignore argument null checking
What I want to know
- Is the multi-threading implementation correct? (Not good at this)
- Can I improve the multi-threading implementation by using a better approach?
- Can I improve the design of these classes?
- Better naming for the classes/methods/variables? (Not quite good at English)
WebResource.cs
///
/// Represents a webpage to be crawled.
///
public class WebResource
{
public string Url { get; private set; }
public Encoding Encoding { get; private set; }
public string Content { get; set; }
// Ignore the constructor
}CrawlingEventArgs.cs
```
public class CrawlingEventArgs : EventArgs
{
public WebResource Resource { get; private set; }
Solution
Generally, very well written code. I just have a few suggestions:
-
I would rather create and throw customized
is less confusing than:
-
In your constructors, check the injected classes for
Overall, this was easy code to read, and very well written, which I like. The suggestions I've pointed out will take it from good code to excellent code, in my opinion.
- In the
Startmethod, I understand why, but I don't like that there are two calls to check if the service is running. It feels kind of redundant, and I don't think you'll get a speed increase by short-circuiting the lock.
-
I would rather create and throw customized
Exception classes rather than using the built-in one.throw new ServiceIsAlreadyRunningException()is less confusing than:
InvalidOperationException("Cannot add new items after the service is started.")- I would change the IsRunning property to use an
enum. This will allow the addition of more states in the future (Starting,Running,Stopped,ShuttingDown, ...)
- I'm not sure what version of .Net you are using, but look into the
awaitkeyword. I found it makes my code flow so much better and makes it easier to follow when using threads.
-
In your constructors, check the injected classes for
null. This will save you problems when trying to use them later:public CralwingService(
ITaskRestStrategy itemRestStrategy,
ITaskRestStrategy errorRestStrategy,
int maxDegreeOfParallelism,
int maxRetriesForEachItem,
AbstractHttpClient httpClient)
{
if (itemRestStrategy == null) throw new ArgumentNullException("itemRestStrategy");
// same for all other reference types injected
// rest of constructor code
}Overall, this was easy code to read, and very well written, which I like. The suggestions I've pointed out will take it from good code to excellent code, in my opinion.
Code Snippets
throw new ServiceIsAlreadyRunningException()InvalidOperationException("Cannot add new items after the service is started.")public CralwingService(
ITaskRestStrategy itemRestStrategy,
ITaskRestStrategy errorRestStrategy,
int maxDegreeOfParallelism,
int maxRetriesForEachItem,
AbstractHttpClient httpClient)
{
if (itemRestStrategy == null) throw new ArgumentNullException("itemRestStrategy");
// same for all other reference types injected
// rest of constructor code
}Context
StackExchange Code Review Q#28517, answer score: 6
Revisions (0)
No revisions yet.