HiveBrain v1.2.0
Get Started
← Back to all entries
patterncsharpMinor

Gaining better performance when screen scraping is required

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
scrapingbetterperformancewhenscreenrequiredgaining

Problem

I have a website on which for every request, the scraping of other websites is done to get accurate data. While it works well, it has a slight impact on the performance.

public ActionResult Index(int id)
{
    Product product = pe.Products.Where(p => p.Id == id).First();
    foreach (var pricing in product.Retailer_Product_Prices)
    {
        switch (pricing.RetailerId)
        {
            case 1:
                pricing.Price = PriceCompareStatic.GetFlipKart(pricing.Url);
                if (pricing.Url.Contains("?"))
                    pricing.Url = pricing.Url + "&affid=pankajupad";
                else
                    pricing.Url = pricing.Url + "?affid=pankajupad";
                break;
            case 3:
                pricing.Price = PriceCompareStatic.GetHomeShop(pricing.Url, "//span[@class='pdp_details_hs18Price']");
                break;
            .
            .
            .  // Removed for brevity
            default:
                break;
        }
        product.BasePrice = product.Retailer_Product_Prices.Min(p => p.Price);
    }
    return View(product);
}


PriceCompareStatic

```
public static decimal GetFlipKart(string url)
{
var baseUrl = new Uri(url);
HtmlAgilityPack.HtmlDocument document = new HtmlDocument();
try
{
WebClient client = new WebClient();
document.Load(client.OpenRead(baseUrl));
var div = document.DocumentNode.SelectNodes("//meta[@itemprop='price']").FirstOrDefault();
HtmlAttribute att = div.Attributes["content"];
string modPrice = att.Value.Replace("Rs. ", "");
return Convert.ToDecimal(modPrice);
}
catch (Exception)
{
return 123456789;
}
}

public static decimal GetSpanFromWebSite(string url, string identification)
{
var baseUrl = new Uri(url);
HtmlAgilityPack.HtmlDocument document = new HtmlDocument();
try
{
WebClient client = new WebClient();
document.Load(client.OpenRead(baseUrl));

Solution

It's hard to tell exactly how it's being utilized since your code sample is an example. However, a few notes.

Due to the design of your use case above most of the time any improvements are going to pale in comparison to pulling a web page live (especially multiple). If you want real improvement, cache the information from the other websites.

Regarding the Code sample above, the only item I would recommend looking into is the parallel.foreach statement. If you utilize it, make sure you've already pulled all the data from the database to loop through (not waiting on EF lazy loading). Otherwise you will run into some EF threading issues.

I would also recommend creating static delegate methods for each website provider and when your application loads put the methods into a concurrent dictionary. It will help to clean up you code to make it more readable.

Context

StackExchange Code Review Q#10358, answer score: 2

Revisions (0)

No revisions yet.