HiveBrain v1.2.0
Get Started
← Back to all entries
patterncsharpMinor

Scraping HTML via async controller & classes + HTML agility pack

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
asyncpackscrapingcontrolleragilityviaclasseshtml

Problem

I've developed a simple application to grab golfer index scores from a website that has no API. The application works but is very slow, with 6 users that require updating takes 60 seconds. I've tried to make my web requests asyncronous to offset some of this lag but it only resulted in a 15% increase in performance.

Description of the code:

On my view I have anchor tag that when clicked hides all the elements in the DOM and loads a preloader, after the preloader is appended to the DOM a AJAX call is executed that calls the UpdatedHandicap method on a controller in my project. From there we await a static method GrabIndexValue. All the code works, it's just very slow.

Possible solution:

This website allows me to input multiple GHIN #s however the result set is in a table with strangely generated xpaths:

//*[@id="ctl00_bodyMP_gvLookupResults_ctl02_lblHI"] which returns index of: 10.5
//*[@id="ctl00_bodyMP_gvLookupResults_ctl03_lblHI"] which returns index of: 9
//*[@id="ctl00_bodyMP_gvLookupResults_ctl04_lblHI"] which returns index of: 13.5


I don't know how to dynamically grab those result sets and parse them properly. So I feel like I'm stuck doing 1 web request per index value.

Async controller method:

public async Task UpdateHandicap()
{
    //Fetch all the golfers
    var results = db.Users.ToList();
    //Iterate through the golfers and update their index value based off their GHIN #. We store this
    //value in the database to make our handicap calculation
    foreach (Users user in results)
    {
        user.Index = await Calculations.GrabIndexValue(user.GHID);
        db.Entry(user).State = EntityState.Modified;
        db.SaveChanges();
    }
    return RedirectToAction("Index", "Users");
}


Method to actually grab the data:

```
public static async Task GrabIndexValue(int ghin)
{
string url = $"http://xxxxxxxxx/Widgets/HandicapLookupResults.aspx?entry=1&ghinno="+ghin+"&css=default&dynamic=&small=0&mode=&tab=0";
HtmlWeb w

Solution

public async Task UpdateHandicap()


We add the Async suffix to methods that are marked with the async keyword. This is a naming convention that as you'll see in a moment Entity Framework follows too.

var results = db.Users.ToList();


You can use the Users directly in the loop, you don't have to call ToList first.

foreach (Users user in results)
{
    user.Index = await Calculations.GrabIndexValue(user.GHID);
    db.Entry(user).State = EntityState.Modified;
    db.SaveChanges();
}


The async calls are incomplete. SaveChanges would still block so you also want to use the

await db.SaveChangesAsync();


but do you really need to call SaveChanges in a loop? This could be bad for the performance. I think you should do it after the loop:

foreach (var user in db.Users)
{
    user.Index = await Calculations.GrabIndexValue(user.GHID);
    db.Entry(user).State = EntityState.Modified;
}
await db.SaveChangesAsync();

Code Snippets

public async Task<ActionResult> UpdateHandicap()
var results = db.Users.ToList();
foreach (Users user in results)
{
    user.Index = await Calculations.GrabIndexValue(user.GHID);
    db.Entry(user).State = EntityState.Modified;
    db.SaveChanges();
}
await db.SaveChangesAsync();
foreach (var user in db.Users)
{
    user.Index = await Calculations.GrabIndexValue(user.GHID);
    db.Entry(user).State = EntityState.Modified;
}
await db.SaveChangesAsync();

Context

StackExchange Code Review Q#152442, answer score: 5

Revisions (0)

No revisions yet.