patterncsharpMinor
HtmlAgilityPack is being slow
Viewed 0 times
htmlagilitypackslowbeing
Problem
HtmlAgilityPack is being really slow pulling back results. I have seen similar tools that get results a lot faster, but it's taking over a minute just to get the viewcounts on YouTube, and that's just with the first page of results.Ideally I want to loop through multiple elements, but a nested loop wouldn't work for this code.
private void button1_Click(object sender, EventArgs e)
{
//webBrowser1.Navigate("www.youtube.com/results?search_query=grindtime");
StringBuilder output = new StringBuilder();
string raw = "http://www.youtube.com/results?search_query=grindtime";
HtmlWeb webGet = new HtmlWeb();
webGet.UserAgent = "Mozilla/5.0 (Macintosh; I; Intel Mac OS X 11_7_9; de-LI; rv:1.9b4) Gecko/2012010317 Firefox/10.0a4";
var document = webGet.Load(raw);
var viewcount = document.DocumentNode.SelectNodes("//*[@class='viewcount']");
//var videotitle = document.DocumentNode.SelectNodes("//a[@class='yt-uix-tile-link']");
//var browser = document.DocumentNode.SelectNodes("//*[@class='viewcount']");
if (viewcount != null)
{
foreach (var v in viewcount)
{
output.AppendLine(v.InnerHtml);
ListViewItem lvi = new ListViewItem("#1");
lvi.SubItems.Add("video title here");
lvi.SubItems.Add(b.InnerHtml);
//views
//desc..
//lid = link in desc yes/no
listView1.Items.Add(lvi);
}
}Solution
First, the web request itself will take some time.
To be able to profile it, maybe do some test with opening saved HTML file instead, to make sure that the parsing is your actual bottleneck, maybe even commenting the lines that create the result UI as well.
I haven't used HTML Agility Pack much, but I see you are using an XPATH selector there, and it only uses a CSS Class. I'd assume the path will scan the entire document to check eevery element in it if it has the desired class or not.
So, it may be a good idea to try to add some parents to the selector, so that it only looks for elements inside a given parent, best selected by ID or a tag. Those are general rules for browsers, I don't expect them to work the same with HTML Agility Pack, but they may provide some gain still.
Also, check if there are any options in the library to match a single element, and if so, find that element and go through its children till you find your desired elements. This may improve things. Also, if there are any strict mode or less tolerant parsing options, try turning those on if they don't break the parsing.
If you expect the page to be well structured and HTML valid, and parsing speed isn't getting any better however you enhance the selector, you may consider reverting to classic old Regex matching.
Also, I recommend separating the Windows Forms rendering from the loop. Maybe the bottleneck is the drawing (unlikely, but maybe), try to add all your elements to a list, and then outside the loop, add the list to the windows forms control.
To be able to profile it, maybe do some test with opening saved HTML file instead, to make sure that the parsing is your actual bottleneck, maybe even commenting the lines that create the result UI as well.
I haven't used HTML Agility Pack much, but I see you are using an XPATH selector there, and it only uses a CSS Class. I'd assume the path will scan the entire document to check eevery element in it if it has the desired class or not.
So, it may be a good idea to try to add some parents to the selector, so that it only looks for elements inside a given parent, best selected by ID or a tag. Those are general rules for browsers, I don't expect them to work the same with HTML Agility Pack, but they may provide some gain still.
Also, check if there are any options in the library to match a single element, and if so, find that element and go through its children till you find your desired elements. This may improve things. Also, if there are any strict mode or less tolerant parsing options, try turning those on if they don't break the parsing.
If you expect the page to be well structured and HTML valid, and parsing speed isn't getting any better however you enhance the selector, you may consider reverting to classic old Regex matching.
Also, I recommend separating the Windows Forms rendering from the loop. Maybe the bottleneck is the drawing (unlikely, but maybe), try to add all your elements to a list, and then outside the loop, add the list to the windows forms control.
Context
StackExchange Code Review Q#7827, answer score: 8
Revisions (0)
No revisions yet.