patterncsharpMinor
C# sitemap crawler
Viewed 0 times
crawlersitemapstackoverflow
Problem
Basically my code piece is a sitemap crawler - it opens sitemap, that contains sub-sitemap listings from - to a datespan. It opens sub-sitemap and gets all the URLs (most of them are seourl's, but not all). As sub-sitemap is enumerated all of it's put in list called
Code overview is like:
There is also a development site, that contains the same data, but with some date propagation delay and it does not have as hard Wall of Cache as live does. Normally page is 'heavy' to mitigate a hit from that a template page is created containing only data I need. In addition, even dev site is view-able on particular domain after entering credentials.
Now I could open them up 1 by 1, but this would take unreasonably long time. The problem that I am having is that I can not just open them asynchronously as well. Server would deal fine with first and second pack of 500+ URLs, but after that it will start to choke sending 504's (This literally means for F this for next ~20s your request is on hold).
How can I set up reasonable batch size? (Rx-Linq?)
This basically tells try to loop over each result and get {URL, data, date}. If there are errors log sitemap and URL with the problem. Fina
list2. Code overview is like:
var dict = list2.ToDictionary(o => o, async o =>
await new WebClient { Credentials = new NetworkCredential(user, pass) }
.DownloadStringTaskAsync(new Uri(o.Replace(liveUrl, devUrl) + end)));There is also a development site, that contains the same data, but with some date propagation delay and it does not have as hard Wall of Cache as live does. Normally page is 'heavy' to mitigate a hit from that a template page is created containing only data I need. In addition, even dev site is view-able on particular domain after entering credentials.
WebClient { Credentials = new NetworkCredential(user, pass) }
.DownloadStringTaskAsync(new Uri(o.Replace(liveUrl, devUrl) + end))Now I could open them up 1 by 1, but this would take unreasonably long time. The problem that I am having is that I can not just open them asynchronously as well. Server would deal fine with first and second pack of 500+ URLs, but after that it will start to choke sending 504's (This literally means for F this for next ~20s your request is on hold).
How can I set up reasonable batch size? (Rx-Linq?)
dict.Keys.ToList().ForEach(delegate(string o){
try
{
result.Add(new PublicationLinkData
{
Url = o,
Title = dict[o].Result.FindTagValue("", ""),
Date = DateTime.Parse(dict[o].Result.FindTagValue("", ""))
}
);
}
catch (Exception ex)
{
sw.WriteLine("{0}\t{1}", item.AbsoluteUri, o);
}});
dict.Clear();
result.Serialize(string.Format("d://sitemap/{0}-{1}.xml", @from, @to));This basically tells try to loop over each result and get {URL, data, date}. If there are errors log sitemap and URL with the problem. Fina
Solution
Managed to solve my problem decently.
Basically this creates 5 workers that work synchronously.
var dict = new Dictionary();
Parallel.ForEach(list2,
new ParallelOptions { MaxDegreeOfParallelism = 5 },
o =>
{
try
{
var data = new WebClient { Credentials = new NetworkCredential(user, pass) }
.DownloadString(new Uri(o.Replace(liveUrl, devUrl) + end));
result.Add(new PublicationLinkData
{
Url = item.AbsoluteUri,
Title = data.FindTagValue("", ""),
Date = DateTime.Parse(data.FindTagValue("", ""))
}
);
}
catch (Exception ex)
{
sw.WriteLine("{0}\t{1}", item.AbsoluteUri, o);
}
});Basically this creates 5 workers that work synchronously.
Code Snippets
var dict = new Dictionary<string, string>();
Parallel.ForEach(list2,
new ParallelOptions { MaxDegreeOfParallelism = 5 },
o =>
{
try
{
var data = new WebClient { Credentials = new NetworkCredential(user, pass) }
.DownloadString(new Uri(o.Replace(liveUrl, devUrl) + end));
result.Add(new PublicationLinkData
{
Url = item.AbsoluteUri,
Title = data.FindTagValue("<div id=\"title\">", "</div>"),
Date = DateTime.Parse(data.FindTagValue("<div id=\"time\">", "</div>"))
}
);
}
catch (Exception ex)
{
sw.WriteLine("{0}\t{1}", item.AbsoluteUri, o);
}
});Context
StackExchange Code Review Q#80338, answer score: 4
Revisions (0)
No revisions yet.