debugpythonModerate
Minimal webcrawler - bad structure and error handling?
Viewed 0 times
handlingerrorandbadstructureminimalwebcrawler
Problem
I did this code over one day as a part of a job application, where they wanted me to make a minimal webcrawler in any language. The purpose was to crawl a site, find all of the URLs on that page, and crawl on to other pages. The code had to be able to resume operation if shut down (that's why I saved the lists to text files). I made it in Python just for simplicity, although I have never made anything more than Project Euler stuff in Python. Commented it excessively to make it highly readable.
I did not get the job, but I am interested in getting some input on how I can make the crawler better, and if there are some major flaws/structural errors. It was a fun task.
```
import urllib
import re
import os
# The parameter is the url the crawler is gonna crawl.
def crawler(url):
# The crawler find links by utilizing pythons urllib and the href tag
for new_url in re.findall('''href="'["']''', urllib.urlopen(url).read()):
new_url = re.sub('\/*$', '', new_url)
# A bit ugly, but this is to be sure that the new url is not in the urls already crawled, or scheduled for crawling, a check for if it's a HTML element that have escape the regex, or a "link" to a html class on the same page, and that the url ending is in the allowed list
if not any(new_url in word for word in urls_Crawled) and not any(new_url in word for word in urlsToCrawl_Parent)and not any(new_url in word for word in urlsToCrawl_Child) and '>file, item
if __name__ == "__main__":
# This is the 'input' parameteres the max_url gives how many different url it's gonna crawl.
# this is implemented to give a better control of the runtime of this crawler.
starturl = ""
max_urls = 500
if not os.path.exists('parentURLs') or not os.path.getsize('parentURLs') > 0 or not os.path.exists('childURLs') or not os.path.getsize('childURLs') > 0:
# This is simple lists to hold the urls crawled, the parent urls, i.e the 'local' urls on the current page, while the
I did not get the job, but I am interested in getting some input on how I can make the crawler better, and if there are some major flaws/structural errors. It was a fun task.
```
import urllib
import re
import os
# The parameter is the url the crawler is gonna crawl.
def crawler(url):
# The crawler find links by utilizing pythons urllib and the href tag
for new_url in re.findall('''href="'["']''', urllib.urlopen(url).read()):
new_url = re.sub('\/*$', '', new_url)
# A bit ugly, but this is to be sure that the new url is not in the urls already crawled, or scheduled for crawling, a check for if it's a HTML element that have escape the regex, or a "link" to a html class on the same page, and that the url ending is in the allowed list
if not any(new_url in word for word in urls_Crawled) and not any(new_url in word for word in urlsToCrawl_Parent)and not any(new_url in word for word in urlsToCrawl_Child) and '>file, item
if __name__ == "__main__":
# This is the 'input' parameteres the max_url gives how many different url it's gonna crawl.
# this is implemented to give a better control of the runtime of this crawler.
starturl = ""
max_urls = 500
if not os.path.exists('parentURLs') or not os.path.getsize('parentURLs') > 0 or not os.path.exists('childURLs') or not os.path.getsize('childURLs') > 0:
# This is simple lists to hold the urls crawled, the parent urls, i.e the 'local' urls on the current page, while the
Solution
import urllib
import re
import os
# The parameter is the url the crawler is gonna crawl.That's a pretty useless comment. The parameter is named
url. So I already know it is the URL.def crawler(url):Seeing as this is a function, it should really be named as a verb. The function is an action, not a thing.
# The crawler find links by utilizing pythons urllib and the href tag
for new_url in re.findall('''href=["'](.[^"']+)["']''', urllib.urlopen(url).read()):Any sort of regex being applied to extracting data from a webpage is suspicious. Consider using Beautiful soup or simliar to properly parse the HTML.
new_url = re.sub('\/*
Why are you checking if the new url is a substring?
and not any(new_url in word for word in urlsToCrawl_Parent)
and not any(new_url in word for word in urlsToCrawl_Child)
Instead of checking all these places, have a single set that you add to for each instance. Then you don't need to check three times.
and '<' not in new_url and '#' not in new_url
# at least is used in various valid urls.
and any(word in new_url.split('.')[-1] for word in allowedList)
Why are checking against any possible word in the URL?
and 'https://www' not in new_url
What do you have against https? Why only https://www?
and 'mailto' not in new_url:
What if you had "mailto" somewhere else in the URL?
# If the url is in the new url, f.ex http://www.telenor.com is inside http://www.telenor.com/no/personvern/feed
# the latter should be appended in the parent list as i prioritize the local urls first.
if url.replace("http://","").replace("www.","").split('/', 1)[0] in new_url:
urlsToCrawl_Parent.append(new_url)
Your code and comment don't really seem to be align. You seem to be checking whether it is on the same domain.
# Another check, this is if we deal with a local link on the form /no/personvern/feed/
elif new_url.startswith('/'):
You already stripped off the '/' so I don't see how this could happen.
# To better store local links, f.ex /no/personvern/feed/ forward slash (/) is omitted
new_url = re.sub('\/*
Modifying the URL here is dangerous. It makes it difficult to predict how it will work in the rest of the application.
# The /no/personvern/feed/ url is stored as www.telenor.com/no/personvern/feed/
urlsToCrawl_Parent.append(url+new_url)
# If we are not dealing with an local URL we are dealing with a child URL
else:
urlsToCrawl_Child.append(new_url)
# We are done crawling this URL and thus we remove it from the list, and we add it to the list over urls crawled.
urlsToCrawl_Parent.pop(0)
urls_Crawled.append(url)
# A fast implemention of a "blind" saving of state so the program can be resumed after an error
writetofile(urls_Crawled, 'URLsCrawled')
writetofile(urlsToCrawl_Parent, 'parentURLs')
writetofile(urlsToCrawl_Child, 'childURLs')
Writing three different files seems excessive.
# Here we write the list over urls crawled to a txt file
def writetofile(urls, filename):
with open(filename,'w') as file:
for item in urls:
print>>file, item
if __name__ == "__main__":
# This is the 'input' parameteres the max_url gives how many different url it's gonna crawl.
# this is implemented to give a better control of the runtime of this crawler.
starturl = ""
max_urls = 500
if not os.path.exists('parentURLs') or not os.path.getsize('parentURLs') > 0 or not os.path.exists('childURLs') or not os.path.getsize('childURLs') > 0:
It probably doesn't make sense to default to picking up, because they user might not realize that they left those files around. Instead, I'd suggest a command line --recover flag or something.
# This is simple lists to hold the urls crawled, the parent urls, i.e the 'local' urls on the current page, while the
# child urls are urls for other web pages found on a parent page.
urls_Crawled = []
urlsToCrawl_Parent = []
urlsToCrawl_Child = []
Don't store the program state in global variables. At least move this into a class.
```
# We start to append the starturl
urlsToCrawl_Parent.append(starturl)
else:
with open('URLsCrawled', 'r') as f:
urls_Crawled = [line.rstrip('\n') for line in f]
with open('parentURLs', 'r') as f:
urlsToCrawl_Parent = [line.rstrip('\n') for l, '', new_url)
# A bit ugly, but this is to be sure that the new url is not in the urls already crawled, or scheduled for crawling, a check for if it's a HTML element that have escape the regex, or a "link" to a html class on the same page, and that the url ending is in the allowed list
if not any(new_url in word for word in urls_Crawled)
Why are you checking if the new url is a substring?
%%CODEBLOCK_4%%
Instead of checking all these places, have a single set that you add to for each instance. Then you don't need to check three times.
%%CODEBLOCK_5%%
# at least is used in various valid urls.
%%CODEBLOCK_6%%
Why are checking against any possible word in the URL?
%%CODEBLOCK_7%%
What do you have against https? Why only https://www?
%%CODEBLOCK_8%%
What if you had "mailto" somewhere else in the URL?
%%CODEBLOCK_9%%
Your code and comment don't really seem to be align. You seem to be checking whether it is on the same domain.
%%CODEBLOCK_10%%
You already stripped off the '/' so I don't see how this could happen.
%%CODEBLOCK_11%%
Modifying the URL here is dangerous. It makes it difficult to predict how it will work in the rest of the application.
%%CODEBLOCK_12%%
Writing three different files seems excessive.
%%CODEBLOCK_13%%
It probably doesn't make sense to default to picking up, because they user might not realize that they left those files around. Instead, I'd suggest a command line --recover flag or something.
%%CODEBLOCK_14%%
Don't store the program state in global variables. At least move this into a class.
```
# We start to append the starturl
urlsToCrawl_Parent.append(starturl)
else:
with open('URLsCrawled', 'r') as f:
urls_Crawled = [line.rstrip('\n') for line in f]
with open('parentURLs', 'r') as f:
urlsToCrawl_Parent = [line.rstrip('\n') for l, '', new_url)
url = re.sub('\/*
Modifying the URL here is dangerous. It makes it difficult to predict how it will work in the rest of the application.
%%CODEBLOCK_12%%
Writing three different files seems excessive.
%%CODEBLOCK_13%%
It probably doesn't make sense to default to picking up, because they user might not realize that they left those files around. Instead, I'd suggest a command line --recover flag or something.
%%CODEBLOCK_14%%
Don't store the program state in global variables. At least move this into a class.
```
# We start to append the starturl
urlsToCrawl_Parent.append(starturl)
else:
with open('URLsCrawled', 'r') as f:
urls_Crawled = [line.rstrip('\n') for line in f]
with open('parentURLs', 'r') as f:
urlsToCrawl_Parent = [line.rstrip('\n') for l, '', new_url)
# A bit ugly, but this is to be sure that the new url is not in the urls already crawled, or scheduled for crawling, a check for if it's a HTML element that have escape the regex, or a "link" to a html class on the same page, and that the url ending is in the allowed list
if not any(new_url in word for word in urls_Crawled)Why are you checking if the new url is a substring?
%%CODEBLOCK_4%%
Instead of checking all these places, have a single set that you add to for each instance. Then you don't need to check three times.
%%CODEBLOCK_5%%
# at least is used in various valid urls.%%CODEBLOCK_6%%
Why are checking against any possible word in the URL?
%%CODEBLOCK_7%%
What do you have against https? Why only
https://www?%%CODEBLOCK_8%%
What if you had "mailto" somewhere else in the URL?
%%CODEBLOCK_9%%
Your code and comment don't really seem to be align. You seem to be checking whether it is on the same domain.
%%CODEBLOCK_10%%
You already stripped off the '/' so I don't see how this could happen.
%%CODEBLOCK_11%%
Modifying the URL here is dangerous. It makes it difficult to predict how it will work in the rest of the application.
%%CODEBLOCK_12%%
Writing three different files seems excessive.
%%CODEBLOCK_13%%
It probably doesn't make sense to default to picking up, because they user might not realize that they left those files around. Instead, I'd suggest a command line --recover flag or something.
%%CODEBLOCK_14%%
Don't store the program state in global variables. At least move this into a class.
```
# We start to append the starturl
urlsToCrawl_Parent.append(starturl)
else:
with open('URLsCrawled', 'r') as f:
urls_Crawled = [line.rstrip('\n') for line in f]
with open('parentURLs', 'r') as f:
urlsToCrawl_Parent = [line.rstrip('\n') for l, '', url)
Modifying the URL here is dangerous. It makes it difficult to predict how it will work in the rest of the application.
%%CODEBLOCK_12%%
Writing three different files seems excessive.
%%CODEBLOCK_13%%
It probably doesn't make sense to default to picking up, because they user might not realize that they left those files around. Instead, I'd suggest a command line --recover flag or something.
%%CODEBLOCK_14%%
Don't store the program state in global variables. At least move this into a class.
```
# We start to append the starturl
urlsToCrawl_Parent.append(starturl)
else:
with open('URLsCrawled', 'r') as f:
urls_Crawled = [line.rstrip('\n') for line in f]
with open('parentURLs', 'r') as f:
urlsToCrawl_Parent = [line.rstrip('\n') for l, '', new_url) # A bit ugly, but this is to be sure that the new url is not in the urls already crawled, or scheduled for crawling, a check for if it's a HTML element that have escape the regex, or a "link" to a html class on the same page, and that the url ending is in the allowed list if not any(new_url in word for word in urls_Crawled)
Why are you checking if the new url is a substring?
%%CODEBLOCK_4%%
Instead of checking all these places, have a single set that you add to for each instance. Then you don't need to check three times.
%%CODEBLOCK_5%%
# at least is used in various valid urls.%%CODEBLOCK_6%%
Why are checking against any possible word in the URL?
%%CODEBLOCK_7%%
What do you have against https? Why only
https://www?%%CODEBLOCK_8%%
What if you had "mailto" somewhere else in the URL?
%%CODEBLOCK_9%%
Your code and comment don't really seem to be align. You seem to be checking whether it is on the same domain.
%%CODEBLOCK_10%%
You already stripped off the '/' so I don't see how this could happen.
%%CODEBLOCK_11%%
Modifying the URL here is dangerous. It makes it difficult to predict how it will work in the rest of the application.
%%CODEBLOCK_12%%
Writing three different files seems excessive.
%%CODEBLOCK_13%%
It probably doesn't make sense to default to picking up, because they user might not realize that they left those files around. Instead, I'd suggest a command line --recover flag or something.
%%CODEBLOCK_14%%
Don't store the program state in global variables. At least move this into a class.
```
# We start to append the starturl
urlsToCrawl_Parent.append(starturl)
else:
with open('URLsCrawled', 'r') as f:
urls_Crawled = [line.rstrip('\n') for line in f]
with open('parentURLs', 'r') as f:
urlsToCrawl_Parent = [line.rstrip('\n') for l
Code Snippets
import urllib
import re
import os
# The parameter is the url the crawler is gonna crawl.def crawler(url):# The crawler find links by utilizing pythons urllib and the href tag
for new_url in re.findall('''href=["'](.[^"']+)["']''', urllib.urlopen(url).read()):new_url = re.sub('\/*$', '', new_url)
# A bit ugly, but this is to be sure that the new url is not in the urls already crawled, or scheduled for crawling, a check for if it's a HTML element that have escape the regex, or a "link" to a html class on the same page, and that the url ending is in the allowed list
if not any(new_url in word for word in urls_Crawled)and not any(new_url in word for word in urlsToCrawl_Parent)
and not any(new_url in word for word in urlsToCrawl_Child)Context
StackExchange Code Review Q#49480, answer score: 18
Revisions (0)
No revisions yet.