HiveBrain v1.2.0
Get Started
← Back to all entries
debugpythonModerate

Minimal webcrawler - bad structure and error handling?

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
handlingerrorandbadstructureminimalwebcrawler

Problem

I did this code over one day as a part of a job application, where they wanted me to make a minimal webcrawler in any language. The purpose was to crawl a site, find all of the URLs on that page, and crawl on to other pages. The code had to be able to resume operation if shut down (that's why I saved the lists to text files). I made it in Python just for simplicity, although I have never made anything more than Project Euler stuff in Python. Commented it excessively to make it highly readable.

I did not get the job, but I am interested in getting some input on how I can make the crawler better, and if there are some major flaws/structural errors. It was a fun task.

```
import urllib
import re
import os

# The parameter is the url the crawler is gonna crawl.
def crawler(url):
# The crawler find links by utilizing pythons urllib and the href tag
for new_url in re.findall('''href="'["']''', urllib.urlopen(url).read()):
new_url = re.sub('\/*$', '', new_url)

# A bit ugly, but this is to be sure that the new url is not in the urls already crawled, or scheduled for crawling, a check for if it's a HTML element that have escape the regex, or a "link" to a html class on the same page, and that the url ending is in the allowed list
if not any(new_url in word for word in urls_Crawled) and not any(new_url in word for word in urlsToCrawl_Parent)and not any(new_url in word for word in urlsToCrawl_Child) and '>file, item

if __name__ == "__main__":

# This is the 'input' parameteres the max_url gives how many different url it's gonna crawl.
# this is implemented to give a better control of the runtime of this crawler.
starturl = ""
max_urls = 500

if not os.path.exists('parentURLs') or not os.path.getsize('parentURLs') > 0 or not os.path.exists('childURLs') or not os.path.getsize('childURLs') > 0:

# This is simple lists to hold the urls crawled, the parent urls, i.e the 'local' urls on the current page, while the

Solution

import urllib
import re
import os

# The parameter is the url the crawler is gonna crawl.


That's a pretty useless comment. The parameter is named url. So I already know it is the URL.

def crawler(url):


Seeing as this is a function, it should really be named as a verb. The function is an action, not a thing.

# The crawler find links by utilizing pythons urllib and the href tag
    for new_url in re.findall('''href=["'](.[^"']+)["']''', urllib.urlopen(url).read()):


Any sort of regex being applied to extracting data from a webpage is suspicious. Consider using Beautiful soup or simliar to properly parse the HTML.

new_url = re.sub('\/*

Why are you checking if the new url is a substring?

and not any(new_url in word for word in urlsToCrawl_Parent)
  and not any(new_url in word for word in urlsToCrawl_Child)


Instead of checking all these places, have a single set that you add to for each instance. Then you don't need to check three times.

and '<' not in new_url and '#' not in new_url


# at least is used in various valid urls.

and any(word in new_url.split('.')[-1] for word in allowedList)


Why are checking against any possible word in the URL?

and 'https://www' not in new_url


What do you have against https? Why only https://www?

and 'mailto' not in new_url:


What if you had "mailto" somewhere else in the URL?

# If the url is in the new url, f.ex http://www.telenor.com is inside http://www.telenor.com/no/personvern/feed
        # the latter should be appended in the parent list as i prioritize the local urls first.
            if url.replace("http://","").replace("www.","").split('/', 1)[0] in new_url:
                urlsToCrawl_Parent.append(new_url)


Your code and comment don't really seem to be align. You seem to be checking whether it is on the same domain.

# Another check, this is if we deal with a local link on the form /no/personvern/feed/
            elif new_url.startswith('/'):


You already stripped off the '/' so I don't see how this could happen.

# To better store local links, f.ex /no/personvern/feed/ forward slash (/) is omitted
                new_url = re.sub('\/*

Modifying the URL here is dangerous. It makes it difficult to predict how it will work in the rest of the application.

# The /no/personvern/feed/ url is stored as www.telenor.com/no/personvern/feed/
                urlsToCrawl_Parent.append(url+new_url)

        # If we are not dealing with an local URL we are dealing with a child URL 
            else:
                urlsToCrawl_Child.append(new_url)

# We are done crawling this URL and thus we remove it from the list, and we add it to the list over urls crawled.
        urlsToCrawl_Parent.pop(0)
        urls_Crawled.append(url)

# A fast implemention of a "blind" saving of state so the program can be resumed after an error
    writetofile(urls_Crawled, 'URLsCrawled')
    writetofile(urlsToCrawl_Parent, 'parentURLs')
    writetofile(urlsToCrawl_Child, 'childURLs')


Writing three different files seems excessive.

# Here we write the list over urls crawled to a txt file
def writetofile(urls, filename):

    with open(filename,'w') as file:
        for item in urls:
            print>>file, item

if __name__ == "__main__":

# This is the 'input' parameteres the max_url gives how many different url it's gonna crawl. 
# this is implemented to give a better control of the runtime of this crawler.
        starturl = ""
    max_urls = 500

    if not os.path.exists('parentURLs') or not os.path.getsize('parentURLs') > 0 or not os.path.exists('childURLs') or not os.path.getsize('childURLs') > 0:


It probably doesn't make sense to default to picking up, because they user might not realize that they left those files around. Instead, I'd suggest a command line --recover flag or something.

# This is simple lists to hold the urls crawled, the parent urls, i.e the 'local' urls on the current page, while the 
    # child urls are urls for other web pages found on a parent page.
        urls_Crawled = []
        urlsToCrawl_Parent = []
        urlsToCrawl_Child = []


Don't store the program state in global variables. At least move this into a class.

```
# We start to append the starturl
urlsToCrawl_Parent.append(starturl)

else:
with open('URLsCrawled', 'r') as f:
urls_Crawled = [line.rstrip('\n') for line in f]
with open('parentURLs', 'r') as f:
urlsToCrawl_Parent = [line.rstrip('\n') for l, '', new_url) # A bit ugly, but this is to be sure that the new url is not in the urls already crawled, or scheduled for crawling, a check for if it's a HTML element that have escape the regex, or a "link" to a html class on the same page, and that the url ending is in the allowed list if not any(new_url in word for word in urls_Crawled)


Why are you checking if the new url is a substring?

%%CODEBLOCK_4%%

Instead of checking all these places, have a single set that you add to for each instance. Then you don't need to check three times.

%%CODEBLOCK_5%%

# at least is used in various valid urls.

%%CODEBLOCK_6%%

Why are checking against any possible word in the URL?

%%CODEBLOCK_7%%

What do you have against https? Why only https://www?

%%CODEBLOCK_8%%

What if you had "mailto" somewhere else in the URL?

%%CODEBLOCK_9%%

Your code and comment don't really seem to be align. You seem to be checking whether it is on the same domain.

%%CODEBLOCK_10%%

You already stripped off the '/' so I don't see how this could happen.

%%CODEBLOCK_11%%

Modifying the URL here is dangerous. It makes it difficult to predict how it will work in the rest of the application.

%%CODEBLOCK_12%%

Writing three different files seems excessive.

%%CODEBLOCK_13%%

It probably doesn't make sense to default to picking up, because they user might not realize that they left those files around. Instead, I'd suggest a command line --recover flag or something.

%%CODEBLOCK_14%%

Don't store the program state in global variables. At least move this into a class.

```
# We start to append the starturl
urlsToCrawl_Parent.append(starturl)

else:
with open('URLsCrawled', 'r') as f:
urls_Crawled = [line.rstrip('\n') for line in f]
with open('parentURLs', 'r') as f:
urlsToCrawl_Parent = [line.rstrip('\n') for l, '', new_url) url = re.sub('\/*

Modifying the URL here is dangerous. It makes it difficult to predict how it will work in the rest of the application.

%%CODEBLOCK_12%%

Writing three different files seems excessive.

%%CODEBLOCK_13%%

It probably doesn't make sense to default to picking up, because they user might not realize that they left those files around. Instead, I'd suggest a command line --recover flag or something.

%%CODEBLOCK_14%%

Don't store the program state in global variables. At least move this into a class.

```
# We start to append the starturl
urlsToCrawl_Parent.append(starturl)

else:
with open('URLsCrawled', 'r') as f:
urls_Crawled = [line.rstrip('\n') for line in f]
with open('parentURLs', 'r') as f:
urlsToCrawl_Parent = [line.rstrip('\n') for l, '', new_url) # A bit ugly, but this is to be sure that the new url is not in the urls already crawled, or scheduled for crawling, a check for if it's a HTML element that have escape the regex, or a "link" to a html class on the same page, and that the url ending is in the allowed list if not any(new_url in word for word in urls_Crawled)


Why are you checking if the new url is a substring?

%%CODEBLOCK_4%%

Instead of checking all these places, have a single set that you add to for each instance. Then you don't need to check three times.

%%CODEBLOCK_5%%

# at least is used in various valid urls.

%%CODEBLOCK_6%%

Why are checking against any possible word in the URL?

%%CODEBLOCK_7%%

What do you have against https? Why only https://www?

%%CODEBLOCK_8%%

What if you had "mailto" somewhere else in the URL?

%%CODEBLOCK_9%%

Your code and comment don't really seem to be align. You seem to be checking whether it is on the same domain.

%%CODEBLOCK_10%%

You already stripped off the '/' so I don't see how this could happen.

%%CODEBLOCK_11%%

Modifying the URL here is dangerous. It makes it difficult to predict how it will work in the rest of the application.

%%CODEBLOCK_12%%

Writing three different files seems excessive.

%%CODEBLOCK_13%%

It probably doesn't make sense to default to picking up, because they user might not realize that they left those files around. Instead, I'd suggest a command line --recover flag or something.

%%CODEBLOCK_14%%

Don't store the program state in global variables. At least move this into a class.

```
# We start to append the starturl
urlsToCrawl_Parent.append(starturl)

else:
with open('URLsCrawled', 'r') as f:
urls_Crawled = [line.rstrip('\n') for line in f]
with open('parentURLs', 'r') as f:
urlsToCrawl_Parent = [line.rstrip('\n') for l, '', url)

Modifying the URL here is dangerous. It makes it difficult to predict how it will work in the rest of the application.

%%CODEBLOCK_12%%

Writing three different files seems excessive.

%%CODEBLOCK_13%%

It probably doesn't make sense to default to picking up, because they user might not realize that they left those files around. Instead, I'd suggest a command line --recover flag or something.

%%CODEBLOCK_14%%

Don't store the program state in global variables. At least move this into a class.

```
# We start to append the starturl
urlsToCrawl_Parent.append(starturl)

else:
with open('URLsCrawled', 'r') as f:
urls_Crawled = [line.rstrip('\n') for line in f]
with open('parentURLs', 'r') as f:
urlsToCrawl_Parent = [line.rstrip('\n') for l, '', new_url) # A bit ugly, but this is to be sure that the new url is not in the urls already crawled, or scheduled for crawling, a check for if it's a HTML element that have escape the regex, or a "link" to a html class on the same page, and that the url ending is in the allowed list if not any(new_url in word for word in urls_Crawled)

Why are you checking if the new url is a substring?

%%CODEBLOCK_4%%

Instead of checking all these places, have a single set that you add to for each instance. Then you don't need to check three times.

%%CODEBLOCK_5%%

# at least is used in various valid urls.

%%CODEBLOCK_6%%

Why are checking against any possible word in the URL?

%%CODEBLOCK_7%%

What do you have against https? Why only https://www?

%%CODEBLOCK_8%%

What if you had "mailto" somewhere else in the URL?

%%CODEBLOCK_9%%

Your code and comment don't really seem to be align. You seem to be checking whether it is on the same domain.

%%CODEBLOCK_10%%

You already stripped off the '/' so I don't see how this could happen.

%%CODEBLOCK_11%%

Modifying the URL here is dangerous. It makes it difficult to predict how it will work in the rest of the application.

%%CODEBLOCK_12%%

Writing three different files seems excessive.

%%CODEBLOCK_13%%

It probably doesn't make sense to default to picking up, because they user might not realize that they left those files around. Instead, I'd suggest a command line --recover flag or something.

%%CODEBLOCK_14%%

Don't store the program state in global variables. At least move this into a class.

```
# We start to append the starturl
urlsToCrawl_Parent.append(starturl)

else:
with open('URLsCrawled', 'r') as f:
urls_Crawled = [line.rstrip('\n') for line in f]
with open('parentURLs', 'r') as f:
urlsToCrawl_Parent = [line.rstrip('\n') for l

Code Snippets

import urllib
import re
import os

# The parameter is the url the crawler is gonna crawl.
def crawler(url):
# The crawler find links by utilizing pythons urllib and the href tag
    for new_url in re.findall('''href=["'](.[^"']+)["']''', urllib.urlopen(url).read()):
new_url = re.sub('\/*$', '', new_url)

    # A bit ugly, but this is to be sure that the new url is not in the urls already crawled, or scheduled for crawling, a check for if it's a HTML element that have escape the regex, or a "link" to a html class on the same page, and that the url ending is in the allowed list
        if not any(new_url in word for word in urls_Crawled)
and not any(new_url in word for word in urlsToCrawl_Parent)
  and not any(new_url in word for word in urlsToCrawl_Child)

Context

StackExchange Code Review Q#49480, answer score: 18

Revisions (0)

No revisions yet.