HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Crawl site getting URL and status code

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
crawlgettingstatuscodesiteandurl

Problem

I wrote a crawler that for every page visited collects the status code.
Below my solution. Is this code optimizable?

import urllib

def getfromurl(url):
    start = urllib.urlopen(url)
    raw = ''
    for lines in start.readlines():
        raw += lines                   
    start.close()
    return raw

def dumbwork(start_link, start_url, text, pattern, counter):
    if counter < 2:
        counter = counter +1
        while start_link != -1:
            try:
                start_url = text.find('/', start_link) 
                end_url = text.find('"', start_url + 1)
                url = 'http:/' + text[start_url + 1 : end_url]
                page_status = str(urllib.urlopen(url).getcode())
                row = url + ', ' + page_status
                t.write(row + '\n')
                temp = str(getfromurl(url))
                print row
                dumbwork(temp.find(pattern), 0, temp, pattern, counter)
                start_link = text.find(pattern, end_url + 1) 
            except Exception, e:
                break
    else:
        pass

t = open('inout.txt', 'w')
text = str(getfromurl('http://www.site.it'))
pattern = '<a href="http:/'
start_link = text.find(pattern)
dumbwork(start_link, 0, text, pattern, 0)
t.close()

Solution

-
You're taking for granted that a link will be '

-
Naming:

  • usually a row is related to a database, while a line is related to a text file.



  • temp means nothing, it's the new content, so you should use something like new_html_content.



  • It takes a bit to understand that counter is actually the max depth that you want to follow, so why not call it depth



  • Function names should explain what they do, dumbwork name doesn't, something like recurse_page may be better.



  • start_link is good for the first link (almost, see below) but the parameter to the function is actually the current link being parsed, so why not call it current_link?



  • You used snake case for start_link, you should keep using it, so get_from_url may be better.



  • start_link, start_url and end_url are not links or urls, they're actually the index of the string, so they should be start_link_index, start_url_index and end_url_index



  • text is the HTML content, so just rename it to html_content



-
The lines doing something with
row should be next to each other, or better yet, in a separate function.

-
That
2 should be in a constant so that the first line of the function can be something like if depth

-
You're trapping exceptions but you're not doing anything with them, you should at least log somewhere what happened.

-
The text.find to get the url are probably better off in a separate function, to improve readability, something like

-
getfromurl already returns a string, no need for the str()

-
You're using always the same name for the file which, when opened with w will overwrite the contents. You should at least check if the file already exists.

-
You're opening a file and leaving it open for the whole duration of the process. This is not bad in itself, but I'd probably put a function called append_to_file where I open the file with a instead of w, write the line and immediately close it. Inside of that function you will also convert the status code to a string.

In the end, your worker loop may look something like this:

def recurse_page(current_link_index, start_index, html_content, pattern, depth):
    if depth  -1:
            try:
                url = get_url(html_content, start_index)
                append_to_file(url, urllib.urlopen(url).getcode())
                new_html_content = get_from_url(url)
                recurse_page(new_html_content.find(pattern), 0, new_html_content, pattern, depth)
                current_link_index = html_content.find(pattern, end_url_index + 1) 
            except Exception, e:
                # TODO: Proper error handling
                break


It's not complete code, but it should give you an idea of what I mean.

Code Snippets

def recurse_page(current_link_index, start_index, html_content, pattern, depth):
    if depth < MAX_DEPTH:
        depth += 1
        while current_link_index > -1:
            try:
                url = get_url(html_content, start_index)
                append_to_file(url, urllib.urlopen(url).getcode())
                new_html_content = get_from_url(url)
                recurse_page(new_html_content.find(pattern), 0, new_html_content, pattern, depth)
                current_link_index = html_content.find(pattern, end_url_index + 1) 
            except Exception, e:
                # TODO: Proper error handling
                break

Context

StackExchange Code Review Q#135567, answer score: 4

Revisions (0)

No revisions yet.