patternpythonMinor

Crawl site getting URL and status code

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

crawlgettingstatuscodesiteandurl

Problem

I wrote a crawler that for every page visited collects the status code.
Below my solution. Is this code optimizable?

import urllib

def getfromurl(url):
    start = urllib.urlopen(url)
    raw = ''
    for lines in start.readlines():
        raw += lines                   
    start.close()
    return raw

def dumbwork(start_link, start_url, text, pattern, counter):
    if counter < 2:
        counter = counter +1
        while start_link != -1:
            try:
                start_url = text.find('/', start_link) 
                end_url = text.find('"', start_url + 1)
                url = 'http:/' + text[start_url + 1 : end_url]
                page_status = str(urllib.urlopen(url).getcode())
                row = url + ', ' + page_status
                t.write(row + '\n')
                temp = str(getfromurl(url))
                print row
                dumbwork(temp.find(pattern), 0, temp, pattern, counter)
                start_link = text.find(pattern, end_url + 1) 
            except Exception, e:
                break
    else:
        pass

t = open('inout.txt', 'w')
text = str(getfromurl('http://www.site.it'))
pattern = '<a href="http:/'
start_link = text.find(pattern)
dumbwork(start_link, 0, text, pattern, 0)
t.close()

Solution

-
You're taking for granted that a link will be

'

- 
Naming:

usually a row is related to a database, while a line is related to a text file.

temp means nothing, it's the new content, so you should use something like new_html_content.

It takes a bit to understand that counter is actually the max depth that you want to follow, so why not call it depth

Function names should explain what they do, dumbwork name doesn't, something like recurse_page may be better.

start_link is good for the first link (almost, see below) but the parameter to the function is actually the current link being parsed, so why not call it current_link?

You used snake case for start_link, you should keep using it, so get_from_url may be better.

start_link, start_url and end_url are not links or urls, they're actually the index of the string, so they should be start_link_index, start_url_index and end_url_index

text is the HTML content, so just rename it to html_content




- 
The lines doing something with

row

 should be next to each other, or better yet, in a separate function.

- 
That

2 should be in a constant so that the first line of the function can be something like if depth

-
You're trapping exceptions but you're not doing anything with them, you should at least log somewhere what happened.

-
The text.find to get the url are probably better off in a separate function, to improve readability, something like

-
getfromurl already returns a string, no need for the str()

-
You're using always the same name for the file which, when opened with w will overwrite the contents. You should at least check if the file already exists.

-
You're opening a file and leaving it open for the whole duration of the process. This is not bad in itself, but I'd probably put a function called append_to_file where I open the file with a instead of w, write the line and immediately close it. Inside of that function you will also convert the status code to a string.

In the end, your worker loop may look something like this:

def recurse_page(current_link_index, start_index, html_content, pattern, depth):
    if depth  -1:
            try:
                url = get_url(html_content, start_index)
                append_to_file(url, urllib.urlopen(url).getcode())
                new_html_content = get_from_url(url)
                recurse_page(new_html_content.find(pattern), 0, new_html_content, pattern, depth)
                current_link_index = html_content.find(pattern, end_url_index + 1) 
            except Exception, e:
                # TODO: Proper error handling
                break

It's not complete code, but it should give you an idea of what I mean.

Code Snippets

def recurse_page(current_link_index, start_index, html_content, pattern, depth):
    if depth < MAX_DEPTH:
        depth += 1
        while current_link_index > -1:
            try:
                url = get_url(html_content, start_index)
                append_to_file(url, urllib.urlopen(url).getcode())
                new_html_content = get_from_url(url)
                recurse_page(new_html_content.find(pattern), 0, new_html_content, pattern, depth)
                current_link_index = html_content.find(pattern, end_url_index + 1) 
            except Exception, e:
                # TODO: Proper error handling
                break

Context

StackExchange Code Review Q#135567, answer score: 4

Revisions (0)

No revisions yet.