patternpythonMinor
Crawl site getting URL and status code
Viewed 0 times
crawlgettingstatuscodesiteandurl
Problem
I wrote a crawler that for every page visited collects the status code.
Below my solution. Is this code optimizable?
Below my solution. Is this code optimizable?
import urllib
def getfromurl(url):
start = urllib.urlopen(url)
raw = ''
for lines in start.readlines():
raw += lines
start.close()
return raw
def dumbwork(start_link, start_url, text, pattern, counter):
if counter < 2:
counter = counter +1
while start_link != -1:
try:
start_url = text.find('/', start_link)
end_url = text.find('"', start_url + 1)
url = 'http:/' + text[start_url + 1 : end_url]
page_status = str(urllib.urlopen(url).getcode())
row = url + ', ' + page_status
t.write(row + '\n')
temp = str(getfromurl(url))
print row
dumbwork(temp.find(pattern), 0, temp, pattern, counter)
start_link = text.find(pattern, end_url + 1)
except Exception, e:
break
else:
pass
t = open('inout.txt', 'w')
text = str(getfromurl('http://www.site.it'))
pattern = '<a href="http:/'
start_link = text.find(pattern)
dumbwork(start_link, 0, text, pattern, 0)
t.close()Solution
-
You're taking for granted that a link will be
-
You're trapping exceptions but you're not doing anything with them, you should at least log somewhere what happened.
-
The text.find to get the url are probably better off in a separate function, to improve readability, something like
-
-
You're using always the same name for the file which, when opened with
-
You're opening a file and leaving it open for the whole duration of the process. This is not bad in itself, but I'd probably put a function called
In the end, your worker loop may look something like this:
It's not complete code, but it should give you an idea of what I mean.
You're taking for granted that a link will be
'
-
Naming:
- usually a
row is related to a database, while a line is related to a text file.
temp means nothing, it's the new content, so you should use something like new_html_content.
- It takes a bit to understand that
counter is actually the max depth that you want to follow, so why not call it depth
- Function names should explain what they do,
dumbwork name doesn't, something like recurse_page may be better.
start_link is good for the first link (almost, see below) but the parameter to the function is actually the current link being parsed, so why not call it current_link?
- You used snake case for
start_link, you should keep using it, so get_from_url may be better.
start_link, start_url and end_url are not links or urls, they're actually the index of the string, so they should be start_link_index, start_url_index and end_url_index
text is the HTML content, so just rename it to html_content
-
The lines doing something with row should be next to each other, or better yet, in a separate function.
-
That 2 should be in a constant so that the first line of the function can be something like if depth -
You're trapping exceptions but you're not doing anything with them, you should at least log somewhere what happened.
-
The text.find to get the url are probably better off in a separate function, to improve readability, something like
-
getfromurl already returns a string, no need for the str()-
You're using always the same name for the file which, when opened with
w will overwrite the contents. You should at least check if the file already exists.-
You're opening a file and leaving it open for the whole duration of the process. This is not bad in itself, but I'd probably put a function called
append_to_file where I open the file with a instead of w, write the line and immediately close it. Inside of that function you will also convert the status code to a string.In the end, your worker loop may look something like this:
def recurse_page(current_link_index, start_index, html_content, pattern, depth):
if depth -1:
try:
url = get_url(html_content, start_index)
append_to_file(url, urllib.urlopen(url).getcode())
new_html_content = get_from_url(url)
recurse_page(new_html_content.find(pattern), 0, new_html_content, pattern, depth)
current_link_index = html_content.find(pattern, end_url_index + 1)
except Exception, e:
# TODO: Proper error handling
breakIt's not complete code, but it should give you an idea of what I mean.
Code Snippets
def recurse_page(current_link_index, start_index, html_content, pattern, depth):
if depth < MAX_DEPTH:
depth += 1
while current_link_index > -1:
try:
url = get_url(html_content, start_index)
append_to_file(url, urllib.urlopen(url).getcode())
new_html_content = get_from_url(url)
recurse_page(new_html_content.find(pattern), 0, new_html_content, pattern, depth)
current_link_index = html_content.find(pattern, end_url_index + 1)
except Exception, e:
# TODO: Proper error handling
breakContext
StackExchange Code Review Q#135567, answer score: 4
Revisions (0)
No revisions yet.