HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Python 3 Program to download Homestuck

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
downloadhomestuckprogrampython

Problem

I've been working for a program over the last few days which should download a range of pages from the webcomic Homestuck. I've created a working version in python 3, but it is horribly inefficient. Can anyone see ways to improve and shorten this code?

```
import urllib.request
range1 = int(input("Enter the 1st page you want: "))
range2 = int(input("Enter the last page you want: ")) + 1
current = range1 + 1900
final = range2 + 1900
page = ''
nextPage = ''
while current != final:
page = str(current)
nextPage = str(current+1)
while len(page) != 6:
page = '0'+ page
while len(nextPage) != 6:
nextPage = '0'+ nextPage
html = 'http://www.mspaintadventures.com/?s=6&p='+page
site = urllib.request.urlopen(html)
s = site.read()
s = s.decode("utf8")
s = s.replace("", "")
s = s.replace("http://cdn.mspaintadventures.com/storyfiles/hs2/", "")
s = s.replace("?s=6&p=" + str(nextPage), str(int(nextPage))+".html")
s = s.replace(page+"/"+page, page)
a,b,c = s.split('')
b = " Page " + page + "" + b
t = open(str(current)+".html", 'w+')
t.write(b)
t.close()
page = str((int(page)-1900))
while len(page) != 5:
page = '0'+ page

t = open(str(current)+".html", 'a')
swfname=page+".swf"
t.write(" ")
t.write("")
t.write("")
t.write("")
t.close()
try:
img = "http://cdn.mspaintadventures.com/storyfiles/hs2/"+page+".gif"
urllib.request.urlretrieve(img, page+".gif")
except:
try:
img = "http://cdn.mspaintadventures.com/storyfiles/hs2/"+page+"_1.gif"
urllib.request.urlretrieve(img, page+"_1.gif")
img = "http://cdn.mspaintadventures.com/storyfiles/hs2/"+page+"_2.gif"
urllib.request.urlretrieve(img, page+"_2.gif")
except:
try:
img = "http://cdn.mspaintadventures.com/storyfiles/hs2/"+page+"/"+page+".swf"
urllib.request.urlretrieve(img, page+".swf")

Solution

The main performance issue is the blocking nature of your script. You don't process the next url until you are done with the current. Think of using asynchronous tools like Scrapy web-scraping framework which is based on twisted; or something like grequests.

Other notes:

  • if you would stick to synchronous approach, switch to requests, initialize a session (requests.Session) once and reuse - this should be faster than using urllib.request



  • when you generate HTML files, pre-define a template with placeholders, render the template on the fly filling up the placeholders. You may use a template engine like mako or Jinja2, or use the built-in str.format()



  • it also looks like you are reopening each of the generated files twice - once for the initial write in the w+ mode and then to append in a mode

Context

StackExchange Code Review Q#155246, answer score: 3

Revisions (0)

No revisions yet.