patternpythonMinor
Python 3 Program to download Homestuck
Viewed 0 times
downloadhomestuckprogrampython
Problem
I've been working for a program over the last few days which should download a range of pages from the webcomic Homestuck. I've created a working version in python 3, but it is horribly inefficient. Can anyone see ways to improve and shorten this code?
```
import urllib.request
range1 = int(input("Enter the 1st page you want: "))
range2 = int(input("Enter the last page you want: ")) + 1
current = range1 + 1900
final = range2 + 1900
page = ''
nextPage = ''
while current != final:
page = str(current)
nextPage = str(current+1)
while len(page) != 6:
page = '0'+ page
while len(nextPage) != 6:
nextPage = '0'+ nextPage
html = 'http://www.mspaintadventures.com/?s=6&p='+page
site = urllib.request.urlopen(html)
s = site.read()
s = s.decode("utf8")
s = s.replace("", "")
s = s.replace("http://cdn.mspaintadventures.com/storyfiles/hs2/", "")
s = s.replace("?s=6&p=" + str(nextPage), str(int(nextPage))+".html")
s = s.replace(page+"/"+page, page)
a,b,c = s.split('')
b = " Page " + page + "" + b
t = open(str(current)+".html", 'w+')
t.write(b)
t.close()
page = str((int(page)-1900))
while len(page) != 5:
page = '0'+ page
t = open(str(current)+".html", 'a')
swfname=page+".swf"
t.write(" ")
t.write("")
t.write("")
t.write("")
t.close()
try:
img = "http://cdn.mspaintadventures.com/storyfiles/hs2/"+page+".gif"
urllib.request.urlretrieve(img, page+".gif")
except:
try:
img = "http://cdn.mspaintadventures.com/storyfiles/hs2/"+page+"_1.gif"
urllib.request.urlretrieve(img, page+"_1.gif")
img = "http://cdn.mspaintadventures.com/storyfiles/hs2/"+page+"_2.gif"
urllib.request.urlretrieve(img, page+"_2.gif")
except:
try:
img = "http://cdn.mspaintadventures.com/storyfiles/hs2/"+page+"/"+page+".swf"
urllib.request.urlretrieve(img, page+".swf")
```
import urllib.request
range1 = int(input("Enter the 1st page you want: "))
range2 = int(input("Enter the last page you want: ")) + 1
current = range1 + 1900
final = range2 + 1900
page = ''
nextPage = ''
while current != final:
page = str(current)
nextPage = str(current+1)
while len(page) != 6:
page = '0'+ page
while len(nextPage) != 6:
nextPage = '0'+ nextPage
html = 'http://www.mspaintadventures.com/?s=6&p='+page
site = urllib.request.urlopen(html)
s = site.read()
s = s.decode("utf8")
s = s.replace("", "")
s = s.replace("http://cdn.mspaintadventures.com/storyfiles/hs2/", "")
s = s.replace("?s=6&p=" + str(nextPage), str(int(nextPage))+".html")
s = s.replace(page+"/"+page, page)
a,b,c = s.split('')
b = " Page " + page + "" + b
t = open(str(current)+".html", 'w+')
t.write(b)
t.close()
page = str((int(page)-1900))
while len(page) != 5:
page = '0'+ page
t = open(str(current)+".html", 'a')
swfname=page+".swf"
t.write(" ")
t.write("")
t.write("")
t.write("")
t.close()
try:
img = "http://cdn.mspaintadventures.com/storyfiles/hs2/"+page+".gif"
urllib.request.urlretrieve(img, page+".gif")
except:
try:
img = "http://cdn.mspaintadventures.com/storyfiles/hs2/"+page+"_1.gif"
urllib.request.urlretrieve(img, page+"_1.gif")
img = "http://cdn.mspaintadventures.com/storyfiles/hs2/"+page+"_2.gif"
urllib.request.urlretrieve(img, page+"_2.gif")
except:
try:
img = "http://cdn.mspaintadventures.com/storyfiles/hs2/"+page+"/"+page+".swf"
urllib.request.urlretrieve(img, page+".swf")
Solution
The main performance issue is the blocking nature of your script. You don't process the next url until you are done with the current. Think of using asynchronous tools like
Other notes:
Scrapy web-scraping framework which is based on twisted; or something like grequests.Other notes:
- if you would stick to synchronous approach, switch to
requests, initialize a session (requests.Session) once and reuse - this should be faster than usingurllib.request
- when you generate HTML files, pre-define a template with placeholders, render the template on the fly filling up the placeholders. You may use a template engine like
makoorJinja2, or use the built-instr.format()
- it also looks like you are reopening each of the generated files twice - once for the initial write in the
w+mode and then to append inamode
Context
StackExchange Code Review Q#155246, answer score: 3
Revisions (0)
No revisions yet.