HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonModerate

Slow web-scraping geolocator

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
geolocatorscrapingslowweb

Problem

How do I make my Python program faster?

I have 3 suspects right now for it being so slow:

-
Maybe my computer is just slow

-
Maybe my Internet is too slow (sometimes my program has to download the html of web pages and then it searches through the html for a specific piece of text)

-
My code is slow (too many loops maybe? something else? I'm new to this so I wouldn't know!)

My code uses lots of loops I think...ALSO, another thing is that for the program to work you have to be logged in to this website: http://www.locationary.com/

```
from urllib import urlopen
from gzip import GzipFile
from cStringIO import StringIO
import re
import urllib
import urllib2
import webbrowser
import mechanize
import time
from difflib import SequenceMatcher
import os

def download(url):
s = urlopen(url).read()
if s[:2] == '\x1f\x8b': # assume it's gzipped data
with GzipFile(mode='rb', fileobj=StringIO(s)) as ifh:
s = ifh.read()
return s

s = download('http://www.locationary.com/place/en/US/Utah/Provo-page5/?ACTION_TOKEN=NumericAction')

findLoc = re.compile('http://www\.locationary\.com/place/en/US/.{1,50}/.{1,50}/.{1,100}\.jsp')

findLocL = re.findall(findLoc,s)

W = []

X = []

XA = []

Y = []

YA = []

Z = []

ZA = []

for i in range(0,25):

b = download(findLocL[i])

findYP = re.compile('http://www\.yellowpages\.com/')

findYPL = re.findall(findYP,b)

findTitle = re.compile('(.*) \(\d{1,10}.{1,100}\)')

getTitle = re.findall(findTitle,b)

findAddress = re.compile('.{1,100}\((.*), .{4,14}, United States\)')

getAddress = re.findall(findAddress,b)

if not findYPL:

if not getTitle:

print ""

else:

W.append(findLocL[i])

b = download(findLocL[i])

if not getTitle:

print ""

else:

X.append(getAddress)

b = download(findLocL[i])

if not getTitle:

print ""

else:

Y.append(getTit

Solution

from urllib import urlopen
from gzip import GzipFile
from cStringIO import StringIO
import re
import urllib
import urllib2
import webbrowser
import mechanize
import time
from difflib import SequenceMatcher
import os

def download(url):
    s = urlopen(url).read()
    if s[:2] == '\x1f\x8b': # assume it's gzipped data
        with GzipFile(mode='rb', fileobj=StringIO(s)) as ifh:
            s = ifh.read()
    return s


I'd probably pick a more descriptive name then s

s = download('http://www.locationary.com/place/en/US/Utah/Provo-page5/?ACTION_TOKEN=NumericAction')

findLoc = re.compile('http://www\.locationary\.com/place/en/US/.{1,50}/.{1,50}/.{1,100}\.jsp')


For constants, its usually best practice to name them in ALL_CAPS and put them outside of the actual logic. The logic in your program should really be in a main function not at the module level.

findLocL = re.findall(findLoc,s)


Not a very descriptive name, perhaps you can come up with a better one.

W = []

X = []

XA = []

Y = []

YA = []

Z = []

ZA = []


Don't double-space your code. Also, these variables are either really badly name or shouldn't be seperate variables. Possibly both.

for i in range(0,25):


Why 25? Do you want everything in findLocL? Then you should use for loc in findLocL.

b = download(findLocL[i])

    findYP = re.compile('http://www\.yellowpages\.com/')

    findYPL = re.findall(findYP,b)


You don't have to call compile, you can just pass the regular expression as the first parameter. Compiling is only helpful if you use the compiled expression multiple times. Also, aren't you just searching for a simple string here? Why use regular expressions?

findTitle = re.compile('(.*) \(\d{1,10}.{1,100}\)')

    getTitle = re.findall(findTitle,b)

    findAddress = re.compile('.{1,100}\((.*), .{4,14}, United States\)')

    getAddress = re.findall(findAddress,b)

    if not findYPL:

        if not getTitle:


Why did you findall if you just endup checking for a single match?

print ""

        else:

            W.append(findLocL[i])


What the fried monkey is W? What you are searching for? You code is difficult to follow.

b = download(findLocL[i])

        if not getTitle:

            print ""

        else:

            X.append(getAddress)

        b = download(findLocL[i])


So you just downloaded this file a few lines before. Why are you downloading it again? And why don't you ever do anything with the b you get?

if not getTitle:

            print ""

        else:

            Y.append(getTitle)

sizeWXY = len(W)


Was that really helpful?

def XReplace(text, dic):
    for i, j in dic.iteritems():
        text = text.replace(i, j)  
    XA.append(text)

def YReplace(text2, dic2):
    for k, l in dic2.iteritems():
        text2 = text2.replace(k, l)  
    YA.append(text2)


These functions do pretty much the same thing. The only difference is that they both modify global variables. You shouldn't really be modifying global variables anyways.

for d in range(0,sizeWXY):

    old = str(X[d])


So... sizeWXY is len(W) but here you are fetching from X, what the fried turkey?

reps = {' ':'-', ',':'', '\'':'', '[':'', ']':''}

    XReplace(old, reps)

    old2 = str(Y[d])

    YReplace(old2, reps)

count = 0

for e in range(0,sizeWXY):

    newYPL = "http://www.yellowpages.com/" + XA[e] + "/" + YA[e] + "?order=distance"

    v = download(newYPL)

    abc = str('\n<a href="')


Why are constructing a string from a string?

dfe = str('" class="no-tracks url "')

    findFinal = re.compile(abc + '(.*)' + dfe)

    getFinal = re.findall(findFinal, v)

    if not getFinal:

        W.remove(W[(e-count)])


Use del W[e-count]

X.remove(X[(e-count)])

        count = (count+1)


No need for those parens

else:

        for f in range(0,1):

            Z.append(getFinal[f])


So... that loop only executes once. Why is it a loop?

XA = []


Rather then reusing XA, I'd suggest working with a new list here.

for c in range(0,(len(X))):

    aGd = re.compile('(.*), .{1,50}')

    bGd = re.findall(aGd, str(X[c]))


bGd? seriously? What the fried turkey marshmallow duck is that supposed to be?

XA.append(bGd)

LenZ = len(Z)


Don't do this. You are probably doing this because if you don't in C, it'll recalculate the length constantly as you execute a loop. Python doesn't do that.

V = []

for i in range(0,(len(W))):


Here you do pass the length to range. But you don't need those parens around it.

if i == 0:

        countTwo = 0


Why don't you do this before the loop starts? That should be the same

gda = download(Z[i-(countTwo)])


No need for the parens around countTwo.

ab = str('"street-address">\n')

    cd = str('\n')

    ZAddress = re.compile(ab + '(.*)' + cd)

    ZAddress2 = re.findall(ZAddress, gda)

    for b in range(0,(len(ZAddress2))):


In python you s

Code Snippets

from urllib import urlopen
from gzip import GzipFile
from cStringIO import StringIO
import re
import urllib
import urllib2
import webbrowser
import mechanize
import time
from difflib import SequenceMatcher
import os

def download(url):
    s = urlopen(url).read()
    if s[:2] == '\x1f\x8b': # assume it's gzipped data
        with GzipFile(mode='rb', fileobj=StringIO(s)) as ifh:
            s = ifh.read()
    return s
s = download('http://www.locationary.com/place/en/US/Utah/Provo-page5/?ACTION_TOKEN=NumericAction')

findLoc = re.compile('http://www\.locationary\.com/place/en/US/.{1,50}/.{1,50}/.{1,100}\.jsp')
findLocL = re.findall(findLoc,s)
W = []

X = []

XA = []

Y = []

YA = []

Z = []

ZA = []
for i in range(0,25):

Context

StackExchange Code Review Q#7136, answer score: 14

Revisions (0)

No revisions yet.