HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Inserting products scraped from the web into a database

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
theintodatabasescrapedwebinsertingfromproducts

Problem

I am trying to speed up the runtime of my program as the API can make calls. However after adding the inserts to a database 3 mins turns in 7 mins after 180 API calls, which can be one a second. How can I further improve the inserts into the DB?

I am using mysql database, but potentially could change if it will increase performance

url = getSignedUrl(params)
resp = requests.get(url)
responseSoup=BeautifulSoup(resp.text)

#EXTRACT FROM XML
quantity = ['' if product.amount is None else product.amount.text for product in responseSoup.findAll("offersummary")]
price = ['' if product.lowestnewprice is None else product.lowestnewprice.formattedprice.text for product in responseSoup.findAll("offersummary")]
prime = ['' if product.iseligibleforprime is None else product.iseligibleforprime.text for product in responseSoup("offer")]

#UPDATE DB
for zz in zip(asins.split(","), price,quantity,prime):
    cursor = conn.cursor()

    if zz[1] == "Too low to display":
        print "Scraping..."
        zz = scrapeTooLow(zz[0])

    dat = [zz[0], zz[1], zz[2], zz[3]]
    if zz[1] != "":
        priceFormat = float(zz[1].replace("$", "").replace(",", ""))
        mep = round(priceFormat*1.35, 2)
        dp = round(priceFormat*1.38, 2)
        fp = round(priceFormat*1.17, 2)
    else:
        priceFormat = ""
        mep = ""
        dp = ""
        fp = ""
    dat.extend([mep,dp,fp])
    print dat 
    try:
        cursor.execute("""INSERT INTO data VALUES (%s,%s, %s, %s, %s, %s, %s)""",dat)   
        conn.commit()
    except:
        conn.rollback()
print count, i

Solution

For performance questions you should really take a profiler to it and
make sure that the Python code isn't accidentally the culprit first.

The numbers you quote for this sound way too high, though of course we
can't see the actual data.

Now for the code.

Even though this is running Python 2 it should still be made compatible
with Python 3, in particular:

  • Use print as a function, that is, print(...), for consistency.



  • Avoid zip for big lists, instead use izip from itertools.



  • Use consistent syntax, also read


PEP8 regarding naming,
because Python names should be consistenly lower case with
underscores - since your names are at least consistent that is less of
an issue.

  • A program should have a if __name__ == "__main__": ... block,


assuming this not part of a bigger script where such a block is
already in place.

Apart from that more generally:

  • Use more descriptive names. zz, dp, fp, mep, dat tell the


reader absolutely nothing about the meaning.

  • Random float literals should be named, or create a separate



  • Use functions to reuse functionality and to decompose the whole


program into clear steps.

  • Related things should be place together, e.g. the cursor call can be


moved to the end where the database functionality is located instead
of at the start of the loop. Btw. are you sure that you acquire a new
transaction automatically? It would help to know which particular
library is being used.

Also:

  • The replace call can be done in one step, "$,".replace("", "").



-
The XML extraction looks pretty verbose. Unfortunately I'm not so
sure about the types, though I'd imagine this could be written a bit
more clearly with at least a helper function for two of the steps,
quantity and prime:

def empty_or_text(thing):
    return '' if thing is None else thing.text


It also makes sense to cache things instead of recomputing it all over
again, in this case that applies to the
findAll calls for
"offersummary".

  • dat is simply zz[0:4].



  • Since strings are immutable the assignment of the empty string can be


done in one line, e.g.
a = b = ""`.

And finally as has been said, it might make sense to group inserts into
larger batches unless you require the transactional behaviour for some
reason, or you can't simply rerun a batch.

Also take a look at
this Stackoverflow post
to deal with the insert statement a bit better.

That's it. I'm afraid without a bit more information about the data
structures it's harder to infer how it should be structured instead.

At this I point I have the following for the modified code:

url = getSignedUrl(params)
resp = requests.get(url)
responseSoup = BeautifulSoup(resp.text)

def empty_or_text(thing):
    return '' if thing is None else thing.text

# EXTRACT FROM XML
offerSummaries = responseSoup.findAll("offersummary")

quantity = [empty_or_text(product.amount)
            for product in offerSummaries]

prime = [empty_or_text(product.iseligibleforprime)
         for product in responseSoup("offer")]

price = [product.lowestnewprice is None else product.lowestnewprice.formattedprice.text
         for product in offerSummaries]

from itertools import izip

# UPDATE DB
for zz in izip(asins.split(","), price, quantity, prime):
    if zz[1] == "Too low to display":
        print("Scraping...")
        zz = scrapeTooLow(zz[0])

    priceFormat = mep = dp = fp = ""
    if zz[1] != "":
        priceFormat = float(zz[1].replace("$,", ""))
        mep = round(priceFormat * 1.35, 2)
        dp = round(priceFormat * 1.38, 2)
        fp = round(priceFormat * 1.17, 2)

    dat = zz[0:4] + [mep, dp, fp]
    print(dat)
    try:
        conn.cursor().execute("INSERT INTO data VALUES (%s, %s, %s, %s, %s, %s, %s)", dat)
        conn.commit()
    except:
        conn.rollback()

print(count, i)

Code Snippets

def empty_or_text(thing):
    return '' if thing is None else thing.text
url = getSignedUrl(params)
resp = requests.get(url)
responseSoup = BeautifulSoup(resp.text)

def empty_or_text(thing):
    return '' if thing is None else thing.text

# EXTRACT FROM XML
offerSummaries = responseSoup.findAll("offersummary")

quantity = [empty_or_text(product.amount)
            for product in offerSummaries]

prime = [empty_or_text(product.iseligibleforprime)
         for product in responseSoup("offer")]

price = [product.lowestnewprice is None else product.lowestnewprice.formattedprice.text
         for product in offerSummaries]

from itertools import izip

# UPDATE DB
for zz in izip(asins.split(","), price, quantity, prime):
    if zz[1] == "Too low to display":
        print("Scraping...")
        zz = scrapeTooLow(zz[0])

    priceFormat = mep = dp = fp = ""
    if zz[1] != "":
        priceFormat = float(zz[1].replace("$,", ""))
        mep = round(priceFormat * 1.35, 2)
        dp = round(priceFormat * 1.38, 2)
        fp = round(priceFormat * 1.17, 2)

    dat = zz[0:4] + [mep, dp, fp]
    print(dat)
    try:
        conn.cursor().execute("INSERT INTO data VALUES (%s, %s, %s, %s, %s, %s, %s)", dat)
        conn.commit()
    except:
        conn.rollback()

print(count, i)

Context

StackExchange Code Review Q#117323, answer score: 2

Revisions (0)

No revisions yet.