patternpythonMinor
Inserting products scraped from the web into a database
Viewed 0 times
theintodatabasescrapedwebinsertingfromproducts
Problem
I am trying to speed up the runtime of my program as the API can make calls. However after adding the inserts to a database 3 mins turns in 7 mins after 180 API calls, which can be one a second. How can I further improve the inserts into the DB?
I am using mysql database, but potentially could change if it will increase performance
I am using mysql database, but potentially could change if it will increase performance
url = getSignedUrl(params)
resp = requests.get(url)
responseSoup=BeautifulSoup(resp.text)
#EXTRACT FROM XML
quantity = ['' if product.amount is None else product.amount.text for product in responseSoup.findAll("offersummary")]
price = ['' if product.lowestnewprice is None else product.lowestnewprice.formattedprice.text for product in responseSoup.findAll("offersummary")]
prime = ['' if product.iseligibleforprime is None else product.iseligibleforprime.text for product in responseSoup("offer")]
#UPDATE DB
for zz in zip(asins.split(","), price,quantity,prime):
cursor = conn.cursor()
if zz[1] == "Too low to display":
print "Scraping..."
zz = scrapeTooLow(zz[0])
dat = [zz[0], zz[1], zz[2], zz[3]]
if zz[1] != "":
priceFormat = float(zz[1].replace("$", "").replace(",", ""))
mep = round(priceFormat*1.35, 2)
dp = round(priceFormat*1.38, 2)
fp = round(priceFormat*1.17, 2)
else:
priceFormat = ""
mep = ""
dp = ""
fp = ""
dat.extend([mep,dp,fp])
print dat
try:
cursor.execute("""INSERT INTO data VALUES (%s,%s, %s, %s, %s, %s, %s)""",dat)
conn.commit()
except:
conn.rollback()
print count, iSolution
For performance questions you should really take a profiler to it and
make sure that the Python code isn't accidentally the culprit first.
The numbers you quote for this sound way too high, though of course we
can't see the actual data.
Now for the code.
Even though this is running Python 2 it should still be made compatible
with Python 3, in particular:
PEP8 regarding naming,
because Python names should be consistenly lower case with
underscores - since your names are at least consistent that is less of
an issue.
assuming this not part of a bigger script where such a block is
already in place.
And finally as has been said, it might make sense to group inserts into
larger batches unless you require the transactional behaviour for some
reason, or you can't simply rerun a batch.
Also take a look at
this Stackoverflow post
to deal with the insert statement a bit better.
That's it. I'm afraid without a bit more information about the data
structures it's harder to infer how it should be structured instead.
At this I point I have the following for the modified code:
make sure that the Python code isn't accidentally the culprit first.
The numbers you quote for this sound way too high, though of course we
can't see the actual data.
Now for the code.
Even though this is running Python 2 it should still be made compatible
with Python 3, in particular:
- Use
printas a function, that is,print(...), for consistency.
- Avoid
zipfor big lists, instead useizipfromitertools.
- Use consistent syntax, also read
PEP8 regarding naming,
because Python names should be consistenly lower case with
underscores - since your names are at least consistent that is less of
an issue.
- A program should have a
if __name__ == "__main__": ...block,
assuming this not part of a bigger script where such a block is
already in place.
Apart from that more generally:
- Use more descriptive names.
zz, dp, fp, mep, dat tell the
reader absolutely nothing about the meaning.
- Random float literals should be named, or create a separate
- Use functions to reuse functionality and to decompose the whole
program into clear steps.
- Related things should be place together, e.g. the
cursor call can be
moved to the end where the database functionality is located instead
of at the start of the loop. Btw. are you sure that you acquire a new
transaction automatically? It would help to know which particular
library is being used.
Also:
- The
replace call can be done in one step, "$,".replace("", "").
-
The XML extraction looks pretty verbose. Unfortunately I'm not so
sure about the types, though I'd imagine this could be written a bit
more clearly with at least a helper function for two of the steps,
quantity and prime:
def empty_or_text(thing):
return '' if thing is None else thing.text
It also makes sense to cache things instead of recomputing it all over
again, in this case that applies to the findAll calls for
"offersummary".
dat is simply zz[0:4].
- Since strings are immutable the assignment of the empty string can be
done in one line, e.g. a = b = ""`.And finally as has been said, it might make sense to group inserts into
larger batches unless you require the transactional behaviour for some
reason, or you can't simply rerun a batch.
Also take a look at
this Stackoverflow post
to deal with the insert statement a bit better.
That's it. I'm afraid without a bit more information about the data
structures it's harder to infer how it should be structured instead.
At this I point I have the following for the modified code:
url = getSignedUrl(params)
resp = requests.get(url)
responseSoup = BeautifulSoup(resp.text)
def empty_or_text(thing):
return '' if thing is None else thing.text
# EXTRACT FROM XML
offerSummaries = responseSoup.findAll("offersummary")
quantity = [empty_or_text(product.amount)
for product in offerSummaries]
prime = [empty_or_text(product.iseligibleforprime)
for product in responseSoup("offer")]
price = [product.lowestnewprice is None else product.lowestnewprice.formattedprice.text
for product in offerSummaries]
from itertools import izip
# UPDATE DB
for zz in izip(asins.split(","), price, quantity, prime):
if zz[1] == "Too low to display":
print("Scraping...")
zz = scrapeTooLow(zz[0])
priceFormat = mep = dp = fp = ""
if zz[1] != "":
priceFormat = float(zz[1].replace("$,", ""))
mep = round(priceFormat * 1.35, 2)
dp = round(priceFormat * 1.38, 2)
fp = round(priceFormat * 1.17, 2)
dat = zz[0:4] + [mep, dp, fp]
print(dat)
try:
conn.cursor().execute("INSERT INTO data VALUES (%s, %s, %s, %s, %s, %s, %s)", dat)
conn.commit()
except:
conn.rollback()
print(count, i)Code Snippets
def empty_or_text(thing):
return '' if thing is None else thing.texturl = getSignedUrl(params)
resp = requests.get(url)
responseSoup = BeautifulSoup(resp.text)
def empty_or_text(thing):
return '' if thing is None else thing.text
# EXTRACT FROM XML
offerSummaries = responseSoup.findAll("offersummary")
quantity = [empty_or_text(product.amount)
for product in offerSummaries]
prime = [empty_or_text(product.iseligibleforprime)
for product in responseSoup("offer")]
price = [product.lowestnewprice is None else product.lowestnewprice.formattedprice.text
for product in offerSummaries]
from itertools import izip
# UPDATE DB
for zz in izip(asins.split(","), price, quantity, prime):
if zz[1] == "Too low to display":
print("Scraping...")
zz = scrapeTooLow(zz[0])
priceFormat = mep = dp = fp = ""
if zz[1] != "":
priceFormat = float(zz[1].replace("$,", ""))
mep = round(priceFormat * 1.35, 2)
dp = round(priceFormat * 1.38, 2)
fp = round(priceFormat * 1.17, 2)
dat = zz[0:4] + [mep, dp, fp]
print(dat)
try:
conn.cursor().execute("INSERT INTO data VALUES (%s, %s, %s, %s, %s, %s, %s)", dat)
conn.commit()
except:
conn.rollback()
print(count, i)Context
StackExchange Code Review Q#117323, answer score: 2
Revisions (0)
No revisions yet.