snippetpythonMinor
How to optimize a pipline for storing item in mysql?
Viewed 0 times
itemmysqlpiplineforoptimizehowstoring
Problem
I am using scrapy framework for data scraping and dumping item in MySQL database.
Here is my pipeline that is inserting output to MySQL, but its taking so much time. Any suggestions on how to optimize this?
```
class MysqlOutputPipeline(object):
def __init__(self):
dispatcher.connect(self.spider_opened, signals.spider_opened)
dispatcher.connect(self.spider_closed, signals.spider_closed)
def connect(self):
try:
self.conn = MySQLdb.connect(
host='some_host',
user='user',
passwd='pwd',
db='my_db',
port=22)
except (AttributeError, MySQLdb.OperationalError), e:
raise e
def query(self, sql, params=()):
try:
cursor = self.conn.cursor()
cursor.execute(sql, params)
except (AttributeError, MySQLdb.OperationalError) as e:
print 'exception generated during sql connection: ', e
self.connect()
cursor = self.conn.cursor()
cursor.execute(sql, params)
return cursor
def spider_opened(self, spider):
self.connect()
def process_item(self, item, spider):
# clean_name
clean_name = ''.join(e for e in item['store'] if e.isalnum()).lower()
# conditional insertion in store_meta
sql = """SELECT * FROM store_meta WHERE clean_name = %s"""
curr = self.query(sql, clean_name)
if not curr.fetchone():
sql = """INSERT INTO store_meta (clean_name) VALUES (%s)"""
self.query(sql, clean_name)
self.conn.commit()
# getting clean_id
sql = """SELECT clean_id FROM store_meta WHERE clean_name = %s"""
curr = self.query(sql, clean_name)
clean_id = curr.fetchone()
# conditional insertion in all_stores
sql = """SELECT * FROM all_stores WHERE store_name = %s"""
curr = self.query(sql, item['store'])
if not curr.fetchone():
sql = """INSERT INTO all_stores (store_name,clean_id) VALUES (%s,%s)"""
self.query(sql, (item['store'], clean_id[0]))
self.conn.commit()
# getting store_id
sql =
Here is my pipeline that is inserting output to MySQL, but its taking so much time. Any suggestions on how to optimize this?
```
class MysqlOutputPipeline(object):
def __init__(self):
dispatcher.connect(self.spider_opened, signals.spider_opened)
dispatcher.connect(self.spider_closed, signals.spider_closed)
def connect(self):
try:
self.conn = MySQLdb.connect(
host='some_host',
user='user',
passwd='pwd',
db='my_db',
port=22)
except (AttributeError, MySQLdb.OperationalError), e:
raise e
def query(self, sql, params=()):
try:
cursor = self.conn.cursor()
cursor.execute(sql, params)
except (AttributeError, MySQLdb.OperationalError) as e:
print 'exception generated during sql connection: ', e
self.connect()
cursor = self.conn.cursor()
cursor.execute(sql, params)
return cursor
def spider_opened(self, spider):
self.connect()
def process_item(self, item, spider):
# clean_name
clean_name = ''.join(e for e in item['store'] if e.isalnum()).lower()
# conditional insertion in store_meta
sql = """SELECT * FROM store_meta WHERE clean_name = %s"""
curr = self.query(sql, clean_name)
if not curr.fetchone():
sql = """INSERT INTO store_meta (clean_name) VALUES (%s)"""
self.query(sql, clean_name)
self.conn.commit()
# getting clean_id
sql = """SELECT clean_id FROM store_meta WHERE clean_name = %s"""
curr = self.query(sql, clean_name)
clean_id = curr.fetchone()
# conditional insertion in all_stores
sql = """SELECT * FROM all_stores WHERE store_name = %s"""
curr = self.query(sql, item['store'])
if not curr.fetchone():
sql = """INSERT INTO all_stores (store_name,clean_id) VALUES (%s,%s)"""
self.query(sql, (item['store'], clean_id[0]))
self.conn.commit()
# getting store_id
sql =
Solution
It's hard to analyze the bottle neck without performance data on where the bottleneck is occurring. You should consider running the program several ways and comparing. Ie
This should tell you the problem is in your code as you suspect. If not you are crawling something that is slow to crawl. Your program can't control that, although you can look into why it's slow. Maybe you are hitting a tar pit (ie websites put in controls to allow normal users regular access but to impede crawlers) which would affect how you crawl.
If it is your program
My suspicion is that the writes are not the issue. You can then can analyze your conditions and try to optimize.
You are reading the db alot before writing. Whether any of those reads could be cached is data dependent on what you are crawling. If that has any predictability (eg it appears you are crawling malls and stores so I would assume alot of predictability) then maybe save at least the most recent q/a for each condition in ram so you don't need to go out to the db if you already have the answer. My guess is just caching most recent will be sufficient since crawler most likely walks 'near' items next. But if that isn't enough, analyze the db you've already gotten from your slow runs. If 90% of the db results are for Macy's then just cache Macy's condition results. Or you could put in more effort and put in a full caching class
- run program as is
- run program spider only with no db
This should tell you the problem is in your code as you suspect. If not you are crawling something that is slow to crawl. Your program can't control that, although you can look into why it's slow. Maybe you are hitting a tar pit (ie websites put in controls to allow normal users regular access but to impede crawlers) which would affect how you crawl.
If it is your program
- run program as is
- run program spider with just all your conditional code but no writes
My suspicion is that the writes are not the issue. You can then can analyze your conditions and try to optimize.
You are reading the db alot before writing. Whether any of those reads could be cached is data dependent on what you are crawling. If that has any predictability (eg it appears you are crawling malls and stores so I would assume alot of predictability) then maybe save at least the most recent q/a for each condition in ram so you don't need to go out to the db if you already have the answer. My guess is just caching most recent will be sufficient since crawler most likely walks 'near' items next. But if that isn't enough, analyze the db you've already gotten from your slow runs. If 90% of the db results are for Macy's then just cache Macy's condition results. Or you could put in more effort and put in a full caching class
Context
StackExchange Code Review Q#15763, answer score: 3
Revisions (0)
No revisions yet.