patternpythonMinor
Writing a hiveserver2 query result to CSV with Python using pyhs2
Viewed 0 times
resultwithquerywritingcsvhiveserver2pythonusingpyhs2
Problem
I am trying to create a small application that will help me extract data from Hadoop via hiveserver2 by simply writing it to CSV. At the moment, it takes about 40 seconds to pull 10,000 records from a hs2 table consisting of 500k rows with 203 columns. I would like to think that there is a faster, more efficient way of doing this than the way I am doing it now. Ideally I would like to eventually be able to pull and write 1 million rows per minute. It doesn't necessarily have to be written to CSV. It probably isn't feasible with my current skill level, but I like to set goals for myself.
import pyhs2
import time
import csv
csv_out = open('data.csv', 'wb')
mywriter = csv.writer(csv_out)
def generic_user():
gu = 'xxxxxx'
return gu
def password():
pw = 'xxxxxxx'
return pw
with pyhs2.connect(host='xxxxxxxx',
port=10000,
authMechanism='PLAIN',
user=generic_user(),
password=password(),
database='xxxxxxxxxx') as conn:
with conn.cursor() as cur:
q = raw_input('Enter query: ').replace('csc', 'CSC')
print q
#timer start
start_time = time.time()
#Execute query
cur.execute(q)
col = []
for key in cur.getSchema():
col.append(key['columnName'])
header = []
header.append(col)
for rows in zip(header):
mywriter.writerows(rows)
records = cur.fetch()
# print records
for rows in zip(records):
mywriter.writerows(rows)
pull_time = time.time() - start_time
print pull_timeSolution
Speed improvements
Using a
to this:
As far as I can tell, your header is actually pointless. You're making it just a single item list that then gets iterated over to write to your file. This seems to be the same as just
Your use of
Other notes
You use context managers (
It's extra safety for your file the same way you have with the connections.
I'd also recommending refactoring your code into discrete functions rather than just one long script. It's better for readability and debugging purposes. Plus it makes it easier to change or reuse this code later.
Using a
for loop to build a list is quite slow, when Python has 'list comprehensions'. They're for loop like expressions that build lists. So your col could be changed from this:for key in cur.getSchema():
col.append(key['columnName'])to this:
col = [key['columnName'] for key in cur.getSchema()]As far as I can tell, your header is actually pointless. You're making it just a single item list that then gets iterated over to write to your file. This seems to be the same as just
mywriter.writerows(col).Your use of
zip is confusing. zip is used to join two or more lists when iterating. It's not normally used for a single list like you have, and doesn't really do much for you unless I'm missing something. It's a redundant drain on your time.Other notes
You use context managers (
with x as y) for the connection which is great, but you should do the same for your csv_out.with open('data.csv', 'wb') as csv_out:It's extra safety for your file the same way you have with the connections.
I'd also recommending refactoring your code into discrete functions rather than just one long script. It's better for readability and debugging purposes. Plus it makes it easier to change or reuse this code later.
Code Snippets
for key in cur.getSchema():
col.append(key['columnName'])col = [key['columnName'] for key in cur.getSchema()]with open('data.csv', 'wb') as csv_out:Context
StackExchange Code Review Q#129201, answer score: 2
Revisions (0)
No revisions yet.