HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Reading a .gz file

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
readingfilestackoverflow

Problem

I am using Python 2.6.5 and am trying to find the fastest way to print out the contents of a .gz file. It's my understanding that prior to v2.5, zcat was much faster than gzip (see here)....I guess that has changed (at least according to a comment in that post)? I have unzipped a 2.4MB .gz file 3 ways and they all seem to take about 17 minutes. Is there a faster way?

This takes 17 minutes in Python:

d = zlib.decompressobj(16+zlib.MAX_WBITS)
f = open('/2.4MB.gz','rb')
buffer = f.read(1024)
while buffer:
 outstr = d.decompress(buffer)
 print(outstr)
 buffer = f.read(1024)
outstr = d.flush()
print(outstr)
f.close()


This also takes 17 minutes:

f = gzip.open('/2.4MB.gz', 'rb')
file_content = f.read()
print file_content 
f.close()


Again, 17 minutes:

def gziplines(fname):
 from subprocess import Popen, PIPE
 f = Popen(['zcat',fname],stdout = PIPE)
 for line in f.stdout:
  yield line

fname = '/2.4MB.gz'
for line in gziplines(fname):
 print line,


My eventual goal is to take the contents of the .gz file and dump them directly into a MySQL database without printing the lines. The unzipped file is 15.8MB.

When I gunzip the file and then use the CSV module to print out the contents to the screen, it takes 1 minute vs. the 17 minutes before. Printing in and of itself doesn't seem to be the problem.

Solution

As it stands, you are measuring the speed of printing. Printing is one of the slowest things your program will ever have to do. Take the prints out and remeasure to find out the speed.

Your middle method will almost certainly be the fastest.

Context

StackExchange Code Review Q#9492, answer score: 6

Revisions (0)

No revisions yet.