HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Improving gzip function for huge files

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
functionhugefilesforgzipimproving

Problem

I have created a python system that runs Linux core files through the crash debugger with some python extensions. This all works fine but one bit of is problematic.

These files are send to the system in gzip format and consist of a single huge data file. The compressed file can often be as big as 20G. The unzipping works fine but is very slow and often uses huge amounts of memory. As an example, last night this processed a 14G gzip file and it took 9.2 hours to uncompress it (60G uncompressed) and the memory utilisation hovered between 30G, peaking at 60G.

Starting to think perhaps my code is the cause.

def chk_gzip_file(FILE):
    logger.info ("Will write uncompressed file to: "+ COREDIR)   
    if os.path.isdir(FILE) == False:
        inF = gzip.open(FILE, 'rb')
        s = inF.read()
        inF.close()
        gzip_fname = os.path.basename(FILE)
        fname = gzip_fname[:-3]
        uncompressed_path = os.path.join(COREDIR, fname)
        open(uncompressed_path, 'w').write(s)
        uncompressedfile=COREDIR+"/"+fname
        return uncompressedfile
    else: 
        logger.critical ("No gz file found  : " + FILE) 
        sys.exit()


I am not a programmer so I imagine this is fairly poor code. Can this be improved for huge files? I know that speed will be an issue as gzip uncompress is single threaded.

Solution

inF = gzip.open(FILE, 'rb')
    s = inF.read()
    inF.close()


That reads the whole uncompressed data into memory. Of course it takes 60GB of memory.

Looking at the documentation for gzip, it has this example of compressing a file:

import gzip
import shutil
with open('file.txt', 'rb') as f_in, gzip.open('file.txt.gz', 'wb') as f_out:
    shutil.copyfileobj(f_in, f_out)


If you switch that round to:

import gzip
import shutil
with open('file.txt', 'wb') as f_out, gzip.open('file.txt.gz', 'rb') as f_in:
    shutil.copyfileobj(f_in, f_out)


then I think you'll find the memory usage is much lower.

Code Snippets

inF = gzip.open(FILE, 'rb')
    s = inF.read()
    inF.close()
import gzip
import shutil
with open('file.txt', 'rb') as f_in, gzip.open('file.txt.gz', 'wb') as f_out:
    shutil.copyfileobj(f_in, f_out)
import gzip
import shutil
with open('file.txt', 'wb') as f_out, gzip.open('file.txt.gz', 'rb') as f_in:
    shutil.copyfileobj(f_in, f_out)

Context

StackExchange Code Review Q#156005, answer score: 9

Revisions (0)

No revisions yet.