HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Process zip files

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
zipfilesprocess

Problem

I have zip bundle, for example, abcd.zip, contains more zips like 1.zip, 2.zip etc. Inside of each child zip there is a .jpg file like 1.jpg, 2.jpg etc. There are so many other files but I need only .jpeg.

I need to extract the .jpeg's and and create a zip it again with same parent name like 1.zip.

This works fine, but just wanted to know if I can make it faster. There will be approx 30,000 zip I need to process.

def fjpeg(file):
    base = os.path.basename(file)
    jp = base[:-4]+".jpg"
    return jp    

def process(bundle):
    z1 = zp.ZipFile(bundle, 'r')
    for z1file in z1.namelist():
        if z1file[-4:] == '.zip':
            z2 = zp.ZipFile(z1.extract(z1file, "tmp"), 'r')
            z3 = os.path.basename(z2.extract(fjpeg(z1file)))
            process_path = "processed" + os.path.sep + os.path.basename(z1file)
            with zp.ZipFile(process_path, 'w', mode) as final:
                final.write(z3)
            z2.close()    
            os.unlink(os.path.join("tmp", z1file))
            os.unlink(z3)
        else:
            continue
    z1.close()

Solution

It is not necessary to create temporary files on disk, as zipfile.ZipFile can work in-memory.

  • Use a cStringIO.StringIO instance to hold a zip file in memory.



  • Use ZipFile.read to read a jpeg file into a str variable.



  • Use ZipFile.writestr to write the jpeg back.



This Stack Overflow question may be useful to you: Unzip nested zip files in python. As mentioned, decompressing zip files requires random access to the archive. If the "bundle" zip stores its contents uncompressed (which is an option), in theory it should be possible to have random access into the files.

Context

StackExchange Code Review Q#73683, answer score: 2

Revisions (0)

No revisions yet.