HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Getting a hash string for a very large file

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
filegettinghashlargeforverystring

Problem

After reading about large files and memory problems, I'm suspecting that my code below may be inefficient because it reads the entire files into memory before applying the hash algorithm. Is there a better way?

chunk_size = 1024
hasher = hashlib.md5()
while True:
    try:
        data = f.read(chunk_size)
    except IOError, e:
        log.error('error hashing %s on Agent %s' % (path, agent.name))
        return {'error': '%s' % e}
    if not data:
        break
    hasher.update(data)
hash_string = hasher.hexdigest()

Solution

Nope, exactly right, except that the chunk size should probably be bigger, typically page size, likely 4096 bytes, but that's cargo culted, so profiling would be better either way.

Also, it might be better to move the try/except block out of the loop, if just for readability. The return convention for errors is a bit weird, but since we don't know the context I can't comment more on that, except that '%s' % e should probably be str(e), because it's a bit shorter (and clearer IMO - string formatting should be used to format strings, not convert to string, but YMMV).

That said, try shelling out to md5sum $FILE and retrieve the result, might be faster; i.e. using subprocess.

Context

StackExchange Code Review Q#108330, answer score: 2

Revisions (0)

No revisions yet.