patternpythonMinor
Getting a hash string for a very large file
Viewed 0 times
filegettinghashlargeforverystring
Problem
After reading about large files and memory problems, I'm suspecting that my code below may be inefficient because it reads the entire files into memory before applying the hash algorithm. Is there a better way?
chunk_size = 1024
hasher = hashlib.md5()
while True:
try:
data = f.read(chunk_size)
except IOError, e:
log.error('error hashing %s on Agent %s' % (path, agent.name))
return {'error': '%s' % e}
if not data:
break
hasher.update(data)
hash_string = hasher.hexdigest()Solution
Nope, exactly right, except that the chunk size should probably be bigger, typically page size, likely 4096 bytes, but that's cargo culted, so profiling would be better either way.
Also, it might be better to move the
That said, try shelling out to
Also, it might be better to move the
try/except block out of the loop, if just for readability. The return convention for errors is a bit weird, but since we don't know the context I can't comment more on that, except that '%s' % e should probably be str(e), because it's a bit shorter (and clearer IMO - string formatting should be used to format strings, not convert to string, but YMMV).That said, try shelling out to
md5sum $FILE and retrieve the result, might be faster; i.e. using subprocess.Context
StackExchange Code Review Q#108330, answer score: 2
Revisions (0)
No revisions yet.