snippetpythonMinor
Write MD5 hashes to file for all files in a directory tree
Viewed 0 times
directoryfileallmd5writefilesforhashestree
Problem
I'm ultimately trying to compare the MD5 hashes for files in two disparate directory trees to see if files are missing from one directory or the other.
More broadly, I'm pushing photos from my DSLR camera to Google Drive, and need to know which files need to be synced up. Pictures from my phone are automatically being synced with Google Drive, and I'd also like to detect which ones I need to sync down. I have separate scripts for getting the MD5 hashes from Google Drive and for comparing the MD5 hashes to see which files need to be synced, up or down.
More broadly, I'm pushing photos from my DSLR camera to Google Drive, and need to know which files need to be synced up. Pictures from my phone are automatically being synced with Google Drive, and I'd also like to detect which ones I need to sync down. I have separate scripts for getting the MD5 hashes from Google Drive and for comparing the MD5 hashes to see which files need to be synced, up or down.
#!/usr/bin/python
import os
import sys
import hashlib
import csv
src_dir = '/Volumes/Archive/Pictures'
with open('checksums_archive.tsv', 'wb') as fout:
writer = csv.writer(fout, delimiter='\t', quotechar='\"', quoting=csv.QUOTE_MINIMAL)
for root, subdirs, files in os.walk(src_dir):
for file in files:
file_path = os.path.join(root, file)
checksum = hashlib.md5(open(file_path, 'rb').read()).hexdigest()
writer.writerow([root, file, checksum])Solution
Four things:
So for 2, 3 & 4, maybe use:
This will perform one write per subdir, which should be faster than a write after every file. If you have a lot of files this will take more memory, which should be slightly mitigated by doing a write for each subdir. You can even pull the checksums list outside of the os.walk loop to have only one (possibly giant) write.
Otherwise, clear and readable code!
- You should put global constants in CAPITAL_LETTERS, according to PEP8.
- I would make the file_name for the checksums a constant
- You never close the files you iterate over.
- In general, doing all writes at the same time is faster.
So for 2, 3 & 4, maybe use:
for root, subdirs, files in os.walk(SRC_DIR):
checksums = []
for file in files:
with open(os.path.join(root, file), 'rb') as _file:
checksums.append([root, file, hashlib.md5(_file.read()).hexdigest()])
writer.writerows(checksums)This will perform one write per subdir, which should be faster than a write after every file. If you have a lot of files this will take more memory, which should be slightly mitigated by doing a write for each subdir. You can even pull the checksums list outside of the os.walk loop to have only one (possibly giant) write.
Otherwise, clear and readable code!
Code Snippets
for root, subdirs, files in os.walk(SRC_DIR):
checksums = []
for file in files:
with open(os.path.join(root, file), 'rb') as _file:
checksums.append([root, file, hashlib.md5(_file.read()).hexdigest()])
writer.writerows(checksums)Context
StackExchange Code Review Q#133859, answer score: 4
Revisions (0)
No revisions yet.