HiveBrain v1.2.0
Get Started
← Back to all entries
snippetpythonMinor

Write MD5 hashes to file for all files in a directory tree

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
directoryfileallmd5writefilesforhashestree

Problem

I'm ultimately trying to compare the MD5 hashes for files in two disparate directory trees to see if files are missing from one directory or the other.

More broadly, I'm pushing photos from my DSLR camera to Google Drive, and need to know which files need to be synced up. Pictures from my phone are automatically being synced with Google Drive, and I'd also like to detect which ones I need to sync down. I have separate scripts for getting the MD5 hashes from Google Drive and for comparing the MD5 hashes to see which files need to be synced, up or down.

#!/usr/bin/python

import os
import sys
import hashlib
import csv

src_dir = '/Volumes/Archive/Pictures'

with open('checksums_archive.tsv', 'wb') as fout:
    writer = csv.writer(fout, delimiter='\t', quotechar='\"', quoting=csv.QUOTE_MINIMAL)

    for root, subdirs, files in os.walk(src_dir):
        for file in files:
            file_path = os.path.join(root, file)
            checksum = hashlib.md5(open(file_path, 'rb').read()).hexdigest()

            writer.writerow([root, file, checksum])

Solution

Four things:

  • You should put global constants in CAPITAL_LETTERS, according to PEP8.



  • I would make the file_name for the checksums a constant



  • You never close the files you iterate over.



  • In general, doing all writes at the same time is faster.



So for 2, 3 & 4, maybe use:

for root, subdirs, files in os.walk(SRC_DIR):
    checksums = []
    for file in files:
        with open(os.path.join(root, file), 'rb') as _file:
            checksums.append([root, file, hashlib.md5(_file.read()).hexdigest()])
    writer.writerows(checksums)


This will perform one write per subdir, which should be faster than a write after every file. If you have a lot of files this will take more memory, which should be slightly mitigated by doing a write for each subdir. You can even pull the checksums list outside of the os.walk loop to have only one (possibly giant) write.

Otherwise, clear and readable code!

Code Snippets

for root, subdirs, files in os.walk(SRC_DIR):
    checksums = []
    for file in files:
        with open(os.path.join(root, file), 'rb') as _file:
            checksums.append([root, file, hashlib.md5(_file.read()).hexdigest()])
    writer.writerows(checksums)

Context

StackExchange Code Review Q#133859, answer score: 4

Revisions (0)

No revisions yet.