patternpythonMinor

Byte by byte directory comparison ignoring folder structures and file name differences

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

directoryfilebyteandcomparisondifferencesfoldernameignoringstructures

Problem

I haven't been able to find an existing tool that does this, so I'm attempting to create one. If anyone knows of one that already exists, I'd appreciate a pointer to it. I plan on using this primarily for cleaning up old backup copies and was hoping for a review of its correctness or suggestions for improvement. Part of my concern is whether or not filecmp.cmp(), as I've used it here with the third argument set to False, does a full byte by byte comparison. I'm also providing it here in the hope that someone else might find it useful. I have run it on Ubuntu 12.04 LTS with Python 2.7.3.

```
# Prints a list of paths to files that exist in dir_l but not dir_r. File name
# differences are ignored. Recursively scans subdirectories. Skips hidden files
# and folders by default. Files of the same size are compared byte by byte (?).
# Differences in folder structures are ignored. For example, if
# dir_l/subdir1/file1 and dir_r/subdir2/subdir3/file2 match byte for byte,
# then dir_l/subdir1/file1 exists in dir_r.

# Two primary data structures are used:
# (1) A list of all the paths to files in dir_l (recursively including
# subdirectories of dir_l and excluding hidden files and folders by default).
# (2) A hash mapping each unique file size in dir_r to a list of all the paths
# to files in dir_r of that size (recursively including subdirectories of dir_r
# and excluding hidden files and folders by default).

# For each file pointed to in (1), its size is checked for existence in (2).
# If its size does not exist in (2), the file path to it is stored as
# unmatched. If its size does exist in (2), a byte by byte comparison (?) is
# done between it and each file matching its size in (2) until a match is
# found, if any. If a match is not found, the file path to it is stored as
# unmatched. The stored list of unmatched file paths, if any, is then printed.

# Requires the progress bar library (2.2)
# https://pypi.python.org/pypi/progressbar/2.2
# http://code.google.

Solution

Yes, filecmp.cmp compares the contents of the files with shallow=False

Breaking main into more functions still gives better structure

Making get_dir_file_paths a generator reduces memory use when building size_to_filepaths_r, and simplifies the function itself slightly.

Use collections.defaultdict(list) to avoid if size in size_to_filepaths_r checks.

Use enumerate to keep a loop counter

Get the file sizes while walking the directories to benefit from disk caching.

Compare all files of the same size consecutively for the same reason. (the code below sorts filepaths_l for that)

I propose to rearrange the bulk of the work into these functions:

import collections

def dict_of_lists(items):
    d = collections.defaultdict(list)
    for key, value in items:
        d[key].append(value)
    return d

def get_dir_file_paths(top, include_hidden):
    for dirpath, dirnames, filenames in os.walk(top):
        if not include_hidden:
            # ignore hidden files and folders
            # http://stackoverflow.com/questions/13454164/os-walk-without-hidden-folders
            # Answer by Martijn Pieters
            filenames = [f for f in filenames if not f[0] == '.']
            dirnames[:] = [d for d in dirnames if not d[0] == '.']

        for filename in filenames:
            yield os.path.join(dirpath, filename)

def sizes_paths(top, include_hidden):
    for filepath in get_dir_file_paths(top, include_hidden):
        size = os.path.getsize(filepath)
        yield size, filepath

def file_match(filepath_l, filepaths_r):
    return any(filecmp.cmp(filepath_l, filepath_r, False) 
               for filepath_r in filepaths_r)

def find_unmatched(dir_l, dir_r, include_hidden):

    filepaths_l = sorted(sizes_paths(dir_l, include_hidden))
    size_to_filepaths_r = dict_of_lists(sizes_paths(dir_r, include_hidden))

    # creates a progress bar
    pbar = ProgressBar(widgets=[Percentage(), Bar()], maxval=len(filepaths_l))
    pbar.start()

    unmatched = []

    for i, (size, filepath_l) in enumerate(filepaths_l):
        if not file_match(filepath_l, size_to_filepaths_r[size]):
            # either no files in dir_r exist that are the same size as the file 
            # pointed to by filepath_l, or none of those that do are a 
            # byte by byte match
            unmatched.append(filepath_l)
        pbar.update(i)
    pbar.finish()

    return unmatched

Code Snippets

import collections

def dict_of_lists(items):
    d = collections.defaultdict(list)
    for key, value in items:
        d[key].append(value)
    return d

def get_dir_file_paths(top, include_hidden):
    for dirpath, dirnames, filenames in os.walk(top):
        if not include_hidden:
            # ignore hidden files and folders
            # http://stackoverflow.com/questions/13454164/os-walk-without-hidden-folders
            # Answer by Martijn Pieters
            filenames = [f for f in filenames if not f[0] == '.']
            dirnames[:] = [d for d in dirnames if not d[0] == '.']

        for filename in filenames:
            yield os.path.join(dirpath, filename)

def sizes_paths(top, include_hidden):
    for filepath in get_dir_file_paths(top, include_hidden):
        size = os.path.getsize(filepath)
        yield size, filepath

def file_match(filepath_l, filepaths_r):
    return any(filecmp.cmp(filepath_l, filepath_r, False) 
               for filepath_r in filepaths_r)

def find_unmatched(dir_l, dir_r, include_hidden):

    filepaths_l = sorted(sizes_paths(dir_l, include_hidden))
    size_to_filepaths_r = dict_of_lists(sizes_paths(dir_r, include_hidden))

    # creates a progress bar
    pbar = ProgressBar(widgets=[Percentage(), Bar()], maxval=len(filepaths_l))
    pbar.start()

    unmatched = []

    for i, (size, filepath_l) in enumerate(filepaths_l):
        if not file_match(filepath_l, size_to_filepaths_r[size]):
            # either no files in dir_r exist that are the same size as the file 
            # pointed to by filepath_l, or none of those that do are a 
            # byte by byte match
            unmatched.append(filepath_l)
        pbar.update(i)
    pbar.finish()

    return unmatched

Context

StackExchange Code Review Q#41853, answer score: 4

Revisions (0)

No revisions yet.