HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Byte by byte directory comparison ignoring folder structures and file name differences

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
directoryfilebyteandcomparisondifferencesfoldernameignoringstructures

Problem

I haven't been able to find an existing tool that does this, so I'm attempting to create one. If anyone knows of one that already exists, I'd appreciate a pointer to it. I plan on using this primarily for cleaning up old backup copies and was hoping for a review of its correctness or suggestions for improvement. Part of my concern is whether or not filecmp.cmp(), as I've used it here with the third argument set to False, does a full byte by byte comparison. I'm also providing it here in the hope that someone else might find it useful. I have run it on Ubuntu 12.04 LTS with Python 2.7.3.

```
# Prints a list of paths to files that exist in dir_l but not dir_r. File name
# differences are ignored. Recursively scans subdirectories. Skips hidden files
# and folders by default. Files of the same size are compared byte by byte (?).
# Differences in folder structures are ignored. For example, if
# dir_l/subdir1/file1 and dir_r/subdir2/subdir3/file2 match byte for byte,
# then dir_l/subdir1/file1 exists in dir_r.

# Two primary data structures are used:
# (1) A list of all the paths to files in dir_l (recursively including
# subdirectories of dir_l and excluding hidden files and folders by default).
# (2) A hash mapping each unique file size in dir_r to a list of all the paths
# to files in dir_r of that size (recursively including subdirectories of dir_r
# and excluding hidden files and folders by default).

# For each file pointed to in (1), its size is checked for existence in (2).
# If its size does not exist in (2), the file path to it is stored as
# unmatched. If its size does exist in (2), a byte by byte comparison (?) is
# done between it and each file matching its size in (2) until a match is
# found, if any. If a match is not found, the file path to it is stored as
# unmatched. The stored list of unmatched file paths, if any, is then printed.

# Requires the progress bar library (2.2)
# https://pypi.python.org/pypi/progressbar/2.2
# http://code.google.

Solution


  • Yes, filecmp.cmp compares the contents of the files with shallow=False



  • Breaking main into more functions still gives better structure



  • Making get_dir_file_paths a generator reduces memory use when building size_to_filepaths_r, and simplifies the function itself slightly.



  • Use collections.defaultdict(list) to avoid if size in size_to_filepaths_r checks.



  • Use enumerate to keep a loop counter



  • Get the file sizes while walking the directories to benefit from disk caching.



  • Compare all files of the same size consecutively for the same reason. (the code below sorts filepaths_l for that)



I propose to rearrange the bulk of the work into these functions:

import collections

def dict_of_lists(items):
    d = collections.defaultdict(list)
    for key, value in items:
        d[key].append(value)
    return d

def get_dir_file_paths(top, include_hidden):
    for dirpath, dirnames, filenames in os.walk(top):
        if not include_hidden:
            # ignore hidden files and folders
            # http://stackoverflow.com/questions/13454164/os-walk-without-hidden-folders
            # Answer by Martijn Pieters
            filenames = [f for f in filenames if not f[0] == '.']
            dirnames[:] = [d for d in dirnames if not d[0] == '.']

        for filename in filenames:
            yield os.path.join(dirpath, filename)

def sizes_paths(top, include_hidden):
    for filepath in get_dir_file_paths(top, include_hidden):
        size = os.path.getsize(filepath)
        yield size, filepath

def file_match(filepath_l, filepaths_r):
    return any(filecmp.cmp(filepath_l, filepath_r, False) 
               for filepath_r in filepaths_r)

def find_unmatched(dir_l, dir_r, include_hidden):

    filepaths_l = sorted(sizes_paths(dir_l, include_hidden))
    size_to_filepaths_r = dict_of_lists(sizes_paths(dir_r, include_hidden))

    # creates a progress bar
    pbar = ProgressBar(widgets=[Percentage(), Bar()], maxval=len(filepaths_l))
    pbar.start()

    unmatched = []

    for i, (size, filepath_l) in enumerate(filepaths_l):
        if not file_match(filepath_l, size_to_filepaths_r[size]):
            # either no files in dir_r exist that are the same size as the file 
            # pointed to by filepath_l, or none of those that do are a 
            # byte by byte match
            unmatched.append(filepath_l)
        pbar.update(i)
    pbar.finish()

    return unmatched

Code Snippets

import collections

def dict_of_lists(items):
    d = collections.defaultdict(list)
    for key, value in items:
        d[key].append(value)
    return d

def get_dir_file_paths(top, include_hidden):
    for dirpath, dirnames, filenames in os.walk(top):
        if not include_hidden:
            # ignore hidden files and folders
            # http://stackoverflow.com/questions/13454164/os-walk-without-hidden-folders
            # Answer by Martijn Pieters
            filenames = [f for f in filenames if not f[0] == '.']
            dirnames[:] = [d for d in dirnames if not d[0] == '.']

        for filename in filenames:
            yield os.path.join(dirpath, filename)

def sizes_paths(top, include_hidden):
    for filepath in get_dir_file_paths(top, include_hidden):
        size = os.path.getsize(filepath)
        yield size, filepath

def file_match(filepath_l, filepaths_r):
    return any(filecmp.cmp(filepath_l, filepath_r, False) 
               for filepath_r in filepaths_r)

def find_unmatched(dir_l, dir_r, include_hidden):

    filepaths_l = sorted(sizes_paths(dir_l, include_hidden))
    size_to_filepaths_r = dict_of_lists(sizes_paths(dir_r, include_hidden))

    # creates a progress bar
    pbar = ProgressBar(widgets=[Percentage(), Bar()], maxval=len(filepaths_l))
    pbar.start()

    unmatched = []

    for i, (size, filepath_l) in enumerate(filepaths_l):
        if not file_match(filepath_l, size_to_filepaths_r[size]):
            # either no files in dir_r exist that are the same size as the file 
            # pointed to by filepath_l, or none of those that do are a 
            # byte by byte match
            unmatched.append(filepath_l)
        pbar.update(i)
    pbar.finish()

    return unmatched

Context

StackExchange Code Review Q#41853, answer score: 4

Revisions (0)

No revisions yet.