patternpythonMinor
Byte by byte directory comparison ignoring folder structures and file name differences
Viewed 0 times
directoryfilebyteandcomparisondifferencesfoldernameignoringstructures
Problem
I haven't been able to find an existing tool that does this, so I'm attempting to create one. If anyone knows of one that already exists, I'd appreciate a pointer to it. I plan on using this primarily for cleaning up old backup copies and was hoping for a review of its correctness or suggestions for improvement. Part of my concern is whether or not filecmp.cmp(), as I've used it here with the third argument set to False, does a full byte by byte comparison. I'm also providing it here in the hope that someone else might find it useful. I have run it on Ubuntu 12.04 LTS with Python 2.7.3.
```
# Prints a list of paths to files that exist in dir_l but not dir_r. File name
# differences are ignored. Recursively scans subdirectories. Skips hidden files
# and folders by default. Files of the same size are compared byte by byte (?).
# Differences in folder structures are ignored. For example, if
# dir_l/subdir1/file1 and dir_r/subdir2/subdir3/file2 match byte for byte,
# then dir_l/subdir1/file1 exists in dir_r.
# Two primary data structures are used:
# (1) A list of all the paths to files in dir_l (recursively including
# subdirectories of dir_l and excluding hidden files and folders by default).
# (2) A hash mapping each unique file size in dir_r to a list of all the paths
# to files in dir_r of that size (recursively including subdirectories of dir_r
# and excluding hidden files and folders by default).
# For each file pointed to in (1), its size is checked for existence in (2).
# If its size does not exist in (2), the file path to it is stored as
# unmatched. If its size does exist in (2), a byte by byte comparison (?) is
# done between it and each file matching its size in (2) until a match is
# found, if any. If a match is not found, the file path to it is stored as
# unmatched. The stored list of unmatched file paths, if any, is then printed.
# Requires the progress bar library (2.2)
# https://pypi.python.org/pypi/progressbar/2.2
# http://code.google.
```
# Prints a list of paths to files that exist in dir_l but not dir_r. File name
# differences are ignored. Recursively scans subdirectories. Skips hidden files
# and folders by default. Files of the same size are compared byte by byte (?).
# Differences in folder structures are ignored. For example, if
# dir_l/subdir1/file1 and dir_r/subdir2/subdir3/file2 match byte for byte,
# then dir_l/subdir1/file1 exists in dir_r.
# Two primary data structures are used:
# (1) A list of all the paths to files in dir_l (recursively including
# subdirectories of dir_l and excluding hidden files and folders by default).
# (2) A hash mapping each unique file size in dir_r to a list of all the paths
# to files in dir_r of that size (recursively including subdirectories of dir_r
# and excluding hidden files and folders by default).
# For each file pointed to in (1), its size is checked for existence in (2).
# If its size does not exist in (2), the file path to it is stored as
# unmatched. If its size does exist in (2), a byte by byte comparison (?) is
# done between it and each file matching its size in (2) until a match is
# found, if any. If a match is not found, the file path to it is stored as
# unmatched. The stored list of unmatched file paths, if any, is then printed.
# Requires the progress bar library (2.2)
# https://pypi.python.org/pypi/progressbar/2.2
# http://code.google.
Solution
- Yes,
filecmp.cmpcompares the contents of the files withshallow=False
- Breaking
maininto more functions still gives better structure
- Making
get_dir_file_pathsa generator reduces memory use when buildingsize_to_filepaths_r, and simplifies the function itself slightly.
- Use
collections.defaultdict(list)to avoidif size in size_to_filepaths_rchecks.
- Use
enumerateto keep a loop counter
- Get the file sizes while walking the directories to benefit from disk caching.
- Compare all files of the same size consecutively for the same reason. (the code below sorts
filepaths_lfor that)
I propose to rearrange the bulk of the work into these functions:
import collections
def dict_of_lists(items):
d = collections.defaultdict(list)
for key, value in items:
d[key].append(value)
return d
def get_dir_file_paths(top, include_hidden):
for dirpath, dirnames, filenames in os.walk(top):
if not include_hidden:
# ignore hidden files and folders
# http://stackoverflow.com/questions/13454164/os-walk-without-hidden-folders
# Answer by Martijn Pieters
filenames = [f for f in filenames if not f[0] == '.']
dirnames[:] = [d for d in dirnames if not d[0] == '.']
for filename in filenames:
yield os.path.join(dirpath, filename)
def sizes_paths(top, include_hidden):
for filepath in get_dir_file_paths(top, include_hidden):
size = os.path.getsize(filepath)
yield size, filepath
def file_match(filepath_l, filepaths_r):
return any(filecmp.cmp(filepath_l, filepath_r, False)
for filepath_r in filepaths_r)
def find_unmatched(dir_l, dir_r, include_hidden):
filepaths_l = sorted(sizes_paths(dir_l, include_hidden))
size_to_filepaths_r = dict_of_lists(sizes_paths(dir_r, include_hidden))
# creates a progress bar
pbar = ProgressBar(widgets=[Percentage(), Bar()], maxval=len(filepaths_l))
pbar.start()
unmatched = []
for i, (size, filepath_l) in enumerate(filepaths_l):
if not file_match(filepath_l, size_to_filepaths_r[size]):
# either no files in dir_r exist that are the same size as the file
# pointed to by filepath_l, or none of those that do are a
# byte by byte match
unmatched.append(filepath_l)
pbar.update(i)
pbar.finish()
return unmatchedCode Snippets
import collections
def dict_of_lists(items):
d = collections.defaultdict(list)
for key, value in items:
d[key].append(value)
return d
def get_dir_file_paths(top, include_hidden):
for dirpath, dirnames, filenames in os.walk(top):
if not include_hidden:
# ignore hidden files and folders
# http://stackoverflow.com/questions/13454164/os-walk-without-hidden-folders
# Answer by Martijn Pieters
filenames = [f for f in filenames if not f[0] == '.']
dirnames[:] = [d for d in dirnames if not d[0] == '.']
for filename in filenames:
yield os.path.join(dirpath, filename)
def sizes_paths(top, include_hidden):
for filepath in get_dir_file_paths(top, include_hidden):
size = os.path.getsize(filepath)
yield size, filepath
def file_match(filepath_l, filepaths_r):
return any(filecmp.cmp(filepath_l, filepath_r, False)
for filepath_r in filepaths_r)
def find_unmatched(dir_l, dir_r, include_hidden):
filepaths_l = sorted(sizes_paths(dir_l, include_hidden))
size_to_filepaths_r = dict_of_lists(sizes_paths(dir_r, include_hidden))
# creates a progress bar
pbar = ProgressBar(widgets=[Percentage(), Bar()], maxval=len(filepaths_l))
pbar.start()
unmatched = []
for i, (size, filepath_l) in enumerate(filepaths_l):
if not file_match(filepath_l, size_to_filepaths_r[size]):
# either no files in dir_r exist that are the same size as the file
# pointed to by filepath_l, or none of those that do are a
# byte by byte match
unmatched.append(filepath_l)
pbar.update(i)
pbar.finish()
return unmatchedContext
StackExchange Code Review Q#41853, answer score: 4
Revisions (0)
No revisions yet.