patternpythonMinor
Finding duplicate files using md5sum Unix command
Viewed 0 times
unixduplicatefilesmd5sumusingfindingcommand
Problem
This is an exercise from Think Python: How to Think Like a Computer Scientist
Here's its description:
In a large collection of MP3 files, there may be more than one copy of
the same song, stored in different directories or with different
filenames. The goal of this exercise is to search for duplicates.
for all files with a given suffix (like .mp3). Hint: os.path provides
several useful functions for manipulating fileand path names.
have the same contents.
Here's my solution:
```
import os
def run_command(cmd):
"""Runs a command in a shell.
cmd: a string specifies a Unix command.
Returns: a string specifies the result
of executing the command.
"""
filepipe = os.popen(cmd)
result = filepipe.read()
status = filepipe.close()
return result
def md5_checksum(filepath):
"""Returns a string specifies the MD5 checksum of
a given file using md5sum Unix command.
filepath: a string specifies a file.
"""
command = 'md5sum ' + filepath
return run_command(command)
def md5_checksum_table(dirname, suffix):
"""Searches a directory for files with a given
file format (a suffix) and computes their
MD5 checksums.
dirname: a string specifies a directory.
suffix: a file format (e.g. .pdf or .mp3).
Returns: a dictionary mapping from string
works as a MD5 checksum to list of strings
work as pathes of files have this checksum.
"""
table = {}
for root, sub, files in os.walk(dirname):
for file in files:
if file.endswith(suffix):
filepath = os.path.join(roo
Here's its description:
In a large collection of MP3 files, there may be more than one copy of
the same song, stored in different directories or with different
filenames. The goal of this exercise is to search for duplicates.
- Write a program that searches a directory and all of its subdirectories, recur‐ sively, and returns a list of complete paths
for all files with a given suffix (like .mp3). Hint: os.path provides
several useful functions for manipulating fileand path names.
- To recognize duplicates, you can use md5sum to compute a “checksum” for each files. If two files have the same checksum, they probably
have the same contents.
- To double-check, you can use the Unix command diff.
Here's my solution:
```
import os
def run_command(cmd):
"""Runs a command in a shell.
cmd: a string specifies a Unix command.
Returns: a string specifies the result
of executing the command.
"""
filepipe = os.popen(cmd)
result = filepipe.read()
status = filepipe.close()
return result
def md5_checksum(filepath):
"""Returns a string specifies the MD5 checksum of
a given file using md5sum Unix command.
filepath: a string specifies a file.
"""
command = 'md5sum ' + filepath
return run_command(command)
def md5_checksum_table(dirname, suffix):
"""Searches a directory for files with a given
file format (a suffix) and computes their
MD5 checksums.
dirname: a string specifies a directory.
suffix: a file format (e.g. .pdf or .mp3).
Returns: a dictionary mapping from string
works as a MD5 checksum to list of strings
work as pathes of files have this checksum.
"""
table = {}
for root, sub, files in os.walk(dirname):
for file in files:
if file.endswith(suffix):
filepath = os.path.join(roo
Solution
Well this was a very satisfying problem, thanks for sharing!
First of all, calling external resources is expensive, there for not optimized, which you ask for. Else wise, calling external resources can be preferable if the external resource is something like shell on a platform you have control over. That's the reasons I removed them and substituted them with python built-ins. It's pretty much the only reason this code is slightly faster then yours.
I found one small error in your code. What if a file you try to hash has spaces? The problem occurs when you split the return from md5_checksum, it splits to as many values as there are white spaces.
The most time consuming function of both our code is walk. It's easy to check these where cpu-time went with profilers. And python has a builtin I like, but there are many. It's the cProfiler, check my code for usage.
The biggest change was refactoring the function are_identical for
They do the same thing, but the any() builtin, is.. as well faster, then iterating over lists.
I did remove your function comments, as they can be substituted for good function names and annotations. Do you agree?
That being said, they problem statement seems foggy. The MD5 function does not yield the same hash for two different sets of data, when concerned with these kinda problems. That is way it is called a hash function or a one way function. If the hashes is identical the content is identical.
The last thing I will say, is that even the very fast hash function MD5, is slower then a efficient comparing of the content. So I criticize the problem not your solution.
Thanks! Good work.
First of all, calling external resources is expensive, there for not optimized, which you ask for. Else wise, calling external resources can be preferable if the external resource is something like shell on a platform you have control over. That's the reasons I removed them and substituted them with python built-ins. It's pretty much the only reason this code is slightly faster then yours.
I found one small error in your code. What if a file you try to hash has spaces? The problem occurs when you split the return from md5_checksum, it splits to as many values as there are white spaces.
The most time consuming function of both our code is walk. It's easy to check these where cpu-time went with profilers. And python has a builtin I like, but there are many. It's the cProfiler, check my code for usage.
The biggest change was refactoring the function are_identical for
if any(cmp(x, y) for x in paths for y in paths if y != x):
print('\nThey are identical\n')They do the same thing, but the any() builtin, is.. as well faster, then iterating over lists.
I did remove your function comments, as they can be substituted for good function names and annotations. Do you agree?
from os import walk
from os.path import join
from hashlib import md5
from filecmp import cmp
from base64 import b64encode
from time import time
import cProfile
def md5_checksum(file_path: str) -> (bytes, str):
""" Returns the raw MD5 bytes here used as checksum a given files content """
with open(file_path, "rb") as f:
file = f.read()
m = md5()
m.update(file)
return m.digest(), file_path
def md5_checksum_table(dir_name: str, suffix: str) -> {bytes: [str]}:
"""
Searches a directory for files with a given file format (a suffix) and
computes their MD5 checksums.
"""
table = {}
for root, sub, files in walk(dir_name):
for file in files:
if file.endswith(suffix):
checksum, filename = md5_checksum(join(root, file))
table.setdefault(checksum, []).append(filename)
return table
def print_duplicates(checksums: {bytes: [str]}):
""" Prints paths of files have the same MD5 checksum and are identical. """
for checksum, paths in checksums.items():
if len(paths) > 1:
print('Files have the checksum {0} are:\n {1}'.format(b64encode(checksum),
"\n".join(paths)))
if any(cmp(x, y) for x in paths for y in paths if y != x):
print('\nThey are identical\n')
def main():
start = time()
table = md5_checksum_table('/media/sf_Shared/', '.pdf')
print_duplicates(table)
print("Time {:.3f}s".format(time()-start))
cProfile.run("md5_checksum_table('/home/cly/', '.pdf')")
cProfile.run("print_duplicates({})".format(table))
if __name__ == '__main__':
main()That being said, they problem statement seems foggy. The MD5 function does not yield the same hash for two different sets of data, when concerned with these kinda problems. That is way it is called a hash function or a one way function. If the hashes is identical the content is identical.
The last thing I will say, is that even the very fast hash function MD5, is slower then a efficient comparing of the content. So I criticize the problem not your solution.
Thanks! Good work.
Code Snippets
if any(cmp(x, y) for x in paths for y in paths if y != x):
print('\nThey are identical\n')from os import walk
from os.path import join
from hashlib import md5
from filecmp import cmp
from base64 import b64encode
from time import time
import cProfile
def md5_checksum(file_path: str) -> (bytes, str):
""" Returns the raw MD5 bytes here used as checksum a given files content """
with open(file_path, "rb") as f:
file = f.read()
m = md5()
m.update(file)
return m.digest(), file_path
def md5_checksum_table(dir_name: str, suffix: str) -> {bytes: [str]}:
"""
Searches a directory for files with a given file format (a suffix) and
computes their MD5 checksums.
"""
table = {}
for root, sub, files in walk(dir_name):
for file in files:
if file.endswith(suffix):
checksum, filename = md5_checksum(join(root, file))
table.setdefault(checksum, []).append(filename)
return table
def print_duplicates(checksums: {bytes: [str]}):
""" Prints paths of files have the same MD5 checksum and are identical. """
for checksum, paths in checksums.items():
if len(paths) > 1:
print('Files have the checksum {0} are:\n {1}'.format(b64encode(checksum),
"\n".join(paths)))
if any(cmp(x, y) for x in paths for y in paths if y != x):
print('\nThey are identical\n')
def main():
start = time()
table = md5_checksum_table('/media/sf_Shared/', '.pdf')
print_duplicates(table)
print("Time {:.3f}s".format(time()-start))
cProfile.run("md5_checksum_table('/home/cly/', '.pdf')")
cProfile.run("print_duplicates({})".format(table))
if __name__ == '__main__':
main()Context
StackExchange Code Review Q#143367, answer score: 2
Revisions (0)
No revisions yet.