patternpythonMinor
Group duplicate files
Viewed 0 times
duplicatefilesgroup
Problem
Here is my source code which group duplicate files together, written in Python 2.7. Any advice on smarter ideas to group duplicate files together more efficient, it will be great. Any general advice on code bugs and code style are also appreciated.
Problem statement:
I am given a list of files e.g.
My major idea:
```
from collections import defaultdict
import itertools
def group_duplicate_files(files):
hash_buf = defaultdict(list)
for f in files:
hash_buf[hash_header(f)].append(f)
result = []
for file_list in hash_buf.values():
if len(file_list) 0:
found = True
break
if found:
n = n.union(result[i])
result.remove(result[i])
result.append(n)
else:
result.append(n)
return result
def hash_header(filename):
f = open(filename, "r")
header = f.read(10)
return hash(header)
def compare_full_content(filename1, filename2):
file1 = open(filename1, 'r')
file2 = open(filename2, 'r')
content1 = file1.read()
content2 = file2.read()
i = 0
j = 0
while i < len(content1) and j < len(content2):
if content1[i] != content2[j]:
return False
i += 1
Problem statement:
I am given a list of files e.g.
["1.txt", "2.txt", "3.txt", "4.txt", "5.txt", "6.txt"]. I want to group all files which have the exact same content. Suppose in this example, file "1.txt", "2.txt", "3.txt" are the same, file "4.txt", "5.txt", "6.txt" have common header, but "4.txt", "6.txt" are exactly the same whole content. Then, the output should be two groups "1.txt", "2.txt", "3.txt" and "4.txt", "6.txt".My major idea:
- To avoid reading full content for each file, I generate a hash code for a file header (in my example, I define the file header to be the first two bytes of a file)
- If more than 2 (>=2) files have the same header, I will read the full content and see if they are the exact same whole content
```
from collections import defaultdict
import itertools
def group_duplicate_files(files):
hash_buf = defaultdict(list)
for f in files:
hash_buf[hash_header(f)].append(f)
result = []
for file_list in hash_buf.values():
if len(file_list) 0:
found = True
break
if found:
n = n.union(result[i])
result.remove(result[i])
result.append(n)
else:
result.append(n)
return result
def hash_header(filename):
f = open(filename, "r")
header = f.read(10)
return hash(header)
def compare_full_content(filename1, filename2):
file1 = open(filename1, 'r')
file2 = open(filename2, 'r')
content1 = file1.read()
content2 = file2.read()
i = 0
j = 0
while i < len(content1) and j < len(content2):
if content1[i] != content2[j]:
return False
i += 1
Solution
As far as I can tell, your algorithm is something like \$\mathcal{O}(n!)\$, because you compare all possible pairs of files. It is possible to do this in \$\mathcal{O}(n)\$.
You just have to iterate over the files and order them into a hashmap (i.e. dictionary) based on their header hash. Then, resolve the possible conflicts of two files with the same header, but different content, which takes \$\mathcal{O}(k)\$, where \$k\$ is the number of files with the same header. So in the worst case (all files have the same header), this algorithm is \$\mathcal{O}(2n)\$.
This code has two caveats:
-
It assumes
-
This uses the fact that
You just have to iterate over the files and order them into a hashmap (i.e. dictionary) based on their header hash. Then, resolve the possible conflicts of two files with the same header, but different content, which takes \$\mathcal{O}(k)\$, where \$k\$ is the number of files with the same header. So in the worst case (all files have the same header), this algorithm is \$\mathcal{O}(2n)\$.
from collections import defaultdict
import sys
def hash_content(file_name, length=-1):
with open(file_name) as f:
return hash(f.read(length))
def group_by_header(*file_names):
file_groups = defaultdict(list)
for file_name in file_names:
file_groups[hash_content(file_name, length=10)].append(file_name)
return file_groups
def group_by_content(*file_names):
file_groups = group_by_header(*file_names)
for file_names in file_groups.values():
if len(file_names) > 1:
while file_names:
file_name = file_names.pop()
file_groups[hash_content(file_name)].append(file_name)
return filter(None, file_groups.values())
if __name__ == "__main__":
file_names = sys.argv[1:]
print group_by_content(*file_names)This code has two caveats:
-
It assumes
hash is collision-free (a fair assumption). We could pass the string content directly as a key and let python do the hashing. Then we would also get a strategy for the rare case that two strings have the same hash without being equal. But this way we avoid storing the possibly large file-contents as strings.-
This uses the fact that
dict.values() returns a list and not an iterator, so we can modify the dictionary in the loop here. This breaks in Python 3.x, but you specifically tagged this as Python 2.7. You can remedy this by introducing a new dictionary:def group_by_content(*file_names):
header_hashs = group_by_header(*file_names)
content_hashs = defaultdict(list)
for hash_, file_names in header_hashs.items():
if len(file_names) > 1:
while len(file_names):
file_name = file_names.pop()
content_hashs[hash_content(file_name)].append(file_name)
else:
content_hashs[hash_] = file_names
return (val for val in content_hashs.values() if val)Code Snippets
from collections import defaultdict
import sys
def hash_content(file_name, length=-1):
with open(file_name) as f:
return hash(f.read(length))
def group_by_header(*file_names):
file_groups = defaultdict(list)
for file_name in file_names:
file_groups[hash_content(file_name, length=10)].append(file_name)
return file_groups
def group_by_content(*file_names):
file_groups = group_by_header(*file_names)
for file_names in file_groups.values():
if len(file_names) > 1:
while file_names:
file_name = file_names.pop()
file_groups[hash_content(file_name)].append(file_name)
return filter(None, file_groups.values())
if __name__ == "__main__":
file_names = sys.argv[1:]
print group_by_content(*file_names)def group_by_content(*file_names):
header_hashs = group_by_header(*file_names)
content_hashs = defaultdict(list)
for hash_, file_names in header_hashs.items():
if len(file_names) > 1:
while len(file_names):
file_name = file_names.pop()
content_hashs[hash_content(file_name)].append(file_name)
else:
content_hashs[hash_] = file_names
return (val for val in content_hashs.values() if val)Context
StackExchange Code Review Q#158244, answer score: 3
Revisions (0)
No revisions yet.