patternpythonMinor
Replacing duplicate files with hard links
Viewed 0 times
withduplicatehardlinksfilesreplacing
Problem
I'm a photographer doing many backups. Over the years I found myself with a lot of hard drives. Now I bought a NAS and copied all my pictures on one 3TB RAID 1 using rsync. According to my script, about 1TB of those files are duplicates. That comes from doing multiple backups before deleting files on my laptop and being very messy. I do have a backup of all those files on the old hard drives, but it would be a pain if my script messes things up.
Can you please have a look at my duplicate finder script and tell me if you think I can run it or not? I tried it on a test folder and it seems ok, but I don't want to mess things up on the NAS.
The script has three steps in three files. In this first part I find all image and metadata files and put them into a shelve database (
If it's somehow important: It's a synology 713+ and has an ext3 or ext4 filesystem.
This is the second part. Now I drop all file sizes which only have one file in their list and create another shelve database
Can you please have a look at my duplicate finder script and tell me if you think I can run it or not? I tried it on a test folder and it seems ok, but I don't want to mess things up on the NAS.
The script has three steps in three files. In this first part I find all image and metadata files and put them into a shelve database (
datenbank) with their size as key.If it's somehow important: It's a synology 713+ and has an ext3 or ext4 filesystem.
import os
import shelve
datenbank = shelve.open(os.path.join(os.path.dirname(__file__),"shelve_step1"), flag='c', protocol=None, writeback=False)
#path_to_search = os.path.join(os.path.dirname(__file__),"test")
path_to_search = "/volume1/backup_2tb_wd/"
file_exts = ["xmp", "jpg", "JPG", "XMP", "cr2", "CR2", "PNG", "png", "tiff", "TIFF"]
walker = os.walk(path_to_search)
counter = 0
for dirpath, dirnames, filenames in walker:
if filenames:
for filename in filenames:
counter += 1
print str(counter)
for file_ext in file_exts:
if file_ext in filename:
filepath = os.path.join(dirpath, filename)
filesize = str(os.path.getsize(filepath))
if not filesize in datenbank:
datenbank[filesize] = []
tmp = datenbank[filesize]
if filepath not in tmp:
tmp.append(filepath)
datenbank[filesize] = tmp
datenbank.sync()
print "done"
datenbank.close()This is the second part. Now I drop all file sizes which only have one file in their list and create another shelve database
Solution
Your code is very risky because you use a weak checksum (md5 - see wikipedia for own study), but since an error would be devastating please use sha256.
Let me quote this:
I strongly question your use of MD5. You should be at least using
SHA1. Some people think that as long as you're not using MD5 for
'cryptographic' purposes, you're fine. But stuff has a tendency
to end up being broader in scope than you initially expect,
and your casual vulnerability analysis may prove completely
flawed. It's best to just get in the habit of using the right
algorithm out of the gate. It's just typing a different bunch of
letters is all. It's not that hard.
Second I have added an "inspection loop" in the main code, so that it creates a csv file which you can play around with to check what the code will do (I checked the csv-data using a pivot table in excel).
So in summary I have rewritten your code as follows:
Python 2.7.4
```
import os
import os.path
import hashlib
import csv
"""
Recipe:
extensions in a table with this structure:
sha256 | filename.ext | keep | link | size | filepath
----------+--------------+-------+-------+------+------
23eadf3ed | summer.jpg | True | False | 1234 | /volume1/backup_2tb_wd/randomStuff/
23eadf3ed | summer.jpg | False | False | 1234 | /volume1/backup_2tb_wd/Stuff/
23eadf3ed | summer.jpg | False | False | 1234 | /volume1/backup_2tb_wd/Holiday/
To spot a link: os.path.islink('path+filename') # returns True if link.
To get filesize: os.path.getsize(join(root, name)) # returns bytes as integer.
Why links? Because os.link doesn't like soft link. The hard links will
survive, but any soft links will leave you in a mess.
Then we select 1 record from the distinct list of sha256s and update the
value for the column "Keep" to "Y". To make sure that we do not catch a
symlink we check that it is not a link.
"""
def hashfile(afile, blocksize=210241024): # load 2Mb
with open(afile, 'rb') as f:
buf = [1]
shasum = hashlib.sha256()
while len(buf)>0:
buf = f.read(blocksize)
shasum.update(buf)
return str(shasum.hexdigest()) # hashlib.sha256('foo').hexdigest()
def convert_to_a_lowercase_set(alist):
for item in alist:
alist[alist.index(item)]=item.lower()
aset = set(alist)
return aset
def get_the_data(path_to_search, file_exts):
file_exts = convert_to_a_lowercase_set(file_exts)
data=[]
shas=set()
for root, dirs, files in os.walk(path_to_search):
for name in files:
if name[-3:].lower() in file_exts:
filepath = os.path.join(root, name)
filename = name
link = os.path.islink(filepath) # returns True or False
if link==False:
size = os.path.getsize(filepath) # returns Int
sha256 = hashfile(filepath) # returns hexadecimal
if sha256 not in shas:
shas.add(sha256)
keep = True # we keep the first found original file.
else:
keep = False # we overwrite soft links with hard links.
else:
size = 0
sha256 = 'aaaaaaaaaaaaaaaaaaa' # returns hexadecimal
keep = False
data.append((sha256, filename, keep, link, size, filepath)) #! order matters!
return data
def writeCSVfile(data, datafile):
with open(datafile, 'wb') as f:
writer = csv.writer(f)
writer.writerow(('sha256', 'filename', 'keep', 'link', 'size', 'filepath'))
writer.writerows(data)
def spaceSaved(data):
return sum([row[4] for row in data if row[2]==False])
def relinkDuplicateFiles(data):
sha256s = (row for row in data if row[2]==True) # unique set of sha256's
for sha in sha256s:
original_file = sha[5]
redudant_copies = [row[5] for row in data if row[0]==sha[0] and row[2]==False and row[3]==False]
for record in redudant_copies:
os.remove(record)
os.link(original_file, record)
def main():
# (0) Loading your starting values.
path_to_search = r'/volume1/backup_2tb_wd/'
datafile = path_to_search+'data.csv'
file_exts = ["xmp", "jpg", "JPG", "XMP", "cr2", "CR2", "PNG", "png", "tiff", "TIFF"]
# (1) Get the data
print "getting the data...\nThis might take a while..."
data = get_the_data(path_to_search, file_exts)
# (2) Hard link duplicates in stead of having redundant files.
msg = """
--------------------
Data captured. Initiate Relinking of redundant files...?
Options:
Press D + enter to view data file and exit
Pres
Let me quote this:
I strongly question your use of MD5. You should be at least using
SHA1. Some people think that as long as you're not using MD5 for
'cryptographic' purposes, you're fine. But stuff has a tendency
to end up being broader in scope than you initially expect,
and your casual vulnerability analysis may prove completely
flawed. It's best to just get in the habit of using the right
algorithm out of the gate. It's just typing a different bunch of
letters is all. It's not that hard.
Second I have added an "inspection loop" in the main code, so that it creates a csv file which you can play around with to check what the code will do (I checked the csv-data using a pivot table in excel).
So in summary I have rewritten your code as follows:
Python 2.7.4
```
import os
import os.path
import hashlib
import csv
"""
Recipe:
- We identify all files on your system and store those with the wanted
extensions in a table with this structure:
sha256 | filename.ext | keep | link | size | filepath
----------+--------------+-------+-------+------+------
23eadf3ed | summer.jpg | True | False | 1234 | /volume1/backup_2tb_wd/randomStuff/
23eadf3ed | summer.jpg | False | False | 1234 | /volume1/backup_2tb_wd/Stuff/
23eadf3ed | summer.jpg | False | False | 1234 | /volume1/backup_2tb_wd/Holiday/
To spot a link: os.path.islink('path+filename') # returns True if link.
To get filesize: os.path.getsize(join(root, name)) # returns bytes as integer.
Why links? Because os.link doesn't like soft link. The hard links will
survive, but any soft links will leave you in a mess.
Then we select 1 record from the distinct list of sha256s and update the
value for the column "Keep" to "Y". To make sure that we do not catch a
symlink we check that it is not a link.
- Now we cycle through the records in the following manner:
- Now I would like to know how much space you saved. So we create a summary:
"""
def hashfile(afile, blocksize=210241024): # load 2Mb
with open(afile, 'rb') as f:
buf = [1]
shasum = hashlib.sha256()
while len(buf)>0:
buf = f.read(blocksize)
shasum.update(buf)
return str(shasum.hexdigest()) # hashlib.sha256('foo').hexdigest()
def convert_to_a_lowercase_set(alist):
for item in alist:
alist[alist.index(item)]=item.lower()
aset = set(alist)
return aset
def get_the_data(path_to_search, file_exts):
file_exts = convert_to_a_lowercase_set(file_exts)
data=[]
shas=set()
for root, dirs, files in os.walk(path_to_search):
for name in files:
if name[-3:].lower() in file_exts:
filepath = os.path.join(root, name)
filename = name
link = os.path.islink(filepath) # returns True or False
if link==False:
size = os.path.getsize(filepath) # returns Int
sha256 = hashfile(filepath) # returns hexadecimal
if sha256 not in shas:
shas.add(sha256)
keep = True # we keep the first found original file.
else:
keep = False # we overwrite soft links with hard links.
else:
size = 0
sha256 = 'aaaaaaaaaaaaaaaaaaa' # returns hexadecimal
keep = False
data.append((sha256, filename, keep, link, size, filepath)) #! order matters!
return data
def writeCSVfile(data, datafile):
with open(datafile, 'wb') as f:
writer = csv.writer(f)
writer.writerow(('sha256', 'filename', 'keep', 'link', 'size', 'filepath'))
writer.writerows(data)
def spaceSaved(data):
return sum([row[4] for row in data if row[2]==False])
def relinkDuplicateFiles(data):
sha256s = (row for row in data if row[2]==True) # unique set of sha256's
for sha in sha256s:
original_file = sha[5]
redudant_copies = [row[5] for row in data if row[0]==sha[0] and row[2]==False and row[3]==False]
for record in redudant_copies:
os.remove(record)
os.link(original_file, record)
def main():
# (0) Loading your starting values.
path_to_search = r'/volume1/backup_2tb_wd/'
datafile = path_to_search+'data.csv'
file_exts = ["xmp", "jpg", "JPG", "XMP", "cr2", "CR2", "PNG", "png", "tiff", "TIFF"]
# (1) Get the data
print "getting the data...\nThis might take a while..."
data = get_the_data(path_to_search, file_exts)
# (2) Hard link duplicates in stead of having redundant files.
msg = """
--------------------
Data captured. Initiate Relinking of redundant files...?
Options:
Press D + enter to view data file and exit
Pres
Code Snippets
import os
import os.path
import hashlib
import csv
"""
Recipe:
1. We identify all files on your system and store those with the wanted
extensions in a table with this structure:
sha256 | filename.ext | keep | link | size | filepath
----------+--------------+-------+-------+------+------
23eadf3ed | summer.jpg | True | False | 1234 | /volume1/backup_2tb_wd/randomStuff/
23eadf3ed | summer.jpg | False | False | 1234 | /volume1/backup_2tb_wd/Stuff/
23eadf3ed | summer.jpg | False | False | 1234 | /volume1/backup_2tb_wd/Holiday/
To spot a link: os.path.islink('path+filename') # returns True if link.
To get filesize: os.path.getsize(join(root, name)) # returns bytes as integer.
Why links? Because os.link doesn't like soft link. The hard links will
survive, but any soft links will leave you in a mess.
Then we select 1 record from the distinct list of sha256s and update the
value for the column "Keep" to "Y". To make sure that we do not catch a
symlink we check that it is not a link.
2. Now we cycle through the records in the following manner:
3. Now I would like to know how much space you saved. So we create a summary:
"""
def hashfile(afile, blocksize=2*1024*1024): # load 2Mb
with open(afile, 'rb') as f:
buf = [1]
shasum = hashlib.sha256()
while len(buf)>0:
buf = f.read(blocksize)
shasum.update(buf)
return str(shasum.hexdigest()) # hashlib.sha256('foo').hexdigest()
def convert_to_a_lowercase_set(alist):
for item in alist:
alist[alist.index(item)]=item.lower()
aset = set(alist)
return aset
def get_the_data(path_to_search, file_exts):
file_exts = convert_to_a_lowercase_set(file_exts)
data=[]
shas=set()
for root, dirs, files in os.walk(path_to_search):
for name in files:
if name[-3:].lower() in file_exts:
filepath = os.path.join(root, name)
filename = name
link = os.path.islink(filepath) # returns True or False
if link==False:
size = os.path.getsize(filepath) # returns Int
sha256 = hashfile(filepath) # returns hexadecimal
if sha256 not in shas:
shas.add(sha256)
keep = True # we keep the first found original file.
else:
keep = False # we overwrite soft links with hard links.
else:
size = 0
sha256 = 'aaaaaaaaaaaaaaaaaaa' # returns hexadecimal
keep = False
data.append((sha256, filename, keep, link, size, filepath)) #! order matters!
return data
def writeCSVfile(data, datafile):
with open(datafile, 'wb') as f:
writer = csv.writer(f)
writer.writerow(('sha256', 'filename', 'keep', 'link', 'size', 'filepath'))
writer.writerows(data)
def spbjorn@EEEbox:~/ownCloud/Test$ ls -liR
.:
total 44
3541027 drwxr-xr-x 4 bjorn bjorn 4096 Jun 26 13:50 2001
3541474 drwxr-xr-x 2 bjorn bjorn 4096 Jun 26 17:25 2001b
3542165 -rw-rw-r-- 1 bjorn bjorn 7054 Jun 26 16:35 data(after).csv
3542163 -rw-rw-r-- 1 bjorn bjorn 7054 Jun 26 16:34 data(before).csv
3542168 -rw-rw-r-- 1 bjorn bjorn 8036 Jun 26 17:52 data.csv
3542164 -rw-rw-r-- 1 bjorn bjorn 7054 Jun 26 16:27 data (org).csv
3542166 -rwxrw-r-- 1 bjorn bjorn 571 Jun 26 16:57 findhardlinks.sh
./2001:
total 944
3541401 -rw-r--r-- 1 bjorn bjorn 347991 Apr 23 18:10 008_05a.jpg
3541320 -rw-r--r-- 1 bjorn bjorn 33055 Apr 23 18:10 04.jpg
3541261 -rw-r--r-- 1 bjorn bjorn 64209 Apr 23 18:10 05.jpg
3541234 -rw-r--r-- 1 bjorn bjorn 70573 Apr 23 18:10 06.jpg
3541454 -rw-r--r-- 1 bjorn bjorn 70906 Apr 23 18:11 07.jpg
3541694 -rw-r--r-- 1 bjorn bjorn 78251 Apr 23 18:10 08.jpg
3541393 -rw-r--r-- 1 bjorn bjorn 61995 Apr 23 18:11 09.jpg
3541737 -rw-r--r-- 1 bjorn bjorn 67659 Apr 23 18:10 10.jpg
3541790 -rw-r--r-- 1 bjorn bjorn 68620 Apr 23 18:11 11.jpg
3541086 -rw-r--r-- 1 bjorn bjorn 74453 Apr 23 18:11 12.jpg
3541028 drwxr-xr-x 3 bjorn bjorn 4096 Jun 26 17:26 2001
./2001/2001:
total 1216
3541920 -rw-r--r-- 1 bjorn bjorn 347991 Apr 23 18:10 008_05a.jpg
3541854 -rw-r--r-- 1 bjorn bjorn 95391 Apr 23 18:10 01.jpg
3541415 -rw-r--r-- 1 bjorn bjorn 68238 Apr 23 18:11 02.jpg
3541196 -rw-r--r-- 1 bjorn bjorn 74282 Apr 23 18:11 03.jpg
3541834 -rw-r--r-- 1 bjorn bjorn 33055 Apr 23 18:10 04.jpg
3541544 -rw-r--r-- 6 bjorn bjorn 33055 Apr 23 18:10 04pyoslink4.jpg
3541871 -rw-r--r-- 1 bjorn bjorn 64209 Apr 23 18:10 05.jpg
3541461 -rw-r--r-- 1 bjorn bjorn 70573 Apr 23 18:10 06.jpg
3541560 -rw-r--r-- 1 bjorn bjorn 70906 Apr 23 18:11 07.jpg
3541670 -rw-r--r-- 1 bjorn bjorn 78251 Apr 23 18:11 08.jpg
3541441 -rw-r--r-- 1 bjorn bjorn 61995 Apr 23 18:11 09.jpg
3541863 -rw-r--r-- 1 bjorn bjorn 67659 Apr 23 18:10 10.jpg
3541836 -rw-r--r-- 1 bjorn bjorn 68620 Apr 23 18:11 11.jpg
3541841 -rw-r--r-- 1 bjorn bjorn 74453 Apr 23 18:10 12.jpg
./2001b:
total 312
3541544 -rw-r--r-- 6 bjorn bjorn 33055 Apr 23 18:10 04hardlink.jpg
3541544 -rw-r--r-- 6 bjorn bjorn 33055 Apr 23 18:10 04.jpg
3541961 -rw-r--r-- 1 bjorn bjorn 1220 Jun 26 14:02 04.lnk
3541544 -rw-r--r-- 6 bjorn bjorn 33055 Apr 23 18:10 04pyoslink2.jpg
3541544 -rw-r--r-- 6 bjorn bjorn 33055 Apr 23 18:10 04pyoslink3.jpg
3541544 -rw-r--r-- 6 bjorn bjorn 33055 Apr 23 18:10 04pyoslink.jpg
3542167 lrwxrwxrwx 1 bjorn bjorn 14 Jun 26 17:16 04softlink.jpg -> ./2001b/04.jpg
3541475 -rw-r--r-- 2 bjorn bjorn 64209 Jun 26 17:20 05hardlink.jpg
3541475 -rw-r--r-- 2 bjorn bjorn 64209 Jun 26 17:20 05.jpgbjorn@EEEbox:~/ownCloud/Test$ ls -liR
.:
total 44
3541027 drwxr-xr-x 4 bjorn bjorn 4096 Jun 26 18:04 2001
3541474 drwxr-xr-x 2 bjorn bjorn 4096 Jun 26 18:04 2001b
3542165 -rw-rw-r-- 1 bjorn bjorn 7054 Jun 26 16:35 data(after).csv
3542163 -rw-rw-r-- 1 bjorn bjorn 7054 Jun 26 16:34 data(before).csv
3542168 -rw-rw-r-- 1 bjorn bjorn 8036 Jun 26 17:52 data.csv
3542164 -rw-rw-r-- 1 bjorn bjorn 7054 Jun 26 16:27 data (org).csv
3542166 -rwxrw-r-- 1 bjorn bjorn 571 Jun 26 16:57 findhardlinks.sh
./2001:
total 944
3541401 -rw-r--r-- 2 bjorn bjorn 347991 Apr 23 18:10 008_05a.jpg
3541544 -rw-r--r-- 8 bjorn bjorn 33055 Apr 23 18:10 04.jpg
3541475 -rw-r--r-- 4 bjorn bjorn 64209 Jun 26 17:20 05.jpg
3541234 -rw-r--r-- 2 bjorn bjorn 70573 Apr 23 18:10 06.jpg
3541454 -rw-r--r-- 2 bjorn bjorn 70906 Apr 23 18:11 07.jpg
3541694 -rw-r--r-- 2 bjorn bjorn 78251 Apr 23 18:10 08.jpg
3541393 -rw-r--r-- 2 bjorn bjorn 61995 Apr 23 18:11 09.jpg
3541737 -rw-r--r-- 2 bjorn bjorn 67659 Apr 23 18:10 10.jpg
3541790 -rw-r--r-- 2 bjorn bjorn 68620 Apr 23 18:11 11.jpg
3541086 -rw-r--r-- 2 bjorn bjorn 74453 Apr 23 18:11 12.jpg
3541028 drwxr-xr-x 3 bjorn bjorn 4096 Jun 26 18:04 2001
./2001/2001:
total 1216
3541401 -rw-r--r-- 2 bjorn bjorn 347991 Apr 23 18:10 008_05a.jpg
3541854 -rw-r--r-- 1 bjorn bjorn 95391 Apr 23 18:10 01.jpg
3541415 -rw-r--r-- 1 bjorn bjorn 68238 Apr 23 18:11 02.jpg
3541196 -rw-r--r-- 1 bjorn bjorn 74282 Apr 23 18:11 03.jpg
3541544 -rw-r--r-- 8 bjorn bjorn 33055 Apr 23 18:10 04.jpg
3541544 -rw-r--r-- 8 bjorn bjorn 33055 Apr 23 18:10 04pyoslink4.jpg
3541475 -rw-r--r-- 4 bjorn bjorn 64209 Jun 26 17:20 05.jpg
3541234 -rw-r--r-- 2 bjorn bjorn 70573 Apr 23 18:10 06.jpg
3541454 -rw-r--r-- 2 bjorn bjorn 70906 Apr 23 18:11 07.jpg
3541694 -rw-r--r-- 2 bjorn bjorn 78251 Apr 23 18:10 08.jpg
3541393 -rw-r--r-- 2 bjorn bjorn 61995 Apr 23 18:11 09.jpg
3541737 -rw-r--r-- 2 bjorn bjorn 67659 Apr 23 18:10 10.jpg
3541790 -rw-r--r-- 2 bjorn bjorn 68620 Apr 23 18:11 11.jpg
3541086 -rw-r--r-- 2 bjorn bjorn 74453 Apr 23 18:11 12.jpg
./2001b:
total 312
3541544 -rw-r--r-- 8 bjorn bjorn 33055 Apr 23 18:10 04hardlink.jpg
3541544 -rw-r--r-- 8 bjorn bjorn 33055 Apr 23 18:10 04.jpg
3541961 -rw-r--r-- 1 bjorn bjorn 1220 Jun 26 14:02 04.lnk
3541544 -rw-r--r-- 8 bjorn bjorn 33055 Apr 23 18:10 04pyoslink2.jpg
3541544 -rw-r--r-- 8 bjorn bjorn 33055 Apr 23 18:10 04pyoslink3.jpg
3541544 -rw-r--r-- 8 bjorn bjorn 33055 Apr 23 18:10 04pyoslink.jpg
3542167 lrwxrwxrwx 1 bjorn bjorn 14 Jun 26 17:16 04softlink.jpg -> ./2001b/04.jpg
3541475 -rw-r--r-- 4 bjorn bjorn 64209 Jun 26 17:20 05hardlink.jpg
3541475 -rw-r--r-- 4 bjorn bjorn 64209 Jun 26 17:20 05.jpg:~$ python3.4 /path/to/directory/root/that/needs/cleanupdef clean_up(root_path, dryrun=False, verbose=False):
seen_files = {}
for root, dirs, files in walk(root_path):
for fname in files:
fpath = path.join(root,fname)
link = path.islink(fpath)
if not link:
s256 = sha256sum(fpath)
if s256 not in seen_files:
seen_files[s256] = fpath # we've found a new file!
else:
old_pointer = fpath # there's a new name for a known file.
new_pointer = seen_files[s256] # let's save the space by symlinking, but keep the name.Context
StackExchange Code Review Q#27652, answer score: 4
Revisions (0)
No revisions yet.