patternpythonMinor
Looking for identical files (or directories) in a filesystem
Viewed 0 times
identicallookingfilesforfilesystemdirectories
Problem
Dilemma
The code below is in its infancy, but for now is an effort to crawl my entire file system and find duplicates whilst adhering to a variety of exclusion criteria. My current strategy for determining if two files are identical—comparing their checksums—seems unfeasible because it takes far too long to run (days, in fact).
Context (Optional Reading)
I have 6 drives in my machine from as early as 2008 to three months ago; collectively it's 7TB of storage that is littered in "Recovery" and "Stuff" directories; my computer-savvy has been steadily increasing since I was a kid, and now I'm a pro-am web-developer dabbling in real programming. Professionally I manage a cognitive psych laboratory that's has never thrown anything away, ever. Here I have still more TBs of data that is multiply redundant, and so I'm tentatively developing this to be worthy as an open source project.
Eventually, I want to:
Probably some software has already been written that will do lots of what I want to do; this is much an exercise in writing software with a larger scope than the practical matter of achieving my ends.
Ideas & Progress
It occurs to me that their might be heuristical approaches that could cut down the workload—possibly I could only create checksums for files that have been inferred to be identical. The two criterion I can think of:
-
comparing filenames: Since many files are identical but for the filenames, this is only helpful if fallbacks exist to catch other cases
-
comparing filesize: It occurs to me that this might always work, ie. if two files have the same size only create & compare checksums between them
Also, often entire directories are identical, say two copies of CakePHP that each have a few thousand library files. Is there a way t
The code below is in its infancy, but for now is an effort to crawl my entire file system and find duplicates whilst adhering to a variety of exclusion criteria. My current strategy for determining if two files are identical—comparing their checksums—seems unfeasible because it takes far too long to run (days, in fact).
Context (Optional Reading)
I have 6 drives in my machine from as early as 2008 to three months ago; collectively it's 7TB of storage that is littered in "Recovery" and "Stuff" directories; my computer-savvy has been steadily increasing since I was a kid, and now I'm a pro-am web-developer dabbling in real programming. Professionally I manage a cognitive psych laboratory that's has never thrown anything away, ever. Here I have still more TBs of data that is multiply redundant, and so I'm tentatively developing this to be worthy as an open source project.
Eventually, I want to:
- find & remove redundant file/directory
- programmatically 'tag' files (ie. the feature by the same name new to OS X as of 10.9)
- develop an API for manipulating and organizing my files as per my whims, essentially
Probably some software has already been written that will do lots of what I want to do; this is much an exercise in writing software with a larger scope than the practical matter of achieving my ends.
Ideas & Progress
It occurs to me that their might be heuristical approaches that could cut down the workload—possibly I could only create checksums for files that have been inferred to be identical. The two criterion I can think of:
-
comparing filenames: Since many files are identical but for the filenames, this is only helpful if fallbacks exist to catch other cases
-
comparing filesize: It occurs to me that this might always work, ie. if two files have the same size only create & compare checksums between them
Also, often entire directories are identical, say two copies of CakePHP that each have a few thousand library files. Is there a way t
Solution
Sort your imports
Your
Personally I'd be tempted to use
Your
Having
is not good. Decide on what type it accepts and stick with it.
Further, with
or the equivalent without - using
I don't get why
Your
The
This looks longer since there is no
You then have
which are unused (and look like a terrible idea).
You then have
I can live with enums, although only with justification. You also probably shouldn't be using integer enums, either. Try
But you only use this here:
and don't define
You also have an absurd default for invalid input. NEVER DO THIS! Bad input should never be silently ignored!
It looks to me like you should have
with
You don't use
either.
IMPORTANT:
Class variables are like static variables. Do not initialize data on the class. Doing so can only cause sadness. Especially don't initialize variables with incorrect data. That is a surefire way to cause bugs.
In
But none of these variables exist! You don't even use
Wrap your lines. These are way too long.
Also, don't reinvent JSON serialization. Use
Your
import datetime
import hashlib
import os
import pickle
import shutil
import sys
import timeYour
now function should probably use datetime.datetime.now, and can deduplicate its branches:def now(include_time=True):
format_as = '%Y-%m-%d %H:%M:%S' if include_time else '%Y-%m-%d'
datetime.datetime.now().strftime(format_as)Personally I'd be tempted to use
format instead of strftime - format is more general and seems nicer overall:def now(include_time=True): format_as = '%Y-%m-%d %H:%M:%S' if include_time else '%Y-%m-%d'
format(datetime.datetime.now(), format_as)Your
path_dir_list function suggests you should use pathlib. It supports Python 2, but it's in Python 3 by default from 3.4 onwards.def path_dir_list(path):
return PurePath(path).partsHaving
make_missing_dirs dotry:
os.makedirs(path)
except AttributeError:
os.makedirs(os.path.join(*path))is not good. Decide on what type it accepts and stick with it.
Further, with
pathlib the whole thing can be justdef make_missing_dirs(path):
path = Path(path)
for path in reversed(path.parents):
try:
path.mkdir()
except OSError:
passor the equivalent without - using
makedirs (or pathlib.Path(...).mkdir(parents=True)) just duplicates work.I don't get why
parents=True shouldn't be enough, though, so I recommend sticking withdef make_missing_dirs(path):
Path(path).mkdir(parents=True)Your
is_subpath seems to be broken:is_subpath("/home", ["/ho"])
#>>> TrueThe
pathlib way to do this would probably bedef is_subpath(path, of):
try:
path.relative_to(of)
except ValueError:
return False
else:
return True
def is_subpath_any(path, of_paths):
path = Path(path).resolve()
return any(is_subpath(path, Path(of).resolve()) for of in of_paths)This looks longer since there is no
is_relative_to in pathlib and because I use shorter lines, but at least it should be correct. However, this resolves with OS calls, so will fail for nonexistent paths.You then have
DIR = "d"
FILE = "f"which are unused (and look like a terrible idea).
You then have
LIST_MODE = 0
SAFE_BUILD_MODE = 1I can live with enums, although only with justification. You also probably shouldn't be using integer enums, either. Try
= object() or, even better, use an Enum backport.But you only use this here:
self.mode = mode if mode in [LIST_MODE, FAST_BUILD_MODE, SAFE_BUILD_MODE, COPY_MODE] else self.SAFE_BUILD_MODEand don't define
FAST_BUILD_MODE or COPY_MODE. This obviously can't run then, so remove those.You also have an absurd default for invalid input. NEVER DO THIS! Bad input should never be silently ignored!
It looks to me like you should have
self.safe_build_mode = safe_build_modewith
safe_build_mode=False in the argument list.You don't use
OS_FILES = [".DS_Store", ".", "..", ".localized"]either.
IMPORTANT:
FileInspection has loads of class variables that you don't use. This demonstrates a misunderstanding of how classes work in Python. Python is not Java.Class variables are like static variables. Do not initialize data on the class. Doing so can only cause sadness. Especially don't initialize variables with incorrect data. That is a surefire way to cause bugs.
In
log you havedef log(self):
json = self.json_template.format(self.__hash, self.__path, self.__size, self.__created, self.__last_mod, self.__has_duplicates,
self.__fetched, self.__checked)
return json.replace(":False", ":false").replace(":True",":true")But none of these variables exist! You don't even use
checked or is_original! And even if they did exist, there's no reason to use __mangled names anyway. Never use __mangled names - they're just a silly way to make it easier to do a particularly silly thing that you shouldn't ever want to do.Wrap your lines. These are way too long.
Also, don't reinvent JSON serialization. Use
json.dumps with a sensible dictionary:def log(self):
return json.dumps({
"hash": self.hash,
"path": self.path,
"filesize": self.size,
"created:": self.created,
"last_mod": self.last_mod,
"has_duplicates": self.has_duplicates,
"duplicates_moved": self.fetched,
"checked": self.checked,
})Your
hash_file is taking self where self is not fully initialized. I recommend you move it to a plain function and give it a normal argument. You also shouldn't open files in text mode when hashing.def hash_file(path, read_size=2**20):
checksum = hashlib.md5()
with open(path, "rb") as f:
for chunk in iter(lambda: f.read(read_size), b""):
checksum.update(data)
return checksum.hexdigest()Manifest make the same error with the class varCode Snippets
import datetime
import hashlib
import os
import pickle
import shutil
import sys
import timedef now(include_time=True):
format_as = '%Y-%m-%d %H:%M:%S' if include_time else '%Y-%m-%d'
datetime.datetime.now().strftime(format_as)def now(include_time=True): format_as = '%Y-%m-%d %H:%M:%S' if include_time else '%Y-%m-%d'
format(datetime.datetime.now(), format_as)def path_dir_list(path):
return PurePath(path).partstry:
os.makedirs(path)
except AttributeError:
os.makedirs(os.path.join(*path))Context
StackExchange Code Review Q#92849, answer score: 6
Revisions (0)
No revisions yet.