patternpythonMinor
Short script to hash files in a directory
Viewed 0 times
scriptdirectoryshorthashfiles
Problem
What improvements could be made to this script and can you explain why you would make those changes?
import hashlib
import os
import time
time = time.strftime("%Y%m%d-%H%M%S")
Masterhash = open("Filehash" + time, "a")
#get a list of files from your directory
for filename in os.listdir('.'):
#ignore directory names
if not os.path.isdir(filename):
#open the file for readying
primingfiletohash = open(filename, "r")
#read contents of file
filetohash = primingfiletohash.read()
#close the file
primingfiletohash.close()
#use hashlib to generate an md5 hash of the file contents
Filehash = hashlib.md5(filetohash)
Filehash = Filehash.hexdigest()
#write file names and hashes to a file
Masterhash.write(filename + "\t")
Masterhash.write(Filehash + "\n")
#close the file
Masterhash.close()Solution
First of all, I'd make a variable which will keep the current path. For that, I'd use
From the docs:
Return a string representing the current working directory.
After this, I'd normalize the path, so that your script will be able to run on multiple OSs:
From the docs:
Normalize a pathname by collapsing redundant separators and up-level
references so that
This string manipulation may change the meaning of a path that
contains symbolic links. On Windows, it converts forward slashes to
backward slashes.
Then, I'd get a list of all the files in the directory using
After getting the files, I'd wrap everything within a function:
Then, I'd make another function which will return a list of hashes for all the files in the list:
After this, I'd just write this list of hashes in the desired file:
Finally, the code would look like this:
Now, why I'd do it this way:
E.g:
To be considered:
MD5 is known broken and (IMHO) should come with scary deprecation warnings and removed from the library, so you should actually do the hashing thing using
Remember:
Each statement that you write is important, and if changing the name of a local variable / method / anything with a descriptive name makes your code a lot clearer and more readable then do it!
os.getcwd(...):current_path = getcwd()From the docs:
Return a string representing the current working directory.
After this, I'd normalize the path, so that your script will be able to run on multiple OSs:
current_path = normpath(getcwd())From the docs:
Normalize a pathname by collapsing redundant separators and up-level
references so that
A//B, A/B/, A/./B and A/foo/../B all become A/B.This string manipulation may change the meaning of a path that
contains symbolic links. On Windows, it converts forward slashes to
backward slashes.
Then, I'd get a list of all the files in the directory using
listdir(...):only_files = [join(current_path, f) for f in listdir(current_path) if isfile(join(current_path, f))]After getting the files, I'd wrap everything within a function:
def get_files():
current_path = normpath(getcwd())
return [join(current_path, f) for f in listdir(current_path) if isfile(join(current_path, f))]Then, I'd make another function which will return a list of hashes for all the files in the list:
def get_hashes():
files = get_files()
list_of_hashes = []
for each_file in files:
hash_md5 = hashlib.md5()
with open(each_file, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash_md5.update(chunk)
list_of_hashes.append('Filename: {}\tHash: {}\n'.format(basename(each_file), hash_md5.hexdigest()))
return list_of_hashesAfter this, I'd just write this list of hashes in the desired file:
def write_hashes():
hashes = get_hashes()
with open('list_of_hashes.txt', 'w') as f:
for md5_hash in hashes:
f.write(md5_hash)Finally, the code would look like this:
from os import listdir, getcwd
from os.path import isfile, join, normpath, basename
import hashlib
def get_files():
current_path = normpath(getcwd())
return [join(current_path, f) for f in listdir(current_path) if isfile(join(current_path, f))]
def get_hashes():
files = get_files()
list_of_hashes = []
for each_file in files:
hash_md5 = hashlib.md5()
with open(each_file, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash_md5.update(chunk)
list_of_hashes.append('Filename: {}\tHash: {}\n'.format(basename(each_file), hash_md5.hexdigest()))
return list_of_hashes
def write_hashes():
hashes = get_hashes()
with open('list_of_hashes.txt', 'w') as f:
for md5_hash in hashes:
f.write(md5_hash)
if __name__ == '__main__':
write_hashes()Now, why I'd do it this way:
- Compare your code with this new one, and tell me which one is more readable.
- You should read PEP8, which is the standard style guide for python.
- You should use
with()when opening a file as this will make sure the file will be closed when needed.
- I've used
for chunk in iter(lambda: f.read(4096), b"")which is a better approach for hashing large files (sometimes you won't be able to fit the whole file in memory. In that case, you'll have to read chunks of 4096 bytes sequentially).
- I've split the logic of your main program into multiple functions which I'll let you document using docstrings:
E.g:
def get_files():
"""Returns a list of all files in the current directory"""
# ...- You can see that I also added
if __name__ == '__main__'. By doing the main check, you can have that code only execute when you want to run the module as a program and not have it execute when someone just wants to import your module and call your functions themselves.
To be considered:
MD5 is known broken and (IMHO) should come with scary deprecation warnings and removed from the library, so you should actually do the hashing thing using
sha256 from the the same module:Remember:
Each statement that you write is important, and if changing the name of a local variable / method / anything with a descriptive name makes your code a lot clearer and more readable then do it!
Code Snippets
current_path = getcwd()current_path = normpath(getcwd())only_files = [join(current_path, f) for f in listdir(current_path) if isfile(join(current_path, f))]def get_files():
current_path = normpath(getcwd())
return [join(current_path, f) for f in listdir(current_path) if isfile(join(current_path, f))]def get_hashes():
files = get_files()
list_of_hashes = []
for each_file in files:
hash_md5 = hashlib.md5()
with open(each_file, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash_md5.update(chunk)
list_of_hashes.append('Filename: {}\tHash: {}\n'.format(basename(each_file), hash_md5.hexdigest()))
return list_of_hashesContext
StackExchange Code Review Q#147056, answer score: 3
Revisions (0)
No revisions yet.