HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Cross-platform file hash sum generator in Python

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
platformfilecrosshashgeneratorpythonsum

Problem

I'm back for more community punishment review of my own scripts and code! This time, I'm looking for general code review of my approach to a Python way of getting one or more hash sums for a provided file path (unfortunately, it's one-file-at-a-time, and only accepts one file path at a time). Hash sum selection is done via argument flags, and if none are present, the default is to run MD5 and SHA1 hash sum functions and return only those.

Note that this script is designed to generate hashes, and not compare them with other hashes. It does not include the corresponding "compare" function that many hash summing tools provide, and it is not meant to - I'm writing something else for this comparison functionality.

Also note that this uses a third party module, but I am considering removing it in the future (I needed a 'failure' code status to process easily...)

I have confirmed this code works in Windows, Mac, and *nix variants, and generates valid hash sums, the same as any file hashing program that is distributed for operating systems. I have verified this, using built-ins on Linux/Unix, built-ins on Mac, and Windows-based programs such as WinMD5Sum or Rapid CRC Unicode to generate the hashes for a file, and comparing them to the outputs from my script. This is a good thing, so I know it works and correctly processes files and returns accurate hashes. I also permit specification of what hashes to run, by argparse and optional 'argument flag' arguments, and all that works - this is useful for people that need more than one hash function, and can get the file hash all in one program rather than

A few notes before we go into this about some fears I have, and what issues I think may / may not exist. While this can be used for some specific review focuses, please feel free to consider this as guideposts, and not an all-inclusive list of review points I need:

  • This script is designed to process files for hashing in chunks. If the provided file size is greater

Solution

Prefer a dictionary to if/elseif/else cascade:

digests = { "md5": hashlib.md5, "sha1": hashlib.sha1, ... }


then compute_hash streamlines to

try:
        filehash = digests[digest]()
        with open(filepath, "rb") as filename:
        ....
    except KeyError:
        raise TypeError("Invalid digest type ...")


Notice that you may reuse the same dictionary in your main:

for key in digests.keys():
        if key in SUMS_TO_RUN:
            print "Generating ", key, "sum, this could take some time"
            ....


Rewriting the file path is absolutely not necessary. More, testing for the file to exist is also not necessary: you may safely rely on open to fail if the file doesn't exist. In fact, testing beforehand is subject to TOC-TOU race.

There is no need to special-case "small" files. It is effectively the same thing as the last (incomplete) chunk of a "large" file. Consequently, there is no need to determine file size beforehand (which is yet another TOC-TOU suspect).

Parsing arguments is perfectly valid job for main. I don't see the reason to lift it out.

Code Snippets

digests = { "md5": hashlib.md5, "sha1": hashlib.sha1, ... }
try:
        filehash = digests[digest]()
        with open(filepath, "rb") as filename:
        ....
    except KeyError:
        raise TypeError("Invalid digest type ...")
for key in digests.keys():
        if key in SUMS_TO_RUN:
            print "Generating ", key, "sum, this could take some time"
            ....

Context

StackExchange Code Review Q#138394, answer score: 5

Revisions (0)

No revisions yet.