HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Fast hashing: combination of different techniques to identify changes in a file?

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
fastcombinationfileidentifydifferentchangeshashingtechniques

Problem

I want to create a fast way to detect whether a file might or might not be the same. For almost 100% sureness I would use an existing hash algorithm, e.g. SHA256.
However, the files are expected to be huge video files with several GB, so calculating the SHA256 hash could take some time, especially over the network.

Therefore I want to combine different other techniques:

  • file size: if the file size has changed, the content has changed (sure)



  • head / tail hash



  • random hash



The latter 2 are part of my question:

My guess would be that in the header there are things like:

  • frame rates (e.g. Videos)



  • resolution (e.g. Videos, Images)



  • (file) length (e.g. in frames, pixels etc.)



  • last change date (e.g. Word documents, not specifically Videos)



Why I consider checking the tail is:

  • MP3 has the tag information there



  • EXIF adds custom data at the end if I'm right



Random hashes would select e.g. 126 regions at random positions in the file with a specific length, e.g. 64 kB and create a hash for them. Of course I remember the offsets for later comparison.
All in all I would use (1+126+1)*64 kB of data for my hash, so I need to read only 8 MB instead of several GB to get the hash.

Maybe it's more a Math question now, but: how likely is it to detect a change using the combination of file size, head, tail and random data to generate this quick hash sum?

I assume that the files are always legal files. There's no benefit in manipulating single bytes. The user would use a normal video editing tool to change the files.

UPDATE:
I unaccepted this answer which came from Crypto.StackExchange. I agree that my proposal is not cryptographic and not intended to be secure. I also agree that CRCing a file is fast, but in my case I really need a hash - I'll explain why:

  • My application is expected to save bookmarks in videos. My database is expected to save the video hash and the bookmarks.



  • Users sometimes move or rename files. My program will notice that a file doe

Solution

There are two sides to your coin:

  • if you want to do it secure, you will need to use a cryptographically secure hash like SHA256 (crypto-hashes are meant to be fast, but tend to be a bit slow due to security constraints),



  • things like CRCs are definitely quicker, but will never be able to offer the same kind of security (especially when we’re talking about .



Option 1: CRCs — Doing it quickly at the price of security:

If you're just after the detection of changes, go for a checksum instead of a hash. That's what checksums were made for: quickly detecting changes in a file or data-stream. But keep in mind that CRC was designed to prevent transmission errors, not malicious action!

Practically, CRC32 is the most obvious candidate (but even an additive CRC8 would do the job if you only want to detect if something has changed and don't expect anything else than that from the CRC.)

Option 2: Beyond CRCs — Doing it rather quickly while enhancing change-detection:

Other valid options (looking at @poncho's comment) are indeed to simply check the last-mod timestamp.

Or, you combine both (to prevent bottlenecks), using something like this pseudo-code shows:

if(LastMod != knownLastMod) { CreateNewCRCandCompare(FileName, knownCRC) };


But does this offer any real security? No. Same goes for your…


Why I consider checking the tail is:

- MP3 has the tag information there

- EXIF adds custom data at the end if I'm right

Again, it depends on how much security you expect. You have to realize that an adversary will surely manipulate the file to keep (or copy-and-paste) any old ID3 and EXIF data… as anyone (with appropriate RW file-access rights) can modify that. Same goes for the Last-Modification timestamp, frame rates, resolution, last change date, and even the (file) length. Depending on that “additional” and “modifiable” data — which can be modified and removed by anyone with enough file-access rights — would introduce a security flaw.

But you do expect security, don't you? After all, that's the reason why you're thinking about all this in the first place. Well, then there is not way around using crypto-secure hashes…

Option 3: Cryptographically Secure Hashes — Doing it securely at the price of speed:

If you expect real security, you will have to rely on hashing; to be more precise: cryptographically secure hashing (using a hash which is not known to produce collisions). It takes time (a few microsecs per MB) but it's worth it.

My 2 (personal) cents:

Try to live with the fact that hashing costs time and hash the whole files with a cryptographically secure hash. Because, when stuff starts hitting the fan… you're better off being slow, instead of being sorry.

EDIT based on your EDIT…

If cryptographic security isn’t your main focus, you could look at MD5 or SHA1. Both MD5 and SHA1 are “cryptographically broken” because collisions have been detected… yet for the change-detection purposes you describe (especially after your EDIT), the likely-ness of hitting such a collision should be minimal enough.

Looking at everything again (including your EDIT), I personally would most probably use MD5, because it offers a usable collision resistance (for change-detection purposes) while still being fast enough to completely hash multi-gigabyte files.

If that still doesn’t satisfy you in a “speed” sense or if your hardware resources are really that limited, you have to try to balance collision-resistance/change-detection with speed. Meaning…

Take the individual timestamp, the individual filename, and hash the header (length depends on media type and used file format) as well as a good chunk from the middle and a good chunk of the tail (= file end). Combine those 5 and you should be able to roughly filter out most


I'd be ok with a ~80% chance of having the correct bookmarks. How many hash pieces should I put together and where in the file would that be?

That’s more of a personal opinion, as it depends on a whole truckload of details (media type, file format, available resources, expected change-detection ratio, file similarity, etc.) so will have to balance that out yourself depending on your personal expectations, your implementations, and local results due to hardware and/or software bottlenecks.

Let me try to provide you with some guidance nevertheless:

If hashing the complete file isn’t an option for whatever reasons, I would – at least – take: the header (and maybe a few KBs more), a good chunk from the middle (at least the size of the “header & co.” part), and a good chunk from the file end (again, at least the size of the “header & co.” part).

The more resources you can invest (or are willing to invest), the more chunks you can take and/or the bigger those chunks can be. If you think your resources/feel/whatever still offers room for more, increase the size of the chunks you hash and/or increase the number of chunks you hash.

Increasing the number of chunks is easy: as all you

Code Snippets

if(LastMod != knownLastMod) { CreateNewCRCandCompare(FileName, knownCRC) };

Context

StackExchange Computer Science Q#19042, answer score: 8

Revisions (0)

No revisions yet.