patternMinor
Hash for verifying both compressed and uncompressed data?
Viewed 0 times
verifyinghashbothforandcompresseddatauncompressed
Problem
Is it possible to have a single hash function output simultaneously verify a compressed block of data as well as its uncompressed counterpart?
Trivially one could just use a hash function twice (once on the uncompressed version and again on the compressed version) and concatenate the outputs, or use a hash function that internally does that, but I wonder if it's possible otherwise.
For example, imagine we've got these functions:
...then I'm looking for these properties for a nontrivial byte array
This could be useful when verifying persisted compressed data: you could verify the correctness of the compression algorithm and of the uncompressed data both at the same time without having to decompress it or store multiple hashes.
I'm not sure where to start looking for this. I think what I'm describing is some kind of homomorphism but apparently I need to level up my search skills. It seems like a Difficult Problem™; for example, the Deflate algorithm outputs data structures like a Huffman tree/table, and the mathematical relationships of serialized data structures being run through hash functions aren't immediately obvious to this OP.
Trivially one could just use a hash function twice (once on the uncompressed version and again on the compressed version) and concatenate the outputs, or use a hash function that internally does that, but I wonder if it's possible otherwise.
For example, imagine we've got these functions:
Hash getHash(byte[] data);
byte[] compress(byte[] data);
byte[] decompress(byte[] data);
bool verify(byte[] data, Hash hash);...then I'm looking for these properties for a nontrivial byte array
D (that is uncompressed):verify(compress(D), getHash(D)) == verify(D, getHash(compress(D))) == verify(D, getHash(D)) == true
D == decompress(compress(D))for allD, i.e. lossless compression
verify(E, getHash(D)) == false"most of the time" whenE != D, i.e. collision "resistant" hashing
This could be useful when verifying persisted compressed data: you could verify the correctness of the compression algorithm and of the uncompressed data both at the same time without having to decompress it or store multiple hashes.
I'm not sure where to start looking for this. I think what I'm describing is some kind of homomorphism but apparently I need to level up my search skills. It seems like a Difficult Problem™; for example, the Deflate algorithm outputs data structures like a Huffman tree/table, and the mathematical relationships of serialized data structures being run through hash functions aren't immediately obvious to this OP.
Solution
In most situations you probably don't want to have
Therefore, trying to design a hash function with this property is probably a bad idea. Instead, you should probably use the alternative solution: when you verify a hash value, hash both the data
If you absolutely must achieve this by artificially introducing collisions into your hash, here is a way you can do it. Disclaimer: This is totally theoretical; it is not useful in practice.
Define the
Yes, this is totally trivial.
No, I don't think you'll find anything that is faster than this and that also satisfies the properties to be a good hash function. Hash functions are specifically designed to avoid having any mathematical structure, while at the same time being very fast. Mathematical structure is considered a weakness in a hash function. In contrast, you are asking for a hash function with some structure. Inserting such artificial structure into the hash function might be possible, but the natural ways of doing it probably will either introduce lots of bias into the hash function (e.g., lots of collisions) or slow it down a lot.
To put it another way, right now hash functions are optimized to meet two criteria:
-
Be fast.
-
Be "random-looking" (few collisions, unbiased, etc).
You are proposing to look for a hash function that meets three criteria:
-
Be fast.
-
Be "random-looking" (few collisions, unbiased, etc).
-
Also, satisfy the additional properties you requested that
Obviously, it's harder to meet three criteria than to meet two. If you want to meet three criteria, something will have to give: and in your situation, that means the resulting hash function will either be slower or less "random-looking" than standard hash functions (probably significantly so).
So, I don't expect there's any better solution for what you want. Instead, just do the natural solution you mention of hashing twice.
D and compress(D) to have the same hash value, as that leads to hash collisions. Usually hash functions are designed to avoid hash collisions -- and this is a good thing. For instance, in cryptographic contexts, any hash scheme that has the properties you ask for would automatically be considered cryptographically insecure.Therefore, trying to design a hash function with this property is probably a bad idea. Instead, you should probably use the alternative solution: when you verify a hash value, hash both the data
D and the compressed version compress(D) and see if either of those hash digests match the expected value. That's probably a much better solution in practice.If you absolutely must achieve this by artificially introducing collisions into your hash, here is a way you can do it. Disclaimer: This is totally theoretical; it is not useful in practice.
Define the
verify procedure as follows:Verify(D, Y):
1. If Hash(D)==Y or Hash(Compress(D))==Y or Hash(Uncompress(D))==Y, return true; otherwise, return false.Yes, this is totally trivial.
No, I don't think you'll find anything that is faster than this and that also satisfies the properties to be a good hash function. Hash functions are specifically designed to avoid having any mathematical structure, while at the same time being very fast. Mathematical structure is considered a weakness in a hash function. In contrast, you are asking for a hash function with some structure. Inserting such artificial structure into the hash function might be possible, but the natural ways of doing it probably will either introduce lots of bias into the hash function (e.g., lots of collisions) or slow it down a lot.
To put it another way, right now hash functions are optimized to meet two criteria:
-
Be fast.
-
Be "random-looking" (few collisions, unbiased, etc).
You are proposing to look for a hash function that meets three criteria:
-
Be fast.
-
Be "random-looking" (few collisions, unbiased, etc).
-
Also, satisfy the additional properties you requested that
verify(compress(D), getHash(D)) == verify(D, getHash(compress(D))) == verify(D, getHash(D)) == true.Obviously, it's harder to meet three criteria than to meet two. If you want to meet three criteria, something will have to give: and in your situation, that means the resulting hash function will either be slower or less "random-looking" than standard hash functions (probably significantly so).
So, I don't expect there's any better solution for what you want. Instead, just do the natural solution you mention of hashing twice.
Code Snippets
Verify(D, Y):
1. If Hash(D)==Y or Hash(Compress(D))==Y or Hash(Uncompress(D))==Y, return true; otherwise, return false.Context
StackExchange Computer Science Q#60715, answer score: 2
Revisions (0)
No revisions yet.