patternMinor
Hashing methods for validating dowloaded files
Viewed 0 times
dowloadedvalidatingmethodsfilesforhashing
Problem
The standard algorithm to generate hashes of files which are downloaded is MD5. For example, when ISO files of Linux distributions are offered most of the time they also give the MD5 sum so that you can check if an error happened while downloading the file.
I'm aware of the fact that MD5 should not be used for password hashing as it is fast to compute. However, in the case of downloaded files I think it is desirable if it is computed fast.
Hence my question: Which hashing algorithm is recommended? What are the relevant criteria? If it is easy to find collisions of a hashing algorithm, does that automatically imply that it should not be used in scenarios where one does not expect attacks, but only "random" errors?
I'm aware of the fact that MD5 should not be used for password hashing as it is fast to compute. However, in the case of downloaded files I think it is desirable if it is computed fast.
Hence my question: Which hashing algorithm is recommended? What are the relevant criteria? If it is easy to find collisions of a hashing algorithm, does that automatically imply that it should not be used in scenarios where one does not expect attacks, but only "random" errors?
Solution
There are several reasons why one might want to add some extra material to validate a downloaded file. Different reasons require different kinds of extra material.
Non-malicious corruption
A common type of corruption is that files get truncated. For this, the size is enough.
To detect accidental corruption of the downloaded file for hardware reasons (bits getting flipped, lost or reorderd in transit), any kind of checksum helps. It doesn't have to involve cryptography — for this purpose, MD5 is a little overkill, but suitable. Against accidental corruption, a well-designed checksum that can take $n$ different values has a probability of $1-\frac{1}{n}$ of detecting the corruption.
Malicious corruption, single-file scenario
Malicious corruption is another matter. Against malicious corruption, you need two things:
The reason to have two separate channels is, to put it succinctly, bandwidth. The goal is to insert the downloaded file into a chain of trust that starts from a computer and its operating system, and may rely on additional assumptions, e.g. trust in a particular website.
An example of a channel separation is when you trust some website (hopefully served over HTTPS, but you don't trust the channel through which the file is downloaded. In practice, these days, often the file would be distributed on the same website, and a separate verification doesn't provide any additional security. However separate verification can still be useful, for example if you've already downloaded the file in the past, but you aren't sure whether your local copy is pristine or whether it's the right version.
To establish a cryptographic tie, the basic method is to use a cryptographic hash function: calculate the hash of the large file distributed through the insecure channel, and distribute the hash through the secure channel. One of the properties of a cryptographic hash $H$ is that it's infeasible to find $x' \neq x$ such that $H(x') = H(x)$ (second preimage resistance); here this means that it's impossible to generate a fake file that has the genuine hash.
MD5 was designed to be a cryptographic hash function, but it is now known not to be one. Its second preimage resistance is not broken yet, but due to the weaknesses that have been found in MD5, the cryptography community no longer trusts it for anything. What has been broken for MD5 is its collision resistance, i.e. there are cheap ways of generating $x \ne x'$ such that $\mathrm{MD5}(x) = \mathrm{MD5}(x')$. (Note the difference with second preimage resistance: here the adversary does not get to choose the hash value.)
In addition to the risk that an attack would be found on second preimage resistance, there's a reason not to use MD5 to verify that something is genuine. Because collisions can be generated, it's possible to perform a bait-and-switch attack:
This attack can be performed without the distributor being aware, if they're redistributing third-party content (e.g. a picture whose bad version triggers an exploit in an image viewer).
The state of the art of cryptographic hash functions is the SHA-2 family (generally SHA-256 or SHA-512). (There are a few alternatives but they don't get used much in the real world as of 2017.) Cryptographic hash functions are designed for speed, and there is no meaningful difference between MD5 and SHA-256 in practice in most scenarios — I/O is usually the limiting factor, not computation.
Password hashes are a different kind of function (closely related to key stretching) which are used in a specific case: when the data to be hashed has to be secret and cannot be encrypted (in the case of a password, the objective is to make the password hard to guess even if the adversary obtains all the information that the legitimate verifier has, including any encryption keys). Password hashes need to be slow to limit the impact of brute force guess attempts. Hashes used in other scenarios don't need this and are designed for speed.
A cryptographic hash function is a way to expand integrity. Assuming the integrity of the channel through which you obtain the hash, you get a guarantee of integrity of the hashed data.
A hash does not guarantee authenticity: it doesn't say who made this file. Anybody can calculate the hash of a file. To know that the file is authentic, you need to trust the authenticity of the channel through which you receive the hash.
Malicious corruption: many-files scenario
There is another useful cryptographic primitive that is useful to distribute files: sig
Non-malicious corruption
A common type of corruption is that files get truncated. For this, the size is enough.
To detect accidental corruption of the downloaded file for hardware reasons (bits getting flipped, lost or reorderd in transit), any kind of checksum helps. It doesn't have to involve cryptography — for this purpose, MD5 is a little overkill, but suitable. Against accidental corruption, a well-designed checksum that can take $n$ different values has a probability of $1-\frac{1}{n}$ of detecting the corruption.
Malicious corruption, single-file scenario
Malicious corruption is another matter. Against malicious corruption, you need two things:
- A known secure channel. Without something to bootstrap security, the recipient has no way to know that what they're getting isn't wholly coming from an adversary.
- A cryptographic tie between the secure channel and the download channel.
The reason to have two separate channels is, to put it succinctly, bandwidth. The goal is to insert the downloaded file into a chain of trust that starts from a computer and its operating system, and may rely on additional assumptions, e.g. trust in a particular website.
An example of a channel separation is when you trust some website (hopefully served over HTTPS, but you don't trust the channel through which the file is downloaded. In practice, these days, often the file would be distributed on the same website, and a separate verification doesn't provide any additional security. However separate verification can still be useful, for example if you've already downloaded the file in the past, but you aren't sure whether your local copy is pristine or whether it's the right version.
To establish a cryptographic tie, the basic method is to use a cryptographic hash function: calculate the hash of the large file distributed through the insecure channel, and distribute the hash through the secure channel. One of the properties of a cryptographic hash $H$ is that it's infeasible to find $x' \neq x$ such that $H(x') = H(x)$ (second preimage resistance); here this means that it's impossible to generate a fake file that has the genuine hash.
MD5 was designed to be a cryptographic hash function, but it is now known not to be one. Its second preimage resistance is not broken yet, but due to the weaknesses that have been found in MD5, the cryptography community no longer trusts it for anything. What has been broken for MD5 is its collision resistance, i.e. there are cheap ways of generating $x \ne x'$ such that $\mathrm{MD5}(x) = \mathrm{MD5}(x')$. (Note the difference with second preimage resistance: here the adversary does not get to choose the hash value.)
In addition to the risk that an attack would be found on second preimage resistance, there's a reason not to use MD5 to verify that something is genuine. Because collisions can be generated, it's possible to perform a bait-and-switch attack:
- Generate a file with “good” contents (e.g. well-behaved software), and another file with “bad” contents (e.g. malware).
- Have people review the good file and give it a good reputation.
- Distribute the bad file.
This attack can be performed without the distributor being aware, if they're redistributing third-party content (e.g. a picture whose bad version triggers an exploit in an image viewer).
The state of the art of cryptographic hash functions is the SHA-2 family (generally SHA-256 or SHA-512). (There are a few alternatives but they don't get used much in the real world as of 2017.) Cryptographic hash functions are designed for speed, and there is no meaningful difference between MD5 and SHA-256 in practice in most scenarios — I/O is usually the limiting factor, not computation.
Password hashes are a different kind of function (closely related to key stretching) which are used in a specific case: when the data to be hashed has to be secret and cannot be encrypted (in the case of a password, the objective is to make the password hard to guess even if the adversary obtains all the information that the legitimate verifier has, including any encryption keys). Password hashes need to be slow to limit the impact of brute force guess attempts. Hashes used in other scenarios don't need this and are designed for speed.
A cryptographic hash function is a way to expand integrity. Assuming the integrity of the channel through which you obtain the hash, you get a guarantee of integrity of the hashed data.
A hash does not guarantee authenticity: it doesn't say who made this file. Anybody can calculate the hash of a file. To know that the file is authentic, you need to trust the authenticity of the channel through which you receive the hash.
Malicious corruption: many-files scenario
There is another useful cryptographic primitive that is useful to distribute files: sig
Context
StackExchange Computer Science Q#75820, answer score: 2
Revisions (0)
No revisions yet.