Tutorial :Are fragments of hashes collision-resistent?


If you only use the first 4 bytes of an MD5 hash, would that mean theoretically only 1 in 255^4 chance of collision? That is, are hashes designed such that you only have to use a small portion of the returned hash (say the hash is of a file of some size)?


Remember that, even without considering a smart attacker deliberately trying to cause collisions, you need to start worrying about accidental collisions once the number of objects you're hashing get comparable to the square root of the hash space... just a few tens of thousands of objects for a 32-bit hash key. This comes from the so-called birthday paradox.


It is 256, not 255.

Assuming that MD5 is a secure hash function (it turns out it is not secure, but, for the sake of the discussion, let's suppose that it is secure), then it should behave like a random oracle, a mythical object which outputs uniformly random values, under the sole constraint that it "remembers" its previous outputs and returns the same value again, given the same input.

Truncating the output of a random oracle yields another random oracle. Thus, if you keep 32 bits, then the probability of a collision with two distinct input messages is 1 in 2^32 (i.e. 1 in 256^4).

Now there is a thing known as the birthday paradox which says that, with about 2^16 distinct inputs, there are good chances that two of the 2^16 corresponding outputs collide.

MD5 has been shown to be insecure for some purposes -- in particular anything which is related to collisions. The current default recommendation is SHA-2 (a family of four functions, with output sizes 224, 256, 384 and 512 bits, respectively). A new (american) standard is currently being defined, through an open competition, under the code name SHA-3. This is a long process; the new function shall be chosen by mid-2012. Some of the remaining candidates (currently 14, out of an initial 51) are substantially faster than SHA-2, some approaching MD5 in performance, while being considerably more secure. But this is a bit new, so right now you shall use SHA-2 by default.


Assume we have a pre-determined message1. hash1 = md5(message1)

Now choose a message2 randomly, and set hash2 = md5(message2).

In theory there is a 1/255^4 chance that the first four characters of hash2 match the first four of pre-determined hash1.

It is also supposed to be very hard for an attacker that knows message1 to come up with a different message2 that has the same hash. This is called second pre-image resistance. However, even with the full MD5, there are better than theoretical pre-image attacks.

MD5 is completely broken for collisions. This means it is quite feasible for an attacker (in a few hours) to come up with two messages with the same hash (let alone the same first four bytes). The attacker gets to choose both messages, but this can still cause major damage. See for instance the poisoned message example.


If you're generating unique identifiers, you might want to use a UUID instead. These are designed to minimize the change of collisions so that in practice they should never occur.

If you're worried about filenames being too long, which is a peculiar thing to be concerned about when most operating systems support names as long as 255 characters, you can always split the filename into a path and filename component. This has the advantage of splitting up the files into different directories:

fdadda221fd71619e6c0139730b012577dd4de90    fdadda221fd71619e6c/0139730b012577dd4de90    fdad/da22/1fd7/1619/e6c0/1397/30b0/1257/7dd4/de90  


Depends on the purpose of the hash.

Hash functions for use in hash tables tend to have more "randomness" in the lower bits (which are used to find the array index) than in the higher bits. Checksum and cryptographic hash functions are more evenly distributed.

Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Next Post »