Score:1

how high is the possibility of getting a hash collision in text files?

in flag

Just for an example, let's say I downloaded "the adventures of tom sawyer" from gutenberg in .txt file format and saved it to my usb thumb drive.

And as you can see, usb drive is not an ideal device for long term data retention. But if I insist on using it, there's possibility any files in my storage would finally be corrupted after long time without powering it up.

So what I will do now is to save hash of the file when first saving it and later I can easily compare the current hash value with the one when I originally saved it. if the two are different, it is highly likely corrupted(like added words of nonsense, or some part of the document is missing). I am planning to do this for all important files I save in the storage.

But the problem is, sometimes hash will be identical even if there's some minor changes as the number of hash outputs will always be smaller than data inputs. Should I be concerned about collision for my use case? And what about other file types including pdf, jpg, exe, zip etc..? Are these also vulerable for hash collision?

And lastly, I know there're plenty of hash algorithms for a single file from crc32 to md5 to sha1 etc, and for my purpose(just data validity check), what would you recommend, and why?

Thank you in advance!

Score:2
ng flag

When using a $n$-bit hash, the probability that an accidental change goes undetected is about $2^{-n}$ (for hashes that even mildly meet their design goals).

If one is using this technique once per second for 100 years, with a 128-bit hash like MD5, that probability is $36524\times86400\times2^{-128}\approx2^{31.6-128}=2^{-96.4}$.

We know of 44 craters on earth caused by collision with a celestial body large enough to be a major blow to our current civilization, that occurred within the last 2.3G years. Thus probability of a civilization-disrupting event within this 100 years time-frame is at the very least $44\times100/(2.3\times10^9)\approx2^{-19}$ (and I'm optimistic here: man-made nuclear obliteration is arguably more probable). Thus there's no point in bothering about a probability of only $2^{-96.4}$.


But in cryptography, we consider adversaries that actively try to defeat us. If we use a 128-bit hash (such as MD5), and make many files (say $2^{31.6}$ as above, which hashes fit a 64GB USB stick), and have powerful adversaries with the kind of resources wasted in bitcoin mining¹, then the possibility that they find a file with the same hash as one of ours becomes sizable (though not the the point I would be bothered).

The real and immediate danger comes if we assume the adversaries manage to penetrate the software we use to save our (say PDF) files, and we are silly enough to use MD5 or SHA-1, which chosen-prefix collision-resistance is broken. Now adversaries can effortlessly produce files with the same MD5 or SHA-1 as any of ours, that look exactly how adversaries see fit when viewed.


For my purpose (just data validity check), what would you recommend?

Ignoring the possibility of adversarial modifications is off-topic in a crypto group. If we do this, a CRC is enough. 64-bit is fine. About the only thing to fear is that the media could internally use a CRC, and they interfere. For lack of information, choosing a random 64-bit primitive CRC makes sense.

Back to cryptography and it's adversarial model: it should be used unbroken hashes like those of the SHA-2 or SHA-3 familly. Short of a breakthrough that few expect, SHA-256 is adequately secure for likely at least a decade, SHA-512 forever (at a human scale) even if we assume we ever get Cryptographically Relevant Quantum Computers.


¹ I'm talking about the overall electrical power and integrated circuits wasted. The bulk of it would however not be for massively parallel hashing with ASICs as in bitcoin mining. It would be for fast memory organized for search, since computing MD5 hashes has low cost compared to matching them with the $2^{\approx31.6}$ target hashes.

in flag
Thank you so much for your detailed answer! Btw I wouldn't mind too much about security because most of my files are contents for reference or entertainment purposes. So for the example above, if any of the words are added, removed or replaced with another one in the 300 pages paperback novel due to file corruption, the chance of it having identical hash value as original one is negligible to the extent of winning powerball 3 times in a row or even lower, is that right? Thank you again for the peace of mind :) Answer accepted.
fgrieu avatar
ng flag
@tadkov: yes, with a 128-bit hash, odds that any file hashed at 1 per second for 100 years _accidentally_ gets corrupted without detection are lower (by a factor 4000) than winning the powerball three times in three bets.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.