Score:2

theoretical hash collisions vs random number collisions

nf flag

I have a theoretical question about the probability of collisions of hashes versus random numbers. I'm not interested in the exact probabilities. The exact hash function is not relevant (we can assume it is perfectly uniform, cryptographically strong, etc.). The implementation of the random number generator is also not relevant (we can assume it is a perfectly random generator).

  1. If I have some pool of inputM values of length M bits (where M is half of N) that are known to be unique, does a hash of inputM hashN(inputM) producing an N-bit hash have lower probability of collision than producing a random number of N bits randN()? Intuitively, although the hash function cannot guarantee uniqueness, wouldn't its uniformity lower the chance of collision seeing that it is performed on unique input values?

  2. Assuming #1 is true, is the probability of collision of hashN(inputM, randM()) lower than the probability of collision of randN()? In other words, will adding a random component to the hash input make the output completely random, and I might as well have just produced an N-bit random number? Or will there still be a lower chance of collision than a simple N-bit random number, because half of input of the hash function is from a pool of input values known not to have collisions?

While this question involves math theory, it directly impacts a software design decision I need to make: If I have a filename and I want to uniquely identify that filename, but with an opaque identifier, I can hash the filename (SHA-256 or whatever) or I can just generate a random number. Would hashing the filename lower the chances of collision?

This is not a question about whether hashing probabilities would practically produce collisions in real life, or whether SHA-256 is a good hashing algorithm, or which random number generator to use, etc. This is simply a comparison of the probability of collision between hashes and random numbers in general.

Score:5
my flag
  1. If I have some pool of inputM values of length M bits (where M is half of N) that are known to be unique, does a hash of inputM hashN(inputM) producing an N-bit hash have lower probability of collision than producing a random number of N bits randN()?

Nope; under the assumptions stated, they are precisely the same.

Where your intuition is misguiding you appears to be in the notion of 'uniformity' of a hash function. By that, we don't mean that, if we consider all possible inputs, each output is generated precisely the same number of times. Instead, what we mean is that the bias for generating any particular output isn't any more than what would be expected from a truly random function.

Because our model is that hashN(inputM) acts (for distinct inputs, which you stated was the case) like a random function, of course the outputs look a lot like a random generator...

Garret Wilson avatar
nf flag
Thanks for clearing this up. That's exactly the sort of answer I was looking for. It's very useful to better understand this now.
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.