Is there a simple formula to calculate how many inputs generate same outputs in cryptographic hash functions, being the input larger than the input?

il flag

Let's suppose I hash a 384-bits message with a cryptographic hash function and generate a hash digest of 128-bits. I know that due to the pigeonhole principle, many inputs of same size will result in the same 128-bits hash digest.

How is the estimated number of 384-bits messages that will produce the same 128-bits digest in a cryptographic hash function? Is there a simple formula to estimate how many collisions I will have given an input larger than the output?

I found a question in this forum but I don't know if that formula applies here:


/\ Is this the formula to calculate what I ask in this question?

sa flag

The average number of inputs generating the same output is clearly the number of inputs divided by the number of outputs which is the quantity $$ 2^{384}/2^{128} $$ in this case. To further clarify, as in a comment, here "average" is over the collection of (2^{128})^(2^{384}) possible functions from 384-bit bitstrings to 128-bit bitstrings, since the assumption is that we draw one of those functions uniformly at random.

Since a good hash function can be modelled as a random mapping we can say more about this, based on the bins-in-balls paradigm.

To start, see Wikipedia for more details here. In their definition, we would have $m=2^{384}$ and $n=2^{128}.$ For example as soon as $m>n \ln n$ the probability that all outputs are reached by some input approaches 1. This is the so called coupon collector problem, look it up. We can also estimate the load of the bin with most balls, least balls etc. There is a paper by Raab online with a title like "balls in bins: a tight analysis" with more technical details.

As in a comment, the distribution of the number of number of inputs resulting in a fixed hash output value, the Poisson approximation can be used and then that approximated by a Gaussian with mean $\mu=2^{256}$ and standard deviation $\sigma=2^{128}.$

Note that this is only a model, though usually very accurate, since any given hash function is deterministic.


The paper by Raab and Steger available here also has results about how many inputs map to the most popular hash output. We have $m=n^3$ here with $n=2^{128}.$ Therefore we are in the last case of Theorem 1, where $m$ is large compared to $n$, namely $m\geq c n (\ln n)^3.$ (In fact even $m>2^{148}$ would put us into this case since $\log_2(\ln 2^{128})^3 \approx 19.4.$)

Applying this result we then see that the maximum load is with high probability equal to $$ \frac{m}{n}+\sqrt{\frac{2 m \ln n}{n}(1-o(1))} $$ where $o(1)$ goes to zero very rapidly. A quick calculation gives $$ 2^{256}+\sqrt{2^{385-128} \ln (2^{128})}\approx 2^{256}+\sqrt{2^{257+6.5}}=2^{256}+2^{131.5} $$ as the expected maximum load in a bin (i.e., a fixed hash output). It is interesting that this value is approximately at $2^{3.5} \sigma \approx 11.3 \sigma$ when compared to the approximation given by Ilmari Karonen in a comment.

fgrieu avatar
ng flag
(Addition) Here _"average"_ is over the ${(2^{128})}^{(2^{384})}$ functions from 384-bit bitstrings to 128-bit bitstrings.
ar flag
In particular, based on the [law of rare events](, the distribution of the number of inputs that generate any given output is _very_ well approximated by a Poisson distribution with rate parameter $\lambda=2^{384}/2^{128}=2^{256}$, which in turn (since $\lambda$ is large) is well approximated by a Gaussian distribution with mean and variance $\lambda$ (and thus a standard deviation of $\sqrt\lambda$). Thus the distribution will look like a bell curve peak at $2^{256} \pm 2^{128}$.

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.