Score:5

Maximum entropy of a hash function?

ng flag

Let $H(h,k)$ be the expected entropy of some random oracle $X:\left\{0,1\right\}^h \to \left\{0,1\right\}^k$, where $h$ does not necessarily equal $k$.

  1. Then, is it true that $\lim\limits_{h\to\infty}H(h,k) = k$ ? (for constant $k$)
  2. If so, is the above still true if $h = 2k$? In other words, does $\lim\limits_{h\to\infty} h - H(2h,h) = 0$ ? (for common hash functions like SHA-256/512, the input block size is twice the bits of the output)

For #1, my hunch is that it is true simply by the Law of Large Numbers, where $X$ is more likely to become uniformly distributed in $\left\{0,1\right\}^k$ as $h\to\infty$.

For #2, I am not so sure. A uniform distribution in $\left\{0,1\right\}^k$ has variance $\sigma^2 \sim 2^{2k} = 2^h$, which may "cancel out" this effect of the Law of Large Numbers as $h\to\infty$.


*Edit -- to calculate the expected entropy, I followed fgrieu's approach from The effect of truncated hash on entropy, where fgrieu derived the expected entropy of a random $h=k$ oracle as $$H = h - \frac{1}{2^h}\sum_{j=1}^{2^h}n_jj\lg\left(j\right) \approx h - 0.8272453$$

However, for a more "general" random oracle (where possibly $h\ne k$), I obtained $$n_j = 2^k\binom{2^h}{j}\left(\frac{1}{2^k}\right)^j\left(1-\frac{1}{2^k}\right)^{2^h-j}$$ and so $$H = h - 2^{k-h}\sum_{j=1}^{2^h}\binom{2^h}{j}\left(\frac{1}{2^k}\right)^j\left(1-\frac{1}{2^k}\right)^{2^h-j}j\lg\left(j\right)$$ which does not seem to have a nice simple closed-form or approximation like in the $h = k$ case there.


*Edit #2 -- it seems Wolfram-Alpha also miscalculates the limits here. For example, in the simple case of a 1-bit random oracle $X:\{1,2,\ldots,n\}\to\{1,2\}$, the expected entropy would just be $$ H_n = -\frac{1}{2^n}\sum_{j=1}^{n-1}\binom{n}{j}\left[\left(\frac{j}{n}\right)\lg\left(\frac{j}{n}\right) + \left(1-\frac{j}{n}\right)\lg\left(1-\frac{j}{n}\right)\right]$$ (ignoring the two zero-entropy cases $j=0,n$, where $X$ either maps everything to $0$ or everything to $1$).

But, Wolfram-Alpha seems to suggest that $\lim\limits_{n\to\infty} H_n = 0$ (and even hangs when substituting $h=n$), when in fact actually $\lim\limits_{n\to\infty} H_n = 1$. This can be demonstrated by showing that $H_n$ is already bounded from below by a (much simpler) sum $$B_n = \frac{1}{2^n}\sum_{j=1}^{n-1}\binom{n}{j}\left[4\times\frac{j}{n}\left(1-\frac{j}{n}\right)\right] = 1 - \frac{1}{n}$$ such that $0 < B_n < H_n < 1$ for all $n > 1$, thus implying that $\lim\limits_{n\to\infty}H_n = 1$.

Paul Uszak avatar
cn flag
Unfortunately the linked approach flies in the face of a large body of work called the Left Over Hash lemma, NIST's opinion and that of commercial TRNG manufacturers. Imagine TRNGs that output << 1 bit /bit of entropy! Also do not be led astray by the h=2k thing. That's a direct result of the lemma I mentioned and only applies for 128 bit block sizes, e.g. MD5.
fgrieu avatar
ng flag
@Paul Uszak: I wouldn't say my approaches to TRNG are entirely antagonist with practice. But I do have things to say against some malpractice in TRNG and their evaluation methods, including taking as argument of conformance that a black box test of a conditioned RNG passes; overcomplexity (are these to mask rigging?); prescribing unrealistically tight bounds for the monobit, poker test or similar (early FIPS 140, AIS31); using statistical approximations out of their applicability domain (AIS31); using references with errors (AIS31); not acknowledging or acting on report of these (AIS31).
Paul Uszak avatar
cn flag
@fgrieu It sounds as if you’ve had some bad experiences of AIS31 :-( Yes, TRNG testing can be challenging, thus all the Qs herein. [ Flags wave & trumpets sound! ] I’m releasing `ent3` soon, as an upgrade of the venerable `ent` test program. It specifically targets DIY TRNG makers in the 25kB to 1MB sample size range. In addition to the 6 original tests, there’s some additional ones and all have accurate $p$ values. I hope you comment on it...
Score:3
ng flag

fgrieu has shown that $\lim\limits_{h\to\infty} H(h,k) = k$ for any fixed $k$, so I will try to address the second question, that is, whether or not $$\lim\limits_{h\to\infty} h - H(2h,h) = 0$$ In particular, if we let $n = 2^h$, then we are asking whether $$\lim\limits_{n\to\infty} \lg(n) - n\sum_{j=0}^{n^2}\binom{n^2}{j}\left(\frac{1}{n}\right)^j\left(1-\frac{1}{n}\right)^{n^2-j}\left[\left(\frac{j}{n^2}\right)\lg\left(\frac{n^2}{j}\right)\right] = 0$$ or equivalently, whether $$\lim\limits_{n\to\infty} \sum_{j=0}^{n^2}\binom{n^2}{j}\left(\frac{1}{n}\right)^j\left(1-\frac{1}{n}\right)^{n^2-j}\left[\lg(n)+\left(\frac{j}{n}\right)\lg\left(\frac{j}{n^2}\right)\right] = 0$$ Note that as $n\to\infty$, the quantity inside the sum $$\binom{n^2}{j}\left(\frac{1}{n}\right)^j\left(1-\frac{1}{n}\right)^{n^2-j} \approx \mathcal{N}(j \,\vert\, n,n-1)$$ where $\mathcal{N}(x \,\vert\, \mu,\sigma^2)$ is the PDF of the normal distribution with mean $\mu$ and variance $\sigma^2$.

Which we can simply plug into Wolfram Alpha to obtain: $$\lim\limits_{n\to\infty} \sum_{j=0}^{n^2}\mathcal{N}(j \,\vert\, n,n-1)\times\left[\lg(n)+\left(\frac{j}{n}\right)\lg\left(\frac{j}{n^2}\right)\right] = 0$$


**Edit -- using "Wolfram Alpha" to obtain an answer is not a very rigorous approach (as @kodlu has also pointed out in the comments as well). So, I'll attempt a more rigorous solution here below...

In particular, where $n = 2^h$ just as above, we can express $$ H(2h, h) = \sum_{j=0}^{n^2}\binom{n^2}{j}\left(\frac{1}{n}\right)^j\left(1-\frac{1}{n}\right)^{n^2-j}\left[\left(\frac{j}{n}\right)\lg\left(\frac{n^2}{j}\right)\right] $$ and observe that this quantity from inside the sum $$\left[\left(\frac{j}{n}\right)\lg\left(\frac{n^2}{j}\right)\right]$$ is bounded from below by the slightly different quantity $$\frac{1}{\ln 2}\times\left[-2\left(\frac{j}{n}\right)^2 + \left(3+\ln n\right)\left(\frac{j}{n}\right) - 1\right]$$ such that $$ H(2h,h) \ge \sum_{j=0}^{n^2}\binom{n^2}{j}\left(\frac{1}{n}\right)^j\left(1-\frac{1}{n}\right)^{n^2-j}\times\frac{1}{\ln 2}\left[-2\left(\frac{j}{n}\right)^2 + \left(3+\ln n\right)\left(\frac{j}{n}\right) - 1\right] \\ = \lg(n) - 2\left(\frac{n - 1}{n^2\ln 2}\right) = h - 2\left(\frac{2^h - 1}{4^h\ln 2}\right)$$ and so $$\lim\limits_{h\to\infty} h - H(2h,h) \le \lim\limits_{h\to\infty} 2\left(\frac{2^h - 1}{4^h\ln 2}\right) = \boxed{0}$$


**Edit #2 -- it seems the same "lower-bound" approach can also be used to answer the first question (demonstrate that $\lim\limits_{h\to\infty} H(h,k) = k$ for any fixed $k$) as well!

In particular, where $n = 2^h$ just as above, and $m = 2^k$, we can express $$H(h,k) = m\sum_{j=0}^{n}\binom{n}{j}\left(\frac{1}{m}\right)^j\left(1-\frac{1}{m}\right)^{n-j}\left[\left(\frac{j}{n}\right)\lg\left(\frac{n}{j}\right)\right]$$ and observe that this quantity from inside the sum $$\left[\left(\frac{j}{n}\right)\lg\left(\frac{n}{j}\right)\right]$$ is bounded from below by the slightly different quantity $$\frac{1}{\ln 2}\times\left(\frac{j}{n}\right)\left[-m\left(\frac{j}{n}\right) + \ln(m) + 1\right]$$ such that $$ H(h,k) \ge m\sum_{j=0}^{n}\binom{n}{j}\left(\frac{1}{m}\right)^j\left(1-\frac{1}{m}\right)^{n-j}\times\frac{1}{\ln 2}\left(\frac{j}{n}\right)\left[-m\left(\frac{j}{n}\right) + \ln(m) + 1\right] \\ = \lg(m) - \frac{m-1}{n\ln 2} = k - \frac{2^k-1}{2^h\ln 2}$$ and so $$\lim\limits_{h\to\infty} H(h,k) \ge \lim\limits_{h\to\infty} k - \frac{2^k-1}{2^h\ln 2} = \boxed{k}$$

kodlu avatar
sa flag
How do you rigorously justify the statement "quantity inside..."? And then demonstration by Wolfram alpha is not rigorous either.
ng flag
@kodlu The "quantity inside" is a binomial PMF that, under certain conditions, can become [accurately approximated by the normal PDF](https://en.wikipedia.org/wiki/Binomial_distribution#Normal_approximation)...
ng flag
@kodlu But yes -- getting an answer from "Wolfram Alpha" does not count as rigorous...
kodlu avatar
sa flag
I've modified my answer but the lack of a rigorous proof for the last lineshould be addressed if possible.
ng flag
@kodlu I've added a more rigorous proof to my answer here above**
kodlu avatar
sa flag
sorry I cannot follow the expression below the line "slightly different quantity". It seems you dropped a logarithm somewhere?
ng flag
@kodlu Good catch! Just fixed that expression there (for the lower bound on $H(2h,h)$)
ng flag
@kodlu Did that address your concerns?
kodlu avatar
sa flag
Can you explain how you obtain the expression right above "and so" at the end?
ng flag
@kodlu Sure! You can notice that there are three terms inside the bracketed expression $$\frac{1}{\ln 2}\times\left[-2\left(\frac{j}{n}\right)^2 + \left(3+\ln n\right)\left(\frac{j}{n}\right) - 1\right]$$ inside the sum (for the lower bound on $H(2h,h)$), each of which is just some power of $j$ (either 0, 1, or 2), multiplied by some coefficient.
ng flag
@kodlu Consequently, you can split the whole sum (for the lower bound on $H(2h,h)$) into *three different sums*, and then easily obtain a closed-form expression for *each one*, simply just by applying the following general formulae: [edited]$$ \begin{align*} & \sum_{k=0}^n\binom{n}{k}p^k\left(1-p\right)^{n-k}\times k^0 = 1 \\ & \sum_{k=0}^n\binom{n}{k}p^k\left(1-p\right)^{n-k}\times k^1 = np \\ & \sum_{k=0}^n\binom{n}{k}p^k\left(1-p\right)^{n-k}\times k^2 = np(p(n-1)+1) \end{align*}$$ (and, of course, with the necessary substitutions!)
kodlu avatar
sa flag
Thanks for the explanation
Score:3
ng flag

Notations and preliminaries

The random oracle implements a random function $X:\{0,1\}^h\to\{0,1\}^k$. For a given such $X$

  • Let $j_i$ be the number of preimages in $\{0,1\}^h$ of a given element $i$ in $\{0,1\}^k$.

    It holds $0\le j_i\le 2^h\,$ and $$\sum_{i\in\{0,1\}^k}j_i\ =\ 2^h\label{fgrieu1}\tag{1}$$

    The entropy at the output of $X$ for uniformly random input is $$\begin{align}H&=\sum_{i\in\{0,1\}^k}\frac{j_i}{2^h}\log_2\left(\frac{2^h}{j_i}\right)\label{fgrieu2}\tag{2}\\ &=2^{-h}\sum_{i\in\{0,1\}^k}j_i\,\bigl(h-\log_2(j_i)\bigr)\label{fgrieu3}\tag{3}\\ \end{align}$$

  • Let $n_j$ be the number of $i$ with $j_i=j$.

    It holds $$\displaystyle\sum_{j=0}^{2^h}n_j\ =\ 2^k\label{fgrieu4}\tag{4}$$ and by rearanging the terms in $\ref{fgrieu3}$ it comes $$H=2^{-h}\sum_{j=1}^{2^h}n_j\,j\,\bigl(h-\log_2(j)\bigr)\label{fgrieu5}\tag{5}$$

The distribution of $j_i$ over the $2^{2^h k}$ different $X$ is binomial, with mean $E(j_i)=\mu=2^{h-k}$ and standard deviation $\sigma=\sqrt{2^{h-k}(1-2^{-k})}$

The question is about the expected $H(h,k)=E(H)$ over the $2^{2^h k}$ different $X$. Expectation is linear, thus even though the $n_j$ are dependent, equations $\ref{fgrieu4}$ and $\ref{fgrieu5}$ rigorously yield $$\displaystyle\sum_{j=0}^{2^h}E(n_j)\ =\ 2^k\label{fgrieu6}\tag{6}$$ $$H(h,k)=2^{-h}\sum_{j=1}^{2^h}E(n_j)\,j\,\bigl(h-\log_2(j)\bigr)\label{fgrieu7}\tag{7}$$

This can lead to at least asymptotic formulas for $H(h,k)$. I used this approach there to derive $\lim\limits_{h\to\infty}h-H(h,h)=0.82724538915300508343173\ldots^+$ bit.

Answer

  1. Yes, for fixed $k$, $\lim\limits_{h\to\infty}H(h,k)=k$.

    Intuitive argument: as $h$ goes to infinity, the binomial distribution of the $j_i$ degenerates to Gaussian with same mean $\mu$ and standard deviation $\sigma$ as above. $\lim\limits_{h\to\infty}\sigma/\mu=0$. This implies that in $\ref{fgrieu7}$, the terms of the sum that matter are those with $j$ close to $\mu=2^{h-k}$. Replacing $j$ with $2^{h-k}$ in $\ref{fgrieu7}$, then applying $\ref{fgrieu6}$, we get $$\begin{align}\lim\limits_{h\to\infty}H(h,k)&=2^{-h}\sum_{j=1}^{2^h}E(n_j)\,2^{h-k}\,\bigl(h-(h-k)\bigr)\\ &=2^{-k}\,k\sum_{j=1}^{2^h}E(n_j)\\ &=k\\ \end{align}$$ I wish I knew how to properly justify "This implies".

  2. Thinking about rigorously proving $\lim\limits_{h\to\infty}h-H(2h,h)=0^+$.

    But don't hold your breath.

Paul Uszak avatar
cn flag
Have a look at the proof of the Left Over Hash lemma. It has your proof and demonstrate s that your (1) and (2) are correct. Not the 0.827 thing though as $e^{h-h} =1$.
fgrieu avatar
ng flag
@Paul Uszak: Are you suggesting [Leftover Hash Lemma, Revisited](https://eprint.iacr.org/2011/088), or [A Pseudorandom Generator from any One-way Function](https://doi.org/10.1137/S0097539793244708), or some other source? Note: that $\lim\limits_{h\to\infty}h-H(h,h)=0.827…$ is easily verified experimentally with usual hashes truncated to e.g. 16 to 35 bit, by determining the $n_j$ and applying (5) [which is slightly easier than applying (2)]. I report on such experiment [here](https://crypto.stackexchange.com/a/24672/555) and also link to a thesis that (among other things) did just that.
Score:2
sa flag

Edit: I was wrong below, as pointed out in the comments by @ManRow:

@ManRow, unfortunately you cannot use the Normal approximation to the binomial for answering Question 2. This is because}<\strike> the usual rule of thumb (see these stats notes, scroll to the bottom) for using this approximation (based on the deMoivre Laplace approximation Wikipedia) which is $$ np(1-p)\geq 10,\quad or \quad np(1-p)\geq 5~(\mathrm{less~ stringently}) $$ does not hold for $p=1/n.$

If we use the Poisson approximation, see David Pollard's notes at Yale here. We can replace $\mathrm{Bin}(n,p)$ by $\mathrm{Poi}(np),$ and LeCam has famously shown that $$ \sum_{k=0}^\infty |P(X=k)-P(Y=k)|\leq 4p $$ if $X\sim \mathrm{Bin}(n,p),$ and $Y\sim \mathrm{Poi}(np)$ which is amazing since you are summing over all natural numbers; recall that for us $p=1/n$.

Another technique we can use is the so-called Poissonization. The balls in Bins process is negatively correlated when you consider distinct bins, if a bin has more balls, another one has fewer. This is quite technical and delicate (Mitzenmacher and Upfal's Probability and Computing book discusses it, but I will skip the details). This actually means that if we pretend that the number of balls falling in distinct bins arrive at some fixed poisson rate $np,$ the approximations we need still hold in a very strong sense.

Therefore, we can approximate the binomial distribution in this problem with the relevant Poisson distribution. Recall that we have $m=n^2$ balls into $n$ bins thus we have the rate $\lambda=mp=n^2(1/n)=n.$

The recent paper https://arxiv.org/pdf/1001.2897.pdf has tight upper and lower bounds on the entropy of the Poisson distribution. This essentially means that for any $\lambda>0,$ we have (bottom of page 2) the approximation to the Poisson entropy $$ H(\lambda)=\frac{1}{2}\ln(2\pi \lambda)+\frac{1}{2} -\frac{1}{12\lambda}-\frac{1}{24 \lambda^2} +O(\lambda^{-3}), $$ and for $\lambda=n$ large ($n=2^h$) we obtain the approximation $$ H\approx \frac{1}{2} \ln(2 \pi n)+\frac{1}{2}\approx \frac{1}{2}(\ln 2\pi + \ln n) \approx \frac{1}{2 \ln 2} \lg n \approx 0.72 \lg n=0.72 h $$ which does not quite get us to the desired $$ \lim\limits_{h\to\infty}h-H(2h,h)=0^+ $$ since we can only conclude that the limit is $0.28 h.$ Maybe this is the real answer.

ng flag
This seems like a misunderstanding of the variables involved. In the quantity below, $$\binom{\boxed{a^2}}{k}\left(\frac{1}{a}\right)^k\left(1-\frac{1}{a}\right)^{\boxed{a^2}-k}$$ we have that $n=\boxed{a^2}$ and $p = 1/a$, so therefore $$\\\mu = np = a^2 \times (1/a) = a \\\sigma^2 = np(1-p) = a^2 \times (1/a)(1-1/a) = a-1$$ both of which tend to infinity as $a\to\infty$.
ng flag
So, just by replacing $a$ with $n$, and $k$ with $j$, it seems the normal approximation for $$\binom{n^2}{j}\left(\frac{1}{n}\right)^j\left(1-\frac{1}{n}\right)^{n^2-j} = \mathcal{N}(j~\vert~ n, n-1)$$ (from inside the sum) may very well apply here as $\mu = n$ and $\sigma^2 = n - 1$ easily satisfy the [criteria for normal approximation](https://en.wikipedia.org/wiki/Binomial_distribution#Normal_approximation) with both $\mu\gg 10$ and $\sigma^2\gg 10$ as $n\to\infty$ as well.
ng flag
The case you're referring to with $\mu = 1$ and $\sigma^2 = 1-1/n$ seems to apply to a [different question](https://crypto.stackexchange.com/questions/24660/the-effect-of-truncated-hash-on-entropy) answered by [fgrieu](https://crypto.stackexchange.com/questions/24660/the-effect-of-truncated-hash-on-entropy/24672#24672) regarding $\lim\limits_{h\to\infty}h - H(h,h)$, not the $\lim\limits_{h\to\infty}h - H(2h,h)$ case as over here...
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.