Score:1

Are these both the probability of collision in birthday attack?

in flag
Tim

About birthday attack, book Cryptography Engineering says:

In general, if an element can take on N different values, then you can expect the first collision after choosing about $\sqrt{N}$ random elements. We're leaving out the exact details here, but $\sqrt{N}$ is fairly close. For the birthday paradox, we have N = 365 and $\sqrt{N} \approx 19$. The number of people required before the chance of a duplicate birthday exceeds 50% is in fact 23, but $\sqrt{N}$ is close enough for our purposes and is the approximation that cryptographers often use.

One way of looking at this is that if you choose $k$ elements, then there are $k(k - 1)/2$ pairs of elements, each of which has a $1/N$ chance of being a pair of equal values. So the chance of finding a collision is close to $k(k - 1)/2N$. When $k = \sqrt{N}$, this chance is close to 50 % .

and wikipedia says:

As an example, consider the scenario in which a teacher with a class of 30 students (n = 30) asks for everybody's birthday (for simplicity, ignore leap years) to determine whether any two students have the same birthday (corresponding to a hash collision as described further). Intuitively, this chance may seem small. Counter-intuitively, the probability that at least one student has the same birthday as any other student on any day is around 70% (for n = 30), from the formula ${\displaystyle 1-{\frac {365!}{(365-n)!\cdot 365^{n}}}}$.

which can be rephrased in terms of the language in Cryptography Engineering:

$$1 - \frac{N!}{(N-k)! * N^k}$$

Is it supposed to equal to the following from Cryptography Engineering:

$$ (k(k-1))/(2N) $$

Why?

kelalaka avatar
in flag
That is about approximation that you did not notice on Wikipedia page. and possible dupes are [What are the odds of collisions for a hash function with 256-bit output?](https://crypto.stackexchange.com/q/39641/18298) and [Birthday Attack Probability of Collision in Introduction to Modern Cryptography](https://crypto.stackexchange.com/a/72791/18298) and [On a lower bound for the birthday problem](https://crypto.stackexchange.com/q/64584/18298) and [What is the error in this collision probability approximation?](https://crypto.stackexchange.com/q/66328/18298) and some other not listed...
kelalaka avatar
in flag
Wiki page: [Birthday problem - Approximations](https://en.wikipedia.org/wiki/Birthday_problem#Approximations)
Tim avatar
in flag
Tim
Thanks. @kelalaka. I don't find (k(k−1))/(2N) in the links. Do I miss it?
kelalaka avatar
in flag
See `This is fine for low ` in the first link 101 answer...
Tim avatar
in flag
Tim
@kelalaka Thanks. The book in my post seems to explain (k(k−1))/(2N) in a different way. Are the pairs each having two equal values disjoint events?
kelalaka avatar
in flag
It first selects uniform random $k$ values then forms the pair where each has $1/N$ chance of collision....
fgrieu avatar
ng flag
A derivation of the approximation is in my [_Birthday problem for cryptographic hashing, 101_](https://crypto.stackexchange.com/a/39644/555). Beware that uses $(k,n)$ where the present question uses $(N,k)$.
Tim avatar
in flag
Tim
@fgrieu Thanks. I still have the same question. The book in my post seems to explain (k(k−1))/(2N) in a different way from your answer. How shall I understand it from 1/N for each pair? Are the pairs each having two equal values disjoint events? Which part in its derivation is approximation?
Score:2
ng flag

The question asks how we go from $\displaystyle p=1 - \frac{N!}{(N-k)!\,N^k}$ for the probability of collision of $k$ uniformly random values among $N$, to the approximation $\displaystyle p\approx\frac{k(k-1)}{2N}$ (which assumes $k$ is small compared to $\sqrt N$ ).

First we get back to $\displaystyle1-p=\prod_{j=0}^{k-1}{\left(1-\frac j N\right)}$, which is how $p$ was determined in the first place. Then we take the logarithm on both sides and use that $u>0,v>0\implies\ln(u\,v)=\ln(u)+\ln(v)$ to get $$\displaystyle\ln(1-p)=\sum_{j=0}^{k-1}{\ln\left(1-\frac j N\right)}$$

For small $|x|$, it hold $\ln(1+x)\approx x$. Applying this to $x=p$ on the left side and $\displaystyle x=\frac j N$ on the right side, we get $\displaystyle p\approx\sum_{j=0}^{k-1}\frac j N$. We rewrite this as $\displaystyle p\approx\frac 1 N\sum_{j=0}^{k-1}j$.

Now we use that the sum of non-negative integers less than $k$ is $\displaystyle\frac{k\,(k-1)}2$ and get the desired $\displaystyle p\approx\frac{k(k-1)}{2N}$.

Without proof: this approximation of $p$ is always by excess. It's off by less than $+28\%$ when $k\le\sqrt N$, less than $+14\%$ when $k\le\sqrt{2N}$, less than $+7\%$ when $k\le2\sqrt N$.


Most of the error is from the approximation $\ln(1-p)\approx-p$. A much better approximation is: $$p\approx1-e^{-\frac{k(k-1)}{2N}}$$ which assumes only $k\ll N$ rather than $k\ll\sqrt N$. However beware that this alternate formula is numerically instable for small $p$.


In comment it's additionally asked

How shall I understand (this formula) from $1/N$ for each pair? Are the pairs each having two equal values disjoint events? Which part in its derivation is approximation?

One easy way to derive the probability $p$ that there is a collision among $k$ uniform values among $N$ (for $0\le k\le N$) is as the complement of the probability that there is not collision.

For a fixed $N$, define $q_k$ as the probability that there is no collision after $k$ values. Obviously $q_0=q_1=1$. And for $k\ge2$, $q_k$ is the probability that there was no collision among the first $k-1$ values (that is, $q_{k-1}$), time the probability that there is no collision between the $k-1$ previous values and the last drawn one, which is $\displaystyle\frac{N-k}N$ (justification there a exactly $N-k$ values among $N$ that do not leak to collision for the last value drawn).

It follows $\displaystyle q_k=q_{k-1}\left(1-\frac k N\right)$, thus $\displaystyle q_k=\prod_{j=0}^{k-1}{\left(1-\frac j N\right)}$, thus $$p=1-\prod_{j=0}^{k-1}{\left(1-\frac j N\right)}=1-\frac{N!}{(N-k)!\,N^k}$$

This is exact. See first two sections of this answer for the derivation of approximations.


One source justified the approximation as:

One way of looking at this is that if you choose $k$ elements, then there are $k(k−1)/2$ pairs of elements, each of which has a $1/N$ chance of being a pair of equal values.

This hand-waving argument does not yield a mathematically exact derivation of $\displaystyle p=\frac{k(k-1)}{2N}$, since the events are not disjoint. As long as $p$ is small, we can get away with it, but that gets grossly wrong when $k>\sqrt{2N}$.

When $k = \sqrt{N}$, this chance is close to 50%.

That's true if 39.3% is close to 50%.

Tim avatar
in flag
Tim
Thanks. My comment at https://crypto.stackexchange.com/questions/99160/are-these-both-the-probability-of-collision-in-birthday-attack/99176#comment214408_99160 asked for different questions
Tim avatar
in flag
Tim
If you take a look at the first quote in my post, the book doesn't derive the probability the way your added to your reply
Tim avatar
in flag
Tim
"since the events are not independent." Additivity of probabilities of several events depends on the events being disjoint not independent
kelalaka avatar
in flag
The percentages need links...
fgrieu avatar
ng flag
@kelalaka: my evidence for +28% is only numerical (hence the _without proof_): I plotted $\frac{\left(1-\frac{{k^2}!}{(k^2-k)!\,k^{2k}}\right)\,2k^2}{k(k-1)}-1$; same for +14% +7%.
Tim avatar
in flag
Tim
Thanks. (1) Does your "First we get back to ... get the desired p≈k(k−1)/2N" justify (k(k−1))/(2N) as a reasonable approximation? (2) Does "This hand-waving argument does not yield a mathematically exact derivation of p=k(k−1)/2N, since the events are not disjoint. As long as p is small, we can get away with it, but that gets grossly wrong when k>sqrt(2N). When k=sqrt(N), this chance is close to 50%. That's true if 39.3% is close to 50%" mean that k(k−1)/2N is a bad approximation? (3) if you don't mean contradiction in (1) (2), could you explain, edit, or paraphrase what you actually mean?
fgrieu avatar
ng flag
(1) More precisely, it implies $k(k−1)/(2N)$ is a valid approximation _for $k$ or $p$ lower than some threshold._ (2) It means that $k(k−1)/(2N)$ is poorly justified by the argument given, comes without a statement of when it applies, and is a terrible approximation for $k$ above some thresold. As explained, it's already significantly off (by +24%) when applied to $k=\sqrt N$ and $N$ of cryptographic interest. (3) There's no contradiction.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.