Probability of a collision in the sum of hashed 64-bit values

gct

1/15/24, 12:05 AM

I'm working on a problem where I need to track some state that's 64-bit integers. It turns out this state can tracked by simply accumulating a sum of differences, which in my case turns out to naturally sum to zero:

$$s = \sum_k (b_k-a_k)$$

This is a "closure" relationship that relies on all the values being transitively accounted for, so I might have e.g. $(a-c)+(c-b)+(b-a) = 0$, but that will always hold.

Right now this works because I'm taking care to traverse things in an order that ensures it only cancels once I've seen all my data. I'd like to loosen that restriction so that I can accumulate the values in any order without worrying about getting a false cancellation before I've seen all the data.

A friend suggested that I define several hash functions $G_i(x)$ that map uint64 $ \rightarrow$ uint64 and sum those differences instead:

$$s_i = \sum_{k}(G_i(b_k) - G_i(a_k))$$

For the example above, now I have $(G_i(a)-G_i(c))+(G_i(c)-G_i(b))+(G_i(b)-G_i(a)) = 0$ when I've seen all the $b_k-a_k$ terms and may be zero at other times too, but are vanishingly unlikely to all be zero at once erroneously.

I'd like to try to prove how unlikely this is though. My intuition is that a particular sum is effectively equivalent to a random 64-bit integer (assuming a good hash function). So the probability of all the sums being randomly zero is $~\frac{1}{2^{64k}}$ where k is the number of $G_i$ functions. For four functions this is 2^{-256}, i.e. it doesn't happen.

Can anyone elaborate on anything I've missed probability wise?

153

1 + 0

collision-resistance

Score:1

Crypto

poncho

1/15/24, 3:45 AM

My intuition is that a particular sum is effectively equivalent to a random 64-bit integer (assuming a good hash function)

If all the values $a, b, c$ are distinct, then your intuition is correct; if the sum is done modulo $2^{64}$, then the probability of $2^{-64}$ at any one step (other than the last one) of the sum just happening to be zero (and thus signally an early end).

The caveat of $a, b$ being distinct is necessary; otherwise, consider this simple case: suppose $a=b$ and you include $G_i(a) - G_i(b)$ as your initial entry; if you test for "closure", the algorithm will signal it (and it doesn't matter how many $G_i$ functions you have).

Outline of how to show that the probability is $2^{-64}$ if all values are distinct - suppose we have a sum $(G_i(a) - G_i(b)) + (G_i(c) - G_i(d)) + ... + (G_i(y) - G_i(z))$. If we assume that we don't achieve closure and all the $a, b, c, d, ..., y, z$ values are distinct (and that the same $G_i(a)$ expression may occur in two different subexpressions), then there must be a value $G_i(k)$ that appears only once. If we assume $G_i$ acts like a random oracle, this expression may be reduced to $\pm G_i(k) + \text{otherstuff}$, where $\text{otherstuff}$ is independent of $G_i(k)$. Because $G_i(k)$ is effectively a random value, the entire expression is, and that random value is 0 with probability $2^{-64}$

And, if $G_i, G_j$ (for $i \ne j$) are independent functions (such as, $G_i(a) = Hash( i || a )$ for a strong hash function $Hash$), then multiplying the probabilities is appropriate.

+ 3

gct

1/15/24, 3:46 AM

Fortunately my $a,b,c,...$ are indeed distinct.

gct

1/15/24, 4:02 AM

Thanks for the excellent answer. My intention was to use the [murmurhash](http://zimbry.blogspot.com/2011/09/better-bit-mixing-improving-on.html) finalizer with the first 4 values of the mixer as salts, so $G_1(k) = Hash(k+Hash(1)), G_2(k) = Hash(G_1(k) + Hash(2)), ...$ Do you foresee any issues there?

poncho

1/16/24, 3:27 AM

@gct: as long as the values $a, b, c$ are not chosen by an adversary (who wants you to fail), I don't see an issue.

Elon Musk

I sit in a Tesla and translated this thread with Ai:

EN: Probability of a collision in the sum of hashed 64-bit values

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.