Score:4

Many near collisions but no full collision

in flag

I read this question: Cracking $f(x) = Cx \oplus Dx$ Asking about finding collisions in a simple 64 bit hash, and I thought I will give it a go myself just for fun. I quickly wrote code to find collisions: https://gist.github.com/meirmaor/b0e59352eb73cacec47d0f95c25a25fc

And yet it finds many near collisions and no full collisions, this baffles me.

Algorithm description: I wanted to solve this using 8GB of Ram, so I allocate two Int arrays of length $2^{30}$ *(4 byte int) each. I populate them by hashing Int values, I take the lower 30 bits as the index into both arrays, and store the top 32 bits in the first array and the source int in the second array.

I populate using $2^{32}$ possible Int values(as byte arrays), and get as expected a 98% fill rate, vary close to the idealized $1-e^{-4}$ I would expect.

It's like a hash table but I don't deal with collisions, just keep a single value for each 30 bit hash key. It's essentially a mapping between truncated 62 bits hash to 32 bit origin.

I then try hashing longer values with an extra Int prefix, and search for collisions, again using lower 30 bits as index to array, check if top 32 match and if they do we found a near collision. However when verifying them I find no full collision, I've found over 60 near collisions so far, validated them separately they indeed match in 62 or 63 bits, but I was expecting 1/4 to be a full collision, I got 0.

I repeated the test twice first comparing hashes of 4 bytes to hashes of 8 bytes starting with the bytes {small number,0,0,0}. I then tried comparing hashes of equal length by pre-populating with hashes of data all beginning with the byte sequence {1,0,0,0} and comparing again with the prefix {2+,0,0,0}

How is this possible, something special in this hash function? A strange bug in my code allowing me to successfully find near collisions but no full collisions? Is there a reason near collisions found this way won't turn into full collisions.

An example of a near collision found(I have many):

Array(24, 0, 0, 0, 14, 103, 61, 80) vs Array(1, 0, 0, 0, -2, -81, 79, 79)

Meir Maor avatar
in flag
My next attempt will be a two finger algorithm O(1) memory, but I still don't know why the first attempt fails.
Score:5
ng flag

Late important addtion: I now realize the code attempts to find collisions for a 64-bit $\operatorname{hash}$ accepting 64-bit messages. If that $\operatorname{hash}$ was a bijection, it would not collide. There is a continuum between a bijection and a random function, and no insurance that $\operatorname{hash}$ behaves mostly like the later. On the contrary, it's internal function $f(x)=C\,x\oplus D\,x\bmod2^{64}$ has no right-diffusion. That is, $x\equiv x'\pmod{2^i}\implies f(x)\equiv f(x')\pmod{2^i}$, thus $f$ is a poor hash round function. This might explain at least in part the difficulty to find collisions by methods designed for random functions. In the later I assume one of:

  • the message space for $x$ is $b$-bit with $b$ sizably larger than the (64-bit output)
  • we try to find collisions $x,x'$ such that $\operatorname{hash}(p\mathbin\|x)=\operatorname{hash}(p'\mathbin\|x')$ where $p$ and $p'$ are fixed distinct prefixes, $x,x'$ are $b=64$-bit, $x\ne x'$.

I suspect another important issue is that in

I populate (the arrays) by hashing Int values

it is hashed incremental Int values. It is quite feasible to make a function such that incremental values in a large interval do not collide, and it's quite possible the function $\operatorname{hash}$ searched for collisions behaves like that, thus any attempt to find collisions among consecutive values fails.

As an example of a function with no collisions for input in a small interval, consider $H(x)=\left(263x+\left(\operatorname{MD5}(x)\bmod256\right)\right)\bmod2^{64}$. It holds $H(x)-H(x')\equiv263(x-x')+(r-r')\pmod{2^{64}}$, with $r,r'\in[0,255]$ since they are obtained as the last byte of an MD5; thus $\lvert r-r'\rvert<256$. Therefore if $x\ne x'$, the only way to get $H(x)=H(x')$ is that $\lvert x-x'\rvert$ is large, at least $\lfloor 2^{64}/263\rfloor$, which won't be the case for consecutive $x$ in a small interval.

When trying to find collisions for such insufficiently random hash function $H$, an easy fix is to find collisions for the a more random function $x\mapsto H(G(x))$, constructed using some pseudo-random auxiliary bijection $G$, e.g. $G(x)=G_2(G_1(G_0(x)))$ with $G_i(x)=k_i(x\oplus(x\gg\lceil b/3+1\rceil))\bmod2^b$ for $k_i$ haphazard odd $b$-bit constants [where $\gg$ is right shift, and $b$ is the bit size of $x$]. Once a collision $x,x'$ is found with $H(G(x))=H(G(x'))$ but $x\ne x'$, a collision for $H$ is $G(x),G(x')$.


One advantage of searching collisions with Pollard's rho with distinguished points (rather than the method in the question's code) is that it's iterative nature often solves the issue of an insufficiently random function searched for collisions, without an auxiliary $G$; or, a relatively simple $G$ will do (here I think a 1-bit rotation in the feedback of Pollard's rho should do, compensating the lack of right diffusion). Also, Pollard's rho uses less memory, thus works for larger hashes; and for fast hash functions, it's faster, because it's cache-friendly.

kodlu avatar
sa flag
nice. is there a deep algebraic reason in your MD5 related hash being collision averse for sequential integer values? other than 263 being relatively prime to the relevant modulus?cannot tell at a glance
Reppiz avatar
gb flag
i personally did not really understand how, respectively why the G fix works. Is there any further (or more in-depth) explanation? Also a link to a paper, blog, article, etc... which somehow describes this method would do.
fgrieu avatar
ng flag
@Reppiz: I now try to give the rationale. In essence, if we have a problem with $H$ because it's not random enough, we make it more random by introducing $G$.
fgrieu avatar
ng flag
@Meir Maor: make sure to read the new intro!
Meir Maor avatar
in flag
Yes, I was worried about that, and this is growing into a really great answer. There were several flaws in my attempt. Yet my success in finding near collisions at (somewhat less but not much less than) the expected rate threw me off.
Score:2
cn flag

Too unreputable to comment...

I expect it is an implementation issue - the high-level description of the method seems reasonable. Can it find a collision if you instead use the prefixes 0x01000099 and 0xDEADBD5C?

Spoiler: e.g. 0x010000992287FF50 vs. 0xDEADBD5C05F19159

The method used for finding this collision is essentially the same as the one you describe, except that I've also used that if we can find a a collision of the 56 most significant bytes of the hash (or, technically, the hash without the final application of f), then it is trivial to extend the byte sequences by one byte each to get a full (64 bit) collision.

fgrieu avatar
ng flag
That looks like legit as an answer (rather than comment) to me, even if it answers "why no collision found" by "there should be"; and I have an alternate explanation.
Meir Maor avatar
in flag
The plot thickens, with these prefixes my code finds 65 near collisions and 64 of them become full collisions. In an ideal function I would have expected to find a single near collision in each prefix pair (because I only store 1/4 of the values hashed in the first prefix)
Maarten Bodewes avatar
in flag
Something mod 127 or similar?
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.