Score:2

Why do universal hash functions prevent adversarys, but uniform hash functions don't?

cn flag

Before I state my actual question, let me first five some terminology so we are all on the same page:

Let $U=\{k_1,...,k_u\}$ the universe of possible keys, $|U|=u$. We use a hash table $T$ with $m$ cells, counting from $0$ to $m-1$. We use a family of hash functions $H$, such that each $h\in H$ has the likelihood $1/m$ that two distinct keys $k$ and $k'$ hash so the same value, i.e., $P(h(k)=h(k'))=1/m$.

Finally, universal hashing means that for hashing, a random hash function (satisfying the $1/m$ requirement mentioned above) is chosen from H. According to my research (and this seems to be in line with the well-known CLRS algorithms textbook), we always use only a single hash function over the entire runtime of our hash table. The other elements in the hash family are thus only relevant while the table is being created (i.e., on first program call), but one this function got chosen its fixed and all the others are not relevant anymore. This also makes sense, because if you would choose different hash functions for different keys, then you need again to have a mapping from key to the used function -- which wouldn't make much sense this this is the problem we are going to solve: knowing where to store the key; but the chosen hash function would be key-dependent, so we'd get a paradox. :) So it's pretty much logical that we only choose one function over the entire runtime.

Having clarified that, my question:

How does picking a random hash function help us then defending against an adversary, if it's still fixed over the lifetime of the table? I do not see any difference to having a single hash function fixed in advance.

The sole motivation mentioned in any single lecture and academic material is to make life harder for an adversary because for any fixed hash function we can devise a sequence of keys that hash to the same value thus causing worst-case behavior of the hash table. So the claim is that since we now choose a random hash function you don't know anymore which is being used, so you also can't devise such an attack key sequence. I just don't buy that argument.

After your hash table with universal hashing got created you also have one single hash function for which you could find such a sequence, so we really didn't solve the problem. I believe that the important question is from where an adversary should know which hash function is being used!

So, if we only use a single hash function and even make it public, then you can be attacked, of course. But why should that function be public? Code is almost never public, is it? So assuming that code is not public, an adversary doesn't know which hash function is used anyway -- whether you have one single one or one picked at random from a whole family.

Thus if we assume that the function is not public and the adversary can infer it "somehow" (is that even possible? by measuring runtimes for example?), then again it doesn't make a difference whether there is one that's fixed in advance or whether it only got fixed after the table got initialized.

Thus, either way, the adversary argument just seems not to work -- it seems to be wrong unless the used hash function is actually public, which I simply can't believe. So what's going on here?

Score:1
ng flag

Indeed, in application to hash tables, the rationale of picking a hash function at random in a universal hash function family is to insure adversaries using the hash table do not know which hash function we picked. And that, combined with the properties of a universal hash function family, makes adversaries unable to purposely create hash collisions.

Code is almost never public, is it?

Per Kerckhoffs's (second) principle, we must assume adversaries know the code. This reflects the fact that object code is very often accessible to adversaries, and (although it requires effort) its possible to reverse-engineer what it does (also, open-source software is increasingly common). So instead of assuming the code is secret, we make the much lesser assumption that the choice of hash function in the universal hash function family is random and secret. That's a reasonable assumption at least initially, because the choice is made at runtime, when the application is started (or, for a database, when the database is created), and thus analysys of the code alone won't help determine this choice.

Importantly, an implementation should try that this choice does not leak, e.g. thru a timing side channel. That's quite difficult, since we avoid collisions because they slow down things, thus collision are inherently detectable by timing.

Prof.Chaos avatar
cn flag
Ah, so it's indeed because the assumption is that the code is known. Who would have thought. :) Thank you!
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.