Score:5

Conditions on a cryptographic hash function for Randomness Extraction

ng flag

Assume we want to transform semi-random $n$-bit inputs into shorter $k$-bit outputs computationally indistinguishable from uniformly random bit strings, and there is (in some sense, to be specified) enough entropy in each of the semi-random inputs towards this.

Under what condition(s) on a hash† function (and as far as it's indispensable, on the semi-random input) can we just hash each semi-random input for that randomness extraction purpose? What's a standard name / statement / reference for such condition(s)?

What's the status when the hash is SHA-256 truncated to $k\le256$ bits?


† For some definition of hash including: an efficiently computable function with an $n$-bit input and $k$-bit output, public except perhaps in some limited sense for an auxiliary fixed input. Other conditions are part of the question.


Motivation: that's the question the OP wanted to ask in Can I use a cryptographic hash function such as sha256 for Randomness Extraction?. The body of the question wondered if collision resistance was enough (it's not), and the question got closed by user consensus as duplicate of Can we assume that a hash function with high collision resistance also means a highly uniform distribution?.

ONUR EREN ARPACI avatar
mc flag
In page 174 section 6.1.3 of this lecture [notes](https://people.seas.harvard.edu/~salil/pseudorandomness/extractors.pdf), it says that you can’t have deterministic extractors for k-sources. And then there is this huge literature about deterministic extractors for specific source models (such as this [book](https://link.springer.com/book/10.1007/978-3-642-14903-0)) so even if the source is not adversarial (for example, radiation waves coming from sun), I am still not sure if using a specific cryptographic hash function solves the problem.
Marc Ilunga avatar
tr flag
@ONURERENARPACI, I think it's important to make a distinction of the attacker model. Surely, we can't have a statistical extractor from a deterministic function (against an unbounded adversary). For a cryptographic application, it is common to restrict to bounded distinguishes. In which case, a general impossibility result doesn't necessarily tell us much.
Score:3
tr flag

Updates

As discussed in the answer (Caveats) and in the comments, a drawback of what is described below is the independence assumption between the extractor and the seed. This is rather unrealistic in real life since for instance SHA256 may have been used to generate an ephemeral DH scalar via a PRNG and used as well in (un-seeded) HMAC to derive the shared secret. Two papers dealing with these considerations:


I will assume the question is about computational randomness extraction; in particular, we are not in the realm of the impossibility result of extracting randomness with a deterministic function. Additionally, I will assume that the entropy source is independent of the hash function.

With those two assumptions, a sufficient condition (more like assumption) for the hash function is that the hash function "behaves" like a random oracle.

Now, it is well known that a single hash function cannot realize a random oracle. Therefore, the notion of "behave like a random oracle" is not appropriate and cannot be interpreted in the sense of indistinguishable from a random oracle. Instead, we can really on the notion of "Indiffernetiability from a random oracle".

Indifferentiability: The indifferentiability framework introduced by Maurer, Renner and Holenstein describes indifferentiability as a generalization of indistinguishability. Summarized: Let $T$ be a cryptosystem using an idealized but publicly available resource $R$ to realize the ideal cryptosystem $T$. $S$ is said to be indifferentialbe from $T$, if there exists a simulator $\mathcal{S}$ such that "translates" $R$ from the ideal resource $T$. Said differently, for any (efficient) distinguisher $D$, it cannot distinguish between an interaction with $(R, S)$ and with $(\mathcal{S}, T)$. The notion comes with a composition theorem that essentially says that for any cryptosystem $\mathcal{C}$ proven secure when using $T$, and if $S$ and $T$ are indifferentiable., then $\mathcal{C}$ is also secure when using $S$ (There are some caveat).

This is useful in this context since, the framework has been applied to hash functions(Sponges), HMAC.

Hugo Krawczyk also discusses here the role of indifferentiability for the security of HKDF assuming a fixed, zero salt.

Caveats: Now, the arguments above can be contested for a number of reasons: 1) Independence assumptions, 2) Idealizations. Because we assume independence of the source and the underlying hash function (or compression), the answer is not as general as in the case of HKDF with random salts. But the question doesn't state that this assumption is not acceptable. Indifferentiability still requires some idealization (of the underlying compression function or block cipher). However, the value can be seen as a minimization of assumptions: idealizing a compression function seems to be less strong than doing that for a full hash function with structure, etc...

To be as general as possible, I would still use a salted HKDF instead of a single hash function or even HKDF with a zero salt.

fgrieu avatar
ng flag
Addition: The _"entropy source is independent of the hash function"_ condition is tricky. That means: in the definition of the hash function is some value, acting as a key (a key of HMAC, or the IV before the first round of SHA-256) ultimately part of the public definition of the hash but that whatever generates the $n$-bit inputs must not use. And that must hold for _the whole_ of each input. If adversaries can choose one bit of each input with knowledge of the other bits of that input and of the randomness extraction function, they can introduce a discernible bias in the output.
Paul Uszak avatar
cn flag
Err, what's the condition for requiring a cryptographic (+salted) HKDF when just von Neumann will do without any kinda hashes?
Marc Ilunga avatar
tr flag
@fgrieu, indeed, that's an important caveat that limits the generality of the statement. However, I suppose it's not unreasonable to put such restrictions for the sake of a theoretical statement. Or perhaps even practically, considering that unsalted HMAC is pervasively used. I guess it comes down to what assumptions are we most comfortable with?
Marc Ilunga avatar
tr flag
@PaulUszak, there's nothing wrong with a statistical extractor. But, I am keeping in mind the question that is a bout using a hash function and not something else. Additionally, the setup here is (to my understanding), we are considering what is possible in a computational setting. So it is interesting to see what statement can be made in this context as in the question.
Marc Ilunga avatar
tr flag
It would be nice to have some more explanation instead of just a down vote.
Score:-1
cn flag

Condition: pass a decent randomness test.

If it looks random downstream, we can be certain that it is truly random if the input driving the extractor function is truly random. You are correct in saying that information theoretic security cannot be achieved just by satisfying output tests without reviewing the entire stack. Thus some understanding of the raw entropy source is necessary, as well as a physical thingie that you can juggle in your hand.

The Left over hash lemma (re-arranged) is:-

rule 3

where we have $n$ = input bits at $s$ bits/bit of raw entropy from the source, $k$ is the number of output bits from the extractor. $\epsilon$ is the bias away from a perfectly uniform $k$ bit length string, i.e. $H(k) = 1 - \epsilon$ bits/bit.

This then allows setting the input/output ratio to choose an output bias $\epsilon$. I use $\epsilon = 2^{-128}$ with SHA-512 because I can. Other ratios for interest are:-

extractors

Having read one or two papers (!) and built many TRNGs that pass all randomness tests, I suggest the term you're looking for is randomness extraction. The quality metric is the Left over hash lemma. Cryptographic functions are not required.

Understand the raw entropy source, pass randomness tests and you're shiny.

fgrieu avatar
ng flag
I added $n$ and $k$ in the question, I believe according to your notation. Kudos for bringing the LHL, which I did not knew, and is very relevant to the question. Gentle remarks: since the question asks for a condition making a hash suitable for randomness extraction, I see "randomness extraction" as somewhat circular. I would like to minimize the amount of "understanding of the raw entropy source" that's necessary. I don't see "pass a decent randomness test" as a sufficient criteria (I believe we could craft a counterexample source with knowledge of the test).
ONUR EREN ARPACI avatar
mc flag
I understand that this answer is saying that we don’t need a hash function. But the motivation of the question is that, I am working with Ethereum smart contracts and it would be way more costly to use a self implemented extractor than using the built in Keccak256 function, so I am trying to justify using Keccak256. I also found this [paper](https://ia.cr/2019/198), which seems to be supporting the idea of using certain hash functions as extractors. There’s too much terminology and math that I couldn’t be able to parse yet. But I am certain it is important for this conversation.
Paul Uszak avatar
cn flag
@fgrieu What else can there possibly be? If there is (unsabotaged) entropy input, and the output is computationally indistinguishable from random, then it's a TRNG if the LHL numbers are correct. Remember that you can have a great TRNG without any hash functions at all. Ergo there cannot be any requirements for hash function criteria other than their output passing randomness tests.
forest avatar
vn flag
@PaulUszak A CSPRNG with output that is computationally indistinguishable from random is not a TRNG.
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.