Score:2

Given a hash function and a hash value, can you tell if it can produce such value?

tr flag

I came up with the following question:

Given a hash function H() and a hash value h that is in the codomain/range of outputs of H(), can you determine if h can be produced by H() (i.e. is h in the image of H())?

Can the question be answered? Does it contradict to the preimage resistance property?

Is there any benefit you can think of for a hash function having the above property (that you can/can't tell if h can be produced by the hash function)?

Maarten Bodewes avatar
in flag
Note that the title indicates a question that has [already been answered](https://crypto.stackexchange.com/a/41708/1172). However, the followup questions are interesting enough to keep it open, in my personal opinion.
Score:2
ng flag

Given a hash function $\mathcal H()$ and a hash value $H$ that is in the codomain/range of outputs of $\mathcal H()$, can you determine if $H$ can be produced by $\mathcal H()$ (i.e. is $H$ in the image of $\mathcal H()$)?

I'll assume "codomain/range of outputs" is defined without reference to what the hash actually outputs (rather than as the set of actual outputs of the hash, which would make it all reached by definition).

If a hash function $\mathcal H$ was such that for a sizable portion of given $H$ in it's codomain one can exhibit input $M$ such that $H(M)=H$, then that function would not be preimage-resistant. Thus said exhibition must be computationally impossible for a random $H$.

If we model a hash as a random function $\{0,1\}^*\to\{0,1\}^n$, then per the coupon collector problem, the expected number of hashes of random messages to reach all values is $E=2^n\,(n\,\ln(2)+\gamma)+1/2+o(1)$. In crypto practice $n\ge128$ thus $\log_2(E)\approx n+\log_2(n)-0.53$. Thus on average we'd need to hash less than all exactly 33-byte messages to reach all values for an ideal 256-bit hash, but need to hash more than all exactly 65-byte messages to reach all values for an ideal 512-bit hash. Making that many hashes is impossible.

For usual hash functions such as SHA-1, SHA-256, SHA-512, and I believe SHA-3, as stated in that other answer, we have no mathematical proof that every output value can be reached. The best we can say is it likely holds (even restricting to messages that fit a single block, and even more so with more), but it would be surprising if it could be either proven or disproven.


But I think we can construct a hash function that provably reaches all it's output space, yet has to a large degree the properties expected from a cryptographic hash. Here is a candidate hash of arbitrary bitstring demonstrably reaching the whole $\{0,1\}^{512}$.

I'll use a 3072-bit safe prime $p$, that is such that $q=(p-1)/2$ is also prime; and a generator $g$ of the multiplicative group $\mathbb Z_p^*$, that is $g\in[2,p-2]$ with $g^q\equiv-1\pmod p$. We can use $p=2^{3072}-2^{3008}+2^{64}\,(\left\lfloor2^{2942}\,\pi\right\rfloor+1690314)-1$ of the 3072-bit MODP group, and $g=\left\lfloor 2^{3070}\,e\right\rfloor$.

Compute the hash $H(M)\in\{0,1\}^{512}$ of input message $M\in\{0,1\}^*$ as follows:

  1. Convert the bitstring $M$ to integer $m$ per big-endian convention, and keep track of the bit length $\ell$ of $M$.
  2. Compute $$\begin{align} m_0&=m\bmod(p-1)\\ h_0&=(g^{m_0}\bmod p)-1\\ h_1&=\left\lfloor h_0/2^{1024}\right\rfloor\bmod2^{512}\\ h_2&=\left\lfloor h_0/2^{1664}\right\rfloor\bmod2^{512}\\ m_1&=\left\lfloor m/(p-1)\right\rfloor\\ \end{align}$$ Note: the constants 1024 and 1664 select the position of two arbitrary non-overlapping 512-bit segments in the binary representation of $h_0$.
  3. Convert $h_1$ to bitstring $H_1\in\{0,1\}^{512}$, $h_2$ to bitstring $H_2\in\{0,1\}^{512}$, and $m_1$ to bitstring $M_1\in\{0,1\}^{\ell}$ per big-endian convention.
  4. Compute and output $H=H_1\oplus H_2\oplus\operatorname{SHA3-512}(M_1)$.

The transformation between $m_0$ and $h_0$ is a bijection of $[0,p-2]$. If follows we could provably find a preimage $M$ of any $H\in\{0,1\}^{512}$ if we could solve the DLP in the multiplicative group $\mathbb Z_p^*$: we fix $M_1=0^{3072}$ (thus $\ell=3072$ and $m_0=m$), $h_0=2^{640}\,h_1$ (thus $H_2=0^{512}$), that lets us compute $H_1=H\oplus\operatorname{SHA3-512}(M_1)$, then $h_1$, then $h_0=2^{640}\,h_1$. We solve the DLP problem $(g^{m_0}\bmod p)=h_0+1$ to get $m_0$, then $m$, then the 3072-bit $M$.

My argument that the hash is collision-resistant and preimage-resistant is that $M\mapsto(M_1,m_0)$ is injective, $m_0\mapsto H_1\oplus H_2$ seems to be fairly hard to invert or collide, and XORing that with a good unrelated hash of $M_1$ less than halves the collision-resistance.


Is there any benefit you can think of ?

I don't see any actual technical benefit to a hash function that demonstrably reaches all it's codomain, since we can't experimentally tell the difference with a good standard hash function without this property. On the other hand, it would be intellectually satisfying. Problem is, anything I can think of (like the above candidate) if both slower and less safe at given output width than a standard hash is, and that practical consideration out weights the intangible benefit of being sure all output values can be reached.

I don't see any benefit to a hash function that demonstrably can not reach some of it's output space, and such that's it's computationally impossible to exhibit such output value, or impossible to tell if a given output space value is reachable.

I can imagine applications for hashes that probably can not reach a few known values in their output space (e.g. one such value can be reserved as indicator of a special case). On the other hand, we can easily construct such hashes from standard primitives. For example, for a 256-bit hash that can not reach $0^{256}$, we can use (with usual conversions between bitstrings and integers) $M\mapsto(\operatorname{SHAKE128}(M,416)\bmod(2^{256}-1))+1$. And in practice we could as well use any standard 256-bit hash.

kelalaka avatar
in flag
Isn't it XORing the input's first 512-bit to the output of the SHA3-512 is an easier way to guarantee the all of the range is the 512-bit space?
fgrieu avatar
ng flag
@kelalaka: if you are proposing $H(m_0\mathbin\|m_1)=m_0\oplus\operatorname{SHA3-512}(m_0\mathbin\|m_1)$ with $m_0\in\{0,1\}^{512}$, then no there is no proof that reaches all output value, and that's very likely false when we restrict to 512-bit messages. If I get the [coupon collector](https://en.wikipedia.org/wiki/Coupon_collector%27s_problem)'s math right, that becomes probable after we hash roughly $2^{512+8.5}$ messages. My function _demonstrably_ reaches all output values, but we'd need to solve a variant of the DLP problem to reverse it.
kelalaka avatar
in flag
Yes, that is not correct, I should say x-oring the message into 512-blocks, then x-or with the hash, still there is no proof, just expectation.
Score:2
in flag

Given a hash function H() and a hash value h that is in the codomain/range of outputs of $H()$, can you determine if $h$ can be produced by $H()$ (i.e. is $h$ in the image of $H()$)?

Generally, you cannot if you consider $H$ to be a black box. I would be surprised if there are values that cannot be reached by a hash function like SHA-2/3 though, because of the way they are constructed. As in the Q/A I pointed to, in the end you'll want to look at the inner workings of the hash function.

Can the question be answered? Does it contradict to the preimage resistance property?

Not directly. It would reduce the codomain of course, but not significantly. However, it could cast big doubts on the construction of the hash function. People would try and analyze it why the distribution is not perfect, and I'd guess they'd find other issues pretty soon unless the hash function was deliberately designed to shun certain values (e.g. "if output = 0 then perform another block of hashing").

Is there any benefit you can think of for a hash function having the above property (that you can/can't tell if $h$ can be produced by the hash function)?

You could e.g. use those values to indicate that something went wrong - if you somehow managed to find them. You could for instance think about covert channels using such values - but note that with pre-image resistance you could simply pick a few from the codomain as well.

If you manage to make the output shun certain values that apply to a common function then you could of course make sure that somebody never wins a lottery, for instance (although, for common lotteries, you may not need the effort). Usually you'd only take a partial hash output for that kind of thing, so e.g. you could make sure that the initial bits are never zero (but note that those hash functions won't make it through many test suites).

With usual constructions it could be tricky to hide these kind of special values from others that analyze the internal structure of the hash function, especially at the long term. That said, you can look at e.g. Dual_EC_DRBG to understand that you can do pretty nasty things with an algorithm, especially when it comes to the selection of constants.

tr flag
Thank you for your input! Wouldn't that exactly because H is a black box, I can always answer "yes" due to high probability that all images are mapped? Then given an artificial hash with some known values it will never output, I will answer "yes" if h is not in those values and "no" otherwise. I can always win?
Maarten Bodewes avatar
in flag
No because if it is a black box you wouldn't be able to tell which values would be excluded from the output. Generally you'd expect a hash output to be well distributed, but if you just leave out a few then this is negligible and indetectible.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.