Safe implicit value validation: $H_k(k \oplus m) \sim H_k(m)$?

aiootp

8/17/24, 1:53 PM

$H_k$ is a cryptographic hash function that's keyed using a section of key material $k$ (for whatever definition of "keyed" that's appropriate for the given hash function $H$).

Are the following two methods roughly equivalent ways of authenticating that the section of key material $k$ is associated with the section of message material $m$?

$$H_k(k \oplus m) \tag{a}$$ $$H_k(m) \tag{b}$$

Is there anything wrong with choosing one method over the other?

_{Note: The goal isn't to show knowledge of $m$ or prove security of the scheme. The questions are only concerned with how interchangeable the methods are with regard to associating $k$ and $m$.}

1 + 0

cryptanalysis

symmetric

Score:1

Crypto

aiootp

8/27/24, 10:41 PM

Are the following two methods roughly equivalent ways of authenticating that the section of key material $k$ is associated with the section of message material $m$? $$H_k(k \oplus m) \tag{a}$$ $$H_k(m) \tag{b}$$

The question of association in this context isn't well-defined. It would be better to ask questions which clearly identify the problem to be solved, & which satisfy some useful security notion if answered appropriately. Let's try asking some better questions.

Preliminaries: $$\epsilon_a = F_a(k, m) = H_k(k \oplus m)$$ $$\epsilon_b = F_b(k, m) = H_k(m)$$ $$\epsilon_{a, b} = F_{a, b}(k, m) = (\epsilon_a, \epsilon_b)$$

$\quad \boldsymbol\alpha$. In the locations where these procedures occur, can any other values $k'$ or $m'$ result in the evaluation of either $\epsilon_a = F_a(k', m')$ or $\epsilon_b = F_b(k', m')$ being true?

This question is better, because if the answer is no, then association is defined as a unique evaluation $\epsilon_{a,b} = F_{a,b}(k', m')$ iff $(k', m') = (k, m)$. Which means, no other values for $k$ or $m$ will satisfy the association.

So, $(\textrm a)$ & $(\textrm b)$ could be considered "roughly equivalent" iff they both have unique $\epsilon_{a, b} \space \forall (k, m)$. But, this cannot be true, because the pigeonhole principle⁽⁰⁾ ensures a hashing function will have output collisions for arbitrary inputs.

The security notion provided by being able to answer question $\boldsymbol\alpha$ in the negative is not satisfied.

$\quad \boldsymbol\beta$. In the locations where these procedures occur, is finding any other values $k'$ or $m'$, given $(k, m, \epsilon_{a, b})$, resulting in either $\epsilon_a = F_a(k', m')$ or $\epsilon_b = F_b(k', m')$ being true, as hard as finding a second-preimage⁽¹⁾ in $H$?

This question tries to capture a looser notion of unique association than $\boldsymbol\alpha$ does, by instead tying the difficulty of finding any $k'$ or $m'$ which result in any matching $\epsilon_a$ or $\epsilon_b$ given $(k, m, \epsilon_{a, b})$ to the difficulty of finding a second-preimage in the hash function.

Even this looser notion of association isn't generally satisfied for the $(\textrm a)$ & $(\textrm b)$ procedures. It depends on the hashing function, the keying & message ingestion procedures, as well as the data representations. For example, if the hash function & keying procedure is $\textrm{HMAC-SHA256}$, with as-is message ingestion & little-endian byte representations, then trivial second-preimages can be found for input tuples $(k', m') \neq (k, m)$ where $\epsilon'_{a, b} = \epsilon_{a, b}$.

This is essentially possible because, given distinct input values, xor muddles the lengths of each input, & xor isn't collision resistant on multiple arbitrary inputs. $\textrm{HMAC-SHA256}$ too uses xor internally to encode the key, allowing trivial second-preimages in the input $k$, described as follows:

\begin{equation} \forall \space (k, m) \quad H\textrm{.blocksize} \gt |k| \ge |m| \end{equation}

\begin{equation} \ell = H\textrm{.blocksize} - |k| \end{equation}

\begin{equation} F_a(k, m) = F_a(k', m') = H_{ k || \underbrace{0 \ldots 0}_{n \space \le \space \ell} }( (k || \overbrace{0 \ldots 0}^{n \space \le \space \ell}) \oplus (m || \overbrace{0 \ldots\ldots\ldots 0}^{ n' \space\le\space |k| + n - |m| \space } ) ) \end{equation}

\begin{equation} F_b(k, m) = F_b(k', m) = H_{k || \underbrace{0 \ldots 0}_{n}}(m) \end{equation}

The following Python code is a proof of concept, showing $(k', m') \neq (k, m)$ where $\epsilon_a = F_a(k', m')$ and $\epsilon_b = F_b(k', m)$.

import math
import hmac
import secrets
import hashlib

ZERO = b"\x00"
ORDER = "little"
H = hashlib.sha256
BLOCKSIZE = H().block_size

as_int = lambda x: int.from_bytes(x, ORDER)
as_bytes = lambda x: x.to_bytes(math.ceil(x.bit_length() / 8), ORDER)
xor = lambda x, y: as_bytes(as_int(x) ^ as_int(y))
F_a = lambda k, m: hmac.new(k, xor(k, m), H).digest()
F_b = lambda k, m: hmac.new(k, m, H).digest()

k = secrets.token_bytes(32); assert BLOCKSIZE > len(k)
m = b"message"; assert len(k) > len(m)
a = F_a(k, m)
b = F_b(k, m)

assert all(
    a == F_a(k + i*ZERO, m + i*ZERO)
    for i in range(len(k) - len(m))
)
assert all(
    b == F_b(k + i*ZERO, m)
    for i in range(BLOCKSIZE - len(k))
)

Clearly, the association notion based on $\boldsymbol\beta$'s reliance on the second-preimage resistance of the collision-resistant $H$ cannot be achieved if the inputs aren't at least sufficiently canonically encoded. It's also clear that the answer to question 1. is no, $(\textrm a)$ & $(\textrm b)$ are not roughly equivalent, as each procedure introduces distinct issues for the purposes of uniquely associating $k$ & $m$.

The code below offers an example of input canonicalization which could help in making $(\textrm a)$ & $(\textrm b)$ more interchangeable.

as_bytes = lambda x, size: x.to_bytes(size, ORDER)
measure = lambda x: len(x).to_bytes(8, ORDER)
measured = lambda x: measure(x) + x
measured_xor = lambda x, y: (
    measure(x)
    + measure(y)
    + as_bytes(as_int(x) ^ as_int(y), size=max([len(x), len(y)]))
)
F_a = lambda k, m: hmac.new(measured(k), measured_xor(k, m), H).digest()
F_b = lambda k, m: hmac.new(measured(k), measured(m), H).digest()
a = F_a(k, m)
b = F_b(k, m)

assert not any(
    a == F_a(k, m + i*ZERO)
    for i in range(1, len(k) - len(m))
)
assert not any(
    b == F_b(k + i*ZERO, m)
    for i in range(1, BLOCKSIZE - len(k))
)

Is there anything wrong with choosing one method over the other?

The $(\textrm a)$ procedure requires a bit more effort to safely process inputs, & $(\textrm b)$ is therefore a bit simpler to use. In neither case is it trivial, as the specifics of the keying procedure, the hash function, the message ingestion, & the data representation impact the overall requirements of associating the inputs $k$ & $m$ under useful notions.

+ 0

Elon Musk

I sit in a Tesla and translated this thread with Ai:

EN: Safe implicit value validation: $H_k(k \oplus m) \sim H_k(m)$?

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.