- Are the following two methods roughly equivalent ways of authenticating that the section of key material $k$ is associated with the section of message material $m$?
$$H_k(k \oplus m) \tag{a}$$
$$H_k(m) \tag{b}$$
The question of association in this context isn't well-defined. It would be better to ask questions which clearly identify the problem to be solved, & which satisfy some useful security notion if answered appropriately. Let's try asking some better questions.
Preliminaries:
$$\epsilon_a = F_a(k, m) = H_k(k \oplus m)$$
$$\epsilon_b = F_b(k, m) = H_k(m)$$
$$\epsilon_{a, b} = F_{a, b}(k, m) = (\epsilon_a, \epsilon_b)$$
$\quad \boldsymbol\alpha$. In the locations where these procedures occur, can any other values $k'$ or $m'$ result in the evaluation of either $\epsilon_a = F_a(k', m')$ or $\epsilon_b = F_b(k', m')$ being true?
This question is better, because if the answer is no, then association is defined as a unique evaluation $\epsilon_{a,b} = F_{a,b}(k', m')$ iff $(k', m') = (k, m)$. Which means, no other values for $k$ or $m$ will satisfy the association.
So, $(\textrm a)$ & $(\textrm b)$ could be considered "roughly equivalent" iff they both have unique $\epsilon_{a, b} \space \forall (k, m)$. But, this cannot be true, because the pigeonhole principle(0) ensures a hashing function will have output collisions for arbitrary inputs.
The security notion provided by being able to answer question $\boldsymbol\alpha$ in the negative is not satisfied.
$\quad \boldsymbol\beta$. In the locations where these procedures occur, is finding any other values $k'$ or $m'$, given $(k, m, \epsilon_{a, b})$, resulting in either $\epsilon_a = F_a(k', m')$ or $\epsilon_b = F_b(k', m')$ being true, as hard as finding a second-preimage(1) in $H$?
This question tries to capture a looser notion of unique association than $\boldsymbol\alpha$ does, by instead tying the difficulty of finding any $k'$ or $m'$ which result in any matching $\epsilon_a$ or $\epsilon_b$ given $(k, m, \epsilon_{a, b})$ to the difficulty of finding a second-preimage in the hash function.
Even this looser notion of association isn't generally satisfied for the $(\textrm a)$ & $(\textrm b)$ procedures. It depends on the hashing function, the keying & message ingestion procedures, as well as the data representations. For example, if the hash function & keying procedure is $\textrm{HMAC-SHA256}$, with as-is message ingestion & little-endian byte representations, then trivial second-preimages can be found for input tuples $(k', m') \neq (k, m)$ where $\epsilon'_{a, b} = \epsilon_{a, b}$.
This is essentially possible because, given distinct input values, xor muddles the lengths of each input, & xor isn't collision resistant on multiple arbitrary inputs. $\textrm{HMAC-SHA256}$ too uses xor internally to encode the key, allowing trivial second-preimages in the input $k$, described as follows:
\begin{equation}
\forall \space (k, m)
\quad H\textrm{.blocksize} \gt |k| \ge |m|
\end{equation}
\begin{equation}
\ell = H\textrm{.blocksize} - |k|
\end{equation}
\begin{equation}
F_a(k, m) =
F_a(k', m') =
H_{
k || \underbrace{0 \ldots 0}_{n \space \le \space \ell}
}(
(k || \overbrace{0 \ldots 0}^{n \space \le \space \ell})
\oplus
(m ||
\overbrace{0 \ldots\ldots\ldots 0}^{
n' \space\le\space |k| + n - |m| \space
}
)
)
\end{equation}
\begin{equation}
F_b(k, m) =
F_b(k', m) =
H_{k || \underbrace{0 \ldots 0}_{n}}(m)
\end{equation}
The following Python code is a proof of concept, showing $(k', m') \neq (k, m)$ where $\epsilon_a = F_a(k', m')$ and $\epsilon_b = F_b(k', m)$.
import math
import hmac
import secrets
import hashlib
ZERO = b"\x00"
ORDER = "little"
H = hashlib.sha256
BLOCKSIZE = H().block_size
as_int = lambda x: int.from_bytes(x, ORDER)
as_bytes = lambda x: x.to_bytes(math.ceil(x.bit_length() / 8), ORDER)
xor = lambda x, y: as_bytes(as_int(x) ^ as_int(y))
F_a = lambda k, m: hmac.new(k, xor(k, m), H).digest()
F_b = lambda k, m: hmac.new(k, m, H).digest()
k = secrets.token_bytes(32); assert BLOCKSIZE > len(k)
m = b"message"; assert len(k) > len(m)
a = F_a(k, m)
b = F_b(k, m)
assert all(
a == F_a(k + i*ZERO, m + i*ZERO)
for i in range(len(k) - len(m))
)
assert all(
b == F_b(k + i*ZERO, m)
for i in range(BLOCKSIZE - len(k))
)
Clearly, the association notion based on $\boldsymbol\beta$'s reliance on the second-preimage resistance of the collision-resistant $H$ cannot be achieved if the inputs aren't at least sufficiently canonically encoded. It's also clear that the answer to question 1. is no, $(\textrm a)$ & $(\textrm b)$ are not roughly equivalent, as each procedure introduces distinct issues for the purposes of uniquely associating $k$ & $m$.
The code below offers an example of input canonicalization which could help in making $(\textrm a)$ & $(\textrm b)$ more interchangeable.
as_bytes = lambda x, size: x.to_bytes(size, ORDER)
measure = lambda x: len(x).to_bytes(8, ORDER)
measured = lambda x: measure(x) + x
measured_xor = lambda x, y: (
measure(x)
+ measure(y)
+ as_bytes(as_int(x) ^ as_int(y), size=max([len(x), len(y)]))
)
F_a = lambda k, m: hmac.new(measured(k), measured_xor(k, m), H).digest()
F_b = lambda k, m: hmac.new(measured(k), measured(m), H).digest()
a = F_a(k, m)
b = F_b(k, m)
assert not any(
a == F_a(k, m + i*ZERO)
for i in range(1, len(k) - len(m))
)
assert not any(
b == F_b(k + i*ZERO, m)
for i in range(1, BLOCKSIZE - len(k))
)
- Is there anything wrong with choosing one method over the other?
The $(\textrm a)$ procedure requires a bit more effort to safely process inputs, & $(\textrm b)$ is therefore a bit simpler to use. In neither case is it trivial, as the specifics of the keying procedure, the hash function, the message ingestion, & the data representation impact the overall requirements of associating the inputs $k$ & $m$ under useful notions.