If the data I want to authenticate consists of multiple values and I compute a MAC simply concatenating the values, an adversary can "shift" characters within those values without invalidating the MAC. How is this issue commonly and best addressed?
I have found this existing question about MACing multiple messages, but I feel the proposed solutions do not generalize well for more than two messages.
Consider the following contrived example:
Suppose I have a server that stores authentic log entries for clients. The client writes a log entry, authenticates it using a MAC and sends it to the server. Later, when the client retrieves log entries from the server, it should be able to verify their authenticity.
Let's say log entries have the following structure:
{
createdAt: "1621012345",
message: "first entry"
}
Naively I could create a MAC for a log entry l
as
$$
mac = \text{HMAC}(K, l.createdAt \| l.message)
$$
where $\|$ denotes concatenation and $K$ is the secret key.
If I were to go ahead to store this log entry and MAC on a server and retrieve it later, the server could return
{
createdAt: "1621012",
message: "345first entry",
mac: "<the MAC computed above>"
}
Since 1621012 || 345first entry
is the same as 1621012345 || first entry
I would not notice the manipulation when checking the MAC.
Note that in this case I should actually detect the manipulation by validating then length of createdAt
. But that only works if the length is fixed and not if I had, say, authorName
instead of the timestamp.
I can think of the following ways of dealing with this:
1. Intersperse a delimiter
If I calculated my MAC as $mac = \text{HMAC}(K, l.createdAt \| \text{':'} \| l.message)$ I believe this attack would not be possible anymore. At first glance it seems problematic that the delimiter character can appear in the message. But that only makes it impossible to unambiguously reconstruct the values from the concatenated string, which is irrelevant in this scenario. I cannot think of any way to make the calculation of the MAC ambiguous here. Is this simple solution secure?
2. Hash values before concatenating
I could calculate the MAC, for example, as $mac = \text{HMAC}(K, \text{SHA256}(l.createdAt) \| \text{SHA256}(l.message))$ (or any other cryptographic hash function). This ensures that an adversary cannot meaningfully manipulate the values I concatenate. It also ensures that the concatenated values always have a fixed length. Does the hashing add any value compared to the first idea?
3. Authenticate the whole structured data
I could also calculate the MAC over the complete JSON object of the log entry. Effectively, this means I have more elaborate, meaningful delimiters (the keys and syntax). Basically, like a JWT.
Note that this approach also has some downsides, which mostly boil down to loss of API flexibility and the need to canonicalize the JSON. There's a great blog post about this at https://latacora.micro.blog/2019/07/24/how-not-to.html.
Am I missing any good solutions for this problem? Is there a recommended way to deal with this?