Reverse engineering hardware crypto processor for modular multiplication

Hendi

7/25/24, 11:19 PM

I'm currently working with an undocumented crypto offload processor that is capable of accelerating modular multiplication in some fashion. I need to figure out what operation it is implementing exactly in order to emulate it in software.

The hardware has four big integer inputs:

Multiplicand $a$
Multiplier $b$
Modulus $p$
Unknown value, let's call it $q$

The output is a single big integer $r$ of the same or lower bit length as $p$.

In practice, the software that uses this hardware function always sets $q = p$.

Here's some example values that I tried:

For the following, assume p = q = 0xFFFFFFFF00000001000000000000000000000000FFFFFFFFFFFFFFFFFFFFFFFF (NIST P-256 modulus).

a = 1, b = 1, r = 0xFFFFFFFE00000003FFFFFFFD0000000200000001FFFFFFFE0000000300000000

a = 3, b = 3, r = 0xFFFFFFF60000001BFFFFFFE50000001200000009FFFFFFEE0000001B00000008

a = 123, b = 123, r = 0xFFFFC4E60000B14BFFFF4EB50000763200003B19FFFF89CE0000B14B00003B18

Interestingly the least significant bits always contain the "correct" result minus one, and in the middle of the integer there's the result I'd normally expect in the least significant bits (with the rest of the integer being 0).

I suspect that Montgomery Multiplication may in some form be playing into this. I noticed that when the software does computations in $\Bbb F_p$ for ECC, it first multiplies $a$ with some other number to produce $c$, and then it computes $r = cb$.

That other number is 0x00000004FFFFFFFDFFFFFFFFFFFFFFFEFFFFFFFBFFFFFFFF0000000000000003 in case of the NIST P-256 modulus. If I multiply by that as an intermediate step, I get the correct multiplication result from the hardware.

So my question would be, does anyone have an intuition what exact operation is implemented here and if a library like OpenSSL's libcrypto can be used to perform it?

583

1 + 0

cryptographic-hardware

modular-arithmetic

montgomery-multiplication

Score:6

Crypto

fgrieu

7/26/24, 6:00 AM

The numbers are consistent with the hypothesis that this hardware allows Montgomery modular multiplication of $a$ and $b$ modulo $p$ for (at least some) odd 256-bit $p$. Meaning it computes a 256-bit quantity $r$ with $$r\equiv k\cdot a\cdot b\pmod p\quad\text{where}\ k=2^{-256}\bmod p$$

With the given $p$, it holds $$k=\mathtt{fffffffe00000003fffffffd0000000200000001fffffffe0000000300000000}$$

In the three examples, it further holds $r=k\cdot a\cdot b\bmod p$ (that is $r<p$ on top of the above), but I can't tell if that always hold. That would require the hardware to perform more than the standard Montgomery modular multiplication. For a description of that see Handbook of Applied Cryptography algorithm 14.36, which I detail here.

The question's other number is $2^{2\cdot 256}\bmod p$, so that the procedure described computes a Montgomery representative $r$ of $a$, that is a 256-bit $r$ with $r\equiv2^{256}\cdot a\pmod p$.

When it comes to convert back from Montgomery form to regular representation, the device can be used with one of it's input set to $1$. The output is then some $r$ congruent modulo $p$ to the desired integer. Or perhaps the device further ensures $r<p$, but again we can't tell from the examples available.

I have no idea what $q$ could be, or what happens when $q\ne p$, or when $p$ is even, or if the device has further restrictions like requiring more than one low-order bit of $p$ to be set, or if it can be used with arguments of different width.

These details make it uncertain if and when OpenSSL's internal function bn_mul_mont (or it's many variations) can be used to perform the same thing. Notice that needs extra code to compute n0, as $(-p)^{-1}\bmod 2^{256}$ in the question's context.

If the goal is virtualization to run unspecified software, then said details matter. If the goal is maintenance of an identified software, then instead I'd consider understanding the interface of the functions using that hardware at some higher level in the software stack, like point multiplication $x\cdot P$ (and similar: $x\cdot P+y\cdot Q$ per Shamir's trick), and perhaps a regular modular multiplication modulo the group order. Or at the next higher abstraction level, like ECDSA and ECDH.

+ 0

Elon Musk

I sit in a Tesla and translated this thread with Ai:

EN: Reverse engineering hardware crypto processor for modular multiplication

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.