I'm currently working with an undocumented crypto offload processor that is capable of accelerating modular multiplication in some fashion. I need to figure out what operation it is implementing exactly in order to emulate it in software.
The hardware has four big integer inputs:
- Multiplicand $a$
- Multiplier $b$
- Modulus $p$
- Unknown value, let's call it $q$
The output is a single big integer $r$ of the same or lower bit length as $p$.
In practice, the software that uses this hardware function always sets $q = p$.
Here's some example values that I tried:
For the following, assume
p = q = 0xFFFFFFFF00000001000000000000000000000000FFFFFFFFFFFFFFFFFFFFFFFF
(NIST P-256 modulus).
a = 1, b = 1, r = 0xFFFFFFFE00000003FFFFFFFD0000000200000001FFFFFFFE0000000300000000
a = 3, b = 3, r = 0xFFFFFFF60000001BFFFFFFE50000001200000009FFFFFFEE0000001B00000008
a = 123, b = 123, r = 0xFFFFC4E60000B14BFFFF4EB50000763200003B19FFFF89CE0000B14B00003B18
Interestingly the least significant bits always contain the "correct" result minus one, and in the middle of the integer there's the result I'd normally expect in the least significant bits (with the rest of the integer being 0).
I suspect that Montgomery Multiplication may in some form be playing into this. I noticed that when the software does computations in $\Bbb F_p$ for ECC, it first multiplies $a$ with some other number to produce $c$, and then it computes $r = cb$.
That other number is 0x00000004FFFFFFFDFFFFFFFFFFFFFFFEFFFFFFFBFFFFFFFF0000000000000003
in case of the NIST P-256 modulus. If I multiply by that as an intermediate step, I get the correct multiplication result from the hardware.
So my question would be, does anyone have an intuition what exact operation is implemented here and if a library like OpenSSL's libcrypto can be used to perform it?