In this answer we study modular multiplications using Montgomery arithmetic, illustrated with the example $7510\cdot 8431\cdot 2143\bmod9137$, working in base $\beta=10$ because the question does. Practice for large integers on modern computers would use $\beta=2^{32}$ or $\beta=2^{64}$, and in order to cover that we use the general term limb for our mere decimal digits.
We use a Positional Numeral System with little-endian convention when it comes to (zero-based) indexes, and low-order digit shown on the right because we are raised on that. E.g. a representation of $A=7510$ is $a_3=7$, $a_2=5$, $a_1=1$, $a_0=0$, so that
$$A=\sum_{0\le i<4}a_i\cdot\beta^i$$
We mostly use the notation in Modern Computer Arithmetic, algorithm 2.6. However we interleave multiplication and reduction per the Handbook of Applied Cryptography algorithm 14.36. We detail things more than either of these references. Roseta stone on notation:
$$\begin{array}{l|c}
\text{This answer}&\beta&\ell&\mu&q&\beta^\ell&N&X\cdot Y&R\\\hline
\text{MCA alg. 2.6}&\beta&n&\mu&q_i&\beta^n&N&C&R\\\hline
\text{HAC alg. 14.36}&b&n&m'&u_i&R&m&x\cdot y&A\end{array}$$
On the$\bmod$ notation: $A\equiv B\pmod N$ means that $A-B$ is multiple of $N>0$. $A=B\bmod N$ additionally means that $0\le A<N$. In that second notation only,$\bmod$ is an operator, and we could write $B\bmod N=A$. Similarly, $A\equiv B^{-1}\pmod N$ means that $A\cdot B\equiv1\pmod N$, and $A=B^{-1}\bmod N$ additionally means that $0\le A<N$.
Motivation of Montgomery arithmetic
Montgomery arithmetic is a method for computation modulo some integer $N$; that is in $\mathbb Z/N\mathbb Z$, the ring of integers modulo $N$. A Montgomery representation of integers allows a streamlined multiplication modulo $N$. Modular addition remains essentially as usual.
Montgomery arithmetic is often used in cryptographic hardware and software needing modular multiplication because:
- It simplifies the interleaving of (quadratic) multiplication and modular reduction steps. Interleaving saves a temporary buffer, memory accesses, and index calculations, leading to a speedup. Interleaving and closely the same speedup is also possible with conventional modular reduction, but the code is hairy.
- The guessing of quotient limbs in long division is replaced by a simple procedure with the only corner case outside the inner loops. That helps make code free from data-dependent timing dependency.
Preliminary checks, determination of $\mu$ and $\ell$
We need $N$ to be coprime with $\beta$, equivalently $\gcd(n_0,\beta)=1$. Otherwise, we'd have to get back to that case in order to use Montgomery arithmetic (when $\beta$ is a power of two, that's by breaking $N$ as the product of an odd integer and a power of two).
Said condition allows computation of a one-limb quantity $\mu$ that we'll use repeatedly: the multiplicative inverse modulo $\beta$ of $-n_0$, that is $\mu=(-n_0)^{-1}\bmod\beta$. For arbitrary $\beta$ we can compute the multiplicative inverse per the (half-)extended Euclidean algorithm, but for $\beta$ a power of two there is a simple, fast, constant-time algorithm.
We also need the integer $\ell$ such that $\beta^{\ell-1}\le N<\beta^\ell$, that is $N$ fits $\ell$ limbs with the high-order limb $n_{\ell-1}$ non-zero.
With our $N=9137$, $n_0=7$ is coprime with $10$. $\mu=7$ since $7\cdot(-7)\equiv1\bmod 10$. $\ell=4$ with $n_3=9\ne0$.
Montgomery representative
A Montgomery representative of an integer $A$ (implicitly: for given modulus $N$, base $\beta$, and $\ell$) is any integer $X$ with $0\le X<\beta^\ell$, expressed as $\ell$ limbs per PNS, such that $X\equiv A\cdot\beta^\ell\pmod N$.
Therefore $X=A\cdot\beta^\ell\bmod N$ is a representative of $A$ which we can find by Euclidean division per schoolbook long division. The $\ell$ low-order limbs of $A\cdot\beta$ are zero, and brought in as needed during the division. We deffer description of a more practical method to compute a Montgomery representative.
Our first integers to multiply $A=7510$, $B=8431$ thus become $X=2997$ (computed as $75100000\bmod9137$) and $Y=2901$ (computed as $84310000\bmod9137$). The $7414122$ and $200000$ in the original question would become $133$ and $062$ with full modular reduction modulo $313$, but partially reduced $446$ and $688$ would do as well. It turns out we won't need a Montgomery representation of the third integer.
Montgomery modular multiplication
Montgomery modular multiplication computes a Montgomery representative $R$ of the product $A\times B$, given Montgomery representatives $X$ and $Y$ of $A$ and $B$.
- Input: $X$, $Y$ and $N$ each as $\ell$ limbs in base $\beta$; and $\mu=-N^{-1}\bmod\beta$.
- Output: $R$ as $\ell$ limbs with $R\cdot\beta^\ell\equiv X\cdot Y\pmod N$.
We handle the limbs of $Y$ from low-order to high-order limb, keeping intermediary result $R$ on $\ell+1$ limbs with $r_\ell\in\{0,1\}$, plus a final reduction of $R$ to $\ell$ limbs:
- Set $R\gets0$, that is set to $0$ it's $\ell$ limbs $r_0$ to $r_{\ell-1}$, and an extra $r_\ell$ that won't exceed $1$ and won't be accessed with a variable index, thus can be a separate bit.
- For $i$ from $0$ to $\ell-1$, perform $R\gets(q\cdot N+y_i\cdot X+R)/\beta$ per standard arithmetic for one-limb $q$ chosen such that the Euclidean division yields no remainder. We detail how to perform that:
- Compute temporary quantity $T\gets y_i\cdot x_0+r_0$, which fits two limbs $t_1$ and $t_0$, plus an extra $t_2\gets0$ considered part of $T$.
- Compute $q$ as the low-order limb of $\mu\cdot t_0$. There remains to finish the computation $R\gets(q\cdot N+y_i\cdot X+R)/\beta$, as follows:
- For $j$ from $1$ to $\ell-1$
- Compute $T\gets t_2\cdot\beta+t_1+q\cdot p_j+y_i\cdot x_j+r_j$, which fits three limbs with $t_2\in\{0,1\}$.
- Set† $r_{j-1}\gets t_0$.
- Compute $T\gets t_2\cdot\beta+t_1+r_\ell$, which fits two limbs with $t_1\in\{0,1\}$
- Set† $r_{\ell-1}\gets t_0$ and $r_\ell\gets t_1$.
- If $r_\ell\ne0$, then compute $R\gets R-N$. It's not necessary to update $r_\ell$ which won't be needed and would be $0$.
- Output $R$ as $\ell$ limbs $r_0\ldots r_{\ell-1}$.
In our example $\beta=10$, $\ell=4$, $N=9137$, $\mu=7$, and the first two values to multiply are $X=2997$, $Y=2901$ (the Montgomery forms of $A=7510$, $B=8431$).
- We start with $R=00000$
- $i=0$: $y_0=1$, the first $T$ is $1\cdot7+0=007$, $q=7\cdot7\bmod10=9$, thus $R\gets(9\cdot9137+1\cdot2997+00000)/10=85230/10=08523$
- $i=1$: $y_1=0$, the first $T$ is $0\cdot7+3=003$, $q=7\cdot3\bmod10=1$, thus $R\gets(1\cdot9137+0\cdot2997+08523)/10=17660/10=01766$
- $i=2$: $y_2=9$, the first $T$ is $9\cdot7+6=069$, $q=7\cdot9\bmod10=3$, thus $R\gets(3\cdot9137+9\cdot2997+01766)/10=56150/10=05615$
- $i=3$: $y_3=2$, the first $T$ is $2\cdot7+5=019$, $q=7\cdot9\bmod10=3$, thus $R\gets(3\cdot9137+2\cdot2997+05615)/10=39020/10=03902$
- $r_4=0$ thus no adjustment of $R$ is needed
- We have $R=3902$. Indeed, $3902\cdot10000\equiv2997\cdot2901\pmod{9137}$
Similarly process all operands
In our example, our best avenue is to integrate our second and last multiplication with the production of the final result in regular RNS, per 2 below.
In a modular exponentiation with an exponent of thousands bits, we may make thousands similar Montgomery modular multiplications in sequence. Some of them will be modular squarings, which we can handle as regular modular multiplication with $Y=X$.
In the context of point multiplication by a scalar in Elliptic Curve Cryptography with prime field and an efficient coordinate system, there is also additions modulo $N$, which we can handle directly in Montgomery form as standard addition, followed by modular reduction only inasmuch as indispensable to keep the result on $\ell$ limbs.
Conversion back to regular PNS
The final result is some $F$ in PNS with $0\le F<N$ and without the $\beta^\ell$ factor of Montgomery form.
General case: We get a Montgomery representation $R$ of the desired $F$. Therefore, $F=\beta^{-\ell}\cdot R\bmod N$. We could compute $F$ as $M\gets=(\beta^\ell)^{-1}\bmod N$, then $F\gets M\cdot R\bmod N$, using standard algorithms. But the computation of $M$ is slightly cumbersome, and $F\gets M\cdot R\bmod N$ would require coding regular modular multiplication.
Instead, it's simpler to compute $F$ from $R$ using Montgomery modular multiplication with inputs $X=1$ and $Y=R$, yielding a new $R$, which only needs a simple final modular reduction $F\gets R\bmod N$. The algorithm simplifies: $t_2$ is not needed, the first $T$ is $y_0+r_0$, the next ones $t_1+q\cdot p_j+r_j$.
Ending in multiplication by an input: If the last operation in the computation is a Montgomery modular multiplication with one input given in standard PNS over $\ell$ limbs, the simplest is to make the last Montgomery modular multiplication with the argument in regular PNS, skipping it's conversion to Montgomery form. That yields $R$ with $F=R\bmod N$, which only needs a simple final modular reduction $F\gets R\bmod N$.
In our example, the second Montgomery modular multiplication would be with $X=3902$, $Y=2143$, which yields $R=1770=F$. Indeed, $1770=7510\cdot8431\cdot2143\bmod9137$.
Testing equality: If we need to test the result for equality to some quantity, we can perform that test in Montgomery form after a full reduction modulo $N$ of both. An alternative is to subtract with sign, reduce modulo $N$, and test for zero.
E.g. in a strong pseudoprime test, we repeatedly need to determine if $F\in\{1,N-1\}$. We can precompute $K=\beta^\ell\bmod N$ and $K'=N-K$, then test these conditions as $R\bmod N\,\in\{K,K'\}$
Practical method to compute a Montgomery representative
We can use Montgomery modular multiplication to compute a Montgomery representative $X$ from an $\ell$-limb integer $A$ in regular PNS. We perform Montgomery modular multiplication with one input set to sets to $\beta^{2\cdot\ell}\bmod p$ (here $100000000\bmod9137=4672$), the other set to $A$, and the output $R$ is the desired $X$.
Code optimization considerations
For sizable $\ell$, Montgomery modular multiplication spends most of the time in the inner loop on $j$. That uses a temporary variable $T$ with $t_0$ and $t_1$ limbs (best registers), and $t_2$ a bit (possibly part of a status register).
Each of the $\ell^2-\ell$ iterations per Montgomery modular multiplication computes $T\gets t_2\cdot\beta+t_1+q\cdot p_j+y_i\cdot x_j+r_j$, with $t_0$ of the outcome written to memory as a new $r_{j-1}$. That's $3$ limb reads; $1$ limb write; $2$ limb multiplications each yielding a high and a low limb; $4$ limb additions when we handle single-bit quantities as carries.
This arithmetic is a good fit for MULX, ADCX, ADOX instructions available in many x86-64 CPUs (starting 2013, mainstream since ≈2016, testable by BMI2 and ADX bits of CPUID EAX=7 ECX=0).
Getting subquadratic
Montgomery arithmetic can be combined with subquadratic multiplication algorithms to get the run time less than $\mathcal O(\ell^2)$. We loose the simplicity of interleaving, and need to widen $\mu$ to $\ell$ limbs. Refer to Modern Computer Arithmetic, second half of section 2.4.2.
† An optimizing compiler can integrate that in the previous step, noticing that the source limb in $T$ is no longer needed.