Score:2

Concrete example of Montgomery Multiplication

pa flag

I have read about Montgomery Multiplication on several sites, but I haven't found any examples on specific numbers that explain the algorithm to someone who doesn't have a PhD in number theory.

I know it involves converting the numbers into a special form called Montgomery Form, which makes modular operations easy and fast, and that converting to and from that form isn't exactly cheap, so we'd prefer to do a multitude of modular operations in Montgomery form before returning to normal form. That's about it.

So I give the following example for a good-hearted soul to work through for me:

(7414122 * 200000 * 2048) MOD 313

using Montgomery Multiplication, and explaining each step and where each auxilliary number and result came from, please. I'd be very grateful as I haven't been able to find any resource that explains it nicely.

Don't get me wrong, I was excellent in calculus and matrices and complex numbers and what not, but never done any of this Ring Theory nonsense most websites use while explaining this, so I want a dumbed-down explanation of this algorithm via this example please.

Or if anyone knows of a place where this algorithm is explained nicely and simply, do point me to it please.

kodlu avatar
sa flag
there is this link (you may have already seen it) https://cp-algorithms.com/algebra/montgomery_multiplication.html#fast-inverse-trick
kodlu avatar
sa flag
if you know matrices, linear algebra, calculus maybe you need to study modular arithmetic which this multiplication depends on (modulo a nonprime number, you get a ring instead of a field). but someone else may be able to answer your question
in flag
The wikipedia page, while not perfect, explains things well enough. You may also want to look at Barrett reduction. The point of Montgomery form or Barrett reduction is to to work with arithmetic modulo a power of 2 instead of the base in question, as reduction modulo $2^n$ is truncation and division by $2^n$ is a shift in binary representation.
Kevin Stefanov avatar
pa flag
I just want to see a concrete worked through example of it cuz I get things best that way
fgrieu avatar
ng flag
I have reservations about the example. The three multiplicands are much wider than the modulus, which is not typical in actual use of Montgomery modular multiplication (which typically involve chained operations modulo a common modulus, as in modular exponentiation or point multiplication on an Elliptic Curve in a prime field). Also the second multiplicand ends with many zeroes, and the third (assuming base 10) is a power of two. What about something more realistic, like $751⋅843⋅214\bmod937$ or $7510⋅8431⋅2143\bmod9137$?
Kevin Stefanov avatar
pa flag
@fgrieu-onstrike Yes, this example works too. Any example works for me, I just want to see the algorithm in action. I will be using Montgomery Multiplication for modular exponentiation, which is part of Rabin-Miller primality test, which I've already implemented. The base is a small number, but the power and the modulus are 3000-bit numbers, which produces 3000 squarings of 3000-bit numbers, each modulo M, and then 3000 multiplications of 3000-bit numbers, modulo M again. These thousands of modular operations I can do in Montgomery Form.
Kevin Stefanov avatar
pa flag
@kodlu I followed the guide in that link and I got stuck at the part where they do the Montgomery Reduction. It says ( (r . r^-1) + (n . n') ) = 1 to find r^-1 and then there's a strange algorithm for finding x.r^-1 mod n, which also involves r^-1, so where do we compute n' in order to use either of these two methods of finding r^-1? Im so confused. Also the strange long ass formula for x.r^-1, which also involves n' (which I can't see a way to find yet) and even some weird number l ? Bro im so confused
Score:8
ng flag

In this answer we study modular multiplications using Montgomery arithmetic, illustrated with the example $7510\cdot 8431\cdot 2143\bmod9137$, working in base $\beta=10$ because the question does. Practice for large integers on modern computers would use $\beta=2^{32}$ or $\beta=2^{64}$, and in order to cover that we use the general term limb for our mere decimal digits.

We use a Positional Numeral System with little-endian convention when it comes to (zero-based) indexes, and low-order digit shown on the right because we are raised on that. E.g. a representation of $A=7510$ is $a_3=7$, $a_2=5$, $a_1=1$, $a_0=0$, so that $$A=\sum_{0\le i<4}a_i\cdot\beta^i$$

We mostly use the notation in Modern Computer Arithmetic, algorithm 2.6. However we interleave multiplication and reduction per the Handbook of Applied Cryptography algorithm 14.36. We detail things more than either of these references. Roseta stone on notation: $$\begin{array}{l|c} \text{This answer}&\beta&\ell&\mu&q&\beta^\ell&N&X\cdot Y&R\\\hline \text{MCA alg. 2.6}&\beta&n&\mu&q_i&\beta^n&N&C&R\\\hline \text{HAC alg. 14.36}&b&n&m'&u_i&R&m&x\cdot y&A\end{array}$$

On the$\bmod$ notation: $A\equiv B\pmod N$ means that $A-B$ is multiple of $N>0$. $A=B\bmod N$ additionally means that $0\le A<N$. In that second notation only,$\bmod$ is an operator, and we could write $B\bmod N=A$. Similarly, $A\equiv B^{-1}\pmod N$ means that $A\cdot B\equiv1\pmod N$, and $A=B^{-1}\bmod N$ additionally means that $0\le A<N$.

Motivation of Montgomery arithmetic

Montgomery arithmetic is a method for computation modulo some integer $N$; that is in $\mathbb Z/N\mathbb Z$, the ring of integers modulo $N$. A Montgomery representation of integers allows a streamlined multiplication modulo $N$. Modular addition remains essentially as usual.

Montgomery arithmetic is often used in cryptographic hardware and software needing modular multiplication because:

  • It simplifies the interleaving of (quadratic) multiplication and modular reduction steps. Interleaving saves a temporary buffer, memory accesses, and index calculations, leading to a speedup. Interleaving and closely the same speedup is also possible with conventional modular reduction, but the code is hairy.
  • The guessing of quotient limbs in long division is replaced by a simple procedure with the only corner case outside the inner loops. That helps make code free from data-dependent timing dependency.

Preliminary checks, determination of $\mu$ and $\ell$

We need $N$ to be coprime with $\beta$, equivalently $\gcd(n_0,\beta)=1$. Otherwise, we'd have to get back to that case in order to use Montgomery arithmetic (when $\beta$ is a power of two, that's by breaking $N$ as the product of an odd integer and a power of two).

Said condition allows computation of a one-limb quantity $\mu$ that we'll use repeatedly: the multiplicative inverse modulo $\beta$ of $-n_0$, that is $\mu=(-n_0)^{-1}\bmod\beta$. For arbitrary $\beta$ we can compute the multiplicative inverse per the (half-)extended Euclidean algorithm, but for $\beta$ a power of two there is a simple, fast, constant-time algorithm.

We also need the integer $\ell$ such that $\beta^{\ell-1}\le N<\beta^\ell$, that is $N$ fits $\ell$ limbs with the high-order limb $n_{\ell-1}$ non-zero.

With our $N=9137$, $n_0=7$ is coprime with $10$. $\mu=7$ since $7\cdot(-7)\equiv1\bmod 10$. $\ell=4$ with $n_3=9\ne0$.

Montgomery representative

A Montgomery representative of an integer $A$ (implicitly: for given modulus $N$, base $\beta$, and $\ell$) is any integer $X$ with $0\le X<\beta^\ell$, expressed as $\ell$ limbs per PNS, such that $X\equiv A\cdot\beta^\ell\pmod N$.

Therefore $X=A\cdot\beta^\ell\bmod N$ is a representative of $A$ which we can find by Euclidean division per schoolbook long division. The $\ell$ low-order limbs of $A\cdot\beta$ are zero, and brought in as needed during the division. We deffer description of a more practical method to compute a Montgomery representative.

Our first integers to multiply $A=7510$, $B=8431$ thus become $X=2997$ (computed as $75100000\bmod9137$) and $Y=2901$ (computed as $84310000\bmod9137$). The $7414122$ and $200000$ in the original question would become $133$ and $062$ with full modular reduction modulo $313$, but partially reduced $446$ and $688$ would do as well. It turns out we won't need a Montgomery representation of the third integer.

Montgomery modular multiplication

Montgomery modular multiplication computes a Montgomery representative $R$ of the product $A\times B$, given Montgomery representatives $X$ and $Y$ of $A$ and $B$.

  • Input: $X$, $Y$ and $N$ each as $\ell$ limbs in base $\beta$; and $\mu=-N^{-1}\bmod\beta$.
  • Output: $R$ as $\ell$ limbs with $R\cdot\beta^\ell\equiv X\cdot Y\pmod N$.

We handle the limbs of $Y$ from low-order to high-order limb, keeping intermediary result $R$ on $\ell+1$ limbs with $r_\ell\in\{0,1\}$, plus a final reduction of $R$ to $\ell$ limbs:

  • Set $R\gets0$, that is set to $0$ it's $\ell$ limbs $r_0$ to $r_{\ell-1}$, and an extra $r_\ell$ that won't exceed $1$ and won't be accessed with a variable index, thus can be a separate bit.
  • For $i$ from $0$ to $\ell-1$, perform $R\gets(q\cdot N+y_i\cdot X+R)/\beta$ per standard arithmetic for one-limb $q$ chosen such that the Euclidean division yields no remainder. We detail how to perform that:
    • Compute temporary quantity $T\gets y_i\cdot x_0+r_0$, which fits two limbs $t_1$ and $t_0$, plus an extra $t_2\gets0$ considered part of $T$.
    • Compute $q$ as the low-order limb of $\mu\cdot t_0$. There remains to finish the computation $R\gets(q\cdot N+y_i\cdot X+R)/\beta$, as follows:
    • For $j$ from $1$ to $\ell-1$
      • Compute $T\gets t_2\cdot\beta+t_1+q\cdot p_j+y_i\cdot x_j+r_j$, which fits three limbs with $t_2\in\{0,1\}$.
      • Set† $r_{j-1}\gets t_0$.
    • Compute $T\gets t_2\cdot\beta+t_1+r_\ell$, which fits two limbs with $t_1\in\{0,1\}$
    • Set† $r_{\ell-1}\gets t_0$ and $r_\ell\gets t_1$.
  • If $r_\ell\ne0$, then compute $R\gets R-N$. It's not necessary to update $r_\ell$ which won't be needed and would be $0$.
  • Output $R$ as $\ell$ limbs $r_0\ldots r_{\ell-1}$.

In our example $\beta=10$, $\ell=4$, $N=9137$, $\mu=7$, and the first two values to multiply are $X=2997$, $Y=2901$ (the Montgomery forms of $A=7510$, $B=8431$).

  • We start with $R=00000$
  • $i=0$: $y_0=1$, the first $T$ is $1\cdot7+0=007$, $q=7\cdot7\bmod10=9$, thus $R\gets(9\cdot9137+1\cdot2997+00000)/10=85230/10=08523$
  • $i=1$: $y_1=0$, the first $T$ is $0\cdot7+3=003$, $q=7\cdot3\bmod10=1$, thus $R\gets(1\cdot9137+0\cdot2997+08523)/10=17660/10=01766$
  • $i=2$: $y_2=9$, the first $T$ is $9\cdot7+6=069$, $q=7\cdot9\bmod10=3$, thus $R\gets(3\cdot9137+9\cdot2997+01766)/10=56150/10=05615$
  • $i=3$: $y_3=2$, the first $T$ is $2\cdot7+5=019$, $q=7\cdot9\bmod10=3$, thus $R\gets(3\cdot9137+2\cdot2997+05615)/10=39020/10=03902$
  • $r_4=0$ thus no adjustment of $R$ is needed
  • We have $R=3902$. Indeed, $3902\cdot10000\equiv2997\cdot2901\pmod{9137}$

Similarly process all operands

In our example, our best avenue is to integrate our second and last multiplication with the production of the final result in regular RNS, per 2 below.

In a modular exponentiation with an exponent of thousands bits, we may make thousands similar Montgomery modular multiplications in sequence. Some of them will be modular squarings, which we can handle as regular modular multiplication with $Y=X$.

In the context of point multiplication by a scalar in Elliptic Curve Cryptography with prime field and an efficient coordinate system, there is also additions modulo $N$, which we can handle directly in Montgomery form as standard addition, followed by modular reduction only inasmuch as indispensable to keep the result on $\ell$ limbs.

Conversion back to regular PNS

The final result is some $F$ in PNS with $0\le F<N$ and without the $\beta^\ell$ factor of Montgomery form.

  1. General case: We get a Montgomery representation $R$ of the desired $F$. Therefore, $F=\beta^{-\ell}\cdot R\bmod N$. We could compute $F$ as $M\gets=(\beta^\ell)^{-1}\bmod N$, then $F\gets M\cdot R\bmod N$, using standard algorithms. But the computation of $M$ is slightly cumbersome, and $F\gets M\cdot R\bmod N$ would require coding regular modular multiplication.

    Instead, it's simpler to compute $F$ from $R$ using Montgomery modular multiplication with inputs $X=1$ and $Y=R$, yielding a new $R$, which only needs a simple final modular reduction $F\gets R\bmod N$. The algorithm simplifies: $t_2$ is not needed, the first $T$ is $y_0+r_0$, the next ones $t_1+q\cdot p_j+r_j$.

  2. Ending in multiplication by an input: If the last operation in the computation is a Montgomery modular multiplication with one input given in standard PNS over $\ell$ limbs, the simplest is to make the last Montgomery modular multiplication with the argument in regular PNS, skipping it's conversion to Montgomery form. That yields $R$ with $F=R\bmod N$, which only needs a simple final modular reduction $F\gets R\bmod N$.

    In our example, the second Montgomery modular multiplication would be with $X=3902$, $Y=2143$, which yields $R=1770=F$. Indeed, $1770=7510\cdot8431\cdot2143\bmod9137$.

  3. Testing equality: If we need to test the result for equality to some quantity, we can perform that test in Montgomery form after a full reduction modulo $N$ of both. An alternative is to subtract with sign, reduce modulo $N$, and test for zero.

    E.g. in a strong pseudoprime test, we repeatedly need to determine if $F\in\{1,N-1\}$. We can precompute $K=\beta^\ell\bmod N$ and $K'=N-K$, then test these conditions as $R\bmod N\,\in\{K,K'\}$

Practical method to compute a Montgomery representative

We can use Montgomery modular multiplication to compute a Montgomery representative $X$ from an $\ell$-limb integer $A$ in regular PNS. We perform Montgomery modular multiplication with one input set to sets to $\beta^{2\cdot\ell}\bmod p$ (here $100000000\bmod9137=4672$), the other set to $A$, and the output $R$ is the desired $X$.


Code optimization considerations

For sizable $\ell$, Montgomery modular multiplication spends most of the time in the inner loop on $j$. That uses a temporary variable $T$ with $t_0$ and $t_1$ limbs (best registers), and $t_2$ a bit (possibly part of a status register).

Each of the $\ell^2-\ell$ iterations per Montgomery modular multiplication computes $T\gets t_2\cdot\beta+t_1+q\cdot p_j+y_i\cdot x_j+r_j$, with $t_0$ of the outcome written to memory as a new $r_{j-1}$. That's $3$ limb reads; $1$ limb write; $2$ limb multiplications each yielding a high and a low limb; $4$ limb additions when we handle single-bit quantities as carries.

This arithmetic is a good fit for MULX, ADCX, ADOX instructions available in many x86-64 CPUs (starting 2013, mainstream since ≈2016, testable by BMI2 and ADX bits of CPUID EAX=7 ECX=0).

Getting subquadratic

Montgomery arithmetic can be combined with subquadratic multiplication algorithms to get the run time less than $\mathcal O(\ell^2)$. We loose the simplicity of interleaving, and need to widen $\mu$ to $\ell$ limbs. Refer to Modern Computer Arithmetic, second half of section 2.4.2.


† An optimizing compiler can integrate that in the previous step, noticing that the source limb in $T$ is no longer needed.

Kevin Stefanov avatar
pa flag
Currently I've implemented 3000 sequential modular multiplications of 3000-bit numbers mod N as simply: compute one MUL, get a 6000-bit result, divide it by N and get the remainder, which you multiply by the next 3000-bit number, get another 6000-bit result, divide this by N and get the result, and so on. Is doing those in Montgomery Space gonna be much faster? Because you mentioned the same speedup is possible with interleaving with the usual modular multiplication which concerned me a bit whether I even need to implement Montgomery Multiplication in the first place.
Kevin Stefanov avatar
pa flag
N is a 3000-bit number, that's why after each MUL's 6000-bit result when we divide by N and get the remainder, we get back to a 3000-bit number, to be multiplied by the next 3000-bit number in the chain of several thousand sequential modular multiplications mod N (each number in the chain is 3000-bits). Then get another 6000-bit MUL result, divide by N and take remainder, which gets us back to a 3000-bit number, and so on and so forth. This way I perform about 3000 modular multiplications on 3000-bit numbers.
Kevin Stefanov avatar
pa flag
So given this do you think it's worth it to implement Montgomery Multiplication to do those 3000 modular MUL operations with, instead of normal reduction after each MUL? If i want to make them run faster?
fgrieu avatar
ng flag
@KevinStefanov: Montgomery arithmetic does not by itself gives a large speedup. A factor of 3 would be surprising even if we include the benefit of interleaving. The essential thing for 3000-bit numbers is using wide limbs so that $\ell=\lceil3000/\log_2(\beta)\rceil$ is as low as possible, AND an algorithm doing in the order of $2\ell^2$ limb multiplications per 3000×3000 multiplication and modular reduction, or something subquadratic like Karatsuba (but the later is much more complex and seldom attempted at first).
fgrieu avatar
ng flag
**Comments have been [moved to chat](https://chat.stackexchange.com/rooms/147436/discussion-on-answer-by-fgrieu-onstrike-concrete-example-of-montgomery-multiplic); please do not continue the discussion here.** Before posting a comment below this one, please review the [purposes of comments](/help/privileges/comment). Comments that do not request clarification or suggest improvements usually belong as an [answer](/help/how-to-answer), on [meta], or in [chat]. Comments continuing discussion may be removed.
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.