Score:1

Special algorithms for edge cases of binary arithmetic?

pa flag

I have several mathematical operations on binary numbers that are special cases of more general arithmetic operations. I am wondering whether there exist more specialized algorithms purpose-made for these edge cases that are faster than the general algorithm. Could someone point me to such algorithms if they exist?

Please note I'm talking specifically about doing math on binary unsigned integers only. I'm not interested in binary floats or negative numbers. Only in binary positive integers.

Here are the specific edge cases I'm looking for specialized algorithms for:

  • Squaring a binary number. Currently I'm using the generic powering algorithm with input power of 2 to square a given binary number. Is there specialized algorithm for this that I can use instead that's faster than general powering?

  • Taking only the remainder of binary division without being interested in the actual result. Currently if I'm only after the remainder of dividing two binary numbers, I do an entire division, store the result and the remainder and never look at the result again. Is there a (faster) specialized algorithm that only gets you the remainder of division?

  • Adding 1 to a binary number. Currently I do the literal addition (N+1) using the general binary addition algorithm. (for this one I've found a few, but I'm asking here because I don't know which one is the fastest).

poncho avatar
my flag
What is a "binary number"? Is it "an integer expressed in binary? Are we talking about speed on a general CPU? If so, wouldn't the answer "use the standard algorithms, which are implemented based on digits which are a power of 2" be sufficient?
Kevin Stefanov avatar
pa flag
Binary integers only, not floats. Even better, binary unsigned integers only.
Kevin Stefanov avatar
pa flag
Im open to algorithms that exploit the GPU's cuda cores as well as general CPU algorithms.
poncho avatar
my flag
How big of integers are we talking about? It's hard to use a GPU to accelerate the incrementation of a 5 bit number...
Kevin Stefanov avatar
pa flag
I am operating on 3000-bit and possibly 6000-bit numbers.
poncho avatar
my flag
Have you considered GMP (the GNU math library)? That's fairly well optimized, and has algorithms that work well in the range of values you've mentioned
Kevin Stefanov avatar
pa flag
I need the algorithms themselves because I'm writing my own BigInt library.
Score:3
ng flag

The question is about arithmetic over non-negative (aka unsigned) integers. Formally, the adjective "binary" applies to the representation of these integers, and a "binary unsigned integer" could be construed as one restricted to the set $\{0,1\}$.

Actually, most big-integers arithmetic libraries do not use binary, but a Positional Numeral System with a larger base $\beta$. Then "binary" means $\beta=2^b$ for some $b$, and for efficiency we want $b\gg1$. E.g. $b=32$ in Java's BigInteger, $b=30$ in Python, $b=64$ in GMP for most many modern CPU/platforms. An integer $X$ in $[0,2^k)$ (that is, an at-most $k$-bit integer) is represented by at least $\lceil k/b\rceil$ integer(s) $x_i\in[0,2^b)$. Each $x_i$ is called a limb, or a base-$\beta$ digit, or just digit when there is no ambiguity. A limb $x_i$ is typically stored as x[i] in a vector or array x[] with $\ell=\lceil k/b\rceil$ entrie(s). When using little-endian convention and no offset, the relation between $X$ and the $x_i$ is simply: $$X\,=\,\sum_{0\le i<\ell}x_i\,\beta^i\,=\,\sum_{0\le i<\ell}x_i\,2^{i\times b}$$ Note: that endianness convention within elements of x[] is not to be confused with the endianness convention within an individual limb x[i], which is typically dictated by the CPU or platform.

A word of caution: "writing my own BigInt library" and caring for performance is an excellent recipe to spend years on that low-level stuff and still end up with something that's unsafe in cryptographic applications and is trounced on both speed and security by \$1 in dedicated hardware, e.g. a bank card.


Squaring a binary number (…) I'm using the generic powering algorithm with input power of 2

A most obvious optimization is to compute $X^2$ as $X\times X$. For some implementations of the powering algorithm, that saves one multiplication of $1$ by $X$ or $X^2$. Also, that likely‡ saves the time to analyze the exponent, and pass it as parameter.

Starting from a copy of existing code for multiplication of arbitrary $X$ and $Y$, we can remove the parameter(s) for $Y$ and reference parameter(s) for $X$ instead. This will often‡ give some measurable performance saving, because the code generator will have more registers, and the calling sequence is smaller.

There are algorithms that compute $X\times X$ for arbitrary $X$ appreciably faster than a comparable algorithm computing $X\times Y$ for arbitrary $X$ and $Y$ (and $Y$ set to $X$). That includes long multiplication (the schoolbook algorithm when $\beta=10$). When multiplying two arbitrary $\ell$-limb integers $X$ and $Y$ by this method, we need $\ell^2$ limb multiplications to compute $x_i\times y_j$ with $0\le i<\ell$ and $0\le j<\ell$, but when $X=Y$ we have $x_i\times y_j=x_j\times y_i$, and we can do with $\ell(\ell+1)/2$ limb multiplications to compute $x_i\times y_j$ with $0\le i\le j\le\ell$, adding the product $x_i\times y_j$ at two positions in a provisional result when $i\ne j$.


Taking only the remainder of binary Euclidean division

I know no general method working for an isolated division, beyond removing the parameter(s) for the quotient array, and the code storing into that. This will often‡ give some measurable performance saving by freeing registers, less taxing the cache subsystem, and a simpler call sequence.

When doing a series of modular multiplications and/or additions with the same modulus $N$, which is very common in cryptography (e.g. modular exponentiation in RSA, multiplication of an Elliptic Curve point by a scalar), there are two useful optimizations, largely complementary:


Adding $1$ to a binary number

For this it pays to take the code for addition $X+Y$, remove parameter(s) for $Y$, replace $y_0$ with $1$ and other $y_i$ with $0$.

If it's not a requirement that the code is free from data-dependent timing variation, we can use that in addition of $1$ with ripple-carry (the schoolbook algorithm when $\beta=10$), as soon as a limb addition generated no carry, the next ones will not. This is particularly useful when doing $X\gets X+1$.

More often than not, the addition of $1$ can be performed at almost no cost by incorporating it into the arithmetic operation that generates the other operand, or into the operation that uses the result. In particular, $Z\gets X+Y+1$ is most efficiently computed by nearly the same code for ripple-carry addition that would compute $Z\gets X+Y$, except that when adding $x_0$ and $y_0$ it's used $x_0+y_0+1$ (equivalently addition with carry preset to $1$, rather than $x_0+y_0$ (equivalently addition with no carry or addition with carry preset to $0$).


Addition per comment:

my BigInt library stores each bit of the big unsigned binary integers as a real bit in memory and uses bitwise operations to do arithmetic. I'm not doing any of that limb stuff in mine.

In at least some limited sense, that's using limbs the width of the bitwise operations. If further we somewhat use addition over that width, that's making use of limbs. Portable (C/C++) code for addition $Z\gets X+Y$ with 64-bit limbs ($\beta=2^{64}$) could be:

uint64_t c = 0;          // start without carry
for(i=0; i<limbs; ++i) { // loop over limb indexes
    uint64_t d = x[i];   // first operand's limb
    c += d;              // may overflow
    c += y[i];           // second operand's limb, may overflow
    z[i] = c;            // store limb result
    c = (c < d);         // c = 1 iff there was an overflow
}
// here c = 0 iff the result is OK

‡ There are exceptions, e.g. in C/C++ when the function is inline, or optimization is particularly late in code generation.

Kevin Stefanov avatar
pa flag
Thank you for the extended answer, my BigInt library stores each bit of the big unsigned binary integers as a real bit in memory and uses bitwise operations to do arithmetic. I'm not doing any of that limb stuff in mine.
poncho avatar
my flag
@KevinStefanov: an integer expressed as a series of base $2^{64}$ values can be viewed as storing each bit in a bit of memory. Are you saying that your implementation stores each bit in a separate byte? Why? And, CPUs have efficient math operations (addition, subtraction, multiplication) that work on sequences of bits efficiently (e.g. take two 64 bit values and generate their 128 bit product) - is there a reason to ignore that?
Kevin Stefanov avatar
pa flag
No, Im not saying each bit is stored as a byte. Each bit is stored as a real bit in memory, each byte of the binary numbers is stored as a byte in memory. There is a 1 to 1 mapping between bits of the big binary unsigned integers and actual bits in memory. And no, I'm open to both GPU and CPU solutions.
poncho avatar
my flag
@KevinStefanov: unless you are doing operations on a large number of integers in parallel, a GPU is unlikely to be useful; each core on a GPU is slower than any decent CPU (the speed of a GPU comes from having a lot of cores that can work in parallel), and your integers are too small to allow you to spread the work over multiple cores
Kevin Stefanov avatar
pa flag
@poncho Thanks, im having a hard time understanding Montgomery Multiplication now. Nowhere have I found an actual worked through example with actual numbers. The definition of the algorithm itself is hard to swallow and understand.
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.