The question is about arithmetic over non-negative (aka unsigned) integers. Formally, the adjective "binary" applies to the representation of these integers, and a "binary unsigned integer" could be construed as one restricted to the set $\{0,1\}$.
Actually, most big-integers arithmetic libraries do not use binary, but a Positional Numeral System with a larger base $\beta$. Then "binary" means $\beta=2^b$ for some $b$, and for efficiency we want $b\gg1$. E.g. $b=32$ in Java's BigInteger, $b=30$ in Python, $b=64$ in GMP for most many modern CPU/platforms. An integer $X$ in $[0,2^k)$ (that is, an at-most $k$-bit integer) is represented by at least $\lceil k/b\rceil$ integer(s) $x_i\in[0,2^b)$. Each $x_i$ is called a limb, or a base-$\beta$ digit, or just digit when there is no ambiguity. A limb $x_i$ is typically stored as x[i]
in a vector or array x[]
with $\ell=\lceil k/b\rceil$ entrie(s). When using little-endian convention and no offset, the relation between $X$ and the $x_i$ is simply:
$$X\,=\,\sum_{0\le i<\ell}x_i\,\beta^i\,=\,\sum_{0\le i<\ell}x_i\,2^{i\times b}$$
Note: that endianness convention within elements of x[]
is not to be confused with the endianness convention within an individual limb x[i]
, which is typically dictated by the CPU or platform.
A word of caution: "writing my own BigInt library" and caring for performance is an excellent recipe to spend years on that low-level stuff and still end up with something that's unsafe in cryptographic applications and is trounced on both speed and security by \$1 in dedicated hardware, e.g. a bank card.
Squaring a binary number (…) I'm using the generic powering algorithm with input power of 2
A most obvious optimization is to compute $X^2$ as $X\times X$. For some implementations of the powering algorithm, that saves one multiplication of $1$ by $X$ or $X^2$. Also, that likely‡ saves the time to analyze the exponent, and pass it as parameter.
Starting from a copy of existing code for multiplication of arbitrary $X$ and $Y$, we can remove the parameter(s) for $Y$ and reference parameter(s) for $X$ instead. This will often‡ give some measurable performance saving, because the code generator will have more registers, and the calling sequence is smaller.
There are algorithms that compute $X\times X$ for arbitrary $X$ appreciably faster than a comparable algorithm computing $X\times Y$ for arbitrary $X$ and $Y$ (and $Y$ set to $X$). That includes long multiplication (the schoolbook algorithm when $\beta=10$). When multiplying two arbitrary $\ell$-limb integers $X$ and $Y$ by this method, we need $\ell^2$ limb multiplications to compute $x_i\times y_j$ with $0\le i<\ell$ and $0\le j<\ell$, but when $X=Y$ we have $x_i\times y_j=x_j\times y_i$, and we can do with $\ell(\ell+1)/2$ limb multiplications to compute $x_i\times y_j$ with $0\le i\le j\le\ell$, adding the product $x_i\times y_j$ at two positions in a provisional result when $i\ne j$.
Taking only the remainder of binary Euclidean division
I know no general method working for an isolated division, beyond removing the parameter(s) for the quotient array, and the code storing into that. This will often‡ give some measurable performance saving by freeing registers, less taxing the cache subsystem, and a simpler call sequence.
When doing a series of modular multiplications and/or additions with the same modulus $N$, which is very common in cryptography (e.g. modular exponentiation in RSA, multiplication of an Elliptic Curve point by a scalar), there are two useful optimizations, largely complementary:
Adding $1$ to a binary number
For this it pays to take the code for addition $X+Y$, remove parameter(s) for $Y$, replace $y_0$ with $1$ and other $y_i$ with $0$.
If it's not a requirement that the code is free from data-dependent timing variation, we can use that in addition of $1$ with ripple-carry (the schoolbook algorithm when $\beta=10$), as soon as a limb addition generated no carry, the next ones will not. This is particularly useful when doing $X\gets X+1$.
More often than not, the addition of $1$ can be performed at almost no cost by incorporating it into the arithmetic operation that generates the other operand, or into the operation that uses the result. In particular, $Z\gets X+Y+1$ is most efficiently computed by nearly the same code for ripple-carry addition that would compute $Z\gets X+Y$, except that when adding $x_0$ and $y_0$ it's used $x_0+y_0+1$ (equivalently addition with carry preset to $1$, rather than $x_0+y_0$ (equivalently addition with no carry or addition with carry preset to $0$).
Addition per comment:
my BigInt library stores each bit of the big unsigned binary integers as a real bit in memory and uses bitwise operations to do arithmetic. I'm not doing any of that limb stuff in mine.
In at least some limited sense, that's using limbs the width of the bitwise operations. If further we somewhat use addition over that width, that's making use of limbs. Portable (C/C++) code for addition $Z\gets X+Y$ with 64-bit limbs ($\beta=2^{64}$) could be:
uint64_t c = 0; // start without carry
for(i=0; i<limbs; ++i) { // loop over limb indexes
uint64_t d = x[i]; // first operand's limb
c += d; // may overflow
c += y[i]; // second operand's limb, may overflow
z[i] = c; // store limb result
c = (c < d); // c = 1 iff there was an overflow
}
// here c = 0 iff the result is OK
‡ There are exceptions, e.g. in C/C++ when the function is inline
, or optimization is particularly late in code generation.