Q1: Where are side-channel-resistant Base64 codecs needed?
Side-channel resistance includes but is not limited to avoiding Data-Dependent Timing Variations, which can result from code branches and indexed accesses. Other potentially damaging side channels include data-dependent power consumption, which can be detected as variations in power consumption and various emissions (electromagnetic including radio, light, heat, vibrations…).
Side-channel resistance is useful in a Base64 encoder/decoder only if (and typically, if) what's encoded is confidential. Therefore
- Side-channel resistance is usually NOT needed to encode a ciphertext, public key, or public signature towards integration into a text-based format like HTML, XML, JSON. Same when decoding this.
- Side-channel resistance is potentially useful when text-encoding something confidential that's NOT encrypted (yet). A common example is an RSA key not password-encrypted. Another is embedding a password in an https session. Or when it's functionally required to encode a signature that's still confidential. I've also seen Base64 used for binary data in databases (though usually reading the nice manual shows binary data can be stored with no more programming effort, and less space and wasted CPU).
Q2: How to implement Base64, or any binary-to-text codec in constant-time?
Strictly speaking, constant-time is not achievable on most modern CPUs, for two classes of reasons:
- Background activity: interrupts, DMA, other processes stealing cycles and accessing memory, thus flushing cache entries, thus changing the number of clock cycles for a given task.
- Clock frequency varies according to factors not under programmer control, including temperature of the CPU; thus even constant clock cycles would not imply constant time.
Fortunately, only timing variations that are correlated to confidential data being manipulated can be a security concern. That excludes 1 under standard hypothesis on process separation, and in practice 2 (it has been argued that if power consumption varies markedly on data, it can influence CPU temperature, and thru that timing; but that's not really a practical concern).
Thus our aim below shall be to avoid Data Dependent Timing Variations (DDTV), and that's achievable. We do NOT consider other side channels, like power.
In order to avoid DDTV, our code shall (as explained in this useful reference):
- Not branch according to confidential data. This does not preclude loops executing a constant number of time, or according to the data length. In high-level language that includes if/then/else, iterative constructs with a data-dependent end or start condition, switch/case, short-circuit boolean operators (
&&
and ||
in C), the selection operator (cond ? x : y
in C), and calling a data-dependent function (even identical functions according to a function pointer selected in constant time).
- Not use memory indexing with the index data-dependent (which can cause timing variations due to e.g. cache, or crossing some boundary). This precludes reading or writing in tables at an index computed from nonpublic data.
- Not use CPU instructions with execution cycle count depending on confidential data. Which instruction that is depends on CPU (that can vary including for CPUs with identical instruction set). Which instruction are used varies with compiler, settings, language, and runtime for code that's not statically compiled (typically including Java, Javascript). Potentially problematic CPU instructions include
- Shifting with the shift count confidential (including but not limited to CPUs without a barrel shifter)
- Multiplication or division/remainder with one of the operand confidential (see this). However multiplication, division or remainder with a multiplicand or divisor/modulus a fixed power of two can be performed without multiplication or division/remainder instruction, e.g. using shift.
Bitwise operations (AND, OR, XOR, NOT) and shift by a constant are typically free from DDTV, including in high-level languages compiled to machine instructions without a prior profiling step. They usually can be combined to arbitrary depth.
Addition and subtraction CPU instructions (including with borrow/carry) are typically free from DDTV, but addition or subtraction over width larger than the CPU word size often triggers compiler optimizations that use branching. That's especially likely when one of the variable is a small constant (including increment/decrement) or narrower than the other, and the CPU word size is narrower than the larger variable. I have yet to see arithmetic negation (NEG, -
with a single operand on it's right) exhibit DDTV when applied to a fixed-width unsigned type directly supported by a compiler.
CAUTION: always assess the generated code! Even minor source code or setting changes, toolchain update… can profoundly modify generated code. If an interpreter or Just In Time Compiler is used, DDTV could occur or not depending on exactly what it does.
An essential building block is generating without DDTV a $b$-bit quantity $Z$ with all bits at $1$ (that is $Z=2^b-1$) if $X<Y$, otherwise all bits at $0$ (that is $Z=0$), where $b$ is the bit width of some convenient unsigned data type (e.g. 8, 16, 32 bits), under the assumption $0\le X<2^{b-1}$, $0\le Y<2^{b-1}$. This can be with
$$Z\gets\left(-\left\lfloor\frac{(X-Y)\bmod2^b}{2^{b-1}}\right\rfloor\right)\bmod2^b$$
That works because given the range of $X$ and $Y$, bit $b-1$ of $(X-Y)\bmod2^b$ is $1$ if and only if $X<Y$. The rest of the expression copies that bit over all the other bits.
In C, this can be written Z = - ( ( X - Y ) >> ( b - 1 ) );
when the variables X
Y
Z
are of type unsigned
or an at least as wide unsigned integer type. See sample code below for automagic determination of b
from the type used, and casts that allow using a narrow unsigned type, which on 8-bit CPUs can speed up things and lower the risk of DDTV when using the result.
In Java with variables of type int
that would go Z = - ( ( X - Y ) >>> 31 );
but Java is often Just-In-Time compiled, and who knows what that does to DDTV.
On top of that building block, we can use bitwise operations and (most often, including nearly always when the selected type is the CPU word size) addition, to build the desired conversion. In the case of conversion to Base64, each character to produce encodes six bits. Representing these as variable $X\in[0,63]$, and assuming consecutive encoding for the three character ranges A…Z
a…z
0…9
(which holds in ASCII, EBCDIC…), the conversion to perform is adding to $X$ one of five values depending only on which sub-range $X$ belongs to: $[0,25]$, $[26,51]$, $[52,61]$, $[62,62]$, $[63,63]$. We can find which range with four invocations of our primitive above, with constants $Y$ set to $26$, $52$, $62$, $63$, and then building the appropriate additive value with bitwise AND and XOR.
Example C code doing that. Try it Online!
#include <stdint.h> // for portable known-width types
typedef uint8_t CT_t; // type for constant-time computation
// Note: The above type can be changed to any of
// - an unsigned type in <stdint.h>, e.g. uint8_t, uint16_t, uint32_t, uint64_t
// - a built-in unsigned type: unsigned char, unsigned short, unsigned...
// Using uint8_t or unsigned char typically generates the best code on 8-bit CPUs.
// On other CPUs, it's worth trying unsigned.
#define CT_MASK (CT_t)(~(CT_t)0) // mask for all bits that fit CT_t
// IMAX_BITS(m); give number of bits b in any m = (1<<b)-1 where 0 <= b <= 3.2E+10
// derived from this post:
// From: Hallvard B Furuseth <h.b.furuseth(nospam)@usit.uio(nospam).no>
// Newsgroups: comp.std.c
// Date: 03 Nov 2003 06:02:00 +0100
// Organization: University of Oslo, Norway
// Message-ID: <[email protected]>
// Modified to cause a division by 0 if m is not one less than a power of two
// This is intended for expressions evaluated at compile time only!
#define IMAX_BITS(m) \
((m)/((m)%0x3FFFFFFF+1)/0x3FFFFFFF%0x3FFFFFFF*30+(m)%0x3FFFFFFF/((m)%31+1)/31%31*5 \
+4-12/((m)%31+3/((m)>=0&&!((m)>>1&(((m)>>1)+1))&&((m)&1)|!(m))))
// automagically compute the bit width of CT_MASK using the above
enum { CT_BITS = IMAX_BITS(CT_MASK) };
#define CT_MAXI (CT_t)(CT_MASK>>1) // mask for one less bit than the type
// Return CT_MASK if X<Y, (CT_t)0 otherwise, without data-dependent timing dependency,
// under the assumption that (CT_t) X and (CT_t)Y are at most CT_MAXI,
// and evaluating X and Y exactly once.
#define CT_LESS(X,Y) (CT_t)(-(CT_t)((CT_t)((CT_t)(X)-(CT_t)(Y))>>(CT_BITS-1)))
// Note: The third of five casts above is essential when type CT_t is shorter
// than unsigned int, and the others intended to guide some 8 bit compilers
// to generate compact/fast code for these types.
// If they hard code generation for some other compilers, it's optimizer is lacking.
// The five casts should be unnecessary and harmless when type CT_t is at least
// as wide as unsigned int.
// Return (CT_t)Z if X<Y, (CT_t)0 otherwise, without data-dependent timing dependency,
// under the assumption that (CT_t) X and (CT_t)Y are at most CT_MAXI,
// and evaluating X, Y and Z exactly once.
// CAUTION: assumes X, Y are at most CT_MAXI after cast to CT_t
#define CT_IFLESS(X,Y,Z) (CT_LESS((X),(Y))&(CT_t)(Z))
#ifdef _MSC_VER // Microsoft-specific workaround
// When it's taken the negative of an unsigned quantity at least as wide as unsigned,
// Visual C++ by design defaults to stoping the build with:
// "error C4146: unary minus operator applied to unsigned type, result still unsigned"
// even though the C standard precisely defines and mandates the behavior.
#pragma warning(disable:4146) // Repair that
#endif
// Given x assumed to be in [0,63],
// encode it per Base64 without data-dependent timing dependency.
static inline char CT_encode_1_per_Base64(CT_t x) {
#define BASE64__0 'A' // first character for [ 0,25]
#define BASE64_26 'a' // first character for [26,51]
#define BASE64_52 '0' // first character for [52,61]
#define BASE64_62 '+' // character for [62,62]
#define BASE64_63 '/' // character for [63,63]
return (char)(CT_t)((
CT_IFLESS(x, 26, (BASE64__0 - 0) ^ (BASE64_26 - 26) ) ^
CT_IFLESS(x, 52, (BASE64_26 - 26) ^ (BASE64_52 - 52) ) ^
CT_IFLESS(x, 62, (BASE64_52 - 52) ^ (BASE64_62 - 62) ) ^
CT_IFLESS(x, 63, (BASE64_62 - 62) ^ (BASE64_63 - 63) ) ^
(CT_t)(BASE64_63 - 63) ) + x );
}
// Note: since this is a function, x is evaluated exactly once.
// Using inline possibly speeds up the next function.
#include <stdlib.h> // for malloc
// Given a byte vector of specified byte size, encode it per Base64 with no DDTV
// in a new mallocated C string.
// Return NULL if size is too large or if malloc otherwise fails.
char* CT_endode_per_Base64(uint8_t* data, size_t size) {
#define BASE64FIL '=' // filler character
char* p = NULL;
size_t outsize = ((size + 2) / 3) * 4 + 1; // size of output
if (outsize > size && (p = malloc(outsize)) != NULL) {
char* q = p;
while (size--) {
CT_t a, b;
*q++ = CT_encode_1_per_Base64((a = *data++) >> 2);
if (size--) {
*q++ = CT_encode_1_per_Base64(((a & 3) << 4) | ((b = *data++) >> 4));
if (size--) {
*q++ = CT_encode_1_per_Base64(((b & 15) << 2) | ((a = *data++) >> 6));
*q++ = CT_encode_1_per_Base64(a & 63);
continue;
}
*q++ = CT_encode_1_per_Base64((b & 15) << 2);
}
else {
*q++ = CT_encode_1_per_Base64((a & 3) << 4);
*q++ = BASE64FIL;
}
*q++ = BASE64FIL;
break;
}
*q = 0; // end of string
}
return p;
}
// test program converts it's inputs to Base64
#include <stdio.h> // for puts
#include <string.h> // for strlen0
int main(int argc, char* argv[]) {
for (int i = 1; i < argc; i++) {
char* p = CT_endode_per_Base64((uint8_t*)argv[i], strlen(argv[i]));
puts(p != NULL ? p : "## failure ##");
free(p);
}
return 0;
}
If the CPU is known to have a barrel-shifter free of DDTV, and we trust the code generation chain to use it, we can use that to build a short table accessible free of DDTV.
E.g. the C expression (V>>((j)*8))&0xFF
(with j
in $[0,7]$ and V
a 64-bit constant) implements an arbitrary 8-byte table defined by V
. And it's typically free of DDTV when running compiled on an x86-64. This can also work for $16$ values in $[0,15]$, or $21$ values in $[0,7]$.
We can combine this with the selection idiom in the previous section in order to extend the table, by replacing the constant V
with an expression selecting among several constants according to j
. This works fine for DES S-boxes.
The question also asks for decoding. That's markedly more complex than encoding, because strictly speaking we want to explicitly detect any error yet not leak it's position. The best I can think of is
- Testing that the input length is multiple of 4 characters; that needs not be free from DDTV.
- Allocating space for result (or checking there is enough), under the assumption that the output length will be a multiple of 3.
- Converting all but the last 2 characters, also maintaining a variable that tells if there was an error so far, all free of DDTV. We need 6 bound checks per characters.
- Converting the last 2 characters, always storing the corresponding 2 bytes even if that's with nonsensical values, maintaining the error variable, and creating a length adjustment variable (0,1,2), all free of DDTV. Afterwards, we can simply test the error variable, and apply the adjustment if there was no error (it may matter that the length does not leak in case of error, but it's pointless to try to hide there was an error, or the length when there was no error).
I strongly recommend minimizing encoding and decoding at the design stage, if only per the KISS principle, and to prevent the countless errors that can creep, thus do. That's especially true for confidential un-encrypted data because of a potential for side channel and padding oracle and fault injection attacks targeting the decoder. That's also important for large data, because encoding and decoding wastes appreciable CPU usage, memory, long term storage and/or bandwidth. Text encoding of email and many things web is a lasting and sizable resource waste.