Score:1

Crypto

Constant-Time Base64 Codec - Necessity and Implementation

DannyNiu

7/27/24, 6:39 AM

As we know, the popular PEM format for textualizing private key binary blobs uses base64 encoding. Typical base64 codecs use look-up tables to find characters and byte values and pack them together, and this represents a memory side-channel.

While in most cases, loading of private key is completely opaque to the clients of servers that loads them, there may be situations that necessitates side-channel resistance - one of them would be cloud computing where several (possibly untrusted) virtual machines running on shared hardware.

Q1: what are the other situations where side-channel-resistant base64 codecs are needed?

Q2: how to implement base64, or any binary-to-text codec in constant-time?

1 + 1

implementation

side-channel-attack

Myria

7/27/24, 11:38 PM

This might not be possible in the general case, depending on the algorithm and key storage format. If the lead 9 bits of some integer in the private key happens to be all zeros, it'll get omitted in DER format and thus the size of the base64 data is dependent on confidential information.

Score:1

Crypto

fgrieu

7/27/24, 9:23 AM

Q1: Where are side-channel-resistant Base64 codecs needed?

Side-channel resistance includes but is not limited to avoiding Data-Dependent Timing Variations, which can result from code branches and indexed accesses. Other potentially damaging side channels include data-dependent power consumption, which can be detected as variations in power consumption and various emissions (electromagnetic including radio, light, heat, vibrations…).

Side-channel resistance is useful in a Base64 encoder/decoder only if (and typically, if) what's encoded is confidential. Therefore

Side-channel resistance is usually NOT needed to encode a ciphertext, public key, or public signature towards integration into a text-based format like HTML, XML, JSON. Same when decoding this.
Side-channel resistance is potentially useful when text-encoding something confidential that's NOT encrypted (yet). A common example is an RSA key not password-encrypted. Another is embedding a password in an https session. Or when it's functionally required to encode a signature that's still confidential. I've also seen Base64 used for binary data in databases (though usually reading the nice manual shows binary data can be stored with no more programming effort, and less space and wasted CPU).

Q2: How to implement Base64, or any binary-to-text codec in constant-time?

Strictly speaking, constant-time is not achievable on most modern CPUs, for two classes of reasons:

Background activity: interrupts, DMA, other processes stealing cycles and accessing memory, thus flushing cache entries, thus changing the number of clock cycles for a given task.
Clock frequency varies according to factors not under programmer control, including temperature of the CPU; thus even constant clock cycles would not imply constant time.

Fortunately, only timing variations that are correlated to confidential data being manipulated can be a security concern. That excludes 1 under standard hypothesis on process separation, and in practice 2 (it has been argued that if power consumption varies markedly on data, it can influence CPU temperature, and thru that timing; but that's not really a practical concern).

Thus our aim below shall be to avoid Data Dependent Timing Variations (DDTV), and that's achievable. We do NOT consider other side channels, like power.

In order to avoid DDTV, our code shall (as explained in this useful reference):

Not branch according to confidential data. This does not preclude loops executing a constant number of time, or according to the data length. In high-level language that includes if/then/else, iterative constructs with a data-dependent end or start condition, switch/case, short-circuit boolean operators (&& and || in C), the selection operator (cond ? x : y in C), and calling a data-dependent function (even identical functions according to a function pointer selected in constant time).
Not use memory indexing with the index data-dependent (which can cause timing variations due to e.g. cache, or crossing some boundary). This precludes reading or writing in tables at an index computed from nonpublic data.
Not use CPU instructions with execution cycle count depending on confidential data. Which instruction that is depends on CPU (that can vary including for CPUs with identical instruction set). Which instruction are used varies with compiler, settings, language, and runtime for code that's not statically compiled (typically including Java, Javascript). Potentially problematic CPU instructions include
- Shifting with the shift count confidential (including but not limited to CPUs without a barrel shifter)
- Multiplication or division/remainder with one of the operand confidential (see this). However multiplication, division or remainder with a multiplicand or divisor/modulus a fixed power of two can be performed without multiplication or division/remainder instruction, e.g. using shift.

Bitwise operations (AND, OR, XOR, NOT) and shift by a constant are typically free from DDTV, including in high-level languages compiled to machine instructions without a prior profiling step. They usually can be combined to arbitrary depth.

Addition and subtraction CPU instructions (including with borrow/carry) are typically free from DDTV, but addition or subtraction over width larger than the CPU word size often triggers compiler optimizations that use branching. That's especially likely when one of the variable is a small constant (including increment/decrement) or narrower than the other, and the CPU word size is narrower than the larger variable. I have yet to see arithmetic negation (NEG, - with a single operand on it's right) exhibit DDTV when applied to a fixed-width unsigned type directly supported by a compiler.

CAUTION: always assess the generated code! Even minor source code or setting changes, toolchain update… can profoundly modify generated code. If an interpreter or Just In Time Compiler is used, DDTV could occur or not depending on exactly what it does.

An essential building block is generating without DDTV a $b$-bit quantity $Z$ with all bits at $1$ (that is $Z=2^b-1$) if $X<Y$, otherwise all bits at $0$ (that is $Z=0$), where $b$ is the bit width of some convenient unsigned data type (e.g. 8, 16, 32 bits), under the assumption $0\le X<2^{b-1}$, $0\le Y<2^{b-1}$. This can be with $$Z\gets\left(-\left\lfloor\frac{(X-Y)\bmod2^b}{2^{b-1}}\right\rfloor\right)\bmod2^b$$ That works because given the range of $X$ and $Y$, bit $b-1$ of $(X-Y)\bmod2^b$ is $1$ if and only if $X<Y$. The rest of the expression copies that bit over all the other bits.

In C, this can be written Z = - ( ( X - Y ) >> ( b - 1 ) ); when the variables X Y Z are of type unsigned or an at least as wide unsigned integer type. See sample code below for automagic determination of b from the type used, and casts that allow using a narrow unsigned type, which on 8-bit CPUs can speed up things and lower the risk of DDTV when using the result.

In Java with variables of type int that would go Z = - ( ( X - Y ) >>> 31 ); but Java is often Just-In-Time compiled, and who knows what that does to DDTV.

On top of that building block, we can use bitwise operations and (most often, including nearly always when the selected type is the CPU word size) addition, to build the desired conversion. In the case of conversion to Base64, each character to produce encodes six bits. Representing these as variable $X\in[0,63]$, and assuming consecutive encoding for the three character ranges A…Z a…z 0…9 (which holds in ASCII, EBCDIC…), the conversion to perform is adding to $X$ one of five values depending only on which sub-range $X$ belongs to: $[0,25]$, $[26,51]$, $[52,61]$, $[62,62]$, $[63,63]$. We can find which range with four invocations of our primitive above, with constants $Y$ set to $26$, $52$, $62$, $63$, and then building the appropriate additive value with bitwise AND and XOR.

Example C code doing that. Try it Online!

#include <stdint.h>                        // for portable known-width types

typedef uint8_t CT_t;                      // type for constant-time computation
// Note: The above type can be changed to any of
// - an unsigned type in <stdint.h>, e.g. uint8_t, uint16_t, uint32_t, uint64_t
// - a built-in unsigned type: unsigned char, unsigned short, unsigned...
// Using uint8_t or unsigned char typically generates the best code on 8-bit CPUs.
// On other CPUs, it's worth trying unsigned.

#define CT_MASK (CT_t)(~(CT_t)0)            // mask for all bits that fit CT_t

// IMAX_BITS(m); give number of bits  b  in any  m = (1<<b)-1  where 0 <= b <= 3.2E+10
// derived from this post:
//      From: Hallvard B Furuseth <h.b.furuseth(nospam)@usit.uio(nospam).no>
//      Newsgroups: comp.std.c
//      Date: 03 Nov 2003 06:02:00 +0100
//      Organization: University of Oslo, Norway
//      Message-ID: <[email protected]>
// Modified to cause a division by 0 if  m  is not one less than a power of two
// This is intended for expressions evaluated at compile time only!
#define IMAX_BITS(m) \
    ((m)/((m)%0x3FFFFFFF+1)/0x3FFFFFFF%0x3FFFFFFF*30+(m)%0x3FFFFFFF/((m)%31+1)/31%31*5 \
    +4-12/((m)%31+3/((m)>=0&&!((m)>>1&(((m)>>1)+1))&&((m)&1)|!(m))))

// automagically compute the bit width of CT_MASK using the above
enum { CT_BITS = IMAX_BITS(CT_MASK) };

#define CT_MAXI (CT_t)(CT_MASK>>1)          // mask for one less bit than the type

// Return CT_MASK if X<Y, (CT_t)0 otherwise, without data-dependent timing dependency,
// under the assumption that (CT_t) X and (CT_t)Y are at most CT_MAXI,
// and evaluating X and Y exactly once.
#define CT_LESS(X,Y) (CT_t)(-(CT_t)((CT_t)((CT_t)(X)-(CT_t)(Y))>>(CT_BITS-1)))
// Note: The third of five casts above is essential when type CT_t is shorter
// than unsigned int, and the others intended to guide some 8 bit compilers
// to generate compact/fast code for these types.
// If they hard code generation for some other compilers, it's optimizer is lacking. 
// The five casts should be unnecessary and harmless when type CT_t is at least
// as wide as unsigned int.


// Return (CT_t)Z if X<Y, (CT_t)0 otherwise, without data-dependent timing dependency,
// under the assumption that (CT_t) X and (CT_t)Y are at most CT_MAXI,
// and evaluating X, Y and Z exactly once.
// CAUTION: assumes X, Y are at most CT_MAXI after cast to CT_t
#define CT_IFLESS(X,Y,Z) (CT_LESS((X),(Y))&(CT_t)(Z))

#ifdef _MSC_VER // Microsoft-specific workaround
// When it's taken the negative of an unsigned quantity at least as wide as unsigned,
// Visual C++ by design defaults to stoping the build with:
// "error C4146: unary minus operator applied to unsigned type, result still unsigned"
// even though the C standard precisely defines and mandates the behavior.
#pragma warning(disable:4146) // Repair that
#endif

// Given x assumed to be in [0,63],
// encode it per Base64 without data-dependent timing dependency.
static inline char CT_encode_1_per_Base64(CT_t x) {
#define BASE64__0 'A'   // first character for [ 0,25]
#define BASE64_26 'a'   // first character for [26,51]
#define BASE64_52 '0'   // first character for [52,61]
#define BASE64_62 '+'   //       character for [62,62]
#define BASE64_63 '/'   //       character for [63,63]
    return (char)(CT_t)((
        CT_IFLESS(x, 26, (BASE64__0 -  0) ^ (BASE64_26 - 26) ) ^
        CT_IFLESS(x, 52, (BASE64_26 - 26) ^ (BASE64_52 - 52) ) ^
        CT_IFLESS(x, 62, (BASE64_52 - 52) ^ (BASE64_62 - 62) ) ^
        CT_IFLESS(x, 63, (BASE64_62 - 62) ^ (BASE64_63 - 63) ) ^
                   (CT_t)(BASE64_63 - 63) ) + x );
}
// Note: since this is a function, x is evaluated exactly once.
// Using inline possibly speeds up the next function.

#include <stdlib.h> // for malloc
// Given a byte vector of specified byte size, encode it per Base64 with no DDTV
// in a new mallocated C string.
// Return NULL if size is too large or if malloc otherwise fails.
char* CT_endode_per_Base64(uint8_t* data, size_t size) {
#define BASE64FIL '='   // filler character
    char* p = NULL;
    size_t outsize = ((size + 2) / 3) * 4 + 1;   // size of output
    if (outsize > size && (p = malloc(outsize)) != NULL) {
        char* q = p;
        while (size--) {
            CT_t a, b;
            *q++ = CT_encode_1_per_Base64((a = *data++) >> 2);
            if (size--) {
                *q++ = CT_encode_1_per_Base64(((a & 3) << 4) | ((b = *data++) >> 4));
                if (size--) {
                    *q++ = CT_encode_1_per_Base64(((b & 15) << 2) | ((a = *data++) >> 6));
                    *q++ = CT_encode_1_per_Base64(a & 63);
                    continue;
                }
                *q++ = CT_encode_1_per_Base64((b & 15) << 2);
            }
            else {
                *q++ = CT_encode_1_per_Base64((a & 3) << 4);
                *q++ = BASE64FIL;
            }
            *q++ = BASE64FIL;
            break;
        }
        *q = 0;     // end of string
    }
    return p;
}

// test program converts it's inputs to Base64
#include <stdio.h>   // for puts
#include <string.h>  // for strlen0
int main(int argc, char* argv[]) {
    for (int i = 1; i < argc; i++) {
        char* p = CT_endode_per_Base64((uint8_t*)argv[i], strlen(argv[i]));
        puts(p != NULL ? p : "## failure ##");
        free(p);
    }
    return 0;
}

If the CPU is known to have a barrel-shifter free of DDTV, and we trust the code generation chain to use it, we can use that to build a short table accessible free of DDTV.

E.g. the C expression (V>>((j)*8))&0xFF (with j in $[0,7]$ and V a 64-bit constant) implements an arbitrary 8-byte table defined by V. And it's typically free of DDTV when running compiled on an x86-64. This can also work for $16$ values in $[0,15]$, or $21$ values in $[0,7]$.

We can combine this with the selection idiom in the previous section in order to extend the table, by replacing the constant V with an expression selecting among several constants according to j. This works fine for DES S-boxes.

The question also asks for decoding. That's markedly more complex than encoding, because strictly speaking we want to explicitly detect any error yet not leak it's position. The best I can think of is

Testing that the input length is multiple of 4 characters; that needs not be free from DDTV.
Allocating space for result (or checking there is enough), under the assumption that the output length will be a multiple of 3.
Converting all but the last 2 characters, also maintaining a variable that tells if there was an error so far, all free of DDTV. We need 6 bound checks per characters.
Converting the last 2 characters, always storing the corresponding 2 bytes even if that's with nonsensical values, maintaining the error variable, and creating a length adjustment variable (0,1,2), all free of DDTV. Afterwards, we can simply test the error variable, and apply the adjustment if there was no error (it may matter that the length does not leak in case of error, but it's pointless to try to hide there was an error, or the length when there was no error).

I strongly recommend minimizing encoding and decoding at the design stage, if only per the KISS principle, and to prevent the countless errors that can creep, thus do. That's especially true for confidential un-encrypted data because of a potential for side channel and padding oracle and fault injection attacks targeting the decoder. That's also important for large data, because encoding and decoding wastes appreciable CPU usage, memory, long term storage and/or bandwidth. Text encoding of email and many things web is a lasting and sizable resource waste.

+ 0

Elon Musk

I sit in a Tesla and translated this thread with Ai:

EN: Constant-Time Base64 Codec - Necessity and Implementation

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.