Score:11

Is If/else vulnerable to timing side-channel attacks?

tf flag
Tom

I have a branching in c++:

if (x & 1)
{
    x = function_1(x);
}
else
{
    x = function_2(x);
}

If function_1 and function_2 are constant time and it takes the same time to compute them, is such branching still vulnerable for side-channel attacks? Can attacker somehow know which condition was executed?

Tom avatar
tf flag
Tom
@kelalaka ok, thanks. And how about ((x & 1) * funtcion_1(x)) ^ ((~x & 1) * funtcion_2(x))? Is better or still bad? I suspect that is not a very good solution either, but I'm not sure.
kelalaka avatar
in flag
I don't see a difference, besides, make sure that nothing is optimized. This is why ASM blocks are common on side-channel implementations. On the above there was mistakes, corrected here ( other comment is deleted) $$x = (x \wedge 1)* f_1(x) + ((x \wedge 1)\oplus 0x1 ) * f_2(x)$$
Tom avatar
tf flag
Tom
@kelalaka yes, there was a mistake, that's why I thought it is something different. Now result is the same as in my formula. But I think your is still better, because in my formula there is "~x & 1", and this is problably not constant time (in compare to x & 1), especially if x is big number
kelalaka avatar
in flag
It is not about the non-constantness of ~ or x-or. It is about you always executing both methods regardless of the $x$'s value and in the final step you add them with the masks. This even doesn't require relying on the same constant timeliness of $f_1$ and $f_2$
kelalaka avatar
in flag
See [Squeamish Ossifrage](https://crypto.stackexchange.com/a/96634/18298)'s canonical answer...
Score:14
ng flag

Yes, if/else is vulnerable to timing attack. So is selecting the function to call by an array index, as in that other answer.

The only next-to-reliable solution for constant-time execution in a high-level language, lacking indication about the target environment, is: Make sure the code execution path and the memory access patterns are independent of the data.

This precludes calling a single of function_1 or function_2 (as the question's code and said other answer do), and indexing arrays, based on data that changes under control or in a manner observable by an adversary, even partly and indirectly.

If the extra cost of calling both functions is tolerable, and the behavior of the functions is well-defined and acceptable regardless of the low-order bit of their input, then this will do with most current C compilers and hardware:

   m = -(x&1);  // mask variable; assumes  m  and  x  are int
   x = (function_1(x) & m) | (function_2(x) & ~m);

or equivalently

   x = function_1(x) & -(x&1) | function_2(x) & ~ -(x&1);

This calls both functions, and selects the result appropriately, using that mask -(x&1) has all bits at 1 for odd x, at 0 otherwise.

I do not know any CPU + C compiler introduced since 1985 where there would be a timing dependency on x introduced by that technique. Modern literature on constant time implementation of crypto assume there's none, without saying. That's despite the fact execution time of modern high-performance CPUs is unspecified, and can change unpredictably from the perspective of an application programmer (e.g. due to compilers/linkers and options thereof, code and data alignment, interrupts, memory wait states and refresh cycles, caches, pipelines, speculative execution, branch predictors, other threads on the same core/CPU/linked CPUs, arbitration of resources between threads on the same core, virtualization, BIOS or VMs settings, and the seemingly endless flow of flawed microcode changes compromising performance to mitigate the latest reported side-channel attack…)

Importantly, that unrealistic hypothesis in the question

function_1 and function_2 are constant time and it takes the same time to compute them

can be replaced by the plausible and falsifiable:

Variations in the execution time of function_1 are uncorrelated to the value of x. function_2 has the same property.

Caveat 1: the C/C++ languages give insurance only about the results obtained, none on the execution time. I know no compiler manual that cares to, and some CPUs without documentation that do.

Caveat 2: -(x&1) to get a mask according to low-order bit of x is a well-known idiom, and nowadays works on almost all CPUs. That's because they use two's complement convention, with -1 represented by all ones. But I lost track about if C promises to hide whatever different behavior the hardware might have (I leave that to the useful comments). And that needs to be ascertained no matter what.

Caveat 3: Adaptation may be necessary according to the type of x due to C promotion rules. I usually cast 1 to the type of x, e.g. if x is an uint32_t, I use -(x&(uint32_t)1), and would live happy with it if there was not compiler warnings.

Some compilers complain when taking the negative of an unsigned expression. They are wrong, per the C standard. The easy route is to bow and use (0-(x&1)), with casting of the constants as above. Another is to disable that complaint with some pragma or option. Yet another, hopeless in my experience, is to send a problem report. Arguments carefully made go to NUL (the local equivalent of /dev/null) including: the construct is useful, common, explicitly supported by ISO C; and making a warning about it could get all warnings indiscriminately silenced.

Note: I assumed the return type of function_1 and function_2 is the same type as x.


Performing security / cryptography on standard CPUs where adversaries can also run code has been consistently brittle since the 1980's. There are useful advances: process isolation with Memory Management Units, security rings or enclaves. There are software heuristics to reduce vulnerability to acceptable levels. But, so far, no silver bullet.

So when the stakes are high, it's often best to offload security functions to trusted hardware (Smart Cards, security CPUs, HSMs, TPMs…) designed for security first, not performance first; and where adversaries can't run code (the simplest thus best), or that's designed in.

SAI Peregrinus avatar
si flag
C does not (yet) promise two's complement behavior. In the upcoming C23 standard that will be guaranteed, so some platforms will need to emulate two's complement to have C23 compilers. Also the expression isn't unsigned, even if `x` is unsigned. `1` is a signed integer literal, so both `x` and `1` get promoted according to the rules in section 6.3.1.8. Then the `&` is applied, and the result is negated. The result is of the (promoted) integer type of the `(x&1)`. You can use typed literals (eg `1ULL`) to make the `1` have the same rank as `x`.
in flag
@SAIPeregrinus: POSIX does [promise](https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/stdint.h.html) two's complement right now, so if you're developing for a POSIX-compliant target (which is most of them including Windows), then you're good.
Lorenzo Donati support Ukraine avatar
ru flag
@SAIPeregrinus Nitpick: 1ULL is `unsigned long long` whose size is *at least* 64 bits, according to C standard, so it has higher rank than `x` (`uint32_t`). Reference: C standard (N1256 draft - C99) section **5.2.4.2.1 Sizes of integer types**: *maximum value for an object of type unsigned long long int* `ULLONG_MAX 18446744073709551615 // 2^64 − 1`
Tom avatar
tf flag
Tom
That's great answer and big topic, as I see. BTW I'm working on __m128i data type, so I do something like: _mm_xor_si128(x,k) & -_mm_cvtsi128_si64(x & 1). To get the lowest significant bit I need to use _mm_cvtsi128_si64. _mm_cvtsi128_si64 and __m128i are two different data types, but AND works properly anyway, because of __m128i properties. So it is not always the case that 1 have to be the same type as x, and even x&1 doesn't have to be the same type as function_1 or function_2 output. But SSE2 is kind of special case.
Tom avatar
tf flag
Tom
@fgrieu If I understand everything correctly, it does not matter when we replace the presented formula with x = function_1(x) & -(x&1) | function_2(x) & -(x&1^1)? So we use xor 1 instead of ~.
fgrieu avatar
ng flag
@Tom: Yes, that also works. However most compilers will notice that `~ -(x&1)` can be efficiently computed from `-(x&1)`, or for free if the `&~` operator has hardware support, which is common since it's used in many common and useful idioms, e.g. in `y&m | z&~m`. But I have some doubts many compilers will reuse `-(x&1)` in the evaluation of `& -(x&1^1)`.
fgrieu avatar
ng flag
@Tom: Thanks for clarifying. I deleted your comment since it's read, thus no longer needed. Upvoting a comment is thanks enough (and for this reason the present comment will be destructed RSN).
Score:4
us flag

This started as an edit to fgrieu's answer, but it was very long, so I decided to post it as a separate answer.

I agree with everything fgrieu wrote in his answer

I will quote fgrieu's answer here to give more emphasis :

The only next-to-reliable solution for constant-time execution in a high-level language, lacking indication about the target environment, is: Make sure the code execution path and the memory access patterns are independent of the data.

If we don't consider memory access patterns, since it is not in the scope of the question. The quoted can only theoretically be achieved with constant memory access time and constant time execution of every instruction. Of course these come with the cost of extra clock cycles and extra memory accesses.

Memory access paterns refer to another type of side channel attack that require Oblivious Random Access Memory (ORAM) to be secure against side channel attacks.


If function_1 and function_2 are constant time and it takes the same time to compute them, is such branching still vulnerable for side-channel attacks?

Most commonly yes, if/else is vulnerable to timing attack, because when you use an if/else statement you can easily expect from the compiler or the runtime environment to execute a branch. In any case you generally try to avoid it.

Selecting the function to call by an array index, as in mentallurg's answer in Python for example isn't that bad of an approach, it has constant memory access time and the code that is executed by Python to make the function call is contant time.


Before I begin.

  • In the case of memory accesses, note that since we are interested in timing attacks, we are only interested in memory access time and not patterns that may leak information because this would make our situation a bit more complicated.
  • I only consider caches at the end, for now we don't consider caches, as everyone knows CPU caches can be the worst nightmare of a cryptographer in such situations.
  • I assume that each instruction that doesn't include a memory access it takes the same clock cycles to execute.

Analysis of mentallurg's answer :

Let's take this simple Python code for example

def f1(x):
    return x + 1

def f2(x):
    return x + 2

def main():
    functions = [f1, f2]
    x = int(input("Enter value:"))
    i = x % 2
    result = functions[i](x)
    print(result)

if __name__=='__main__':
    main()

This gets compiled to the following Python bytecode (output of pycdas for .pyc compiled with Python 3.10) :

        [Code]
            File Name: test.py
            Object Name: main
            Arg Count: 0
            Pos Only Arg Count: 0
            KW Only Arg Count: 0
            Locals: 4
            Stack Size: 3
            Flags: 0x00000043 (CO_OPTIMIZED | CO_NEWLOCALS | CO_NOFREE)
            [Names]
                'f1'
                'f2'
                'int'
                'input'
                'print'
            [Var Names]
                'functions'
                'x'
                'i'
                'result'
            [Free Vars]
            [Cell Vars]
            [Constants]
                None
                'Enter value:'
                2
            [Disassembly]
                0       LOAD_GLOBAL             0: f1
                2       LOAD_GLOBAL             1: f2
                4       BUILD_LIST              2
                6       STORE_FAST              0: functions
                8       LOAD_GLOBAL             2: int
                10      LOAD_GLOBAL             3: input
                12      LOAD_CONST              1: 'Enter value:'
                14      CALL_FUNCTION           1
                16      CALL_FUNCTION           1
                18      STORE_FAST              1: x
                20      LOAD_FAST               1: x
                22      LOAD_CONST              2: 2
                24      BINARY_MODULO           
                26      STORE_FAST              2: i
                28      LOAD_FAST               0: functions
                30      LOAD_FAST               2: i
                32      BINARY_SUBSCR           
                34      LOAD_FAST               1: x
                36      CALL_FUNCTION           1
                38      STORE_FAST              3: result
                40      LOAD_GLOBAL             4: print
                42      LOAD_FAST               3: result
                44      CALL_FUNCTION           1
                46      POP_TOP                 
                48      LOAD_CONST              0: None
                50      RETURN_VALUE

The lookup and the execution of the line result = functions[i](x) is done from lines bytes 26-36. Let's take a look at the BINARY_SUBSCR operator. It takes two arguments, a list (it can be a dict, or tuple but let's don't focus on this) as the first and an index from the stack (the ones that were previously loaded with LOAD_FAST), returns the value at that index and decreases the stack by 1.

Now, let's take a look how BINARY_SUBSCR is implemented in CPython. The implementation can be found here and is the following:

        TARGET(BINARY_SUBSCR) {
            PREDICTED(BINARY_SUBSCR);
            PyObject *sub = POP();
            PyObject *container = TOP();
            PyObject *res = PyObject_GetItem(container, sub);
            Py_DECREF(container);
            Py_DECREF(sub);
            SET_TOP(res);
            if (res == NULL)
                goto error;
            JUMPBY(INLINE_CACHE_ENTRIES_BINARY_SUBSCR);
            DISPATCH();
        }

Now the whole analysis can be focused on PyObject *res = PyObject_GetItem(container, sub);. This is a generic method and until the item is retrieved various other method are called in the intermediate. Of course we can expect $O(1)$ complexity. It remains to check it. At the end PyList_GetItem is called which is the following :

PyObject *
PyList_GetItem(PyObject *op, Py_ssize_t i)
{
    if (!PyList_Check(op)) {
        PyErr_BadInternalCall();
        return NULL;
    }
    if (!valid_index(i, Py_SIZE(op))) {
        _Py_DECLARE_STR(list_err, "list index out of range");
        PyErr_SetObject(PyExc_IndexError, &_Py_STR(list_err));
        return NULL;
    }
    return ((PyListObject *)op) -> ob_item[i];
}

As we can see on the last line. It has $O(1)$ complexity. Of course due to the complexity of high level languages they are never used in practice for such applications. So let's try this code in a lower level language, like C, to see what it produces.

#include <stdio.h>

int f1(int x) {
    return x + 1;
}

int f2(int x) {
    return x + 2;
}

int main() {
    int x;
    scanf("%d", &x);
    int (*(functions[2]))(int) = {f1, f2};
    int i;
    i = x %2;
    int result = functions[i](x);
}

This is the x86_64 from Godbolt with the latest GCC without optimizations :

f1:
        addiu   $sp,$sp,-8
        sw      $fp,4($sp)
        move    $fp,$sp
        sw      $4,8($fp)
        lw      $2,8($fp)
        nop
        addiu   $2,$2,1
        move    $sp,$fp
        lw      $fp,4($sp)
        addiu   $sp,$sp,8
        jr      $31
        nop

f2:
        addiu   $sp,$sp,-8
        sw      $fp,4($sp)
        move    $fp,$sp
        sw      $4,8($fp)
        lw      $2,8($fp)
        nop
        addiu   $2,$2,2
        move    $sp,$fp
        lw      $fp,4($sp)
        addiu   $sp,$sp,8
        jr      $31
        nop

$LC0:
    .ascii  "%d\000"
main:
    addiu   $sp,$sp,-56
    sw      $31,52($sp)
    sw      $fp,48($sp)
    move    $fp,$sp
    addiu   $2,$fp,32
    move    $5,$2
    lui     $2,%hi($LC0)
    addiu   $4,$2,%lo($LC0)
        jal     __isoc99_scanf
        nop

        lui     $2,%hi(f1)
    addiu   $2,$2,%lo(f1)
    sw      $2,36($fp)
    lui     $2,%hi(f2)
        addiu   $2,$2,%lo(f2)
        sw      $2,40($fp)
        lw      $3,32($fp)
        li      $2,-2147483648                  # 0xffffffff80000000
    ori     $2,$2,0x1
    and     $2,$3,$2
        bgez    $2,$L6
        nop

        addiu   $2,$2,-1
        li      $3,-2                 # 0xfffffffffffffffe
    or      $2,$2,$3
        addiu   $2,$2,1
$L6:
    sw      $2,24($fp)
    lw      $2,24($fp)
    nop
    sll     $2,$2,2
    addiu   $3,$fp,24
    addu    $2,$3,$2
        lw      $2,12($2)
        lw      $3,32($fp)
        nop
        move    $4,$3
        move    $25,$2
        jalr    $25
        nop

        sw      $2,28($fp)
        move    $2,$0
        move    $sp,$fp
        lw      $31,52($sp)
        lw      $fp,48($sp)
        addiu   $sp,$sp,56
        jr      $31
        nop

More specifically we are interested in these instructions in which the calling is made to the appropriate function :

        lw      $2,24($fp)
        nop
        sll     $2,$2,2
        addiu   $3,$fp,24
        addu    $2,$3,$2
    lw      $2,12($2)
    lw      $3,32($fp)
    nop
    move    $4,$3
    move    $25,$2
    jalr    $25
        nop

        sw      $2,28($fp)
        move    $2,$0

As we can see no branches are made, except from jalr which call the appropriate function.


Analysis of fgrieu's answer

Of course it is easy to see that this is constant time :

#include <stdio.h>

int f1(int x) {
    return x + 1;
}

int f2(int x) {
    return x + 2;
}

int main() {
    int x;
    scanf("%d", &x);
    int (*(functions[2]))(int) = {f1, f2};
    int m = -(x&1);  // mask variable  m  must have the same type as  x
    x = (function_1(x) & m) | (function_2(x) & ~m);
    printf(x);
}

Again, Goldbolt with the same options :

f1:
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-4], edi
        mov     eax, DWORD PTR [rbp-4]
        add     eax, 1
        pop     rbp
        ret
f2:
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-4], edi
        mov     eax, DWORD PTR [rbp-4]
        add     eax, 2
        pop     rbp
        ret
.LC0:
        .string "%d"
main:
        push    rbp
        mov     rbp, rsp
        push    rbx
        sub     rsp, 40
        lea     rax, [rbp-24]
        mov     rsi, rax
        mov     edi, OFFSET FLAT:.LC0
        mov     eax, 0
        call    __isoc99_scanf
        mov     QWORD PTR [rbp-48], OFFSET FLAT:f1
        mov     QWORD PTR [rbp-40], OFFSET FLAT:f2
        mov     eax, DWORD PTR [rbp-24]
        and     eax, 1
        neg     eax
        mov     DWORD PTR [rbp-20], eax
        mov     eax, DWORD PTR [rbp-24]
        mov     edi, eax
        mov     eax, 0
        call    function_1
        and     eax, DWORD PTR [rbp-20]
        mov     ebx, eax
        mov     eax, DWORD PTR [rbp-24]
        mov     edi, eax
        mov     eax, 0
        call    function_2
        mov     edx, DWORD PTR [rbp-20]
        not     edx
        and     eax, edx
        or      eax, ebx
        mov     DWORD PTR [rbp-24], eax
        mov     eax, DWORD PTR [rbp-24]
        cdqe
        mov     rdi, rax
        mov     eax, 0
        call    printf
        mov     eax, 0
        mov     rbx, QWORD PTR [rbp-8]
        leave
        ret

We focus on this part where the assignment to m is done and the inline branching :

        mov     eax, DWORD PTR [rbp-24]
        and     eax, 1
        neg     eax
        mov     DWORD PTR [rbp-20], ea
        mov     eax, DWORD PTR [rbp-24]
        mov     edi, eax
        mov     eax, 0
        call    function_1
        and     eax, DWORD PTR [rbp-20]
        mov     ebx, eax
        mov     eax, DWORD PTR [rbp-24]
        mov     edi, eax
        mov     eax, 0
        call    function_2
        mov     edx, DWORD PTR [rbp-20]
        not     edx
        and     eax, edx
        or      eax, ebx
        mov     DWORD PTR [rbp-24], eax

We can actually see constant time execution.

So what is the difference between these solutions :

Both satisfy anti-timing analysis measures but the second one also hides the access patterns. It always calls both functions. Since we haven't mentioned caches so far, in the presence of caches the second one seems more secure against timing side channel attacks. Simply because it needs the code of both functions to be in cache in order to execute the instruction. In the second case only the one that is being called is cached. If we assume that for some specific period of time f1 was called and f2 was evicted from cache, there would be a difference in time when the f2 will be called again.

For other solutions and further reading on the matter you can consider [1] and [2].

Score:-1
kr flag

You can put functions to an array an refer them by index.

In Python it would look like following:

def f1( x ):
    ...

def f2( x ):
    ...

functions = [ f1, f2 ]

i = x % 2

result = functions[ i ]( x )
Tom avatar
tf flag
Tom
Thanks. This also seems to be smart. Of course I'm just trying to make my code better, this not for proffesional use, for now.
kelalaka avatar
in flag
Are you sure this (is|will) not converted as if-else in the background? Relying on the compiler is too risky as long as you are not working with [Thomas Pornin T1](https://www.youtube.com/watch?v=IHbtK5Kwt6A)
kr flag
@kelalaka: I am sure that it works as expected. This is not the case that can be decided by compiler or by the runtime environment in the common platforms like C, C++, C#, Java, Python. What is being often optimized are **boolean** expressions. E.g. if there is expression `a & b` and the runtime knows `a == false` (or `a == 0`), then in many platforms it is not even delegated to the runtime, but is required by the language specification that the runtime *must* skip evaluation of the 2nd operand.
fgrieu avatar
ng flag
This might be an improvement. But **that does not fully solve the issue**. All kinds of things including caches, alignment… conspire to make it impossible to make portable code for modern computers constant-time while having code paths or data access patterns that are data dependent, as in the question. Achieving that is only possible with tight control of the execution environment (e.g. in assembly language on CPU with a clear model of the execution cycles), and high-level languages do not allow that.
kr flag
@fgrieu: Beginning at some you **cannot** control execution. Even if you implement it in assembler, you cannot control how the OS organizes access to memory, how memory is split, how the OS is using different optimizations. At farther level even OS cannot control how the computer is using caches of different types. Furthermore, the most of modern CPUs in PCs employ **branch prediction**, so that some code can be executed **in advance** (before it really may be needed) and its result may be used later on or may be not used and just thrown away. And there is no way to control it.
kr flag
@fgrieu: ... That's why it makes sense to talk about reducing side-channel information leakage only at some macro level. Some steps of reducing leakage at the low level may include using of specialized OSs, usage of specialized hardware, usage of tools that generate more "noise" during execution. I consider the OP namely as a question about reducing side-channel leakage at the *macro* level.
fgrieu avatar
ng flag
@mentallurg: I fully agree on "you cannot control execution" and the rest of that comment. I take it as argument the solution the answer outlines, and any that leads to execution of a single function according to the parity of `x`, should only be expected to _reduce_ timing variations correlated to parity of `x`, when the one in my answer, when and if practicable, _removes_ them on all architectures I know, no matter OS, CPU, branch prediction, and status of microcode facing whatever was the latest attack.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.