This started as an edit to fgrieu's answer, but it was very long, so I decided to post it as a separate answer.
I agree with everything fgrieu wrote in his answer
I will quote fgrieu's answer here to give more emphasis :
The only next-to-reliable solution for constant-time execution in a high-level language, lacking indication about the target environment, is: Make sure the code execution path and the memory access patterns are independent of the data.
If we don't consider memory access patterns, since it is not in the scope of the question. The quoted can only theoretically be achieved with constant memory access time and constant time execution of every instruction. Of course these come with the cost of extra clock cycles and extra memory accesses.
Memory access paterns refer to another type of side channel attack that require Oblivious Random Access Memory (ORAM) to be secure against side channel attacks.
If function_1 and function_2 are constant time and it takes the same
time to compute them, is such branching still vulnerable for
side-channel attacks?
Most commonly yes, if/else is vulnerable to timing attack, because when you use an if/else statement you can easily expect from the compiler or the runtime environment to execute a branch. In any case you generally try to avoid it.
Selecting the function to call by an array index, as in mentallurg's answer in Python for example isn't that bad of an approach, it has constant memory access time and the code that is executed by Python to make the function call is contant time.
Before I begin.
- In the case of memory accesses, note that since we are interested in timing attacks, we are only interested in memory access time and not patterns that may leak information because this would make our situation a bit more complicated.
- I only consider caches at the end, for now we don't consider caches, as everyone knows CPU caches can be the worst nightmare of a cryptographer in such situations.
- I assume that each instruction that doesn't include a memory access it takes the same clock cycles to execute.
Analysis of mentallurg's answer :
Let's take this simple Python code for example
def f1(x):
return x + 1
def f2(x):
return x + 2
def main():
functions = [f1, f2]
x = int(input("Enter value:"))
i = x % 2
result = functions[i](x)
print(result)
if __name__=='__main__':
main()
This gets compiled to the following Python bytecode (output of pycdas for .pyc compiled with Python 3.10) :
[Code]
File Name: test.py
Object Name: main
Arg Count: 0
Pos Only Arg Count: 0
KW Only Arg Count: 0
Locals: 4
Stack Size: 3
Flags: 0x00000043 (CO_OPTIMIZED | CO_NEWLOCALS | CO_NOFREE)
[Names]
'f1'
'f2'
'int'
'input'
'print'
[Var Names]
'functions'
'x'
'i'
'result'
[Free Vars]
[Cell Vars]
[Constants]
None
'Enter value:'
2
[Disassembly]
0 LOAD_GLOBAL 0: f1
2 LOAD_GLOBAL 1: f2
4 BUILD_LIST 2
6 STORE_FAST 0: functions
8 LOAD_GLOBAL 2: int
10 LOAD_GLOBAL 3: input
12 LOAD_CONST 1: 'Enter value:'
14 CALL_FUNCTION 1
16 CALL_FUNCTION 1
18 STORE_FAST 1: x
20 LOAD_FAST 1: x
22 LOAD_CONST 2: 2
24 BINARY_MODULO
26 STORE_FAST 2: i
28 LOAD_FAST 0: functions
30 LOAD_FAST 2: i
32 BINARY_SUBSCR
34 LOAD_FAST 1: x
36 CALL_FUNCTION 1
38 STORE_FAST 3: result
40 LOAD_GLOBAL 4: print
42 LOAD_FAST 3: result
44 CALL_FUNCTION 1
46 POP_TOP
48 LOAD_CONST 0: None
50 RETURN_VALUE
The lookup and the execution of the line result = functions[i](x)
is done from lines bytes 26-36. Let's take a look at the BINARY_SUBSCR
operator. It takes two arguments, a list (it can be a dict, or tuple but let's don't focus on this) as the first and an index from the stack (the ones that were previously loaded with LOAD_FAST
), returns the value at that index and decreases the stack by 1.
Now, let's take a look how BINARY_SUBSCR
is implemented in CPython. The implementation can be found here and is the following:
TARGET(BINARY_SUBSCR) {
PREDICTED(BINARY_SUBSCR);
PyObject *sub = POP();
PyObject *container = TOP();
PyObject *res = PyObject_GetItem(container, sub);
Py_DECREF(container);
Py_DECREF(sub);
SET_TOP(res);
if (res == NULL)
goto error;
JUMPBY(INLINE_CACHE_ENTRIES_BINARY_SUBSCR);
DISPATCH();
}
Now the whole analysis can be focused on PyObject *res = PyObject_GetItem(container, sub);
. This is a generic method and until the item is retrieved various other method are called in the intermediate. Of course we can expect $O(1)$ complexity. It remains to check it. At the end PyList_GetItem
is called which is the following :
PyObject *
PyList_GetItem(PyObject *op, Py_ssize_t i)
{
if (!PyList_Check(op)) {
PyErr_BadInternalCall();
return NULL;
}
if (!valid_index(i, Py_SIZE(op))) {
_Py_DECLARE_STR(list_err, "list index out of range");
PyErr_SetObject(PyExc_IndexError, &_Py_STR(list_err));
return NULL;
}
return ((PyListObject *)op) -> ob_item[i];
}
As we can see on the last line. It has $O(1)$ complexity. Of course due to the complexity of high level languages they are never used in practice for such applications. So let's try this code in a lower level language, like C, to see what it produces.
#include <stdio.h>
int f1(int x) {
return x + 1;
}
int f2(int x) {
return x + 2;
}
int main() {
int x;
scanf("%d", &x);
int (*(functions[2]))(int) = {f1, f2};
int i;
i = x %2;
int result = functions[i](x);
}
This is the x86_64 from Godbolt with the latest GCC without optimizations :
f1:
addiu $sp,$sp,-8
sw $fp,4($sp)
move $fp,$sp
sw $4,8($fp)
lw $2,8($fp)
nop
addiu $2,$2,1
move $sp,$fp
lw $fp,4($sp)
addiu $sp,$sp,8
jr $31
nop
f2:
addiu $sp,$sp,-8
sw $fp,4($sp)
move $fp,$sp
sw $4,8($fp)
lw $2,8($fp)
nop
addiu $2,$2,2
move $sp,$fp
lw $fp,4($sp)
addiu $sp,$sp,8
jr $31
nop
$LC0:
.ascii "%d\000"
main:
addiu $sp,$sp,-56
sw $31,52($sp)
sw $fp,48($sp)
move $fp,$sp
addiu $2,$fp,32
move $5,$2
lui $2,%hi($LC0)
addiu $4,$2,%lo($LC0)
jal __isoc99_scanf
nop
lui $2,%hi(f1)
addiu $2,$2,%lo(f1)
sw $2,36($fp)
lui $2,%hi(f2)
addiu $2,$2,%lo(f2)
sw $2,40($fp)
lw $3,32($fp)
li $2,-2147483648 # 0xffffffff80000000
ori $2,$2,0x1
and $2,$3,$2
bgez $2,$L6
nop
addiu $2,$2,-1
li $3,-2 # 0xfffffffffffffffe
or $2,$2,$3
addiu $2,$2,1
$L6:
sw $2,24($fp)
lw $2,24($fp)
nop
sll $2,$2,2
addiu $3,$fp,24
addu $2,$3,$2
lw $2,12($2)
lw $3,32($fp)
nop
move $4,$3
move $25,$2
jalr $25
nop
sw $2,28($fp)
move $2,$0
move $sp,$fp
lw $31,52($sp)
lw $fp,48($sp)
addiu $sp,$sp,56
jr $31
nop
More specifically we are interested in these instructions in which the calling is made to the appropriate function :
lw $2,24($fp)
nop
sll $2,$2,2
addiu $3,$fp,24
addu $2,$3,$2
lw $2,12($2)
lw $3,32($fp)
nop
move $4,$3
move $25,$2
jalr $25
nop
sw $2,28($fp)
move $2,$0
As we can see no branches are made, except from jalr which call the appropriate function.
Analysis of fgrieu's answer
Of course it is easy to see that this is constant time :
#include <stdio.h>
int f1(int x) {
return x + 1;
}
int f2(int x) {
return x + 2;
}
int main() {
int x;
scanf("%d", &x);
int (*(functions[2]))(int) = {f1, f2};
int m = -(x&1); // mask variable m must have the same type as x
x = (function_1(x) & m) | (function_2(x) & ~m);
printf(x);
}
Again, Goldbolt with the same options :
f1:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov eax, DWORD PTR [rbp-4]
add eax, 1
pop rbp
ret
f2:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov eax, DWORD PTR [rbp-4]
add eax, 2
pop rbp
ret
.LC0:
.string "%d"
main:
push rbp
mov rbp, rsp
push rbx
sub rsp, 40
lea rax, [rbp-24]
mov rsi, rax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call __isoc99_scanf
mov QWORD PTR [rbp-48], OFFSET FLAT:f1
mov QWORD PTR [rbp-40], OFFSET FLAT:f2
mov eax, DWORD PTR [rbp-24]
and eax, 1
neg eax
mov DWORD PTR [rbp-20], eax
mov eax, DWORD PTR [rbp-24]
mov edi, eax
mov eax, 0
call function_1
and eax, DWORD PTR [rbp-20]
mov ebx, eax
mov eax, DWORD PTR [rbp-24]
mov edi, eax
mov eax, 0
call function_2
mov edx, DWORD PTR [rbp-20]
not edx
and eax, edx
or eax, ebx
mov DWORD PTR [rbp-24], eax
mov eax, DWORD PTR [rbp-24]
cdqe
mov rdi, rax
mov eax, 0
call printf
mov eax, 0
mov rbx, QWORD PTR [rbp-8]
leave
ret
We focus on this part where the assignment to m is done and the inline branching :
mov eax, DWORD PTR [rbp-24]
and eax, 1
neg eax
mov DWORD PTR [rbp-20], ea
mov eax, DWORD PTR [rbp-24]
mov edi, eax
mov eax, 0
call function_1
and eax, DWORD PTR [rbp-20]
mov ebx, eax
mov eax, DWORD PTR [rbp-24]
mov edi, eax
mov eax, 0
call function_2
mov edx, DWORD PTR [rbp-20]
not edx
and eax, edx
or eax, ebx
mov DWORD PTR [rbp-24], eax
We can actually see constant time execution.
So what is the difference between these solutions :
Both satisfy anti-timing analysis measures but the second one also hides the access patterns. It always calls both functions. Since we haven't mentioned caches so far, in the presence of caches the second one seems more secure against timing side channel attacks. Simply because it needs the code of both functions to be in cache in order to execute the instruction. In the second case only the one that is being called is cached. If we assume that for some specific period of time f1
was called and f2
was evicted from cache, there would be a difference in time when the f2
will be called again.
For other solutions and further reading on the matter you can consider [1] and [2].