Score:2

What are the performance reasons behind "xor-a-rotated-sum" instead of "add-a-rotated-xor" in Salsa20?

us flag

I'm currently reading the specification of Salsa20 (link). DJB on whether he chooses "xor-a-rotated-sum" instead of "add-a-rotated-xor" states the following :

Should there be modifications other than xor-a-rotated-sum? There are many plausible ways to modify each word in a column using other words in the same column. I settled on “xor a rotated sum” as bouncing back and forth between incompatible structures on the critical path. I chose “xor a rotated sum” over “add a rotated xor” for simple performance reasons: the x86 architecture has a three-operand addition (LEA) but not a three-operand xor.

First of all, I cannot understand why three-operand operations are mentioned since every operation whether it is xor or add is done in for a pair of words. Also, at first, from my little knowledge in embedded systems and a bit of research I did, a lot of "tricky hacks" can be done with the LEA instruction that is mentioned, see for example 1,2,3. But as seen in the last reference, three operand addition cannot be done in a single x86 instruction, although as mentioned in the second reference they can be parallelized. However, I still doubt 3 arguments addition will be faster than three arguments xor.

So the questions are, why we are bothered with 3 argument operations and is there evidence that 3 arguments addition is faster than 3 arguments xor?

fgrieu avatar
ng flag
Looking at the code [there](https://cr.yp.to/salsa20.html), I see no opportunity for three-arguments addition. Thus I'm just as puzzled as you are by the "simple performance reasons" that you quote, and won't make an answer.
JAAAY avatar
us flag
I was hoping maybe there is are advantages in SIMD implementation but after reading [this](https://crypto.stackexchange.com/questions/9019/fast-salsa20-in-java) I stopped hopping. If I don't get any answers I will try to ask in stackoverflow. Thanks you for your time and interest.
Score:11
cn flag

Three-operand means two source registers and one destination register. Most x86 instructions reuse one of the source registers as the destination, so you must use an extra MOV instruction to copy one of the sources if you need to save it. Other architectures (ARM, etc.) usually encode the destination register separately.

JAAAY avatar
us flag
Thanks you for your answer, can you elaborate further maybe with an small few lines example and reply to the specific questions so as it will be easier for me and other readers to understand.
benrg avatar
cn flag
@JAAAY I'm not sure I understand either question. I think they're based on a false premise, since they both ask about 3-argument operations ($a+b+c$), but djb was talking about 3-operand instructions where one operand is an output ($a\leftarrow b+c$). x86 has $a\leftarrow b+c$, but for xor it only has $a\leftarrow a\oplus b$ where $a$ appears on both sides.
fgrieu avatar
ng flag
@JAAAY: thanks to benrq's insight, I think that I now see it. When implementing the code [there](https://cr.yp.to/salsa20.html) that repeatedly does xor-of-rotated-add, we need to store the result of the addition in a temporary register distinct from the two operands (since these operands must be kept unchanged), before we rotate that, then xor the outcome. LEA (or is it some ADD allowing freedom in the destination register) makes the introduction of the temporary free. With add-of-rotated-xor, we would need to copy one of the arguments of xor to a temporary register, one extra instruction.
Score:3
co flag

The Salsa20 core consists primarily of many updates of the form:

x[i] ^= rol32(x[j] + x[k], N);

or

t = x[j] + x[k];
u = rol32(t, N);
x[i] = x[i] ^ u;

where rol32 is left-rotation of a 32bit word.

In x86 (AT&T syntax), suppose x[i], x[j], and x[k] are respectively in registers edi, esi, and edx. This can be computed, using eax as a temporary register, by:

lea (%esi,%edx),%eax
rol $N,%eax
xor %eax,%edi

Note that we're taking advantage of "three-operand LEA" here to add esi and edx, and put the result in a third register eax. In contrast, ROL and XOR are "two-operand" instructions: they reuse one of the source registers as a destination register, and they have no "three-operand" version.


If the sequence were instead

x[i] += rol32(x[j] ^ x[k], N);

or

t = x[j] ^ x[k];
u = rol32(t, N);
x[i] = x[i] + u;

then we would need an extra MOV instruction to compute the first XOR in a temporary register, because we will use x[j] and x[k] later so we can't destroy them right away:

mov %esi,%eax
xor %edx,%eax
rol $N,%eax
add %eax,%edi

It's a small difference, but the Salsa20 core is designed to be fast and performance-critical for processing data ideally at line rate on the network. Of course, for real performance, you would use the CPU's vector unit—SSE or AVX, on x86—so it's largely a moot point today.

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.