The Salsa20 core consists primarily of many updates of the form:
x[i] ^= rol32(x[j] + x[k], N);
or
t = x[j] + x[k];
u = rol32(t, N);
x[i] = x[i] ^ u;
where rol32 is left-rotation of a 32bit word.
In x86 (AT&T syntax), suppose x[i], x[j], and x[k] are respectively in registers edi, esi, and edx.
This can be computed, using eax as a temporary register, by:
lea (%esi,%edx),%eax
rol $N,%eax
xor %eax,%edi
Note that we're taking advantage of "three-operand LEA" here to add esi and edx, and put the result in a third register eax.
In contrast, ROL and XOR are "two-operand" instructions: they reuse one of the source registers as a destination register, and they have no "three-operand" version.
If the sequence were instead
x[i] += rol32(x[j] ^ x[k], N);
or
t = x[j] ^ x[k];
u = rol32(t, N);
x[i] = x[i] + u;
then we would need an extra MOV instruction to compute the first XOR in a temporary register, because we will use x[j] and x[k] later so we can't destroy them right away:
mov %esi,%eax
xor %edx,%eax
rol $N,%eax
add %eax,%edi
It's a small difference, but the Salsa20 core is designed to be fast and performance-critical for processing data ideally at line rate on the network.
Of course, for real performance, you would use the CPU's vector unit—SSE or AVX, on x86—so it's largely a moot point today.