Score:2

Efficient way to pick an array index by using a, say, 64 bit random number?

in flag

Say, I have uint64_t rand = <some random number>, and char array[20] = .... My goal is to pick an element in array based on the content of rand.

  1. One slow way is to use the remainder: size_t i = rand % 20 then pick the element by array[i].
  2. Another way, which I guess is faster, is i = rand/UINT64_MAX * 20. Or, to avoid needing floating operations, its inverse counter part 20/(UINT64_MAX/rand).
  3. A 3rd way is to use the random bits to branch to the index like a tree (but misses every 5th number):
size_t total_bytes = 20;
size_t mask = 1;
size_t i = 0;
while (total_bytes) {
  if (rand & mask) i += total_bytes / 2;  // branch right
  else i += 0;  // branch left
  mask <<= 1;
  total_bytes /= 2;
}

Is there any faster way on common hardware? E.g. laptop/desktop PCs?

The reason I care: I'm implementing a memory hard key derivation function, and at some point I need to pick a array element based on the content of calculated ciphertext. The random number is 64 bits.

Target language is C.

Meir Maor avatar
in flag
Have you actually checked %20 is too slow? On a modern PC? I would be shocked.
Maarten Bodewes avatar
in flag
@caveman Never mind, the question was slightly different than expected. Late night comments....
in flag
Cross posted: https://stackoverflow.com/questions/68809491/whats-the-fastest-method-in-c-for-converting-a-64bit-random-number-into-a-small with more detail in the comments, including that "20" is not a constant.
Score:4
ng flag

rand % 20 generates a result in $\{0,1,\ldots,18,19\}$ that is nearly uniform (assuming rand is): $\Pr(19)/\Pr(0)=1-1/922337203685477581$. That's often a tolerable bias.

On a "laptop/desktop PC" with a modern 64-bit CPU, rand % 20 is reasonably fast, and has the important virtues of being correct, simple, and easily adaptable. However it's at least often (see comment) possible to be faster using

(rand-((rand-(rand>>2))>>1))>>59

which has the same (optimum) ratio between the least and most probable outcomes, while using only shift and add operations. I'm more confident that the generated code is constant-time, which can be important in crypto applications. And the mean is closer to $19/2$.

For an intuition of how that formula works: for any $x\in\mathbb R$ it holds $(x-(x-x\,2^{-2})\,2^{-1})\,2^{-59}=20\,x\,2^{-64}$, thus we essentialy evaluate what the expressions (uint64_t)floor(rand*(20/(UINT64_MAX+1.))) or (uint64_t)((rand*(uint128_t)20)>>64) attempt to evaluate. Notice that for some values including rand=0xCCCCCCCCCCCCCCCC the later formula does not exactly coincide with the formula I propose; yet the distribution achieved by both is optimally uniform.

The method is not limited to the constant $m=20$ for the array size. It generalizes to any constant $m$ with moderate Hamming weight. Computing appropriate shift counts from the constants is nontrivial. I refer to this marvelous answer (note: the last shift count given there must be increased by 32 in the case at hand) for something that works, but is not quite always optimal. I have no other reference for the method, which I (re-?)invented for an ARM Cortex-M0, where it proved useful. Actually I only empirically found formulas for a few constants fitting my need, and Anders Kaseorg takes full credit for how to generate formulas systematically.


If we are willing to loose a little uniformity and assurance that the code is constant-time, we can use

((rand>>3)*5)>>59

which is simpler, likely faster, and easier to adapt to other constants $m$ rather than $20$: we write $m$ as $r\,2^i$ with $i$ an integer and $r$ preferably odd, then find the integer $j$ with $2^{j-1}\le r<2^j$. We use ((rand>>j)*r)>>(64+i-j). Problem is, the lower $j$ bits of rand are not used, and the uniformity of the outcome is correspondingly reduced (except if $m$ is a power of two).

When $m$ is $2^j$ for some integer $j$, we can use rand>>(64-j) or rand&(m-1). The later is noticed in that other answer. These methods looses no uniformity, if all bits of rand are uniform and independent.

If $m$ changes at runtime with $m<2^j$ for some known constant $j$, we can use

((rand>>j)*m)>>(64-j)

however the $j$ lower bits of rand are lost and that reduces the uniformity of the outcome (except if $m$ is a power of two).


Off-topic:

  • (uint64_t)(floor(rand*(20/(UINT64_MAX+1.)))) would be OK if there was no rounding error, but because these exist it's hard to tell if it can yield 20 for some input; also on many compilers it's not optimally uniform.
  • (uint64_t)((rand*(uint128_t)20)>>64) is mathematically correct, and very close to what we evaluate, but uint128_t is an optional and still marginally supported C feature.
  • The question's rand/UINT64_MAX * 20 outputs in $\{0,20\}$ thus is unfit. Problems are the division rounds down to integer, and (independently) that rand can be UINT64_MAX.
  • The question's 20/(UINT64_MAX/rand) outputs in $\{0,1,2,3,4,5,6,10,20\}$ and can cause a division by zero, thus is unfit. Problems are the division rounds down to integer, and (independently) that rand can be 0.
  • The question's code fragment 3 always has i%5 != 4 on output, thus is unfit. Problem is that the output i is constructed as 10+5+2+1 with some term(s) removed.
Gilles 'SO- stop being evil' avatar
cn flag
When optimizing for speed on a typical 64-bit CPU, remainder or division by a constant is compiled to a multiplication by a constant plus some shifts and additions/subtractions. Hardware division is slow and compilers know it (though most will not do the compile-time math for a 64-bit division on a 32-bit CPU). The shifts you propose have about the same number of instructions, but no multiplication and the same number of memory accesses, so your shift method is very likely to be faster on any CPU except some designed for real-time with low-cycle-count mul/div. https://godbolt.org/z/z4PverffY
fgrieu avatar
ng flag
@Gilles'SO-stopbeingevil' : I failed to find the appropriate info in [that mess](https://software.intel.com/content/dam/develop/external/us/en/documents-tps/325462-sdm-vol-1-2abcd-3abcd.pdf) to confirm that the optimization you mention is still worth it on the latest x64 CPUs. Update: I'm pointed [these](https://www.agner.org/optimize/#manuals) useful resources.
Gilles 'SO- stop being evil' avatar
cn flag
I think you need to find a model-specific manual for that. You linked to the generic architecture reference. The instruction set reference (volume 2) would be more relevant, but even that is only a functional description, it doesn't include cycle counts (which don't tell the complete performance story, but for this simple case there's no branching or parallelism so I think adding cycle counts would result in a meaningful comparison).
caveman avatar
in flag
Would it be worth it to generalise that shifting solution to any number other than 20 in order to achieve fewer cycles than using the `%` approach? Because 20 is not a constant, but a mere example that I chose.
fgrieu avatar
ng flag
@caveman: the answer now clarifies that yes, we can extend to other constants. [This](https://tinyurl.com/unicst) gives formulas for all constants up to 3 decimal digits (but make sure to add 32 to the last shift count). Again, that optimization makes sense only if the `%` operator is slow, and it's not going to be on a modern laptop/desktop PCs.
Gilles 'SO- stop being evil' avatar
cn flag
@caveman I'm not an expert but I think that in terms of performance, the calculations needed to calculate the necessary shifts will cost more than one division instruction. However, the shift approach has benefits other than performance, mostly being guaranteed not to have a timing that depends on the secret data.
pe flag
This seems like a more complicated version of the [Lemire](https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/) `(rand() * 20) >> 64` approach.
fgrieu avatar
ng flag
@SamuelNeves: there are differences. (A) The expression `(rand() * 20) >> 64` needs the product evaluated on 69 bits, and that's not possible portably; the linked Lemire trick is with 32-bit `rand()` extended to 64-bit, and hits that wall for 64-bit `rand()`. (B) For some values of `rand()` including 0xCCCCCCCCCCCCCCCC, what I propose differs by one, yet still has an ideally uniform distribution.
Score:3
in flag

Just do % 20

According to http://ithare.com/infographics-operation-costs-in-cpu-clock-cycles/ Integer division costs no 12-44 cpu cycles on a modern CPU(and in some cases less due to pipeline structure if the ALU isn't doing anything else) Considering the next thing you want to do is a memory access which at best will be an L1 read will cost 3-4 cycles in itself and probably you want to do something with this value.

I can't imagine a scenario where this is worth optimizing even if it is possible to reduce a clock tick or two.

Look for bottlenecks before optimizing.

fgrieu avatar
ng flag
The [image](http://ithare.com/wp-content/uploads/part101_infographics_v08.png) in your useful source states that integer division costs 15-40 cycles. The text cited a reference as giving "cost of 32/64-bit division (known as DIV/IDIV on x86/64) – at between 12-44 cycles". In my experience that's extremely dependent on platform and on the width of the arguments, and my intuition is the 15 or even 12 does not reflect the 2021 bleeding edge. Our (shared) initial intuition that on an x64 CPU `i%20` is fast enough and might be fastest still makes sense.
Meir Maor avatar
in flag
@fgrieu Indeed I copied the wrong number, I corrected the number. It doesn't change the bottom line. This is fast.
Gilles 'SO- stop being evil' avatar
cn flag
If 20 is a constant and the numbers are no larger than one machine word, `% 20` will typically be optimized to a multiplication, which takes fewer cycles than a division, further reducing the difference. In any case I agree that even a division is negligible compared to memory accesses on any platform with a memory cache (especially if it's a constant-time table lookup that requires many loads). However, for cryptographic applications, it may be undesirable to use division or multiplication because it's common for them to have data-dependent timing.
Meir Maor avatar
in flag
Initially I gave the cycle count for multiplication and then edited following comment. Actual micro optimization like this is tricky and depends on what else is going on to see how well the cpu packs the instructions. Though I think I won't make my answer longer than it is.
Score:1
sk flag

Usually one would strive to make the array size a power of 2. Then the index can be calculated by bitwise AND:

char array[0x40];
uint64_t rand;
...
char c = array[rand & 0x3f];
id flag
That's sort of a "I can solve another problem really fast" answer. Sure, but it's not the question being asked. And in crypto, when the algorithm says to use 20, you don't substitute 32 just because that would be faster. That sort of programming is how you break crypto.
ThomasM avatar
sk flag
As I understood the question, the algorithm is not given but under construction. Otherwise there would likely be a determined way how to calculate the index from the random number, and one could not try different methods to find the fastest.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.