Score:1

State recovery algorithm for Xorshift128 given modular outputs

cn flag

I am researching the Xorshift128 PRNG. I am particularly interested in recovering the state given a set of outputs that have the remainder taken with different values.

A common way to take a unsigned 32-bit output from Xorshift128 and produce a value that ranges from 0<=n<50 is to take the remainder of the output and 50.

Say I have been given 25 consecutive outputs which have modulos in the range of 30-50 (which are known). I am assuming that the amount of entropy leaked in these values is about 133 bits (25*log2(40)). This should be sufficient to recover the internal state.

I am interested in reducing the time complexity of the code/algorithm posted below. You can pretty much remove the "shifting" in my code because those shifts produce a 1:1 domain and you can boil the whole loop down to just the xor and the mod check. Assuming I have infinite memory and can just pre-calculate x' and w', and later xor them together in this algorithm.

Of course, we do not get the benefit of having power-of-2 modulos, so all bits influence the output of the modulo. I have noticed with certain sets of input values that some bits remain the same, this could be exploited to avoid some loops. Normally it has to do with certain combinations of even/odd parameters which result in the last bit being leaked directly or always being the same. Sometimes I see 4 consecutive bits at the end of certain x / w inputs, which could be a major improvement. I have also thought of whether certain outputs half a "every other" situation where every other output is odd or even which results in another time save.

My complete algorithm which is structured around this loop basically forms a massive DAG and slowly culls invalid pathways. This approach becomes a problem because it results in the need for 9000 TB of memory (haha). I have discovered that the rough expected number of outputs for this algorithm is: (2^64)/(m^3) where m is x_mod. If you assume the sets I am building to construct this DAG take up 4 bytes for each parameter, making 12 bytes for (x,w,r) triplets, where m=40 you and assuming 0 overhead you would need 3145TB per iteration - which is not ideal! This algorithm works as intended for m=~300 and produces the correct results, even with the available entropy is 128.0001, in a matter of 30 seconds with about 20GB of RAM. This was done using a GPU to do the hard-work and the CPU to arrange the DAG.

Also I would like to mention that I have tried using z3 and others like it, but they do not do well when it comes to non-power-of-2 modulo values. I have intentionally left my algorithm "un-optimized" as to show the simplest state of it. There are many micro-optimizations possible. I could also very easily run this on a GPU/ FPGA to highly parallelize it, I am more interested in the meat of the problem rather than just throwing hardware at it.

If anyone can point me in the right direction to any problems that resemble this or another community that might be able to help that would be appreciated. Thank you!

My search algorithm that I am looking to reduce the time complexity of:

s: internal states '[ ..., x, y, z, w, r, ... ]'

m: modulos

x: s[n-4], the state from 4 iterations ago

w: s[n-1], the previous state

r: s[n], the current state

/**
 * @param x_offset s[n-4] mod m[n-4]
 * @param x_mod m[n-4]
 * @param w_offset s[n-1] mod m[n-1]
 * @param w_mod m[n-1]
 * @param r_offset s[n] mod m[n]
 * @param r_mod m[n]
 * @return N( f(s[n-4], s[n-1]) = s[n] mod m[n] )
 */
size_t xorshift128_mod_count(
  uint32_t x_offset, uint32_t x_mod,
  uint32_t w_offset, uint32_t w_mod,
  uint32_t r_offset, uint32_t r_mod
) {
  // Simply the number of possible unique x / w that satisfies s[..] mod m[..]
  uint32_t x_iterations = (uint32_t) ( ((1ULL<<32) - x_offset + x_mod - 1) / x_mod );
  uint32_t w_iterations = (uint32_t) ( ((1ULL<<32) - w_offset + w_mod - 1) / w_mod );
  size_t result = 0;
  for (uint32_t x_i = 0; x_i != x_iterations; x_i++) {
    /** s[n-4] */ uint32_t x = x_offset + x_mod * x_i;
    for (uint32_t w_i = 0; w_i != w_iterations; w_i++) {
      /** s[n-1] */ uint32_t w = w_offset + w_mod * w_i;
      /** s[n] */ auto r = x;
      r ^= r << 11;
      r ^= r >>  8;
      r ^= w;
      r ^= w >> 19;
      /// s[n] mod m[n]
      if ((r % r_mod) == r_offset) {
        // Assume rather than counting perhaps I would like to perform some operation on the resulting values.
        result++;
      }
    }
  }
  return result;
}

Xorshift128 algorithm:

/// Algorithm "xor128" from p. 5 of Marsaglia, "Xorshift RNGs"
uint32_t xorshift128(uint32_t* x, uint32_t* y, uint32_t* z, uint32_t* w) {
  /// Perform a contrived 32-bit shift.
  uint32_t r = *x;
  r ^=  r << 11;
  r ^=  r >>  8;
  r ^= *w;
  r ^= *w >> 19;
  *x = *y;
  *y = *z;
  *z = *w;
  *w =  r;
  return r;
}

One could "tear out" the actual code that produces the 'r' value:

uint32_t xorshift128_next(uint32_t x, uint32_t w) {
  uint32_t r = x;
  r ^= r << 11;
  r ^= r >>  8;
  r ^= w;
  r ^= w >> 19;
  return r;
}

You only need to do this to get your intended output:

uint32_t r_mod_m = xorshift128_next(x, w) % m;
Richard Thiessen avatar
mx flag
Outputs whose modulus includes a factor of `2^n` leak `n` bits of the output directly. On average a randomly chosen modulus will leak 1 bit. If you can get 128 such leaked bits, the problem becomes completely linear and solving it is trivial.
joemelsha avatar
cn flag
Thank you for the feedback! Unfortunately due to the restrictions of the problem I only get at most 40 outputs, but it's very difficult to get that many - 25-30 is significantly more feasible. I wish I had that many consecutive outputs though! I could retrieve sets of 25 or so consecutive outputs with some work but they would be potentially separated by millions of rounds.
Command Master avatar
in flag
@joemelsha They don't have to be consecutive for the problem to remain linear
joemelsha avatar
cn flag
What if I don't know "how" consecutive they are? Like, I have 25 consecutive ones split up by 1-100m rounds, 25 consecutive, 1-100m rounds, etc
Richard Thiessen avatar
mx flag
The problem is nonlinear in the shift distance between values.
D.W. avatar
fr flag
Is the modulus the same for all 25 outputs? Are the modulii random? uniformly distributed at random over this range? Are you sure you can't get any other outputs?
joemelsha avatar
cn flag
Modulus set is for example: 50, 49, 48, 47, ... (in order)
Score:1
mx flag

This is a hard problem. Your solution is clever and possibly the best option. You may have to bite the bullet and use quite a lot of memory (although sorted data structures + SSD might work)

Here's some approaches I've tried

Algebraic analysis

The Xorshift128 PRNG is defined by a reccurence relation o[i]=f(o[i-1],o[i-4]) where f=xorshift_next(x,w)-->r.

f is linear which means we can take the current 128 bit state as a 128 bit vector S and express any future or past bit as the dot product S·C[i] of the current state and some vector C[i] (IE:all bits are just linear combinations of the current state). Likewise we can build a matrix from stacked C[i] rows to compute a full output word. S·C[0:128]=S because the linear combination of bits required to find the state vector from the state vector is an identity matrix.

If we have 128 total bits from any output words, we can set up a system of equations and solve for the state vector.

Not so productive analysis

Remainder constraints can be rewritten in two ways. First they can be broken up into prime factors. Secondly the bits from each output word can be considered as multiplying a value in GF(3),GF(5),GF(some_prime) whose sum for that output word must equal some total value. This is not linear obviously and I've no idea how to solve this. In theory you can rewrite these as higher degree boolean expressions but that's not going to help. The usual algebraic cryptanalysis solution would be to get more data and add more unknowns for all the nonlinear terms but we can't get more data so ... *(shrugs)*. It's not surprising the solver didn't give you anything.

Folding in more constraints for a 96 bit search

Here's a completely different algorithm. Perfect for FPGA implementation since it's just a very big search. Easy to parallelise.

The recurrence relation o[i]=f(o[i-1],o[i-4]) is suggestive. If you can get say 5 of the larger modulus values into the search pre-filtering 96-log2(50*47*46*44*43,2)=68.4. Not too many bits I think.

For each valid o[3] find the set of valid o[4]=f(o[3],o[0]) and o[7]=f(o[6],o[3]) values. The product of both sets for all valid o[3]'s is of size 2^96/m^5 which is big but not impossible to search especially since you just check the modulus of o[8]=f(o[7],o[4]) to get an additional 1/m filtering. So a 2^96/m^6 result set`. Should be about 63 bits.

There's 32 bits left to find, part of which can be calculated from the linear equations implied by the ~25-30 bits that leak directly. Getting the remaining bits from one of the moduli (EG:31) should be pretty easy. Then check against a few more for confirmation.

It's just a little past what's practical. The 2^69 search is very paralleliseable since f(a,b)=g(a)^h(b), then check a grid of g(a[i])^h(b[j]) results. This might be doable on a big FPGA.

Data structure optimisations and SSD instead of DRAM

The DAG approach is clever. Really clever. However, 12 bytes isn't needed.

The f() function is also invertible. This means you can cut x from the (x,w,r) triples and recalculate on the fly when going backwards.

Assuming you plan on traversing the DAG links backwards you'll want to use a hash table. Fortunately, you don't need the hash or the table. Just use list sorted on the key (r,x). Since the entries are sorted store them in prefix implied buckets with the prefix lopped off for all elements in that bucket. That should result in ~2-3 bytes per element.

Generating this data structure will take a little time and perhaps an extra round trip to disk. Still, you'll be traversing it more than once so reducing TB(read) makes sense. It might be worthwhile to tweak your loops and calculations so (r,x) pairs are generated in order saving the trouble of sorting. Removing invalid paths may require generating (r,x) pairs directly from a previous (r',x') pair list and sorting into the new order.

Finally, you can lop off a few extra bits by storing the floor division of x//m[x] for the given modulus. This shaves off some bits. You could also store the tuple as r*m[r]+x to save some fractional bits.

once you have the data structure, a bitmap should be enough to track removals. Alternatively, stream from disk and write back with some entries removed.

joemelsha avatar
cn flag
This is really great information. I did some quick back-of-the-napkin math assuming it took 1 bit of information per (x,w) pair (or r,x if going in reverse) and it would take 14TB, which is feasible. As I am traversing I can free up the nodes as well. I have been considering ways which require 2-3 bytes like you are thinking. This is very helpful. I have also thought of some ways of speeding up the main loop on a GPU by using the reciprocal / float math which will run 2-8x faster and shouldn't produce any artifacts.
joemelsha avatar
cn flag
I have gone back to the drawing board in terms of the memory storage on the algorithm. I am trying to achieve this without the need for very expensive hardware. I think there might be a way I could store the sets in a way that requires the square root of the needed memory. Still looking into that!
Score:0
fr flag

I don't know how to solve this with only 25-40 outputs. But if you can obtain 80-100 outputs, this is doable. Let $S$ denote the initial 128-bit state before the first of those outputs.

You can obtain on average about one linear equation on the bits of $S$, per output. If we can observe 100 outputs, this would give 100 linear inequations over 128 unknowns. Thus you could narrow $S$ down to one of $2^{28}$ possibilities (in particular it must lie in a known linear subspace of dimension 28). So, now you can do exhaustive search over all $2^{28}$ of these possibilities and test which ones are consistent with the observed outputs.

The outputs need not be consecutive, as long as they are at known indices in the stream.

I should explain why you get about one linear equation per output:

a. If the modulus is a multiple of 2 but not a multiple of 4, then the least significant bit of the output reveals one bit of the internal state, and this is a linear function of $S$.

b. If the modulus is a multiple of 4 but not a multiple of 8, then the least significant bit of the output reveals two bits of the internal state, and this is a linear function of $S$.

c. If the modulus is a multiple of 8 but not a multiple of 16, then the least significant bit of the output reveals three bits of the internal state, and this is a linear function of $S$.

And so on. Case a happens for 1/4 of the outputs, case b for 1/8 of the outputs, case c for 1/16 of the outputs, and so on, so on average the number of linear equations obtained per output is

$$1 \times \frac14 + 2 \times \frac18 + 3 \times \frac{1}{16} + \dots = 1.$$

joemelsha avatar
cn flag
I have devised an algorithm which exploits the full entropy of each output. I will post the code in a bit, I need to clean it up and remove the parallel aspects. It works by initiating N sets, one for each output with the known modulus of each that contain each possible state at that position, denoted as S. Take x=S[0], w=[3] and calculate all r=S'[4] that can be calculated with each combination. Remove values that are not in our previously calculated S[4] and update S[4] to be that new set. Step this one ahead at a time. Repeat this pass multiple times until nothing can be updated on any set.
joemelsha avatar
cn flag
Then the crucial aspect that is needed at lower entropy per value is ensuring that a (x, y, z, w) set is possible as you are iterating (you start doing this at n=9). You can look at what is possible based on how the values were calculated and just expand their formula. You can memoize this as you calculate each iteration into sets of x and w per r. This is essentially a cheap way at creating a DAG.
joemelsha avatar
cn flag
The complication is that for very low values of N it requires insane amounts of memory, although the algorithm runs extremely fast (relatively, you might be looking at an entire day utilizing an RTX 4090 for M=50, assuming infinite memory) - but that's better than iterating all 2^128 states haha
joemelsha avatar
cn flag
Also, suppose I have 25 consecutive outputs, separated by an unknown gap, 25 more consecutive outputs, (repeat 10 times). Now I have 250 outputs, and every 25 there is consecutive separated by an unknown gap. Is this feasible given your solution?
D.W. avatar
fr flag
@joemelsha, potentially. You might be able to guess the length of the gap. What is the distribution on the length of the gap?
joemelsha avatar
cn flag
It is relatively constant but varies with time. For instance, 5 seconds could equate to 1 million rounds, 10 seconds = 2 million. But that varies depending on load, so if the underlying application is under heavy use it could be different. Distribution is even, with respect to time. If you know the time gap precisely and the rate of use, the randomized factor would be the latencies involved. This is mostly speculation as I have no characterized any of this, just out of assumption.
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.