I am researching the Xorshift128 PRNG. I am particularly interested in recovering the state given a set of outputs that have the remainder taken with different values.

A common way to take a unsigned 32-bit output from Xorshift128 and produce a value that ranges from 0<=n<50 is to take the remainder of the output and 50.

Say I have been given 25 consecutive outputs which have modulos in the range of 30-50 (which are known). I am assuming that the amount of entropy leaked in these values is about 133 bits (25*log2(40)). This should be sufficient to recover the internal state.

**I am interested in reducing the time complexity of the code/algorithm posted below.** You can pretty much remove the "shifting" in my code because those shifts produce a 1:1 domain and you can boil the whole loop down to just the xor and the mod check. Assuming I have infinite memory and can just pre-calculate x' and w', and later xor them together in this algorithm.

Of course, we do not get the benefit of having power-of-2 modulos, so all bits influence the output of the modulo. I have noticed with certain sets of input values that some bits remain the same, this could be exploited to avoid some loops. Normally it has to do with certain combinations of even/odd parameters which result in the last bit being leaked directly or always being the same. Sometimes I see 4 consecutive bits at the end of certain x / w inputs, which could be a major improvement. I have also thought of whether certain outputs half a "every other" situation where every other output is odd or even which results in another time save.

My complete algorithm which is structured around this loop basically forms a massive DAG and slowly culls invalid pathways. This approach becomes a problem because it results in the need for 9000 TB of memory (haha). I have discovered that the rough expected number of outputs for this algorithm is: *(2^64)/(m^3)* where m is x_mod. If you assume the sets I am building to construct this DAG take up 4 bytes for each parameter, making 12 bytes for (x,w,r) triplets, where m=40 you and assuming 0 overhead you would need 3145TB per iteration - which is not ideal! This algorithm works as intended for m=~300 and produces the correct results, even with the available entropy is 128.0001, in a matter of 30 seconds with about 20GB of RAM. This was done using a GPU to do the hard-work and the CPU to arrange the DAG.

Also I would like to mention that I have tried using z3 and others like it, but they do not do well when it comes to non-power-of-2 modulo values. I have intentionally left my algorithm "un-optimized" as to show the simplest state of it. There are many micro-optimizations possible. I could also very easily run this on a GPU/ FPGA to highly parallelize it, I am more interested in the meat of the problem rather than just throwing hardware at it.

If anyone can point me in the right direction to any problems that resemble this or another community that might be able to help that would be appreciated. Thank you!

**My search algorithm that I am looking to reduce the time complexity of:**

s: internal states '[ ..., x, y, z, w, r, ... ]'

m: modulos

x: s[n-4], the state from 4 iterations ago

w: s[n-1], the previous state

r: s[n], the current state

```
/**
* @param x_offset s[n-4] mod m[n-4]
* @param x_mod m[n-4]
* @param w_offset s[n-1] mod m[n-1]
* @param w_mod m[n-1]
* @param r_offset s[n] mod m[n]
* @param r_mod m[n]
* @return N( f(s[n-4], s[n-1]) = s[n] mod m[n] )
*/
size_t xorshift128_mod_count(
uint32_t x_offset, uint32_t x_mod,
uint32_t w_offset, uint32_t w_mod,
uint32_t r_offset, uint32_t r_mod
) {
// Simply the number of possible unique x / w that satisfies s[..] mod m[..]
uint32_t x_iterations = (uint32_t) ( ((1ULL<<32) - x_offset + x_mod - 1) / x_mod );
uint32_t w_iterations = (uint32_t) ( ((1ULL<<32) - w_offset + w_mod - 1) / w_mod );
size_t result = 0;
for (uint32_t x_i = 0; x_i != x_iterations; x_i++) {
/** s[n-4] */ uint32_t x = x_offset + x_mod * x_i;
for (uint32_t w_i = 0; w_i != w_iterations; w_i++) {
/** s[n-1] */ uint32_t w = w_offset + w_mod * w_i;
/** s[n] */ auto r = x;
r ^= r << 11;
r ^= r >> 8;
r ^= w;
r ^= w >> 19;
/// s[n] mod m[n]
if ((r % r_mod) == r_offset) {
// Assume rather than counting perhaps I would like to perform some operation on the resulting values.
result++;
}
}
}
return result;
}
```

**Xorshift128 algorithm:**

```
/// Algorithm "xor128" from p. 5 of Marsaglia, "Xorshift RNGs"
uint32_t xorshift128(uint32_t* x, uint32_t* y, uint32_t* z, uint32_t* w) {
/// Perform a contrived 32-bit shift.
uint32_t r = *x;
r ^= r << 11;
r ^= r >> 8;
r ^= *w;
r ^= *w >> 19;
*x = *y;
*y = *z;
*z = *w;
*w = r;
return r;
}
```

**One could "tear out" the actual code that produces the 'r' value:**

```
uint32_t xorshift128_next(uint32_t x, uint32_t w) {
uint32_t r = x;
r ^= r << 11;
r ^= r >> 8;
r ^= w;
r ^= w >> 19;
return r;
}
```

**You only need to do this to get your intended output:**

```
uint32_t r_mod_m = xorshift128_next(x, w) % m;
```