Score:2

Good entropy from entropy test (90B) but still fail NIST800-22

sa flag

I designed my TRNG with FPGA. My TRNG has a good entropy performance with a value of 0.99x over several test times. But for the NIST800-22, during several run times, sometimes my sequence passes all the tests, and sometimes it fails one or two tests. (my sequence length is 100.000.000 with the number of run-time for each test is 100). I check the failed sequence result, mostly my sequence failed at Nonoverlapping test (95/100). Can anyone help me to explain how to improve the randomness test result since my TRNG already have a good entropy performance

Score:5
ng flag

The question tells that "95/100" of failed tests are Nonoverlapping tests. I previously pointed that's a significant statistical anomaly†, and it couldn't have been correctly tested a good generator. I'm no longer sure following these comments, which make me wonder about which proportion of the tests made are Nonoverlapping tests!


How to improve the randomness test result?

The simplest towards THAT goal (NOT towards the more righteous goal of improving the generator) is to add post-conditioning with an LFSR-based scrambler. Below is an example with polynomial $x^4+x+1$ (or $x^4+x^3+1$ depending on convention).

LFSR conditionning

Such scrambler improves the results of most predefined statistical tests (and demonstrably can't weaken an already good TRNG). I think that for an otherwise passable generator, the excess probability of failure of the Non-Overlapping Template Matching Test (and many other tests, with Linear Complexity Test a likely outlier) decreases exponentially with the degree of the polynomial used (it must have a constant term, and it's better if it has some other coefficients). But that does not necessarily make the TRNG more suitable for cryptographic use; see why here.

The fact that this solution is so easy and effective should be enough to convince anyone that trusting a TRNG on the sole basis that it passes NIST800-22 (or any other black-box test) is foolish.

A much better option from the standpoint of cryptographic security is to use a long LFSR scrambler, and remove it's initial output, and decimate it's output (e.g. keep one bit out of two). An even better option is to use a CSPRNG instead of the LFSR for the conditioning.


A serious problem with NIST800-22 and other statistical tests for TRNGs is that they are so stringent that very few unconditioned sources can pass them. It's thus tempting to heavily post-condition sources, but then the test looses significance unless the cryptographic soundness of the post-conditioning is taken into account, which tests based on the output alone can not.

Independently, NIST800-22 do not test critical aspects of TRNGs:

  • Tendency to output the same thing at each cold reset.
  • Sensitivity to external conditions, like temperature, power supply voltage, adversarially applied stimulation (power variation, EMI) to get some internals in sync.
  • Presence and effectiveness of fail-safe mechanisms such that the TRNG signals it's failure rather than have unsafe output.

In summary, no observation based on NIST800-22 (or any other statistical test) can ever be enough to conclude that a TRNG can be safely used in a cryptographic context (some knowledge of the TRNG design is required, as well as knowledge of how the samples tested have been obtained). The best we can hope to conclude is that the TRNG doesn't require improvement to pass tests of randomness.


At least $14$ tests failed (because $12/13 ≈ 0.923$ would have been rounded to 90/100); and at least $92.5\%$ of the failed tests are Nonoverlapping test. There are $15$ kinds of tests. By design, they have approximately the same failure rate of $1\%$ for truly random data. If $i\ge8$ kinds have been implemented, and an equal number of tests of each kind was run, there is an anomaly: the probability that among at least $14$ independent draws uniformly among $i$ values, at least $92.5\%$ are the same, is maximal for $14$ draws, in which case that probability is $p=i^{-11}(1+i^{-2})$. If the tests are run on the same data (which is common) NIST800-22 test failures are correlated, and that makes $p$ even lower. We thus conclude with overwhelming confidence (probability of the contrary less than one in billions for $i=8$) that it was not correctly tested a good TRNG.

Paul Uszak avatar
cn flag
Re. _"It's thus tempting to heavily post-condition sources"_ . Isn't that exactly what you're suggesting though by _"scrambling"_? That's why RRR doesn't advocate post processing at all.
Paul Uszak avatar
cn flag
There are 161 tests in the suite (section 4.4)
fgrieu avatar
ng flag
@Paul Uszak: I'm embarrassed! If indeed there are 161 tests, rather than 15 as the number of subsections in 2 and of entries in the [STATISTICAL TESTS menu](https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-22r1a.pdf#page=100), and if 147 (or so) share the name "Nonoverlapping test", then my conclusion is entirely wrong! Yes I recommend post-conditioning with and LFSR scrambler towards the goal of passing the test, which is what's asked. My actual recommendation is to have a different goal entirely: making a sound generator. I'll clarify this.
Paul Uszak avatar
cn flag
There are 188 lines of test outputs in my version (pretty recent). That's 188 actual test runs if you include the variants, block sizes, bit positions e.t.c. Plus out of interest, the run I just did over /dev/urandom failed on `RandomExcursionsVariant`.
Score:2
cn flag

Welcome, and that's good :-) Ring oscillator perchance?

800-22 uses a critical value of $\alpha = 0.01$. That means one in a hundred tests will fail on average, even with the most randomnest of generators. That's just because randomness is pesky and stochastic.

It's also not surprising to get at least one or two failures given the critical value in combination with the large number of tests in the suite. There's probably over a hundred tests in there when you consider their variants and multiple parameter settings. $0.01 \times (100+) > 1$. That can also lead to correlation between the tests, which is further addressed here.

If your TRNG was very bad, many of the tests would fail with very low p's. One simple way to improve the pass rate is to decimate the output from the raw generator. You could just drop every other sample, or XOR pairs of samples.

Decimation will increase the output entropy, but unfortunately at the cost of reducing the entropy rate. Since this is a cryptography forum and not motor sport, I think that's an acceptable compromise. I'd avoid post processing the output (with encryption or scramblers) as this does not increase the output entropy rate, just adds in masking pseudo entropy.

fgrieu avatar
ng flag
I agree fully with all except the last paragraph. See my [answer](https://crypto.stackexchange.com/a/106816/555), which thesis are 1) From information in the question we can be very confident that the TRNG is defective or the tests are not conducted correctly [update: I'm no longer sure of that part!] 2) Whatever NIST800-22 test results are, we can never conclude from them alone that a "TRNG doesn't require improvement" to be safely used in a cryptographic context.
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.