Score:4

# Question about confidence interval in NIST SP 800 22

I want to ask about the confidence interval in NIST SP 800 22 so I can make sure I got it right. When evaluating the randomness of many binary strings, for example 1000 binary strings. In the NIST SP 800 22 document section 4.2.1, it is recommended to calculate the confidence interval given by the formula: $$\hat{p} \pm 3 \sqrt{\frac{\hat{p}(1-\hat{p})}{m}}$$, where $$\hat{p}=1-\alpha$$ with $$\alpha=0.01$$, and $$m$$ is the sample size.

Suppose i need to test 1000 binary strings where all binary strings have $$P-value \geq0.01$$, then the proportion is $$1000/1000=1$$, the confidence interval is $$0.99 \pm 3 \sqrt{\frac{0.99(0.01)}{1000}} = 0.99 \pm 0.0094392$$, means the proportion should above 0.9805607 and below 0.9994393, but proportion 1 falls outside of this interval so these 1000 binary strings are not random.

Is there something wrong here or am I misunderstanding it? 1000 binary strings all pass but the conclusion is not random because it is not in the confidence interval ?

Score:4

Yes this is how the document is intended to be read and interpreted.

If perform 1000 tests looking for 1-in-100 events we expect to find 10, and it is actually quite surprising if we see none at all (the opposite of the green jelly bean fallacy). This should only occur with probability roughly 0.000043. This should make us suspicious that some sort of outlier suppression may be occurring.

Score:3

Yes, but there is an important caveat to your $$\hat{p}$$ formula, actually addressed in § 4.4. Application of Multiple Tests.

The tests have to be truly independent of each other. Tests "such as the cusum test and the runs test, result in P-values that are likely to be correlated." A $$\chi^2$$ test and an $$n$$ dimensional random walk are totally correlated. In those circumstances the confidence formula will not work accurately and thus cause mis-characterisation of the test samples.

NIST argues that it has investigated this effect, but you'll see that it has not published the tests' dependency chart or covariance matrix, just saying that the correlation is "very small".

That's probably(?) okay, but it's a concern if you are writing your own test suite or expanding NIST's with other tests (à la Appendix C). I'm re-writing and expanding the ent test suite and have addressed such inter-test correlations by empirically determining the confidence interval through simulation.

Score:2

There is some computational work studying the independence of the randomness tests from the NIST suite in the following paper which comes up after a google search.

The table below shows the specific tests which are highly correlated (the set of sequences they fail overlap to a large extent) in gray. Since the testing was exhaustive, the length of sequences considered is quite short, e.g., $$n=20,30.$$

The point of view of this paper is somewhat dual in that they argue the following: If most sequences are random and should pass tests, it is of interest not to duplicate tests that are dependent.

We experimentally observe that frequency, overlapping template (with input template 111), longest run of ones, random walk height tests and maximum order complexity tests produce correlated results for short sequences. These correlations should be considered while analyzing generators using short sequences. The strength of these correlations is likely to decrease as the input lengths increase, but in case of testing longer sequences by level-2 version of these tests, correlations still exist whenever the input block size is small.

Another feature they tabulate is the number of sequences that only fail a given test and pass all other tests, in some sense this is a measure of how different and independent a given test is. For example the test which exposes the largest number of sequences failed by all other tests is Linear Complexity (22436 sequences fail LC but pass all other tests considered which is roughly $$2^{14.5}$$ sequences out of a total of $$2^{20}$$).

I haven't followed up the citations resulting from this work to see if any longer lengths were considered for exhaustive testing.

I sit in a Tesla and translated this thread with Ai: