Basic question: ENT seems to trip up generators that pass NIST 800-22 and maybe even dieharder. How do the latter two test suites miss such an obvious failure?
There are two things I want to mention about the well-known randomness test suite ENT, which is, as far as I understand, considered to be far less rigorous than test suites like NIST SP 800-22 and diehard(er).
I've applied ENT, the NIST test suite, and dieharder to my own TRNG throughout different stages of development, both with and without post-processing, etc. Eventually, I got to a stage where the TRNG consistently passed the NIST test suite—both their official implementation and a third party one I found on github. I was pretty rigorous with the testing, looking for any indications the test gave for non-randomness, plotting p-values, etc., but the TRNG consistently passed easily as far as I can tell. As for dieharder, the large data requirements that have been discussed on this forum made that difficult, but here too it seems that I was able to get the TRNG to pass at a similar rate as other "gold standard" PRNGs (in the words of the creators of the test suite).
I was surprised then to see that the generator (at one point in development when it passed NIST) consistently failed the ENT Chi squared tests, with a summary that the Chi squared statistic "would exceed this value 0.01 percent of the times," i.e. a p-value of 1e-4. This is the same generator that handily passed NIST 800-22 and almost dieharder - although here it seemed to get tripped up a bit by dieharder, but not severely.
I noticed that Hotbits, who's methodology/results have been praised on this forum, have what seems to be a failing Chi squared test with ENT displayed on their statistics page. It is the same type of failure I mentioned earlier: a Chi squared test statistic with a p-value of 1e-4, if I understand correctly. Indeed, per the ENT website, "If the percentage is greater than 99% or less than 1%, the sequence is almost certainly not random." In fact that wording seems a bit odd to me, since we would expect to see a p-value > .99 or < .01 exactly 2% of the time for an ideal generator, but the point stands and the p-value of 1e-4 is quite a bit lower.
So both my RNG and the Hotbits RNG seem to pass NIST and dieharder pretty easily, only to be tripped up by the ENT Chi squared tests.
My question: How would the NIST test suite let a generator off the hook that fails a basic Chi squared test as in ENT? Am I missing something or misunderstanding ENT's Chi squared test?
Side notes:
I linked to someone on the forum "praising" the results of Hotbits, and they themselves ran ENT on some of their data. The ENT results they presented were indeed passes, with a reasonable Chi squared test statistic. I haven't tested any of their data myself, I just noticed the 1e-4 p-value on their site front and center, hence the post.
I noticed that Fourmilab maintains both hotbits and ENT.. Not sure where this fact fits in.
Edit: I've since thought about this a little more, and done some plotting of the distribution of my RNG, and sure enough typically one byte value is quite a bit more likely than the others. Not a ton—the difference is small enough to still get a min-entropy upwards of 7.9 bits/byte—but it is noticable. First, I imagine the reason it might not show up in NIST is because the test is done on multiple bit streams (at least that's the way I've applied it), and this "splitting up" of the data would reduce the effect on individual segments. As for dieharder I am not sure.
It seems like the phenomenon I observe, where one byte is more likely than the others by enough to give the Chi squared test statistic of 1e-4, would be the main thing to cause such a test statistic. After all, the test statistic is a normalized sum of squares, so intuitively it is inordinate empirical probabilities like that that really throw the test statistic off. I wonder if Hotbits experiences something similar ...