The question tells that "95/100" of failed tests are Nonoverlapping tests. I previously pointed that's a significant statistical anomaly†, and it couldn't have been correctly tested a good generator. I'm no longer sure following these comments, which make me wonder about which proportion of the tests made are Nonoverlapping tests!
How to improve the randomness test result?
The simplest towards THAT goal (NOT towards the more righteous goal of improving the generator) is to add post-conditioning with an LFSR-based scrambler. Below is an example with polynomial $x^4+x+1$ (or $x^4+x^3+1$ depending on convention).
Such scrambler improves the results of most predefined statistical tests (and demonstrably can't weaken an already good TRNG). I think that for an otherwise passable generator, the excess probability of failure of the Non-Overlapping Template Matching Test (and many other tests, with Linear Complexity Test a likely outlier) decreases exponentially with the degree of the polynomial used (it must have a constant term, and it's better if it has some other coefficients). But that does not necessarily make the TRNG more suitable for cryptographic use; see why here.
The fact that this solution is so easy and effective should be enough to convince anyone that trusting a TRNG on the sole basis that it passes NIST800-22 (or any other black-box test) is foolish.
A much better option from the standpoint of cryptographic security is to use a long LFSR scrambler, and remove it's initial output, and decimate it's output (e.g. keep one bit out of two). An even better option is to use a CSPRNG instead of the LFSR for the conditioning.
A serious problem with NIST800-22 and other statistical tests for TRNGs is that they are so stringent that very few unconditioned sources can pass them. It's thus tempting to heavily post-condition sources, but then the test looses significance unless the cryptographic soundness of the post-conditioning is taken into account, which tests based on the output alone can not.
Independently, NIST800-22 do not test critical aspects of TRNGs:
- Tendency to output the same thing at each cold reset.
- Sensitivity to external conditions, like temperature, power supply voltage, adversarially applied stimulation (power variation, EMI) to get some internals in sync.
- Presence and effectiveness of fail-safe mechanisms such that the TRNG signals it's failure rather than have unsafe output.
In summary, no observation based on NIST800-22 (or any other statistical test) can ever be enough to conclude that a TRNG can be safely used in a cryptographic context (some knowledge of the TRNG design is required, as well as knowledge of how the samples tested have been obtained). The best we can hope to conclude is that the TRNG doesn't require improvement to pass tests of randomness.
† At least $14$ tests failed (because $12/13 ≈ 0.923$ would have been rounded to 90/100); and at least $92.5\%$ of the failed tests are Nonoverlapping test. There are $15$ kinds of tests. By design, they have approximately the same failure rate of $1\%$ for truly random data. If $i\ge8$ kinds have been implemented, and an equal number of tests of each kind was run, there is an anomaly: the probability that among at least $14$ independent draws uniformly among $i$ values, at least $92.5\%$ are the same, is maximal for $14$ draws, in which case that probability is $p=i^{-11}(1+i^{-2})$. If the tests are run on the same data (which is common) NIST800-22 test failures are correlated, and that makes $p$ even lower. We thus conclude with overwhelming confidence (probability of the contrary less than one in billions for $i=8$) that it was not correctly tested a good TRNG.