When we design a cryptographic PRNG, we want more than just statistical tests to pass. The gold-standard here is the next-bit test, which says that given all of the output up to a certain point, it is only negligibly easier than even to guess the next bit. This is not a statistical test, but a test which implies that the attacker has full knowledge of the algorithm and the ability to cryptanalyze it, and still cannot find a better solution than brute force.
To show an example of an insecure CSPRNG which could well pass any statistical test but won't pass the next-bit test, imagine a secret seed $ S $, a secure hash function $ H(x) $, and a PRNG that outputs $ H(S) || H(H(S)) || H(H(H(S))) || ... $. Because our hash function is secure, its output appears random, but it is in fact very easy to guess the next output given the existing output.
It is, of course, totally possible to make a CSPRNG that outputs single bytes at a time. RC4 is a stream cipher that has this design and has been used a CSPRNG in many systems, although it's no longer considered secure. We could well design other CSPRNGs that are secure and byte oriented. As a practical matter, though, it is usually more efficient to operate on larger chunks of data, which is why algorithms such as ChaCha20 or the NIST DRBGs tend to be used more frequently.
As to the benefits of a fast non-cryptographic PRNG, there is little reason to use them. ChaCha20 can output data at 3 GB/s on my system, and it is cryptographically secure. In the unlikely event this isn't fast enough, a smaller number of rounds could be used (e.g., ChaCha12) and the performance would be better while still being cryptographically secure. Random number generation with a suitable algorithm is almost never the bottleneck, so using a CSPRNG, which will always pass the next-bit test and is therefore of the highest possible quality, is sufficient in almost all cases.