There are more efficient implementations of TFHE than the one you quote.
In particular, the company Zama has an implementation they call Concrete, which includes
There are some benchmarks of the Rust code from ~ 2 years ago, where they claim ~30ms for multiplications.
Unfortunately, it is not precisely clear to me from that document how many message bits they claim for the benchmark. I believe it is $\geq 5$, but it may be only $5$. Either way, this is of course much faster than ~.9s for 4 bit multiplications.
Note that you would still lose out on the SIMD-type qualities of BFV.
Despite this, you might end up (practically) faster, as it appears Concrete has a GPU-accelerated backend (the aforementioned benchmarks were before this backend existed), so one could plausibly get a similar degree of parallelism via appealing to that.
Still though, provided I am interpreting the benchmarks correctly, this is a > 30x speedup from what you quote (before doing anything GPU-related, which is now an option), so is plausibly of interest.
Regarding BFV bootstrapping, the BGV cryptosystem has similar characteristics to BFV (they are both "fast arithmetic with SIMD + slow bootstrapping" schemes), and HElib's benchmarks contain sample code for BGV bootstrapping, which may be of interest to you.