While not an answer to your question, a little background on DFT/FFT/NTT. If you understand these well, you can write your own implementation without too much trouble.
DFT/FFT
In a field $\mathbb{F}$ with $n$th roots of unity (say $\mathbb{C}$), we have an isomorphism
$$
\mathbb{F}[x]/(x^n-1)\cong\prod_{i=0}^{n-1}\mathbb{F}[x]/(x-\zeta^i)\cong\mathbb{F}^n (\text{as rings!}),
$$
basically the chinese remainder theorem (here $\zeta$ is a primitive $n$th root of unity). The isomorphism is specifically given by evaluation (left to right)
$$
f(x)\mapsto(f(1),f(\zeta),\ldots,f(\zeta^{n-1}))
$$
for some primitive $n$th root of unity $\zeta$ (say $\zeta=e^{2\pi i/n}$). From right to left, we interpolate
$$
(z_0,\ldots,z_{n-1})\mapsto f \text{ s.t. } f(\zeta^i)=z_i.
$$
This isomorphism is linear over $\mathbb{F}$, given by the matrices
$$
(\zeta^{ij})_{i,j=0}^{n-1}=
\left(
\begin{array}{ccccc}
1&1&1&\cdots&1\\
1&\zeta&\zeta^2&\cdots&\zeta^{n-1}\\
\vdots&\vdots&\vdots&\cdots&\vdots\\
1&\zeta^{n-1}&\zeta^{2(n-1)}&\cdots&\zeta^{(n-1)(n-1)}
\end{array}
\right)
$$
$$
\frac{1}{n}\left(\zeta^{-ji}\right)_{i,j=0}^{n-1}=
\frac{1}{n}\left(
\begin{array}{ccccc}
1&1&1&\cdots&1\\
1&\zeta^{-1}&\zeta^{-2}&\cdots&\zeta^{-(n-1)}\\
\vdots&\vdots&\vdots&\cdots&\vdots\\
1&\zeta^{-(n-1)}&\zeta^{-2(n-1)}&\cdots&\zeta^{-(n-1)(n-1)}
\end{array}
\right)
$$
The FFT is (a pair of) algorithms to compute the DFT above quickly. The flavors are DIT (decimation in time, or Tukey-Cooley) and DIF (decimation in frequency, or Gentleman-Sande). You can think of these as nice factorizations of the above matrices or as recursive implementation of the matrix multiplication.
NTT
The NTT is a DFT over some other base field, say a finite field that has $n$th roots of unity, e.g. $\mathbb{F}_{8380417}$ has $512$th roots of unity (since the multiplicative group of a finite field of order $q$ is cyclic of order $q-1$ and is precisely the solutions to $x^{q-1}-1=0$ and $8380417-1=2^{13}\cdot3\cdot11\cdot31$). [For reference, this is the prime used in Dilithium.]
NTT in Kyber
Most lattice schemes work in $\mathbb{F}[x]/(x^n+1)$ for $n$ a power of 2. [Note that this is using half the roots of unity discussed above $x^{2^{k+1}}-1=(x^{2^k}+1)(x^{2^k}-1)$ and that $x^{2^k}+1$ is irreducible over $\mathbb{Q}$.] This doesn't change things all that much; basically replace $\zeta^i$ above with $\zeta^{2i+1}$.
To allow the use of a smaller modulus in Kyber ($q=3329$), we end up with a field that doesn't have $512$th roots of unity, but does have $256$th roots of unity, so that the chinese remainder theorem looks like
$$
\mathbb{F}_{3329}[x]/(x^{256}+1)\cong\prod_{i=0}^{255}\mathbb{F}_{3329}[x]/(x^2-\zeta^{2i+1}),
$$
again given by evaluation/interpolation. In other words a polynomial $f$ given by coefficients $f(x)=\sum_{i=0}^{255}f_ix^i$ is transformed into a list
$$
\hat{f}=(\hat{f}_{2i}+x\hat{f}_{2i+1})_{i=0}^{127}.
$$
To make a long story shorter, doing this evaluation quickly boils down to doing two length 128 FFTs.
One final complication with Kyber is the ordering of the roots of unity/indices in the NTT, namely
$$
\hat{f}=(\hat{f}_{2\mathsf{br}(i)}+x\hat{f}_{2\mathsf{br}(i)+1})_{i=0}^{127}
$$
where $\mathsf{br}(i)$ is the 7-bit bit reversal of the integer $0\leq i<127$, e.g.
$$
112_{10}=1110000_2\xrightarrow{\mathsf{br}}0000111_2=7_{10}.
$$
The reason for this is that DIT/DIF naturally reverse the bit-order of the indices and the implementors left out the extra work to (un)do this (or something about AVX instructions if you read the specs).
Exercises
- Work out/verify DFT.
- Show how DFT for $x^n+1$ and $x^{2n}-1$ are related.
- Work out DIT/DIF recursions.
- Show how to implement Kyber NTT with two length 128 DFTs.