From this source I extract
Short Benchmark for the RTX 4090
CUDA API (CUDA 11.8)
* Device #1: NVIDIA GeForce RTX 4090, 23867/24252 MB, 128MCU
Benchmark relevant options:
* --optimized-kernel-enable
* --workload-profile=4
* Hash-Mode 1700 (SHA2-512)
Speed.#1.........: 7425.2 MH/s (288.57ms) @ Accel:64 Loops:1024 Thr:256 Vec:1
* Hash-Mode 7100 (macOS v10.8+ (PBKDF2-SHA512)) [Iterations: 1023]
Speed.#1.........: 2825.7 kH/s (221.31ms) @ Accel:32 Loops:511 Thr:512 Vec:1
The vastly different order of magnitude shows that in 7425.2 MH/s
and 2825.7 kH/s
, the H
stands for many more SHA-512 in the second case.
By definition of PBKDF2, the iteration count is the number of iterations of a Pseudo Random Function that in practice is HMAC with the specified hash, here SHA-512, which (for the sizes considered) uses two SHA-512 round functions. Hence one PBKDF2-SHA512 with $1023$ iterations is expected to perform $2046$ SHA-512 round functions, when one SHA2-512 probably makes $1$. This is roughly consistent with the $2627$ times lower H/s value reported for the second data point.
In the computation of PBKDF2 with more than a few rounds, assuming ideal parallelization, the average time should be of the form $(u\,c+v)n$ where $c$ is the iteration count, $n$ is the (assumed large) number of PBKDF2 evaluations, and $u$, $v$ are some positive constants in seconds. Roughly, $u$ is an overhead per PBKDF2 evaluation, and $v$ is an extra time per PRF evaluation, all assuming ideal parallelization. It follows that there are $1/(u\,c+v)$ PBKDF2 evaluations per second.
We can't tell $u$ or $v$ exactly, but knowing that they are positive is enough to extrapolate a minimum PBKDF2 rate for $c=600000$ iterations from the one for $c=1023$: that should be at least $2825700\times1023/600000=4817$ PBKDF2 per second. In practice, I believe this won't be far off, because hashcat is a reputable program, thus well optimized, thus hopefully, even for $c=1023$, $u\,c\gg v$.
I've timed the AES(GCM) verification step and I'm treating it as negligible, and the implementation is (some library on the PC's CPU)
I trust that when PBKDF2 is on the PC's CPU, the AES-GCM verification step uses a negligible fraction of the CPU in the password cracking effort, assuming the (unspecified) amount of data entering AES-GCM is moderate. However, that might not stand when the PBKDF2 part is GPU-accelerated.
- If we use the same AES-GCM code, will it be able to perform the (vastly higher) $\approx4817$ AES-GCM tests per second required so that it's not the bottleneck? Also, I can't tell for sure how much having to export the results of PBKDF2 to the PC's memory (rather than checking everything in the GPU as I imagine the benchmarked code does) will slow down the GPU code, and if making the necessary change requires a day or a month of experience writing GPU code.
- If we move the AES-GCM code to the GPU, that's a significant change in the GPU code, and I can't forecast how efficient that will be, or even if that will be faster than the previous option.