What is not realized is that the number of samples should be kept multiples of small primes, 2,3,5 etc. Calculating an FFT where the number of samples is a single large prime is the worse case.
Eg:
Plan for 999983 samples used: 99522862.000000 add's, 53927851.000000 multiplies, 42864846.000000 fused Multiply-Add
real 0m0.873s
Plan for 999984 samples used: 41417486.000000 add's, 28948213.000000 multiplies, 6618732.000000 fused Multiply-Add
real 0m0.312s
So in this sample code going from 999983, a prime number, to 999984, who's prime factors are 2 x 2 x 2 x 2 x 3 x 83 x 251, resulted in a performance improvement of about 3x.
Performance isn't always simple.