Flux HPC: Using FFTW For Optimum Performance

Monday, September 29, 2014

Using FFTW For Optimum Performance

FFTW is probably the most popular DFT library ever. What many don't realize is that there are many ways to calculate an FFT depending on the number of samples. FFTW addresses this by creating a plan and using dynamic code generation to try to quickly calculate an optimal method.

What is not realized is that the number of samples should be kept multiples of small primes, 2,3,5 etc. Calculating an FFT where the number of samples is a single large prime is the worse case.

Eg:

Plan for 999983 samples used: 99522862.000000 add's, 53927851.000000 multiplies, 42864846.000000 fused Multiply-Add

real    0m0.873s

Plan for 999984 samples used: 41417486.000000 add's, 28948213.000000 multiplies, 6618732.000000 fused Multiply-Add

real    0m0.312s

So in this sample code going from 999983, a prime number, to 999984, who's prime factors are 2 x 2 x 2 x 2 x 3 x 83 x 251, resulted in a performance improvement of about 3x.

Performance isn't always simple.