Before I go further never use the Netlib versions of BLAS and LAPACK. They are not tuned and the performance difference can be dramatic. When using these routines you want to use the CPU vendors specific implementation. They all conform to the same API and thus are portable.
MKL or the Math Kernel Library is the Intel implementation of BLAS, LAPACK and many other solvers. Intel went further that the MKL can automatically offload routines to Xeon PHI accelerators with no code modification.
module load mkl/11.2
icc dgemm_speed.cpp -mkl -DBLAS3 -DDIM=10000 -DMKL
qsub -I -l nodes=1:ppn=1:mics=1,pmem=20gb -A-q flux -l qos=flux -V
#CPU Only Result
[brockp@nyx5896 blas]$ ./a.out
Size of double: 8
Will use: 2288 MB
Matrix full in: 7 sec
MKL Runtime: 89 sec.#Single Phi
[brockp@nyx5896 blas]$ module load xeon-mic
This module sets the following defaults:
MKL_MIC_ENABLE=1 enables auto MKL offload
MKL_MIC_WORKDIVISION=MIC_AUTO_WORKDIVISION
MKL_MIC_DISABLE_HOST_FALLBACK=1 unset if not always using MIC
MIC_OMP_NUM_THREADS=240 Sane values 240/244 (60 core * 4 threads)
OFFLOAD_REPORT=2 Gives lots of details about the device when running, setting to 1 gives less information, unset to surpress
If your application uses both Compiler Assisted Offload and Automatic Offload then it is strongly recommended to set OFFLOAD_ENABLE_ORSL=1. This env-variable enables the two offload modes to synchronize their accesses to coprocessors.
[brockp@nyx5896 blas]$ ./a.out
Size of double: 8
Will use: 2288 MB
Matrix full in: 6 sec
[MKL] [MIC --] [AO Function] DGEMM
[MKL] [MIC --] [AO DGEMM Workdivision] 0.00 1.00
[MKL] [MIC 00] [AO DGEMM CPU Time] 9.471196 seconds
[MKL] [MIC 00] [AO DGEMM MIC Time] 6.818753 seconds
[MKL] [MIC 00] [AO DGEMM CPU->MIC Data] 1600000000 bytes
[MKL] [MIC 00] [AO DGEMM MIC->CPU Data] 3200000000 bytes
MKL Runtime: 10 sec.
The key difference to get auto offload is that the xeon-mic module sets MKL_MIC_ENABLE=1 which then lets MKL use the Phi (Also known as MIC's) to take part in the computation. In this case with no code modification the the single matrix multiply was accelerated 8.9x.
You can also use more than one PHI. You can also see speedups much greater than that here if the data are larger, or the routine is more complex. Eg a full DGESV() which will call DGEMM() benchmarked here many times.