Flux HPC: mic

Wednesday, April 1, 2015

Xeon Phi's for Linear Algebra

Linear algebra is the backbone of many research codes. So much so that a standard library was created to support both the primitives; matrix multiply, dot product, and higher level; LU factorization, QR factorization. These libraries are BLAS Basic Linear Algebra Subprograms, and LAPACK Linear Algebra PACKage.

Before I go further never use the Netlib versions of BLAS and LAPACK. They are not tuned and the performance difference can be dramatic. When using these routines you want to use the CPU vendors specific implementation. They all conform to the same API and thus are portable.

MKL or the Math Kernel Library is the Intel implementation of BLAS, LAPACK and many other solvers. Intel went further that the MKL can automatically offload routines to Xeon PHI accelerators with no code modification.

module load mkl/11.2
icc dgemm_speed.cpp -mkl -DBLAS3 -DDIM=10000 -DMKL
qsub -I -l nodes=1:ppn=1:mics=1,pmem=20gb -A -q flux -l qos=flux -V

#CPU Only Result

[brockp@nyx5896 blas]$ ./a.out

Size of double: 8

Will use: 2288 MB

 Matrix full in: 7 sec

MKL Runtime: 89 sec.

#Single Phi

[brockp@nyx5896 blas]$ module load xeon-mic

This module sets the following defaults:

        MKL_MIC_ENABLE=1 enables auto MKL offload

        MKL_MIC_WORKDIVISION=MIC_AUTO_WORKDIVISION

        MKL_MIC_DISABLE_HOST_FALLBACK=1 unset if not always using MIC

        MIC_OMP_NUM_THREADS=240 Sane values 240/244 (60 core * 4 threads) 

        OFFLOAD_REPORT=2 Gives lots of details about the device when running, setting to 1 gives less information, unset to surpress

        If your application uses both Compiler Assisted Offload and Automatic Offload then it is strongly recommended to set OFFLOAD_ENABLE_ORSL=1. This env-variable enables the two offload modes to synchronize their accesses to coprocessors.

[brockp@nyx5896 blas]$ ./a.out 

Size of double: 8

Will use: 2288 MB

 Matrix full in: 6 sec

[MKL] [MIC --] [AO Function]    DGEMM

[MKL] [MIC --] [AO DGEMM Workdivision]  0.00 1.00

[MKL] [MIC 00] [AO DGEMM CPU Time]      9.471196 seconds

[MKL] [MIC 00] [AO DGEMM MIC Time]      6.818753 seconds

[MKL] [MIC 00] [AO DGEMM CPU->MIC Data] 1600000000 bytes

[MKL] [MIC 00] [AO DGEMM MIC->CPU Data] 3200000000 bytes

MKL Runtime: 10 sec.

The key difference to get auto offload is that the xeon-mic module sets MKL_MIC_ENABLE=1 which then lets MKL use the Phi (Also known as MIC's) to take part in the computation. In this case with no code modification the the single matrix multiply was accelerated 8.9x.

You can also use more than one PHI. You can also see speedups much greater than that here if the data are larger, or the routine is more complex. Eg a full DGESV() which will call DGEMM() benchmarked here many times.

Tuesday, March 17, 2015

Intel Xeon Phi's Available on Flux

ARC-TS now has available as a technology preview, 8 Intel Xeon Phi (Wikipedia) 5110p cards. These are known as MIC's or Many Integrated Core. These are an accelerator card that fits into a slot on a Flux compute node and a code can offload portions or all of the work to the card.

As a technology preview, there is no additional cost for using the Phi's. All that is required is an active Flux allocation and users can test the Phi's. The only other requirement is all Phi jobs must be less than 24 hours long.

The Phi cards support three modes of operation, Automatic Offload, Compiler Assisted Offload, and Native. The first two are well tested, the last works but is not as well tested. All intel-comp compiler and mkl math library modules on Flux support the Phi.

You can request a phi with PBS with:

qsub -I -l nodes=1:mics=2 -q flux -A account_flux -l qos=flux -V

This will provide two Phi's and one CPU. Flux currently has one node with 8 Phi cards.

PBS will do set two variables:

PBS_MICFILE -> list of hostnames of assigned phi's good for native mode.
OFFLOAD_DEVICES -> csv list of devices for controlling auto offload (MKL) or compiler assisted offload.

$HOME is mounted on the card, as is all the software but /scratch currently is not, it should sometime in the future. This should only affect users running Native Phi code.

For software the Phi requires some environment changes to work. We created a module called xeon-mic. When loaded it will set some sane defaults but will print what it set. Users are encouraged to experiment with many of the settings available.

module load xeon-mic

In future posts we will show examples using the Phi.