Flux HPC: performance

Showing posts with label performance. Show all posts

Saturday, August 22, 2015

Flux High-Speed Data Transfer Service

Do you have a large data set on your own storage equipment that you would like to process on Flux? We can accommodate up to 40 gigabits per second of data transfer in and out of Flux via the campus Ethernet backbone. There is no additional cost to use this service, but you do need to contact us in order to set it up.

By default, network traffic between Flux compute nodes and other systems on campus takes place over standard one gigabit Ethernet connections. This is sufficient for modest amounts of traffic such as that generated by administrative tasks, monitoring, and home directory access.

Traffic between Flux and its high-speed /scratch filesystem runs over a separate 40 gigabit per second InfiniBand network within the datacenter, and data between Flux and off-campus systems on the Internet can be staged through our transfer server at up to 10 gigabits per second. This would seem to leave a gap though: what if you want direct high-speed connections between the Flux nodes and other systems on campus? We provide such connections using a Mellanox BX5020 InfiniBand/Ethernet gateway:

The Flux BX5020 Gateway

The gateway connects to the Flux InfiniBand network and to the campus Ethernet network and allows traffic to flow between the two networks. The InfiniBand network runs at 40 gigabits per second, and the gateway has four 10 gigabit links to the campus Ethernet network. This allows any Flux node to communicate with any system on campus at up to 40 gbit/s.

We have a customer that has multiple petabytes of data on their own storage equipment which they have been using Flux to process. We mount this customer's NFS servers on Flux and route the traffic through the gateway. The customer is currently running jobs on Flux against two of their 10-gigabit connected servers, and last weekend they reached a sustained data transfer rate into Flux of 14.3 gigabits per second.

Gateway traffic for the week of 8/11/2015 - 8/18/2015

Although we have pushed more than 14 gbit/s through the gateway during testing, this is a new record for production traffic through the system.

Our gateway is currently connected to the Ethernet network at 40 gigabits per second, but it can be readily expanded to 80 and possibly 120 gigabits per second as needed. Additionally, we plan to replace the existing gateway in the near future with newer equipment. The planned initial bandwidth for the new equipment is 160 gbit/s, and there is room for growth even beyond that.

No changes to your network configuration are needed to use the gateway; those changes take place on our end only. All you have to do is export your storage to our IP ranges. If you want to discuss or get set up for this service, please let us know! Our email address is hpc-support@umich.edu and we will be happy to answer any questions you have.

If you are are interested in the technical details of how the gateway works, this presentation from Mellanox on the Ethernet over InfiniBand (EoIB) technology used by the system should prove informative. There is no need to know anything about EoIB in order to use the service; the link is provided strictly for the curious.

Sunday, April 26, 2015

Filetransfer Tool Performance

On the HPC systems at ARC-TS we have two primary tools for transferring data, scp (secure copy), and Globus (GridFTP). Other tools like rsync and sftp operate over scp and thus will have performance comparable to that tool.

So which tool performs the best? We are going to test two cases each moving data to the XSEDE machine Gordon at SDSC. One test will be for moving a single large file, the second will be many small files.

Large file case.

For the large file we are moving a single 5GB file from Flux's scratch directory to the Gordon scratch directory. Both filesystems can move data at GB/s rates so the network or tool will be the bottleneck.

scp / gsiscp

[brockp@flux-login2 stripe]$ gsiscp -c arcfour all-out.txt.bz2

                             gordon.sdsc.xsede.org:/oasis/scratch/brockp/temp_project

india-all-out.txt.bz2                     100% 5091MB  20.5MB/s  25.6MB/s   04:08   

Duration: 4m:08s

Globus

Request Time : 2015-04-26 22:41:04Z

Completion Time : 2015-04-26 22:42:44Z

Effective MBits/sec : 427.089

Duration: 1m:40s 2.5x faster than SCP

Many File Case

In this case the same 5GB file was split into 5093 1MB files. Many may not know that every file has overhead, and that it is well known that moving many small files of the same size is much slower than moving one larger file of the same total size. How much impact and can Globus help with this impact read below.

scp / gsiscp

[brockp@flux-login2 stripe]$ time gsiscp -r -c arcfour iobench

                             gordon.sdsc.xsede.org:/oasis/scratch/brockp/temp_project/

real 28m9.179s

Duration: 28m:09s

Globus

Request Time : 2015-04-27 00:18:40Z

Completion Time : 2015-04-27 00:25:30Z

Effective MBits/sec : 104.423

Duration: 7m:50s 3.6x faster than SCP

Conclusion

Globus provides significant speedup both for single large files and many smaller files over scp. The result is even more significant the smaller the files because of the overhead in scp doing one file at a time.

Wednesday, April 1, 2015

Xeon Phi's for Linear Algebra

Linear algebra is the backbone of many research codes. So much so that a standard library was created to support both the primitives; matrix multiply, dot product, and higher level; LU factorization, QR factorization. These libraries are BLAS Basic Linear Algebra Subprograms, and LAPACK Linear Algebra PACKage.

Before I go further never use the Netlib versions of BLAS and LAPACK. They are not tuned and the performance difference can be dramatic. When using these routines you want to use the CPU vendors specific implementation. They all conform to the same API and thus are portable.

MKL or the Math Kernel Library is the Intel implementation of BLAS, LAPACK and many other solvers. Intel went further that the MKL can automatically offload routines to Xeon PHI accelerators with no code modification.

module load mkl/11.2
icc dgemm_speed.cpp -mkl -DBLAS3 -DDIM=10000 -DMKL
qsub -I -l nodes=1:ppn=1:mics=1,pmem=20gb -A -q flux -l qos=flux -V

#CPU Only Result

[brockp@nyx5896 blas]$ ./a.out

Size of double: 8

Will use: 2288 MB

 Matrix full in: 7 sec

MKL Runtime: 89 sec.

#Single Phi

[brockp@nyx5896 blas]$ module load xeon-mic

This module sets the following defaults:

        MKL_MIC_ENABLE=1 enables auto MKL offload

        MKL_MIC_WORKDIVISION=MIC_AUTO_WORKDIVISION

        MKL_MIC_DISABLE_HOST_FALLBACK=1 unset if not always using MIC

        MIC_OMP_NUM_THREADS=240 Sane values 240/244 (60 core * 4 threads) 

        OFFLOAD_REPORT=2 Gives lots of details about the device when running, setting to 1 gives less information, unset to surpress

        If your application uses both Compiler Assisted Offload and Automatic Offload then it is strongly recommended to set OFFLOAD_ENABLE_ORSL=1. This env-variable enables the two offload modes to synchronize their accesses to coprocessors.

[brockp@nyx5896 blas]$ ./a.out 

Size of double: 8

Will use: 2288 MB

 Matrix full in: 6 sec

[MKL] [MIC --] [AO Function]    DGEMM

[MKL] [MIC --] [AO DGEMM Workdivision]  0.00 1.00

[MKL] [MIC 00] [AO DGEMM CPU Time]      9.471196 seconds

[MKL] [MIC 00] [AO DGEMM MIC Time]      6.818753 seconds

[MKL] [MIC 00] [AO DGEMM CPU->MIC Data] 1600000000 bytes

[MKL] [MIC 00] [AO DGEMM MIC->CPU Data] 3200000000 bytes

MKL Runtime: 10 sec.

The key difference to get auto offload is that the xeon-mic module sets MKL_MIC_ENABLE=1 which then lets MKL use the Phi (Also known as MIC's) to take part in the computation. In this case with no code modification the the single matrix multiply was accelerated 8.9x.

You can also use more than one PHI. You can also see speedups much greater than that here if the data are larger, or the routine is more complex. Eg a full DGESV() which will call DGEMM() benchmarked here many times.

Saturday, January 10, 2015

Using Infiniband with MATLAB Parallel Computing Toolbox

In High Performance Computing (HPC) there are a number of network types commonly used, among these are: Ethernet, the common network found on all computer equipment. Infiniband, a specialty high performance low latency interconnect common on commodity clusters. There are also several propriety types and a few other less common types but I will focus on Ethernet and Infiniband.

Ethernet and really its mate protocol, TCP, are the most common supported MPI networks. Almost all computer platforms support this network type and can be as simple as using your home network switch. It is ubiquitous and easy to support. Networks like Infiniband though require special drivers, uncommon hardware but the effort is normally worth it.

The MATLAB Parallel Computing Toolbox provides a collection of functions that allow users of MATLAB to utilize multiple compute nodes to work on larger problems. Many may not realize that MathWorks chose to use the standard MPI routines to implement this toolbox. MathWorks also chose for ease of use to ship MATLAB with the Mpich MPI library, and the version they use only support Ethernet for communication between nodes.

As noted Ethernet is about the slowest common network used in parallel applications. The question is how much can this impact performance.

Mmmmm Data:

The data was generated on 12 nodes of Xeon x5650 total 144 cores. The code was the stock MATLAB paralleldemo_backslash_bench(1.25) from MATLAB 2013b. You can find my M-Code at Gist.

The data show two trends, one is independent of the network type. That is many parallel algorithms do not scale unless the amount of data for each core to work on is sufficiently large. In this case for Ethernet especially the peak performance is never reached. What should be really noted though is that without Infiniband at many problem sizes over half of the performance of the nodes is lost. The second trend is that network really matters.

How to have MATLAB use Infiniband?

MathWorks does not ship an MPI library with the parallel computing toolbox that can use infiniband by default. This is reasonable, I would be curious how large the average PCT cluster is, and/or how big the jobs ran on the toolbox are. Lucky for us MathWorks allows a way for introducing your own MPI library. Let me be the first to proclaim:

Thank you MathWorks for adding mpiLibConf.m as a feature. -- Brock Palen

In the above test we used Intel MPI for the infiniband test and mpich for the ethernet test. The choice of MPI is important. The MPI standard enforces a shared API but not a shared ABI. Thus the MPI library you substitute needs to match the one MATLAB is compiled against. Lucky for us they used mpich, so any mpich clone should work; mvapich, IntelMPI, etc.

If you are using the MATLAB Parallel Computing Toolbox on more than one node, and if your cluster has a network other than Ethernet/TCP (there are non-tcp ethernet networks that perform very well) I highly encourage that the effort be put in to ensure you use that network.

For Flux users we have this setup, but you have to do some setup for yourself before you see the benefit. Please visit the ARC MATLAB documentation, or send us a question at hpc-support@umich.edu.

Monday, September 29, 2014

Using FFTW For Optimum Performance

FFTW is probably the most popular DFT library ever. What many don't realize is that there are many ways to calculate an FFT depending on the number of samples. FFTW addresses this by creating a plan and using dynamic code generation to try to quickly calculate an optimal method.

What is not realized is that the number of samples should be kept multiples of small primes, 2,3,5 etc. Calculating an FFT where the number of samples is a single large prime is the worse case.

Eg:

Plan for 999983 samples used: 99522862.000000 add's, 53927851.000000 multiplies, 42864846.000000 fused Multiply-Add

real    0m0.873s

Plan for 999984 samples used: 41417486.000000 add's, 28948213.000000 multiplies, 6618732.000000 fused Multiply-Add

real    0m0.312s

So in this sample code going from 999983, a prime number, to 999984, who's prime factors are 2 x 2 x 2 x 2 x 3 x 83 x 251, resulted in a performance improvement of about 3x.

Performance isn't always simple.

Friday, September 12, 2014

3D Rendering using Blender and CUDA on HPC Clusters

Blender is a popular Open Source 3D modeling system. Recently the question was asked can one use the Flux cluster for 3D rendering. We were curious about this, and we wanted to support our students. Clusters like Flux though are normally built with scientific use in mind and we didn't know if what we had would support Blender. Turns out this was all easier than we thought.

blender-batch.sh

GPU Rendering with CYCLES and CUDA

What we found though is we wanted to use the CYCLES render, and not only that we wanted to run it on the FluxG GPU service. Why GPUs? Lets us use the current standard CYCLES benchmark file and compare GPU to CPU performance.

Blender Rendering Benchmark (mpan)

Hardware	Tile Settings (XxY)	Time	Speedup
1 CPU E5-2670	16x16	10m:17s	1x
4 CPU E5-2670	16x16	2m:48s	3.7x
16 CPU E5-2670	16x16	46S	13.4x
1 K20X GPU	256x256	40S	15.4x
2 K20X GPU	256x256	24S	25.7
4 K20X GPU	256x256	18S	34.3x

Running Blender Better

So we know GPUs are much faster, but Blender when ran in the normal batch mode above ignores any settings you pass in a python input. We want to be able to control all GPU/CPU settings on the cluster and not open the GUI each time. The trick, is to read your blend file from the Blender Python API and then change settings. Look at the tooltips in blender this API is powerful, everything can be controlled from Python.

bmw.py