Monday, September 29, 2014

Using FFTW For Optimum Performance

FFTW is probably the most popular DFT library ever.  What many don't realize is that there are many ways to calculate an FFT depending on the number of samples.  FFTW addresses this by creating a plan and using dynamic code generation to try to quickly calculate an optimal method.

What is not realized is that the number of samples should be kept multiples of small primes, 2,3,5 etc.  Calculating an FFT where the number of samples is a single large prime is the worse case.

Eg:

Plan for 999983 samples used: 99522862.000000 add's, 53927851.000000 multiplies, 42864846.000000 fused Multiply-Add

real    0m0.873s

Plan for 999984 samples used: 41417486.000000 add's, 28948213.000000 multiplies, 6618732.000000 fused Multiply-Add

real    0m0.312s

So in this sample code going from 999983, a prime number, to 999984, who's prime factors are 2 x 2 x 2 x 2 x 3 x 83 x 251, resulted in a performance improvement of about 3x.

Performance isn't always simple.

Wednesday, September 24, 2014

4th Quarter XSEDE Allocations Research Proposal Dates

XSEDE allocation research request (XRAC) proposals are due October 15th. An overview of XSEDE is on the ARC website, in general it is a large set of national resources for computational research. They are at no cost, but do require a proposal for access. These proposals are simple and most get some sort of an award/resource.

Brock Palen at ARC is the local XSEDE Campus Champion and is available for help learning, using, and accessing XSEDE resources.  He can be contacted at hpc-support@umich.edu.

Researchers who want to get started sooner and/or are new to XSEDE resources should use a Startup proposal which are accepted year round and require significantly less documentation.

Tuesday, September 23, 2014

ARC Fladoop Data Platform (Hadoop)


ARC and CAEN-HPC recently announced our second generation Hadoop cluster. The first generation cluster lasted about a year and was validating the software and usefulness of the platform.  This new generation platform represents a real investment into hardware to match the software to enable true data science at scale.
The Fladoop Cluster


Architecture
One of the core components of Hadoop is a distributed file system known as HDFS. The power of Hadoop is the ability to break work up and move it to chunks of the data, known as a block, on a drive local to the machine doing the work.

Compared to a traditional compute focused HPC cluster such as Flux this feature of Hadoop requires nodes with many local hard drives and a fast TCP network for shuffling the large quantity of data between nodes.

Fladoop the ARC Hadoop cluster for Data Science consists of 7 data nodes. Each of these nodes has 64GB of main memory, 12 cpu cores and 12 local hard drives.  The network tying them together is 40Gbit Ethernet from Chelsio T580-LP-CR all attached to an Arista 7050q switch.  This is 40 times higher bandwidth per host than standard ethernet, and 4times faster than standard high performance 10Gig-e.


HDFS -- Hadoop File System
HDFS has some unique features to both help with data integrity and performance.  HDFS directly controls each hard drive independently.  Because hard drives fail, HDFS by default copies the data 3 times, to at least 2 unique nodes, and tracks how many copies are available at any time.  If a drive or node fails HDFS automatically makes new copies form the remaining copies.

These copies also can lead to additional performance at the cost of total available space. Because Hadoop tries to move work to where the data are local, there are now 3 possibilities to do that before spilling over to accessing the data over the network.

Getting Started
The use of the new cluster is at no cost. People interested in using it should contact hpc-support@umich.edu to get an account, and look over the documentation.  ARC also provides consulting for exploring if the platform will work well for your application.

In future posts we will show how Hadoop and the packages that go with it, Hive, Pig, and Spark, etc. are powerful and easy to use.

Monday, September 22, 2014

ARC/Flux User Meetup

Friday September 26th we will be hosting another ARC/Flux User Meetup. These are open get togethers where the Flux operators and campus HPC support will be in one location. You can drop in ask questions, learn new things, and generally get updated on ways you can further computational work at Michigan.




Friday, September 12, 2014

3D Rendering using Blender and CUDA on HPC Clusters

Blender is a popular Open Source 3D modeling system.  Recently the question was asked can one use the Flux cluster for 3D rendering. We were curious about this, and we wanted to support our students. Clusters like Flux though are normally built with scientific use in mind and we didn't know if what we had would support Blender. Turns out this was all easier than we thought.

blender-batch.sh

GPU Rendering with CYCLES and CUDA

What we found though is we wanted to use the CYCLES render, and not only that we wanted to run it on the FluxG GPU service.  Why GPUs?  Lets us use the current standard CYCLES benchmark file and compare GPU to CPU performance.

Blender Rendering Benchmark (mpan)


HardwareTile Settings (XxY)TimeSpeedup
1 CPU E5-267016x1610m:17s1x
4 CPU E5-267016x162m:48s3.7x
16 CPU E5-267016x1646S13.4x
1 K20X GPU256x25640S15.4x
2 K20X GPU256x25624S25.7
4 K20X GPU256x25618S34.3x

Running Blender Better

So we know GPUs are much faster, but Blender when ran in the normal batch mode above ignores any settings you pass in a python input. We want to be able to control all GPU/CPU settings on the cluster and not open the GUI each time.  The trick, is to read your blend file from the Blender Python API and then change settings.  Look at the tooltips in blender this API is powerful, everything can be controlled from Python.