Tuesday, September 23, 2014

ARC Fladoop Data Platform (Hadoop)

ARC and CAEN-HPC recently announced our second generation Hadoop cluster. The first generation cluster lasted about a year and was validating the software and usefulness of the platform.  This new generation platform represents a real investment into hardware to match the software to enable true data science at scale.
The Fladoop Cluster

One of the core components of Hadoop is a distributed file system known as HDFS. The power of Hadoop is the ability to break work up and move it to chunks of the data, known as a block, on a drive local to the machine doing the work.

Compared to a traditional compute focused HPC cluster such as Flux this feature of Hadoop requires nodes with many local hard drives and a fast TCP network for shuffling the large quantity of data between nodes.

Fladoop the ARC Hadoop cluster for Data Science consists of 7 data nodes. Each of these nodes has 64GB of main memory, 12 cpu cores and 12 local hard drives.  The network tying them together is 40Gbit Ethernet from Chelsio T580-LP-CR all attached to an Arista 7050q switch.  This is 40 times higher bandwidth per host than standard ethernet, and 4times faster than standard high performance 10Gig-e.

HDFS -- Hadoop File System
HDFS has some unique features to both help with data integrity and performance.  HDFS directly controls each hard drive independently.  Because hard drives fail, HDFS by default copies the data 3 times, to at least 2 unique nodes, and tracks how many copies are available at any time.  If a drive or node fails HDFS automatically makes new copies form the remaining copies.

These copies also can lead to additional performance at the cost of total available space. Because Hadoop tries to move work to where the data are local, there are now 3 possibilities to do that before spilling over to accessing the data over the network.

Getting Started
The use of the new cluster is at no cost. People interested in using it should contact hpc-support@umich.edu to get an account, and look over the documentation.  ARC also provides consulting for exploring if the platform will work well for your application.

In future posts we will show how Hadoop and the packages that go with it, Hive, Pig, and Spark, etc. are powerful and easy to use.