Flux HPC: 2014

Thursday, December 4, 2014

XSEDE Domain Champions

The Campus Champion program has been a successful part of helping local campus researchers reach out and use national HPC resources as part of the XSEDE (ARC Notes) project. Our Campus Champion is Brock Palen, a member of the campus HPC support staff, who can be reached at hpc-support@umich.edu.

While Champions have worked well helping people, what about those who need more detailed help in their domain? XSEDE recently started a new type of Champion, the Domain Champion. The local champion, Brock Palen, can put you into contact with them.

The current list of Domains are:

Bioinformatics/Genomics
Data Analytics
Economics
Digital Humanities
Humanities
Molecular Dynamics

XSEDE plans to expand this list as Champions become available and demand is expressed.

If you have any questions feel free to ask us at hpc-support@umich.edu

Tuesday, December 2, 2014

Finding the Data Network Bottleneck with perfSONAR and BWCTL

UPDATE: ITS page on perfSONAR
Java based (no software install required) bandwidth test.

Networks are famous for getting the blame for why things are slow. It would be wonderful if one could use tools like IPerf3 on points along the network to hammer down, if the network is the problem, and if so where does the network go bad. As we all know we all love data, so how can we collect this?

The problem with IPerf is that a server has to be started on the remote end, if you don't have access to a server on the other end you can't run a test. Enter perfSONAR a way of registering network tests allowing both authenticated and anonymous bandwidth, ping, and other tests.

PerfSONAR publishes a list of tests, and limits what an external anonymous tester can run against it. By using a PerfSONAR node on the network along your data path, you can find if the network can hit speeds you expect. In this example we will focus on Bandwidth Test Controller or BWCTL. BWCTL handles the communication with the perfSONAR box and then relies on popular tools such at IPerf, IPerf3, Nuttcp, etc. to run actual tests.

To run your tests you will need two things, an install of BWCTL with the plugins supported by the endpoints you use, use IPerf3 as most support that. Most major distributions have packages for BWCTL, if not you can build it from the sites linked above.

You will also need a perfSONAR server to test against. At Michigan as part of the process upgrading the backbone to 100Gig and other links, ITS has installed a series of perfSONAR boxes in each datacenter near the network core. This is where you should start, make sure you get good performance between your machine and the core servers.

There is a directory for perfSONAR deployments world wide. As of this writing there are 850 BWCTL servers in the directory. For a list of boxes at umich.edu or other domain, you can filter directly to that result. An example server would be ntap-dc-mdc-10g.umnet.umich.edu this is the perfSONAR server in the Modular Data Center, which is the datacenter Flux and the data transfer node flux-xfer.engin.umich.edu are located in.

With BWCTL with IPerf3 installed and the hostname of the perfSONAR server we can run tests:

bwctl -c "ntap-dc-mdc-10g.umnet.umich.edu:4823" -T iperf3 -t 20

[ ID] Interval Transfer Bandwidth Retr
[ 17] 0.00-20.04 sec 14.9 GBytes 6.38 Gbits/sec 0 sender
[ 17] 0.00-20.04 sec 14.9 GBytes 6.37 Gbits/sec receiver

If we have 0 retrys and a decent bandwidth, things are looking good. Next test the network in the other direction using -o and -s options in place of -c :

bwctl -o -s "ntap-dc-mdc-10g.umnet.umich.edu:4823" -T iperf3 -t 20

[ ID] Interval Transfer Bandwidth Retr

[ 17] 0.00-20.04 sec 9.85 GBytes 4.22 Gbits/sec 0 sender

[ 17] 0.00-20.04 sec 9.85 GBytes 4.22 Gbits/sec receiver

Where to go from here?

Choose servers from the directory along the path you are sending data, you can find the paths using tools like traceroute or tracepath. Work with network administrators if networks do appear to be slowing your data transfer. If the network is the problem because of errors normally speeds fall very low. If you are getting 50% of the network as in our tests above things are probably ok on the network side.

If the network isn't the problem likely the protocol for file transfers is poor and should be replaced with tools like bbcp or our recommended Globus. Lastly make sure the storage system can send or write data at the speeds of the network.

Monday, December 1, 2014

Classroom HPC Resources

As the fall semester comes to a close those teaching need to start thinking about what resources they will need for their winter offerings. Many may not realize that there are many options for using HPC resources in courses with no cost.

LSA budgets the use of Flux, the on campus cluster
Many Departments also provide Flux funding, contact ARC for details
XSEDE Provides education allocations at no cost
Amazon and Microsoft support teaching on their Clouds

Depending on what is being taught, most classes would be served by the above. Other organizations also provide classroom/teaching acess such as; NSF Blue Waters, and DoE NERSC.

Any questions about the difference resources can be directed to ARC at hpc-support@umich.edu. We can provide consulting, guidance, and support. This includes guest demo and training lectures in your classroom.

Tuesday, November 25, 2014

Optical Networking and Flux

A student recently asked us what kinds of optical networking systems are used in Flux. Thinking this information might be of interest to the community, we decided to post the answer here as well.

Flux uses two main networking technologies; InfiniBand within the cluster and Ethernet to the rest of campus and the Internet.

The type of InfiniBand (IB) network we use is called QDR 4x, the QDR standing for quad data rate. In QDR IB, each data lane has a raw transmission rate of 10 gbit/s and there are four lanes per connection, so each link has a raw data rate of 40 gbit/s. QDR IB can run over copper or fiber optic cables, and the vast majority of the IB cables used in Flux are fiber optic. Our fiber IB cables come pre-terminated with QSFP connectors, so it is not entirely obvious what kind of lasers are used. That said, my understanding is that there are actually eight fiber strands in a QDR IB cable; four 10 gbit/s strands for each direction of data transfer.

On the Ethernet side, we use multiple 10 gbit/s 10GBASE-SR links between the Flux access switches and their serving distribution switches. There are two distribution switches serving Flux; each has a 100 gbit/s Ethernet link to the campus backbone and a 100 gbit/s Ethernet link to the other distribution switch.

The 100 gbit/s link between the distribution switches is of the 100GBASE-SR10 type, and the links between the distribution switches and the campus backbone are 100GBASE-LR4/ER4.

The ARC Data Science Platform (a.k.a. Fladoop or Flux Hadoop) uses nine 40 gbit/s Ethernet connections within the datacenter; each of these are 40GBASE-SR links.

Friday, November 7, 2014

Citing XSEDE Resources

If you are using XSEDE resources in any way please cite this paper:

John Towns, Timothy Cockerill, Maytal Dahan, Ian Foster, Kelly Gaither, Andrew Grimshaw, Victor Hazlewood, Scott Lathrop, Dave Lifka, Gregory D. Peterson, Ralph Roskies, J. Ray Scott, Nancy Wilkens-Diehr, "XSEDE: Accelerating Scientific Discovery", Computing in Science & Engineering, vol.16, no. 5, pp. 62-74, Sept.-Oct. 2014, doi:10.1109/MCSE.2014.80

This will help show to NSF that XSEDE is a valuable resource and should continue to be funded.

Saturday, November 1, 2014

Optimizing Star Cluster on AWS

In a previous post we pointed to instructions and some common tasks for using Star Cluster to create an HPC cluster on the fly in Amazon Web Services. Now we will focus on some options for optimizing your use of Star Cluster.

Cutting the Fat - Your Head Node is to Big

By default Star creates a head node and N compute nodes. You select the instance type with NODE_INSTANCE_TYPE but this same type is used for the head/master and the compute nodes. In most cases your head node need not be so big. Un-comment MASTER_INSTANCE_TYPE and set to a more modest size instance to control costs. A c3.large for your head at $0.105/hr is much better than paying $1.68 for c3.8xlarge at $1.68/hr but you probably want the C3.8xlarge for compute nodes because it is lower cost per CPU performance.

Use Spot For Lower Costs

Spot instances are where you bid a price you are willing to pay for AWS's extra capacity. Star can use spot, but will only use it for compute nodes, but they are your main costs, plus you don't want your head node being killed if prices rise higher than your bid.

Almost every Star command that starts an instance can be created as a bid, using the -b <bid price> options.

Switch Regions and/or Availability Zone

AWS is made up of regions which in turn are made of availability zones. Different regions have their own pricing, my favorite are Virginia (us-east-1) and Oregon (us-west-2) for the lowest prices. By default Star will use us-east-1, mostly I switch to us-east-2, why do I do this? Lower Spot prices! The graphs from the AWS console spot price history for c3.8xlarge the fastest single node on AWS from both regions shows the difference.

c3.8xlarge us-east-1 24h price history

c3.8xlarge us-west-2 24h price history

The average price on us-west-2 for compute power on spot is on average much lower than us-east-1. Be sure to really think about how spot works, you can bid high, and it is possible to pay for a few hours more than the on demand rate. But this keeps your nodes from being killed, and the total spend, the area under the curve should still be much lower than would have been paid under On-Demand.

Changing regions in Star Cluster:

Update the region name and the region host (the machine that accepts AWS API commands from star) and the availability zone.

#get values from:
$ starcluster listregions
$ starcluster listzones

AWS_REGION_NAME = us-west-2
AWS_REGION_HOST = ec2.us-west-2.amazonaws.com
AVAILABILITY_ZONE = us-west-2c

Create a new key pair for each region, so repeat that step with a new name for they key in the Star cluster Quick Start, and update the [key key-name-here] for your cluster config.

$ starcluster createkey mykey-us-west-2 -o ~/.ssh/mykey-us-west-2.rsa

[key mykey]
KEY_LOCATION=~/.ssh/mykey-us-west-2.rsa

The AMI names are also per region so when you switch regions you need to update the name of the image to boot, in general select an HVM image

#get list of AMI images for region
$ starcluster listpublic

NODE_IMAGE_ID=ami-80bedfb0

Get a Reserved Instance for your Head/Master Node

If you know you are going to have your Star cluster up and running for a decent amount of time, and look into Reserved Instances. Discounts of close to 50% can be had compared to on demand pricing. There are also light, medium, and heavy reserved types which match how often you expect your Star head/master node to be running. Discounts vary on instance type, and term, so refer to AWS pricing to figure out if this makes sense for you. You can even buy the remaining reserved time from another AWS user, or sell your unused remaining reserved contract on the Reserved Instance Marketplace. Be careful reserved contracts are tied to regions and availability zones, if you plan to move between these to chase lower spot costs your contract won't follow you.

Switch to Instance Store

By default Star uses EBS volumes for the disk for each node. While very simple and allows data stored on the worker nodes to persist even when shutdown EBS has an extra cost. The cost of a few GB of EBS will be small compared to the compute costs, but if you plan to have a large cluster, it can add up to real money. Consider instance storage supported by Star. With instance store the compute node will boot by copying the star AMI image.

Most users clusters will not resize long enough to have this mater, contact hpc-support@umich.edu if you want to switch. Just remember to terminate your cluster, not just stop it. If you stop it the EBS volumes remain.

Building Cloud HPC Clusters with Star Cluster and AWS

Researchers that have been traditional users of HPC clusters have been asking how can they make use of Amazon Web Services (AWS) aka the cloud to run their workloads. While AWS gives you a great hardware infrastructure they really are just renting you bare metal machines by the hour.

I should stress using AWS is not magic. There is a lot that needs to be known to avoid extra costs or risk of losing data if you are new to cloud computing. Before you start contact ARC at hpc-support@umich.edu.

Admins of HPC clusters know that it takes a lot more than metal to make a useful HPC service which is what researchers really want. Researchers don't want to spend time installing and configuring queueing systems, exporting shared storage, and building AMI images.

Lucky for the community the nice folks at MIT created Star Cluster. Star cluster is really a set of prebuilt AMIs and a set of python codes that uses the AWS API to create HPC clusters on the fly. Their AMIs also includes many common packages such as MPI libraries, compilers, and python packages.

There is a great Quick-Start guide form the Star team. Users can follow this, but HPC users at the University of Michigan can use the ARC Cluster Flux, which has star cluster installed as an application. Users only need user accounts to access the login node to then create clusters on AWS.

module load starcluster/0.95.5

Following the rest of the Quick-Start guide will get your first cluster up and running.

Common Star Cluster Tasks

Switch Instance Type

AWS offers a number of instance types each with their own features and costs. You switch your instance type with the NODE_INSTNACE_TYPE.

NODE_INSTANCE_TYPE=m3.medium

Make a shared Disk on EBS

EBS is the persistent storage service on AWS. In Star you can create an EBS volume and then attach to your master node and share across your entire cluster. Be careful that you don't leave your volumecreator cluster running. Use the -s flag to createvolume and also check your running clusters with listclusters.

Add/Remove Nodes to Cluster

This is very important when using cloud services. Don't leave machines running you don't need. Compute nodes should be nothing special and should be clones of each other. Using the commands addnode and removenode clusters can be resized to current needs. In general if you are not computing you should remove all your compute nodes leaving only the master/head node with the shared disk to get data. You can queue jobs still in this state and then start nodes.

$ starcluster addnode -n # <clustername>
$ starcluster removenode -n 3 smallcluster

Monday, October 27, 2014

On Campus Research Computing Symposium

Details and Registration Links November 6th, Rackham Building.

The Flux Staff attends these events so if you come say hello!

The ARC Research Computing Symposium is a twice-yearly conference conference held on-campus that brings together more than 200 U-M researchers who rely on large-scale computational tools and methods for research. Formerly known as Cyberinfrastructure (CI) Days, the symposium is an opportunity to learn more about advanced computing technologies that are helping to spur new discoveries and breakthroughs in a wide range of disciplines.

Edward Seidel, Director of the National Center for Supercomputing Applications at the University of Illinois at Urbana-Champaign
The Data-Enabled Revolution in Science and Society: A Need for National Data Services and Policy
Marc Snir, Director of the Mathematics and Computer Science Division at Argonne National Laboratory
High Performance Computing: Exascale and Beyond
Leslie Greengard, Director, Simons Center for Data Analysis, Simons Foundation; Professor, Courant Institute of Mathematical Sciences, New York University
Fast, Accurate Tools for Physical Modeling in Complex Geometry
Gonçalo Abecasis, Chair of the Biostatistics Department and Felix E. Moore Collegiate Professor of Biostatistics, U-M
Biostatistics: Bringing Big Data to Genetics, Biology and Medicine
Sharon Glotzer, Stuart W. Churchill Collegiate Professor of Chemical Engineering; Professor of Material Science and Engineering, Macromolecular Science and Engineering, and Physics, U-M
Discovery and Design of Digital Matter
Scott Page, Leonid Hurwicz Collegiate Professor of Complex Systems and Political Science; Professor of Economics; and Director, Center for the Study of Complex Systems, U-M.
Diversity + Ability

Wednesday, October 15, 2014

Running MATLAB on XSEDE Resources

MATLAB is one of the most popular prototyping systems available in research computing. It is a set of powerful tools all wrapped in a language similar to FORTRAN with object oriented features.

XSEDE is a set of national computing resources that researcher can apply for. It is one in a set of national resources that powers a large share of national resources including DOE, Blue Waters, and others.

Currently it is very difficult to combine the two sets of resources. MATLAB is commercial and getting access to a license that you can use on XSEDE can be problematic due to license cost and technical complexity.

MCC - The MATLAB Compiler

MCC (Flux Docs) is a toolbox/add-on to MATLAB that allows the wrapping of mcode into a standalone executable. Features of this executable is that it can be ran anywhere using any functionality that was available at the site it was compiled at. Thus MATLAB programs can be moved to a resource such as XSEDE within your license terms.

The downsides are you cannot modify your mcode on the XSEDE resource. You have to make any design changes where you have your MATLAB license and MCC license. This limitation can be mitigated by the fact that MCC can compile functions and arguments can be passed on the command line. Of course MCC compiled code can also read from files that regular MATLAB can. So if your code is stable but you are running different inputs that is no problem.

The How-To

This is will be in two parts. What needs to be done on Flux or other machine with MATLAB and MCC installed, and what needs to be done to run the result on the XSEDE resource. In this example I am going to use Stampede at TACC.

The example code is implicitthreads.m which solves a system of equations using the \ operator in MATLAB. It is implemented as a function, taking the number of unknowns.

Compile on Flux

(GIST flux.sh)

Setup MCR on Stampede

(GIST stampede.sh)

Run on Stampede

(GIST run-mcc.sh)

Wednesday, October 1, 2014

GLCPC Blue Waters Allocations Call Open

The Great Lake Consortium for Petascale Computation (GLCPC) is the easiest way to get access for moderate users of HPC CPU/GPU resources to gain access and use the Blue Waters super computer.

Proposals are due November 3rd, full details are available at the GLCPC website.

What is the difference between GLCPC Blue Waters allocations?
When there is the NSF Petascale Computing Resource Allocation (PRAC) program for Blue Waters?

The bar is higher for the NSF allocations, they expect users to have significant resource need and run at extreme scale. If your work falls into this area please apply to the NSF program, which are due November 14th. GLCPC is only granted a small portion of the Blue Waters system ~3.5 Million node hours. GLCPC is also only available to GLCPC members, which Michigan is one.

The availability of time on Blue Waters is broken down as:

80% or more NSF Petascale Computing Resource Program (PRAC)
7% University of Illinois at Urbana-Champaign
2% Great Lake Consortium for Petascale Computation
1% Teaching/Education

For most HPC users access via GLCPC is the most appropriate choice, and for a small subset the NSF Petascale program also applies.

How does GLCPC compare to XSEDE?
When should I pursue XSEDE or GLCPC?

GLCPC in the number of hours it can give away is about on par with medium to small XSEDE allocations. Currently XSEDE also does not have a long lived large scale GPU/CUDA capable machine while Blue Waters is capable of that.

As always contact us at hpc-support@umich.edu if you have questions.

Monday, September 29, 2014

Using FFTW For Optimum Performance

FFTW is probably the most popular DFT library ever. What many don't realize is that there are many ways to calculate an FFT depending on the number of samples. FFTW addresses this by creating a plan and using dynamic code generation to try to quickly calculate an optimal method.

What is not realized is that the number of samples should be kept multiples of small primes, 2,3,5 etc. Calculating an FFT where the number of samples is a single large prime is the worse case.

Eg:

Plan for 999983 samples used: 99522862.000000 add's, 53927851.000000 multiplies, 42864846.000000 fused Multiply-Add

real    0m0.873s

Plan for 999984 samples used: 41417486.000000 add's, 28948213.000000 multiplies, 6618732.000000 fused Multiply-Add

real    0m0.312s

So in this sample code going from 999983, a prime number, to 999984, who's prime factors are 2 x 2 x 2 x 2 x 3 x 83 x 251, resulted in a performance improvement of about 3x.

Performance isn't always simple.

Wednesday, September 24, 2014

4th Quarter XSEDE Allocations Research Proposal Dates

XSEDE allocation research request (XRAC) proposals are due October 15th. An overview of XSEDE is on the ARC website, in general it is a large set of national resources for computational research. They are at no cost, but do require a proposal for access. These proposals are simple and most get some sort of an award/resource.

Brock Palen at ARC is the local XSEDE Campus Champion and is available for help learning, using, and accessing XSEDE resources. He can be contacted at hpc-support@umich.edu.

Researchers who want to get started sooner and/or are new to XSEDE resources should use a Startup proposal which are accepted year round and require significantly less documentation.

Tuesday, September 23, 2014

ARC Fladoop Data Platform (Hadoop)

ARC and CAEN-HPC recently announced our second generation Hadoop cluster. The first generation cluster lasted about a year and was validating the software and usefulness of the platform. This new generation platform represents a real investment into hardware to match the software to enable true data science at scale.

The Fladoop Cluster

Architecture

One of the core components of Hadoop is a distributed file system known as HDFS. The power of Hadoop is the ability to break work up and move it to chunks of the data, known as a block, on a drive local to the machine doing the work.

Compared to a traditional compute focused HPC cluster such as Flux this feature of Hadoop requires nodes with many local hard drives and a fast TCP network for shuffling the large quantity of data between nodes.

Fladoop the ARC Hadoop cluster for Data Science consists of 7 data nodes. Each of these nodes has 64GB of main memory, 12 cpu cores and 12 local hard drives. The network tying them together is 40Gbit Ethernet from Chelsio T580-LP-CR all attached to an Arista 7050q switch. This is 40 times higher bandwidth per host than standard ethernet, and 4times faster than standard high performance 10Gig-e.

HDFS -- Hadoop File System

HDFS has some unique features to both help with data integrity and performance. HDFS directly controls each hard drive independently. Because hard drives fail, HDFS by default copies the data 3 times, to at least 2 unique nodes, and tracks how many copies are available at any time. If a drive or node fails HDFS automatically makes new copies form the remaining copies.

These copies also can lead to additional performance at the cost of total available space. Because Hadoop tries to move work to where the data are local, there are now 3 possibilities to do that before spilling over to accessing the data over the network.

Getting Started

The use of the new cluster is at no cost. People interested in using it should contact hpc-support@umich.edu to get an account, and look over the documentation. ARC also provides consulting for exploring if the platform will work well for your application.

In future posts we will show how Hadoop and the packages that go with it, Hive, Pig, and Spark, etc. are powerful and easy to use.

Flux HPC