Flux HPC: 2015

Tuesday, September 29, 2015

Flux 7 Node Final Configuration

For details please refer to: Next Flux Bulk Purchase

We have the final configuration for Flux 7 the nodes are:

2 x E5-2680V3 (24 total core)
8 x 16GB DDR4 2133MHz (128GB)
4TB 7200RPM Drive
EDR ConnectX 4 100Gbps Infiniband Adaptor

For price contact hpc-support@umich.edu

Desired quantities must be reported by 5pm Tuesday, October 13th.

Thursday, September 17, 2015

Globus endpoint sharing available to UM researchers

We have described in a number of Blog posts some features and benefits of using the the Globus File Transfer service. Now that UM is a Globus Provider you have a new feature available to you, sharing of directories and files with your collaborators who are Globus users as well.

There are two avenues of sharing possible for you now. The first is via standard server endpoints that have sharing enabled and another via "Globus Connect Personal" client endpoints. Today I will describe sharing for standard servers endpoints only. Sharing for Personal Connect endpoints is a bit more complicated due to differences between OS versions of the client and will be described later.

To see if the endpoint you use has sharing enabled navigate to the endpoint in "Manage Endpoints" within the Globus web interface. Click on the Sharing tab, note that you may have to Activate (login) a session on the endpoint first. If sharing is enabled you will be told so and will see a "Add Shared Endpoint" button in the panel. Shared endpoints are essentially sub-endpoints you can create and provide access to any other Globus user.

Lets go ahead and make a shared endpoint from umich#flux by clicking on the button. You are presented a web form to provide required information:

Host Path ~/Test_share

You can either give a complete absolute path or use unix shorthand (~/) for my home directory as I have done (make sure the shared directory exists first!).

New Endpoint Name traeker#Test_share

Description Tell others know what this is about.

Clicking on the "Create and Manage permissions" button creates the shared endpoint and presents you with a new panel to manage permissions. It shows you the current access setting and clicking on the "Add Permission" button presents you with a number of options of how to share this endpoint with other Globus users.

Share With check which one to use among email, user, group, all users

Permissions check one or both of read, write

A couple of things you to keep in mind as you set these parameters:

Be careful about choosing all users as this will allow all users logged into Globus to access this share.
By default only read permission is set. If you allow write permission you could get files containing viruses and also get yourself into trouble with any disk usage quotas.

One easy way to manage permissions to a large group of people is to create a Globus group and populate it with users. Be advised that the entire group will have the same permissions so if you need some users to have different permissions, you either create a different group or add each user to the share individually. Using groups comes in handy when you have multiple shared directories to similar sets of collaborators.

Once a directory is shared with another Globus user he/she can find that endpoint name via the "shared with me" filter on the top of the endpoint list panel. With name in hand they can now transfer files from/to that endpoint by typing in the name under the "Transfer Files" screen just like another other endpoint they have access to.

You can go back to this shared endpoint to add new or edit any access settings.

Globus endpoint sharing is very powerful as it gives non-UM collaborators access to your research data without having to create a UM "Sponsored account" for them to access your systems. This is very similar to other cloud file sharing services like Box and Dropbox. The big difference is that Globus does not store the data and thus quotas are managed by your systems policies.

Saturday, August 22, 2015

Flux High-Speed Data Transfer Service

Do you have a large data set on your own storage equipment that you would like to process on Flux? We can accommodate up to 40 gigabits per second of data transfer in and out of Flux via the campus Ethernet backbone. There is no additional cost to use this service, but you do need to contact us in order to set it up.

By default, network traffic between Flux compute nodes and other systems on campus takes place over standard one gigabit Ethernet connections. This is sufficient for modest amounts of traffic such as that generated by administrative tasks, monitoring, and home directory access.

Traffic between Flux and its high-speed /scratch filesystem runs over a separate 40 gigabit per second InfiniBand network within the datacenter, and data between Flux and off-campus systems on the Internet can be staged through our transfer server at up to 10 gigabits per second. This would seem to leave a gap though: what if you want direct high-speed connections between the Flux nodes and other systems on campus? We provide such connections using a Mellanox BX5020 InfiniBand/Ethernet gateway:

The Flux BX5020 Gateway

The gateway connects to the Flux InfiniBand network and to the campus Ethernet network and allows traffic to flow between the two networks. The InfiniBand network runs at 40 gigabits per second, and the gateway has four 10 gigabit links to the campus Ethernet network. This allows any Flux node to communicate with any system on campus at up to 40 gbit/s.

We have a customer that has multiple petabytes of data on their own storage equipment which they have been using Flux to process. We mount this customer's NFS servers on Flux and route the traffic through the gateway. The customer is currently running jobs on Flux against two of their 10-gigabit connected servers, and last weekend they reached a sustained data transfer rate into Flux of 14.3 gigabits per second.

Gateway traffic for the week of 8/11/2015 - 8/18/2015

Although we have pushed more than 14 gbit/s through the gateway during testing, this is a new record for production traffic through the system.

Our gateway is currently connected to the Ethernet network at 40 gigabits per second, but it can be readily expanded to 80 and possibly 120 gigabits per second as needed. Additionally, we plan to replace the existing gateway in the near future with newer equipment. The planned initial bandwidth for the new equipment is 160 gbit/s, and there is room for growth even beyond that.

No changes to your network configuration are needed to use the gateway; those changes take place on our end only. All you have to do is export your storage to our IP ranges. If you want to discuss or get set up for this service, please let us know! Our email address is hpc-support@umich.edu and we will be happy to answer any questions you have.

If you are are interested in the technical details of how the gateway works, this presentation from Mellanox on the Ethernet over InfiniBand (EoIB) technology used by the system should prove informative. There is no need to know anything about EoIB in order to use the service; the link is provided strictly for the curious.

Next Flux Bulk Purchase and Flux Operating Environment

Update 9/29: Final quoting took longer than expected see our post for details. Additions to the order must be placed by October 13th.

Update 8/28: The date for expressing interest was extended to Tuesday, September 8th. After September 8th, a final pricing proposal will be sent to the vendors.

Flux will be purchasing new cores for the Standard and Large Memory Service. Because we realize that not all funding sources allow for the purchasing of a service like Flux we provide The Flux Operating Environment (FOE).

FOE is the Flux service minus the hardware, thus a grant that provides only hardware (capital) funds is able to add nodes to Flux, where ARC-TS provides login, storage, network, support, power, etc.

More importantly grants submitted from LSA, COE, and the Medical School have no cost for placing grant nodes in FOE. Thus the only cost to the researcher is the node and is granted dedicated access to it.

Because Flux is going to be making a larger (4000 core) purchase any faculty with such funds are invited to join in our purchase process. If you are interested email hpc-support@umich.edu by August 28th with your node requirements.

Flux 7 2 socket nodes:

128GB Ram
2 x E5-2680V3 CPU (24 Total Core)
3TB 7200 RPM HDD or 1TB 7200RPM HDD
EDR Infiniband (100Gbit ConnectX-4)

Flux 7 4 socket nodes:

1024 - 2048GB Ram
4x E5 class CPU (40-48 core)
3TB 7200 RPM HDD
EDR Infiniband (100Gbit ConnectX-4)

Faculty purchasing their own via FOE can modify the drive and memory types and quantity to match their need and likely still get the bulk purchasing power by purchasing with Flux. Researchers who wish to purchase other specialty nodes (GPU, Xeon-PHI, FPGA, Hadoop/Spark, etc.) are still encouraged to contact us.

Sunday, August 2, 2015

XSEDE15 Updates

We recently returned from the XSEDE 15 representing Michigan and learning about the new resources and features coming online at XSEDE. What follows are our notes; there will be a live stream webinar August 6th 10:30am for one hour. If you have questions please attend:

Webinar: ARC-TS XSEDE[15] Faculty and Student Update
Location: http://univofmichigan.adobeconnect.com/flux/
Time: August 6th 10:30am-11:30am
Login: (Select Guest, use uniquename)

Champions Program Update

Michigan currently participates in the Campus Champions program via the staff at ARC-TS. There are two newer programs that faculty and students might take interest in:

Domain Champions

Domain Champions are XSEDE participants like Campus Champions but sorted by field. These Champions are available nationally to help researchers in their fields even if they do not use XSEDE resources:

Domain	Champion	Institution
Data Analysis	Rob Kooper	University of Illinois
Finance	Mao Ye	University of Illinois
Molecular Dynamics	Tom Cheatham	University of Utah
Genomics	Brian Couger	Oklahoma State University
Digital Humanities	Virginia Kuhn	University of Southern California
Digital Humanities	Michael Simeone	Arizona State University
Chemistry and Material Science	Sudhakar Pamidighantam	Indiana University

Student Champion

The Student Champions program is a way for graduate students (preferred but not required) to get more plugged into supporting researchers in research computing. Michigan does not currently have any student champions. If you are interested contact ARC-TS at hpc-support@umich.edu.

New Clusters and Clouds

Many of the new XSEDE resources coming online or already available are adding virtualization capability. This ability is sometimes called cloud but can have subtle differences depending what resources you are using. If you have questions about using any of the XSEDE resources contact ARC-TS at hpc-support@umich.edu.

NSF and XSEDE have recognized that data plays a much larger role than in the past. Many of the resources have added persistent storage options (file space that isn't purged) as well as database hosting and other resources normally not found on HPC clusters.

Wrangler

Wrangler is a new data focused computer and is in production. Notable features are:

iRODS Service Available and persistent storage options
Can host long running reservations for databases and other services if needed.
600TB of Flash storage directly attached. This storage can change its identity to provide different service types (GPFS, Object, HDFS, etc.). Sustains over 4.5TB/minute terasort benchmark.

Comet

Comet is a very large traditional HPC system recently in production. It provides over 2 petaflops of compute mostly in the form of 47,000+ cpu cores. Notable features are:

Host Virtual Clusters, these are customized cluster images when researchers need to make modifications that are not possible in the traditional batch hosting environment.
36 nodes with 2x Nvidia k80 GPUs (4 total GPU dies / node)
SSD in each local node for fast local IO.
4 nodes with 1.5TB

Bridges

Bridges is a large cluster that will support more interactive work, virtual machines, and database hosting along with traditional batch HPC processing. Bridges is not yet in production, some notable features are:

Nodes with 128GB, 3TB, and 12TB of RAM
Reservations for long running database, web server and other services
Planned support for Docker containers

Jetstream

Jetstream is a cloud platform for science. It is OpenStack based and will give researchers great control over their exact computing environment. Jetstream is not yet in production, notable features are:

Libraries of VM's will be created and hosted in Atmosphere, researchers will be able to contribute their own images, or use other images already configured for their needs.
Split across two national sites geographically distant

Chameleon

Chameleon is an experimental environment for large-scale cloud research. Chameleon will allow researchers to not only reconfigure the images as virtual machines but as bare metal. Chameleon is now in production, some notable features are:

Geographically separated OpenStack private cloud
Not allocated by XSEDE but allocated in a similar way

CloudLab

CloudLab is a unique environment where researchers can deploy their own cloud to do research about clouds or on clouds. It is in production, some notable features are:

Able to prototype entire cloud stacks under researcher control, or bare metal
Geographically distributed across three sites
Support multiple network types (ethernet, infiniband)
Supports multiple CPU types (Intel/X86, ARM64)

XSEDE 2.0

XSEDE was a 5 year proposal we are wrapping up year 4. The XSEDE proposal doesn't actually provide any of the compute resources these are their own awards and are allocated only by the XSEDE process. A new solicitation was extended for another 5 years and a response is currently under review by NSF. The next generation of XSEDE aims to be even more inclusive and focus more on data intensive computing.

XSEDE Gateways, Get on the HPC Train With Less Effort

We have written about XSEDE (Arc Docs) before, a set of national computing resources for research.

XSEDE Gateways on the other hand are simple, normally web-based front ends to the XSEDE computers for specific areas of interest. They lower the barrier to getting started utilizing super computers in research, and are a great educational tool also.

List of current XSEDE Gateways.

One might want to use a gateway for the following reasons:

Not comfortable with using super computers at the command line
Don't need the power of a huge system but need more than their laptop
Are looking for an easy to introduce new users to an area of simulation
Undergraduate work supplementing course material

A snapshot of some portals (over 30 at this writing):

The iPlant Collaborative Agave API	Integrative Biology and Neuroscience	Visit Portal
VLab - Virtual Laboratory for Earth and Planetary Materials	Materials Research	Visit Portal
NIST Digital Repository of Mathematical Formulae	Mathematical Sciences	Visit Portal
Integrated database and search engine for systems biology (IntegromeDB)	Molecular Biosciences	Visit Portal
ROBETTA: Automated Prediction of Protein Structure and Interactions	Molecular Biosciences	Visit Portal
Providing a Neuroscience Gateway	Neuroscience Biology	Visit Portal
General Automated Atomic Model Parameterization	Physical Chemistry	Visit Portal
SCEC Earthworks Project	Seismology	Visit Portal
Asteroseismic Modeling Portal	Stellar Astronomy and Astrophysics	Visit Portal
CIPRES Portal for inference of large phylogenetic trees	Systematic and Population Biology	Visit Portal
Computational Anatomy	Visualization, Graphics, and Image Processing	Visit Portal

Thursday, July 9, 2015

XSEDE, HPC, and BigData for classroom use

For those teaching classes are not that far away for the fall. Those who teach courses backed by computational needs resources are available for supporting this work.

HPC

The ARC cluster Flux is available for course work. Some schools cover the cost or subsidize the cost.
XSEDE (ARC-TS Docs.) is a set of free NSF machines for research as well as teaching. Teaching allocations can be had very easily but some lead time is required. Contact hpc-support@umich.edu for help or questions getting your course up and running on XSEDE.

BigData

The ARC Hadoop/Spark cluster is still free for any use as an exploratory technology.
The XSEDE Blacklight machine is unique for having 24TB of shared memory. Wrangler is a new cluster built around SSD's and able to run Hadoop and large datasets.
Amazon Web Services (ARC-TS Docs.) supports classroom use for all their recourses including Elastic Map Reduce and others.

Cloud

Any questions can be directed to ARC-TS at hpc-support@umich.edu .

Thursday, July 2, 2015

Sending Data to Amazon AWS S3 storage

Researchers at UM have numerous storage options available to them on and off campus. In this post we focus on moving data to Amazons AWS cloud storage S3 . This storage is fast and easily accessible from other AWS resources as well as UM systems.

To use S3 you first need to have an account in AWS and create what are called S3 buckets. Buckets can be created via the AWS web console or AWS Command Line Interface (CLI) tools on your local systems. Installation and setup instructions are available in the provided link. Below we shall assume this has already been done.

Lets go through a good sample use case of creating a S3 bucket and sending a large backup file to that bucket. First, if you have configured the aws cli tools correctly, it knows your account name and has full access to your S3 resources.

Now create a S3 bucket called "mybackups":

$ aws s3 mb s3://mybackups

To confirm creation and check contents use:

$ aws s3 ls s3://mybackups

Now lets copy the file backup.tar to that bucket:

$ aws s3 cp backup.tar s3://mybackups

In this test case I got 107 MB/s from my laptop which is pretty awesome. This speed is largely due to two things: 1) the aws s3 cp command can break the file into numerous parts and simultaneously send them to the bucket and 2) the route from UM to AWS is via Internet 2 which can be be 1-10 Gb/s depending on your particular uplink speed to the UM backbone. I can confirm that doing this from my home computer is exceedingly slow!

Confirm the file is in the backup via

$ aws s3 ls s3://mybackups

Some among you might say I do not have enough space on my system to make a temporary backup tar file. Fear not, you can make nice use of piping unix utilities to avoid this.

$ tar -czf - raeker | aws s3 cp - s3://mybackups/raeker.tgz

Alternatively you can use the aws s3 sync command! This functions much like the traditional unix rsync command to sync files between a source and target:

$ aws s3 sync my_directory s3://mybackups

Be warned though that if there are lots of files to sync you likely will not get anywhere near the 100 MB/s I got above. Also be advised that AWS charges for operations as well as storage so each file cp/put incurs a request operation towards the $0.005 per 1,000 requests!

You can also use this if you simply need a copy of your local files in a S3 bucket for use in say EC2 instances for computing.

Normally, sync only copies missing or updated files or objects between the source and target. However, you may supply the --delete option to remove files or objects from the target not present in the source.

Of course you can reverse the data flow by making s3://mybackups as the source and local file/folder as target!

In another blog post I will show you how you can automatically archive your s3 object to the considerably cheaper Glacier storage. Stayed tuned.

Monday, May 18, 2015

Large-scale Visualization of Volumes from 2d Images

The Visible Human project has a series of high resolution CT or MRI scans of human bodies. These images can be stitched together to make volume renderings of the original subject. First Images!

These images were generated from high resolution CT scans available here at Michigan. The data in this case is over 5000 2d slices in TIFF format for total data of around 34GB.

On standard systems working with the input data of this size is difficult let alone the derived 3d volume created. Lucky for us we can use the Visit imgvol format specifically for this case.

In the above example 32 cores with 25GB of memory each (800GB total) on the Flux Large Memory nodes was used and my personal Apple laptop running the Visit viewer over a home network connection (!!). Memory use in the creation of the above plots ranged from 3GB/core to 7.5GB/core. Rendering performance wasn't interactive, but a plot change would range from 15-45 seconds to redraw.

The imgvol format is very simple and allowed for us to create these sorts of plots very quickly. Most users don't have such huge data and can run this on their personal lab workstations. If your workstation isn't sufficient feel free to reach out to ARC-TS at hpc-support@umich.edu

Tuesday, May 5, 2015

Summer XSEDE Parallel Programming BootCamp

Interested in learning how to do parallel computer programming? June 16-19, 2015 we will be hosting a XSEDE summer bootcamp to teach various aspects of parallel programming using MPI, OpenMP and OpenACC and more. The event will be held in room 2255 NorthQuad. Registration is required and free at the XSEDE registration site

Below is the planned schedule:

Tuesday, June 16

11:00 Welcome
11:15 Computing Environment
11:45 Intro to Parallel Computing
12:30 Intro to OpenMP
1:30 Lunch Break
2:30 Exercise 1
3:15 More OpenMP
4:30 Exercise 2
5:00 Adjourn

Wednesday, June 17
11:00 Intro to OpenACC
12:00 Exercise 1
12:30 Introduction to OpenACC (cont.)
1:00 Lunch Break
2:00 Exercise 2
2:45 Introduction to OpenACC (cont.)
3:00 Using OpenACC with CUDA Libraries
3:30 Advanced OpenACC
4:00 OpenMP 4.0 Sneek Peek
5:00 Adjourn

Thursday, June 18
11:00 Introduction to MPI
1:00 Lunch break
2:00 Intro Exercises
3:10 Intro Exercises Review
3:15 Scalable Programming: Laplace code
3:45 Laplace Exercise
5:00 Adjourn

Friday, June 19
11:00 Laplace Exercise Review
12:30 Laplace Solution
1:00 Lunch break
2:00 Advanced MPI
3:00 Outro to Parallel Computing
4:00 Hybrid Computing
4:30 Hybrid Competition
5:00 Adjourn

Sunday, April 26, 2015

Hive a high performance replacement for SQL databases

SQL is is gaining popularity as more researchers work with structured data. Rather than reimport data every session, using a relational database (RDBMS) and leaving the data persistent and using SQL to query data is a significant improvement.

The problem with standard RDBMS systems is that their algorithms are often serial and hampered by the needs to keep transactions (think keeping bank deposits and debits in order) consistent. This is also known as ACID.

In many research cases though researchers do not need transactions, they have data and they just want to query, or their data is append only such as new measurements. By relaxing the transactions needs researchers can use a whole host of new methods that are very scalable.

Enter Apache Hive. Hive is a data warehouse tool that lets data on an Hadoop cluster (such as the cluster at ARC-TS) be queried using SQL syntax. For large tables even in to the thousands of GBytes of data, performance is consistent.

In this example I have data in CSV format from a database. It has 12 columns and 1,487,169,693 rows. Total data size is about 880GB of raw data. With hive though once I have the data in Hadoop and create a table out of it. I can use Hive to query it just as any other SQL table.

SELECT COUNT(*) FROM sample_table;

OK

1487169693

Time taken: 75.875 seconds, Fetched: 1 row(s)

At 75.9 seconds to do a full table scan as Hive works on the raw text data and must read all the data for a query like this, the ARC-TS Hadoop cluster is able to scan the data at 11GB/s. Hive will maintain performance for ore complex queries also.

SELECT AVERAGE(sample_column) FROM sample_table;

OK

0.011386917827452752

Time taken: 81.488 seconds, Fetched: 1 row(s)

Researchers who work with a lot of structured data will find SQL on Hive to be intuitive and very powerful and effectively remove all limits to query performance and data size imposed by any other solution.

To many researchers working with SQL or Hadoop is new to them and daunting but is part of the new BigData ecosystem. Please contact ARC-TS at hpc-support@umich.edu and one of our staff can help you with your data.

Filetransfer Tool Performance

On the HPC systems at ARC-TS we have two primary tools for transferring data, scp (secure copy), and Globus (GridFTP). Other tools like rsync and sftp operate over scp and thus will have performance comparable to that tool.

So which tool performs the best? We are going to test two cases each moving data to the XSEDE machine Gordon at SDSC. One test will be for moving a single large file, the second will be many small files.

Large file case.

For the large file we are moving a single 5GB file from Flux's scratch directory to the Gordon scratch directory. Both filesystems can move data at GB/s rates so the network or tool will be the bottleneck.

scp / gsiscp

[brockp@flux-login2 stripe]$ gsiscp -c arcfour all-out.txt.bz2

                             gordon.sdsc.xsede.org:/oasis/scratch/brockp/temp_project

india-all-out.txt.bz2                     100% 5091MB  20.5MB/s  25.6MB/s   04:08   

Duration: 4m:08s

Globus

Request Time : 2015-04-26 22:41:04Z

Completion Time : 2015-04-26 22:42:44Z

Effective MBits/sec : 427.089

Duration: 1m:40s 2.5x faster than SCP

Many File Case

In this case the same 5GB file was split into 5093 1MB files. Many may not know that every file has overhead, and that it is well known that moving many small files of the same size is much slower than moving one larger file of the same total size. How much impact and can Globus help with this impact read below.

scp / gsiscp

[brockp@flux-login2 stripe]$ time gsiscp -r -c arcfour iobench

                             gordon.sdsc.xsede.org:/oasis/scratch/brockp/temp_project/

real 28m9.179s

Duration: 28m:09s

Globus

Request Time : 2015-04-27 00:18:40Z

Completion Time : 2015-04-27 00:25:30Z

Effective MBits/sec : 104.423

Duration: 7m:50s 3.6x faster than SCP

Conclusion

Globus provides significant speedup both for single large files and many smaller files over scp. The result is even more significant the smaller the files because of the overhead in scp doing one file at a time.

Wednesday, April 1, 2015

Xeon Phi's for Linear Algebra

Linear algebra is the backbone of many research codes. So much so that a standard library was created to support both the primitives; matrix multiply, dot product, and higher level; LU factorization, QR factorization. These libraries are BLAS Basic Linear Algebra Subprograms, and LAPACK Linear Algebra PACKage.

Before I go further never use the Netlib versions of BLAS and LAPACK. They are not tuned and the performance difference can be dramatic. When using these routines you want to use the CPU vendors specific implementation. They all conform to the same API and thus are portable.

MKL or the Math Kernel Library is the Intel implementation of BLAS, LAPACK and many other solvers. Intel went further that the MKL can automatically offload routines to Xeon PHI accelerators with no code modification.

module load mkl/11.2
icc dgemm_speed.cpp -mkl -DBLAS3 -DDIM=10000 -DMKL
qsub -I -l nodes=1:ppn=1:mics=1,pmem=20gb -A -q flux -l qos=flux -V

#CPU Only Result

[brockp@nyx5896 blas]$ ./a.out

Size of double: 8

Will use: 2288 MB

 Matrix full in: 7 sec

MKL Runtime: 89 sec.

#Single Phi

[brockp@nyx5896 blas]$ module load xeon-mic

This module sets the following defaults:

        MKL_MIC_ENABLE=1 enables auto MKL offload

        MKL_MIC_WORKDIVISION=MIC_AUTO_WORKDIVISION

        MKL_MIC_DISABLE_HOST_FALLBACK=1 unset if not always using MIC

        MIC_OMP_NUM_THREADS=240 Sane values 240/244 (60 core * 4 threads) 

        OFFLOAD_REPORT=2 Gives lots of details about the device when running, setting to 1 gives less information, unset to surpress

        If your application uses both Compiler Assisted Offload and Automatic Offload then it is strongly recommended to set OFFLOAD_ENABLE_ORSL=1. This env-variable enables the two offload modes to synchronize their accesses to coprocessors.

[brockp@nyx5896 blas]$ ./a.out 

Size of double: 8

Will use: 2288 MB

 Matrix full in: 6 sec

[MKL] [MIC --] [AO Function]    DGEMM

[MKL] [MIC --] [AO DGEMM Workdivision]  0.00 1.00

[MKL] [MIC 00] [AO DGEMM CPU Time]      9.471196 seconds

[MKL] [MIC 00] [AO DGEMM MIC Time]      6.818753 seconds

[MKL] [MIC 00] [AO DGEMM CPU->MIC Data] 1600000000 bytes

[MKL] [MIC 00] [AO DGEMM MIC->CPU Data] 3200000000 bytes

MKL Runtime: 10 sec.

The key difference to get auto offload is that the xeon-mic module sets MKL_MIC_ENABLE=1 which then lets MKL use the Phi (Also known as MIC's) to take part in the computation. In this case with no code modification the the single matrix multiply was accelerated 8.9x.

You can also use more than one PHI. You can also see speedups much greater than that here if the data are larger, or the routine is more complex. Eg a full DGESV() which will call DGEMM() benchmarked here many times.

XSEDE Research Allocations Due April 15th

The next quarter round of XSEDE research allocations are due April 15th.

XSEDE provides a set of national HPC and Research CI resources available by proposal to the national research community.

ARC-TS provides support for XSEDE and other large HPC centers, including DOE, NASA and Cloud providers.

Friday, March 27, 2015

Sending large files to people

In a previous post we described the Globus file transfer service. It is designed to transfer single or multiple large data sets between two sites. Here we show an alternate approach using the Filesender service offered by Internet2 that is focused on file transfers of any size between individual’s desktop or laptop computers. It is a particularly convenient way to overcome email attachment file size limitations.

The process is real easy, just follow these simple steps:

Point your web-browser at: https://filesender.internet2.edu
Login using "University of Michigan" as your Organization.

Then use your UMich Uniqname and Kerberos password

Fill in all the information fields of the form, especially the email address.
Upload your files.
Then click on the "Send" button. The recipient will be sent an email with a link and instructions on downloading the files. When the download is complete you will receive an email telling you so.

Does someone need to send you large files but is not from an Internet2 member Institution? No problem! After you log into the filesender site select "Guest Voucher" at the top of the page. You can then have an email sent to the other person with a link allowing them to send a file or multiple files back to you or anyone else. Be advised though that this is a one time use for each voucher.

As always, the HPC support staff on campus are available to help, simply send an email to hpc-support@umich.edu

Tuesday, March 17, 2015

Intel Xeon Phi's Available on Flux

ARC-TS now has available as a technology preview, 8 Intel Xeon Phi (Wikipedia) 5110p cards. These are known as MIC's or Many Integrated Core. These are an accelerator card that fits into a slot on a Flux compute node and a code can offload portions or all of the work to the card.

As a technology preview, there is no additional cost for using the Phi's. All that is required is an active Flux allocation and users can test the Phi's. The only other requirement is all Phi jobs must be less than 24 hours long.

The Phi cards support three modes of operation, Automatic Offload, Compiler Assisted Offload, and Native. The first two are well tested, the last works but is not as well tested. All intel-comp compiler and mkl math library modules on Flux support the Phi.

You can request a phi with PBS with:

qsub -I -l nodes=1:mics=2 -q flux -A account_flux -l qos=flux -V

This will provide two Phi's and one CPU. Flux currently has one node with 8 Phi cards.

PBS will do set two variables:

PBS_MICFILE -> list of hostnames of assigned phi's good for native mode.
OFFLOAD_DEVICES -> csv list of devices for controlling auto offload (MKL) or compiler assisted offload.

$HOME is mounted on the card, as is all the software but /scratch currently is not, it should sometime in the future. This should only affect users running Native Phi code.

For software the Phi requires some environment changes to work. We created a module called xeon-mic. When loaded it will set some sane defaults but will print what it set. Users are encouraged to experiment with many of the settings available.

module load xeon-mic

In future posts we will show examples using the Phi.

Monday, February 23, 2015

Flux and Value Storage outage March 28th 10:00pm

The ARC cluster Flux and Engineering cluster Nyx will be unavailable for jobs March 28th at 10:00pm. There is an emergency update to the ITS Value Storage systems on that date.
http://status.its.umich.edu/outage.php?id=93178

Flux and Nyx rely on Value Storage and thus will also not be available during that time. We expect the outage to be finished quickly and any queued jobs will run as expected once the service is completed.

At the start of the outage, login and transfer nodes will be rebooted. Users will be unable to login until after the service is restored.

Any jobs that request more walltime than remains until the start of the outage will be held and started after the systems return to service.

To find the maximum walltime you can request and have your job start

prior to the outage can be found with our walltime calculator.

module load flux-utilsmaxwalltime

Allocations that are active on that date will be extended by one day

at no cost

If you have any questions feel free to ask us at hpc-support@umich.edu

For immediate updates watch: https://twitter.com/umcoecac

Wednesday, February 11, 2015

Data will be deleted from /scratch on Flux if unused for 90 days

Over the past several months, a huge amount of data (491 TB) has accumulated in the /scratch directory on the Flux computing cluster. /scratch is meant for data relating to currently running jobs, and the buildup of data is threatening the performance of Flux for all users.
Therefore, ARC will begin deleting data from /scratch that have not been accessed for 90 consecutive days.

Flux account owners with unused data have begun receiving emails warning that their data will be deleted.

Account owners in this situation can move their data to another system such as ITS Value Storage or their own equipment using the dedicated transfer nodes on Flux with high speed network connectsion available for that purpose.

For more information on Value Storage, see the ITS website.
For more information on transfer nodes, see the ARC website.
If you have any questions, please contact hpc-support@umich.edu.

Saturday, January 10, 2015

Flux Adds New 20 core Nodes

Flux has been expanded to include 126 new nodes. These are IBM* NeXtScale based systems. Each Chassis holds 12 nx360 M4 nodes.

Details are:

2 x 10 core 2.8 Ghz Intel E5-2680v2
90+ GB of DDR 1866Mhz RAM
FDR** Infiniband

A clever eye may notice that 126 nodes does not evenly divide into chassis that hold 12 nodes. Under FoE we operate some additional nodes totaling 156.

*IBM has sense sold their x86 line to Linovo

** While the servers have FDR adaptors, the fabric it connects to is still QDR based and will perform as such.

Using Infiniband with MATLAB Parallel Computing Toolbox

In High Performance Computing (HPC) there are a number of network types commonly used, among these are: Ethernet, the common network found on all computer equipment. Infiniband, a specialty high performance low latency interconnect common on commodity clusters. There are also several propriety types and a few other less common types but I will focus on Ethernet and Infiniband.

Ethernet and really its mate protocol, TCP, are the most common supported MPI networks. Almost all computer platforms support this network type and can be as simple as using your home network switch. It is ubiquitous and easy to support. Networks like Infiniband though require special drivers, uncommon hardware but the effort is normally worth it.

The MATLAB Parallel Computing Toolbox provides a collection of functions that allow users of MATLAB to utilize multiple compute nodes to work on larger problems. Many may not realize that MathWorks chose to use the standard MPI routines to implement this toolbox. MathWorks also chose for ease of use to ship MATLAB with the Mpich MPI library, and the version they use only support Ethernet for communication between nodes.

As noted Ethernet is about the slowest common network used in parallel applications. The question is how much can this impact performance.

Mmmmm Data:

The data was generated on 12 nodes of Xeon x5650 total 144 cores. The code was the stock MATLAB paralleldemo_backslash_bench(1.25) from MATLAB 2013b. You can find my M-Code at Gist.

The data show two trends, one is independent of the network type. That is many parallel algorithms do not scale unless the amount of data for each core to work on is sufficiently large. In this case for Ethernet especially the peak performance is never reached. What should be really noted though is that without Infiniband at many problem sizes over half of the performance of the nodes is lost. The second trend is that network really matters.

How to have MATLAB use Infiniband?

MathWorks does not ship an MPI library with the parallel computing toolbox that can use infiniband by default. This is reasonable, I would be curious how large the average PCT cluster is, and/or how big the jobs ran on the toolbox are. Lucky for us MathWorks allows a way for introducing your own MPI library. Let me be the first to proclaim:

Thank you MathWorks for adding mpiLibConf.m as a feature. -- Brock Palen

In the above test we used Intel MPI for the infiniband test and mpich for the ethernet test. The choice of MPI is important. The MPI standard enforces a shared API but not a shared ABI. Thus the MPI library you substitute needs to match the one MATLAB is compiled against. Lucky for us they used mpich, so any mpich clone should work; mvapich, IntelMPI, etc.

If you are using the MATLAB Parallel Computing Toolbox on more than one node, and if your cluster has a network other than Ethernet/TCP (there are non-tcp ethernet networks that perform very well) I highly encourage that the effort be put in to ensure you use that network.

For Flux users we have this setup, but you have to do some setup for yourself before you see the benefit. Please visit the ARC MATLAB documentation, or send us a question at hpc-support@umich.edu.

Friday, January 2, 2015

Q1 XSEDE Research Proposal Deadline

The next XSEDE Research proposal deadline is January 15th. If you are looking to get more work done, or to scale to new levels read more on the ARC site.