Thursday, August 29, 2013

Engineering New Faculty Orientation: Research Computing

Today Ken Powell, Andy Caird, and Amadi Nwankpa spoke to the new College of Engineering faculty about research computing.  The slides and handout are represented below.

Who's who on campus

IT Groups


  • has operational responsibilities for College of Engineering IT services and facilities.
  • has five main groups: high-performance computing (HPC), web, applications, student computing environment, instructional technology


Your departmental computing support

Research Computing Support

Michigan Institute for Computational Discovery and Engineering

Office of Advanced Research Computing (ARC)

  • coordinates campus-level research computing infrastructure and events
  • helps connect faculty in various colleges doing related work

What's What—HPC resources


  • Recommended HPC resource for most CoE faculty
  • Homogeneous collection of hardware owned by ARC and operated by CAEN HPC with purchasing/business/HR support by ITS.
  • Run on an allocation model—faculty purchase by core-month allocation
  • About 10,000 InfiniBand-connected cores, all owned by ARC. Used by about 120 research groups and several courses.

Flux Operating Environment

Where to turn for help

  • Purchasing a desktop or laptop for you or a student/post-doc in your group:
    • your departmental IT group
  • Finding out whether a certain software package has already been licensed to the department, college or university:
  • Licensing a piece of software:
    • your department IT person, or Amadi Nwankpa (, the CAEN faculty liaison and software guru
    • this is increasingly important to get correct
  • Purchasing storage:
  • Assessing your HPC needs (cluster computing, storage):
  • Getting your doctoral students enrolled in the Scientific Computing PhD or scientific computing certificate program:
  • General questions about using HPC in your research:
    • Andrew Caird and Ken Powell

More Information

Friday, August 9, 2013

Scratch Filesystem Details

There has been a lot of questions of how it is built, as of 8/2013 /scratch consists of 12 15,000RPM SAS drives for metadata in a raid 10, and 300 3,000GB 7,200RPM SATA drives for data.

The filesystem is based on Lustre ( a high performance parallel filesystem. For details on how Lustre works visit this NICS's page.  If you want to implement any of the details NICS lists please contact us first, as our system is different from theirs.

The 300 3,000GB drives are stored in an SFA10K-X which has two active-active heads. Each head has two paths to each of the 5 disk arrays that hold 60 drives each. Each head can calculate data at the rate of 3-4GB/s.  Each head is also a backup to the other, and each head has two paths to each Lustre server. This allows the loss of a head or a path without disrupting data.

At the start of the last unplanned (8/2013) outage a head was removed for service, and a SAS card failed in the remaining head causing 60 drives to disappear of the 300 and yet scratch continued to operate at this point. Though at significant risk to data loss.

The 300 drives are broken into groups of 10, in a double parity (raid6) configuration. Each of the 5 disk shelves has two drives for each group. With raid6, two drives can die in each group of 10 and maintain data. With the drives spread across the shelves an entire shelf can be lost and maintain data.

Each group provides 21.3TB of space in /scratch. These are also known as OSTs. OSTs are the building blocks of Lustre/Scratch. As performance and space needs grow OST's can be added for capacity and performance.

Refer to the NICS page for details, by default a file written to scratch has an entry in the metadata server and the data are stored in one of the 30 OST's. For very large files, or when using MPI-IO (Parallel IO) users should stripe files across OST's where data are distributed across all, or a subset of OST's. Stripeing files allows users to sum up the performance of the OST's. This should only be done for large files.  Small files will actually be slower when striped.

In the event an OST is lost.  Only the data on that OST is gone. Single stripped files on the failed OST would be completely missing. Striped files would have holes in their data that resided on that failed OST.

The largest known Lustre filesystem is for the LLNL Sequoia system at 55,000TB and hitting performance over 1,024GB/s.  Here is a talk from the Lustre User Group on the Sequoia system.

Scratch in its old data center being installed. The 6 machines on top are the Lustre servers, followed by the metadata array, the two heads and the 5 disk shelves.

Back of scratch during installation

Scratch installed in the new Modular Data Center

Closeup of the back of the SFA10k-X Heads showing the 40 SAS connections to the disks. Each connection supports 3GB/s.

Scratch Back Online

Thursday night the checks of the /scratch disk volumes completed letting the staff return access to files on that system.  No problems were found. Parity rebuilds continued and performance was significantly degraded.  We resumed jobs on Flux around 11pm that night.

On Friday morning the first set of parity calculations (raid5) finished on all the /scratch volumes. Data loss risk was significantly reduced at this point as every volume could now survive a single disk failure. At this point the staff failed some of the volumes over to the other active head (which had been unavailable). This should let the second level parity (raid6) calculation to proceed quickly as, well as double the performance of /scratch for applications running on Flux.

All Flux allocations affected by the outage have been extended by 4 days.

Performance is still degraded over normal operation due to the impact of the remaining parity calculations. Data are now generally safe. The /scratch filesystem is for scratch data and is not backed up. For a listing of the scratch polices visit our scratch page.

Wednesday, August 7, 2013

/scratch problems

While trying to swap out a head that was exhibiting problems we had a SAS card failure.

This failure caused 60 drives to disappear from the system.  Because of the ungraceful way the drives were removed, this took away all the redundancy in the raid 6 arrays.

We were able to get the drives back up on the old head (that had been removed) but because they had been missing from the system for 10 minutes, the arrays forced themselves into full rebuild.

Right now scratch has no parity --- none --- and we have 60 drives trying rebuilding on only one head. The other head is up but  is not picking up the paths.  We have been working with the vendor, DDN, on this.

Right now the head is rebuilding only 30 of the drives (get us up to raid 5) and then will continue onto raid6.

With only one head working we are CPU bound, the rebuild is going at 1%/hour. We are at risk of losing data until the end of the week and it will take another week to get full raid 6.