Friday, August 9, 2013

Scratch Filesystem Details

There has been a lot of questions of how it is built, as of 8/2013 /scratch consists of 12 15,000RPM SAS drives for metadata in a raid 10, and 300 3,000GB 7,200RPM SATA drives for data.

The filesystem is based on Lustre ( a high performance parallel filesystem. For details on how Lustre works visit this NICS's page.  If you want to implement any of the details NICS lists please contact us first, as our system is different from theirs.

The 300 3,000GB drives are stored in an SFA10K-X which has two active-active heads. Each head has two paths to each of the 5 disk arrays that hold 60 drives each. Each head can calculate data at the rate of 3-4GB/s.  Each head is also a backup to the other, and each head has two paths to each Lustre server. This allows the loss of a head or a path without disrupting data.

At the start of the last unplanned (8/2013) outage a head was removed for service, and a SAS card failed in the remaining head causing 60 drives to disappear of the 300 and yet scratch continued to operate at this point. Though at significant risk to data loss.

The 300 drives are broken into groups of 10, in a double parity (raid6) configuration. Each of the 5 disk shelves has two drives for each group. With raid6, two drives can die in each group of 10 and maintain data. With the drives spread across the shelves an entire shelf can be lost and maintain data.

Each group provides 21.3TB of space in /scratch. These are also known as OSTs. OSTs are the building blocks of Lustre/Scratch. As performance and space needs grow OST's can be added for capacity and performance.

Refer to the NICS page for details, by default a file written to scratch has an entry in the metadata server and the data are stored in one of the 30 OST's. For very large files, or when using MPI-IO (Parallel IO) users should stripe files across OST's where data are distributed across all, or a subset of OST's. Stripeing files allows users to sum up the performance of the OST's. This should only be done for large files.  Small files will actually be slower when striped.

In the event an OST is lost.  Only the data on that OST is gone. Single stripped files on the failed OST would be completely missing. Striped files would have holes in their data that resided on that failed OST.

The largest known Lustre filesystem is for the LLNL Sequoia system at 55,000TB and hitting performance over 1,024GB/s.  Here is a talk from the Lustre User Group on the Sequoia system.

Scratch in its old data center being installed. The 6 machines on top are the Lustre servers, followed by the metadata array, the two heads and the 5 disk shelves.

Back of scratch during installation

Scratch installed in the new Modular Data Center

Closeup of the back of the SFA10k-X Heads showing the 40 SAS connections to the disks. Each connection supports 3GB/s.