Thursday night the checks of the /scratch disk volumes completed letting the staff return access to files on that system. No problems were found. Parity rebuilds continued and performance was significantly degraded. We resumed jobs on Flux around 11pm that night.
On Friday morning the first set of parity calculations (raid5) finished on all the /scratch volumes. Data loss risk was significantly reduced at this point as every volume could now survive a single disk failure. At this point the staff failed some of the volumes over to the other active head (which had been unavailable). This should let the second level parity (raid6) calculation to proceed quickly as, well as double the performance of /scratch for applications running on Flux.
All Flux allocations affected by the outage have been extended by 4 days.
Performance is still degraded over normal operation due to the impact of the remaining parity calculations. Data are now generally safe. The /scratch filesystem is for scratch data and is not backed up. For a listing of the scratch polices visit our scratch page.