In line with the University's holiday schedule, the CAEN HPC group will be on holiday from December 25th through January 1st. Nyx and Flux will be operational during this time and we will be monitoring these systems to ensure everything is operating appropriately.
Staff will be monitoring the ticket system but in general only be responding to critical or systems-related issues during the holiday break and will address non-critical issues after the holiday. As a reminder, immediately following the holiday break is the 2013-14 Winter Outage and there will be no access to the cluster or storage systems as of 6am on January 2nd.
As always, email any questions you may have to hpc-support@umich.edu and have a great holiday.
CAEN HPC Staff
News and updates about the Flux high performance computing cluster at the University of Michigan.
Friday, December 20, 2013
Thursday, December 12, 2013
SPINEVOLUTION on Flux
SPINEVOLUTION is, from its web site, a highly efficient computer program for the numerical simulation of NMR experiments and spin dynamics in general.
Its installation on Flux has some more nuances than other software, but Flux is a general platform and SPINEVOLUTION can be installed and run on Flux.
The LSA Research Support and Advocacy group, and Mark Montague in particular, have documented in installation and use of SPINEVOLUTION on Flux at https://sites.google.com/a/umich.edu/flux-support/software/lsa/spinevolution
While SPINEVOLUTION is a narrowly focused software package, the details of its installation on Flux may be applicable to other narrowly focused software packages.
For more information, please send email to hpc-support@umich.edu.
Undergraduate Student job in web data visualization
The CAEN HPC group would like to improve the graphical reporting of
much of the data available from the cluster.
In the past, we would run commands via scripts and parse the output and make graphs.
The most recent versions of the cluster management software present some (and increasingly more) of the information via a REST-ful interface that returns JSON-formatted results.
In addition, JavaScript graphing libraries are improving in usefulness and usability. Among these are d3.js, JavaScript InfoVis Toolkit, Chart.js, Google Charts and others.
Our current usage graphs (an example of which is below) do not differentiate different types of Flux products (regular nodes, larger memory nodes, GPU nodes, FOE nodes, etc.) and do not separate utilization by Flux project accounts or by Flux user accounts.
We would like
In the past, we would run commands via scripts and parse the output and make graphs.
The most recent versions of the cluster management software present some (and increasingly more) of the information via a REST-ful interface that returns JSON-formatted results.
In addition, JavaScript graphing libraries are improving in usefulness and usability. Among these are d3.js, JavaScript InfoVis Toolkit, Chart.js, Google Charts and others.
Our current usage graphs (an example of which is below) do not differentiate different types of Flux products (regular nodes, larger memory nodes, GPU nodes, FOE nodes, etc.) and do not separate utilization by Flux project accounts or by Flux user accounts.
Figure 1: The current Flux usage graphs do not differentiate between Flux projects, do not offer different time scales, and are generally of limited use. |
- an overview page that improves on the current usage graphs
- a way to see daily, weekly, monthly, and yearly detail
- a way to see Flux products (as above) individually and stacked together
- a place for this to live (locally? MiServer? AWS? Google Sites?)
- a web site akin to http://flux-stats/?account-flux(|m|g) that provides per-Flux-project reports including:
- allocated cores over time
- running cores (by user) over time
- current resources in use per total:
- xrunning / xallocated cores
- yrunning / yallocated GB RAM
- the current queue represented as running jobs:
job owner # cores in use # GB RAM in use times (start, running, total) job name job ID
job owner # cores req’d # GB RAM req’d time req’d job name job ID - some heuristic advice along the lines of:
- if you had X more cores, then Y more jobs would start
- if you had A more GB of RAM, then B more jobs would start
- “you would save money by switching your the allocations in project G from standard Flux to larger memory Flux”, etc.
Email us at coe-hpc-jobs@umich.edu if you are interested.
Tuesday, November 26, 2013
Amazon Visiting Ann Arbor to talk about AWS for Research
On Thursday, December 5th, 2013 Steve Elliot and KD Singh of Amazon will be at the Hilton Garden Inn Ann Arbor (1401 Briarwood Circle, Ann Arbor, MI 48108).
Steve and KD will be talking about about both compute and storage services for researchers. This will be a technical discussion and presentation including live demos of Gluster, StarCluster among other technologies, possibly including Elastic Map Reduce or RedShift (AWS' data warehouse service).
To register, please email Steve Elliott at: elliotts@amazon.com.
Steve and KD will be talking about about both compute and storage services for researchers. This will be a technical discussion and presentation including live demos of Gluster, StarCluster among other technologies, possibly including Elastic Map Reduce or RedShift (AWS' data warehouse service).
To register, please email Steve Elliott at: elliotts@amazon.com.
Thursday, November 14, 2013
Flux for Research Administrators
About This Talk
This talk was given at the 2013 University of Michigan CyberInfrastructure Days conference.
Administrative and related support activities are needed for researchers to successfully plan for and use Flux in their projects.
This presentation describes Flux and how the use of Flux by researchers is planned for, acquired, and managed.
The information presented here is intended to help you to better support the proposal or other planning process and manage or track Flux use.
Administrative and related support activities are needed for researchers to successfully plan for and use Flux in their projects.
This presentation describes Flux and how the use of Flux by researchers is planned for, acquired, and managed.
The information presented here is intended to help you to better support the proposal or other planning process and manage or track Flux use.
What is Flux in Terms of Hardware?
Flux is a rate-based service that provides a Linux-based High
Performance Computing (HPC) system to the University of Michigan
community.
It is a fast system. Its CPUs, internal network, and storage are all fast in their own right and are designed to be fast together.
It is a large system on campus. Flux consists of 12,260 cores.
It is a fast system. Its CPUs, internal network, and storage are all fast in their own right and are designed to be fast together.
It is a large system on campus. Flux consists of 12,260 cores.
Flux Services and Costs
monthly rate | |
---|---|
standard Flux | $11.72/core |
larger memory Flux | $23.82/core |
Flux Operating Env. | $113.25/node |
GPU Flux | $107.10/GPU |
Planning to Use Flux
Planning for using Flux is done by estimating usage needs and considering
the limits or availability of funding.
Using Flux is more flexible than purchasing hardware. Allocations can be adjusted up or down or kept the same over the duration of a project.
There are two approaches to planning for the use of Flux:
Using Flux is more flexible than purchasing hardware. Allocations can be adjusted up or down or kept the same over the duration of a project.
There are two approaches to planning for the use of Flux:
- Determine the amount of Flux resources your research will need and create a budget to meet that demand.
- Determine how much Flux time and cores you can afford on a given budget.
Understanding Flux Allocations, Accounts, \\ and Projects is Important
A Flux project is a collection of Flux user accounts that are associated
with one or more Flux allocations.
A Flux project can have as many allocations as you wish.
A Flux project can have as many allocations as you wish.
Instructions for Research and Other Administrators \\ During the Planning Process
Administrators should confirm, as necessary, that the grant writer has
done what he or she needs to do.
Grant writers need to make sure their computing needs are suitable for the use of Flux, estimate the Flux resources that are required for the project, describe Flux in the proposal, and prepare the information needed to complete the Flux PAF Supplement form.
The administrator sends the completed Flux PAF Supplement to coe-flux-paf-review@umich.edu, and attaches the returned and approved Flux PAF Supplement to the PAF packet.
Grant writers need to make sure their computing needs are suitable for the use of Flux, estimate the Flux resources that are required for the project, describe Flux in the proposal, and prepare the information needed to complete the Flux PAF Supplement form.
The administrator sends the completed Flux PAF Supplement to coe-flux-paf-review@umich.edu, and attaches the returned and approved Flux PAF Supplement to the PAF packet.
The Flux PAF Supplement
The completion and the review of the Flux PAF Supplement are important steps in
the Flux planning process.
Being able to fill out the Flux PAF Supplement is a good self-check for having completed a good planning process.
The review of the Flux PAF Supplement allows the Flux operators to do some system planning. In some cases you may be asked for some clarification.
Being able to fill out the Flux PAF Supplement is a good self-check for having completed a good planning process.
The review of the Flux PAF Supplement allows the Flux operators to do some system planning. In some cases you may be asked for some clarification.
Using Flux
A Flux User Account and a Flux Allocation are needed to use Flux.
A Flux user account is a Linux login ID and password (the same as your U-M uniqname and UMICH.EDU password).
Flux user accounts and allocations may be requested using email. (See http://arc.research.umich.edu/flux/managing-a-flux-project/)
A Flux user account is a Linux login ID and password (the same as your U-M uniqname and UMICH.EDU password).
Flux user accounts and allocations may be requested using email. (See http://arc.research.umich.edu/flux/managing-a-flux-project/)
Monitoring and Tracking Flux Allocations
Historical usage data for Flux allocations is available in MReports
(http://mreports.umich.edu/mreports/pages/Flux.aspx).
Instructions for accessing data in MRreports are available online (http://arc.research.umich.edu/flux/managing-a-flux-project/check-my-flux-allocation/).
Billing is done monthly by ITS.
Flux allocations can be started and ended (on month boundaries). Multiple allocations may be created.
Instructions for accessing data in MRreports are available online (http://arc.research.umich.edu/flux/managing-a-flux-project/check-my-flux-allocation/).
Billing is done monthly by ITS.
Flux allocations can be started and ended (on month boundaries). Multiple allocations may be created.
More Information is Available
Email
Look at CAEN's High Performance Computing website: http://caen.engin.umich.edu/hpc/overview.
Look at ARC's Flux website: http://arc.research.umich.edu/flux/
hpc-support@umich.edu
.
Look at CAEN's High Performance Computing website: http://caen.engin.umich.edu/hpc/overview.
Look at ARC's Flux website: http://arc.research.umich.edu/flux/
Wednesday, November 13, 2013
Flux: The State of the Cluster
Last Year
What is Flux in Terms of Hardware?
Flux is a rate-based service that provides a Linux-based High
Performance Computing (HPC) system to the University of Michigan
community.
It is a fast system. Its CPUs, internal network, and storage are all fast in their own right and are designed to be fast together.
It is a large system on campus. Flux consists of 12,260 cores.
It is a fast system. Its CPUs, internal network, and storage are all fast in their own right and are designed to be fast together.
It is a large system on campus. Flux consists of 12,260 cores.
Flux was moved to the Modular Data Center from the MACC
Moving Flux to the MDC from the MACC resulted directly in the
decrease in the rate and an accompanying change in service level.
Before the move Flux had generator-backed electrical power and could run for days during a utility power outage.
After the move Flux has battery-backed electrical power and can run for 5 minutes during a utility power outage.
Before the move Flux had generator-backed electrical power and could run for days during a utility power outage.
After the move Flux has battery-backed electrical power and can run for 5 minutes during a utility power outage.
The rate for all of the Flux services was reduced on October 1, 2013
old monthly rate | new monthly rate | |
---|---|---|
standard Flux | $18.00/core | $11.72/core |
larger memory Flux | $24.35/core | $23.82/core |
Flux Operating Env. | $267.00/node | $113.25/node |
GPU Flux | n/a | $107.10/GPU |
Flux has the newest GPUs from NVIDIA - the K20x
Flux has 40 K20x GPUs connected to 5 compute nodes.
Each GPU allocation comes with 2 compute cores and 8GB of CPU RAM.
Each GPU allocation comes with 2 compute cores and 8GB of CPU RAM.
Number and Type of GPU | one Kepler GK110 |
Peak double precision floating point perf. | 1.31 Tflops |
Peak single precision floating point perf. | 3.95 Tflops |
Memory bandwidth (ECC off) | 250 GB/sec |
Memory size (GDDR5) | 6 GB |
CUDA cores | 2688 |
Flux has Intel Phis as a technology preview
Flux has 8 Intel 5110P Phi co-processors connected to one compute
node.
As a technology preview, there is no cost to use the Phis.
As a technology preview, there is no cost to use the Phis.
Number and type of processor | one 5110P |
Processor clock | 1.053GHz |
Memory bandwidth (ECC off) | 320 GB/sec |
Memory size (GDDR5) | 8 GB |
Number of cores | 60 |
Flux has Hadoop as a technology preview
Flux has a Hadoop environment that offers 16TB of HDFS storage,
soon expanding to move than 100TB.
The Hadoop environment is based on Apache Hadoop version 1.1.2.
The Hadoop environment is a technology preview and has no charge
associated with it. For more information on using Hadoop on Flux email
The Hadoop environment is based on Apache Hadoop version 1.1.2.
Hive | v0.9.0 |
HBase | v0.94.7 |
Sqoop | v1.4.3 |
Pig | v0.11.1 |
R + rhdfs + rmr2 | v3.0.1 |
hpc-support@umich.edu
.
Next Year
The initial hardware will be replaced
Flux has a three-year hardware replacement cycle; we are in the
process of replacing the initial 2,000 cores.
The new cores are likely to be in Intel's 10-core Xeon CPUs, resulting in 20 cores per node.
We are planning on keeping the 4GB RAM per core ratio. The memory usage over the last three years have a profile that supports this direction.
The new cores are likely to be in Intel's 10-core Xeon CPUs, resulting in 20 cores per node.
We are planning on keeping the 4GB RAM per core ratio. The memory usage over the last three years have a profile that supports this direction.
Flux may offer a option without software
ARC is hoping to have a Flux product offering that does not include
the availability, and thus cost, of most commercial software.
The "software-free" version of Flux will include
The "software-free" version of Flux will include
- the Intel compilers
- the Allinea debuggers and code profilers
- MathWorks MATLAB®
- other no- or low-cost software
A clearer software usage policy will be published
With changes in how software on Flux is presented will come
guidance on appropriate use of the Flux software library.
In approximate terms, the Flux software library is
In approximate terms, the Flux software library is
- licensed for academic research and education by faculty, students, and staff of the University of Michigan.
- not licensed for commercial work, work that yields proprietary or restricted results, or for people who are not at the University of Michigan.
Flux on Demand may be available
ARC continues to work on developing a Flux-on-Demand service.
We hope to have some variant of this available sometime in the Winter semester.
We hope to have some variant of this available sometime in the Winter semester.
A High-Throughput Computing service will be piloted
ARC, CAEN, and ITS are working on a High-Throughput Computing
service based on HTCondor (http://research.cs.wisc.edu/htcondor/).
This will allow for large quantities (1000s) of serial jobs to be run on either Windows or Linux.
ARC does not expect there to be any charge to the researchers for this.
This will allow for large quantities (1000s) of serial jobs to be run on either Windows or Linux.
ARC does not expect there to be any charge to the researchers for this.
Advanced Research Computing at the University of Michigan
Wednesday, November 6, 2013
Nyx/Flux Winter 2013-14 Outage
Nyx, Flux, and their storage systems (/home, /home2, /nobackup, and /scratch) will be unavailable starting at 6am Thursday January 2nd, returning to service on Saturday, January 4th.
During this time, CAEN will be making the following updates:
* The OS and system software will be upgraded. These should be minor updates provided by RedHat
* Scheduling software updates, including the resource manager (PBS/Torque), job scheduler (Moab), and associated software
* PBS-generated mails related to job data will now be from hpc-support@umich.edu, rather than the current cac-support@umich.edu
* Transitioning some compute nodes to a more reliable machine room
* Software updates to the high speed storage systems (/nobackup and /scratch)
* The College of Engineering AFS cell being retired (/afs/engin.umich.edu). Jobs using the Modules system should have no issue, but any PBS scripts which directly reference /afs/engin.umich.edu/ will be impacted.
* Migrating /home from a retiring filesystem to Value Storage
During this time, CAEN will be making the following updates:
* The OS and system software will be upgraded. These should be minor updates provided by RedHat
* Scheduling software updates, including the resource manager (PBS/Torque), job scheduler (Moab), and associated software
* PBS-generated mails related to job data will now be from hpc-support@umich.edu, rather than the current cac-support@umich.edu
* Transitioning some compute nodes to a more reliable machine room
* Software updates to the high speed storage systems (/nobackup and /scratch)
* The College of Engineering AFS cell being retired (/afs/engin.umich.edu). Jobs using the Modules system should have no issue, but any PBS scripts which directly reference /afs/engin.umich.edu/ will be impacted.
* Migrating /home from a retiring filesystem to Value Storage
We will post status updates on our Twitter feed (https://twitter.com/UMCoECAC), which can also be found on the CAEN HPC website at http://cac.engin.umich.edu .
Wednesday, October 2, 2013
Expanded XSEDE Support at Michigan
The EXtreme Science and Engineering Discovery Environment (XSEDE) is a great source for free high-performance computing resources for the research community. Researchers who wish to use XSEDE can get support on campus from a number of people.
I (Brock Palen) am the XSEDE Campus Champion for the University of Michigan. As the Champion I have direct communication with a number of the XSEDE resource providers.
I can support you on XSEDE resources in much the same way you are supported on any local resource. I can benchmark, test, and debug issues. I can also support you in the XSEDE proposal writing process for review and resource selection.
I work closely with many of the research computing support staff in the College of LSA, the School of Public Health, and the Medical School, and they can also help you with any questions or problems you have with any part of the XSEDE process. You can get a hold of any of them and me at hpc-support@umich.edu.
As always, if you run into difficulty with any part of the XSEDE application process or when using XSEDE resources to reach to us locally for help at hpc-support@umich.edu.
Brock Palen |
I can support you on XSEDE resources in much the same way you are supported on any local resource. I can benchmark, test, and debug issues. I can also support you in the XSEDE proposal writing process for review and resource selection.
I work closely with many of the research computing support staff in the College of LSA, the School of Public Health, and the Medical School, and they can also help you with any questions or problems you have with any part of the XSEDE process. You can get a hold of any of them and me at hpc-support@umich.edu.
As always, if you run into difficulty with any part of the XSEDE application process or when using XSEDE resources to reach to us locally for help at hpc-support@umich.edu.
Thursday, August 29, 2013
Engineering New Faculty Orientation: Research Computing
Today Ken Powell, Andy Caird, and Amadi Nwankpa spoke to the new College of Engineering faculty about research computing. The slides and handout are represented below.
Who's who on campus
IT Groups
CAEN
- has operational responsibilities for College of Engineering IT services and facilities.
- has five main groups: high-performance computing (HPC), web, applications, student computing environment, instructional technology
- http://caen.engin.umich.edu
ITS
- has operational responsibilities for University IT services and facilities
- http://its.umich.edu
Your departmental computing support
Research Computing Support
Michigan Institute for Computational Discovery and Engineering
- administers the Rackham scientific computing PhD and scientific computing certificate
- helps point CoE faculty to appropriate HPC resources
- http://arc.research.umich.edu/micde/
Office of Advanced Research Computing (ARC)
- coordinates campus-level research computing infrastructure and events
- helps connect faculty in various colleges doing related work
- http://arc.research.umich.edu
What's What—HPC resources
Flux
- Recommended HPC resource for most CoE faculty
- Homogeneous collection of hardware owned by ARC and operated by CAEN HPC with purchasing/business/HR support by ITS.
- Run on an allocation model—faculty purchase by core-month allocation
- About 10,000 InfiniBand-connected cores, all owned by ARC. Used by about 120 research groups and several courses.
- http://caen.engin.umich.edu/hpc/overview
Flux Operating Environment
- For faculty who have hardware needs that differ from standard Flux
- Run on an ownership model—faculty buy their hardware, and it is incorporated and run for them in the Flux Operating Environment
- http://caen.engin.umich.edu/hpc/flux-operating-environment
Where to turn for help
- Purchasing a desktop or laptop for you or a student/post-doc in your
group:
- your departmental IT group
- Finding out whether a certain software package has already been licensed
to the department, college or university:
- your department IT person, or the CAEN software listing: http://caen.engin.umich.edu/software/overview
- Licensing a piece of software:
- your department IT person, or Amadi Nwankpa (amadi@umich.edu), the CAEN faculty liaison and software guru
- this is increasingly important to get correct
- Purchasing storage:
- ValueStorage: http://www.itcs.umich.edu/storage/value/
- Assessing your HPC needs (cluster computing, storage):
- Andrew Caird (acaird@umich.edu), the CAEN Director of HPC
hpc-support@umich.edu
- Getting your doctoral students enrolled in the Scientific Computing PhD or
scientific computing certificate program:
- Eric Michielssen (emichiel@umich.edu), the MICDE Director
- General questions about using HPC in your research:
- Andrew Caird and Ken Powell
More Information
- Visit http://arc.research.umich.edu to learn about upcoming workshops (including the CI days workshop November 13th and 14th)
- E-mail
hpc-support@umich.edu
to discuss your HPC needs - E-mail
caen@umich.edu
for other help
Friday, August 9, 2013
Scratch Filesystem Details
There has been a lot of questions of how it is built, as of 8/2013 /scratch consists of 12 15,000RPM SAS drives for metadata in a raid 10, and 300 3,000GB 7,200RPM SATA drives for data.
The filesystem is based on Lustre (http://www.whamcloud.com/) a high performance parallel filesystem. For details on how Lustre works visit this NICS's page. If you want to implement any of the details NICS lists please contact us first, as our system is different from theirs.
The 300 3,000GB drives are stored in an SFA10K-X which has two active-active heads. Each head has two paths to each of the 5 disk arrays that hold 60 drives each. Each head can calculate data at the rate of 3-4GB/s. Each head is also a backup to the other, and each head has two paths to each Lustre server. This allows the loss of a head or a path without disrupting data.
At the start of the last unplanned (8/2013) outage a head was removed for service, and a SAS card failed in the remaining head causing 60 drives to disappear of the 300 and yet scratch continued to operate at this point. Though at significant risk to data loss.
The 300 drives are broken into groups of 10, in a double parity (raid6) configuration. Each of the 5 disk shelves has two drives for each group. With raid6, two drives can die in each group of 10 and maintain data. With the drives spread across the shelves an entire shelf can be lost and maintain data.
Each group provides 21.3TB of space in /scratch. These are also known as OSTs. OSTs are the building blocks of Lustre/Scratch. As performance and space needs grow OST's can be added for capacity and performance.
Refer to the NICS page for details, by default a file written to scratch has an entry in the metadata server and the data are stored in one of the 30 OST's. For very large files, or when using MPI-IO (Parallel IO) users should stripe files across OST's where data are distributed across all, or a subset of OST's. Stripeing files allows users to sum up the performance of the OST's. This should only be done for large files. Small files will actually be slower when striped.
In the event an OST is lost. Only the data on that OST is gone. Single stripped files on the failed OST would be completely missing. Striped files would have holes in their data that resided on that failed OST.
The largest known Lustre filesystem is for the LLNL Sequoia system at 55,000TB and hitting performance over 1,024GB/s. Here is a talk from the Lustre User Group on the Sequoia system.
The filesystem is based on Lustre (http://www.whamcloud.com/) a high performance parallel filesystem. For details on how Lustre works visit this NICS's page. If you want to implement any of the details NICS lists please contact us first, as our system is different from theirs.
The 300 3,000GB drives are stored in an SFA10K-X which has two active-active heads. Each head has two paths to each of the 5 disk arrays that hold 60 drives each. Each head can calculate data at the rate of 3-4GB/s. Each head is also a backup to the other, and each head has two paths to each Lustre server. This allows the loss of a head or a path without disrupting data.
At the start of the last unplanned (8/2013) outage a head was removed for service, and a SAS card failed in the remaining head causing 60 drives to disappear of the 300 and yet scratch continued to operate at this point. Though at significant risk to data loss.
The 300 drives are broken into groups of 10, in a double parity (raid6) configuration. Each of the 5 disk shelves has two drives for each group. With raid6, two drives can die in each group of 10 and maintain data. With the drives spread across the shelves an entire shelf can be lost and maintain data.
Each group provides 21.3TB of space in /scratch. These are also known as OSTs. OSTs are the building blocks of Lustre/Scratch. As performance and space needs grow OST's can be added for capacity and performance.
Refer to the NICS page for details, by default a file written to scratch has an entry in the metadata server and the data are stored in one of the 30 OST's. For very large files, or when using MPI-IO (Parallel IO) users should stripe files across OST's where data are distributed across all, or a subset of OST's. Stripeing files allows users to sum up the performance of the OST's. This should only be done for large files. Small files will actually be slower when striped.
In the event an OST is lost. Only the data on that OST is gone. Single stripped files on the failed OST would be completely missing. Striped files would have holes in their data that resided on that failed OST.
The largest known Lustre filesystem is for the LLNL Sequoia system at 55,000TB and hitting performance over 1,024GB/s. Here is a talk from the Lustre User Group on the Sequoia system.
Scratch in its old data center being installed. The 6 machines on top are the Lustre servers, followed by the metadata array, the two heads and the 5 disk shelves. |
Back of scratch during installation |
Scratch installed in the new Modular Data Center |
Closeup of the back of the SFA10k-X Heads showing the 40 SAS connections to the disks. Each connection supports 3GB/s. |
Scratch Back Online
Thursday night the checks of the /scratch disk volumes completed letting the staff return access to files on that system. No problems were found. Parity rebuilds continued and performance was significantly degraded. We resumed jobs on Flux around 11pm that night.
On Friday morning the first set of parity calculations (raid5) finished on all the /scratch volumes. Data loss risk was significantly reduced at this point as every volume could now survive a single disk failure. At this point the staff failed some of the volumes over to the other active head (which had been unavailable). This should let the second level parity (raid6) calculation to proceed quickly as, well as double the performance of /scratch for applications running on Flux.
All Flux allocations affected by the outage have been extended by 4 days.
Performance is still degraded over normal operation due to the impact of the remaining parity calculations. Data are now generally safe. The /scratch filesystem is for scratch data and is not backed up. For a listing of the scratch polices visit our scratch page.
On Friday morning the first set of parity calculations (raid5) finished on all the /scratch volumes. Data loss risk was significantly reduced at this point as every volume could now survive a single disk failure. At this point the staff failed some of the volumes over to the other active head (which had been unavailable). This should let the second level parity (raid6) calculation to proceed quickly as, well as double the performance of /scratch for applications running on Flux.
All Flux allocations affected by the outage have been extended by 4 days.
Performance is still degraded over normal operation due to the impact of the remaining parity calculations. Data are now generally safe. The /scratch filesystem is for scratch data and is not backed up. For a listing of the scratch polices visit our scratch page.
Wednesday, August 7, 2013
/scratch problems
While trying to swap out a head that was exhibiting problems we had a SAS card failure.
This failure caused 60 drives to disappear from the system. Because of the ungraceful way the drives were removed, this took away all the redundancy in the raid 6 arrays.
We were able to get the drives back up on the old head (that had been removed) but because they had been missing from the system for 10 minutes, the arrays forced themselves into full rebuild.
Right now scratch has no parity --- none --- and we have 60 drives trying rebuilding on only one head. The other head is up but is not picking up the paths. We have been working with the vendor, DDN, on this.
Right now the head is rebuilding only 30 of the drives (get us up to raid 5) and then will continue onto raid6.
With only one head working we are CPU bound, the rebuild is going at 1%/hour. We are at risk of losing data until the end of the week and it will take another week to get full raid 6.
This failure caused 60 drives to disappear from the system. Because of the ungraceful way the drives were removed, this took away all the redundancy in the raid 6 arrays.
We were able to get the drives back up on the old head (that had been removed) but because they had been missing from the system for 10 minutes, the arrays forced themselves into full rebuild.
Right now scratch has no parity --- none --- and we have 60 drives trying rebuilding on only one head. The other head is up but is not picking up the paths. We have been working with the vendor, DDN, on this.
Right now the head is rebuilding only 30 of the drives (get us up to raid 5) and then will continue onto raid6.
With only one head working we are CPU bound, the rebuild is going at 1%/hour. We are at risk of losing data until the end of the week and it will take another week to get full raid 6.
Thursday, June 20, 2013
COMSOL no longer available on Nyx/Flux
Recently we have learned more about the license the University has with COMSOL for their software. This license precludes us installing any COMSOL software where it can be accessed via the Internet. Because Nyx and Flux can be accessed via the Internet using the U-M VPN or one of the ITS or CAEN login hosts, we are not
currently in compliance with the license.
To come into compliace with the COMSOL license, we will be removing COMSOL from Nyx and Flux on June 15, 2013. COMSOL will still be available in CAEN Computing Labs.
At this point, we are unable to provide access to COMSOL on Nyx, Flux, or machines that allow access from the internet at large, even at at one or more removes.
You might want to contact the COMSOL representative, Siva Hariharan <siva@comsol.com>, and ask them about this. I can't guarantee it, but if a license holder has a license server of their own and requests that we install it but that they will use their own license, we may be able to work something out.
It appears that COMSOL thinks it OK to run it from Amazon, however, if that's an option for you. There is information about that on these pages: https://aws.amazon.com/ marketplace/pp/B00A41KQUY/ and http://www.comsol.com/ec2_ manual/
We have no experience with that, and it appears that you need to provide your own license server, as, at the bottom of the AWS page, it says:
If you have access to CAEN lab machines, you may be able to use COMSOL on them; if so, the CAEN Hotline can direct you to the highest-powered ones.
If your group has its own network license and license server for COMSOL, please let us know.
currently in compliance with the license.
To come into compliace with the COMSOL license, we will be removing COMSOL from Nyx and Flux on June 15, 2013. COMSOL will still be available in CAEN Computing Labs.
At this point, we are unable to provide access to COMSOL on Nyx, Flux, or machines that allow access from the internet at large, even at at one or more removes.
You might want to contact the COMSOL representative, Siva Hariharan <siva@comsol.com>, and ask them about this. I can't guarantee it, but if a license holder has a license server of their own and requests that we install it but that they will use their own license, we may be able to work something out.
It appears that COMSOL thinks it OK to run it from Amazon, however, if that's an option for you. There is information about that on these pages: https://aws.amazon.com/
We have no experience with that, and it appears that you need to provide your own license server, as, at the bottom of the AWS page, it says:
Refund PolicyThis is a BYOL product. See http://www.comsol.com/sla
If you have access to CAEN lab machines, you may be able to use COMSOL on them; if so, the CAEN Hotline can direct you to the highest-powered ones.
If your group has its own network license and license server for COMSOL, please let us know.
Subscribe to:
Posts (Atom)