News and updates about the Flux high performance computing cluster at the University of Michigan.
Wednesday, April 30, 2014
Using qselect to make job management easy
There is an easy way to do this. A combination of qselect to filter your jobs, and subshells makes this a breeze. For the most recent qselect documentation run man qselect on the login node.
Below is a selection of example qselects. All the options can be combined to make a very flexible set of job selection.
Now that qselect has given us all the job id's we want we can use the sub shell. A subshell, evaluates the command in the subshell (our qselect) first and takes the output of that command and feeds it to the second. To create a subshell, wrap the command you want to run first in $( ). The options below is some of our most common requests.
Users need not stop here, combine qselect to hold all your jobs so that other jobs from your Flux account can run. In this case assume you have jobs queued in two Flux accounts, and I only want todo this in one of them.
Qselect should make your life very easy for mass job changes. There are a few commands (qmove for one) that does not take a list of jobs and thus this method wont work. Contact us if you are in need of one of these commands.
Friday, April 25, 2014
Upcoming High Performance Computing Webinars
High Performance Computing — How Can It Improve My Research?
Where: https://univofmichigan.adobeconnect.com/flux (Select as a Guest)April 28th 10:30am-11:30am
Learn about what can be accomplished with HPC clusters, such as the ARC Flux cluster and NSF XSEDE machines. We will cover how HPC resources let more work get done, or enable work that was unapproachable before. We will use any remaining time to go into the details of the construction of the Flux system on campus as an example.
-----------------------
Basic Linux Commands and Remote File Transfer and Access
Where: https://univofmichigan.adobeconnect.com/flux (Select as a Guest)April 30th 10:30am-11:30am
Focused for new Linux users, users will learn the beginner command line. At the end of this session users should feel comfortable manipulating files and navigating the Linux filesystem. Particular focus will be given to connecting to remote Linux systems with SSH from Mac and Windows, using the Flux Cluster as an example.
Commands covered: ssh, sftp, globus, ls, mv, pwd, cd, etc.
------------------------
Using Modules and the PBS Batch system
Where: https://univofmichigan.adobeconnect.com/flux (Select as a Guest)May 2nd 10:30am-11:30am
Users will learn how to use the extensive Flux software library via the Modules system. Modules provides a powerful and flexible way to manipulate a users environment, easing the management of many software titles on a shared system.
In the remaining time, users will learn how to use PBS, the batch system on Flux, to submit work loads. All users of the cluster must use PBS, and at the end of this session users should feel comfortable submitting basic serial and parallel batch jobs to the cluster.
------------------------
Notes:
Free Flux account required: https://www.engin.umich.edu/form/cacaccountapplication
Sessions are drop in. Attendees need not attend all, or any order of the sessions.
Sessions will be recorded and posted on: https://www.youtube.com/user/UMCoECAC
Wednesday, April 16, 2014
Distributed Memory HFSS Simulations (DDM, DSO)
Rather than post everything here, we have updated our HFSS documentation page.
We are excited about this because it has been requested many times before to enable Domain Decomposition (DDM) for very large models, and we were limited to using the FluxM service just to use the 40 cores/node. Now users who don't need the large memory footprint can use the Standard Flux services and realize savings.
The second group of users are those doing sweeps or sweeps inside optimization problems. HFSS will use the multiple nodes now and farm out different values in your parameter space to multiple nodes at once. This should be great for those users.
If you want to use HFSS and have questions please contact hpc-support@umich.edu
Tuesday, April 15, 2014
High-speed automated file transfers
Do you have a large amount of data that you need to move?
Do you need to move data reliably, even over unreliable networks?
Would you like to start a moving data and ignore it until you get a notification that it is complete?
Do you move data often between UM and XSEDE or Flux and your laptop?
Globus can be a tremdous help in any of those scenarios, and we have good support for it on campus, especially on Flux!
The source and destination for file transfers using Globus are called endpoints, and Flux’s endpoint supports access to
/scratch
, files in
your Flux home directory, and files in any Value Storage shares that are
available to Flux. The other endpoint could be your laptop, an XSEDE site,
or even another location on Flux (for example, you can use Globus to move
data between a Value Storage share on Flux and /scratch
on Flux).
Using Globus is as simple as using a graphical file-transfer client, detailed instructions are at http://cac.engin.umich.edu/resources/login-nodes/globus-gridftp. As always, the HPC support staff on campus are available to help, simply send an email to hpc-support@umich.edu.
Monday, April 14, 2014
The Efficiency of Compute Jobs
One measure of efficiency, the ratio of CPU time to wallclock time, is easily accessible to Flux users. The operating system (Linux) used by Flux reports statistics about compute jobs, and those statistics are in turn reported by the job management (PBS) and job scheduling (Moab) system back to the owner of the job. The email sent to job owners should look something like this:
- Be smart about improving efficiency Unless you’re developing code that will be run many times for many years or you suspect you have a serious efficiency problem, the value of your time in working on improving efficiency is probably greater than the value of the computer time you’ll save.
- Start by looking at reads from and writes to storage Storage is the slowest I/O on most systems, so minimizing reads and writes can often have a dramatic effect on efficiency and wall-clock time for your programs. If you must read and write data, make use of the fastest storage that you can, either the local /tmp space on every Flux node or the shared /scratch parallel filesystem.
- Use well-regarded third-party libraries instead of inventing your own For things like FFTs, matrix algebra, data storage formats, and other common components of scientific or engineering software making use of third-party libraries can have a large positive effect on the performance and efficiency of your program. Some examples are FFTW for FFTs, MKL for matrix algebra, HDF5 for data storage; all of these are available on Flux.
- Use a profiler to see where your code is spending its time The Allinea code profiler MAP is available on Flux and can help guide you to the places in your code where changes will have the biggest effect. MAP will also show MPI network traffic to make sure you aren’t spending too much time sending small packets between ranks or blocking progress on some ranks waiting for another rank to deliver updated data.
Tuesday, April 1, 2014
VNC Remote Desktop
VNC (Virtual Network Computing) was recently installed on all the Flux cluster nodes, providing virtual desktop access to the cluster and improving the performance of jobs using graphics or a GUI.
While most users should still strive to make their codes work in batch without graphics or a GUI, sometimes you just need to make a plot or generate a mesh in a GUI-only tool, but you still need the horsepower of the cluster.
Traditionally users had to use X11 Forwarding with the -X
option to qsub
, this required a working Xserver, and while every Linux and MAC user had one, it was slow, worked badly over slow connections and some applications performed very poorly with it.
VNC is essentially your new Flux Desktop. When started inside a batch or interactive job, VNC will start a desktop on the node you were assigned. You then need a VNC client, and an SSH tunnel, and you can connect that desktop.
The first step is to set your VNC password. (NOTE: Use a totally different password for VNC than for any other service. VNC authentication is very insecure and the password is easy to find and crack.) From a login node run:
$ vncpasswd
Now that we have a working password, we need to get a VNC session started on a compute node. You can use an interactive job, and start vncserver there, or you can submit a batch job.
#PBS -N vncjob #PBS -l nodes=1:ppn=4,walltime=8:00:00,pmem=1gb #PBS -A example_fluxg #PBS -l qos=flux #PBS -q fluxg #PBS -M uniqname@umich.edu #PBS -m b #PBS -V # vncserver -geometry XxY -fg # -geometry 1280x1024 <= default is 1024x768 # -fg <= Run in foreground, needed for PBS vncserver -geometry 1280x1024 -fg # be sure to set your e-mail in -M and -m b (mail when a job starts) # without it you will have to look manualy for when the job starts.
Wait for the job to start, which you can check with qstat <jobid>
or if you set -M uniquename@umich.edu -m b
PBS will e-mail you at the beginning of your job letting you know it started. At that point you have a remote desktop running on the first core of your cluster job. Don't leave it idle, if you leave the PBS job running you are blocking resources from other users even though you are not using the VNC Desktop.
Now that a desktop is running (vncserver
) we need to create an SSH Tunnel to connect to it. You need to tunnel from your local machine via flux-xfer.arc-ts.umich.edu
to the first CPU in your batch job. The script below explains how to find both of those and how to start an SSH Tunnel from Linux, Mac, or Cygwin.
# create an ssh tunnel via flux-xfer to the machine with your VNC Display.
# this example works for Mac, Linux, and Cygwin
# find your vnchost and display number
# If you have no VNC sessions running, you can make this easier by
# cleaning your .vnc
folder first with
# $ rm $HOME/.vnc/*.log
# $ rm $HOME/.vnc/*.pid
# prior to submitting the job with your VNC session
ls -rt $HOME/.vnc/
# eg: nyx5330.arc-ts.umich.edu:1.log
# eg: nyx5330.arc-ts.umich.edu <== host running vnc
# eg: 1 <== display number
# the port number is 5900+display number, which starts at 1 and increments
# if there are already VNC sessions running, which would make this
# template accurate.
# ssh -L 5901:<host running vnc>:5901 flux-xfer.arc-ts.umich.edu
ssh -L 5901:nyx5330.arc-ts.umich.edu:5901 flux-xfer.arc-ts.umich.edu
Users of Windows using Putty to connect by SSH, can follow these instructions.
The Values would be using the example above,
- Source Port: 5901
- Destination: nyx5330.arc-ts.umich.edu:5901
- Host Name (or IP address): flux-login.arc-ts.umich.edu
At this point you should be able to connect a VNC client to localhost:5901
or if your client uses display numbers, display 1.
Here is a list, certainly not exhaustive, of VNC clients.
- Linux vncviewer
- MAC Chicken of the VNC
- Windows/Java Tight VNC
When connecting from a VNC client/viewer the host should be localhost
and the port should be the port to which you forwarded in the prior step, in our example 5901. Some viewers want the display number, in this case use the last digit in your port number, in our case 1.
At this point the viewer will either ask for your vnc password, set in step 1, or ask it from you after you connect. You should now see a desktop with a terminal. You can run any GUI application we have on our nodes. You can even spawn parallel jobs with MPI, as the PBS environment is picked up by VNC.
One of VNC's great features is that you can detach and reattach later. This makes it very useful if your connection might drop, or if you work from a laptop and need to change locations. Using Umich VPN you can even create the tunnel from home.
We hope you find that this is a powerful feature giving you access to more of the functionality of Flux's software, and we hope you will find your Flux allocation that much more useful for your research. Below is a video showing the entire process.