Flux HPC: datatransfer

Showing posts with label datatransfer. Show all posts

Thursday, September 17, 2015

Globus endpoint sharing available to UM researchers

We have described in a number of Blog posts some features and benefits of using the the Globus File Transfer service. Now that UM is a Globus Provider you have a new feature available to you, sharing of directories and files with your collaborators who are Globus users as well.

There are two avenues of sharing possible for you now. The first is via standard server endpoints that have sharing enabled and another via "Globus Connect Personal" client endpoints. Today I will describe sharing for standard servers endpoints only. Sharing for Personal Connect endpoints is a bit more complicated due to differences between OS versions of the client and will be described later.

To see if the endpoint you use has sharing enabled navigate to the endpoint in "Manage Endpoints" within the Globus web interface. Click on the Sharing tab, note that you may have to Activate (login) a session on the endpoint first. If sharing is enabled you will be told so and will see a "Add Shared Endpoint" button in the panel. Shared endpoints are essentially sub-endpoints you can create and provide access to any other Globus user.

Lets go ahead and make a shared endpoint from umich#flux by clicking on the button. You are presented a web form to provide required information:

Host Path ~/Test_share

You can either give a complete absolute path or use unix shorthand (~/) for my home directory as I have done (make sure the shared directory exists first!).

New Endpoint Name traeker#Test_share

Description Tell others know what this is about.

Clicking on the "Create and Manage permissions" button creates the shared endpoint and presents you with a new panel to manage permissions. It shows you the current access setting and clicking on the "Add Permission" button presents you with a number of options of how to share this endpoint with other Globus users.

Share With check which one to use among email, user, group, all users

Permissions check one or both of read, write

A couple of things you to keep in mind as you set these parameters:

Be careful about choosing all users as this will allow all users logged into Globus to access this share.
By default only read permission is set. If you allow write permission you could get files containing viruses and also get yourself into trouble with any disk usage quotas.

One easy way to manage permissions to a large group of people is to create a Globus group and populate it with users. Be advised that the entire group will have the same permissions so if you need some users to have different permissions, you either create a different group or add each user to the share individually. Using groups comes in handy when you have multiple shared directories to similar sets of collaborators.

Once a directory is shared with another Globus user he/she can find that endpoint name via the "shared with me" filter on the top of the endpoint list panel. With name in hand they can now transfer files from/to that endpoint by typing in the name under the "Transfer Files" screen just like another other endpoint they have access to.

You can go back to this shared endpoint to add new or edit any access settings.

Globus endpoint sharing is very powerful as it gives non-UM collaborators access to your research data without having to create a UM "Sponsored account" for them to access your systems. This is very similar to other cloud file sharing services like Box and Dropbox. The big difference is that Globus does not store the data and thus quotas are managed by your systems policies.

Saturday, August 22, 2015

Flux High-Speed Data Transfer Service

Do you have a large data set on your own storage equipment that you would like to process on Flux? We can accommodate up to 40 gigabits per second of data transfer in and out of Flux via the campus Ethernet backbone. There is no additional cost to use this service, but you do need to contact us in order to set it up.

By default, network traffic between Flux compute nodes and other systems on campus takes place over standard one gigabit Ethernet connections. This is sufficient for modest amounts of traffic such as that generated by administrative tasks, monitoring, and home directory access.

Traffic between Flux and its high-speed /scratch filesystem runs over a separate 40 gigabit per second InfiniBand network within the datacenter, and data between Flux and off-campus systems on the Internet can be staged through our transfer server at up to 10 gigabits per second. This would seem to leave a gap though: what if you want direct high-speed connections between the Flux nodes and other systems on campus? We provide such connections using a Mellanox BX5020 InfiniBand/Ethernet gateway:

The Flux BX5020 Gateway

The gateway connects to the Flux InfiniBand network and to the campus Ethernet network and allows traffic to flow between the two networks. The InfiniBand network runs at 40 gigabits per second, and the gateway has four 10 gigabit links to the campus Ethernet network. This allows any Flux node to communicate with any system on campus at up to 40 gbit/s.

We have a customer that has multiple petabytes of data on their own storage equipment which they have been using Flux to process. We mount this customer's NFS servers on Flux and route the traffic through the gateway. The customer is currently running jobs on Flux against two of their 10-gigabit connected servers, and last weekend they reached a sustained data transfer rate into Flux of 14.3 gigabits per second.

Gateway traffic for the week of 8/11/2015 - 8/18/2015

Although we have pushed more than 14 gbit/s through the gateway during testing, this is a new record for production traffic through the system.

Our gateway is currently connected to the Ethernet network at 40 gigabits per second, but it can be readily expanded to 80 and possibly 120 gigabits per second as needed. Additionally, we plan to replace the existing gateway in the near future with newer equipment. The planned initial bandwidth for the new equipment is 160 gbit/s, and there is room for growth even beyond that.

No changes to your network configuration are needed to use the gateway; those changes take place on our end only. All you have to do is export your storage to our IP ranges. If you want to discuss or get set up for this service, please let us know! Our email address is hpc-support@umich.edu and we will be happy to answer any questions you have.

If you are are interested in the technical details of how the gateway works, this presentation from Mellanox on the Ethernet over InfiniBand (EoIB) technology used by the system should prove informative. There is no need to know anything about EoIB in order to use the service; the link is provided strictly for the curious.

Thursday, July 2, 2015

Sending Data to Amazon AWS S3 storage

Researchers at UM have numerous storage options available to them on and off campus. In this post we focus on moving data to Amazons AWS cloud storage S3 . This storage is fast and easily accessible from other AWS resources as well as UM systems.

To use S3 you first need to have an account in AWS and create what are called S3 buckets. Buckets can be created via the AWS web console or AWS Command Line Interface (CLI) tools on your local systems. Installation and setup instructions are available in the provided link. Below we shall assume this has already been done.

Lets go through a good sample use case of creating a S3 bucket and sending a large backup file to that bucket. First, if you have configured the aws cli tools correctly, it knows your account name and has full access to your S3 resources.

Now create a S3 bucket called "mybackups":

$ aws s3 mb s3://mybackups

To confirm creation and check contents use:

$ aws s3 ls s3://mybackups

Now lets copy the file backup.tar to that bucket:

$ aws s3 cp backup.tar s3://mybackups

In this test case I got 107 MB/s from my laptop which is pretty awesome. This speed is largely due to two things: 1) the aws s3 cp command can break the file into numerous parts and simultaneously send them to the bucket and 2) the route from UM to AWS is via Internet 2 which can be be 1-10 Gb/s depending on your particular uplink speed to the UM backbone. I can confirm that doing this from my home computer is exceedingly slow!

Confirm the file is in the backup via

$ aws s3 ls s3://mybackups

Some among you might say I do not have enough space on my system to make a temporary backup tar file. Fear not, you can make nice use of piping unix utilities to avoid this.

$ tar -czf - raeker | aws s3 cp - s3://mybackups/raeker.tgz

Alternatively you can use the aws s3 sync command! This functions much like the traditional unix rsync command to sync files between a source and target:

$ aws s3 sync my_directory s3://mybackups

Be warned though that if there are lots of files to sync you likely will not get anywhere near the 100 MB/s I got above. Also be advised that AWS charges for operations as well as storage so each file cp/put incurs a request operation towards the $0.005 per 1,000 requests!

You can also use this if you simply need a copy of your local files in a S3 bucket for use in say EC2 instances for computing.

Normally, sync only copies missing or updated files or objects between the source and target. However, you may supply the --delete option to remove files or objects from the target not present in the source.

Of course you can reverse the data flow by making s3://mybackups as the source and local file/folder as target!

In another blog post I will show you how you can automatically archive your s3 object to the considerably cheaper Glacier storage. Stayed tuned.