Thursday, July 2, 2015

Sending Data to Amazon AWS S3 storage

Researchers at UM have numerous storage options available to them on and off campus. In this post we focus on moving data to Amazons AWS cloud storage S3  .  This storage is fast and easily accessible from other AWS resources as well as UM systems.

To use S3 you first need to have an account in AWS and create what are called S3 buckets.  Buckets can be created via the AWS web console or AWS Command Line Interface (CLI) tools on your local systems. Installation and setup instructions are available in the provided link. Below we shall assume this has already been done.

Lets go through a good sample use case of creating a S3 bucket and sending a large backup file to that bucket. First, if you have configured the aws cli tools correctly, it knows your account name and has full access to your S3 resources.

Now create a S3 bucket called "mybackups":

$ aws s3 mb s3://mybackups

To confirm creation and check contents use:

$ aws s3 ls s3://mybackups

Now lets copy the file backup.tar to that bucket:

$ aws s3 cp backup.tar s3://mybackups

In this test case I got 107 MB/s from my laptop which is pretty awesome.  This speed is largely due to two things: 1) the aws s3 cp command can break the file into numerous parts and simultaneously send them to the bucket and 2) the route from UM to AWS is via Internet 2 which can be be 1-10 Gb/s depending on your particular uplink speed to the UM backbone.  I can confirm that doing this from my home computer is exceedingly slow!

Confirm the file is in the backup via

$ aws s3 ls s3://mybackups

Some among you might say I do not have enough space on my system to make a temporary backup tar file. Fear not, you can make nice use of piping unix utilities to avoid this.

$ tar -czf - raeker | aws s3 cp - s3://mybackups/raeker.tgz

Alternatively you can use the aws s3  sync command!  This functions much like the traditional unix rsync command to sync files between a source and target:

$ aws s3 sync my_directory s3://mybackups

Be warned though that if there are lots of files to sync you likely will not get anywhere near the 100 MB/s I got above. Also be advised that AWS charges for operations as well as storage so each file cp/put incurs a request operation towards the $0.005 per 1,000 requests!

You can also use this if you simply need a copy of your local files in a S3 bucket for use in say EC2 instances for computing.

Normally, sync only copies missing or updated files or objects between the source and target. However, you may supply the --delete option to remove files or objects from the target not present in the source.

Of course you can reverse the data flow by making s3://mybackups as the source and local file/folder as target!

In another blog post I will show you how you can automatically archive your s3 object to the considerably cheaper Glacier storage.  Stayed tuned.