Sunday, April 26, 2015

Hive a high performance replacement for SQL databases

SQL is is gaining popularity as more researchers work with structured data.  Rather than reimport data every session, using a relational database (RDBMS) and leaving the data persistent and using SQL to query data is a significant improvement.

The problem with standard RDBMS systems is that their algorithms are often serial and hampered by the needs to keep transactions (think keeping bank deposits and debits in order) consistent.  This is also known as ACID.

In many research cases though researchers do not need transactions, they have data and they just want to query, or their data is append only such as new measurements.  By relaxing the transactions needs researchers can use a whole host of new methods that are very scalable.

Enter Apache Hive.  Hive is a data warehouse tool that lets data on an Hadoop cluster (such as the cluster at ARC-TS) be queried using SQL syntax.  For large tables even in to the thousands of GBytes of data, performance is consistent.

In this example I have data in CSV format from a database.  It has 12 columns and 1,487,169,693 rows.  Total data size is about 880GB of raw data.  With hive though once I have the data in Hadoop and create a table out of it. I  can use Hive to query it just as any other SQL table.

SELECT COUNT(*) FROM sample_table;

Time taken: 75.875 seconds, Fetched: 1 row(s)

At 75.9 seconds to do a full table scan as Hive works on the raw text data and must read all the data for a query like this, the ARC-TS Hadoop cluster is able to scan the data at 11GB/s.  Hive will maintain performance for ore complex queries also.

SELECT AVERAGE(sample_column) FROM sample_table;
Time taken: 81.488 seconds, Fetched: 1 row(s)

Researchers who work with a lot of structured data will find SQL on Hive to be intuitive and very powerful and effectively remove all limits to query performance and data size imposed by any other solution.

To many researchers working with SQL or Hadoop is new to them and daunting but is part of the new BigData ecosystem.  Please contact ARC-TS at and one of our staff can help you with your data.

Filetransfer Tool Performance

On the HPC systems at ARC-TS we have two primary tools for transferring data, scp (secure copy), and Globus (GridFTP).  Other tools like rsync and sftp operate over scp and thus will have performance comparable to that tool.

So which tool performs the best?  We are going to test two cases each moving data to the XSEDE machine Gordon at SDSC. One test will be for moving a single large file, the second will be many small files.

Large file case.

For the large file we are moving a single 5GB file from Flux's scratch directory to the Gordon scratch directory.  Both filesystems can move data at GB/s rates so the network or tool will be the bottleneck.

scp / gsiscp

[brockp@flux-login2 stripe]$ gsiscp -c arcfour all-out.txt.bz2
india-all-out.txt.bz2                     100% 5091MB  20.5MB/s  25.6MB/s   04:08   

Duration: 4m:08s


Request Time            : 2015-04-26 22:41:04Z
Completion Time         : 2015-04-26 22:42:44Z
Effective MBits/sec     : 427.089

Duration: 1m:40s  2.5x faster than SCP

Many File Case

In this case the same 5GB file was split into 5093 1MB files.  Many may not know that every file has overhead, and that it is well known that moving many small files of the same size is much slower than moving one larger file of the same total size.  How much impact and can Globus help with this impact read below.

scp / gsiscp

[brockp@flux-login2 stripe]$ time gsiscp -r -c arcfour iobench
real 28m9.179s

Duration: 28m:09s


Request Time            : 2015-04-27 00:18:40Z
Completion Time         : 2015-04-27 00:25:30Z
Effective MBits/sec     : 104.423

Duration: 7m:50s  3.6x faster than SCP


Globus provides significant speedup both for single large files and many smaller files over scp.  The result is even more significant the smaller the files because of the overhead in scp doing one file at a time.

Wednesday, April 1, 2015

Xeon Phi's for Linear Algebra

Linear algebra is the backbone of many research codes. So much so that a standard library was created to support both the primitives; matrix multiply, dot product, and higher level; LU factorization, QR factorization.  These libraries are BLAS Basic Linear Algebra Subprograms, and LAPACK Linear Algebra PACKage.

Before I go further never use the Netlib versions of BLAS and LAPACK. They are not tuned and the performance difference can be dramatic.  When using these routines you want to use the CPU vendors specific implementation.  They all conform to the same API and thus are portable.

MKL or the Math Kernel Library is the Intel implementation of BLAS, LAPACK and many other solvers.  Intel went further that the MKL can automatically offload routines to Xeon PHI accelerators with no code modification.

module load mkl/11.2
icc dgemm_speed.cpp -mkl -DBLAS3 -DDIM=10000 -DMKL
qsub -I -l nodes=1:ppn=1:mics=1,pmem=20gb -A -q flux -l qos=flux -V

#CPU Only Result
[brockp@nyx5896 blas]$ ./a.out
Size of double: 8
Will use: 2288 MB
 Matrix full in: 7 sec
MKL Runtime: 89 sec.

#Single Phi
[brockp@nyx5896 blas]$ module load xeon-mic

This module sets the following defaults:

        MKL_MIC_ENABLE=1 enables auto MKL offload
        MKL_MIC_DISABLE_HOST_FALLBACK=1 unset if not always using MIC
        MIC_OMP_NUM_THREADS=240 Sane values 240/244 (60 core * 4 threads) 
        OFFLOAD_REPORT=2 Gives lots of details about the device when running, setting to 1 gives less information, unset to surpress

        If your application uses both Compiler Assisted Offload and Automatic Offload then it is strongly recommended to set OFFLOAD_ENABLE_ORSL=1. This env-variable enables the two offload modes to synchronize their accesses to coprocessors.

[brockp@nyx5896 blas]$ ./a.out 
Size of double: 8
Will use: 2288 MB
 Matrix full in: 6 sec
[MKL] [MIC --] [AO Function]    DGEMM
[MKL] [MIC --] [AO DGEMM Workdivision]  0.00 1.00
[MKL] [MIC 00] [AO DGEMM CPU Time]      9.471196 seconds
[MKL] [MIC 00] [AO DGEMM MIC Time]      6.818753 seconds
[MKL] [MIC 00] [AO DGEMM CPU->MIC Data] 1600000000 bytes
[MKL] [MIC 00] [AO DGEMM MIC->CPU Data] 3200000000 bytes
MKL Runtime: 10 sec.

The key difference to get auto offload is that the xeon-mic module sets MKL_MIC_ENABLE=1 which then lets MKL use the Phi (Also known as MIC's) to take part in the computation.  In this case with no code modification the the single matrix multiply was accelerated 8.9x. 

You can also use more than one PHI.  You can also see speedups much greater than that here if the data are larger, or the routine is more complex.  Eg a full DGESV() which will call DGEMM() benchmarked here many times.

XSEDE Research Allocations Due April 15th

The next quarter round of XSEDE research allocations are due April 15th.

XSEDE provides a set of national HPC and Research CI resources available by proposal to the national research community.

ARC-TS provides support for XSEDE and other large HPC centers, including DOE, NASA and Cloud providers.