The problem with standard RDBMS systems is that their algorithms are often serial and hampered by the needs to keep transactions (think keeping bank deposits and debits in order) consistent. This is also known as ACID.
In many research cases though researchers do not need transactions, they have data and they just want to query, or their data is append only such as new measurements. By relaxing the transactions needs researchers can use a whole host of new methods that are very scalable.
Enter Apache Hive. Hive is a data warehouse tool that lets data on an Hadoop cluster (such as the cluster at ARC-TS) be queried using SQL syntax. For large tables even in to the thousands of GBytes of data, performance is consistent.
In this example I have data in CSV format from a database. It has 12 columns and 1,487,169,693 rows. Total data size is about 880GB of raw data. With hive though once I have the data in Hadoop and create a table out of it. I can use Hive to query it just as any other SQL table.
SELECT COUNT(*) FROM sample_table;
OK
1487169693
Time taken: 75.875 seconds, Fetched: 1 row(s)
At 75.9 seconds to do a full table scan as Hive works on the raw text data and must read all the data for a query like this, the ARC-TS Hadoop cluster is able to scan the data at 11GB/s. Hive will maintain performance for ore complex queries also.
SELECT AVERAGE(sample_column) FROM sample_table;
OK
0.011386917827452752
Time taken: 81.488 seconds, Fetched: 1 row(s)
Researchers who work with a lot of structured data will find SQL on Hive to be intuitive and very powerful and effectively remove all limits to query performance and data size imposed by any other solution.
To many researchers working with SQL or Hadoop is new to them and daunting but is part of the new BigData ecosystem. Please contact ARC-TS at hpc-support@umich.edu and one of our staff can help you with your data.