Friday, September 23, 2016

Hive on Spark

I have been testing Hive on Spark. One of the main benefits of running Hive on Spark (rather than on MapReduce) is that it would be able to run much faster. 

I wanted to run a pretty simple query on Hive using the Google Ngrams dataset. First, I just used it running on MapReduce, its default.

SELECT COUNT(*) FROM ngrams;

That query took 1 minutes and 12 seconds to run. Next, I decided to try it with Spark, which was accomplished by simply typing:

set hive.execution.engine=spark;

I ran the same query again, but this time it took 3 minutes and 12 seconds, which was almost 3 times as long as MapReduce.

It turned out the default settings for Spark on the cluster were pretty conservative, so they needed to be modified to run optimally. The next thing that I wanted to do was make sure that I found an “ideal” tuning for our cluster. I referenced this blog post in order to do so. We needed more executors and cores per executor in order for more tasks to be executed at once. We also required a change in how much memory each executor was allocated in order to take advantage of the large number of resources our cluster has, while not completely overwhelming it. Tuning Spark was more of an art than a science, but after testing and tweaking, I found that for the Flux Hadoop cluster, the settings should be 35 executors with 4 cores each, with about 5g of memory per executor.

set spark.executor.instances=35;
set spark.executor.cores=4;
set spark.executor.memory=5g;


That same query then took only 46 seconds to run. Spark, with all of the settings tuned, was about 36% faster than MapReduce in this example.

Some more examples of queries I ran and how Spark (with tuned settings) compared to MapReduce:


SELECT year, COUNT(ngram) FROM ngrams WHERE volumes = 1 GROUP BY year


MapReduce: 1:06
Spark: 0:55

Spark was about 17% faster.


We had another, larger data set to test this on, so the time differences for that are shown below. It seemed that the larger the data, the greater the difference between MR and Spark.


SELECT COUNT(*) FROM entries


MapReduce: 1:53
Spark: 1:06

Spark was about 47% faster.


SELECT owner FROM entries WHERE gr_name = ‘psych’ GROUP BY owner


MapReduce: 2:54
Spark: 1:01

Spark was about 65% faster.


Clearly, with the right tuning, Spark can be a decent amount faster than MapReduce on Hive for large datasets. Finding the ideal tuning does take some tweaking and testing, but ultimately leads to faster jobs.