Cloudera's New Impala "Drives" Reduced Query Times for Hadoop

Hive is most often relegated to analytical processing use cases.  The Cloudera 'Impala' SQL engine for Hadoop allows you to run both transactional and analytical queries on your Hadoop cluster with a significant reduction in processing time over Hive with Map Reduce.  

What is Impala?



Impala is a SQL query engine specifically created for Hadoop.  It currently supports most of the same SQL functionality as Hive.  Unlike Hive, it does not rely on Map Reduce to process and return results from the Hadoop data nodes where the impala services run.  The low level query engine is specifically designed in C++ for performance.

Imapla is Apache Licensed but not managed.   The source is open source but it is not Apache managed so changes are controlled by Cloudera with periodic releases to Git Hub.  There is an incubator project by Apache by looks years away called Drill.


Impala is suitable for queries that run in seconds to hours but not for longer running queries due to it's lack of fault tolerance. For longer running queries, it is best to use Hive because in the current release of Impala if one node fails, then the query will fail.

Test Drive your Impala



What do I need?

 
>64-bit OS with at least 6GB of RAM


>Virtualization Software that supports 64-bit VMs

We used Windows 7 64-bit running Virtual Box 64-bit. You can use a myriad of virtual hard drives provided by Cloudera. 

>A CDH4 Hadoop Installation with Hive and Impala (Impala does not run on CDH3.)
There is a Demo VM available here from cloudera that we will use
As of writing it is Centos 6.2 containing Hive 0.8.0 & Impala 0.6 Beta

>On windows you will need:

A tool like 7-Zip to unzip the cloudera gz tar package.

A tool like WinMD5 to check the download MD5.


How do I get started?



Step 1.  Instead of installing the components from scratch download the appropriate Impala VM demo from Cloudera here for your virtualization software.  Right now this is the beta version of Hive. GA release is due Q1 2013.



Since the file is 1GB, it’s recommended to make note of the MD5 for the file and check it after downloading.

Step 2.   Check the MD5 of the downloaded file vs. the MD5 on the cloudera site.
Step 3.   Unzip downloaded gz and resulting tar file with 7-Zip or tool of your choice.
Import the VM into your virtual machine.

Step 4.  Create a VM following these steps.

In Virtual box, select the Machine | New menu and make the following selections.




Select 4 GB of memory for the VM.



Navigate to the hard drive you extracted.


Make sure the two options are selected for performance purposes.  Press "OK" and you should be able to start the virtual machine.


How do I test drive the Cloudera Impala?


Step 1. Start the VM and log in with the user and password cloudera.

Step 2. We will load the largest data set that Cloudera supplies which is over 500 MB.

Go to the impalascripts directory and execute this script to download the data.


[cloudera@localhost impalascripts]$ tpcds-setup.sh

Step 3. We will compare a moderately complex query in Hive and in Impala.


Here is the query pre-loaded on the file system.



Step 3a. Start the query in Hive.

 [cloudera@localhost impalascripts]$ hive -f tpcds_query.sql

You can see this query takes coordination of 6 Map-Reduce Jobs.



Hive history file=/tmp/cloudera/hive_job_log_cloudera_201304102047_1453607919.txt
Total MapReduce jobs = 6
Launching Job 1 out of 6
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_1365639753530_0001, Tracking URL = http://localhost.localdomain:8088/proxy/application_1365639753530_0001/
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1365639753530_0001


.....


If you aren't addicted to watching console logging from Hive (you know who you are), find something to do for about 5 minutes. 

Then come back and see the query results

...


AAAAAAAADHBAAAA TN 15.0 179.1300048828125 496.44000244140625 39.400001525878906
AAAAAAAAADHDAAAA TN 98.0 39.29999923706055 0.0 33.790000915527344
AAAAAAAAADIAAAAA TN 27.0 124.51000213623047 0.0 107.06999969482422

Time taken: 368.591 seconds


Step 3b. Run the query with Impala.

To run a query we can do it at the command line:
 
impala-shell --impalad=127.0.0.1:21000 --query_file=/home/cloudera/impalascripts/tpcds_query.sql


 Or execute the included script:

[cloudera@localhost impalascripts]$ tpcds-samplequery.sh


...


AAAAAAAAADHBAAAA TN 15 179.1300048828125 496.4400024414062 39.40000152587891
AAAAAAAAADHDAAAA TN 98 39.29999923706055 0 33.79000091552734
AAAAAAAAADIAAAAA TN 27 124.5100021362305 0 107.0699996948242

Returned 100 row(s) in 10.22s


For this particular query type, there is about a 35 to 1 reduction in query time by Impala over Hive.


Findings


Performance improvements with Impala will vary by hardware and query of type, but in this case Impala reduces the query time for the data set considerably. Queries which require multiple map reduce jobs will obviously give Impala more of an advantage over Hive as shown in this Cloudera Impala Demo VM. Impala should be definitely be investigated for use cases which require transactional data access of data in Hadoop.

No comments:

Post a Comment