Storage Integration: Configuring HBase to Work with Hive


The NoSQL model of functionality provided by Apache HBase adds performance value to any system seeking storage optimizations within a distributed processing environment.  But when the end goal of the system is to produce a data analytics platform, networks leveraging HBase may lack the query engines they need to meet the analytical requirement.  Querying abilities may be supplemented through integrating with a SQL-modeled datastore such as Apache Hive.  Though this technique promotes an array of advantages in an analytical context, newer versions of HBase sometimes lack the proper documentation needed to properly configure such integrations, thus leading to extended use of developer/administrator resources to iron the kinks.  Here, we outline the procedure for properly configuring HBase(0.92+) to work with Apache Hive.

Note: These procedures have been tested on Cloudera Hadoop-0.20.22-cdh3u5, u6 and Hortonworks HDP 1.3.0 running on CentOS versions 5.8 and 6.3

Matching Versions

Foremost, separate versions of HBase bundle varying instances of the Hadoop JAR within it's lib directory.  Often, the Hadoop version corresponding to this particular file is different than that of the version actually running on the system. Pinpoint the JAR version within the HBase lib file.  If there is a mismatch from the JAR existing in Hadoop home directory, Copy the system's Hadoop-core.jar into the HBase lib directory:

$ cp -r $HADOOP_HOME/hadoop-core{version}.jar $HBASE_HOME/lib

Declaring Proper Hostname

When working in distributed mode,  it is important HBase is aware of the FQDN of the master server.
With this information absent, confusion arises in which agent holds the master title.  Declare this information, along with the Zookeeper quorum, in HBase's primary configuration file, or hbase-site.xml.  Be sure HBase knows to operate in distributed mode:



Leveraging HBaseStorageHandler and Executing Commands in Hive

For information on using the StorageHandler to work with HBase from within Hive's shell,. see Integrating HBase with Hive 

When using Hive to create objects in HBase,  It is necessary to inform Hive on which files to use in order to successfully execute and complete the StorageHandler processes.  This is accomplished by specifying the proper JAR files in the Hive auxpath.   When starting the Hive shell, execute the following:

$HIVE_HOME/bin/hive --auxpath $HIVE_HOME/{$hive_version}/
lib/hive-hbase-handler-{$version}.jar,$HIVE_HOME/lib/hbase-{$hbase_version}.jar,/$HIVE_HOME/lib/zookeeper-{$version}.jar,$HIVE_HOME/lib/guava-r09.jar -hiveconf hbase.master=hbase.{$FQDN}.com:60000.

When navigating in and out of the shell, the command can be handled more conveniently through a startup script similar to:


if [ $1 = "-a" ]

$HIVE_HOME/bin/hive --auxpath $HIVE_HOME/{$hive_version}/
lib/hive-hbase-handler-{$version}.jar,$HIVE_HOME/lib/hbase-{$hbase_version}.jar,/$HIVE_HOME/lib/zookeeper-{$version}.jar,$HIVE_HOME/lib/guava-r09.jar -hiveconf hbase.master=hbase.{$FQDN}.com:60000.

elif [ $1 = "-o" ]

Make the script executable then add it to the system's $PATH, to run hive with auxpath argument:
$ hive -a
$ hive -o    

No comments:

Post a Comment