Integrating HBase with Hive

Overview 


 Depending on a system's purpose, and general architecture, the induction of multiple structured
 datastores with a Hadoop-based environment may prove beneficial as the use case fluctuates.  As
 access and storage functionalities between datastores differ, the ability to leverage the proper
 capabilities of each when analytical requirements change establishes significant advantage.
 HBase, when serving as the primary datastore, boasts gigantic storage capacity while promoting
 greater overall system performance.  The benefits stretch afar, but with no internal querying
 capability, accessing data as requirements change can become a challenge.  Introduce Apache
 Hive, with a simple storage structure and a native query language, to account for instances in
 which system data needs to be quickly isolated and sampled.  With the proper integration techniques,
 both components may be successfully configured to work in unison, adding substantial analytical
 abilities to the overall platform.



Integration  

 

Internal 

 

This section assumes that the user understands basic Hive and HBase functionalities, may introduce
both packages to the system, and understands the installation processes.

At the most fundamental level, Hive may communicate with HBase via the HBaseStorageHandler.
Through this, specific commands within Hive may be implemented to create new tables within HBase. 

Assume there exists "customer" data, in which it becomes necessary to create an HBase table accessed through Hive, the following command:


CREATE TABLE hbase_customer (key int, value string)
STORED BY 'org.apache.hadoop.hive.hbase.
HBaseStorageHandler'
WITH SERDEPROPERTIES ("
hbase.columns.mapping" = ":key,cf1:val")
TBLPROPERTIES ("
hbase.table.name" = "customer");


will create an HBase table with the name "customer," but accessed through Hive using the alias
hbase_customer.  If it becomes necessary to execute and aggregate or sampling query on this
table, the operation may be performed with something like the following:


select count(*) from hbase_customer;

OR

select count(*) from hbase_customer  value LIKE ‘M%';
 

In general, using derivatives of the technique shown above, data structures may be generated in HBase, and accessed through Hive's internal functions.


External 

 

Data structures, specifically tables, pre-existing within HBase may also be accessed externally using
the following process:


CREATE EXTERNAL TABLE hbase_table_2(key int, value string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")
TBLPROPERTIES("hbase.table.name" = "some_existing_table");


Querying HBase - General Points 

 

When using Hive to access an HBase schema, performance is optimized using HBase filters 
in terms of SQL construction. Identifiers such as "<" and ">" are leveraged to minimize the scanning area in which Hive has to search for the result.  The idea becomes concrete through the following example in which "customer last_name" serves as a key and "customer first_name" as a column:



CREATE TABLE hbase_customer_by_last (key string, firstn string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:firstn")
TBLPROPERTIES ("hbase.table.name" = "customer_by_last");


INSERT OVERWRITE TABLE hbase_customer_by_last SELECT CONCAT ( c_last_name, '|', c_customer_sk ) as key, c_first_name from customer;


The query should leverage the  ">" and "<" operators on the row key to set up the start and stop rows in the table scan.


Again, a row count can be generated through Hive by running the following:


select count(*) from hbase_customer_by_last where key >= 'M' and key < 'N' and firstn LIKE 'C%';


 





No comments:

Post a Comment