Interacting with HBase: The HBase Shell

HBase is appealing as part of a Hadoop solution because of its incredibly fast read times, due to efficient scanning and retrieval of the data. However, the first step to pulling data from HBase is ingesting it - and if done inefficiently, that process can be significantly slower.

The typical way to efficiently ingest data into HBase is to run one or more MapReduce jobs to transform the data the way that is necessary, and then write the data to HDFS using the HFileOutputFormat. Depending on the HBase version, one limitation might be that there can only be one column family per HFile. This limitation is easily overcome by running different MapReduce jobs for each column family that needs to be ingested.

Once the data is available in HDFS, it should be further prepared for bulk ingest by adjusting the permissions on the folder. This can be performed using the following command:

hadoop fs -chmod -R a+w <HFilePath>

Once the folder permissions are correct, the "completebulkload" tool must be run to ingest the columns into HBase. This utility can only be run against one folder at a time, so if the HFileOutputFormat limitation applies, it should be repeated for each column family that has data. This command will bulk ingest each of the HFiles in the folder as an atomic HBase transaction - the most efficient way to ingest data into HBase. To invoke the utility, run the following command:

HADOOP_CLASSPATH=`/usr/lib/hbase/bin/hbase classpath` /usr/lib/hadoop/bin/hadoop jar /usr/lib/hbase/hbase-0.90.6-cdh3u6.jar completebulkload <HFilePath>/<columnFamilyDataPath> <tableName>

This command will fail if the schema of the HBase table does not include the column family that is being loaded. However, a failure will not consume the data in HDFS. If the task fails, the HBase schema can be modified, and the completebulkload utility can be run. On successful ingest, the HDFS data will be consumed.

In order to verify that the data is loaded and accessible, it may be helpful to view the data in HBase. This can be performed using the following commands:

hbase shell
scan '<tableName'>, {COLUMNS=>['<columnFamily>'], LIMIT=>3}

This will retrieve all column qualifiers for the specified column family in the first three rows of data.

No comments:

Post a Comment