Hadoop Fundamentals: Bulk Ingesting Data Into Apache HBase

The fastest way to get data into HBase is to create the columns that need to be stored (probably through a MapReduce job), and then bulk ingest the data.

In order to ensure that the output files created through the MapReduce job are ready for HBase ingestion, be sure to use the HFileOutputFormat.

One limitation of the HFileOutputFormat is that it can only write to one column family at a time. This means that you will need to run a separate MapReduce job for each column family stored in your table. Multiple column qualifiers can be written in a single MapReduce job.

Before ingesting the data using the completebulk load tool, you need to change the permissions of the folder containing your HFileOutputFormat files:



hadoop fs -chmod -R a+w <folder>

In order to actually get the columns into HBase, we need to run the completebulkload tool. This tool comes with HBase and can be executed in the following manner:

HADOOP_CLASSPATH=`/usr/lib/hbase/bin/hbase classpath` /usr/lib/hadoop/bin/hadoop jar /usr/lib/hbase/hbase-0.90.6-cdh3u6.jar completebulkload /tmp/ingested/route stop

Since the HFileOutputFormat is only able to write one column family at a time, running the completebulkload tool must be repeated for each column family that requires data.

No comments:

Post a Comment