Interacting with HBase: Complex Filtering Through the HBase Shell

The HBase shell provides access to the raw data ingested into HBase in a visual manner. The primary functions are available. Additionally, all of the Scanner functionality available through the API is also available through the command line interface, but the syntax is specific and possibly different from what you might expect.



To show all tables available in HBase:

list

To drop a table:

disable '<tableName>'
drop '<tableName>'

To add a column family to a table:

disable '<tableName>'
alter '<tableName>', {NAME => '<columnFamily>'}
enable '<tableName>'

To remove a column family from a table:

alter '<tableName>', {NAME => '<columnFamily>', METHOD => 'delete'}

To perform a simple scan of a table:

scan '<tableName>'

This will print all columns in all rows contained within the table. To limit the number of rows returned:

scan '<tableName>', {LIMIT=>1}

To further limit the list of column families that are returned:

scan '<tableName>', {COLUMNS=>['<columnFamily>','<columnFamily>'], LIMIT=>1}

One primary benefit of using HBase as part of a Hadoop solution is the ability to scan efficiently using filters configured to retrieve only the data that is necessary. This functionality is also available through the HBase shell.

There are several filters available through the HBase API. The list is accessible here:
http://hbase.apache.org/book.html#client.filter

In order to make use of one of these filters in the HBase shell, the fully qualified names for each class or function must be used. For example, in order to use the ColumnPrefixFilter such that it returns only columns beginning with a particular prefix:

scan '<tableName>', {FILTER => org.apache.hadoop.hbase.filter.ColumnPrefixFilter.new(org.apache.hadoop.hbase.util.Bytes.toBytes('<prefix>'))}

2 comments:

  1. Thanks for the info. Is there something like ColumnPrefixFilter one could use knowing only the suffix of the column name?

    ReplyDelete
  2. Bruce, since HBase sorts its rows lexicographically, attempting to filter by the suffix will not be efficient. The ColumnPrefixFilter is desirable because it can skip over information that isn't relevant.

    Of course, HBase supports custom filters, so it's possible to create your own ColumnSuffixFilter if that is something that your application requires.

    ReplyDelete