Astro logo

DATA POTION

Big Data and Artificial Intelligence 🤖

How to filter columns in HBase Shell

How to filter columns in HBase Shell

One of the most useful functions available in HBase are filters. In this article, I will show you how to use filters to get lookup columns with specific values. Every example is shown in Hbase Shell, but they are also available in the API, so you can use it in your ETL application as well.

First, let’s check which filters are available for us from Hbase Shell. To do this use show_filters command in Hbase Shell:

As you can see there are multiple filters available but we are focusing on Column Filters, so 3 that are useful for us are ColumnValueFilter, ValueFilter, and SingleColumnValueFilters. All of these filters have very similar properties and all of them can be used for filtering columns.

Let’s assume we have a table named myTable with the following contents.

Rowkey myColumnFamily:columnA myColumnFamily:columnB
row1 value1
row2 value2
row3 value3 valueX
row3_special value4
hbase(main):003:0> scan 'myTable'
ROW                             COLUMN+CELL
 row1                           column=myColumnFamily:columnA, timestamp=1656785481871, value=value1
 row2                           column=myColumnFamily:columnA, timestamp=1656785500397, value=value2
 row3                           column=myColumnFamily:columnA, timestamp=1656785504786, value=value3
 special_row3                   column=myColumnFamily:columnA, timestamp=1656785539020, value=value4

ColumnValueFilter

The first option to filter values is to use ColumnValueFilter. As an argument we pass column family and column name in which we want to search, and filtering argument in this example we search for a value that equals value1.

scan 'myTable', {FILTER => "ColumnValueFilter('myColumnFamily','columnA',=,'binary:value1')"}

ColumnValueFilter will return only columns that match our filtering statement.

hbase(main):010:0> scan 'myTable', {FILTER => "ColumnValueFilter('myColumnFamily','columnA',=,'binary:value1')"}
ROW                             COLUMN+CELL\n
 row1                           column=myColumnFamily:columnA,\n timestamp=1656785481871, value=value1\n
1 row(s)\n
Took 0.0188 seconds\n
hbase(main):011:0>

SingleColumnValueFilter

SingleColumnValueFilter returns an entire row when specified condition is matched. Besides that it works exactly like ColumnValueFilter.

This example also shows another form of filtering comparison - when only the substring of the value matches it returns this value.

hbase(main):022:0> scan 'myTable', {FILTER => "SingleColumnValueFilter('myColumnFamily','columnA',=,'substring:value')"}

ROW                             COLUMN+CELL\n
 row1                           column=myColumnFamily:columnA, timestamp=1656785481871, value=value1\n
 row1                           column=myColumnFamily:columnB, timestamp=1656802766312, value=valueX\n
 row2                           column=myColumnFamily:columnA, timestamp=1656785500397, value=value2\n
 row3                           column=myColumnFamily:columnA, timestamp=1656785504786, value=value3\n
 special_row3                   column=myColumnFamily:columnA, timestamp=1656785539020, value=value4\n
4 row(s)\n
Took 0.0253 seconds\n
hbase(main):023:0> scan 'myTable', {FILTER => "SingleColumnValueFilter('myColumnFamily','columnA',=,'substring:X')"}
ROW                             COLUMN+CELL
0 row(s)
Took 0.0047 seconds

⚠️WARNING ⚠️ If the column is not found every column from searched row will be filtered

ValueFilter

In ValueFilter we don’t provide the column family and column name, just the filter statement. Which makes it search in every column possible. ValueFilter return only rows that match statement.

From ValueFilter documentation:

To test the value of a single qualifier when scanning multiple qualifiers, use link SingleColumnValueFilter

hbase(main):004:0> scan 'myTable', {FILTER => "ValueFilter(=,'binary:valueX')"}
ROW                                   COLUMN+CELL
 row1                                 column=myColumnFamily:columnB, timestamp=1656802766312, value=valueX
1 row(s)
Took 0.0106 seconds

Conclusion

  • ValueFilters or specific column values in every column
  • SingleValueColumnFilter returns only matched row in a specified column
  • ColumnValueFilter returns every row in case it matched the row in a specified column