HBaseCon 2012 | HBase Filtering - Lars George, Cloudera

HBaseCon, May 2012

HBase Filters
Lars George, Solutions Architect

Agenda

1 Introduction
2 Comparison Filters
3 Dedicated Filters
4 Decorating Filters
5 Combining Filters
6 Custom Filters

2 ©2012 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction
or redistribution without written permission is prohibited.

About Me

•  Solutions Architect @ Cloudera
•  Apache HBase & Whirr Committer
•  Author of
HBase – The Definitive Guide
•  Working with HBase since end
of 2007
•  Organizer of the Munich OpenHUG
•  Speaker at Conferences (Fosdem,
Hadoop World)


Introduction to Filters

•  Used in combination with get() and scan()
API calls
•  Steps:
–  Create Filter instance
–  Create Get or Scan instance
–  Assign Filter to Get or Scan
–  Call API and enjoy
•  More fine-grained control over what is
returned to the client


Filter Features

•  Allow client to further narrow down what is
retrieved
–  Not just per row or column key, or per column
family
•  Predicate Pushdown
–  Move filtering from client to server to reduce
network traffic
•  Varying performance implications,
dependent on the use-case


Filter Pushdown


Filter Features (cont.)

•  Filters have access to the entire row to
decide its fate
–  Access to KeyValue instances to check row keys,
column qualifiers, timestamps, or values
•  Scan batching might conflict with the above
and might trigger an “Incompatible Filter”
exception
–  Example: DependentColumnFilter
•  There is no cross invocation state
–  Cannot filter rows based on dependent rows


Available Filters

•  Many filters are supplied by HBase
–  Based on row key, column family, or column
qualifier
–  Paging through rows and columns
–  Based on dependencies

•  Write your own filters
–  Use FilterBase class to get a no-op
skeleton and fill in the gaps


Agenda

1 Introduction
3 Dedicated Filters
5 Combining Filters
6 Custom Filters


Comparison Filters

•  Based on CompareFilter class
•  Adds the compare() method to
FilterBase!
•  Takes operator that defines how the
comparison is performed
–  Predefined by client API
•  Also needs a comparator to do the actual
check
–  HBase supplies a large set


Comparison Operators


Comparators


Comparison Filters (cont.)

•  Not all combinations of operator and
comparator make sense
–  For example, the SubstringComparator
replies only 0 (match) and 1(no match)
–  Only EQUAL and NOT_EQUAL are useful
–  Using other operators is allowed but will most
likely yield unexpected results


Comparison Filters (cont.)

•  HBase filters are usually filtering data out
•  Comparison filters work in reverse as they
include matching data
–  Be mindful when selecting the comparison
operator!


Available Comparison Filters

•  Row Filter
–  Based on row keys comparisons
•  Family Filter
–  Based on column family names
•  Qualifier Filter
–  Based on column names, aka qualifiers
•  Value Filter
–  Based on the actual value of a column


Available Comparison Filters (cont.)

•  Dependent Column Filter
–  Based on a timestamp of a reference column
–  Includes all columns that have the same
timestamp
–  Implies that the entire row is accessible, since
batching will not have access to the reference
column
•  No scanner batching allowed!
–  Useful for loading interdependent changes
within a row


Example Code
Scan scan = new Scan(); 
scan.addColumn(Bytes.toBytes("colfam1"), !
Bytes.toBytes("col-0")); !
Filter filter = new RowFilter(!
CompareFilter.CompareOp.LESS_OR_EQUAL, !
new BinaryComparator(Bytes.toBytes("row-22")));
scan.setFilter(filter); 
ResultScanner scanner = table.getScanner(scan);
for (Result res : scanner) { !
System.out.println(res); !
} !
scanner.close(); !
!


Example Ouput
keyvalues={row-1/colfam1:col-0/1301043190260/Put/vlen=7} !


Agenda

1 Introduction
3 Dedicated Filters
5 Combining Filters
6 Custom Filters


Dedicated Filters

•  Based directly on FilterBase class
•  Often less useful for get() calls, since
entire rows are filtered


Available Dedicated Filters

•  Single Column Value Filter
–  Filter rows based on one specific column
–  Extra features
•  “Filter if missing”
•  “Get latest version only”
–  Column must be part of the scan selection
•  Or else it is all or nothing
–  Also needs compare operation and an
optional comparator


Available Dedicated Filters (cont.)

•  Single Column Value Exclude Filter
–  Same as the one before but excludes the
selection column
•  Prefix Filter
–  Based on prefix of row keys
–  Can early out the scan!
•  Combine with start row for best performance


•  Page Filter
–  Allows pagination through rows
–  Needs to be combined with setting the start row on
subsequent scans
–  Can early out the scan when limit is reached
•  Key Only Filter
–  Drop the value for every column
•  First Key Only Filter
–  Return only the first column key
–  Useful for row counter, or get newest post type
applications
–  Can early out rest of row scan



•  Inclusive Stop Filter
–  As opposed to the exclusive stop row, this
filter will include the final row
•  Timestamp Filter
–  Takes list of timestamps to include in result
•  Column Count Get Filter
–  Used to limit number of columns returned by a
get() call



•  Column Pagination Filter
–  Allows to paginate through columns within a
row
–  Skips to offset parameter and returns
limit columns
•  Column Prefix Filter
–  Analog to PrefixFilter, here for matching
column qualifiers
•  Random Row Filter


Agenda

1 Introduction
3 Dedicated Filters
5 Combining Filters
6 Custom Filters


Decorating Filters

•  Extend filters to gain additional control
over the returned data
•  Skip Filter
–  Skip entire row when a column is filtered
–  Not all filters are compatible
•  While Match Filter
–  Aborts entire scan once the wrapped filter
indicates a row or column is omitted


Agenda

1 Introduction
3 Dedicated Filters
5 Combining Filters
6 Custom Filters


Combining Filters

•  Implemented by the FilterList class
–  Wraps list of filters into a Filter compatible
class
–  Takes optional operator to decide how to
handle the results of each wrapped filter
(default: MUST_PASS_ALL)


Combining Filters

•  Filter lists can contain other filter lists
•  Operator is fixed per list, but hierarchy
allows to create combinations
•  Using the proper List implementation
helps controlling filter execution order


List<Filter> filters = new ArrayList<Filter>(); 
Filter filter1 = new RowFilter(!
CompareFilter.CompareOp.GREATER_OR_EQUAL, !
new BinaryComparator(Bytes.toBytes("row-03"))); !
filters.add(filter1); !
Filter filter2 = new RowFilter(!
CompareFilter.CompareOp.LESS_OR_EQUAL, !
new BinaryComparator(Bytes.toBytes("row-06"))); !
filters.add(filter2); !
Filter filter3 = new QualifierFilter(!
CompareFilter.CompareOp.EQUAL, !
new RegexStringComparator("col-0[03]")); !
filters.add(filter3);!
FilterList filterList1 = new FilterList(filters); !
…!
FilterList filterList2 = new
FilterList(FilterList.Operator.MUST_PASS_ONE, filters); !


Agenda

1 Introduction
3 Dedicated Filters
5 Combining Filters
6 Custom Filters


Custom Filter

•  Allows users to add missing filters
•  Either implement Filter interface or use
FilterBase skeleton
•  Provides hooks called at different stages
of the read process


Filter Interface
public interface Filter extends Writable { !
public enum ReturnCode { !
INCLUDE, SKIP, NEXT_COL, NEXT_ROW,!
SEEK_NEXT_USING_HINT } !
public void reset()!
public boolean filterRowKey(byte[] buffer, !
int offset, int length) !
public boolean filterAllRemaining()!
public ReturnCode filterKeyValue(KeyValue v)!
public void filterRow(List<KeyValue> kvs)!
public boolean hasFilterRow()!
public boolean filterRow()!
public KeyValue getNextKeyHint(KeyValue !
currentKV) !
!


Filter Return Codes


Merge Reads


Filter Flow

•  Filter hooks are called at
different stages
•  Seeks are done initially to
find the next KeyValue
–  Hint from previous filter
invocation might help
•  Early out checks improve
performance


Example Code
public class CustomFilter extends FilterBase{ !
private byte[] value = null; !
private boolean filterRow = true; !
public CustomFilter() { super(); }!
public CustomFilter(byte[] value) { this.value = value; } !
@Override 
public void reset() { this.filterRow = true; } !
@Override !
public ReturnCode filterKeyValue(KeyValue kv) {!
if (Bytes.compareTo(value, kv.getValue()) == 0) { !
filterRow = false; !
} !
return ReturnCode.INCLUDE; !
} !
@Override !
public boolean filterRow() { return filterRow; } !
...!
} !
!

Deploying Custom Filters

•  Need to provide JAR file with filter class
•  Deploy JAR to RegionServers
•  Add JAR to HBASE_CLASSPATH
•  Restart RegionServers

•  Tip: Testing on cluster more involved, test
on local machine first


Summary


Summary (cont.)


HBaseCon 2012 | HBase Filtering - Lars George, Cloudera

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to HBaseCon 2012 | HBase Filtering - Lars George, Cloudera

Similar to HBaseCon 2012 | HBase Filtering - Lars George, Cloudera (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Recently uploaded

Recently uploaded (20)

HBaseCon 2012 | HBase Filtering - Lars George, Cloudera