HBase Secondary Indexing
Implementation and analysis of secondary index in HBase
Gino McCarty, Ranjan Kumar
What’s next
● HBase Introduction
● HBase Scan
● Secondary index
● CoProcessors
● Secondary Index using CoProcessors
● Testing Infrastructure
● Benchmarks
● Challenges and conclusions
HBase
● Open source, non-relational, distributed database built on top of Hadoop.
● Columnar data store, also called Tabular data store.
● Architecture - Hadoop, HDFS, Row keys, Column Families
Image source: http://blogs.igalia.com/dpino/2012/10/31/introduction-to-hbase-and-nosql-systems/
EmpId Lastname Firstname Salary
1 Smith Joe 40000
2 Jones Mary 50000
3 Johnson Cathy 44000
Row-oriented
1,Smith,Joe,40000;
2,Jones,Mary,50000;
3,Johnson,Cathy,44000;
Column-oriented
1,2,3;
Smith,Jones,Johnson;
Joe,Mary,Cathy;
40000,50000,44000;
HBase
HBase Scan
Scan - API for data retrieval
● Scan scan = new Scan(startRow, stopRow)
● scan.addFamily(f);
● scan.addColumn(c);
Filters - control the amount of returned to the client
● scan.setFilter(new ValueFilter(CompareOp.EQUAL, new
SubStringComparator(“3”)));
Optimizations - caching, batching (still scans the entire table- client timeouts,
lease expiring)
● scan.setCaching(1000);
● scan.setBatch(1000);
Secondary Index
● The way you design your row key affects everything
● Secondary index to avoid table scans. Indexes can be
stored in another index table.
● Create the index on a column and store the row keys
corresponding to that column in a separate table.
● RegionServers use the index table to perform selective
scan.
HBase and CoProcessor Architecture
Trigger Based Observer
System
● Region Observer
● Master Observer
● Log Observer
EndPoints
(Stored Procedures)
● Server Side Execution
● Distributed
● Can be called by Clients
or Observers
Secondary Indexing utilizing CoProcessors
● For every index we
create
o Regions have
their own
assigned index
o Indexes and
regions are kept
together
o Immune to
Splits and
Server outages
Sample Code Implementation
● Optimizing for Queries based
on Cartesian Products of Key
Values
● Example, what was the air time
of all carriers that flew on a
wednesday.
HbaseAdmin admin = new IndexAdmin(conf);
HTableDescriptor htd = new
HTableDescriptor(TableName.valueOf(tableName));
HColumnDescriptor hcd = new HColumnDescriptor(columnFamily);
htd.addFamily(hcd);
IndexSpecification iSpec = new IndexSpecification(indexName);
iSpec.addIndexColumn(hcd,indexColumnQualifier,
ValueType.String, 10);
TableIndices tableIndices = new TableIndices();
tableIndices.addIndex(iSpec);
htd.setValue(Constants.INDEX_SPEC_KEY,
tableIndices.toByteArray());
admin.createTable(htd);
Testing Environment
● Testing Platforms:
o Hadoop-2.2.0
o HBase-0.98.8
 Single Laptop Benchmark
● HBase Pseudo Distributed Mode
 QEMU Based 3 Node Virtual Cluster
● 2GB Ram and 2 Core Intel i7 per node
● Data Set: 7 Million Rows DataSet
- Airline On-Time Statistics and Delay Causes
- http://stat-computing.org/dataexpo/2009/the-data.html
- Roughly 80GB DataStore Size Per Node
HIndex
● Secondary Index for HBase
● Implementation by Huawei developers
● Uses coprocessors to inject code into master and region
servers.
● Creates an additional index table for every column on which
an index is desired.
HIndex conclusions
● Limited to HBase .94 version.
● Need to build the code with hbase source i.e. it is not
available independently as a jar.
● Very little documentation and support.
● Unstable at many many edgecases
Benchmarks
Query1:
new RowFilter(CompareOp.EQUAL, new SubstringComparator(“WN”)
38ms per record
Query2:
new ValueFilter(CompareOp.GREATER,new BinaryComparator(60));
3.9ms per record
Query Time taken on
single node(ms)
Total Records
on single node
Time taken on
cluster(ms)
Total Records
on cluster
Query1 50910 346435 62291 689409
Query2 207757 3807652 125777 7785038
Average Performance over 3 Runs
Table Name Keys Column Families and Values
flight_data year,
month,
dayofMonth,
dayOfWeek,
Departure
Time,
Carrier
(flight)
Month,
dayOfWeek, Carrier,
Flight Number, Origin,
Destination,
(trip)
Distance,
AirTime,
Arrival Delay
Challenges & Conclusions
Indexing data across regions
● Co-locating index with data in the same region
● Make column family a part of the index
Handling region split
● Split the index table by having a custom splitter as per the row key
distribution of the data table
Secondary indexes improve query performance at the expense of extra space.
CoProcessors add extra overhead for each query processing.
Questions
References:
HBase - CoProcessors - https://blogs.apache.org/hbase/entry/coprocessor_introduction
HIndex - https://github.com/Huawei-Hadoop/hindex
HIndex Overview - http://www.slideshare.net/rajeshbabuchintaguntla/apache-con-hindex

HBase Secondary Indexing

  • 1.
    HBase Secondary Indexing Implementationand analysis of secondary index in HBase Gino McCarty, Ranjan Kumar
  • 2.
    What’s next ● HBaseIntroduction ● HBase Scan ● Secondary index ● CoProcessors ● Secondary Index using CoProcessors ● Testing Infrastructure ● Benchmarks ● Challenges and conclusions
  • 3.
    HBase ● Open source,non-relational, distributed database built on top of Hadoop. ● Columnar data store, also called Tabular data store. ● Architecture - Hadoop, HDFS, Row keys, Column Families Image source: http://blogs.igalia.com/dpino/2012/10/31/introduction-to-hbase-and-nosql-systems/ EmpId Lastname Firstname Salary 1 Smith Joe 40000 2 Jones Mary 50000 3 Johnson Cathy 44000 Row-oriented 1,Smith,Joe,40000; 2,Jones,Mary,50000; 3,Johnson,Cathy,44000; Column-oriented 1,2,3; Smith,Jones,Johnson; Joe,Mary,Cathy; 40000,50000,44000;
  • 4.
  • 5.
    HBase Scan Scan -API for data retrieval ● Scan scan = new Scan(startRow, stopRow) ● scan.addFamily(f); ● scan.addColumn(c); Filters - control the amount of returned to the client ● scan.setFilter(new ValueFilter(CompareOp.EQUAL, new SubStringComparator(“3”))); Optimizations - caching, batching (still scans the entire table- client timeouts, lease expiring) ● scan.setCaching(1000); ● scan.setBatch(1000);
  • 6.
    Secondary Index ● Theway you design your row key affects everything ● Secondary index to avoid table scans. Indexes can be stored in another index table. ● Create the index on a column and store the row keys corresponding to that column in a separate table. ● RegionServers use the index table to perform selective scan.
  • 7.
    HBase and CoProcessorArchitecture Trigger Based Observer System ● Region Observer ● Master Observer ● Log Observer EndPoints (Stored Procedures) ● Server Side Execution ● Distributed ● Can be called by Clients or Observers
  • 9.
    Secondary Indexing utilizingCoProcessors ● For every index we create o Regions have their own assigned index o Indexes and regions are kept together o Immune to Splits and Server outages
  • 10.
    Sample Code Implementation ●Optimizing for Queries based on Cartesian Products of Key Values ● Example, what was the air time of all carriers that flew on a wednesday. HbaseAdmin admin = new IndexAdmin(conf); HTableDescriptor htd = new HTableDescriptor(TableName.valueOf(tableName)); HColumnDescriptor hcd = new HColumnDescriptor(columnFamily); htd.addFamily(hcd); IndexSpecification iSpec = new IndexSpecification(indexName); iSpec.addIndexColumn(hcd,indexColumnQualifier, ValueType.String, 10); TableIndices tableIndices = new TableIndices(); tableIndices.addIndex(iSpec); htd.setValue(Constants.INDEX_SPEC_KEY, tableIndices.toByteArray()); admin.createTable(htd);
  • 11.
    Testing Environment ● TestingPlatforms: o Hadoop-2.2.0 o HBase-0.98.8  Single Laptop Benchmark ● HBase Pseudo Distributed Mode  QEMU Based 3 Node Virtual Cluster ● 2GB Ram and 2 Core Intel i7 per node ● Data Set: 7 Million Rows DataSet - Airline On-Time Statistics and Delay Causes - http://stat-computing.org/dataexpo/2009/the-data.html - Roughly 80GB DataStore Size Per Node
  • 12.
    HIndex ● Secondary Indexfor HBase ● Implementation by Huawei developers ● Uses coprocessors to inject code into master and region servers. ● Creates an additional index table for every column on which an index is desired.
  • 13.
    HIndex conclusions ● Limitedto HBase .94 version. ● Need to build the code with hbase source i.e. it is not available independently as a jar. ● Very little documentation and support. ● Unstable at many many edgecases
  • 14.
    Benchmarks Query1: new RowFilter(CompareOp.EQUAL, newSubstringComparator(“WN”) 38ms per record Query2: new ValueFilter(CompareOp.GREATER,new BinaryComparator(60)); 3.9ms per record Query Time taken on single node(ms) Total Records on single node Time taken on cluster(ms) Total Records on cluster Query1 50910 346435 62291 689409 Query2 207757 3807652 125777 7785038 Average Performance over 3 Runs Table Name Keys Column Families and Values flight_data year, month, dayofMonth, dayOfWeek, Departure Time, Carrier (flight) Month, dayOfWeek, Carrier, Flight Number, Origin, Destination, (trip) Distance, AirTime, Arrival Delay
  • 15.
    Challenges & Conclusions Indexingdata across regions ● Co-locating index with data in the same region ● Make column family a part of the index Handling region split ● Split the index table by having a custom splitter as per the row key distribution of the data table Secondary indexes improve query performance at the expense of extra space. CoProcessors add extra overhead for each query processing.
  • 16.
    Questions References: HBase - CoProcessors- https://blogs.apache.org/hbase/entry/coprocessor_introduction HIndex - https://github.com/Huawei-Hadoop/hindex HIndex Overview - http://www.slideshare.net/rajeshbabuchintaguntla/apache-con-hindex