HBase Secondary Indexing

HBase Secondary Indexing
Implementation and analysis of secondary index in HBase
Gino McCarty, Ranjan Kumar

What’s next
● HBase Introduction
● HBase Scan
● Secondary index
● CoProcessors
● Secondary Index using CoProcessors
● Testing Infrastructure
● Benchmarks
● Challenges and conclusions

HBase
● Open source, non-relational, distributed database built on top of Hadoop.
● Columnar data store, also called Tabular data store.
● Architecture - Hadoop, HDFS, Row keys, Column Families
Image source: http://blogs.igalia.com/dpino/2012/10/31/introduction-to-hbase-and-nosql-systems/
EmpId Lastname Firstname Salary
1 Smith Joe 40000
2 Jones Mary 50000
3 Johnson Cathy 44000
Row-oriented
1,Smith,Joe,40000;
2,Jones,Mary,50000;
3,Johnson,Cathy,44000;
Column-oriented
1,2,3;
Smith,Jones,Johnson;
Joe,Mary,Cathy;
40000,50000,44000;

HBase Scan
Scan - API for data retrieval
● Scan scan = new Scan(startRow, stopRow)
● scan.addFamily(f);
● scan.addColumn(c);
Filters - control the amount of returned to the client
● scan.setFilter(new ValueFilter(CompareOp.EQUAL, new
SubStringComparator(“3”)));
Optimizations - caching, batching (still scans the entire table- client timeouts,
lease expiring)
● scan.setCaching(1000);
● scan.setBatch(1000);

Secondary Index
● The way you design your row key affects everything
● Secondary index to avoid table scans. Indexes can be
stored in another index table.
● Create the index on a column and store the row keys
corresponding to that column in a separate table.
● RegionServers use the index table to perform selective
scan.

HBase and CoProcessor Architecture
Trigger Based Observer
System
● Region Observer
● Master Observer
● Log Observer
EndPoints
(Stored Procedures)
● Server Side Execution
● Distributed
● Can be called by Clients
or Observers

Secondary Indexing utilizing CoProcessors
● For every index we
create
o Regions have
their own
assigned index
o Indexes and
regions are kept
together
o Immune to
Splits and
Server outages

Sample Code Implementation
● Optimizing for Queries based
on Cartesian Products of Key
Values
● Example, what was the air time
of all carriers that flew on a
wednesday.
HbaseAdmin admin = new IndexAdmin(conf);
HTableDescriptor htd = new
HTableDescriptor(TableName.valueOf(tableName));
HColumnDescriptor hcd = new HColumnDescriptor(columnFamily);
htd.addFamily(hcd);
IndexSpecification iSpec = new IndexSpecification(indexName);
iSpec.addIndexColumn(hcd,indexColumnQualifier,
ValueType.String, 10);
TableIndices tableIndices = new TableIndices();
tableIndices.addIndex(iSpec);
htd.setValue(Constants.INDEX_SPEC_KEY,
tableIndices.toByteArray());
admin.createTable(htd);

Testing Environment
● Testing Platforms:
o Hadoop-2.2.0
o HBase-0.98.8
 Single Laptop Benchmark
● HBase Pseudo Distributed Mode
 QEMU Based 3 Node Virtual Cluster
● 2GB Ram and 2 Core Intel i7 per node
● Data Set: 7 Million Rows DataSet
- Airline On-Time Statistics and Delay Causes
- http://stat-computing.org/dataexpo/2009/the-data.html
- Roughly 80GB DataStore Size Per Node

HIndex
● Secondary Index for HBase
● Implementation by Huawei developers
● Uses coprocessors to inject code into master and region
servers.
● Creates an additional index table for every column on which
an index is desired.

HIndex conclusions
● Limited to HBase .94 version.
● Need to build the code with hbase source i.e. it is not
available independently as a jar.
● Very little documentation and support.
● Unstable at many many edgecases

Benchmarks
Query1:
new RowFilter(CompareOp.EQUAL, new SubstringComparator(“WN”)
38ms per record
Query2:
new ValueFilter(CompareOp.GREATER,new BinaryComparator(60));
3.9ms per record
Query Time taken on
single node(ms)
Total Records
on single node
Time taken on
cluster(ms)
Total Records
on cluster
Query1 50910 346435 62291 689409
Query2 207757 3807652 125777 7785038
Average Performance over 3 Runs
Table Name Keys Column Families and Values
flight_data year,
month,
dayofMonth,
dayOfWeek,
Departure
Time,
Carrier
(flight)
Month,
dayOfWeek, Carrier,
Flight Number, Origin,
Destination,
(trip)
Distance,
AirTime,
Arrival Delay

Challenges & Conclusions
Indexing data across regions
● Co-locating index with data in the same region
● Make column family a part of the index
Handling region split
● Split the index table by having a custom splitter as per the row key
distribution of the data table
Secondary indexes improve query performance at the expense of extra space.
CoProcessors add extra overhead for each query processing.

Questions
References:
HBase - CoProcessors - https://blogs.apache.org/hbase/entry/coprocessor_introduction
HIndex - https://github.com/Huawei-Hadoop/hindex
HIndex Overview - http://www.slideshare.net/rajeshbabuchintaguntla/apache-con-hindex

HBase Secondary Indexing

More Related Content

What's hot

Viewers also liked

Similar to HBase Secondary Indexing

Recently uploaded

HBase Secondary Indexing