HBaseCon 2013: HBase SEP - Reliable Maintenance of Auxiliary Index Structures
WWW.NGDATA.COMThe information herein is the property of NGDATA and is considered proprietary and confidential
HBase SEP & Indexer
Mining needles from massive haystacks
Steven Noels
HBaseCon, 2013-06-13, San Francisco
WWW.NGDATA.COMThe information herein is the property of NGDATA and is considered proprietary and confidential
HBase is a great haystack
(but where are the needles?)
What HBase Offers
• rows of column family-contained
columns containing timestamp-
versioned cells
• rowkey-based random access
through sorted row order
• get / put / delete / scan
operations
• scale-out across region servers
What Most People Need
• sorted rows of column family-
contained columns containing
timestamp-versioned cells
• rowkey-based random access
through sorted row order
• get / put / delete / scan
operations
• scale-out across region servers
• fast (indexed) random access
using secondary column keys
• index generation and
maintenance
WWW.NGDATA.COMThe information herein is the property of NGDATA and is considered proprietary and confidential
• Lily RowLog
• hbase-solr-dataimport
Import HBase data into Solr using the DataImportHandler
https://code.google.com/p/hbase-solr-dataimport/
• HBasene
HBase as the backing store for the TF-IDF representations for Lucene
https://github.com/akkumar/hbasene
• hbase-secondary-index
https://github.com/mayanhui/hbase-secondary-index
• hbase-indexed
https://github.com/danix800/hbase-indexed
• Culvert
A Robust Framework for Secondary Indexing
https://github.com/jyates/culvert
• Co-processors
Earlier attempts
HBase Indexing and Search
1. many data prerequisites
2. leaky abstractions
3. no drop-in approach
WWW.NGDATA.COMThe information herein is the property of NGDATA and is considered proprietary and confidential
• maintaining alternate data views
• aggregates
• counts
• general side-effects to updates
• keeping secondary systems in lock-step sync with updates
Indexing isn’t just about Search
1.
HBase update
2.
trigger
3.
process
WWW.NGDATA.COMThe information herein is the property of NGDATA and is considered proprietary and confidential
HBase ‘Side-Effect Processor’
• A mechanism for triggering and
processing side-effect events, based
upon HBase updates
Companion project: HBase Indexer
• Maps HBase row updates into Solr
index updates
The Solution: HBase SEP + Indexer
Open Source, Apache License
http://github.com/NGDATA/hbase-sep
http://github.com/NGDATA/hbase-indexer
WWW.NGDATA.COMThe information herein is the property of NGDATA and is considered proprietary and confidential
• structured ad-hoc search of HBase-backed Solr indexes
• faceted search
• auxiliary index or view structures
• observation matrices for CF-style recommendations
• maintenance of auxiliary cross-reference tables (link mgmt)
• computing data aggregates, counter maintenance
Use cases for HBase SEP & Indexing
What about co-processors? Sysadmins
don’t like running application code on
HBase region servers.
WWW.NGDATA.COMThe information herein is the property of NGDATA and is considered proprietary and confidential
Use Case: Faceted Search in Lily
facets
resultsetcount
facet counts
HBase
Solr
Cloud
WWW.NGDATA.COMThe information herein is the property of NGDATA and is considered proprietary and confidential
Approach:
• SEP = fake HBase region
servers, pass on update
events to Indexer
• light-weight, embeddable
process
• piggybacks on HBase
replication mechanism
• Indexer = maps HBase HLog
update events into Solr
updates
• no impact on write path
SEP / Trigger fundamentals
Using HBase replication for Indexing triggering
Fake HBase
‘Cluster’
SEP + Indexer
Index
(Solr)
WWW.NGDATA.COMThe information herein is the property of NGDATA and is considered proprietary and confidential
• option 1: co-locate with HBase Region Servers
Deployment
HBase RS SEP+IDX Solr
ZooKeeper arbitration
WWW.NGDATA.COMThe information herein is the property of NGDATA and is considered proprietary and confidential
• option 2: co-locate with Solr index engine nodes
Deployment
HBase RS SEP+IDX Solr
ZooKeeperarbitration
WWW.NGDATA.COMThe information herein is the property of NGDATA and is considered proprietary and confidential
• row- and column-based mapping
HBase Indexer features
rowkey col1 col2 col3 col4
1
1
42 3
2 3 4
rowkey row content
1
3
5
2
4
HBase Solr(Cloud)
row:
column:
WWW.NGDATA.COMThe information herein is the property of NGDATA and is considered proprietary and confidential
• configurable data extraction mechanisms
• HBase Bytes
• Tika / SolrCell (+ content extraction)
• optional formatters
• non-programmatic
indexer configuration
• index mgmt CLI
HBase Indexer features
WWW.NGDATA.COMThe information herein is the property of NGDATA and is considered proprietary and confidential
• http://github.com/NGDATA/hbase-sep and hbase-indexer
• easy setup:
1. switch on HBase replication, and …
2. profit.
• few prerequisites on data model
• multiple approaches for mapping HBase rows to Solr
• can be used for other secondary operations
• open source, Apache license
Questions? stevenn@ngdata.com
Wrap-up
HBase SEP & Indexer
WWW.NGDATA.COMThe information herein is the property of NGDATA and is considered proprietary and confidential
HBase SEP & Indexer are part of Cloudera Search
➜ joint development between Cloudera & NGDATA
➜ try it out: www.cloudera.com/downloads