WWW.NGDATA.COMThe information herein is the property of NGDATA and is considered proprietary and confidential
HBase SEP & Indexer
Mining needles from massive haystacks
Steven Noels
HBaseCon, 2013-06-13, San Francisco
WWW.NGDATA.COMThe information herein is the property of NGDATA and is considered proprietary and confidential
HBase is a great haystack
(but where are the needles?)
What HBase Offers
• rows of column family-contained
columns containing timestamp-
versioned cells
• rowkey-based random access
through sorted row order
• get / put / delete / scan
operations
• scale-out across region servers
What Most People Need
• sorted rows of column family-
contained columns containing
timestamp-versioned cells
• rowkey-based random access
through sorted row order
• get / put / delete / scan
operations
• scale-out across region servers
• fast (indexed) random access
using secondary column keys
• index generation and
maintenance
WWW.NGDATA.COMThe information herein is the property of NGDATA and is considered proprietary and confidential
• Lily RowLog
• hbase-solr-dataimport
Import HBase data into Solr using the DataImportHandler
https://code.google.com/p/hbase-solr-dataimport/
• HBasene
HBase as the backing store for the TF-IDF representations for Lucene
https://github.com/akkumar/hbasene
• hbase-secondary-index
https://github.com/mayanhui/hbase-secondary-index
• hbase-indexed
https://github.com/danix800/hbase-indexed
• Culvert
A Robust Framework for Secondary Indexing
https://github.com/jyates/culvert
• Co-processors
Earlier attempts
HBase Indexing and Search
1. many data prerequisites
2. leaky abstractions
3. no drop-in approach
WWW.NGDATA.COMThe information herein is the property of NGDATA and is considered proprietary and confidential
• maintaining alternate data views
• aggregates
• counts
• general side-effects to updates
• keeping secondary systems in lock-step sync with updates
Indexing isn’t just about Search
1.
HBase update
2.
trigger
3.
process
WWW.NGDATA.COMThe information herein is the property of NGDATA and is considered proprietary and confidential
HBase ‘Side-Effect Processor’
• A mechanism for triggering and
processing side-effect events, based
upon HBase updates
Companion project: HBase Indexer
• Maps HBase row updates into Solr
index updates
The Solution: HBase SEP + Indexer
Open Source, Apache License
http://github.com/NGDATA/hbase-sep
http://github.com/NGDATA/hbase-indexer
WWW.NGDATA.COMThe information herein is the property of NGDATA and is considered proprietary and confidential
• structured ad-hoc search of HBase-backed Solr indexes
• faceted search
• auxiliary index or view structures
• observation matrices for CF-style recommendations
• maintenance of auxiliary cross-reference tables (link mgmt)
• computing data aggregates, counter maintenance
Use cases for HBase SEP & Indexing
What about co-processors? Sysadmins
don’t like running application code on
HBase region servers.
WWW.NGDATA.COMThe information herein is the property of NGDATA and is considered proprietary and confidential
Use Case: Faceted Search in Lily
facets
resultsetcount
facet counts
HBase
Solr
Cloud
WWW.NGDATA.COMThe information herein is the property of NGDATA and is considered proprietary and confidential
Approach:
• SEP = fake HBase region
servers, pass on update
events to Indexer
• light-weight, embeddable
process
• piggybacks on HBase
replication mechanism
• Indexer = maps HBase HLog
update events into Solr
updates
• no impact on write path
SEP / Trigger fundamentals
Using HBase replication for Indexing triggering
Fake HBase
‘Cluster’
SEP + Indexer
Index
(Solr)
WWW.NGDATA.COMThe information herein is the property of NGDATA and is considered proprietary and confidential
SEP & Indexer data flow anatomy
WWW.NGDATA.COMThe information herein is the property of NGDATA and is considered proprietary and confidential
• option 1: co-locate with HBase Region Servers
Deployment
HBase RS SEP+IDX Solr
ZooKeeper arbitration
WWW.NGDATA.COMThe information herein is the property of NGDATA and is considered proprietary and confidential
• option 2: co-locate with Solr index engine nodes
Deployment
HBase RS SEP+IDX Solr
ZooKeeperarbitration
WWW.NGDATA.COMThe information herein is the property of NGDATA and is considered proprietary and confidential
HBase Indexer: two options
WWW.NGDATA.COMThe information herein is the property of NGDATA and is considered proprietary and confidential
• row- and column-based mapping
HBase Indexer features
rowkey col1 col2 col3 col4
1
1
42 3
2 3 4
rowkey row content
1
3
5
2
4
HBase Solr(Cloud)
row:
column:
WWW.NGDATA.COMThe information herein is the property of NGDATA and is considered proprietary and confidential
• configurable data extraction mechanisms
• HBase Bytes
• Tika / SolrCell (+ content extraction)
• optional formatters
• non-programmatic
indexer configuration
• index mgmt CLI
HBase Indexer features
WWW.NGDATA.COMThe information herein is the property of NGDATA and is considered proprietary and confidential
• http://github.com/NGDATA/hbase-sep and hbase-indexer
• easy setup:
1. switch on HBase replication, and …
2. profit.
• few prerequisites on data model
• multiple approaches for mapping HBase rows to Solr
• can be used for other secondary operations
• open source, Apache license
Questions? stevenn@ngdata.com
Wrap-up
HBase SEP & Indexer
WWW.NGDATA.COMThe information herein is the property of NGDATA and is considered proprietary and confidential
HBase SEP & Indexer are part of Cloudera Search
➜ joint development between Cloudera & NGDATA
➜ try it out: www.cloudera.com/downloads

HBaseCon 2013: HBase SEP - Reliable Maintenance of Auxiliary Index Structures

  • 1.
    WWW.NGDATA.COMThe information hereinis the property of NGDATA and is considered proprietary and confidential HBase SEP & Indexer Mining needles from massive haystacks Steven Noels HBaseCon, 2013-06-13, San Francisco
  • 2.
    WWW.NGDATA.COMThe information hereinis the property of NGDATA and is considered proprietary and confidential HBase is a great haystack (but where are the needles?) What HBase Offers • rows of column family-contained columns containing timestamp- versioned cells • rowkey-based random access through sorted row order • get / put / delete / scan operations • scale-out across region servers What Most People Need • sorted rows of column family- contained columns containing timestamp-versioned cells • rowkey-based random access through sorted row order • get / put / delete / scan operations • scale-out across region servers • fast (indexed) random access using secondary column keys • index generation and maintenance
  • 3.
    WWW.NGDATA.COMThe information hereinis the property of NGDATA and is considered proprietary and confidential • Lily RowLog • hbase-solr-dataimport Import HBase data into Solr using the DataImportHandler https://code.google.com/p/hbase-solr-dataimport/ • HBasene HBase as the backing store for the TF-IDF representations for Lucene https://github.com/akkumar/hbasene • hbase-secondary-index https://github.com/mayanhui/hbase-secondary-index • hbase-indexed https://github.com/danix800/hbase-indexed • Culvert A Robust Framework for Secondary Indexing https://github.com/jyates/culvert • Co-processors Earlier attempts HBase Indexing and Search 1. many data prerequisites 2. leaky abstractions 3. no drop-in approach
  • 4.
    WWW.NGDATA.COMThe information hereinis the property of NGDATA and is considered proprietary and confidential • maintaining alternate data views • aggregates • counts • general side-effects to updates • keeping secondary systems in lock-step sync with updates Indexing isn’t just about Search 1. HBase update 2. trigger 3. process
  • 5.
    WWW.NGDATA.COMThe information hereinis the property of NGDATA and is considered proprietary and confidential HBase ‘Side-Effect Processor’ • A mechanism for triggering and processing side-effect events, based upon HBase updates Companion project: HBase Indexer • Maps HBase row updates into Solr index updates The Solution: HBase SEP + Indexer Open Source, Apache License http://github.com/NGDATA/hbase-sep http://github.com/NGDATA/hbase-indexer
  • 6.
    WWW.NGDATA.COMThe information hereinis the property of NGDATA and is considered proprietary and confidential • structured ad-hoc search of HBase-backed Solr indexes • faceted search • auxiliary index or view structures • observation matrices for CF-style recommendations • maintenance of auxiliary cross-reference tables (link mgmt) • computing data aggregates, counter maintenance Use cases for HBase SEP & Indexing What about co-processors? Sysadmins don’t like running application code on HBase region servers.
  • 7.
    WWW.NGDATA.COMThe information hereinis the property of NGDATA and is considered proprietary and confidential Use Case: Faceted Search in Lily facets resultsetcount facet counts HBase Solr Cloud
  • 8.
    WWW.NGDATA.COMThe information hereinis the property of NGDATA and is considered proprietary and confidential Approach: • SEP = fake HBase region servers, pass on update events to Indexer • light-weight, embeddable process • piggybacks on HBase replication mechanism • Indexer = maps HBase HLog update events into Solr updates • no impact on write path SEP / Trigger fundamentals Using HBase replication for Indexing triggering Fake HBase ‘Cluster’ SEP + Indexer Index (Solr)
  • 9.
    WWW.NGDATA.COMThe information hereinis the property of NGDATA and is considered proprietary and confidential SEP & Indexer data flow anatomy
  • 10.
    WWW.NGDATA.COMThe information hereinis the property of NGDATA and is considered proprietary and confidential • option 1: co-locate with HBase Region Servers Deployment HBase RS SEP+IDX Solr ZooKeeper arbitration
  • 11.
    WWW.NGDATA.COMThe information hereinis the property of NGDATA and is considered proprietary and confidential • option 2: co-locate with Solr index engine nodes Deployment HBase RS SEP+IDX Solr ZooKeeperarbitration
  • 12.
    WWW.NGDATA.COMThe information hereinis the property of NGDATA and is considered proprietary and confidential HBase Indexer: two options
  • 13.
    WWW.NGDATA.COMThe information hereinis the property of NGDATA and is considered proprietary and confidential • row- and column-based mapping HBase Indexer features rowkey col1 col2 col3 col4 1 1 42 3 2 3 4 rowkey row content 1 3 5 2 4 HBase Solr(Cloud) row: column:
  • 14.
    WWW.NGDATA.COMThe information hereinis the property of NGDATA and is considered proprietary and confidential • configurable data extraction mechanisms • HBase Bytes • Tika / SolrCell (+ content extraction) • optional formatters • non-programmatic indexer configuration • index mgmt CLI HBase Indexer features
  • 15.
    WWW.NGDATA.COMThe information hereinis the property of NGDATA and is considered proprietary and confidential • http://github.com/NGDATA/hbase-sep and hbase-indexer • easy setup: 1. switch on HBase replication, and … 2. profit. • few prerequisites on data model • multiple approaches for mapping HBase rows to Solr • can be used for other secondary operations • open source, Apache license Questions? stevenn@ngdata.com Wrap-up HBase SEP & Indexer
  • 16.
    WWW.NGDATA.COMThe information hereinis the property of NGDATA and is considered proprietary and confidential HBase SEP & Indexer are part of Cloudera Search ➜ joint development between Cloudera & NGDATA ➜ try it out: www.cloudera.com/downloads