HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBase Coprocessor to Index Columns into
ElasticSearch Cluster
Dibyendu Bhattacharya
Architect – Big Data Analytics
HappiestMinds
About HappiestMinds
• Next Gen IT Consultancy Company launched Aug 2011 . Head office
in Bangalore, India, have offices in USA, UK, Canada, Australia and
Singapore. Core focus on disruptive technologies like Big
Data/Analytics, Cloud, Mobile and Social.
• Raised USD 45M Series A Funding from prominent VCs , Intel
Capital, Canaan Partners and founders.
• 45 + Client Globally, 800 + Employees.
About Myself :
Dibyendu is Big Data Architect at HappiestMinds where he is involved
in architecting and developing solutions on a Hadoop-based analytics
and search platform. In the past few years, he has worked on complex
data analytics related projects that utilize Hadoop, HBase, and real
time analytics. Before HappiestMinds, he worked at EMC, FairIsaac,
Cisco, IBM etc.
This Presentation….
…….will explores the design and challenges HappiestMinds faced
while implementing a storage and search infrastructure for a
library procurement system where books/documents/artifacts
related records are stored in Apache HBase. Upon bulk insert of
book records into HBase, the Elasticsearch index is built offline
using MapReduce but there are certain use cases where the
records need to be re-indexed in Elasticsearch using Region
Observer Coprocessors.
Storing and Indexing Book records from
Publishers and Libraries
Publisher/
Library Data
HDFS
HBase
Cluster
Data Pre Processing
• Data ingestion to Hadoop
Data Loading : Map Reduce
• Bulk Data upload to HBase table1
2
1
2
3 Elastic
Search
Cluster
3
Data Indexing : Map Reduce
• Incremental Data Indexing to
ElasticSearch
• Part of the document is indexed.
User
Search
4
4 User Search:
• User Search Data.
• Search engine display results.
• Full data access request fetch
from HBase.
User Update data5a
5b
5 User Update:
• User update HBase record.
• Update will propagate to Search
Cluster.
Here comes the Coprocessors
The idea of HBase Coprocessors was inspired by Google’s Big
Table coprocessors.
• HBase coprocessors are an addition to data-manipulation
toolset that were introduced as a feature in HBase in the
0.92.0 release.
• With the introduction of coprocessors, we can push arbitrary
computation out to the HBase nodes hosting data.
• Coprocessors can be loaded globally on all tables and regions
hosted by the region server, or the administrator can specify
which coprocessors should be loaded on all regions for a table
on a per-table basis.
Coprocessors Class and Interfaces
The Coprocessor Interface
• All User code must inherit from this class
The CoprocessorEnvironement Interface
• Retain state across invocation
The CoprocessorHost interfaces
• Tied state and the user code
Observer Coprocessors
Two types of Coprocessor
• observer, which are like triggers in conventional databases.
• endpoint, dynamic RPC endpoints that resemble stored procedures.
Observer Coprocessor : Callback functions/hooks for every explicit API
method
• MasterObserver
• Hooks into HMaster API
• RegionObserver
• Hooks into Region related operations
• WALObserver
• Hooks into write-ahead log operations
RegionObserver Coprocessor … Put ( )
RegionObserver: Provides hooks for data manipulation events, Get, Put,
Delete, Scan, and so on. There is an instance of a RegionObserver
coprocessor for every table region and the scope of the observations
they can make is constrained to that region.
Distributed Search Engine : ElasticSearch
• Distributed
• Highly-available
• REST based search engine (on top of Lucene)
• Designed to speak JSON (JSON in, JSON out)
• Built on top of Lucene.
For each index you can specify:
• Number of shards
Each index has fixed number of shards
• Number of replicas
Each shard can have 0-many replicas, can be changed
dynamically
ElasticSearch : Automatic Discovery
Discovery Module responsible for discovering nodes within the
cluster , as well as electing master node.
The responsibility of master node is to maintain global cluster
state, and act if nodes join or leave cluster by reassigning shards.
The idea is to perform Indexing into
ElasticSearch from HBase Coprocessors…..
We need a Java Client…
Use ElasticSearch Transport Client : The Transport Client connects
remotely to an ElasticSearch cluster. It does not join the cluster, but
simply gets one or more initial transport addresses and communicates
with them in round robin fashion on each action (though most actions
will probably be “two hop” operations).
But this approach has a problem..
• Client does not have the knowledge of the ElasticSearch
cluster.
• Two Hop indexing.
• No fault tolerant mechanism if transport address is down.
• HBase Region Servers can have hundreds regions and
hence hundreds of transport client.
Solution
• Use ElasticSearch Node Client. Client Node does not hold
index but have knowledge of complete Cluster.
• Use HBASE-6505 to share Node Client across Regions in a
RegionServer.
HBase 6505
RegionCoprocessorEnvironment provides a getSharedData()
method, which returns a ConcurrentMap, which is held by
the RegionCoprocessorHost as a weak reference (in a special
map with strongly referenced keys and weakly referenced
values), and held strongly by the RegionEnvironment.
That way if the coprocessor is blacklisted the coprocessors
environment is removed, and any shared data is immediately
available for garbage collection. This shared data is per
RegionServer. As long as there is at least one region observer
or endpoint active this shared data is not garbage collected
and can be accessed to share state between the remaining
coprocessors of the same class.
The Final Problem….
Concurrency Control …
HBase Solve it using MVCC (Multi Version Concurrency Control):
Implement updates not by deleting an old piece of data and
overwriting it with a new one, but instead by making the old data as
obsolete and adding newer version
And ElasticSearch using OCC (Optimistic Concurrency Control) :
Multiple transactions can complete without affecting each other, and
that therefore transactions can proceed without locking the data
resources that they affect. Before committing, each transaction verifies
that no other transaction has modified its data. If the check reveals
conflicting modifications, the committing transaction rolls back.
Let See a Conflict.. Search and Update
HBase ES
C1
C2
V1
V1
V1(M/R)
HBase ES
C1
C2
V1
V1
V2 (Update success)
Conflict
V2(CP)
V1(M/R)
One More Conflict.. Search and Update
HBase ES
C1
C2
V1
V1
V1(M/R)V1(M/R)
HBase ES
C1
C2
V1
V1
Conflict
V2(M/R)
Conflict
The bottom line is.
Search and Update should only be successful when the
Version of ElasticSearch and Version of HBase is same
during the update.
Solution..
1. Data Load from Source to HBase will insert a document with Put call.
2. postPut coprocessor will perform incrementColumnValue for a version
column.
………………………
………………………
Solution..
3. Same Version number will be propagated to ElasticSearch during
Map Reduce based bulk indexing. ElasticSearch support version
number supplied externally.
4. Step 1-3 will repeat for any new data upload.
5. During search and update , the client will perform checkAndPut ()
call.
5i. Client perform search and get the Version number from ElasticSearch
5ii. Client construct a Put with new Version No = Old Version + 1
5iii. Client perform checkAndPut, and check for old Version number before
doing Put.
5iv. postCheckAndPut Coprocessor invoked to propagate the successful Put to
Search Cluster.
5v. After this step the Version Number of HBase column and ElasticSearch
version will be equal.