2. Overview
• Why Full-text Indexing for HBase
• Organize Indices
• Index Building
• Index Splitting
• Index Searching
• Performance Results
3. Why Full-text Indexing for HBase?
• Fast retrieval by more than one column
– HBase: lack of indexing for non-key columns
• Effective search for non-exact matches:
containing, starting-with, ending-with, etc.
– HBase: supports only byte-order indexing; no text
structure awareness
4. HBase table “t1”
Organize Indices
• One Lucene directory for one HBase region
Region t1,,…
…
Region t1,aa,…
Region t1,bb,…
Region t1,xx,…
Lucene indices on HDFS
Dir:
t1,,…
Dir:
t1,xx,…
Dir:
t1,aa,…
Dir:
t1,bb,…
5. Organize Indices
• One Lucene document for one HBase record
rowkey document field “ID”
indexed column(s) a user-specified field
r1 (rowkey) v1 (f:c1) v2 (f:c2) v3 (f:c3) v4 (f:c4)HBase record
ID field_a field_bLucene document
6. Index Building
• Implemented with HBase Coprocessors
(Region Observers)
• The main hooks are:
Updates (Put / Delete)
WAL restore
Region open/close
Memstore flush
Region splitting
7. Index Building: updates and WAL restore
• In prePut(), postDelete() or preWALRestore(), a generated
Lucene document is added, updated or deleted accordingly.
put request “row1” add record into memstore
add Lucene
document
Y
update-mode
&& missing
columns?
look up HBase
table for those
missing columns
of “row1”
compose Lucene
document with
new & existing
values
compose Lucene
document
ignoring missing
columns
N
8. Index Building: updates and WAL restore
• For deletes, always look up the HBase table to see if all
columns for index building have been deleted.
delete request “row1” add delete mark into memstore
update
Lucene
document
Y
all columns
for index
deleted?
delete Lucene documents having
field “ID” as “row1”
compose Lucene
document with
existing and
deleted values
N
look up HBase
table for any
index column of
“row1”
9. Index Building: memstore flush
• Fork a tread to do Lucene index commit as memstore
flush starts; join this thread in postFlush()
Memstore flush
starts
Flush memstore to
storefiles
Commit Lucene
index segments
Memstore flush
completes
1
1 2
10. Index Splitting
• Split Lucene indices on region splitting:
Copy indices from parent dir to daughter dirs
• Problem: Index splitting is time-consuming
Should not block updates to new daughter
regions
11. Daughter region indices
Index Splitting (optimized for non update-mode)
• Split indices in a background thread
• Temp directory for new updates
• Merge old and new dirs when copying is done
Main index dir
(under copying)
Index dir of
parent region
Temp index dir
for new
updates
HBase
updates
Will be merged after
index copying is done
12. Index Searching
• coprocessor endpoint (enhanced with local result combiner)
Index-search Client
HRegionServer
…
HDFS
Endpoint dispatcher & combiner
HRegion
Index-
searcher
HRegion
Index-
searcher
HRegion
Index-
searcher
HRegionServer
…
Endpoint dispatcher & combiner
HRegion
Index-
searcher
HRegion
Index-
searcher
HRegion
Index-
searcher
Index DIR Index DIR Index DIR
…
Index DIR Index DIR Index DIR …
13. Performance Results
• Total record count: 500,000,000
• Record size: 1KB
• Memstore: 128MB
• 6 nodes: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz (24 cores), memory 48G
• ~1s search time out of 10 billion records
42735
37537
1
Insertion without pre-split regions
W/O Index W/ Index
63613
55928
1
Insertion with pre-split regions
W/O Index W/ Index
record / sec record / sec
Editor's Notes
Consistency, locality, high throughput
Define: one or multiple columns to one fieldOne doc for each recordAdditional ID field
Use zookeeper nodes for track of index splitting status, so that when opening an indexed region, we know if was undergoing index splitting before being moved.Parent directory cleanup performed in Hmaster: a chore thread IndexSplitCleanerWill need more Lucene capabilities for this optimization to be enabled in update-mode