• Save
HBaseCon 2013: Full-Text Indexing for Apache HBase
 

HBaseCon 2013: Full-Text Indexing for Apache HBase

on

  • 2,104 views

Presented by: Maryann Xue, Intel

Presented by: Maryann Xue, Intel

Statistics

Views

Total Views
2,104
Views on SlideShare
2,104
Embed Views
0

Actions

Likes
16
Downloads
3
Comments
1

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Hi.. Anyone work with Full Text Search in HBase, Please help me.
    My Email Id : praveenkr337@gmail.com
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Consistency, locality, high throughput
  • Define: one or multiple columns to one fieldOne doc for each recordAdditional ID field
  • Use zookeeper nodes for track of index splitting status, so that when opening an indexed region, we know if was undergoing index splitting before being moved.Parent directory cleanup performed in Hmaster: a chore thread IndexSplitCleanerWill need more Lucene capabilities for this optimization to be enabled in update-mode

HBaseCon 2013: Full-Text Indexing for Apache HBase HBaseCon 2013: Full-Text Indexing for Apache HBase Presentation Transcript

  • Full-text indexing for HBase Mary-Ann Xue wei.xue@intel.com
  • Overview • Why Full-text Indexing for HBase • Organize Indices • Index Building • Index Splitting • Index Searching • Performance Results
  • Why Full-text Indexing for HBase? • Fast retrieval by more than one column – HBase: lack of indexing for non-key columns • Effective search for non-exact matches: containing, starting-with, ending-with, etc. – HBase: supports only byte-order indexing; no text structure awareness
  • HBase table “t1” Organize Indices • One Lucene directory for one HBase region Region t1,,… … Region t1,aa,… Region t1,bb,… Region t1,xx,… Lucene indices on HDFS Dir: t1,,… Dir: t1,xx,… Dir: t1,aa,… Dir: t1,bb,…
  • Organize Indices • One Lucene document for one HBase record rowkey  document field “ID” indexed column(s)  a user-specified field r1 (rowkey) v1 (f:c1) v2 (f:c2) v3 (f:c3) v4 (f:c4)HBase record ID field_a field_bLucene document
  • Index Building • Implemented with HBase Coprocessors (Region Observers) • The main hooks are: Updates (Put / Delete) WAL restore Region open/close Memstore flush Region splitting
  • Index Building: updates and WAL restore • In prePut(), postDelete() or preWALRestore(), a generated Lucene document is added, updated or deleted accordingly. put request “row1” add record into memstore add Lucene document Y update-mode && missing columns? look up HBase table for those missing columns of “row1” compose Lucene document with new & existing values compose Lucene document ignoring missing columns N
  • Index Building: updates and WAL restore • For deletes, always look up the HBase table to see if all columns for index building have been deleted. delete request “row1” add delete mark into memstore update Lucene document Y all columns for index deleted? delete Lucene documents having field “ID” as “row1” compose Lucene document with existing and deleted values N look up HBase table for any index column of “row1”
  • Index Building: memstore flush • Fork a tread to do Lucene index commit as memstore flush starts; join this thread in postFlush() Memstore flush starts Flush memstore to storefiles Commit Lucene index segments Memstore flush completes 1 1 2
  • Index Splitting • Split Lucene indices on region splitting: Copy indices from parent dir to daughter dirs • Problem: Index splitting is time-consuming Should not block updates to new daughter regions
  • Daughter region indices Index Splitting (optimized for non update-mode) • Split indices in a background thread • Temp directory for new updates • Merge old and new dirs when copying is done Main index dir (under copying) Index dir of parent region Temp index dir for new updates HBase updates Will be merged after index copying is done
  • Index Searching • coprocessor endpoint (enhanced with local result combiner) Index-search Client HRegionServer … HDFS Endpoint dispatcher & combiner HRegion Index- searcher HRegion Index- searcher HRegion Index- searcher HRegionServer … Endpoint dispatcher & combiner HRegion Index- searcher HRegion Index- searcher HRegion Index- searcher Index DIR Index DIR Index DIR … Index DIR Index DIR Index DIR …
  • Performance Results • Total record count: 500,000,000 • Record size: 1KB • Memstore: 128MB • 6 nodes: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz (24 cores), memory 48G • ~1s search time out of 10 billion records 42735 37537 1 Insertion without pre-split regions W/O Index W/ Index 63613 55928 1 Insertion with pre-split regions W/O Index W/ Index record / sec record / sec