HBaseCon 2013: Full-Text Indexing for Apache HBase

•

20 likes•7,310 views

Cloudera, Inc.

Presented by: Maryann Xue, Intel

Technology

Full-text indexing for HBase
Mary-Ann Xue
wei.xue@intel.com

Overview
• Why Full-text Indexing for HBase
• Organize Indices
• Index Building
• Index Splitting
• Index Searching
• Performance Results

Why Full-text Indexing for HBase?
• Fast retrieval by more than one column
– HBase: lack of indexing for non-key columns
• Effective search for non-exact matches:
containing, starting-with, ending-with, etc.
– HBase: supports only byte-order indexing; no text
structure awareness

HBase table “t1”
Organize Indices
• One Lucene directory for one HBase region
Region t1,,…
…
Region t1,aa,…
Region t1,bb,…
Region t1,xx,…
Lucene indices on HDFS
Dir:
t1,,…
Dir:
t1,xx,…
Dir:
t1,aa,…
Dir:
t1,bb,…

Organize Indices
• One Lucene document for one HBase record
rowkey  document field “ID”
indexed column(s)  a user-specified field
r1 (rowkey) v1 (f:c1) v2 (f:c2) v3 (f:c3) v4 (f:c4)HBase record
ID field_a field_bLucene document

Index Building
• Implemented with HBase Coprocessors
(Region Observers)
• The main hooks are:
Updates (Put / Delete)
WAL restore
Region open/close
Memstore flush
Region splitting

Index Building: updates and WAL restore
• In prePut(), postDelete() or preWALRestore(), a generated
Lucene document is added, updated or deleted accordingly.
put request “row1” add record into memstore
add Lucene
document
Y
update-mode
&& missing
columns?
look up HBase
table for those
missing columns
of “row1”
compose Lucene
document with
new & existing
values
compose Lucene
document
ignoring missing
columns
N

Index Building: updates and WAL restore
• For deletes, always look up the HBase table to see if all
columns for index building have been deleted.
delete request “row1” add delete mark into memstore
update
Lucene
document
Y
all columns
for index
deleted?
delete Lucene documents having
field “ID” as “row1”
compose Lucene
document with
existing and
deleted values
N
look up HBase
table for any
index column of
“row1”

Index Building: memstore flush
• Fork a tread to do Lucene index commit as memstore
flush starts; join this thread in postFlush()
Memstore flush
starts
Flush memstore to
storefiles
Commit Lucene
index segments
Memstore flush
completes
1
1 2

Index Splitting
• Split Lucene indices on region splitting:
Copy indices from parent dir to daughter dirs
• Problem: Index splitting is time-consuming
Should not block updates to new daughter
regions

Daughter region indices
Index Splitting (optimized for non update-mode)
• Split indices in a background thread
• Temp directory for new updates
• Merge old and new dirs when copying is done
Main index dir
(under copying)
Index dir of
parent region
Temp index dir
for new
updates
HBase
updates
Will be merged after
index copying is done

Index Searching
• coprocessor endpoint (enhanced with local result combiner)
Index-search Client
HRegionServer
…
HDFS
Endpoint dispatcher & combiner
HRegion
Index-
searcher
HRegion
Index-
searcher
HRegion
Index-
searcher
HRegionServer
…
Endpoint dispatcher & combiner
HRegion
Index-
searcher
HRegion
Index-
searcher
HRegion
Index-
searcher
Index DIR Index DIR Index DIR
…
Index DIR Index DIR Index DIR …

Performance Results
• Total record count: 500,000,000
• Record size: 1KB
• Memstore: 128MB
• 6 nodes: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz (24 cores), memory 48G
• ~1s search time out of 10 billion records
42735
37537
1
Insertion without pre-split regions
W/O Index W/ Index
63613
55928
1
Insertion with pre-split regions
W/O Index W/ Index
record / sec record / sec

What's hot

HBaseCon 2013: Integration of Apache Hive and HBaseCloudera, Inc.

Mapreduce over snapshotsenissoz

HBase in Practice DataWorks Summit/Hadoop Summit

HBase: Just the BasicsHBaseCon

A Survey of HBase Application ArchetypesHBaseCon

Apache phoenix: Past, Present and Future of SQL over HBAseenissoz

HBaseCon 2015 General Session: State of HBaseHBaseCon

HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase Cloudera, Inc.

Apache Spark on Apache HBase: Current and Future HBaseCon

Apache Phoenix and Apache HBase: An Enterprise Grade Data WarehouseJosh Elser

Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)Suman Srinivasan

HBase Advanced Schema Design - Berlin Buzzwords - June 2012larsgeorge

Intro to HBase - Lars GeorgeJAX London

HBaseCon 2015: Analyzing HBase Data with Apache HiveHBaseCon

HBase Read High Availability Using Timeline Consistent Region Replicasenissoz

HBase and Impala Notes - Munich HUG - 20131017larsgeorge

HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseCloudera, Inc.

HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWSHBaseCon

Apache phoenixOsama Hussein

Apache HBase - Introduction & Use CasesData Con LA

What's hot (20)

HBaseCon 2013: Integration of Apache Hive and HBase

Mapreduce over snapshots

HBase in Practice

HBase: Just the Basics

A Survey of HBase Application Archetypes

Apache phoenix: Past, Present and Future of SQL over HBAse

HBaseCon 2015 General Session: State of HBase

HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase

Apache Spark on Apache HBase: Current and Future

Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)

HBase Advanced Schema Design - Berlin Buzzwords - June 2012

Intro to HBase - Lars George

HBaseCon 2015: Analyzing HBase Data with Apache Hive

HBase Read High Availability Using Timeline Consistent Region Replicas

HBase and Impala Notes - Munich HUG - 20131017

HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase

HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS

Apache phoenix

Apache HBase - Introduction & Use Cases

Viewers also liked

20090713 Hbase Schema Design Case StudiesEvan Liu

HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceCloudera, Inc.

HBaseCon 2013: Real-Time Model Scoring in Recommender Systems Cloudera, Inc.

HBaseCon 2013: Realtime User Segmentation using Apache HBase -- Architectural...Cloudera, Inc.

HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster Cloudera, Inc.

HBase Storage InternalsDataWorks Summit

HBase for ArchitectsNick Dimiduk

Intro to HBase Internals & Schema Design (for HBase users)alexbaranau

Intro to HBasealexbaranau

HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...Cloudera, Inc.

HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase Cloudera, Inc.

HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetCloudera, Inc.

HBaseCon 2012 | Real-time Analytics with HBase - SematextCloudera, Inc.

HBaseCon 2013: Scalable Network Designs for Apache HBaseCloudera, Inc.

HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...Cloudera, Inc.

HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily ...Cloudera, Inc.

Real-time HBase: Lessons from the CloudHBaseCon

HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web ArchivingHBaseCon

HBaseCon 2015: HBase Operations in a FlurryHBaseCon

HBaseCon 2015: Blackbird Collections - In-situ Stream Processing in HBaseHBaseCon

Viewers also liked (20)

20090713 Hbase Schema Design Case Studies

HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce

HBaseCon 2013: Real-Time Model Scoring in Recommender Systems

HBaseCon 2013: Realtime User Segmentation using Apache HBase -- Architectural...

HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster

HBase Storage Internals

HBase for Architects

Intro to HBase Internals & Schema Design (for HBase users)

Intro to HBase

HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...

HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase

HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget

HBaseCon 2012 | Real-time Analytics with HBase - Sematext

HBaseCon 2013: Scalable Network Designs for Apache HBase

HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...

HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily ...

Real-time HBase: Lessons from the Cloud

HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving

HBaseCon 2015: HBase Operations in a Flurry

HBaseCon 2015: Blackbird Collections - In-situ Stream Processing in HBase

Similar to HBaseCon 2013: Full-Text Indexing for Apache HBase

Nyc hadoop meetup introduction to h base智杰付

Apache HiveAmit Khandelwal

Unit II Hadoop Ecosystem_Updated.pptxBhavanaHotchandani

HBase.pptxvijayapraba1

Advance Hive, NoSQL Database (HBase) - Module 7Rohit Agrawal

Apache hive introductionMahmood Reza Esmaili Zand

Apache hivepradipbajpai68

Apache Drill talk ApacheCon 2018Aman Sinha

ACS DataMart_pptJeremy Searls

hive_slides_Webinar_Session_1.pptxvishwasgarade1

Musings on Secondary Indexing in HBaseJesse Yates

Ils on a shoe string budgetJolene81

Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetuprcmuir

Hive and HiveQL - Module6Rohit Agrawal

Indexing and-hashingAmi Ranjit

HBasePooja Sunkapur

01 hbaseSubhas Kumar Ghosh

Bi publisher for hyperion planningMaxim Levko

Apache HBase™Prashant Gupta

Similar to HBaseCon 2013: Full-Text Indexing for Apache HBase (20)

Nyc hadoop meetup introduction to h base

Apache Hive

Unit II Hadoop Ecosystem_Updated.pptx

HBase.pptx

Advance Hive, NoSQL Database (HBase) - Module 7

Apache hive introduction

Apache hive

Apache Drill talk ApacheCon 2018

ACS DataMart_ppt

hive_slides_Webinar_Session_1.pptx

Musings on Secondary Indexing in HBase

Ils on a shoe string budget

Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup

Hive and HiveQL - Module6

Indexing and-hashing

HBase

01 hbase

Bi publisher for hyperion planning

Apache HBase™

Recently uploaded

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge

GenCyber Cyber Security Day PresentationMichael W. Hawkins

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

A Domino Admins Adventures (Engage 2024)Gabriella Davis

GenAI Risks & Security Meetup 01052024.pdflior mazor

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

Partners Life - Insurer Innovation Award 2024The Digital Insurer

How to convert PDF to text with Nanonetsnaman860154

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Evaluating the top large language models.pdfChristopherTHyatt

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

🐬 The future of MySQL is Postgres 🐘RTylerCroy

Tech Trends Report 2024 Future Today Institute.pdfhans926745

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

Recently uploaded (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

GenCyber Cyber Security Day Presentation

08448380779 Call Girls In Civil Lines Women Seeking Men

A Domino Admins Adventures (Engage 2024)

GenAI Risks & Security Meetup 01052024.pdf

Boost Fertility New Invention Ups Success Rates.pdf

Partners Life - Insurer Innovation Award 2024

How to convert PDF to text with Nanonets

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

How to Troubleshoot Apps for the Modern Connected Worker

Evaluating the top large language models.pdf

Powerful Google developer tools for immediate impact! (2023-24 C)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Presentation on how to chat with PDF using ChatGPT code interpreter

🐬 The future of MySQL is Postgres 🐘

Tech Trends Report 2024 Future Today Institute.pdf

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

HBaseCon 2013: Full-Text Indexing for Apache HBase

1. Full-text indexing for HBase Mary-Ann Xue wei.xue@intel.com

2. Overview • Why Full-text Indexing for HBase • Organize Indices • Index Building • Index Splitting • Index Searching • Performance Results

3. Why Full-text Indexing for HBase? • Fast retrieval by more than one column – HBase: lack of indexing for non-key columns • Effective search for non-exact matches: containing, starting-with, ending-with, etc. – HBase: supports only byte-order indexing; no text structure awareness

4. HBase table “t1” Organize Indices • One Lucene directory for one HBase region Region t1,,… … Region t1,aa,… Region t1,bb,… Region t1,xx,… Lucene indices on HDFS Dir: t1,,… Dir: t1,xx,… Dir: t1,aa,… Dir: t1,bb,…

5. Organize Indices • One Lucene document for one HBase record rowkey  document field “ID” indexed column(s)  a user-specified field r1 (rowkey) v1 (f:c1) v2 (f:c2) v3 (f:c3) v4 (f:c4)HBase record ID field_a field_bLucene document

6. Index Building • Implemented with HBase Coprocessors (Region Observers) • The main hooks are: Updates (Put / Delete) WAL restore Region open/close Memstore flush Region splitting

7. Index Building: updates and WAL restore • In prePut(), postDelete() or preWALRestore(), a generated Lucene document is added, updated or deleted accordingly. put request “row1” add record into memstore add Lucene document Y update-mode && missing columns? look up HBase table for those missing columns of “row1” compose Lucene document with new & existing values compose Lucene document ignoring missing columns N

8. Index Building: updates and WAL restore • For deletes, always look up the HBase table to see if all columns for index building have been deleted. delete request “row1” add delete mark into memstore update Lucene document Y all columns for index deleted? delete Lucene documents having field “ID” as “row1” compose Lucene document with existing and deleted values N look up HBase table for any index column of “row1”

9. Index Building: memstore flush • Fork a tread to do Lucene index commit as memstore flush starts; join this thread in postFlush() Memstore flush starts Flush memstore to storefiles Commit Lucene index segments Memstore flush completes 1 1 2

10. Index Splitting • Split Lucene indices on region splitting: Copy indices from parent dir to daughter dirs • Problem: Index splitting is time-consuming Should not block updates to new daughter regions

11. Daughter region indices Index Splitting (optimized for non update-mode) • Split indices in a background thread • Temp directory for new updates • Merge old and new dirs when copying is done Main index dir (under copying) Index dir of parent region Temp index dir for new updates HBase updates Will be merged after index copying is done

12. Index Searching • coprocessor endpoint (enhanced with local result combiner) Index-search Client HRegionServer … HDFS Endpoint dispatcher & combiner HRegion Index- searcher HRegion Index- searcher HRegion Index- searcher HRegionServer … Endpoint dispatcher & combiner HRegion Index- searcher HRegion Index- searcher HRegion Index- searcher Index DIR Index DIR Index DIR … Index DIR Index DIR Index DIR …

13. Performance Results • Total record count: 500,000,000 • Record size: 1KB • Memstore: 128MB • 6 nodes: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz (24 cores), memory 48G • ~1s search time out of 10 billion records 42735 37537 1 Insertion without pre-split regions W/O Index W/ Index 63613 55928 1 Insertion with pre-split regions W/O Index W/ Index record / sec record / sec

Editor's Notes

Consistency, locality, high throughput
Define: one or multiple columns to one fieldOne doc for each recordAdditional ID field
Use zookeeper nodes for track of index splitting status, so that when opening an indexed region, we know if was undergoing index splitting before being moved.Parent directory cleanup performed in Hmaster: a chore thread IndexSplitCleanerWill need more Lucene capabilities for this optimization to be enabled in update-mode

HBaseCon 2013: Full-Text Indexing for Apache HBase

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to HBaseCon 2013: Full-Text Indexing for Apache HBase

Similar to HBaseCon 2013: Full-Text Indexing for Apache HBase (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Recently uploaded

Recently uploaded (20)

HBaseCon 2013: Full-Text Indexing for Apache HBase

Editor's Notes