HBaseCon 2013: Full-Text Indexing for Apache HBase

•

20 likes•7,310 views

Cloudera, Inc.

Presented by: Maryann Xue, Intel

Technology

Full-text indexing for HBase
Mary-Ann Xue
wei.xue@intel.com

Overview
• Why Full-text Indexing for HBase
• Organize Indices
• Index Building
• Index Splitting
• Index Searching
• Performance Results

Why Full-text Indexing for HBase?
• Fast retrieval by more than one column
– HBase: lack of indexing for non-key columns
• Effective search for non-exact matches:
containing, starting-with, ending-with, etc.
– HBase: supports only byte-order indexing; no text
structure awareness

HBase table “t1”
Organize Indices
• One Lucene directory for one HBase region
Region t1,,…
…
Region t1,aa,…
Region t1,bb,…
Region t1,xx,…
Lucene indices on HDFS
Dir:
t1,,…
Dir:
t1,xx,…
Dir:
t1,aa,…
Dir:
t1,bb,…

Organize Indices
• One Lucene document for one HBase record
rowkey  document field “ID”
indexed column(s)  a user-specified field
r1 (rowkey) v1 (f:c1) v2 (f:c2) v3 (f:c3) v4 (f:c4)HBase record
ID field_a field_bLucene document

Index Building
• Implemented with HBase Coprocessors
(Region Observers)
• The main hooks are:
Updates (Put / Delete)
WAL restore
Region open/close
Memstore flush
Region splitting

Index Building: updates and WAL restore
• In prePut(), postDelete() or preWALRestore(), a generated
Lucene document is added, updated or deleted accordingly.
put request “row1” add record into memstore
add Lucene
document
Y
update-mode
&& missing
columns?
look up HBase
table for those
missing columns
of “row1”
compose Lucene
document with
new & existing
values
compose Lucene
document
ignoring missing
columns
N

Index Building: updates and WAL restore
• For deletes, always look up the HBase table to see if all
columns for index building have been deleted.
delete request “row1” add delete mark into memstore
update
Lucene
document
Y
all columns
for index
deleted?
delete Lucene documents having
field “ID” as “row1”
compose Lucene
document with
existing and
deleted values
N
look up HBase
table for any
index column of
“row1”

Index Building: memstore flush
• Fork a tread to do Lucene index commit as memstore
flush starts; join this thread in postFlush()
Memstore flush
starts
Flush memstore to
storefiles
Commit Lucene
index segments
Memstore flush
completes
1
1 2

Index Splitting
• Split Lucene indices on region splitting:
Copy indices from parent dir to daughter dirs
• Problem: Index splitting is time-consuming
Should not block updates to new daughter
regions

Daughter region indices
Index Splitting (optimized for non update-mode)
• Split indices in a background thread
• Temp directory for new updates
• Merge old and new dirs when copying is done
Main index dir
(under copying)
Index dir of
parent region
Temp index dir
for new
updates
HBase
updates
Will be merged after
index copying is done

Index Searching
• coprocessor endpoint (enhanced with local result combiner)
Index-search Client
HRegionServer
…
HDFS
Endpoint dispatcher & combiner
HRegion
Index-
searcher
HRegion
Index-
searcher
HRegion
Index-
searcher
HRegionServer
…
Endpoint dispatcher & combiner
HRegion
Index-
searcher
HRegion
Index-
searcher
HRegion
Index-
searcher
Index DIR Index DIR Index DIR
…
Index DIR Index DIR Index DIR …

Performance Results
• Total record count: 500,000,000
• Record size: 1KB
• Memstore: 128MB
• 6 nodes: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz (24 cores), memory 48G
• ~1s search time out of 10 billion records
42735
37537
1
Insertion without pre-split regions
W/O Index W/ Index
63613
55928
1
Insertion with pre-split regions
W/O Index W/ Index
record / sec record / sec

What's hot

HBaseCon 2013: Integration of Apache Hive and HBaseCloudera, Inc.

Mapreduce over snapshotsenissoz

HBase in Practice DataWorks Summit/Hadoop Summit

HBase: Just the BasicsHBaseCon

A Survey of HBase Application ArchetypesHBaseCon

Apache phoenix: Past, Present and Future of SQL over HBAseenissoz

HBaseCon 2015 General Session: State of HBaseHBaseCon

HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase Cloudera, Inc.

Apache Spark on Apache HBase: Current and Future HBaseCon

Apache Phoenix and Apache HBase: An Enterprise Grade Data WarehouseJosh Elser

Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)Suman Srinivasan

HBase Advanced Schema Design - Berlin Buzzwords - June 2012larsgeorge

Intro to HBase - Lars GeorgeJAX London

HBaseCon 2015: Analyzing HBase Data with Apache HiveHBaseCon

HBase Read High Availability Using Timeline Consistent Region Replicasenissoz

HBase and Impala Notes - Munich HUG - 20131017larsgeorge

HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseCloudera, Inc.

HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWSHBaseCon

Apache phoenixOsama Hussein

Apache HBase - Introduction & Use CasesData Con LA

What's hot (20)

HBaseCon 2013: Integration of Apache Hive and HBase

Mapreduce over snapshots

HBase in Practice

HBase: Just the Basics

A Survey of HBase Application Archetypes

Apache phoenix: Past, Present and Future of SQL over HBAse

HBaseCon 2015 General Session: State of HBase

HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase

Apache Spark on Apache HBase: Current and Future

Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)

HBase Advanced Schema Design - Berlin Buzzwords - June 2012

Intro to HBase - Lars George

HBaseCon 2015: Analyzing HBase Data with Apache Hive

HBase Read High Availability Using Timeline Consistent Region Replicas

HBase and Impala Notes - Munich HUG - 20131017

HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase

HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS

Apache phoenix

Apache HBase - Introduction & Use Cases

Viewers also liked

20090713 Hbase Schema Design Case StudiesEvan Liu

HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceCloudera, Inc.

HBaseCon 2013: Real-Time Model Scoring in Recommender Systems Cloudera, Inc.

HBaseCon 2013: Realtime User Segmentation using Apache HBase -- Architectural...Cloudera, Inc.

HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster Cloudera, Inc.

HBase Storage InternalsDataWorks Summit

HBase for ArchitectsNick Dimiduk

Intro to HBase Internals & Schema Design (for HBase users)alexbaranau

Intro to HBasealexbaranau

HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...Cloudera, Inc.

HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase Cloudera, Inc.

HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetCloudera, Inc.

HBaseCon 2012 | Real-time Analytics with HBase - SematextCloudera, Inc.

HBaseCon 2013: Scalable Network Designs for Apache HBaseCloudera, Inc.

HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...Cloudera, Inc.

HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily ...Cloudera, Inc.

Real-time HBase: Lessons from the CloudHBaseCon

HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web ArchivingHBaseCon

HBaseCon 2015: HBase Operations in a FlurryHBaseCon

HBaseCon 2015: Blackbird Collections - In-situ Stream Processing in HBaseHBaseCon

Viewers also liked (20)

20090713 Hbase Schema Design Case Studies

HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce

HBaseCon 2013: Real-Time Model Scoring in Recommender Systems

HBaseCon 2013: Realtime User Segmentation using Apache HBase -- Architectural...

HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster

HBase Storage Internals

HBase for Architects

Intro to HBase Internals & Schema Design (for HBase users)

Intro to HBase

HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...

HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase

HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget

HBaseCon 2012 | Real-time Analytics with HBase - Sematext

HBaseCon 2013: Scalable Network Designs for Apache HBase

HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...

HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily ...

Real-time HBase: Lessons from the Cloud

HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving

HBaseCon 2015: HBase Operations in a Flurry

HBaseCon 2015: Blackbird Collections - In-situ Stream Processing in HBase

Similar to HBaseCon 2013: Full-Text Indexing for Apache HBase

Nyc hadoop meetup introduction to h base智杰付

Apache HiveAmit Khandelwal

Unit II Hadoop Ecosystem_Updated.pptxBhavanaHotchandani

HBase.pptxvijayapraba1

Advance Hive, NoSQL Database (HBase) - Module 7Rohit Agrawal

Apache hive introductionMahmood Reza Esmaili Zand

Apache hivepradipbajpai68

Apache Drill talk ApacheCon 2018Aman Sinha

ACS DataMart_pptJeremy Searls

hive_slides_Webinar_Session_1.pptxvishwasgarade1

Musings on Secondary Indexing in HBaseJesse Yates

Ils on a shoe string budgetJolene81

Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetuprcmuir

Hive and HiveQL - Module6Rohit Agrawal

Indexing and-hashingAmi Ranjit

HBasePooja Sunkapur

01 hbaseSubhas Kumar Ghosh

Bi publisher for hyperion planningMaxim Levko

Apache HBase™Prashant Gupta

Similar to HBaseCon 2013: Full-Text Indexing for Apache HBase (20)

Nyc hadoop meetup introduction to h base

Apache Hive

Unit II Hadoop Ecosystem_Updated.pptx

HBase.pptx

Advance Hive, NoSQL Database (HBase) - Module 7

Apache hive introduction

Apache hive

Apache Drill talk ApacheCon 2018

ACS DataMart_ppt

hive_slides_Webinar_Session_1.pptx

Musings on Secondary Indexing in HBase

Ils on a shoe string budget

Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup

Hive and HiveQL - Module6

Indexing and-hashing

HBase

01 hbase

Bi publisher for hyperion planning

Apache HBase™

Recently uploaded

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski

Artificial intelligence in cctv survelliance.pptxhariprasad279825

Search Engine Optimization SEO PDF for 2024.pdfRankYa

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

CloudStudio User manual (basic edition):comworks

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

Commit 2024 - Secret Management made easyAlfredo García Lavilla

"ML in Production",Oleksandr BaganFwdays

WordPress Websites for Engineers: Elevate Your Brandgvaughan

"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays

Recently uploaded (20)

Streamlining Python Development: A Guide to a Modern Project Setup

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...

Artificial intelligence in cctv survelliance.pptx

Search Engine Optimization SEO PDF for 2024.pdf

Anypoint Exchange: It’s Not Just a Repo!

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

Vertex AI Gemini Prompt Engineering Tips

Designing IA for AI - Information Architecture Conference 2024

"Debugging python applications inside k8s environment", Andrii Soldatenko

DMCC Future of Trade Web3 - Special Edition

What's New in Teams Calling, Meetings and Devices March 2024

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)

Scanning the Internet for External Cloud Exposures via SSL Certs

CloudStudio User manual (basic edition):

Advanced Test Driven-Development @ php[tek] 2024

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

Commit 2024 - Secret Management made easy

"ML in Production",Oleksandr Bagan

WordPress Websites for Engineers: Elevate Your Brand

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

HBaseCon 2013: Full-Text Indexing for Apache HBase

1. Full-text indexing for HBase Mary-Ann Xue wei.xue@intel.com

2. Overview • Why Full-text Indexing for HBase • Organize Indices • Index Building • Index Splitting • Index Searching • Performance Results

3. Why Full-text Indexing for HBase? • Fast retrieval by more than one column – HBase: lack of indexing for non-key columns • Effective search for non-exact matches: containing, starting-with, ending-with, etc. – HBase: supports only byte-order indexing; no text structure awareness

4. HBase table “t1” Organize Indices • One Lucene directory for one HBase region Region t1,,… … Region t1,aa,… Region t1,bb,… Region t1,xx,… Lucene indices on HDFS Dir: t1,,… Dir: t1,xx,… Dir: t1,aa,… Dir: t1,bb,…

5. Organize Indices • One Lucene document for one HBase record rowkey  document field “ID” indexed column(s)  a user-specified field r1 (rowkey) v1 (f:c1) v2 (f:c2) v3 (f:c3) v4 (f:c4)HBase record ID field_a field_bLucene document

6. Index Building • Implemented with HBase Coprocessors (Region Observers) • The main hooks are: Updates (Put / Delete) WAL restore Region open/close Memstore flush Region splitting

7. Index Building: updates and WAL restore • In prePut(), postDelete() or preWALRestore(), a generated Lucene document is added, updated or deleted accordingly. put request “row1” add record into memstore add Lucene document Y update-mode && missing columns? look up HBase table for those missing columns of “row1” compose Lucene document with new & existing values compose Lucene document ignoring missing columns N

8. Index Building: updates and WAL restore • For deletes, always look up the HBase table to see if all columns for index building have been deleted. delete request “row1” add delete mark into memstore update Lucene document Y all columns for index deleted? delete Lucene documents having field “ID” as “row1” compose Lucene document with existing and deleted values N look up HBase table for any index column of “row1”

9. Index Building: memstore flush • Fork a tread to do Lucene index commit as memstore flush starts; join this thread in postFlush() Memstore flush starts Flush memstore to storefiles Commit Lucene index segments Memstore flush completes 1 1 2

10. Index Splitting • Split Lucene indices on region splitting: Copy indices from parent dir to daughter dirs • Problem: Index splitting is time-consuming Should not block updates to new daughter regions

11. Daughter region indices Index Splitting (optimized for non update-mode) • Split indices in a background thread • Temp directory for new updates • Merge old and new dirs when copying is done Main index dir (under copying) Index dir of parent region Temp index dir for new updates HBase updates Will be merged after index copying is done

12. Index Searching • coprocessor endpoint (enhanced with local result combiner) Index-search Client HRegionServer … HDFS Endpoint dispatcher & combiner HRegion Index- searcher HRegion Index- searcher HRegion Index- searcher HRegionServer … Endpoint dispatcher & combiner HRegion Index- searcher HRegion Index- searcher HRegion Index- searcher Index DIR Index DIR Index DIR … Index DIR Index DIR Index DIR …

13. Performance Results • Total record count: 500,000,000 • Record size: 1KB • Memstore: 128MB • 6 nodes: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz (24 cores), memory 48G • ~1s search time out of 10 billion records 42735 37537 1 Insertion without pre-split regions W/O Index W/ Index 63613 55928 1 Insertion with pre-split regions W/O Index W/ Index record / sec record / sec

Editor's Notes

Consistency, locality, high throughput
Define: one or multiple columns to one fieldOne doc for each recordAdditional ID field
Use zookeeper nodes for track of index splitting status, so that when opening an indexed region, we know if was undergoing index splitting before being moved.Parent directory cleanup performed in Hmaster: a chore thread IndexSplitCleanerWill need more Lucene capabilities for this optimization to be enabled in update-mode

HBaseCon 2013: Full-Text Indexing for Apache HBase

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to HBaseCon 2013: Full-Text Indexing for Apache HBase

Similar to HBaseCon 2013: Full-Text Indexing for Apache HBase (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Recently uploaded

Recently uploaded (20)

HBaseCon 2013: Full-Text Indexing for Apache HBase

Editor's Notes