SlideShare a Scribd company logo
Submit Search
Upload
Local Secondary Indexes in Apache Phoenix
Report
Share
Rajeshbabu Chintaguntla
Apache HBase Committer at The Apache Software Foundation
Follow
•
2 likes
•
2,579 views
1
of
18
Local Secondary Indexes in Apache Phoenix
•
2 likes
•
2,579 views
Report
Share
Download Now
Download to read offline
Software
Deep dive of local indexes in Apache Phoenix
Read more
Rajeshbabu Chintaguntla
Apache HBase Committer at The Apache Software Foundation
Follow
Recommended
Strongly Consistent Global Indexes for Apache Phoenix by
Strongly Consistent Global Indexes for Apache Phoenix
YugabyteDB
506 views
•
18 slides
Apache Phoenix + Apache HBase by
Apache Phoenix + Apache HBase
DataWorks Summit/Hadoop Summit
7.3K views
•
43 slides
Apache phoenix: Past, Present and Future of SQL over HBAse by
Apache phoenix: Past, Present and Future of SQL over HBAse
enissoz
6.3K views
•
41 slides
Supporting Apache HBase : Troubleshooting and Supportability Improvements by
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
1.8K views
•
47 slides
ORC File - Optimizing Your Big Data by
ORC File - Optimizing Your Big Data
DataWorks Summit
11.6K views
•
26 slides
HBase Advanced - Lars George by
HBase Advanced - Lars George
JAX London
9.9K views
•
45 slides
More Related Content
What's hot
Off-heaping the Apache HBase Read Path by
Off-heaping the Apache HBase Read Path
HBaseCon
4.2K views
•
19 slides
Transactional operations in Apache Hive: present and future by
Transactional operations in Apache Hive: present and future
DataWorks Summit
4.8K views
•
36 slides
A Thorough Comparison of Delta Lake, Iceberg and Hudi by
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
11.1K views
•
27 slides
Facebook Messages & HBase by
Facebook Messages & HBase
强 王
39.2K views
•
39 slides
HBase Application Performance Improvement by
HBase Application Performance Improvement
Biju Nair
23.5K views
•
25 slides
Hive 3 - a new horizon by
Hive 3 - a new horizon
Thejas Nair
2.6K views
•
50 slides
What's hot
(20)
Off-heaping the Apache HBase Read Path by HBaseCon
Off-heaping the Apache HBase Read Path
HBaseCon
•
4.2K views
Transactional operations in Apache Hive: present and future by DataWorks Summit
Transactional operations in Apache Hive: present and future
DataWorks Summit
•
4.8K views
A Thorough Comparison of Delta Lake, Iceberg and Hudi by Databricks
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
•
11.1K views
Facebook Messages & HBase by 强 王
Facebook Messages & HBase
强 王
•
39.2K views
HBase Application Performance Improvement by Biju Nair
HBase Application Performance Improvement
Biju Nair
•
23.5K views
Hive 3 - a new horizon by Thejas Nair
Hive 3 - a new horizon
Thejas Nair
•
2.6K views
Time-Series Apache HBase by HBaseCon
Time-Series Apache HBase
HBaseCon
•
5.6K views
Iceberg: A modern table format for big data (Strata NY 2018) by Ryan Blue
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
•
2K views
Reshape Data Lake (as of 2020.07) by Eric Sun
Reshape Data Lake (as of 2020.07)
Eric Sun
•
264 views
Hudi architecture, fundamentals and capabilities by Nishith Agarwal
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
•
2.8K views
Tuning Apache Phoenix/HBase by Anil Gupta
Tuning Apache Phoenix/HBase
Anil Gupta
•
1.8K views
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase by DataWorks Summit/Hadoop Summit
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
DataWorks Summit/Hadoop Summit
•
3.1K views
Performance Optimizations in Apache Impala by Cloudera, Inc.
Performance Optimizations in Apache Impala
Cloudera, Inc.
•
10.7K views
Hadoop World 2011: Advanced HBase Schema Design by Cloudera, Inc.
Hadoop World 2011: Advanced HBase Schema Design
Cloudera, Inc.
•
17.9K views
Hive+Tez: A performance deep dive by t3rmin4t0r
Hive+Tez: A performance deep dive
t3rmin4t0r
•
9.6K views
HBase Low Latency by DataWorks Summit
HBase Low Latency
DataWorks Summit
•
5.1K views
Hive + Tez: A Performance Deep Dive by DataWorks Summit
Hive + Tez: A Performance Deep Dive
DataWorks Summit
•
57.6K views
Apache Tez - A New Chapter in Hadoop Data Processing by DataWorks Summit
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
•
18.3K views
Real-time Analytics with Trino and Apache Pinot by Xiang Fu
Real-time Analytics with Trino and Apache Pinot
Xiang Fu
•
1.2K views
Apache Spark Architecture by Alexey Grishchenko
Apache Spark Architecture
Alexey Grishchenko
•
75.9K views
Similar to Local Secondary Indexes in Apache Phoenix
Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse by
Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Josh Elser
36.9K views
•
43 slides
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase by
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
DataWorks Summit/Hadoop Summit
3.3K views
•
41 slides
Apache Phoenix and HBase - Hadoop Summit Tokyo, Japan by
Apache Phoenix and HBase - Hadoop Summit Tokyo, Japan
Ankit Singhal
118 views
•
41 slides
HBase Read High Availability Using Timeline Consistent Region Replicas by
HBase Read High Availability Using Timeline Consistent Region Replicas
enissoz
8.7K views
•
38 slides
Interactive Analytics at Scale in Apache Hive Using Druid by
Interactive Analytics at Scale in Apache Hive Using Druid
DataWorks Summit
2.6K views
•
37 slides
Lightweight ETL pipelines with mara (PyData Berlin September Meetup) by
Lightweight ETL pipelines with mara (PyData Berlin September Meetup)
Martin Loetzsch
1.6K views
•
16 slides
Similar to Local Secondary Indexes in Apache Phoenix
(20)
Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse by Josh Elser
Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Josh Elser
•
36.9K views
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase by DataWorks Summit/Hadoop Summit
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
DataWorks Summit/Hadoop Summit
•
3.3K views
Apache Phoenix and HBase - Hadoop Summit Tokyo, Japan by Ankit Singhal
Apache Phoenix and HBase - Hadoop Summit Tokyo, Japan
Ankit Singhal
•
118 views
HBase Read High Availability Using Timeline Consistent Region Replicas by enissoz
HBase Read High Availability Using Timeline Consistent Region Replicas
enissoz
•
8.7K views
Interactive Analytics at Scale in Apache Hive Using Druid by DataWorks Summit
Interactive Analytics at Scale in Apache Hive Using Druid
DataWorks Summit
•
2.6K views
Lightweight ETL pipelines with mara (PyData Berlin September Meetup) by Martin Loetzsch
Lightweight ETL pipelines with mara (PyData Berlin September Meetup)
Martin Loetzsch
•
1.6K views
Hive 3 a new horizon by Abdelkrim Hadjidj
Hive 3 a new horizon
Abdelkrim Hadjidj
•
390 views
Meet HBase 2.0 and Phoenix 5.0 by DataWorks Summit
Meet HBase 2.0 and Phoenix 5.0
DataWorks Summit
•
2.4K views
Hbase mhug 2015 by Joseph Niemiec
Hbase mhug 2015
Joseph Niemiec
•
568 views
Ijebea14 228 by Iasir Journals
Ijebea14 228
Iasir Journals
•
120 views
hbaseconasia2019 Distributed Bitmap Index Solution by Michael Stack
hbaseconasia2019 Distributed Bitmap Index Solution
Michael Stack
•
365 views
HBase Read High Availabilty using Timeline Consistent Region Replicas by DataWorks Summit
HBase Read High Availabilty using Timeline Consistent Region Replicas
DataWorks Summit
•
1.2K views
MySQL Query Tuning for the Squeemish -- Fossetcon Orlando Sep 2014 by Dave Stokes
MySQL Query Tuning for the Squeemish -- Fossetcon Orlando Sep 2014
Dave Stokes
•
1K views
Major advancements in Apache Hive towards full support of SQL compliance by DataWorks Summit/Hadoop Summit
Major advancements in Apache Hive towards full support of SQL compliance
DataWorks Summit/Hadoop Summit
•
1.9K views
IRJET- Rest API for E-Commerce Site by IRJET Journal
IRJET- Rest API for E-Commerce Site
IRJET Journal
•
30 views
War of the Indices- SQL vs. Oracle by Kellyn Pot'Vin-Gorman
War of the Indices- SQL vs. Oracle
Kellyn Pot'Vin-Gorman
•
853 views
Hive(ppt) by Abhinav Tyagi
Hive(ppt)
Abhinav Tyagi
•
3.5K views
Hive(ppt) by Abhinav Tyagi
Hive(ppt)
Abhinav Tyagi
•
15.4K views
Sql server lesson6 by Ala Qunaibi
Sql server lesson6
Ala Qunaibi
•
84 views
Hive present-and-feature-shanghai by Yifeng Jiang
Hive present-and-feature-shanghai
Yifeng Jiang
•
2.6K views
Recently uploaded
DSD-INT 2023 Simulation of Coastal Hydrodynamics and Water Quality in Hong Ko... by
DSD-INT 2023 Simulation of Coastal Hydrodynamics and Water Quality in Hong Ko...
Deltares
14 views
•
23 slides
DSD-INT 2023 Thermobaricity in 3D DCSM-FM - taking pressure into account in t... by
DSD-INT 2023 Thermobaricity in 3D DCSM-FM - taking pressure into account in t...
Deltares
9 views
•
26 slides
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra... by
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra...
Marc Müller
38 views
•
62 slides
BushraDBR: An Automatic Approach to Retrieving Duplicate Bug Reports by
BushraDBR: An Automatic Approach to Retrieving Duplicate Bug Reports
Ra'Fat Al-Msie'deen
5 views
•
49 slides
DSD-INT 2023 Exploring flash flood hazard reduction in arid regions using a h... by
DSD-INT 2023 Exploring flash flood hazard reduction in arid regions using a h...
Deltares
5 views
•
31 slides
MariaDB stored procedures and why they should be improved by
MariaDB stored procedures and why they should be improved
Federico Razzoli
8 views
•
32 slides
Recently uploaded
(20)
DSD-INT 2023 Simulation of Coastal Hydrodynamics and Water Quality in Hong Ko... by Deltares
DSD-INT 2023 Simulation of Coastal Hydrodynamics and Water Quality in Hong Ko...
Deltares
•
14 views
DSD-INT 2023 Thermobaricity in 3D DCSM-FM - taking pressure into account in t... by Deltares
DSD-INT 2023 Thermobaricity in 3D DCSM-FM - taking pressure into account in t...
Deltares
•
9 views
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra... by Marc Müller
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra...
Marc Müller
•
38 views
BushraDBR: An Automatic Approach to Retrieving Duplicate Bug Reports by Ra'Fat Al-Msie'deen
BushraDBR: An Automatic Approach to Retrieving Duplicate Bug Reports
Ra'Fat Al-Msie'deen
•
5 views
DSD-INT 2023 Exploring flash flood hazard reduction in arid regions using a h... by Deltares
DSD-INT 2023 Exploring flash flood hazard reduction in arid regions using a h...
Deltares
•
5 views
MariaDB stored procedures and why they should be improved by Federico Razzoli
MariaDB stored procedures and why they should be improved
Federico Razzoli
•
8 views
Fleet Management Software in India by Fleetable
Fleet Management Software in India
Fleetable
•
11 views
SUGCON ANZ Presentation V2.1 Final.pptx by Jack Spektor
SUGCON ANZ Presentation V2.1 Final.pptx
Jack Spektor
•
22 views
DSD-INT 2023 Salt intrusion Modelling of the Lauwersmeer, towards a measureme... by Deltares
DSD-INT 2023 Salt intrusion Modelling of the Lauwersmeer, towards a measureme...
Deltares
•
5 views
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J... by Deltares
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J...
Deltares
•
9 views
Software testing company in India.pptx by SakshiPatel82
Software testing company in India.pptx
SakshiPatel82
•
7 views
2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx by animuscrm
2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx
animuscrm
•
14 views
DevsRank by devsrank786
DevsRank
devsrank786
•
11 views
Dapr Unleashed: Accelerating Microservice Development by Miroslav Janeski
Dapr Unleashed: Accelerating Microservice Development
Miroslav Janeski
•
10 views
FIMA 2023 Neo4j & FS - Entity Resolution.pptx by Neo4j
FIMA 2023 Neo4j & FS - Entity Resolution.pptx
Neo4j
•
6 views
Unmasking the Dark Art of Vectored Exception Handling: Bypassing XDR and EDR ... by Donato Onofri
Unmasking the Dark Art of Vectored Exception Handling: Bypassing XDR and EDR ...
Donato Onofri
•
795 views
DSD-INT 2023 Leveraging the results of a 3D hydrodynamic model to improve the... by Deltares
DSD-INT 2023 Leveraging the results of a 3D hydrodynamic model to improve the...
Deltares
•
6 views
Gen Apps on Google Cloud PaLM2 and Codey APIs in Action by Márton Kodok
Gen Apps on Google Cloud PaLM2 and Codey APIs in Action
Márton Kodok
•
5 views
DSD-INT 2023 European Digital Twin Ocean and Delft3D FM - Dols by Deltares
DSD-INT 2023 European Digital Twin Ocean and Delft3D FM - Dols
Deltares
•
7 views
Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated... by TomHalpin9
Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated...
TomHalpin9
•
5 views
Local Secondary Indexes in Apache Phoenix
1.
Local Secondary Indexes
in Apache Phoenix Rajeshbabu Chintaguntla PhoenixCon 2017
2.
2 © Hortonworks
Inc. 2011 – 2017. All Rights Reserved Agenda Local Indexes Introduction Local indexes design and data model Local index writes and reads Performance Results Helpful Tips or recommendations
3.
3 © Hortonworks
Inc. 2011 – 2017. All Rights Reserved Secondary indexes in Phoenix Primary Key columns in a phoenix table forms HBase row key which acts as a primary index so filtering by primary key columns become point or range scans to the table. Filtering on non primary key column converts query into full table scans and consume lot time and resources. With secondary indexes, we can create alternative access paths to convert queries into point lookups or range scans. Phoenix supports two kinds of indexes GLOBAL and LOCAL. Phoenix supports Functional indexes as well.
4.
4 © Hortonworks
Inc. 2011 – 2017. All Rights Reserved Local Secondary Indexes - Introduction Local secondary index is LOCAL in the sense that a REGION in a table is considered as a unit and create and maintain index of it’s data. The local index data is stored and maintained in the shadow column family(ies) in the same table. So the index is 100% co-reside in the same server serving the actual data. Faster index building. Syntax:
5.
5 © Hortonworks
Inc. 2011 – 2017. All Rights Reserved Local Secondary Index - Introduction Order Id Customer ID Item ID Date 100 11 1111 06/10/2017 101 23 1231 06/01/2017 102 11 1332 05/31/2017 103 34 3221 06/01/2017 Region[100 ,104) Region[104 ,107) REGION START KEY IDX ID DATE Order ID 100 1 05/31/2017 102 100 1 06/01/2017 101 100 1 06/01/2017 103 100 1 06/10/2017 100 104 55 1343 05/28/2017 105 11 2312 06/01/2017 106 29 1234 05/15/2017 104 1 05/15/2017 106 104 1 05/28/2017 104 104 1 06/01/2017 105 CREATE TABLE IF NOT EXISTS ORDERS( ORDER_ID LONG NOT NULL PRIMARY KEY, CUSTOMER_ID LONG NOT NULL, ITEM_ID INTEGER NOT NULL, DATE DATE NOT NULL); CREATE LOCAL INDEX IDX ON ORDERS(DATE) Index of Region[100, 104) Index of Region[104,107) BASE TABLE DATA – ORDER ID IS PRIMARY KEY INDEX ROW KEY
6.
6 © Hortonworks
Inc. 2011 – 2017. All Rights Reserved Table Region1 0 L# 0 STATS CREATE TABLE IF NOT EXISTS WEB_STAT ( HOST CHAR(2) NOT NULL, DOMAIN VARCHAR NOT NULL, FEATURE VARCHAR NOT NULL, DATE DATE NOT NULL, STATS.ACTIVE_VISITOR INTEGER CONSTRAINT PK PRIMARY KEY (HOST, DOMAIN)); Region2 0 L# 0 STATS 2) CREATE LOCAL INDEX IDX2 ON WEB_STAT(STATS.ACTIVE_VISITOR) INCLUDE(DATE) Table Region1 0 STATS Region2 0 L# 0 STATS 3) CREATE LOCAL INDEX IDX3 ON WEB_STAT(DATE) INCLUDE(STATS.ACTIVE_VISITOR) L#STATS L# 0 L#STATS Data Model Shadow column families to store the index data 1) CREATE LOCAL INDEX IDX ON WEB_STAT(DATE)
7.
7 © Hortonworks
Inc. 2011 – 2017. All Rights Reserved Data Model REGION START KEY SALT NUMBER (Empty for non salt table) INDEX ID TENANT_ID (Empty for non multi tenant table) INDEXED COLUMN VALUE[S] PRIMARY KEY COLUMN VALUE[S] Local index row key format REGION START KEY: Start key of data region. For first region it’s empty byte array of region end key length. This helps to index region wise data. SALT NUMBER: A byte value represents a salt bucket number calculated for index row key. INDEX ID: A short number represents the local index. This helps to store each index data together. TENANT_ID: Tenant column value of the row key. It’s empty for if a table is not multi-tenant
8.
8 © Hortonworks
Inc. 2011 – 2017. All Rights Reserved Write path Region Server Region CLIENT 1.Write request prepare index updates Data cf Index cf 2.batch call Mem Store Me mSto re Index updates Data updates 4.Merge data and index updates 5.Write to MemStores WAL 6.Write to WAL 100% ATOMIC and CONSISTENT local index updates with data updates
9.
9 © Hortonworks
Inc. 2011 – 2017. All Rights Reserved Regionserver Region [‘’,F) Region [F,L) Client 0 L#0 Region [L,R) Region [R,’’) Regionserver Read Path 0 L#0 0 L#0 0 L#0 SELECT COUNT(*) FROM T WHERE INDEXED_COL=‘findme’ 2 1 0 5
10.
10 © Hortonworks
Inc. 2011 – 2017. All Rights Reserved Read Path SELECT INDEX_COL, NON_INDEX_COL FROM T WHERE INDEX_COL=‘findme’ Joining back missing columns from data table Region CLIENT 1.SCAN,L#0,FILTER Index cf Data cf Mem Store Me mSto re 2.Apply filter on index col 3.Get non index cols on matching rows 4.Merge with index cols 5.Return combined results to client 6. Results
11.
11 © Hortonworks
Inc. 2011 – 2017. All Rights Reserved Region Splits and Merges Since the indexes also stored in the same table, splits and merges taken care by HBase automatically. We have special mechanism to separate HFile into child regions after split. We scan through each key value find the data row key from it and write to corresponding child region
12.
12 © Hortonworks
Inc. 2011 – 2017. All Rights Reserved Performance Results 4 node cluster Tested with 5 local indexes on the base table of 25 columns with 10 regions. Ingested 50M rows. 3x faster upsert time comparing to global indexes 5x less network RX/TX utilizations during write comparing to global indexes Similar read performance comparing to global indexes with queries like aggregations, group by, limit etc.
13.
13 © Hortonworks
Inc. 2011 – 2017. All Rights Reserved Performance results Write performance
14.
14 © Hortonworks
Inc. 2011 – 2017. All Rights Reserved Performance results Network Tx/Rx during write
15.
15 © Hortonworks
Inc. 2011 – 2017. All Rights Reserved Performance results Network Tx/Rx during write
16.
16 © Hortonworks
Inc. 2011 – 2017. All Rights Reserved Performance results Network Tx/Rx during write
17.
17 © Hortonworks
Inc. 2011 – 2017. All Rights Reserved Helpful Tips Mutable vs Immutable rows table? – Writes are much more faster with local indexes on immutable rows table than mutable. So if the row written once and never updated then better to create table with IMMUTABLE_ROWS property. Online vs Offline index population? – When a table with pre-existing data then index population time may vary depending on the data size. – Usually index population happen at server by reading data table and writing index to the same table. It works very fast normally. But if the data size is too big then better to use ASYNC population by using IndexTool. Covered index vs non covered index? – When a query contains the non indexed columns to access then Phoenix joins the missing columns(in the index) from data table itself by using get calls. If the matching number of rows are high better to create covered index to avoid get calls.
18.
18 © Hortonworks
Inc. 2011 – 2017. All Rights Reserved Thank You Q & A? rajeshbabu@apache.org @rajeshhcu32