Local Secondary Indexes in Apache Phoenix

Rajeshbabu Chintaguntla
Rajeshbabu ChintaguntlaApache HBase Committer at The Apache Software Foundation
Local Secondary Indexes in
Apache Phoenix
Rajeshbabu Chintaguntla
PhoenixCon 2017
2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Agenda
Local Indexes Introduction
Local indexes design and data model
Local index writes and reads
Performance Results
Helpful Tips or recommendations
3 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Secondary indexes in Phoenix
 Primary Key columns in a phoenix table forms HBase row key which acts as a
primary index so filtering by primary key columns become point or range
scans to the table.
 Filtering on non primary key column converts query into full table scans and
consume lot time and resources.
 With secondary indexes, we can create alternative access paths to convert
queries into point lookups or range scans.
 Phoenix supports two kinds of indexes GLOBAL and LOCAL.
 Phoenix supports Functional indexes as well.
4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Local Secondary Indexes - Introduction
 Local secondary index is LOCAL in the sense that a REGION in a table is
considered as a unit and create and maintain index of it’s data.
 The local index data is stored and maintained in the shadow column
family(ies) in the same table.
 So the index is 100% co-reside in the same server serving the actual data.
 Faster index building.
 Syntax:
5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Local Secondary Index - Introduction
Order Id Customer ID Item ID Date
100 11 1111 06/10/2017
101 23 1231 06/01/2017
102 11 1332 05/31/2017
103 34 3221 06/01/2017
Region[100
,104)
Region[104
,107)
REGION
START KEY
IDX ID DATE Order ID
100 1 05/31/2017 102
100 1 06/01/2017 101
100 1 06/01/2017 103
100 1 06/10/2017 100
104 55 1343 05/28/2017
105 11 2312 06/01/2017
106 29 1234 05/15/2017
104 1 05/15/2017 106
104 1 05/28/2017 104
104 1 06/01/2017 105
CREATE TABLE IF NOT EXISTS ORDERS(
ORDER_ID LONG NOT NULL PRIMARY KEY,
CUSTOMER_ID LONG NOT NULL,
ITEM_ID INTEGER NOT NULL,
DATE DATE NOT NULL);
CREATE LOCAL INDEX IDX ON ORDERS(DATE)
Index of
Region[100,
104)
Index of Region[104,107)
BASE TABLE
DATA – ORDER
ID IS PRIMARY
KEY INDEX ROW KEY
6 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Table
Region1
0
L#
0
STATS
CREATE TABLE IF NOT EXISTS WEB_STAT (
HOST CHAR(2) NOT NULL,
DOMAIN VARCHAR NOT NULL,
FEATURE VARCHAR NOT NULL,
DATE DATE NOT NULL,
STATS.ACTIVE_VISITOR INTEGER
CONSTRAINT PK PRIMARY KEY (HOST, DOMAIN));
Region2
0
L#
0
STATS
2) CREATE LOCAL INDEX IDX2 ON
WEB_STAT(STATS.ACTIVE_VISITOR) INCLUDE(DATE)
Table
Region1
0
STATS
Region2
0
L#
0
STATS
3) CREATE LOCAL INDEX IDX3 ON WEB_STAT(DATE)
INCLUDE(STATS.ACTIVE_VISITOR)
L#STATS
L#
0
L#STATS
Data Model
Shadow column
families to store
the index data
1) CREATE LOCAL INDEX IDX ON WEB_STAT(DATE)
7 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Data Model
REGION
START KEY
SALT NUMBER
(Empty for
non salt table)
INDEX ID
TENANT_ID
(Empty for
non multi
tenant table)
INDEXED COLUMN
VALUE[S]
PRIMARY KEY COLUMN
VALUE[S]
Local index row key format
 REGION START KEY: Start key of data region. For first region it’s empty byte array of region
end key length. This helps to index region wise data.
 SALT NUMBER: A byte value represents a salt bucket number calculated for index row key.
 INDEX ID: A short number represents the local index. This helps to store each index data
together.
 TENANT_ID: Tenant column value of the row key. It’s empty for if a table is not multi-tenant
8 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Write path
Region Server
Region
CLIENT
1.Write
request
prepare index updates
Data cf Index cf
2.batch call
Mem
Store
Me
mSto
re
Index
updates
Data updates
4.Merge data and
index updates
5.Write to
MemStores
WAL
6.Write to WAL
100% ATOMIC
and CONSISTENT
local index
updates with
data updates
9 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Regionserver
Region [‘’,F)
Region [F,L)
Client
0 L#0
Region [L,R)
Region [R,’’)
Regionserver
Read Path
0 L#0
0 L#0
0 L#0
SELECT COUNT(*) FROM T WHERE INDEXED_COL=‘findme’
2
1
0
5
10 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Read Path
SELECT INDEX_COL, NON_INDEX_COL FROM T WHERE INDEX_COL=‘findme’
Joining back missing columns from data table
Region
CLIENT
1.SCAN,L#0,FILTER
Index cf Data cf
Mem
Store
Me
mSto
re
2.Apply filter
on index col
3.Get non
index cols on
matching rows
4.Merge with
index cols
5.Return
combined
results to client
6. Results
11 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Region Splits and Merges
 Since the indexes also stored in the same table, splits and merges taken care
by HBase automatically.
 We have special mechanism to separate HFile into child regions after split.
We scan through each key value find the data row key from it and write to
corresponding child region
12 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Performance Results
 4 node cluster
 Tested with 5 local indexes on the base table of 25 columns with 10 regions.
 Ingested 50M rows.
 3x faster upsert time comparing to global indexes
 5x less network RX/TX utilizations during write comparing to global indexes
 Similar read performance comparing to global indexes with queries like aggregations, group
by, limit etc.
13 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Performance results
Write performance
14 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Performance results
Network Tx/Rx during write
15 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Performance results
Network Tx/Rx during write
16 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Performance results
Network Tx/Rx during write
17 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Helpful Tips
 Mutable vs Immutable rows table?
– Writes are much more faster with local indexes on immutable rows table than mutable.
So if the row written once and never updated then better to create table with
IMMUTABLE_ROWS property.
 Online vs Offline index population?
– When a table with pre-existing data then index population time may vary depending on
the data size.
– Usually index population happen at server by reading data table and writing index to the
same table. It works very fast normally. But if the data size is too big then better to use
ASYNC population by using IndexTool.
 Covered index vs non covered index?
– When a query contains the non indexed columns to access then Phoenix joins the
missing columns(in the index) from data table itself by using get calls. If the matching
number of rows are high better to create covered index to avoid get calls.
18 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Thank You
Q & A?
rajeshbabu@apache.org
@rajeshhcu32
1 of 18

Recommended

Strongly Consistent Global Indexes for Apache Phoenix by
Strongly Consistent Global Indexes for Apache PhoenixStrongly Consistent Global Indexes for Apache Phoenix
Strongly Consistent Global Indexes for Apache PhoenixYugabyteDB
506 views18 slides
Apache Phoenix + Apache HBase by
Apache Phoenix + Apache HBaseApache Phoenix + Apache HBase
Apache Phoenix + Apache HBaseDataWorks Summit/Hadoop Summit
7.3K views43 slides
Apache phoenix: Past, Present and Future of SQL over HBAse by
Apache phoenix: Past, Present and Future of SQL over HBAseApache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAseenissoz
6.3K views41 slides
Supporting Apache HBase : Troubleshooting and Supportability Improvements by
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
1.8K views47 slides
ORC File - Optimizing Your Big Data by
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataDataWorks Summit
11.6K views26 slides
HBase Advanced - Lars George by
HBase Advanced - Lars GeorgeHBase Advanced - Lars George
HBase Advanced - Lars GeorgeJAX London
9.9K views45 slides

More Related Content

What's hot

Off-heaping the Apache HBase Read Path by
Off-heaping the Apache HBase Read Path Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path HBaseCon
4.2K views19 slides
Transactional operations in Apache Hive: present and future by
Transactional operations in Apache Hive: present and futureTransactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and futureDataWorks Summit
4.8K views36 slides
A Thorough Comparison of Delta Lake, Iceberg and Hudi by
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
11.1K views27 slides
Facebook Messages & HBase by
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase强 王
39.2K views39 slides
HBase Application Performance Improvement by
HBase Application Performance ImprovementHBase Application Performance Improvement
HBase Application Performance ImprovementBiju Nair
23.5K views25 slides
Hive 3 - a new horizon by
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizonThejas Nair
2.6K views50 slides

What's hot(20)

Off-heaping the Apache HBase Read Path by HBaseCon
Off-heaping the Apache HBase Read Path Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path
HBaseCon4.2K views
Transactional operations in Apache Hive: present and future by DataWorks Summit
Transactional operations in Apache Hive: present and futureTransactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and future
DataWorks Summit4.8K views
A Thorough Comparison of Delta Lake, Iceberg and Hudi by Databricks
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks11.1K views
Facebook Messages & HBase by 强 王
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
强 王39.2K views
HBase Application Performance Improvement by Biju Nair
HBase Application Performance ImprovementHBase Application Performance Improvement
HBase Application Performance Improvement
Biju Nair23.5K views
Hive 3 - a new horizon by Thejas Nair
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
Thejas Nair2.6K views
Time-Series Apache HBase by HBaseCon
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBase
HBaseCon5.6K views
Iceberg: A modern table format for big data (Strata NY 2018) by Ryan Blue
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue2K views
Reshape Data Lake (as of 2020.07) by Eric Sun
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
Eric Sun264 views
Hudi architecture, fundamentals and capabilities by Nishith Agarwal
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
Nishith Agarwal2.8K views
Tuning Apache Phoenix/HBase by Anil Gupta
Tuning Apache Phoenix/HBaseTuning Apache Phoenix/HBase
Tuning Apache Phoenix/HBase
Anil Gupta1.8K views
Performance Optimizations in Apache Impala by Cloudera, Inc.
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.10.7K views
Hadoop World 2011: Advanced HBase Schema Design by Cloudera, Inc.
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema Design
Cloudera, Inc.17.9K views
Hive+Tez: A performance deep dive by t3rmin4t0r
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
t3rmin4t0r9.6K views
Hive + Tez: A Performance Deep Dive by DataWorks Summit
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit57.6K views
Apache Tez - A New Chapter in Hadoop Data Processing by DataWorks Summit
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit18.3K views
Real-time Analytics with Trino and Apache Pinot by Xiang Fu
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
Xiang Fu1.2K views

Similar to Local Secondary Indexes in Apache Phoenix

Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse by
Apache Phoenix and Apache HBase: An Enterprise Grade Data WarehouseApache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Apache Phoenix and Apache HBase: An Enterprise Grade Data WarehouseJosh Elser
36.9K views43 slides
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase by
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseDataWorks Summit/Hadoop Summit
3.3K views41 slides
Apache Phoenix and HBase - Hadoop Summit Tokyo, Japan by
Apache Phoenix and HBase - Hadoop Summit Tokyo, JapanApache Phoenix and HBase - Hadoop Summit Tokyo, Japan
Apache Phoenix and HBase - Hadoop Summit Tokyo, JapanAnkit Singhal
118 views41 slides
HBase Read High Availability Using Timeline Consistent Region Replicas by
HBase  Read High Availability Using Timeline Consistent Region ReplicasHBase  Read High Availability Using Timeline Consistent Region Replicas
HBase Read High Availability Using Timeline Consistent Region Replicasenissoz
8.7K views38 slides
Interactive Analytics at Scale in Apache Hive Using Druid by
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidDataWorks Summit
2.6K views37 slides
Lightweight ETL pipelines with mara (PyData Berlin September Meetup) by
Lightweight ETL pipelines with mara (PyData Berlin September Meetup)Lightweight ETL pipelines with mara (PyData Berlin September Meetup)
Lightweight ETL pipelines with mara (PyData Berlin September Meetup)Martin Loetzsch
1.6K views16 slides

Similar to Local Secondary Indexes in Apache Phoenix(20)

Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse by Josh Elser
Apache Phoenix and Apache HBase: An Enterprise Grade Data WarehouseApache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Josh Elser36.9K views
Apache Phoenix and HBase - Hadoop Summit Tokyo, Japan by Ankit Singhal
Apache Phoenix and HBase - Hadoop Summit Tokyo, JapanApache Phoenix and HBase - Hadoop Summit Tokyo, Japan
Apache Phoenix and HBase - Hadoop Summit Tokyo, Japan
Ankit Singhal118 views
HBase Read High Availability Using Timeline Consistent Region Replicas by enissoz
HBase  Read High Availability Using Timeline Consistent Region ReplicasHBase  Read High Availability Using Timeline Consistent Region Replicas
HBase Read High Availability Using Timeline Consistent Region Replicas
enissoz8.7K views
Interactive Analytics at Scale in Apache Hive Using Druid by DataWorks Summit
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
DataWorks Summit2.6K views
Lightweight ETL pipelines with mara (PyData Berlin September Meetup) by Martin Loetzsch
Lightweight ETL pipelines with mara (PyData Berlin September Meetup)Lightweight ETL pipelines with mara (PyData Berlin September Meetup)
Lightweight ETL pipelines with mara (PyData Berlin September Meetup)
Martin Loetzsch1.6K views
hbaseconasia2019 Distributed Bitmap Index Solution by Michael Stack
hbaseconasia2019 Distributed Bitmap Index Solutionhbaseconasia2019 Distributed Bitmap Index Solution
hbaseconasia2019 Distributed Bitmap Index Solution
Michael Stack365 views
HBase Read High Availabilty using Timeline Consistent Region Replicas by DataWorks Summit
HBase Read High Availabilty using Timeline Consistent Region ReplicasHBase Read High Availabilty using Timeline Consistent Region Replicas
HBase Read High Availabilty using Timeline Consistent Region Replicas
DataWorks Summit1.2K views
MySQL Query Tuning for the Squeemish -- Fossetcon Orlando Sep 2014 by Dave Stokes
MySQL Query Tuning for the Squeemish -- Fossetcon Orlando Sep 2014MySQL Query Tuning for the Squeemish -- Fossetcon Orlando Sep 2014
MySQL Query Tuning for the Squeemish -- Fossetcon Orlando Sep 2014
Dave Stokes1K views
IRJET- Rest API for E-Commerce Site by IRJET Journal
IRJET- Rest API for E-Commerce SiteIRJET- Rest API for E-Commerce Site
IRJET- Rest API for E-Commerce Site
IRJET Journal30 views
Hive present-and-feature-shanghai by Yifeng Jiang
Hive present-and-feature-shanghaiHive present-and-feature-shanghai
Hive present-and-feature-shanghai
Yifeng Jiang2.6K views

Recently uploaded

DSD-INT 2023 Simulation of Coastal Hydrodynamics and Water Quality in Hong Ko... by
DSD-INT 2023 Simulation of Coastal Hydrodynamics and Water Quality in Hong Ko...DSD-INT 2023 Simulation of Coastal Hydrodynamics and Water Quality in Hong Ko...
DSD-INT 2023 Simulation of Coastal Hydrodynamics and Water Quality in Hong Ko...Deltares
14 views23 slides
DSD-INT 2023 Thermobaricity in 3D DCSM-FM - taking pressure into account in t... by
DSD-INT 2023 Thermobaricity in 3D DCSM-FM - taking pressure into account in t...DSD-INT 2023 Thermobaricity in 3D DCSM-FM - taking pressure into account in t...
DSD-INT 2023 Thermobaricity in 3D DCSM-FM - taking pressure into account in t...Deltares
9 views26 slides
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra... by
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra....NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra...
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra...Marc Müller
38 views62 slides
BushraDBR: An Automatic Approach to Retrieving Duplicate Bug Reports by
BushraDBR: An Automatic Approach to Retrieving Duplicate Bug ReportsBushraDBR: An Automatic Approach to Retrieving Duplicate Bug Reports
BushraDBR: An Automatic Approach to Retrieving Duplicate Bug ReportsRa'Fat Al-Msie'deen
5 views49 slides
DSD-INT 2023 Exploring flash flood hazard reduction in arid regions using a h... by
DSD-INT 2023 Exploring flash flood hazard reduction in arid regions using a h...DSD-INT 2023 Exploring flash flood hazard reduction in arid regions using a h...
DSD-INT 2023 Exploring flash flood hazard reduction in arid regions using a h...Deltares
5 views31 slides
MariaDB stored procedures and why they should be improved by
MariaDB stored procedures and why they should be improvedMariaDB stored procedures and why they should be improved
MariaDB stored procedures and why they should be improvedFederico Razzoli
8 views32 slides

Recently uploaded(20)

DSD-INT 2023 Simulation of Coastal Hydrodynamics and Water Quality in Hong Ko... by Deltares
DSD-INT 2023 Simulation of Coastal Hydrodynamics and Water Quality in Hong Ko...DSD-INT 2023 Simulation of Coastal Hydrodynamics and Water Quality in Hong Ko...
DSD-INT 2023 Simulation of Coastal Hydrodynamics and Water Quality in Hong Ko...
Deltares14 views
DSD-INT 2023 Thermobaricity in 3D DCSM-FM - taking pressure into account in t... by Deltares
DSD-INT 2023 Thermobaricity in 3D DCSM-FM - taking pressure into account in t...DSD-INT 2023 Thermobaricity in 3D DCSM-FM - taking pressure into account in t...
DSD-INT 2023 Thermobaricity in 3D DCSM-FM - taking pressure into account in t...
Deltares9 views
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra... by Marc Müller
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra....NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra...
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra...
Marc Müller38 views
BushraDBR: An Automatic Approach to Retrieving Duplicate Bug Reports by Ra'Fat Al-Msie'deen
BushraDBR: An Automatic Approach to Retrieving Duplicate Bug ReportsBushraDBR: An Automatic Approach to Retrieving Duplicate Bug Reports
BushraDBR: An Automatic Approach to Retrieving Duplicate Bug Reports
DSD-INT 2023 Exploring flash flood hazard reduction in arid regions using a h... by Deltares
DSD-INT 2023 Exploring flash flood hazard reduction in arid regions using a h...DSD-INT 2023 Exploring flash flood hazard reduction in arid regions using a h...
DSD-INT 2023 Exploring flash flood hazard reduction in arid regions using a h...
Deltares5 views
MariaDB stored procedures and why they should be improved by Federico Razzoli
MariaDB stored procedures and why they should be improvedMariaDB stored procedures and why they should be improved
MariaDB stored procedures and why they should be improved
Fleet Management Software in India by Fleetable
Fleet Management Software in India Fleet Management Software in India
Fleet Management Software in India
Fleetable11 views
SUGCON ANZ Presentation V2.1 Final.pptx by Jack Spektor
SUGCON ANZ Presentation V2.1 Final.pptxSUGCON ANZ Presentation V2.1 Final.pptx
SUGCON ANZ Presentation V2.1 Final.pptx
Jack Spektor22 views
DSD-INT 2023 Salt intrusion Modelling of the Lauwersmeer, towards a measureme... by Deltares
DSD-INT 2023 Salt intrusion Modelling of the Lauwersmeer, towards a measureme...DSD-INT 2023 Salt intrusion Modelling of the Lauwersmeer, towards a measureme...
DSD-INT 2023 Salt intrusion Modelling of the Lauwersmeer, towards a measureme...
Deltares5 views
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J... by Deltares
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J...DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J...
DSD-INT 2023 3D hydrodynamic modelling of microplastic transport in lakes - J...
Deltares9 views
Software testing company in India.pptx by SakshiPatel82
Software testing company in India.pptxSoftware testing company in India.pptx
Software testing company in India.pptx
SakshiPatel827 views
2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx by animuscrm
2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx
2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx
animuscrm14 views
Dapr Unleashed: Accelerating Microservice Development by Miroslav Janeski
Dapr Unleashed: Accelerating Microservice DevelopmentDapr Unleashed: Accelerating Microservice Development
Dapr Unleashed: Accelerating Microservice Development
Miroslav Janeski10 views
FIMA 2023 Neo4j & FS - Entity Resolution.pptx by Neo4j
FIMA 2023 Neo4j & FS - Entity Resolution.pptxFIMA 2023 Neo4j & FS - Entity Resolution.pptx
FIMA 2023 Neo4j & FS - Entity Resolution.pptx
Neo4j6 views
Unmasking the Dark Art of Vectored Exception Handling: Bypassing XDR and EDR ... by Donato Onofri
Unmasking the Dark Art of Vectored Exception Handling: Bypassing XDR and EDR ...Unmasking the Dark Art of Vectored Exception Handling: Bypassing XDR and EDR ...
Unmasking the Dark Art of Vectored Exception Handling: Bypassing XDR and EDR ...
Donato Onofri795 views
DSD-INT 2023 Leveraging the results of a 3D hydrodynamic model to improve the... by Deltares
DSD-INT 2023 Leveraging the results of a 3D hydrodynamic model to improve the...DSD-INT 2023 Leveraging the results of a 3D hydrodynamic model to improve the...
DSD-INT 2023 Leveraging the results of a 3D hydrodynamic model to improve the...
Deltares6 views
Gen Apps on Google Cloud PaLM2 and Codey APIs in Action by Márton Kodok
Gen Apps on Google Cloud PaLM2 and Codey APIs in ActionGen Apps on Google Cloud PaLM2 and Codey APIs in Action
Gen Apps on Google Cloud PaLM2 and Codey APIs in Action
Márton Kodok5 views
DSD-INT 2023 European Digital Twin Ocean and Delft3D FM - Dols by Deltares
DSD-INT 2023 European Digital Twin Ocean and Delft3D FM - DolsDSD-INT 2023 European Digital Twin Ocean and Delft3D FM - Dols
DSD-INT 2023 European Digital Twin Ocean and Delft3D FM - Dols
Deltares7 views
Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated... by TomHalpin9
Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated...Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated...
Dev-HRE-Ops - Addressing the _Last Mile DevOps Challenge_ in Highly Regulated...
TomHalpin95 views

Local Secondary Indexes in Apache Phoenix

  • 1. Local Secondary Indexes in Apache Phoenix Rajeshbabu Chintaguntla PhoenixCon 2017
  • 2. 2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Agenda Local Indexes Introduction Local indexes design and data model Local index writes and reads Performance Results Helpful Tips or recommendations
  • 3. 3 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Secondary indexes in Phoenix  Primary Key columns in a phoenix table forms HBase row key which acts as a primary index so filtering by primary key columns become point or range scans to the table.  Filtering on non primary key column converts query into full table scans and consume lot time and resources.  With secondary indexes, we can create alternative access paths to convert queries into point lookups or range scans.  Phoenix supports two kinds of indexes GLOBAL and LOCAL.  Phoenix supports Functional indexes as well.
  • 4. 4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Local Secondary Indexes - Introduction  Local secondary index is LOCAL in the sense that a REGION in a table is considered as a unit and create and maintain index of it’s data.  The local index data is stored and maintained in the shadow column family(ies) in the same table.  So the index is 100% co-reside in the same server serving the actual data.  Faster index building.  Syntax:
  • 5. 5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Local Secondary Index - Introduction Order Id Customer ID Item ID Date 100 11 1111 06/10/2017 101 23 1231 06/01/2017 102 11 1332 05/31/2017 103 34 3221 06/01/2017 Region[100 ,104) Region[104 ,107) REGION START KEY IDX ID DATE Order ID 100 1 05/31/2017 102 100 1 06/01/2017 101 100 1 06/01/2017 103 100 1 06/10/2017 100 104 55 1343 05/28/2017 105 11 2312 06/01/2017 106 29 1234 05/15/2017 104 1 05/15/2017 106 104 1 05/28/2017 104 104 1 06/01/2017 105 CREATE TABLE IF NOT EXISTS ORDERS( ORDER_ID LONG NOT NULL PRIMARY KEY, CUSTOMER_ID LONG NOT NULL, ITEM_ID INTEGER NOT NULL, DATE DATE NOT NULL); CREATE LOCAL INDEX IDX ON ORDERS(DATE) Index of Region[100, 104) Index of Region[104,107) BASE TABLE DATA – ORDER ID IS PRIMARY KEY INDEX ROW KEY
  • 6. 6 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Table Region1 0 L# 0 STATS CREATE TABLE IF NOT EXISTS WEB_STAT ( HOST CHAR(2) NOT NULL, DOMAIN VARCHAR NOT NULL, FEATURE VARCHAR NOT NULL, DATE DATE NOT NULL, STATS.ACTIVE_VISITOR INTEGER CONSTRAINT PK PRIMARY KEY (HOST, DOMAIN)); Region2 0 L# 0 STATS 2) CREATE LOCAL INDEX IDX2 ON WEB_STAT(STATS.ACTIVE_VISITOR) INCLUDE(DATE) Table Region1 0 STATS Region2 0 L# 0 STATS 3) CREATE LOCAL INDEX IDX3 ON WEB_STAT(DATE) INCLUDE(STATS.ACTIVE_VISITOR) L#STATS L# 0 L#STATS Data Model Shadow column families to store the index data 1) CREATE LOCAL INDEX IDX ON WEB_STAT(DATE)
  • 7. 7 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Data Model REGION START KEY SALT NUMBER (Empty for non salt table) INDEX ID TENANT_ID (Empty for non multi tenant table) INDEXED COLUMN VALUE[S] PRIMARY KEY COLUMN VALUE[S] Local index row key format  REGION START KEY: Start key of data region. For first region it’s empty byte array of region end key length. This helps to index region wise data.  SALT NUMBER: A byte value represents a salt bucket number calculated for index row key.  INDEX ID: A short number represents the local index. This helps to store each index data together.  TENANT_ID: Tenant column value of the row key. It’s empty for if a table is not multi-tenant
  • 8. 8 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Write path Region Server Region CLIENT 1.Write request prepare index updates Data cf Index cf 2.batch call Mem Store Me mSto re Index updates Data updates 4.Merge data and index updates 5.Write to MemStores WAL 6.Write to WAL 100% ATOMIC and CONSISTENT local index updates with data updates
  • 9. 9 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Regionserver Region [‘’,F) Region [F,L) Client 0 L#0 Region [L,R) Region [R,’’) Regionserver Read Path 0 L#0 0 L#0 0 L#0 SELECT COUNT(*) FROM T WHERE INDEXED_COL=‘findme’ 2 1 0 5
  • 10. 10 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Read Path SELECT INDEX_COL, NON_INDEX_COL FROM T WHERE INDEX_COL=‘findme’ Joining back missing columns from data table Region CLIENT 1.SCAN,L#0,FILTER Index cf Data cf Mem Store Me mSto re 2.Apply filter on index col 3.Get non index cols on matching rows 4.Merge with index cols 5.Return combined results to client 6. Results
  • 11. 11 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Region Splits and Merges  Since the indexes also stored in the same table, splits and merges taken care by HBase automatically.  We have special mechanism to separate HFile into child regions after split. We scan through each key value find the data row key from it and write to corresponding child region
  • 12. 12 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Performance Results  4 node cluster  Tested with 5 local indexes on the base table of 25 columns with 10 regions.  Ingested 50M rows.  3x faster upsert time comparing to global indexes  5x less network RX/TX utilizations during write comparing to global indexes  Similar read performance comparing to global indexes with queries like aggregations, group by, limit etc.
  • 13. 13 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Performance results Write performance
  • 14. 14 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Performance results Network Tx/Rx during write
  • 15. 15 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Performance results Network Tx/Rx during write
  • 16. 16 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Performance results Network Tx/Rx during write
  • 17. 17 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Helpful Tips  Mutable vs Immutable rows table? – Writes are much more faster with local indexes on immutable rows table than mutable. So if the row written once and never updated then better to create table with IMMUTABLE_ROWS property.  Online vs Offline index population? – When a table with pre-existing data then index population time may vary depending on the data size. – Usually index population happen at server by reading data table and writing index to the same table. It works very fast normally. But if the data size is too big then better to use ASYNC population by using IndexTool.  Covered index vs non covered index? – When a query contains the non indexed columns to access then Phoenix joins the missing columns(in the index) from data table itself by using get calls. If the matching number of rows are high better to create covered index to avoid get calls.
  • 18. 18 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Thank You Q & A? rajeshbabu@apache.org @rajeshhcu32