SlideShare a Scribd company logo
Apache Drill Meetup Nov’18
Accelerating SQL queries in NoSQL Databases using Apache
Drill and Secondary Indexes
Aman Sinha
Apache Drill Meetup, November 2018
Apache Drill Meetup Nov’18
About me
• Apache Drill PMC and Apache Calcite PMC. Past PMC chair of Drill.
• Engineering @ MapR
• Main areas of interest: SQL query processing for RDBMS, NoSQL, Hadoop
• Contact: amansinha@apache.org, Github: https://github.com/amansinha100
Apache Drill Meetup Nov’18
Talk Outline
• Background
– Overview of Apache Drill
– Overview of Secondary Index in NoSQL databases
• Leveraging secondary index via Drill
• Index interfaces exposed to plugins
• Demo
Apache Drill Meetup Nov’18
Architecture Summary
• Schema-on-read
• No centralized metastore
• Supports wide range of data sources via ‘plugins’
• Fully Java based
• In-memory columnar processing
• Mostly off-heap memory management (negligible GC overhead)
• Code generation for run-time operators
• Optimistic, pipelined execution model
• Spill to disk for blocking operations under memory pressure
• Integrated with YARN for resource management
• Provides an extensible framework for UDFs
Apache Drill Meetup Nov’18
Background
● Drill overview
● Secondary Index overview
Apache Drill Meetup Nov’18
HDFS HBase
MapR-
DB
HDFS HBase
MapR-
DB
DRILLBIT (any
drillbit can be
Foreman)
HDFS HBase
MapR-
DB
DRILLBIT DRILLBIT
JDBC/ODBC
Client
Web
Console
Data Sources (exposed to Drill as
Storage/Format Plugins)
Zookeeper (could be
co-resident with
Drillbit nodes)
Drill: Distributed SQL Query Engine
BI Tools
Apache Drill Meetup Nov’18
NoSQL DB: Primary Index
• HBase and MapR-DB tables have primary key (row key) column
– Column values are sorted
– Efficient range pruning is done for rowkey predicates:
WHERE rowkey BETWEEN ‘user003’ AND ‘user007’
• Secondary columns (e.g ‘State’) values are not sorted
– Predicate WHERE state = ‘CA’ need full table scan !
user009 WA
user010 OR
user011 CA
user012 TX
user005 WA
user006 OR
user007 CA
user008 TX
user001 WA
user002 OR
user003 CA
user004 TX
Rowkey Rowkey Rowkey
Primary table regions
Apache Drill Meetup Nov’18
Solution: Secondary Index ‘Tables’
user001 WA
user002 OR
user003 CA
user004 TX
user005 WA
user006 OR
user007 CA
user008 TX
user009 WA
user010 OR
user011 CA
user012 TX
CA_user003
CA_user007
CA_user011
OR_user002
OR_user006
OR_user010
TX_user004
TX_user008
TX_user012
WA_user001
WA_user005
WA_user009
Primary table regions
Index table regions
Rowkey Rowkey Rowkey
Rowkey
Apache Drill Meetup Nov’18
Secondary Index
• NoSQL DBs supporting secondary index
– MongoDB
– MapR-DB JSON
– HBase + Phoenix
– Couchbase
– Cassandra
• What’s missing ?
– Other than Hbase + Phoenix, others don’t have an ANSI SQL interface
– There’s a need for a generalized cost-based index planning and execution framework
– A key requirement:
• Framework must be able to support ‘global’ non-covering indexes, not just
covering index
Apache Drill Meetup Nov’18
Terminology and concepts
• Indexed fields, Included fields
• Simple Index, Composite Index
• Range Index, Hash Index
• Covering Index: All columns referenced in the query are available in the
index
– Easier to handle by the optimizer. Generate an index-only plan.
• Non-Covering Index: Only a subset of the columns referenced in the query
are available in the index
– Needs more supporting infrastructure from optimizer and executor
Apache Drill Meetup Nov’18
Leveraging Secondary Index via Drill
Apache Drill Meetup Nov’18
Feature summary
• Designed for storage/format plugin whose backend supports secondary
indexing
– Reference implementation is with MapR-DB JSON
– NOTE: Drill does not create indexes; it uses them
• Index metadata is exposed to Drill planner through well defined interfaces
• Statistics (if available) are also exposed
• New run-time operators added for executing index plans
• Drill Planner extends Apache Calcite’s planner and does cost-based index
selection
• Feature will be available in upcoming Drill 1.15 release
Apache Drill Meetup Nov’18
Types of Queries Eligible for Index Planning
• WHERE clause with local filters
– <, >, =, BETWEEN
– IN, LIKE
– Eligible ANDed conditions
– Eligible ORed conditions
– Certain types of functions, e.g CAST(zipcode as BIGINT) = 12345 (only if data
source supports functional indexes)
• ORDER BY
• GROUP BY (using StreamingAggregate)
• JOIN (using MergeJoin)
Apache Drill Meetup Nov’18
Covering index plan
Index Scan
• SELECT zipcode FROM T WHERE state = ‘CA’ AND age < 30
• Composite key index on {state, age}, included field: {zipcode}
Filter
state = ‘CA’ AND age <
30
Project
UnionExchange
These are subsequently pushed
down by plugin-specific filter
pushdown rule and project
pushdown rule
Apache Drill Meetup Nov’18
Non-Covering Index Plan
• SELECT * FROM T WHERE state = ‘CA’ AND age < 30
• Composite key index on {state, age}
• How to produce the remaining (‘star’) columns ?
Index Scan
state = ‘CA’ AND
age < 30
RowKey Join
Range Partition
Exchange
Restricted (a.k.a
‘skip’ ) scan
Return rows
Row keys
Supply row
keys
Do ‘bucketing’ of row keys belonging
to the same region/tablet (this needs
knowledge of the tablet map)
Project all
columns
Apache Drill Meetup Nov’18
Example: ORDER BY queries
state county zipcode
SELECT zipcode
FROM T
WHERE state = ‘CA’
AND county = ‘Santa Clara’
ORDER BY zipcode
● Composite Index on {state, county,
zipcode}
● Planner will use this index and avoid sort
Index keys
Apache Drill Meetup Nov’18
Index Intersection
• SELECT * FROM T WHERE state IN (‘TX’, ‘CA’) AND county like ‘San%’
• Suppose single key index exists on ‘state’ and ‘county’
Index Scan
state in (‘TX’,
‘CA’)
Index Scan
county LIKE
‘San%’
Intersect HashJoin
Broadcast Exchange
RowKey Join
Range Partition
Exchange
Restricted Scan
Apache Drill Meetup Nov’18
Index Selection
● Drill planner in conjunction with Calcite’s Volcano planner provides
cost-based index selection. In addition, Drill planner employs a heuristic to
reduce the overall search space
○ Ranks the indexes based on few criteria: leading prefix selectivity, covering property,
collation property
○ Picks top 5 indexes (configurable) for further plan generation
○ Intersection of indexes is factored into selection
● Volcano planner compares cost of index plans with each other and the full
table scan plan and picks cheapest
Apache Drill Meetup Nov’18
Index Interfaces Exposed to Plugins
Apache Drill Meetup Nov’18
Sample interfaces to be implemented by plugin
● DbGroupScan
○ IndexCollection getSecondaryIndexCollection(RelNode scan)
○ DbGroupScan getRestrictedScan(List<SchemaPath> columns);
○ PartitionFunction getRangePartitionFunction(List<FieldReference> refList)
○ PluginCost getPluginCostModel()
● PluginCost
○ int getSequentialBlockReadCost(GroupScan scan)
○ int getRandomBlockReadCost(GroupScan scan)
● IndexDiscover
○ IndexCollection getTableIndex(String tableName)
● IndexDefinition
○ List<LogicalExpression> getRowKeyColumns()
○ List<LogicalExpression> getIndexColumns()
○ List<LogicalExpression> getNonIndexColumns()
○ Map<LogicalExpression, RelFieldCollation> getCollationMap()
Apache Drill Meetup Nov’18
Demo
• Sample queries on TPC-H
• Examine query profile : identify which index is picked
• Show the index definitions: indexed fields, included fields
Apache Drill Meetup Nov’18
Additional Resources
Apache Drill:
• http://drill.apache.org
• Mailing lists:
– user@drill.apache.org
– dev@drill.apache.org
Secondary Indexes in MapR-DB:
https://mapr.com/docs/61/MapR-DB/Indexes/Indexes.html
Thank you !

More Related Content

What's hot

Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 
Apache Phoenix: Transforming HBase into a SQL Database
Apache Phoenix: Transforming HBase into a SQL DatabaseApache Phoenix: Transforming HBase into a SQL Database
Apache Phoenix: Transforming HBase into a SQL Database
DataWorks Summit
 

What's hot (20)

A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteA smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
 
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
 
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
 
Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache Calcite
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Apache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllApache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them All
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
 
Spark etl
Spark etlSpark etl
Spark etl
 
Improving Spark SQL at LinkedIn
Improving Spark SQL at LinkedInImproving Spark SQL at LinkedIn
Improving Spark SQL at LinkedIn
 
Apache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysisApache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysis
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!
 
Apache Phoenix: Transforming HBase into a SQL Database
Apache Phoenix: Transforming HBase into a SQL DatabaseApache Phoenix: Transforming HBase into a SQL Database
Apache Phoenix: Transforming HBase into a SQL Database
 
Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2
 

Similar to Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary Indexes

Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Qbeast
 

Similar to Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary Indexes (20)

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 
Beyond EXPLAIN: Query Optimization From Theory To Code
Beyond EXPLAIN: Query Optimization From Theory To CodeBeyond EXPLAIN: Query Optimization From Theory To Code
Beyond EXPLAIN: Query Optimization From Theory To Code
 
Apache HAWQ Architecture
Apache HAWQ ArchitectureApache HAWQ Architecture
Apache HAWQ Architecture
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
 
Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0
 
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
 
Pydata london meetup - RiakTS, PySpark and Python by Stephen Etheridge
Pydata london meetup - RiakTS, PySpark and Python by Stephen EtheridgePydata london meetup - RiakTS, PySpark and Python by Stephen Etheridge
Pydata london meetup - RiakTS, PySpark and Python by Stephen Etheridge
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
 
Qubole - Big data in cloud
Qubole - Big data in cloudQubole - Big data in cloud
Qubole - Big data in cloud
 
Riak TS
Riak TSRiak TS
Riak TS
 
Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...
Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...
Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with Spark
 
Introduction no sql solutions with couchbase and .net core
Introduction no sql solutions with couchbase and .net coreIntroduction no sql solutions with couchbase and .net core
Introduction no sql solutions with couchbase and .net core
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Advanced search and Top-K queries in Cassandra
Advanced search and Top-K queries in CassandraAdvanced search and Top-K queries in Cassandra
Advanced search and Top-K queries in Cassandra
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 

Recently uploaded

AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
Alluxio, Inc.
 

Recently uploaded (20)

Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
iGaming Platform & Lottery Solutions by Skilrock
iGaming Platform & Lottery Solutions by SkilrockiGaming Platform & Lottery Solutions by Skilrock
iGaming Platform & Lottery Solutions by Skilrock
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
Breaking the Code : A Guide to WhatsApp Business API.pdf
Breaking the Code : A Guide to WhatsApp Business API.pdfBreaking the Code : A Guide to WhatsApp Business API.pdf
Breaking the Code : A Guide to WhatsApp Business API.pdf
 
GraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysisGraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysis
 
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should Know
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 
De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FME
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
 
Agnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in KrakówAgnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in Kraków
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
Studiovity film pre-production and screenwriting software
Studiovity film pre-production and screenwriting softwareStudiovity film pre-production and screenwriting software
Studiovity film pre-production and screenwriting software
 

Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary Indexes

  • 1. Apache Drill Meetup Nov’18 Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary Indexes Aman Sinha Apache Drill Meetup, November 2018
  • 2. Apache Drill Meetup Nov’18 About me • Apache Drill PMC and Apache Calcite PMC. Past PMC chair of Drill. • Engineering @ MapR • Main areas of interest: SQL query processing for RDBMS, NoSQL, Hadoop • Contact: amansinha@apache.org, Github: https://github.com/amansinha100
  • 3. Apache Drill Meetup Nov’18 Talk Outline • Background – Overview of Apache Drill – Overview of Secondary Index in NoSQL databases • Leveraging secondary index via Drill • Index interfaces exposed to plugins • Demo
  • 4. Apache Drill Meetup Nov’18 Architecture Summary • Schema-on-read • No centralized metastore • Supports wide range of data sources via ‘plugins’ • Fully Java based • In-memory columnar processing • Mostly off-heap memory management (negligible GC overhead) • Code generation for run-time operators • Optimistic, pipelined execution model • Spill to disk for blocking operations under memory pressure • Integrated with YARN for resource management • Provides an extensible framework for UDFs
  • 5. Apache Drill Meetup Nov’18 Background ● Drill overview ● Secondary Index overview
  • 6. Apache Drill Meetup Nov’18 HDFS HBase MapR- DB HDFS HBase MapR- DB DRILLBIT (any drillbit can be Foreman) HDFS HBase MapR- DB DRILLBIT DRILLBIT JDBC/ODBC Client Web Console Data Sources (exposed to Drill as Storage/Format Plugins) Zookeeper (could be co-resident with Drillbit nodes) Drill: Distributed SQL Query Engine BI Tools
  • 7. Apache Drill Meetup Nov’18 NoSQL DB: Primary Index • HBase and MapR-DB tables have primary key (row key) column – Column values are sorted – Efficient range pruning is done for rowkey predicates: WHERE rowkey BETWEEN ‘user003’ AND ‘user007’ • Secondary columns (e.g ‘State’) values are not sorted – Predicate WHERE state = ‘CA’ need full table scan ! user009 WA user010 OR user011 CA user012 TX user005 WA user006 OR user007 CA user008 TX user001 WA user002 OR user003 CA user004 TX Rowkey Rowkey Rowkey Primary table regions
  • 8. Apache Drill Meetup Nov’18 Solution: Secondary Index ‘Tables’ user001 WA user002 OR user003 CA user004 TX user005 WA user006 OR user007 CA user008 TX user009 WA user010 OR user011 CA user012 TX CA_user003 CA_user007 CA_user011 OR_user002 OR_user006 OR_user010 TX_user004 TX_user008 TX_user012 WA_user001 WA_user005 WA_user009 Primary table regions Index table regions Rowkey Rowkey Rowkey Rowkey
  • 9. Apache Drill Meetup Nov’18 Secondary Index • NoSQL DBs supporting secondary index – MongoDB – MapR-DB JSON – HBase + Phoenix – Couchbase – Cassandra • What’s missing ? – Other than Hbase + Phoenix, others don’t have an ANSI SQL interface – There’s a need for a generalized cost-based index planning and execution framework – A key requirement: • Framework must be able to support ‘global’ non-covering indexes, not just covering index
  • 10. Apache Drill Meetup Nov’18 Terminology and concepts • Indexed fields, Included fields • Simple Index, Composite Index • Range Index, Hash Index • Covering Index: All columns referenced in the query are available in the index – Easier to handle by the optimizer. Generate an index-only plan. • Non-Covering Index: Only a subset of the columns referenced in the query are available in the index – Needs more supporting infrastructure from optimizer and executor
  • 11. Apache Drill Meetup Nov’18 Leveraging Secondary Index via Drill
  • 12. Apache Drill Meetup Nov’18 Feature summary • Designed for storage/format plugin whose backend supports secondary indexing – Reference implementation is with MapR-DB JSON – NOTE: Drill does not create indexes; it uses them • Index metadata is exposed to Drill planner through well defined interfaces • Statistics (if available) are also exposed • New run-time operators added for executing index plans • Drill Planner extends Apache Calcite’s planner and does cost-based index selection • Feature will be available in upcoming Drill 1.15 release
  • 13. Apache Drill Meetup Nov’18 Types of Queries Eligible for Index Planning • WHERE clause with local filters – <, >, =, BETWEEN – IN, LIKE – Eligible ANDed conditions – Eligible ORed conditions – Certain types of functions, e.g CAST(zipcode as BIGINT) = 12345 (only if data source supports functional indexes) • ORDER BY • GROUP BY (using StreamingAggregate) • JOIN (using MergeJoin)
  • 14. Apache Drill Meetup Nov’18 Covering index plan Index Scan • SELECT zipcode FROM T WHERE state = ‘CA’ AND age < 30 • Composite key index on {state, age}, included field: {zipcode} Filter state = ‘CA’ AND age < 30 Project UnionExchange These are subsequently pushed down by plugin-specific filter pushdown rule and project pushdown rule
  • 15. Apache Drill Meetup Nov’18 Non-Covering Index Plan • SELECT * FROM T WHERE state = ‘CA’ AND age < 30 • Composite key index on {state, age} • How to produce the remaining (‘star’) columns ? Index Scan state = ‘CA’ AND age < 30 RowKey Join Range Partition Exchange Restricted (a.k.a ‘skip’ ) scan Return rows Row keys Supply row keys Do ‘bucketing’ of row keys belonging to the same region/tablet (this needs knowledge of the tablet map) Project all columns
  • 16. Apache Drill Meetup Nov’18 Example: ORDER BY queries state county zipcode SELECT zipcode FROM T WHERE state = ‘CA’ AND county = ‘Santa Clara’ ORDER BY zipcode ● Composite Index on {state, county, zipcode} ● Planner will use this index and avoid sort Index keys
  • 17. Apache Drill Meetup Nov’18 Index Intersection • SELECT * FROM T WHERE state IN (‘TX’, ‘CA’) AND county like ‘San%’ • Suppose single key index exists on ‘state’ and ‘county’ Index Scan state in (‘TX’, ‘CA’) Index Scan county LIKE ‘San%’ Intersect HashJoin Broadcast Exchange RowKey Join Range Partition Exchange Restricted Scan
  • 18. Apache Drill Meetup Nov’18 Index Selection ● Drill planner in conjunction with Calcite’s Volcano planner provides cost-based index selection. In addition, Drill planner employs a heuristic to reduce the overall search space ○ Ranks the indexes based on few criteria: leading prefix selectivity, covering property, collation property ○ Picks top 5 indexes (configurable) for further plan generation ○ Intersection of indexes is factored into selection ● Volcano planner compares cost of index plans with each other and the full table scan plan and picks cheapest
  • 19. Apache Drill Meetup Nov’18 Index Interfaces Exposed to Plugins
  • 20. Apache Drill Meetup Nov’18 Sample interfaces to be implemented by plugin ● DbGroupScan ○ IndexCollection getSecondaryIndexCollection(RelNode scan) ○ DbGroupScan getRestrictedScan(List<SchemaPath> columns); ○ PartitionFunction getRangePartitionFunction(List<FieldReference> refList) ○ PluginCost getPluginCostModel() ● PluginCost ○ int getSequentialBlockReadCost(GroupScan scan) ○ int getRandomBlockReadCost(GroupScan scan) ● IndexDiscover ○ IndexCollection getTableIndex(String tableName) ● IndexDefinition ○ List<LogicalExpression> getRowKeyColumns() ○ List<LogicalExpression> getIndexColumns() ○ List<LogicalExpression> getNonIndexColumns() ○ Map<LogicalExpression, RelFieldCollation> getCollationMap()
  • 21. Apache Drill Meetup Nov’18 Demo • Sample queries on TPC-H • Examine query profile : identify which index is picked • Show the index definitions: indexed fields, included fields
  • 22. Apache Drill Meetup Nov’18 Additional Resources Apache Drill: • http://drill.apache.org • Mailing lists: – user@drill.apache.org – dev@drill.apache.org Secondary Indexes in MapR-DB: https://mapr.com/docs/61/MapR-DB/Indexes/Indexes.html Thank you !