SlideShare a Scribd company logo
1 of 23
Download to read offline
Enhancements on
Spark SQL Optimizer
Min Qiu
HuaweiTechnologies,Inc.
nuwabox.com
Agenda
• Review of Catalyst Architecture
• Rule-based Optimizations (RBO)
• Cost-based Optimizations (CBO)
• Future Work
• Q & A
Agenda
• Review of Catalyst Architecture
• Rule-based Optimizations
• Cost-based Optimizations
• Future Work
• Q & A
Catalyst Architecture
Spark optimizes
query plan here
Reference: Deep Dive into Spark SQL’s Catalyst Optimizer, a databricks engineering blog
Spark SQL Optimizer
• Optimizer is a RuleExecutor
• Individual optimization is defined as Rule
– Equivalent transformations on analyzed plan
• Reduce the data volume and cost of computation
i.e. Constant Folding, Filter/Project Pushdown…
• Well designed, and easy to extend
– New rules can easily be plugged in to optimizer
• Very good already, but still incomplete
– Need new optimization rules to cover more cases
Agenda
• Review of Catalyst optimizer
• Rule-based Optimizations (RBO)
• Cost-based Optimizations
• Future Work
• Q & A
Rule Based Optimizations
• Join Condition Push Down
- Predicate Rewrite
- Join Order Adjustment
• Data Volume Reduction
- Column Pruning Enhancement
Join-condition Push Down
• In case of push down failure
– No explicit equal-join condition recognized by EquiJoinSelection Strategy
– Expensive Join operator could be planned (i.e. CartesianProduct)
R1 R2
Join[predicate]
R2
Join
R1
Filter[predicate]
R1
Cartesian
Product
R2
Filter[predicate]
1. Predicate is empty
2. Predicate is notconjunctive
3. Predicate is conjunctive,butno split
componentmatches the attributes ofR1 and R2
Predicate Rewrite
• Issue: SPARK-12032
- Predicate is OR-expression
- Join condition cannot be associated with a Join node
• Reason
- Equality condition is hidden in DNF
- The equality condition matches a join but cannot be pushed down to Join node
Predicate Rewrite 2
• Solution and Performance gain
– Pull Request 10087
– Extract the predicate that is common in each disjunctive component
– OR expression is rewritten toAND expression
– Query runs several times faster
part lineitem
Join
Filter
Or-Expression
part
lineitem
Join
p_partkey = l_partkey && rest
Filter
l_shipmode in ( …) and
l_shipinstruct = “DELIVER IN PERSON”
Join Order Adjustment
• Issue: SPARK-12032
- Join condition isn’t pushed down to join relation due to mismatching join order
- Expensive Join operator is planned
OOM if CartesianProduct
Query runs slowly
Join Order Adjustment 2
• Reason
- Multi-table join order in query plan is the same as they appear in SQL statement
- Join condition isn’t associated with correct Join node
order nation
Join customer
Join
o.o_orderkey = c.c_custkey &&
c.c_nationkey = n.n_nationkey
c_nationkey is from customer table
n_nationkey is from nation table
They don’t match attributes from
relation order and relation nation
Join Order Adjustment 3
• Solution and Performance gain
- Pull Request #10258 (By me)
- Pull Request #10073 (By other engineer)
- Group join relations with matched join condition together
- No OOM, Query runs several to tens of times faster
order nation
Join customer
Join
o.o_orderkey = c.c_custkey &&
c.c_nationkey = n.n_nationkey
order customer
Join nation
Join
c.c_nationkey = n.n_nationkey
o.o_orderkey
= c.c_custkey
Column Pruning Enhancement
• Issue: SPARK-12114
- ColumnPruning Rule fails in the following case
- Unnecessary columns are read from disk and are involved in data shuffling
Expected referenced columns:
c_name,c_custkey,o_orderkey,o_orderdate,o_totalprice,
l_orderkey,l_quantity
In fact:
all columns from table customer,order andlineitem
a fewhundred GB shuffling volume for 100GB DB
OOM, long running time
Column Pruning Enhancement 2
• Reason
- The rule only considers the following patterns
Aggregate,Generate,Project<- Join,LeftSemiJoin
- Query plan for the example below has a unrecognized pattern
Project
Filter
Join
Column Pruning Enhancement 3
• Solution and performance gain
– Pull Request #10117
– Added a pattern to ColumnPruning rule
– Project nodes are inserted to the places where desired
– Data shuffling volume is down by 90%
– Query runs 6X faster
order lineitem
Join customer
Join
order lineitem
Join
customer
Join
Project
c_custkey, c_name
Project Project
l_orderkey, l_quantity
o_orderdate,
o_custkey,
o_totalprice,
o_orderkey
Agenda
• Review of Catalyst optimizer
• Rule-based Optimizations
• Cost-based Optimizations (CBO)
• Future Work
• Q & A
Cost Based Optimization
The more we know about the database objects,
the better the query plan could be generated
• Implemented statistics collection for table and columns
• Statistics are saved in Hive Metastore
• Implemented histogram to improve the estimation accuracy
• Estimate number of rows and size of intermediate results
Cost Based Optimization 2
• The optimizations we have done
- Dynamically determine the number of partitions, and the broadcast threshold
- Select the join type based on the cost estimation
i.e. Sort Merge join vs. Broadcast join vs. Hash join
- Adjust multi-table join order based on the cost estimation
- Choose appropriate build side of a hash join operator (Prior to 1.5)
Cost Based Optimization 3
• Configuration
- SF100 (100GB)
- 2x E5-2680 v3 (16 cores, 32 HT)
- 250GB memory
• Preliminary Results
– Performance evaluation shows up to five times speedup
– Small part of queries perform worse even with CBO (due to inaccurate
estimation)
Agenda
• Review of Catalyst optimizer
• Rule-based Optimizations
• Cost-based Optimizations
• Future Work
• Q & A
Future Work
• Enumerate Space of Query Plans
- The query plan picked by query planner is not always optimal from CBO perspective
- Our CBO essentially only makes adjustments on the plan picked by query planner
- Need to apply the cost model on set of equivalent query plans
• Make the cost model smarter
- The estimation algorithm isn’t accurate enough
i.e. Bad filter factor estimation for String
Bad estimation on the expressions/user-defined function
THANK YOU!
min.qiu@huawei.com

More Related Content

What's hot

Optimizing the Catalyst Optimizer for Complex Plans
Optimizing the Catalyst Optimizer for Complex PlansOptimizing the Catalyst Optimizer for Complex Plans
Optimizing the Catalyst Optimizer for Complex Plans
Databricks
 
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit
 
Superworkflow of Graph Neural Networks with K8S and Fugue
Superworkflow of Graph Neural Networks with K8S and FugueSuperworkflow of Graph Neural Networks with K8S and Fugue
Superworkflow of Graph Neural Networks with K8S and Fugue
Databricks
 

What's hot (20)

Optimizing the Catalyst Optimizer for Complex Plans
Optimizing the Catalyst Optimizer for Complex PlansOptimizing the Catalyst Optimizer for Complex Plans
Optimizing the Catalyst Optimizer for Complex Plans
 
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
 
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
 
Hundreds of queries in the time of one - Gianmario Spacagna
Hundreds of queries in the time of one - Gianmario SpacagnaHundreds of queries in the time of one - Gianmario Spacagna
Hundreds of queries in the time of one - Gianmario Spacagna
 
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
 
Yelp Ad Targeting at Scale with Apache Spark with Inaz Alaei-Novin and Joe Ma...
Yelp Ad Targeting at Scale with Apache Spark with Inaz Alaei-Novin and Joe Ma...Yelp Ad Targeting at Scale with Apache Spark with Inaz Alaei-Novin and Joe Ma...
Yelp Ad Targeting at Scale with Apache Spark with Inaz Alaei-Novin and Joe Ma...
 
Superworkflow of Graph Neural Networks with K8S and Fugue
Superworkflow of Graph Neural Networks with K8S and FugueSuperworkflow of Graph Neural Networks with K8S and Fugue
Superworkflow of Graph Neural Networks with K8S and Fugue
 
Spark and the Future of Advanced Analytics by Thomas Dinsmore
Spark and the Future of Advanced Analytics by Thomas DinsmoreSpark and the Future of Advanced Analytics by Thomas Dinsmore
Spark and the Future of Advanced Analytics by Thomas Dinsmore
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
 
Spark Summit EU talk by Zoltan Zvara
Spark Summit EU talk by Zoltan ZvaraSpark Summit EU talk by Zoltan Zvara
Spark Summit EU talk by Zoltan Zvara
 
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
 
FlinkML - Big data application meetup
FlinkML - Big data application meetupFlinkML - Big data application meetup
FlinkML - Big data application meetup
 
Spark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick PentreathSpark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick Pentreath
 
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
 
Keeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkKeeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache Spark
 
Spark Summit EU talk by Herman van Hovell
Spark Summit EU talk by Herman van HovellSpark Summit EU talk by Herman van Hovell
Spark Summit EU talk by Herman van Hovell
 
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable Statistics
 
Spark Summit EU talk by Sol Ackerman and Franklyn D'souza
Spark Summit EU talk by Sol Ackerman and Franklyn D'souzaSpark Summit EU talk by Sol Ackerman and Franklyn D'souza
Spark Summit EU talk by Sol Ackerman and Franklyn D'souza
 

Viewers also liked

Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaTop 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Spark Summit
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...
Spark Summit
 
Building a Just in Time Data Warehouse by Dan Morris and Jason Pohl
Building a Just in Time Data Warehouse by Dan Morris and Jason PohlBuilding a Just in Time Data Warehouse by Dan Morris and Jason Pohl
Building a Just in Time Data Warehouse by Dan Morris and Jason Pohl
Spark Summit
 
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit AgarwalSuccinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Spark Summit
 

Viewers also liked (20)

Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaTop 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
 
20140908 spark sql & catalyst
20140908 spark sql & catalyst20140908 spark sql & catalyst
20140908 spark sql & catalyst
 
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Flintrock: A Faster, Better spark-ec2 by Nicholas Chammas
Flintrock: A Faster, Better spark-ec2 by Nicholas ChammasFlintrock: A Faster, Better spark-ec2 by Nicholas Chammas
Flintrock: A Faster, Better spark-ec2 by Nicholas Chammas
 
Introduction to Spark SQL & Catalyst
Introduction to Spark SQL & CatalystIntroduction to Spark SQL & Catalyst
Introduction to Spark SQL & Catalyst
 
Beyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden KarauBeyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden Karau
 
Building a Just in Time Data Warehouse by Dan Morris and Jason Pohl
Building a Just in Time Data Warehouse by Dan Morris and Jason PohlBuilding a Just in Time Data Warehouse by Dan Morris and Jason Pohl
Building a Just in Time Data Warehouse by Dan Morris and Jason Pohl
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Multi dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframesMulti dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframes
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Apache streams 2015
Apache streams 2015Apache streams 2015
Apache streams 2015
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit AgarwalSuccinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
 
Spark in the BigData dark
Spark in the BigData darkSpark in the BigData dark
Spark in the BigData dark
 
Austin Data Meetup 092014 - Spark
Austin Data Meetup 092014 - SparkAustin Data Meetup 092014 - Spark
Austin Data Meetup 092014 - Spark
 

Similar to Enhancements on Spark SQL optimizer by Min Qiu

Cost-based Query Optimization in Hive
Cost-based Query Optimization in HiveCost-based Query Optimization in Hive
Cost-based Query Optimization in Hive
DataWorks Summit
 
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized EngineApache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
DataWorks Summit
 
Oil & Gas Case Study
Oil & Gas Case StudyOil & Gas Case Study
Oil & Gas Case Study
Casey Lucas
 
ONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smartONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smart
Evans Ye
 

Similar to Enhancements on Spark SQL optimizer by Min Qiu (20)

Presentación Oracle Database Migración consideraciones 10g/11g/12c
Presentación Oracle Database Migración consideraciones 10g/11g/12cPresentación Oracle Database Migración consideraciones 10g/11g/12c
Presentación Oracle Database Migración consideraciones 10g/11g/12c
 
Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...
 
eBay Cloud CMS - QCon 2012 - http://yidb.org/
eBay Cloud CMS - QCon 2012 - http://yidb.org/eBay Cloud CMS - QCon 2012 - http://yidb.org/
eBay Cloud CMS - QCon 2012 - http://yidb.org/
 
Cost-based Query Optimization in Hive
Cost-based Query Optimization in HiveCost-based Query Optimization in Hive
Cost-based Query Optimization in Hive
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
 
Performance Stability, Tips and Tricks and Underscores
Performance Stability, Tips and Tricks and UnderscoresPerformance Stability, Tips and Tricks and Underscores
Performance Stability, Tips and Tricks and Underscores
 
Sql parametrized queries
Sql parametrized queriesSql parametrized queries
Sql parametrized queries
 
ATAGTR2017 Batch Workload Modelling and Performance Optimization
ATAGTR2017 Batch Workload Modelling and Performance Optimization ATAGTR2017 Batch Workload Modelling and Performance Optimization
ATAGTR2017 Batch Workload Modelling and Performance Optimization
 
Part1 of SQL Tuning Workshop - Understanding the Optimizer
Part1 of SQL Tuning Workshop - Understanding the OptimizerPart1 of SQL Tuning Workshop - Understanding the Optimizer
Part1 of SQL Tuning Workshop - Understanding the Optimizer
 
(ARC310) Solving Amazon's Catalog Contention With Amazon Kinesis
(ARC310) Solving Amazon's Catalog Contention With Amazon Kinesis(ARC310) Solving Amazon's Catalog Contention With Amazon Kinesis
(ARC310) Solving Amazon's Catalog Contention With Amazon Kinesis
 
dd presentation.pdf
dd presentation.pdfdd presentation.pdf
dd presentation.pdf
 
Cloud dwh
Cloud dwhCloud dwh
Cloud dwh
 
Introduction to GCP BigQuery and DataPrep
Introduction to GCP BigQuery and DataPrepIntroduction to GCP BigQuery and DataPrep
Introduction to GCP BigQuery and DataPrep
 
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
 
Improve power bi performance
Improve power bi performanceImprove power bi performance
Improve power bi performance
 
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized EngineApache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
 
Optimizer overviewoow2014
Optimizer overviewoow2014Optimizer overviewoow2014
Optimizer overviewoow2014
 
Oil & Gas Case Study
Oil & Gas Case StudyOil & Gas Case Study
Oil & Gas Case Study
 
ONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smartONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smart
 
Creating a Centralized Consumer Profile Management Service with WebSphere Dat...
Creating a Centralized Consumer Profile Management Service with WebSphere Dat...Creating a Centralized Consumer Profile Management Service with WebSphere Dat...
Creating a Centralized Consumer Profile Management Service with WebSphere Dat...
 

More from Spark Summit

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 

Recently uploaded (20)

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 

Enhancements on Spark SQL optimizer by Min Qiu

  • 1. Enhancements on Spark SQL Optimizer Min Qiu HuaweiTechnologies,Inc. nuwabox.com
  • 2. Agenda • Review of Catalyst Architecture • Rule-based Optimizations (RBO) • Cost-based Optimizations (CBO) • Future Work • Q & A
  • 3. Agenda • Review of Catalyst Architecture • Rule-based Optimizations • Cost-based Optimizations • Future Work • Q & A
  • 4. Catalyst Architecture Spark optimizes query plan here Reference: Deep Dive into Spark SQL’s Catalyst Optimizer, a databricks engineering blog
  • 5. Spark SQL Optimizer • Optimizer is a RuleExecutor • Individual optimization is defined as Rule – Equivalent transformations on analyzed plan • Reduce the data volume and cost of computation i.e. Constant Folding, Filter/Project Pushdown… • Well designed, and easy to extend – New rules can easily be plugged in to optimizer • Very good already, but still incomplete – Need new optimization rules to cover more cases
  • 6. Agenda • Review of Catalyst optimizer • Rule-based Optimizations (RBO) • Cost-based Optimizations • Future Work • Q & A
  • 7. Rule Based Optimizations • Join Condition Push Down - Predicate Rewrite - Join Order Adjustment • Data Volume Reduction - Column Pruning Enhancement
  • 8. Join-condition Push Down • In case of push down failure – No explicit equal-join condition recognized by EquiJoinSelection Strategy – Expensive Join operator could be planned (i.e. CartesianProduct) R1 R2 Join[predicate] R2 Join R1 Filter[predicate] R1 Cartesian Product R2 Filter[predicate] 1. Predicate is empty 2. Predicate is notconjunctive 3. Predicate is conjunctive,butno split componentmatches the attributes ofR1 and R2
  • 9. Predicate Rewrite • Issue: SPARK-12032 - Predicate is OR-expression - Join condition cannot be associated with a Join node • Reason - Equality condition is hidden in DNF - The equality condition matches a join but cannot be pushed down to Join node
  • 10. Predicate Rewrite 2 • Solution and Performance gain – Pull Request 10087 – Extract the predicate that is common in each disjunctive component – OR expression is rewritten toAND expression – Query runs several times faster part lineitem Join Filter Or-Expression part lineitem Join p_partkey = l_partkey && rest Filter l_shipmode in ( …) and l_shipinstruct = “DELIVER IN PERSON”
  • 11. Join Order Adjustment • Issue: SPARK-12032 - Join condition isn’t pushed down to join relation due to mismatching join order - Expensive Join operator is planned OOM if CartesianProduct Query runs slowly
  • 12. Join Order Adjustment 2 • Reason - Multi-table join order in query plan is the same as they appear in SQL statement - Join condition isn’t associated with correct Join node order nation Join customer Join o.o_orderkey = c.c_custkey && c.c_nationkey = n.n_nationkey c_nationkey is from customer table n_nationkey is from nation table They don’t match attributes from relation order and relation nation
  • 13. Join Order Adjustment 3 • Solution and Performance gain - Pull Request #10258 (By me) - Pull Request #10073 (By other engineer) - Group join relations with matched join condition together - No OOM, Query runs several to tens of times faster order nation Join customer Join o.o_orderkey = c.c_custkey && c.c_nationkey = n.n_nationkey order customer Join nation Join c.c_nationkey = n.n_nationkey o.o_orderkey = c.c_custkey
  • 14. Column Pruning Enhancement • Issue: SPARK-12114 - ColumnPruning Rule fails in the following case - Unnecessary columns are read from disk and are involved in data shuffling Expected referenced columns: c_name,c_custkey,o_orderkey,o_orderdate,o_totalprice, l_orderkey,l_quantity In fact: all columns from table customer,order andlineitem a fewhundred GB shuffling volume for 100GB DB OOM, long running time
  • 15. Column Pruning Enhancement 2 • Reason - The rule only considers the following patterns Aggregate,Generate,Project<- Join,LeftSemiJoin - Query plan for the example below has a unrecognized pattern Project Filter Join
  • 16. Column Pruning Enhancement 3 • Solution and performance gain – Pull Request #10117 – Added a pattern to ColumnPruning rule – Project nodes are inserted to the places where desired – Data shuffling volume is down by 90% – Query runs 6X faster order lineitem Join customer Join order lineitem Join customer Join Project c_custkey, c_name Project Project l_orderkey, l_quantity o_orderdate, o_custkey, o_totalprice, o_orderkey
  • 17. Agenda • Review of Catalyst optimizer • Rule-based Optimizations • Cost-based Optimizations (CBO) • Future Work • Q & A
  • 18. Cost Based Optimization The more we know about the database objects, the better the query plan could be generated • Implemented statistics collection for table and columns • Statistics are saved in Hive Metastore • Implemented histogram to improve the estimation accuracy • Estimate number of rows and size of intermediate results
  • 19. Cost Based Optimization 2 • The optimizations we have done - Dynamically determine the number of partitions, and the broadcast threshold - Select the join type based on the cost estimation i.e. Sort Merge join vs. Broadcast join vs. Hash join - Adjust multi-table join order based on the cost estimation - Choose appropriate build side of a hash join operator (Prior to 1.5)
  • 20. Cost Based Optimization 3 • Configuration - SF100 (100GB) - 2x E5-2680 v3 (16 cores, 32 HT) - 250GB memory • Preliminary Results – Performance evaluation shows up to five times speedup – Small part of queries perform worse even with CBO (due to inaccurate estimation)
  • 21. Agenda • Review of Catalyst optimizer • Rule-based Optimizations • Cost-based Optimizations • Future Work • Q & A
  • 22. Future Work • Enumerate Space of Query Plans - The query plan picked by query planner is not always optimal from CBO perspective - Our CBO essentially only makes adjustments on the plan picked by query planner - Need to apply the cost model on set of equivalent query plans • Make the cost model smarter - The estimation algorithm isn’t accurate enough i.e. Bad filter factor estimation for String Bad estimation on the expressions/user-defined function