SlideShare a Scribd company logo
1 of 24
Download to read offline
Enhancing Spark SQL Optimizer with
Reliable Statistics
Ron Hu, Fang Cao, Min Qiu*, Yizhen Liu
Huawei Technologies, Inc.
* Former Huawei employee
Agenda
• Review of Catalyst Architecture
• Rule-based optimizations
• Reliable statistics collected
• Cost-based rules
• Future Work
• Q & A
Page 2
Catalyst Architecture
Spark optimizes
query plan here
Reference:Deep Dive into Spark SQL’s Catalyst Optimizer, a databricks engineering blog
Page 3
Rule-Based Optimizer in Spark SQL
• Most of Spark SQL optimizer’s rules are heuristics rules.
– Does NOT consider the cost of each operator
– Does NOT consider the cost of the equivalent logical plans
• Join order is decided by its position in the SQL queries
• Join type is based on some very simple system
assumptions
• Number of shuffle partitions is a fixed number.
• Our community work:
– Ex.: Fixed bugs in Spark.
– Spark Summit East 2016 talk, https://spark-summit.org/east-
2016/events/enhancements-on-spark-sql-optimizer/
Page 4
Statistics Collected
• Collect Table Statistics information
• Collect Column Statistics information
• Only consider static system statistics (configuration
file: CPU, Storage, Network) at this stage.
• Goal:
– Calculate the cost for each database operator
• in terms of number of output rows, size of output rows, etc.
– Based on the cost calculation, adjust the query execution
plan
Page 5
Table Statistics Collected
• Use a modified Hive Analyze Table statement to
collect statistics of a table.
– Ex: Analyze Table lineitem compute statistics
• It collects table level statistics and save into
metastore.
– Number of rows
– Number of files
– Table size in bytes
Page 6
Column Statistics Collected
• Use Analyze statement to collect column level statistics
of individual column.
– Ex: Analyze Table lineitem compute statistics for
columns l_orderkey, l_partkey, l_suppkey,
l_returnflag, l_linestatus, l_shipdate, ……..
• It collects column level statistics and save into
metastore.
– Minimal value, maximal value,
– Number of distinct values, number of null values
– Column maximal length, column average length
– Uniqueness of a column
Page 7
Column 1-D Histogram
Provided two kinds of Histograms: Equi-Width and Equi-
Depth
- Between buckets, data distribution is determined by histograms
- Within one bucket, still assume data is evenly distributed
Max number of buckets: 256,
- If Number of Distinct Values <= 256, use equi-width
- If Number of Distinct Values > 256, use equi-depth
Used Hive Analyze Command and Hive Metastore API
Page 8
Column interval
Frequency
Equi-Width
Equi-Depth
Column interval
Frequency
Column 2-D Histogram
• Developed 2-dimensional equi-depth histogram for the
column combination of (c1, c2)
– In a 2-dimensional histogram, there are 2 levels of buckets.
– B(c1) is the number of major buckets for column C1.
– Within each C1 bucket, B(c2) is the number of buckets for C2
• Lessons Learned:
– Users do not use 2-D histogram often as they do not know which 2
columns are correlated.
– What granularity to use? 256 buckets or 256x256 buckets?
– Difficult to extend to 3-D or more dimensions
– Can be replaced by hints
Page 9
Cost-Based Rules
• Optimizer is a RuleExecutor.
– Individual optimization is defined as Rule
• We added new rules to estimate number of output
rows and output size in bytes for each execution
operator:
– MetastoreRelation, Filter, Project, Join, Sort, Aggregate,
Exchange, Limit, Union, etc.
• The node’s cost = nominal scale of (output_rows,
output_size)
Page 10
Filter Operator Statistics
• Between Filter’s expressions: AND, OR, NOT
• In each Expression: =, <, <=, >, >=, like, in, etc
• Current support type in Expression
– For <, <=, >, >=, String, Integer, Double, etc
– For =, String, Integer, Double and Date Type, and User-Defined
Types, etc.
• Sample: A <= B
– Based on A, B’s min/max/NDV values, decide the relationships
between A and B. After completing this expression, what the new
min/max/NDV should be for A and B
– We use histograms to adjust min/max/NDV values
– Assume all the data is evenly distributed if no histogram information.
Page 11
Filter Operator Example
• Column A (op) Data B
– (op) can be “=“, “<”, “<=”, “>”, “>=”, “like”
– Like the styles as “l_orderkey = 3”, “l_shipdate <= “1995-03-21”
– Column’s max/min/distinct should be updated
– Sample: Column A < value B
Column AB B
A.min A.max
Filtering Factor = 0%
no need to change A’s statistics
A will not appear in the future work
Filtering Factor = 100%
no need to change A’s statistics
value
frequency
50
40
30
20
10
1–5 6–10 11–15 16–20 21–25
With Histograms
Filtering Factor = using Histograms to calculate
A.min = no change
A.max = B.value
A.ndv = A.ndv * Filtering Factor
Without Histograms, Suppose Data is evenly distributed
Filtering Factor = (B.value – A.min) / (A.max – A.min)
A.min = no change
A.max = B.value
A.ndv = A.ndv * Filtering Factor
Page 12
Filter Operator Example
• Column A (op) Column B
– Actually, based on observation, this expression will appear in Project, but not in Filter
– Note: for column comparing, currently we don’t support histogram. We cannot suppose the data is evenly
distributed, so the empirical filtering factor is set to 1/3
– (op) can be “<”, “<=”, “>”, “>=”
– Need to adjust the A and B’s min/max/NDV after filtering
– Sample: Column A < Column B
B
A
AA
A
B
B B
A filtering = 100%
B filtering = 100%
A filtering = 0%
B filtering = 0%
A filtering = 33.3%
B filtering = 33.3%
A filtering = 33.3%
B filtering = 33.3%
Page 13
Join Order
• Only for two table joins
• We calculate the cost of Hash Join using the stats of left
and right nodes.
– Nominal Cost = <nominal-rows> × 0.7 + <nominal-size> × 0.3
• Choose lower-cost child as build side of hash join (Prior
to Spark 1.5).
Page 14
Multi-way Join Reorder
• Currently Spark SQL’s Join order is not decided
by the cost of multi-way join operations.
• We decide the join order based on the output
rows and output size of the intermediate table.
– The join with smaller output is performed first.
– Can benefit star join queries (like TPC-DS).
• Using dynamic programming for join order
Page 15
Sample:Q3,3 Tables join+aggregate
• ParquetRelation node
• Filter node
• Project node
• Join node
• Aggregation
• Limit
Page 16
Build Right -> Build Left
Limitation without Key Information
• Spark SQL does not support index or primary key.
– This missing information fails to properly estimate the
join output of the primary/foreign key join.
• When estimating the number of GROUP BY
operator output records, we multiply the number of
distinct values for each GROUP BY column.
– This formula is valid only if every GROUP BY column is
independent.
Page 17
Column Uniqueness
• We know that a column is unique (or primary key)
if the number of distinct values divided by the
number of records of a table is close to 1.0.
– We can set the size of hash join table properly if one
join column is unique.
– When computing the number of GROUP BY output
records, if one GROUP BY column is unique, we do
NOT multiply those non-unique columns.
Page 18
Unique Column Example, tpc-h Q10
• /* tpc-h Q10: c_custkey is unique */
• SELECT c_custkey, c_name, sum(l_extendedprice * (1 - l_discount))
• AS revenue, c_acctbal, n_name, c_address, c_phone, c_comment
• FROM nation join customer join orders join lineitem
• WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey
• AND o_orderdate >= '1993-10-01' AND o_orderdate < '1994-01-01'
• AND l_returnflag = 'R' AND c_nationkey = n_nationkey
• GROUP BY c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment
• ORDER BY revenue DESC limit 20
Number of group-by outputs can be:
• 1708M if there is no unique column information,
• 82K if we know there is a unique group-by column
Page 19
SQL Hints
• Some information cannot be analyzed directly from the statistics of
tables/columns. Example, tpc-h Q13:
– Supported hints /*+ …. */: Like_FilterFactor,
NDV_Correlated_Columns, Join_Build, Join_Type, ……
Page 20
SELECT c_count, count(*) as custdist
FROM
(SELECT c_custkey, count(o_orderkey) c_count
FROM customer LEFT OUTER JOIN orders
ON c_custkey = o_custkey
and o_comment not like '%special%request%'
GROUP BY c_custkey
) c_orders
GROUP BY c_count
ORDER BY custdist desc, c_count desc
Actual vs Estimated Output Rows
Query Actual Estimated
Q1 4 6
Q2 460 1756
Q3 11621 496485
Q4 5 5
Q5 5 25
Q6 1 1
Q7 4 5
Q8 2 5
Q9 175 222
Q10 37967 81611
Query Actual Estimated
Q11 28574 32000
Q12 2 2
Q13 42 100
Q14 1 1
Q15 1 2
Q16 18314 14700
Q17 1 1
Q18 57 1621
Q19 1 1
Q20 186 558
Q21 411 558
Page 21
Wrong Output Rows Estimate for Q3
• We do not handle the correlated columns of
different tables.
TPC-H Q3:
select l_orderkey, sum(l_extendedprice *(1 - l_discount)) as revenue,
o_orderdate, o_shippriority
from customer, orders, lineitem
where c_mktsegment = 'BUILDING'
and c_custkey = o_custkey and l_orderkey = o_orderkey
and o_orderdate < date '1995-3-15'
and l_shipdate > date '1995-3-15'
group by l_orderkey, o_orderdate, o_shippriority
order by l_orderkey, revenue desc, o_orderdate
Page 22
Possible Future Work
• How to collect table histograms information quickly and correctly
– For full table scan – correct, but slow, especially for big data
– Possible method – Sampling Counting
• Linear, LogLog, Adaptive, Hyper LogLog, Hyper LogLog++, etc
• Expression Statistics
– Now only raw columns’ statistics are collected. Not for the derived columns
– Derived columns from calculation of expressions
• Ex: Alias Column, Aggregation Expression, Arithmetic Expression, UDF
• Collecting the real-world running statistics information, for the future
query plan optimization.
– Continuous feedback optimization
Page 23
THANK YOU.
ron.hu@huawei.com fang.cao@huawei.com
yizhen.liu@huawei.com

More Related Content

What's hot

Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkIlya Ganelin
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkKazuaki Ishizaki
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Databricks
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleBucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleDatabricks
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
 
Under the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database ArchitectureUnder the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database ArchitectureScyllaDB
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdfAmit Raj
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Julien Le Dem
 
Scylla Summit 2022: New AWS Instances Perfect for ScyllaDB
Scylla Summit 2022: New AWS Instances Perfect for ScyllaDBScylla Summit 2022: New AWS Instances Perfect for ScyllaDB
Scylla Summit 2022: New AWS Instances Perfect for ScyllaDBScyllaDB
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSAmazon Web Services
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDatabricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 

What's hot (20)

Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They Work
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleBucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
Under the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database ArchitectureUnder the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database Architecture
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdf
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 
Scylla Summit 2022: New AWS Instances Perfect for ScyllaDB
Scylla Summit 2022: New AWS Instances Perfect for ScyllaDBScylla Summit 2022: New AWS Instances Perfect for ScyllaDB
Scylla Summit 2022: New AWS Instances Perfect for ScyllaDB
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 

Similar to Enhancing Spark SQL Optimizer with Reliable Statistics

Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Spark Summit
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Databricks
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Databricks
 
Processes in Query Optimization in (ABMS) Advanced Database Management Systems
Processes in Query Optimization in (ABMS) Advanced Database Management Systems Processes in Query Optimization in (ABMS) Advanced Database Management Systems
Processes in Query Optimization in (ABMS) Advanced Database Management Systems gamemaker762
 
Presentation top tips for getting optimal sql execution
Presentation    top tips for getting optimal sql executionPresentation    top tips for getting optimal sql execution
Presentation top tips for getting optimal sql executionxKinAnx
 
Presentación Oracle Database Migración consideraciones 10g/11g/12c
Presentación Oracle Database Migración consideraciones 10g/11g/12cPresentación Oracle Database Migración consideraciones 10g/11g/12c
Presentación Oracle Database Migración consideraciones 10g/11g/12cRonald Francisco Vargas Quesada
 
Query Compilation in Impala
Query Compilation in ImpalaQuery Compilation in Impala
Query Compilation in ImpalaCloudera, Inc.
 
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...Databricks
 
Oracle Advanced SQL and Analytic Functions
Oracle Advanced SQL and Analytic FunctionsOracle Advanced SQL and Analytic Functions
Oracle Advanced SQL and Analytic FunctionsZohar Elkayam
 
SQL Pass Summit Presentations from Datavail - Optimize SQL Server: Query Tuni...
SQL Pass Summit Presentations from Datavail - Optimize SQL Server: Query Tuni...SQL Pass Summit Presentations from Datavail - Optimize SQL Server: Query Tuni...
SQL Pass Summit Presentations from Datavail - Optimize SQL Server: Query Tuni...Datavail
 
Managing Statistics for Optimal Query Performance
Managing Statistics for Optimal Query PerformanceManaging Statistics for Optimal Query Performance
Managing Statistics for Optimal Query PerformanceKaren Morton
 
Myth busters - performance tuning 101 2007
Myth busters - performance tuning 101 2007Myth busters - performance tuning 101 2007
Myth busters - performance tuning 101 2007paulguerin
 
Oracle_Analytical_function.pdf
Oracle_Analytical_function.pdfOracle_Analytical_function.pdf
Oracle_Analytical_function.pdfKalyankumarVenkat1
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
II B.Sc IT DATA STRUCTURES.pptx
II B.Sc IT DATA STRUCTURES.pptxII B.Sc IT DATA STRUCTURES.pptx
II B.Sc IT DATA STRUCTURES.pptxsabithabanu83
 
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn PipelinesRevolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn PipelinesPhilip Goddard
 

Similar to Enhancing Spark SQL Optimizer with Reliable Statistics (20)

Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
 
Processes in Query Optimization in (ABMS) Advanced Database Management Systems
Processes in Query Optimization in (ABMS) Advanced Database Management Systems Processes in Query Optimization in (ABMS) Advanced Database Management Systems
Processes in Query Optimization in (ABMS) Advanced Database Management Systems
 
Presentation top tips for getting optimal sql execution
Presentation    top tips for getting optimal sql executionPresentation    top tips for getting optimal sql execution
Presentation top tips for getting optimal sql execution
 
Presentación Oracle Database Migración consideraciones 10g/11g/12c
Presentación Oracle Database Migración consideraciones 10g/11g/12cPresentación Oracle Database Migración consideraciones 10g/11g/12c
Presentación Oracle Database Migración consideraciones 10g/11g/12c
 
Ali upload
Ali uploadAli upload
Ali upload
 
Query Compilation in Impala
Query Compilation in ImpalaQuery Compilation in Impala
Query Compilation in Impala
 
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
 
19CS3052R-CO1-7-S7 ECE
19CS3052R-CO1-7-S7 ECE19CS3052R-CO1-7-S7 ECE
19CS3052R-CO1-7-S7 ECE
 
Oracle Advanced SQL and Analytic Functions
Oracle Advanced SQL and Analytic FunctionsOracle Advanced SQL and Analytic Functions
Oracle Advanced SQL and Analytic Functions
 
SQL Pass Summit Presentations from Datavail - Optimize SQL Server: Query Tuni...
SQL Pass Summit Presentations from Datavail - Optimize SQL Server: Query Tuni...SQL Pass Summit Presentations from Datavail - Optimize SQL Server: Query Tuni...
SQL Pass Summit Presentations from Datavail - Optimize SQL Server: Query Tuni...
 
Managing Statistics for Optimal Query Performance
Managing Statistics for Optimal Query PerformanceManaging Statistics for Optimal Query Performance
Managing Statistics for Optimal Query Performance
 
Myth busters - performance tuning 101 2007
Myth busters - performance tuning 101 2007Myth busters - performance tuning 101 2007
Myth busters - performance tuning 101 2007
 
Oracle_Analytical_function.pdf
Oracle_Analytical_function.pdfOracle_Analytical_function.pdf
Oracle_Analytical_function.pdf
 
Query processing System
Query processing SystemQuery processing System
Query processing System
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
II B.Sc IT DATA STRUCTURES.pptx
II B.Sc IT DATA STRUCTURES.pptxII B.Sc IT DATA STRUCTURES.pptx
II B.Sc IT DATA STRUCTURES.pptx
 
DB
DBDB
DB
 
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn PipelinesRevolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
 

More from Jen Aman

Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaJen Aman
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéJen Aman
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesJen Aman
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsJen Aman
 
Spatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using SparkSpatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using SparkJen Aman
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat DetectionJen Aman
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkJen Aman
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersJen Aman
 
Deploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using SparkDeploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using SparkJen Aman
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityJen Aman
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityJen Aman
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache SparkJen Aman
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesJen Aman
 
Livy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache SparkLivy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache SparkJen Aman
 
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonGPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonJen Aman
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousJen Aman
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLJen Aman
 
Spark on Mesos
Spark on MesosSpark on Mesos
Spark on MesosJen Aman
 

More from Jen Aman (20)

Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher Ré
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time Decisions
 
Spatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using SparkSpatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using Spark
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
 
Deploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using SparkDeploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using Spark
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out Databases
 
Livy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache SparkLivy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache Spark
 
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonGPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And Python
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
 
Spark on Mesos
Spark on MesosSpark on Mesos
Spark on Mesos
 

Recently uploaded

BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 

Enhancing Spark SQL Optimizer with Reliable Statistics

  • 1. Enhancing Spark SQL Optimizer with Reliable Statistics Ron Hu, Fang Cao, Min Qiu*, Yizhen Liu Huawei Technologies, Inc. * Former Huawei employee
  • 2. Agenda • Review of Catalyst Architecture • Rule-based optimizations • Reliable statistics collected • Cost-based rules • Future Work • Q & A Page 2
  • 3. Catalyst Architecture Spark optimizes query plan here Reference:Deep Dive into Spark SQL’s Catalyst Optimizer, a databricks engineering blog Page 3
  • 4. Rule-Based Optimizer in Spark SQL • Most of Spark SQL optimizer’s rules are heuristics rules. – Does NOT consider the cost of each operator – Does NOT consider the cost of the equivalent logical plans • Join order is decided by its position in the SQL queries • Join type is based on some very simple system assumptions • Number of shuffle partitions is a fixed number. • Our community work: – Ex.: Fixed bugs in Spark. – Spark Summit East 2016 talk, https://spark-summit.org/east- 2016/events/enhancements-on-spark-sql-optimizer/ Page 4
  • 5. Statistics Collected • Collect Table Statistics information • Collect Column Statistics information • Only consider static system statistics (configuration file: CPU, Storage, Network) at this stage. • Goal: – Calculate the cost for each database operator • in terms of number of output rows, size of output rows, etc. – Based on the cost calculation, adjust the query execution plan Page 5
  • 6. Table Statistics Collected • Use a modified Hive Analyze Table statement to collect statistics of a table. – Ex: Analyze Table lineitem compute statistics • It collects table level statistics and save into metastore. – Number of rows – Number of files – Table size in bytes Page 6
  • 7. Column Statistics Collected • Use Analyze statement to collect column level statistics of individual column. – Ex: Analyze Table lineitem compute statistics for columns l_orderkey, l_partkey, l_suppkey, l_returnflag, l_linestatus, l_shipdate, …….. • It collects column level statistics and save into metastore. – Minimal value, maximal value, – Number of distinct values, number of null values – Column maximal length, column average length – Uniqueness of a column Page 7
  • 8. Column 1-D Histogram Provided two kinds of Histograms: Equi-Width and Equi- Depth - Between buckets, data distribution is determined by histograms - Within one bucket, still assume data is evenly distributed Max number of buckets: 256, - If Number of Distinct Values <= 256, use equi-width - If Number of Distinct Values > 256, use equi-depth Used Hive Analyze Command and Hive Metastore API Page 8 Column interval Frequency Equi-Width Equi-Depth Column interval Frequency
  • 9. Column 2-D Histogram • Developed 2-dimensional equi-depth histogram for the column combination of (c1, c2) – In a 2-dimensional histogram, there are 2 levels of buckets. – B(c1) is the number of major buckets for column C1. – Within each C1 bucket, B(c2) is the number of buckets for C2 • Lessons Learned: – Users do not use 2-D histogram often as they do not know which 2 columns are correlated. – What granularity to use? 256 buckets or 256x256 buckets? – Difficult to extend to 3-D or more dimensions – Can be replaced by hints Page 9
  • 10. Cost-Based Rules • Optimizer is a RuleExecutor. – Individual optimization is defined as Rule • We added new rules to estimate number of output rows and output size in bytes for each execution operator: – MetastoreRelation, Filter, Project, Join, Sort, Aggregate, Exchange, Limit, Union, etc. • The node’s cost = nominal scale of (output_rows, output_size) Page 10
  • 11. Filter Operator Statistics • Between Filter’s expressions: AND, OR, NOT • In each Expression: =, <, <=, >, >=, like, in, etc • Current support type in Expression – For <, <=, >, >=, String, Integer, Double, etc – For =, String, Integer, Double and Date Type, and User-Defined Types, etc. • Sample: A <= B – Based on A, B’s min/max/NDV values, decide the relationships between A and B. After completing this expression, what the new min/max/NDV should be for A and B – We use histograms to adjust min/max/NDV values – Assume all the data is evenly distributed if no histogram information. Page 11
  • 12. Filter Operator Example • Column A (op) Data B – (op) can be “=“, “<”, “<=”, “>”, “>=”, “like” – Like the styles as “l_orderkey = 3”, “l_shipdate <= “1995-03-21” – Column’s max/min/distinct should be updated – Sample: Column A < value B Column AB B A.min A.max Filtering Factor = 0% no need to change A’s statistics A will not appear in the future work Filtering Factor = 100% no need to change A’s statistics value frequency 50 40 30 20 10 1–5 6–10 11–15 16–20 21–25 With Histograms Filtering Factor = using Histograms to calculate A.min = no change A.max = B.value A.ndv = A.ndv * Filtering Factor Without Histograms, Suppose Data is evenly distributed Filtering Factor = (B.value – A.min) / (A.max – A.min) A.min = no change A.max = B.value A.ndv = A.ndv * Filtering Factor Page 12
  • 13. Filter Operator Example • Column A (op) Column B – Actually, based on observation, this expression will appear in Project, but not in Filter – Note: for column comparing, currently we don’t support histogram. We cannot suppose the data is evenly distributed, so the empirical filtering factor is set to 1/3 – (op) can be “<”, “<=”, “>”, “>=” – Need to adjust the A and B’s min/max/NDV after filtering – Sample: Column A < Column B B A AA A B B B A filtering = 100% B filtering = 100% A filtering = 0% B filtering = 0% A filtering = 33.3% B filtering = 33.3% A filtering = 33.3% B filtering = 33.3% Page 13
  • 14. Join Order • Only for two table joins • We calculate the cost of Hash Join using the stats of left and right nodes. – Nominal Cost = <nominal-rows> × 0.7 + <nominal-size> × 0.3 • Choose lower-cost child as build side of hash join (Prior to Spark 1.5). Page 14
  • 15. Multi-way Join Reorder • Currently Spark SQL’s Join order is not decided by the cost of multi-way join operations. • We decide the join order based on the output rows and output size of the intermediate table. – The join with smaller output is performed first. – Can benefit star join queries (like TPC-DS). • Using dynamic programming for join order Page 15
  • 16. Sample:Q3,3 Tables join+aggregate • ParquetRelation node • Filter node • Project node • Join node • Aggregation • Limit Page 16 Build Right -> Build Left
  • 17. Limitation without Key Information • Spark SQL does not support index or primary key. – This missing information fails to properly estimate the join output of the primary/foreign key join. • When estimating the number of GROUP BY operator output records, we multiply the number of distinct values for each GROUP BY column. – This formula is valid only if every GROUP BY column is independent. Page 17
  • 18. Column Uniqueness • We know that a column is unique (or primary key) if the number of distinct values divided by the number of records of a table is close to 1.0. – We can set the size of hash join table properly if one join column is unique. – When computing the number of GROUP BY output records, if one GROUP BY column is unique, we do NOT multiply those non-unique columns. Page 18
  • 19. Unique Column Example, tpc-h Q10 • /* tpc-h Q10: c_custkey is unique */ • SELECT c_custkey, c_name, sum(l_extendedprice * (1 - l_discount)) • AS revenue, c_acctbal, n_name, c_address, c_phone, c_comment • FROM nation join customer join orders join lineitem • WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey • AND o_orderdate >= '1993-10-01' AND o_orderdate < '1994-01-01' • AND l_returnflag = 'R' AND c_nationkey = n_nationkey • GROUP BY c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment • ORDER BY revenue DESC limit 20 Number of group-by outputs can be: • 1708M if there is no unique column information, • 82K if we know there is a unique group-by column Page 19
  • 20. SQL Hints • Some information cannot be analyzed directly from the statistics of tables/columns. Example, tpc-h Q13: – Supported hints /*+ …. */: Like_FilterFactor, NDV_Correlated_Columns, Join_Build, Join_Type, …… Page 20 SELECT c_count, count(*) as custdist FROM (SELECT c_custkey, count(o_orderkey) c_count FROM customer LEFT OUTER JOIN orders ON c_custkey = o_custkey and o_comment not like '%special%request%' GROUP BY c_custkey ) c_orders GROUP BY c_count ORDER BY custdist desc, c_count desc
  • 21. Actual vs Estimated Output Rows Query Actual Estimated Q1 4 6 Q2 460 1756 Q3 11621 496485 Q4 5 5 Q5 5 25 Q6 1 1 Q7 4 5 Q8 2 5 Q9 175 222 Q10 37967 81611 Query Actual Estimated Q11 28574 32000 Q12 2 2 Q13 42 100 Q14 1 1 Q15 1 2 Q16 18314 14700 Q17 1 1 Q18 57 1621 Q19 1 1 Q20 186 558 Q21 411 558 Page 21
  • 22. Wrong Output Rows Estimate for Q3 • We do not handle the correlated columns of different tables. TPC-H Q3: select l_orderkey, sum(l_extendedprice *(1 - l_discount)) as revenue, o_orderdate, o_shippriority from customer, orders, lineitem where c_mktsegment = 'BUILDING' and c_custkey = o_custkey and l_orderkey = o_orderkey and o_orderdate < date '1995-3-15' and l_shipdate > date '1995-3-15' group by l_orderkey, o_orderdate, o_shippriority order by l_orderkey, revenue desc, o_orderdate Page 22
  • 23. Possible Future Work • How to collect table histograms information quickly and correctly – For full table scan – correct, but slow, especially for big data – Possible method – Sampling Counting • Linear, LogLog, Adaptive, Hyper LogLog, Hyper LogLog++, etc • Expression Statistics – Now only raw columns’ statistics are collected. Not for the derived columns – Derived columns from calculation of expressions • Ex: Alias Column, Aggregation Expression, Arithmetic Expression, UDF • Collecting the real-world running statistics information, for the future query plan optimization. – Continuous feedback optimization Page 23