SlideShare a Scribd company logo
Ron Hu, Zhenhua Wang
Huawei Technologies, Inc.
Cardinality Estimation through
Histogram in Apache Spark 2.3
#DevSAIS13
Agenda
• Catalyst Architecture
• Cost Based Optimizer in Spark 2.2
• Statistics Collected
• Histogram Support in Spark 2.3
• Configuration Parameters
• Q & A
2
Catalyst Architecture
3
Spark optimizes query plan here
Reference:Deep Dive into Spark SQL’s Catalyst Optimizer, a databricks engineering blog
Query Optimizer in Spark SQL
• Spark SQL’s query optimizer is based on both
rules and cost.
• Most of Spark SQL optimizer’s rules are
heuristics rules.
– PushDownPredicate, ColumnPruning,
ConstantFolding,….
• Cost based optimization (CBO) was added in
Spark 2.2.
4
Cost Based Optimizer in Spark 2.2
• It was a good and working CBO framework to start
with.
• Focused on
– Statistics collection,
– Cardinality estimation,
– Build side selection, broadcast vs. shuffled join, join
reordering, etc.
• Used heuristics formula for cost function in terms
of cardinality and data size of each operator.
5
Statistics Collected
• Collect Table Statistics information
• Collect Column Statistics information
• Goal:
– Calculate the cost for each operator in terms of
number of output rows, size of output, etc.
– Based on the cost calculation, adjust the query
execution plan
6
Table Statistics Collected
• Command to collect statistics of a table.
– Ex: ANALYZE TABLE table-name COMPUTE
STATISTICS
• It collects table level statistics and saves into
metastore.
– Number of rows
– Table size in bytes
7
Column Statistics Collected
• Command to collect column level statistics of individual columns.
– Ex: ANALYZE TABLE table-name COMPUTE STATISTICS
FOR COLUMNS column-name1, column-name2, ….
• It collects column level statistics and saves into meta-store.
String/Binary type
✓ Distinct count
✓ Null count
✓ Average length
✓ Max length
Numeric/Date/Timestamp type
✓ Distinct count
✓ Max
✓ Min
✓ Null count
✓ Average length (fixed length)
✓ Max length (fixed length)
8
Real World Data Are Often Skewed
9#DevSAIS13 – Cardinality Estimation by Hu and Wang
Histogram Support in Spark 2.3
• Histogram is effective in handling
skewed distribution.
• We developed equi-height histogram
in Spark 2.3.
• Equi-Height histogram is better than
equi-width histogram
• Equi-height histogram can use multiple
buckets to show a very skewed value.
• Equi-width histogram cannot give right
frequency when a skewed value falls in
same bucket with other values.
Column interval
Frequency
Equi-Width
Equi-Height
Column interval
Frequency Density
10
Histogram Algorithm
– Each histogram has a default of 254 buckets.
• The height of a histogram is number of non-null values divided
by number of buckets.
– Each histogram bucket contains
• Range values of a bucket
• Number of distinct values in a bucket
– We use two table scans to generate the equi-height
histograms for all columns specified in analyze
command.
• Use ApproximatePercentile class to get end points of all
histogram buckets
• Use HyperLogLog++ algorithm to compute the number of
distinct values in each bucket.
11
Filter Cardinality Estimation
• Between Logical expressions: AND, OR, NOT
• In each logical expression: =, <, <=, >, >=, in, etc
• Current support type in Expression
– For <, <=, >, >=, <=>: Integer, Double, Date, Timestamp, etc
– For = , <=>: String, Integer, Double, Date, Timestamp, etc.
• Example: A <= B
– Based on A, B’s min/max/distinct count/null count values, decide
the relationships between A and B. After completing this
expression, we set the new min/max/distinct count/null count
– Assume all the data is evenly distributed if no histogram
information.
12
Filter Operator without Histogram
• Column A (op) literal B
– (op) can be “=“, “<”, “<=”, “>”, “>=”, “like”
– Column’s max/min/distinct count/null count should be updated
– Example: Column A < value B
Column AB B
A.min A.max
Filtering Factor = 0%
need to change A’s statistics
Filtering Factor = 100%
no need to change A’s statistics
Filtering Factor = (B.value – A.min) / (A.max – A.min)
A.min = no change
A.max = B.value
A.ndv = A.ndv * Filtering Factor
13
• Without histogram, we prorate over
the entire column range.
• It works only if it is evenly distributed.
Filter Operator with Histogram
• With histogram, we check the range values of a
bucket to see if it should be included in
estimation.
• We prorate only the boundary bucket.
• This way can enhance the accuracy of
estimation since we prorate (or guess) only a
much smaller set of records in a bucket only.
14
Histogram for Filter Example 1
Age distribution of a restaurant:
• Estimate row count for
predicate “age > 40”. Correct
answer is 5.
• Without histogram, estimate:
25 * (80 – 40)/(80 – 20) = 16.7
• With histogram, estimate:
1.0 * // only 5th bucket
5 // 5 records per bucket
= 5
15#DevSAIS13 – Cardinality Estimation by Hu and Wang
20
21
23
24
25
ndv=5
25
27
27
27
28
ndv=3
28
28
28
28
28
ndv=1
29
36
36
39
40
ndv=4
45
47
55
63
80
ndv=5
20 25 28 40 8028
Total row count: 25
age min = 20
age max = 80
age ndv = 17
Histogram for Filter Example 2
Age distribution of a restaurant:
• Estimate row count for predicate
“age = 28”. Correct answer is 6.
• Without histogram, estimate:
25 * 1 / 17 = 1.47
• With histogram, estimate:
( 1/3 // prorate the 2nd bucket
+ 1.0 // for 3rd bucket
) * 5 // 5 records per bucket
= 6.67
16#DevSAIS13 – Cardinality Estimation by Hu and Wang
20
21
23
24
25
ndv=5
25
27
27
27
28
ndv=3
28
28
28
28
28
ndv=1
29
36
36
39
40
ndv=4
45
47
55
63
80
ndv=5
20 25 28 40 8028
Total row count: 25
age min = 20
age max = 80
age ndv = 17
Join Cardinality without Histogram
• Inner-Join: The number of rows of “A join B on A.k1 = B.k1” is
estimated as:
num(A ⟗ B) = num(A) * num(B) / max(distinct(A.k1),
distinct(B.k1)),
– where num(A) is the number of records in table A, distinct is the number of
distinct values of that column.
– The underlying assumption for this formula is that each value of the smaller
domain is included in the larger domain.
– Assuming uniform distribution for entire range of both join columns.
• We similarly estimate cardinalities for Left-Outer Join, Right-Outer
Join and Full-Outer Join
17
Join Cardinality without Histogram
18
Total row count: 25
k1 min = 20
k1 max = 80
k1 ndv = 17
Table A, join column k1 Table B, join column k1
Total row count: 20
k1 min = 20
k1 max = 90
k1 ndv = 17
Without histogram, join cardinality estimate is 25 * 20 / 17 = 29.4
The correct answer is 20.
20
21
23
24
25
25
27
27
27
28
28
28
28
28
28
29
36
36
39
40
45
47
55
63
80
20 80
20
21
21
25
26
28
28
30
36
39
45
50
55
60
65
70
75
80
90
90
20 90
Join Cardinality with Histogram
• The number of rows of “A join B on A.k1 = B.k1” is estimated as:
num(𝐴⟗𝐵) = 𝑖,𝑗 num(𝐴𝑖) * num(𝐵𝑗) / max (ndv(Ai.k1), ndv(Bj.k1))
– where num(Ai) is the number of records in bucket i of table A, ndv is the number
of distinct values of that column in the corresponding bucket.
– We compute the join cardinality bucket by bucket, and then add up the total
count.
• If the buckets of two join tables do not align,
– We split the bucket on the boundary values into more than 1 bucket.
– In the split buckets, we prorate ndv and bucket height based on the boundary
values of the newly split buckets by assuming uniform distribution within a given
bucket.
19
Aligning Histogram Buckets for Join
• Form new buckets to align buckets properly
20#DevSAIS13 – Cardinality Estimation by Hu and Wang
Table A, join column k1,
Histogram buckets
Table B, join column k1,
Histogram buckets
20 25 30 50 70 9080
28
28 40
Original bucket
boundary
Extra new bucket boundary
To form additional buckets
This bucket is excluded
In computation
20 25 28
28
40 80705030
21#DevSAIS13 – Cardinality Estimation by Hu and Wang
Table A, join column k1,
Histogram buckets:
Total row count: 25
min = 20, max = 80
ndv = 17
20
21
23
24
25
ndv=5
25
27
27
27
28
ndv=3
28
28
28
28
28
ndv=1
29
ndv=1
36
36
39
40
ndv=3
45
47
ndv=2
55
63
ndv=2
80
ndv=1
2520 28 3028 5040 70 80
90
90
ndv=1
20
21
21
25
ndv=3
26
ndv=1
28
28
ndv=1
30
ndv=1
36
39
ndv=2
45
50
ndv=2
55
60
65
70
ndv=4
75
80
ndv=2
7030282520 28 5040 80 90
Table B, join column k1,
Histogram buckets:
Total row count: 20
min = 20, max = 90
ndv = 17
- With histogram, join cardinality estimate is 21.8 by
computing the aligned bucket’s cardinality one-by-one.
- Without histogram, join cardinality estimate is 29.4
- The correct answer is 20.
Other Operator Estimation
• Project: does not change row count
• Aggregate: consider uniqueness of group-by
columns
• Limit, Sample, etc.
22
Statistics Propagation
Join
(t1.a = t2.b)
Scan t2Scan t1a: min, max, ndv …
…
b: min, max, ndv …
…
a: newMin, newMax, newNdv …
b: newMin, newMax, newNdv …
…
Top-down statistics
requests
Bottom-up statistics
propagation
23
Statistics inference
• Statistics collected:
– Number of records for a table
– Number of distinct values for a column
• Can make these inferences:
– If the above two numbers are close, we can determine if a
column is a unique key.
– Can infer if it is a primary-key to foreign-key join.
– Can detect if a star schema exists.
– Can help determine the output size of group-by operator if
multiple columns of same tables appear in group-by
expression.
24
Configuration Parameters
Configuration Parameters Default
Value
Suggested
Value
spark.sql.cbo.enabled False True
spark.sql.cbo.joinReorder.enabled False True
spark.sql.cbo.joinReorder.dp.threshold 12 12
spark.sql.cbo.joinReorder.card.weight 0.7 0.7
spark.sql.statistics.size.autoUpdate.enabled False True
spark.sql.statistics.histogram.enabled False True
spark.sql.statistics.histogram.numBins 254 254
spark.sql.statistics.ndv.maxError 0.05 0.05
spark.sql.statistics.percentile.accuracy 10000 10000
25#DevSAIS13
Reference
• SPARK-16026: Cost-Based Optimizer
Framework
– https://issues.apache.org/jira/browse/SPARK-16026
– It has 45 sub-tasks.
• SPARK-21975: Histogram support in cost-based
optimizer
– https://issues.apache.org/jira/browse/SPARK-21975
– It has 10 sub-tasks.
26#DevSAIS13 – Cardinality Estimation by Hu and Wang
Summary
• Cost Based Optimizer in Spark 2.2
• Statistics Collected
• Histogram Support in Spark 2.3
– Skewed data distributions are intrinsic in real world
data.
– Turn on histogram configuration parameter
“spark.sql.statistics.histogram.enabled” to deal with
skew.
27
Q & A
ron.hu@huawei.com
wangzhenhua@huawei.com

More Related Content

What's hot

Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
t3rmin4t0r
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
On Improving Broadcast Joins in Apache Spark SQL
On Improving Broadcast Joins in Apache Spark SQLOn Improving Broadcast Joins in Apache Spark SQL
On Improving Broadcast Joins in Apache Spark SQL
Databricks
 
High Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouseHigh Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouse
Altinity Ltd
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Databricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Databricks
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
Databricks
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
Databricks
 
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleBucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Databricks
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
Databricks
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Datastage Introduction To Data Warehousing
Datastage Introduction To Data WarehousingDatastage Introduction To Data Warehousing
Datastage Introduction To Data Warehousing
Vibrant Technologies & Computers
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
 
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsMaterialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
Databricks
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 

What's hot (20)

Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
On Improving Broadcast Joins in Apache Spark SQL
On Improving Broadcast Joins in Apache Spark SQLOn Improving Broadcast Joins in Apache Spark SQL
On Improving Broadcast Joins in Apache Spark SQL
 
High Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouseHigh Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouse
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
 
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleBucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Datastage Introduction To Data Warehousing
Datastage Introduction To Data WarehousingDatastage Introduction To Data Warehousing
Datastage Introduction To Data Warehousing
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsMaterialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 

Similar to Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and Zhenhua Wang

Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable Statistics
Jen Aman
 
Time Series.pptx
Time Series.pptxTime Series.pptx
Time Series.pptx
Ramakrishna Reddy Bijjam
 
Summarizing Data : Listing and Grouping pdf
Summarizing Data : Listing and Grouping pdfSummarizing Data : Listing and Grouping pdf
Summarizing Data : Listing and Grouping pdf
JustynOwen
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Databricks
 
Monte Carlo Simulation for project estimates v1.0
Monte Carlo Simulation for project estimates v1.0Monte Carlo Simulation for project estimates v1.0
Monte Carlo Simulation for project estimates v1.0
PMILebanonChapter
 
final Line balancing slide12.ppt
final Line balancing slide12.pptfinal Line balancing slide12.ppt
final Line balancing slide12.ppt
xicohos114
 
IRJET- Wallace Tree Multiplier using MFA Counters
IRJET-  	  Wallace Tree Multiplier using MFA CountersIRJET-  	  Wallace Tree Multiplier using MFA Counters
IRJET- Wallace Tree Multiplier using MFA Counters
IRJET Journal
 
MUMS: Transition & SPUQ Workshop - Gradient-Free Construction of Active Subsp...
MUMS: Transition & SPUQ Workshop - Gradient-Free Construction of Active Subsp...MUMS: Transition & SPUQ Workshop - Gradient-Free Construction of Active Subsp...
MUMS: Transition & SPUQ Workshop - Gradient-Free Construction of Active Subsp...
The Statistical and Applied Mathematical Sciences Institute
 
Unsupervised learning
Unsupervised learning Unsupervised learning
Unsupervised learning
AlexAman1
 
Matlab ch1 (4)
Matlab ch1 (4)Matlab ch1 (4)
Matlab ch1 (4)
mohsinggg
 
Power Efficient and High Speed Carry Skip Adder using Binary to Excess One Co...
Power Efficient and High Speed Carry Skip Adder using Binary to Excess One Co...Power Efficient and High Speed Carry Skip Adder using Binary to Excess One Co...
Power Efficient and High Speed Carry Skip Adder using Binary to Excess One Co...
rahulmonikasharma
 
Ashish garg research paper 660_CamReady
Ashish garg research paper 660_CamReadyAshish garg research paper 660_CamReady
Ashish garg research paper 660_CamReadyAshish Garg
 
Reducing Structural Bias in Technology Mapping
Reducing Structural Bias in Technology MappingReducing Structural Bias in Technology Mapping
Reducing Structural Bias in Technology Mappingsatrajit
 
Final Project Report
Final Project ReportFinal Project Report
Final Project ReportRiddhi Shah
 
Probabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. FrequencyProbabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. Frequency
Andrii Gakhov
 
Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019
Evan Chan
 
What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...
What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...
What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...
Smarten Augmented Analytics
 
Cerebellar Model Articulation Controller
Cerebellar Model Articulation ControllerCerebellar Model Articulation Controller
Cerebellar Model Articulation Controller
Zahra Sadeghi
 
Matlab introduction
Matlab introductionMatlab introduction
Matlab introduction
Satish Gummadi
 
MariaDB 10.3 Optimizer - where does it stand
MariaDB 10.3 Optimizer - where does it standMariaDB 10.3 Optimizer - where does it stand
MariaDB 10.3 Optimizer - where does it stand
Sergey Petrunya
 

Similar to Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and Zhenhua Wang (20)

Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable Statistics
 
Time Series.pptx
Time Series.pptxTime Series.pptx
Time Series.pptx
 
Summarizing Data : Listing and Grouping pdf
Summarizing Data : Listing and Grouping pdfSummarizing Data : Listing and Grouping pdf
Summarizing Data : Listing and Grouping pdf
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
 
Monte Carlo Simulation for project estimates v1.0
Monte Carlo Simulation for project estimates v1.0Monte Carlo Simulation for project estimates v1.0
Monte Carlo Simulation for project estimates v1.0
 
final Line balancing slide12.ppt
final Line balancing slide12.pptfinal Line balancing slide12.ppt
final Line balancing slide12.ppt
 
IRJET- Wallace Tree Multiplier using MFA Counters
IRJET-  	  Wallace Tree Multiplier using MFA CountersIRJET-  	  Wallace Tree Multiplier using MFA Counters
IRJET- Wallace Tree Multiplier using MFA Counters
 
MUMS: Transition & SPUQ Workshop - Gradient-Free Construction of Active Subsp...
MUMS: Transition & SPUQ Workshop - Gradient-Free Construction of Active Subsp...MUMS: Transition & SPUQ Workshop - Gradient-Free Construction of Active Subsp...
MUMS: Transition & SPUQ Workshop - Gradient-Free Construction of Active Subsp...
 
Unsupervised learning
Unsupervised learning Unsupervised learning
Unsupervised learning
 
Matlab ch1 (4)
Matlab ch1 (4)Matlab ch1 (4)
Matlab ch1 (4)
 
Power Efficient and High Speed Carry Skip Adder using Binary to Excess One Co...
Power Efficient and High Speed Carry Skip Adder using Binary to Excess One Co...Power Efficient and High Speed Carry Skip Adder using Binary to Excess One Co...
Power Efficient and High Speed Carry Skip Adder using Binary to Excess One Co...
 
Ashish garg research paper 660_CamReady
Ashish garg research paper 660_CamReadyAshish garg research paper 660_CamReady
Ashish garg research paper 660_CamReady
 
Reducing Structural Bias in Technology Mapping
Reducing Structural Bias in Technology MappingReducing Structural Bias in Technology Mapping
Reducing Structural Bias in Technology Mapping
 
Final Project Report
Final Project ReportFinal Project Report
Final Project Report
 
Probabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. FrequencyProbabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. Frequency
 
Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019
 
What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...
What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...
What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...
 
Cerebellar Model Articulation Controller
Cerebellar Model Articulation ControllerCerebellar Model Articulation Controller
Cerebellar Model Articulation Controller
 
Matlab introduction
Matlab introductionMatlab introduction
Matlab introduction
 
MariaDB 10.3 Optimizer - where does it stand
MariaDB 10.3 Optimizer - where does it standMariaDB 10.3 Optimizer - where does it stand
MariaDB 10.3 Optimizer - where does it stand
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 

Recently uploaded (20)

一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 

Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and Zhenhua Wang

  • 1. Ron Hu, Zhenhua Wang Huawei Technologies, Inc. Cardinality Estimation through Histogram in Apache Spark 2.3 #DevSAIS13
  • 2. Agenda • Catalyst Architecture • Cost Based Optimizer in Spark 2.2 • Statistics Collected • Histogram Support in Spark 2.3 • Configuration Parameters • Q & A 2
  • 3. Catalyst Architecture 3 Spark optimizes query plan here Reference:Deep Dive into Spark SQL’s Catalyst Optimizer, a databricks engineering blog
  • 4. Query Optimizer in Spark SQL • Spark SQL’s query optimizer is based on both rules and cost. • Most of Spark SQL optimizer’s rules are heuristics rules. – PushDownPredicate, ColumnPruning, ConstantFolding,…. • Cost based optimization (CBO) was added in Spark 2.2. 4
  • 5. Cost Based Optimizer in Spark 2.2 • It was a good and working CBO framework to start with. • Focused on – Statistics collection, – Cardinality estimation, – Build side selection, broadcast vs. shuffled join, join reordering, etc. • Used heuristics formula for cost function in terms of cardinality and data size of each operator. 5
  • 6. Statistics Collected • Collect Table Statistics information • Collect Column Statistics information • Goal: – Calculate the cost for each operator in terms of number of output rows, size of output, etc. – Based on the cost calculation, adjust the query execution plan 6
  • 7. Table Statistics Collected • Command to collect statistics of a table. – Ex: ANALYZE TABLE table-name COMPUTE STATISTICS • It collects table level statistics and saves into metastore. – Number of rows – Table size in bytes 7
  • 8. Column Statistics Collected • Command to collect column level statistics of individual columns. – Ex: ANALYZE TABLE table-name COMPUTE STATISTICS FOR COLUMNS column-name1, column-name2, …. • It collects column level statistics and saves into meta-store. String/Binary type ✓ Distinct count ✓ Null count ✓ Average length ✓ Max length Numeric/Date/Timestamp type ✓ Distinct count ✓ Max ✓ Min ✓ Null count ✓ Average length (fixed length) ✓ Max length (fixed length) 8
  • 9. Real World Data Are Often Skewed 9#DevSAIS13 – Cardinality Estimation by Hu and Wang
  • 10. Histogram Support in Spark 2.3 • Histogram is effective in handling skewed distribution. • We developed equi-height histogram in Spark 2.3. • Equi-Height histogram is better than equi-width histogram • Equi-height histogram can use multiple buckets to show a very skewed value. • Equi-width histogram cannot give right frequency when a skewed value falls in same bucket with other values. Column interval Frequency Equi-Width Equi-Height Column interval Frequency Density 10
  • 11. Histogram Algorithm – Each histogram has a default of 254 buckets. • The height of a histogram is number of non-null values divided by number of buckets. – Each histogram bucket contains • Range values of a bucket • Number of distinct values in a bucket – We use two table scans to generate the equi-height histograms for all columns specified in analyze command. • Use ApproximatePercentile class to get end points of all histogram buckets • Use HyperLogLog++ algorithm to compute the number of distinct values in each bucket. 11
  • 12. Filter Cardinality Estimation • Between Logical expressions: AND, OR, NOT • In each logical expression: =, <, <=, >, >=, in, etc • Current support type in Expression – For <, <=, >, >=, <=>: Integer, Double, Date, Timestamp, etc – For = , <=>: String, Integer, Double, Date, Timestamp, etc. • Example: A <= B – Based on A, B’s min/max/distinct count/null count values, decide the relationships between A and B. After completing this expression, we set the new min/max/distinct count/null count – Assume all the data is evenly distributed if no histogram information. 12
  • 13. Filter Operator without Histogram • Column A (op) literal B – (op) can be “=“, “<”, “<=”, “>”, “>=”, “like” – Column’s max/min/distinct count/null count should be updated – Example: Column A < value B Column AB B A.min A.max Filtering Factor = 0% need to change A’s statistics Filtering Factor = 100% no need to change A’s statistics Filtering Factor = (B.value – A.min) / (A.max – A.min) A.min = no change A.max = B.value A.ndv = A.ndv * Filtering Factor 13 • Without histogram, we prorate over the entire column range. • It works only if it is evenly distributed.
  • 14. Filter Operator with Histogram • With histogram, we check the range values of a bucket to see if it should be included in estimation. • We prorate only the boundary bucket. • This way can enhance the accuracy of estimation since we prorate (or guess) only a much smaller set of records in a bucket only. 14
  • 15. Histogram for Filter Example 1 Age distribution of a restaurant: • Estimate row count for predicate “age > 40”. Correct answer is 5. • Without histogram, estimate: 25 * (80 – 40)/(80 – 20) = 16.7 • With histogram, estimate: 1.0 * // only 5th bucket 5 // 5 records per bucket = 5 15#DevSAIS13 – Cardinality Estimation by Hu and Wang 20 21 23 24 25 ndv=5 25 27 27 27 28 ndv=3 28 28 28 28 28 ndv=1 29 36 36 39 40 ndv=4 45 47 55 63 80 ndv=5 20 25 28 40 8028 Total row count: 25 age min = 20 age max = 80 age ndv = 17
  • 16. Histogram for Filter Example 2 Age distribution of a restaurant: • Estimate row count for predicate “age = 28”. Correct answer is 6. • Without histogram, estimate: 25 * 1 / 17 = 1.47 • With histogram, estimate: ( 1/3 // prorate the 2nd bucket + 1.0 // for 3rd bucket ) * 5 // 5 records per bucket = 6.67 16#DevSAIS13 – Cardinality Estimation by Hu and Wang 20 21 23 24 25 ndv=5 25 27 27 27 28 ndv=3 28 28 28 28 28 ndv=1 29 36 36 39 40 ndv=4 45 47 55 63 80 ndv=5 20 25 28 40 8028 Total row count: 25 age min = 20 age max = 80 age ndv = 17
  • 17. Join Cardinality without Histogram • Inner-Join: The number of rows of “A join B on A.k1 = B.k1” is estimated as: num(A ⟗ B) = num(A) * num(B) / max(distinct(A.k1), distinct(B.k1)), – where num(A) is the number of records in table A, distinct is the number of distinct values of that column. – The underlying assumption for this formula is that each value of the smaller domain is included in the larger domain. – Assuming uniform distribution for entire range of both join columns. • We similarly estimate cardinalities for Left-Outer Join, Right-Outer Join and Full-Outer Join 17
  • 18. Join Cardinality without Histogram 18 Total row count: 25 k1 min = 20 k1 max = 80 k1 ndv = 17 Table A, join column k1 Table B, join column k1 Total row count: 20 k1 min = 20 k1 max = 90 k1 ndv = 17 Without histogram, join cardinality estimate is 25 * 20 / 17 = 29.4 The correct answer is 20. 20 21 23 24 25 25 27 27 27 28 28 28 28 28 28 29 36 36 39 40 45 47 55 63 80 20 80 20 21 21 25 26 28 28 30 36 39 45 50 55 60 65 70 75 80 90 90 20 90
  • 19. Join Cardinality with Histogram • The number of rows of “A join B on A.k1 = B.k1” is estimated as: num(𝐴⟗𝐵) = 𝑖,𝑗 num(𝐴𝑖) * num(𝐵𝑗) / max (ndv(Ai.k1), ndv(Bj.k1)) – where num(Ai) is the number of records in bucket i of table A, ndv is the number of distinct values of that column in the corresponding bucket. – We compute the join cardinality bucket by bucket, and then add up the total count. • If the buckets of two join tables do not align, – We split the bucket on the boundary values into more than 1 bucket. – In the split buckets, we prorate ndv and bucket height based on the boundary values of the newly split buckets by assuming uniform distribution within a given bucket. 19
  • 20. Aligning Histogram Buckets for Join • Form new buckets to align buckets properly 20#DevSAIS13 – Cardinality Estimation by Hu and Wang Table A, join column k1, Histogram buckets Table B, join column k1, Histogram buckets 20 25 30 50 70 9080 28 28 40 Original bucket boundary Extra new bucket boundary To form additional buckets This bucket is excluded In computation 20 25 28 28 40 80705030
  • 21. 21#DevSAIS13 – Cardinality Estimation by Hu and Wang Table A, join column k1, Histogram buckets: Total row count: 25 min = 20, max = 80 ndv = 17 20 21 23 24 25 ndv=5 25 27 27 27 28 ndv=3 28 28 28 28 28 ndv=1 29 ndv=1 36 36 39 40 ndv=3 45 47 ndv=2 55 63 ndv=2 80 ndv=1 2520 28 3028 5040 70 80 90 90 ndv=1 20 21 21 25 ndv=3 26 ndv=1 28 28 ndv=1 30 ndv=1 36 39 ndv=2 45 50 ndv=2 55 60 65 70 ndv=4 75 80 ndv=2 7030282520 28 5040 80 90 Table B, join column k1, Histogram buckets: Total row count: 20 min = 20, max = 90 ndv = 17 - With histogram, join cardinality estimate is 21.8 by computing the aligned bucket’s cardinality one-by-one. - Without histogram, join cardinality estimate is 29.4 - The correct answer is 20.
  • 22. Other Operator Estimation • Project: does not change row count • Aggregate: consider uniqueness of group-by columns • Limit, Sample, etc. 22
  • 23. Statistics Propagation Join (t1.a = t2.b) Scan t2Scan t1a: min, max, ndv … … b: min, max, ndv … … a: newMin, newMax, newNdv … b: newMin, newMax, newNdv … … Top-down statistics requests Bottom-up statistics propagation 23
  • 24. Statistics inference • Statistics collected: – Number of records for a table – Number of distinct values for a column • Can make these inferences: – If the above two numbers are close, we can determine if a column is a unique key. – Can infer if it is a primary-key to foreign-key join. – Can detect if a star schema exists. – Can help determine the output size of group-by operator if multiple columns of same tables appear in group-by expression. 24
  • 25. Configuration Parameters Configuration Parameters Default Value Suggested Value spark.sql.cbo.enabled False True spark.sql.cbo.joinReorder.enabled False True spark.sql.cbo.joinReorder.dp.threshold 12 12 spark.sql.cbo.joinReorder.card.weight 0.7 0.7 spark.sql.statistics.size.autoUpdate.enabled False True spark.sql.statistics.histogram.enabled False True spark.sql.statistics.histogram.numBins 254 254 spark.sql.statistics.ndv.maxError 0.05 0.05 spark.sql.statistics.percentile.accuracy 10000 10000 25#DevSAIS13
  • 26. Reference • SPARK-16026: Cost-Based Optimizer Framework – https://issues.apache.org/jira/browse/SPARK-16026 – It has 45 sub-tasks. • SPARK-21975: Histogram support in cost-based optimizer – https://issues.apache.org/jira/browse/SPARK-21975 – It has 10 sub-tasks. 26#DevSAIS13 – Cardinality Estimation by Hu and Wang
  • 27. Summary • Cost Based Optimizer in Spark 2.2 • Statistics Collected • Histogram Support in Spark 2.3 – Skewed data distributions are intrinsic in real world data. – Turn on histogram configuration parameter “spark.sql.statistics.histogram.enabled” to deal with skew. 27