SlideShare a Scribd company logo
Lightweight Graphical Models for
Selectivity Estimation without
Independence Assumptions
Kostas Tzoumas
Amol Deshpande
Christian S. Jensen
Query Optimization
• Need to decide (among others) the
“best” join plan.
• Very simplified picture:
– Cost of a plan = # of intermediate tuples
it produces.
– Use upper plan if |LO|<|OC|, lower plan
otherwise.
• Use size estimates of intermediate
relations.
• Cardinality estimation: Estimating
relation sizes at compile time.
– Typically using statistical summaries.
• Errors in estimates can lead to wrong
plan. Independence assumptions a
frequent factor of errors.
L O C
LO
cost = |LO|
L O C
OC
cost = |OC|
2
Cardinality Estimation
• Original system R optimizer assumes
– Uniform distribution of attributes
• Addressed by attribute-level synopsis (e.g., 1D
histogram and wavelets) accurate estimation of single➔
attributes.
– Independence among
• Attribute values of a table
– Addressed by table-level synopsis (multi-D histograms) to
estimate the joint probability distributions of two or many
variables.
• Join predicates between different tables
– Addressed by schema-level synopsis to capture the joint
probability distribution of attributes and join indicators.
Independence Assumption
|L’| = Pr(eprice in [e1,e2]) |L|
|O’| = Pr(tprice in [t1,t2]) |O|
|L’O’| = Pr(l_orderkey=o_orderkey)|L’||O’|
|L’O’| = Pr(l_orderkey=o_orderkey)
Pr(eprice in [e1,e2])
Pr(tprice in [t1,t2]) |L||O|
•Results to under-estimation of |L’O’| 
wrong query plan, nested loop join.
•Can result in orders of magnitude slower
execution.
•Solution: estimate joint probabilities: |L’O’|
= Pr (l_orderkey=o_orderkey,
eprice in [e1,e2],
tprice in [t1,t2]) |L||O|
L O C
HJ
NLJ
σL σΟ σC
L’ O’ C’
L’O’
4
select c_name,c_address
from lineitem,orders,customer
where l_orderkey=o_orderkey and
o_custkey=c_custkey and
o_totalprice in [t1,t2] and
l_extendedprice in [e1,e2]
and c_acctbal in [b1,b2]
Problem Setting
• Database + workload  schema graph
• Schema graph  Set of random variables (descriptive
attributes and join indicators)
• Goal: Approximate P(JLO, JLP, JLS, JSC, JOC, L.sdate, L.cdate,
L.rdate, O.odate, P.size, S.acctbal,C.acctbal)
sdate
cdate
rdate
lineitem
odate
orders
size
part
acctbal
supplier
acctbal
customer
JLO ≡
L.okey=O.okey
JOC ≡
O.ckey=C.ckey
JLS ≡
L.skey=S.skey
JLP ≡
L.pkey=P.pkey
JSC ≡
S.nkey=C.nkey
Join indicator
A binary random variable
Captures the “event” that
two random tuples join.
We don’t need “key”
attributes in the model
(hard to approximate).
Pr(JOC=T)=sel(O.ckey=C.ckey)
Problem:
Exponential blow-
up of storage space
and selectivity
estimation time.
5
Graphical Models
• Exploit independence and conditional independence
to factor the full joint distribution.
– X┴Y ↔P(X,Y)=P(X)P(Y)
– X┴Z|Y ↔ P(X,Y,Z)=P(X,Y)P(Y,Z)/P(Y)
• Bayesian networks
– Graphical representation of a set of cond. inds
– Express a factorization of the joint distribution
• Selectivity estimation is solving inference problem
(computing marginal distributions of the form of P(Xi|
Xj=xj)).
– High dimensional distributions  high overhead
– Construction algorithms do not scale.
X Y Z
6
Fixed Structure
• “Tailored” solution
– only 2D histograms to ensure scalable construction
• Relational independencies:
– L.rdate ┴ O.odate
σL.rdate=a,O.odate=b(LxO)=σL.rdate=a(L) x σO.odate=b(O)
– L.rdate ┴ JSC
– JLO ┴ JSC
• Model Design decisions
– Join indicator has at most two parents; one from
each relation
– BN within a relation as a directed tree 7
Bayesian Network
sdate
cdaterdate
size S.acctbal S.acctbal
odate
JLP JLS
JSC
JOC
JLO
lineitem
part supplier customer
orders
8
Moral Graph
sdate
cdaterdate
size S.acctbal C.acctbal
odate
JLP JLS
JSC
JOC
JLO
9
Chordal Graph
sdate
cdaterdate
size S.acctbal C.acctbal
odate
JLP JLS
JSC
JOC
JLO
10
Junction Tree
sdate
rdate
C.acctbal
odate
sdate
C.acctbal
S.acctbal
sdate
sdate
cdate
odate
cdate
odate
JLO
sdate
S.acctbal
JLS
sdate
size
JLP
odate
C.acctbal
JOC
S.acctbal
C.acctbal
JSC
sdate
sdate
sdate
odate
sdate
sdate
cdate
odate
odateacctbal
Distributions to be kept =
marginals of “cliques” and
“separators.”
Factorization achieved!
Storage space:
•(sdate,rdate)  One 2D histogram
•(odate,acctbal,JOC)  Two 2D histograms (JOC=false,true)
•(acctbal,odate,sdate)  Three 1D histograms
Only 2D histograms needed! 11
Efficient selectivity estimation
via junction tree propagation.
Selectivity Estimation
sdate
rdate
C.acctbal
odate
sdate
C.acctbal
S.acctbal
sdate
sdate
cdate
odate
cdate
odate
JLO
sdate
S.acctbal
JLS
sdate
size
JLP
odate
C.acctbal
JOC
S.acctbal
C.acctbal
JSC
sdate
sdate
sdate
odate
sdate
sdate
cdate
odate
odateacctbal
Estimate size of query:
select c_name,c_address
from lineitem,orders,customer
where l_orderkey=o_orderkey and
o_custkey=c_custkey and
l_sdate<=“25/7/2011” and
c_acctbal<=200000
Equivalent: Estimate
Pr(JLO=true,
JOC=true,
sdate<=“25/7/2011”,
C.acctbal<=200000)
Extract “Steiner tree”:
Minimal subtree that
contains JLO, JOC, sdate,
C.acctbal 12
Selectivity Estimation
sdate
cdate
odate
cdate
odate
JLO
odate
acctbal
JOC
cdate
odate
odate
Estimate
Pr(JLO=true,
JOC=true,
sdate<=“25/7/2011”,
acctbal<=200000)
φ2=P(cdate,odate,JLO)
φ3=P(odate,acctbal,JOC)
φ1=P(sdate,cdate,odate)
μ12=P(cdate,odate)
μ23=P(odate)
1. Substitute
φ1
*
(cdate,odate)=
φ1[sdate<=“25/7/2011”]
φ2
*
(cdate,odate)=
φ2[JLO=true]
φ3
*
(odate)=
φ3[JOC=true,acctbal<=200000]
2. Multiply
φ12
*
(cdate,odate)=
φ1
*
φ2
*
/μ12
3. Marginalize
φ12
**
(odate)=
Σcdateφ12
*
4. Multiply
φ123
*
(odate)=
φ12
**
φ3
*
/μ23
5. Return
Σodateφ123
*
13
Implementation
• Model construction outside DBMS. Create distributions
with SQL queries.
– Stores junction tree as tables in PostgreSQL catalog.
• Graphical model foundation as PostgreSQL “nodes”.
– Probability distributions stored as equi-width multidimensional
histograms.
– Cliques, multiplication, division, marginalization.
– Junction tree, selectivity estimation algorithms
• Integration with PostgreSQL query optimizer.
– Load Steiner tree from catalog tables.
– Bypass PostgreSQL selectivity estimation functions.
14
Impact of Capturing Correlations
select c_name,c_address
from lineitem,orders,customer
where l_orderkey=o_orderkey and o_custkey=c_custkey
and o_tprice in [t1,t2] and l_eprice in [e1,e2]
and c_acctbal in [b1,b2]
Avoid under-estimation
by capturing the price
correlation.
Estimates very close to
reality.
Optimizer picks different
plan than default PGSQL.
Huge impact in
execution time.
Can do selectivity
estimation efficiently.
Optimization time in
the range of 10s of
milliseconds
Cost of plans (intermediate tuples)
Execution & optimization times
15
Accuracy of Estimates
Multiplicative error:
max(real,estimate) / min(real,estimate)
Penalizes both under- and over- estimation.
Subset of TPC-H schema.
Tweaked data generator
to introduce correlations.
400 random queries.
Geometric average of
multiplicative error.
One order of magnitude
better estimates for 5-
join queries
Optimization time less
than 25 milliseconds
16
Conclusions and Future Work
• Conclusion
– Attribute value independence assumption unfounded
• Heuristic, not good enough
• Most frequent source of horrible plans
– Practical adaptation of graphical models implemented in
PostgreSQL
• Low overhead, good estimate quality
• Future work
– Specialized synopses
• Low-dimensional, minimize multiplicative error, error guarantees
after multiplication
– Incremental updates
• Building graphical model as a side-effect of query execution

More Related Content

Similar to lightweight graphical models for selectivity estimation without independance assumption

Linear regression by Kodebay
Linear regression by KodebayLinear regression by Kodebay
Linear regression by Kodebay
Kodebay
 
Clustering coefficients for correlation networks
Clustering coefficients for correlation networksClustering coefficients for correlation networks
Clustering coefficients for correlation networks
Naoki Masuda
 
Linear Model Selection and Regularization (Article 6 - Practical exercises)
Linear Model Selection and Regularization (Article 6 - Practical exercises)Linear Model Selection and Regularization (Article 6 - Practical exercises)
Linear Model Selection and Regularization (Article 6 - Practical exercises)
Theodore Grammatikopoulos
 
Curve fitting and Optimization
Curve fitting and OptimizationCurve fitting and Optimization
Curve fitting and Optimization
Syahrul Senin
 
Linear regression [Theory and Application (In physics point of view) using py...
Linear regression [Theory and Application (In physics point of view) using py...Linear regression [Theory and Application (In physics point of view) using py...
Linear regression [Theory and Application (In physics point of view) using py...
ANIRBANMAJUMDAR18
 
Lecture 2: Stochastic Hydrology
Lecture 2: Stochastic Hydrology Lecture 2: Stochastic Hydrology
Lecture 2: Stochastic Hydrology
Amro Elfeki
 
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...
Atsushi Nitanda
 
Design and Implementation of Parallel and Randomized Approximation Algorithms
Design and Implementation of Parallel and Randomized Approximation AlgorithmsDesign and Implementation of Parallel and Randomized Approximation Algorithms
Design and Implementation of Parallel and Randomized Approximation Algorithms
Ajay Bidyarthy
 
Compilation
CompilationCompilation
Compilation
magansandu
 
MATLABgraphPlotting.pptx
MATLABgraphPlotting.pptxMATLABgraphPlotting.pptx
MATLABgraphPlotting.pptx
PrabhakarSingh646829
 
Exploiting Hierarchy in the Abstraction-Based Verification of Statecharts Usi...
Exploiting Hierarchy in the Abstraction-Based Verification of Statecharts Usi...Exploiting Hierarchy in the Abstraction-Based Verification of Statecharts Usi...
Exploiting Hierarchy in the Abstraction-Based Verification of Statecharts Usi...
Akos Hajdu
 
SVD and the Netflix Dataset
SVD and the Netflix DatasetSVD and the Netflix Dataset
SVD and the Netflix Dataset
Ben Mabey
 
1. Regression_V1.pdf
1. Regression_V1.pdf1. Regression_V1.pdf
1. Regression_V1.pdf
ssuser4c50a9
 
Realtime Analytics
Realtime AnalyticsRealtime Analytics
Realtime Analytics
eXascale Infolab
 
Paper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit sat
ChenYiHuang5
 
Survadapt-Webinar_2014_SLIDES
Survadapt-Webinar_2014_SLIDESSurvadapt-Webinar_2014_SLIDES
Survadapt-Webinar_2014_SLIDES
therealreverendbayes
 
theory of computation lecture 01
theory of computation lecture 01theory of computation lecture 01
theory of computation lecture 01
8threspecter
 
Passive network-redesign-ntua
Passive network-redesign-ntuaPassive network-redesign-ntua
Passive network-redesign-ntua
IEEE NTUA SB
 
S1 - Process product optimization using design experiments and response surfa...
S1 - Process product optimization using design experiments and response surfa...S1 - Process product optimization using design experiments and response surfa...
S1 - Process product optimization using design experiments and response surfa...
CAChemE
 
Seminar psu 20.10.2013
Seminar psu 20.10.2013Seminar psu 20.10.2013
Seminar psu 20.10.2013
Vyacheslav Arbuzov
 

Similar to lightweight graphical models for selectivity estimation without independance assumption (20)

Linear regression by Kodebay
Linear regression by KodebayLinear regression by Kodebay
Linear regression by Kodebay
 
Clustering coefficients for correlation networks
Clustering coefficients for correlation networksClustering coefficients for correlation networks
Clustering coefficients for correlation networks
 
Linear Model Selection and Regularization (Article 6 - Practical exercises)
Linear Model Selection and Regularization (Article 6 - Practical exercises)Linear Model Selection and Regularization (Article 6 - Practical exercises)
Linear Model Selection and Regularization (Article 6 - Practical exercises)
 
Curve fitting and Optimization
Curve fitting and OptimizationCurve fitting and Optimization
Curve fitting and Optimization
 
Linear regression [Theory and Application (In physics point of view) using py...
Linear regression [Theory and Application (In physics point of view) using py...Linear regression [Theory and Application (In physics point of view) using py...
Linear regression [Theory and Application (In physics point of view) using py...
 
Lecture 2: Stochastic Hydrology
Lecture 2: Stochastic Hydrology Lecture 2: Stochastic Hydrology
Lecture 2: Stochastic Hydrology
 
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...
 
Design and Implementation of Parallel and Randomized Approximation Algorithms
Design and Implementation of Parallel and Randomized Approximation AlgorithmsDesign and Implementation of Parallel and Randomized Approximation Algorithms
Design and Implementation of Parallel and Randomized Approximation Algorithms
 
Compilation
CompilationCompilation
Compilation
 
MATLABgraphPlotting.pptx
MATLABgraphPlotting.pptxMATLABgraphPlotting.pptx
MATLABgraphPlotting.pptx
 
Exploiting Hierarchy in the Abstraction-Based Verification of Statecharts Usi...
Exploiting Hierarchy in the Abstraction-Based Verification of Statecharts Usi...Exploiting Hierarchy in the Abstraction-Based Verification of Statecharts Usi...
Exploiting Hierarchy in the Abstraction-Based Verification of Statecharts Usi...
 
SVD and the Netflix Dataset
SVD and the Netflix DatasetSVD and the Netflix Dataset
SVD and the Netflix Dataset
 
1. Regression_V1.pdf
1. Regression_V1.pdf1. Regression_V1.pdf
1. Regression_V1.pdf
 
Realtime Analytics
Realtime AnalyticsRealtime Analytics
Realtime Analytics
 
Paper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit sat
 
Survadapt-Webinar_2014_SLIDES
Survadapt-Webinar_2014_SLIDESSurvadapt-Webinar_2014_SLIDES
Survadapt-Webinar_2014_SLIDES
 
theory of computation lecture 01
theory of computation lecture 01theory of computation lecture 01
theory of computation lecture 01
 
Passive network-redesign-ntua
Passive network-redesign-ntuaPassive network-redesign-ntua
Passive network-redesign-ntua
 
S1 - Process product optimization using design experiments and response surfa...
S1 - Process product optimization using design experiments and response surfa...S1 - Process product optimization using design experiments and response surfa...
S1 - Process product optimization using design experiments and response surfa...
 
Seminar psu 20.10.2013
Seminar psu 20.10.2013Seminar psu 20.10.2013
Seminar psu 20.10.2013
 

More from Soheila Dehghanzadeh

Performance comparison between Linux Containers and Virtual Machines
Performance comparison between Linux Containers and Virtual MachinesPerformance comparison between Linux Containers and Virtual Machines
Performance comparison between Linux Containers and Virtual Machines
Soheila Dehghanzadeh
 
Approximate Continuous Query Answering Over Streams and Dynamic Linked Data ...
 Approximate Continuous Query Answering Over Streams and Dynamic Linked Data ... Approximate Continuous Query Answering Over Streams and Dynamic Linked Data ...
Approximate Continuous Query Answering Over Streams and Dynamic Linked Data ...
Soheila Dehghanzadeh
 
Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine L...
Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine L...Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine L...
Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine L...
Soheila Dehghanzadeh
 
addressing tim/quality trade-off in view maintenance
addressing tim/quality trade-off in view maintenanceaddressing tim/quality trade-off in view maintenance
addressing tim/quality trade-off in view maintenance
Soheila Dehghanzadeh
 
Optimizing SPARQL Query Processing On Dynamic and Static Data Based on Query ...
Optimizing SPARQL Query Processing On Dynamic and Static Data Based on Query ...Optimizing SPARQL Query Processing On Dynamic and Static Data Based on Query ...
Optimizing SPARQL Query Processing On Dynamic and Static Data Based on Query ...
Soheila Dehghanzadeh
 
Approximate Continuous Query Answering Over Streams and Dynamic Linked Data Sets
Approximate Continuous Query Answering Over Streams and Dynamic Linked Data SetsApproximate Continuous Query Answering Over Streams and Dynamic Linked Data Sets
Approximate Continuous Query Answering Over Streams and Dynamic Linked Data Sets
Soheila Dehghanzadeh
 

More from Soheila Dehghanzadeh (6)

Performance comparison between Linux Containers and Virtual Machines
Performance comparison between Linux Containers and Virtual MachinesPerformance comparison between Linux Containers and Virtual Machines
Performance comparison between Linux Containers and Virtual Machines
 
Approximate Continuous Query Answering Over Streams and Dynamic Linked Data ...
 Approximate Continuous Query Answering Over Streams and Dynamic Linked Data ... Approximate Continuous Query Answering Over Streams and Dynamic Linked Data ...
Approximate Continuous Query Answering Over Streams and Dynamic Linked Data ...
 
Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine L...
Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine L...Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine L...
Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine L...
 
addressing tim/quality trade-off in view maintenance
addressing tim/quality trade-off in view maintenanceaddressing tim/quality trade-off in view maintenance
addressing tim/quality trade-off in view maintenance
 
Optimizing SPARQL Query Processing On Dynamic and Static Data Based on Query ...
Optimizing SPARQL Query Processing On Dynamic and Static Data Based on Query ...Optimizing SPARQL Query Processing On Dynamic and Static Data Based on Query ...
Optimizing SPARQL Query Processing On Dynamic and Static Data Based on Query ...
 
Approximate Continuous Query Answering Over Streams and Dynamic Linked Data Sets
Approximate Continuous Query Answering Over Streams and Dynamic Linked Data SetsApproximate Continuous Query Answering Over Streams and Dynamic Linked Data Sets
Approximate Continuous Query Answering Over Streams and Dynamic Linked Data Sets
 

Recently uploaded

BBR 2024 Summer Sessions Interview Training
BBR  2024 Summer Sessions Interview TrainingBBR  2024 Summer Sessions Interview Training
BBR 2024 Summer Sessions Interview Training
Katrina Pritchard
 
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
National Information Standards Organization (NISO)
 
Smart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICTSmart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICT
simonomuemu
 
How to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold MethodHow to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold Method
Celine George
 
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat  Leveraging AI for Diversity, Equity, and InclusionExecutive Directors Chat  Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
TechSoup
 
How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17
Celine George
 
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
GeorgeMilliken2
 
Advanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docxAdvanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docx
adhitya5119
 
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
Academy of Science of South Africa
 
DRUGS AND ITS classification slide share
DRUGS AND ITS classification slide shareDRUGS AND ITS classification slide share
DRUGS AND ITS classification slide share
taiba qazi
 
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
Dr. Shivangi Singh Parihar
 
How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17
Celine George
 
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
PECB
 
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPLAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
RAHUL
 
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama UniversityNatural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Akanksha trivedi rama nursing college kanpur.
 
Main Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docxMain Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docx
adhitya5119
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
History of Stoke Newington
 
How to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP ModuleHow to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP Module
Celine George
 
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
Jean Carlos Nunes Paixão
 
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Excellence Foundation for South Sudan
 

Recently uploaded (20)

BBR 2024 Summer Sessions Interview Training
BBR  2024 Summer Sessions Interview TrainingBBR  2024 Summer Sessions Interview Training
BBR 2024 Summer Sessions Interview Training
 
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
 
Smart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICTSmart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICT
 
How to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold MethodHow to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold Method
 
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat  Leveraging AI for Diversity, Equity, and InclusionExecutive Directors Chat  Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
 
How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17
 
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
 
Advanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docxAdvanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docx
 
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
 
DRUGS AND ITS classification slide share
DRUGS AND ITS classification slide shareDRUGS AND ITS classification slide share
DRUGS AND ITS classification slide share
 
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
 
How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17
 
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
 
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPLAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
 
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama UniversityNatural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
 
Main Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docxMain Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docx
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
 
How to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP ModuleHow to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP Module
 
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
 
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
 

lightweight graphical models for selectivity estimation without independance assumption

  • 1. Lightweight Graphical Models for Selectivity Estimation without Independence Assumptions Kostas Tzoumas Amol Deshpande Christian S. Jensen
  • 2. Query Optimization • Need to decide (among others) the “best” join plan. • Very simplified picture: – Cost of a plan = # of intermediate tuples it produces. – Use upper plan if |LO|<|OC|, lower plan otherwise. • Use size estimates of intermediate relations. • Cardinality estimation: Estimating relation sizes at compile time. – Typically using statistical summaries. • Errors in estimates can lead to wrong plan. Independence assumptions a frequent factor of errors. L O C LO cost = |LO| L O C OC cost = |OC| 2
  • 3. Cardinality Estimation • Original system R optimizer assumes – Uniform distribution of attributes • Addressed by attribute-level synopsis (e.g., 1D histogram and wavelets) accurate estimation of single➔ attributes. – Independence among • Attribute values of a table – Addressed by table-level synopsis (multi-D histograms) to estimate the joint probability distributions of two or many variables. • Join predicates between different tables – Addressed by schema-level synopsis to capture the joint probability distribution of attributes and join indicators.
  • 4. Independence Assumption |L’| = Pr(eprice in [e1,e2]) |L| |O’| = Pr(tprice in [t1,t2]) |O| |L’O’| = Pr(l_orderkey=o_orderkey)|L’||O’| |L’O’| = Pr(l_orderkey=o_orderkey) Pr(eprice in [e1,e2]) Pr(tprice in [t1,t2]) |L||O| •Results to under-estimation of |L’O’|  wrong query plan, nested loop join. •Can result in orders of magnitude slower execution. •Solution: estimate joint probabilities: |L’O’| = Pr (l_orderkey=o_orderkey, eprice in [e1,e2], tprice in [t1,t2]) |L||O| L O C HJ NLJ σL σΟ σC L’ O’ C’ L’O’ 4 select c_name,c_address from lineitem,orders,customer where l_orderkey=o_orderkey and o_custkey=c_custkey and o_totalprice in [t1,t2] and l_extendedprice in [e1,e2] and c_acctbal in [b1,b2]
  • 5. Problem Setting • Database + workload  schema graph • Schema graph  Set of random variables (descriptive attributes and join indicators) • Goal: Approximate P(JLO, JLP, JLS, JSC, JOC, L.sdate, L.cdate, L.rdate, O.odate, P.size, S.acctbal,C.acctbal) sdate cdate rdate lineitem odate orders size part acctbal supplier acctbal customer JLO ≡ L.okey=O.okey JOC ≡ O.ckey=C.ckey JLS ≡ L.skey=S.skey JLP ≡ L.pkey=P.pkey JSC ≡ S.nkey=C.nkey Join indicator A binary random variable Captures the “event” that two random tuples join. We don’t need “key” attributes in the model (hard to approximate). Pr(JOC=T)=sel(O.ckey=C.ckey) Problem: Exponential blow- up of storage space and selectivity estimation time. 5
  • 6. Graphical Models • Exploit independence and conditional independence to factor the full joint distribution. – X┴Y ↔P(X,Y)=P(X)P(Y) – X┴Z|Y ↔ P(X,Y,Z)=P(X,Y)P(Y,Z)/P(Y) • Bayesian networks – Graphical representation of a set of cond. inds – Express a factorization of the joint distribution • Selectivity estimation is solving inference problem (computing marginal distributions of the form of P(Xi| Xj=xj)). – High dimensional distributions  high overhead – Construction algorithms do not scale. X Y Z 6
  • 7. Fixed Structure • “Tailored” solution – only 2D histograms to ensure scalable construction • Relational independencies: – L.rdate ┴ O.odate σL.rdate=a,O.odate=b(LxO)=σL.rdate=a(L) x σO.odate=b(O) – L.rdate ┴ JSC – JLO ┴ JSC • Model Design decisions – Join indicator has at most two parents; one from each relation – BN within a relation as a directed tree 7
  • 8. Bayesian Network sdate cdaterdate size S.acctbal S.acctbal odate JLP JLS JSC JOC JLO lineitem part supplier customer orders 8
  • 9. Moral Graph sdate cdaterdate size S.acctbal C.acctbal odate JLP JLS JSC JOC JLO 9
  • 10. Chordal Graph sdate cdaterdate size S.acctbal C.acctbal odate JLP JLS JSC JOC JLO 10
  • 11. Junction Tree sdate rdate C.acctbal odate sdate C.acctbal S.acctbal sdate sdate cdate odate cdate odate JLO sdate S.acctbal JLS sdate size JLP odate C.acctbal JOC S.acctbal C.acctbal JSC sdate sdate sdate odate sdate sdate cdate odate odateacctbal Distributions to be kept = marginals of “cliques” and “separators.” Factorization achieved! Storage space: •(sdate,rdate)  One 2D histogram •(odate,acctbal,JOC)  Two 2D histograms (JOC=false,true) •(acctbal,odate,sdate)  Three 1D histograms Only 2D histograms needed! 11 Efficient selectivity estimation via junction tree propagation.
  • 12. Selectivity Estimation sdate rdate C.acctbal odate sdate C.acctbal S.acctbal sdate sdate cdate odate cdate odate JLO sdate S.acctbal JLS sdate size JLP odate C.acctbal JOC S.acctbal C.acctbal JSC sdate sdate sdate odate sdate sdate cdate odate odateacctbal Estimate size of query: select c_name,c_address from lineitem,orders,customer where l_orderkey=o_orderkey and o_custkey=c_custkey and l_sdate<=“25/7/2011” and c_acctbal<=200000 Equivalent: Estimate Pr(JLO=true, JOC=true, sdate<=“25/7/2011”, C.acctbal<=200000) Extract “Steiner tree”: Minimal subtree that contains JLO, JOC, sdate, C.acctbal 12
  • 14. Implementation • Model construction outside DBMS. Create distributions with SQL queries. – Stores junction tree as tables in PostgreSQL catalog. • Graphical model foundation as PostgreSQL “nodes”. – Probability distributions stored as equi-width multidimensional histograms. – Cliques, multiplication, division, marginalization. – Junction tree, selectivity estimation algorithms • Integration with PostgreSQL query optimizer. – Load Steiner tree from catalog tables. – Bypass PostgreSQL selectivity estimation functions. 14
  • 15. Impact of Capturing Correlations select c_name,c_address from lineitem,orders,customer where l_orderkey=o_orderkey and o_custkey=c_custkey and o_tprice in [t1,t2] and l_eprice in [e1,e2] and c_acctbal in [b1,b2] Avoid under-estimation by capturing the price correlation. Estimates very close to reality. Optimizer picks different plan than default PGSQL. Huge impact in execution time. Can do selectivity estimation efficiently. Optimization time in the range of 10s of milliseconds Cost of plans (intermediate tuples) Execution & optimization times 15
  • 16. Accuracy of Estimates Multiplicative error: max(real,estimate) / min(real,estimate) Penalizes both under- and over- estimation. Subset of TPC-H schema. Tweaked data generator to introduce correlations. 400 random queries. Geometric average of multiplicative error. One order of magnitude better estimates for 5- join queries Optimization time less than 25 milliseconds 16
  • 17. Conclusions and Future Work • Conclusion – Attribute value independence assumption unfounded • Heuristic, not good enough • Most frequent source of horrible plans – Practical adaptation of graphical models implemented in PostgreSQL • Low overhead, good estimate quality • Future work – Specialized synopses • Low-dimensional, minimize multiplicative error, error guarantees after multiplication – Incremental updates • Building graphical model as a side-effect of query execution

Editor's Notes

  1. Selectivity estimation errors is caused by missed correlations between attributes Their selectivity estimation approach doesn’t make the independence assumption by factoring the joint probability distribution of all attributes into 2-dimentional distributions.
  2. To elaborate more on the independence assumption, we provide an example
  3. They assumes a set of independencies to focus on the most important dependencies. Allowing dependencies will violate the relational independencies and won’t lead to missed correlations. (leads to missed dependencies but they believe that the error is negligible if the specified dependencies can be captured correctly)
  4. And based on those assumptions they schema graph boils down to the following Bayesian network. Now to use this BN for inference, they used the junction tree solution which is proven to be more efficient that variable elimination algorithm in this case. To build the junction tree they add edges among nodes that share the same child to build the moral graph and to prevent cycles with more than 4 nodes with out cycle they run the triangulation process to add chords in such cycles.
  5. To avoid distributions with more than 3 arguments they have to avoid cycles with more than 3 nodes.