The document discusses query compilation and optimization. It covers parsing a SQL query into a parse tree, converting the parse tree into a logical query plan using relational algebra, and estimating the costs of different logical and physical query plans to select the most efficient plan. The key steps are parsing, rewriting the logical query plan using algebraic laws to improve it, estimating sizes of intermediate results, and selecting a physical query plan based on cost estimates.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.
Overview of Java RMI remoting.
RMI is a lightweight Java technology that provides access to remote methods, similar to RPC, but object-oriented. RMI basically provides remote object access for a client and object registration for servers.
RMI is both a Java API (java.rmi.* package) as well as a transport protocol definition for transporting RMI calls through a network.
RMI is a Java technology since it requires that client and server objects run in a JVM (Java Virtual Machine). By using IIOP as transport protocol, however, it is possible to connect RMI-clients to non-Java server objects (e.g. CORBA).
RMI defines the elements client, server, RMI registry where servers register their services and possibly a plain vanilla web server that can be used by clients to dynamically load object classes to access servers.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.
Overview of Java RMI remoting.
RMI is a lightweight Java technology that provides access to remote methods, similar to RPC, but object-oriented. RMI basically provides remote object access for a client and object registration for servers.
RMI is both a Java API (java.rmi.* package) as well as a transport protocol definition for transporting RMI calls through a network.
RMI is a Java technology since it requires that client and server objects run in a JVM (Java Virtual Machine). By using IIOP as transport protocol, however, it is possible to connect RMI-clients to non-Java server objects (e.g. CORBA).
RMI defines the elements client, server, RMI registry where servers register their services and possibly a plain vanilla web server that can be used by clients to dynamically load object classes to access servers.
Presentation given at Outreach Digital in London, February 2017.
We look at some examples of using Python and Pandas to analyse SEO and web analytics data. Especially useful when your dataset is too large for working in Excel.
This was a short introduction to Scala programming language.
me and my colleague lectured these slides in Programming Language Design and Implementation course in K.N. Toosi University of Technology.
This Presentation is about NoSQL which means Not Only SQL. This presentation covers the aspects of using NoSQL for Big Data and the differences from RDBMS.
An overview of functional programming. You can do functional programming in any language, it’s just that some languages push you toward this style. There are many styles of programming, depending on the dimension you look at and the features available in the language: Imperative vs Functional, Compiled vs Scripting, Static vs Dynamic, Implicit vs Explicit type conversion, Managed vs Unmanaged (memory). We'll talk about these concepts.
Github: https://github.com/ijcd
Twitter: @ijcd
San Francisco, California
Presentation given at Outreach Digital in London, February 2017.
We look at some examples of using Python and Pandas to analyse SEO and web analytics data. Especially useful when your dataset is too large for working in Excel.
This was a short introduction to Scala programming language.
me and my colleague lectured these slides in Programming Language Design and Implementation course in K.N. Toosi University of Technology.
This Presentation is about NoSQL which means Not Only SQL. This presentation covers the aspects of using NoSQL for Big Data and the differences from RDBMS.
An overview of functional programming. You can do functional programming in any language, it’s just that some languages push you toward this style. There are many styles of programming, depending on the dimension you look at and the features available in the language: Imperative vs Functional, Compiled vs Scripting, Static vs Dynamic, Implicit vs Explicit type conversion, Managed vs Unmanaged (memory). We'll talk about these concepts.
Github: https://github.com/ijcd
Twitter: @ijcd
San Francisco, California
• Process for heuristics optimization
1. The parser of a high-level query generates an initial internal representation;
2. Apply heuristics rules to optimize the internal representation.
3. A query execution plan is generated to execute groups of operations based on the access paths available on the files involved in the query.
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerSachin Aggarwal
RDD recap
Spark SQL library
Architecture of Spark SQL
Comparison with Pig and Hive Pipeline
DataFrames
Definition of a DataFrames API
DataFrames Operations
DataFrames features
Data cleansing
Diagram for logical plan container
Plan Optimization & Execution
Catalyst Analyzer
Catalyst Optimizer
Generating Physical Plan
Code Generation
Extensions
Hey friends, here is my "query tree" assignment. :-) I have searched a lot to get this master piece :p and I can guarantee you that this one gonna help you In Sha ALLAH more than any else document on the subject. Have a good day :-)
DBMS Architecture
Query Executor
Buffer Manager
Storage Manager
Storage
Transaction Manager
Logging &
Recovery
Lock Manager
Buffers Lock Tables
Main Memory
User/Web Forms/Applications/DBA
query transaction
Query Optimizer
Query Rewriter
Query Parser
Files & Access Methods
Past lectures
Today’s
lecture
Many query plans to execute a SQL query
3
S UTR
SR
T
U
SR
T
U
• Even more plans: multiple algorithms to execute each operation
SR
T
U
Sort-merge
Sort-merge
hash join
Table-scan
index-scan
Table-scanindex-scan
• Compute the join of R(A,B) S(B,C) T(C,D) U(D,E)
Explain command in PostgreSQL
4
SELECT C.STATE, SUM(O.NETAMOUNT), SUM(O.TOTALAMOUNT)
FROM CUSTOMERS C
JOIN CUST_HIST CH ON C.CUSTOMERID = CH.CUSTOMERID
JOIN ORDERS O ON CH.ORDERID = O.ORDERID
GROUP BY C.STATE
Communication between operators:
iterator model
• Each physical operator implements three functions:
– Open: initializes the data structures.
– GetNext: returns the next tuple in the result.
– Close: ends the operation and frees the resources.
• It enables pipelining
• Other option: compute the result of the operator in full
and store it in disk or memory:
– inefficient.
5
Query optimization: picking the fastest plan
• Optimal approach plan
– enumerate each possible plan
– measure its performance by running it
– pick the fastest one
– What’s wrong?
• Rule-based optimization
– Use a set of pre-defined rules to generate a fast plan
• e.g., If there is an index over a table, use it for scan and join.
6
Review Definitions
• Statistics on table R:
– T(R): Number of tuples in R
– B(R): Number of blocks in R
• B(R) = T(R ) / block size
– V(R,A): Number of distinct values of attribute A in R
7
Plans to select tuples from R: σA=a(R)
• We have an unclustered index on R
• Plans:
– (Unclustered) indexed-based scan
– Table-scan (sequential access)
• Statistics on R
– B(R)=5000, T(R)=200,000
– V(R,A) = 2, one value appears in 95% of tuples.
• Unclustered indexed scan vs. table-scan ?
8
Presenter
Presentation Notes
Unclustered indexed scan vs. table-scan ?
table-scan is the winner!
Query optimization methods
• Rule-based optimizer fails
– It uses static rules
– The rules do not consider the distribution of the data.
• Cost-based optimization
– predict the cost of each plan
– search the plan space to find the fastest one
– do it efficiently
• Optimization itself should be fast!
9
Cost-based optimization
• Plan space
– which plans to consider?
– it is time consuming to explore all alternatives.
• Cost estimator
– how to estimate the cost of each plan without executing it?
– we would like to have accurate estimation
• Search algorithm
– how to search the plan space fast?
– we would like to avoid checking inefficient plans
10
Space of query plans: basic operators
• Selection
– algorithms: sequential, index-based
– Ordering
• Projection
– Ordering
• Join
– algorithms: nested loop, sort-merge, hash
– orderi ...
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
The Metaverse and AI: how can decision-makers harness the Metaverse for their...Jen Stirrup
The Metaverse is popularized in science fiction, and now it is becoming closer to being a part of our daily lives through the use of social media and shopping companies. How can businesses survive in a world where Artificial Intelligence is becoming the present as well as the future of technology, and how does the Metaverse fit into business strategy when futurist ideas are developing into reality at accelerated rates? How do we do this when our data isn't up to scratch? How can we move towards success with our data so we are set up for the Metaverse when it arrives?
How can you help your company evolve, adapt, and succeed using Artificial Intelligence and the Metaverse to stay ahead of the competition? What are the potential issues, complications, and benefits that these technologies could bring to us and our organizations? In this session, Jen Stirrup will explain how to start thinking about these technologies as an organisation.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
4. Query compliation
Outline
• Parsing
• Algebraic laws for improving query plans
• From parse trees to logical query plans
• Estimating the cost of operations
• Cost-based plan selection
• Choosing an order for joins
• Completing the physical query plan selection
5. Overview of Query Compilation
• SQL query query expression tree logical query plan tree
physical query plan tree
• Three major steps
– Parsing
• Query and its structure is constructed
– Query rewrite (Selecting a logical query plan)
• Parse tree is converted to initial query plan
• Algebraic representation of a query
– Physical plan generation (Selecting a physical plan)
• Selecting algorithms for each operations
• Selecting the order of operations
• Query rewrite and physical plan generation are called query
optimizer.
• Choice depends on meta data
6. SQL query
parse
parse tree
convert
answer
logical query plan
execute
apply laws statistics
Pi
“improved” l.q.p
pick best
estimate result sizes
{(P1,C1),
l.q.p. +sizes (P2,C2)...}
consider physical plans estimate costs
{P1,P2,…..}
6
7. Query to Logical Query Plan
query
Parser
Preprocessor
Logical query
Plan generator
Query rewriter
Preferred logical
Query plan
8. Outline
• Parsing
• Algebraic laws for improving query plan
• From Parse Trees to Logical Query Plans
• Estimating the cost of operations
• Cost-based plan selection
• Choosing the order of joins
• Completing the physical query plan selection
9. Parsing
• Syntax analysis of parse tree
– SQL is converted into parse tree where nodes correspond to
• Atoms:
– key words, attributes, constants, parentheses, operators
• Syntactic categories:
– names of sub parts < >
» <SFW> or <condition>
– If the node is an atom, there are no children
– If the node is a syntactic category, the children are
described by one of the rules of the grammar.
• Parsing turns query into parse tree
– Subject of compilation
10. Example: SQL query
SELECT title
FROM StarsIn
WHERE starName IN (
SELECT name
FROM MovieStar
WHERE birthdate LIKE ‘%1960’
);
(Find the movies with stars born in 1960)
10
11. Example: Parse Tree
<Query>
<SFW>
SELECT <SelList> FROM <FromList> WHERE <Condition>
<Attribute> <RelName> <Tuple> IN <Query>
title StarsIn <Attribute> ( <Query> )
starName <SFW>
SELECT <SelList> FROM <FromList> WHERE <Condition>
<Attribute> <RelName> <Attribute> LIKE <Pattern>
name MovieStar birthDate ‘%1960’
12. Preprocessor
• Responsible for semantic checking
– Check relation uses
– Check and resolve attribute uses
– Check types
• If the parse tree satisfies all the above tests, it
is valid
• Otherwise, processing stops.
13. Outline
• Parsing
• Algebraic laws for improving query plan
• From Parse Trees to Logical Query Plans
• Estimating the cost of operations
• Cost-based plan selection
• Choosing the order of joins
• Completing the physical query plan selection
14. Algebraic laws for Improving Query Plans
• Commutative and associative laws
– Commutative law
• Order does not matter (+,* are associative)
– Associative law
• Group the operators either from left to right or right to
left.
15. Rules: Natural joins & cross products & union
R S = S R
(R S) T =R (S T)
RxS=SxR
(R x S) x T = R x (S x T)
RUS=SUR
R U (S U T) = (R U S) U T
Similarly for intersection also 15
16. Laws Involving Selection
σ p1∧p2(R) = σ [ σ (R)]
p1 p2
σ p1vp2(R) = [ σ (R)] U [ σ
p1 p2 (R)]
• Order can be changed
• The second law do not work for bags
17. Selection and Union
• For a union, the selection must be pushed to
both arguments.
• For a difference, the selection must be pushed
to the first argument and optionally pushed to
the second.
• For other operators, it is only required that the
selection be pushed to one argument.
18. Rules σ, U combined:
σp(R U S) = σp(R) U σp(S)
σp(R S) = σp(R) S
σp(R ∩ S) = σp(R) ∩ S
σp(R S) = σp(R) σp(S)
19. Pushing Selections
• Powerful features of query optimizer
– Replacing the left side of one of the rules by its
right side
20. Laws Involving projections
• Projections, like selections, can be pushed down
through many operators.
• Pushing selections reduces number of tuples.
• Pushing projections keeps the same number of tuples.
• Principle
– We may introduce a projection anywhere in an expression
tree, as long as it eliminates only attributes that are never
used by any of the operators above, and are not in the result
of the entire expression.
21. Rules: π,σ combined
Let x = subset of R attributes
z = attributes in predicate P
(subset of R attributes)
πx[σp (R) ] =
21
22. Rules: π,σ combined
Let x = subset of R attributes
z = attributes in predicate P
(subset of R attributes)
πx[σp (R) ] = {σp [ πx (R ) ]}
22
23. Rules: π,σ combined
Let x = subset of R attributes
z = attributes in predicate P
(subset of R attributes)
πxz
πx[σp (R) ] = πx {σp [ πx (R ) ]}
23
24. Rules: π, combined
Let x = subset of R attributes
y = subset of S attributes
z = intersection of R,S attributes
πxy (R S) =
24
25. Rules: π, combined
Let x = subset of R attributes
y = subset of S attributes
z = intersection of R,S attributes
πxy (R S) =
πxy{[πxz (R) ] [πyz (S) ]}
25
31. Outline
• Parsing
• Algebraic laws for improving query plan
• From Parse Trees to Logical Query Plans
• Estimating the cost of operations
• Cost-based plan selection
• Choosing the order of joins
• Completing the physical query plan selection
32. From Parse Trees to
Logical Query Plan
• First Step:
– Replace the nodes and structures of parse tree, by
an operators of relational algebra
• Second step
– Take the relational algebra expression produced in
the first step and turn it to an expression with the
preferred logical query plan.
33. Conversion into relational algebra
• Input is SFW query and output is relational
algebra expression
– Product of all relations in the From list
– Selection σC where C is the <condition>
expression which is the argument of πL, where L is
the selection list.
35. Improving the Logical query plan
• Push the selections as far as they go
• Push the projections or add new projections
• Remove the duplicate elimination or move into
more convenient position.
• Certain selections can be combined with a
product below to turn the pair of operations
into an equijoin.
36. Grouping Associative/Commutative
Operators
• Conventional parsers do not produce trees with
many children.
– Unary/ binary
• Associative and commutative operators have
many operands.
• Group associative and commutative operators.
37. Outline
• Parsing
• Algebraic laws for improving query plan
• From Parse Trees to Logical Query Plans
• Estimating the cost of operations
• Cost-based plan selection
• Choosing the order of joins
• Completing the physical query plan selection
38. Estimating the cost of Operations
• We enumerate the possible physical plans for a given logical
plan
• For each physical plan, we select
– An order and grouping for associative-and-commutative operations
like joins, unions and intersections.
– An algorithm for each operator
• Nested loop or a hash join
– Additional operators
• Scanning, sorting and so on
– How the arguments are passed from one operator to next by storing the
intermediate result on disk or by using iterators passing one main
memory buffer at a time.
• Estimation of costs is an important step
39. Estimating the Sizes of Intermediate
Relations
• Physical plan cost= # of disk I/Os
– In some cases processor time and communication
time is also important.
• Rules for estimating the number of tuples in an
intermediate relation.
– Give accurate estimates
– Are easy to compute
– Are logically consistent
• Independent of the order of computation.
40. Notation
• B(R) # of blocks needed to hold all tuples
of R
• T(R) number of tuples of R
• V(R,a) number of district values of “a” in
R.
41. Example
R A B C D A: 20 byte string
cat 1 10 a B: 4 byte integer
cat 1 20 b
dog 1 30 a
C: 8 byte date
dog 1 40 c D: 5 byte string
bat 1 50 d
T(R) = 5
V(R,A) = 3 V(R,C) = 5
V(R,B) = 1 V(R,D) = 4
41
42. Estimating the size of projection
• Reduces the size of a tuple
• Some times size can be increased due to
addition of attributes.
43.
44. Estimating the Size of Selection
Example A B C D
cat 1 10 a
R V(R,A)=3
cat 1 20 b
dog 1 30 a V(R,B)=1
dog 1 40 c V(R,C)=5
bat 1 50 d
V(R,D)=4
T(R)
W = σz=val(R) T(W) = V(R,Z)
44
45. Selection
• Size estimate is more problematic when the selection
involves inequality comparison.
– Example S=σa<10(R)
– We assume that typical inequality will return one third of
tuples
– T(S)=T(R)/3.
– For not equal, T(S)=T(R)
• For OR
– S=σc1ORc2(R)
– If R has n tuples, m1 of which satisfy C1 and m2 will
satisfy C2, the estimate is equal to
– n(1-(1-m1/n)(1-m2/n)
46. Estimating the Size of Selection
• We consider only natural join
• Estimate for other joins has the following
outline
– Equijoin can be computed similar to natural join.
– Estimate for Theta join can be computes as if there
was a selection following a product.
• Number of tuples= product
• Equality is similar to natural join
• Inequality comparison is similar to selection.
47. Natural join
• R(X,Y) S(Y,Z)
• If two relations have disjoint sets of Y values T(R natural join
S)=0
• Y might be the key of S and foreign key of R
– T(R S)=T(R)
• All the tuples of R and S could have the same Y-value
– =T(R)T(S)
• The above formula comes with the assumptions
– Containment of value sets: V(R,Y) <= V(S,Y), every Y value of R will
be a Y-value of S.
– Preservation of value sets: If R is joined with another relation, we do
not lose value from the set of values.
• V(S join R, A)=V(R,A)
• In practical situations, the above rules may be violated
– So, = T(R) T(S)/max(V(R,Y),V(S,Y))
48. Natural joins with multiple join
attributes
• R(x, y1, y2) S(y1, y2, z)
• T(R)T(S)/max(V(R,y1), V(S,y1)) max
(V(R,y2), V(S,y2))
50. Estimating the Sizes of Other
Operations
• Projections do not change the number of tuples
• Products= product of number of tuples
• Union
– Bag: sum of sizes
– Set: sum of sizes or equal to larger relation
• Middle is chosen
• Intersection
– 0 to size of smaller relation
– Average of extremes.
• Difference
– We can have T(R) an T(R ) – T(S) tuples. Average of estimate: T(R ) –
1/2T(S).
51. Estimating size of duplicate
elimination
• Duplicate elimination
– Size is equal to size of R (no duplicates)
– Number of distinct tuples
– Min(T(R)/2, product of all the V(R,a))
• Grouping and aggregation
– Similar to duplicate elimination.
52. Outline
• Parsing
• Algebraic laws for improving query plan
• From Parse Trees to Logical Query Plans
• Estimating the cost of operations
• Cost-based plan selection
• Choosing the order of joins
• Completing the physical query plan selection
53. Cost-based Plan Selection
• Cost of expression is approximated to number of disk
I/Os, which in turn are influenced by
– Logical operators chosen to implement the query
– Size of intermediate relations
– Physical operators used to implement the logical operators
– Ordering of joins
– The method of passing arguments from one physical
operator to the next.
54. Obtaining estimates for Size
operators
• DBMS can compute the following by scanning R
– T(R), V(R, a) for each a
• DBMS computes histogram of the values for each
attribute (number of tuples for each value)
– Equal width
– Equal height
– Most frequent values
• With histograms, the size of joins can be estimated
more accurately.
57. Incremental Computation of Statistics
• Statistics are updated whenever a table is updated.
• T(R): add one whenever a tuple is inserted.
• If there is a B-tree index, we can estimate by counting
number of blocks (3/4 full).
• If there is an index on “a”, we can estimate V(R,a)
– Increment the value when the tuple is inserted
• If a is a key for R, V(R,a)=R
• Sampling can be used to estimate V(R,a).
58. Approaches to Enumerating Physical Plan
• Each possible physical plan is assigned an estimated
cost, and the one with the smallest cost is selected.
• Two approaches
– Top-down: we work down from the root of the tree. For
each possible implementation of the operation at the root,
we consider each possible way to evaluate its arguments,
and compute the cost of each combination, taking the best.
– Bottom-up: For each sub-expression E of the logical query
plan tree, we compute the costs of all possible ways to
compute the sub-expression. Similarly, sub expressions of
E are computed.
• Not much difference
• Techniques for bottom-up strategy have been
developed (System R)
59. Heuristic selection
• Greedy heuristic for join
– Joining the pair of relations whose results is smallest size, and repeat
the process.
• Some of the common heuristics
– If the logical plan calls for σA=c(R), stored relation R has an index on
attribute A, perform indexed scan.
– If the selection involves A=c and other conditions, we can implement
the selection by indexed scan.
– If an argument of a join has an index on the join attributes, then use an
index-join with that relation in the inner loop.
– If one argument of the join is sorted on the join attributes, prefer a sort
join to a hash join.
– When computing the union or intersection of three or more relations,
group the smallest relations first.
60. Branch and bound plan enumeration
• Uses good heuristics to find a good physical plan for
the entire logical plan. Let the cost of this plan be C.
– When we consider other plans for sub-queries, we can
eliminate any plan greater than C.
– If we construct a query with a cost less than C, we can
replace C.
• If C is already small, stop searching for more plans.
• If C is big, keep searching alternative plans.
61. Hill Climbing
• Start with heuristically selected physical plan
– Make small changes to the plan, replacing the one
method by another method for an operator,
reordering joins with commutative and associative
laws, to find nearby plans with low cost.
62. Dynamic and Selinger Style
• Dynamic programming
– Keep for each expression only the plan of least
cost. (We will discuss later)
• Selinger-Style Optimization
– Improves upon dynamic programming
– It keeps the plans of least cost and other plans of
higher costs, but produce the result in a sorted
order.
63. Outline
• Parsing
• Algebraic laws for improving query plan
• From Parse Trees to Logical Query Plans
• Estimating the cost of operations
• Cost-based plan selection
• Choosing the order of joins
• Completing the physical query plan selection
64. Choosing a order of joins
• Selecting a join order of two or more relations
• One pass join
– Left argument fits in main memory (build relation)
and right argument (probe relation) is accessed
from disk
• Nested-loop join: left argument is the relation
of outer loop.
• Index join: right argument has index.
65. Join Trees
• Algorithms work better of left argument is
smaller.
• When join involves more than two relations,
the number of join trees grows rapidly.
• n relations n! join trees.
66. Advantages of left-deep Join Trees
• Number of left-deep trees with a given number
of leaves is large but not as that large as the
number of all trees.
• Left-deep joins interact well with common join
algorithms.
– Nested-loop joins and one pass join algorithms.
67. Advantage of left deep join trees…
• If one pass joins are used, and the build
relation is on the left, the amount of main
memory needed at any one time tends to be
smaller than if we used a right-deep tree or a
bushy tree for the same relations.
• If we use nested-loop joins, implemented by
iterators, then we avoid having to construct
any intermediate relation more than once.
68. Example
• Left deep tree
– Left arguments will be kept in the memory
– We compute R join S and keep the result in the main
memory
• We need B(R )+ B(R join S) buffers. If R is small, there will be no
problem.
• After computing R join S, the buffers used by R are not needed and
can be used hold the result of (R join S) join T.
• Right deep tree
– Load R into buffers
• Load S, T and U for join computation.
– So, load “n-1” relations in the main memory.
• Iterators require more disk I/Os in right deep join than
left deep join.
69. Dynamic Programming to select a join
order
• We have three choices.
– Consider them all
– Consider a subset
– Use the heuristic to pick one.
70. Consider them all
• Suppose we want to join R1, R2,…,Rn.
• Estimate size of join of these relations.
• Compute the least cost of computing the join of these
relations.
– Estimate the cost of disk I/Os.
• Compute the expression that yields least cost.
– The expression joins the set of relations in question, with
some grouping.
• We can restrict to left-deep expressions.
71. Consider them all
• One relation: The formula is R itself, the cost is zero, no intermediate
relations.
• Two relations: {Ri, Rj}, easy to compute. Cost is 0, no intermediate
relations.
– Formula is Ri join Rj or Rj join Ri.
– Estimate: product of sizes of Ri and Rj.
– Take smallest of Ri and Rj as left argument.
• More than two relations (r relations)
– Computing r-{R} and joining with R
– Cost of join is cost for r-{R} + size of the join.
– The expression of r has the best join expression for r-{R} as the left argument
of final join, and R as the right argument.
• We consider all ways to partition r into disjoint sets in R1 and R2. For
each subset, we consider the sum of
– The best costs of R1 and R2
– The sizes of R1 and R2.
• See the Example in the book.
72. Dynamic programming with more
Detailed Cost Functions
• Selinger style optimization
– Compute the least cost of joining R1 and R2 by
considering best algorithms.
• One pass, multi-pass
• Hash join
• indexing
73. Greedy Algorithm for
selecting join order
• Dynamic programming results into exponential
number of calculations as a function of relations
joined.
• To reduce the time, use a join order heuristic.
– Select a left deep tree
– Take one decision at one time, but never backtrack
– Among all those relations not yet included in the current
tree, chose the relation that yields the least cost.
• It may not guarantee the optimum
– Main heuristic:
• Keep the intermediate relations small.
74. Completing the physical plan selection
• Use similar techniques for join for union, intersection
or any associative/ commutative operations
• So far we have done the following steps
– Input is a query
– Initial logical query plan
– Improved logical query plan with transformations.
– Selecting join orders
– More steps are required.
75. Further Steps to Complete
Physical Query Plan
• Selection of algorithms to implement the operations
of the query plan.
• Decisions regarding intermediate results will be
materialized (create and store in the disk), and when
they will be pipelines (create only in main memory)
• Notations for physical plan operators, include the
details regarding access methods for stored relations
and algorithms for implementation of relational
algebra operators.
76. Outline
• Parsing
• Algebraic laws for improving query plan
• From Parse Trees to Logical Query Plans
• Estimating the cost of operations
• Cost-based plan selection
• Choosing the order of joins
• Completing the physical query plan
selection
77. Choosing a selection method
• Choosing a selection operator
– The cost of table scan algorithm with a filter step is
• B(R), if the table is clustered
• T(R), if the table is not clustered
– The cost of selection operator with equality
• B(R )/(V(R,a) if the index is clustering
• B(R )/V(R,a) if index is not clustering
– The cost of algorithm with an inequality term such as
b<20.
• B(R)/3 of index is clustering
• T(R)/3 if the index is not clustering
78. Choosing a Join Method
• If the estimates are unknown or number of buffers is unknown
– Call one-pass join, hoping that buffer manager has enough buffers.
Alternatively, chose a nested loop join, hoping that left argument can
not be granted enough buffers.
– A sort-join is a good choice
• One argument is already sorted on their join attributes or there are two
more joins on the same attribute.
– ((R(a,b) join S(a,c)) join T(a,d); reuse the sorted result for the next join
– If there is an index opportunity, use the index join
– If there is no opportunity to use already-sorted relations or indexes, and
amulti-pass join is needed, hashing is a best choice, because the
number of passes requires the size of smaller argument.
79. Pipelining versus materialization
• Naïve way (materialization)
– Compute the operation after completing the
underlying operation.
• Intermediate result is materialized to disk.
• Pipelining
– The tuples produced by the operator are pased to
next higher-level operator.
– No disk I/O, but thrashing might occur as several
operations are using main memory.
80. Notations for Physical Query Plan
• Operators for leaves
– TableScan(R ): All blocks holding tuples of R are read in
arbitrary order.
– SortScan(R,L):Tuples of R are read in order, and stored
according to the attribute (L) order.
– IndexScan(R, C): C is the condition Aθc, where θ is a
comparison operator. Hashing or Btree are used.
– IndexSCan(R,A): A is the attribute of R. The relation R is
retrieved via an index on R.A.
• Physical operators for selection
– Replace σc(R) by Filter(C): uses TableScan(L) or
SortScan(L) to access R.
81. Notations..
• Physical sort operators
– SortScan(R,L): already introduced.
– Sort(L): Sort R based on L.
• All operations are replaced by suitable physical operators.
Other operations can be given designations as follows.
– Join or grouping
– Necessary parameters, the condition in theta-join, or list of elements in
a grouping.
– A general strategy for algorithm: sort-based, hash-based, or in some
joins, index-based.
– One-pass, two-pass, or multi pass
– Anticipated number of buffers the operation require.
82. Ordering of Physical Plan Operators
• Physical query plan is represented as a tree
– Break the subtrees a each edge that represents
materialization. The subtrees can will be executed one at a
time.
– Execution the operations left to right
• Perform the preorder traversal of a tree.
– Execute the nodes of subtree with the network of interators.
GetNext calls determine the order of events.
• Result is an executable code.