The document discusses query optimization in database management systems. It covers converting SQL queries to logical and physical query plans, improving logical plans through algebraic transformations, and choosing the optimal physical query plan by considering the order of operations and join trees. The goal is to select the most efficient physical plan by estimating the size of relations and intermediate results.
The document discusses various steps and algorithms for processing database queries. It covers parsing and optimizing queries, estimating query costs, and algorithms for operations like selection, sorting, and joins. Selection algorithms include linear scans, binary searches, and using indexes. Sorting can use indexes or external merge sort. Join algorithms include nested loops, merge join, and hash join.
The document discusses Oracle system catalogs which contain metadata about database objects like tables and indexes. System catalogs allow accessing information through views with prefixes like USER, ALL, and DBA. Examples show how to query system catalog views to get information on tables, columns, indexes and views. Query optimization and evaluation are also covered, explaining how queries are parsed, an execution plan is generated, and the least cost plan is chosen.
Hey friends, here is my "query tree" assignment. :-) I have searched a lot to get this master piece :p and I can guarantee you that this one gonna help you In Sha ALLAH more than any else document on the subject. Have a good day :-)
The document discusses cost estimation in query optimization. It explains that the query optimizer should estimate the cost of different execution strategies and choose the strategy with the minimum estimated cost. The cost functions used are estimates and depend on factors like selectivity. The main cost components include access cost to storage, storage cost, computation cost, memory use cost, and communication cost. For different types and sizes of databases, the emphasis may be on minimizing different cost components, such as access cost for large databases. The document provides examples of cost functions for select and join operations that consider factors like index levels, block sizes, and selectivity.
Query processing and Query OptimizationNiraj Gandha
This presentation on query processing and query optimization is made with many efforts. According to me, I have used the most basic/ fundamental examples and topics for the explanation.
The document discusses query optimization in databases. Query optimization is the process of selecting the most efficient query evaluation plan to minimize costs and maximize performance. An optimized query will be executed faster using less system resources. Key factors considered during optimization include join size estimation, estimating the number of distinct values, and catalog information about relations.
The document discusses query processing and optimization. It describes several key activities in query processing including translating queries to a format executable by the database, applying optimization techniques, and evaluating the queries. It then provides details on three specific operations: selection using linear searches and indices, sorting, and join operations. It explains different algorithms for implementing each operation and factors to consider when choosing algorithms such as indexing and data sizes.
The document discusses algorithms and techniques for query processing and optimization in relational database management systems. It covers translating SQL queries into relational algebra, algorithms for operations like selection, projection, join and sorting, using heuristics and cost estimates for optimization, and an overview of query optimization in Oracle databases.
The document discusses various steps and algorithms for processing database queries. It covers parsing and optimizing queries, estimating query costs, and algorithms for operations like selection, sorting, and joins. Selection algorithms include linear scans, binary searches, and using indexes. Sorting can use indexes or external merge sort. Join algorithms include nested loops, merge join, and hash join.
The document discusses Oracle system catalogs which contain metadata about database objects like tables and indexes. System catalogs allow accessing information through views with prefixes like USER, ALL, and DBA. Examples show how to query system catalog views to get information on tables, columns, indexes and views. Query optimization and evaluation are also covered, explaining how queries are parsed, an execution plan is generated, and the least cost plan is chosen.
Hey friends, here is my "query tree" assignment. :-) I have searched a lot to get this master piece :p and I can guarantee you that this one gonna help you In Sha ALLAH more than any else document on the subject. Have a good day :-)
The document discusses cost estimation in query optimization. It explains that the query optimizer should estimate the cost of different execution strategies and choose the strategy with the minimum estimated cost. The cost functions used are estimates and depend on factors like selectivity. The main cost components include access cost to storage, storage cost, computation cost, memory use cost, and communication cost. For different types and sizes of databases, the emphasis may be on minimizing different cost components, such as access cost for large databases. The document provides examples of cost functions for select and join operations that consider factors like index levels, block sizes, and selectivity.
Query processing and Query OptimizationNiraj Gandha
This presentation on query processing and query optimization is made with many efforts. According to me, I have used the most basic/ fundamental examples and topics for the explanation.
The document discusses query optimization in databases. Query optimization is the process of selecting the most efficient query evaluation plan to minimize costs and maximize performance. An optimized query will be executed faster using less system resources. Key factors considered during optimization include join size estimation, estimating the number of distinct values, and catalog information about relations.
The document discusses query processing and optimization. It describes several key activities in query processing including translating queries to a format executable by the database, applying optimization techniques, and evaluating the queries. It then provides details on three specific operations: selection using linear searches and indices, sorting, and join operations. It explains different algorithms for implementing each operation and factors to consider when choosing algorithms such as indexing and data sizes.
The document discusses algorithms and techniques for query processing and optimization in relational database management systems. It covers translating SQL queries into relational algebra, algorithms for operations like selection, projection, join and sorting, using heuristics and cost estimates for optimization, and an overview of query optimization in Oracle databases.
The document outlines the key phases and concepts in query optimization: 1) Parsing the SQL query into an internal representation like a query tree, 2) Applying transformation rules to put the query in canonical form, 3) Estimating the costs of different execution plans, and 4) Selecting the lowest cost plan. Key topics covered include relational algebra trees, transformation rules, heuristic strategies like pushing down selections, and using statistics and cost models to choose the most efficient query execution plan.
The document discusses query optimization in database management systems. It describes the steps in cost-based query optimization including parsing, transformation, implementation, and plan selection based on cost estimates. It provides an example of projections and how the estimated storage requirements would change based on eliminating a column. It also discusses how queries interact with a DBMS and the differences between interactive users and embedded queries.
The document discusses distributed query processing and optimization in distributed database systems. It covers topics like query decomposition, distributed query optimization techniques including cost models, statistics collection and use, and algorithms for query optimization. Specifically, it describes the process of optimizing queries distributed across multiple database fragments or sites including generating the search space of possible query execution plans, using cost functions and statistics to pick the best plan, and examples of deterministic and randomized search strategies used.
This document outlines algorithms for query processing and optimization in database systems. It discusses translating SQL queries to relational algebra, algorithms for sorting and joining large datasets that exceed available memory, including nested loop joins, sort-merge joins, and hash joins. It also describes query optimization techniques and factors that influence query performance.
The document discusses query optimization by describing how a database system estimates the cost of different query evaluation plans using statistical information about relations. It covers topics like estimating the size of selections, joins, aggregations and other operations to choose the lowest cost plan using transformations and equivalence rules.
How the query planner in PostgreSQL works? Index access methods, join execution types, aggregation & pipelining. Optimizing queries with WHERE conditions, ORDER BY and GROUP BY. Composite indexes, partial and expression indexes. Exploiting assumptions about data and denormalization.
The document discusses query processing and optimization. It defines query processing as translating a query into low-level activities like evaluation and data extraction. Query optimization aims to select the most efficient query evaluation plan. The key steps in query processing are parsing, translating to relational algebra, creating evaluation plans, optimization to find the best plan, and executing the plan. Optimization techniques include heuristic-based and cost-based approaches. Heuristic rules are used to modify the query representation to improve performance. Cost-based optimization estimates the costs of different plans and selects the lowest cost plan.
This document discusses distributed database and distributed query processing. It covers topics like distributed database, query processing, distributed query processing methodology including query decomposition, data localization, and global query optimization. Query decomposition involves normalizing, analyzing, eliminating redundancy, and rewriting queries. Data localization applies data distribution to algebraic operations to determine involved fragments. Global query optimization finds the best global schedule to minimize costs and uses techniques like join ordering and semi joins. Local query optimization applies centralized optimization techniques to the best global execution schedule.
This document discusses distributed query processing and optimization. It covers query processing methodology which includes query decomposition, data localization, and global query optimization. Query decomposition takes a high-level query and breaks it down into fragments. Data localization determines which data fragments are involved. Global query optimization finds the most efficient execution plan by considering costs of operations and minimizing communication. The goal is to optimize queries running across distributed data in a network.
The document discusses query optimization in databases. It explains that the goal of query optimization is to determine the most efficient execution plan for a query to minimize the time needed. It outlines the typical steps in query optimization, including parsing/translation, applying relational algebra, and optimizing the query plan. It also discusses techniques like generating alternative execution plans using equivalence rules, estimating plan costs based on statistical data, and using heuristics or dynamic programming to choose the optimal plan.
Query Decomposition and data localization Hafiz faiz
This document discusses query processing in distributed databases. It describes query decomposition, which transforms a high-level query into an equivalent lower-level algebraic query. The main steps in query decomposition are normalization, analysis, redundancy elimination, and rewriting the query in relational algebra. Data localization then translates the algebraic query on global relations into a query on physical database fragments using fragmentation rules.
The document discusses query optimization in database systems. It covers generating logically equivalent query expressions using equivalence rules, estimating the cost of different query evaluation plans using statistical information about relations, and using dynamic programming to choose the lowest-cost evaluation plan through cost-based optimization. The goal of query optimization is to select the most efficient way to evaluate a given query by considering different algorithms and join orders.
The document discusses query processing and optimization. It describes the basic concepts including query processing, query optimization, and the phases of query processing. It also explains relational algebra operations like selection, projection, joins, and additional operations. The document then covers topics like query decomposition, analysis, normalization, simplification, and restructuring during query optimization. It discusses cost estimation and algorithms for implementing relational algebra operations and file organization.
Different algorithms can be used to implement joins in a database, including nested loop, block nested loop, indexed nested loop, merge, and hash joins. The optimal algorithm depends on factors like whether indexes are available on the joined attributes and the relative sizes and block distributions of the relations. Database tuning involves monitoring performance and adjusting aspects like indexes, queries, and design to improve response times and throughput.
The document discusses techniques used by a database management system (DBMS) to process, optimize, and execute high-level queries. It describes the phases of query processing which include syntax checking, translating the SQL query into an algebraic expression, optimization to choose an efficient execution plan, and running the optimized plan. Query optimization aims to minimize resources like disk I/O and CPU time by selecting the best execution strategy. Techniques for optimization include heuristic rules, cost-based methods, and semantic query optimization using constraints.
This document discusses query processing in a database system. It covers parsing queries, optimization to choose the most efficient evaluation plan, and executing the plan. Query optimization aims to minimize costs like I/O by choosing plans with the lowest estimated execution time. The document describes different algorithms for operations like selection, sorting, joins, and expression evaluation, and how equivalence rules and heuristics can transform queries into more efficient forms.
Query Processing : Query Processing Problem, Layers of Query Processing Query Processing in Centralized Systems – Parsing & Translation, Optimization, Code generation, Example Query Processing in Distributed Systems – Mapping global query to local, Optimization,
This document discusses query processing in a database system. It describes the basic steps of query processing as parsing and translation, optimization, and evaluation. For optimization, it explains that a relational algebra expression can be evaluated in many ways and the goal is to choose the plan with the lowest estimated cost. It then covers algorithms for common relational operations like selection, sorting, and join and how they are implemented, including using indexes. The overall focus is on analyzing the costs of different algorithms and implementations.
This document provides an overview of data structures and algorithms. It discusses key concepts like interfaces, implementations, time complexity, space complexity, asymptotic analysis, and common control structures. Some key points:
- A data structure organizes data to allow for efficient operations. It has an interface defining operations and an implementation defining internal representation.
- Algorithm analysis considers best, average, and worst case time complexities using asymptotic notations like Big O. Space complexity also measures memory usage.
- Common control structures include sequential, conditional (if/else), and repetitive (loops) structures that control program flow based on conditions.
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...Beat Signer
This document discusses query processing and optimization in databases. It covers the basic steps of query processing including parsing, optimization, and evaluation. It also describes different algorithms for query operations like selection, join, and sorting that are used to process queries efficiently. The goals of query optimization are to select the most efficient query execution plan based on the given data and minimize the number of disk accesses.
The document discusses different ways to represent hierarchical data structures including classical node-link diagrams, nested sets, layered "icicle" diagrams, outlines, tree views, and nested parenthesis notation. Examples of each method are given showing a family tree with Thess-Haydee at the top and relationships between Resty, Allysa, Dharell, Myla and Mikan.
The document outlines the key phases and concepts in query optimization: 1) Parsing the SQL query into an internal representation like a query tree, 2) Applying transformation rules to put the query in canonical form, 3) Estimating the costs of different execution plans, and 4) Selecting the lowest cost plan. Key topics covered include relational algebra trees, transformation rules, heuristic strategies like pushing down selections, and using statistics and cost models to choose the most efficient query execution plan.
The document discusses query optimization in database management systems. It describes the steps in cost-based query optimization including parsing, transformation, implementation, and plan selection based on cost estimates. It provides an example of projections and how the estimated storage requirements would change based on eliminating a column. It also discusses how queries interact with a DBMS and the differences between interactive users and embedded queries.
The document discusses distributed query processing and optimization in distributed database systems. It covers topics like query decomposition, distributed query optimization techniques including cost models, statistics collection and use, and algorithms for query optimization. Specifically, it describes the process of optimizing queries distributed across multiple database fragments or sites including generating the search space of possible query execution plans, using cost functions and statistics to pick the best plan, and examples of deterministic and randomized search strategies used.
This document outlines algorithms for query processing and optimization in database systems. It discusses translating SQL queries to relational algebra, algorithms for sorting and joining large datasets that exceed available memory, including nested loop joins, sort-merge joins, and hash joins. It also describes query optimization techniques and factors that influence query performance.
The document discusses query optimization by describing how a database system estimates the cost of different query evaluation plans using statistical information about relations. It covers topics like estimating the size of selections, joins, aggregations and other operations to choose the lowest cost plan using transformations and equivalence rules.
How the query planner in PostgreSQL works? Index access methods, join execution types, aggregation & pipelining. Optimizing queries with WHERE conditions, ORDER BY and GROUP BY. Composite indexes, partial and expression indexes. Exploiting assumptions about data and denormalization.
The document discusses query processing and optimization. It defines query processing as translating a query into low-level activities like evaluation and data extraction. Query optimization aims to select the most efficient query evaluation plan. The key steps in query processing are parsing, translating to relational algebra, creating evaluation plans, optimization to find the best plan, and executing the plan. Optimization techniques include heuristic-based and cost-based approaches. Heuristic rules are used to modify the query representation to improve performance. Cost-based optimization estimates the costs of different plans and selects the lowest cost plan.
This document discusses distributed database and distributed query processing. It covers topics like distributed database, query processing, distributed query processing methodology including query decomposition, data localization, and global query optimization. Query decomposition involves normalizing, analyzing, eliminating redundancy, and rewriting queries. Data localization applies data distribution to algebraic operations to determine involved fragments. Global query optimization finds the best global schedule to minimize costs and uses techniques like join ordering and semi joins. Local query optimization applies centralized optimization techniques to the best global execution schedule.
This document discusses distributed query processing and optimization. It covers query processing methodology which includes query decomposition, data localization, and global query optimization. Query decomposition takes a high-level query and breaks it down into fragments. Data localization determines which data fragments are involved. Global query optimization finds the most efficient execution plan by considering costs of operations and minimizing communication. The goal is to optimize queries running across distributed data in a network.
The document discusses query optimization in databases. It explains that the goal of query optimization is to determine the most efficient execution plan for a query to minimize the time needed. It outlines the typical steps in query optimization, including parsing/translation, applying relational algebra, and optimizing the query plan. It also discusses techniques like generating alternative execution plans using equivalence rules, estimating plan costs based on statistical data, and using heuristics or dynamic programming to choose the optimal plan.
Query Decomposition and data localization Hafiz faiz
This document discusses query processing in distributed databases. It describes query decomposition, which transforms a high-level query into an equivalent lower-level algebraic query. The main steps in query decomposition are normalization, analysis, redundancy elimination, and rewriting the query in relational algebra. Data localization then translates the algebraic query on global relations into a query on physical database fragments using fragmentation rules.
The document discusses query optimization in database systems. It covers generating logically equivalent query expressions using equivalence rules, estimating the cost of different query evaluation plans using statistical information about relations, and using dynamic programming to choose the lowest-cost evaluation plan through cost-based optimization. The goal of query optimization is to select the most efficient way to evaluate a given query by considering different algorithms and join orders.
The document discusses query processing and optimization. It describes the basic concepts including query processing, query optimization, and the phases of query processing. It also explains relational algebra operations like selection, projection, joins, and additional operations. The document then covers topics like query decomposition, analysis, normalization, simplification, and restructuring during query optimization. It discusses cost estimation and algorithms for implementing relational algebra operations and file organization.
Different algorithms can be used to implement joins in a database, including nested loop, block nested loop, indexed nested loop, merge, and hash joins. The optimal algorithm depends on factors like whether indexes are available on the joined attributes and the relative sizes and block distributions of the relations. Database tuning involves monitoring performance and adjusting aspects like indexes, queries, and design to improve response times and throughput.
The document discusses techniques used by a database management system (DBMS) to process, optimize, and execute high-level queries. It describes the phases of query processing which include syntax checking, translating the SQL query into an algebraic expression, optimization to choose an efficient execution plan, and running the optimized plan. Query optimization aims to minimize resources like disk I/O and CPU time by selecting the best execution strategy. Techniques for optimization include heuristic rules, cost-based methods, and semantic query optimization using constraints.
This document discusses query processing in a database system. It covers parsing queries, optimization to choose the most efficient evaluation plan, and executing the plan. Query optimization aims to minimize costs like I/O by choosing plans with the lowest estimated execution time. The document describes different algorithms for operations like selection, sorting, joins, and expression evaluation, and how equivalence rules and heuristics can transform queries into more efficient forms.
Query Processing : Query Processing Problem, Layers of Query Processing Query Processing in Centralized Systems – Parsing & Translation, Optimization, Code generation, Example Query Processing in Distributed Systems – Mapping global query to local, Optimization,
This document discusses query processing in a database system. It describes the basic steps of query processing as parsing and translation, optimization, and evaluation. For optimization, it explains that a relational algebra expression can be evaluated in many ways and the goal is to choose the plan with the lowest estimated cost. It then covers algorithms for common relational operations like selection, sorting, and join and how they are implemented, including using indexes. The overall focus is on analyzing the costs of different algorithms and implementations.
This document provides an overview of data structures and algorithms. It discusses key concepts like interfaces, implementations, time complexity, space complexity, asymptotic analysis, and common control structures. Some key points:
- A data structure organizes data to allow for efficient operations. It has an interface defining operations and an implementation defining internal representation.
- Algorithm analysis considers best, average, and worst case time complexities using asymptotic notations like Big O. Space complexity also measures memory usage.
- Common control structures include sequential, conditional (if/else), and repetitive (loops) structures that control program flow based on conditions.
Query Processing and Optimisation - Lecture 10 - Introduction to Databases (1...Beat Signer
This document discusses query processing and optimization in databases. It covers the basic steps of query processing including parsing, optimization, and evaluation. It also describes different algorithms for query operations like selection, join, and sorting that are used to process queries efficiently. The goals of query optimization are to select the most efficient query execution plan based on the given data and minimize the number of disk accesses.
The document discusses different ways to represent hierarchical data structures including classical node-link diagrams, nested sets, layered "icicle" diagrams, outlines, tree views, and nested parenthesis notation. Examples of each method are given showing a family tree with Thess-Haydee at the top and relationships between Resty, Allysa, Dharell, Myla and Mikan.
SQL (Structured Query language) is a standard language for the database access and their management according to American National Standards Institute.
(by QATestLab)
Query Optimization with MySQL 5.6: Old and New Tricks - Percona Live London 2013Jaime Crespo
Tutorial delivered at Percona MySQL Conference Live London 2013.
It doesn't matter what new SSD technologies appear, or what are the latest breakthroughs in flushing algorithms: the number one cause for MySQL applications being slow is poor execution plan of SQL queries. While the latest GA version provided a huge amount of transparent optimizations -specially for JOINS and subqueries- it is still the developer's responsibility to take advantage of all new MySQL 5.6 features.
In this tutorial we will propose the attendants a sample PHP application with bad response time. Through practical examples, we will suggest step-by-step strategies to improve its performance, including:
* Checking MySQL & InnoDB configuration
* Internal (performance_schema) and external tools for profiling (pt-query-digest)
* New EXPLAIN tools
* Simple and multiple column indexing
* Covering index technique
* Index condition pushdown
* Batch key access
* Subquery optimization
This document discusses database performance tuning and query optimization. It covers basic concepts like minimizing I/O operations to improve performance. The key phases of query processing are parsing, execution, and fetching. During parsing, the optimizer determines the most efficient execution plan. Indexes are also important for performance, as they allow more efficient data access than scanning rows. The document provides an overview of how a database management system processes queries and some common techniques for performance tuning.
What Your Database Query is Really DoingDave Stokes
Do you ever wonder what your database servers is REALLY doing with that query you just wrote. This is a high level overview of the process of running a query
Database performance tuning and query optimizationDhani Ahmad
Database performance tuning involves activities to ensure queries are processed in the minimum amount of time. A DBMS processes queries in three phases - parsing, execution, and fetching. Indexes are crucial for speeding up data access by facilitating operations like searching and sorting. Query optimization involves the DBMS choosing the most efficient plan for accessing data, such as which indexes to use.
The document provides an overview of database architecture and basic concepts such as what a database is, structured query language (SQL), and stored procedures. A database allows for structured storage and retrieval of complex data. SQL is used to manipulate and retrieve data from databases. Stored procedures are programs stored in databases that perform specific tasks like validating arguments. They provide benefits like improved performance and protection of database integrity.
This document provides an overview of data modeling and SQL. It introduces key concepts in relational databases including relations, schemas, tuples, domains, keys, and referential integrity. It also describes the relational data model including the structure of relations, attributes, and relation instances. Finally, it covers the relational algebra including operations like select, project, join, union, difference, and rename that form the basis for SQL queries. The document uses examples from a banking domain to illustrate these concepts.
relational algebra and calculus queries .pptShahidSultan24
This document provides an overview of key concepts in the relational model of databases, including:
- Relations are represented as tables with rows (tuples) and columns (attributes). The order of tuples and attributes is not important.
- A relational database contains multiple relations that each store a part of the overall database information. Keys are used to identify unique tuples.
- Relational algebra defines operations like select, project, join, and set operations that can be used to query and manipulate relations. Operations take relations as input and produce new relations as output.
The document discusses the relational model for databases including:
1) The structure of relational databases including relations, tuples, attributes, domains, relation schemas, and relation instances.
2) Relational algebra which is a procedural query language using operators like select, project, join, and set operations.
3) Additional concepts like keys, normalization, and an example banking schema to demonstrate relational queries.
The document discusses the relational model for databases including:
1) The structure of relational databases including relations, tuples, attributes, domains, relation schemas, and relation instances.
2) Relational algebra which is a procedural query language using operators like select, project, join, and set operations.
3) Additional concepts like keys, normalization, and an example banking schema to demonstrate relational queries.
The document discusses the process of query compilation in a database management system. It involves 6 main steps: 1) Parsing the SQL query into a parse tree, 2) Converting the parse tree into a logical query plan, 3) Optimizing the logical query plan by applying transformation rules, 4) Estimating the sizes of results from operations in the logical query plan, 5) Generating multiple physical query plans from the logical plan, and 6) Estimating the costs of physical plans and selecting the most efficient plan to execute. The document focuses on relational algebra rules for optimization and techniques for estimating result sizes of operations like selections, joins, and projections.
The document discusses query optimization in database systems. It covers topics like:
- Estimating the cost of different query evaluation plans using statistical information about relations.
- Transforming relational expressions using equivalence rules to generate logically equivalent expressions with different evaluation orders.
- Choosing the lowest cost plan based on cost estimates to optimize query evaluation.
The document describes the relational model for relational databases. It discusses the structure of relational databases including relations, tuples, attributes, domains, keys and relation schemas. It also describes the relational algebra query language including operators like select, project, join, union and set differences. Examples are provided to illustrate how to write queries using these operators to retrieve and manipulate data from relations that model real-world entities and relationships, like customers, accounts and loans in a banking example.
The document provides an overview of the relational model and relational algebra used in relational databases. It defines key concepts like relations, tuples, attributes, domains, schemas, instances, keys, and normal forms. It also explains the six basic relational algebra operations - select, project, union, difference, cartesian product, and rename - and how they can be composed to form complex queries. Examples of relations and queries involving operations like selection, projection, joins are provided to illustrate relational algebra.
The document provides an overview of the relational model and relational algebra used in relational databases. It defines key concepts like relations, tuples, attributes, domains, schemas, instances, keys, and normal forms. It also explains the six basic relational algebra operations - select, project, union, difference, cartesian product, and rename - and how they can be composed to form complex queries. Examples of relations and queries involving operations like selection, projection, joins are provided to illustrate relational algebra.
What is Relational model
Characteristics
Relational constraints
Representation of schemas
characteristics and Constraints of Relational model with proper examples.
Updates and dealing with constraint violations in Relational model
Lecture 06 relational algebra and calculusemailharmeet
The document discusses data manipulation languages (DML) for databases. There are two main types of DML: navigational/procedural and non-navigational/non-procedural. Relational algebra is a non-navigational DML defined by Codd that uses algebraic operations like selection, projection, join, etc. on tables. Relational calculus is also a non-navigational DML that defines new relations in terms of predicates on tuple variables ranging over named relations.
The document outlines various statistical and data analysis techniques that can be performed in R including importing data, data visualization, correlation and regression, and provides code examples for functions to conduct t-tests, ANOVA, PCA, clustering, time series analysis, and producing publication-quality output. It also reviews basic R syntax and functions for computing summary statistics, transforming data, and performing vector and matrix operations.
This document provides an overview of the R programming language and environment. It discusses why R is useful, outlines its interface and workspace, describes how to access help and tutorials, install packages, and input/output data. The interactive nature of R is highlighted, where results from one function can be used as input for another.
relational model in Database Management.ppt.pptRoshni814224
This document provides an overview of the relational model used in database management systems. It discusses key concepts such as:
- Relations, which are sets of tuples that represent entities and relationships between entities.
- Relation schemas that define the structure of relations, including the attributes and their domains.
- Keys such as candidate keys and foreign keys that uniquely identify tuples and define relationships between relations.
- Relational algebra, which consists of operators like select, project, join, and set operations to manipulate and query relations.
- An example banking schema is presented to demonstrate these concepts.
The document provides an overview of the relational model used in database management systems. It defines key concepts like relations, attributes, tuples, domains, schemas, keys, and foreign keys. It also describes common relational algebra operations like select, project, join, union, and set differences. Examples are provided to illustrate how these operations work on relations. Additional topics covered include query languages, normalization, and modeling a banking example database using these concepts.
This document discusses query languages and relational algebra operations. It introduces relational algebra as a procedural query language. The basic relational algebra operations are selection, projection, union, set difference, cartesian product, and rename. Examples are provided to illustrate each operation. Additional operations like join, outer join, division and aggregation are also discussed. The document concludes with a discussion of database modification operations like deletion, insertion and updating.
This document discusses the architecture and optimization of database management systems (DBMS). It covers:
1) The main components of a DBMS architecture including the query executor, buffer manager, storage manager, transaction manager, and more.
2) Query optimization techniques including rule-based optimization, cost-based optimization using a dynamic programming algorithm to search the plan space, and reducing the plan space.
3) Cost estimation including estimating selectivity factors, output sizes, and costs of different query execution plans without executing them.
This document summarizes key concepts relating to the relational model in database systems, including:
- The structure of relational databases and examples of relations and tuples.
- Relational algebra as a procedural query language consisting of operations like selection, projection, join, union and set differences.
- Nonprocedural query languages like tuple and domain relational calculus.
- Examples of relational algebra operations like selection, projection, join, union and set differences applied to relations.
The document proposes an automated approach for ranking tuples in the results of SQL queries over databases. It computes global and conditional scores for tuples based on attribute correlations learned from past query workloads and data statistics. At query time, it merges pre-computed ranked lists corresponding to the query attributes to efficiently retrieve the top-k results without a full table scan. Experiments on real datasets show the approach is efficient and provides high quality rankings preferred by users over alternative methods.
This document discusses query languages in database management systems. It covers the main categories of query languages: procedural languages like relational algebra, and non-procedural languages like tuple and domain relational calculus. Relational algebra operators like selection, projection, union, and join are defined. Example queries are provided in both relational algebra and relational calculus formats. Functional dependencies, candidate keys, and the closure of attribute sets under a set of functional dependencies are also explained.
OpenLSH - a framework for locality sensitive hashingJ Singh
The document discusses limitations of the k-means clustering algorithm and proposes alternatives like locality-sensitive hashing (LSH) for clustering large document collections. LSH hashes documents into "buckets" based on similarity so that similar documents are hashed to the same buckets, allowing efficient retrieval of nearest neighbors. The document demonstrates LSH using minhashing, which represents documents as sets of "shingles" or fragments, and hashes the minimum value found. It also describes an open-source implementation of LSH called OpenLSH that works with large-scale databases like Cassandra.
Analytics methods for big data have two requirements above and beyond analytics methods for normal-sized data. First, the analytics can not assume that all the data will fit in memory, or even fit on one server. Second, the choice of analysis methods must avoid high-order algorithms. We illustrate the point with one algorithm: Locality Sensitive Hashing
This document discusses using locality sensitive hashing (LSH) to enable large-scale similarity searches of massive datasets. LSH works by hashing similar objects into the same "buckets", allowing efficient discovery of similar items by only comparing objects within a small number of buckets. The document outlines how LSH could be used to find similar users on Facebook based on shared interests, and describes OpenLSH, an open-source Python framework for implementing LSH on Google App Engine using a MapReduce architecture.
The document discusses Google App Engine, a platform as a service (PaaS) that allows developers to focus on development rather than operations. It presents Google App Engine as having a "virtual raised floor" with the IDE above the floor and website/deployment system below. It provides an overview of Google App Engine's history and supported languages/data stores. It also summarizes code samples for a guestbook application and MapReduce workflow on Google App Engine.
Mining of massive datasets using locality sensitive hashing (LSH)J Singh
This document discusses using locality sensitive hashing (LSH) to solve large-scale search problems by clustering similar data points together. It presents an example of using LSH to find Facebook friends with similar interests. The key steps are: (1) representing each user as a vector of interests and computing minhashes, (2) clustering users into buckets based on minhash similarity, and (3) comparing a candidate to others in their bucket to find nearest neighbors. The performance of LSH involves tuning parameters like the number of minhashes and bands to balance false positives and negatives. Implementing LSH on MapReduce can make it scalable to large datasets.
Data Analytic Technology Platforms: Options and TradeoffsJ Singh
This document discusses options for data analytic technology platforms to address big data problems. It begins by distinguishing between problems that truly involve big data versus just large data problems. Examples of big data problems include recommendations, financial analysis, internet security monitoring, social media network analysis, genomics, and sensor data. The key characteristics of big data problems are that the data sets are too large to download, data is generated rapidly requiring near real-time analysis, and the problems involve diverse data types. The document then outlines the governing principle for choosing a platform as processing needing to be close to the data due to data size. Examples of platforms used for different applications are discussed to illustrate this principle. The decision making process for choosing a platform is described as
Map Reduce is a simple programming model that is well-suited for distributed computing. Hadoop is an open-source implementation of MapReduce that can run on large clusters of commodity hardware. Amazon Elastic MapReduce (EMR) provides a hosted Hadoop service that simplifies using MapReduce without needing to deploy and manage your own Hadoop cluster. The document discusses using EMR to analyze Facebook data at scale through examples like word counting and analyzing likes.
Does it make sense to use Google App Engine as a quick prototyping environment for Big Data use cases? It would avoid all the hassles of setting up Hadoop and its bestiary.
The answer is a definite "maybe".
The document provides an overview of the Hadoop ecosystem. It introduces Hadoop and its core components, including MapReduce and HDFS. It describes other related projects like HBase, Pig, Hive, Mahout, Sqoop, Flume and Nutch that provide data access, algorithms, and data import capabilities to Hadoop. The document also discusses hosted Hadoop frameworks and the major Hadoop providers.
Receiving data from a source that produces 5-10 GBytes per hour, and presenting analysis results as the data streams in has some interesting challenges.
We used MongoDB running on Amazon EC2 to house the data, map reduce to analyze it and Django-non-rel to present the results in near-real-time.
(Slides from my presentation at MongoDB Boston)
The document discusses NoSQL databases and MapReduce. It provides historical context on how databases were not adequate for the large amounts of data being accumulated from the web. It describes Brewer's Conjecture and CAP Theorem, which contributed to the rise of NoSQL databases. It then defines what NoSQL databases are, provides examples of different types, and discusses some large-scale implementations like Amazon SimpleDB, Google Datastore, and Hadoop MapReduce.
The document summarizes topics discussed in a database management systems lecture, including concurrency control techniques like intention locks, index locking, optimistic concurrency control using validation, and timestamp ordering algorithms. It also discusses multi-version concurrency control and challenges with commit in distributed databases using two phase commit and the Paxos algorithm. The lecture covers lock-based and optimistic approaches to concurrency control and managing concurrent transactions in a database system.
This document discusses database recovery techniques including undo logging, redo logging, and undo/redo logging.
Undo logging involves writing enough information to the log to allow rolling back uncommitted transactions after a failure. Redo logging writes log records to allow reapplying committed transactions not yet written to disk.
Undo/redo logging combines these approaches by writing both old and new values to the log, allowing flexible flushing of data pages before or after commit. It uses a two-pass recovery procedure of undoing uncommitted transactions followed by redoing committed ones.
Checkpoints are used to limit the portion of the log that needs to be processed during recovery by bracketing active transactions. Various checkpointing techniques like quies
The document discusses query execution in database management systems. It begins with an example query on a City, Country database and represents it in relational algebra. It then discusses different query execution strategies like table scan, nested loop join, sort merge join, and hash join. The strategies are compared based on their memory and disk I/O requirements. The document emphasizes that query execution plans can be optimized for parallelism and pipelining to improve performance.
CS 542 Putting it all together -- Storage ManagementJ Singh
The document provides an overview and plan for a lecture on database management systems. Key points include:
- By the second break, the lecture will cover storage hierarchies, secondary storage management, and system catalogs.
- After the second break, the topics will include data modeling and storage hierarchies.
- Storage hierarchies involve multiple storage levels from main memory to disk and beyond. The cost and performance of each level differs.
- Techniques like caching aim to keep frequently used data in faster storage levels like memory.
This document provides an overview of topics to be covered in a database management systems course, including parallel and distributed databases, NoSQL databases, and MapReduce. It discusses parallel databases and different architectures for distributed databases. It introduces several NoSQL databases like Amazon SimpleDB, Google BigTable, and HBase and describes their data models and implementations. It also provides details about MapReduce, including its programming model, implementation, optimizations, and statistics on its usage at Google. The next class meetings will include a mid-term exam, student presentations on assigned topics, and a proposal for each student's final project.
The document summarizes key topics in database integrity and performance, including:
- Primary and foreign key constraints to prevent duplicate and dangling tuples
- Attribute and tuple constraints to enforce data integrity
- Views to provide virtual subsets and joins of database relations
- Indexes to enable fast search through tables
The document discusses these concepts over multiple pages and provides examples to illustrate primary keys, foreign keys, constraints, views and indexing. It concludes by offering feedback on students' report proposals, emphasizing depth over breadth and a focus on design over implementation.
CS 542 Controlling Database Integrity and PerformanceJ Singh
This document summarizes a lecture on database integrity and performance. It discusses various techniques for ensuring database integrity, including primary key constraints to prevent duplicate tuples, foreign key constraints to prevent dangling references, and attribute constraints to prevent inconsistent attribute values. It also covers views, which allow querying virtual tables, and indexes to improve query performance by enabling faster searching. The document proposes discussing index structures and report topics at the next meeting.
This document discusses SQL queries and Datalog rules. It begins with examples of simple SQL queries on a BROWSER_TABLE relation. More complex queries are demonstrated using joins, subqueries, aggregation, and set operations. Transaction processing and ensuring isolation levels are covered. The document then introduces Datalog, a logical query language, and how its rules can extend SQL with recursion to express queries not possible in SQL alone. Key concepts in Datalog like the distinction between extensional and intensional databases, computing rules bottom-up and top-down, and ensuring safe rules are explained. Finally, examples are given of expressing Datalog rules and recursive queries using the SQL WITH clause.
1. CS 542 Database Management Systems Query Optimization J Singh March 28, 2011
2. Outline Convert SQL query to a parse tree Semantic checking: attributes, relation names, types Convert to a logical query plan (relational algebra expression) deal with subqueries Improve the logical query plan use algebraic transformations group together certain operators evaluate logical plan based on estimated size of relations Convert to a physical query plan search the space of physical plans choose order of operations complete the physical query plan
3. Desired Endpoint x=1 AND y=2 AND z<5 (R) R ⋈ S ⋈ U Example Physical Query Plans two-pass hash-join 101 buffers Filter(x=1 AND z<5) materialize IndexScan(R,y=2) two-pass hash-join 101 buffers TableScan(U) TableScan(R) TableScan(S)
4. Physical Plan Selection The particular operation being performed Size of intermediate results, as derived last week (sec 16.4 of book) Physical Operator Implementation used, e.g., one- or two-pass Operation ordering, esp. Join ordering Operation output: materialized or pipelined. Governed by disk I/O, which in turn is governed by
5. Index-based physical plans (p1) Selection example. What is the cost of a=v(R) assuming B(R) = 2000 T(R) = 100,000 V(R, a) = 20 Table scan (assuming R is clustered): B(R) = 2,000 I/Os Index based selection: If index is clustering: B(R) / V(R,a) = 100 I/Os If index is unclustered: T(R) / V(R,a) = 5,000 I/Os For small V(R, a), table scan can be faster than an unclustered index Heuristics that pick indexed over not-indexed can lead you astray Determine the cost of both methods and let the algorithm decide 5
6. Index-based physical plans (p2) Example: Join if S has an index on the join attribute For each tuplein R, fetch corresponding tuple(s) from S Assume R is clustered. Cost: If index on S is clustering: B(R) + T(R) B(S) / V(S,a) If index on S is unclustered: B(R) + T(R) T(S) / V(S,a) Another case: when R is output of another Iterator. Cost: B(R) is accounted for in the iterator If index on S is clustering: T(R) B(S) / V(S,a) If index on S is unclustered: T(R) T(S) / V(S,a) If S is not indexed but fits in memory: B(S) A number of other cases
7. Index-based physical plans (p3) Index Based Join ifboth R and S have a sorted index (B+ tree) on the join attribute Then perform a merge join called zig-zag join Cost: B(R) + B(S)
8. Grand Summary of Physical Plans (p1) Scans and Selects Index: N = None, C = Clustering, NC = Non-clustered
9. Grand Summary of Physical Plans (p2) Joins Index: N = None, C = Clustering, NC = Non-clustered Relation fits in memory: F = Yes, NF = No
10. Physical plans at non-leaf Operators (p1) What if the input of the operator is from another operator? For Select, cost= 0. Cost of pipelining is assumed to be zero The number of tuples emitted is reduced For Join, when R is from an operator and S from a table: B(R) is accounted for in the iterator If index on S is clustering: T(R) B(S) / V(S,a) If index on S is unclustered: T(R) T(S) / V(S,a) If S is not indexed but fits in memory: B(S) If S is not indexed and doesn’t fit: k*B(S) for k chunks If S is not indexed and doesn’t fit: 3*B(S) for sort- or hash-join
11. Physical plans at non-leaf Operators (p2) For Join, when R and S are both from operators, cost depends on whether the result are sorted by the Join attribute(s) If yes, we use the zig-zag algorithm and the cost is zero. Why? If either relation will fit in memory, the cost is zero. Why? At most, the cost is 2*(B(R) + B(S)). Why?
12. Example (787) Product(pname, maker), Company(cname, city) Select Product.pname From Product, Company Where Product.maker=Company.cname and Company.city = “Seattle” How do we execute this query ?
13. Example (787) Product(pname, maker), Company(cname, city) Select Product.pname From Product, Company Where Product.maker=Company.cname and Company.city = “Seattle” Logical Plan Clustering Indices: Product.pname Company.cname Unclustered Indices: Product.maker Company.city maker=cname scity=“Seattle” Product(pname,maker) Company(cname,city)
14. Example (787) Physical Plans Physical Plan 1 Physical Plans 2a and 2b Merge-join Index-basedjoin Index-basedselection maker=cname scity=“Seattle” cname=maker scity=“Seattle” Product(pname,maker) Company(cname,city) Product(pname,maker) Company(cname,city) Index-scan Scan and sort (2a)index scan (2b)
16. Final Evaluation Plan Costs: Plan 1: T(Company) / V(Company, city) T(Product)/V(Product, maker) Plan 2a: B(Company) + 3B(Product) Plan 2b: B(Company) + T(Product) Which is better? It depends on the data
20. Query Optimzation Have a SQL query Q Create a plan P Find equivalent plans P = P’ = P’’ = … Choose the “cheapest”. HOW ??
21. Logical Query Plan SELECT P.buyer FROM Purchase P, Person Q WHERE P.buyer=Q.name AND Q.city=‘seattle’ AND Q.phone > ‘5430000’ Plan buyer City=‘seattle’ phone>’5430000’ Buyer=name In class: find a “better” plan P’ Person Purchase
22. CS 542 Database Management Systems Query Optimization – Choosing the Order of Operations J Singh March 28, 2011
23. Outline Convert SQL query to a parse tree Semantic checking: attributes, relation names, types Convert to a logical query plan (relational algebra expression) deal with subqueries Improve the logical query plan use algebraic transformations group together certain operators evaluate logical plan based on estimated size of relations Convert to a physical query plan search the space of physical plans choose order of operations complete the physical query plan
28. But they are not equivalent from an execution viewpoint.Considerable research has gone into picking the best order for Joins
29. Join Trees R1 ⋈R2 ⋈ …⋈Rn Join tree: Definitions A plan = a join tree A partial plan = a subtree of a join tree R3 R1 R2 R4 24
30. Left & Right Join Arguments The argument relations in joins determine the cost of the join In Physical Query Plans, the left argument of the join is Called the build relation Assumed to be smaller Stored in main-memory
31. Left & Right Join Arguments The right argument of the join is Called the probe relation Read a block at a time Its tuples are matched with those of build relation The join algorithms which distinguish between the arguments are: One-pass join Nested-loop join Index join
34. Dynamic Programming Given: a query R1 ⋈R2 ⋈… ⋈Rn Assume we have a function cost() that gives us the cost of a join tree Find the best join tree for the query
35. Dynamic Programming Problem Statement Given: a query R1 ⋈ R2 ⋈… ⋈Rn Assume we have a function cost() that gives us the cost of a join tree Find the best join tree for the query Idea: for each subset of {R1, …, Rn}, compute the best plan for that subset Algorithm: In increasing order of set cardinality, compute the cost for Step 1: for {R1}, {R2}, …, {Rn} Step 2: for {R1,R2}, {R1,R3}, …, {Rn-1, Rn} … Step n: for {R1, …, Rn} It is a bottom-up strategy Skipping further details of the algorithm Read from book if interested Will not be on the exam
40. Outline Convert SQL query to a parse tree Semantic checking: attributes, relation names, types Convert to a logical query plan (relational algebra expression) deal with subqueries Improve the logical query plan use algebraic transformations group together certain operators evaluate logical plan based on estimated size of relations Convert to a physical query plan search the space of physical plans choose order of operations complete the physical query plan Three topics Choosing the physical implementations (e.g., select and join methods) Decisions regarding materialized vs pipelined Notation for physical query plans
41. Choosing a Selection Method Algorithm for each selection operator 1. Can we use an created index on an attribute? If yes, index-scan. (Otherwise table-scan) 2. After retrieving all condition-satisfied tuples in (1), filter them with the remaining selection conditions In other words, When computing c1 c2 … cn(R), we index-scan on ci, then filter the result on all other ci, where j i. The next 2 pages show an example where we examine several options and pick the best one
42. Selection Method Example (p1) Selection: x=1 y=2 z < 5 (R) Where parameters of R are: T(R) = 5,000 B(R) = 200 V(R, x) = 100 V(R, y) = 500 Relation R is clustered x and y have non-clustering indices z is a clustering index
43. Selection Method Example (p2) Selection options: Table-scan filter x, y, z. Cost isB(R) = 200since R is clustered. Use index on x =1 filter on y, z. Cost is 50 sinceT(R) / V(R, x) is (5000/100) = 50 tuples, x is not clustering. Use index on y =2 filter on x, z. Cost is 10 sinceT(R) / V(R, y) is (5000/500) = 10 tuples, y is not clustering. Index-scan on clustering index w/ z < 5 filter x ,y. Cost is about B(R)/3 = 67 Therefore: First retrieve all tuples with y = 2 (option 3) Then filter for x and z
44. Outline Convert SQL query to a parse tree Semantic checking: attributes, relation names, types Convert to a logical query plan (relational algebra expression) deal with subqueries Improve the logical query plan use algebraic transformations group together certain operators evaluate logical plan based on estimated size of relations Convert to a physical query plan search the space of physical plans choose order of operations complete the physical query plan Three topics Choosing the physical implementations (e.g., select and join methods) Decisions regarding materialized vs pipelined Notation for physical query plans
45. Pipelining Versus Materialization Materialization store (intermediate) result of each operations on disk Pipelining Interleave the execution of several operations, the tuples produced by one operation are passed directly to the operations that used it store (intermediate) result of each operations on buffer, which is implemented on main memory Prefer Pipelining where possible Sometimes not possible, as the following example shows Next few pages, a fully worked-out example
46. R⋈S⋈U Example (p1) Consider physical query plan for the expression (R(w, x) ⋈ S(x, y)) ⋈ U(y, z) Assumption R occupies 5,000 blocks, S and U each 10,000 blocks. The intermediate result R ⋈ S occupies k blocks for some k. Both joins will be implemented as hash-joins, either one-pass or two-pass depending on k There are 101 buffers available.
47. R⋈S⋈U Example (p2) When joining R ⋈ S, neither relation fits in buffers Need two-pass hash-join to partition R How many hash buckets for R? 100 at most The 2nd pass hash-join uses 51 buffers, leaving 50 buffers for joining result of R ⋈ S with U. Why 51?
48. R⋈S⋈U Example (p3) Case 1: Suppose k 49, the result of R ⋈ S occupies at most 49 blocks. Steps Pipeline in R ⋈ S into 49 buffers Organize them for lookup as a hash table Use one buffer left to read each block of U in turn Execute the second join as one-pass join. The total number of I/O’s is 55,000 45,000 for two-pass hash join of R and S 10,000 to read U for one-pass hash join of (R⋈ S) ⋈U.
49. R⋈S⋈U Example (p4) Case 2: suppose k > 49 but < 5,000, we can still pipeline, but need another strategy where intermediate results join with U in a 50-bucket, two-pass hash-join. Steps are: Before start on R ⋈ S, we hash U into 50 buckets of 200 blocks each. Perform two-pass hash join of R and U using 51 buffers as case 1, and placing results in 50 remaining buffers to form 50 buckets for the join of R ⋈ S with U. Finally, join R ⋈ S with U bucket by bucket. The number of disk I/O’s is: 20,000 to read U and write its tuples into buckets 45,000 for two-pass hash-join R ⋈ S k to write out the buckets of R ⋈ S k+10,000 to read the buckets of R ⋈ S and U in the final join The total cost is 75,000+2k.
50. R⋈S⋈U Example (p5) Case 3: k > 5,000, we cannot perform two-pass join in 50 buffers available if result of R ⋈ S is pipelined. We are forced to materialize the relation R ⋈ S. The number of disk I/O’s is: 45,000 for two-pass hash-join R and S k to store R ⋈ S on disk 30,000 + 3k for two-pass join of U in R ⋈ S The total cost is 75,000+4k.
51. R⋈S⋈U Example (p6) In summary, costs of physical plan as function of R ⋈ S size. Pause and Reflect It’s all about the expected size of the intermediate result R ⋈ S What would have happened if We guessed 45 but had 55? Guessed 55 but only had 45? Guessed 4,500 but had 5,500? Guessed 5,500 but only had 4,500?
52. Outline Convert SQL query to a parse tree Semantic checking: attributes, relation names, types Convert to a logical query plan (relational algebra expression) deal with subqueries Improve the logical query plan use algebraic transformations group together certain operators evaluate logical plan based on estimated size of relations Convert to a physical query plan search the space of physical plans choose order of operations complete the physical query plan Three topics Choosing the physical implementations (e.g., select and join methods) Decisions regarding materialized vs pipelined Notation for physical query plans
53. Notation for Physical Query Plans Several types of operators: Operators for leaves (Physical) operators for Selection (Physical) Sorts Operators Other Relational-Algebra Operations In practice, each DBMS uses its own internal notation for physical query plans
54. PQP Notation Leaves:Replace a leaf in an LQP by TableScan(R): Read all blocks SortScan(R, L): Read in order according to L IndexScan(R, C): Scan R using index attribute A by condition AC IndexScan(R, A): Scan R using index attribute A Selects: Replace a Select in an LQP by one of the leaf operators plus: Filter(D) for condition D Sorts: Replace a leaf-level sort as shown above. For other operation, Sort(L): Sort a relation that is not stored Other Operators: Operation- and algorithm-specific (e.g., Hash-Join) Also need to specify # passes, buffer sizes, etc.
55. We have Arrived at the Desired Endpoint x=1 AND y=2 AND z<5 (R) R ⋈ S ⋈ U Example Physical Query Plans two-pass hash-join 101 buffers Filter(x=1 AND z<5) materialize IndexScan(R,y=2) two-pass hash-join 101 buffers TableScan(U) TableScan(R) TableScan(S)
56. Outline Convert SQL query to a parse tree Semantic checking: attributes, relation names, types Convert to a logical query plan (relational algebra expression) deal with subqueries Improve the logical query plan use algebraic transformations group together certain operators evaluate logical plan based on estimated size of relations Convert to a physical query plan search the space of physical plans choose order of operations complete the physical query plan
57. Optimization Issues and Proposals The “fuzz” in estimation of sizes Parametric Query Optimization Specify alternatives to the execution engine so it may respond to conditions at runtime Multiple-query optimization Take concurrent execution of several queries into account Combinatoric explosion of options when doing an n-way Join Becomes really expensive around n > 15 Alternatives optimizations have been proposed for special situations, but no general framework Rule-based optimizers Randomized plan generation
58. CS 542 Database Management Systems Distributed Query Execution Source: Carsten Binnig, Univ of Zurich, 2006 J Singh March 28, 2011
59. Motivation Algorithms based on Semi-Joins have been proposed as techniques for query optimization They shine in Distributed and Parallel Databases Good opportunity to explore them in that context Semi-join by example: Semi-join formal definition:
60. Distributed / Parallel Join Processing Scenario: How to compute A ⋈B? Table A resides on Node 1 Table B resides on Node 2 Node 1 Node 2 Table A Table B
61. Naïve approach (1) Idea: Use standard join and fetch table page-wise from remote node if necessary (send- and receive-operators) Example: Join is executed on node 2 using a Nested-Loop-Join Outer loop: Request page of table A from node 1 (remote) Inner loop: For each page iterate over table B and produce output => Random access of pages on node 1 (due to network delay) Node 1 Node 2 Request Table A Page A1 Table B Send
62.
63. Naïve Approach: Implications Problems: High cost for shipping data Network cost roughly the same as I/O cost for a hard disk (or even worse because of unpredictability of network delay) Shipping A roughly equivalent to a full table scan (Trivial) Optimizations: Ship always smaller table to the other side If query contains a selection, apply selection before sending A Note: bigger table may become the smaller table (after selection)
64. Semi-join Approach (p1) Idea: Before shipping a table, reduce to data that is shipped to those tuples that are only relevant for join Example: Join on A.id=B.id and table A should be shipped to node 2 Node 1 Node 2 Table A Table B
65. Semi-join Approach (p2) (1) Compute projection B.id of table B on node 2 (2) Ship column B.id to node 1 Node 1 Node 2 Table A Table B Ship
66. Semi-join Approach (p3) (3) Execute semi-join of B.id and table A on A.id=B.id (to select only relevant tuples of table A => table A’) (4) Send result of semi-join (table A’) to node 2 Node 1 Node 2 Table A Table B Table A’ Ship
67. Semi-join Approach (p4) (5) Join the shipped table A’ locally on node 2 with table B => Optimization of this approach: If node 1 holds a join index (e.g., type 1 with A.id -> {B.RID}) we can start with step (3) Node 1 Node 2 Table A Table B Table A’ Ship
68. Semi-join Approach Discussion This strategy works well if semi-join reduces size of the table that needs to be shipped Assume all rows of Table A are needed anyway => none of the rows of table A can be discarded Then this approach is more costly than shipping the entire table A in the first place! Consequence: Need to decide whether this method makes sense based on semi-join selectivity => Cost-based optimization must decide this
69. Bloom-join Approach (p1) Algorithm same as semi-join approach Ship a bloom-filter instead of (foreign) key column Use bloom-filter technique to compress data Goal: only send a small bit list (to reduce network I/O) instead of all keys of column (as bit-vector) Problems: A superset of tuples that might join will be sent back (same problem as in bloom-filters for bitmap-indexes) => More tuples must be sent over network and thus net gain depends on good hash function
70. Bloom-join Approach (p2) (1) Compute bloom filter BL of size n for column B.id of table B on node 2 with n << |B.id| (e.g., by B.id % n) (2) Ship bloom filter B.id’ to node 1 Node 1 Node 2 Table A Table B Ship
71. Bloom-join Approach (p3) (3) Probe bloom filter B.id’ with tuples from table A to get a superset of possible join candidates (=> table A’) (4) Send result (table A’) to node 2 (table A’ might contain join candidates that do not have a partner in table B) (5) Join the shipped table A’ locally on node 2 with table B Node 1 Node 2 Table A Table B Table A’ Ship Probe
72. Bloom-join Approach Discussion Communication cost much reduced But have to deal with false positives Widely used in NoSQLdatabases