This document discusses query optimization in database systems. It begins by describing the components of a database management system and how queries are processed. It then explains that the goal of query optimization is to reduce the execution cost of a query by choosing efficient access methods and ordering operations. The document outlines different query plans involving table scans, index scans, and joins. It also introduces concepts like filter factors, statistics about tables and indexes, and how these are used to estimate the cost of alternative query execution plans.
The document describes various DB2 online utilities including UNLOAD, LOAD, REBUILD INDEX, COPY, RECOVER, RUNSTATS, MODIFY RECOVERY, QUIESCE, and REORG. These utilities perform functions like unloading and loading data, rebuilding indexes, taking image copies of data, recovering data to a prior point in time, updating catalog statistics, and reorganizing tablespaces.
Track 2 session 6 db2 utilities update and best practices v2IBMSystemzEvents
The document provides an overview of updates and best practices for DB2 utilities. Key enhancements include improved performance for online REORG utilities, reduced impact of the SWITCH phase for partition-level REORGs, automated handling of mapping tables during REORG, and faster partition-level recovery using inline image copies. Statistics collection and LOAD/UNLOAD processing were also enhanced. Deprecated items are noted. Contacting IBM for a DB2 Utilities Workshop is recommended for learning more about optimizing utility usage.
R is a free and open-source language and environment for statistical computing and graphics. It contains a variety of statistical and graphical techniques built in. This document provides an introduction to using R, including how to import and manage data, perform basic analyses and visualizations, and save scripts. It covers topics such as importing data from CSV and text files, creating and subsetting data frames, recoding variables, sorting and merging data, and using basic functions.
Efficient spatial queries on vanilla databasesJulian Hyde
A talk given by Julian Hyde at the Apache Calcite online meetup, 2021/01/20.
Spatial and GIS applications have traditionally required specialized databases, or at least specialized data structures like r-trees. Unfortunately this means that hybrid applications such as spatial analytics are not well served, and many people are unaware of the power of spatial queries because their favorite database does not support them.
In this talk, we describe how Apache Calcite enables efficient spatial queries using generic data structures such as HBase’s key-sorted tables, using techniques like Hilbert space-filling curves and materialized views. Calcite implements much of the OpenGIS function set and recognizes query patterns that can be rewritten to use particular spatial indexes. Calcite is bringing spatial query to the masses!
Oracle tables can be organized as heap tables or index organized tables. Heap tables store rows together in blocks without any particular order, while index organized tables order rows based on primary key values. Tables can be partitioned to improve manageability and performance for large volumes of data. Table clusters group related tables together in blocks to reduce disk I/O and access time for joined queries on those tables.
Kevin Kempter PostgreSQL Backup and Recovery Methods @ Postgres OpenPostgresOpen
This document provides an overview of PostgreSQL backup and recovery methods, including pg_dump, pg_dumpall, psql, pg_restore, and point-in-time recovery (PITR). It discusses the options and usage of each tool and provides examples.
R is a language and environment for statistical computing and graphics. It contains a wide variety of statistical and graphical techniques built into its core. R code is executed from the R console by typing commands and pressing enter to see the output. Data can be imported from files like CSV, manipulated using functions, and exported for later use. Common tasks in R include importing data, subsetting datasets, sorting data, performing calculations and statistical analyses, and visualizing results.
The document describes various DB2 online utilities including UNLOAD, LOAD, REBUILD INDEX, COPY, RECOVER, RUNSTATS, MODIFY RECOVERY, QUIESCE, and REORG. These utilities perform functions like unloading and loading data, rebuilding indexes, taking image copies of data, recovering data to a prior point in time, updating catalog statistics, and reorganizing tablespaces.
Track 2 session 6 db2 utilities update and best practices v2IBMSystemzEvents
The document provides an overview of updates and best practices for DB2 utilities. Key enhancements include improved performance for online REORG utilities, reduced impact of the SWITCH phase for partition-level REORGs, automated handling of mapping tables during REORG, and faster partition-level recovery using inline image copies. Statistics collection and LOAD/UNLOAD processing were also enhanced. Deprecated items are noted. Contacting IBM for a DB2 Utilities Workshop is recommended for learning more about optimizing utility usage.
R is a free and open-source language and environment for statistical computing and graphics. It contains a variety of statistical and graphical techniques built in. This document provides an introduction to using R, including how to import and manage data, perform basic analyses and visualizations, and save scripts. It covers topics such as importing data from CSV and text files, creating and subsetting data frames, recoding variables, sorting and merging data, and using basic functions.
Efficient spatial queries on vanilla databasesJulian Hyde
A talk given by Julian Hyde at the Apache Calcite online meetup, 2021/01/20.
Spatial and GIS applications have traditionally required specialized databases, or at least specialized data structures like r-trees. Unfortunately this means that hybrid applications such as spatial analytics are not well served, and many people are unaware of the power of spatial queries because their favorite database does not support them.
In this talk, we describe how Apache Calcite enables efficient spatial queries using generic data structures such as HBase’s key-sorted tables, using techniques like Hilbert space-filling curves and materialized views. Calcite implements much of the OpenGIS function set and recognizes query patterns that can be rewritten to use particular spatial indexes. Calcite is bringing spatial query to the masses!
Oracle tables can be organized as heap tables or index organized tables. Heap tables store rows together in blocks without any particular order, while index organized tables order rows based on primary key values. Tables can be partitioned to improve manageability and performance for large volumes of data. Table clusters group related tables together in blocks to reduce disk I/O and access time for joined queries on those tables.
Kevin Kempter PostgreSQL Backup and Recovery Methods @ Postgres OpenPostgresOpen
This document provides an overview of PostgreSQL backup and recovery methods, including pg_dump, pg_dumpall, psql, pg_restore, and point-in-time recovery (PITR). It discusses the options and usage of each tool and provides examples.
R is a language and environment for statistical computing and graphics. It contains a wide variety of statistical and graphical techniques built into its core. R code is executed from the R console by typing commands and pressing enter to see the output. Data can be imported from files like CSV, manipulated using functions, and exported for later use. Common tasks in R include importing data, subsetting datasets, sorting data, performing calculations and statistical analyses, and visualizing results.
The document provides instructions for summarizing data in R using various functions and commands. It discusses summarizing a single dataset, variables within a dataset, and grouping variables. It also demonstrates generating statistics, histograms, scatter plots, and correlations to visualize and analyze relationships in the data. The final sections discuss aggregating and grouping data using functions like aggregate(), tapply(), and ddply() as well as generating frequency tables and cross tables.
Practical Recipes for Daily DBA Activities using DB2 9 and 10 for z/OSCuneyt Goksu
This document discusses several practical DBA activities in DB2 9 and 10 for z/OS including recovering from accidentally dropping a table, defining a trusted context for security, including columns in indexes for performance, creating indexes on expressions, and using MAXTEMP_RID in version 10 for performance. Steps are provided for recovering a dropped table using log records, archive logs, and VSAM copy techniques. Trusted contexts are introduced for efficiently switching users without credentials. Including columns in indexes and new features in version 10 like MAXTEMP_RID are highlighted for potential performance improvements.
About Flexible Indexing
Postgres’ rich variety of data structures and data-type specific indexes can be confusing for newer and experienced Postgres users alike who may be unsure when and how to use them. For example, gin indexing specializes in the rapid lookup of keys with many duplicates — an area where traditional btree indexes perform poorly. This is particularly useful for json and full text searching. GiST allows for efficient indexing of two-dimensional values and range types.
To listen to the recorded presentation with Bruce Momjian, visit Enterprisedb.com > Resources > Webcasts > Ondemand Webcasts.
For product information and subscriptions, please email sales@enterprisedb.com.
The document discusses B-tree indexes in PostgreSQL. It provides an overview of B-tree index internals including page layout, the meta page, Lehman & Yao algorithm adaptations, and new features like covering indexes, partial indexes, and HOT updates. It also outlines development challenges and future work needed like index compression, index-organized tables, and global partitioned indexes. The presenter aims to inspect B-tree index internals, present new features, clarify the development roadmap, and understand difficulties.
Triggers are those little bits of code running in your database that gets executed when something happens that you care about. Whether you are a developer who puts all of your business logic inside of PL/pgSQL functions or someone who uses an ORM and wants to stay away from database code, you will likely end up using triggers at some point. The fact that the most recommend way of implementing table partitioning in PostgreSQL uses triggers accounts for the importance of understanding triggers.
In this talk, we will step through examples of writing various types of triggers using practical uses cases like partitioning and auditing.
The structure of a trigger
BEFORE vs AFTER triggers
Statement Level vs Row Level triggers
Conditional triggers
Event triggers
Debugging triggers
Performance overhead of triggers
All of the examples will be done using PL/pgSQL so in addition to getting an overview of triggers, you will also get a good understanding of how to code in PL/pgSQL.
GridSQL is an open source distributed database built on PostgreSQL that allows it to scale horizontally across multiple servers by partitioning and distributing data and queries. It provides significantly improved performance over a single PostgreSQL instance for large datasets and queries by parallelizing processing across nodes. However, it has some limitations compared to PostgreSQL such as lack of support for advanced SQL features, slower transactions, and need for downtime to add nodes.
The Query Optimizer is the “brain” of your Postgres database. It interprets SQL queries and determines the fastest method of execution. Using the EXPLAIN command , this presentation shows how the optimizer interprets queries and determines optimal execution.
This presentation will give you a better understanding of how Postgres optimally executes their queries and what steps you can take to understand and perhaps improve its behavior in your environment.
To listen to the webinar recording, please visit EnterpriseDB.com > Resources > Ondemand Webcasts
If you have any questions please email sales@enterprisedb.com
This document introduces a domain specific language (DSL) called Zeoflow for defining data pipelines that can run on different distributed compute engines like Spark, Dataflow, and BigQuery. The DSL aims to standardize all actions in the system and separate the platform logic from the domain logic, allowing code and functionality to be reused across different runtimes. The document discusses designing the DSL to have an AST-like structure and monadic behavior. It also describes how the DSL is lifted into a Free monad to separate it from interpreters. Finally, it provides examples of writing business logic programs using the DSL and interpreting them using state monads to run on Spark.
Is it easier to add functional programming features to a query language, or to add query capabilities to a functional language? In Morel, we have done the latter.
Functional and query languages have much in common, and yet much to learn from each other. Functional languages have a rich type system that includes polymorphism and functions-as-values and Turing-complete expressiveness; query languages have optimization techniques that can make programs several orders of magnitude faster, and runtimes that can use thousands of nodes to execute queries over terabytes of data.
Morel is an implementation of Standard ML on the JVM, with language extensions to allow relational expressions. Its compiler can translate programs to relational algebra and, via Apache Calcite’s query optimizer, run those programs on relational backends.
In this talk, we describe the principles that drove Morel’s design, the problems that we had to solve in order to implement a hybrid functional/relational language, and how Morel can be applied to implement data-intensive systems.
(A talk given by Julian Hyde at Strange Loop 2021, St. Louis, MO, on October 1st, 2021.)
Basic Query Tuning Primer - Pg West 2009mattsmiley
Intro to query tuning in Postgres, for beginners or intermediate software developers. Lists your basic toolkit, common problems, a series of examples. Assumes the audience knows basic SQL but has little or no experience with reading or adjusting execution plans. Accompanies 45-90 minute talk; meant to encourage Q/A.
The document discusses the top 10 mistakes Oracle DBAs make when migrating to PostgreSQL. These include: configuring PostgreSQL settings based on Oracle without understanding PostgreSQL's different architecture, preserving uppercase metadata which is lowercase in PostgreSQL, unnecessary use of tablespaces, not using PostgreSQL's native features like PL languages and data types, treating exceptions the same between the databases, and not properly handling null values. It emphasizes that PostgreSQL and Oracle have meaningful differences and to avoid porting Oracle concepts directly without understanding PostgreSQL.
ne of the most sought after features in PostgreSQL is a scalable multi-master replication solution. While there does exists some tools to create multi-master clusters such as Bucardo and pgpool-II, they may not be the right fit for an application. In this session, you will learn some of the strengths and weaknesses of these more popular multi-master solutions for PostgreSQL and how they compare to using Slony for your multi-master needs. We will explore the types of deployments best suited for a Slony deployment and the steps necessary to configure a multi-master solution for PostgreSQL.
The document summarizes the new features in PostgreSQL 11, including enhancements to table partitioning such as hash partitioning, default partitions, updating partition keys, automatically creating indexes on partitions, adding unique constraints to partitions, and enabling partition-wise joins and aggregates. It also provides examples of using various partitioning features and explains the benefits of partition-wise processing.
Three data storage formats are described: row-oriented, column-oriented, and record columnar. Record columnar stores data in a hybrid fashion, with columns stored together but records kept together in row groups for better failure handling. Common serialization formats include SequenceFile, Avro, RCFile, and Parquet. Parquet is generally recommended for analytics workloads due to its support for complex data structures and good compression. Failure behavior, read/write speeds, splittability, and compression are important factors to consider when choosing a data storage format and serialization format.
Stado is an open source, shared-nothing architecture for scaling PostgreSQL across multiple servers. It partitions and distributes PostgreSQL tables and queries them in parallel. Stado provides linear scalability for read queries as more nodes are added. However, writes are slower than a single PostgreSQL instance due to additional network hops. Stado also has limitations in transaction performance, high availability and backup/restore capabilities compared to a single PostgreSQL database.
ALTER TABLE Improvements in MariaDB ServerMariaDB plc
OpenWorks 2019 Presentation
MariaDB Server 10.3 introduced ALGORITHM=NOCOPY, allowing columns to be added in an instant because adding columns no longer required rebuilding a table. In MariaDB Server 10.4, we’re taking this a step further with instant DROP COLUMN and instant ALTER TABLE to support many more instant schema changes. In this session, Marko Mäkelä explains how instant schema changes work and how MariaDB is eliminating table rebuilds.
Using histograms to provide better query performance in MariaDB. Histograms capture the distribution of values in columns to help the query optimizer select better execution plans. The optimizer needs statistics on data distributions to estimate query costs accurately. Histograms are not enabled by default but can be collected using ANALYZE TABLE with the PERSISTENT option. Making histograms available improves the performance of queries that have selective filters or ordering on non-indexed columns.
The document provides information about MapReduce jobs including:
- The number of maps is determined by input size and partitioning. The number of reducers is set by the user.
- Reducers receive sorted, grouped data from maps via shuffle and sort. They apply the reduce function to grouped keys/values.
- The optimal number of reducers depends on nodes and tasks. More reducers improve load balancing but increase overhead.
The document discusses the relational model of databases. It defines key concepts like relations, tuples, attributes, domains, and keys. It provides an example database schema for an auction application with relations for owners, items, bids, and buyers. It explains that a relation is a set of tuples with a common schema where each tuple maps attribute names to values from predefined domains. It also defines the different types of keys like superkeys and primary keys.
SQL is a language for communicating with a database management system (DBMS) to carry out tasks like querying data, inserting/updating/deleting rows, and managing database objects. It includes data definition language (DDL) for creating and modifying database objects and data manipulation language (DML) for querying and modifying data. A SQL database contains tables which have a schema defining columns and their data types, and may have constraints. Queries in SQL use SELECT statements to retrieve data that matches conditions specified in the WHERE clause by comparing column values and expressions.
The document provides instructions for summarizing data in R using various functions and commands. It discusses summarizing a single dataset, variables within a dataset, and grouping variables. It also demonstrates generating statistics, histograms, scatter plots, and correlations to visualize and analyze relationships in the data. The final sections discuss aggregating and grouping data using functions like aggregate(), tapply(), and ddply() as well as generating frequency tables and cross tables.
Practical Recipes for Daily DBA Activities using DB2 9 and 10 for z/OSCuneyt Goksu
This document discusses several practical DBA activities in DB2 9 and 10 for z/OS including recovering from accidentally dropping a table, defining a trusted context for security, including columns in indexes for performance, creating indexes on expressions, and using MAXTEMP_RID in version 10 for performance. Steps are provided for recovering a dropped table using log records, archive logs, and VSAM copy techniques. Trusted contexts are introduced for efficiently switching users without credentials. Including columns in indexes and new features in version 10 like MAXTEMP_RID are highlighted for potential performance improvements.
About Flexible Indexing
Postgres’ rich variety of data structures and data-type specific indexes can be confusing for newer and experienced Postgres users alike who may be unsure when and how to use them. For example, gin indexing specializes in the rapid lookup of keys with many duplicates — an area where traditional btree indexes perform poorly. This is particularly useful for json and full text searching. GiST allows for efficient indexing of two-dimensional values and range types.
To listen to the recorded presentation with Bruce Momjian, visit Enterprisedb.com > Resources > Webcasts > Ondemand Webcasts.
For product information and subscriptions, please email sales@enterprisedb.com.
The document discusses B-tree indexes in PostgreSQL. It provides an overview of B-tree index internals including page layout, the meta page, Lehman & Yao algorithm adaptations, and new features like covering indexes, partial indexes, and HOT updates. It also outlines development challenges and future work needed like index compression, index-organized tables, and global partitioned indexes. The presenter aims to inspect B-tree index internals, present new features, clarify the development roadmap, and understand difficulties.
Triggers are those little bits of code running in your database that gets executed when something happens that you care about. Whether you are a developer who puts all of your business logic inside of PL/pgSQL functions or someone who uses an ORM and wants to stay away from database code, you will likely end up using triggers at some point. The fact that the most recommend way of implementing table partitioning in PostgreSQL uses triggers accounts for the importance of understanding triggers.
In this talk, we will step through examples of writing various types of triggers using practical uses cases like partitioning and auditing.
The structure of a trigger
BEFORE vs AFTER triggers
Statement Level vs Row Level triggers
Conditional triggers
Event triggers
Debugging triggers
Performance overhead of triggers
All of the examples will be done using PL/pgSQL so in addition to getting an overview of triggers, you will also get a good understanding of how to code in PL/pgSQL.
GridSQL is an open source distributed database built on PostgreSQL that allows it to scale horizontally across multiple servers by partitioning and distributing data and queries. It provides significantly improved performance over a single PostgreSQL instance for large datasets and queries by parallelizing processing across nodes. However, it has some limitations compared to PostgreSQL such as lack of support for advanced SQL features, slower transactions, and need for downtime to add nodes.
The Query Optimizer is the “brain” of your Postgres database. It interprets SQL queries and determines the fastest method of execution. Using the EXPLAIN command , this presentation shows how the optimizer interprets queries and determines optimal execution.
This presentation will give you a better understanding of how Postgres optimally executes their queries and what steps you can take to understand and perhaps improve its behavior in your environment.
To listen to the webinar recording, please visit EnterpriseDB.com > Resources > Ondemand Webcasts
If you have any questions please email sales@enterprisedb.com
This document introduces a domain specific language (DSL) called Zeoflow for defining data pipelines that can run on different distributed compute engines like Spark, Dataflow, and BigQuery. The DSL aims to standardize all actions in the system and separate the platform logic from the domain logic, allowing code and functionality to be reused across different runtimes. The document discusses designing the DSL to have an AST-like structure and monadic behavior. It also describes how the DSL is lifted into a Free monad to separate it from interpreters. Finally, it provides examples of writing business logic programs using the DSL and interpreting them using state monads to run on Spark.
Is it easier to add functional programming features to a query language, or to add query capabilities to a functional language? In Morel, we have done the latter.
Functional and query languages have much in common, and yet much to learn from each other. Functional languages have a rich type system that includes polymorphism and functions-as-values and Turing-complete expressiveness; query languages have optimization techniques that can make programs several orders of magnitude faster, and runtimes that can use thousands of nodes to execute queries over terabytes of data.
Morel is an implementation of Standard ML on the JVM, with language extensions to allow relational expressions. Its compiler can translate programs to relational algebra and, via Apache Calcite’s query optimizer, run those programs on relational backends.
In this talk, we describe the principles that drove Morel’s design, the problems that we had to solve in order to implement a hybrid functional/relational language, and how Morel can be applied to implement data-intensive systems.
(A talk given by Julian Hyde at Strange Loop 2021, St. Louis, MO, on October 1st, 2021.)
Basic Query Tuning Primer - Pg West 2009mattsmiley
Intro to query tuning in Postgres, for beginners or intermediate software developers. Lists your basic toolkit, common problems, a series of examples. Assumes the audience knows basic SQL but has little or no experience with reading or adjusting execution plans. Accompanies 45-90 minute talk; meant to encourage Q/A.
The document discusses the top 10 mistakes Oracle DBAs make when migrating to PostgreSQL. These include: configuring PostgreSQL settings based on Oracle without understanding PostgreSQL's different architecture, preserving uppercase metadata which is lowercase in PostgreSQL, unnecessary use of tablespaces, not using PostgreSQL's native features like PL languages and data types, treating exceptions the same between the databases, and not properly handling null values. It emphasizes that PostgreSQL and Oracle have meaningful differences and to avoid porting Oracle concepts directly without understanding PostgreSQL.
ne of the most sought after features in PostgreSQL is a scalable multi-master replication solution. While there does exists some tools to create multi-master clusters such as Bucardo and pgpool-II, they may not be the right fit for an application. In this session, you will learn some of the strengths and weaknesses of these more popular multi-master solutions for PostgreSQL and how they compare to using Slony for your multi-master needs. We will explore the types of deployments best suited for a Slony deployment and the steps necessary to configure a multi-master solution for PostgreSQL.
The document summarizes the new features in PostgreSQL 11, including enhancements to table partitioning such as hash partitioning, default partitions, updating partition keys, automatically creating indexes on partitions, adding unique constraints to partitions, and enabling partition-wise joins and aggregates. It also provides examples of using various partitioning features and explains the benefits of partition-wise processing.
Three data storage formats are described: row-oriented, column-oriented, and record columnar. Record columnar stores data in a hybrid fashion, with columns stored together but records kept together in row groups for better failure handling. Common serialization formats include SequenceFile, Avro, RCFile, and Parquet. Parquet is generally recommended for analytics workloads due to its support for complex data structures and good compression. Failure behavior, read/write speeds, splittability, and compression are important factors to consider when choosing a data storage format and serialization format.
Stado is an open source, shared-nothing architecture for scaling PostgreSQL across multiple servers. It partitions and distributes PostgreSQL tables and queries them in parallel. Stado provides linear scalability for read queries as more nodes are added. However, writes are slower than a single PostgreSQL instance due to additional network hops. Stado also has limitations in transaction performance, high availability and backup/restore capabilities compared to a single PostgreSQL database.
ALTER TABLE Improvements in MariaDB ServerMariaDB plc
OpenWorks 2019 Presentation
MariaDB Server 10.3 introduced ALGORITHM=NOCOPY, allowing columns to be added in an instant because adding columns no longer required rebuilding a table. In MariaDB Server 10.4, we’re taking this a step further with instant DROP COLUMN and instant ALTER TABLE to support many more instant schema changes. In this session, Marko Mäkelä explains how instant schema changes work and how MariaDB is eliminating table rebuilds.
Using histograms to provide better query performance in MariaDB. Histograms capture the distribution of values in columns to help the query optimizer select better execution plans. The optimizer needs statistics on data distributions to estimate query costs accurately. Histograms are not enabled by default but can be collected using ANALYZE TABLE with the PERSISTENT option. Making histograms available improves the performance of queries that have selective filters or ordering on non-indexed columns.
The document provides information about MapReduce jobs including:
- The number of maps is determined by input size and partitioning. The number of reducers is set by the user.
- Reducers receive sorted, grouped data from maps via shuffle and sort. They apply the reduce function to grouped keys/values.
- The optimal number of reducers depends on nodes and tasks. More reducers improve load balancing but increase overhead.
The document discusses the relational model of databases. It defines key concepts like relations, tuples, attributes, domains, and keys. It provides an example database schema for an auction application with relations for owners, items, bids, and buyers. It explains that a relation is a set of tuples with a common schema where each tuple maps attribute names to values from predefined domains. It also defines the different types of keys like superkeys and primary keys.
SQL is a language for communicating with a database management system (DBMS) to carry out tasks like querying data, inserting/updating/deleting rows, and managing database objects. It includes data definition language (DDL) for creating and modifying database objects and data manipulation language (DML) for querying and modifying data. A SQL database contains tables which have a schema defining columns and their data types, and may have constraints. Queries in SQL use SELECT statements to retrieve data that matches conditions specified in the WHERE clause by comparing column values and expressions.
This document discusses the entity-relationship (E-R) model for conceptual database design. It defines entities, attributes, relationships and cardinalities. Entities are mapped to relations, with attributes and keys. Relationships are mapped based on cardinality, such as creating a new relation for many-to-many relationships. The document provides examples of mapping auction database entities and relationships to tables. It also covers weak entities, generalization hierarchies, and extensions to the basic relational model.
Triggers in SQL allow users to specify actions that are automatically performed in response to insert, update, or delete events on a table. Triggers can be defined to execute before, after, or instead of the triggering event. Triggers have access to old and new values of rows that are inserted, updated, or deleted. Care must be taken with triggers on mutating tables to avoid inconsistent data access or infinite recursion.
A database is a collection of information organized in a way that allows a computer program to select desired data quickly. A traditional database is organized into fields, records, and files. A field contains a single piece of information, a record contains one set of fields, and a file contains records.
A database management system (DBMS) is a collection of programs that allows users to enter, organize, and select data in a database. It performs functions like user management, data creation/modification/access, and database maintenance. Popular DBMS include Microsoft Access, Oracle, MySQL, SQL Server, and others.
Good database systems have ACID properties - Atomicity, Consistency, Isolation, and Durability.
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze big data for a fraction of the cost of traditional data warehouses. In this session, we take an in-depth look at data warehousing with Amazon Redshift for big data analytics. We cover best practices to take advantage of Amazon Redshift's columnar technology and parallel processing capabilities to deliver high throughput and query performance. We also discuss how to design optimal schemas, load data efficiently, and use workload management.
This document provides an overview of Amazon Redshift, including its history, architecture, concepts, terminology, storage subsystem, and query lifecycle. It discusses how Redshift uses a massively parallel processing (MPP) architecture with columnar storage to improve query performance and reduce I/O. Key concepts explained include slices, columnar storage, compression encodings, sorting, and data distribution styles.
This document provides an overview of Amazon Redshift, including its history, architecture, concepts, terminology, storage subsystem, and query lifecycle. It discusses how Redshift uses a massively parallel processing (MPP) architecture with columnar storage to improve query performance and reduce storage requirements through data compression. Key concepts explained include slices, sorting, data distribution styles, and how data is stored across disks and persisted to blocks at the physical level.
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze big data for a fraction of the cost of traditional data warehouses. In this session, we take an in-depth look at data warehousing with Amazon Redshift for big data analytics. We cover best practices to take advantage of Amazon Redshift's columnar technology and parallel processing capabilities to deliver high throughput and query performance. We also discuss how to design optimal schemas, load data efficiently, and use work load management.
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all of your data for a fraction of the cost of traditional data warehouses. In this webinar, we take an in-depth look at data warehousing with Amazon Redshift for big data analytics. We cover best practices to take advantage of Amazon Redshift's columnar technology and parallel processing capabilities to deliver high throughput and query performance.
Learning Objectives:
• Get an inside look at Amazon Redshift's columnar technology and parallel processing capabilities
• Learn how to design schemas and load data efficiently
• Learn best practices for workload management, distribution and sort keys, and optimizing queries
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all of your data for a fraction of the cost of traditional data warehouses. In this session, we take an in-depth look at data warehousing with Amazon Redshift for big data analytics. We cover best practices to take advantage of Amazon Redshift's columnar technology and parallel processing capabilities to deliver high throughput and query performance. We also discuss how to design optimal schemas, load data efficiently, and use work load management.
Denis Reznik "Оптимизация запроса. Не знаешь что делать? Делай то, что знаешь"Fwdays
"Не знаешь что делать? Делай то, что знаешь.", - так учил нас в школе учитель математики. Эти золотые слова и лягут в основу сегодняшнего доклада. В докладе мы рассмотрим ряд ситуаций, когда запрос работает медленно, и ряд простых (относительно) подходов, которые можно использовать для быстрой оптимизации, без излишнего погружения во внутренности процесса. Данные техники подойдут в двух случаях: когда нет большого опыта оптимизации запросов, и когда нет времени на глубинный разбор причины плохой производительности. И это будет именно то, что стоит попробовать в случае, когда не знаешь что делать с запросом, но оптимизировать надо.
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftSnapLogic
In this webinar, we discuss how the secret sauce to your business analytics strategy remains rooted on your approached, methodologies and the amount of data incorporated into this critical exercise. We also address best practices to supercharge your cloud analytics initiatives, and tips and tricks on designing the right information architecture, data models and other tactical optimizations.
To learn more, visit: http://www.snaplogic.com/redshift-trial
Tony Gibbs gave a presentation on Amazon Redshift covering its history, architecture, concepts, and parallelism. The presentation included details on Redshift's cluster architecture, node components, storage design, data distribution styles, and terminology. It also provided a deep dive on parallelism in Redshift, explaining how queries are compiled and executed through streams, segments, and steps to enable massively parallel processing across nodes.
=-=-=-==-=-Overview of the Talk-=-=-=-=-=
Introduction to the Subject
Database
Rational Database
Object Rational Database
Database Management System
History
Programming
SQL,
Connecting Java, Matlab to a Database
Advance DBMS
Data Grid
BigTable
Demo
Products
MySQL, SQLite, Oracle,
DB2, Microsoft Access,
Microsoft SQL Server
Products Comparison.
AWS Senior Product Manager, Tina Adams, discusses Redshift's new feature, User Defined Functions.
Learn how the new User Defined Functions for Amazon Redshift works with Chartio for quick and dynamic data analysis.
There are NO thumb Rules in Oracle. Different Versions of Oracle and data Patterns Drives the SQL performance !!!!
This presentation is just to introduce as to How CBO workouts the SQL plans That probably will help you to find what is suitable for given SQL.
How you write a SQL, it matters !!!!
Amazon Redshift é um serviço gerenciado que lhe dá um Data Warehouse, pronto para usar. Você se preocupa com carregar dados e utilizá-lo. Os detalhes de infraestrutura, servidores, replicação, backup são administrados pela AWS.
Presentations from the Cloudera Impala meetup on Aug 20 2013Cloudera, Inc.
Presentations from the Cloudera Impala meetup on Aug 20 2013:
- Nong Li on Parquet+Impala and UDF support
- Henry Robinson on performance tuning for Impala
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze big data for a fraction of the cost of traditional data warehouses. By following a few best practices, you can take advantage of Amazon Redshift’s columnar technology and parallel processing capabilities to minimize I/O and deliver high throughput and query performance. This webinar will cover techniques to load data efficiently, design optimal schemas, and use work load management.
Learning Objectives:
• Get an inside look at Amazon Redshift's columnar technology and parallel processing capabilities
• Learn how to migrate from existing data warehouses, optimize schemas, and load data efficiently
• Learn best practices for managing workload, tuning your queries, and using Amazon Redshift's interleaved sorting features
Who Should Attend:
• Data Warehouse Developers, Big Data Architects, BI Managers, and Data Engineers
SQL Server 2014 Memory Optimised Tables - AdvancedTony Rogerson
Hekaton is large piece of kit, this session will focus on the internals of how in-memory tables and native stored procedures work and interact – Database structure: use of File Stream, backup/restore considerations in HA and DR as well as Database Durability, in-memory table make up: hash and range indexes, row chains, Multi-Version Concurrency Control (MVCC). Design considerations and gottcha’s to watch out for.
The session will be demo led.
Note: the session will assume the basics of Hekaton are known, so it is recommended you attend the Basics session.
Amazon Redshift is a data warehouse service that runs on AWS. It has a leader node that coordinates queries and compute nodes that store and process the data in parallel. The compute nodes can use either HDD storage optimized for large datasets or SSD storage optimized for fast queries. Data is stored in columns and compressed to reduce I/O. Queries are optimized using statistics on the data distribution, sort keys and other metadata. The EXPLAIN command and STL tables provide visibility into query plans and performance.
Antes de migrar de 10g a 11g o 12c, tome en cuenta las siguientes consideraciones. No es tan sencillo como simplemente cambiar de motor de base de datos, se necesita hacer consideraciones a nivel del aplicativo.
[Www.pkbulk.blogspot.com]file and indexingAnusAhmad
The document discusses data storage and indexing in databases. It covers physical and logical addressing of data blocks, main memory addressing when blocks are read into memory, and the I/O model of computation in databases where I/O time dominates. The document also discusses indexing using B+ trees and hash tables, including insertion, deletion, and searching operations in B+ trees. External sorting algorithms are covered, along with how they are optimized when data does not fit in memory.
This document provides an overview of SQL and embedded SQL concepts. It discusses scalar subqueries, embedded SQL programming, transactions, dynamic SQL, and JDBC. Scalar subqueries return single values that can be used in expressions or output clauses. Embedded SQL allows embedding SQL statements in programming languages for connectivity. Transactions define units of work that can be committed or rolled back. Dynamic SQL builds SQL statements dynamically at runtime using strings, while JDBC is the Java database connectivity API.
This document discusses SQL nested queries and aggregation. It provides examples of different types of nested queries using IN, EXISTS, and NOT EXISTS clauses. It explains how to write queries with correlated subqueries that refer to columns in the outer query. It also covers SQL aggregation functions like COUNT, MAX, MIN, SUM, AVG and the GROUP BY clause. It shows how to group query results and apply aggregate functions to each group. The HAVING clause is introduced to filter groups based on aggregate conditions.
JSP (Java Server Pages) Lecture # 9
Java Server Faces the best Alternative of C# and Easy to make your own Application (Desktop applications) or web applications
JSP (Java Server Pages) Lecture # 5
Breif detail lecture about the JSP Servlets with example code the tutorial thing such as how to create, deploy etc etc
This document provides information on Java applets including:
- An applet is a Java program that runs in a web browser context
- It must extend the Applet class or JApplet class
- Includes the applet lifecycle of loading, creating, initializing, starting, stopping, and destroying
- Provides sample code for creating a basic "MyApp" applet class and embedding it in an HTML page
- Discusses restrictions on applets and demonstrates creating an applet project in NetBeans
This document outlines a course on web engineering taught by Imran Daud. It covers topics like HTTP architecture, HTML, Java applets, JSP, Java servlets, and JavaScript. The course marks are distributed as follows: projects/assignments/quizzes 15%, midterm 30%, attendance 5%, and final exam 50%. It also provides information on Java fundamentals like what packages and classes are, how to write, compile, and run a Java program, and an introduction to object-oriented programming concepts in Java.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Infrastructure Challenges in Scaling RAG with Custom AI modelsZilliz
Building Retrieval-Augmented Generation (RAG) systems with open-source and custom AI models is a complex task. This talk explores the challenges in productionizing RAG systems, including retrieval performance, response synthesis, and evaluation. We’ll discuss how to leverage open-source models like text embeddings, language models, and custom fine-tuned models to enhance RAG performance. Additionally, we’ll cover how BentoML can help orchestrate and scale these AI components efficiently, ensuring seamless deployment and management of RAG systems in the cloud.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
1. Query Optimization
1
Fall 2001 Database Systems 1
System Structure Revisited
Naïve users Application
programmers
Casual users Database
administrator
Forms
Application
Front ends DML Interface CLI DDL
Indexes System
Catalog
Data
files
DDL
Compiler
Disk Space Manager
Buffer Manager
File & Access Methods
Query Evaluation Engine
SQL Commands
RecoveryManager
Transaction
&
Lock
Manager
DBMS
Fall 2001 Database Systems 2
• Some DBMS component indicates it wants to read record R
• File Manager
– Does security check
– Uses access structures to determine the page it is on
– Asks the buffer manager to find that page
• Buffer Manager
– Checks to see if the page is already in the buffer
– If so, gives the buffer address to the requestor
– If not, allocates a buffer frame
– Asks the Disk Manager to get the page
• Disk Manager
– Determines the physical address(es) of the page
– Asks the disk controller to get the appropriate block of data
from the physical address
• Disk controller instructs disk driver to do dirty job
Disk Access Process
(Overly Simplifed)
2. Query Optimization
2
Fall 2001 Database Systems 3
Storage Hierarchy
Cache
Main
Memory
Virtual
Memory
File
System
Tertiary
Storage
Programs
DBMS
Capacity
vs
Cost &
Speed
Secondary Storage
Registers2-5 ns
3-10 ns
80-400 ns
5,000,000 ns
Fall 2001 Database Systems 4
Storage Mechanisms
• Primary access methods
– Heap
– Cluster
– Hashing
• Secondary access methods
– B-tree indices
– Bitmap indices
– R-trees/Quadtrees (for multi-dimensional range
queries)
3. Query Optimization
3
Fall 2001 Database Systems 5
Query Optimization
• The goal of query optimization is to reduce the
execution cost of a query
• It involves:
– checking syntax
– simplifying the query
– execution:
• choosing a set of algorithms to execute
choosing a set of access methods for
relations
• ordering the query steps
– creating executable code to process the query
Fall 2001 Database Systems 6
Query Resource Utilization
• An optimizer must have an objective function
• Optimize the use of resources:
– CPU time
– I/O time
– number of remote calls (amount of remote data
transfer)
• Define an objective function
– c1 * costI/O(execution plan) + c2 * costCPU (execution
plan)
– many systems assume CPU costs are directly
proportional to I/O costs and optimize for costI/O only
4. Query Optimization
4
Fall 2001 Database Systems 7
Query Plan
• A query plan consists of
– methods to access relations (sequential
scan, index scan)
– methods to perform basic operations (hash
join/merge-sort join)
– ordering of these operations
– other considerations (writing temporary
results to disk, remote calls, sorting, etc.)
Fall 2001 Database Systems 8
Main query operations
• SELECT (WHERE C)
– Scan a relation and find tuples that satisfy a condition C
– Use indices to find which tuples will satisfy C
• JOIN
– Join multiple relations into a single relation
• SORT/GROUP BY/DISTINCT/UNION
– Order tuples with respect to some criteria
• Note that projection can usually be performed on the fly.
5. Query Optimization
5
Fall 2001 Database Systems 9
Table Scans
• A table scan consists of reading all the disk pages in a relation
• For example:
SELECT A.StageName FROM movies.actors A
WHERE A.age < 25
Plan: read all pages in the relation one by one
for all tuples check if A.age < 25 is true
if it is true, output the tuple to some output buffer
Assume I/O is the bottleneck
Key question is how fast can I read the whole relation?
I/O
CPU
Fall 2001 Database Systems 10
Table Scan
• A fast disk read:
– seek time 4.9 ms
– rotational latency 2.99ms
– transfer time 300 Mbits/sec
– a page of 4K is transferred in 0.1 ms
• A relation with 1 million tuples and 10 tuples per page
has:
1,000,000 / 10 = 100,000 disk pages
• A random read assumes the disk head moves to a
random location on the disk at each read:
100,000 * (4.9+2.99+0.1) = 799 sec = 13 mins
6. Query Optimization
6
Fall 2001 Database Systems 11
I/O Parallelism
• Distribute the data to multiple disks (striping)
– distribute uniformly to allow all disk heads to work equally
hard
– introduce fault tolerance
• In the best case, n disks may give a speed up factor of n
– but the total load is the same
– the cost of the system may have increased!
Page 1, 5,
9, 13, ...
Page 2, 6,
10, ...
Page 3, 7,
11, ...
Page 4, 8,
12, ...
Fall 2001 Database Systems 12
Other Scan Speed-Ups
• Read multiple pages and then perform logic filter operations
– sequential prefetch (read k consecutive pages at once)
– list prefetch (read k pages in a list at once, let the disk arm
scheduler find the optimal way of reading them)
• Example (sequential prefetch)
– read 32 pages at once and pay seek time and rotational latency
only once
4.9 + 2.99 + 0.1*32 = 11.09 ms
– to read 100,000 disk pages, make 100,000 / 32 read rounds
(each takes 11.09 ms = 11.09/1000 sec)
– total read time is then 11.09/1000 * 100,000/32 = 34.6 sec
7. Query Optimization
7
Fall 2001 Database Systems 13
SELECT * FROM T WHERE P
• Table scan methods
– read the entire table and select tuples that satisfy
the predicate P [sequential scan]
– prefetching is used to reduce the read time (read
blocks of N pages at once from the same track)
[sequential scan with prefetch]
• Index scan methods
– use indices to find tuples that satisfy all of P and
then read the tuples from disk [index scan]
– use indices to find tuples that satisfy part of P and
then read the tuples from disk and check the rest
of P [index scan+select]
Fall 2001 Database Systems 14
SELECT * FROM T WHERE P
• Index scan methods (continued)
– use indices to find tuples that satisfy all of P and output the
indexed attributes [index-only scan]
– use indices to find tuples that satisfy part of P and then find the
intersection of different sets of tuples [multi-index scan]
8. Query Optimization
8
Fall 2001 Database Systems 15
Statistics in Oracle
ANALYZE TABLE employee
COMPUTE STATISTICS FOR COLUMNS dept, name
• For each relation:
– CARD: total number of tuples in the relation (cardinality)
– NPAGES: total number of disk pages for the relation
• For each column:
– COLCARD: number of distinct values for that column
– HIGHKEY, LOWKEY: the highest and the lowest stored
value for that column
In addition we will use: CARD(R WHERE C) to denote the
number of tuples in R that satisfy the condition C
Fall 2001 Database Systems 16
Statistics in Oracle
ANALYZE TABLE employee
COMPUTE STATISTICS FOR COLUMNS dept,
name
• For each index:
– NLEVELS: number of levels of the B+-tree
– NLEAF: total number of leaf pages
– FULLKEYCARD: total number of distinct values for
the index column
– CLUSTER-RATIO: percentage of rows in the table
clustered with respect to the index column
9. Query Optimization
9
Fall 2001 Database Systems 17
Find R.A=20 and R.B between (1,50)
RELATION R
Read all of R
NPAGES(R)
Check
R.A = 20 AND
R.B between (1,50)
Use index I on R.A
NLEVELS(I) + NLEAF(I,R.A=20)
Read R tuples with
R.A = 20
Check
R.B between (1,50)
CARD(R.A=20)
Use index I2 on R.B
NLEVELS(I2) +
NLEAF(I2,R.B in (1,50))
Read R tuples with
R.B between (1,50)
Check
R.A=20
CARD(R.B in (1,50))
Intersect
Fall 2001 Database Systems 18
Index Scan
Read one node
at each
intermediate level
Read leaf nodes by following sibling pointers
until no matching entry is found
Index on Location
Boston Boston Cape Cod
Denver
Anchorage
Albany
Denver
Detroit
Detroit
11. Query Optimization
11
Fall 2001 Database Systems 21
Filter factors
• Filter factors assume uniform distribution of values and no
correlation between attributes
• Suppose that we are storing the transactions of customers
at different Hollywood Video stores.
• Attributes: store_zipcode, movieid, customer_name,
customer_zipcode, date_rented
– 40,000 store_zipcodes between 10,000 and 50,000
– 10,000 movies ids between 1 and 10,000
– 100,000 customer_names between 1 and 100,000
– 40,000 customer_zipcodes between 10,000 and 50,000
– 364 dates (between 1 and 364)
– Total cardinality: 300 billion tuples
Fall 2001 Database Systems 22
Filter Factors
• What are the filter factors of the following conditions?
– All tuples for the customers named “John Smith”
– All tuples for the customers living in 12180
– All tuples for the stores located in 12180
– All tuples for the rentals on day 200
1/COLCARD = 1/100,000 = 0.00001
1/COLCARD = 1/40,000 = 0.000025
1/COLCARD = 1/40,000 = 0.000025
1/COLCARD = 1/364 = 0.0027
12. Query Optimization
12
Fall 2001 Database Systems 23
Filter Factors
– All tuples for the rentals on days (200,210,220) AND
by customer named “John Smith”
– All tuples for the rentals on day 200 AND in a store
with zipcode between 12000 and 14000
– All tuples for the rentals on day 200 OR by a
customer living in zipcode 12180
– All tuples for a customer NOT living 12180
3/363 * 1/100,000 = .0083 * .00001 = .000000083
.0027 * 2000/40,000 = .0027 * .05 = .000135
.0027 + .000025 – (.0027)(.000025) = .002724933
1 – FF(customer in 12180) = 1 - .000025 = .999975
Fall 2001 Database Systems 24
Matching Index Scan
SELECT I.name FROM items I WHERE I.location = ‘Boston’
• Assume B+-tree index ILoc on items.location
• Algorithm:
scan index for leftmost leaf where location = ‘Boston’
for all rowids R found in the leaf
retrieve tuple from items using R
find next leaf node with location = ‘Boston’ and repeat
• Cost: reading from B+-tree + reading the tuples from items
NLEVELS(Iloc) + NLEAF(Iloc, I.location=‘Boston’) + CARD(I.location=‘Boston’)
• Assume non-leaf nodes of B+-tree are already in memory and leaf
nodes store at most 400 rowids
• To retrieve n tuples, we need n / 400 + n disk accesses in the average
case
13. Query Optimization
13
Fall 2001 Database Systems 25
Partial-Matching Index Scan
SELECT I.name FROM items I
WHERE I.location = ‘Boston’ AND I.name like ‘Antique%’
• Assume B+-tree index on items.location
• Algorithm:
scan index for leftmost leaf where location = ‘Boston’
for all rowids R found in the leaf
retrieve the tuple from items using R
check if the name is like ‘Antique%’
find next leaf node with location = ‘Boston’ and repeat
• Except for some additional CPU cost, the cost of this scan is
identical to the previous one
Fall 2001 Database Systems 26
Matching Index Scan
SELECT I.name FROM items I
WHERE I.location = ‘Boston’ AND I.name like ‘Antique%’
• Assume B+-tree index IL on items.location, index IN on items.name, and
index ILN on items.location+items.name
• Options:
PLAN1: Use index IL, read the items tuples and filter on items.name
(previous slide)
PLAN2: Use index IN, read the items tuples and filter on items.location
PLAN3: Use index IL to find tuple ids SL, use index IN to find tuple ids
SN, compute intersection of SL and SN, and read the items tuples
from disk that are in this intersection
PLAN4: Use index ILN to find tuples with values Boston+Antique%.
Return the name value of all tuples from ILN that match the criteria
(Index only scan)
14. Query Optimization
14
Fall 2001 Database Systems 27
Comparing Costs (1)
• Assume items contains 1 million tuples, 50 different
cities and 100,000 different names for items
• Assume B+-trees can store at most 400 duplicate
values per node at the leaf level
• The items table can store about 20 tuples in a single
disk page
• If we assume uniform distribution, there are
– 1M / 50 = 20,000 items in Boston
– 1M / 100,000 = 10 items of each different name
– assume 100 names start with Antique so that 1000
items have a name like ‘Antique%’
Fall 2001 Database Systems 28
Comparing Costs (2)
• B+-tree indices
– Index IL: items from Boston are stored in 20,000 / 400
= 50 disk pages
– Index IN: items with names that start with Antiques are
stored in 1000/400 = 3 disk pages
• Assume that only leaf nodes of a B+-Tree index are read
from disk during query execution
• PLAN 1: To read all tuples for ‘Boston’ requires 50 index
pages + 20,000 pages from items = 20,050 disk reads
15. Query Optimization
15
Fall 2001 Database Systems 29
Comparing Costs (3)
• PLAN 2: To read all tuples with name like ‘Antique%’
requires 3 index pages + 1000 pages from items = 1003
disk reads
• PLAN 4: How big is the B+-tree for ILN? Assume 150
rowids at most fit in a leaf of ILN
– Assume 1 tuple for each city and name combination
– the 100 item names of the form ‘Antique%’ are stored
consecutively for a given location and fit in a single
page
– cost: read 1 B+-tree page with Boston+Antique% and
find all 100 names, so 1 disk read
Fall 2001 Database Systems 30
Indices Not Always Best
• Assume seek time 4.9 ms, latency 3.0 ms, transfer 0.1
ms/page
• Suppose we want to find all items in ‘Boston’
• Use IL index on items.location:
– 20,000 items per city, 50 index pages per city
– total cost is 20,050 disk page reads (assuming no
clustering on location)
– 20,050 * 8 = 160 sec = 2.7 min
• Sequential scan with prefetch = 32
– 1M tuples, 1M / 20 = 50,000 disk pages
– 50,000 / 32 = 1563 rounds
– 1563 * (4.9+3+3.2) = 17.35 sec
16. Query Optimization
16
Fall 2001 Database Systems 31
Clustering
• Remember, clustering means that the tuples of a relation are stored in
groups with respect to a set of attributes
• Assume BIDS(bidid,itemid,buyid,date,amount) is clustered on itemid,
buyid
– all bids for the same item are on consecutive disk pages
– all bids for the same item by the same buyer are on the same disk
page
• It is very fast to find
– all bids on a specific item
– all bids on a specific item by a specific buyer
• It is not very fast to find
– all bids by a specific buyer
– all bids of some amount
Fall 2001 Database Systems 32
Clustering
• Assume that there are 20 bids per item in general,
20 million tuples in the bids relation, and a total of
10,000 buyers
– Suppose 40 bids tuples fit on a single page
– B+-tree index IIB on itemid, buyid stores 200
rowids per page
– B+-tree index IB on buyid stores 400 rowids per
page
17. Query Optimization
17
Fall 2001 Database Systems 33
Clustering
• What is the cost of finding all bids for items I1 through
I1000?
– How many bids do we expect? 1000 items * 20 bids
per item = 20,000 bids
– 20,000 / 200 =100 index pages using index IIB
– 20,000 / 40 = 500 bids pages
– with prefetch=32, 500 / 32 = 16 rounds
– 16*(4.9+3+3.2) + 100*8 = 0.97 sec
– We might be able to use prefetch for the index as well
Fall 2001 Database Systems 34
Clustering
• What is the cost of finding all bids for items I1
through I1000 by buyer B5?
– how many bids do we expect? 20 bids per
item, 10,000 buyers, so .002 bids per item by a
given buyer
– 1000 items, so 20,000 bids total
– each bid for the same item and buyer are
stored consecutively in index IIB and on disk
– 1000 index accesses + (1000 * .002) bids
pages for buyer B5
– cost = 1000*8 + 2*8 = 8 sec
18. Query Optimization
18
Fall 2001 Database Systems 35
Clustering
• What is the cost of finding all bids by buyer B5?
– 20 bids per item / 10,000 buyers = .002 bids per
buyer on each item
– .002 bids per buyer per item * 1M items = 2000 bids
per buyer
– bids by the same buyer for different items are stored
on different pages
• if we use index IB, we need to access 2000
pages and 2000/400 = 5 index pages
• cost = 2000*8 + 5*8 = 16.04 sec
– sequential scan: 20M / 40 = 500,000 disk pages
• Use prefetch = 32, 500,000 / 32 = 15625 rounds
• 15625 * (4.9+3+3.2) = 2.9 minutes
Fall 2001 Database Systems 36
Design Process - Physical Design
Conceptual
Design
Conceptual Schema
(ER Model)
Logical
Design
Logical Schema
(Relational Model)
Physical
Design
Physical Schema
19. Query Optimization
19
Fall 2001 Database Systems 37
Physical Design
• Choice of indexes
• Clustering of data
• May have to revisit and refine the conceptual
and external schemas to meet performance
goals.
• Most important is to understand the workload
– The most important queries and their frequency.
– The most important updates and their frequency.
– The desired performance for these queries and
updates.
Fall 2001 Database Systems 38
Workload Modeling
• For each query in the workload:
– Which relations does it access?
– Which attributes are retrieved?
– Which attributes are involved in selection/join
conditions? How selective are these conditions
likely to be?
• For each update in the workload:
– Which attributes are involved in selection/join
conditions? How selective are these conditions
likely to be?
– The type of update (INSERT/DELETE/UPDATE),
and the attributes that are affected.
20. Query Optimization
20
Fall 2001 Database Systems 39
Physical Design Decisions
• What indexes should be created?
– Relations to index
– Field(s) to be used as the search key
– Perhaps multiple indexes?
– For each index, what kind of an index should it be?
• Clustered? Hash/tree? Dynamic/static? Dense/sparse?
• Should changes be made to the conceptual schema?
– Alternative normalized schemas
– Denormalization
– Partitioning (vertical horizontal)
– New view definitions
• Should the frequently executed queries be rewritten
to run faster?
Fall 2001 Database Systems 40
Choice of Indexes
• Consider the most important queries one-by-one
– Consider the best plan using the current indexes
– See if a better plan is possible with an additional
index
– If so, create it.
• Consider the impact on updates in the workload
– Indexes can make queries go faster,
– Updates are slower
– Indexes require disk space, too.
21. Query Optimization
21
Fall 2001 Database Systems 41
Index Selection Guidelines
• Don’t index unless it contributes to performance.
• Attributes mentioned in a WHERE clause are candidates
for index search keys.
– Exact match condition suggests hash index.
– Range query suggests tree index.
• Clustering is especially useful for range queries, although it can help
on equality queries as well in the presence of duplicates.
• Multi-attribute search keys should be considered when a
WHERE clause contains several conditions.
– If range selections are involved, order of attributes should be
carefully chosen to match the range ordering.
– Such indexes can sometimes enable index-only strategies for
important queries.
• For index-only strategies, clustering is not important!
Fall 2001 Database Systems 42
Index Selection Guidelines (cont’d.)
• Try to choose indexes that benefit as many
queries as possible. Since only one index can
be clustered per relation, choose it based on
important queries that would benefit the most
from clustering.
22. Query Optimization
22
Fall 2001 Database Systems 43
Matching composite index scans
An extent - normally read
with a sequential prefetch
Relation R
B+-tree for Relation R on
columns C1, C2, C3, C4
• Entries for the same C1, C2, C3 values (but different C4
values) are located in consecutive leaf pages
• Same for C1,C2 entries but with different C3 or C4 entries
• Matching index scan is a search for consecutively stored
leaf pages
Fall 2001 Database Systems 44
Matching composite index scans
• Hollywood Video relation: (store_zipcode, movieid,
customer_name, customer_zipcode, date_rented)
• Suppose we have a B+-tree index on store_zipcode,
customer_zipcode, movieid, date_rented, in which each
leaf node stores 200 rowids
SELECT S.customer_name
FROM Store S
WHERE S.store_zipcode = 12180 AND
S.customer_zipcode = 12180 AND
S.movie_id between 1000 and 2000
– find the leftmost leaf node with 12180, 12180, and
movie-id = 1000
– read all leaf nodes from left to right until a movie with
id2000 is found.
23. Query Optimization
23
Fall 2001 Database Systems 45
Matching composite index scans
• How many tuples do we expect in the result?
• Expected number of tuples:
FF: 1/40,000 * 1/40,000 * 1000/10,000 = 6.25 x 10-11
N = 300 billion * FF = 3 x 1011 * 6.25 x 10-11 = 19
• How many disk pages do we read?
N / 200 B+-tree nodes + N pages of the relation = 20
assumes no clustering on the STORE relation for the
store_zipcode, customer_zipcode, movie_id attributes
Fall 2001 Database Systems 46
Matching composite index scans
• Hollywood Video relation: (store_zipcode, movieid,
customer_name, customer_zipcode, date_rented)
• Suppose we have a B+-tree index on store_zipcode,
customer_zipcode, movieid, date_rented, in which each
leaf node stores 200 tuples
SELECT S.customer_name
FROM Store S
WHERE S.store_zipcode between 12180 and 42180 AND
S.customer_zipcode = 12180 AND
S.movie_id = 20
• Not a matching scan, B+-tree nodes with different
store_zipcode values for customer_zipcode 12180 are not
in consecutive leaf-nodes
24. Query Optimization
24
Fall 2001 Database Systems 47
Matching composite index scans
• How many disk pages do we expect to read if we scanned for
S.customer_zipcode = 12180 AND S.movie_id = 20 ?
– Note that these tuples are not consecutive on disk, we cannot perform
a matching index scan
• We can read the B+-tree index at the leaf level for
S.store_zipcode between 12180 and 42180 reading ¾ of the
tuples.
– Use sequential prefetch = 32:
300 billion / 200 = 1.5 billion nodes total in the B+ tree
1.5 *3/4 = 1.125 billion nodes read for the range search
1.125 billion / 32 = 35 million rounds
35 million * 11.1ms = 388.5 seconds
Fall 2001 Database Systems 48
Multiple Index Access
SELECT T.A, T.B, T.D, T.E FROM T
WHERE (T.A = 4 AND (T.B 4 OR T.C 5)) OR T.E = 10
• Assume indices on columns A, B, C, E, individually.
• Plan:
– Find the set SA of all rowids with T.A = 4
– Find the set SB of all rowids with T.B 4
– Find the set SC of all rowids with T.C 5
– Find the set SE of all rowids with T.E = 10
– Compute: (SA ∩ ( SB ∪ SC)) ∪SE
– Sort the rowids in result into lists and prefetch these lists
to read the tuples of T from disk
25. Query Optimization
25
Fall 2001 Database Systems 49
Multiple Index Access
• Which indices should we use and in which order?
– order the indices with respect to their filter factors
– for each index being considered
• get size of input relation and compute time t1 to read it
• compute expected time ti to use index, i.e. how many index pages
will be read and time to read them
• compute time t2 required to read tuples identified by the index
• if (t1 - t2) ti , i.e. if time gain reading the tuples is much larger
than the index read time, then use this index and proceed to the
next index
• otherwise break out of loop and do not use this index
Fall 2001 Database Systems 50
Multiple Index Access
SELECT T.A, T.B, T.D, T.E FROM T
WHERE T.A = 4 AND T.B 4 AND T.C 5
• Suppose T contains 10 million tuples stored on 500,000 disk
pages (i.e., 20 tuples per page)
– FF(T.A) = 1/1000, and there are 50,000 leaf nodes in the
B+-tree index for T.A (i.e., 200 per leaf page)
– FF(T.B) = 1/200, and there are 25,000 leaf nodes in the
B+-tree index for T.B (i.e., 400 per leaf page)
– FF(T.C) = 1/20, and there are 25,000 leaf nodes in the
B+-tree index for T.C (i.e., 400 per leaf page)
• First consider using the index on T.A
26. Query Optimization
26
Fall 2001 Database Systems 51
Do We Use T.A Index?
• Without using index on T.A, we have to read 500,000 pages
– with a sequential prefetch of 0.01 sec per 32 pages, it will
take about 156 secs
• If index is used to find tuples with T.A = 4, we need to retrieve
50,000/1000 = 50 pages
– with a sequential prefetch of 0.01 sec per 32 pages, it will
take = 0.02 secs to read the index
– the result will have an estimated 10 million/1000 = 10,000
tuples in 10,000 different pages in the worst case
– with random I/O of 0.008 sec/page, it will take 80 secs to
read the tuples
• We gain 76 seconds by using the index
Fall 2001 Database Systems 52
Do We Use Index on T.B?
• Without using index on T.B, we have to read 10,000 pages
– with random I/O of 0.008 sec/page, it will take 80 secs
• If index is used to find tuples with T.B 4, we need to retrieve 3 *
(25,000 / 200) = 375 pages
– with a sequential prefetch of 0.01 sec per 32 pages, it will
take 0.12 secs
• The result will have an estimated 3*(10,000/200) = 150 tuples
which are on 150 different pages in the worst case
– with random I/O of 0.008 sec/page, it will take 1.2 secs to
read them
• We gain 78.7 seconds using the index for T.B
27. Query Optimization
27
Fall 2001 Database Systems 53
Do We Use Index for T.C?
• Without using index on T.C, we have to read 150 pages in
1.2 secs
• If index is used to find tuples with T.C 5, we have to read
15 * (25,000 / 20) = 18,750 pages
– with a sequential prefetch of 0.01 sec per 32 pages, it will
take = 5.86 secs
• The result will have an estimated 15 * 150 / 20 = 112.5
tuples which are on 113 different pages in the worst case
– with random I/O of 0.008 sec/page, it will take 0.904 secs
• We gain 1.2 - 0.9 = 0.3 seconds by paying 5.86 seconds to
use the index. Hence, using T.C will not pay off