DBMSArchitecture_QueryProcessingandOptimization.pdf

DBMS Architecture, Query Processing and
Optimization
Christalin Nelson | SOCS
1 of 42

At a Glance
• DBMS Architecture
• Introduction to Query Processing
• Translating SQL Queries into Relational Algebra
• Algorithms for External Sorting
• Algorithms for SELECT and JOIN Operations
• Algorithms for PROJECT and Set Operations
• Implementing Aggregate Operations and OUTER JOIN
7-Apr-24
2 of 42

7-Apr-24
DBMS Instance (1/5)
• A single occurrence of a DBMS running on a server
– It includes the memory structures and processes necessary to manage the DB
– Multiple DBs can be managed by a single DBMS instance, each identified by a unique
name
• Serves as the runtime environment for
– Managing DB operations
– Providing essential services
– Maintaining the overall integrity & performance of DB system
3 of 42

7-Apr-24
DBMS Instance (2/5)
• Memory Allocation
– Allocates memory for various purposes
• For caching frequently accessed data, storing execution plans, maintaining connection
information, and managing internal structures like locks and latches
– Controlled by parameters specified during the DBMS instance startup or through
configuration settings
• Process Management
– Manage various processes
• Processes can be responsible for handling client requests, executing queries, managing
transactions, and performing administrative tasks
– Processes: Include listener processes, user processes, server processes, and
background processes
4 of 42

7-Apr-24
DBMS Instance (3/5)
• Session Management
– Track user sessions (represent connections established by clients to interact with DB)
– Involves: allocating resources to each session, maintaining session state, and enforcing
security and access controls
• Resource Management
– Oversee resource allocation & usage to ensure fair distribution & optimal performance
– Resources (such as CPU, memory, disk I/O, and network bandwidth) are managed to
prioritize critical tasks and prevent resource contention
5 of 42

7-Apr-24
DBMS Instance (4/5)
• Configuration and Parameter Settings
– Allows administrators to configure various settings and parameters to tailor the
system behavior according to specific requirements
– Configuration options: Includes memory allocation settings, parallel processing
degree, caching mechanisms, logging levels, and security policies
• Database Startup and Shutdown
– Control the startup and shutdown processes of the associated DBs
– During startup: Instance initializes data structures, allocates memory, opens DB files,
and starts background processes
– During shutdown: Instance ensures data integrity, flushes dirty pages to disk, releases
resources, and terminates active connections gracefully
6 of 42

7-Apr-24
DBMS Instance (5/5)
• Monitoring and Diagnostics
– Provides tools and utilities for monitoring system performance, diagnosing issues, and
troubleshooting errors
– Monitoring capabilities: Include real-time performance metrics, database activity logs,
system health checks, and alerting mechanisms
• High Availability and Failover
– Supports high availability configurations to ensure continuous availability of DBs in
case of hardware failures, software crashes, or network outages
– High availability features: Include clustering, replication, automated failover, and
disaster recovery solutions
7 of 42

7-Apr-24
DBMS Internal Memory Structure
• Components
– Buffer Pool: Stores data pages temporarily in memory, facilitating faster access to
frequently accessed data
– Data Cache: Stores frequently accessed data rows or indexes in memory to minimize
disk I/O
– Transaction Log Buffer: Stores transaction log records temporarily before writing them
to the transaction log file on disk
– Execution Stack: Stores information about currently executing transactions, queries, or
stored procedures
– Metadata Cache: Stores metadata information such as table structures, indexes, and
permissions for quick access
8 of 42

7-Apr-24
Background Processes
• Background processes are employed by DBMS to manage various tasks
• Tasks include
– Checkpoint Process: Writes modified data pages from buffer pool to disk periodically
to ensure data consistency and recovery
– Log Writer Process: Writes transaction log records from log buffer to transaction log
file on disk
– Archiver Process: Archives transaction log files to provide point-in-time recovery
– Lock Manager: Manages concurrency control by handling lock requests from
concurrent transactions
– Backup & Restore Processes: Handle DB backup and restoration tasks
9 of 42

7-Apr-24
Data Types
• DBMS supports various data types to represent different kinds of data
• Types
– Numeric: Integer, Floating Point, Decimal
– Character: Char, Varchar, Text
– Date and Time: Date, Time, Timestamp
– Binary: Blob, Binary Large Object
– Spatial: Geometry, Geography
10 of 42

7-Apr-24
Roles & Privileges
• DBMS provides mechanisms for managing access control through roles and
privileges
– Roles
• Predefined sets of privileges that can be assigned to users or groups of users.
– Privileges
• Permissions granted to users or roles to perform specific operations on database objects
such as tables, views, procedures, and sequences.
• Common privileges: SELECT, INSERT, UPDATE, DELETE, EXECUTE, CREATE, ALTER, and DROP
11 of 42

7-Apr-24
Introduction to Query Processing (1/3)
• A query is expressed in a high-level query language (such as SQL)
• DBMS uses special techniques internally to process, optimize, and execute high-
level queries
• Typical steps when processing a high-level query
– Scanner
• Identifies query tokens (i.e. SQL keywords, attribute names, and
relation names)
– Parser
• Checks the query syntax to determine whether it is formulated
according to syntax rules (rules of grammar) of query language.
– Validation
• Check that all attributes and relation names are valid and
semantically meaningful names in the schema of the particular
database being queried.
12 of 42

7-Apr-24
• Internal Representation of Query
– Query tree
– Query graph
• Query Optimization
– Process of choosing a suitable execution strategy (or Query plan) for Query processing
– 2 main Techniques employed
• (1) Order the operations in a Query execution strategy based on heuristic rules (The rules
typically reorder the operations in a Query tree)
– A heuristic works well in most cases but is not guaranteed to work well in every case
• (2) Systematically estimate the cost of different execution strategies and choose the
execution plan with the lowest cost estimate
• Code Generator
– Generates the code to execute that plan
13 of 42

7-Apr-24
• Runtime database processor
– Execute query code (compiled/interpreted mode) to produce query result
• Executed directly (interpreted mode)
• Stored and executed later whenever needed (compiled mode)
– Runtime error results in an error message
14 of 42

7-Apr-24
Translating SQL Queries (1/2)
• SQL Query  Query Blocks  Relational algebra exp. (Query Tree)  Optimized
• Query block
– The basic unit that can be translated into algebraic operators
– Can contain a single SELECT-FROM-WHERE expression
– Can include GROUP BY and HAVING clauses if these are part of the block
– Nested queries within a query  Identified as separate query blocks
– Aggregate operators (E.g. MAX, MIN, SUM, COUNT)  Included in extended algebra
15 of 42

Translating SQL Queries (2/2)
SELECT LNAME, FNAME
FROM EMPLOYEE
WHERE SALARY > (SELECT MAX (SALARY)
FROM EMPLOYEE
WHERE DNO = 5);
SELECT MAX (SALARY)
FROM EMPLOYEE
WHERE DNO = 5
SELECT LNAME, FNAME
FROM EMPLOYEE
WHERE SALARY > ??? (say X)
πLNAME, FNAME (σSALARY > X (EMPLOYEE))
SQL Query
Extended relational
algebra expression
2 Query Modules
7-Apr-24
ℱMAX SALARY (σDNO=5 (EMPLOYEE))
16 of 42

7-Apr-24
Algorithms for External Sorting (1/3)
• External sorting
– Sorting algorithms that are suitable for large files of records that do not fit entirely in
main memory and are stored on disk (E.g. most database files)
• Sort-Merge strategy
– Requires Buffer space in main memory (part of DBMS cache)
• Divided into individual buffers (each buffer can hold the contents of exactly one disk block)
– (1) Sorting Phase
• Read small subfiles (runs) of the main file into buffer  Sort using internal sort algorithm
 Write back to Disk as temporarily sorted subfiles
– (2) Merging Phase
• Merge sorted runs with one or more Merge passes
• The larger sorted subfiles are merged in turn
17 of 42

7-Apr-24
Algorithms for External Sorting (2/3)
• Number of Initial runs (nR)
– nR = ⌐(b/nB)¬
• Here, b – Size of Disk file (or) no. of file blocks, nB – no. of available main memory buffers
• Number of Merge passes (nP)
– nP = ⌐(logdM
(nR))¬
– Degree of merging (dM) = No. of sorted runs that can be merged per merge step = Min (nB-1, nR)
• Example:
– (Q) Size of disk file is 1024 file blocks. Number of main memory buffers available is 5. Find the
number of Initial Runs and Number of Merge passes
• nR = ceiling [1024/5] = 205 (i.e. 204 runs of size 5 blocks each & the last run of size 4 blocks)
• nP = ceiling[log4(205)] = 4
– 205  52 larger sorted subfiles  13  4  Final Sorted File
• Cost incurred
18 of 42

7-Apr-24 Christalin Nelson | SOCS
19 of 42

7-Apr-24
Algorithms for SELECT Operations (1/7)
• Implementing SELECT operation (Relational Algebra)
– Sample Operation taken for consideration
• (OP1): s SSN='123456789' (EMPLOYEE)
• (OP2): s DNUMBER>5(DEPARTMENT)
• (OP3): s DNO=5(EMPLOYEE)
• (OP4): s DNO=5 AND SALARY>30000 AND SEX=F(EMPLOYEE)
• (OP5): s ESSN=123456789 AND PNO=10(WORKS_ON)
20 of 42

7-Apr-24
• Search Methods for simple Selection (1/3)
– S1. Linear search (brute force)
• Records are grouped into disk blocks  Copy DB to MM buffer  Search records inside
disk block (i.e Test whether its attribute values satisfy the selection condition)
– S2. Binary search
• Selection condition involves an equality comparison on a key attribute on which the file is
ordered.
– Example: SSN is the ordering attribute in s SSN='123456789' (EMPLOYEE)
– S3. Using a primary index (or) hash key to retrieve a single record
• If the selection condition involves an equality comparison on a key attribute with a primary
index (or a hash key), use the primary index (or the hash key) to retrieve the record.
– Example: s SSN='123456789' (EMPLOYEE)
21 of 42

7-Apr-24
• Search Methods for simple Selection (2/3)
– S3. Using a primary index (or) hash key to retrieve a single record
• If the selection condition involves an equality comparison on a key attribute with a primary
index (or a hash key), use the primary index (or the hash key) to retrieve the record.
– Example: s SSN='123456789' (EMPLOYEE)
– S4. Using a primary index to retrieve multiple records
• If the comparison condition is >, ≥, <, or ≤ on a key attribute with a primary index, use the
index to find the record satisfying the corresponding equality condition, then retrieve all
subsequent records in the (ordered) file.
– S5. Using a clustering index to retrieve multiple records
• If the selection condition involves an equality comparison on a non-key attribute with a
clustering index, use the clustering index to retrieve all the records satisfying the selection
condition.
– Example: s DNO=5(EMPLOYEE)
22 of 42

7-Apr-24
• Search Methods for Simple Selection (3/3)
– S6. Using a secondary (B+ tree) index on an equality comparison
• Retrieve a single record if the indexing field is a key
(Or)
Retrieve multiple records if the indexing field is not a key
• Retrieve records on conditions involving >,>=, <, or <= (for range queries)
– A Search operation can be
• File scan (S1, S2)
• Index scan (S3a, S4, S5, and S6)
• S4 & S6 applies to Range Queries
23 of 42

7-Apr-24
• Search Methods for Complex Selection (1/2)
– Conjunctive selection?
• Contains many simple select conditions connected with AND.
• E.g. s DNO=5 AND SALARY>30000 AND SEX=F(EMPLOYEE)
– S7. Conjunctive selection using an individual index
• If an attribute involved in any single simple condition in the conjunctive condition has an access
path that permits the use of one of the methods S2 to S6: (1) Use that condition to retrieve the
records  (2) Check whether each retrieved record satisfies the remaining simple conditions in
the conjunctive condition.
– Example: s DNO=5 AND SALARY>30000 AND SEX=F(EMPLOYEE)
– S8. Conjunctive selection using a composite index
• If two or more attributes are involved in equality conditions in the conjunctive condition & a
composite index (or hash) exists on the combined field, we can use the index directly.
– Example: If an index has been created on the composite key (Essn, Pno) of WORKS_ON table  Use
the index directly
» s ESSN=123456789 AND PNO=10(WORKS_ON)
24 of 42

7-Apr-24
• Search Methods for Complex Selection (2/2)
– S9. Conjunctive selection by intersection of record pointers
• If secondary indexes are available on all (or >1) the fields involved simple conditions in the
conjunctive condition & if the indexes include record pointers (rather than block pointers)
 Each index can be used to retrieve the record pointers that satisfy the individual
condition
– Intersection of these sets of record pointers gives the record pointers that satisfy the conjunctive
condition, which are then used to retrieve those records directly
• If only some of the conditions have secondary indexes, each retrieved record is further
tested to determine whether it satisfies the remaining conditions
25 of 42

7-Apr-24
• Search Methods
– Whenever a single condition specifies the selection, we can only check whether an
access path exists on the attribute involved in that condition. If an access path exists,
the method corresponding to that access path is used; otherwise, the “brute force”
linear search approach of method S1 is used. (OP1, OP2 and OP3)
– For conjunctive selection conditions, whenever more than one of the attributes
involved in the conditions has an access path, query optimization should be done to
choose the access path that retrieves the fewest records most efficiently.
– Disjunctive selection conditions (Uses OR)
26 of 42

7-Apr-24
Algorithms on Join Operations (1/8)
• Join (EQUIJOIN, NATURAL JOIN)
– Two-way join: Join on 2 files
• e.g. R A=B S
– Multi-way joins: Joins involving >2 files
• e.g. R A=B S C=D T
– Examples
• (OP6): EMPLOYEE DNO=DNUMBER DEPARTMENT
• (OP7): DEPARTMENT MGRSSN=SSN EMPLOYEE
• Factors affecting JOIN performance
– Available buffer space
– Join selection factor
– Choice of inner VS outer relation
27 of 42

7-Apr-24
• Methods for implementing joins
– J1. Nested-loop join (brute force)
• For each record t in R (outer loop), retrieve every record s from S (inner loop) and test
whether the two records satisfy the join condition t[A] = s[B].
– J2. Single-loop join (Using an access structure to retrieve the matching records)
• If an index (or hash key) exists for one of the two join attributes (say, B of S) – retrieve each
record t in R, one at a time, and then use the access structure to retrieve directly all
matching records s from S that satisfy s[B] = t[A].
28 of 42

7-Apr-24
• Methods for implementing joins
– J3. Sort-merge join
• If the records of R and S are physically sorted by the value of the join attributes A and B
– Both files are scanned in order of the join attributes, matching the records that have the same
values for A and B.
» Here, the records of each file are scanned only once each for matching with the other file
– If both A and B are non-key attributes, the method needs to be modified slightly.
– J4. Hash-join
• The records of R and S are both hashed to the same hash file, using the same hashing
function on the join attributes A of R and B of S as hash keys
– A single pass through the file with fewer records (say, R) hashes its records to hash file buckets
– A single pass through the other file (S) then hashes each of its records to the appropriate bucket,
where the record is combined with all matching records from R.
29 of 42

7-Apr-24
30 of 42

7-Apr-24
31 of 42

7-Apr-24
• Partition hash join (1/2)
– Partitioning phase
• Each file (R and S) is first partitioned into M partitions using a partitioning hash function on
the join attributes: R1, R2, R3, ...... Rm and S1, S2, S3, ...... Sm
• Min. In-memory buffers needed: M+1
• A disk sub-file is created per partition to store the tuples for that partition.
– Joining (or) Probing phase
• Involves M iterations (one per partitioned file)
• Iteration ‘i’ involves joining partitions Ri and Si
32 of 42

7-Apr-24
• Partition hash join (2/2)
– Procedure
• Assume Ri < Si
• Copy records from Ri into memory buffers
• Read all blocks from Si, one at a time, and each record from Si is used to probe for a
matching record(s) from partition Si
• Write matching record from Ri after joining to the record from Si into the result file
– Cost
• 3* (bR + bS) + bRES
33 of 42

7-Apr-24
• Hybrid Hash join
– Same as a Partition Hash join except the following
• Difference: Joining phase of one of the partitions is included during the partitioning phase
– Partitioning phase
• Allocate buffers for smaller relation (one block for each of the M-1 partitions, remaining
blocks to partition 1)
• Repeat for the larger relation in the pass-through S
– Joining phase
• M-1 iterations are needed for the partitions R2, R3, R4 , ......Rm and S2, S3, S4, ......Sm
• R1 and S1 are joined during the partitioning of S1, and results of joining R1 and S1 are
already written to the disk by the end of the partitioning phase
34 of 42

7-Apr-24
Algorithm for Project Operations
• <attribute list>(R)
– If <attribute list> has a key of relation R, extract all tuples from R with only the values
for the attributes in <attribute list>
– If <attribute list> does NOT include a key of relation R, duplicated tuples must be
removed from the results
• Methods to remove duplicate tuples
– Sorting
– Hashing
35 of 42

7-Apr-24
Algorithm for Set Operation (1/2)
• Set operations
– CARTESIAN PRODUCT, UNION, INTERSECTION, SET DIFFERENCE
• CARTESIAN PRODUCT of relations R and S
– Includes all possible combinations of records from R and S. The attribute of the result
includes all attributes of R and S.
– Cost analysis
• If R has n records and j attributes & S has m records and k attributes, the result relation will
have n*m records and j+k attributes.
• CARTESIAN PRODUCT operation is very expensive and should be avoided if possible
36 of 42

7-Apr-24
Algorithm for Set Operation (2/2)
• UNION
– Sort the two relations on the same attributes.
– Scan and merge both sorted files concurrently, whenever the same tuple exists in both
relations, only one is kept in the merged results.
• INTERSECTION
– Sort the two relations on the same attributes.
– Scan and merge both sorted files concurrently, and keep in the merged results only
those tuples that appear in both relations.
• SET DIFFERENCE R-S
– Keep in the merged results only those tuples that appear in relation R but not in
relation S
37 of 42

7-Apr-24
Implementing Aggregate Operations (1/2)
• Aggregate operators
– MAX, MIN, SUM, COUNT and AVG
• Options to implement aggregate operators
– Table Scan
– Index
• Example: SELECT MAX (SALARY) FROM EMPLOYEE;
– If an (ascending) index on SALARY exists for the employee relation, then the optimizer
could decide on traversing the index for the largest value, which would entail
following the rightmost pointer in each index node from the root to a leaf.
38 of 42

7-Apr-24
Implementing Aggregate Operations (2/2)
• SUM, COUNT and AVG
– For a dense index (each record has one index entry)
• Apply the associated computation to the values in the index.
– For a non-dense index (actual no. of records associated with each index entry must be
accounted for)
• With GROUP BY
– The aggregate operator must be applied separately to each group of tuples
• Use sorting or hashing on group attributes to partition the file into the appropriate groups
• Computes the aggregate function for the tuples in each group.
39 of 42

7-Apr-24
Implementing Outer Joins (1/2)
• Outer Join Operators
– FULL OUTER JOIN, LEFT OUTER JOIN, and RIGHT OUTER JOIN
• Full outer join
– Result: Equivalent to Union of results of the left and right outer joins.
– Example
• SELECT FNAME, DNAME FROM (EMPLOYEE LEFT OUTER JOIN DEPARTMENT ON DNO =
DNUMBER);
– Note: The result of this query is a table of employee names and their associated departments.
» It is similar to a regular join result, with the exception that if an employee does not have an
associated department, the employee's name will still appear in the resulting table,
although the department name would be indicated as null.
40 of 42

7-Apr-24
Implementing Outer Joins (2/2)
• Modifying Join Algorithms
– Nested Loop or Sort-Merge joins can be modified to implement Outer Join
• E.g. for Left outer join, use the left relation as the outer relation and constructs result from
every tuple in the left relation.
– If there is a match, the concatenated tuple is saved in the result
– However, if an outer tuple does not match, then the tuple is still included in the result but is
padded with a null value(s)
41 of 42

Thank You
42 of 42 7-Apr-24

DBMSArchitecture_QueryProcessingandOptimization.pdf

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to DBMSArchitecture_QueryProcessingandOptimization.pdf

Similar to DBMSArchitecture_QueryProcessingandOptimization.pdf (20)

More from Christalin Nelson

More from Christalin Nelson (14)

Recently uploaded

Recently uploaded (20)

DBMSArchitecture_QueryProcessingandOptimization.pdf