2. OUTLINE
RAID
File Organization –Organization of Records in Files
Indexing and Hashing
Ordered Indices
B+ tree Index Files – B tree Index Files
Static Hashing –Dynamic Hashing
Query Processing Overview
Algorithms for SELECT and JOIN operations
Query optimization using Heuristics and Cost Estimation
2
PreparedbyR.Arthy,AP/IT,KCET
4. CLASSIFICATION OF PHYSICAL STORAGE
MEDIA
Can differentiate storage into:
volatile storage: loses contents when power is switched off
non-volatile storage:
Contents persist even when power is switched off.
Includes secondary and tertiary storage, as well as batter-backed up
main-memory.
Factors affecting choice of storage media include
Speed with which data can be accessed
Cost per unit of data
Reliability
4
PreparedbyR.Arthy,AP/IT,KCET
6. [CONTD…]
primary storage: Fastest media but volatile (cache, main
memory).
secondary storage: next level in hierarchy, non-volatile,
moderately fast access time
Also called on-line storage
E.g., flash memory, magnetic disks
tertiary storage: lowest level in hierarchy, non-volatile, slow
access time
also called off-line storage and used for archival storage
e.g., magnetic tape, optical storage
Magnetic tape
Sequential access, 1 to 12 TB capacity
A few drives with many tapes
Juke boxes with petabytes (1000’s of TB) of storage
6
PreparedbyR.Arthy,AP/IT,KCET
7. RAID
RAID: Redundant Arrays of Independent Disks
Disk organization techniques that manage a large numbers of
disks, providing a view of a single disk of
high capacity and high speed by using multiple disks in parallel,
high reliability by storing data redundantly, so that data can be
recovered even if a disk fails
The chance that some disk out of a set of N disks will fail
is much higher than the chance that a specific single disk
will fail.
E.g., a system with 100 disks, each with MTTF of 100,000
hours (approx. 11 years), will have a system MTTF of 1000
hours (approx. 41 days)
Techniques for using redundancy to avoid data loss are critical
with large numbers of disks
7
PreparedbyR.Arthy,AP/IT,KCET
8. IMPROVEMENT OF RELIABILITY VIA REDUNDANCY
Redundancy – store extra information that can be used to
rebuild information lost in a disk failure
E.g., Mirroring (or shadowing)
Duplicate every disk. Logical disk consists of two physical disks.
Every write is carried out on both disks
Reads can take place from either disk
If one disk in a pair fails, data still available in the other
Data loss would occur only if a disk fails, and its mirror disk also fails
before the system is repaired
Probability of combined event is very small
Except for dependent failure modes such as fire or building collapse or
electrical power surges
Mean time to data loss depends on mean time to failure,
and mean time to repair
E.g., MTTF of 100,000 hours, mean time to repair of 10 hours gives
mean time to data loss of 500*106 hours (or 57,000 years) for a
mirrored pair of disks (ignoring dependent failure modes)
8
PreparedbyR.Arthy,AP/IT,KCET
9. IMPROVEMENT IN PERFORMANCE VIA PARALLELISM
Two main goals of parallelism in a disk system:
1. Load balance multiple small accesses to increase throughput
2. Parallelize large accesses to reduce response time.
Improve transfer rate by striping data across multiple disks.
Bit-level striping – split the bits of each byte across multiple
disks
In an array of eight disks, write bit i of each byte to disk i.
Each access can read data at eight times the rate of a single disk.
But seek/access time worse than for a single disk
Bit level striping is not used much any more
Block-level striping – with n disks, block i of a file goes to
disk (i mod n) + 1
Requests for different blocks can run in parallel if the blocks reside on
different disks
A request for a long sequence of blocks can utilize all disks in parallel 9
PreparedbyR.Arthy,AP/IT,KCET
10. RAID LEVELS
Schemes to provide redundancy at lower cost by using disk
striping combined with parity bits
Different RAID organizations, or RAID levels, have differing cost,
performance and reliability characteristics
RAID Level 0: Block striping; non-redundant.
Used in high-performance applications where data loss is not critical.
RAID Level 1: Mirrored disks with block striping
Offers best write performance.
Popular for applications such as storing log files in a database system.
10
PreparedbyR.Arthy,AP/IT,KCET
11. [CONTD…]
RAID Level 2: Memory-Style Error-Correcting-Codes
(ECC) with bit striping.
RAID Level 3: Bit-Interleaved Parity
a single parity bit is enough for error correction, not just
detection, since we know which disk has failed
When writing data, corresponding parity bits must also be computed
and written to a parity bit disk
To recover data in a damaged disk, compute XOR of bits from other
disks (including parity bit disk)
11
PreparedbyR.Arthy,AP/IT,KCET
12. [CONTD…]
RAID Level 3 (Cont.)
Faster data transfer than with a single disk, but fewer I/Os per
second since every disk has to participate in every I/O.
Subsumes Level 2 (provides all its benefits, at lower cost).
RAID Level 4: Block-Interleaved Parity; uses block-level
striping, and keeps a parity block on a separate disk for
corresponding blocks from N other disks.
When writing data block, corresponding block of parity bits must
also be computed and written to parity disk
To find value of a damaged block, compute XOR of bits from
corresponding blocks (including parity block) from other disks.
12
PreparedbyR.Arthy,AP/IT,KCET
13. [CONTD…]
RAID Level 4 (Cont.)
Provides higher I/O rates for independent block reads than
Level 3
block read goes to a single disk, so blocks stored on different disks
can be read in parallel
Provides high transfer rates for reads of multiple blocks than
no-striping
Before writing a block, parity data must be computed
Can be done by using old parity block, old value of current block and
new value of current block (2 block reads + 2 block writes)
Or by recomputing the parity value using the new values of blocks
corresponding to the parity block
More efficient for writing large amounts of data sequentially
Parity block becomes a bottleneck for independent block
writes since every block write also writes to parity disk 13
PreparedbyR.Arthy,AP/IT,KCET
14. [CONTD…]
RAID Level 5: Block-Interleaved Distributed Parity;
partitions data and parity among all N + 1 disks, rather than
storing data in N disks and parity in 1 disk.
E.g., with 5 disks, parity block for nth set of blocks is stored on disk
(n mod 5) + 1, with the data blocks stored on the other 4 disks.
14
PreparedbyR.Arthy,AP/IT,KCET
15. [CONTD…]
RAID Level 5 (Cont.)
Block writes occur in parallel if the blocks and their parity blocks are
on different disks.
RAID Level 6: P+Q Redundancy scheme; similar to Level 5,
but stores two error correction blocks (P, Q) instead of single
parity block to guard against multiple disk failures.
Better reliability than Level 5 at a higher cost
Becoming more important as storage sizes increase
15
PreparedbyR.Arthy,AP/IT,KCET
16. CHOICE OF RAID LEVEL
Factors in choosing RAID level
Monetary cost
Performance: Number of I/O operations per second, and bandwidth
during normal operation
Performance during failure
Performance during rebuild of failed disk
Including time taken to rebuild failed disk
RAID 0 is used only when data safety is not important
E.g., data can be recovered quickly from other sources
Level 2 and 4 never used since they are subsumed by 3 and 5
Level 3 is not used anymore since bit-striping forces single
block reads to access all disks, wasting disk arm movement,
which block striping (level 5) avoids
Level 6 is rarely used since levels 1 and 5 offer adequate safety
for most applications 16
PreparedbyR.Arthy,AP/IT,KCET
17. [CONTD…]
Level 1 provides much better write performance than level 5
Level 5 requires at least 2 block reads and 2 block writes to write a
single block, whereas Level 1 only requires 2 block writes
Level 1 had higher storage cost than level 5
Level 5 is preferred for applications where writes are sequential
and large (many blocks), and need large amounts of data storage
RAID 1 is preferred for applications with many random/small
updates
Level 6 gives better data protection than RAID 5 since it can
tolerate two disk (or disk block) failures
Increasing in importance since latent block failures on one disk,
coupled with a failure of another disk can result in data loss with
RAID 1 and RAID 5.
17
PreparedbyR.Arthy,AP/IT,KCET
18. HARDWARE ISSUES
Software RAID: RAID implementations done entirely in
software, with no special hardware support
Hardware RAID: RAID implementations with special
hardware
Use non-volatile RAM to record writes that are being executed
Beware: power failure during write can result in corrupted
disk
E.g., failure after writing one block but before writing the second in a
mirrored system
Such corrupted data must be detected when power is restored
Recovery from corruption is similar to recovery from failed disk
NV-RAM helps to efficiently detected potentially corrupted blocks
Otherwise all blocks of disk must be read and compared with
mirror/parity block
18
PreparedbyR.Arthy,AP/IT,KCET
19. [CONTD…]
Latent failures: data successfully written earlier gets damaged
can result in data loss even if only one disk fails
Data scrubbing:
continually scan for latent failures, and recover from copy/parity
Hot swapping: replacement of disk while system is running,
without power down
Supported by some hardware RAID systems,
reduces time to recovery, and improves availability greatly
Many systems maintain spare disks which are kept online, and
used as replacements for failed disks immediately on detection
of failure
Reduces time to recovery greatly
Many hardware RAID systems ensure that a single point of
failure will not stop the functioning of the system by using
Redundant power supplies with battery backup
Multiple controllers and multiple interconnections to guard against
controller/interconnection failures 19
PreparedbyR.Arthy,AP/IT,KCET
20. OPTIMIZATION OF DISK-BLOCK ACCESS
Buffering: in-memory buffer to cache disk blocks
Read-ahead: Read extra blocks from a track in
anticipation that they will be requested soon
Disk-arm-scheduling algorithms re-order block requests
so that disk arm movement is minimized
elevator algorithm
20
PreparedbyR.Arthy,AP/IT,KCET
22. INTRODUCTION
The database is stored as a collection of files.
Each file is a sequence of records.
A record is a sequence of fields.
One approach:
assume record size is fixed
each file has records of one particular type only
different files are used for different relations
This case is easiest to implement; will consider variable
length records later.
22
PreparedbyR.Arthy,AP/IT,KCET
23. FIXED LENGTH RECORD
Simple approach:
Store record i starting from byte n ∗ (i − 1), where n is the
size of each record.
Record access is simple but records may cross blocks.
Deletion of record i — alternatives:
move records i + 1,...,n to i, . . . , n − 1
move record n to i
Link all free records on a free list
23
PreparedbyR.Arthy,AP/IT,KCET
24. EXAMPLE
type instructor = record
ID varchar (5);
name varchar(20);
dept name varchar (20);
salary numeric (8,2);
End
instructor record is 53 bytes long.
Two problems:
Unless the block size happens to be a multiple of 53 (which is
unlikely), some records will cross block boundaries.
It is difficult to delete a record from this structure. 24
PreparedbyR.Arthy,AP/IT,KCET
27. FILE HEADER AND FREE LIST
Store the address of the first record whose contents are
deleted in the file header.
Use this first record to store the address of the second
available record, and so on.
Can think of these stored addresses as pointers since they
“point” to the location of a record.
27
PreparedbyR.Arthy,AP/IT,KCET
28. [CONTD…]
More space efficient representation: reuse space for
normal attributes of free records to store pointers. (No
pointers stored in in-use records.)
Dangling pointers occur if we move or delete a record to
which another record contains a pointer; that pointer no
longer points to the desired record.
Avoid moving or deleting records that are pointed to by
other records; such records are pinned.
28
PreparedbyR.Arthy,AP/IT,KCET
29. VARIABLE LENGTH RECORD
Variable-length records arise in database systems in
several ways:
Storage of multiple record types in a file.
Record types that allow variable lengths for one or more
fields.
Record types that allow repeating fields (used in some older
data models).
Byte string representation
Attach an end-of-record (┴) control character to the end of
each record
Difficulty with deletion
Difficulty with growth 29
PreparedbyR.Arthy,AP/IT,KCET
30. SLOTTED PAGE STRUCTURE
Header contains:
number of record entries
end of free space in the block
location and size of each record
Records can be moved around within a page to keep them
contiguous with no empty space between them; entry in the header
must then be updated.
Pointers should not point directly to record — instead they should
point to the entry for the record in header.
30
PreparedbyR.Arthy,AP/IT,KCET
31. ORGANIZATION OF RECORDS IN
FILES
Heap – a record can be placed anywhere in the file where
there is space
Sequential – store records in sequential order, based on
the value of the search key of each record
Hashing – a hash function is computed on some attribute
of each record; the result specifies in which block of the
file the record should be placed
Clustering – records of several different relations can be
stored in the same file; related records are stored on the
same block
31
PreparedbyR.Arthy,AP/IT,KCET
32. SEQUENTIAL FILE ORGANIZATION
Suitable for applications that require sequential
processing of the entire file
The records in the file are ordered by a search-key
32
PreparedbyR.Arthy,AP/IT,KCET
33. [CONTD…]
Deletion – use pointer chains
Insertion – must locate the position in the file where the
record is to be inserted
if there is free space insert there
if no free space, insert the record in an overflow block
In either case, pointer chain must be updated
Need to reorganize the file from time to time to restore
sequential order
33
PreparedbyR.Arthy,AP/IT,KCET
34. CLUSTERING FILE ORGANIZATION
Simple file structure stores each relation in a separate file
Can instead store several relations in one file using a
clustering file organization
E.g., clustering organization of department and employee:
34
PreparedbyR.Arthy,AP/IT,KCET
36. INTRODUCTION
Indexing mechanisms used to speed up access to desired
data.
E.g., author catalog in library
Search Key - attribute or set of attributes used to look up
records in a file.
An index file consists of records (called index entries) of
the form
Index files are typically much smaller than the original
file
Two basic kinds of indices:
Ordered indices: search keys are stored in some sorted order
Hash indices: search keys are distributed uniformly across
“buckets” using a “hash function”.
36
PreparedbyR.Arthy,AP/IT,KCET
37. INDEX EVALUATION METRICS
Access types: The types of access that are supported
efficiently.
Access time: The time it takes to find a particular data
item, or set of items, using the technique in question.
Insertion time: The time it takes to insert a newdata item.
Deletion time: The time it takes to delete a data item.
Space overhead: The additional space occupied by an
index structure.
37
PreparedbyR.Arthy,AP/IT,KCET
39. INTRODUCTION
In an ordered index, index entries are stored sorted on the
search key value.
E.g., author catalog in library.
Primary index: in a sequentially ordered file, the index whose
search key specifies the sequential order of the file.
Also called clustering index
The search key of a primary index is usually but not necessarily the
primary key.
Secondary index: an index whose search key specifies an
order different from the sequential order of the file. Also
called non-clustering index.
Index-sequential file:ordered sequential file with a primary
index. 39
PreparedbyR.Arthy,AP/IT,KCET
44. INDEX UPDATE
Insertion
First, the system performs a lookup using the search-key value
that appears in the record to be inserted. The actions the
system takes next depend on whether the index is dense or
sparse:
Dense indices:
1. If the search-key value does not appear in the index, the
system inserts an index entry with the search-key value in the
index at the appropriate position.
2. Otherwise the following actions are taken:
If the index entry stores pointers to all records with the same search
key value, the system adds a pointer to the new record in the index
entry.
Otherwise, the index entry stores a pointer to only the first record
with the search-key value. The system then places the record being
inserted after the other records with the same search-key values. 44
PreparedbyR.Arthy,AP/IT,KCET
45. [CONTD…] DELETION
Dense indices:
1. If the deleted record was the only record with its
particular search-key value, then the system deletes the
corresponding index entry from the index.
2. Otherwise the following actions are taken:
If the index entry stores pointers to all records with the same
search key value, the system deletes the pointer to the deleted
record from the index entry.
Otherwise, the index entry stores a pointer to only the first
record with the search-key value. In this case, if the deleted
record was the first record with the search-key value, the
system updates the index entry to point to the next record. 45
PreparedbyR.Arthy,AP/IT,KCET
46. [CONTD…]
Sparse indices:
1. If the index does not contain an index entry with the
search-key value of the deleted record, nothing needs to
be done to the index.
2. Otherwise the system takes the following actions:
If the deleted record was the only record with its search key,
the system replaces the corresponding index record with an
index record for the next search-key value (in search-key
order). If the next search-key value already has an index
entry, the entry is deleted instead of being replaced.
Otherwise, if the index entry for the search-key value points
to the record being deleted, the system updates the index
entry to point to the next record with the same search-key
value. 46
PreparedbyR.Arthy,AP/IT,KCET
48. SELECT OPERATIONS
File scan – search algorithms that locate and retrieve
records that fulfill a selection condition.
Algorithm A1 (linear search). Scan each file block and
test all records to see whether they satisfy the selection
condition.
Cost estimate = br block transfers + 1 seek
br denotes number of blocks containing records from relation r
If selection is on a key attribute, can stop on finding record
cost = (br /2) block transfers + 1 seek
Linear search can be applied regardless of
selection condition or
ordering of records in the file, or
availability of indices
48
PreparedbyR.Arthy,AP/IT,KCET
49. [CONTD…]
A2 (binary search). Applicable if selection is an
equality comparison on the attribute on which file is
ordered.
Assume that the blocks of a relation are stored contiguously
Cost estimate (number of disk blocks to be scanned):
cost of locating the first tuple by a binary search on the blocks
log2(br) * (tT + tS)
If there are multiple records satisfying selection
Add transfer cost of the number of blocks containing records that
satisfy selection condition
49
PreparedbyR.Arthy,AP/IT,KCET
50. [CONTD…]
Index scan – search algorithms that use an index
selection condition must be on search-key of index.
A3 (primary index on candidate key, equality). Retrieve a
single record that satisfies the corresponding equality
condition
Cost = (hi + 1) * (tT + tS)
A4 (primary index on nonkey, equality) Retrieve multiple
records.
Records will be on consecutive blocks
Let b = number of blocks containing matching records
Cost = hi * (tT + tS) + tS + tT * b
A5 (equality on search-key of secondary index).
Retrieve a single record if the search-key is a candidate key
Cost = (hi + 1) * (tT + tS)
Retrieve multiple records if search-key is not a candidate key
each of n matching records may be on a different block
Cost = (hi + n) * (tT + tS)
Can be very expensive!
50
PreparedbyR.Arthy,AP/IT,KCET
51. [CONTD…]
Can implement selections of the form AV (r) or A V(r) by
using
a linear file scan or binary search,
or by using indices in the following ways:
A6 (primary index, comparison). (Relation is sorted on A)
For A V(r) use index to find first tuple v and scan relation
sequentially from there
For AV (r) just scan relation sequentially till first tuple > v; do not use
index
A7 (secondary index, comparison).
For A V(r) use index to find first index entry v and scan index
sequentially from there, to find pointers to records.
For AV (r) just scan leaf pages of index finding pointers to records, till
first entry > v
In either case, retrieve records that are pointed to
requires an I/O for each record
Linear file scan may be cheaper 51
PreparedbyR.Arthy,AP/IT,KCET
52. [CONTD…]
Conjunction: 1 2. . . n(r)
A8 (conjunctive selection using one index).
Select a combination of i and algorithms A1 through A7 that results
in the least cost for i (r).
Test other conditions on tuple after fetching it into memory buffer.
A9 (conjunctive selection using multiple-key index).
Use appropriate composite (multiple-key) index if available.
A10 (conjunctive selection by intersection of identifiers).
Requires indices with record pointers.
Use corresponding index for each condition, and take intersection of
all the obtained sets of record pointers.
Then fetch records from file
If some conditions do not have appropriate indices, apply test in
memory. 52
PreparedbyR.Arthy,AP/IT,KCET
53. [CONTD…]
Disjunction:1 2 . . . n (r).
A11 (disjunctive selection by union of identifiers).
Applicable if all conditions have available indices.
Otherwise use linear scan.
Use corresponding index for each condition, and take union
of all the obtained sets of record pointers.
Then fetch records from file
Negation: (r)
Use linear scan on file
If very few records satisfy , and an index is applicable to
Find satisfying records using index and fetch from file
53
PreparedbyR.Arthy,AP/IT,KCET
54. JOIN OPERATIONS
Several different algorithms to implement joins
Nested-loop join
Block nested-loop join
Indexed nested-loop join
Merge-join
Hash-join
Choice based on cost estimate
Examples use the following information
Number of records of customer: 10,000 depositor: 5000
Number of blocks of customer: 400 depositor: 100
54
PreparedbyR.Arthy,AP/IT,KCET
55. NESTED – LOOP JOIN
To compute the theta join r s
for each tuple tr in r do begin
for each tuple ts in s do begin
test pair (tr,ts) to see if they satisfy the join condition
if they do, add tr • ts to the result.
end
end
r is called the outer relation and s the inner relation of the
join.
Requires no indices and can be used with any kind of join
condition.
Expensive since it examines every pair of tuples in the two
relations. 55
PreparedbyR.Arthy,AP/IT,KCET
56. [CONTD…]
In the worst case, if there is enough memory only to hold one block of each relation,
the estimated cost is
nr bs + br
block transfers, plus
nr + br
seeks
If the smaller relation fits entirely in memory, use that as the inner relation.
Reduces cost to br + bs block transfers and 2 seeks
Assuming worst case memory availability cost estimate is
with depositor as outer relation:
5000 400 + 100 = 2,000,100 block transfers,
5000 + 100 = 5100 seeks
with customer as the outer relation
10000 100 + 400 = 1,000,400 block transfers and 10,400 seeks
If smaller relation (depositor) fits entirely in memory, the cost estimate will be 500
block transfers.
Block nested-loops algorithm is preferable. 56
PreparedbyR.Arthy,AP/IT,KCET
57. BLOCK NESTED - LOOP JOIN
Variant of nested-loop join in which every block of inner
relation is paired with every block of outer relation.
for each block Br of r do begin
for each block Bs of s do begin
for each tuple tr in Br do begin
for each tuple ts in Bs do begin
Check if (tr,ts) satisfy the join condition
if they do, add tr • ts to the result.
end
end
end
end 57
PreparedbyR.Arthy,AP/IT,KCET
58. [CONTD…]
Worst case estimate: br bs + br block transfers + 2 * br seeks
Each block in the inner relation s is read once for each block in the outer
relation (instead of once for each tuple in the outer relation
Best case: br + bs block transfers + 2 seeks.
Improvements to nested loop and block nested loop algorithms:
In block nested-loop, use M — 2 disk blocks as blocking unit for outer
relations, where M = memory size in blocks; use remaining two blocks
to buffer inner relation and output
Cost = br / (M-2) bs + br block transfers + 2 br / (M-2) seeks
If equi-join attribute forms a key on inner relation, stop inner loop on
first match
Scan inner loop forward and backward alternately, to make use of the
blocks remaining in buffer (with LRU replacement)
Use index on inner relation if available
58
PreparedbyR.Arthy,AP/IT,KCET
59. INDEXED NESTED - LOOP JOIN
Index lookups can replace file scans if
join is an equi-join or natural join and
an index is available on the inner relation’s join attribute
Can construct an index just to compute a join.
For each tuple tr in the outer relation r, use the index to look up
tuples in s that satisfy the join condition with tuple tr.
Worst case: buffer has space for only one page of r, and, for each
tuple in r, we perform an index lookup on s.
Cost of the join: br (tT + tS) + nr c
Where c is the cost of traversing index and fetching all matching s
tuples for one tuple or r
c can be estimated as cost of a single selection on s using the join
condition.
If indices are available on join attributes of both r and s,
use the relation with fewer tuples as the outer relation.
59
PreparedbyR.Arthy,AP/IT,KCET
60. EXAMPLE
Compute depositor customer, with depositor as the outer relation.
Let customer have a primary B+-tree index on the join attribute
customer-name, which contains 20 entries in each index node.
Since customer has 10,000 tuples, the height of the tree is 4, and one
more access is needed to find the actual data
depositor has 5000 tuples
Cost of block nested loops join
400*100 + 100 = 40,100 block transfers + 2 * 100 = 200 seeks
assuming worst case memory
may be significantly less with more memory
Cost of indexed nested loops join
100 + 5000 * 5 = 25,100 block transfers and seeks.
CPU cost likely to be less than that for block nested loops join 60
PreparedbyR.Arthy,AP/IT,KCET
61. MERGE JOIN
1. Sort both relations on their join
attribute (if not already sorted on
the join attributes).
2. Merge the sorted relations to join
them
1. Join step is similar to the
merge stage of the sort-merge
algorithm.
2. Main difference is handling of
duplicate values in join
attribute — every pair with
same value on join attribute
must be matched
61
PreparedbyR.Arthy,AP/IT,KCET
62. [CONTD…]
Can be used only for equi-joins and natural joins
Each block needs to be read only once (assuming all tuples for any
given value of the join attributes fit in memory
Thus the cost of merge join is:
br + bs block transfers + br / bb + bs / bb seeks
+ the cost of sorting if relations are unsorted.
hybrid merge-join: If one relation is sorted, and the other has a
secondary B+-tree index on the join attribute
Merge the sorted relation with the leaf entries of the B+-tree .
Sort the result on the addresses of the unsorted relation’s tuples
Scan the unsorted relation in physical address order and merge with
previous result, to replace addresses by the actual tuples
Sequential scan more efficient than random lookup
62
PreparedbyR.Arthy,AP/IT,KCET
63. HASH JOIN
Applicable for equi-joins and natural joins.
A hash function h is used to partition tuples of both relations
Intuition: partitions fit in memory
h maps JoinAttrs values to {0, 1, ..., n}, where JoinAttrs
denotes the common attributes of r and s used in the natural
join.
r0, r1, . . ., rn denote partitions of r tuples
Each tuple tr r is put in partition ri where i = h(tr [JoinAttrs]).
r0,, r1. . ., rn denotes partitions of s tuples
Each tuple ts s is put in partition si, where i = h(ts [JoinAttrs]).
Note: In book, ri is denoted as Hri, si is denoted as Hsi and
n is denoted as nh.
63
PreparedbyR.Arthy,AP/IT,KCET
64. [CONTD…]
64
r tuples in ri need only to be
compared with s tuples in si
Need not be compared with s
tuples in any other partition,
since:
an r tuple and an s tuple that
satisfy the join condition will
have the same value for the join
attributes.
If that value is hashed to some
value i, the r tuple has to be in
ri and the s tuple in si.
PreparedbyR.Arthy,AP/IT,KCET
65. [CONTD…]
The hash-join of r and s is computed as follows.
1. Partition the relation s using hashing function h.
1. When partitioning a relation, one block of memory is
reserved as the output buffer for each partition, and one
block for input
2. If extra memory is available, allocate bb blocks as buffer for
input and each output
2.Partition r similarly.
65
PreparedbyR.Arthy,AP/IT,KCET
66. [CONTD…]
3. For each partition i:
(a) Load si into memory and build an in-memory hash index on
it using the join attribute.
This hash index uses a different hash function than the earlier one
h.
(b) Read the tuples in ri from the disk one by one.
For each tuple tr probe the in-memory hash index to find all
matching tuples ts in si
For each matching tuple ts in si
output the concatenation of the attributes of tr and ts
Relation s is called the build input and
r is called the probe input.
66
PreparedbyR.Arthy,AP/IT,KCET
67. [CONTD…]
The value n and the hash function h is chosen such that each si
should fit in memory.
Typically n is chosen as bs/M * f where f is a “fudge factor”,
typically around 1.2
The probe relation partitions si need not fit in memory
Recursive partitioning required if number of partitions n is greater
than number of pages M of memory.
instead of partitioning n ways, use M – 1 partitions for s
Further partition the M – 1 partitions using a different hash
function
Use same partitioning method on r
Rarely required: e.g., recursive partitioning not needed for
relations of 1GB or less with memory size of 2MB, with block
size of 4KB.
67
PreparedbyR.Arthy,AP/IT,KCET
68. HANDLING OVERFLOW
Partitioning is said to be skewed if some partitions have
significantly more tuples than some others
Hash-table overflow occurs in partition si if si does not fit in
memory. Reasons could be
Many tuples in s with same value for join attributes
Bad hash function
Overflow resolution can be done in build phase
Partition si is further partitioned using different hash function.
Partition ri must be similarly partitioned.
Overflow avoidance performs partitioning carefully to avoid
overflows during build phase
E.g. partition build relation into many partitions, then combine them
Both approaches fail with large numbers of duplicates
Fallback option: use block nested loops join on overflowed partitions
68
PreparedbyR.Arthy,AP/IT,KCET
69. [CONTD…]
If recursive partitioning is not required: cost of hash join is
3(br + bs) +4 nh block transfers +
2( br / bb + bs / bb) seeks
If recursive partitioning required:
number of passes required for partitioning build relation
s is logM–1(bs) – 1
best to choose the smaller relation as the build relation.
Total cost estimate is:
2(br + bs logM–1(bs) – 1 + br + bs block transfers +
2(br / bb + bs / bb) logM–1(bs) – 1 seeks
If the entire build input can be kept in main memory no partitioning
is required
Cost estimate goes down to br + bs. 69
PreparedbyR.Arthy,AP/IT,KCET
70. EXAMPLE
Assume that memory size is 20 blocks
bdepositor= 100 and bcustomer = 400.
depositor is to be used as build input. Partition it into
five partitions, each of size 20 blocks. This partitioning
can be done in one pass.
Similarly, partition customer into five partitions,each of
size 80. This is also done in one pass.
Therefore total cost, ignoring cost of writing partially
filled blocks:
3(100 + 400) = 1500 block transfers +
2( 100/3 + 400/3) = 336 seeks 70
customer depositor
PreparedbyR.Arthy,AP/IT,KCET
71. HYBRID HASH JOIN
Useful when memory sized are relatively large, and the build input is
bigger than memory.
Main feature of hybrid hash join:
Keep the first partition of the build relation in memory.
E.g. With memory size of 25 blocks, depositor can be partitioned
into five partitions, each of size 20 blocks.
Division of memory:
The first partition occupies 20 blocks of memory
1 block is used for input, and 1 block each for buffering the other 4 partitions.
customer is similarly partitioned into five partitions each of size 80
the first is used right away for probing, instead of being written out
Cost of 3(80 + 320) + 20 +80 = 1300 block transfers for
hybrid hash join, instead of 1500 with plain hash-join.
Hybrid hash-join most useful if M >>
71
sb
PreparedbyR.Arthy,AP/IT,KCET
73. INTRODUCTION
Alternative ways of evaluating a given query
Equivalent expressions
Different algorithms for each operation
73
PreparedbyR.Arthy,AP/IT,KCET
74. [CONTD…]
An evaluation plan defines exactly what algorithm is
used for each operation, and how the execution of the
operations is coordinated.
74
PreparedbyR.Arthy,AP/IT,KCET
75. [CONTD…]
Cost difference between evaluation plans for a query can be
enormous
E.g. seconds vs. days in some cases
Steps in cost-based query optimization
1. Generate logically equivalent expressions using equivalence rules
2. Annotate resultant expressions to get alternative query plans
3. Choose the cheapest plan based on estimated cost
Estimation of plan cost based on:
Statistical information about relations. Examples:
number of tuples, number of distinct values for an attribute
Statistics estimation for intermediate results
to compute cost of complex expressions
Cost formulae for algorithms, computed using statistics 75
PreparedbyR.Arthy,AP/IT,KCET
76. GENERATING EQUIVALENT EXPRESSIONS –
TRANSFORMATION OF RELATIONAL EXPRESSIONS
Two relational algebra expressions are said to be equivalent if the
two expressions generate the same set of tuples on every legal
database instance
Note: order of tuples is irrelevant
In SQL, inputs and outputs are multisets of tuples
Two expressions in the multiset version of the relational algebra
are said to be equivalent if the two expressions generate the same
multiset of tuples on every legal database instance.
An equivalence rule says that expressions of two forms are
equivalent
Can replace expression of first form by second, or vice versa
76
PreparedbyR.Arthy,AP/IT,KCET
77. GENERATING EQUIVALENT EXPRESSIONS –
EQUIVALENCE RULE
1. Conjunctive selection operations can be deconstructed into a
sequence of individual selections.
2. Selection operations are commutative.
3. Only the last in a sequence of projection operations is
needed, the others can be omitted.
4. Selections can be combined with Cartesian products and
theta joins.
a. (E1 X E2) = E1 E2
b. 1(E1 2 E2) = E1 1 2 E2 77
))(()( 2121
EE
))(())(( 1221
EE
)())))(((( 121
EE LLnLL
PreparedbyR.Arthy,AP/IT,KCET
78. [CONTD…]
5.Theta-join operations (and natural joins) are
commutative.
E1 E2 = E2 E1
6.(a) Natural join operations are associative:
(E1 E2) E3 = E1 (E2 E3)
(b) Theta joins are associative in the following manner:
(E1 1 E2) 2 3 E3 = E1 1 3 (E2 2 E3)
where 2 involves attributes from only E2 and E3.
78
PreparedbyR.Arthy,AP/IT,KCET
80. [CONTD…]
7.The selection operation distributes over the theta join
operation under the following two conditions:
(a) When all the attributes in 0 involve only the
attributes of one
of the expressions (E1) being joined.
0E1 E2) = (0(E1)) E2
(b) When 1 involves only the attributes of E1 and 2
involves
only the attributes of E2.
1 E1 E2) = (1(E1)) ( (E2)) 80
PreparedbyR.Arthy,AP/IT,KCET
81. [CONTD…]
8.The projection operation distributes over the theta join
operation as follows:
(a) if involves only attributes from L1 L2:
(b) Consider a join E1 E2.
Let L1 and L2 be sets of attributes from E1 and E2,
respectively.
Let L3 be attributes of E1 that are involved in join condition ,
but are not in L1 L2, and
let L4 be attributes of E2 that are involved in join condition ,
but are not in L1 L2. 81
))(())(()( 2121 2121
EEEE LLLL
)))(())((()( 2121 42312121
EEEE LLLLLLLL
PreparedbyR.Arthy,AP/IT,KCET
82. [CONTD…]
9. The set operations union and intersection are commutative
E1 E2 = E2 E1
E1 E2 = E2 E1
(set difference is not commutative).
10. Set union and intersection are associative.
(E1 E2) E3 = E1 (E2 E3)
(E1 E2) E3 = E1 (E2 E3)
11. The selection operation distributes over , and –.
(E1 – E2) = (E1) – (E2)
and similarly for and in place of –
Also: (E1 – E2) = (E1) – E2
and similarly for in place of –, but not for
12. The projection operation distributes over union
L(E1 E2) = (L(E1)) (L(E2))
82
PreparedbyR.Arthy,AP/IT,KCET
83. EXAMPLE
Query: Find the names of all customers with an account at a
Brooklyn branch whose account balance is over $1000.
customer_name((branch_city = “Brooklyn” balance > 1000
(branch (account depositor)))
Transformation using join associatively (Rule 6a):
customer_name((branch_city = “Brooklyn” balance > 1000
(branch account)) depositor)
Second form provides an opportunity to apply the “perform
selections early” rule, resulting in the subexpression
branch_city = “Brooklyn” (branch) balance > 1000
(account)
Thus a sequence of transformations can be useful
83
PreparedbyR.Arthy,AP/IT,KCET
85. TRANSFORMATION EXAMPLE: PUSHING PROJECTIONS
When we compute
(branch_city = “Brooklyn” (branch) account )
we obtain a relation whose schema is:
(branch_name, branch_city, assets, account_number, balance)
Push projections using equivalence rules 8a and 8b; eliminate
unneeded attributes from intermediate results to get:
customer_name ((account_number ( (branch_city = “Brooklyn” (branch) account
)) depositor )
Performing the projection as early as possible reduces the size of the
relation to be joined.
customer_name((branch_city = “Brooklyn” (branch) account) depositor)
85
PreparedbyR.Arthy,AP/IT,KCET
86. JOIN ORDERING EXAMPLE
For all relations r1, r2, and r3,
(r1 r2) r3 = r1 (r2 r3 )
(Join Associativity)
If r2 r3 is quite large and r1 r2 is small, we choose
(r1 r2) r3
so that we compute and store a smaller temporary relation.
86
PreparedbyR.Arthy,AP/IT,KCET
87. JOIN ORDERING EXAMPLE (CONT.)
Consider the expression
customer_name ((branch_city = “Brooklyn” (branch))
(account depositor))
Could compute account depositor first, and join
result with
branch_city = “Brooklyn” (branch)
but account depositor is likely to be a large relation.
Only a small fraction of the bank’s customers are likely to
have accounts in branches located in Brooklyn
it is better to compute
branch_city = “Brooklyn” (branch) account
first.
87
PreparedbyR.Arthy,AP/IT,KCET
88. GENERATING EQUIVALENT EXPRESSIONS –
ENUMERATION OF EQUIVALENT EXPRESSIONS
Query optimizers use equivalence rules to systematically
generate expressions equivalent to the given expression
Can generate all equivalent expressions as follows:
Repeat
apply all applicable equivalence rules on every equivalent expression
found so far
add newly generated expressions to the set of equivalent expressions
Until no new equivalent expressions are generated above
The above approach is very expensive in space and time
Two approaches
Optimized plan generation based on transformation rules
Special case approach for queries with only selections, projections and
joins
88
PreparedbyR.Arthy,AP/IT,KCET
89. GENERATING EQUIVALENT EXPRESSIONS – IMPLEMENTING
TRANSFORMATION BASED OPTIMIZATION
Space requirements reduced by sharing common sub-expressions:
when E1 is generated from E2 by an equivalence rule, usually only the top
level of the two are different, subtrees below are the same and can be shared
using pointers
E.g. when applying join commutativity
Same sub-expression may get generated multiple times
Detect duplicate sub-expressions and share one copy
Time requirements are reduced by not generating all expressions
Dynamic programming
We will study only the special case of dynamic programming for join order
optimization
E1 E2
89
PreparedbyR.Arthy,AP/IT,KCET
90. COST ESTIMATION
Cost of each operator computer as described in Chapter 13
Need statistics of input relations
E.g. number of tuples, sizes of tuples
Inputs can be results of sub-expressions
Need to estimate statistics of expression results
To do so, we require additional statistics
E.g. number of distinct values for an attribute
More on cost estimation later
90
PreparedbyR.Arthy,AP/IT,KCET
91. CHOICE OF EVALUATION PLANS
Must consider the interaction of evaluation techniques when
choosing evaluation plans
choosing the cheapest algorithm for each operation independently
may not yield best overall algorithm. E.g.
merge-join may be costlier than hash-join, but may provide a sorted
output which reduces the cost for an outer level aggregation.
nested-loop join may provide opportunity for pipelining
Practical query optimizers incorporate elements of the
following two broad approaches:
1. Search all the plans and choose the best plan in a
cost-based fashion.
2. Uses heuristics to choose a plan.
91
PreparedbyR.Arthy,AP/IT,KCET
92. COST-BASED OPTIMIZATION
Consider finding the best join-order for r1 r2 . . . rn.
There are (2(n – 1))!/(n – 1)! different join orders for above
expression. With n = 7, the number is 665280, with n = 10,
the number is greater than 176 billion!
No need to generate all the join orders. Using dynamic
programming, the least-cost join order for any subset of
{r1, r2, . . . rn} is computed only once and stored for future
use.
92
PreparedbyR.Arthy,AP/IT,KCET
93. DYNAMIC PROGRAMMING IN OPTIMIZATION
To find best join tree for a set of n relations:
To find best plan for a set S of n relations, consider all possible
plans of the form: S1 (S – S1) where S1 is any non-empty
subset of S.
Recursively compute costs for joining subsets of S to find the
cost of each plan. Choose the cheapest of the 2n – 1
alternatives.
Base case for recursion: single relation access plan
Apply all selections on Ri using best choice of indices on Ri
When plan for any subset is computed, store it and reuse it
when it is required again, instead of recomputing it
Dynamic programming
93
PreparedbyR.Arthy,AP/IT,KCET
94. JOIN ORDER OPTIMIZATION ALGORITHM
procedure findbestplan(S)
if (bestplan[S].cost )
return bestplan[S]
// else bestplan[S] has not been computed earlier, compute it now
if (S contains only 1 relation)
set bestplan[S].plan and bestplan[S].cost based on the best way
of accessing S /* Using selections on S and indices on S */
else for each non-empty subset S1 of S such that S1 S
P1= findbestplan(S1)
P2= findbestplan(S - S1)
A = best algorithm for joining results of P1 and P2
cost = P1.cost + P2.cost + cost of A
if cost < bestplan[S].cost
bestplan[S].cost = cost
bestplan[S].plan = “execute P1.plan; execute P2.plan;
join results of P1 and P2 using A”
return bestplan[S]
94
PreparedbyR.Arthy,AP/IT,KCET
95. LEFT DEEP JOIN TREES
In left-deep join trees, the right-hand-side input for
each join is a relation, not the result of an
intermediate join.
95
PreparedbyR.Arthy,AP/IT,KCET
96. COST OF OPTIMIZATION
With dynamic programming time complexity of optimization with bushy
trees is O(3n).
With n = 10, this number is 59000 instead of 176 billion!
Space complexity is O(2n)
To find best left-deep join tree for a set of n relations:
Consider n alternatives with one relation as right-hand side input and the
other relations as left-hand side input.
Modify optimization algorithm:
Replace “for each non-empty subset S1 of S such that S1 S”
By: for each relation r in S
let S1 = S – r .
If only left-deep trees are considered, time complexity of finding best
join order is O(n 2n)
Space complexity remains at O(2n)
Cost-based optimization is expensive, but worthwhile for queries on
large datasets (typical queries have small n, generally < 10)
96
PreparedbyR.Arthy,AP/IT,KCET
97. INTERESTING SORT ORDERS
Consider the expression (r1 r2) r3 (with A as common attribute)
An interesting sort order is a particular sort order of tuples that could
be useful for a later operation
Using merge-join to compute r1 r2 may be costlier than hash join but
generates result sorted on A
Which in turn may make merge-join with r3 cheaper, which may reduce cost
of join with r3 and minimizing overall cost
Sort order may also be useful for order by and for grouping
Not sufficient to find the best join order for each subset of the set of n
given relations
must find the best join order for each subset, for each interesting sort order
Simple extension of earlier dynamic programming algorithms
Usually, number of interesting orders is quite small and doesn’t affect
time/space complexity significantly
97
PreparedbyR.Arthy,AP/IT,KCET
98. HEURISTIC OPTIMIZATION
Cost-based optimization is expensive, even with dynamic programming.
Systems may use heuristics to reduce the number of choices that must
be made in a cost-based fashion.
Heuristic optimization transforms the query-tree by using a set of rules
that typically (but not in all cases) improve execution performance:
Perform selection early (reduces the number of tuples)
Perform projection early (reduces the number of attributes)
Perform most restrictive selection and join operations (i.e. with smallest
result size) before other similar operations.
Some systems use only heuristics, others combine heuristics with partial
cost-based optimization.
98
PreparedbyR.Arthy,AP/IT,KCET
99. STRUCTURE OF QUERY OPTIMIZERS
Many optimizers considers only left-deep join orders.
Plus heuristics to push selections and projections down the query
tree
Reduces optimization complexity and generates plans amenable
to pipelined evaluation.
Heuristic optimization used in some versions of Oracle:
Repeatedly pick “best” relation to join next
Starting from each of n starting points. Pick best among these
Intricacies of SQL complicate query optimization
E.g. nested subqueries
99
PreparedbyR.Arthy,AP/IT,KCET
100. [CONTD…]
Some query optimizers integrate heuristic selection and the generation
of alternative access plans.
Frequently used approach
heuristic rewriting of nested block structure and aggregation
followed by cost-based join-order optimization for each block
Some optimizers (e.g. SQL Server) apply transformations to entire
query and do not depend on block structure
Even with the use of heuristics, cost-based query optimization imposes
a substantial overhead.
But is worth it for expensive queries
Optimizers often use simple heuristics for very cheap queries, and
perform exhaustive enumeration for more expensive queries
100
PreparedbyR.Arthy,AP/IT,KCET
101. STATISTICS FOR COST ESTIMATION - STATISTICAL
INFORMATION FOR COST ESTIMATION
nr: number of tuples in a relation r.
br: number of blocks containing tuples of r.
lr: size of a tuple of r.
fr: blocking factor of r — i.e., the number of tuples of r that fit into
one block.
V(A, r): number of distinct values that appear in r for attribute A;
same as the size of A(r).
If tuples of r are stored together physically in a file, then:
rf
rn
rb
101
PreparedbyR.Arthy,AP/IT,KCET
102. STATISTICS FOR COST ESTIMATION -
HISTOGRAMS
Histogram on attribute age of relation person
Equi-width histograms
Equi-depth histograms
102
PreparedbyR.Arthy,AP/IT,KCET
103. STATISTICS FOR COST ESTIMATION - SELECTION SIZE
ESTIMATION
A=v(r)
nr / V(A,r) : number of records that will satisfy the selection
Equality condition on a key attribute: size estimate = 1
AV(r) (case of A V(r) is symmetric)
Let c denote the estimated number of tuples satisfying the
condition.
If min(A,r) and max(A,r) are available in catalog
c = 0 if v < min(A,r)
c =
If histograms available, can refine above estimate
In absence of statistical information c is assumed to be nr / 2.
),min(),max(
),min(
.
rArA
rAv
nr
103
104. STATISTICS FOR COST ESTIMATION - SIZE ESTIMATION
OF COMPLEX SELECTIONS
The selectivity of a condition i is the probability that a tuple in
the relation r satisfies i .
If si is the number of satisfying tuples in r, the selectivity of i is
given by si /nr.
Conjunction: 1 2. . . n (r). Assuming indepdence, estimate
of tuples in the result is:
Disjunction:1 2 . . . n (r). Estimated number of tuples:
Negation: (r). Estimated number of tuples:
nr – size((r))
n
r
n
r
n
sss
n
...21
)1(...)1()1(1 21
r
n
rr
r
n
s
n
s
n
s
n
104
PreparedbyR.Arthy,AP/IT,KCET
105. STATISTICS FOR COST ESTIMATION - JOIN
OPERATION: RUNNING EXAMPLE
Running example:
depositor customer
Catalog information for join examples:
ncustomer = 10,000.
fcustomer = 25, which implies that
bcustomer =10000/25 = 400.
ndepositor = 5000.
fdepositor = 50, which implies that
bdepositor = 5000/50 = 100.
V(customer_name, depositor) = 2500, which implies that , on
average, each customer has two accounts.
Also assume that customer_name in depositor is a foreign key on
customer.
V(customer_name, customer) = 10000 (primary key!)
105
PreparedbyR.Arthy,AP/IT,KCET
106. STATISTICS FOR COST ESTIMATION - ESTIMATION
OF THE SIZE OF JOINS
The Cartesian product r x s contains nr .ns tuples; each
tuple occupies sr + ss bytes.
If R S = , then r s is the same as r x s.
If R S is a key for R, then a tuple of s will join with at
most one tuple from r
therefore, the number of tuples in r s is no greater than the
number of tuples in s.
If R S in S is a foreign key in S referencing R, then the
number of tuples in r s is exactly the same as the
number of tuples in s.
The case for R S being a foreign key referencing S is symmetric.
In the example query depositor customer,
customer_name in depositor is a foreign key of customer
hence, the result has exactly ndepositor tuples, which is 5000 106
PreparedbyR.Arthy,AP/IT,KCET
107. STATISTICS FOR COST ESTIMATION - ESTIMATION OF
THE SIZE OF JOINS (CONT.)
If R S = {A} is not a key for R or S.
If we assume that every tuple t in R produces tuples in R S, the
number of tuples in R S is estimated to be:
If the reverse is true, the estimate obtained will be:
The lower of these two estimates is probably the more accurate one.
Can improve on above if histograms are available
Use formula similar to above, for each cell of histograms on the
two relations
),( sAV
nn sr
),( rAV
nn sr
107
PreparedbyR.Arthy,AP/IT,KCET
108. [CONTD…]
Compute the size estimates for depositor customer
without using information about foreign keys:
V(customer_name, depositor) = 2500, and
V(customer_name, customer) = 10000
The two estimates are 5000 * 10000/2500 - 20,000 and 5000
* 10000/10000 = 5000
We choose the lower estimate, which in this case, is the same
as our earlier computation using foreign keys.
108
PreparedbyR.Arthy,AP/IT,KCET
109. STATISTICS FOR COST ESTIMATION - SIZE
ESTIMATION FOR OTHER OPERATIONS
Projection: estimated size of A(r) = V(A,r)
Aggregation : estimated size of AgF(r) = V(A,r)
Set operations
For unions/intersections of selections on the same relation:
rewrite and use size estimate for selections
E.g. 1 (r) 2 (r) can be rewritten as 1 2 (r)
For operations on different relations:
estimated size of r s = size of r + size of s.
estimated size of r s = minimum size of r and size of s.
estimated size of r – s = r.
All the three estimates may be quite inaccurate, but provide upper
bounds on the sizes.
109
PreparedbyR.Arthy,AP/IT,KCET
110. [CONTD…]
Outer join:
Estimated size of r s = size of r s + size of r
Case of right outer join is symmetric
Estimated size of r s = size of r s + size of r + size
of s
110
PreparedbyR.Arthy,AP/IT,KCET
111. STATISTICS FOR COST ESTIMATION - ESTIMATION OF
NUMBER OF DISTINCT VALUES
Selections: (r)
If forces A to take a specified value: V(A, (r)) = 1.
e.g., A = 3
If forces A to take on one of a specified set of values:
V(A, (r)) = number of specified values.
(e.g., (A = 1 V A = 3 V A = 4 )),
If the selection condition is of the form A op r
estimated V(A, (r)) = V(A.r) * s
where s is the selectivity of the selection.
In all the other cases: use approximate estimate of
min(V(A,r), n (r) )
More accurate estimate can be got using probability theory,
but this one works fine generally 111
PreparedbyR.Arthy,AP/IT,KCET
112. [CONTD…]
Joins: r s
If all attributes in A are from r
estimated V(A, r s) = min (V(A,r), n r s)
If A contains attributes A1 from r and A2 from s, then
estimated
V(A,r s) =
min(V(A1,r)*V(A2 – A1,s), V(A1 – A2,r)*V(A2,s),
nr s)
More accurate estimate can be got using probability theory,
but this one works fine generally
112
PreparedbyR.Arthy,AP/IT,KCET
113. [CONTD…]
Estimation of distinct values are straightforward for
projections.
They are the same in A (r) as in r.
The same holds for grouping attributes of aggregation.
For aggregated values
For min(A) and max(A), the number of distinct values can be
estimated as min(V(A,r), V(G,r)) where G denotes grouping
attributes
For other aggregates, assume all values are distinct, and use
V(G,r)
113
PreparedbyR.Arthy,AP/IT,KCET
114. OPTIMIZING NESTED SUBQUERIES
Nested query example:
select customer_name
from borrower
where exists (select *
from depositor
where depositor.customer_name =
borrower.customer_name)
SQL conceptually treats nested subqueries in the where clause as functions
that take parameters and return a single value or set of values
Parameters are variables from outer level query that are used in the nested
subquery; such variables are called correlation variables
Conceptually, nested subquery is executed once for each tuple in the
cross-product generated by the outer level from clause
Such evaluation is called correlated evaluation
Note: other conditions in where clause may be used to compute a join (instead
of a cross-product) before executing the nested subquery 114
PreparedbyR.Arthy,AP/IT,KCET
115. [CONTD…]
Correlated evaluation may be quite inefficient since
a large number of calls may be made to the nested query
there may be unnecessary random I/O as a result
SQL optimizers attempt to transform nested subqueries to joins where
possible, enabling use of efficient join techniques
E.g.: earlier nested query can be rewritten as
select customer_name
from borrower, depositor
where depositor.customer_name = borrower.customer_name
Note: the two queries generate different numbers of duplicates (why?)
Borrower can have duplicate customer-names
Can be modified to handle duplicates correctly as we will see
In general, it is not possible/straightforward to move the entire nested
subquery from clause into the outer level query from clause
A temporary relation is created instead, and used in body of outer level query 115
PreparedbyR.Arthy,AP/IT,KCET
116. [CONTD…]
In general, SQL queries of the form below can be rewritten as shown
Rewrite: select …
from L1
where P1 and exists (select *
from L2
where P2)
To: create table t1 as
select distinct V
from L2
where P2
1
select …
from L1, t1
where P1 and P2
2
P2
1 contains predicates in P2 that do not involve any correlation
variables
P2
2 reintroduces predicates involving correlation variables, with
relations renamed appropriately
V contains all attributes used in predicates with correlation variables
116
PreparedbyR.Arthy,AP/IT,KCET
117. [CONTD…]
In our example, the original nested query would be transformed to
create table t1 as
select distinct customer_name
from depositor
select customer_name
from borrower, t1
where t1.customer_name = borrower.customer_name
The process of replacing a nested query by a query with a join (possibly
with a temporary relation) is called decorrelation.
Decorrelation is more complicated when
the nested subquery uses aggregation, or
when the result of the nested subquery is used to test for equality, or
when the condition linking the nested subquery to the other
query is not exists,
and so on.
117
PreparedbyR.Arthy,AP/IT,KCET
118. MATERIALIZED VIEWS
A materialized view is a view whose contents are computed
and stored.
Consider the view
create view branch_total_loan(branch_name, total_loan) as
select branch_name, sum(amount)
from loan
group by branch_name
Materializing the above view would be very useful if the total
loan amount is required frequently
Saves the effort of finding multiple tuples and adding up
their amounts
118
PreparedbyR.Arthy,AP/IT,KCET
119. MATERIALIZED VIEW MAINTENANCE
The task of keeping a materialized view up-to-date with the
underlying data is known as materialized view maintenance
Materialized views can be maintained by recomputation on every
update
A better option is to use incremental view maintenance
Changes to database relations are used to compute changes to the
materialized view, which is then updated
View maintenance can be done by
Manually defining triggers on insert, delete, and update of each relation
in the view definition
Manually written code to update the view whenever database relations
are updated
Periodic recomputation (e.g. nightly)
Above methods are directly supported by many database systems
Avoids manual effort/correctness issues
119
PreparedbyR.Arthy,AP/IT,KCET
120. INCREMENTAL VIEW MAINTENANCE
The changes (inserts and deletes) to a relation or expressions are
referred to as its differential
Set of tuples inserted to and deleted from r are denoted ir and dr
To simplify our description, we only consider inserts and deletes
We replace updates to a tuple by deletion of the tuple followed
by insertion of the update tuple
We describe how to compute the change to the result of each
relational operation, given changes to its inputs
We then outline how to handle relational algebra expressions
120
PreparedbyR.Arthy,AP/IT,KCET
121. JOIN OPERATION
Consider the materialized view v = r s and an update to r
Let rold and rnew denote the old and new states of relation r
Consider the case of an insert to r:
We can write rnew s as (rold ir) s
And rewrite the above to (rold s) (ir s)
But (rold s) is simply the old value of the materialized view, so
the incremental change to the view is just ir s
Thus, for inserts vnew = vold (ir s)
Similarly for deletes vnew = vold – (dr s)
A, 1
B, 2
1, p
2, r
2, s
A, 1, p
B, 2, r
B, 2, s
C,2
C, 2, r
C, 2, s
121
PreparedbyR.Arthy,AP/IT,KCET
122. SELECTION AND PROJECTION OPERATIONS
Selection: Consider a view v = (r).
vnew = vold (ir)
vnew = vold - (dr)
Projection is a more difficult operation
R = (A,B), and r(R) = { (a,2), (a,3)}
A(r) has a single tuple (a).
If we delete the tuple (a,2) from r, we should not delete the tuple (a)
from A(r), but if we then delete (a,3) as well, we should delete the
tuple
For each tuple in a projection A(r) , we will keep a count of how many
times it was derived
On insert of a tuple to r, if the resultant tuple is already in A(r) we
increment its count, else we add a new tuple with count = 1
On delete of a tuple from r, we decrement the count of the
corresponding tuple in A(r)
if the count becomes 0, we delete the tuple from A(r) 122
PreparedbyR.Arthy,AP/IT,KCET
123. AGGREGATION OPERATIONS
count : v = Agcount(B)
(r).
When a set of tuples ir is inserted
For each tuple r in ir, if the corresponding group is already present in v, we
increment its count, else we add a new tuple with count = 1
When a set of tuples dr is deleted
for each tuple t in ir.we look for the group t.A in v, and subtract 1 from the count
for the group.
If the count becomes 0, we delete from v the tuple for the group t.A
sum: v = Agsum (B)
(r)
We maintain the sum in a manner similar to count, except we add/subtract the B
value instead of adding/subtracting 1 for the count
Additionally we maintain the count in order to detect groups with no tuples. Such
groups are deleted from v
Cannot simply test for sum = 0 (why?)
To handle the case of avg, we maintain the sum and count
aggregate values separately, and divide at the end 123
PreparedbyR.Arthy,AP/IT,KCET
124. [CONTD…]
min, max: v = Agmin (B) (r).
Handling insertions on r is straightforward.
Maintaining the aggregate values min and max on deletions
may be more expensive. We have to look at the other tuples
of r that are in the same group to find the new minimum
124
PreparedbyR.Arthy,AP/IT,KCET
125. OTHER OPERATIONS
Set intersection: v = r s
when a tuple is inserted in r we check if it is present in s, and
if so we add it to v.
If the tuple is deleted from r, we delete it from the
intersection if it is present.
Updates to s are symmetric
The other set operations, union and set difference are handled
in a similar fashion.
Outer joins are handled in much the same way as joins
but with some extra work
we leave details to you.
125
PreparedbyR.Arthy,AP/IT,KCET
126. HANDLING EXPRESSIONS
To handle an entire expression, we derive expressions for
computing the incremental change to the result of each
sub-expressions, starting from the smallest sub-
expressions.
E.g. consider E1 E2 where each of E1 and E2 may be a
complex expression
Suppose the set of tuples to be inserted into E1 is given by D1
Computed earlier, since smaller sub-expressions are handled first
Then the set of tuples to be inserted into E1 E2 is given by
D1 E2
This is just the usual way of maintaining joins
126
PreparedbyR.Arthy,AP/IT,KCET
127. QUERY OPTIMIZATION AND MATERIALIZED VIEWS
Rewriting queries to use materialized views:
A materialized view v = r s is available
A user submits a query r s t
We can rewrite the query as v t
Whether to do so depends on cost estimates for the two alternative
Replacing a use of a materialized view by the view definition:
A materialized view v = r s is available, but without any index
on it
User submits a query A=10(v).
Suppose also that s has an index on the common attribute B, and r
has an index on attribute A.
The best plan for this query may be to replace v by r s, which can
lead to the query plan A=10(r) s
Query optimizer should be extended to consider all above
alternatives and choose the best overall plan
127
PreparedbyR.Arthy,AP/IT,KCET
128. MATERIALIZED VIEW SELECTION
Materialized view selection: “What is the best set of views to
materialize?”.
Index selection: “what is the best set of indices to create”
closely related, to materialized view selection
but simpler
Materialized view selection and index selection based on
typical system workload (queries and updates)
Typical goal: minimize time to execute workload , subject to
constraints on space and time taken for some critical queries/updates
One of the steps in database tuning
more on tuning in later chapters
Commercial database systems provide tools (called “tuning
assistants” or “wizards”) to help the database administrator
choose what indices and materialized views to create
128
PreparedbyR.Arthy,AP/IT,KCET
132. OUTLINE
Distributed Database Systems – Introduction
Distributed Data Storage
Distributed Transaction
Commit Protocol
132
PreparedbyR.Arthy,AP/IT,KCET
133. I. DISTRIBUTED DATABASE SYSTEM
A distributed database system consists of loosely coupled sites
that share no physical component
Database systems that run on each site are independent of each
other
Transactions may access data at one or more sites
133
PreparedbyR.Arthy,AP/IT,KCET
134. TYPES OF DISTRIBUTED DATABASES
In a homogeneous distributed database
All sites have identical software
Are aware of each other and agree to cooperate in processing
user requests.
Each site surrenders part of its autonomy in terms of right to
change schemas or software
Appears to user as a single system
In a heterogeneous distributed database
Different sites may use different schemas and software
Difference in schema is a major problem for query processing
Difference in software is a major problem for transaction processing
Sites may not be aware of each other and may provide only
limited facilities for cooperation in transaction processing
134
PreparedbyR.Arthy,AP/IT,KCET
135. II. DISTRIBUTED DATA STORAGE
There are two approaches to store the relation in the
distributed database:
Replication: The system maintains several identical replicas
(copies) of the relation, and stores each replica at a different
site. The alternative to replication is to store only one copy of
relation r.
Fragmentation: The system partitions the relation into
several fragments, and stores each fragment at a different site.
135
PreparedbyR.Arthy,AP/IT,KCET
136. 1. DATA REPLICATION
A relation or fragment of a relation is replicated if it is
stored redundantly in two or more sites.
Full replication of a relation is the case where the relation
is stored at all sites.
Fully redundant databases are those in which every site
contains a copy of the entire database.
136
PreparedbyR.Arthy,AP/IT,KCET
137. [CONTD…]
Advantages of Replication
Availability: failure of site containing relation r does not result in
unavailability of r is replicas exist.
Parallelism: queries on r may be processed by several nodes in parallel.
Reduced data transfer: relation r is available locally at each site
containing a replica of r.
Disadvantages of Replication
Increased cost of updates: each replica of relation r must be updated.
Increased complexity of concurrency control: concurrent updates to
distinct replicas may lead to inconsistent data unless special concurrency
control mechanisms are implemented.
One solution: choose one copy as primary copy and apply concurrency control
operations on primary copy
137
PreparedbyR.Arthy,AP/IT,KCET
138. 2. DATA FRAGMENTATION
Division of relation r into fragments r1, r2, …, rn which contain
sufficient information to reconstruct relation r.
Horizontal fragmentation: each tuple of r is assigned to one or more
fragments
Vertical fragmentation: the schema for relation r is split into several
smaller schemas
All schemas must contain a common candidate key (or superkey) to
ensure lossless join property.
A special attribute, the tuple-id attribute may be added to each
schema to serve as a candidate key.
Example : relation account with following schema
Account = (account_number, branch_name , balance )
138
PreparedbyR.Arthy,AP/IT,KCET
141. ADVANTAGES OF FRAGMENTATION
Horizontal:
allows parallel processing on fragments of a relation
allows a relation to be split so that tuples are located where they are most
frequently accessed
Vertical:
allows tuples to be split so that each part of the tuple is stored where it is most
frequently accessed
tuple-id attribute allows efficient joining of vertical fragments
Vertical and horizontal fragmentation can be mixed.
Fragments may be successively fragmented to an arbitrary depth.
Replication and fragmentation can be combined
Relation is partitioned into several fragments: system maintains several
identical replicas of each such fragment.
141
PreparedbyR.Arthy,AP/IT,KCET
142. 3. DATA TRANSPARENCY
Data transparency: Degree to which system user may
remain unaware of the details of how and where the data
items are stored in a distributed system
Consider transparency issues in relation to:
Fragmentation transparency
Replication transparency
Location transparency
Naming of data items: criteria
1. Every data item must have a system-wide unique name.
2. It should be possible to find the location of data items efficiently.
3. It should be possible to change the location of data items
transparently.
4. Each site should be able to create new data items autonomously. 142
PreparedbyR.Arthy,AP/IT,KCET
143. CENTRALIZED SCHEME - NAME SERVER
Structure:
name server assigns all names
each site maintains a record of local data items
sites ask name server to locate non-local data items
Advantages:
satisfies naming criteria 1-3
Disadvantages:
does not satisfy naming criterion 4
name server is a potential performance bottleneck
name server is a single point of failure
143
PreparedbyR.Arthy,AP/IT,KCET
144. USE OF ALIASES
Alternative to centralized scheme: each site prefixes its
own site identifier to any name that it generates i.e., site
17.account.
Fulfills having a unique identifier, and avoids problems associated
with central control.
However, fails to achieve network transparency.
Solution: Create a set of aliases for data items; Store the
mapping of aliases to the real names at each site.
The user can be unaware of the physical location of a data
item, and is unaffected if the data item is moved from one
site to another.
144
PreparedbyR.Arthy,AP/IT,KCET
145. III. DISTRIBUTED TRANSACTIONS
SYSTEM ARCHITECTURE
Transaction may access data at several sites.
Each site has a local transaction manager responsible for:
Maintaining a log for recovery purposes
Participating in coordinating the concurrent execution of the
transactions executing at that site.
Each site has a transaction coordinator, which is responsible for:
Starting the execution of transactions that originate at the site.
Distributing subtransactions at appropriate sites for execution.
Coordinating the termination of each transaction that originates
at the site, which may result in the transaction being committed
at all sites or aborted at all sites.
145
PreparedbyR.Arthy,AP/IT,KCET
147. SYSTEM FAILURE MODES
Failures unique to distributed systems:
Failure of a site.
Loss of massages
Handled by network transmission control protocols such as TCP-IP
Failure of a communication link
Handled by network protocols, by routing messages via alternative
links
Network partition
A network is said to be partitioned when it has been split into two or
more subsystems that lack any connection between them
Note: a subsystem may consist of a single node
Network partitioning and site failures are generally
indistinguishable. 147
PreparedbyR.Arthy,AP/IT,KCET
148. COMMIT PROTOCOLS
Commit protocols are used to ensure atomicity across
sites
a transaction which executes at multiple sites must either be
committed at all the sites, or aborted at all the sites.
not acceptable to have a transaction committed at one site and
aborted at another
The two-phase commit (2PC) protocol is widely used
The three-phase commit (3PC) protocol is more
complicated and more expensive, but avoids some
drawbacks of two-phase commit protocol. This protocol
is not used in practice.
148
PreparedbyR.Arthy,AP/IT,KCET
149. TWO PHASE COMMIT PROTOCOL (2PC)
Assumes fail-stop model – failed sites simply stop
working, and do not cause any other harm, such as
sending incorrect messages to other sites.
Execution of the protocol is initiated by the coordinator
after the last step of the transaction has been reached.
The protocol involves all the local sites at which the
transaction executed
Let T be a transaction initiated at site Si, and let the
transaction coordinator at Si be Ci
149
PreparedbyR.Arthy,AP/IT,KCET
150. PHASE 1: OBTAINING A DECISION
Coordinator asks all participants to prepare to commit
transaction Ti.
Ci adds the records <prepare T> to the log and forces log to
stable storage
sends prepare T messages to all sites at which T executed
Upon receiving message, transaction manager at site
determines if it can commit the transaction
if not, add a record <no T> to the log and send abort T
message to Ci
if the transaction can be committed, then:
add the record <ready T> to the log
force all records for T to stable storage
send ready T message to Ci 150
PreparedbyR.Arthy,AP/IT,KCET
151. PHASE 2: RECORDING THE DECISION
T can be committed of Ci received a ready T message
from all the participating sites: otherwise T must be
aborted.
Coordinator adds a decision record, <commit T> or
<abort T>, to the log and forces record onto stable
storage. Once the record stable storage it is irrevocable
(even if failures occur)
Coordinator sends a message to each participant
informing it of the decision (commit or abort)
Participants take appropriate action locally.
151
PreparedbyR.Arthy,AP/IT,KCET
152. HANDLING OF FAILURES - SITE FAILURE
When site Si recovers, it examines its log to determine the
fate of transactions active at the time of the failure.
Log contain <commit T> record: site executes redo (T)
Log contains <abort T> record: site executes undo (T)
Log contains <ready T> record: site must consult Ci to
determine the fate of T.
If T committed, redo (T)
If T aborted, undo (T)
The log contains no control records concerning T
implies that Sk failed before responding to the prepare T
message from Ci
Sk must execute undo (T) 152
PreparedbyR.Arthy,AP/IT,KCET
153. HANDLING OF FAILURES- COORDINATOR FAILURE
If coordinator fails while the commit protocol for T is executing then
participating sites must decide on T’s fate:
1. If an active site contains a <commit T> record in its log, then T must
be committed.
2. If an active site contains an <abort T> record in its log, then T must
be aborted.
3. If some active participating site does not contain a <ready T> record
in its log, then the failed coordinator Ci cannot have decided to
commit T.
1. Can therefore abort T.
4. If none of the above cases holds, then all active sites must have a
<ready T> record in their logs, but no additional control records
(such as <abort T> of <commit T>).
In this case active sites must wait for Ci to recover, to find decision.
Blocking problem: active sites may have to wait for failed coordinator
to recover. 153
PreparedbyR.Arthy,AP/IT,KCET
154. HANDLING OF FAILURES - NETWORK PARTITION
If the coordinator and all its participants remain in one partition, the
failure has no effect on the commit protocol.
If the coordinator and its participants belong to several partitions:
Sites that are not in the partition containing the coordinator think
the coordinator has failed, and execute the protocol to deal with
failure of the coordinator.
No harm results, but sites may still have to wait for decision from
coordinator.
The coordinator and the sites are in the same partition as the
coordinator think that the sites in the other partition have failed, and
follow the usual commit protocol.
Again, no harm results
154
PreparedbyR.Arthy,AP/IT,KCET
156. OUTLINE
Object-based Databases: Object Database Concepts
Object-Relational features
Object Data Management Group (ODMG) Object Model
Object Definition Language (ODL)
Object Query Language (OQL)
156
PreparedbyR.Arthy,AP/IT,KCET
157. I. OBJECT ORIENTED CONCEPTS
Extend the relational data model by including object
orientation and constructs to deal with added data types.
Allow attributes of tuples to have complex types,
including non-atomic values such as nested relations.
Preserve relational foundations, in particular the
declarative access to data, while extending modeling
power.
Upward compatibility with existing relational languages.
157
PreparedbyR.Arthy,AP/IT,KCET
158. COMPLEX DATA TYPES
Motivation:
Permit non-atomic domains (atomic indivisible)
Example of non-atomic domain: set of integers,or set of
tuples
Allows more intuitive modeling for applications with
complex data
Intuitive definition:
allow relations whenever we allow atomic (scalar) values —
relations within relations
Retains mathematical foundation of relational model
Violates first normal form.
158
PreparedbyR.Arthy,AP/IT,KCET
159. EXAMPLE OF A NESTED RELATION
Example: library information system
Each book has
title,
a set of authors,
Publisher, and
a set of keywords
Non-1NF relation books
159
PreparedbyR.Arthy,AP/IT,KCET
160. [CONTD…]
4NF DECOMPOSITION OF NESTED RELATION
Remove awkwardness of flat-books by assuming that the
following multivalued dependencies hold:
title author
title keyword
title pub-name, pub-branch
Decompose flat-doc into 4NF using the schemas:
(title, author )
(title, keyword )
(title, pub-name, pub-branch )
160
PreparedbyR.Arthy,AP/IT,KCET
162. PROBLEM WITH 4NF SCHEME
4NF design requires users to include joins in their
queries.
1NF relational view flat-books defined by join of 4NF
relations:
eliminates the need for users to perform joins,
but loses the one-to-one correspondence between tuples and
documents.
And has a large amount of redundancy
Nested relations representation is much more natural
here.
162
PreparedbyR.Arthy,AP/IT,KCET
163. II. OBJECT-RELATIONAL FEATURES
Structured types can be declared and used in SQL
create type Name as
(firstname varchar(20),
lastname varchar(20))
final
create type Address as
(street varchar(20),
city varchar(20),
zipcode varchar(20))
not final
Note: final and not final indicate whether subtypes can be created
Structured types can be used to create tables with composite attributes
create table customer (
name Name,
address Address,
dateOfBirth date)
Dot notation used to reference components: name.firstname
163
PreparedbyR.Arthy,AP/IT,KCET
164. [CONTD…]
User-defined row types
create type CustomerType as (
name Name,
address Address,
dateOfBirth date)
not final
Can then create a table whose rows are a user-defined
type
create table customer of CustomerType
164
PreparedbyR.Arthy,AP/IT,KCET
165. [CONTD…]
Alternative way of defining composite attributes in SQL
is to use unnamed row types.
create table person_r (
name row (firstname varchar(20),
lastname varchar(20)),
address row (street varchar(20),
city varchar(20),
zipcode varchar(9)),
dateOfBirth date);
The query finds the last name and city of each person.
select name.lastname, address.city from person;
165
PreparedbyR.Arthy,AP/IT,KCET
166. [CONTD…]
Methods
Can add a method declaration with a structured type.
method ageOnDate (onDate date)
returns interval year
Method body is given separately.
create instance method ageOnDate (onDate date)
returns interval year
for CustomerType
begin
return onDate - self.dateOfBirth;
end
We can now find the age of each customer:
select name.lastname, ageOnDate (current_date) from customer
166
PreparedbyR.Arthy,AP/IT,KCET
167. [CONTD…]
Constructor
create function Name (firstname varchar(20), lastname
varchar(20))
returns Name
begin
set self.firstname = firstname;
set self.lastname = lastname;
end
Inserting
insert into Person values (new Name(’John’, ’Smith’), new
Address(’20 Main St’, ’New York’, ’11001’), date ’1960-8-
22’);
167
PreparedbyR.Arthy,AP/IT,KCET
168. [CONTD…]
Inheritance
Suppose that we have the following type definition for
people:
create type Person
(name varchar(20),
address varchar(20))
Using inheritance to define the student and teacher types
create type Student under Person
(degree varchar(20),
department varchar(20))
create type Teacher under Person
(salary integer,
department varchar(20))
Subtypes can redefine methods by using overriding method
in place of method in the method declaration
168
PreparedbyR.Arthy,AP/IT,KCET
169. [CONTD…]
Multiple Inheritance
SQL:1999 and SQL:2003 do not support multiple inheritance
If our type system supports multiple inheritance, we can define a type
for teaching assistant as follows:
create type Teaching Assistant
under Student, Teacher
To avoid a conflict between the two occurrences of department we can
rename them
create type Teaching Assistant under
Student with (department as student_dept ),
Teacher with (department as teacher_dept )
169
PreparedbyR.Arthy,AP/IT,KCET
170. [CONTD…]
Array and Multiset Types in SQL
Example of array and multiset declaration:
create type Publisher as
(name varchar(20),
branch varchar(20))
create type Book as
(title varchar(20),
author-array varchar(20) array [10],
pub-date date,
publisher Publisher,
keyword-set varchar(20) multiset )
create table books of Book
Similar to the nested relation books, but with array of authors
instead of set 170
PreparedbyR.Arthy,AP/IT,KCET
171. [CONTD…]
Array construction
array [‘Silberschatz’,`Korth’,`Sudarshan’]
Multisets
multisetset [‘computer’, ‘database’, ‘SQL’]
To create a tuple of the type defined by the books relation:
(‘Compilers’, array[`Smith’,`Jones’],
Publisher (`McGraw-Hill’,`New York’),
multiset [`parsing’,`analysis’ ])
To insert the preceding tuple into the relation books
insert into books
values(‘Compilers’, array[`Smith’,`Jones’],
Publisher (`McGraw-Hill’,`New York’),
multiset [`parsing’,`analysis’ ])
171
PreparedbyR.Arthy,AP/IT,KCET
172. UNNESTING
The transformation of a nested relation into a form with
fewer (or no) relation-valued attributes us called
unnesting.
E.g.
select title, A as author, publisher.name as pub_name,
publisher.branch as pub_branch, K.keyword
from books as B, unnest(B.author_array ) as A
(author ),
unnest (B.keyword_set ) as K (keyword )
172
PreparedbyR.Arthy,AP/IT,KCET
173. NESTING
Nesting is the opposite of unnesting, creating a collection-valued
attribute
NOTE: SQL:1999 does not support nesting
Nesting can be done in a manner similar to aggregation, but using
the function colect() in place of an aggregation operation, to create a
multiset
To nest the flat-books relation on the attribute keyword:
select title, author, Publisher (pub_name, pub_branch ) as
publisher, collect (keyword) as keyword_set from flat-books
groupby title, author, publisher
To nest on both authors and keywords:
select title, collect (author ) as author_set, Publisher (pub_name,
pub_branch) as publisher, collect (keyword ) as keyword_set
from flat-books group by title, publisher 173
PreparedbyR.Arthy,AP/IT,KCET
174. III. OBJECT DATA MANAGEMENT GROUP
(ODMG) OBJECT MODEL
Provides a standard model for object databases
Supports object definition via ODL
Supports object querying via OQL
Supports a variety of data types and type construtors
174
PreparedbyR.Arthy,AP/IT,KCET
175. ODMG OBJECTS AND LITERALS
The basic building blocks of the object model are
Objects
Literals
An object has four characteristics
Identifier: Unique system-wide identifier
Name: Unique within a particular database and/or program; it
is optional
Lifetime: persistent vs transient
Structure: specifies how object is constructed by the type
constructor and whether it is an atomic object
175
PreparedbyR.Arthy,AP/IT,KCET
176. [CONTD…]
A literal has a current value but not an identifier
Three types of literals
Atomic: predefined; basic data type values (e.g. short, float,
boolean, char)
Structured: values that are constructed by type constructors
(e.g. date, struct variables)
Collection: a collection (e.g. array) of values or objects
176
PreparedbyR.Arthy,AP/IT,KCET
177. [CONTD…]
ODMG supports two concepts for specifying object
types:
Interface
Class
There are similarities and differences between interfaces
and classes
Both have behaviors (operations) and state (attributes
and relationships)
177
PreparedbyR.Arthy,AP/IT,KCET
178. ODMG INTERFACE
An interface is a specification of the abstract behavior of
an object type
State properties of an interface (i.e. its attributes and
relationships) cannot be inherited from
Objects cannot be instantiated from an interface
178
PreparedbyR.Arthy,AP/IT,KCET
179. ODMG INTERFACE DEFINITION
interface Date:Object {
enum weekday{sun, mon, tue, wed, thu, fri, sat};
enum month{jan, feb, mar, …, dec};
unsigned short year();
unsigned short month();
unsigned short day();
boolean is_equal(in Date other_date);
};
179
PreparedbyR.Arthy,AP/IT,KCET
180. BUILD-IN INTERFACES FOR COLLECTION
OBJECTS
A collection object inherits the basic collection
interface, for example:
cardinality()
is_empty()
insert_element()
remove_element()
contains_element()
create_iterator()
180
PreparedbyR.Arthy,AP/IT,KCET
181. COLLECTION TYPES
Collection objects are further specialized into types like a
set, list, bag, array, and dictionary
Each collection type may provide additional interfaces,
for example, a set provides:
create_union()
create_difference()
is_subset_of()
is_superset_of()
is_proper_subset_of()
181
PreparedbyR.Arthy,AP/IT,KCET
183. ODMG CLASS
A class is a specification of abstract behavior and state of
an object type
A class is Instantiable
Supports “extends” inheritance to allow both state and
behavior inheritance among classes
Multiple inheritance via “extends” is not allowed
183
PreparedbyR.Arthy,AP/IT,KCET
184. [CONTD…]
Atomic objects are user defined objects and are defined
via keyword class
An example:
class Employee(extend all_employees key ssn) {
attribute string name;
attribute string ssn;
attribute short age;
relationship dept works_for;
void reassign(in string new_name);
}
184
PreparedbyR.Arthy,AP/IT,KCET
185. IV. OBJECT DEFINITION LANGUAGE (ODL)
ODL supports semantics constructs of ODMG
ODL is independent of any programming language
ODL is used to create object specification (classes and
interfaces)
ODL is not used for database manipulation
185
PreparedbyR.Arthy,AP/IT,KCET
186. EXAMPLE 1: A VERY SIMPLE CLASS
A very simple, straightforward class definition
class Degree {
attribute string college;
attribute string degree;
attribute string year;
};
186
PreparedbyR.Arthy,AP/IT,KCET
187. EXAMPLE 2: A CLASS WITH KEY AND EXTENT
class Person (extent persons key ssn) {
attribute struct Pname {string fname …} name;
attribute string ssn;
attribute date birthdate;
short age();
};
187
PreparedbyR.Arthy,AP/IT,KCET
188. EXAMPLE 3: A CLASS WITH RELATIONSHIPS
class Faculty extends Person (extent faculty) {
attribute string rank;
attribute float salary;
attribute string phone;
relationship dept works_in inverse
dept :: has_faculty;
relationship set<GradStu> advises inverse
GradStu :: advisor;
void give_raise (in float raise);
void promise (in string new_rank);
}; 188
PreparedbyR.Arthy,AP/IT,KCET
189. EXAMPLE 4: INHERITANCE
interface Shape {
attribute struct point {…}
reference_point;
float perimeter();
};
class Triangle : Shape (extent triangles) {
attribute short side_1;
attribute short side_2;
}; 189
PreparedbyR.Arthy,AP/IT,KCET
190. V. OBJECT QUERY LANGUAGE (OQL)
OQL is DMG’s query language
OQL works closely with programming languages such as
C++
Embedded OQL statements return objects that are
compatible with the type system of the host language
OQL’s syntax is similar to SQL with additional deatures
for objects
190
PreparedbyR.Arthy,AP/IT,KCET
191. SIMPLE OQL QUERIES
Basic syntax: select … from … where …
select d.name from d in departments where d.college =
‘engineering’;
An entry point to the database is needed for each query
An extent name may serve as an entry point
191
PreparedbyR.Arthy,AP/IT,KCET
192. ITERATOR VARIABLES
Iterator variables are defined whenever a collection is
referenced in an OQL query
Iterator d in the previous example serves as an iterator
and ranges over each object in the collection
Syntactical options for specifying an iterator:
d in departments
departments d
departments as d
192
PreparedbyR.Arthy,AP/IT,KCET
193. DATA TYPE OF QUERY RESULTS
The data type of a query result can be any type defined in
the ODMG model
A query does not have to follow the select … from …
where … format
A persistent name on its own can serve as a query whose
result is a reference to the persistent object.
For example,
departments: whose type is set<Departments>
193
PreparedbyR.Arthy,AP/IT,KCET
194. PATH EXPRESSIONS
A path expression is used to specify a path to attributes
and objects in an entry point
A path expression starts at a persistent object name
The name will be followed by zero or more dot
connected relationship or attribute names
For example: departments.chair;
194
PreparedbyR.Arthy,AP/IT,KCET
195. VIEWS AS NAMED OBJECTS
The define keyword in OQL is used to specify an
identifier for a named query
The name should be unique; if not, the results will
replace an existing named query
Once a query definition is created, it will persist until
deleted or redefined
A view definition can include parameters
195
PreparedbyR.Arthy,AP/IT,KCET
196. EXAMPLE
A view to include students in a department who have a
minor
define has_minor(dept_name) as select s from s in
students where s.minor_in.dname = dept_name
196
PreparedbyR.Arthy,AP/IT,KCET
197. SINGLE ELEMENTS FROM COLLECTIONS
An OQL query returns a collection
OQL’s element operator can be used to return a single
element from a singleton collection that contains one
element:
element (select d from d in departments where
d.name = ‘Web Programming’);
If d is empty or has more that one elements, an exception
is raised
197
PreparedbyR.Arthy,AP/IT,KCET
198. COLLECTION OPERATORS
OQL supports a number of aggregate operators that can
be applied to query results
The aggregate operators and operate over a collection
and include
Min
Max
Count
Sum
Avg
For example:
avg (select s.gpa from s in students where s.class =
‘senior’ and s.majors_in.dname = ‘business’);
198
PreparedbyR.Arthy,AP/IT,KCET