SlideShare a Scribd company logo
1 of 256
Download to read offline
CS8492 – DATABASE
MANAGEMENT SYSTEMS
UNIT IV - IMPLEMENTATION
TECHNIQUES
OUTLINE
 RAID
 File Organization –Organization of Records in Files
 Indexing and Hashing
 Ordered Indices
 B+ tree Index Files – B tree Index Files
 Static Hashing –Dynamic Hashing
 Query Processing Overview
 Algorithms for SELECT and JOIN operations
 Query optimization using Heuristics and Cost Estimation
2
PreparedbyR.Arthy,AP/IT,KCET
RAID
REDUNDANT ARRAYS OF INDEPENDENT
DISKS
CLASSIFICATION OF PHYSICAL STORAGE
MEDIA
 Can differentiate storage into:
 volatile storage: loses contents when power is switched off
 non-volatile storage:
 Contents persist even when power is switched off.
 Includes secondary and tertiary storage, as well as batter-backed up
main-memory.
 Factors affecting choice of storage media include
 Speed with which data can be accessed
 Cost per unit of data
 Reliability
4
PreparedbyR.Arthy,AP/IT,KCET
STORAGE HIERARCHY
5
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 primary storage: Fastest media but volatile (cache, main
memory).
 secondary storage: next level in hierarchy, non-volatile,
moderately fast access time
 Also called on-line storage
 E.g., flash memory, magnetic disks
 tertiary storage: lowest level in hierarchy, non-volatile, slow
access time
 also called off-line storage and used for archival storage
 e.g., magnetic tape, optical storage
 Magnetic tape
 Sequential access, 1 to 12 TB capacity
 A few drives with many tapes
 Juke boxes with petabytes (1000’s of TB) of storage
6
PreparedbyR.Arthy,AP/IT,KCET
RAID
 RAID: Redundant Arrays of Independent Disks
 Disk organization techniques that manage a large numbers of
disks, providing a view of a single disk of
 high capacity and high speed by using multiple disks in parallel,
 high reliability by storing data redundantly, so that data can be
recovered even if a disk fails
 The chance that some disk out of a set of N disks will fail
is much higher than the chance that a specific single disk
will fail.
 E.g., a system with 100 disks, each with MTTF of 100,000
hours (approx. 11 years), will have a system MTTF of 1000
hours (approx. 41 days)
 Techniques for using redundancy to avoid data loss are critical
with large numbers of disks
7
PreparedbyR.Arthy,AP/IT,KCET
IMPROVEMENT OF RELIABILITY VIA REDUNDANCY
 Redundancy – store extra information that can be used to
rebuild information lost in a disk failure
 E.g., Mirroring (or shadowing)
 Duplicate every disk. Logical disk consists of two physical disks.
 Every write is carried out on both disks
 Reads can take place from either disk
 If one disk in a pair fails, data still available in the other
 Data loss would occur only if a disk fails, and its mirror disk also fails
before the system is repaired
 Probability of combined event is very small
 Except for dependent failure modes such as fire or building collapse or
electrical power surges
 Mean time to data loss depends on mean time to failure,
and mean time to repair
 E.g., MTTF of 100,000 hours, mean time to repair of 10 hours gives
mean time to data loss of 500*106 hours (or 57,000 years) for a
mirrored pair of disks (ignoring dependent failure modes)
8
PreparedbyR.Arthy,AP/IT,KCET
IMPROVEMENT IN PERFORMANCE VIA PARALLELISM
 Two main goals of parallelism in a disk system:
1. Load balance multiple small accesses to increase throughput
2. Parallelize large accesses to reduce response time.
 Improve transfer rate by striping data across multiple disks.
 Bit-level striping – split the bits of each byte across multiple
disks
 In an array of eight disks, write bit i of each byte to disk i.
 Each access can read data at eight times the rate of a single disk.
 But seek/access time worse than for a single disk
 Bit level striping is not used much any more
 Block-level striping – with n disks, block i of a file goes to
disk (i mod n) + 1
 Requests for different blocks can run in parallel if the blocks reside on
different disks
 A request for a long sequence of blocks can utilize all disks in parallel 9
PreparedbyR.Arthy,AP/IT,KCET
RAID LEVELS
 Schemes to provide redundancy at lower cost by using disk
striping combined with parity bits
 Different RAID organizations, or RAID levels, have differing cost,
performance and reliability characteristics
 RAID Level 0: Block striping; non-redundant.
 Used in high-performance applications where data loss is not critical.
 RAID Level 1: Mirrored disks with block striping
 Offers best write performance.
 Popular for applications such as storing log files in a database system.
10
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 RAID Level 2: Memory-Style Error-Correcting-Codes
(ECC) with bit striping.
 RAID Level 3: Bit-Interleaved Parity
 a single parity bit is enough for error correction, not just
detection, since we know which disk has failed
 When writing data, corresponding parity bits must also be computed
and written to a parity bit disk
 To recover data in a damaged disk, compute XOR of bits from other
disks (including parity bit disk)
11
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 RAID Level 3 (Cont.)
 Faster data transfer than with a single disk, but fewer I/Os per
second since every disk has to participate in every I/O.
 Subsumes Level 2 (provides all its benefits, at lower cost).
 RAID Level 4: Block-Interleaved Parity; uses block-level
striping, and keeps a parity block on a separate disk for
corresponding blocks from N other disks.
 When writing data block, corresponding block of parity bits must
also be computed and written to parity disk
 To find value of a damaged block, compute XOR of bits from
corresponding blocks (including parity block) from other disks.
12
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 RAID Level 4 (Cont.)
 Provides higher I/O rates for independent block reads than
Level 3
 block read goes to a single disk, so blocks stored on different disks
can be read in parallel
 Provides high transfer rates for reads of multiple blocks than
no-striping
 Before writing a block, parity data must be computed
 Can be done by using old parity block, old value of current block and
new value of current block (2 block reads + 2 block writes)
 Or by recomputing the parity value using the new values of blocks
corresponding to the parity block
 More efficient for writing large amounts of data sequentially
 Parity block becomes a bottleneck for independent block
writes since every block write also writes to parity disk 13
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 RAID Level 5: Block-Interleaved Distributed Parity;
partitions data and parity among all N + 1 disks, rather than
storing data in N disks and parity in 1 disk.
 E.g., with 5 disks, parity block for nth set of blocks is stored on disk
(n mod 5) + 1, with the data blocks stored on the other 4 disks.
14
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 RAID Level 5 (Cont.)
 Block writes occur in parallel if the blocks and their parity blocks are
on different disks.
 RAID Level 6: P+Q Redundancy scheme; similar to Level 5,
but stores two error correction blocks (P, Q) instead of single
parity block to guard against multiple disk failures.
 Better reliability than Level 5 at a higher cost
 Becoming more important as storage sizes increase
15
PreparedbyR.Arthy,AP/IT,KCET
CHOICE OF RAID LEVEL
 Factors in choosing RAID level
 Monetary cost
 Performance: Number of I/O operations per second, and bandwidth
during normal operation
 Performance during failure
 Performance during rebuild of failed disk
 Including time taken to rebuild failed disk
 RAID 0 is used only when data safety is not important
 E.g., data can be recovered quickly from other sources
 Level 2 and 4 never used since they are subsumed by 3 and 5
 Level 3 is not used anymore since bit-striping forces single
block reads to access all disks, wasting disk arm movement,
which block striping (level 5) avoids
 Level 6 is rarely used since levels 1 and 5 offer adequate safety
for most applications 16
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 Level 1 provides much better write performance than level 5
 Level 5 requires at least 2 block reads and 2 block writes to write a
single block, whereas Level 1 only requires 2 block writes
 Level 1 had higher storage cost than level 5
 Level 5 is preferred for applications where writes are sequential
and large (many blocks), and need large amounts of data storage
 RAID 1 is preferred for applications with many random/small
updates
 Level 6 gives better data protection than RAID 5 since it can
tolerate two disk (or disk block) failures
 Increasing in importance since latent block failures on one disk,
coupled with a failure of another disk can result in data loss with
RAID 1 and RAID 5.
17
PreparedbyR.Arthy,AP/IT,KCET
HARDWARE ISSUES
 Software RAID: RAID implementations done entirely in
software, with no special hardware support
 Hardware RAID: RAID implementations with special
hardware
 Use non-volatile RAM to record writes that are being executed
 Beware: power failure during write can result in corrupted
disk
 E.g., failure after writing one block but before writing the second in a
mirrored system
 Such corrupted data must be detected when power is restored
 Recovery from corruption is similar to recovery from failed disk
 NV-RAM helps to efficiently detected potentially corrupted blocks
 Otherwise all blocks of disk must be read and compared with
mirror/parity block
18
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 Latent failures: data successfully written earlier gets damaged
 can result in data loss even if only one disk fails
 Data scrubbing:
 continually scan for latent failures, and recover from copy/parity
 Hot swapping: replacement of disk while system is running,
without power down
 Supported by some hardware RAID systems,
 reduces time to recovery, and improves availability greatly
 Many systems maintain spare disks which are kept online, and
used as replacements for failed disks immediately on detection
of failure
 Reduces time to recovery greatly
 Many hardware RAID systems ensure that a single point of
failure will not stop the functioning of the system by using
 Redundant power supplies with battery backup
 Multiple controllers and multiple interconnections to guard against
controller/interconnection failures 19
PreparedbyR.Arthy,AP/IT,KCET
OPTIMIZATION OF DISK-BLOCK ACCESS
 Buffering: in-memory buffer to cache disk blocks
 Read-ahead: Read extra blocks from a track in
anticipation that they will be requested soon
 Disk-arm-scheduling algorithms re-order block requests
so that disk arm movement is minimized
 elevator algorithm
20
PreparedbyR.Arthy,AP/IT,KCET
FILE ORGANIZATION –ORGANIZATION OF
RECORDS IN FILES
INTRODUCTION
 The database is stored as a collection of files.
 Each file is a sequence of records.
 A record is a sequence of fields.
 One approach:
 assume record size is fixed
 each file has records of one particular type only
 different files are used for different relations
 This case is easiest to implement; will consider variable
length records later.
22
PreparedbyR.Arthy,AP/IT,KCET
FIXED LENGTH RECORD
 Simple approach:
 Store record i starting from byte n ∗ (i − 1), where n is the
size of each record.
 Record access is simple but records may cross blocks.
 Deletion of record i — alternatives:
 move records i + 1,...,n to i, . . . , n − 1
 move record n to i
 Link all free records on a free list
23
PreparedbyR.Arthy,AP/IT,KCET
EXAMPLE
 type instructor = record
ID varchar (5);
name varchar(20);
dept name varchar (20);
salary numeric (8,2);
End
 instructor record is 53 bytes long.
 Two problems:
 Unless the block size happens to be a multiple of 53 (which is
unlikely), some records will cross block boundaries.
 It is difficult to delete a record from this structure. 24
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
INSERTING RECORD
25
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
DELETING RECORD
26
PreparedbyR.Arthy,AP/IT,KCET
FILE HEADER AND FREE LIST
 Store the address of the first record whose contents are
deleted in the file header.
 Use this first record to store the address of the second
available record, and so on.
 Can think of these stored addresses as pointers since they
“point” to the location of a record.
27
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 More space efficient representation: reuse space for
normal attributes of free records to store pointers. (No
pointers stored in in-use records.)
 Dangling pointers occur if we move or delete a record to
which another record contains a pointer; that pointer no
longer points to the desired record.
 Avoid moving or deleting records that are pointed to by
other records; such records are pinned.
28
PreparedbyR.Arthy,AP/IT,KCET
VARIABLE LENGTH RECORD
 Variable-length records arise in database systems in
several ways:
 Storage of multiple record types in a file.
 Record types that allow variable lengths for one or more
fields.
 Record types that allow repeating fields (used in some older
data models).
 Byte string representation
 Attach an end-of-record (┴) control character to the end of
each record
 Difficulty with deletion
 Difficulty with growth 29
PreparedbyR.Arthy,AP/IT,KCET
SLOTTED PAGE STRUCTURE
 Header contains:
 number of record entries
 end of free space in the block
 location and size of each record
 Records can be moved around within a page to keep them
contiguous with no empty space between them; entry in the header
must then be updated.
 Pointers should not point directly to record — instead they should
point to the entry for the record in header.
30
PreparedbyR.Arthy,AP/IT,KCET
ORGANIZATION OF RECORDS IN
FILES
 Heap – a record can be placed anywhere in the file where
there is space
 Sequential – store records in sequential order, based on
the value of the search key of each record
 Hashing – a hash function is computed on some attribute
of each record; the result specifies in which block of the
file the record should be placed
 Clustering – records of several different relations can be
stored in the same file; related records are stored on the
same block
31
PreparedbyR.Arthy,AP/IT,KCET
SEQUENTIAL FILE ORGANIZATION
 Suitable for applications that require sequential
processing of the entire file
 The records in the file are ordered by a search-key
32
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 Deletion – use pointer chains
 Insertion – must locate the position in the file where the
record is to be inserted
 if there is free space insert there
 if no free space, insert the record in an overflow block
 In either case, pointer chain must be updated
 Need to reorganize the file from time to time to restore
sequential order
33
PreparedbyR.Arthy,AP/IT,KCET
CLUSTERING FILE ORGANIZATION
 Simple file structure stores each relation in a separate file
 Can instead store several relations in one file using a
clustering file organization
 E.g., clustering organization of department and employee:
34
PreparedbyR.Arthy,AP/IT,KCET
INDEXING AND HASING
INTRODUCTION
 Indexing mechanisms used to speed up access to desired
data.
 E.g., author catalog in library
 Search Key - attribute or set of attributes used to look up
records in a file.
 An index file consists of records (called index entries) of
the form
 Index files are typically much smaller than the original
file
 Two basic kinds of indices:
 Ordered indices: search keys are stored in some sorted order
 Hash indices: search keys are distributed uniformly across
“buckets” using a “hash function”.
36
PreparedbyR.Arthy,AP/IT,KCET
INDEX EVALUATION METRICS
 Access types: The types of access that are supported
efficiently.
 Access time: The time it takes to find a particular data
item, or set of items, using the technique in question.
 Insertion time: The time it takes to insert a newdata item.
 Deletion time: The time it takes to delete a data item.
 Space overhead: The additional space occupied by an
index structure.
37
PreparedbyR.Arthy,AP/IT,KCET
ORDERED INDICES
INTRODUCTION
 In an ordered index, index entries are stored sorted on the
search key value.
 E.g., author catalog in library.
 Primary index: in a sequentially ordered file, the index whose
search key specifies the sequential order of the file.
 Also called clustering index
 The search key of a primary index is usually but not necessarily the
primary key.
 Secondary index: an index whose search key specifies an
order different from the sequential order of the file. Also
called non-clustering index.
 Index-sequential file:ordered sequential file with a primary
index. 39
PreparedbyR.Arthy,AP/IT,KCET
DENSE INDEX
40
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
41
PreparedbyR.Arthy,AP/IT,KCET
SPARSE INDEX
42
PreparedbyR.Arthy,AP/IT,KCET
MULTILEVEL
INDEXING
43
PreparedbyR.Arthy,AP/IT,KCET
INDEX UPDATE
Insertion
 First, the system performs a lookup using the search-key value
that appears in the record to be inserted. The actions the
system takes next depend on whether the index is dense or
sparse:
 Dense indices:
1. If the search-key value does not appear in the index, the
system inserts an index entry with the search-key value in the
index at the appropriate position.
2. Otherwise the following actions are taken:
 If the index entry stores pointers to all records with the same search
key value, the system adds a pointer to the new record in the index
entry.
 Otherwise, the index entry stores a pointer to only the first record
with the search-key value. The system then places the record being
inserted after the other records with the same search-key values. 44
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…] DELETION
 Dense indices:
1. If the deleted record was the only record with its
particular search-key value, then the system deletes the
corresponding index entry from the index.
2. Otherwise the following actions are taken:
 If the index entry stores pointers to all records with the same
search key value, the system deletes the pointer to the deleted
record from the index entry.
 Otherwise, the index entry stores a pointer to only the first
record with the search-key value. In this case, if the deleted
record was the first record with the search-key value, the
system updates the index entry to point to the next record. 45
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 Sparse indices:
1. If the index does not contain an index entry with the
search-key value of the deleted record, nothing needs to
be done to the index.
2. Otherwise the system takes the following actions:
 If the deleted record was the only record with its search key,
the system replaces the corresponding index record with an
index record for the next search-key value (in search-key
order). If the next search-key value already has an index
entry, the entry is deleted instead of being replaced.
 Otherwise, if the index entry for the search-key value points
to the record being deleted, the system updates the index
entry to point to the next record with the same search-key
value. 46
PreparedbyR.Arthy,AP/IT,KCET
ALGORITHMS FOR SELECT
AND JOIN OPERATIONS
SELECT OPERATIONS
 File scan – search algorithms that locate and retrieve
records that fulfill a selection condition.
 Algorithm A1 (linear search). Scan each file block and
test all records to see whether they satisfy the selection
condition.
 Cost estimate = br block transfers + 1 seek
 br denotes number of blocks containing records from relation r
 If selection is on a key attribute, can stop on finding record
 cost = (br /2) block transfers + 1 seek
 Linear search can be applied regardless of
 selection condition or
 ordering of records in the file, or
 availability of indices
48
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 A2 (binary search). Applicable if selection is an
equality comparison on the attribute on which file is
ordered.
 Assume that the blocks of a relation are stored contiguously
 Cost estimate (number of disk blocks to be scanned):
 cost of locating the first tuple by a binary search on the blocks
 log2(br) * (tT + tS)
 If there are multiple records satisfying selection
 Add transfer cost of the number of blocks containing records that
satisfy selection condition
49
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 Index scan – search algorithms that use an index
 selection condition must be on search-key of index.
 A3 (primary index on candidate key, equality). Retrieve a
single record that satisfies the corresponding equality
condition
 Cost = (hi + 1) * (tT + tS)
 A4 (primary index on nonkey, equality) Retrieve multiple
records.
 Records will be on consecutive blocks
 Let b = number of blocks containing matching records
 Cost = hi * (tT + tS) + tS + tT * b
 A5 (equality on search-key of secondary index).
 Retrieve a single record if the search-key is a candidate key
 Cost = (hi + 1) * (tT + tS)
 Retrieve multiple records if search-key is not a candidate key
 each of n matching records may be on a different block
 Cost = (hi + n) * (tT + tS)
 Can be very expensive!
50
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 Can implement selections of the form AV (r) or A  V(r) by
using
 a linear file scan or binary search,
 or by using indices in the following ways:
 A6 (primary index, comparison). (Relation is sorted on A)
 For A  V(r) use index to find first tuple  v and scan relation
sequentially from there
 For AV (r) just scan relation sequentially till first tuple > v; do not use
index
 A7 (secondary index, comparison).
 For A  V(r) use index to find first index entry  v and scan index
sequentially from there, to find pointers to records.
 For AV (r) just scan leaf pages of index finding pointers to records, till
first entry > v
 In either case, retrieve records that are pointed to
 requires an I/O for each record
 Linear file scan may be cheaper 51
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 Conjunction: 1 2. . . n(r)
 A8 (conjunctive selection using one index).
 Select a combination of i and algorithms A1 through A7 that results
in the least cost for i (r).
 Test other conditions on tuple after fetching it into memory buffer.
 A9 (conjunctive selection using multiple-key index).
 Use appropriate composite (multiple-key) index if available.
 A10 (conjunctive selection by intersection of identifiers).
 Requires indices with record pointers.
 Use corresponding index for each condition, and take intersection of
all the obtained sets of record pointers.
 Then fetch records from file
 If some conditions do not have appropriate indices, apply test in
memory. 52
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 Disjunction:1 2 . . . n (r).
 A11 (disjunctive selection by union of identifiers).
 Applicable if all conditions have available indices.
 Otherwise use linear scan.
 Use corresponding index for each condition, and take union
of all the obtained sets of record pointers.
 Then fetch records from file
 Negation: (r)
 Use linear scan on file
 If very few records satisfy , and an index is applicable to 
 Find satisfying records using index and fetch from file
53
PreparedbyR.Arthy,AP/IT,KCET
JOIN OPERATIONS
 Several different algorithms to implement joins
 Nested-loop join
 Block nested-loop join
 Indexed nested-loop join
 Merge-join
 Hash-join
 Choice based on cost estimate
 Examples use the following information
 Number of records of customer: 10,000 depositor: 5000
 Number of blocks of customer: 400 depositor: 100
54
PreparedbyR.Arthy,AP/IT,KCET
NESTED – LOOP JOIN
 To compute the theta join r  s
for each tuple tr in r do begin
for each tuple ts in s do begin
test pair (tr,ts) to see if they satisfy the join condition 
if they do, add tr • ts to the result.
end
end
 r is called the outer relation and s the inner relation of the
join.
 Requires no indices and can be used with any kind of join
condition.
 Expensive since it examines every pair of tuples in the two
relations. 55
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 In the worst case, if there is enough memory only to hold one block of each relation,
the estimated cost is
nr  bs + br
block transfers, plus
nr + br
seeks
 If the smaller relation fits entirely in memory, use that as the inner relation.
 Reduces cost to br + bs block transfers and 2 seeks
 Assuming worst case memory availability cost estimate is
 with depositor as outer relation:
 5000  400 + 100 = 2,000,100 block transfers,
 5000 + 100 = 5100 seeks
 with customer as the outer relation
 10000  100 + 400 = 1,000,400 block transfers and 10,400 seeks
 If smaller relation (depositor) fits entirely in memory, the cost estimate will be 500
block transfers.
 Block nested-loops algorithm is preferable. 56
PreparedbyR.Arthy,AP/IT,KCET
BLOCK NESTED - LOOP JOIN
 Variant of nested-loop join in which every block of inner
relation is paired with every block of outer relation.
for each block Br of r do begin
for each block Bs of s do begin
for each tuple tr in Br do begin
for each tuple ts in Bs do begin
Check if (tr,ts) satisfy the join condition
if they do, add tr • ts to the result.
end
end
end
end 57
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 Worst case estimate: br  bs + br block transfers + 2 * br seeks
 Each block in the inner relation s is read once for each block in the outer
relation (instead of once for each tuple in the outer relation
 Best case: br + bs block transfers + 2 seeks.
 Improvements to nested loop and block nested loop algorithms:
 In block nested-loop, use M — 2 disk blocks as blocking unit for outer
relations, where M = memory size in blocks; use remaining two blocks
to buffer inner relation and output
 Cost = br / (M-2)  bs + br block transfers + 2 br / (M-2) seeks
 If equi-join attribute forms a key on inner relation, stop inner loop on
first match
 Scan inner loop forward and backward alternately, to make use of the
blocks remaining in buffer (with LRU replacement)
 Use index on inner relation if available
58
PreparedbyR.Arthy,AP/IT,KCET
INDEXED NESTED - LOOP JOIN
 Index lookups can replace file scans if
 join is an equi-join or natural join and
 an index is available on the inner relation’s join attribute
 Can construct an index just to compute a join.
 For each tuple tr in the outer relation r, use the index to look up
tuples in s that satisfy the join condition with tuple tr.
 Worst case: buffer has space for only one page of r, and, for each
tuple in r, we perform an index lookup on s.
 Cost of the join: br (tT + tS) + nr  c
 Where c is the cost of traversing index and fetching all matching s
tuples for one tuple or r
 c can be estimated as cost of a single selection on s using the join
condition.
 If indices are available on join attributes of both r and s,
use the relation with fewer tuples as the outer relation.
59
PreparedbyR.Arthy,AP/IT,KCET
EXAMPLE
 Compute depositor customer, with depositor as the outer relation.
 Let customer have a primary B+-tree index on the join attribute
customer-name, which contains 20 entries in each index node.
 Since customer has 10,000 tuples, the height of the tree is 4, and one
more access is needed to find the actual data
 depositor has 5000 tuples
 Cost of block nested loops join
 400*100 + 100 = 40,100 block transfers + 2 * 100 = 200 seeks
 assuming worst case memory
 may be significantly less with more memory
 Cost of indexed nested loops join
 100 + 5000 * 5 = 25,100 block transfers and seeks.
 CPU cost likely to be less than that for block nested loops join 60
PreparedbyR.Arthy,AP/IT,KCET
MERGE JOIN
1. Sort both relations on their join
attribute (if not already sorted on
the join attributes).
2. Merge the sorted relations to join
them
1. Join step is similar to the
merge stage of the sort-merge
algorithm.
2. Main difference is handling of
duplicate values in join
attribute — every pair with
same value on join attribute
must be matched
61
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 Can be used only for equi-joins and natural joins
 Each block needs to be read only once (assuming all tuples for any
given value of the join attributes fit in memory
 Thus the cost of merge join is:
br + bs block transfers + br / bb + bs / bb seeks
 + the cost of sorting if relations are unsorted.
 hybrid merge-join: If one relation is sorted, and the other has a
secondary B+-tree index on the join attribute
 Merge the sorted relation with the leaf entries of the B+-tree .
 Sort the result on the addresses of the unsorted relation’s tuples
 Scan the unsorted relation in physical address order and merge with
previous result, to replace addresses by the actual tuples
 Sequential scan more efficient than random lookup
62
PreparedbyR.Arthy,AP/IT,KCET
HASH JOIN
 Applicable for equi-joins and natural joins.
 A hash function h is used to partition tuples of both relations
 Intuition: partitions fit in memory
 h maps JoinAttrs values to {0, 1, ..., n}, where JoinAttrs
denotes the common attributes of r and s used in the natural
join.
 r0, r1, . . ., rn denote partitions of r tuples
 Each tuple tr  r is put in partition ri where i = h(tr [JoinAttrs]).
 r0,, r1. . ., rn denotes partitions of s tuples
 Each tuple ts s is put in partition si, where i = h(ts [JoinAttrs]).
 Note: In book, ri is denoted as Hri, si is denoted as Hsi and
n is denoted as nh.
63
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
64
 r tuples in ri need only to be
compared with s tuples in si
Need not be compared with s
tuples in any other partition,
since:
 an r tuple and an s tuple that
satisfy the join condition will
have the same value for the join
attributes.
 If that value is hashed to some
value i, the r tuple has to be in
ri and the s tuple in si.
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 The hash-join of r and s is computed as follows.
1. Partition the relation s using hashing function h.
1. When partitioning a relation, one block of memory is
reserved as the output buffer for each partition, and one
block for input
2. If extra memory is available, allocate bb blocks as buffer for
input and each output
2.Partition r similarly.
65
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
3. For each partition i:
(a) Load si into memory and build an in-memory hash index on
it using the join attribute.
 This hash index uses a different hash function than the earlier one
h.
(b) Read the tuples in ri from the disk one by one.
 For each tuple tr probe the in-memory hash index to find all
matching tuples ts in si
 For each matching tuple ts in si
 output the concatenation of the attributes of tr and ts
 Relation s is called the build input and
r is called the probe input.
66
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 The value n and the hash function h is chosen such that each si
should fit in memory.
 Typically n is chosen as bs/M * f where f is a “fudge factor”,
typically around 1.2
 The probe relation partitions si need not fit in memory
 Recursive partitioning required if number of partitions n is greater
than number of pages M of memory.
 instead of partitioning n ways, use M – 1 partitions for s
 Further partition the M – 1 partitions using a different hash
function
 Use same partitioning method on r
 Rarely required: e.g., recursive partitioning not needed for
relations of 1GB or less with memory size of 2MB, with block
size of 4KB.
67
PreparedbyR.Arthy,AP/IT,KCET
HANDLING OVERFLOW
 Partitioning is said to be skewed if some partitions have
significantly more tuples than some others
 Hash-table overflow occurs in partition si if si does not fit in
memory. Reasons could be
 Many tuples in s with same value for join attributes
 Bad hash function
 Overflow resolution can be done in build phase
 Partition si is further partitioned using different hash function.
 Partition ri must be similarly partitioned.
 Overflow avoidance performs partitioning carefully to avoid
overflows during build phase
 E.g. partition build relation into many partitions, then combine them
 Both approaches fail with large numbers of duplicates
 Fallback option: use block nested loops join on overflowed partitions
68
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 If recursive partitioning is not required: cost of hash join is
3(br + bs) +4  nh block transfers +
2( br / bb + bs / bb) seeks
 If recursive partitioning required:
 number of passes required for partitioning build relation
s is logM–1(bs) – 1
 best to choose the smaller relation as the build relation.
 Total cost estimate is:
2(br + bs logM–1(bs) – 1 + br + bs block transfers +
2(br / bb + bs / bb) logM–1(bs) – 1 seeks
 If the entire build input can be kept in main memory no partitioning
is required
 Cost estimate goes down to br + bs. 69
PreparedbyR.Arthy,AP/IT,KCET
EXAMPLE
 Assume that memory size is 20 blocks
 bdepositor= 100 and bcustomer = 400.
 depositor is to be used as build input. Partition it into
five partitions, each of size 20 blocks. This partitioning
can be done in one pass.
 Similarly, partition customer into five partitions,each of
size 80. This is also done in one pass.
 Therefore total cost, ignoring cost of writing partially
filled blocks:
 3(100 + 400) = 1500 block transfers +
2( 100/3 + 400/3) = 336 seeks 70
customer depositor
PreparedbyR.Arthy,AP/IT,KCET
HYBRID HASH JOIN
 Useful when memory sized are relatively large, and the build input is
bigger than memory.
 Main feature of hybrid hash join:
Keep the first partition of the build relation in memory.
 E.g. With memory size of 25 blocks, depositor can be partitioned
into five partitions, each of size 20 blocks.
 Division of memory:
 The first partition occupies 20 blocks of memory
 1 block is used for input, and 1 block each for buffering the other 4 partitions.
 customer is similarly partitioned into five partitions each of size 80
 the first is used right away for probing, instead of being written out
 Cost of 3(80 + 320) + 20 +80 = 1300 block transfers for
hybrid hash join, instead of 1500 with plain hash-join.
 Hybrid hash-join most useful if M >>
71
sb
PreparedbyR.Arthy,AP/IT,KCET
QUERY OPTIMIZATION USING
HEURISTICS AND COST
ESTIMATION
INTRODUCTION
 Alternative ways of evaluating a given query
 Equivalent expressions
 Different algorithms for each operation
73
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 An evaluation plan defines exactly what algorithm is
used for each operation, and how the execution of the
operations is coordinated.
74
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 Cost difference between evaluation plans for a query can be
enormous
 E.g. seconds vs. days in some cases
 Steps in cost-based query optimization
1. Generate logically equivalent expressions using equivalence rules
2. Annotate resultant expressions to get alternative query plans
3. Choose the cheapest plan based on estimated cost
 Estimation of plan cost based on:
 Statistical information about relations. Examples:
 number of tuples, number of distinct values for an attribute
 Statistics estimation for intermediate results
 to compute cost of complex expressions
 Cost formulae for algorithms, computed using statistics 75
PreparedbyR.Arthy,AP/IT,KCET
GENERATING EQUIVALENT EXPRESSIONS –
TRANSFORMATION OF RELATIONAL EXPRESSIONS
 Two relational algebra expressions are said to be equivalent if the
two expressions generate the same set of tuples on every legal
database instance
 Note: order of tuples is irrelevant
 In SQL, inputs and outputs are multisets of tuples
 Two expressions in the multiset version of the relational algebra
are said to be equivalent if the two expressions generate the same
multiset of tuples on every legal database instance.
 An equivalence rule says that expressions of two forms are
equivalent
 Can replace expression of first form by second, or vice versa
76
PreparedbyR.Arthy,AP/IT,KCET
GENERATING EQUIVALENT EXPRESSIONS –
EQUIVALENCE RULE
1. Conjunctive selection operations can be deconstructed into a
sequence of individual selections.
2. Selection operations are commutative.
3. Only the last in a sequence of projection operations is
needed, the others can be omitted.
4. Selections can be combined with Cartesian products and
theta joins.
a. (E1 X E2) = E1  E2
b. 1(E1 2 E2) = E1 1 2 E2 77
))(()( 2121
EE   
))(())(( 1221
EE   
)())))(((( 121
EE LLnLL  
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
5.Theta-join operations (and natural joins) are
commutative.
E1  E2 = E2  E1
6.(a) Natural join operations are associative:
(E1 E2) E3 = E1 (E2 E3)
(b) Theta joins are associative in the following manner:
(E1 1 E2) 2 3 E3 = E1 1 3 (E2 2 E3)
where 2 involves attributes from only E2 and E3.
78
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
79
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
7.The selection operation distributes over the theta join
operation under the following two conditions:
(a) When all the attributes in 0 involve only the
attributes of one
of the expressions (E1) being joined.
0E1  E2) = (0(E1))  E2
(b) When  1 involves only the attributes of E1 and 2
involves
only the attributes of E2.
1 E1  E2) = (1(E1))  ( (E2)) 80
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
8.The projection operation distributes over the theta join
operation as follows:
(a) if  involves only attributes from L1  L2:
(b) Consider a join E1  E2.
 Let L1 and L2 be sets of attributes from E1 and E2,
respectively.
 Let L3 be attributes of E1 that are involved in join condition ,
but are not in L1  L2, and
 let L4 be attributes of E2 that are involved in join condition ,
but are not in L1  L2. 81
))(())(()( 2121 2121
EEEE LLLL   
)))(())((()( 2121 42312121
EEEE LLLLLLLL   
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
9. The set operations union and intersection are commutative
E1  E2 = E2  E1
E1  E2 = E2  E1
 (set difference is not commutative).
10. Set union and intersection are associative.
(E1  E2)  E3 = E1  (E2  E3)
(E1  E2)  E3 = E1  (E2  E3)
11. The selection operation distributes over ,  and –.
 (E1 – E2) =  (E1) – (E2)
and similarly for  and  in place of –
Also:  (E1 – E2) = (E1) – E2
and similarly for  in place of –, but not for 
12. The projection operation distributes over union
L(E1  E2) = (L(E1))  (L(E2))
82
PreparedbyR.Arthy,AP/IT,KCET
EXAMPLE
 Query: Find the names of all customers with an account at a
Brooklyn branch whose account balance is over $1000.
customer_name((branch_city = “Brooklyn”  balance > 1000
(branch (account depositor)))
 Transformation using join associatively (Rule 6a):
customer_name((branch_city = “Brooklyn”  balance > 1000
(branch account)) depositor)
 Second form provides an opportunity to apply the “perform
selections early” rule, resulting in the subexpression
branch_city = “Brooklyn” (branch)  balance > 1000
(account)
 Thus a sequence of transformations can be useful
83
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
84
PreparedbyR.Arthy,AP/IT,KCET
TRANSFORMATION EXAMPLE: PUSHING PROJECTIONS
 When we compute
(branch_city = “Brooklyn” (branch) account )
we obtain a relation whose schema is:
(branch_name, branch_city, assets, account_number, balance)
 Push projections using equivalence rules 8a and 8b; eliminate
unneeded attributes from intermediate results to get:
customer_name ((account_number ( (branch_city = “Brooklyn” (branch) account
)) depositor )
 Performing the projection as early as possible reduces the size of the
relation to be joined.
customer_name((branch_city = “Brooklyn” (branch) account) depositor)
85
PreparedbyR.Arthy,AP/IT,KCET
JOIN ORDERING EXAMPLE
 For all relations r1, r2, and r3,
(r1 r2) r3 = r1 (r2 r3 )
(Join Associativity)
 If r2 r3 is quite large and r1 r2 is small, we choose
(r1 r2) r3
so that we compute and store a smaller temporary relation.
86
PreparedbyR.Arthy,AP/IT,KCET
JOIN ORDERING EXAMPLE (CONT.)
 Consider the expression
customer_name ((branch_city = “Brooklyn” (branch))
(account depositor))
 Could compute account depositor first, and join
result with
branch_city = “Brooklyn” (branch)
but account depositor is likely to be a large relation.
 Only a small fraction of the bank’s customers are likely to
have accounts in branches located in Brooklyn
 it is better to compute
branch_city = “Brooklyn” (branch) account
first.
87
PreparedbyR.Arthy,AP/IT,KCET
GENERATING EQUIVALENT EXPRESSIONS –
ENUMERATION OF EQUIVALENT EXPRESSIONS
 Query optimizers use equivalence rules to systematically
generate expressions equivalent to the given expression
 Can generate all equivalent expressions as follows:
 Repeat
 apply all applicable equivalence rules on every equivalent expression
found so far
 add newly generated expressions to the set of equivalent expressions
Until no new equivalent expressions are generated above
 The above approach is very expensive in space and time
 Two approaches
 Optimized plan generation based on transformation rules
 Special case approach for queries with only selections, projections and
joins
88
PreparedbyR.Arthy,AP/IT,KCET
GENERATING EQUIVALENT EXPRESSIONS – IMPLEMENTING
TRANSFORMATION BASED OPTIMIZATION
 Space requirements reduced by sharing common sub-expressions:
 when E1 is generated from E2 by an equivalence rule, usually only the top
level of the two are different, subtrees below are the same and can be shared
using pointers
 E.g. when applying join commutativity
 Same sub-expression may get generated multiple times
 Detect duplicate sub-expressions and share one copy
 Time requirements are reduced by not generating all expressions
 Dynamic programming
 We will study only the special case of dynamic programming for join order
optimization
E1 E2
89
PreparedbyR.Arthy,AP/IT,KCET
COST ESTIMATION
 Cost of each operator computer as described in Chapter 13
 Need statistics of input relations
 E.g. number of tuples, sizes of tuples
 Inputs can be results of sub-expressions
 Need to estimate statistics of expression results
 To do so, we require additional statistics
 E.g. number of distinct values for an attribute
 More on cost estimation later
90
PreparedbyR.Arthy,AP/IT,KCET
CHOICE OF EVALUATION PLANS
 Must consider the interaction of evaluation techniques when
choosing evaluation plans
 choosing the cheapest algorithm for each operation independently
may not yield best overall algorithm. E.g.
 merge-join may be costlier than hash-join, but may provide a sorted
output which reduces the cost for an outer level aggregation.
 nested-loop join may provide opportunity for pipelining
 Practical query optimizers incorporate elements of the
following two broad approaches:
1. Search all the plans and choose the best plan in a
cost-based fashion.
2. Uses heuristics to choose a plan.
91
PreparedbyR.Arthy,AP/IT,KCET
COST-BASED OPTIMIZATION
 Consider finding the best join-order for r1 r2 . . . rn.
 There are (2(n – 1))!/(n – 1)! different join orders for above
expression. With n = 7, the number is 665280, with n = 10,
the number is greater than 176 billion!
 No need to generate all the join orders. Using dynamic
programming, the least-cost join order for any subset of
{r1, r2, . . . rn} is computed only once and stored for future
use.
92
PreparedbyR.Arthy,AP/IT,KCET
DYNAMIC PROGRAMMING IN OPTIMIZATION
 To find best join tree for a set of n relations:
 To find best plan for a set S of n relations, consider all possible
plans of the form: S1 (S – S1) where S1 is any non-empty
subset of S.
 Recursively compute costs for joining subsets of S to find the
cost of each plan. Choose the cheapest of the 2n – 1
alternatives.
 Base case for recursion: single relation access plan
 Apply all selections on Ri using best choice of indices on Ri
 When plan for any subset is computed, store it and reuse it
when it is required again, instead of recomputing it
 Dynamic programming
93
PreparedbyR.Arthy,AP/IT,KCET
JOIN ORDER OPTIMIZATION ALGORITHM
procedure findbestplan(S)
if (bestplan[S].cost  )
return bestplan[S]
// else bestplan[S] has not been computed earlier, compute it now
if (S contains only 1 relation)
set bestplan[S].plan and bestplan[S].cost based on the best way
of accessing S /* Using selections on S and indices on S */
else for each non-empty subset S1 of S such that S1  S
P1= findbestplan(S1)
P2= findbestplan(S - S1)
A = best algorithm for joining results of P1 and P2
cost = P1.cost + P2.cost + cost of A
if cost < bestplan[S].cost
bestplan[S].cost = cost
bestplan[S].plan = “execute P1.plan; execute P2.plan;
join results of P1 and P2 using A”
return bestplan[S]
94
PreparedbyR.Arthy,AP/IT,KCET
LEFT DEEP JOIN TREES
 In left-deep join trees, the right-hand-side input for
each join is a relation, not the result of an
intermediate join.
95
PreparedbyR.Arthy,AP/IT,KCET
COST OF OPTIMIZATION
 With dynamic programming time complexity of optimization with bushy
trees is O(3n).
 With n = 10, this number is 59000 instead of 176 billion!
 Space complexity is O(2n)
 To find best left-deep join tree for a set of n relations:
 Consider n alternatives with one relation as right-hand side input and the
other relations as left-hand side input.
 Modify optimization algorithm:
 Replace “for each non-empty subset S1 of S such that S1  S”
 By: for each relation r in S
let S1 = S – r .
 If only left-deep trees are considered, time complexity of finding best
join order is O(n 2n)
 Space complexity remains at O(2n)
 Cost-based optimization is expensive, but worthwhile for queries on
large datasets (typical queries have small n, generally < 10)
96
PreparedbyR.Arthy,AP/IT,KCET
INTERESTING SORT ORDERS
 Consider the expression (r1 r2) r3 (with A as common attribute)
 An interesting sort order is a particular sort order of tuples that could
be useful for a later operation
 Using merge-join to compute r1 r2 may be costlier than hash join but
generates result sorted on A
 Which in turn may make merge-join with r3 cheaper, which may reduce cost
of join with r3 and minimizing overall cost
 Sort order may also be useful for order by and for grouping
 Not sufficient to find the best join order for each subset of the set of n
given relations
 must find the best join order for each subset, for each interesting sort order
 Simple extension of earlier dynamic programming algorithms
 Usually, number of interesting orders is quite small and doesn’t affect
time/space complexity significantly
97
PreparedbyR.Arthy,AP/IT,KCET
HEURISTIC OPTIMIZATION
 Cost-based optimization is expensive, even with dynamic programming.
 Systems may use heuristics to reduce the number of choices that must
be made in a cost-based fashion.
 Heuristic optimization transforms the query-tree by using a set of rules
that typically (but not in all cases) improve execution performance:
 Perform selection early (reduces the number of tuples)
 Perform projection early (reduces the number of attributes)
 Perform most restrictive selection and join operations (i.e. with smallest
result size) before other similar operations.
 Some systems use only heuristics, others combine heuristics with partial
cost-based optimization.
98
PreparedbyR.Arthy,AP/IT,KCET
STRUCTURE OF QUERY OPTIMIZERS
 Many optimizers considers only left-deep join orders.
 Plus heuristics to push selections and projections down the query
tree
 Reduces optimization complexity and generates plans amenable
to pipelined evaluation.
 Heuristic optimization used in some versions of Oracle:
 Repeatedly pick “best” relation to join next
 Starting from each of n starting points. Pick best among these
 Intricacies of SQL complicate query optimization
 E.g. nested subqueries
99
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 Some query optimizers integrate heuristic selection and the generation
of alternative access plans.
 Frequently used approach
 heuristic rewriting of nested block structure and aggregation
 followed by cost-based join-order optimization for each block
 Some optimizers (e.g. SQL Server) apply transformations to entire
query and do not depend on block structure
 Even with the use of heuristics, cost-based query optimization imposes
a substantial overhead.
 But is worth it for expensive queries
 Optimizers often use simple heuristics for very cheap queries, and
perform exhaustive enumeration for more expensive queries
100
PreparedbyR.Arthy,AP/IT,KCET
STATISTICS FOR COST ESTIMATION - STATISTICAL
INFORMATION FOR COST ESTIMATION
 nr: number of tuples in a relation r.
 br: number of blocks containing tuples of r.
 lr: size of a tuple of r.
 fr: blocking factor of r — i.e., the number of tuples of r that fit into
one block.
 V(A, r): number of distinct values that appear in r for attribute A;
same as the size of A(r).
 If tuples of r are stored together physically in a file, then:











rf
rn
rb
101
PreparedbyR.Arthy,AP/IT,KCET
STATISTICS FOR COST ESTIMATION -
HISTOGRAMS
 Histogram on attribute age of relation person
 Equi-width histograms
 Equi-depth histograms
102
PreparedbyR.Arthy,AP/IT,KCET
STATISTICS FOR COST ESTIMATION - SELECTION SIZE
ESTIMATION
 A=v(r)
 nr / V(A,r) : number of records that will satisfy the selection
 Equality condition on a key attribute: size estimate = 1
 AV(r) (case of A  V(r) is symmetric)
 Let c denote the estimated number of tuples satisfying the
condition.
 If min(A,r) and max(A,r) are available in catalog
 c = 0 if v < min(A,r)
 c =
 If histograms available, can refine above estimate
 In absence of statistical information c is assumed to be nr / 2.
),min(),max(
),min(
.
rArA
rAv
nr


103
STATISTICS FOR COST ESTIMATION - SIZE ESTIMATION
OF COMPLEX SELECTIONS
 The selectivity of a condition i is the probability that a tuple in
the relation r satisfies i .
 If si is the number of satisfying tuples in r, the selectivity of i is
given by si /nr.
 Conjunction: 1 2. . .  n (r). Assuming indepdence, estimate
of tuples in the result is:
 Disjunction:1 2 . . .  n (r). Estimated number of tuples:
 Negation: (r). Estimated number of tuples:
nr – size((r))
n
r
n
r
n
sss
n


...21






 )1(...)1()1(1 21
r
n
rr
r
n
s
n
s
n
s
n
104
PreparedbyR.Arthy,AP/IT,KCET
STATISTICS FOR COST ESTIMATION - JOIN
OPERATION: RUNNING EXAMPLE
Running example:
depositor customer
Catalog information for join examples:
 ncustomer = 10,000.
 fcustomer = 25, which implies that
bcustomer =10000/25 = 400.
 ndepositor = 5000.
 fdepositor = 50, which implies that
bdepositor = 5000/50 = 100.
 V(customer_name, depositor) = 2500, which implies that , on
average, each customer has two accounts.
 Also assume that customer_name in depositor is a foreign key on
customer.
 V(customer_name, customer) = 10000 (primary key!)
105
PreparedbyR.Arthy,AP/IT,KCET
STATISTICS FOR COST ESTIMATION - ESTIMATION
OF THE SIZE OF JOINS
 The Cartesian product r x s contains nr .ns tuples; each
tuple occupies sr + ss bytes.
 If R  S = , then r s is the same as r x s.
 If R  S is a key for R, then a tuple of s will join with at
most one tuple from r
 therefore, the number of tuples in r s is no greater than the
number of tuples in s.
 If R  S in S is a foreign key in S referencing R, then the
number of tuples in r s is exactly the same as the
number of tuples in s.
 The case for R  S being a foreign key referencing S is symmetric.
 In the example query depositor customer,
customer_name in depositor is a foreign key of customer
 hence, the result has exactly ndepositor tuples, which is 5000 106
PreparedbyR.Arthy,AP/IT,KCET
STATISTICS FOR COST ESTIMATION - ESTIMATION OF
THE SIZE OF JOINS (CONT.)
 If R  S = {A} is not a key for R or S.
If we assume that every tuple t in R produces tuples in R S, the
number of tuples in R S is estimated to be:
If the reverse is true, the estimate obtained will be:
The lower of these two estimates is probably the more accurate one.
 Can improve on above if histograms are available
 Use formula similar to above, for each cell of histograms on the
two relations
),( sAV
nn sr 
),( rAV
nn sr 
107
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 Compute the size estimates for depositor customer
without using information about foreign keys:
 V(customer_name, depositor) = 2500, and
V(customer_name, customer) = 10000
 The two estimates are 5000 * 10000/2500 - 20,000 and 5000
* 10000/10000 = 5000
 We choose the lower estimate, which in this case, is the same
as our earlier computation using foreign keys.
108
PreparedbyR.Arthy,AP/IT,KCET
STATISTICS FOR COST ESTIMATION - SIZE
ESTIMATION FOR OTHER OPERATIONS
 Projection: estimated size of A(r) = V(A,r)
 Aggregation : estimated size of AgF(r) = V(A,r)
 Set operations
 For unions/intersections of selections on the same relation:
rewrite and use size estimate for selections
 E.g. 1 (r)  2 (r) can be rewritten as 1 2 (r)
 For operations on different relations:
 estimated size of r  s = size of r + size of s.
 estimated size of r  s = minimum size of r and size of s.
 estimated size of r – s = r.
 All the three estimates may be quite inaccurate, but provide upper
bounds on the sizes.
109
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 Outer join:
 Estimated size of r s = size of r s + size of r
 Case of right outer join is symmetric
 Estimated size of r s = size of r s + size of r + size
of s
110
PreparedbyR.Arthy,AP/IT,KCET
STATISTICS FOR COST ESTIMATION - ESTIMATION OF
NUMBER OF DISTINCT VALUES
Selections:  (r)
 If  forces A to take a specified value: V(A, (r)) = 1.
 e.g., A = 3
 If  forces A to take on one of a specified set of values:
V(A, (r)) = number of specified values.
 (e.g., (A = 1 V A = 3 V A = 4 )),
 If the selection condition  is of the form A op r
estimated V(A, (r)) = V(A.r) * s
 where s is the selectivity of the selection.
 In all the other cases: use approximate estimate of
min(V(A,r), n (r) )
 More accurate estimate can be got using probability theory,
but this one works fine generally 111
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
Joins: r s
 If all attributes in A are from r
estimated V(A, r s) = min (V(A,r), n r s)
 If A contains attributes A1 from r and A2 from s, then
estimated
V(A,r s) =
min(V(A1,r)*V(A2 – A1,s), V(A1 – A2,r)*V(A2,s),
nr s)
 More accurate estimate can be got using probability theory,
but this one works fine generally
112
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 Estimation of distinct values are straightforward for
projections.
 They are the same in A (r) as in r.
 The same holds for grouping attributes of aggregation.
 For aggregated values
 For min(A) and max(A), the number of distinct values can be
estimated as min(V(A,r), V(G,r)) where G denotes grouping
attributes
 For other aggregates, assume all values are distinct, and use
V(G,r)
113
PreparedbyR.Arthy,AP/IT,KCET
OPTIMIZING NESTED SUBQUERIES
 Nested query example:
select customer_name
from borrower
where exists (select *
from depositor
where depositor.customer_name =
borrower.customer_name)
 SQL conceptually treats nested subqueries in the where clause as functions
that take parameters and return a single value or set of values
 Parameters are variables from outer level query that are used in the nested
subquery; such variables are called correlation variables
 Conceptually, nested subquery is executed once for each tuple in the
cross-product generated by the outer level from clause
 Such evaluation is called correlated evaluation
 Note: other conditions in where clause may be used to compute a join (instead
of a cross-product) before executing the nested subquery 114
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 Correlated evaluation may be quite inefficient since
 a large number of calls may be made to the nested query
 there may be unnecessary random I/O as a result
 SQL optimizers attempt to transform nested subqueries to joins where
possible, enabling use of efficient join techniques
 E.g.: earlier nested query can be rewritten as
select customer_name
from borrower, depositor
where depositor.customer_name = borrower.customer_name
 Note: the two queries generate different numbers of duplicates (why?)
 Borrower can have duplicate customer-names
 Can be modified to handle duplicates correctly as we will see
 In general, it is not possible/straightforward to move the entire nested
subquery from clause into the outer level query from clause
 A temporary relation is created instead, and used in body of outer level query 115
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
In general, SQL queries of the form below can be rewritten as shown
 Rewrite: select …
from L1
where P1 and exists (select *
from L2
where P2)
 To: create table t1 as
select distinct V
from L2
where P2
1
select …
from L1, t1
where P1 and P2
2
 P2
1 contains predicates in P2 that do not involve any correlation
variables
 P2
2 reintroduces predicates involving correlation variables, with
relations renamed appropriately
 V contains all attributes used in predicates with correlation variables
116
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 In our example, the original nested query would be transformed to
create table t1 as
select distinct customer_name
from depositor
select customer_name
from borrower, t1
where t1.customer_name = borrower.customer_name
 The process of replacing a nested query by a query with a join (possibly
with a temporary relation) is called decorrelation.
 Decorrelation is more complicated when
 the nested subquery uses aggregation, or
 when the result of the nested subquery is used to test for equality, or
 when the condition linking the nested subquery to the other
query is not exists,
 and so on.
117
PreparedbyR.Arthy,AP/IT,KCET
MATERIALIZED VIEWS
 A materialized view is a view whose contents are computed
and stored.
 Consider the view
create view branch_total_loan(branch_name, total_loan) as
select branch_name, sum(amount)
from loan
group by branch_name
 Materializing the above view would be very useful if the total
loan amount is required frequently
 Saves the effort of finding multiple tuples and adding up
their amounts
118
PreparedbyR.Arthy,AP/IT,KCET
MATERIALIZED VIEW MAINTENANCE
 The task of keeping a materialized view up-to-date with the
underlying data is known as materialized view maintenance
 Materialized views can be maintained by recomputation on every
update
 A better option is to use incremental view maintenance
 Changes to database relations are used to compute changes to the
materialized view, which is then updated
 View maintenance can be done by
 Manually defining triggers on insert, delete, and update of each relation
in the view definition
 Manually written code to update the view whenever database relations
are updated
 Periodic recomputation (e.g. nightly)
 Above methods are directly supported by many database systems
 Avoids manual effort/correctness issues
119
PreparedbyR.Arthy,AP/IT,KCET
INCREMENTAL VIEW MAINTENANCE
 The changes (inserts and deletes) to a relation or expressions are
referred to as its differential
 Set of tuples inserted to and deleted from r are denoted ir and dr
 To simplify our description, we only consider inserts and deletes
 We replace updates to a tuple by deletion of the tuple followed
by insertion of the update tuple
 We describe how to compute the change to the result of each
relational operation, given changes to its inputs
 We then outline how to handle relational algebra expressions
120
PreparedbyR.Arthy,AP/IT,KCET
JOIN OPERATION
 Consider the materialized view v = r s and an update to r
 Let rold and rnew denote the old and new states of relation r
 Consider the case of an insert to r:
 We can write rnew s as (rold  ir) s
 And rewrite the above to (rold s)  (ir s)
 But (rold s) is simply the old value of the materialized view, so
the incremental change to the view is just ir s
 Thus, for inserts vnew = vold (ir s)
 Similarly for deletes vnew = vold – (dr s)
A, 1
B, 2
1, p
2, r
2, s
A, 1, p
B, 2, r
B, 2, s
C,2
C, 2, r
C, 2, s
121
PreparedbyR.Arthy,AP/IT,KCET
SELECTION AND PROJECTION OPERATIONS
 Selection: Consider a view v = (r).
 vnew = vold (ir)
 vnew = vold - (dr)
 Projection is a more difficult operation
 R = (A,B), and r(R) = { (a,2), (a,3)}
 A(r) has a single tuple (a).
 If we delete the tuple (a,2) from r, we should not delete the tuple (a)
from A(r), but if we then delete (a,3) as well, we should delete the
tuple
 For each tuple in a projection A(r) , we will keep a count of how many
times it was derived
 On insert of a tuple to r, if the resultant tuple is already in A(r) we
increment its count, else we add a new tuple with count = 1
 On delete of a tuple from r, we decrement the count of the
corresponding tuple in A(r)
 if the count becomes 0, we delete the tuple from A(r) 122
PreparedbyR.Arthy,AP/IT,KCET
AGGREGATION OPERATIONS
 count : v = Agcount(B)
(r).
 When a set of tuples ir is inserted
 For each tuple r in ir, if the corresponding group is already present in v, we
increment its count, else we add a new tuple with count = 1
 When a set of tuples dr is deleted
 for each tuple t in ir.we look for the group t.A in v, and subtract 1 from the count
for the group.
 If the count becomes 0, we delete from v the tuple for the group t.A
 sum: v = Agsum (B)
(r)
 We maintain the sum in a manner similar to count, except we add/subtract the B
value instead of adding/subtracting 1 for the count
 Additionally we maintain the count in order to detect groups with no tuples. Such
groups are deleted from v
 Cannot simply test for sum = 0 (why?)
 To handle the case of avg, we maintain the sum and count
aggregate values separately, and divide at the end 123
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 min, max: v = Agmin (B) (r).
 Handling insertions on r is straightforward.
 Maintaining the aggregate values min and max on deletions
may be more expensive. We have to look at the other tuples
of r that are in the same group to find the new minimum
124
PreparedbyR.Arthy,AP/IT,KCET
OTHER OPERATIONS
 Set intersection: v = r  s
 when a tuple is inserted in r we check if it is present in s, and
if so we add it to v.
 If the tuple is deleted from r, we delete it from the
intersection if it is present.
 Updates to s are symmetric
 The other set operations, union and set difference are handled
in a similar fashion.
 Outer joins are handled in much the same way as joins
but with some extra work
 we leave details to you.
125
PreparedbyR.Arthy,AP/IT,KCET
HANDLING EXPRESSIONS
 To handle an entire expression, we derive expressions for
computing the incremental change to the result of each
sub-expressions, starting from the smallest sub-
expressions.
 E.g. consider E1 E2 where each of E1 and E2 may be a
complex expression
 Suppose the set of tuples to be inserted into E1 is given by D1
 Computed earlier, since smaller sub-expressions are handled first
 Then the set of tuples to be inserted into E1 E2 is given by
D1 E2
 This is just the usual way of maintaining joins
126
PreparedbyR.Arthy,AP/IT,KCET
QUERY OPTIMIZATION AND MATERIALIZED VIEWS
 Rewriting queries to use materialized views:
 A materialized view v = r s is available
 A user submits a query r s t
 We can rewrite the query as v t
 Whether to do so depends on cost estimates for the two alternative
 Replacing a use of a materialized view by the view definition:
 A materialized view v = r s is available, but without any index
on it
 User submits a query A=10(v).
 Suppose also that s has an index on the common attribute B, and r
has an index on attribute A.
 The best plan for this query may be to replace v by r s, which can
lead to the query plan A=10(r) s
 Query optimizer should be extended to consider all above
alternatives and choose the best overall plan
127
PreparedbyR.Arthy,AP/IT,KCET
MATERIALIZED VIEW SELECTION
 Materialized view selection: “What is the best set of views to
materialize?”.
 Index selection: “what is the best set of indices to create”
 closely related, to materialized view selection
 but simpler
 Materialized view selection and index selection based on
typical system workload (queries and updates)
 Typical goal: minimize time to execute workload , subject to
constraints on space and time taken for some critical queries/updates
 One of the steps in database tuning
 more on tuning in later chapters
 Commercial database systems provide tools (called “tuning
assistants” or “wizards”) to help the database administrator
choose what indices and materialized views to create
128
PreparedbyR.Arthy,AP/IT,KCET
CS8492 – DATABASE
MANAGEMENT SYSTEMS
UNIT V - ADVANCED
TOPICS
OUTLINE
 Distributed Databases: Architecture
 Data Storage, Transaction Processing
 Object-based Databases: Object Database Concepts
 Object-Relational features
 ODMG Object Model, ODL, OQL
 XML Databases: XML Hierarchical Model
 DTD, XML Schema
 XQuery
 Information Retrieval: IR Concepts, Retrieval Models,
Queries in IR systems.
130
PreparedbyR.Arthy,AP/IT,KCET
DISTRIBUTED
DATABASES
OUTLINE
 Distributed Database Systems – Introduction
 Distributed Data Storage
 Distributed Transaction
 Commit Protocol
132
PreparedbyR.Arthy,AP/IT,KCET
I. DISTRIBUTED DATABASE SYSTEM
 A distributed database system consists of loosely coupled sites
that share no physical component
 Database systems that run on each site are independent of each
other
 Transactions may access data at one or more sites
133
PreparedbyR.Arthy,AP/IT,KCET
TYPES OF DISTRIBUTED DATABASES
 In a homogeneous distributed database
 All sites have identical software
 Are aware of each other and agree to cooperate in processing
user requests.
 Each site surrenders part of its autonomy in terms of right to
change schemas or software
 Appears to user as a single system
 In a heterogeneous distributed database
 Different sites may use different schemas and software
 Difference in schema is a major problem for query processing
 Difference in software is a major problem for transaction processing
 Sites may not be aware of each other and may provide only
limited facilities for cooperation in transaction processing
134
PreparedbyR.Arthy,AP/IT,KCET
II. DISTRIBUTED DATA STORAGE
 There are two approaches to store the relation in the
distributed database:
 Replication: The system maintains several identical replicas
(copies) of the relation, and stores each replica at a different
site. The alternative to replication is to store only one copy of
relation r.
 Fragmentation: The system partitions the relation into
several fragments, and stores each fragment at a different site.
135
PreparedbyR.Arthy,AP/IT,KCET
1. DATA REPLICATION
 A relation or fragment of a relation is replicated if it is
stored redundantly in two or more sites.
 Full replication of a relation is the case where the relation
is stored at all sites.
 Fully redundant databases are those in which every site
contains a copy of the entire database.
136
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 Advantages of Replication
 Availability: failure of site containing relation r does not result in
unavailability of r is replicas exist.
 Parallelism: queries on r may be processed by several nodes in parallel.
 Reduced data transfer: relation r is available locally at each site
containing a replica of r.
 Disadvantages of Replication
 Increased cost of updates: each replica of relation r must be updated.
 Increased complexity of concurrency control: concurrent updates to
distinct replicas may lead to inconsistent data unless special concurrency
control mechanisms are implemented.
 One solution: choose one copy as primary copy and apply concurrency control
operations on primary copy
137
PreparedbyR.Arthy,AP/IT,KCET
2. DATA FRAGMENTATION
 Division of relation r into fragments r1, r2, …, rn which contain
sufficient information to reconstruct relation r.
 Horizontal fragmentation: each tuple of r is assigned to one or more
fragments
 Vertical fragmentation: the schema for relation r is split into several
smaller schemas
 All schemas must contain a common candidate key (or superkey) to
ensure lossless join property.
 A special attribute, the tuple-id attribute may be added to each
schema to serve as a candidate key.
 Example : relation account with following schema
 Account = (account_number, branch_name , balance )
138
PreparedbyR.Arthy,AP/IT,KCET
HORIZONTAL FRAGMENTATION OF ACCOUNT RELATION
branch_nameaccount_number balance
Hillside
Hillside
Hillside
A-305
A-226
A-155
500
336
62
account1 = branch_name=“Hillside” (account )
branch_nameaccount_number balance
Valleyview
Valleyview
Valleyview
Valleyview
A-177
A-402
A-408
A-639
205
10000
1123
750
account2 = branch_name=“Valleyview” (account ) 139
PreparedbyR.Arthy,AP/IT,KCET
VERTICAL FRAGMENTATION OF EMPLOYEE_INFO RELATION
branch_name customer_name tuple_id
Hillside
Hillside
Valleyview
Valleyview
Hillside
Valleyview
Valleyview
Lowman
Camp
Camp
Kahn
Kahn
Kahn
Green
deposit1 = branch_name, customer_name, tuple_id (employee_info )
1
2
3
4
5
6
7
account_number balance tuple_id
500
336
205
10000
62
1123
750
1
2
3
4
5
6
7
A-305
A-226
A-177
A-402
A-155
A-408
A-639
deposit2 = account_number, balance, tuple_id (employee_info )
140
PreparedbyR.Arthy,AP/IT,KCET
ADVANTAGES OF FRAGMENTATION
 Horizontal:
 allows parallel processing on fragments of a relation
 allows a relation to be split so that tuples are located where they are most
frequently accessed
 Vertical:
 allows tuples to be split so that each part of the tuple is stored where it is most
frequently accessed
 tuple-id attribute allows efficient joining of vertical fragments
 Vertical and horizontal fragmentation can be mixed.
 Fragments may be successively fragmented to an arbitrary depth.
 Replication and fragmentation can be combined
 Relation is partitioned into several fragments: system maintains several
identical replicas of each such fragment.
141
PreparedbyR.Arthy,AP/IT,KCET
3. DATA TRANSPARENCY
 Data transparency: Degree to which system user may
remain unaware of the details of how and where the data
items are stored in a distributed system
 Consider transparency issues in relation to:
 Fragmentation transparency
 Replication transparency
 Location transparency
 Naming of data items: criteria
1. Every data item must have a system-wide unique name.
2. It should be possible to find the location of data items efficiently.
3. It should be possible to change the location of data items
transparently.
4. Each site should be able to create new data items autonomously. 142
PreparedbyR.Arthy,AP/IT,KCET
CENTRALIZED SCHEME - NAME SERVER
 Structure:
 name server assigns all names
 each site maintains a record of local data items
 sites ask name server to locate non-local data items
 Advantages:
 satisfies naming criteria 1-3
 Disadvantages:
 does not satisfy naming criterion 4
 name server is a potential performance bottleneck
 name server is a single point of failure
143
PreparedbyR.Arthy,AP/IT,KCET
USE OF ALIASES
 Alternative to centralized scheme: each site prefixes its
own site identifier to any name that it generates i.e., site
17.account.
 Fulfills having a unique identifier, and avoids problems associated
with central control.
 However, fails to achieve network transparency.
 Solution: Create a set of aliases for data items; Store the
mapping of aliases to the real names at each site.
 The user can be unaware of the physical location of a data
item, and is unaffected if the data item is moved from one
site to another.
144
PreparedbyR.Arthy,AP/IT,KCET
III. DISTRIBUTED TRANSACTIONS
SYSTEM ARCHITECTURE
 Transaction may access data at several sites.
 Each site has a local transaction manager responsible for:
 Maintaining a log for recovery purposes
 Participating in coordinating the concurrent execution of the
transactions executing at that site.
 Each site has a transaction coordinator, which is responsible for:
 Starting the execution of transactions that originate at the site.
 Distributing subtransactions at appropriate sites for execution.
 Coordinating the termination of each transaction that originates
at the site, which may result in the transaction being committed
at all sites or aborted at all sites.
145
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
146
PreparedbyR.Arthy,AP/IT,KCET
SYSTEM FAILURE MODES
 Failures unique to distributed systems:
 Failure of a site.
 Loss of massages
 Handled by network transmission control protocols such as TCP-IP
 Failure of a communication link
 Handled by network protocols, by routing messages via alternative
links
 Network partition
 A network is said to be partitioned when it has been split into two or
more subsystems that lack any connection between them
 Note: a subsystem may consist of a single node
 Network partitioning and site failures are generally
indistinguishable. 147
PreparedbyR.Arthy,AP/IT,KCET
COMMIT PROTOCOLS
 Commit protocols are used to ensure atomicity across
sites
 a transaction which executes at multiple sites must either be
committed at all the sites, or aborted at all the sites.
 not acceptable to have a transaction committed at one site and
aborted at another
 The two-phase commit (2PC) protocol is widely used
 The three-phase commit (3PC) protocol is more
complicated and more expensive, but avoids some
drawbacks of two-phase commit protocol. This protocol
is not used in practice.
148
PreparedbyR.Arthy,AP/IT,KCET
TWO PHASE COMMIT PROTOCOL (2PC)
 Assumes fail-stop model – failed sites simply stop
working, and do not cause any other harm, such as
sending incorrect messages to other sites.
 Execution of the protocol is initiated by the coordinator
after the last step of the transaction has been reached.
 The protocol involves all the local sites at which the
transaction executed
 Let T be a transaction initiated at site Si, and let the
transaction coordinator at Si be Ci
149
PreparedbyR.Arthy,AP/IT,KCET
PHASE 1: OBTAINING A DECISION
 Coordinator asks all participants to prepare to commit
transaction Ti.
 Ci adds the records <prepare T> to the log and forces log to
stable storage
 sends prepare T messages to all sites at which T executed
 Upon receiving message, transaction manager at site
determines if it can commit the transaction
 if not, add a record <no T> to the log and send abort T
message to Ci
 if the transaction can be committed, then:
 add the record <ready T> to the log
 force all records for T to stable storage
 send ready T message to Ci 150
PreparedbyR.Arthy,AP/IT,KCET
PHASE 2: RECORDING THE DECISION
 T can be committed of Ci received a ready T message
from all the participating sites: otherwise T must be
aborted.
 Coordinator adds a decision record, <commit T> or
<abort T>, to the log and forces record onto stable
storage. Once the record stable storage it is irrevocable
(even if failures occur)
 Coordinator sends a message to each participant
informing it of the decision (commit or abort)
 Participants take appropriate action locally.
151
PreparedbyR.Arthy,AP/IT,KCET
HANDLING OF FAILURES - SITE FAILURE
 When site Si recovers, it examines its log to determine the
fate of transactions active at the time of the failure.
 Log contain <commit T> record: site executes redo (T)
 Log contains <abort T> record: site executes undo (T)
 Log contains <ready T> record: site must consult Ci to
determine the fate of T.
 If T committed, redo (T)
 If T aborted, undo (T)
 The log contains no control records concerning T
 implies that Sk failed before responding to the prepare T
message from Ci
 Sk must execute undo (T) 152
PreparedbyR.Arthy,AP/IT,KCET
HANDLING OF FAILURES- COORDINATOR FAILURE
 If coordinator fails while the commit protocol for T is executing then
participating sites must decide on T’s fate:
1. If an active site contains a <commit T> record in its log, then T must
be committed.
2. If an active site contains an <abort T> record in its log, then T must
be aborted.
3. If some active participating site does not contain a <ready T> record
in its log, then the failed coordinator Ci cannot have decided to
commit T.
1. Can therefore abort T.
4. If none of the above cases holds, then all active sites must have a
<ready T> record in their logs, but no additional control records
(such as <abort T> of <commit T>).
 In this case active sites must wait for Ci to recover, to find decision.
 Blocking problem: active sites may have to wait for failed coordinator
to recover. 153
PreparedbyR.Arthy,AP/IT,KCET
HANDLING OF FAILURES - NETWORK PARTITION
 If the coordinator and all its participants remain in one partition, the
failure has no effect on the commit protocol.
 If the coordinator and its participants belong to several partitions:
 Sites that are not in the partition containing the coordinator think
the coordinator has failed, and execute the protocol to deal with
failure of the coordinator.
 No harm results, but sites may still have to wait for decision from
coordinator.
 The coordinator and the sites are in the same partition as the
coordinator think that the sites in the other partition have failed, and
follow the usual commit protocol.
 Again, no harm results
154
PreparedbyR.Arthy,AP/IT,KCET
OBJECT-BASED
DATABASES
OUTLINE
 Object-based Databases: Object Database Concepts
 Object-Relational features
 Object Data Management Group (ODMG) Object Model
 Object Definition Language (ODL)
 Object Query Language (OQL)
156
PreparedbyR.Arthy,AP/IT,KCET
I. OBJECT ORIENTED CONCEPTS
 Extend the relational data model by including object
orientation and constructs to deal with added data types.
 Allow attributes of tuples to have complex types,
including non-atomic values such as nested relations.
 Preserve relational foundations, in particular the
declarative access to data, while extending modeling
power.
 Upward compatibility with existing relational languages.
157
PreparedbyR.Arthy,AP/IT,KCET
COMPLEX DATA TYPES
 Motivation:
 Permit non-atomic domains (atomic  indivisible)
 Example of non-atomic domain: set of integers,or set of
tuples
 Allows more intuitive modeling for applications with
complex data
 Intuitive definition:
 allow relations whenever we allow atomic (scalar) values —
relations within relations
 Retains mathematical foundation of relational model
 Violates first normal form.
158
PreparedbyR.Arthy,AP/IT,KCET
EXAMPLE OF A NESTED RELATION
 Example: library information system
 Each book has
 title,
 a set of authors,
 Publisher, and
 a set of keywords
 Non-1NF relation books
159
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
4NF DECOMPOSITION OF NESTED RELATION
 Remove awkwardness of flat-books by assuming that the
following multivalued dependencies hold:
 title author
 title keyword
 title pub-name, pub-branch
 Decompose flat-doc into 4NF using the schemas:
 (title, author )
 (title, keyword )
 (title, pub-name, pub-branch )
160
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
161
PreparedbyR.Arthy,AP/IT,KCET
PROBLEM WITH 4NF SCHEME
 4NF design requires users to include joins in their
queries.
 1NF relational view flat-books defined by join of 4NF
relations:
 eliminates the need for users to perform joins,
 but loses the one-to-one correspondence between tuples and
documents.
 And has a large amount of redundancy
 Nested relations representation is much more natural
here.
162
PreparedbyR.Arthy,AP/IT,KCET
II. OBJECT-RELATIONAL FEATURES
 Structured types can be declared and used in SQL
create type Name as
(firstname varchar(20),
lastname varchar(20))
final
create type Address as
(street varchar(20),
city varchar(20),
zipcode varchar(20))
not final
 Note: final and not final indicate whether subtypes can be created
 Structured types can be used to create tables with composite attributes
create table customer (
name Name,
address Address,
dateOfBirth date)
 Dot notation used to reference components: name.firstname
163
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 User-defined row types
create type CustomerType as (
name Name,
address Address,
dateOfBirth date)
not final
 Can then create a table whose rows are a user-defined
type
create table customer of CustomerType
164
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 Alternative way of defining composite attributes in SQL
is to use unnamed row types.
create table person_r (
name row (firstname varchar(20),
lastname varchar(20)),
address row (street varchar(20),
city varchar(20),
zipcode varchar(9)),
dateOfBirth date);
 The query finds the last name and city of each person.
select name.lastname, address.city from person;
165
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 Methods
 Can add a method declaration with a structured type.
method ageOnDate (onDate date)
returns interval year
 Method body is given separately.
create instance method ageOnDate (onDate date)
returns interval year
for CustomerType
begin
return onDate - self.dateOfBirth;
end
 We can now find the age of each customer:
select name.lastname, ageOnDate (current_date) from customer
166
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 Constructor
create function Name (firstname varchar(20), lastname
varchar(20))
returns Name
begin
set self.firstname = firstname;
set self.lastname = lastname;
end
 Inserting
insert into Person values (new Name(’John’, ’Smith’), new
Address(’20 Main St’, ’New York’, ’11001’), date ’1960-8-
22’);
167
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 Inheritance
 Suppose that we have the following type definition for
people:
create type Person
(name varchar(20),
address varchar(20))
 Using inheritance to define the student and teacher types
create type Student under Person
(degree varchar(20),
department varchar(20))
create type Teacher under Person
(salary integer,
department varchar(20))
 Subtypes can redefine methods by using overriding method
in place of method in the method declaration
168
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 Multiple Inheritance
 SQL:1999 and SQL:2003 do not support multiple inheritance
 If our type system supports multiple inheritance, we can define a type
for teaching assistant as follows:
create type Teaching Assistant
under Student, Teacher
 To avoid a conflict between the two occurrences of department we can
rename them
create type Teaching Assistant under
Student with (department as student_dept ),
Teacher with (department as teacher_dept )
169
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 Array and Multiset Types in SQL
 Example of array and multiset declaration:
create type Publisher as
(name varchar(20),
branch varchar(20))
create type Book as
(title varchar(20),
author-array varchar(20) array [10],
pub-date date,
publisher Publisher,
keyword-set varchar(20) multiset )
create table books of Book
 Similar to the nested relation books, but with array of authors
instead of set 170
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 Array construction
array [‘Silberschatz’,`Korth’,`Sudarshan’]
 Multisets
 multisetset [‘computer’, ‘database’, ‘SQL’]
 To create a tuple of the type defined by the books relation:
(‘Compilers’, array[`Smith’,`Jones’],
Publisher (`McGraw-Hill’,`New York’),
multiset [`parsing’,`analysis’ ])
 To insert the preceding tuple into the relation books
insert into books
values(‘Compilers’, array[`Smith’,`Jones’],
Publisher (`McGraw-Hill’,`New York’),
multiset [`parsing’,`analysis’ ])
171
PreparedbyR.Arthy,AP/IT,KCET
UNNESTING
 The transformation of a nested relation into a form with
fewer (or no) relation-valued attributes us called
unnesting.
 E.g.
select title, A as author, publisher.name as pub_name,
publisher.branch as pub_branch, K.keyword
from books as B, unnest(B.author_array ) as A
(author ),
unnest (B.keyword_set ) as K (keyword )
172
PreparedbyR.Arthy,AP/IT,KCET
NESTING
 Nesting is the opposite of unnesting, creating a collection-valued
attribute
 NOTE: SQL:1999 does not support nesting
 Nesting can be done in a manner similar to aggregation, but using
the function colect() in place of an aggregation operation, to create a
multiset
 To nest the flat-books relation on the attribute keyword:
select title, author, Publisher (pub_name, pub_branch ) as
publisher, collect (keyword) as keyword_set from flat-books
groupby title, author, publisher
 To nest on both authors and keywords:
select title, collect (author ) as author_set, Publisher (pub_name,
pub_branch) as publisher, collect (keyword ) as keyword_set
from flat-books group by title, publisher 173
PreparedbyR.Arthy,AP/IT,KCET
III. OBJECT DATA MANAGEMENT GROUP
(ODMG) OBJECT MODEL
 Provides a standard model for object databases
 Supports object definition via ODL
 Supports object querying via OQL
 Supports a variety of data types and type construtors
174
PreparedbyR.Arthy,AP/IT,KCET
ODMG OBJECTS AND LITERALS
 The basic building blocks of the object model are
 Objects
 Literals
 An object has four characteristics
 Identifier: Unique system-wide identifier
 Name: Unique within a particular database and/or program; it
is optional
 Lifetime: persistent vs transient
 Structure: specifies how object is constructed by the type
constructor and whether it is an atomic object
175
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 A literal has a current value but not an identifier
 Three types of literals
 Atomic: predefined; basic data type values (e.g. short, float,
boolean, char)
 Structured: values that are constructed by type constructors
(e.g. date, struct variables)
 Collection: a collection (e.g. array) of values or objects
176
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 ODMG supports two concepts for specifying object
types:
 Interface
 Class
 There are similarities and differences between interfaces
and classes
 Both have behaviors (operations) and state (attributes
and relationships)
177
PreparedbyR.Arthy,AP/IT,KCET
ODMG INTERFACE
 An interface is a specification of the abstract behavior of
an object type
 State properties of an interface (i.e. its attributes and
relationships) cannot be inherited from
 Objects cannot be instantiated from an interface
178
PreparedbyR.Arthy,AP/IT,KCET
ODMG INTERFACE DEFINITION
interface Date:Object {
enum weekday{sun, mon, tue, wed, thu, fri, sat};
enum month{jan, feb, mar, …, dec};
unsigned short year();
unsigned short month();
unsigned short day();
boolean is_equal(in Date other_date);
};
179
PreparedbyR.Arthy,AP/IT,KCET
BUILD-IN INTERFACES FOR COLLECTION
OBJECTS
 A collection object inherits the basic collection
interface, for example:
 cardinality()
 is_empty()
 insert_element()
 remove_element()
 contains_element()
 create_iterator()
180
PreparedbyR.Arthy,AP/IT,KCET
COLLECTION TYPES
 Collection objects are further specialized into types like a
set, list, bag, array, and dictionary
 Each collection type may provide additional interfaces,
for example, a set provides:
 create_union()
 create_difference()
 is_subset_of()
 is_superset_of()
 is_proper_subset_of()
181
PreparedbyR.Arthy,AP/IT,KCET
OBJECT INHERITANCE HIERARCHY
182
PreparedbyR.Arthy,AP/IT,KCET
ODMG CLASS
 A class is a specification of abstract behavior and state of
an object type
 A class is Instantiable
 Supports “extends” inheritance to allow both state and
behavior inheritance among classes
 Multiple inheritance via “extends” is not allowed
183
PreparedbyR.Arthy,AP/IT,KCET
[CONTD…]
 Atomic objects are user defined objects and are defined
via keyword class
 An example:
class Employee(extend all_employees key ssn) {
attribute string name;
attribute string ssn;
attribute short age;
relationship dept works_for;
void reassign(in string new_name);
}
184
PreparedbyR.Arthy,AP/IT,KCET
IV. OBJECT DEFINITION LANGUAGE (ODL)
 ODL supports semantics constructs of ODMG
 ODL is independent of any programming language
 ODL is used to create object specification (classes and
interfaces)
 ODL is not used for database manipulation
185
PreparedbyR.Arthy,AP/IT,KCET
EXAMPLE 1: A VERY SIMPLE CLASS
 A very simple, straightforward class definition
class Degree {
attribute string college;
attribute string degree;
attribute string year;
};
186
PreparedbyR.Arthy,AP/IT,KCET
EXAMPLE 2: A CLASS WITH KEY AND EXTENT
class Person (extent persons key ssn) {
attribute struct Pname {string fname …} name;
attribute string ssn;
attribute date birthdate;
short age();
};
187
PreparedbyR.Arthy,AP/IT,KCET
EXAMPLE 3: A CLASS WITH RELATIONSHIPS
class Faculty extends Person (extent faculty) {
attribute string rank;
attribute float salary;
attribute string phone;
relationship dept works_in inverse
dept :: has_faculty;
relationship set<GradStu> advises inverse
GradStu :: advisor;
void give_raise (in float raise);
void promise (in string new_rank);
}; 188
PreparedbyR.Arthy,AP/IT,KCET
EXAMPLE 4: INHERITANCE
interface Shape {
attribute struct point {…}
reference_point;
float perimeter();
};
class Triangle : Shape (extent triangles) {
attribute short side_1;
attribute short side_2;
}; 189
PreparedbyR.Arthy,AP/IT,KCET
V. OBJECT QUERY LANGUAGE (OQL)
 OQL is DMG’s query language
 OQL works closely with programming languages such as
C++
 Embedded OQL statements return objects that are
compatible with the type system of the host language
 OQL’s syntax is similar to SQL with additional deatures
for objects
190
PreparedbyR.Arthy,AP/IT,KCET
SIMPLE OQL QUERIES
 Basic syntax: select … from … where …
select d.name from d in departments where d.college =
‘engineering’;
 An entry point to the database is needed for each query
 An extent name may serve as an entry point
191
PreparedbyR.Arthy,AP/IT,KCET
ITERATOR VARIABLES
 Iterator variables are defined whenever a collection is
referenced in an OQL query
 Iterator d in the previous example serves as an iterator
and ranges over each object in the collection
 Syntactical options for specifying an iterator:
 d in departments
 departments d
 departments as d
192
PreparedbyR.Arthy,AP/IT,KCET
DATA TYPE OF QUERY RESULTS
 The data type of a query result can be any type defined in
the ODMG model
 A query does not have to follow the select … from …
where … format
 A persistent name on its own can serve as a query whose
result is a reference to the persistent object.
 For example,
departments: whose type is set<Departments>
193
PreparedbyR.Arthy,AP/IT,KCET
PATH EXPRESSIONS
 A path expression is used to specify a path to attributes
and objects in an entry point
 A path expression starts at a persistent object name
 The name will be followed by zero or more dot
connected relationship or attribute names
 For example: departments.chair;
194
PreparedbyR.Arthy,AP/IT,KCET
VIEWS AS NAMED OBJECTS
 The define keyword in OQL is used to specify an
identifier for a named query
 The name should be unique; if not, the results will
replace an existing named query
 Once a query definition is created, it will persist until
deleted or redefined
 A view definition can include parameters
195
PreparedbyR.Arthy,AP/IT,KCET
EXAMPLE
 A view to include students in a department who have a
minor
define has_minor(dept_name) as select s from s in
students where s.minor_in.dname = dept_name
196
PreparedbyR.Arthy,AP/IT,KCET
SINGLE ELEMENTS FROM COLLECTIONS
 An OQL query returns a collection
 OQL’s element operator can be used to return a single
element from a singleton collection that contains one
element:
element (select d from d in departments where
d.name = ‘Web Programming’);
 If d is empty or has more that one elements, an exception
is raised
197
PreparedbyR.Arthy,AP/IT,KCET
COLLECTION OPERATORS
 OQL supports a number of aggregate operators that can
be applied to query results
 The aggregate operators and operate over a collection
and include
 Min
 Max
 Count
 Sum
 Avg
 For example:
avg (select s.gpa from s in students where s.class =
‘senior’ and s.majors_in.dname = ‘business’);
198
PreparedbyR.Arthy,AP/IT,KCET
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material
DBMS Unit IV and V Material

More Related Content

What's hot (19)

Mass storage structure
Mass storage structureMass storage structure
Mass storage structure
 
Raid- Redundant Array of Inexpensive Disks
Raid- Redundant Array of Inexpensive DisksRaid- Redundant Array of Inexpensive Disks
Raid- Redundant Array of Inexpensive Disks
 
ch11
ch11ch11
ch11
 
Ch11 - Silberschatz
Ch11 - SilberschatzCh11 - Silberschatz
Ch11 - Silberschatz
 
RAID
RAIDRAID
RAID
 
OSCh14
OSCh14OSCh14
OSCh14
 
Mass Storage Structure
Mass Storage StructureMass Storage Structure
Mass Storage Structure
 
Sistemas operacionais raid
Sistemas operacionais   raidSistemas operacionais   raid
Sistemas operacionais raid
 
Storage Devices And Backup Media
Storage Devices And Backup MediaStorage Devices And Backup Media
Storage Devices And Backup Media
 
DB_ch11
DB_ch11DB_ch11
DB_ch11
 
RAID LEVELS
RAID LEVELSRAID LEVELS
RAID LEVELS
 
Disk structure
Disk structureDisk structure
Disk structure
 
RAID
RAIDRAID
RAID
 
Ch11
Ch11Ch11
Ch11
 
RDBMS
RDBMSRDBMS
RDBMS
 
Coal presentationt
Coal presentationtCoal presentationt
Coal presentationt
 
Introduction to Storage technologies
Introduction to Storage technologiesIntroduction to Storage technologies
Introduction to Storage technologies
 
storage and file structure
storage and file structurestorage and file structure
storage and file structure
 
SEMINAR
SEMINARSEMINAR
SEMINAR
 

Similar to DBMS Unit IV and V Material

Ch14 OS
Ch14 OSCh14 OS
Ch14 OSC.U
 
disk structure and multiple RAID levels .ppt
disk structure and multiple  RAID levels .pptdisk structure and multiple  RAID levels .ppt
disk structure and multiple RAID levels .pptRAJASEKHARV10
 
Asif Jamal disk (it)
Asif Jamal disk (it)Asif Jamal disk (it)
Asif Jamal disk (it)Asif Jamal
 
Data center core elements, Data center virtualization
Data center core elements, Data center virtualizationData center core elements, Data center virtualization
Data center core elements, Data center virtualizationMadhuraNK
 
Chapter 4 record storage and primary file organization
Chapter 4 record storage and primary file organizationChapter 4 record storage and primary file organization
Chapter 4 record storage and primary file organizationJafar Nesargi
 
Chapter 4 record storage and primary file organization
Chapter 4 record storage and primary file organizationChapter 4 record storage and primary file organization
Chapter 4 record storage and primary file organizationJafar Nesargi
 
RAID: High-Performance, Reliable Secondary Storage
RAID: High-Performance, Reliable Secondary StorageRAID: High-Performance, Reliable Secondary Storage
RAID: High-Performance, Reliable Secondary StorageUğur Tılıkoğlu
 

Similar to DBMS Unit IV and V Material (20)

UNIT III.pptx
UNIT III.pptxUNIT III.pptx
UNIT III.pptx
 
Storage memory
Storage memoryStorage memory
Storage memory
 
Raid
Raid Raid
Raid
 
Raid technology
Raid technologyRaid technology
Raid technology
 
Magnetic disk - Krishna Geetha.ppt
Magnetic disk  - Krishna Geetha.pptMagnetic disk  - Krishna Geetha.ppt
Magnetic disk - Krishna Geetha.ppt
 
OS_Ch14
OS_Ch14OS_Ch14
OS_Ch14
 
Ch14 OS
Ch14 OSCh14 OS
Ch14 OS
 
Raid
RaidRaid
Raid
 
RAID Levels
RAID LevelsRAID Levels
RAID Levels
 
disk structure and multiple RAID levels .ppt
disk structure and multiple  RAID levels .pptdisk structure and multiple  RAID levels .ppt
disk structure and multiple RAID levels .ppt
 
data recovery-raid
data recovery-raiddata recovery-raid
data recovery-raid
 
Asif Jamal disk (it)
Asif Jamal disk (it)Asif Jamal disk (it)
Asif Jamal disk (it)
 
Data center core elements, Data center virtualization
Data center core elements, Data center virtualizationData center core elements, Data center virtualization
Data center core elements, Data center virtualization
 
Chapter 4 record storage and primary file organization
Chapter 4 record storage and primary file organizationChapter 4 record storage and primary file organization
Chapter 4 record storage and primary file organization
 
Chapter 4 record storage and primary file organization
Chapter 4 record storage and primary file organizationChapter 4 record storage and primary file organization
Chapter 4 record storage and primary file organization
 
Raid 1 3
Raid 1 3Raid 1 3
Raid 1 3
 
DAS RAID NAS SAN
DAS RAID NAS SANDAS RAID NAS SAN
DAS RAID NAS SAN
 
Os
OsOs
Os
 
RAID: High-Performance, Reliable Secondary Storage
RAID: High-Performance, Reliable Secondary StorageRAID: High-Performance, Reliable Secondary Storage
RAID: High-Performance, Reliable Secondary Storage
 
os
osos
os
 

More from ArthyR3

Unit IV Knowledge and Hybrid Recommendation System.pdf
Unit IV Knowledge and Hybrid Recommendation System.pdfUnit IV Knowledge and Hybrid Recommendation System.pdf
Unit IV Knowledge and Hybrid Recommendation System.pdfArthyR3
 
VIT336 – Recommender System - Unit 3.pdf
VIT336 – Recommender System - Unit 3.pdfVIT336 – Recommender System - Unit 3.pdf
VIT336 – Recommender System - Unit 3.pdfArthyR3
 
OOPs - JAVA Quick Reference.pdf
OOPs - JAVA Quick Reference.pdfOOPs - JAVA Quick Reference.pdf
OOPs - JAVA Quick Reference.pdfArthyR3
 
NodeJS and ExpressJS.pdf
NodeJS and ExpressJS.pdfNodeJS and ExpressJS.pdf
NodeJS and ExpressJS.pdfArthyR3
 
MongoDB.pdf
MongoDB.pdfMongoDB.pdf
MongoDB.pdfArthyR3
 
REACTJS.pdf
REACTJS.pdfREACTJS.pdf
REACTJS.pdfArthyR3
 
ANGULARJS.pdf
ANGULARJS.pdfANGULARJS.pdf
ANGULARJS.pdfArthyR3
 
JQUERY.pdf
JQUERY.pdfJQUERY.pdf
JQUERY.pdfArthyR3
 
Qb it1301
Qb   it1301Qb   it1301
Qb it1301ArthyR3
 
CNS - Unit v
CNS - Unit vCNS - Unit v
CNS - Unit vArthyR3
 
Cs8792 cns - unit v
Cs8792   cns - unit vCs8792   cns - unit v
Cs8792 cns - unit vArthyR3
 
Cs8792 cns - unit iv
Cs8792   cns - unit ivCs8792   cns - unit iv
Cs8792 cns - unit ivArthyR3
 
Cs8792 cns - unit iv
Cs8792   cns - unit ivCs8792   cns - unit iv
Cs8792 cns - unit ivArthyR3
 
Cs8792 cns - unit i
Cs8792   cns - unit iCs8792   cns - unit i
Cs8792 cns - unit iArthyR3
 
Java quick reference
Java quick referenceJava quick reference
Java quick referenceArthyR3
 
Cs8792 cns - Public key cryptosystem (Unit III)
Cs8792   cns - Public key cryptosystem (Unit III)Cs8792   cns - Public key cryptosystem (Unit III)
Cs8792 cns - Public key cryptosystem (Unit III)ArthyR3
 
Cryptography Workbook
Cryptography WorkbookCryptography Workbook
Cryptography WorkbookArthyR3
 
Cs6701 cryptography and network security
Cs6701 cryptography and network securityCs6701 cryptography and network security
Cs6701 cryptography and network securityArthyR3
 
Compiler question bank
Compiler question bankCompiler question bank
Compiler question bankArthyR3
 

More from ArthyR3 (20)

Unit IV Knowledge and Hybrid Recommendation System.pdf
Unit IV Knowledge and Hybrid Recommendation System.pdfUnit IV Knowledge and Hybrid Recommendation System.pdf
Unit IV Knowledge and Hybrid Recommendation System.pdf
 
VIT336 – Recommender System - Unit 3.pdf
VIT336 – Recommender System - Unit 3.pdfVIT336 – Recommender System - Unit 3.pdf
VIT336 – Recommender System - Unit 3.pdf
 
OOPs - JAVA Quick Reference.pdf
OOPs - JAVA Quick Reference.pdfOOPs - JAVA Quick Reference.pdf
OOPs - JAVA Quick Reference.pdf
 
NodeJS and ExpressJS.pdf
NodeJS and ExpressJS.pdfNodeJS and ExpressJS.pdf
NodeJS and ExpressJS.pdf
 
MongoDB.pdf
MongoDB.pdfMongoDB.pdf
MongoDB.pdf
 
REACTJS.pdf
REACTJS.pdfREACTJS.pdf
REACTJS.pdf
 
ANGULARJS.pdf
ANGULARJS.pdfANGULARJS.pdf
ANGULARJS.pdf
 
JQUERY.pdf
JQUERY.pdfJQUERY.pdf
JQUERY.pdf
 
Qb it1301
Qb   it1301Qb   it1301
Qb it1301
 
CNS - Unit v
CNS - Unit vCNS - Unit v
CNS - Unit v
 
Cs8792 cns - unit v
Cs8792   cns - unit vCs8792   cns - unit v
Cs8792 cns - unit v
 
Cs8792 cns - unit iv
Cs8792   cns - unit ivCs8792   cns - unit iv
Cs8792 cns - unit iv
 
Cs8792 cns - unit iv
Cs8792   cns - unit ivCs8792   cns - unit iv
Cs8792 cns - unit iv
 
Cs8792 cns - unit i
Cs8792   cns - unit iCs8792   cns - unit i
Cs8792 cns - unit i
 
Java quick reference
Java quick referenceJava quick reference
Java quick reference
 
Cs8792 cns - Public key cryptosystem (Unit III)
Cs8792   cns - Public key cryptosystem (Unit III)Cs8792   cns - Public key cryptosystem (Unit III)
Cs8792 cns - Public key cryptosystem (Unit III)
 
Cryptography Workbook
Cryptography WorkbookCryptography Workbook
Cryptography Workbook
 
Cns
CnsCns
Cns
 
Cs6701 cryptography and network security
Cs6701 cryptography and network securityCs6701 cryptography and network security
Cs6701 cryptography and network security
 
Compiler question bank
Compiler question bankCompiler question bank
Compiler question bank
 

Recently uploaded

Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineeringmalavadedarshan25
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)dollysharma2066
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
EduAI - E learning Platform integrated with AI
EduAI - E learning Platform integrated with AIEduAI - E learning Platform integrated with AI
EduAI - E learning Platform integrated with AIkoyaldeepu123
 
DATA ANALYTICS PPT definition usage example
DATA ANALYTICS PPT definition usage exampleDATA ANALYTICS PPT definition usage example
DATA ANALYTICS PPT definition usage examplePragyanshuParadkar1
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 

Recently uploaded (20)

POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineering
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
EduAI - E learning Platform integrated with AI
EduAI - E learning Platform integrated with AIEduAI - E learning Platform integrated with AI
EduAI - E learning Platform integrated with AI
 
DATA ANALYTICS PPT definition usage example
DATA ANALYTICS PPT definition usage exampleDATA ANALYTICS PPT definition usage example
DATA ANALYTICS PPT definition usage example
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 

DBMS Unit IV and V Material

  • 1. CS8492 – DATABASE MANAGEMENT SYSTEMS UNIT IV - IMPLEMENTATION TECHNIQUES
  • 2. OUTLINE  RAID  File Organization –Organization of Records in Files  Indexing and Hashing  Ordered Indices  B+ tree Index Files – B tree Index Files  Static Hashing –Dynamic Hashing  Query Processing Overview  Algorithms for SELECT and JOIN operations  Query optimization using Heuristics and Cost Estimation 2 PreparedbyR.Arthy,AP/IT,KCET
  • 3. RAID REDUNDANT ARRAYS OF INDEPENDENT DISKS
  • 4. CLASSIFICATION OF PHYSICAL STORAGE MEDIA  Can differentiate storage into:  volatile storage: loses contents when power is switched off  non-volatile storage:  Contents persist even when power is switched off.  Includes secondary and tertiary storage, as well as batter-backed up main-memory.  Factors affecting choice of storage media include  Speed with which data can be accessed  Cost per unit of data  Reliability 4 PreparedbyR.Arthy,AP/IT,KCET
  • 6. [CONTD…]  primary storage: Fastest media but volatile (cache, main memory).  secondary storage: next level in hierarchy, non-volatile, moderately fast access time  Also called on-line storage  E.g., flash memory, magnetic disks  tertiary storage: lowest level in hierarchy, non-volatile, slow access time  also called off-line storage and used for archival storage  e.g., magnetic tape, optical storage  Magnetic tape  Sequential access, 1 to 12 TB capacity  A few drives with many tapes  Juke boxes with petabytes (1000’s of TB) of storage 6 PreparedbyR.Arthy,AP/IT,KCET
  • 7. RAID  RAID: Redundant Arrays of Independent Disks  Disk organization techniques that manage a large numbers of disks, providing a view of a single disk of  high capacity and high speed by using multiple disks in parallel,  high reliability by storing data redundantly, so that data can be recovered even if a disk fails  The chance that some disk out of a set of N disks will fail is much higher than the chance that a specific single disk will fail.  E.g., a system with 100 disks, each with MTTF of 100,000 hours (approx. 11 years), will have a system MTTF of 1000 hours (approx. 41 days)  Techniques for using redundancy to avoid data loss are critical with large numbers of disks 7 PreparedbyR.Arthy,AP/IT,KCET
  • 8. IMPROVEMENT OF RELIABILITY VIA REDUNDANCY  Redundancy – store extra information that can be used to rebuild information lost in a disk failure  E.g., Mirroring (or shadowing)  Duplicate every disk. Logical disk consists of two physical disks.  Every write is carried out on both disks  Reads can take place from either disk  If one disk in a pair fails, data still available in the other  Data loss would occur only if a disk fails, and its mirror disk also fails before the system is repaired  Probability of combined event is very small  Except for dependent failure modes such as fire or building collapse or electrical power surges  Mean time to data loss depends on mean time to failure, and mean time to repair  E.g., MTTF of 100,000 hours, mean time to repair of 10 hours gives mean time to data loss of 500*106 hours (or 57,000 years) for a mirrored pair of disks (ignoring dependent failure modes) 8 PreparedbyR.Arthy,AP/IT,KCET
  • 9. IMPROVEMENT IN PERFORMANCE VIA PARALLELISM  Two main goals of parallelism in a disk system: 1. Load balance multiple small accesses to increase throughput 2. Parallelize large accesses to reduce response time.  Improve transfer rate by striping data across multiple disks.  Bit-level striping – split the bits of each byte across multiple disks  In an array of eight disks, write bit i of each byte to disk i.  Each access can read data at eight times the rate of a single disk.  But seek/access time worse than for a single disk  Bit level striping is not used much any more  Block-level striping – with n disks, block i of a file goes to disk (i mod n) + 1  Requests for different blocks can run in parallel if the blocks reside on different disks  A request for a long sequence of blocks can utilize all disks in parallel 9 PreparedbyR.Arthy,AP/IT,KCET
  • 10. RAID LEVELS  Schemes to provide redundancy at lower cost by using disk striping combined with parity bits  Different RAID organizations, or RAID levels, have differing cost, performance and reliability characteristics  RAID Level 0: Block striping; non-redundant.  Used in high-performance applications where data loss is not critical.  RAID Level 1: Mirrored disks with block striping  Offers best write performance.  Popular for applications such as storing log files in a database system. 10 PreparedbyR.Arthy,AP/IT,KCET
  • 11. [CONTD…]  RAID Level 2: Memory-Style Error-Correcting-Codes (ECC) with bit striping.  RAID Level 3: Bit-Interleaved Parity  a single parity bit is enough for error correction, not just detection, since we know which disk has failed  When writing data, corresponding parity bits must also be computed and written to a parity bit disk  To recover data in a damaged disk, compute XOR of bits from other disks (including parity bit disk) 11 PreparedbyR.Arthy,AP/IT,KCET
  • 12. [CONTD…]  RAID Level 3 (Cont.)  Faster data transfer than with a single disk, but fewer I/Os per second since every disk has to participate in every I/O.  Subsumes Level 2 (provides all its benefits, at lower cost).  RAID Level 4: Block-Interleaved Parity; uses block-level striping, and keeps a parity block on a separate disk for corresponding blocks from N other disks.  When writing data block, corresponding block of parity bits must also be computed and written to parity disk  To find value of a damaged block, compute XOR of bits from corresponding blocks (including parity block) from other disks. 12 PreparedbyR.Arthy,AP/IT,KCET
  • 13. [CONTD…]  RAID Level 4 (Cont.)  Provides higher I/O rates for independent block reads than Level 3  block read goes to a single disk, so blocks stored on different disks can be read in parallel  Provides high transfer rates for reads of multiple blocks than no-striping  Before writing a block, parity data must be computed  Can be done by using old parity block, old value of current block and new value of current block (2 block reads + 2 block writes)  Or by recomputing the parity value using the new values of blocks corresponding to the parity block  More efficient for writing large amounts of data sequentially  Parity block becomes a bottleneck for independent block writes since every block write also writes to parity disk 13 PreparedbyR.Arthy,AP/IT,KCET
  • 14. [CONTD…]  RAID Level 5: Block-Interleaved Distributed Parity; partitions data and parity among all N + 1 disks, rather than storing data in N disks and parity in 1 disk.  E.g., with 5 disks, parity block for nth set of blocks is stored on disk (n mod 5) + 1, with the data blocks stored on the other 4 disks. 14 PreparedbyR.Arthy,AP/IT,KCET
  • 15. [CONTD…]  RAID Level 5 (Cont.)  Block writes occur in parallel if the blocks and their parity blocks are on different disks.  RAID Level 6: P+Q Redundancy scheme; similar to Level 5, but stores two error correction blocks (P, Q) instead of single parity block to guard against multiple disk failures.  Better reliability than Level 5 at a higher cost  Becoming more important as storage sizes increase 15 PreparedbyR.Arthy,AP/IT,KCET
  • 16. CHOICE OF RAID LEVEL  Factors in choosing RAID level  Monetary cost  Performance: Number of I/O operations per second, and bandwidth during normal operation  Performance during failure  Performance during rebuild of failed disk  Including time taken to rebuild failed disk  RAID 0 is used only when data safety is not important  E.g., data can be recovered quickly from other sources  Level 2 and 4 never used since they are subsumed by 3 and 5  Level 3 is not used anymore since bit-striping forces single block reads to access all disks, wasting disk arm movement, which block striping (level 5) avoids  Level 6 is rarely used since levels 1 and 5 offer adequate safety for most applications 16 PreparedbyR.Arthy,AP/IT,KCET
  • 17. [CONTD…]  Level 1 provides much better write performance than level 5  Level 5 requires at least 2 block reads and 2 block writes to write a single block, whereas Level 1 only requires 2 block writes  Level 1 had higher storage cost than level 5  Level 5 is preferred for applications where writes are sequential and large (many blocks), and need large amounts of data storage  RAID 1 is preferred for applications with many random/small updates  Level 6 gives better data protection than RAID 5 since it can tolerate two disk (or disk block) failures  Increasing in importance since latent block failures on one disk, coupled with a failure of another disk can result in data loss with RAID 1 and RAID 5. 17 PreparedbyR.Arthy,AP/IT,KCET
  • 18. HARDWARE ISSUES  Software RAID: RAID implementations done entirely in software, with no special hardware support  Hardware RAID: RAID implementations with special hardware  Use non-volatile RAM to record writes that are being executed  Beware: power failure during write can result in corrupted disk  E.g., failure after writing one block but before writing the second in a mirrored system  Such corrupted data must be detected when power is restored  Recovery from corruption is similar to recovery from failed disk  NV-RAM helps to efficiently detected potentially corrupted blocks  Otherwise all blocks of disk must be read and compared with mirror/parity block 18 PreparedbyR.Arthy,AP/IT,KCET
  • 19. [CONTD…]  Latent failures: data successfully written earlier gets damaged  can result in data loss even if only one disk fails  Data scrubbing:  continually scan for latent failures, and recover from copy/parity  Hot swapping: replacement of disk while system is running, without power down  Supported by some hardware RAID systems,  reduces time to recovery, and improves availability greatly  Many systems maintain spare disks which are kept online, and used as replacements for failed disks immediately on detection of failure  Reduces time to recovery greatly  Many hardware RAID systems ensure that a single point of failure will not stop the functioning of the system by using  Redundant power supplies with battery backup  Multiple controllers and multiple interconnections to guard against controller/interconnection failures 19 PreparedbyR.Arthy,AP/IT,KCET
  • 20. OPTIMIZATION OF DISK-BLOCK ACCESS  Buffering: in-memory buffer to cache disk blocks  Read-ahead: Read extra blocks from a track in anticipation that they will be requested soon  Disk-arm-scheduling algorithms re-order block requests so that disk arm movement is minimized  elevator algorithm 20 PreparedbyR.Arthy,AP/IT,KCET
  • 21. FILE ORGANIZATION –ORGANIZATION OF RECORDS IN FILES
  • 22. INTRODUCTION  The database is stored as a collection of files.  Each file is a sequence of records.  A record is a sequence of fields.  One approach:  assume record size is fixed  each file has records of one particular type only  different files are used for different relations  This case is easiest to implement; will consider variable length records later. 22 PreparedbyR.Arthy,AP/IT,KCET
  • 23. FIXED LENGTH RECORD  Simple approach:  Store record i starting from byte n ∗ (i − 1), where n is the size of each record.  Record access is simple but records may cross blocks.  Deletion of record i — alternatives:  move records i + 1,...,n to i, . . . , n − 1  move record n to i  Link all free records on a free list 23 PreparedbyR.Arthy,AP/IT,KCET
  • 24. EXAMPLE  type instructor = record ID varchar (5); name varchar(20); dept name varchar (20); salary numeric (8,2); End  instructor record is 53 bytes long.  Two problems:  Unless the block size happens to be a multiple of 53 (which is unlikely), some records will cross block boundaries.  It is difficult to delete a record from this structure. 24 PreparedbyR.Arthy,AP/IT,KCET
  • 27. FILE HEADER AND FREE LIST  Store the address of the first record whose contents are deleted in the file header.  Use this first record to store the address of the second available record, and so on.  Can think of these stored addresses as pointers since they “point” to the location of a record. 27 PreparedbyR.Arthy,AP/IT,KCET
  • 28. [CONTD…]  More space efficient representation: reuse space for normal attributes of free records to store pointers. (No pointers stored in in-use records.)  Dangling pointers occur if we move or delete a record to which another record contains a pointer; that pointer no longer points to the desired record.  Avoid moving or deleting records that are pointed to by other records; such records are pinned. 28 PreparedbyR.Arthy,AP/IT,KCET
  • 29. VARIABLE LENGTH RECORD  Variable-length records arise in database systems in several ways:  Storage of multiple record types in a file.  Record types that allow variable lengths for one or more fields.  Record types that allow repeating fields (used in some older data models).  Byte string representation  Attach an end-of-record (┴) control character to the end of each record  Difficulty with deletion  Difficulty with growth 29 PreparedbyR.Arthy,AP/IT,KCET
  • 30. SLOTTED PAGE STRUCTURE  Header contains:  number of record entries  end of free space in the block  location and size of each record  Records can be moved around within a page to keep them contiguous with no empty space between them; entry in the header must then be updated.  Pointers should not point directly to record — instead they should point to the entry for the record in header. 30 PreparedbyR.Arthy,AP/IT,KCET
  • 31. ORGANIZATION OF RECORDS IN FILES  Heap – a record can be placed anywhere in the file where there is space  Sequential – store records in sequential order, based on the value of the search key of each record  Hashing – a hash function is computed on some attribute of each record; the result specifies in which block of the file the record should be placed  Clustering – records of several different relations can be stored in the same file; related records are stored on the same block 31 PreparedbyR.Arthy,AP/IT,KCET
  • 32. SEQUENTIAL FILE ORGANIZATION  Suitable for applications that require sequential processing of the entire file  The records in the file are ordered by a search-key 32 PreparedbyR.Arthy,AP/IT,KCET
  • 33. [CONTD…]  Deletion – use pointer chains  Insertion – must locate the position in the file where the record is to be inserted  if there is free space insert there  if no free space, insert the record in an overflow block  In either case, pointer chain must be updated  Need to reorganize the file from time to time to restore sequential order 33 PreparedbyR.Arthy,AP/IT,KCET
  • 34. CLUSTERING FILE ORGANIZATION  Simple file structure stores each relation in a separate file  Can instead store several relations in one file using a clustering file organization  E.g., clustering organization of department and employee: 34 PreparedbyR.Arthy,AP/IT,KCET
  • 36. INTRODUCTION  Indexing mechanisms used to speed up access to desired data.  E.g., author catalog in library  Search Key - attribute or set of attributes used to look up records in a file.  An index file consists of records (called index entries) of the form  Index files are typically much smaller than the original file  Two basic kinds of indices:  Ordered indices: search keys are stored in some sorted order  Hash indices: search keys are distributed uniformly across “buckets” using a “hash function”. 36 PreparedbyR.Arthy,AP/IT,KCET
  • 37. INDEX EVALUATION METRICS  Access types: The types of access that are supported efficiently.  Access time: The time it takes to find a particular data item, or set of items, using the technique in question.  Insertion time: The time it takes to insert a newdata item.  Deletion time: The time it takes to delete a data item.  Space overhead: The additional space occupied by an index structure. 37 PreparedbyR.Arthy,AP/IT,KCET
  • 39. INTRODUCTION  In an ordered index, index entries are stored sorted on the search key value.  E.g., author catalog in library.  Primary index: in a sequentially ordered file, the index whose search key specifies the sequential order of the file.  Also called clustering index  The search key of a primary index is usually but not necessarily the primary key.  Secondary index: an index whose search key specifies an order different from the sequential order of the file. Also called non-clustering index.  Index-sequential file:ordered sequential file with a primary index. 39 PreparedbyR.Arthy,AP/IT,KCET
  • 44. INDEX UPDATE Insertion  First, the system performs a lookup using the search-key value that appears in the record to be inserted. The actions the system takes next depend on whether the index is dense or sparse:  Dense indices: 1. If the search-key value does not appear in the index, the system inserts an index entry with the search-key value in the index at the appropriate position. 2. Otherwise the following actions are taken:  If the index entry stores pointers to all records with the same search key value, the system adds a pointer to the new record in the index entry.  Otherwise, the index entry stores a pointer to only the first record with the search-key value. The system then places the record being inserted after the other records with the same search-key values. 44 PreparedbyR.Arthy,AP/IT,KCET
  • 45. [CONTD…] DELETION  Dense indices: 1. If the deleted record was the only record with its particular search-key value, then the system deletes the corresponding index entry from the index. 2. Otherwise the following actions are taken:  If the index entry stores pointers to all records with the same search key value, the system deletes the pointer to the deleted record from the index entry.  Otherwise, the index entry stores a pointer to only the first record with the search-key value. In this case, if the deleted record was the first record with the search-key value, the system updates the index entry to point to the next record. 45 PreparedbyR.Arthy,AP/IT,KCET
  • 46. [CONTD…]  Sparse indices: 1. If the index does not contain an index entry with the search-key value of the deleted record, nothing needs to be done to the index. 2. Otherwise the system takes the following actions:  If the deleted record was the only record with its search key, the system replaces the corresponding index record with an index record for the next search-key value (in search-key order). If the next search-key value already has an index entry, the entry is deleted instead of being replaced.  Otherwise, if the index entry for the search-key value points to the record being deleted, the system updates the index entry to point to the next record with the same search-key value. 46 PreparedbyR.Arthy,AP/IT,KCET
  • 47. ALGORITHMS FOR SELECT AND JOIN OPERATIONS
  • 48. SELECT OPERATIONS  File scan – search algorithms that locate and retrieve records that fulfill a selection condition.  Algorithm A1 (linear search). Scan each file block and test all records to see whether they satisfy the selection condition.  Cost estimate = br block transfers + 1 seek  br denotes number of blocks containing records from relation r  If selection is on a key attribute, can stop on finding record  cost = (br /2) block transfers + 1 seek  Linear search can be applied regardless of  selection condition or  ordering of records in the file, or  availability of indices 48 PreparedbyR.Arthy,AP/IT,KCET
  • 49. [CONTD…]  A2 (binary search). Applicable if selection is an equality comparison on the attribute on which file is ordered.  Assume that the blocks of a relation are stored contiguously  Cost estimate (number of disk blocks to be scanned):  cost of locating the first tuple by a binary search on the blocks  log2(br) * (tT + tS)  If there are multiple records satisfying selection  Add transfer cost of the number of blocks containing records that satisfy selection condition 49 PreparedbyR.Arthy,AP/IT,KCET
  • 50. [CONTD…]  Index scan – search algorithms that use an index  selection condition must be on search-key of index.  A3 (primary index on candidate key, equality). Retrieve a single record that satisfies the corresponding equality condition  Cost = (hi + 1) * (tT + tS)  A4 (primary index on nonkey, equality) Retrieve multiple records.  Records will be on consecutive blocks  Let b = number of blocks containing matching records  Cost = hi * (tT + tS) + tS + tT * b  A5 (equality on search-key of secondary index).  Retrieve a single record if the search-key is a candidate key  Cost = (hi + 1) * (tT + tS)  Retrieve multiple records if search-key is not a candidate key  each of n matching records may be on a different block  Cost = (hi + n) * (tT + tS)  Can be very expensive! 50 PreparedbyR.Arthy,AP/IT,KCET
  • 51. [CONTD…]  Can implement selections of the form AV (r) or A  V(r) by using  a linear file scan or binary search,  or by using indices in the following ways:  A6 (primary index, comparison). (Relation is sorted on A)  For A  V(r) use index to find first tuple  v and scan relation sequentially from there  For AV (r) just scan relation sequentially till first tuple > v; do not use index  A7 (secondary index, comparison).  For A  V(r) use index to find first index entry  v and scan index sequentially from there, to find pointers to records.  For AV (r) just scan leaf pages of index finding pointers to records, till first entry > v  In either case, retrieve records that are pointed to  requires an I/O for each record  Linear file scan may be cheaper 51 PreparedbyR.Arthy,AP/IT,KCET
  • 52. [CONTD…]  Conjunction: 1 2. . . n(r)  A8 (conjunctive selection using one index).  Select a combination of i and algorithms A1 through A7 that results in the least cost for i (r).  Test other conditions on tuple after fetching it into memory buffer.  A9 (conjunctive selection using multiple-key index).  Use appropriate composite (multiple-key) index if available.  A10 (conjunctive selection by intersection of identifiers).  Requires indices with record pointers.  Use corresponding index for each condition, and take intersection of all the obtained sets of record pointers.  Then fetch records from file  If some conditions do not have appropriate indices, apply test in memory. 52 PreparedbyR.Arthy,AP/IT,KCET
  • 53. [CONTD…]  Disjunction:1 2 . . . n (r).  A11 (disjunctive selection by union of identifiers).  Applicable if all conditions have available indices.  Otherwise use linear scan.  Use corresponding index for each condition, and take union of all the obtained sets of record pointers.  Then fetch records from file  Negation: (r)  Use linear scan on file  If very few records satisfy , and an index is applicable to   Find satisfying records using index and fetch from file 53 PreparedbyR.Arthy,AP/IT,KCET
  • 54. JOIN OPERATIONS  Several different algorithms to implement joins  Nested-loop join  Block nested-loop join  Indexed nested-loop join  Merge-join  Hash-join  Choice based on cost estimate  Examples use the following information  Number of records of customer: 10,000 depositor: 5000  Number of blocks of customer: 400 depositor: 100 54 PreparedbyR.Arthy,AP/IT,KCET
  • 55. NESTED – LOOP JOIN  To compute the theta join r  s for each tuple tr in r do begin for each tuple ts in s do begin test pair (tr,ts) to see if they satisfy the join condition  if they do, add tr • ts to the result. end end  r is called the outer relation and s the inner relation of the join.  Requires no indices and can be used with any kind of join condition.  Expensive since it examines every pair of tuples in the two relations. 55 PreparedbyR.Arthy,AP/IT,KCET
  • 56. [CONTD…]  In the worst case, if there is enough memory only to hold one block of each relation, the estimated cost is nr  bs + br block transfers, plus nr + br seeks  If the smaller relation fits entirely in memory, use that as the inner relation.  Reduces cost to br + bs block transfers and 2 seeks  Assuming worst case memory availability cost estimate is  with depositor as outer relation:  5000  400 + 100 = 2,000,100 block transfers,  5000 + 100 = 5100 seeks  with customer as the outer relation  10000  100 + 400 = 1,000,400 block transfers and 10,400 seeks  If smaller relation (depositor) fits entirely in memory, the cost estimate will be 500 block transfers.  Block nested-loops algorithm is preferable. 56 PreparedbyR.Arthy,AP/IT,KCET
  • 57. BLOCK NESTED - LOOP JOIN  Variant of nested-loop join in which every block of inner relation is paired with every block of outer relation. for each block Br of r do begin for each block Bs of s do begin for each tuple tr in Br do begin for each tuple ts in Bs do begin Check if (tr,ts) satisfy the join condition if they do, add tr • ts to the result. end end end end 57 PreparedbyR.Arthy,AP/IT,KCET
  • 58. [CONTD…]  Worst case estimate: br  bs + br block transfers + 2 * br seeks  Each block in the inner relation s is read once for each block in the outer relation (instead of once for each tuple in the outer relation  Best case: br + bs block transfers + 2 seeks.  Improvements to nested loop and block nested loop algorithms:  In block nested-loop, use M — 2 disk blocks as blocking unit for outer relations, where M = memory size in blocks; use remaining two blocks to buffer inner relation and output  Cost = br / (M-2)  bs + br block transfers + 2 br / (M-2) seeks  If equi-join attribute forms a key on inner relation, stop inner loop on first match  Scan inner loop forward and backward alternately, to make use of the blocks remaining in buffer (with LRU replacement)  Use index on inner relation if available 58 PreparedbyR.Arthy,AP/IT,KCET
  • 59. INDEXED NESTED - LOOP JOIN  Index lookups can replace file scans if  join is an equi-join or natural join and  an index is available on the inner relation’s join attribute  Can construct an index just to compute a join.  For each tuple tr in the outer relation r, use the index to look up tuples in s that satisfy the join condition with tuple tr.  Worst case: buffer has space for only one page of r, and, for each tuple in r, we perform an index lookup on s.  Cost of the join: br (tT + tS) + nr  c  Where c is the cost of traversing index and fetching all matching s tuples for one tuple or r  c can be estimated as cost of a single selection on s using the join condition.  If indices are available on join attributes of both r and s, use the relation with fewer tuples as the outer relation. 59 PreparedbyR.Arthy,AP/IT,KCET
  • 60. EXAMPLE  Compute depositor customer, with depositor as the outer relation.  Let customer have a primary B+-tree index on the join attribute customer-name, which contains 20 entries in each index node.  Since customer has 10,000 tuples, the height of the tree is 4, and one more access is needed to find the actual data  depositor has 5000 tuples  Cost of block nested loops join  400*100 + 100 = 40,100 block transfers + 2 * 100 = 200 seeks  assuming worst case memory  may be significantly less with more memory  Cost of indexed nested loops join  100 + 5000 * 5 = 25,100 block transfers and seeks.  CPU cost likely to be less than that for block nested loops join 60 PreparedbyR.Arthy,AP/IT,KCET
  • 61. MERGE JOIN 1. Sort both relations on their join attribute (if not already sorted on the join attributes). 2. Merge the sorted relations to join them 1. Join step is similar to the merge stage of the sort-merge algorithm. 2. Main difference is handling of duplicate values in join attribute — every pair with same value on join attribute must be matched 61 PreparedbyR.Arthy,AP/IT,KCET
  • 62. [CONTD…]  Can be used only for equi-joins and natural joins  Each block needs to be read only once (assuming all tuples for any given value of the join attributes fit in memory  Thus the cost of merge join is: br + bs block transfers + br / bb + bs / bb seeks  + the cost of sorting if relations are unsorted.  hybrid merge-join: If one relation is sorted, and the other has a secondary B+-tree index on the join attribute  Merge the sorted relation with the leaf entries of the B+-tree .  Sort the result on the addresses of the unsorted relation’s tuples  Scan the unsorted relation in physical address order and merge with previous result, to replace addresses by the actual tuples  Sequential scan more efficient than random lookup 62 PreparedbyR.Arthy,AP/IT,KCET
  • 63. HASH JOIN  Applicable for equi-joins and natural joins.  A hash function h is used to partition tuples of both relations  Intuition: partitions fit in memory  h maps JoinAttrs values to {0, 1, ..., n}, where JoinAttrs denotes the common attributes of r and s used in the natural join.  r0, r1, . . ., rn denote partitions of r tuples  Each tuple tr  r is put in partition ri where i = h(tr [JoinAttrs]).  r0,, r1. . ., rn denotes partitions of s tuples  Each tuple ts s is put in partition si, where i = h(ts [JoinAttrs]).  Note: In book, ri is denoted as Hri, si is denoted as Hsi and n is denoted as nh. 63 PreparedbyR.Arthy,AP/IT,KCET
  • 64. [CONTD…] 64  r tuples in ri need only to be compared with s tuples in si Need not be compared with s tuples in any other partition, since:  an r tuple and an s tuple that satisfy the join condition will have the same value for the join attributes.  If that value is hashed to some value i, the r tuple has to be in ri and the s tuple in si. PreparedbyR.Arthy,AP/IT,KCET
  • 65. [CONTD…]  The hash-join of r and s is computed as follows. 1. Partition the relation s using hashing function h. 1. When partitioning a relation, one block of memory is reserved as the output buffer for each partition, and one block for input 2. If extra memory is available, allocate bb blocks as buffer for input and each output 2.Partition r similarly. 65 PreparedbyR.Arthy,AP/IT,KCET
  • 66. [CONTD…] 3. For each partition i: (a) Load si into memory and build an in-memory hash index on it using the join attribute.  This hash index uses a different hash function than the earlier one h. (b) Read the tuples in ri from the disk one by one.  For each tuple tr probe the in-memory hash index to find all matching tuples ts in si  For each matching tuple ts in si  output the concatenation of the attributes of tr and ts  Relation s is called the build input and r is called the probe input. 66 PreparedbyR.Arthy,AP/IT,KCET
  • 67. [CONTD…]  The value n and the hash function h is chosen such that each si should fit in memory.  Typically n is chosen as bs/M * f where f is a “fudge factor”, typically around 1.2  The probe relation partitions si need not fit in memory  Recursive partitioning required if number of partitions n is greater than number of pages M of memory.  instead of partitioning n ways, use M – 1 partitions for s  Further partition the M – 1 partitions using a different hash function  Use same partitioning method on r  Rarely required: e.g., recursive partitioning not needed for relations of 1GB or less with memory size of 2MB, with block size of 4KB. 67 PreparedbyR.Arthy,AP/IT,KCET
  • 68. HANDLING OVERFLOW  Partitioning is said to be skewed if some partitions have significantly more tuples than some others  Hash-table overflow occurs in partition si if si does not fit in memory. Reasons could be  Many tuples in s with same value for join attributes  Bad hash function  Overflow resolution can be done in build phase  Partition si is further partitioned using different hash function.  Partition ri must be similarly partitioned.  Overflow avoidance performs partitioning carefully to avoid overflows during build phase  E.g. partition build relation into many partitions, then combine them  Both approaches fail with large numbers of duplicates  Fallback option: use block nested loops join on overflowed partitions 68 PreparedbyR.Arthy,AP/IT,KCET
  • 69. [CONTD…]  If recursive partitioning is not required: cost of hash join is 3(br + bs) +4  nh block transfers + 2( br / bb + bs / bb) seeks  If recursive partitioning required:  number of passes required for partitioning build relation s is logM–1(bs) – 1  best to choose the smaller relation as the build relation.  Total cost estimate is: 2(br + bs logM–1(bs) – 1 + br + bs block transfers + 2(br / bb + bs / bb) logM–1(bs) – 1 seeks  If the entire build input can be kept in main memory no partitioning is required  Cost estimate goes down to br + bs. 69 PreparedbyR.Arthy,AP/IT,KCET
  • 70. EXAMPLE  Assume that memory size is 20 blocks  bdepositor= 100 and bcustomer = 400.  depositor is to be used as build input. Partition it into five partitions, each of size 20 blocks. This partitioning can be done in one pass.  Similarly, partition customer into five partitions,each of size 80. This is also done in one pass.  Therefore total cost, ignoring cost of writing partially filled blocks:  3(100 + 400) = 1500 block transfers + 2( 100/3 + 400/3) = 336 seeks 70 customer depositor PreparedbyR.Arthy,AP/IT,KCET
  • 71. HYBRID HASH JOIN  Useful when memory sized are relatively large, and the build input is bigger than memory.  Main feature of hybrid hash join: Keep the first partition of the build relation in memory.  E.g. With memory size of 25 blocks, depositor can be partitioned into five partitions, each of size 20 blocks.  Division of memory:  The first partition occupies 20 blocks of memory  1 block is used for input, and 1 block each for buffering the other 4 partitions.  customer is similarly partitioned into five partitions each of size 80  the first is used right away for probing, instead of being written out  Cost of 3(80 + 320) + 20 +80 = 1300 block transfers for hybrid hash join, instead of 1500 with plain hash-join.  Hybrid hash-join most useful if M >> 71 sb PreparedbyR.Arthy,AP/IT,KCET
  • 72. QUERY OPTIMIZATION USING HEURISTICS AND COST ESTIMATION
  • 73. INTRODUCTION  Alternative ways of evaluating a given query  Equivalent expressions  Different algorithms for each operation 73 PreparedbyR.Arthy,AP/IT,KCET
  • 74. [CONTD…]  An evaluation plan defines exactly what algorithm is used for each operation, and how the execution of the operations is coordinated. 74 PreparedbyR.Arthy,AP/IT,KCET
  • 75. [CONTD…]  Cost difference between evaluation plans for a query can be enormous  E.g. seconds vs. days in some cases  Steps in cost-based query optimization 1. Generate logically equivalent expressions using equivalence rules 2. Annotate resultant expressions to get alternative query plans 3. Choose the cheapest plan based on estimated cost  Estimation of plan cost based on:  Statistical information about relations. Examples:  number of tuples, number of distinct values for an attribute  Statistics estimation for intermediate results  to compute cost of complex expressions  Cost formulae for algorithms, computed using statistics 75 PreparedbyR.Arthy,AP/IT,KCET
  • 76. GENERATING EQUIVALENT EXPRESSIONS – TRANSFORMATION OF RELATIONAL EXPRESSIONS  Two relational algebra expressions are said to be equivalent if the two expressions generate the same set of tuples on every legal database instance  Note: order of tuples is irrelevant  In SQL, inputs and outputs are multisets of tuples  Two expressions in the multiset version of the relational algebra are said to be equivalent if the two expressions generate the same multiset of tuples on every legal database instance.  An equivalence rule says that expressions of two forms are equivalent  Can replace expression of first form by second, or vice versa 76 PreparedbyR.Arthy,AP/IT,KCET
  • 77. GENERATING EQUIVALENT EXPRESSIONS – EQUIVALENCE RULE 1. Conjunctive selection operations can be deconstructed into a sequence of individual selections. 2. Selection operations are commutative. 3. Only the last in a sequence of projection operations is needed, the others can be omitted. 4. Selections can be combined with Cartesian products and theta joins. a. (E1 X E2) = E1  E2 b. 1(E1 2 E2) = E1 1 2 E2 77 ))(()( 2121 EE    ))(())(( 1221 EE    )())))(((( 121 EE LLnLL   PreparedbyR.Arthy,AP/IT,KCET
  • 78. [CONTD…] 5.Theta-join operations (and natural joins) are commutative. E1  E2 = E2  E1 6.(a) Natural join operations are associative: (E1 E2) E3 = E1 (E2 E3) (b) Theta joins are associative in the following manner: (E1 1 E2) 2 3 E3 = E1 1 3 (E2 2 E3) where 2 involves attributes from only E2 and E3. 78 PreparedbyR.Arthy,AP/IT,KCET
  • 80. [CONTD…] 7.The selection operation distributes over the theta join operation under the following two conditions: (a) When all the attributes in 0 involve only the attributes of one of the expressions (E1) being joined. 0E1  E2) = (0(E1))  E2 (b) When  1 involves only the attributes of E1 and 2 involves only the attributes of E2. 1 E1  E2) = (1(E1))  ( (E2)) 80 PreparedbyR.Arthy,AP/IT,KCET
  • 81. [CONTD…] 8.The projection operation distributes over the theta join operation as follows: (a) if  involves only attributes from L1  L2: (b) Consider a join E1  E2.  Let L1 and L2 be sets of attributes from E1 and E2, respectively.  Let L3 be attributes of E1 that are involved in join condition , but are not in L1  L2, and  let L4 be attributes of E2 that are involved in join condition , but are not in L1  L2. 81 ))(())(()( 2121 2121 EEEE LLLL    )))(())((()( 2121 42312121 EEEE LLLLLLLL    PreparedbyR.Arthy,AP/IT,KCET
  • 82. [CONTD…] 9. The set operations union and intersection are commutative E1  E2 = E2  E1 E1  E2 = E2  E1  (set difference is not commutative). 10. Set union and intersection are associative. (E1  E2)  E3 = E1  (E2  E3) (E1  E2)  E3 = E1  (E2  E3) 11. The selection operation distributes over ,  and –.  (E1 – E2) =  (E1) – (E2) and similarly for  and  in place of – Also:  (E1 – E2) = (E1) – E2 and similarly for  in place of –, but not for  12. The projection operation distributes over union L(E1  E2) = (L(E1))  (L(E2)) 82 PreparedbyR.Arthy,AP/IT,KCET
  • 83. EXAMPLE  Query: Find the names of all customers with an account at a Brooklyn branch whose account balance is over $1000. customer_name((branch_city = “Brooklyn”  balance > 1000 (branch (account depositor)))  Transformation using join associatively (Rule 6a): customer_name((branch_city = “Brooklyn”  balance > 1000 (branch account)) depositor)  Second form provides an opportunity to apply the “perform selections early” rule, resulting in the subexpression branch_city = “Brooklyn” (branch)  balance > 1000 (account)  Thus a sequence of transformations can be useful 83 PreparedbyR.Arthy,AP/IT,KCET
  • 85. TRANSFORMATION EXAMPLE: PUSHING PROJECTIONS  When we compute (branch_city = “Brooklyn” (branch) account ) we obtain a relation whose schema is: (branch_name, branch_city, assets, account_number, balance)  Push projections using equivalence rules 8a and 8b; eliminate unneeded attributes from intermediate results to get: customer_name ((account_number ( (branch_city = “Brooklyn” (branch) account )) depositor )  Performing the projection as early as possible reduces the size of the relation to be joined. customer_name((branch_city = “Brooklyn” (branch) account) depositor) 85 PreparedbyR.Arthy,AP/IT,KCET
  • 86. JOIN ORDERING EXAMPLE  For all relations r1, r2, and r3, (r1 r2) r3 = r1 (r2 r3 ) (Join Associativity)  If r2 r3 is quite large and r1 r2 is small, we choose (r1 r2) r3 so that we compute and store a smaller temporary relation. 86 PreparedbyR.Arthy,AP/IT,KCET
  • 87. JOIN ORDERING EXAMPLE (CONT.)  Consider the expression customer_name ((branch_city = “Brooklyn” (branch)) (account depositor))  Could compute account depositor first, and join result with branch_city = “Brooklyn” (branch) but account depositor is likely to be a large relation.  Only a small fraction of the bank’s customers are likely to have accounts in branches located in Brooklyn  it is better to compute branch_city = “Brooklyn” (branch) account first. 87 PreparedbyR.Arthy,AP/IT,KCET
  • 88. GENERATING EQUIVALENT EXPRESSIONS – ENUMERATION OF EQUIVALENT EXPRESSIONS  Query optimizers use equivalence rules to systematically generate expressions equivalent to the given expression  Can generate all equivalent expressions as follows:  Repeat  apply all applicable equivalence rules on every equivalent expression found so far  add newly generated expressions to the set of equivalent expressions Until no new equivalent expressions are generated above  The above approach is very expensive in space and time  Two approaches  Optimized plan generation based on transformation rules  Special case approach for queries with only selections, projections and joins 88 PreparedbyR.Arthy,AP/IT,KCET
  • 89. GENERATING EQUIVALENT EXPRESSIONS – IMPLEMENTING TRANSFORMATION BASED OPTIMIZATION  Space requirements reduced by sharing common sub-expressions:  when E1 is generated from E2 by an equivalence rule, usually only the top level of the two are different, subtrees below are the same and can be shared using pointers  E.g. when applying join commutativity  Same sub-expression may get generated multiple times  Detect duplicate sub-expressions and share one copy  Time requirements are reduced by not generating all expressions  Dynamic programming  We will study only the special case of dynamic programming for join order optimization E1 E2 89 PreparedbyR.Arthy,AP/IT,KCET
  • 90. COST ESTIMATION  Cost of each operator computer as described in Chapter 13  Need statistics of input relations  E.g. number of tuples, sizes of tuples  Inputs can be results of sub-expressions  Need to estimate statistics of expression results  To do so, we require additional statistics  E.g. number of distinct values for an attribute  More on cost estimation later 90 PreparedbyR.Arthy,AP/IT,KCET
  • 91. CHOICE OF EVALUATION PLANS  Must consider the interaction of evaluation techniques when choosing evaluation plans  choosing the cheapest algorithm for each operation independently may not yield best overall algorithm. E.g.  merge-join may be costlier than hash-join, but may provide a sorted output which reduces the cost for an outer level aggregation.  nested-loop join may provide opportunity for pipelining  Practical query optimizers incorporate elements of the following two broad approaches: 1. Search all the plans and choose the best plan in a cost-based fashion. 2. Uses heuristics to choose a plan. 91 PreparedbyR.Arthy,AP/IT,KCET
  • 92. COST-BASED OPTIMIZATION  Consider finding the best join-order for r1 r2 . . . rn.  There are (2(n – 1))!/(n – 1)! different join orders for above expression. With n = 7, the number is 665280, with n = 10, the number is greater than 176 billion!  No need to generate all the join orders. Using dynamic programming, the least-cost join order for any subset of {r1, r2, . . . rn} is computed only once and stored for future use. 92 PreparedbyR.Arthy,AP/IT,KCET
  • 93. DYNAMIC PROGRAMMING IN OPTIMIZATION  To find best join tree for a set of n relations:  To find best plan for a set S of n relations, consider all possible plans of the form: S1 (S – S1) where S1 is any non-empty subset of S.  Recursively compute costs for joining subsets of S to find the cost of each plan. Choose the cheapest of the 2n – 1 alternatives.  Base case for recursion: single relation access plan  Apply all selections on Ri using best choice of indices on Ri  When plan for any subset is computed, store it and reuse it when it is required again, instead of recomputing it  Dynamic programming 93 PreparedbyR.Arthy,AP/IT,KCET
  • 94. JOIN ORDER OPTIMIZATION ALGORITHM procedure findbestplan(S) if (bestplan[S].cost  ) return bestplan[S] // else bestplan[S] has not been computed earlier, compute it now if (S contains only 1 relation) set bestplan[S].plan and bestplan[S].cost based on the best way of accessing S /* Using selections on S and indices on S */ else for each non-empty subset S1 of S such that S1  S P1= findbestplan(S1) P2= findbestplan(S - S1) A = best algorithm for joining results of P1 and P2 cost = P1.cost + P2.cost + cost of A if cost < bestplan[S].cost bestplan[S].cost = cost bestplan[S].plan = “execute P1.plan; execute P2.plan; join results of P1 and P2 using A” return bestplan[S] 94 PreparedbyR.Arthy,AP/IT,KCET
  • 95. LEFT DEEP JOIN TREES  In left-deep join trees, the right-hand-side input for each join is a relation, not the result of an intermediate join. 95 PreparedbyR.Arthy,AP/IT,KCET
  • 96. COST OF OPTIMIZATION  With dynamic programming time complexity of optimization with bushy trees is O(3n).  With n = 10, this number is 59000 instead of 176 billion!  Space complexity is O(2n)  To find best left-deep join tree for a set of n relations:  Consider n alternatives with one relation as right-hand side input and the other relations as left-hand side input.  Modify optimization algorithm:  Replace “for each non-empty subset S1 of S such that S1  S”  By: for each relation r in S let S1 = S – r .  If only left-deep trees are considered, time complexity of finding best join order is O(n 2n)  Space complexity remains at O(2n)  Cost-based optimization is expensive, but worthwhile for queries on large datasets (typical queries have small n, generally < 10) 96 PreparedbyR.Arthy,AP/IT,KCET
  • 97. INTERESTING SORT ORDERS  Consider the expression (r1 r2) r3 (with A as common attribute)  An interesting sort order is a particular sort order of tuples that could be useful for a later operation  Using merge-join to compute r1 r2 may be costlier than hash join but generates result sorted on A  Which in turn may make merge-join with r3 cheaper, which may reduce cost of join with r3 and minimizing overall cost  Sort order may also be useful for order by and for grouping  Not sufficient to find the best join order for each subset of the set of n given relations  must find the best join order for each subset, for each interesting sort order  Simple extension of earlier dynamic programming algorithms  Usually, number of interesting orders is quite small and doesn’t affect time/space complexity significantly 97 PreparedbyR.Arthy,AP/IT,KCET
  • 98. HEURISTIC OPTIMIZATION  Cost-based optimization is expensive, even with dynamic programming.  Systems may use heuristics to reduce the number of choices that must be made in a cost-based fashion.  Heuristic optimization transforms the query-tree by using a set of rules that typically (but not in all cases) improve execution performance:  Perform selection early (reduces the number of tuples)  Perform projection early (reduces the number of attributes)  Perform most restrictive selection and join operations (i.e. with smallest result size) before other similar operations.  Some systems use only heuristics, others combine heuristics with partial cost-based optimization. 98 PreparedbyR.Arthy,AP/IT,KCET
  • 99. STRUCTURE OF QUERY OPTIMIZERS  Many optimizers considers only left-deep join orders.  Plus heuristics to push selections and projections down the query tree  Reduces optimization complexity and generates plans amenable to pipelined evaluation.  Heuristic optimization used in some versions of Oracle:  Repeatedly pick “best” relation to join next  Starting from each of n starting points. Pick best among these  Intricacies of SQL complicate query optimization  E.g. nested subqueries 99 PreparedbyR.Arthy,AP/IT,KCET
  • 100. [CONTD…]  Some query optimizers integrate heuristic selection and the generation of alternative access plans.  Frequently used approach  heuristic rewriting of nested block structure and aggregation  followed by cost-based join-order optimization for each block  Some optimizers (e.g. SQL Server) apply transformations to entire query and do not depend on block structure  Even with the use of heuristics, cost-based query optimization imposes a substantial overhead.  But is worth it for expensive queries  Optimizers often use simple heuristics for very cheap queries, and perform exhaustive enumeration for more expensive queries 100 PreparedbyR.Arthy,AP/IT,KCET
  • 101. STATISTICS FOR COST ESTIMATION - STATISTICAL INFORMATION FOR COST ESTIMATION  nr: number of tuples in a relation r.  br: number of blocks containing tuples of r.  lr: size of a tuple of r.  fr: blocking factor of r — i.e., the number of tuples of r that fit into one block.  V(A, r): number of distinct values that appear in r for attribute A; same as the size of A(r).  If tuples of r are stored together physically in a file, then:            rf rn rb 101 PreparedbyR.Arthy,AP/IT,KCET
  • 102. STATISTICS FOR COST ESTIMATION - HISTOGRAMS  Histogram on attribute age of relation person  Equi-width histograms  Equi-depth histograms 102 PreparedbyR.Arthy,AP/IT,KCET
  • 103. STATISTICS FOR COST ESTIMATION - SELECTION SIZE ESTIMATION  A=v(r)  nr / V(A,r) : number of records that will satisfy the selection  Equality condition on a key attribute: size estimate = 1  AV(r) (case of A  V(r) is symmetric)  Let c denote the estimated number of tuples satisfying the condition.  If min(A,r) and max(A,r) are available in catalog  c = 0 if v < min(A,r)  c =  If histograms available, can refine above estimate  In absence of statistical information c is assumed to be nr / 2. ),min(),max( ),min( . rArA rAv nr   103
  • 104. STATISTICS FOR COST ESTIMATION - SIZE ESTIMATION OF COMPLEX SELECTIONS  The selectivity of a condition i is the probability that a tuple in the relation r satisfies i .  If si is the number of satisfying tuples in r, the selectivity of i is given by si /nr.  Conjunction: 1 2. . .  n (r). Assuming indepdence, estimate of tuples in the result is:  Disjunction:1 2 . . .  n (r). Estimated number of tuples:  Negation: (r). Estimated number of tuples: nr – size((r)) n r n r n sss n   ...21        )1(...)1()1(1 21 r n rr r n s n s n s n 104 PreparedbyR.Arthy,AP/IT,KCET
  • 105. STATISTICS FOR COST ESTIMATION - JOIN OPERATION: RUNNING EXAMPLE Running example: depositor customer Catalog information for join examples:  ncustomer = 10,000.  fcustomer = 25, which implies that bcustomer =10000/25 = 400.  ndepositor = 5000.  fdepositor = 50, which implies that bdepositor = 5000/50 = 100.  V(customer_name, depositor) = 2500, which implies that , on average, each customer has two accounts.  Also assume that customer_name in depositor is a foreign key on customer.  V(customer_name, customer) = 10000 (primary key!) 105 PreparedbyR.Arthy,AP/IT,KCET
  • 106. STATISTICS FOR COST ESTIMATION - ESTIMATION OF THE SIZE OF JOINS  The Cartesian product r x s contains nr .ns tuples; each tuple occupies sr + ss bytes.  If R  S = , then r s is the same as r x s.  If R  S is a key for R, then a tuple of s will join with at most one tuple from r  therefore, the number of tuples in r s is no greater than the number of tuples in s.  If R  S in S is a foreign key in S referencing R, then the number of tuples in r s is exactly the same as the number of tuples in s.  The case for R  S being a foreign key referencing S is symmetric.  In the example query depositor customer, customer_name in depositor is a foreign key of customer  hence, the result has exactly ndepositor tuples, which is 5000 106 PreparedbyR.Arthy,AP/IT,KCET
  • 107. STATISTICS FOR COST ESTIMATION - ESTIMATION OF THE SIZE OF JOINS (CONT.)  If R  S = {A} is not a key for R or S. If we assume that every tuple t in R produces tuples in R S, the number of tuples in R S is estimated to be: If the reverse is true, the estimate obtained will be: The lower of these two estimates is probably the more accurate one.  Can improve on above if histograms are available  Use formula similar to above, for each cell of histograms on the two relations ),( sAV nn sr  ),( rAV nn sr  107 PreparedbyR.Arthy,AP/IT,KCET
  • 108. [CONTD…]  Compute the size estimates for depositor customer without using information about foreign keys:  V(customer_name, depositor) = 2500, and V(customer_name, customer) = 10000  The two estimates are 5000 * 10000/2500 - 20,000 and 5000 * 10000/10000 = 5000  We choose the lower estimate, which in this case, is the same as our earlier computation using foreign keys. 108 PreparedbyR.Arthy,AP/IT,KCET
  • 109. STATISTICS FOR COST ESTIMATION - SIZE ESTIMATION FOR OTHER OPERATIONS  Projection: estimated size of A(r) = V(A,r)  Aggregation : estimated size of AgF(r) = V(A,r)  Set operations  For unions/intersections of selections on the same relation: rewrite and use size estimate for selections  E.g. 1 (r)  2 (r) can be rewritten as 1 2 (r)  For operations on different relations:  estimated size of r  s = size of r + size of s.  estimated size of r  s = minimum size of r and size of s.  estimated size of r – s = r.  All the three estimates may be quite inaccurate, but provide upper bounds on the sizes. 109 PreparedbyR.Arthy,AP/IT,KCET
  • 110. [CONTD…]  Outer join:  Estimated size of r s = size of r s + size of r  Case of right outer join is symmetric  Estimated size of r s = size of r s + size of r + size of s 110 PreparedbyR.Arthy,AP/IT,KCET
  • 111. STATISTICS FOR COST ESTIMATION - ESTIMATION OF NUMBER OF DISTINCT VALUES Selections:  (r)  If  forces A to take a specified value: V(A, (r)) = 1.  e.g., A = 3  If  forces A to take on one of a specified set of values: V(A, (r)) = number of specified values.  (e.g., (A = 1 V A = 3 V A = 4 )),  If the selection condition  is of the form A op r estimated V(A, (r)) = V(A.r) * s  where s is the selectivity of the selection.  In all the other cases: use approximate estimate of min(V(A,r), n (r) )  More accurate estimate can be got using probability theory, but this one works fine generally 111 PreparedbyR.Arthy,AP/IT,KCET
  • 112. [CONTD…] Joins: r s  If all attributes in A are from r estimated V(A, r s) = min (V(A,r), n r s)  If A contains attributes A1 from r and A2 from s, then estimated V(A,r s) = min(V(A1,r)*V(A2 – A1,s), V(A1 – A2,r)*V(A2,s), nr s)  More accurate estimate can be got using probability theory, but this one works fine generally 112 PreparedbyR.Arthy,AP/IT,KCET
  • 113. [CONTD…]  Estimation of distinct values are straightforward for projections.  They are the same in A (r) as in r.  The same holds for grouping attributes of aggregation.  For aggregated values  For min(A) and max(A), the number of distinct values can be estimated as min(V(A,r), V(G,r)) where G denotes grouping attributes  For other aggregates, assume all values are distinct, and use V(G,r) 113 PreparedbyR.Arthy,AP/IT,KCET
  • 114. OPTIMIZING NESTED SUBQUERIES  Nested query example: select customer_name from borrower where exists (select * from depositor where depositor.customer_name = borrower.customer_name)  SQL conceptually treats nested subqueries in the where clause as functions that take parameters and return a single value or set of values  Parameters are variables from outer level query that are used in the nested subquery; such variables are called correlation variables  Conceptually, nested subquery is executed once for each tuple in the cross-product generated by the outer level from clause  Such evaluation is called correlated evaluation  Note: other conditions in where clause may be used to compute a join (instead of a cross-product) before executing the nested subquery 114 PreparedbyR.Arthy,AP/IT,KCET
  • 115. [CONTD…]  Correlated evaluation may be quite inefficient since  a large number of calls may be made to the nested query  there may be unnecessary random I/O as a result  SQL optimizers attempt to transform nested subqueries to joins where possible, enabling use of efficient join techniques  E.g.: earlier nested query can be rewritten as select customer_name from borrower, depositor where depositor.customer_name = borrower.customer_name  Note: the two queries generate different numbers of duplicates (why?)  Borrower can have duplicate customer-names  Can be modified to handle duplicates correctly as we will see  In general, it is not possible/straightforward to move the entire nested subquery from clause into the outer level query from clause  A temporary relation is created instead, and used in body of outer level query 115 PreparedbyR.Arthy,AP/IT,KCET
  • 116. [CONTD…] In general, SQL queries of the form below can be rewritten as shown  Rewrite: select … from L1 where P1 and exists (select * from L2 where P2)  To: create table t1 as select distinct V from L2 where P2 1 select … from L1, t1 where P1 and P2 2  P2 1 contains predicates in P2 that do not involve any correlation variables  P2 2 reintroduces predicates involving correlation variables, with relations renamed appropriately  V contains all attributes used in predicates with correlation variables 116 PreparedbyR.Arthy,AP/IT,KCET
  • 117. [CONTD…]  In our example, the original nested query would be transformed to create table t1 as select distinct customer_name from depositor select customer_name from borrower, t1 where t1.customer_name = borrower.customer_name  The process of replacing a nested query by a query with a join (possibly with a temporary relation) is called decorrelation.  Decorrelation is more complicated when  the nested subquery uses aggregation, or  when the result of the nested subquery is used to test for equality, or  when the condition linking the nested subquery to the other query is not exists,  and so on. 117 PreparedbyR.Arthy,AP/IT,KCET
  • 118. MATERIALIZED VIEWS  A materialized view is a view whose contents are computed and stored.  Consider the view create view branch_total_loan(branch_name, total_loan) as select branch_name, sum(amount) from loan group by branch_name  Materializing the above view would be very useful if the total loan amount is required frequently  Saves the effort of finding multiple tuples and adding up their amounts 118 PreparedbyR.Arthy,AP/IT,KCET
  • 119. MATERIALIZED VIEW MAINTENANCE  The task of keeping a materialized view up-to-date with the underlying data is known as materialized view maintenance  Materialized views can be maintained by recomputation on every update  A better option is to use incremental view maintenance  Changes to database relations are used to compute changes to the materialized view, which is then updated  View maintenance can be done by  Manually defining triggers on insert, delete, and update of each relation in the view definition  Manually written code to update the view whenever database relations are updated  Periodic recomputation (e.g. nightly)  Above methods are directly supported by many database systems  Avoids manual effort/correctness issues 119 PreparedbyR.Arthy,AP/IT,KCET
  • 120. INCREMENTAL VIEW MAINTENANCE  The changes (inserts and deletes) to a relation or expressions are referred to as its differential  Set of tuples inserted to and deleted from r are denoted ir and dr  To simplify our description, we only consider inserts and deletes  We replace updates to a tuple by deletion of the tuple followed by insertion of the update tuple  We describe how to compute the change to the result of each relational operation, given changes to its inputs  We then outline how to handle relational algebra expressions 120 PreparedbyR.Arthy,AP/IT,KCET
  • 121. JOIN OPERATION  Consider the materialized view v = r s and an update to r  Let rold and rnew denote the old and new states of relation r  Consider the case of an insert to r:  We can write rnew s as (rold  ir) s  And rewrite the above to (rold s)  (ir s)  But (rold s) is simply the old value of the materialized view, so the incremental change to the view is just ir s  Thus, for inserts vnew = vold (ir s)  Similarly for deletes vnew = vold – (dr s) A, 1 B, 2 1, p 2, r 2, s A, 1, p B, 2, r B, 2, s C,2 C, 2, r C, 2, s 121 PreparedbyR.Arthy,AP/IT,KCET
  • 122. SELECTION AND PROJECTION OPERATIONS  Selection: Consider a view v = (r).  vnew = vold (ir)  vnew = vold - (dr)  Projection is a more difficult operation  R = (A,B), and r(R) = { (a,2), (a,3)}  A(r) has a single tuple (a).  If we delete the tuple (a,2) from r, we should not delete the tuple (a) from A(r), but if we then delete (a,3) as well, we should delete the tuple  For each tuple in a projection A(r) , we will keep a count of how many times it was derived  On insert of a tuple to r, if the resultant tuple is already in A(r) we increment its count, else we add a new tuple with count = 1  On delete of a tuple from r, we decrement the count of the corresponding tuple in A(r)  if the count becomes 0, we delete the tuple from A(r) 122 PreparedbyR.Arthy,AP/IT,KCET
  • 123. AGGREGATION OPERATIONS  count : v = Agcount(B) (r).  When a set of tuples ir is inserted  For each tuple r in ir, if the corresponding group is already present in v, we increment its count, else we add a new tuple with count = 1  When a set of tuples dr is deleted  for each tuple t in ir.we look for the group t.A in v, and subtract 1 from the count for the group.  If the count becomes 0, we delete from v the tuple for the group t.A  sum: v = Agsum (B) (r)  We maintain the sum in a manner similar to count, except we add/subtract the B value instead of adding/subtracting 1 for the count  Additionally we maintain the count in order to detect groups with no tuples. Such groups are deleted from v  Cannot simply test for sum = 0 (why?)  To handle the case of avg, we maintain the sum and count aggregate values separately, and divide at the end 123 PreparedbyR.Arthy,AP/IT,KCET
  • 124. [CONTD…]  min, max: v = Agmin (B) (r).  Handling insertions on r is straightforward.  Maintaining the aggregate values min and max on deletions may be more expensive. We have to look at the other tuples of r that are in the same group to find the new minimum 124 PreparedbyR.Arthy,AP/IT,KCET
  • 125. OTHER OPERATIONS  Set intersection: v = r  s  when a tuple is inserted in r we check if it is present in s, and if so we add it to v.  If the tuple is deleted from r, we delete it from the intersection if it is present.  Updates to s are symmetric  The other set operations, union and set difference are handled in a similar fashion.  Outer joins are handled in much the same way as joins but with some extra work  we leave details to you. 125 PreparedbyR.Arthy,AP/IT,KCET
  • 126. HANDLING EXPRESSIONS  To handle an entire expression, we derive expressions for computing the incremental change to the result of each sub-expressions, starting from the smallest sub- expressions.  E.g. consider E1 E2 where each of E1 and E2 may be a complex expression  Suppose the set of tuples to be inserted into E1 is given by D1  Computed earlier, since smaller sub-expressions are handled first  Then the set of tuples to be inserted into E1 E2 is given by D1 E2  This is just the usual way of maintaining joins 126 PreparedbyR.Arthy,AP/IT,KCET
  • 127. QUERY OPTIMIZATION AND MATERIALIZED VIEWS  Rewriting queries to use materialized views:  A materialized view v = r s is available  A user submits a query r s t  We can rewrite the query as v t  Whether to do so depends on cost estimates for the two alternative  Replacing a use of a materialized view by the view definition:  A materialized view v = r s is available, but without any index on it  User submits a query A=10(v).  Suppose also that s has an index on the common attribute B, and r has an index on attribute A.  The best plan for this query may be to replace v by r s, which can lead to the query plan A=10(r) s  Query optimizer should be extended to consider all above alternatives and choose the best overall plan 127 PreparedbyR.Arthy,AP/IT,KCET
  • 128. MATERIALIZED VIEW SELECTION  Materialized view selection: “What is the best set of views to materialize?”.  Index selection: “what is the best set of indices to create”  closely related, to materialized view selection  but simpler  Materialized view selection and index selection based on typical system workload (queries and updates)  Typical goal: minimize time to execute workload , subject to constraints on space and time taken for some critical queries/updates  One of the steps in database tuning  more on tuning in later chapters  Commercial database systems provide tools (called “tuning assistants” or “wizards”) to help the database administrator choose what indices and materialized views to create 128 PreparedbyR.Arthy,AP/IT,KCET
  • 129. CS8492 – DATABASE MANAGEMENT SYSTEMS UNIT V - ADVANCED TOPICS
  • 130. OUTLINE  Distributed Databases: Architecture  Data Storage, Transaction Processing  Object-based Databases: Object Database Concepts  Object-Relational features  ODMG Object Model, ODL, OQL  XML Databases: XML Hierarchical Model  DTD, XML Schema  XQuery  Information Retrieval: IR Concepts, Retrieval Models, Queries in IR systems. 130 PreparedbyR.Arthy,AP/IT,KCET
  • 132. OUTLINE  Distributed Database Systems – Introduction  Distributed Data Storage  Distributed Transaction  Commit Protocol 132 PreparedbyR.Arthy,AP/IT,KCET
  • 133. I. DISTRIBUTED DATABASE SYSTEM  A distributed database system consists of loosely coupled sites that share no physical component  Database systems that run on each site are independent of each other  Transactions may access data at one or more sites 133 PreparedbyR.Arthy,AP/IT,KCET
  • 134. TYPES OF DISTRIBUTED DATABASES  In a homogeneous distributed database  All sites have identical software  Are aware of each other and agree to cooperate in processing user requests.  Each site surrenders part of its autonomy in terms of right to change schemas or software  Appears to user as a single system  In a heterogeneous distributed database  Different sites may use different schemas and software  Difference in schema is a major problem for query processing  Difference in software is a major problem for transaction processing  Sites may not be aware of each other and may provide only limited facilities for cooperation in transaction processing 134 PreparedbyR.Arthy,AP/IT,KCET
  • 135. II. DISTRIBUTED DATA STORAGE  There are two approaches to store the relation in the distributed database:  Replication: The system maintains several identical replicas (copies) of the relation, and stores each replica at a different site. The alternative to replication is to store only one copy of relation r.  Fragmentation: The system partitions the relation into several fragments, and stores each fragment at a different site. 135 PreparedbyR.Arthy,AP/IT,KCET
  • 136. 1. DATA REPLICATION  A relation or fragment of a relation is replicated if it is stored redundantly in two or more sites.  Full replication of a relation is the case where the relation is stored at all sites.  Fully redundant databases are those in which every site contains a copy of the entire database. 136 PreparedbyR.Arthy,AP/IT,KCET
  • 137. [CONTD…]  Advantages of Replication  Availability: failure of site containing relation r does not result in unavailability of r is replicas exist.  Parallelism: queries on r may be processed by several nodes in parallel.  Reduced data transfer: relation r is available locally at each site containing a replica of r.  Disadvantages of Replication  Increased cost of updates: each replica of relation r must be updated.  Increased complexity of concurrency control: concurrent updates to distinct replicas may lead to inconsistent data unless special concurrency control mechanisms are implemented.  One solution: choose one copy as primary copy and apply concurrency control operations on primary copy 137 PreparedbyR.Arthy,AP/IT,KCET
  • 138. 2. DATA FRAGMENTATION  Division of relation r into fragments r1, r2, …, rn which contain sufficient information to reconstruct relation r.  Horizontal fragmentation: each tuple of r is assigned to one or more fragments  Vertical fragmentation: the schema for relation r is split into several smaller schemas  All schemas must contain a common candidate key (or superkey) to ensure lossless join property.  A special attribute, the tuple-id attribute may be added to each schema to serve as a candidate key.  Example : relation account with following schema  Account = (account_number, branch_name , balance ) 138 PreparedbyR.Arthy,AP/IT,KCET
  • 139. HORIZONTAL FRAGMENTATION OF ACCOUNT RELATION branch_nameaccount_number balance Hillside Hillside Hillside A-305 A-226 A-155 500 336 62 account1 = branch_name=“Hillside” (account ) branch_nameaccount_number balance Valleyview Valleyview Valleyview Valleyview A-177 A-402 A-408 A-639 205 10000 1123 750 account2 = branch_name=“Valleyview” (account ) 139 PreparedbyR.Arthy,AP/IT,KCET
  • 140. VERTICAL FRAGMENTATION OF EMPLOYEE_INFO RELATION branch_name customer_name tuple_id Hillside Hillside Valleyview Valleyview Hillside Valleyview Valleyview Lowman Camp Camp Kahn Kahn Kahn Green deposit1 = branch_name, customer_name, tuple_id (employee_info ) 1 2 3 4 5 6 7 account_number balance tuple_id 500 336 205 10000 62 1123 750 1 2 3 4 5 6 7 A-305 A-226 A-177 A-402 A-155 A-408 A-639 deposit2 = account_number, balance, tuple_id (employee_info ) 140 PreparedbyR.Arthy,AP/IT,KCET
  • 141. ADVANTAGES OF FRAGMENTATION  Horizontal:  allows parallel processing on fragments of a relation  allows a relation to be split so that tuples are located where they are most frequently accessed  Vertical:  allows tuples to be split so that each part of the tuple is stored where it is most frequently accessed  tuple-id attribute allows efficient joining of vertical fragments  Vertical and horizontal fragmentation can be mixed.  Fragments may be successively fragmented to an arbitrary depth.  Replication and fragmentation can be combined  Relation is partitioned into several fragments: system maintains several identical replicas of each such fragment. 141 PreparedbyR.Arthy,AP/IT,KCET
  • 142. 3. DATA TRANSPARENCY  Data transparency: Degree to which system user may remain unaware of the details of how and where the data items are stored in a distributed system  Consider transparency issues in relation to:  Fragmentation transparency  Replication transparency  Location transparency  Naming of data items: criteria 1. Every data item must have a system-wide unique name. 2. It should be possible to find the location of data items efficiently. 3. It should be possible to change the location of data items transparently. 4. Each site should be able to create new data items autonomously. 142 PreparedbyR.Arthy,AP/IT,KCET
  • 143. CENTRALIZED SCHEME - NAME SERVER  Structure:  name server assigns all names  each site maintains a record of local data items  sites ask name server to locate non-local data items  Advantages:  satisfies naming criteria 1-3  Disadvantages:  does not satisfy naming criterion 4  name server is a potential performance bottleneck  name server is a single point of failure 143 PreparedbyR.Arthy,AP/IT,KCET
  • 144. USE OF ALIASES  Alternative to centralized scheme: each site prefixes its own site identifier to any name that it generates i.e., site 17.account.  Fulfills having a unique identifier, and avoids problems associated with central control.  However, fails to achieve network transparency.  Solution: Create a set of aliases for data items; Store the mapping of aliases to the real names at each site.  The user can be unaware of the physical location of a data item, and is unaffected if the data item is moved from one site to another. 144 PreparedbyR.Arthy,AP/IT,KCET
  • 145. III. DISTRIBUTED TRANSACTIONS SYSTEM ARCHITECTURE  Transaction may access data at several sites.  Each site has a local transaction manager responsible for:  Maintaining a log for recovery purposes  Participating in coordinating the concurrent execution of the transactions executing at that site.  Each site has a transaction coordinator, which is responsible for:  Starting the execution of transactions that originate at the site.  Distributing subtransactions at appropriate sites for execution.  Coordinating the termination of each transaction that originates at the site, which may result in the transaction being committed at all sites or aborted at all sites. 145 PreparedbyR.Arthy,AP/IT,KCET
  • 147. SYSTEM FAILURE MODES  Failures unique to distributed systems:  Failure of a site.  Loss of massages  Handled by network transmission control protocols such as TCP-IP  Failure of a communication link  Handled by network protocols, by routing messages via alternative links  Network partition  A network is said to be partitioned when it has been split into two or more subsystems that lack any connection between them  Note: a subsystem may consist of a single node  Network partitioning and site failures are generally indistinguishable. 147 PreparedbyR.Arthy,AP/IT,KCET
  • 148. COMMIT PROTOCOLS  Commit protocols are used to ensure atomicity across sites  a transaction which executes at multiple sites must either be committed at all the sites, or aborted at all the sites.  not acceptable to have a transaction committed at one site and aborted at another  The two-phase commit (2PC) protocol is widely used  The three-phase commit (3PC) protocol is more complicated and more expensive, but avoids some drawbacks of two-phase commit protocol. This protocol is not used in practice. 148 PreparedbyR.Arthy,AP/IT,KCET
  • 149. TWO PHASE COMMIT PROTOCOL (2PC)  Assumes fail-stop model – failed sites simply stop working, and do not cause any other harm, such as sending incorrect messages to other sites.  Execution of the protocol is initiated by the coordinator after the last step of the transaction has been reached.  The protocol involves all the local sites at which the transaction executed  Let T be a transaction initiated at site Si, and let the transaction coordinator at Si be Ci 149 PreparedbyR.Arthy,AP/IT,KCET
  • 150. PHASE 1: OBTAINING A DECISION  Coordinator asks all participants to prepare to commit transaction Ti.  Ci adds the records <prepare T> to the log and forces log to stable storage  sends prepare T messages to all sites at which T executed  Upon receiving message, transaction manager at site determines if it can commit the transaction  if not, add a record <no T> to the log and send abort T message to Ci  if the transaction can be committed, then:  add the record <ready T> to the log  force all records for T to stable storage  send ready T message to Ci 150 PreparedbyR.Arthy,AP/IT,KCET
  • 151. PHASE 2: RECORDING THE DECISION  T can be committed of Ci received a ready T message from all the participating sites: otherwise T must be aborted.  Coordinator adds a decision record, <commit T> or <abort T>, to the log and forces record onto stable storage. Once the record stable storage it is irrevocable (even if failures occur)  Coordinator sends a message to each participant informing it of the decision (commit or abort)  Participants take appropriate action locally. 151 PreparedbyR.Arthy,AP/IT,KCET
  • 152. HANDLING OF FAILURES - SITE FAILURE  When site Si recovers, it examines its log to determine the fate of transactions active at the time of the failure.  Log contain <commit T> record: site executes redo (T)  Log contains <abort T> record: site executes undo (T)  Log contains <ready T> record: site must consult Ci to determine the fate of T.  If T committed, redo (T)  If T aborted, undo (T)  The log contains no control records concerning T  implies that Sk failed before responding to the prepare T message from Ci  Sk must execute undo (T) 152 PreparedbyR.Arthy,AP/IT,KCET
  • 153. HANDLING OF FAILURES- COORDINATOR FAILURE  If coordinator fails while the commit protocol for T is executing then participating sites must decide on T’s fate: 1. If an active site contains a <commit T> record in its log, then T must be committed. 2. If an active site contains an <abort T> record in its log, then T must be aborted. 3. If some active participating site does not contain a <ready T> record in its log, then the failed coordinator Ci cannot have decided to commit T. 1. Can therefore abort T. 4. If none of the above cases holds, then all active sites must have a <ready T> record in their logs, but no additional control records (such as <abort T> of <commit T>).  In this case active sites must wait for Ci to recover, to find decision.  Blocking problem: active sites may have to wait for failed coordinator to recover. 153 PreparedbyR.Arthy,AP/IT,KCET
  • 154. HANDLING OF FAILURES - NETWORK PARTITION  If the coordinator and all its participants remain in one partition, the failure has no effect on the commit protocol.  If the coordinator and its participants belong to several partitions:  Sites that are not in the partition containing the coordinator think the coordinator has failed, and execute the protocol to deal with failure of the coordinator.  No harm results, but sites may still have to wait for decision from coordinator.  The coordinator and the sites are in the same partition as the coordinator think that the sites in the other partition have failed, and follow the usual commit protocol.  Again, no harm results 154 PreparedbyR.Arthy,AP/IT,KCET
  • 156. OUTLINE  Object-based Databases: Object Database Concepts  Object-Relational features  Object Data Management Group (ODMG) Object Model  Object Definition Language (ODL)  Object Query Language (OQL) 156 PreparedbyR.Arthy,AP/IT,KCET
  • 157. I. OBJECT ORIENTED CONCEPTS  Extend the relational data model by including object orientation and constructs to deal with added data types.  Allow attributes of tuples to have complex types, including non-atomic values such as nested relations.  Preserve relational foundations, in particular the declarative access to data, while extending modeling power.  Upward compatibility with existing relational languages. 157 PreparedbyR.Arthy,AP/IT,KCET
  • 158. COMPLEX DATA TYPES  Motivation:  Permit non-atomic domains (atomic  indivisible)  Example of non-atomic domain: set of integers,or set of tuples  Allows more intuitive modeling for applications with complex data  Intuitive definition:  allow relations whenever we allow atomic (scalar) values — relations within relations  Retains mathematical foundation of relational model  Violates first normal form. 158 PreparedbyR.Arthy,AP/IT,KCET
  • 159. EXAMPLE OF A NESTED RELATION  Example: library information system  Each book has  title,  a set of authors,  Publisher, and  a set of keywords  Non-1NF relation books 159 PreparedbyR.Arthy,AP/IT,KCET
  • 160. [CONTD…] 4NF DECOMPOSITION OF NESTED RELATION  Remove awkwardness of flat-books by assuming that the following multivalued dependencies hold:  title author  title keyword  title pub-name, pub-branch  Decompose flat-doc into 4NF using the schemas:  (title, author )  (title, keyword )  (title, pub-name, pub-branch ) 160 PreparedbyR.Arthy,AP/IT,KCET
  • 162. PROBLEM WITH 4NF SCHEME  4NF design requires users to include joins in their queries.  1NF relational view flat-books defined by join of 4NF relations:  eliminates the need for users to perform joins,  but loses the one-to-one correspondence between tuples and documents.  And has a large amount of redundancy  Nested relations representation is much more natural here. 162 PreparedbyR.Arthy,AP/IT,KCET
  • 163. II. OBJECT-RELATIONAL FEATURES  Structured types can be declared and used in SQL create type Name as (firstname varchar(20), lastname varchar(20)) final create type Address as (street varchar(20), city varchar(20), zipcode varchar(20)) not final  Note: final and not final indicate whether subtypes can be created  Structured types can be used to create tables with composite attributes create table customer ( name Name, address Address, dateOfBirth date)  Dot notation used to reference components: name.firstname 163 PreparedbyR.Arthy,AP/IT,KCET
  • 164. [CONTD…]  User-defined row types create type CustomerType as ( name Name, address Address, dateOfBirth date) not final  Can then create a table whose rows are a user-defined type create table customer of CustomerType 164 PreparedbyR.Arthy,AP/IT,KCET
  • 165. [CONTD…]  Alternative way of defining composite attributes in SQL is to use unnamed row types. create table person_r ( name row (firstname varchar(20), lastname varchar(20)), address row (street varchar(20), city varchar(20), zipcode varchar(9)), dateOfBirth date);  The query finds the last name and city of each person. select name.lastname, address.city from person; 165 PreparedbyR.Arthy,AP/IT,KCET
  • 166. [CONTD…]  Methods  Can add a method declaration with a structured type. method ageOnDate (onDate date) returns interval year  Method body is given separately. create instance method ageOnDate (onDate date) returns interval year for CustomerType begin return onDate - self.dateOfBirth; end  We can now find the age of each customer: select name.lastname, ageOnDate (current_date) from customer 166 PreparedbyR.Arthy,AP/IT,KCET
  • 167. [CONTD…]  Constructor create function Name (firstname varchar(20), lastname varchar(20)) returns Name begin set self.firstname = firstname; set self.lastname = lastname; end  Inserting insert into Person values (new Name(’John’, ’Smith’), new Address(’20 Main St’, ’New York’, ’11001’), date ’1960-8- 22’); 167 PreparedbyR.Arthy,AP/IT,KCET
  • 168. [CONTD…]  Inheritance  Suppose that we have the following type definition for people: create type Person (name varchar(20), address varchar(20))  Using inheritance to define the student and teacher types create type Student under Person (degree varchar(20), department varchar(20)) create type Teacher under Person (salary integer, department varchar(20))  Subtypes can redefine methods by using overriding method in place of method in the method declaration 168 PreparedbyR.Arthy,AP/IT,KCET
  • 169. [CONTD…]  Multiple Inheritance  SQL:1999 and SQL:2003 do not support multiple inheritance  If our type system supports multiple inheritance, we can define a type for teaching assistant as follows: create type Teaching Assistant under Student, Teacher  To avoid a conflict between the two occurrences of department we can rename them create type Teaching Assistant under Student with (department as student_dept ), Teacher with (department as teacher_dept ) 169 PreparedbyR.Arthy,AP/IT,KCET
  • 170. [CONTD…]  Array and Multiset Types in SQL  Example of array and multiset declaration: create type Publisher as (name varchar(20), branch varchar(20)) create type Book as (title varchar(20), author-array varchar(20) array [10], pub-date date, publisher Publisher, keyword-set varchar(20) multiset ) create table books of Book  Similar to the nested relation books, but with array of authors instead of set 170 PreparedbyR.Arthy,AP/IT,KCET
  • 171. [CONTD…]  Array construction array [‘Silberschatz’,`Korth’,`Sudarshan’]  Multisets  multisetset [‘computer’, ‘database’, ‘SQL’]  To create a tuple of the type defined by the books relation: (‘Compilers’, array[`Smith’,`Jones’], Publisher (`McGraw-Hill’,`New York’), multiset [`parsing’,`analysis’ ])  To insert the preceding tuple into the relation books insert into books values(‘Compilers’, array[`Smith’,`Jones’], Publisher (`McGraw-Hill’,`New York’), multiset [`parsing’,`analysis’ ]) 171 PreparedbyR.Arthy,AP/IT,KCET
  • 172. UNNESTING  The transformation of a nested relation into a form with fewer (or no) relation-valued attributes us called unnesting.  E.g. select title, A as author, publisher.name as pub_name, publisher.branch as pub_branch, K.keyword from books as B, unnest(B.author_array ) as A (author ), unnest (B.keyword_set ) as K (keyword ) 172 PreparedbyR.Arthy,AP/IT,KCET
  • 173. NESTING  Nesting is the opposite of unnesting, creating a collection-valued attribute  NOTE: SQL:1999 does not support nesting  Nesting can be done in a manner similar to aggregation, but using the function colect() in place of an aggregation operation, to create a multiset  To nest the flat-books relation on the attribute keyword: select title, author, Publisher (pub_name, pub_branch ) as publisher, collect (keyword) as keyword_set from flat-books groupby title, author, publisher  To nest on both authors and keywords: select title, collect (author ) as author_set, Publisher (pub_name, pub_branch) as publisher, collect (keyword ) as keyword_set from flat-books group by title, publisher 173 PreparedbyR.Arthy,AP/IT,KCET
  • 174. III. OBJECT DATA MANAGEMENT GROUP (ODMG) OBJECT MODEL  Provides a standard model for object databases  Supports object definition via ODL  Supports object querying via OQL  Supports a variety of data types and type construtors 174 PreparedbyR.Arthy,AP/IT,KCET
  • 175. ODMG OBJECTS AND LITERALS  The basic building blocks of the object model are  Objects  Literals  An object has four characteristics  Identifier: Unique system-wide identifier  Name: Unique within a particular database and/or program; it is optional  Lifetime: persistent vs transient  Structure: specifies how object is constructed by the type constructor and whether it is an atomic object 175 PreparedbyR.Arthy,AP/IT,KCET
  • 176. [CONTD…]  A literal has a current value but not an identifier  Three types of literals  Atomic: predefined; basic data type values (e.g. short, float, boolean, char)  Structured: values that are constructed by type constructors (e.g. date, struct variables)  Collection: a collection (e.g. array) of values or objects 176 PreparedbyR.Arthy,AP/IT,KCET
  • 177. [CONTD…]  ODMG supports two concepts for specifying object types:  Interface  Class  There are similarities and differences between interfaces and classes  Both have behaviors (operations) and state (attributes and relationships) 177 PreparedbyR.Arthy,AP/IT,KCET
  • 178. ODMG INTERFACE  An interface is a specification of the abstract behavior of an object type  State properties of an interface (i.e. its attributes and relationships) cannot be inherited from  Objects cannot be instantiated from an interface 178 PreparedbyR.Arthy,AP/IT,KCET
  • 179. ODMG INTERFACE DEFINITION interface Date:Object { enum weekday{sun, mon, tue, wed, thu, fri, sat}; enum month{jan, feb, mar, …, dec}; unsigned short year(); unsigned short month(); unsigned short day(); boolean is_equal(in Date other_date); }; 179 PreparedbyR.Arthy,AP/IT,KCET
  • 180. BUILD-IN INTERFACES FOR COLLECTION OBJECTS  A collection object inherits the basic collection interface, for example:  cardinality()  is_empty()  insert_element()  remove_element()  contains_element()  create_iterator() 180 PreparedbyR.Arthy,AP/IT,KCET
  • 181. COLLECTION TYPES  Collection objects are further specialized into types like a set, list, bag, array, and dictionary  Each collection type may provide additional interfaces, for example, a set provides:  create_union()  create_difference()  is_subset_of()  is_superset_of()  is_proper_subset_of() 181 PreparedbyR.Arthy,AP/IT,KCET
  • 183. ODMG CLASS  A class is a specification of abstract behavior and state of an object type  A class is Instantiable  Supports “extends” inheritance to allow both state and behavior inheritance among classes  Multiple inheritance via “extends” is not allowed 183 PreparedbyR.Arthy,AP/IT,KCET
  • 184. [CONTD…]  Atomic objects are user defined objects and are defined via keyword class  An example: class Employee(extend all_employees key ssn) { attribute string name; attribute string ssn; attribute short age; relationship dept works_for; void reassign(in string new_name); } 184 PreparedbyR.Arthy,AP/IT,KCET
  • 185. IV. OBJECT DEFINITION LANGUAGE (ODL)  ODL supports semantics constructs of ODMG  ODL is independent of any programming language  ODL is used to create object specification (classes and interfaces)  ODL is not used for database manipulation 185 PreparedbyR.Arthy,AP/IT,KCET
  • 186. EXAMPLE 1: A VERY SIMPLE CLASS  A very simple, straightforward class definition class Degree { attribute string college; attribute string degree; attribute string year; }; 186 PreparedbyR.Arthy,AP/IT,KCET
  • 187. EXAMPLE 2: A CLASS WITH KEY AND EXTENT class Person (extent persons key ssn) { attribute struct Pname {string fname …} name; attribute string ssn; attribute date birthdate; short age(); }; 187 PreparedbyR.Arthy,AP/IT,KCET
  • 188. EXAMPLE 3: A CLASS WITH RELATIONSHIPS class Faculty extends Person (extent faculty) { attribute string rank; attribute float salary; attribute string phone; relationship dept works_in inverse dept :: has_faculty; relationship set<GradStu> advises inverse GradStu :: advisor; void give_raise (in float raise); void promise (in string new_rank); }; 188 PreparedbyR.Arthy,AP/IT,KCET
  • 189. EXAMPLE 4: INHERITANCE interface Shape { attribute struct point {…} reference_point; float perimeter(); }; class Triangle : Shape (extent triangles) { attribute short side_1; attribute short side_2; }; 189 PreparedbyR.Arthy,AP/IT,KCET
  • 190. V. OBJECT QUERY LANGUAGE (OQL)  OQL is DMG’s query language  OQL works closely with programming languages such as C++  Embedded OQL statements return objects that are compatible with the type system of the host language  OQL’s syntax is similar to SQL with additional deatures for objects 190 PreparedbyR.Arthy,AP/IT,KCET
  • 191. SIMPLE OQL QUERIES  Basic syntax: select … from … where … select d.name from d in departments where d.college = ‘engineering’;  An entry point to the database is needed for each query  An extent name may serve as an entry point 191 PreparedbyR.Arthy,AP/IT,KCET
  • 192. ITERATOR VARIABLES  Iterator variables are defined whenever a collection is referenced in an OQL query  Iterator d in the previous example serves as an iterator and ranges over each object in the collection  Syntactical options for specifying an iterator:  d in departments  departments d  departments as d 192 PreparedbyR.Arthy,AP/IT,KCET
  • 193. DATA TYPE OF QUERY RESULTS  The data type of a query result can be any type defined in the ODMG model  A query does not have to follow the select … from … where … format  A persistent name on its own can serve as a query whose result is a reference to the persistent object.  For example, departments: whose type is set<Departments> 193 PreparedbyR.Arthy,AP/IT,KCET
  • 194. PATH EXPRESSIONS  A path expression is used to specify a path to attributes and objects in an entry point  A path expression starts at a persistent object name  The name will be followed by zero or more dot connected relationship or attribute names  For example: departments.chair; 194 PreparedbyR.Arthy,AP/IT,KCET
  • 195. VIEWS AS NAMED OBJECTS  The define keyword in OQL is used to specify an identifier for a named query  The name should be unique; if not, the results will replace an existing named query  Once a query definition is created, it will persist until deleted or redefined  A view definition can include parameters 195 PreparedbyR.Arthy,AP/IT,KCET
  • 196. EXAMPLE  A view to include students in a department who have a minor define has_minor(dept_name) as select s from s in students where s.minor_in.dname = dept_name 196 PreparedbyR.Arthy,AP/IT,KCET
  • 197. SINGLE ELEMENTS FROM COLLECTIONS  An OQL query returns a collection  OQL’s element operator can be used to return a single element from a singleton collection that contains one element: element (select d from d in departments where d.name = ‘Web Programming’);  If d is empty or has more that one elements, an exception is raised 197 PreparedbyR.Arthy,AP/IT,KCET
  • 198. COLLECTION OPERATORS  OQL supports a number of aggregate operators that can be applied to query results  The aggregate operators and operate over a collection and include  Min  Max  Count  Sum  Avg  For example: avg (select s.gpa from s in students where s.class = ‘senior’ and s.majors_in.dname = ‘business’); 198 PreparedbyR.Arthy,AP/IT,KCET