Access Methods - Lecture 9 - Introduction to Databases (1007156ANR)
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Access Methods - Lecture 9 - Introduction to Databases (1007156ANR)

on

  • 4,789 views

This lecture is part of an Introduction to Databases course given at the Vrije Universiteit Brussel.

This lecture is part of an Introduction to Databases course given at the Vrije Universiteit Brussel.

Statistics

Views

Total Views
4,789
Views on SlideShare
4,772
Embed Views
17

Actions

Likes
1
Downloads
193
Comments
0

3 Embeds 17

http://www.inf.ethz.ch 7
http://www.slideshare.net 5
http://wise.vub.ac.be 5

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Access Methods - Lecture 9 - Introduction to Databases (1007156ANR) Presentation Transcript

  • 1. 2 December 2005 Introduction to Databases Access Methods Prof. Beat Signer Department of Computer Science Vrije Universiteit Brussel http://www.beatsigner.com
  • 2. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 2April 25, 2014 Context of Today's Lecture Access Methods System Buffers Authorisation Control Integrity Checker Command Processor Program Object Code DDL Compiler File Manager Buffer Manager Recovery Manager Scheduler Query Optimiser Transaction Manager Query Compiler Queries Catalogue Manager DML Preprocessor Database Schema Application Programs Database Manager Data Manager DBMS Programmers Users DB Admins Based on 'Components of a DBMS', Database Systems, T. Connolly and C. Begg, Addison-Wesley 2010 Data, Indices and System Catalogue
  • 3. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 3April 25, 2014 Basic Index Concepts  An index is used to efficiently access specific data  e.g. index in a textbook  Often database queries reference only a small number of records in a file  indices can be used to enhance the search for these records  most DBMSs automatically create an index for primary keys  A search key is an attribute (or set of attributes) that is used to lookup records in a file  An index file consists of (searchKey, pointer) index entries  normally much smaller than the original file  pointer identifies a disk block and a record offset within the block
  • 4. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 4April 25, 2014 Basic Index Concepts ...  There are two basic types of indices  ordered indices - based on a sorted ordering of search keys  hash indices - search keys are uniformely distributed to different buckets based on a hash function  Indexing techniques have to be evaluated based on  access types - e.g. records with a given attribute value or an attribute value in a specific range  access time  insertion and deletion time - if the underlying data is changed, all indices have to be updated which may result in a significant overhead for modifications  space overhead
  • 5. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 5April 25, 2014 Ordered Indices  An ordered index stores the search key values in sorted order  e.g. author catalogue in a library  A primary index (clustering index) is an index whose search key also defines the sequential order of the file  the primary key is often used as a search key of a primary index  files with a clustering index on the search key are called index-sequential-files  A secondary index (non-clustering index) is an index whose search key has a different order than the sequential file order
  • 6. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 6April 25, 2014 Dense Index  A dense index contains an entry for each search key in the file  An index record of a dense primary index points to the first file record with the given search key  An index record of a dense secondary index stores a list of pointers to the records with the same search key value de Rover Frei Jones Meier Michelin 2 de Rover Pieter Brussels 1 Frei Urs Zurich 8 Frei Urs Zurich 14 Jones Andrew Paris 53 Meier Beat Zurich 5 Michelin Robert Parisdense index index-sequential file
  • 7. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 7April 25, 2014 Dense Index Insertion find the search key of the record to be inserted in the index if (index contains search key) { if (index stores pointers to all records) { add a pointer to the new record to the index entry } else { // index stores pointer to the first record only ensure that the record is inserted after all other records with the same search key } } else { // index does not contain the search key insert an index record with the search key at the appropriate position }
  • 8. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 8April 25, 2014 Dense Index Deletion D  Update can be modelled as a delete followed by an insert look up the record to be deleted if (deleted record is the only one with the search key value) { delete index record } else { // there are other records with the same search key if (index stores pointers to all records) { delete the corresponding pointer from the index record } else { // index stores pointer to the first record only if (deleted record was the first record with the search key value) { update the index record to point to the next record } } }
  • 9. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 9April 25, 2014 Sparse Index  Contains only some of the search key values  can only be used for primary indices!  To search a given record in a sparse index the following steps have to be performed 1) find the index record ri with the largest search key value that is smaller or equal to the search key value to be found 2) start a linear search at record ri until the record is found de Rover Jones Michelin 2 de Rover Pieter Brussels 1 Frei Urs Zurich 8 Frei Urs Zurich 14 Jones Andrew Paris 53 Meier Beat Zurich 5 Michelin Robert Paris sparse index
  • 10. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 10April 25, 2014 Sparse Index  A sparse index consumes less space and produces less maintenance overhead for insertions and deletions  On the other hand it takes more time to locate a record based on a sparse index  A good trade-off is a sparse index with one index record for each block ... sparse index ... block 1 block 2
  • 11. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 11April 25, 2014 Sparse Index Insertion // assumption: one index entry for each block if (new block is created) { insert first search key value in the new block into the index } else { // no new block if (new record has the smallest search key value in the block) { update the index record pointing to the block } }
  • 12. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 12April 25, 2014 Sparse Index Deletion  Update can be modelled as a delete followed by an insert if (index contains the search key value of deleted record) { if (deleted record is the only one with the search key value) { if (search key value of the next record is not in the index { update the index record to point to the next record } else { // search key value of next record is already in the index delete index entry } } else { // there are other records with the same search key if (deleted record was the first record with the search key value) { update the index entry to point to the next record with the same search key value } }
  • 13. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 13April 25, 2014 Multilevel Index  A multilevel index can be used if the index grows too large to be kept in memory  inner index froms the primary index file  outer index is a sparse index on the inner index ... inner index ... data block 1 data block 2 ... ... outer index
  • 14. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 14April 25, 2014 Multilevel Index ...  If even the outer index grows too large to be kept in memory, the process can be repeated  note that instead of this type of multilevel index it is preferred to use B+-trees  The update of the multilevel index (on insertion and deletion) is an extension of the updates described for the single level indices
  • 15. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 15April 25, 2014 Secondary Index  A secondary index has to be a dense index and must point to all the records with the same search key value  e.g. indirection via buckets that contain the pointers  The use of a secondary index normally requires more I/O operations than a primary index Brussels Paris Zurich 2 de Rover Pieter Brussels 1 Frei Urs Zurich 8 Frei Urs Zurich 14 Jones Andrew Paris 53 Meier Beat Zurich 5 Michelin Robert Paris secondary index buckets
  • 16. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 16April 25, 2014 Index-Sequential File Problems  Index-sequential files have a number of disadvantages  degradation and loss of performance when the file grows since many overflow blocks have to be created  the entire file has to be periodically reorganised  Alternative index structures that do not show a loss of performance after updates can be used  e.g. the B+-tree index file structure is a widely used multilevel index structure that is based on a B+-tree (balanced and sorted )
  • 17. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 17April 25, 2014 B+-Tree  Properties of a balanced B+-tree  every path from the root to a leaf node has the same length  for a given n, each leaf node has between n-1)/2 and n-1 values - leaf nodes must always be at least half full  each non-leaf (and non-root) node has between n/2 and n children (fanout)  typical B+-tree node - K1,..., Km-1 are the ordered search key values (Ki < Ki+1) - a non-leaf node has pointers P1,..., Pm to child nodes - a leaf node has pointers P1,..., Pm to file records or to buckets of pointers - typically the size of a disk block  if the root node is not a leaf node it has at least 2 children; otherwise it has between 0 and n-1 values P1 K1 P2 ... Pm-1 Km-1 Pm ...
  • 18. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 18April 25, 2014 B+-Tree Leaf Nodes  Leaf node example (n=3)  for 1  i < n the pointers Pi point to a file record with the search key value Ki or to a bucket of pointers for search keys that are non-ordered non-candidate keys  the pointer Pn points to the next leaf node in search key order de Rover Frei 2 de Rover Pieter Brussels 1 Frei Urs Zurich 8 Frei Urs Zurich ... next leaf node index-sequential file
  • 19. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 19April 25, 2014 B+-Tree Non-Leaf (Internal) Nodes  The non-leaf nodes form a multilevel sparse index on the leaf nodes  For a non-leaf node with m pointers the following holds  the values of the search keys in the subtree that P1 points to are all smaller than K1  for 2  i < m the search key values in the subtree to which Pi points to have values greater than or equal to Ki-1 and smaller than Ki  the search keys in the subtree to which Pm points to have values greather than or equal to Km-1 P1 K1 P2 ... Pm-1 Km-1 Pm ...
  • 20. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 20April 25, 2014 B+-Tree Example  Example B+-tree for n=3  root node must have at least two children  non-leaf nodes must have between 2 and 3 children (for n=3)  leaf nodes must have between 1 and 2 values (for n=3)  note that the node size is normally defined by the block size and the fanout is typically in the range of 50-100 (e.g. n=100) Bush Frei Jones Meier Otlet Price Jones Otlet Meier
  • 21. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 21April 25, 2014 B+-Tree Properties  Logically close blocks do not have to be physically close since the inter-node connections are realised via pointers  B+-trees generally have a small height  if there are k search key values, the B+-tree tree height is not greater than log n/2 (k) - e.g. for 5'000'000 search keys and n = 100 we get a height not greater than 4 (note that the height would be 23 for a binary tree)  search operations can be performed efficiently (without many block accesses)
  • 22. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 22April 25, 2014 B+-Tree Lookup  Find all records for a given search key v Bush Frei Jones Meier Otlet Price Jones Otlet Meier c = root node while (c is not a leaf node) { let Ki = smallest search key value in c greater than v if (Ki exists) { c = node pointed to by Pi
  • 23. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 23April 25, 2014 B+-Tree Lookup ...  The lookup requires O(lognk) I/O operations in worst case } else { // no search key greater than v c = node pointed to by Pm // where m is the number of pointers } } if (a key value Ki in c with the value v exists) { pointer Pi leads to the desired record or bucket } else { a record with the given search key value v does not exist }
  • 24. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 24April 25, 2014 B+-Tree Insertion (1)Find the leaf node in which the search key value would appear using the lookup algorithm presented before and add the new record to the file (2)If the search key is already present in the leaf node  if necessary add a new pointer to the bucket (3)If the search key value is not yet present  if there is room in the leaf node then insert a new index record  if there is no room then split the node - take the n index records (inlcuding the one to be inserted) in sorted order and place the first n/2 records in the original node and the rest in a new node - let the new node be p and the smallest value in p be k. Insert (p,k) in the parent node (if there is no space in the parent node, the split is propagated upwards)
  • 25. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 25April 25, 2014 B+-Tree Insertion ...  In worst case, the root node has to be split and the height of the tree is increased by 1  The insertion requires O(lognk) I/O operations in worst case
  • 26. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 26April 25, 2014 B+-Tree Insertion Example  insertion of Codd Bush Frei Jones Meier Otlet Price Jones Otlet Meier Bush Codd Meier Otlet Price Frei Jones Otlet Meier JonesFrei
  • 27. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 27April 25, 2014 B+-Tree Deletion (1)Find the record to be deleted by using the lookup algorithm and delete it from the file (2)Remove the search key and value from the leaf index node if there is no bucket or if the bucket is empty (3)If the node has no longer enough entries due to the deleted index record and the entries in the node and a sibling fit into a single node then merge the siblings  insert the search key values of the two nodes into the left node and delete the right node  recursively delete the entry for the deleted node from the parent node (propagation)
  • 28. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 28April 25, 2014 B+-Tree Deletion ... (4)If the node has no longer enough entries due to the deleted index record but the entries in the node and a sibling do not fit into a single node then redistribute the pointers  redistribute the pointers between the node and a sibling such that both have more than the minimum number of entries  update the search key values in the parent node  If the root node has only one pointer after deletion, it is deleted and its child becomes the root  The deletion of a previously located record requires O(lognk) I/O operations in worst case  note that there are variants where leaf nodes are not merged
  • 29. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 29April 25, 2014 B+-Tree Deletion Example  deletion of Frei Bush Codd Meier Otlet Price Frei Jones Otlet Meier JonesFrei Bush Codd Meier Otlet Price Jones Otlet Meier Jones
  • 30. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 30April 25, 2014 B+-Tree Deletion Example ...  deletion of Meier Bush Codd Otlet PriceJones Bush Codd Meier Otlet Price Jones Otlet Meier Jones Jones Otlet
  • 31. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 31April 25, 2014 B+-Tree Deletion Example ...  deletion of Meier Bush Codd Meier Otlet Price Frei Jones Otlet Meier JonesFrei Bush Codd Otlet Price Frei Otlet Jones JonesFrei
  • 32. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 32April 25, 2014 Features of B+-Tree Index Files  Advantages  automatic reorganisation with small local changes for insertions and deletions  no reorganisation of the file is required to maintain the performance (solves index-sequential file problem)  Disadvantages  insertion and deletion overhead  space overhead  The advantages of B+-trees outweigh the disadvantages which makes them an extensively used data structure
  • 33. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 33April 25, 2014 B+-Tree File Organisation  We can use the same B+-tree structure to organise the records in a file and thereby solve the problem of data file degradation  In a B+-tree file organisation, the leaves of the tree store records instead of pointers to records  number of records in the leaf nodes normally smaller than the number of pointers on non-leave nodes  leaf nodes still have to be at least half full  two siblings can be used in splits/merges for space optimisation  adjacent nodes in the tree may not be continuous on the disk
  • 34. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 34April 25, 2014 Indexing Strings  The use of strings as search keys introduces two major problems  the strings can be of variable length which leads to different fanouts  the split and merge operations should no longer be based on the number of search keys but on the fraction of the used space  The fanout of nodes can be increased by using a prefix compression technique  we no longer store the entire search key in non-leaf nodes but only the part that is necessary to distinguish between the key values in the subtrees
  • 35. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 35April 25, 2014 B+-Trees vs. B-Trees  B-trees store every search key value only once (no redundancy)  additional pointers to the file records have to be added for non- leaf nodes  Advantages of B-trees  may use less nodes than the corresponding B+-trees  may find the search key before reaching the leaf node  Disadvantages of B-trees  non-leaf nodes get larger which results in a deeper trees  insertion and deletion gets more complicated  Normally the advantages of B-trees do not outweigh their disadvantages
  • 36. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 36April 25, 2014 Multiple-Key Access  For queries that need fast search on the combination of multiple attributes we can define composite search keys  e.g. search key (name, city) for customers  The ordering of composite search keys is the lexicographic ordering  e.g. (a1, a2) < (b1, b2) if - a1 < b1, or - a1 = b1 and a2 < b2  An index with composite search keys can be used to efficiently answer queries of the form  WHERE name = 'Max Frisch' AND city = 'Zurich'  WHERE name = 'Max Frisch' AND city < 'Brussels'
  • 37. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 37April 25, 2014 Static Hashing  A bucket is a storage unit containing one or more records  typically a disk block  In a hash file organisation we get the bucket in a file by using a hash function on the search key value  A hash function h is a function from the set of search key values K to the set of buckets B  distribution should be uniform and random  The hash function can be used to locate, insert and delete records  A bucket may contain records with different search keys  bucket has to be sequentially searched
  • 38. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 38April 25, 2014 Hash File Organisation Example  example of a hash function: binary representation of search key modulo 6 14 Jones Andrew Paris2 Rover Pieter Brussels 53 Meier Beat Zurich 1 Frei Urs Zurich 8 Frei Urs Zurich 5 Michelin Robert Paris bucket 0 bucket 1 bucket 2 bucket 3 bucket 4 bucket 5
  • 39. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 39April 25, 2014 Bucket Overflows  Buckets can overflow due to different reasons  insufficent number of buckets  a skew in the distribution of records - many records with the same search key - non-uniform hash function  The possibility for bucket overflows can be reduced but we cannot prevent overflows  bucket overflows are handled by a linked list of additional overflow buckets  this form of hashing with overflow buckets is called closed hashing  Open hashing does not allow overflow buckets  other existing buckets have to be "misused" (linear probing)
  • 40. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 40April 25, 2014 Hash Index  Hashing cannot only be used for file organisation but also for the creation of an index  A hash index manages the search keys with their record pointers in a hash file structure
  • 41. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 41April 25, 2014 1 53 Hash Index Example 2 de Rover Pieter Brussels 1 Frei Urs Zurich 8 Frei Urs Zurich 14 Jones Andrew Paris 53 Meier Beat Zurich 5 Michelin Robert Paris 5 bucket 0 bucket 1 14 bucket 2 2 bucket 3 8 overflow bucket
  • 42. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 42April 25, 2014 Static Hashing Problems  The hash function maps the search key values to a fixed number of buckets B but the database will grow or shrink over time  if the file grows and the initial number of buckets is too small, the performance decreases since overflow buckets are necessary  if space is preallocated for future growth, some space will be wasted initially  A possible solution would be to reorganise the file with a new hash function from time to time  too expensive and has to be performed exclusively  An alternative is to dynamically modify the number of buckets (dynamic hashing)
  • 43. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 43April 25, 2014 Ordered Indexing vs. Hashing  The database designer has to choose which form of index should be used based on different criteria  cost of periodic reorganisation of the index  frequency of update operations  optimised average or worst case access time  expected types of queries - hashing is generally better for retrieving records with a specific attribute value • constant average lookup time • worst case lookup time proportional to the number of attribute values! - ordered indices are preferred for range queries where the attribute value lies within a range of values (e.g. WHERE A > C1 AND A < C2)  In practice  most DBMSs support an ordered B+-tree index but not all of them support hash indices
  • 44. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 44April 25, 2014 Bitmap Index  A bitmap index can be used to efficiently query on multiple keys/attributes  The following assumptions must hold  records in a relation must be numbered sequentially (e.g. starting from 0)  given a number i, it must be easy to find the ith record - easy if records have fixed size  Bitmaps are applicable on attributes with a relatively small number of distinct values  e.g. gender, city, ...  A bitmap is an array with as many bits as records  the ith bit is 1 if record i has the desired attribute value
  • 45. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 45April 25, 2014 Bitmap Index ...  Queries can be answered by using bitmap operations  and (intersection), or (union) and not (complementation) - e.g. find all males in Zurich: 101101 AND 011010 = 001000 - e.g. find females not living in Paris: 010010 AND 111010 = 010010  Retrieve records based on the resulting bit array  can also be used to count the number of records with a given condition 2 de Rover Pieter m Brussels 21 Brown Kathy f Zurich 8 Frei Urs m Zurich 14 Jones Andrew m Paris 9 Jones Lea f Zurich 5 Michelin Robert m Paris 101101 010010 100000 000101 011010 m f Brussel Paris Zurich bitmaps for gender bitmaps for location
  • 46. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 46April 25, 2014 Index Definition in SQL  The create index command can be used to create an index on a relation  different DBMSs support different types of indices (e.g. B+-tree, hash index, etc.)  Example CREATE INDEX nameIndex ON Customer USING BTREE (name);
  • 47. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 47April 25, 2014 Summary  Different types of indices can be used to speed up the search of specific records in a file  The number and types of indices have to be designed carefully since each index adds additional maintenance costs on update operations  The structures that are used for the indices (e.g. B+-tree or hash index) can also be used to manage the records in a file and reduce data file degradation
  • 48. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 48April 25, 2014 Homework  Study the following chapter of the Database System Concepts book  chapter 11 - sections 11.1-11.11 - Indexing and Hashing
  • 49. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 49April 25, 2014 Exercise 9  Functional Dependencies and Normalisation
  • 50. Beat Signer - Department of Computer Science - bsigner@vub.ac.be 50April 25, 2014 References  A. Silberschatz, H. Korth and S. Sudarshan, Database System Concepts (Sixth Edition), McGraw-Hill, 2010
  • 51. 2 December 2005 Next Lecture Query Processing and Optimisation