Introduction to Database Access Methods

2 December 2005
Introduction to Databases
Access Methods
Prof. Beat Signer
Department of Computer Science
Vrije Universiteit Brussel
http://www.beatsigner.com

Beat Signer - Department of Computer Science - bsigner@vub.ac.be 2April 28, 2017
Context of Today's Lecture
Access
Methods
System
Buffers
Authorisation
Control
Integrity
Checker
Command
Processor
Program
Object Code
DDL
Compiler
File
Manager
Buffer
Manager
Recovery
Manager
Scheduler
Query
Optimiser
Transaction
Manager
Query
Compiler
Queries
Catalogue
Manager
DML
Preprocessor
Database
Schema
Application
Programs
Database
Manager
Data
Manager
DBMS
Programmers Users DB Admins
Based on 'Components of a DBMS', Database Systems,
T. Connolly and C. Begg, Addison-Wesley 2010
Data, Indices and
System Catalogue

Basic Index Concepts
 An index is used to efficiently access specific data
 e.g. index in a textbook
 Often database queries reference only a small number of
records in a file
 indices can be used to enhance the search for these records
 most DBMSs automatically create an index for primary keys
 A search key is an attribute (or set of attributes) that is
used to lookup records in a file
 An index file consists of (searchKey, pointer) index
entries
 normally much smaller than the original file
 pointer identifies a disk block and a record offset within the block

Basic Index Concepts ...
 There are two basic types of indices
 ordered indices
- based on a sorted ordering of search keys
 hash indices
- search keys are uniformely distributed to different buckets based on a hash
function
 Indexing techniques have to be evaluated based on
 access types
- e.g. records with a given attribute value or an attribute value in a specific range
 access time
 insertion and deletion time
- if the underlying data is changed, all indices have to be updated which may
result in a significant overhead for modifications
 space overhead

Ordered Indices
 An ordered index stores the search key values in sorted
order
 e.g. author catalogue in a library
 A primary index (clustering index) is an index whose
search key also defines the sequential order of the file
 the primary key is often used as a search key of a primary index
 files with a clustering index on the search key are called
index-sequential-files
 A secondary index (non-clustering index) is an index
whose search key has a different order than the
sequential file order

Dense Index
 A dense index contains an entry for each search key in
the file
 An index record of a dense primary index points to the
first file record with the given search key
 An index record of a dense secondary index stores a list
of pointers to the records with the same search key value
de Rover
Frei
Jones
Meier
Michelin
2 de Rover Pieter Brussels
1 Frei Urs Zurich
8 Frei Urs Zurich
14 Jones Andrew Paris
53 Meier Beat Zurich
5 Michelin Robert Parisdense index
index-sequential file

Dense Index Insertion
find the search key of the record to be inserted in the index
if (index contains search key) {
if (index stores pointers to all records) {
add a pointer to the new record to the index entry
}
else { // index stores pointer to the first record only
ensure that the record is inserted after the first record with the
same search key
}
}
else { // index does not contain the search key
insert an index record with the search key at the appropriate position
}

Dense Index Deletion
D
 Update can be modelled as a delete followed by an insert
look up the record to be deleted
if (deleted record is the only one with the search key value) {
delete index record
}
else { // there are other records with the same search key
if (index stores pointers to all records) {
delete the corresponding pointer from the index record
}
else { // index stores pointer to the first record only
if (deleted record was the first record with the search key value) {
update the index record to point to the next record
}
}
}

Sparse Index
 Contains only some of the search key values
 can only be used for primary indices!
 To search a given record in a sparse index the following
steps have to be performed
1) find the index record ri with the largest search key value that is
smaller or equal to the search key value to be found
2) start a linear search at record ri until the record is found
de Rover
Jones
Michelin
1 Frei Urs Zurich
8 Frei Urs Zurich
5 Michelin Robert Paris
sparse index

Sparse Index
 A sparse index consumes less space and produces less
maintenance overhead for insertions and deletions
 On the other hand it takes more time to locate a record
based on a sparse index
 A good trade-off is a sparse index with one index record
for each block
...
sparse index
...
block 1
block 2

Sparse Index Insertion
// assumption: one index entry for each block
if (new block is created) {
insert first search key value in the new block into the index
}
else { // no new block
if (new record has the smallest search key value in the block) {
update the index record pointing to the block
}
}

Sparse Index Deletion
 Update can be modelled as a delete followed by an insert
if (index contains the search key value of deleted record) {
if (deleted record is the only one with the search key value) {
if (search key value of the next record is not in the index {
update the index record to point to the next record
}
else { // search key value of next record is already in the index
delete index entry
}
}
else { // there are other records with the same search key
if (deleted record was the first record with the search key value) {
update the index entry to point to the next record with the same
search key value
}
}

Multilevel Index
 A multilevel index can be used if the
index grows too large to be kept in memory
 inner index froms the primary index file
 outer index is a sparse index on the inner index
...
inner index
...
data
block 1
data
block 2
...
...
outer index

Multilevel Index ...
 If even the outer index grows too large to be kept in
memory, the process can be repeated
 note that instead of this type of multilevel index it is preferred to
use B+-trees
 The update of the multilevel index (on insertion and
deletion) is an extension of the updates described for the
single level indices

Secondary Index
 A secondary index has to be a dense index and must
point to all the records with the same search key value
 e.g. indirection via buckets that contain the pointers
 The use of a secondary index normally requires more
I/O operations than a primary index
Brussels
Paris
Zurich
1 Frei Urs Zurich
8 Frei Urs Zurich
secondary index
buckets

Index-Sequential File Problems
 Index-sequential files have a number of disadvantages
 degradation and loss of performance when the file grows since
many overflow blocks have to be created
 the entire file has to be periodically reorganised
 Alternative index structures that do not show a loss of
performance after updates can be used
 e.g. the B+-tree index file structure is a widely used multilevel
index structure that is based on a B+-tree (balanced and sorted )

B+-Tree
 Properties of a balanced B+-tree
 every path from the root to a leaf node has the same length
 for a given n, each leaf node has between n-1)/2 and n-1 values
- leaf nodes must always be at least half full
 each non-leaf (and non-root) node has between n/2 and n
children (fanout)
 typical B+-tree node
- K1,..., Km-1 are the ordered search key values (Ki < Ki+1)
- a non-leaf node has pointers P1,..., Pm to child nodes
- a leaf node has pointers P1,..., Pm to file records or to buckets of pointers
- typically the size of a disk block
 if the root node is not a leaf node it has at least 2 children;
otherwise it has between 0 and n-1 values
P1 K1 P2 ... Pm-1 Km-1 Pm ...

B+-Tree Leaf Nodes
 Leaf node example (n=3)
 for 1 i <n the pointers Pi
point to a file record with the
search key value Ki or to a bucket
of pointers for search keys that are
non-ordered non-candidate keys
 the pointer Pn points to the next leaf node in search key order
de Rover Frei
1 Frei Urs Zurich
8 Frei Urs Zurich
...
next leaf node
index-sequential file

B+-Tree Non-Leaf (Internal) Nodes
 The non-leaf nodes form a multilevel sparse index on the
leaf nodes
 For a non-leaf node with m pointers the following holds
 the values of the search keys in the subtree that P1 points to are
all smaller than K1
 for 2 i <m the search key values in the subtree to which Pi points
to have values greater or equal than Ki-1 and smaller than Ki
 the search keys in the subtree to which Pm points to have values
greather than or equal to Km-1
P1 K1 P2 ... Pm-1 Km-1 Pm ...

B+-Tree Example
 Example B+-tree for n=3
 root node must have at least two children
 non-leaf nodes must have between 2 and 3 children (for n=3)
 leaf nodes must have between 1 and 2 values (for n=3)
 note that the node size is normally defined by the block size and
the fanout is typically in the range of 50-100 (e.g. n=100)
Bush Frei Jones Meier Otlet Price
Jones Otlet
Meier

B+-Tree Properties
 Logically close blocks do not have to be physically close
since the inter-node connections are realised via pointers
 B+-trees generally have a small height
 if there are k search key values, the B+-tree tree height is not
greater than log n/2 (k)
- e.g. for 5'000'000 search keys and n = 100 we get a height not greater than 4
(note that the height would be 23 for a binary tree)
 search operations can be performed efficiently (without many
block accesses)

B+-Tree Lookup
 Find all records for a given search key v
Jones Otlet
Meier
c = root node
while (c is not a leaf node) {
let Ki = smallest search key value in c greater than v
if (Ki exists) {
c = node pointed to by Pi

B+-Tree Lookup ...
 The lookup requires O(lognk) I/O operations in worst case
}
else { // no search key greater than v
c = node pointed to by Pm // where m is the number of pointers
}
}
if (a key value Ki in c with the value v exists) {
pointer Pi leads to the desired record or bucket
}
else {
a record with the given search key value v does not exist
}

B+-Tree Insertion
(1)Find the leaf node in which the search key value would
appear using the lookup algorithm presented before and
add the new record to the file
(2)If the search key is already present in the leaf node
 if necessary add a new pointer to the bucket
(3)If the search key value is not yet present
 if there is room in the leaf node then insert a new index record
 if there is no room then split the node
- take the n index records (inlcuding the one to be inserted) in sorted order and
place the first n/2records in the original node and the rest in a new node
- let the new node be p and the smallest value in p be k. Insert (p,k) in the parent
node (if there is no space in the parent node, the split is propagated upwards)

B+-Tree Insertion ...
 In worst case, the root node has to be split and the
height of the tree is increased by 1
 The insertion requires O(lognk) I/O operations in worst
case

B+-Tree Insertion Example
 insertion of Codd
Jones Otlet
Meier
Bush Codd Meier Otlet Price
Frei Jones Otlet
Meier
JonesFrei

B+-Tree Deletion
(1)Find the record to be deleted by using the lookup
algorithm and delete it from the file
(2)Remove the search key and value from the leaf index
node if there is no bucket or if the bucket is empty
(3)If the node has no longer enough entries due to the
deleted index record and the entries in the node and a
sibling fit into a single node then merge the siblings
 insert the search key values of the two nodes into the left node
and delete the right node
 recursively delete the entry for the deleted node from the parent
node (propagation)

B+-Tree Deletion ...
(4)If the node has no longer enough entries due to the
deleted index record but the entries in the node and a
sibling do not fit into a single node then redistribute the
pointers
 redistribute the pointers between the node and a sibling such that
both have more than the minimum number of entries
 update the search key values in the parent node
 If the root node has only one pointer after deletion, it is
deleted and its child becomes the root
 The deletion of a previously located record requires
O(lognk) I/O operations in worst case
 note that there are variants where leaf nodes are not merged

B+-Tree Deletion Example
 deletion of Frei
Frei Jones Otlet
Meier
JonesFrei
Jones Otlet
Meier
Jones

B+-Tree Deletion Example ...
 deletion of Meier
Bush Codd Otlet PriceJones
Jones Otlet
Meier
Jones
Jones Otlet

B+-Tree Deletion Example ...
 deletion of Meier
Frei Jones Otlet
Meier
JonesFrei
Bush Codd Otlet Price
Frei Otlet
Jones
JonesFrei

Features of B+-Tree Index Files
 Advantages
 automatic reorganisation with small local changes for insertions
and deletions
 no reorganisation of the file is required to maintain the
performance (solves index-sequential file problem)
 Disadvantages
 insertion and deletion overhead
 space overhead
 The advantages of B+-trees outweigh the disadvantages
which makes them an extensively used data structure

B+-Tree File Organisation
 We can use the same B+-tree structure to organise the
records in a file and thereby solve the problem of data
file degradation
 In a B+-tree file organisation, the leaves of the tree store
records instead of pointers to records
 number of records in the leaf nodes normally smaller than the
number of pointers in non-leave nodes
 leaf nodes still have to be at least half full
 two siblings can be used in splits/merges for space optimisation
 adjacent nodes in the tree may not be continuous on the disk

Indexing Strings
 The use of strings as search keys introduces two major
problems
 the strings can be of variable length which leads to different
fanouts
 the split and merge operations should no longer be based on the
number of search keys but on the fraction of the used space
 The fanout of nodes can be increased by using a prefix
compression technique
 we no longer store the entire search key in non-leaf nodes but
only the part that is necessary to distinguish between the key
values in the subtrees

B+-Trees vs. B-Trees
 B-trees store every search key value only once
(no redundancy)
 additional pointers to the file records have to be added for non-
leaf nodes
 Advantages of B-trees
 may use less nodes than the corresponding B+-trees
 may find the search key before reaching the leaf node
 Disadvantages of B-trees
 non-leaf nodes get larger which results in a deeper trees
 insertion and deletion gets more complicated
 Normally the advantages of B-trees do not outweigh their
disadvantages

Multiple-Key Access
 For queries that need fast search on the combination of
multiple attributes we can define composite search keys
 e.g. search key (name, city) for customers
 The ordering of composite search keys is the
lexicographic ordering
 e.g. (a1, a2) < (b1, b2) if
- a1 <b1, or
- a1 = b1 and a2 <b2
 An index with composite search keys can be used to
efficiently answer queries of the form
 WHERE name = 'Max Frisch' AND city = 'Zurich'
 WHERE name = 'Max Frisch' AND city < 'Brussels'

Static Hashing
 A bucket is a storage unit containing one or more records
 typically a disk block
 In a hash file organisation we get the bucket in a file by
using a hash function on the search key value
 A hash function h is a function from the set of search key
values K to the set of buckets B
 distribution should be uniform and random
 The hash function can be used to locate, insert and
delete records
 A bucket may contain records with different search keys
 bucket has to be sequentially searched

Hash File Organisation Example
 example of a hash function: binary representation of search key
modulo 6
14 Jones Andrew Paris2 Rover Pieter Brussels
1 Frei Urs Zurich
8 Frei Urs Zurich
bucket 0 bucket 1
bucket 2 bucket 3
bucket 4 bucket 5

Bucket Overflows
 Buckets can overflow due to different reasons
 insufficent number of buckets
 a skew in the distribution of records
- many records with the same search key
- non-uniform hash function
 The possibility for bucket overflows can be reduced but
we cannot prevent overflows
 bucket overflows are handled by a linked list of additional
overflow buckets
 this form of hashing with overflow buckets is called closed
hashing
 Open hashing does not allow overflow buckets
 other existing buckets have to be "misused" (linear probing)

Hash Index
 Hashing cannot only be used for file organisation but
also for the creation of an index
 A hash index manages the search keys with their record
pointers in a hash file structure

1
53
Hash Index Example
1 Frei Urs Zurich
8 Frei Urs Zurich
5
bucket 0
bucket 1
14
bucket 2
2
bucket 3
8
overflow bucket

Static Hashing Problems
 The hash function maps the search key values to a fixed
number of buckets B but the database will grow or shrink
over time
 if the file grows and the initial number of buckets is too small, the
performance decreases since overflow buckets are necessary
 if space is preallocated for future growth, some space will be
wasted initially
 A possible solution would be to reorganise the file with a
new hash function from time to time
 too expensive and has to be performed exclusively
 An alternative is to dynamically modify the number of
buckets (dynamic hashing)

Ordered Indexing vs. Hashing
 The database designer has to choose which form of
index should be used based on different criteria
 cost of periodic reorganisation of the index
 frequency of update operations
 optimised average or worst case access time
 expected types of queries
- hashing is generally better for retrieving records with a specific attribute value
• constant average lookup time
• worst case lookup time proportional to the number of attribute values!
- ordered indices are preferred for range queries where the attribute value lies
within a range of values (e.g. WHERE A > C1 AND A < C2)
 In practice
 most DBMSs support an ordered B+-tree index but not all of them
support hash indices

Bitmap Index
 A bitmap index can be used to efficiently query on
multiple keys/attributes
 The following assumptions must hold
 records in a relation must be numbered sequentially (e.g. starting
from 0)
 given a number i, it must be easy to find the ith record
- easy if records have fixed size
 Bitmaps are applicable on attributes with a relatively
small number of distinct values
 e.g. gender, city, ...
 A bitmap is an array with as many bits as records
 the ith bit is 1 if record i has the desired attribute value

Bitmap Index ...
 Queries can be answered by using bitmap operations
 and (intersection), or (union) and not (complementation)
- e.g. find all males in Zurich: 101101 AND 011010 = 001000
- e.g. find females not living in Paris: 010010 AND 111010 = 010010
 Retrieve records based on the resulting bit array
 can also be used to count the number of records with a given condition
2 de Rover Pieter m Brussels
21 Brown Kathy f Zurich
8 Frei Urs m Zurich
14 Jones Andrew m Paris
9 Jones Lea f Zurich
5 Michelin Robert m Paris
101101
010010
100000
000101
011010
m
f
Brussel
Paris
Zurich
bitmaps for gender
bitmaps for location

Index Definition in SQL
 The create index command can be used to create an
index on a relation
 different DBMSs support different types of indices (e.g. B+-tree,
hash index, etc.)
 Example
CREATE INDEX nameIndex ON Customer USING BTREE (name);

Summary
 Different types of indices can be used to speed up the
search of specific records in a file
 The number and types of indices have to be designed
carefully since each index adds additional maintenance
costs on update operations
 The structures that are used for the indices (e.g. B+-tree
or hash index) can also be used to manage the records
in a file and reduce data file degradation

Homework
 Study the following chapter of the
Database System Concepts book
 chapter 11
- sections 11.1-11.11
- Indexing and Hashing

Exercise 9
 Storage, Indexing and Hashing

References
 A. Silberschatz, H. Korth and S. Sudarshan,
Database System Concepts (Sixth Edition),
McGraw-Hill, 2010

2 December 2005
Next Lecture
Query Processing and Optimisation

Introduction to Database Access Methods

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to Introduction to Database Access Methods

Similar to Introduction to Database Access Methods (20)

More from Beat Signer

More from Beat Signer (20)

Recently uploaded

Recently uploaded (20)

Introduction to Database Access Methods