files,indexing,hashing,linear and non linear hashing

File Organization & Indexing
1

DBMS stores data on hard disks
2
• This means that data needs to be
– read from the hard disk into memory (RAM)
– Written from the memory onto the hard disk
• Because I/O disk operations are slow query
performance depends upon how data is stored
on hard disks
• The lowest component of the DBMS performs
storage management activities
• Other DBMS components need not know how
these low level activities are performed

3
Basics of Data storage on hard
disk
• A disk is organized into a number of
blocks or pages
• A page is the unit of exchange between
the disk and the main memory
• A collection of pages is known as a file
• DBMS stores data in one or more files
on the hard disk

4
File Organization
• The physical arrangement of data in a file into records and
pages on the disk
• File organization determines the set of access methods for
– Storing and retrieving records from a file
• We study three types of file organization
– Unordered or Heap files
– Ordered or sequential files
– Hash files
• We examine each of them in terms of the operations we
perform on the database
– Insert a new record
– Search for a record (or update a record)
– Delete a record

5
• Heap – a record can be placed anywhere in the file where there
is space
• Sequential – store records in sequential order, based on the
value of the search key of each record.
• Hashing –
This function computed on some attribute of each record.
The term hash indicates splitting of key into pieces.
Records of each relation may be stored in a separate file.
Organization of Records in Files

6
Unordered Or Heap File
• Records are stored in the same order in which they
are created
• Insert operation
– Fast – because the incoming record is written at the end of
the last page of the file
• Search (or update) operation
– Slow – because linear search is performed on pages
• Delete Operation
– Slow – because the record to be deleted is first searched
– Deleting the record creates a hole in the page

7
Ordered or Sequential File
• Records are sorted on the values of one or more fields
– Ordering field – the field on which the records are sorted
• Search (or update) Operation
– Fast – because binary search is performed on sorted records
– Fast – because searching the record is fast
• Insert Operation
– Poor – because if we insert the new record in the correct
position
– we need to shift more than half the subsequent records in
the file
– Alternatively an ‘overflow file’ is created which contains all
the new records as a heap
– Periodically overflow file is merged with the main file

Sequential access vs random
access .
• sequential access means
that a group of elements is
accessed predetermined,
ordered sequence
• Random Access files will
be spited in to pieces and
will be stored wherever
spaces available.
• Sequential file may load
faster and random access
files may take time
8

9
Hash File
• Is an array of buckets
– Given a record, k a hash function, h(k) computes the index
of the bucket in which record k belongs
– h uses one or more fields in the record called hash fields
– Hash key - the key of the file when it is used by the hash
function
– h(K)=K mod M
• Example hash function
– Assume that the staff last name is used as the hash field
– Assume also that the hash file size is 26 buckets - each
bucket corresponding to each of the letters from the
alphabet
– Then a hash function can be defined which computes the
bucket address (index) based on the first letter in the last
name.

Abucket is a unit of storage containing one or more records
(a bucket is typically a disk block).
Hash function is used to locate records for access, insertion
as well as deletion.
Hashing is an effective technique to calculate direct location
of data record on the disk without using index structure.
10

11
Hash File
• Insert Operation
– Fast – because the hash function computes the
index of the bucket to which the record belongs
• If that bucket is full you go to the next free one
• Search Operation
– Fast – because the hash function computes the
index of the bucket
– Fast – once again for the same reason of hashing
function being able to locate the record quick

12
Internal Hashing:
•Opening Addressing:
-Proceeding from occupied position specified by the hash address,
program check the subsequent position in order until an unused empty
position is found.
•Chaining
-Various overflow locations are kept, usually by extending the array
with number of overflow position
-A pointer field is added to each record location.
•Multiple hashing:
External Hashing:
- Hashing for disk file is called External Hashing
-The Goal of good hashing function is to distribute the record
uniformly over the address space so as to minimize collisions.

Static Hashing
Dynamic Hashing
Dynamic hashing provides a
mechanism in which data buckets are
added and removed dynamically and
on-demand(extended hashing)
13
!!! ….Problem with static hashing
is that it does not expand or
shrink dynamically as the size of
database grows or shrinks….???

Overflow Chaining: When buckets are
full, a new bucket is allocated for the
same hash result and is linked after the
previous one.
This mechanism is called Closed
Hashing.
Linear Probing: When hash function
generates an address at which data is
already stored, the next free bucket is
allocated to it.
This mechanism is called Open Hashing.
14

15
Hash file organization of account file, using branch_name as key
For a string search - key, the binary representations of all the characters in the
string could be added and the sum modulo the number of buckets could be
returned
Use of Extendable Hash Structure: Example
Initial Hash structure, bucket size = 2

20
Indexing
•Index File (same idea as textbook index) : auxiliary structure designed to
speed up access to desired data.
• Indexing field: field on which the index file is defined.
• Index file stores each value of the index field along with pointer
(eg:page no.) pointer(s) to block(s) that contain record(s) with that field value
or pointer to the record with that field value:<Indexing Field, Pointer>
•To find a record in the data file based on a certain selection criterion on an
indexing field , we initially access the index file, which will allow the access
of the record on the data file.
• Index file much smaller than the data file => searching will be fast.
• Indexing important for file systems and DBMSs:

21
Choosing Indexing Technique
• Five Factors involved when choosing the
indexing technique:
• access type
• access time
• insertion time
• deletion time
• space overhead

22
Two Types of Indices
• Ordered index (Primary index or clustering
index) – which is used to access data sorted by
order of values.
• Hash index (secondary index or non-clustering
index ) - used to access data that is distributed
uniformly across a range of buckets.

23
Single-Level Ordered Index : Primary Index
Aprimary index file is an index that is constructed using the
sorting attribute of the main file.
• Physical records may be kept ordered on the primary key.
• The index is ordered but only one entry record for each block
•Each index entry has the value of the primary key field for
the first record (or the last record) in a block and a pointer to
that block.

25
Procedure:
First perform a binary search on the primary index file, to find the
address of the corresponding data.
Performance: Very fast!
Problem: The Primary Index will work only if the main file is a sorted file.
Solution:
The new records are inserted into an unordered (heap) in the overflow file for the
table. Periodically, the ordered and overflow tables are merged together; at this time,
the main file is sorted again, and the Primary Index file is accordingly updated.

26
Dense and Sparse Indices
There are Two types of ordered indices:
Dense Index:
• An index record appears for every search key value in file.
• This record contains search key value and a pointer to the actual
record.
Sparse Index:
• Index records are created only for some of the records.
• We start at that record pointed to by the index record, and proceed
along the pointers in the file (that is, sequentially) until we find the
desired record.

Figures 1 and 2 show dense and sparse indices for the deposit file.
Figure 1: Dense index.
•Notice how we would find records for Perryridge branch using both methods.
Figure 2: Sparse index. 27

28
Index Choice
• Dense index requires more space overhead and more
memory.
• Data can be accessed in a shorter time using Dense
Index.
• It is preferable to use a dense index when the file is
using a secondary index, or when the index file is
small compared to the size of the memory.

29
Single-Level Ordered Index: Clustering Index
• Records physically ordered by a non-key field
• Same general structure as ordered file index
– <Clustering field, Block pointer>
•One entry in the index for each distinct value of the clustering field with
a pointer to the first block in the data file that has a record with that value
for its clustering field.
– Possibly many records for one index entry (non-dense)
• Sometimes entire blocks reserved for each distinct clustering field value

30
Secondary Indexes
• secondary index must contain pointers to all the records.
• A pointer does not point directly to the file but to a
bucket that contains pointers to the file.
• Secondary indices must be dense, with an index entry for
every search-key value, and a pointer to every record in
the file. Secondary indices improve the performance of
queries on non-primary keys.

31
Choosing Multi-Level Index
• In some cases an index may be too large for efficient
processing.
• In that case use multi-level indexing.
• In multi-level indexing, the primary index is treated as a
sequence file and sparse index is created on it.
• The outer index is a sparse index of the primary index whereas
the inner index is the primary index.

33
B-Tree Index
• B-tree is the most commonly used data
structures for indexing.
• It is fully dynamic, that is it can grow
and shrink.

34
Three Types B-Tree Nodes
• Root node - contains node pointers to
branch nodes.
• Branch node - contains pointers to leaf
nodes or other branch nodes.
• Leaf node - contains index items and
horizontal pointers to other leaf nodes.

36
Dynamic Multilevel Indexes
– Retain the benefits of using multilevel indexing while reducing index
insertion & deletion
–Dynamic multilevel indexes are implemented as B-trees and often as B+-
trees.
• B-tree:
Allow an indexing field value to appear only once at some level in the tree ;
. pointer to data at each node.
• B+tree:
. pointers to data are stored only at the leaf nodes of the tree
. Leaf nodes have an entry for every indexing field value.
. The leaf nodes are usually linked together to provide ordered access on the
indexing field to the records.
All the leaf nodes of the tree are at the same depth: retrieval of any record
takes the same time.

In a B tree search keys and data stored in internal or leaf nodes.
But in B+tree data store only leaf nodes.
Searching of any data in a B+ tree is very easy because all data are found in leaf
nodes otherwise in a B tree data cannot found in leaf node.
In B tree data may found leaf or non leaf node. Deletion of non leaf node is very
complicated. Otherwise in a B+ tree data must found leaf node. So deletion
is easy in leaf node.
Insertion of a B tree is more complicated than B+ tree.
B +tree store redundant search key but B-tree has no redundant value.
In B+ tree leaf node data are ordered in a sequential linked list but in B tree the
leaf node cannot stored using linked list. Many database system
implementers prefer the structural simplicity of a B+ tree
37

ISAM (Indexed sequential access method) is an advanced
sequential file organization method. In this case, records
are stored in the file with the help of the primary key.
For each primary key, an index value is created and mapped
to the record. This index contains the address of the
record in the file.
If a record has to be obtained based on its index value,
the data block’s address is retrieved, and the record is
retrieved from memory.
.

• Pros of ISAM
• Because each record consists of the address of its data block in this manner, finding a record in
a large database is rapid and simple.
• Range retrieval and partial record retrieval are both supported by this approach. We may obtain
data for a specific range of values because the index is based on primary key values. Similarly, the
partial value can be simply found, for example, in a student’s name that begins with the letter ‘JA’.
• Cons of ISAM
• This approach necessitates additional disc space to hold the index value.
• When new records are added, these files must be reconstructed in order to keep the sequence.
• When a record is erased, the space it occupied must be freed up. Otherwise, the database’s
performance will suffer

files,indexing,hashing,linear and non linear hashing

Recommended

Recommended

More Related Content

Similar to files,indexing,hashing,linear and non linear hashing

Similar to files,indexing,hashing,linear and non linear hashing (20)

Recently uploaded

Recently uploaded (20)

files,indexing,hashing,linear and non linear hashing