4. 4
Data on External Storage
Disks: Can retrieve random page at fixed cost
But reading several consecutive pages is much cheaper than
reading them in random order
Tapes: Can only read pages in sequence
Cheaper than disks; used for archival storage
Page: Unit of information read from or written to disk
Size of page: DBMS parameter, 4KB, 8KB
Disk space manager:
Abstraction: a collection of pages.
Allocate/de-allocate a page.
Read/write a page.
Page I/O:
Pages read from disk and pages written to disk
Dominant cost of database operations
5. 5
Buffer Management
Architecture:
Data is read into memory for processing
Data is written to disk for persistent storage
Buffer manager stages pages between external
storage and main memory buffer pool.
Access method layer makes calls to the buffer
manager.
6. 6
Access Methods
Access methods: routines to manage various disk-based
data structures.
Files of records
Various kinds of indexes
File of records:
Important abstraction of external storage in a DBMS!
Record id (rid) is sufficient to physically locate a record
Indexes:
Auxiliary data structures
Given values in index search key fields, find the record ids of
records with those values
7. 7
File organizations & access methods
Many alternatives exist, each ideal for some
situations, and not so good in others:
Heap (unordered) files: Suitable when typical
access is a file scan retrieving all records.
Sorted Files: Best if records must be retrieved in
some order, or only a `range’ of records is needed.
Indexes: Data structures to organize records via
trees or hashing.
• Like sorted files, they speed up searches for a subset of
records, based on values in certain (“search key”) fields
• Updates are much faster than in sorted files.
9. 9
Disks and DBMS Design
DBMS stores information on disks.
This has major implications for DBMS design!
READ: transfer data from disk to main memory (RAM)
for data processing.
WRITE: transfer data from RAM to disk for persistent
storage.
Both are high-cost operations, relative to in-memory
operations, so must be planned carefully!
10. 10
Why Not Store Everything in Main Memory?
Main memory is volatile. We want data to be
saved between runs. (Obviously!)
Costs too much. $100 will buy you either 4GB
of RAM or 1TB of disk today.
32-bit addressing limitation.
232 bytes can be directly addressed in memory.
Number of objects cannot exceed this number.
11. 11
Basics of Disks
Unit of storage and retrieval: disk block or page.
A disk block/page is a contiguous sequence of bytes.
Size of a DBMS parameter, 4KB or 8KB.
Disks support direct access to a page.
Unlike RAM, time to retrieve a page varies!
It depends upon the location on disk.
Therefore, relative placement of pages on disk has
major impact on DBMS performance!
12. 12
Components of a Disk
Track
Platters
Platters spin (say, 7200rpm).
Spindle
Arm assembly is moved in
or out to position a head on a
desired track.
Disk head
Arm movement
Arm
assembly
Only one head reads/
writes at any one time.
Tracks under heads make a
cylinder (imaginary!).
Each track is divided into
sectors (whose size is fixed).
Block size is a multiple
of sector size.
Sector
13. 13
Accessing a Disk Page
Time to access (read/write) a disk block:
seek time (moving arms to position disk head on track)
rotational delay (waiting for block to rotate under head)
transfer time (actually moving data to/from disk surface)
Seek time and rotational delay dominate.
Seek time varies from about 2 to 30 msec
Rotational delay varies from 0 to 11 msec
Transfer rate between 25 to 100 MB/s
Key to lower I/O cost: reduce seek/rotation
delays!
14. 14
Arranging Pages on Disk
`Next’ block concept:
blocks on same track, followed by
blocks on same cylinder, followed by
blocks on adjacent cylinder
Blocks in a file should be arranged
sequentially on disk (by `next’), to minimize
seek and rotational delay.
For a sequential scan, pre-fetching several
pages at a time is a big win!
16. 16
Buffer Management in a DBMS
Page Requests from Higher Levels
BUFFER POOL
DB
disk page
free frame
MAIN MEMORY
DISK
choice of frame dictated
by replacement policy
Data must be in RAM for DBMS to operate on it!
Table of <frame#, pageid> pairs is maintained.
17. 17
More on Buffer Management
Requestor of page must unpin it, and indicate
whether page has been modified:
dirty bit is used for this.
Page in pool may be requested many times,
a pin count is used. A page is a candidate for
replacement iff pin count = 0.
CC & recovery may entail additional I/O
when a frame is chosen for replacement.
(Write-Ahead Log protocol; more later.)
18. 18
When a Page is Requested ...
If requested page is not in pool:
Choose a frame for replacement
If frame is dirty, write it to disk
Read requested page into chosen frame
Pin the page and return its address.
If requests can be predicted (e.g., sequential scans)
pages can be pre-fetched several pages at a time!
19. 19
Buffer Replacement Policy
Frame is chosen for replacement by a
replacement policy:
Least-recently-used (LRU), Clock, MRU etc.
Policy can have big impact on # of I/O’s;
depends on the access pattern.
Sequential flooding: Nasty situation caused by
LRU + repeated sequential scans.
# buffer frames < # pages in file means each page
request causes an I/O. MRU much better in this
situation (but not in all situations, of course).
20. 20
DBMS vs. OS File System
OS does disk space & buffer mgmt: why not let
OS manage these tasks?
Differences in OS support: portability issues
Some limitations, e.g., files can’t span disks.
Buffer management in DBMS requires ability to:
pin a page in buffer pool, force a page to disk
(important for implementing CC & recovery),
adjust replacement policy, and pre-fetch pages based
on access patterns in typical DB operations.
22. 22
Files
Access method layer offers an abstraction of
data on disk: a file of records residing on
multiple pages
A number of fields are organized in a record
A collection of records are organized in a page
A collection of pages are organized in a file
23. 23
Files of Records
Page or block is OK when doing I/O, but
higher levels of DBMS operate on records and
files of records.
FILE: A collection of pages, each containing a
collection of records. Must support:
insert/delete/modify record
read a particular record (specified using record id)
scan all records (possibly with some conditions on
the records to be retrieved)
24. 24
Unordered (Heap) Files
Simplest file structure contains records in no
particular order.
As file grows and shrinks, disk pages are
allocated and de-allocated.
To support record level operations, we must:
keep track of the pages in a file
keep track of free space on pages
keep track of the records on a page
There are many alternatives for keeping track
of this.
25. 25
Heap File Implemented as a List
Header
Page
Data
Page
Data
Page
Data
Page
Data
Page
Data
Page
Full Pages
Data
Page Pages with
Free Space
(heap file name, header page id) stored in a known place.
Two doubly linked lists, for full pages & pages with space.
Each page contains 2 `pointers’ plus data.
Upon insertion, scan the list of pages with space, or ask
disk space manager to allocate a new page
26. 26
Heap File Using a Page Directory
Data
Page 1
Data
Page 2
Data
Page N
Header
Page
DIRECTORY
A directory entry per page; it can include the
number of free bytes on the page.
The directory is a collection of pages; linked
list implementation is just one alternative.
Much smaller than linked list of all HF pages!
27. 27
Page Format
How to store a collection of records on a page?
Consider a page as a collection of slots, one for
each record.
A record is identified by rid = <page id, slot #>
Record ids (rids) are used in indexes
(Alternatives 2 and 3).
28. 28
Page Format: Fixed Length Records
Slot 1
Slot 2
Slot N
. . .
PACKED
N
Free
Space
number
of records
. . .
1 1
. . . 0 1M
M ... 3 2 1
Slot 1
Slot 2
Slot N
Slot M
UNPACKED, BITMAP
number
of slots
Moving records for free space management changes
rid! May not be acceptable.
29. 20 16 24 N Pointer
to start
of free
space
29
Page Format: Variable Length Records
Page i
Rid = (i,N)
Rid = (i,2)
Rid = (i,1)
N . . . 2 1 # slots
SLOT DIRECTORY
Rid of record is <page id, slot #>
Can move records on page without changing rid; so, attractive
for fixed-length records too. (level of indirection)
30. 30
Record Format: Fixed Length
F1 F2 F3 F4
L1 L2 L3 L4
Base address (B)
Address = B+L1+L2
Information of a record type e.g., the number of fields
and field types is stored in the system catalog.
Fixed length record: (1) the number of fields is fixed,
(2) each field has a fixed length.
Store fields consecutively in a record.
Finding i’th field does not require scan of record.
31. 31
Record Format: Variable Length
Variable length record: (1) number of fields is fixed, (2)
some fields are variable length
Two alternatives:
F1 F2 F3 F4
$ $ $ $
Scan
Fields Delimited
by Special Symbols
F1 F2 F3 F4
S1 S2 S3 S4 E4 Array of Field
Offsets
Second offers direct access to i’th field, efficient storage
of NULLs ; small directory overhead.