Overview of Storage and Indexing


Published on

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • 19 What are other reasons? Persistence – want databases to stay around; Size – 32-bit addressing insufficient for many databases
  • 21 120 rps = 120 r/s x 1min/60 seconds = 7200rpm
  • 21
  • http://www.seagate.com/staticfiles/support/disc/manuals/enterprise/Barracuda%20ES/SATA/100424667b.pdf
  • 22
  • http://www.seagate.com/staticfiles/maxtor/en_us/documentation/manuals/diamondmax_16_manual.pdf, page 3-2 2**33 = 2**3 x 2**30 bytes in the file 2**18 = 2**8 x 2**10 bytes per track So 4K = 2**12 tracks in the file 4 tracks per cylinder So 1K cylinders per file 1/7200 minutes/rotation x 60 seconds/minute = 1/120 seconds/rotation Average rotational delay is half a rotation, or 1/240 seconds = 4.2 msecs
  • Page size is set by the OS because of the virtual memory system’s importance. Most server-class OSes support larger size pages, up to megabytes in size.
  • 24
  • 8
  • 13
  • 14
  • 15
  • 16
  • 11
  • 12
  • 9
  • 10
  • Overview of Storage and Indexing

    1. 1. Chapter 9, Disks and Files <ul><li>The Storage Hierarchy </li></ul><ul><li>Disks </li></ul><ul><ul><li>Mechanics </li></ul></ul><ul><ul><li>Performance </li></ul></ul><ul><ul><li>RAID </li></ul></ul><ul><li>Disk Space Management </li></ul><ul><li>Buffer Management </li></ul><ul><li>Files of Records </li></ul><ul><ul><li>Format of a Heap File </li></ul></ul><ul><ul><li>Format of a Data Page </li></ul></ul><ul><ul><li>Format of Records </li></ul></ul>
    2. 2. Learning objectives <ul><li>Given disk parameters, compute storage needs and read times </li></ul><ul><li>Given a reminder about what each level means, be able to derive any figures on the RAID performance slide </li></ul><ul><li>Describe the pros and cons of alternative structures for files, pages and records </li></ul>
    3. 3. A (Very) Simple Hardware Model main memory I/O bridge bus interface ALU register file CPU chip system bus memory bus disk controller graphics adapter USB controller mouse keyboard monitor disk I/O bus Expansion slots for other devices such as network adapters.
    4. 4. Storage Options 1k-2k bytes 1 Tc Way Expensive 10s -1000s K Bytes 2-20 Tc $10 / MByte G Bytes 300 – 1000 Tc $0.03 / MB (eBay) 100s G Bytes 10 ms = 30M Tc $0.10/ G B (eBay) Capacity Access Time Cost Infinite Forever Way Cheap Registers Caches Main Memory Hard Disk / Flash Tape
    5. 5. Memory “Hierarchy” 1k-2k bytes 1 Tc Way Expensive 10s -1000s K Bytes 2-20 Tc $10 / MByte G Bytes 300 – 1000 Tc $0.03 / MB (eBay) 100s G Bytes 10 ms = 30M Tc $0.10/ G B (eBay) Capacity Access Time Cost Infinite Forever Way Cheap Registers Cache - SDRAM may be multiple levels! Memory - DRAM Disk Tape Instr. Operands Blocks Pages Files Staging Xfer Size prog./compiler 1-8 bytes cache cntl 8-128 bytes OS 4K+ bytes user/operator Gbytes Upper Level Lower Level Faster Larger
    6. 6. Why Does “Hierarchy” Work? <ul><li>Locality: </li></ul><ul><ul><li>Program access a relatively small portion of the address space at any instant of time </li></ul></ul><ul><li>Two Different Types </li></ul><ul><ul><li>Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) </li></ul></ul><ul><ul><li>Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access) </li></ul></ul>
    7. 7. 9.1 The Memory Hierarchy <ul><li>Typical storage hierarchy as used by a RDBMS: </li></ul><ul><ul><li>Primary storage: Main memory (RAM) for currently used data </li></ul></ul><ul><ul><li>Secondary storage: Disk, Flash Memory for the main database </li></ul></ul><ul><ul><ul><li>http://www. cs . cmu . edu /~damon2007/ pdf /graefe07fiveminrule. pdf </li></ul></ul></ul><ul><ul><ul><li>What are other reasons besides cost to use disk? </li></ul></ul></ul><ul><ul><li>Tertiary storage Tapes, DVDs for archiving older versions of the data </li></ul></ul><ul><li>Other factors </li></ul><ul><ul><li>Caches at every level </li></ul></ul><ul><ul><li>Controllers, protocols </li></ul></ul><ul><ul><li>Network connections </li></ul></ul>
    8. 8. What is FLASH Memory, Anyway? <ul><li>Floating gate transitor </li></ul><ul><ul><li>Presence of charge => “0” </li></ul></ul><ul><ul><li>Erase Electrically or UV (EPROM) </li></ul></ul><ul><li>Peformance </li></ul><ul><ul><li>Reads like DRAM (~ns) </li></ul></ul><ul><ul><li>Writes like DISK (~ms). Write is a complex operation </li></ul></ul>
    9. 9. Components of a Disk Platters <ul><li>platters are always spinning (say, 120rps). </li></ul><ul><li>one head reads/writes at any one time. </li></ul><ul><li>to read a record: </li></ul><ul><ul><li>position arm (seek) </li></ul></ul><ul><ul><li>engage head </li></ul></ul><ul><ul><li>wait for data to spin by </li></ul></ul><ul><ul><li>read (transfer data) </li></ul></ul>Spindle Disk head Arm movement Arm assembly Tracks Sector
    10. 10. More terminology <ul><li>Each track is made up of fixed size sectors. </li></ul><ul><li>Page size is a multiple of sector size . </li></ul><ul><li>A platter typically has data on </li></ul><ul><li>both surfaces. </li></ul><ul><li>All the tracks that you can reach from one position of the arm is called a cylinder (imaginary!). </li></ul>Platters Spindle Disk head Arm movement Arm assembly Tracks Sector
    11. 11. Disks Technology Background <ul><li>Seagate 373453, 2003 </li></ul><ul><li>15000 RPM (4X) </li></ul><ul><li>73.4 GBytes (2500X) </li></ul><ul><li>Tracks/Inch: 64000 (80X) </li></ul><ul><li>Bits/Inch: 533,000 (60X) </li></ul><ul><li>Four 2.5” platters (in 3.5” form factor) </li></ul><ul><li>Bandwidth: 86 MBytes/sec (140X) </li></ul><ul><li>Latency: 5.7 ms (8X) </li></ul><ul><li>Cache: 8 MBytes </li></ul><ul><li>CDC Wren I, 1983 </li></ul><ul><li>3600 RPM </li></ul><ul><li>0.03 GBytes capacity </li></ul><ul><li>Tracks/Inch: 800 </li></ul><ul><li>Bits/Inch: 9550 </li></ul><ul><li>Three 5.25” platters </li></ul><ul><li>Bandwidth: 0.6 MBytes/sec </li></ul><ul><li>Latency: 48.3 ms </li></ul><ul><li>Cache: none </li></ul>
    12. 12. Typical Disk Drive Statistics (2008) Sector size: 512 bytes Seek time Average 4-10 ms Track to track .6-1.0 ms Average Rotational Delay - 3 to 5 ms (rotational speed 10,000 RPM to 5,400RPM) Transfer Time - Sustained data rate 0.3- 0.1 msec per 8K page, or 25-75 MB/second Density 12-18 GB/in 2
    13. 13. Disk Capacity <ul><li>Capacity: maximum number of bits that can be stored. </li></ul><ul><ul><li>Expressed in units of gigabytes (GB), where 1 GB = 10^9 bytes </li></ul></ul><ul><li>Capacity is determined by: </li></ul><ul><ul><li>Recording density (bits/in): number of bits that can be squeezed into a 1 inch segment of a track. </li></ul></ul><ul><ul><li>Track density (tracks/in): number of tracks that can be squeezed into a 1 inch radial segment. </li></ul></ul><ul><ul><li>Areal density (bits/in2): product of recording and track density. </li></ul></ul><ul><li>Modern disks partition tracks into disjoint subsets called recording zones </li></ul><ul><ul><li>Each track in a zone has the same number of sectors, determined by the circumference of innermost track. </li></ul></ul><ul><ul><li>Each zone has a different number of sectors/track </li></ul></ul>
    14. 14. Cost of Accessing Data on Disk <ul><li>Time to access (read/write) a disk block: </li></ul><ul><ul><li>Taccess = Tavg seek + Tavg rotation + Tavg transfer </li></ul></ul><ul><ul><li>seek time (moving arms to position disk head on track) </li></ul></ul><ul><ul><li>rotational delay (waiting for block to rotate under head) </li></ul></ul><ul><ul><ul><li>Half a rotation , on average </li></ul></ul></ul><ul><ul><li>transfer time (actually moving data to/from disk surface) </li></ul></ul><ul><li>Key to lower I/O cost: reduce seek/rotation delays! </li></ul><ul><ul><li>No way to avoid transfer time… </li></ul></ul><ul><li>Textbook measures query cost by NUMBER of page I/Os </li></ul><ul><ul><li>Implies all I/Os have the same cost, and that CPU time is free </li></ul></ul><ul><ul><ul><li>This is a common simplification. </li></ul></ul></ul><ul><ul><li>Real DBMSs (in the optimizer) would consider sequential vs. random disk reads </li></ul></ul><ul><ul><ul><li>Because sequential reads are much faster </li></ul></ul></ul><ul><ul><ul><li>and would count CPU time. </li></ul></ul></ul>
    15. 15. Disk Parameters Practice <ul><li>A 2-platter disk rotates at 7,200 rpm. Each track contains 256KB. </li></ul><ul><ul><li>How many cylinders are required to store an 8 Gigabyte file? </li></ul></ul><ul><ul><li>What is the average rotational delay, in milliseconds? </li></ul></ul>
    16. 16. Disk Access Time Example <ul><li>Given: </li></ul><ul><ul><li>Rotational rate = 7,200 RPM </li></ul></ul><ul><ul><li>Average seek time = 9 ms. </li></ul></ul><ul><ul><li>Avg # sectors/track = 400. </li></ul></ul><ul><li>Derived: </li></ul><ul><ul><li>Tavg rotation = 1/2 x (60 secs/7200 RPM) x 1000 ms/sec = 4 ms. </li></ul></ul><ul><ul><li>Tavg transfer = 60/7200 RPM x 1/400 secs/track x 1000 ms/sec = 0.02 ms </li></ul></ul><ul><ul><li>Taccess = 9 ms + 4 ms + 0.02 ms </li></ul></ul><ul><li>Important points: </li></ul><ul><ul><li>Access time dominated by seek time and rotational latency. </li></ul></ul><ul><ul><li>First bit in a sector is the most expensive, the rest are free. </li></ul></ul><ul><ul><li>SRAM access time is about 4 ns/doubleword, DRAM about 60 ns </li></ul></ul><ul><ul><ul><li>Disk is about 40,000 times slower than SRAM, </li></ul></ul></ul><ul><ul><ul><li>2,500 times slower than DRAM. </li></ul></ul></ul>
    17. 17. So, How far away is the data? From http://research.microsoft.com/~gray/papers/AlphaSortSigmod.doc
    18. 18. Block, page and record sizes <ul><li>Block – According to text, smallest unit of I/O. </li></ul><ul><li>Page – often used in place of block . </li></ul><ul><li>“typical” record size: commonly hundreds, sometimes thousands of bytes </li></ul><ul><ul><li>Unlike the toy records in textbooks </li></ul></ul><ul><li>“typical” page size 4K, 8K </li></ul>
    19. 19. Effect of page size on read time <ul><li>Suppose rotational delay is 4ms, average seek time 6 ms, transfer speed .5msec/8K. </li></ul><ul><li>This graph shows the time required to read 1Gig of data for different page sizes. </li></ul>
    20. 20. Why the difference? <ul><li>What accounts for the difference, in times to read one Gigabyte, on the previous graph? </li></ul><ul><li>Assume: rotational delay 4ms , average seek time 6 ms , transfer speed .5msec /8K </li></ul><ul><li>Transfer time </li></ul><ul><ul><li>(2 30 /2 13 8K blocks)  (.5msec/8K) = 66 secs ~= one minute </li></ul></ul><ul><li>How many reads? </li></ul><ul><ul><li>Page size 8K : there are 2 30 /2 13 = 2 17 = 128K reads </li></ul></ul><ul><ul><li>Page size 64K , there are 1/8 th that many reads = 16K reads </li></ul></ul><ul><li>Time taken by rotational delays and seeks </li></ul><ul><ul><li>Each read requires a rotational delay and a seek, totalling 10 msec. </li></ul></ul><ul><ul><li>8K : (128K reads)  (10msec/read) = 1,311 secs ~= 22 minutes </li></ul></ul><ul><ul><li>64K : 1/8 of that, or 164 secs ~= 3 minutes </li></ul></ul>
    21. 21. Moral of the Story <ul><li>As page size increases, read (and write) time reduces to transfer time, a big savings. </li></ul><ul><li>So why not use a huge page size? </li></ul><ul><ul><li>Wastes memory space if you don’t need all that is read </li></ul></ul><ul><ul><li>Wastes read time if you don’t need all that is read </li></ul></ul><ul><li>What applications could use a large page size? </li></ul><ul><ul><li>Those that sequentially access data </li></ul></ul><ul><li>The problem with a small page size is that pages get scattered across the disk. Turn the page…. </li></ul>
    22. 22. Faster I/O, even with a small page size <ul><li>Even if the page size is small, you can achieve fast I/O by storing a file’s data as follows: </li></ul><ul><ul><li>Consecutive pages on same track, followed by </li></ul></ul><ul><ul><li>Consecutive tracks on same cylinder, followed by </li></ul></ul><ul><ul><li>Consecutive cylinders adjacent to each other </li></ul></ul><ul><ul><li>First two incur no seek time or rotational delay, seek for third is only one-track. </li></ul></ul><ul><li>What is saved with this storage pattern? </li></ul><ul><li>How is this storage pattern obtained? </li></ul><ul><ul><li>Disk defragmenter and its relatives/predecessors </li></ul></ul><ul><ul><ul><li>Also places frequently used files near the spindle </li></ul></ul></ul><ul><li>When data is in this storage pattern, the application can do sequential I/O </li></ul><ul><ul><li>Otherwise it must do random I/O </li></ul></ul>
    23. 23. More Hardware Issues <ul><li>Disk Controllers </li></ul><ul><ul><li>Interface from Disks to bus </li></ul></ul><ul><ul><li>Checksums, remap bad sectors, driver mgt, etc </li></ul></ul><ul><li>Interface Protocols and MB per second xfer rates </li></ul><ul><ul><li>IDE/EIDE/ATA/PATA, SATA -133 </li></ul></ul><ul><ul><li>SCSI -640 </li></ul></ul><ul><ul><ul><li>BUT for a single device, SCSI is inferior </li></ul></ul></ul><ul><ul><li>Faster network technologies such as Fibre Channel </li></ul></ul><ul><li>Storage Area Networks (SANs) </li></ul><ul><ul><li>Disk farm networked to servers </li></ul></ul><ul><ul><li>Servers can be heterogeneous – a primary advantage </li></ul></ul><ul><ul><li>Centralized management </li></ul></ul>9. Disks
    24. 24. Dependability <ul><li>Module reliability = measure of continuous service accomplishment (or time to failure). 2 metrics </li></ul><ul><li>Mean Time To Failure ( MTTF ) measures Reliability </li></ul><ul><li>Failures In Time ( FIT ) = 1/MTTF, the rate of failures </li></ul><ul><ul><li>Traditionally reported as failures per billion hours of operation </li></ul></ul><ul><li>Mean Time To Repair ( MTTR ) measures Service Interruption </li></ul><ul><ul><li>Mean Time Between Failures ( MTBF ) = MTTF+MTTR </li></ul></ul><ul><li>Module availability measures service as alternate between the 2 states of accomplishment and interruption (number between 0 and 1, e.g. 0.9) </li></ul><ul><li>Module availability = MTTF / ( MTTF + MTTR) </li></ul>
    25. 25. Example calculating reliability <ul><li>If modules have exponentially distributed lifetimes (age of module does not affect probability of failure), overall failure rate is the sum of failure rates of the modules </li></ul><ul><li>Example: Calculate FIT and MTTF for </li></ul><ul><ul><li>10 disks (1M hour MTTF per disk) </li></ul></ul><ul><ul><li>1 disk controller (0.5M hour MTTF) </li></ul></ul><ul><ul><li>and 1 power supply (0.2M hour MTTF) </li></ul></ul>
    26. 26. Example calculating reliability <ul><li>Calculate FIT and MTTF for </li></ul><ul><ul><li>10 disks (1M hour MTTF per disk) </li></ul></ul><ul><ul><li>1 disk controller (0.5M hour MTTF) </li></ul></ul><ul><ul><li>and 1 power supply (0.2M hour MTTF): </li></ul></ul>
    27. 27. 9.2 RAID [587] <ul><li>Disk Array: Arrangement of several disks that gives abstraction of a single, large disk. </li></ul><ul><li>Goals: Increase performance and reliability . </li></ul><ul><li>Two main techniques: </li></ul><ul><ul><li>Data striping: Data is partitioned; size of a partition is called the striping Unit. Partitions are distributed over several disks. </li></ul></ul><ul><ul><li>Redundancy: More disks => more failures. Redundant information allows reconstruction of data if a disk fails. </li></ul></ul>9.Disks
    28. 28. Data Striping <ul><li>CPUs go fast, disks don’t. How can disks keep up? </li></ul><ul><li>CPUs do work in parallel. Can disks? </li></ul><ul><li>Answer: Partition data across D disks (see next slide). </li></ul><ul><li>If Partition unit is a page: </li></ul><ul><ul><li>A single page I/O request is no faster </li></ul></ul><ul><ul><li>Multiple I/O requests can run at aggregated bandwidth </li></ul></ul><ul><li>Number of pages in a partition unit called the depth of the partition. </li></ul><ul><li>Contrary to text, partition units of a bit are almost never used and partition units of a byte are rare. </li></ul>
    29. 29. Data Striping (RAID Level 0) 0 D 2D … 0 1 D+1 2D+1 … 1 2 D+2 2D+2… 2 D-1 2D-1 3D-1 … D-1 ... Disk 0 Disk 1 Disk 2 Disk D-1
    30. 30. Redundancy <ul><li>Striping is seductive, but remember reliability! </li></ul><ul><ul><li>MTTF of a disk is about 6 years </li></ul></ul><ul><ul><li>If we stripe over 24 disks, what is MTTF? </li></ul></ul><ul><li>Solution: redundancy </li></ul><ul><ul><li>Parity: corrects single failures </li></ul></ul><ul><ul><li>Others: detect where the failure is, and corrects multiple failures </li></ul></ul><ul><ul><li>But failure location is provided by controller </li></ul></ul><ul><ul><li>Redundancy may require more than one check bit </li></ul></ul><ul><li>Redundancy makes writes slower – why? </li></ul>
    31. 31. RAID Levels <ul><li>Standardized by SNIA ( www. snia .org ) </li></ul><ul><li>Vary in practice </li></ul><ul><li>For each level, decide (assume single user) </li></ul><ul><ul><li>Number of disks required to hold D disks of data. </li></ul></ul><ul><ul><li>Speedup s (compared to 1 disk) for </li></ul></ul><ul><ul><ul><li>S/R (Sequential/Random) R/W (Reads/Writes) </li></ul></ul></ul><ul><ul><ul><ul><li>Random: each I/O is one block </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Sequential: Each I/O is one stripe </li></ul></ul></ul></ul><ul><ul><li>Number of disks/blocks that can fail w/o data loss </li></ul></ul><ul><li>Level 0 : Block Striped, No redundancy </li></ul><ul><ul><li>Picture is 2 slides back </li></ul></ul>
    32. 32. JBOD, RAID Level 1 <ul><li>JBOD: Just a Bunch of Disks </li></ul><ul><li>Level 1: Mirrored (two identical JBODs – no striping) </li></ul>0 1 2 3 … 2 3 ... Disk 0 Disk 1 Disk 2 Disk D-1 0 1 2 3 … 0 … 0 1 2 3 … 1 … 0 1 2 3 … 3 4
    33. 33. RAID Level 0+1: Stripe + Mirror 1 D+1 2D+1 … 1 2 D+2 2D+2 … 2 D-1 2D-1 3D-1 … D-1 ... Disk D Disk D+1 Disk D+2 Disk 2D-1 0 D 2D … 0 1 D+1 2D+1 … 1 2 D+2 2D+2 … 2 D-1 2D-1 3D-1 … D-1 ... Disk 0 Disk 1 Disk 2 Disk D-1 0 D 2D … 0
    34. 34. RAID Level 4 <ul><li>Block-Interleaved Parity (not common) </li></ul><ul><ul><li>One check disk, uses one bit of parity. </li></ul></ul><ul><ul><li>How to tell if there is a failure, or which disk failed? </li></ul></ul><ul><ul><li>Read-modify-write </li></ul></ul><ul><ul><li>Disk D is a bottleneck </li></ul></ul>0 D 2D … 0 1 D+1 2D+1 … 1 2 D+2 2D+2… 2 D-1 2D-1 3D-1 … D-1 ... Disk 0 Disk 1 Disk 2 Disk D-1 Disk D P P P P …
    35. 35. RAID Level 5 <ul><li>Level 5: Block-Interleaved Distributed Parity </li></ul>1 D+1 2D+1 … … D-2 2D-2 P … … D-1 P 3D-2 … … ... Disk 0 Disk 1 Disk D-2 Disk D-1 Disk D P 2D-1 3D-1 … … <ul><li>Level 6: Like 5, but 2 parity bits/disks </li></ul><ul><ul><li>Can survive loss of 2 disks/blocks </li></ul></ul>0 D 2D … …
    36. 36. Notation on the next slide <ul><li>#Disks </li></ul><ul><ul><li>Number of disks required to hold D disks worth of data using this RAID level </li></ul></ul><ul><li>Reads/Write speedup of blocks in a single file: </li></ul><ul><ul><li>SR: Sequential Read </li></ul></ul><ul><ul><li>RR: Random read </li></ul></ul><ul><ul><li>SW: Sequential write </li></ul></ul><ul><ul><li>RW: Random write </li></ul></ul><ul><li>Failure Tolerance </li></ul><ul><ul><li>How many disks can fail without loss of data </li></ul></ul><ul><li>Internal Data </li></ul><ul><ul><li>s = Blocks transferred in the time it takes to transfer one block of data from one disk. </li></ul></ul><ul><ul><li>These numbers are theoretical! </li></ul></ul><ul><ul><ul><li>YMMV…and vary significantly! </li></ul></ul></ul>
    37. 37. RAID Performance *If no two are copies of each other ** note – can’t write both mirrors at once – why? Level #Disks SR speedup RR speedup SW speedup RW speedup Failure Tolerance 0 D s=D 1  s  D s=D 1  s  D 0 1 2D s=2 s=2 s=1** s=1** D* 0+1 2D s=2D 2  s  2D s=D** 1  s  D** D* 5 D+1 s=D 1  s  D s=D Varies 1
    38. 38. Small Writes on Levels 4 and 5 <ul><li>Levels 4 and 5 require a read-modify-write cycle for all writes, since the parity block must be read and modified. </li></ul><ul><li>On small writes this can be very expensive </li></ul><ul><li>This is another justification for Log Based File Systems (see your OS course) </li></ul>
    39. 39. Which RAID Level is best? <ul><li>If data loss is not a problem </li></ul><ul><ul><li>Level 0 </li></ul></ul><ul><li>If storage cost is not a problem </li></ul><ul><ul><li>Level 0+1 </li></ul></ul><ul><li>Else </li></ul><ul><ul><li>Level 5 </li></ul></ul><ul><li>Software Support </li></ul><ul><ul><li>Linux: 0,1,4,5 ( http://www. tldp .org/HOWTO/Software-RAID-HOWTO.html ) </li></ul></ul><ul><ul><li>Windows: 0,1,5 ( http://www. techimo .com/articles/index.pl?photo=149 ) </li></ul></ul>
    40. 40. 9.3, 9.4.1: Covered earlier 9.Disks
    41. 41. 9.4.2 DBMS vs. OS File System <ul><li>OS does disk space & buffer mgmt: why not let OS manage these tasks? [715] </li></ul><ul><li>Differences in OS support: portability issues </li></ul><ul><li>Some limitations, e.g., files can’t span disks. </li></ul><ul><li>Buffer management in DBMS requires ability to: </li></ul><ul><ul><li>pin a page in buffer pool, force a page to disk (important for implementing CC & recovery), </li></ul></ul><ul><ul><li>adjust replacement policy, and pre-fetch pages based on access patterns in typical DB operations. </li></ul></ul><ul><ul><ul><li>Sometimes MRU is the best replacement policy: For example, for a scan or a loop that does not fit. </li></ul></ul></ul>9.Disks
    42. 42. 9.5 Files of Records <ul><li>Page or block is OK when doing I/O, but higher levels of DBMS operate on records , and files of records . </li></ul><ul><li>FILE : A collection of pages, each containing a collection of records. Must support: </li></ul><ul><ul><li>insert/delete/modify record </li></ul></ul><ul><ul><li>read a particular record (specified using record id ) </li></ul></ul><ul><ul><li>scan all records (possibly with some conditions on the records to be retrieved) </li></ul></ul>9.Disks
    43. 43. 9.5.1 Unordered (Heap) Files <ul><li>Simplest file structure contains records in no particular order. </li></ul><ul><li>As file grows and shrinks, disk pages are allocated and de-allocated. </li></ul><ul><li>To support record level operations, we must: </li></ul><ul><ul><li>keep track of the pages in a file </li></ul></ul><ul><ul><li>keep track of free space on pages </li></ul></ul><ul><ul><li>keep track of the records on a page </li></ul></ul><ul><li>There are at least two alternatives for keeping track of heap files. </li></ul>9.Disks
    44. 44. Heap File Implemented as a List <ul><li>The header page id and Heap file name must be stored someplace. </li></ul><ul><li>Each page contains 2 `pointers’ plus data. </li></ul>Header Page Data Page Data Page Data Page Data Page Data Page Data Page Pages with Free Space Full Pages 9.Disks
    45. 45. Heap File Using a Page Directory <ul><li>The entry for a page can include the number of free bytes on the page. </li></ul><ul><li>The directory is a collection of pages; linked list implementation is just one alternative. </li></ul><ul><ul><li>Much smaller than linked list of all HF pages ! </li></ul></ul>Data Page 1 Data Page 2 Data Page N Header Page DIRECTORY 9.Disks
    46. 46. Comparing Heap File Implementations <ul><li>Assume </li></ul><ul><ul><li>100 directory entries per page. </li></ul></ul><ul><ul><li>U full pages, E pages with free space </li></ul></ul><ul><ul><li>D directory pages </li></ul></ul><ul><ul><li>Then D =  (U+E) /100  </li></ul></ul><ul><ul><li>Note that D is two orders of magnitude less than U or E </li></ul></ul><ul><li>Cost to find a page with enough free space </li></ul><ul><ul><li>List: E/2 Directory: (D/2) + 1 </li></ul></ul><ul><li>Cost to Move a page from Full to Free (e.g., when a record is deleted) </li></ul><ul><ul><li>List: 3, Directory: 1 </li></ul></ul><ul><li>Can you think of some other operations? </li></ul>
    47. 47. 9.6 Page Formats: Fixed Length Records Slot 1 Slot 2 Slot N . . . . . . N M 1 0 . . . M ... 3 2 1 PACKED UNPACKED, BITMAP Slot 1 Slot 2 Slot N Free Space Slot M 1 1 number of records number of slots 9.Disks
    48. 48. Packed vs Unpacked Page Formats <ul><li>Record ID (RID, TID) = (page#, slot#) , in all page formats </li></ul><ul><ul><li>Note that indexes are filled with RIDs </li></ul></ul><ul><ul><li>Data entries in alternatives 2 and 3 are (key, RID..) </li></ul></ul><ul><li>Packed </li></ul><ul><ul><li>stores more records </li></ul></ul><ul><ul><li>RIDs change when a record is deleted </li></ul></ul><ul><ul><ul><li>This may not be acceptable. </li></ul></ul></ul><ul><li>Unpacked </li></ul><ul><ul><li>RID does not change </li></ul></ul><ul><ul><li>Less data movement when deleting </li></ul></ul>
    49. 49. Page Formats: Variable Length Records Page i Rid = (i,N) Rid = (i,2) Rid = (i,1) Pointer to start of free space SLOT DIRECTORY N . . . 2 1 20 16 24 N # slots 9.Disks
    50. 50. Slotted Page Format <ul><li>Intergalactic Standard, for fixed length records also. </li></ul><ul><li>How to deal with free space fragmentation? </li></ul><ul><ul><li>Pack records. lazily </li></ul></ul><ul><li>Note that RIDs don’t change </li></ul><ul><li>How are updates handled which expand the size of a record? </li></ul><ul><ul><li>Forwarding flag to new location </li></ul></ul><ul><li>http://www. postgresql .org/docs/8.3/interactive/storage-page-layout.html </li></ul><ul><li>postgresql-8.3.1srcincludestorageufpage.h </li></ul>
    51. 51. 9.7 Record Formats: Fixed Length <ul><li>Information about field types same for all records in a file; stored in system catalogs. </li></ul><ul><li>Finding i’th field does not require scan of record. </li></ul>Base address (B) L1 L2 L3 L4 F1 F2 F3 F4 Address = B+L1+L2 9.Disks
    52. 52. Record Formats: Variable Length <ul><li>Two alternative formats (# fields is fixed): </li></ul><ul><li>Second offers direct access to i’th field, efficient storage </li></ul><ul><li>of nulls (special don’t know value); small directory overhead. </li></ul>Field Count Fields Delimited by Special Symbols F1 F2 F3 F4 F1 F2 F3 F4 Array of Field Offsets 9.Disks 4 $ $ $ $