SlideShare a Scribd company logo
Ch 9. Storing Data: Disks and Files
- Heap File Structure -
Sang-Won Lee
( )
2 SKKU VLDB Lab.Ch 9. Storing Disk
9.0 Overview
9.1 Memory Hierarchy
9.2 RAID(Redundant Array of Independent Disk)
9.3 Disk Space Management
9.4 Buffer Manager
9.5 Files of Records
9.6 Page Format
9.7 Record Format
3 SKKU VLDB Lab.Ch 9. Storing Disk
Memory Hierarchy
Smaller, Faster,
Expensive, Volatile
Bigger, Slower,
Cheaper, Non-Volatile
 Main memory (RAM) for currently used data.
 Disk for the main database (secondary storage).
 Tapes for archiving older versions of the data (tertiary storage)
 What if ideal storage appear? Fast, cheap, large, NV..: PCM, MRAM, FeRAM?
4 SKKU VLDB Lab.Ch 9. Storing Disk
Jim Gray’s Storage Latency Analogy:
How Far Away is the Data?
On Chip Cache
On Board Cache
Tape /Optical
10 9
10 6
This Hotel
This Room
My Head
10 min
1.5 hr
2 Years
1 min
2,000 Years
5 SKKU VLDB Lab.Ch 9. Storing Disk
Disks and Files
 DBMS stores information on (hard) disks.
− Electronic (CPU, DRAM) vs. Mechanical (harddisk)
 This has major implications for DBMS design!
− READ: transfer data from disk to main memory (RAM).
− WRITE: transfer data from RAM to disk.
− Both are expensive operations, relative to in-memory operations, so
must be planned carefully!
 DRAM: ~ 10 ns
 Harddisk: ~ 10ms
 SSD: 80us ~ 10ms
6 SKKU VLDB Lab.Ch 9. Storing Disk
 Secondary storage device of choice.
 Main advantage over tapes: random access vs. sequential.
− Tapes deteriorate over time
 Data is stored and retrieved in disk blocks or pages unit.
 Unlike RAM, time to retrieve a disk page varies depending upon
location on disk.
− Thus, relative placement of pages on disk has big impact on DB
 e.g. adjacent allocation of the pages from the same tables.
− We need to optimize both data placement and access
 e.g. elevator disk scheduling algorithm
7 SKKU VLDB Lab.Ch 9. Storing Disk
Anatomy of a Disk
Arm assembly
 The platters spin
 e.g. 5400 / 7200 / 15K rpm
 The arm assembly is moved in or out to
position a head on a desired track. Tracks
under heads make a cylinder
 Mechanical storage -> low IOPS
 Only one head reads/writes at any one
 Parallelism degree: 1
 Block size is a multiple of sector size
 Update-in-place: poisoned apple
 No atomic write
 Fsync for ordering / durability
8 SKKU VLDB Lab.Ch 9. Storing Disk
Accessing a Disk Page
 Time to access (read/write) a disk block:
− seek time (moving arms to position disk head on track)
− rotational delay (waiting for block to rotate under head)
− transfer time (actually moving data to/from disk surface)
 Seek time and rotational delay dominate.
− Seek time: about 1 to 20msec
− Rotational delay: from 0 to 10msec
− Transfer rate: about 1ms per 4KB page
 Key to lower I/O cost: reduce seek/rotation delays!
− E.g. disk scheduling algorithm in OS, Linux 4 I/O schedulers
9 SKKU VLDB Lab.Ch 9. Storing Disk
Arranging Pages on Disk
 `Next’ block concept:
− Blocks on same track, followed by
− Blocks on same cylinder, followed by
− Blocks on adjacent cylinder
 Blocks in a file should be arranged sequentially on disk (by `next’),
to minimize seek and rotational delay.
 Disk fragmentation problem
− Is this still problematic in flash storage?
Table, Insertions, Heap Files
CREATE TABLE TEST (a int, b int, c varchar2(650));
/* Insert 1M tuples into TEST table (approximately 664 bytes per
tuple) */
FOR i IN 1..1000000 LOOP
INSERT INTO TEST (a, b, c) values (i, i, rpad('X', 650, 'X'));
Page = 8KB
10 tuples / page
100,000 pages in total
TEST table = 800MB
Page 1
Page 2
(Heap File)
Page i
 On-Line Analytical vs. Transactional Processing
Execution Plan
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
| 0 | SELECT STATEMENT | | 1 | 5 | 22053 (1)| 00:04:25 |
| 1 | SORT AGGREGATE | | 1 | 5 | | |
| 2 | TABLE ACCESS FULL| TEST | 996K| 4865K| 22053 (1)| 00:04:25
179 recursive calls
0 db block gets
100152 consistent gets
100112 physical reads
1 rows processed
Page 1
Page 2
(Heap File)
Page i
Execution Plan
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
| 0 | SELECT STATEMENT | | 1 | 5 | 22053 (1)| 00:04:25 |
| 1 | SORT AGGREGATE | | 1 | 5 | | |
| 2 | TABLE ACCESS FULL| TEST | 996K| 4865K| 22053 (1)| 00:04:25
179 recursive calls
0 db block gets
100152 consistent gets
100112 physical reads
1 rows processed
Page 1
Page 2
(Heap File)
Page i
 Point or Range Query
Page 1
Page 2
SELECT B /* point query */
WHERE A = 500000;
SELECT B /* range query */
WHERE A BETWEEN 50001 and 50101;
B-tree Index on TEST(A)
SEARCH KEY: 500000
(500000, (50000,10) )
(500000,500000, ’XX…XX’;
Block #: 50000SLOT #: 10
 Full Table Scan: 100,000 Block Accesses
 Index: (3~4) + Data Block Access
− Point query: 1
− Range queries: depending on range
(Heap File)
Execution Plan
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
| 0 | SELECT STATEMENT | | 1 | 10 | 4 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| TEST | 1 | 10 | 4 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | TEST_A | 1 | | 3 (0)| 00:00:01 |
5 consistent gets
4 physical reads
1 rows processed
 Index-based Table Access
OLTP: TPC-A/B/C Benchmark
exec sql begin declare section;
long Aid, Bid, Tid, delta, Abalance;
exec sql end declare section;
{ read input msg;
exec sql begin work;
exec sql update accounts set Abalance = Abalance + :delta where Aid = :Aid;
exec sql select Abalance into :Abalance from accounts where Aid = :Aid;
exec sql update tellers set Tbalance = Tbalance + :delta where Tid = :Tid;
exec sql update branches set Bbalance = Bbalance + :delta where Bid = :Bid;
exec sql insert into history(Tid, Bid, Aid, delta, time) values (:Tid, :Bid, :Aid, :delta, CURRENT);
send output msg;
exec sql commit work; }
From Gray’s Presentation
Transaction =
Sequence of Reads and Writes
OLTP System: Architecture
Transactions 1 …….. N
Database Buffer
Physical Read/Writes
Select / Insert / Update / Delete / Commit Operations
Logical Read/Writes
17 SKKU VLDB Lab.Ch 9. Storing Disk
Files and Access Methods
Buffer Manager
Disk Space Manager
Plan Executor
Operator Evaluator
shows interaction
Index Files
Data Files
System Catalog
shows references
Application Front EndsWeb Forms SQL Interface
Unsophisticated users (customers, travel agents, etc.) Sophisticated users, application
programmers, DB administrators
shows command flow
Figure 1.3 Architecture of a DBMS
18 SKKU VLDB Lab.Ch 9. Storing Disk
9.4 Buffer Management in a DBMS
 Data must be in RAM
for DBMS to operate on
disk page
free frame
Page Requests from Higher Levels
choice of frame dictated
by replacement policy
19 SKKU VLDB Lab.Ch 9. Storing Disk
When a Page is Requested ...
 Buffer pool information table: <frame#, pageid, pin_cnt, dirty>
− In big systems, it is not trivial to just check whether a page is in pool
 If requested page is not in pool:
− Choose a frame for replacement
− If frame is dirty, write it to disk
− Read requested page into chosen frame
 Pin the page and return its address.
 If requests can be predicted (e.g., sequential scans) pages can be
pre-fetched several pages at a time!
20 SKKU VLDB Lab.Ch 9. Storing Disk
More on Buffer Management
 Requestor of page must unpin it, and indicate whether page has
been modified:
− dirty bit is used for this.
 Page in pool may be requested many times,
− a pin count is used.
− a page is a candidate for replacement iff pin count = 0.
 CC & recovery may entail additional I/O when a frame is chosen for
replacement. (e.g. Write-Ahead Log protocol)
Buffer Manager Pseudo Code
SKKU VLDB Lab.Ch 9. Storing Disk
[Source: Uwe Röhm’s Slide]
22 SKKU VLDB Lab.Ch 9. Storing Disk
Buffer Replacement Policy
 Hit vs. miss
 Hit ratio = # of hits / ( # of page requests to buffer cache)
− One miss incurs one (or two) physical IO. Hit saves IO.
− Rule of thumb: at least 80 ~ 90%
 Problem: for the given (future) references, which victim should be
chosen for highest hit ratio (i.e. least # of IOs)?
− Numerous policies
− Does one policy win over the others?
− One policy does not fit all reference patterns!
23 SKKU VLDB Lab.Ch 9. Storing Disk
Buffer Replacement Policy
 Frame is chosen for replacement by a replacement policy:
− Random, FIFO, LRU, MRU, LFU, Clock etc.
− Replacement policy can have big impact on # of I/O’s; depends on the
access pattern
 For a given workload, one replacement policy, A, achieves 90% hit
ratio and the other, B, does 91%.
− How much improvement? 1% or 10%?
− We need to interpret its impact in terms of miss ratio, not hit ratio
24 SKKU VLDB Lab.Ch 9. Storing Disk
Buffer Replacement Policy (2)
 Least Recently Used (LRU)
− For each page in buffer pool, keep track of time last unpinned
− Replace the frame that has the oldest (earliest) time
− Very common policy: intuitive and simple
− Why does it work?
 ``Principle of (temporal) locality” (of references) (
 Why temporal locality in database?
− The correct implementation is not trivial
 Especially in large scale systems: e.g. time stamp
 Variants
− Linked list of buffer frames, LRU-K, 2Q, midpoint-insertion and touch count
algorithm(Oracle), Clock, ARC …
− Implication of big memory: “random” > “LRU”??
25 SKKU VLDB Lab.Ch 9. Storing Disk
LRU and Sequential Flooding
 Problem of LRU - sequential flooding
− caused by LRU + repeated sequential scans.
− # buffer frames < # pages in file means each page request causes an I/O. MRU
much better in this situation (but not in all situations, of course).
1 2 3 4 5
1 2 3 4
 Assume repeated
sequential scans of file A
 What happens when
reading 5th blocks? and
when reading 1st block
again? ….
File A
26 SKKU VLDB Lab.Ch 9. Storing Disk
“Clock” Replacement Policy
 An approximation of LRU
 Arrange frames into a cycle, store one reference bit per frame
 When pin count reduces to 0, turn on reference bit
 When replacement necessary
do for each page in cycle {
if (pincount == 0 && ref bit is on)
turn off ref bit;
else if (pincount == 0 && ref bit is off)
choose this page for replacement;
} until a page is chosen;
 Second chance algorithm / bit
 Generalized Clock (GCLOCK): ref. bit  ref. counter
Classification of Replacement Policies
SKKU VLDB Lab.Ch 9. Storing Disk
[Source: Uwe Röhm’s Slide]
28 SKKU VLDB Lab.Ch 9. Storing Disk
Why Not Store It All in Main Memory?
 Cost!: $20 /1GB GB DRAM vs. 50$ / 150 GB of disk (EIDI/ATA) vs.
100/30GB (SCSI).
− High-end databases today in the 10-100 TB range.
− Approx. 60% of the cost of a production system is in the disks.
 Some specialized systems (e.g. Main Memory(MM) DBMS) store
entire database in main memory.
− Vendors claim 10x speed up vs. traditional DBMS in main memory.
− Sap Hana, MS Hekaton, Oracle In-memory, Altibase ..
 Main memory is volatile: data should be saved between runs.
− Disk write is inevitable: log write for recovery and periodic checkpoint
(Dual, Quad, …)
Hit Ratio Multi-threading
(CPU-IO Overlapping)Data size
Buffer Size
(Transactions Per Second)
CPU-IO Overlapping
 3 States: CPU Bound, IO Bound, Balanced
For perfect CPU-IO overlapping,
IOPS Matters!
IOPS Crisis in OLTP
IBM for TPC-C (2008 Dec.)
 6M tpmC
 Total cost: 35M $
− Server HW: 12M $
− Server SW: 2M $
− Storage: 20M $
− Client HW/SW: 1M $
 They are buying IOPS, not capacity
IOPS Crisis in OLTP(2)
 For balanced systems, OLTP systems pay huge $$$ on disks for high
CPU + Server
300 GIPS
10,000 disks
12M $
20M $
A balanced state
Amdhal’s law
10,000 disks
A balanced state???
50% CPU utilization;
Same TPS
12M $
CPU + Server
600 GIPS18 months
(Moore’s law)
20,000 disks
(short stroking)
A balanced state;
40M $
32 SKKU VLDB Lab.Ch 9. Storing Disk
Some Techniques to Hide IO Bottlenecks
 Pre-fetching: For a sequential scan, pre-fetching several pages at a
time is a big win! Even cache / disk controller support prefetching
 Caching: modern disk controllers do their own caching.
 IO overlapping: CPU works while IO is performing
− Double buffering, asynchronous IO
 Multiple threads
 And, don’t do IOs, avoid IOs
33 SKKU VLDB Lab.Ch 9. Storing Disk
MMDBMS vs. All-Flash DBMS: Personal Thoughts
 Why MMDBMS has been recently popular?
− Sap Hana, MS Hekaton, Oracle In-memory, Altibase, ….
− Disk IOPS was too expensive
− The price of DRAM has ever dropped.: DRAM_Cost << Disk_IOPS_Cost
− The overhead of disk-based DBMS is not negligible
− Applications with extreme performance requirements
 Flash storage
− Low $/IOPS
− DRAM_Cost > Flash_IOPS_Cost
 MMDBMS vs. All-Flash DBMS (with some optimizations)
− Winner? Time will tell
Power Consumption Issue in Big Memory
SKKU VLDB Lab.Ch 9. Storing Disk
 Why exponential?
 1KWh = 15 ~ 50 cents, 1 year = 1,752$
Source: sigmod 17 keynote by Anastatia Ailamaki
Jim Gray Flash Disk Opportunity for Server Applications
(MS-TR 2007)
 “My tests and those of several others suggest that FLASH disks can deliver about 3K random 8KB reads/second and with
some re-engineering about 1,100 random 8KB writes per second. Indeed, it appears that a single FLASH chip could deliver
nearly that performance and there are many chips inside the “box” – so the actual limit could be 4x or more. But, even the
current performance would be VERY attractive for many enterprise applications. For example, in the TPC-C benchmark, has
approximately equal reads and writes. Using the graphs above, and doing a weighted average of the 4-deep 8 KB random
read rate (2,804 IOps), and 4-deep 8 KB sequential write rate (1233 IOps) gives harmonic average of 1713 (1-deep gives
1,624 IOps). TPC-C systems are configured with ~50 disks per cpu. For example the most recent Dell TPC-C system has
ninety 15Krpm 36GB SCSI disks costing 45k$ (with 10k$ extra for maintenance that gets “discounted”). Those disks are
68% of the system cost. They deliver about 18,000 IO/s. That is comparable to the requests/second of ten FLASH disks. So
we could replace those 90 disks with ten NSSD if the data would fit on 320GB (it does not). That would save a lot of money
and a lot of power (1.3Kw of power and 1.3Kw of cooling).” (excerpts from “Flash disk opportunity for server-applications”)
Flash SSD vs. HDD: Non-Volatile Storage
60 years champion
A new challenger!
Electronic Mechanical
Read/Write Asymmetric Symmetric
No Overwrite Overwrite
Flash SSD: Characteristics
 No overwrite, addr. mapping table
 Asymmetric read/write
 No mechanics (Seq. RD ~ Rand. RD)
 Seq. WR >> Rand. WR
 Parallelism (SSD Architecture)
 New Interface & Computing (SSD Architecture)
Storage Device Metrics
 Capacity ($/GB) : Harddisk >> Flash SSD
 Bandwidth (MB/sec): Harddisk < Flash SSD
 Latency (IOPS): Harddisk << Flash SSD
− e.g. Harddisk
 Commodity hdd (7200rpm): 50$ / 1TB / 100MB/s / 100 IOPS
 Enterprise hdd(1.5Krpm): 250$ / 72GB / 200MB/s / 500 IOPS
 The price of harddisks is said to be proportional to IOPS, not capacity.
Storage Device Metrics(2): HDD vs. Flash SSDs
SKKU VLDB Lab.Ch 9. Storing Disk
 Other Metrics: Weight/shock resistance/heat & cooling,
power(watt) , IOPS/watt, IOPS/$ ….
− Harddisk << Flash SSD
[Source: Rethinking Flash In the Data Center, IEEE Micro 2010]
Evolution of DRAM, HDD, and SSD (1987 – 2017)
 Raja et. al., The Five-minute Rule Thirty Years Later and its Impact
on Storage Hierarchy, ADMS ‘17
 the Storage Hierarchy
SKKU VLDB Lab.Ch 9. Storing Disk
41 SKKU VLDB Lab.Ch 9. Storing Disk
Technology RATIOS Matter
[ Source: Jim Gray’s PPT ]
 Technology ratio change: 1980s vs. 2010s
 If everything changes in the same way, then nothing really changes.[See next slide]
 If some things get much cheaper/faster than others, then that is real change.
 Some things are not changing much: e.g. cost of people, speed of light
 And some things are changing a LOT: e.g. Moore’s law, disk capacity
 Latency lags behind bandwidth
 Bandwidth lags behind capacity
 Flash memory/NVRAMs and its role in the memory hierarchy? Disruptive
technology ratio change  new disruptive solution!!
 Disk improvements [‘88 – ’04]:
− Capacity: 60%/y
− Bandwidth: 40%/y
− Access time: 16%/y
Latency Gap in Memory Hierarchy
SKKU VLDB Lab.Ch 9. Storing Disk
Typical access latency & granularity
We need `gap filler’!
[Source: Uwe Röhm’s Slide]
 Latency lags behind bandwidth [David Patternson, CACM Oct. 2004]
 Bandwidth problem can be cured with money, but latency problem is harder
Our Message in SIGMOD ’09 Paper
One FlashSSD can beat Ten Harddisks in OLTP
- Performance, Price, Capacity, Power – (in 2008)
 In 2015, one FlashSSD can beat more than several tens harddisks in OLTP
 “My tests and those of several others suggest that FLASH disks can deliver about 3K random 8KB reads/second and with
some re-engineering about 1,100 random 8KB writes per second. Indeed, it appears that a single FLASH chip could deliver
nearly that performance and there are many chips inside the “box” – so the actual limit could be 4x or more. But, even the
current performance would be VERY attractive for many enterprise applications. For example, in the TPC-C benchmark, has
approximately equal reads and writes. Using the graphs above, and doing a weighted average of the 4-deep 8 KB random
read rate (2,804 IOps), and 4-deep 8 KB sequential write rate (1233 IOps) gives harmonic average of 1713 (1-deep gives
1,624 IOps). TPC-C systems are configured with ~50 disks per cpu. For example the most recent Dell TPC-C system has
ninety 15Krpm 36GB SCSI disks costing 45k$ (with 10k$ extra for maintenance that gets “discounted”). Those disks are
68% of the system cost. They deliver about 18,000 IO/s. That is comparable to the requests/second of ten FLASH disks. So
we could replace those 90 disks with ten NSSD if the data would fit on 320GB (it does not). That would save a lot of money
and a lot of power (1.3Kw of power and 1.3Kw of cooling).” (excerpts from Flash disk opportunity for server-applications
(Jim Gray)
Flash-based TPC-C @ 2013 September
 Oracle + Sun Flash Storage
− 8.5M tpmC
 Total cost: 4.7M $
− Server HW: .6M $
− Server SW: 1.9M $
− Storage: 1.8M $
 216 400GB Flash Moduley: 1.1M $
 86 3TB 7.2K HDD: 0.07M
− Client HW/SW: 0.1M $
− Others: 0.1M$
 Implications
− More vertical stacks (by SW vendor )
− Harddisk vendors (e.g. Seagate)
HDD vs. SSD [Patterson 2016]
SKKU VLDB Lab.Ch 9. Storing Disk
Page Size
 Default page size in Linux and Window: 4KB ( 512B sector)
 Oracle
− “Oracle recommends smaller Oracle Database block sizes (2 KB or 4 KB) for online
transaction processing or mixed workload environments and larger block sizes (8 KB,16
KB, or 32 KB) for decision support system workload environments.” (see chapter “IO
Config. And Design” in “Database Performance Tuning Guide” book )
SKKU VLDB Lab.Ch 9. Storing Disk
Why Smaller Page in SSD?
 Small page size advantages – Better throughput
LinkBench: Page Size
 Benefits of small page
− Better read/write IOPS
 Exploit internal parallelism
− Better buffer-pool hit ratio

More Related Content

What's hot

Dba tuning
Dba tuningDba tuning
Native erasure coding support inside hdfs presentation
Native erasure coding support inside hdfs presentationNative erasure coding support inside hdfs presentation
Native erasure coding support inside hdfs presentation
lin bao
ZFS for Databases
ZFS for DatabasesZFS for Databases
ZFS for Databases
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
Less is More: 2X Storage Efficiency with HDFS Erasure CodingLess is More: 2X Storage Efficiency with HDFS Erasure Coding
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
Zhe Zhang
Direct SGA access without SQL
Direct SGA access without SQLDirect SGA access without SQL
Direct SGA access without SQL
Kyle Hailey
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Altinity Ltd
Enable Symantec Backup Exec™ to store and retrieve backup data to Cloud stora...
Enable Symantec Backup Exec™ to store and retrieve backup data to Cloud stora...Enable Symantec Backup Exec™ to store and retrieve backup data to Cloud stora...
Enable Symantec Backup Exec™ to store and retrieve backup data to Cloud stora...
Column store indexes and batch processing mode (nx power lite)
Column store indexes and batch processing mode (nx power lite)Column store indexes and batch processing mode (nx power lite)
Column store indexes and batch processing mode (nx power lite)
Chris Adkin
Out of the box replication in postgres 9.4(pg confus)
Out of the box replication in postgres 9.4(pg confus)Out of the box replication in postgres 9.4(pg confus)
Out of the box replication in postgres 9.4(pg confus)
Denish Patel
Payal Singh
PostgreSQL Table Partitioning / Sharding
PostgreSQL Table Partitioning / ShardingPostgreSQL Table Partitioning / Sharding
PostgreSQL Table Partitioning / Sharding
Amir Reza Hashemi
In-core compression: how to shrink your database size in several times
In-core compression: how to shrink your database size in several timesIn-core compression: how to shrink your database size in several times
In-core compression: how to shrink your database size in several times
Aleksander Alekseev
Ceph Day KL - Bluestore
Ceph Day KL - Bluestore Ceph Day KL - Bluestore
Ceph Day KL - Bluestore
Ceph Community
gDBClone - Database Clone “onecommand Automation Tool”
gDBClone - Database Clone “onecommand Automation Tool”gDBClone - Database Clone “onecommand Automation Tool”
gDBClone - Database Clone “onecommand Automation Tool”
Ruggero Citton
RedGateWebinar - Where did my CPU go?
RedGateWebinar - Where did my CPU go?RedGateWebinar - Where did my CPU go?
RedGateWebinar - Where did my CPU go?
Kristofferson A
Understanding MySQL Performance through Benchmarking
Understanding MySQL Performance through BenchmarkingUnderstanding MySQL Performance through Benchmarking
Understanding MySQL Performance through Benchmarking
Laine Campbell
10 ways to improve your rman script
10 ways to improve your rman script10 ways to improve your rman script
10 ways to improve your rman script
Maris Elsins
ODA Backup Restore Utility & ODA Rescue Live Disk
ODA Backup Restore Utility & ODA Rescue Live DiskODA Backup Restore Utility & ODA Rescue Live Disk
ODA Backup Restore Utility & ODA Rescue Live Disk
Ruggero Citton
Out of the box replication in postgres 9.4
Out of the box replication in postgres 9.4Out of the box replication in postgres 9.4
Out of the box replication in postgres 9.4
Denish Patel
An introduction to column store indexes and batch mode
An introduction to column store indexes and batch modeAn introduction to column store indexes and batch mode
An introduction to column store indexes and batch mode
Chris Adkin

What's hot (20)

Dba tuning
Dba tuningDba tuning
Dba tuning
Native erasure coding support inside hdfs presentation
Native erasure coding support inside hdfs presentationNative erasure coding support inside hdfs presentation
Native erasure coding support inside hdfs presentation
ZFS for Databases
ZFS for DatabasesZFS for Databases
ZFS for Databases
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
Less is More: 2X Storage Efficiency with HDFS Erasure CodingLess is More: 2X Storage Efficiency with HDFS Erasure Coding
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
Direct SGA access without SQL
Direct SGA access without SQLDirect SGA access without SQL
Direct SGA access without SQL
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Enable Symantec Backup Exec™ to store and retrieve backup data to Cloud stora...
Enable Symantec Backup Exec™ to store and retrieve backup data to Cloud stora...Enable Symantec Backup Exec™ to store and retrieve backup data to Cloud stora...
Enable Symantec Backup Exec™ to store and retrieve backup data to Cloud stora...
Column store indexes and batch processing mode (nx power lite)
Column store indexes and batch processing mode (nx power lite)Column store indexes and batch processing mode (nx power lite)
Column store indexes and batch processing mode (nx power lite)
Out of the box replication in postgres 9.4(pg confus)
Out of the box replication in postgres 9.4(pg confus)Out of the box replication in postgres 9.4(pg confus)
Out of the box replication in postgres 9.4(pg confus)
PostgreSQL Table Partitioning / Sharding
PostgreSQL Table Partitioning / ShardingPostgreSQL Table Partitioning / Sharding
PostgreSQL Table Partitioning / Sharding
In-core compression: how to shrink your database size in several times
In-core compression: how to shrink your database size in several timesIn-core compression: how to shrink your database size in several times
In-core compression: how to shrink your database size in several times
Ceph Day KL - Bluestore
Ceph Day KL - Bluestore Ceph Day KL - Bluestore
Ceph Day KL - Bluestore
gDBClone - Database Clone “onecommand Automation Tool”
gDBClone - Database Clone “onecommand Automation Tool”gDBClone - Database Clone “onecommand Automation Tool”
gDBClone - Database Clone “onecommand Automation Tool”
RedGateWebinar - Where did my CPU go?
RedGateWebinar - Where did my CPU go?RedGateWebinar - Where did my CPU go?
RedGateWebinar - Where did my CPU go?
Understanding MySQL Performance through Benchmarking
Understanding MySQL Performance through BenchmarkingUnderstanding MySQL Performance through Benchmarking
Understanding MySQL Performance through Benchmarking
10 ways to improve your rman script
10 ways to improve your rman script10 ways to improve your rman script
10 ways to improve your rman script
ODA Backup Restore Utility & ODA Rescue Live Disk
ODA Backup Restore Utility & ODA Rescue Live DiskODA Backup Restore Utility & ODA Rescue Live Disk
ODA Backup Restore Utility & ODA Rescue Live Disk
Out of the box replication in postgres 9.4
Out of the box replication in postgres 9.4Out of the box replication in postgres 9.4
Out of the box replication in postgres 9.4
An introduction to column store indexes and batch mode
An introduction to column store indexes and batch modeAn introduction to column store indexes and batch mode
An introduction to column store indexes and batch mode

Similar to W1.1 i os in database

ZFS Workshop
ZFS WorkshopZFS Workshop
ZFS Workshop
Optimizing RocksDB for Open-Channel SSDs
Optimizing RocksDB for Open-Channel SSDsOptimizing RocksDB for Open-Channel SSDs
Optimizing RocksDB for Open-Channel SSDs
Javier González
IO Dubi Lebel
IO Dubi LebelIO Dubi Lebel
IO Dubi Lebel
FlashSQL 소개 & TechTalk
FlashSQL 소개 & TechTalkFlashSQL 소개 & TechTalk
FlashSQL 소개 & TechTalk
I Goo Lee
CS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementCS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage Management
J Singh
Using Statspack and AWR for Memory Monitoring and Tuning
Using Statspack and AWR for Memory Monitoring and TuningUsing Statspack and AWR for Memory Monitoring and Tuning
Using Statspack and AWR for Memory Monitoring and Tuning
Texas Memory Systems, and IBM Company
Configuring Aerospike - Part 2
Configuring Aerospike - Part 2 Configuring Aerospike - Part 2
Configuring Aerospike - Part 2
Aerospike, Inc.
Using AWR for IO Subsystem Analysis
Using AWR for IO Subsystem AnalysisUsing AWR for IO Subsystem Analysis
Using AWR for IO Subsystem Analysis
Texas Memory Systems, and IBM Company
Apache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysisApache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysis
liang chen
Storage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, WhiptailStorage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, Whiptail
Internet World
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]
Kyle Hailey
Mass storage systemsos
Mass storage systemsosMass storage systemsos
Mass storage systemsos
Gokila Manickam
RMAN – The Pocket Knife of a DBA
RMAN – The Pocket Knife of a DBA RMAN – The Pocket Knife of a DBA
RMAN – The Pocket Knife of a DBA
Guatemala User Group
Reference Architecture: Architecting Ceph Storage Solutions
Reference Architecture: Architecting Ceph Storage Solutions Reference Architecture: Architecting Ceph Storage Solutions
Reference Architecture: Architecting Ceph Storage Solutions
Ceph Community
KSCOPE 2013: Exadata Consolidation Success Story
KSCOPE 2013: Exadata Consolidation Success StoryKSCOPE 2013: Exadata Consolidation Success Story
KSCOPE 2013: Exadata Consolidation Success Story
Kristofferson A
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...
Guy Harrison
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephBuild an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Rongze Zhu
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
Flink Forward
Oracle real application_cluster
Oracle real application_clusterOracle real application_cluster
Oracle real application_cluster
Prabhat gangwar

Similar to W1.1 i os in database (20)

ZFS Workshop
ZFS WorkshopZFS Workshop
ZFS Workshop
Optimizing RocksDB for Open-Channel SSDs
Optimizing RocksDB for Open-Channel SSDsOptimizing RocksDB for Open-Channel SSDs
Optimizing RocksDB for Open-Channel SSDs
IO Dubi Lebel
IO Dubi LebelIO Dubi Lebel
IO Dubi Lebel
FlashSQL 소개 & TechTalk
FlashSQL 소개 & TechTalkFlashSQL 소개 & TechTalk
FlashSQL 소개 & TechTalk
CS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementCS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage Management
Using Statspack and AWR for Memory Monitoring and Tuning
Using Statspack and AWR for Memory Monitoring and TuningUsing Statspack and AWR for Memory Monitoring and Tuning
Using Statspack and AWR for Memory Monitoring and Tuning
Configuring Aerospike - Part 2
Configuring Aerospike - Part 2 Configuring Aerospike - Part 2
Configuring Aerospike - Part 2
Using AWR for IO Subsystem Analysis
Using AWR for IO Subsystem AnalysisUsing AWR for IO Subsystem Analysis
Using AWR for IO Subsystem Analysis
Apache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysisApache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysis
Storage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, WhiptailStorage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, Whiptail
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]
Mass storage systemsos
Mass storage systemsosMass storage systemsos
Mass storage systemsos
RMAN – The Pocket Knife of a DBA
RMAN – The Pocket Knife of a DBA RMAN – The Pocket Knife of a DBA
RMAN – The Pocket Knife of a DBA
Reference Architecture: Architecting Ceph Storage Solutions
Reference Architecture: Architecting Ceph Storage Solutions Reference Architecture: Architecting Ceph Storage Solutions
Reference Architecture: Architecting Ceph Storage Solutions
KSCOPE 2013: Exadata Consolidation Success Story
KSCOPE 2013: Exadata Consolidation Success StoryKSCOPE 2013: Exadata Consolidation Success Story
KSCOPE 2013: Exadata Consolidation Success Story
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephBuild an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
Oracle real application_cluster
Oracle real application_clusterOracle real application_cluster
Oracle real application_cluster

Recently uploaded

Cherries 32 collection of colorful paintings
Cherries 32 collection of colorful paintingsCherries 32 collection of colorful paintings
Cherries 32 collection of colorful paintings
sandamichaela *
My storyboard for a sword fight scene with lightsabers
My storyboard for a sword fight scene with lightsabersMy storyboard for a sword fight scene with lightsabers
My storyboard for a sword fight scene with lightsabers
Ealing London Independent Photography meeting - June 2024
Ealing London Independent Photography meeting - June 2024Ealing London Independent Photography meeting - June 2024
Ealing London Independent Photography meeting - June 2024
Sean McDonnell
❼❷⓿❺❻❷❽❷❼❽ Dpboss Kalyan Satta Matka Guessing Matka Result Main Bazar chart
❼❷⓿❺❻❷❽❷❼❽ Dpboss Kalyan Satta Matka Guessing Matka Result Main Bazar chart❼❷⓿❺❻❷❽❷❼❽ Dpboss Kalyan Satta Matka Guessing Matka Result Main Bazar chart
❼❷⓿❺❻❷❽❷❼❽ Dpboss Kalyan Satta Matka Guessing Matka Result Main Bazar chart
❼❷⓿❺❻❷❽❷❼❽ Dpboss Kalyan Satta Matka Guessing Matka Result Main Bazar chart
Dino Ranch Storyboard / Kids TV Advertising
Dino Ranch Storyboard / Kids TV AdvertisingDino Ranch Storyboard / Kids TV Advertising
Dino Ranch Storyboard / Kids TV Advertising
Alessandro Occhipinti
➒➌➎➏➑➐➋➑➐➐ Dpboss Satta Matka Matka Guessing Kalyan Chart Indian Matka Satta ...
➒➌➎➏➑➐➋➑➐➐ Dpboss Satta Matka Matka Guessing Kalyan Chart Indian Matka Satta ...➒➌➎➏➑➐➋➑➐➐ Dpboss Satta Matka Matka Guessing Kalyan Chart Indian Matka Satta ...
➒➌➎➏➑➐➋➑➐➐ Dpboss Satta Matka Matka Guessing Kalyan Chart Indian Matka Satta ...
➒➌➎➏➑➐➋➑➐➐Dpboss Matka Guessing Satta Matka Kalyan Chart Indian Matka
❼❷⓿❺❻❷❽❷❼❽ Dpboss Matka ! Fix Satta Matka ! Matka Result ! Matka Guessing ! ...
❼❷⓿❺❻❷❽❷❼❽  Dpboss Matka ! Fix Satta Matka ! Matka Result ! Matka Guessing ! ...❼❷⓿❺❻❷❽❷❼❽  Dpboss Matka ! Fix Satta Matka ! Matka Result ! Matka Guessing ! ...
❼❷⓿❺❻❷❽❷❼❽ Dpboss Matka ! Fix Satta Matka ! Matka Result ! Matka Guessing ! ...
❼❷⓿❺❻❷❽❷❼❽ Dpboss Kalyan Satta Matka Guessing Matka Result Main Bazar chart
➒➌➎➏➑➐➋➑➐➐ Dpboss Matka Guessing Satta Matka Kalyan panel Chart Indian Matka ...
➒➌➎➏➑➐➋➑➐➐ Dpboss Matka Guessing Satta Matka Kalyan panel Chart Indian Matka ...➒➌➎➏➑➐➋➑➐➐ Dpboss Matka Guessing Satta Matka Kalyan panel Chart Indian Matka ...
➒➌➎➏➑➐➋➑➐➐ Dpboss Matka Guessing Satta Matka Kalyan panel Chart Indian Matka ...
➒➌➎➏➑➐➋➑➐➐Dpboss Matka Guessing Satta Matka Kalyan Chart Indian Matka
In Focus_ The Evolution of Boudoir Photography in NYC.pdf
In Focus_ The Evolution of Boudoir Photography in NYC.pdfIn Focus_ The Evolution of Boudoir Photography in NYC.pdf
In Focus_ The Evolution of Boudoir Photography in NYC.pdf
Boudoir Photography by Your Hollywood Portrait
All the images mentioned in 'See What You're Missing'
All the images mentioned in 'See What You're Missing'All the images mentioned in 'See What You're Missing'
All the images mentioned in 'See What You're Missing'
Dave Boyle
Domino Express Storyboard - TV Adv Toys 30"
Domino Express Storyboard - TV Adv Toys 30"Domino Express Storyboard - TV Adv Toys 30"
Domino Express Storyboard - TV Adv Toys 30"
Alessandro Occhipinti
Tibbetts_HappyAwesome_NewArc Sketch to AI
Tibbetts_HappyAwesome_NewArc Sketch to AITibbetts_HappyAwesome_NewArc Sketch to AI
Tibbetts_HappyAwesome_NewArc Sketch to AI
Todd Tibbetts
My storyboard for the short film "Maatla".
My storyboard for the short film "Maatla".My storyboard for the short film "Maatla".
My storyboard for the short film "Maatla".
Portfolio of my work as my passion and skills
Portfolio of my work as my passion and skillsPortfolio of my work as my passion and skills
Portfolio of my work as my passion and skills
HOW TO USE PINTEREST_by: Clarissa Credito
HOW TO USE PINTEREST_by: Clarissa CreditoHOW TO USE PINTEREST_by: Clarissa Credito
HOW TO USE PINTEREST_by: Clarissa Credito
This is a test powerpoint!!!!!!!!!!!!!!!
This is a test powerpoint!!!!!!!!!!!!!!!This is a test powerpoint!!!!!!!!!!!!!!!
This is a test powerpoint!!!!!!!!!!!!!!!
Barbie Made To Move Skin Tones Matches.pptx
Barbie Made To Move Skin Tones Matches.pptxBarbie Made To Move Skin Tones Matches.pptx
Barbie Made To Move Skin Tones Matches.pptx

Recently uploaded (20)

Cherries 32 collection of colorful paintings
Cherries 32 collection of colorful paintingsCherries 32 collection of colorful paintings
Cherries 32 collection of colorful paintings
My storyboard for a sword fight scene with lightsabers
My storyboard for a sword fight scene with lightsabersMy storyboard for a sword fight scene with lightsabers
My storyboard for a sword fight scene with lightsabers
Ealing London Independent Photography meeting - June 2024
Ealing London Independent Photography meeting - June 2024Ealing London Independent Photography meeting - June 2024
Ealing London Independent Photography meeting - June 2024
❼❷⓿❺❻❷❽❷❼❽ Dpboss Kalyan Satta Matka Guessing Matka Result Main Bazar chart
❼❷⓿❺❻❷❽❷❼❽ Dpboss Kalyan Satta Matka Guessing Matka Result Main Bazar chart❼❷⓿❺❻❷❽❷❼❽ Dpboss Kalyan Satta Matka Guessing Matka Result Main Bazar chart
❼❷⓿❺❻❷❽❷❼❽ Dpboss Kalyan Satta Matka Guessing Matka Result Main Bazar chart
Dino Ranch Storyboard / Kids TV Advertising
Dino Ranch Storyboard / Kids TV AdvertisingDino Ranch Storyboard / Kids TV Advertising
Dino Ranch Storyboard / Kids TV Advertising
➒➌➎➏➑➐➋➑➐➐ Dpboss Satta Matka Matka Guessing Kalyan Chart Indian Matka Satta ...
➒➌➎➏➑➐➋➑➐➐ Dpboss Satta Matka Matka Guessing Kalyan Chart Indian Matka Satta ...➒➌➎➏➑➐➋➑➐➐ Dpboss Satta Matka Matka Guessing Kalyan Chart Indian Matka Satta ...
➒➌➎➏➑➐➋➑➐➐ Dpboss Satta Matka Matka Guessing Kalyan Chart Indian Matka Satta ...
❼❷⓿❺❻❷❽❷❼❽ Dpboss Matka ! Fix Satta Matka ! Matka Result ! Matka Guessing ! ...
❼❷⓿❺❻❷❽❷❼❽  Dpboss Matka ! Fix Satta Matka ! Matka Result ! Matka Guessing ! ...❼❷⓿❺❻❷❽❷❼❽  Dpboss Matka ! Fix Satta Matka ! Matka Result ! Matka Guessing ! ...
❼❷⓿❺❻❷❽❷❼❽ Dpboss Matka ! Fix Satta Matka ! Matka Result ! Matka Guessing ! ...
➒➌➎➏➑➐➋➑➐➐ Dpboss Matka Guessing Satta Matka Kalyan panel Chart Indian Matka ...
➒➌➎➏➑➐➋➑➐➐ Dpboss Matka Guessing Satta Matka Kalyan panel Chart Indian Matka ...➒➌➎➏➑➐➋➑➐➐ Dpboss Matka Guessing Satta Matka Kalyan panel Chart Indian Matka ...
➒➌➎➏➑➐➋➑➐➐ Dpboss Matka Guessing Satta Matka Kalyan panel Chart Indian Matka ...
In Focus_ The Evolution of Boudoir Photography in NYC.pdf
In Focus_ The Evolution of Boudoir Photography in NYC.pdfIn Focus_ The Evolution of Boudoir Photography in NYC.pdf
In Focus_ The Evolution of Boudoir Photography in NYC.pdf
All the images mentioned in 'See What You're Missing'
All the images mentioned in 'See What You're Missing'All the images mentioned in 'See What You're Missing'
All the images mentioned in 'See What You're Missing'
Domino Express Storyboard - TV Adv Toys 30"
Domino Express Storyboard - TV Adv Toys 30"Domino Express Storyboard - TV Adv Toys 30"
Domino Express Storyboard - TV Adv Toys 30"
Tibbetts_HappyAwesome_NewArc Sketch to AI
Tibbetts_HappyAwesome_NewArc Sketch to AITibbetts_HappyAwesome_NewArc Sketch to AI
Tibbetts_HappyAwesome_NewArc Sketch to AI
My storyboard for the short film "Maatla".
My storyboard for the short film "Maatla".My storyboard for the short film "Maatla".
My storyboard for the short film "Maatla".
Portfolio of my work as my passion and skills
Portfolio of my work as my passion and skillsPortfolio of my work as my passion and skills
Portfolio of my work as my passion and skills
HOW TO USE PINTEREST_by: Clarissa Credito
HOW TO USE PINTEREST_by: Clarissa CreditoHOW TO USE PINTEREST_by: Clarissa Credito
HOW TO USE PINTEREST_by: Clarissa Credito
This is a test powerpoint!!!!!!!!!!!!!!!
This is a test powerpoint!!!!!!!!!!!!!!!This is a test powerpoint!!!!!!!!!!!!!!!
This is a test powerpoint!!!!!!!!!!!!!!!
Barbie Made To Move Skin Tones Matches.pptx
Barbie Made To Move Skin Tones Matches.pptxBarbie Made To Move Skin Tones Matches.pptx
Barbie Made To Move Skin Tones Matches.pptx

W1.1 i os in database

  • 1. Ch 9. Storing Data: Disks and Files - Heap File Structure - Sang-Won Lee SKKU VLDB Lab. & SOS ( )
  • 2. 2 SKKU VLDB Lab.Ch 9. Storing Disk Contents 9.0 Overview 9.1 Memory Hierarchy 9.2 RAID(Redundant Array of Independent Disk) 9.3 Disk Space Management 9.4 Buffer Manager 9.5 Files of Records 9.6 Page Format 9.7 Record Format
  • 3. 3 SKKU VLDB Lab.Ch 9. Storing Disk Memory Hierarchy Smaller, Faster, Expensive, Volatile Bigger, Slower, Cheaper, Non-Volatile  Main memory (RAM) for currently used data.  Disk for the main database (secondary storage).  Tapes for archiving older versions of the data (tertiary storage)  WHY MEMORY HIERARCHY?  What if ideal storage appear? Fast, cheap, large, NV..: PCM, MRAM, FeRAM?
  • 4. 4 SKKU VLDB Lab.Ch 9. Storing Disk Jim Gray’s Storage Latency Analogy: How Far Away is the Data? Registers On Chip Cache On Board Cache Memory Disk 1 2 10 100 Tape /Optical Robot 10 9 10 6 Sacramento This Hotel This Room My Head 10 min 1.5 hr 2 Years 1 min Pluto 2,000 Years Andromeda
  • 5. 5 SKKU VLDB Lab.Ch 9. Storing Disk Disks and Files  DBMS stores information on (hard) disks. − Electronic (CPU, DRAM) vs. Mechanical (harddisk)  This has major implications for DBMS design! − READ: transfer data from disk to main memory (RAM). − WRITE: transfer data from RAM to disk. − Both are expensive operations, relative to in-memory operations, so must be planned carefully!  DRAM: ~ 10 ns  Harddisk: ~ 10ms  SSD: 80us ~ 10ms
  • 6. 6 SKKU VLDB Lab.Ch 9. Storing Disk Disks  Secondary storage device of choice.  Main advantage over tapes: random access vs. sequential. − Tapes deteriorate over time  Data is stored and retrieved in disk blocks or pages unit.  Unlike RAM, time to retrieve a disk page varies depending upon location on disk. − Thus, relative placement of pages on disk has big impact on DB performance!  e.g. adjacent allocation of the pages from the same tables. − We need to optimize both data placement and access  e.g. elevator disk scheduling algorithm
  • 7. 7 SKKU VLDB Lab.Ch 9. Storing Disk Anatomy of a Disk Arm assembly  The platters spin  e.g. 5400 / 7200 / 15K rpm  The arm assembly is moved in or out to position a head on a desired track. Tracks under heads make a cylinder  Mechanical storage -> low IOPS  Only one head reads/writes at any one time.  Parallelism degree: 1  Block size is a multiple of sector size  Update-in-place: poisoned apple  No atomic write  Fsync for ordering / durability
  • 8. 8 SKKU VLDB Lab.Ch 9. Storing Disk Accessing a Disk Page  Time to access (read/write) a disk block: − seek time (moving arms to position disk head on track) − rotational delay (waiting for block to rotate under head) − transfer time (actually moving data to/from disk surface)  Seek time and rotational delay dominate. − Seek time: about 1 to 20msec − Rotational delay: from 0 to 10msec − Transfer rate: about 1ms per 4KB page  Key to lower I/O cost: reduce seek/rotation delays! − E.g. disk scheduling algorithm in OS, Linux 4 I/O schedulers
  • 9. 9 SKKU VLDB Lab.Ch 9. Storing Disk Arranging Pages on Disk  `Next’ block concept: − Blocks on same track, followed by − Blocks on same cylinder, followed by − Blocks on adjacent cylinder  Blocks in a file should be arranged sequentially on disk (by `next’), to minimize seek and rotational delay.  Disk fragmentation problem − Is this still problematic in flash storage?
  • 10. 10 Table, Insertions, Heap Files CREATE TABLE TEST (a int, b int, c varchar2(650)); /* Insert 1M tuples into TEST table (approximately 664 bytes per tuple) */ BEGIN FOR i IN 1..1000000 LOOP INSERT INTO TEST (a, b, c) values (i, i, rpad('X', 650, 'X')); END LOOP; END; /* Page = 8KB 10 tuples / page 100,000 pages in total TEST table = 800MB */ Data Page 1 Data Page 2 Data Page 100,000 TEST SEGMENT (Heap File) Data Page i
  • 11. 11 OLAP vs. OLTP  On-Line Analytical vs. Transactional Processing SQL> SELECT SUM(b) FROM TEST; SUM(B) ---------- 5.0000E+11 Execution Plan --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 5 | 22053 (1)| 00:04:25 | | 1 | SORT AGGREGATE | | 1 | 5 | | | | 2 | TABLE ACCESS FULL| TEST | 996K| 4865K| 22053 (1)| 00:04:25 | --------------------------------------------------------------------------- Statistics ---------------------------------------------------------- 179 recursive calls 0 db block gets 100152 consistent gets 100112 physical reads ….. 1 rows processed Data Page 1 Data Page 2 Data Page 100,000 TEST SEGMENT (Heap File) Data Page i
  • 12. 12 OLAP vs. OLTP SQL> SELECT B FROM TEST WHERE A = 500000; B ---------- 500000 Execution Plan --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 5 | 22053 (1)| 00:04:25 | | 1 | SORT AGGREGATE | | 1 | 5 | | | | 2 | TABLE ACCESS FULL| TEST | 996K| 4865K| 22053 (1)| 00:04:25 | --------------------------------------------------------------------------- Statistics ---------------------------------------------------------- 179 recursive calls 0 db block gets 100152 consistent gets 100112 physical reads ….. 1 rows processed Data Page 1 Data Page 2 Data Page 100,000 TEST SEGMENT (Heap File) Data Page i  Point or Range Query
  • 13. 13 OLAP vs. OLTP CREATE INDEX TEST_A ON TEST(A); Data Page 1 Data Page 2 Data Page 100000 SELECT B /* point query */ FROM TEST WHERE A = 500000; SELECT B /* range query */ FROM TEST WHERE A BETWEEN 50001 and 50101; B-tree Index on TEST(A) SEARCH KEY: 500000 (500000, (50000,10) ) (500000,500000, ’XX…XX’; Block #: 50000SLOT #: 10 Cost:  Full Table Scan: 100,000 Block Accesses  Index: (3~4) + Data Block Access − Point query: 1 − Range queries: depending on range TEST SEGMENT (Heap File)
  • 14. 14 OLAP vs. OLTP SQL> SELECT B FROM TEST WHERE A = 500000; B ---------- 500000 Execution Plan -------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | -------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 10 | 4 (0)| 00:00:01 | | 1 | TABLE ACCESS BY INDEX ROWID| TEST | 1 | 10 | 4 (0)| 00:00:01 | |* 2 | INDEX RANGE SCAN | TEST_A | 1 | | 3 (0)| 00:00:01 | -------------------------------------------------------------------------------------- Statistics ---------------------------------------------------------- ….. 5 consistent gets 4 physical reads ….. 1 rows processed  Index-based Table Access
  • 15. 15 OLTP: TPC-A/B/C Benchmark exec sql begin declare section; long Aid, Bid, Tid, delta, Abalance; exec sql end declare section; DCApplication() { read input msg; exec sql begin work; exec sql update accounts set Abalance = Abalance + :delta where Aid = :Aid; exec sql select Abalance into :Abalance from accounts where Aid = :Aid; exec sql update tellers set Tbalance = Tbalance + :delta where Tid = :Tid; exec sql update branches set Bbalance = Bbalance + :delta where Bid = :Bid; exec sql insert into history(Tid, Bid, Aid, delta, time) values (:Tid, :Bid, :Aid, :delta, CURRENT); send output msg; exec sql commit work; } From Gray’s Presentation Transaction = Sequence of Reads and Writes
  • 16. 16 OLTP System: Architecture Transactions 1 …….. N Database Buffer Database SQLs Physical Read/Writes Select / Insert / Update / Delete / Commit Operations Logical Read/Writes
  • 17. 17 SKKU VLDB Lab.Ch 9. Storing Disk Concurrency Control Transaction Manager Lock Manager Files and Access Methods Buffer Manager Disk Space Manager Recovery Manager Plan Executor Operator Evaluator Parser Optimizer DBMS shows interaction Query Evaluation Engine Index Files Data Files System Catalog shows references DATABASE Application Front EndsWeb Forms SQL Interface SQL COMMANDS Unsophisticated users (customers, travel agents, etc.) Sophisticated users, application programmers, DB administrators shows command flow Figure 1.3 Architecture of a DBMS Shared Components Per Connection
  • 18. 18 SKKU VLDB Lab.Ch 9. Storing Disk 9.4 Buffer Management in a DBMS  Data must be in RAM for DBMS to operate on it! DB MAIN MEMORY DISK disk page free frame Page Requests from Higher Levels BUFFER POOL choice of frame dictated by replacement policy
  • 19. 19 SKKU VLDB Lab.Ch 9. Storing Disk When a Page is Requested ...  Buffer pool information table: <frame#, pageid, pin_cnt, dirty> − In big systems, it is not trivial to just check whether a page is in pool  If requested page is not in pool: − Choose a frame for replacement − If frame is dirty, write it to disk − Read requested page into chosen frame  Pin the page and return its address.  If requests can be predicted (e.g., sequential scans) pages can be pre-fetched several pages at a time!
  • 20. 20 SKKU VLDB Lab.Ch 9. Storing Disk More on Buffer Management  Requestor of page must unpin it, and indicate whether page has been modified: − dirty bit is used for this.  Page in pool may be requested many times, − a pin count is used. − a page is a candidate for replacement iff pin count = 0.  CC & recovery may entail additional I/O when a frame is chosen for replacement. (e.g. Write-Ahead Log protocol)
  • 21. 21 Buffer Manager Pseudo Code SKKU VLDB Lab.Ch 9. Storing Disk [Source: Uwe Röhm’s Slide]
  • 22. 22 SKKU VLDB Lab.Ch 9. Storing Disk Buffer Replacement Policy  Hit vs. miss  Hit ratio = # of hits / ( # of page requests to buffer cache) − One miss incurs one (or two) physical IO. Hit saves IO. − Rule of thumb: at least 80 ~ 90%  Problem: for the given (future) references, which victim should be chosen for highest hit ratio (i.e. least # of IOs)? − Numerous policies − Does one policy win over the others? − One policy does not fit all reference patterns!
  • 23. 23 SKKU VLDB Lab.Ch 9. Storing Disk Buffer Replacement Policy  Frame is chosen for replacement by a replacement policy: − Random, FIFO, LRU, MRU, LFU, Clock etc. − Replacement policy can have big impact on # of I/O’s; depends on the access pattern  For a given workload, one replacement policy, A, achieves 90% hit ratio and the other, B, does 91%. − How much improvement? 1% or 10%? − We need to interpret its impact in terms of miss ratio, not hit ratio
  • 24. 24 SKKU VLDB Lab.Ch 9. Storing Disk Buffer Replacement Policy (2)  Least Recently Used (LRU) − For each page in buffer pool, keep track of time last unpinned − Replace the frame that has the oldest (earliest) time − Very common policy: intuitive and simple − Why does it work?  ``Principle of (temporal) locality” (of references) (  Why temporal locality in database? − The correct implementation is not trivial  Especially in large scale systems: e.g. time stamp  Variants − Linked list of buffer frames, LRU-K, 2Q, midpoint-insertion and touch count algorithm(Oracle), Clock, ARC … − Implication of big memory: “random” > “LRU”??
  • 25. 25 SKKU VLDB Lab.Ch 9. Storing Disk LRU and Sequential Flooding  Problem of LRU - sequential flooding − caused by LRU + repeated sequential scans. − # buffer frames < # pages in file means each page request causes an I/O. MRU much better in this situation (but not in all situations, of course). DB BUFFER POOL SIZE: 4 Blocks 1 2 3 4 5 1 2 3 4  Assume repeated sequential scans of file A  What happens when reading 5th blocks? and when reading 1st block again? …. File A
  • 26. 26 SKKU VLDB Lab.Ch 9. Storing Disk “Clock” Replacement Policy  An approximation of LRU  Arrange frames into a cycle, store one reference bit per frame  When pin count reduces to 0, turn on reference bit  When replacement necessary do for each page in cycle { if (pincount == 0 && ref bit is on) turn off ref bit; else if (pincount == 0 && ref bit is off) choose this page for replacement; } until a page is chosen;  Second chance algorithm / bit  Generalized Clock (GCLOCK): ref. bit  ref. counter A(1) B(p) C(1) D(1)
  • 27. 27 Classification of Replacement Policies SKKU VLDB Lab.Ch 9. Storing Disk [Source: Uwe Röhm’s Slide]
  • 28. 28 SKKU VLDB Lab.Ch 9. Storing Disk Why Not Store It All in Main Memory?  Cost!: $20 /1GB GB DRAM vs. 50$ / 150 GB of disk (EIDI/ATA) vs. 100/30GB (SCSI). − High-end databases today in the 10-100 TB range. − Approx. 60% of the cost of a production system is in the disks.  Some specialized systems (e.g. Main Memory(MM) DBMS) store entire database in main memory. − Vendors claim 10x speed up vs. traditional DBMS in main memory. − Sap Hana, MS Hekaton, Oracle In-memory, Altibase ..  Main memory is volatile: data should be saved between runs. − Disk write is inevitable: log write for recovery and periodic checkpoint
  • 29. 29 CPU (Dual, Quad, …) IOPS Hit Ratio Multi-threading (CPU-IO Overlapping)Data size Buffer Size  TPS (Transactions Per Second)  Context Switching + CPU-IO Overlapping  3 States: CPU Bound, IO Bound, Balanced For perfect CPU-IO overlapping, IOPS Matters!
  • 30. 30 IOPS Crisis in OLTP IBM for TPC-C (2008 Dec.)  6M tpmC  Total cost: 35M $ − Server HW: 12M $ − Server SW: 2M $ − Storage: 20M $ − Client HW/SW: 1M $  They are buying IOPS, not capacity
  • 31. 31 IOPS Crisis in OLTP(2)  For balanced systems, OLTP systems pay huge $$$ on disks for high IOPS CPU + Server 300 GIPS IOPS 10,000 disks 12M $ 20M $ A balanced state Amdhal’s law IOPS 10,000 disks A balanced state??? 50% CPU utilization; Same TPS 12M $ CPU + Server 600 GIPS18 months (Moore’s law) IOPS 20,000 disks (short stroking) A balanced state; 2 X TPS 40M $
  • 32. 32 SKKU VLDB Lab.Ch 9. Storing Disk Some Techniques to Hide IO Bottlenecks  Pre-fetching: For a sequential scan, pre-fetching several pages at a time is a big win! Even cache / disk controller support prefetching  Caching: modern disk controllers do their own caching.  IO overlapping: CPU works while IO is performing − Double buffering, asynchronous IO  Multiple threads  And, don’t do IOs, avoid IOs
  • 33. 33 SKKU VLDB Lab.Ch 9. Storing Disk MMDBMS vs. All-Flash DBMS: Personal Thoughts  Why MMDBMS has been recently popular? − Sap Hana, MS Hekaton, Oracle In-memory, Altibase, …. − Disk IOPS was too expensive − The price of DRAM has ever dropped.: DRAM_Cost << Disk_IOPS_Cost − The overhead of disk-based DBMS is not negligible − Applications with extreme performance requirements  Flash storage − Low $/IOPS − DRAM_Cost > Flash_IOPS_Cost  MMDBMS vs. All-Flash DBMS (with some optimizations) − Winner? Time will tell
  • 34. 34 Power Consumption Issue in Big Memory SKKU VLDB Lab.Ch 9. Storing Disk  Why exponential?  1KWh = 15 ~ 50 cents, 1 year = 1,752$ Source: sigmod 17 keynote by Anastatia Ailamaki
  • 35. 35 Jim Gray Flash Disk Opportunity for Server Applications (MS-TR 2007)  “My tests and those of several others suggest that FLASH disks can deliver about 3K random 8KB reads/second and with some re-engineering about 1,100 random 8KB writes per second. Indeed, it appears that a single FLASH chip could deliver nearly that performance and there are many chips inside the “box” – so the actual limit could be 4x or more. But, even the current performance would be VERY attractive for many enterprise applications. For example, in the TPC-C benchmark, has approximately equal reads and writes. Using the graphs above, and doing a weighted average of the 4-deep 8 KB random read rate (2,804 IOps), and 4-deep 8 KB sequential write rate (1233 IOps) gives harmonic average of 1713 (1-deep gives 1,624 IOps). TPC-C systems are configured with ~50 disks per cpu. For example the most recent Dell TPC-C system has ninety 15Krpm 36GB SCSI disks costing 45k$ (with 10k$ extra for maintenance that gets “discounted”). Those disks are 68% of the system cost. They deliver about 18,000 IO/s. That is comparable to the requests/second of ten FLASH disks. So we could replace those 90 disks with ten NSSD if the data would fit on 320GB (it does not). That would save a lot of money and a lot of power (1.3Kw of power and 1.3Kw of cooling).” (excerpts from “Flash disk opportunity for server-applications”)
  • 36. 36 Flash SSD vs. HDD: Non-Volatile Storage VS 60 years champion A new challenger! Identical Interface Flash SSD HDD Electronic Mechanical Read/Write Asymmetric Symmetric No Overwrite Overwrite
  • 37. 37 Flash SSD: Characteristics  No overwrite, addr. mapping table  Asymmetric read/write  No mechanics (Seq. RD ~ Rand. RD)  Seq. WR >> Rand. WR  Parallelism (SSD Architecture)  New Interface & Computing (SSD Architecture) 9/4/2017 37
  • 38. 38 Storage Device Metrics  Capacity ($/GB) : Harddisk >> Flash SSD  Bandwidth (MB/sec): Harddisk < Flash SSD  Latency (IOPS): Harddisk << Flash SSD − e.g. Harddisk  Commodity hdd (7200rpm): 50$ / 1TB / 100MB/s / 100 IOPS  Enterprise hdd(1.5Krpm): 250$ / 72GB / 200MB/s / 500 IOPS  The price of harddisks is said to be proportional to IOPS, not capacity.
  • 39. 39 Storage Device Metrics(2): HDD vs. Flash SSDs SKKU VLDB Lab.Ch 9. Storing Disk  Other Metrics: Weight/shock resistance/heat & cooling, power(watt) , IOPS/watt, IOPS/$ …. − Harddisk << Flash SSD [Source: Rethinking Flash In the Data Center, IEEE Micro 2010]
  • 40. 40 Evolution of DRAM, HDD, and SSD (1987 – 2017)  Raja et. al., The Five-minute Rule Thirty Years Later and its Impact on Storage Hierarchy, ADMS ‘17  the Storage Hierarchy SKKU VLDB Lab.Ch 9. Storing Disk
  • 41. 41 SKKU VLDB Lab.Ch 9. Storing Disk Technology RATIOS Matter [ Source: Jim Gray’s PPT ]  Technology ratio change: 1980s vs. 2010s  If everything changes in the same way, then nothing really changes.[See next slide]  If some things get much cheaper/faster than others, then that is real change.  Some things are not changing much: e.g. cost of people, speed of light  And some things are changing a LOT: e.g. Moore’s law, disk capacity  Latency lags behind bandwidth  Bandwidth lags behind capacity  Flash memory/NVRAMs and its role in the memory hierarchy? Disruptive technology ratio change  new disruptive solution!!  Disk improvements [‘88 – ’04]: − Capacity: 60%/y − Bandwidth: 40%/y − Access time: 16%/y
  • 42. 42 Latency Gap in Memory Hierarchy SKKU VLDB Lab.Ch 9. Storing Disk Typical access latency & granularity We need `gap filler’! [Source: Uwe Röhm’s Slide]  Latency lags behind bandwidth [David Patternson, CACM Oct. 2004]  Bandwidth problem can be cured with money, but latency problem is harder
  • 43. 43 Our Message in SIGMOD ’09 Paper One FlashSSD can beat Ten Harddisks in OLTP - Performance, Price, Capacity, Power – (in 2008)  In 2015, one FlashSSD can beat more than several tens harddisks in OLTP  “My tests and those of several others suggest that FLASH disks can deliver about 3K random 8KB reads/second and with some re-engineering about 1,100 random 8KB writes per second. Indeed, it appears that a single FLASH chip could deliver nearly that performance and there are many chips inside the “box” – so the actual limit could be 4x or more. But, even the current performance would be VERY attractive for many enterprise applications. For example, in the TPC-C benchmark, has approximately equal reads and writes. Using the graphs above, and doing a weighted average of the 4-deep 8 KB random read rate (2,804 IOps), and 4-deep 8 KB sequential write rate (1233 IOps) gives harmonic average of 1713 (1-deep gives 1,624 IOps). TPC-C systems are configured with ~50 disks per cpu. For example the most recent Dell TPC-C system has ninety 15Krpm 36GB SCSI disks costing 45k$ (with 10k$ extra for maintenance that gets “discounted”). Those disks are 68% of the system cost. They deliver about 18,000 IO/s. That is comparable to the requests/second of ten FLASH disks. So we could replace those 90 disks with ten NSSD if the data would fit on 320GB (it does not). That would save a lot of money and a lot of power (1.3Kw of power and 1.3Kw of cooling).” (excerpts from Flash disk opportunity for server-applications (Jim Gray)
  • 44. 44 Flash-based TPC-C @ 2013 September SKKU VLDB Lab.  Oracle + Sun Flash Storage − 8.5M tpmC  Total cost: 4.7M $ − Server HW: .6M $ − Server SW: 1.9M $ − Storage: 1.8M $  216 400GB Flash Moduley: 1.1M $  86 3TB 7.2K HDD: 0.07M − Client HW/SW: 0.1M $ − Others: 0.1M$  Implications − More vertical stacks (by SW vendor ) − Harddisk vendors (e.g. Seagate)
  • 45. 45 HDD vs. SSD [Patterson 2016] SKKU VLDB Lab.Ch 9. Storing Disk
  • 46. 46 Page Size  Default page size in Linux and Window: 4KB ( 512B sector)  Oracle − “Oracle recommends smaller Oracle Database block sizes (2 KB or 4 KB) for online transaction processing or mixed workload environments and larger block sizes (8 KB,16 KB, or 32 KB) for decision support system workload environments.” (see chapter “IO Config. And Design” in “Database Performance Tuning Guide” book ) SKKU VLDB Lab.Ch 9. Storing Disk
  • 47. 47 Why Smaller Page in SSD?  Small page size advantages – Better throughput 47
  • 48. 48 LinkBench: Page Size  Benefits of small page − Better read/write IOPS  Exploit internal parallelism − Better buffer-pool hit ratio

Editor's Notes

  1. 1
  2. 21
  3. 4
  4. 6
  5. 7
  6. 7
  7. 7
  8. 7
  9. Next, we measured the impact of page size on overall performance. In this experiment, we disabled WRITE_BARRIER and double write buffer. As you can see on the left, smaller page delivers better performance with LinkBench. Small page improves effectiveness of data transfer and is likely to leverage internal parallelism of SSD better. In addition, we also monitored the hit ratio of buffer pool varying the page size. As you can see on the right, smaller pages deliver better hit rates and the gap becomes wider with larger buffer pool. Again, smaller pages reduce the amount of unnecessary data in the buffer pool and utilize the bandwidth better, by reducing the unnecessary data transfer between host and storage device. ---------------------------------------------------------- In this slide, I will explain the advantages of using a small page size by using performance results. As you can see from the left graph, when the write-barrier option was off, the performance gain obtained by using a small page size was significant. More than two-fold increase in transaction throughput by reducing the page size from 16KB to 4KB. This is because a flash memory SSD can exploit the internal parallelism maximally when the write barrier is off. Another benefit from using a small page is an improved buffer pool cache effect. The right graph shows the buffer pool hit ratio observed from running the LinkBench workload with different page sizes. With seeing the graph, the buffer hit ratio was increased more quickly with 4KB page size than others. This higher hit ratio, in combination with the higher IOPS with a small page, contributes to the widening throughput gap among the three page sizes as the size of buffer pool increases. Due to the limited time, we did not include the transaction throughput graph by varying buffer pool size. Please refer to the paper.