ECE7130: Advanced Computer Architecture: Storage Systems Dr. Xubin He http://iweb.tntech.edu/hexb Email: firstname.lastname@example.org Tel: 931-3723462, Brown Hall 319
Quick overview: storage systems
I/O Benchmarks, Performance and Dependability
Storage Architectures 06/16/10
Disk Figure of Merit: Areal Density
Bits recorded along a track
Metric is Bits Per Inch ( BPI )
Number of tracks per surface
Metric is Tracks Per Inch ( TPI )
Disk Designs Brag about bit density per unit area
Metric is Bits Per Square Inch : Areal Density = BPI x TPI
1956 IBM Ramac — early 1970s Winchester
Developed for mainframe computers, proprietary interfaces
Steady shrink in form factor: 27 in. to 14 in.
Form factor and capacity drives market more than performance
5.25 inch floppy disk formfactor (microcode into mainframe)
Emergence of industry standard disk interfaces
Early 1980s: PCs and first generation workstations
Mid 1980s: Client/server computing
Centralized storage on file server
accelerates disk downsizing: 8 inch to 5.25
Mass market disk drives become a reality
industry standards: SCSI, IPI, IDE
5.25 inch to 3.5 inch drives for PCs, End of proprietary interfaces
1900s: Laptops => 2.5 inch drives
2000s: What new devices leading to new drives?
Solid State Semiconductor:
Significantly cheaper than DRAM but much faster than traditional HDD, high density
Tape: sequential accessed
Driving forces Source: Anderson & Whittington, Seagate, FAST’07 06/16/10
HDD Market Source: Anderson & Whittington, Seagate, FAST’07 06/16/10
HDD Technology Trend Source: Anderson & Whittington, Seagate, FAST’07 06/16/10
Media cause: bit flip, noise…
Data detection and decoding logic, ECC
Right HDD for right application Source: Anderson & Whittington, Seagate, FAST’07 06/16/10
SCSI: wide SCSI, Ultra wide SCSI…
IDE ATA: Parallel ATA (PATA): PATA 66/100/133
Serial ATA (SATA): SATA 150, SATA II 300
Serial Attached SCSI (SAS)
eSATA: external SATA
SATA/SAS compatibility Source: Anderson & Whittington, Seagate, FAST’07 06/16/10
Top HDD manufactures
Seagate (also own Maxtor)
IBM used to make hard disk drives but sold that division to Hitachi.
Top companies to provide storage solutions
Fastest growing storage companies (07/2008) Source: storagesearch.com 06/16/10 Company Yearly growth Main product Coraid 370% NAS ExaGrid Systems 337% Disk to disk backup RELDATA 300% iSCSI Voltaire 194% InfiniBand Transcend Information 166% SSD OnStor 157% NAS Bluepoint Data Storage 140% Online Backup and Storage Compellent 124% NAS Alacritech 100% iSCSI Intransa 100% iSCSI
Solid State Drive
A data storage device that uses solid-state memory to store persistent data.
A SSD either uses Flash non-volatile memory (Flash SSD) or DRAM volatile memory (RAM SSD)
Advantages of SSD
Faster startup: no spin-up
Fast random access for read
Extremely low read/write latency: much smaller seek time
Quiet: no moving
High mechanical reliability: endure shock, vibration
Balanced performance across entire storage device
Disadvantages of SSD
Price: unit price of SSD is 20x of HDD
Limited write cycles (Flash SSD): 30-50 millions
Slower write speed (Flash SSD): erase blocks
Lower storage density
Vulnerable to some effects: abrupt power loss (RAM SSD), magnetic fields and electric/static charges
HPC systems are comprised of 3 major subsystems: processing, networking and storage. In different applications, any one of these subsystems can limit the overall system performance. The HPC Storage Challenge is a competition showcasing effective approaches using the storage subsystem, which is often the limiting system, with actual applications.
Participants must describe their implementations and present measurements of performance, scalability, and storage subsystem utilization. Judging will be based on these measurements as well as innovation and effectiveness; maximum size and peak performance are not the sole criteria.
Finalists will be chosen on the basis of submissions which are in the form of a proposal; submissions are encouraged to include reports of work in progress. Participants with access to either large or small HPC systems are encouraged to enter this challenge.
Energy Efficient Storage: CMU, UIUC
Scalable Storages Meets Petaflops: IBM, UCSC
High availability and Reliable Storage Systems: UC Berkeley, CMU, TTU, UCSC
Object Storage (OSD): CMU, HUST, UCSC
Storage virtualization: CMU
Intelligent Storage Systems: UC Berkeley, CMU
Energy Efficient Storage
disk level: SSD
I/O and file system level: efficient memory, cache, networked storage
Application level: data layout
Green storage initiative (GSI): SNIA
Energy efficient storage networking solutions
Storage administrator and infrastructure
Peta-scale scalable storage
Challenges to using storages that can meet the ever increasing speed, multi-core designs, and petaflop computing capabilities.
Latencies to disks are not keeping up with peta-scale computing.
Incentive approaches are needed.
My research in high performance and reliable storage systems
Active/Active Service for High Availability Computing
Active/Active Metadata Service
Networked Storage Systems
A Unified Multiple-Level Cache for High Performance Storage Systems
Performance-Adaptive UDP for Bulk Data Transfer over Dedicated Network Links: PA-UDP
Improving disk performance.
Use large sectors to improve bandwidth
Use track caches and read ahead;
Read entire track into on-controller cache
Exploit locality (improves both latency and BW)
Design file systems to maximize locality
Allocate files sequentially on disks (exploit track cache)
Locate similar files in same cylinder (reduce seeks)
Locate simlar files in near-by cylinders (reduce seek distance)
Pack bits closer together to improve transfer rate and density.
Use a collection of small disks to form a large, high performance one--->disk array
Stripping data across multiple disks to allow parallel I/O, thus improving performance.
Use Arrays of Small Disks? 14” 10” 5.25” 3.5” 3.5” Disk Array: 1 disk design Conventional: 4 disk designs Low End High End
Katz and Patterson asked in 1987:
Can smaller disks be used to close gap in performance between disks and CPUs?
Replace Small Number of Large Disks with Large Number of Small Disks! (1988 Disks) Capacity Volume Power Data Rate I/O Rate MTTF Cost IBM 3390K 20 GBytes 97 cu. ft. 3 KW 15 MB/s 600 I/Os/s 250 KHrs $250K IBM 3.5" 0061 320 MBytes 0.1 cu. ft. 11 W 1.5 MB/s 55 I/Os/s 50 KHrs $2K x70 23 GBytes 11 cu. ft. 1 KW 110 MB/s 3900 IOs/s ??? Hrs $150K Disk Arrays have potential for large data and I/O rates, high MB per cu. ft., high MB per KW, but what about reliability? 9X 3X 8X 6X
MTTF: Mean Time To Failure: average time that a non - repairable component will operate before experiencing failure.
Reliability of N disks = Reliability of 1 Disk ÷N
50,000 Hours ÷ 70 disks = 700 hours
Disk system MTTF: Drops from 6 years to 1 month!
Arrays without redundancy too unreliable to be useful!
Redundant Arrays of (Inexpensive) Disks
Replicate data over several disks so that no data will be lost if one disk fails.
Redundancy yields high data availability
Availability : service still provided to user, even if some components failed
Disks will still fail
Contents reconstructed from data redundantly stored in the array
Capacity penalty to store redundant info
Bandwidth penalty to update redundant info
Levels of RAID
Original RAID paper described five categories (RAID levels 1-5). (Patterson et al, “A case for redundant arrays of inexpensive disks (RAID)”, ACM SIGMOD, 1988)
Disk striping with no redundant now is called RAID0 or JBOD(Just a bunch of disks).
Other kinds have been proposed in literature,
Level 6 (P+Q Redundancy), Level 10, RAID53, etc.
Except RAID0, all the RAID levels trade disk capacity for reliability, and the extra reliability makes parallism a practical way to improve performance.
RAID 0: Nonredundant (JBOD)
High I/O performance.
Data is not save redundantly.
Single copy of data is striped across multiple disks.
Lack of redundancy.
Least reliable: single disk failure leads to data loss.
file data block 1 block 0 block 2 block 3 Disk 1 Disk 0 Disk 2 Disk 3
Redundant Arrays of Inexpensive Disks RAID 1: Disk Mirroring/Shadowing
• Each disk is fully duplicated onto its “ mirror ”
Very high availability can be achieved
• Bandwidth sacrifice on write:
Logical write = two physical writes
• Reads may be optimized, minimize the queue and disk search time
• Most expensive solution: 100% capacity overhead
recovery group Targeted for high I/O rate , high availability environments
RAID 2: Memory-Style ECC Data Disks Multiple ECC Disks and a Parity Disk
Multiple disks record the ECC information to determine which disk is in fault
A parity disk is then used to reconstruct corrupted or lost data
Needs log 2 (number of disks) redundancy disks
f 0 (b) b 2 b 1 b 0 b 3 f 1 (b) P(b)
RAID 3: Bit (Byte) Interleaved Parity
Only need one parity disk
Write/Read accesses all disks
Only one request can be serviced at a time
Easy to implement
Provides high bandwidth but not high I/O rates
Targeted for high bandwidth applications: Multimedia, Image Processing 10010011 11001101 10010011 . . . Logical record 1 0 0 1 0 0 1 1 0 1 1 0 0 1 1 0 1 1 1 0 0 1 0 0 1 1 0 Striped physical records P Physical record
Sum computed across recovery group to protect against hard disk failures, stored in P disk
Logically, a single high capacity, high transfer rate disk: good for large transfers
Wider arrays reduce capacity costs, but decreases availability
12.5% capacity cost for parity in this configuration
RAID 3 relies on parity disk to discover errors on Read
But every sector has an error detection field
Rely on error detection field to catch errors on read, not on the parity disk
Allows independent reads to different disks simultaneously
Inspiration for RAID 4
RAID 4: Block Interleaved Parity
Blocks: striping units
Allow for parallel access by multiple I/O requests, high I/O rates
Doing multiple small reads is now faster than before. (allows small read requests to be restricted to a single disk).
Large writes(full stripe), update the parity:
P’ = d0’ + d1’ + d2’ + d3’;
Small writes(eg. write on d0), update the parity:
P = d0 + d1 + d2 + d3
P’ = d0’ + d1 + d2 + d3 = P + d0’ + d0;
However, writes are still very slow since the parity
Problems of Disk Arrays: Small Writes (read-modify-write procedure) D0 D1 D2 D3 P D0' + + D0' D1 D2 D3 P' new data old data old parity XOR XOR (1. Read) (2. Read) (3. Write) (4. Write) RAID-5: Small Write Algorithm 1 Logical Write = 2 Physical Reads + 2 Physical Writes
Inspiration for RAID 5
RAID 4 works well for small reads
Small writes (write to one disk):
Option 1: read other data disks, create new sum and write to Parity Disk
Option 2: since P has old sum, compare old data to new data, add the difference to P
Small writes are limited by Parity Disk: Write to D0, D5 both also write to P disk. Parity disk must be updated for every write operation!
D0 D1 D2 D3 P D4 D5 D6 P D7
Redundant Arrays of Inexpensive Disks RAID 5: High I/O Rate Interleaved Parity Independent writes possible because of interleaved parity D0 D1 D2 D3 P D4 D5 D6 P D7 D8 D9 P D10 D11 D12 P D13 D14 D15 P D16 D17 D18 D19 D20 D21 D22 D23 P . . . . . . . . . . . . . . . Disk Columns Increasing Logical Disk Addresses Example: write to D0, D5 uses disks 0, 1, 3, 4
Comparison of RAID Levels (N disks, each with capacity of C) Small write problem High performance, reliability (N-1)xC Interleave parity 5 Write-related bottleneck High redundancy and better performance (N-1)xC Block-level parity 4 Low performance Easy to implement, high error recoverability (N-1)xC Bit(byte)-level parity 3 cost High performance, fastest write (N/2)xC Mirroring 1 No redundancy Maximum data transfer rate and sizes NxC Striping 0 Disadvantage Advantage Capacity Technique RAID level
RAID 6: Recovering from 2 failures
Why > 1 failure recovery?
operator accidentally replaces the wrong disk during a failure
since disk bandwidth is growing more slowly than disk capacity, the MTT Repair a disk in a RAID system is increasing increases the chances of a 2nd failure during repair since takes longer
reading much more data during reconstruction meant increasing the chance of an uncorrectable media failure, which would result in data loss
RAID 6: Recovering from 2 failures
Network Appliance’s row-diagonal parity or RAID-DP
Like the standard RAID schemes, it uses redundant space based on parity calculation per stripe
Since it is protecting against a double failure, it adds two check blocks per stripe of data.
If p+1 disks total, p-1 disks have data; assume p=5
Row parity disk is just like in RAID 4
Even parity across the other 4 data blocks in its stripe
Each block of the diagonal parity disk contains the even parity of the blocks in the same diagonal
Example p = 5
Row diagonal parity starts by recovering one of the 4 blocks on the failed disk using diagonal parity
Since each diagonal misses one disk, and all diagonals miss a different disk, 2 diagonals are only missing 1 block
Once the data for those blocks is recovered, then the standard RAID recovery scheme can be used to recover two more blocks in the standard RAID 4 stripes
Process continues until two failed disks are restored
0 4 3 2 1 0 4 3 2 1 0 4 3 2 1 0 4 3 2 1 0 4 3 2 1 0 4 3 2 1 0 4 3 2 1 0 Diagonal Parity Row Parity Data Disk 3 Data Disk 2 Data Disk 1 Data Disk 0
Summary: RAID Techniques: Goal was performance, popularity due to reliability of storage • Disk Mirroring, Shadowing (RAID 1) Each disk is fully duplicated onto its "shadow" Logical write = two physical writes 100% capacity overhead • Parity Data Bandwidth Array (RAID 3) Parity computed horizontally Logically a single high data bw disk • High I/O Rate Parity Array (RAID 5) Interleaved parity blocks Independent reads and writes Logical write = 2 reads + 2 writes 1 0 0 1 0 0 1 1 1 1 0 0 1 1 0 1 1 0 0 1 0 0 1 1 0 0 1 1 0 0 1 0 1 0 0 1 0 0 1 1 1 0 0 1 0 0 1 1