• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Download It
 

Download It

on

  • 1,004 views

 

Statistics

Views

Total Views
1,004
Views on SlideShare
1,004
Embed Views
0

Actions

Likes
0
Downloads
49
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • CS258 S99
  • CS258 S99
  • CS258 S99
  • CS258 S99
  • CS258 S99
  • CS258 S99
  • CS258 S99
  • CS258 S99
  • CS258 S99

Download It Download It Presentation Transcript

  • ECE7130: Advanced Computer Architecture: Storage Systems Dr. Xubin He http://iweb.tntech.edu/hexb Email: hexb@tntech.edu Tel: 931-3723462, Brown Hall 319
  • Outline
    • Quick overview: storage systems
    • RAID
    • Advanced Dependability/Reliability/Availability
    • I/O Benchmarks, Performance and Dependability
  • Storage Architectures 06/16/10
  • Disk Figure of Merit: Areal Density
    • Bits recorded along a track
      • Metric is Bits Per Inch ( BPI )
    • Number of tracks per surface
      • Metric is Tracks Per Inch ( TPI )
    • Disk Designs Brag about bit density per unit area
      • Metric is Bits Per Square Inch : Areal Density = BPI x TPI
  • Historical Perspective
    • 1956 IBM Ramac — early 1970s Winchester
      • Developed for mainframe computers, proprietary interfaces
      • Steady shrink in form factor: 27 in. to 14 in.
    • Form factor and capacity drives market more than performance
    • 1970s developments
      • 5.25 inch floppy disk formfactor (microcode into mainframe)
      • Emergence of industry standard disk interfaces
    • Early 1980s: PCs and first generation workstations
    • Mid 1980s: Client/server computing
      • Centralized storage on file server
        • accelerates disk downsizing: 8 inch to 5.25
      • Mass market disk drives become a reality
        • industry standards: SCSI, IPI, IDE
        • 5.25 inch to 3.5 inch drives for PCs, End of proprietary interfaces
    • 1900s: Laptops => 2.5 inch drives
    • 2000s: What new devices leading to new drives?
  • Storage Media
    • Magnetic:
      • Hard drive
    • Optical:
      • CD
    • Solid State Semiconductor:
      • Flash SSD
      • RAM SSD
    • MEMS-based
      • Significantly cheaper than DRAM but much faster than traditional HDD, high density
    • Tape: sequential accessed
    06/16/10
  • Driving forces Source: Anderson & Whittington, Seagate, FAST’07 06/16/10
  • HDD Market Source: Anderson & Whittington, Seagate, FAST’07 06/16/10
  • HDD Technology Trend Source: Anderson & Whittington, Seagate, FAST’07 06/16/10
  • Errors
    • Errors happens:
      • Media cause: bit flip, noise…
    • Data detection and decoding logic, ECC
    06/16/10
  • Right HDD for right application Source: Anderson & Whittington, Seagate, FAST’07 06/16/10
  • Interfaces
    • Internal
      • SCSI: wide SCSI, Ultra wide SCSI…
      • IDE ATA: Parallel ATA (PATA): PATA 66/100/133
      • Serial ATA (SATA): SATA 150, SATA II 300
      • Serial Attached SCSI (SAS)
    • External
      • USB (1.0/2.0)
      • Firewire (400/800)
      • eSATA: external SATA
    06/16/10
  • SATA/SAS compatibility Source: Anderson & Whittington, Seagate, FAST’07 06/16/10
  • Top HDD manufactures
    • Western Digital
    • Seagate (also own Maxtor)
    • Fujitsu
    • Samsung
    • Hitachi
    • IBM used to make hard disk drives but sold that division to Hitachi.
    06/16/10
  • Top companies to provide storage solutions
    • Adaptec
    • EMC
    • Qlogic
    • IBM
    • Hitachi
    • Brocade
    • Cisco
    • HP
    • Network appliance
    • Emulex
    06/16/10
  • Fastest growing storage companies (07/2008) Source: storagesearch.com 06/16/10 Company Yearly growth Main product Coraid 370% NAS ExaGrid Systems 337% Disk to disk backup RELDATA 300% iSCSI Voltaire 194% InfiniBand Transcend Information 166% SSD OnStor 157% NAS Bluepoint Data Storage 140% Online Backup and Storage Compellent 124% NAS Alacritech 100% iSCSI Intransa 100% iSCSI
  • Solid State Drive
    • A data storage device that uses solid-state memory to store persistent data.
    • A SSD either uses Flash non-volatile memory (Flash SSD) or DRAM volatile memory (RAM SSD)
    06/16/10
  • Advantages of SSD
    • Faster startup: no spin-up
    • Fast random access for read
    • Extremely low read/write latency: much smaller seek time
    • Quiet: no moving
    • High mechanical reliability: endure shock, vibration
    • Balanced performance across entire storage device
    06/16/10
  • Disadvantages of SSD
    • Price: unit price of SSD is 20x of HDD
    • Limited write cycles (Flash SSD): 30-50 millions
    • Slower write speed (Flash SSD): erase blocks
    • Lower storage density
    • Vulnerable to some effects: abrupt power loss (RAM SSD), magnetic fields and electric/static charges
    06/16/10
  • Top SSD manufactures (as of 1Q of 2008, source: storagesearch.com) 06/16/10 Rank Manufacturer SSD Technology 1 BitMICRO Networks Flash SSD 2 STEC Flash SSD 3 Mtron Flash SSD 4 Memoright Flash SSD 5 SanDisk Flash SSD 6 Samsung Flash SSD 7 Adtron Flash SSD 8 Texas Memory Systems RAM/Flash SSD 9 Toshiba Flash SSD 10 Violin Memory RAM SSD
  • Networked Storage
    • Network Attached Storage(NAS)
    • Storage accessed over TCP/IP, using industry standard file sharing protocols like NFS, HTTP, Windows Networking
      • Provide File System Functionality
      • Take LAN bandwidth of Servers
    • Storage Area Network(SAN)
    • Storage accessed over a Fibre Channel switching fabric, using encapsulated SCSI.
      • Block level storage system
      • Fibre-Channel SAN
      • IP SAN
        • Implementing SAN over well-known TCP/IP
        • iSCSI: Cost-effective, SCSI and TCP/IP
    06/16/10
  • Advantages
    • Consolidation
    • Centralized Data Management
    • Scalability
    • Fault Resiliency
    06/16/10
  • NAS and SAN shortcomings
    • SAN Shortcomings --Data to desktop --Sharing between NT and UNIX --Lack of standards for file access and locking
    • NAS Shortcomings --Shared tape resources --Number of drives --Distance to tapes/disks
        • NAS --Focuses on applications, users, and the files and data that they share
        • SAN --Focuses on disks, tapes, and a scalable, reliable infrastructure to connect them
        • NAS Plus SAN --The complete solution, from desktop to data center to storage device
    06/16/10
  • Organizations
    • IEEE Computer Society Mass Storage Systems Technical Committee (MSSTC or TCMS): http://www.msstc.org
    • IEEE IETF: www.ietf.org
      • IPS, IMSS
    • International Committee for Information Technology Standards: www.incits.org
      • Storage: B11, T10, T11, T13
    • Storage mailing list: storage [email_address]
    • INSIC: Information Storage Industry Consortium: www.insic.org
    • SNIA: Storage Networking Industry Association: http://www.snia.org
      • Technical work groups
    06/16/10
  • Conferences
    • FAST: Usenix: http://www.usenix.org/event/fast09/
    • SC: IEEE/ACM: http://sc08.supercomputing.org/
    • MSST: IEEE: http:// storageconference.org /
      • SNAPI, SISW, CMPD, DAPS
    • NAS: IEEE: http://www.eece.maine.edu/iwnas/
    • SNW: SNIA: Storage Networking World
    • SNIA Storage Developer Conference: http://www.snia.org/events/storage-developer2008/
    • Other conferences with storage components:
      • IPDPS, ICDCS, ICPP, AReS, PDCS, Mascots, HPDC, IPCCC, ccGRID…
    06/16/10
  • Awards in Storage Systems
    • IEEE Reynold B. Johnson Information Storage Systems Award
      • Sponsored by: IBM Almaden Research Center
    • 2008 Recipient:
      • 2008 - ALAN J. SMITH, Professor University of California at Berkeley Berkeley, CA, USA 
    • 2007- Co-Recipients
      • DAVE HITZ Executive Vice President and Co-Founder Network Appliance Sunnyvale, CA
      • JAMES LAU Chief Strategy Officer, Executive Vice President and Co-Founder Network Appliance Sunnyvale, CA
    • 2006 Recipient
      • JAISHANKAR M. MENON Director of Storage Systems Architecture and Design IBM, San Jose, CA
      • "For pioneering work in the theory and application of RAID storage systems."
    • More: http://www.ieee.org/portal/pages/about/awards/sums/johnson.html
    06/16/10
  • HPC storage challenge: SC
    • HPC systems are comprised of 3 major subsystems: processing, networking and storage. In different applications, any one of these subsystems can limit the overall system performance. The HPC Storage Challenge is a competition showcasing effective approaches using the storage subsystem, which is often the limiting system, with actual applications.
    • Participants must describe their implementations and present measurements of performance, scalability, and storage subsystem utilization. Judging will be based on these measurements as well as innovation and effectiveness; maximum size and peak performance are not the sole criteria.
    • Finalists will be chosen on the basis of submissions which are in the form of a proposal; submissions are encouraged to include reports of work in progress. Participants with access to either large or small HPC systems are encouraged to enter this challenge.
    06/16/10
  • Research hotspots
    • Energy Efficient Storage: CMU, UIUC
    • Scalable Storages Meets Petaflops: IBM, UCSC
    • High availability and Reliable Storage Systems: UC Berkeley, CMU, TTU, UCSC
    • Object Storage (OSD): CMU, HUST, UCSC
    • Storage virtualization: CMU
    • Intelligent Storage Systems: UC Berkeley, CMU
    06/16/10
  • Energy Efficient Storage
    • Energy aware:
      • disk level: SSD
      • I/O and file system level: efficient memory, cache, networked storage
      • Application level: data layout
    • Green storage initiative (GSI): SNIA
      • Energy efficient storage networking solutions
      • Storage administrator and infrastructure
    06/16/10
  • Peta-scale scalable storage
    • Challenges to using storages that can meet the ever increasing speed, multi-core designs, and petaflop computing capabilities.
    • Latencies to disks are not keeping up with peta-scale computing.
    • Incentive approaches are needed.
    06/16/10
  • My research in high performance and reliable storage systems
    • Active/Active Service for High Availability Computing
      • Active/Active Metadata Service
    • Networked Storage Systems
      • STICS
      • iRAID
    • A Unified Multiple-Level Cache for High Performance Storage Systems
      • iPVFS
      • iCache
    • Performance-Adaptive UDP for Bulk Data Transfer over Dedicated Network Links: PA-UDP
    06/16/10
  • Improving disk performance.
    • Use large sectors to improve bandwidth
    • Use track caches and read ahead;
      • Read entire track into on-controller cache
      • Exploit locality (improves both latency and BW)
    • Design file systems to maximize locality
      • Allocate files sequentially on disks (exploit track cache)
      • Locate similar files in same cylinder (reduce seeks)
      • Locate simlar files in near-by cylinders (reduce seek distance)
    • Pack bits closer together to improve transfer rate and density.
    • Use a collection of small disks to form a large, high performance one--->disk array
    • Stripping data across multiple disks to allow parallel I/O, thus improving performance.
    06/16/10
  • Use Arrays of Small Disks? 14” 10” 5.25” 3.5” 3.5” Disk Array: 1 disk design Conventional: 4 disk designs Low End High End
    • Katz and Patterson asked in 1987:
      • Can smaller disks be used to close gap in performance between disks and CPUs?
  • Replace Small Number of Large Disks with Large Number of Small Disks! (1988 Disks) Capacity Volume Power Data Rate I/O Rate MTTF Cost IBM 3390K 20 GBytes 97 cu. ft. 3 KW 15 MB/s 600 I/Os/s 250 KHrs $250K IBM 3.5" 0061 320 MBytes 0.1 cu. ft. 11 W 1.5 MB/s 55 I/Os/s 50 KHrs $2K x70 23 GBytes 11 cu. ft. 1 KW 110 MB/s 3900 IOs/s ??? Hrs $150K Disk Arrays have potential for large data and I/O rates, high MB per cu. ft., high MB per KW, but what about reliability? 9X 3X 8X 6X
  • Array Reliability
    • MTTF: Mean Time To Failure: average time that a non - repairable component will operate before experiencing failure.
    • Reliability of N disks = Reliability of 1 Disk ÷N
      • 50,000 Hours ÷ 70 disks = 700 hours
    • Disk system MTTF: Drops from 6 years to 1 month!
    • Arrays without redundancy too unreliable to be useful!
    • Solution: redundancy.
  • Redundant Arrays of (Inexpensive) Disks
    • Replicate data over several disks so that no data will be lost if one disk fails.
    • Redundancy yields high data availability
      • Availability : service still provided to user, even if some components failed
    • Disks will still fail
    • Contents reconstructed from data redundantly stored in the array
      •  Capacity penalty to store redundant info
      •  Bandwidth penalty to update redundant info
  •  
  • Levels of RAID
    • Original RAID paper described five categories (RAID levels 1-5). (Patterson et al, “A case for redundant arrays of inexpensive disks (RAID)”, ACM SIGMOD, 1988)
    • Disk striping with no redundant now is called RAID0 or JBOD(Just a bunch of disks).
    • Other kinds have been proposed in literature,
    • Level 6 (P+Q Redundancy), Level 10, RAID53, etc.
    • Except RAID0, all the RAID levels trade disk capacity for reliability, and the extra reliability makes parallism a practical way to improve performance.
  • RAID 0: Nonredundant (JBOD)
    • High I/O performance.
      • Data is not save redundantly.
      • Single copy of data is striped across multiple disks.
    • Low cost.
      • Lack of redundancy.
    • Least reliable: single disk failure leads to data loss.
    file data block 1 block 0 block 2 block 3 Disk 1 Disk 0 Disk 2 Disk 3
  • Redundant Arrays of Inexpensive Disks RAID 1: Disk Mirroring/Shadowing
    • •  Each disk is fully duplicated onto its “ mirror ”
    • Very high availability can be achieved
    • • Bandwidth sacrifice on write:
    • Logical write = two physical writes
      • • Reads may be optimized, minimize the queue and disk search time
    • • Most expensive solution: 100% capacity overhead
    recovery group Targeted for high I/O rate , high availability environments
  • RAID 2: Memory-Style ECC Data Disks Multiple ECC Disks and a Parity Disk
    • Multiple disks record the ECC information to determine which disk is in fault
    • A parity disk is then used to reconstruct corrupted or lost data
    • Needs log 2 (number of disks) redundancy disks
    f 0 (b) b 2 b 1 b 0 b 3 f 1 (b) P(b)
  • RAID 3: Bit (Byte) Interleaved Parity
    • Only need one parity disk
    • Write/Read accesses all disks
    • Only one request can be serviced at a time
    • Easy to implement
    • Provides high bandwidth but not high I/O rates
    Targeted for high bandwidth applications: Multimedia, Image Processing 10010011 11001101 10010011 . . . Logical record 1 0 0 1 0 0 1 1 0 1 1 0 0 1 1 0 1 1 1 0 0 1 0 0 1 1 0 Striped physical records P Physical record
  • RAID 3
    • Sum computed across recovery group to protect against hard disk failures, stored in P disk
    • Logically, a single high capacity, high transfer rate disk: good for large transfers
    • Wider arrays reduce capacity costs, but decreases availability
    • 12.5% capacity cost for parity in this configuration
    • RAID 3 relies on parity disk to discover errors on Read
    • But every sector has an error detection field
    • Rely on error detection field to catch errors on read, not on the parity disk
    • Allows independent reads to different disks simultaneously
    Inspiration for RAID 4
  • RAID 4: Block Interleaved Parity
    • Blocks: striping units
    • Allow for parallel access by multiple I/O requests, high I/O rates
    • Doing multiple small reads is now faster than before. (allows small read requests to be restricted to a single disk).
    • Large writes(full stripe), update the parity:
    • P’ = d0’ + d1’ + d2’ + d3’;
    • Small writes(eg. write on d0), update the parity:
    • P = d0 + d1 + d2 + d3
    • P’ = d0’ + d1 + d2 + d3 = P + d0’ + d0;
    • However, writes are still very slow since the parity
    • disk is the bottleneck.
    block 0 block 4 block 8 block 12 block 1 block 5 block 9 block 13 block 2 block 6 block 10 block 14 block 3 block 7 block 11 block 15 P(0-3) P(4-7) P(8-11) P(12-15)
  • Problems of Disk Arrays: Small Writes (read-modify-write procedure) D0 D1 D2 D3 P D0' + + D0' D1 D2 D3 P' new data old data old parity XOR XOR (1. Read) (2. Read) (3. Write) (4. Write) RAID-5: Small Write Algorithm 1 Logical Write = 2 Physical Reads + 2 Physical Writes
  • Inspiration for RAID 5
    • RAID 4 works well for small reads
    • Small writes (write to one disk):
      • Option 1: read other data disks, create new sum and write to Parity Disk
      • Option 2: since P has old sum, compare old data to new data, add the difference to P
    • Small writes are limited by Parity Disk: Write to D0, D5 both also write to P disk. Parity disk must be updated for every write operation!
    D0 D1 D2 D3 P D4 D5 D6 P D7
  • Redundant Arrays of Inexpensive Disks RAID 5: High I/O Rate Interleaved Parity Independent writes possible because of interleaved parity D0 D1 D2 D3 P D4 D5 D6 P D7 D8 D9 P D10 D11 D12 P D13 D14 D15 P D16 D17 D18 D19 D20 D21 D22 D23 P . . . . . . . . . . . . . . . Disk Columns Increasing Logical Disk Addresses Example: write to D0, D5 uses disks 0, 1, 3, 4
  • Comparison of RAID Levels (N disks, each with capacity of C) Small write problem High performance, reliability (N-1)xC Interleave parity 5 Write-related bottleneck High redundancy and better performance (N-1)xC Block-level parity 4 Low performance Easy to implement, high error recoverability (N-1)xC Bit(byte)-level parity 3 cost High performance, fastest write (N/2)xC Mirroring 1 No redundancy Maximum data transfer rate and sizes NxC Striping 0 Disadvantage Advantage Capacity Technique RAID level
  • RAID 6: Recovering from 2 failures
    • Why > 1 failure recovery?
      • operator accidentally replaces the wrong disk during a failure
      • since disk bandwidth is growing more slowly than disk capacity, the MTT Repair a disk in a RAID system is increasing  increases the chances of a 2nd failure during repair since takes longer
      • reading much more data during reconstruction meant increasing the chance of an uncorrectable media failure, which would result in data loss
  • RAID 6: Recovering from 2 failures
    • Network Appliance’s row-diagonal parity or RAID-DP
    • Like the standard RAID schemes, it uses redundant space based on parity calculation per stripe
    • Since it is protecting against a double failure, it adds two check blocks per stripe of data.
      • If p+1 disks total, p-1 disks have data; assume p=5
    • Row parity disk is just like in RAID 4
      • Even parity across the other 4 data blocks in its stripe
    • Each block of the diagonal parity disk contains the even parity of the blocks in the same diagonal
  • Example p = 5
    • Row diagonal parity starts by recovering one of the 4 blocks on the failed disk using diagonal parity
      • Since each diagonal misses one disk, and all diagonals miss a different disk, 2 diagonals are only missing 1 block
    • Once the data for those blocks is recovered, then the standard RAID recovery scheme can be used to recover two more blocks in the standard RAID 4 stripes
    • Process continues until two failed disks are restored
    0 4 3 2 1 0 4 3 2 1 0 4 3 2 1 0 4 3 2 1 0 4 3 2 1 0 4 3 2 1 0 4 3 2 1 0 Diagonal Parity Row Parity Data Disk 3 Data Disk 2 Data Disk 1 Data Disk 0
  • Summary: RAID Techniques: Goal was performance, popularity due to reliability of storage • Disk Mirroring, Shadowing (RAID 1) Each disk is fully duplicated onto its "shadow" Logical write = two physical writes 100% capacity overhead • Parity Data Bandwidth Array (RAID 3) Parity computed horizontally Logically a single high data bw disk • High I/O Rate Parity Array (RAID 5) Interleaved parity blocks Independent reads and writes Logical write = 2 reads + 2 writes 1 0 0 1 0 0 1 1 1 1 0 0 1 1 0 1 1 0 0 1 0 0 1 1 0 0 1 1 0 0 1 0 1 0 0 1 0 0 1 1 1 0 0 1 0 0 1 1