• Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
714
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
50
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • CS258 S99
  • CS258 S99
  • CS258 S99
  • CS258 S99
  • CS258 S99
  • CS258 S99
  • CS258 S99
  • CS258 S99
  • CS258 S99

Transcript

  • 1. ECE7130: Advanced Computer Architecture: Storage Systems Dr. Xubin He http://iweb.tntech.edu/hexb Email: hexb@tntech.edu Tel: 931-3723462, Brown Hall 319
  • 2. Outline
    • Quick overview: storage systems
    • RAID
    • Advanced Dependability/Reliability/Availability
    • I/O Benchmarks, Performance and Dependability
  • 3. Storage Architectures 06/16/10
  • 4. Disk Figure of Merit: Areal Density
    • Bits recorded along a track
      • Metric is Bits Per Inch ( BPI )
    • Number of tracks per surface
      • Metric is Tracks Per Inch ( TPI )
    • Disk Designs Brag about bit density per unit area
      • Metric is Bits Per Square Inch : Areal Density = BPI x TPI
  • 5. Historical Perspective
    • 1956 IBM Ramac — early 1970s Winchester
      • Developed for mainframe computers, proprietary interfaces
      • Steady shrink in form factor: 27 in. to 14 in.
    • Form factor and capacity drives market more than performance
    • 1970s developments
      • 5.25 inch floppy disk formfactor (microcode into mainframe)
      • Emergence of industry standard disk interfaces
    • Early 1980s: PCs and first generation workstations
    • Mid 1980s: Client/server computing
      • Centralized storage on file server
        • accelerates disk downsizing: 8 inch to 5.25
      • Mass market disk drives become a reality
        • industry standards: SCSI, IPI, IDE
        • 5.25 inch to 3.5 inch drives for PCs, End of proprietary interfaces
    • 1900s: Laptops => 2.5 inch drives
    • 2000s: What new devices leading to new drives?
  • 6. Storage Media
    • Magnetic:
      • Hard drive
    • Optical:
      • CD
    • Solid State Semiconductor:
      • Flash SSD
      • RAM SSD
    • MEMS-based
      • Significantly cheaper than DRAM but much faster than traditional HDD, high density
    • Tape: sequential accessed
    06/16/10
  • 7. Driving forces Source: Anderson & Whittington, Seagate, FAST’07 06/16/10
  • 8. HDD Market Source: Anderson & Whittington, Seagate, FAST’07 06/16/10
  • 9. HDD Technology Trend Source: Anderson & Whittington, Seagate, FAST’07 06/16/10
  • 10. Errors
    • Errors happens:
      • Media cause: bit flip, noise…
    • Data detection and decoding logic, ECC
    06/16/10
  • 11. Right HDD for right application Source: Anderson & Whittington, Seagate, FAST’07 06/16/10
  • 12. Interfaces
    • Internal
      • SCSI: wide SCSI, Ultra wide SCSI…
      • IDE ATA: Parallel ATA (PATA): PATA 66/100/133
      • Serial ATA (SATA): SATA 150, SATA II 300
      • Serial Attached SCSI (SAS)
    • External
      • USB (1.0/2.0)
      • Firewire (400/800)
      • eSATA: external SATA
    06/16/10
  • 13. SATA/SAS compatibility Source: Anderson & Whittington, Seagate, FAST’07 06/16/10
  • 14. Top HDD manufactures
    • Western Digital
    • Seagate (also own Maxtor)
    • Fujitsu
    • Samsung
    • Hitachi
    • IBM used to make hard disk drives but sold that division to Hitachi.
    06/16/10
  • 15. Top companies to provide storage solutions
    • Adaptec
    • EMC
    • Qlogic
    • IBM
    • Hitachi
    • Brocade
    • Cisco
    • HP
    • Network appliance
    • Emulex
    06/16/10
  • 16. Fastest growing storage companies (07/2008) Source: storagesearch.com 06/16/10 Company Yearly growth Main product Coraid 370% NAS ExaGrid Systems 337% Disk to disk backup RELDATA 300% iSCSI Voltaire 194% InfiniBand Transcend Information 166% SSD OnStor 157% NAS Bluepoint Data Storage 140% Online Backup and Storage Compellent 124% NAS Alacritech 100% iSCSI Intransa 100% iSCSI
  • 17. Solid State Drive
    • A data storage device that uses solid-state memory to store persistent data.
    • A SSD either uses Flash non-volatile memory (Flash SSD) or DRAM volatile memory (RAM SSD)
    06/16/10
  • 18. Advantages of SSD
    • Faster startup: no spin-up
    • Fast random access for read
    • Extremely low read/write latency: much smaller seek time
    • Quiet: no moving
    • High mechanical reliability: endure shock, vibration
    • Balanced performance across entire storage device
    06/16/10
  • 19. Disadvantages of SSD
    • Price: unit price of SSD is 20x of HDD
    • Limited write cycles (Flash SSD): 30-50 millions
    • Slower write speed (Flash SSD): erase blocks
    • Lower storage density
    • Vulnerable to some effects: abrupt power loss (RAM SSD), magnetic fields and electric/static charges
    06/16/10
  • 20. Top SSD manufactures (as of 1Q of 2008, source: storagesearch.com) 06/16/10 Rank Manufacturer SSD Technology 1 BitMICRO Networks Flash SSD 2 STEC Flash SSD 3 Mtron Flash SSD 4 Memoright Flash SSD 5 SanDisk Flash SSD 6 Samsung Flash SSD 7 Adtron Flash SSD 8 Texas Memory Systems RAM/Flash SSD 9 Toshiba Flash SSD 10 Violin Memory RAM SSD
  • 21. Networked Storage
    • Network Attached Storage(NAS)
    • Storage accessed over TCP/IP, using industry standard file sharing protocols like NFS, HTTP, Windows Networking
      • Provide File System Functionality
      • Take LAN bandwidth of Servers
    • Storage Area Network(SAN)
    • Storage accessed over a Fibre Channel switching fabric, using encapsulated SCSI.
      • Block level storage system
      • Fibre-Channel SAN
      • IP SAN
        • Implementing SAN over well-known TCP/IP
        • iSCSI: Cost-effective, SCSI and TCP/IP
    06/16/10
  • 22. Advantages
    • Consolidation
    • Centralized Data Management
    • Scalability
    • Fault Resiliency
    06/16/10
  • 23. NAS and SAN shortcomings
    • SAN Shortcomings --Data to desktop --Sharing between NT and UNIX --Lack of standards for file access and locking
    • NAS Shortcomings --Shared tape resources --Number of drives --Distance to tapes/disks
        • NAS --Focuses on applications, users, and the files and data that they share
        • SAN --Focuses on disks, tapes, and a scalable, reliable infrastructure to connect them
        • NAS Plus SAN --The complete solution, from desktop to data center to storage device
    06/16/10
  • 24. Organizations
    • IEEE Computer Society Mass Storage Systems Technical Committee (MSSTC or TCMS): http://www.msstc.org
    • IEEE IETF: www.ietf.org
      • IPS, IMSS
    • International Committee for Information Technology Standards: www.incits.org
      • Storage: B11, T10, T11, T13
    • Storage mailing list: storage [email_address]
    • INSIC: Information Storage Industry Consortium: www.insic.org
    • SNIA: Storage Networking Industry Association: http://www.snia.org
      • Technical work groups
    06/16/10
  • 25. Conferences
    • FAST: Usenix: http://www.usenix.org/event/fast09/
    • SC: IEEE/ACM: http://sc08.supercomputing.org/
    • MSST: IEEE: http:// storageconference.org /
      • SNAPI, SISW, CMPD, DAPS
    • NAS: IEEE: http://www.eece.maine.edu/iwnas/
    • SNW: SNIA: Storage Networking World
    • SNIA Storage Developer Conference: http://www.snia.org/events/storage-developer2008/
    • Other conferences with storage components:
      • IPDPS, ICDCS, ICPP, AReS, PDCS, Mascots, HPDC, IPCCC, ccGRID…
    06/16/10
  • 26. Awards in Storage Systems
    • IEEE Reynold B. Johnson Information Storage Systems Award
      • Sponsored by: IBM Almaden Research Center
    • 2008 Recipient:
      • 2008 - ALAN J. SMITH, Professor University of California at Berkeley Berkeley, CA, USA 
    • 2007- Co-Recipients
      • DAVE HITZ Executive Vice President and Co-Founder Network Appliance Sunnyvale, CA
      • JAMES LAU Chief Strategy Officer, Executive Vice President and Co-Founder Network Appliance Sunnyvale, CA
    • 2006 Recipient
      • JAISHANKAR M. MENON Director of Storage Systems Architecture and Design IBM, San Jose, CA
      • "For pioneering work in the theory and application of RAID storage systems."
    • More: http://www.ieee.org/portal/pages/about/awards/sums/johnson.html
    06/16/10
  • 27. HPC storage challenge: SC
    • HPC systems are comprised of 3 major subsystems: processing, networking and storage. In different applications, any one of these subsystems can limit the overall system performance. The HPC Storage Challenge is a competition showcasing effective approaches using the storage subsystem, which is often the limiting system, with actual applications.
    • Participants must describe their implementations and present measurements of performance, scalability, and storage subsystem utilization. Judging will be based on these measurements as well as innovation and effectiveness; maximum size and peak performance are not the sole criteria.
    • Finalists will be chosen on the basis of submissions which are in the form of a proposal; submissions are encouraged to include reports of work in progress. Participants with access to either large or small HPC systems are encouraged to enter this challenge.
    06/16/10
  • 28. Research hotspots
    • Energy Efficient Storage: CMU, UIUC
    • Scalable Storages Meets Petaflops: IBM, UCSC
    • High availability and Reliable Storage Systems: UC Berkeley, CMU, TTU, UCSC
    • Object Storage (OSD): CMU, HUST, UCSC
    • Storage virtualization: CMU
    • Intelligent Storage Systems: UC Berkeley, CMU
    06/16/10
  • 29. Energy Efficient Storage
    • Energy aware:
      • disk level: SSD
      • I/O and file system level: efficient memory, cache, networked storage
      • Application level: data layout
    • Green storage initiative (GSI): SNIA
      • Energy efficient storage networking solutions
      • Storage administrator and infrastructure
    06/16/10
  • 30. Peta-scale scalable storage
    • Challenges to using storages that can meet the ever increasing speed, multi-core designs, and petaflop computing capabilities.
    • Latencies to disks are not keeping up with peta-scale computing.
    • Incentive approaches are needed.
    06/16/10
  • 31. My research in high performance and reliable storage systems
    • Active/Active Service for High Availability Computing
      • Active/Active Metadata Service
    • Networked Storage Systems
      • STICS
      • iRAID
    • A Unified Multiple-Level Cache for High Performance Storage Systems
      • iPVFS
      • iCache
    • Performance-Adaptive UDP for Bulk Data Transfer over Dedicated Network Links: PA-UDP
    06/16/10
  • 32. Improving disk performance.
    • Use large sectors to improve bandwidth
    • Use track caches and read ahead;
      • Read entire track into on-controller cache
      • Exploit locality (improves both latency and BW)
    • Design file systems to maximize locality
      • Allocate files sequentially on disks (exploit track cache)
      • Locate similar files in same cylinder (reduce seeks)
      • Locate simlar files in near-by cylinders (reduce seek distance)
    • Pack bits closer together to improve transfer rate and density.
    • Use a collection of small disks to form a large, high performance one--->disk array
    • Stripping data across multiple disks to allow parallel I/O, thus improving performance.
    06/16/10
  • 33. Use Arrays of Small Disks? 14” 10” 5.25” 3.5” 3.5” Disk Array: 1 disk design Conventional: 4 disk designs Low End High End
    • Katz and Patterson asked in 1987:
      • Can smaller disks be used to close gap in performance between disks and CPUs?
  • 34. Replace Small Number of Large Disks with Large Number of Small Disks! (1988 Disks) Capacity Volume Power Data Rate I/O Rate MTTF Cost IBM 3390K 20 GBytes 97 cu. ft. 3 KW 15 MB/s 600 I/Os/s 250 KHrs $250K IBM 3.5" 0061 320 MBytes 0.1 cu. ft. 11 W 1.5 MB/s 55 I/Os/s 50 KHrs $2K x70 23 GBytes 11 cu. ft. 1 KW 110 MB/s 3900 IOs/s ??? Hrs $150K Disk Arrays have potential for large data and I/O rates, high MB per cu. ft., high MB per KW, but what about reliability? 9X 3X 8X 6X
  • 35. Array Reliability
    • MTTF: Mean Time To Failure: average time that a non - repairable component will operate before experiencing failure.
    • Reliability of N disks = Reliability of 1 Disk ÷N
      • 50,000 Hours ÷ 70 disks = 700 hours
    • Disk system MTTF: Drops from 6 years to 1 month!
    • Arrays without redundancy too unreliable to be useful!
    • Solution: redundancy.
  • 36. Redundant Arrays of (Inexpensive) Disks
    • Replicate data over several disks so that no data will be lost if one disk fails.
    • Redundancy yields high data availability
      • Availability : service still provided to user, even if some components failed
    • Disks will still fail
    • Contents reconstructed from data redundantly stored in the array
      •  Capacity penalty to store redundant info
      •  Bandwidth penalty to update redundant info
  • 37.  
  • 38. Levels of RAID
    • Original RAID paper described five categories (RAID levels 1-5). (Patterson et al, “A case for redundant arrays of inexpensive disks (RAID)”, ACM SIGMOD, 1988)
    • Disk striping with no redundant now is called RAID0 or JBOD(Just a bunch of disks).
    • Other kinds have been proposed in literature,
    • Level 6 (P+Q Redundancy), Level 10, RAID53, etc.
    • Except RAID0, all the RAID levels trade disk capacity for reliability, and the extra reliability makes parallism a practical way to improve performance.
  • 39. RAID 0: Nonredundant (JBOD)
    • High I/O performance.
      • Data is not save redundantly.
      • Single copy of data is striped across multiple disks.
    • Low cost.
      • Lack of redundancy.
    • Least reliable: single disk failure leads to data loss.
    file data block 1 block 0 block 2 block 3 Disk 1 Disk 0 Disk 2 Disk 3
  • 40. Redundant Arrays of Inexpensive Disks RAID 1: Disk Mirroring/Shadowing
    • •  Each disk is fully duplicated onto its “ mirror ”
    • Very high availability can be achieved
    • • Bandwidth sacrifice on write:
    • Logical write = two physical writes
      • • Reads may be optimized, minimize the queue and disk search time
    • • Most expensive solution: 100% capacity overhead
    recovery group Targeted for high I/O rate , high availability environments
  • 41. RAID 2: Memory-Style ECC Data Disks Multiple ECC Disks and a Parity Disk
    • Multiple disks record the ECC information to determine which disk is in fault
    • A parity disk is then used to reconstruct corrupted or lost data
    • Needs log 2 (number of disks) redundancy disks
    f 0 (b) b 2 b 1 b 0 b 3 f 1 (b) P(b)
  • 42. RAID 3: Bit (Byte) Interleaved Parity
    • Only need one parity disk
    • Write/Read accesses all disks
    • Only one request can be serviced at a time
    • Easy to implement
    • Provides high bandwidth but not high I/O rates
    Targeted for high bandwidth applications: Multimedia, Image Processing 10010011 11001101 10010011 . . . Logical record 1 0 0 1 0 0 1 1 0 1 1 0 0 1 1 0 1 1 1 0 0 1 0 0 1 1 0 Striped physical records P Physical record
  • 43. RAID 3
    • Sum computed across recovery group to protect against hard disk failures, stored in P disk
    • Logically, a single high capacity, high transfer rate disk: good for large transfers
    • Wider arrays reduce capacity costs, but decreases availability
    • 12.5% capacity cost for parity in this configuration
    • RAID 3 relies on parity disk to discover errors on Read
    • But every sector has an error detection field
    • Rely on error detection field to catch errors on read, not on the parity disk
    • Allows independent reads to different disks simultaneously
    Inspiration for RAID 4
  • 44. RAID 4: Block Interleaved Parity
    • Blocks: striping units
    • Allow for parallel access by multiple I/O requests, high I/O rates
    • Doing multiple small reads is now faster than before. (allows small read requests to be restricted to a single disk).
    • Large writes(full stripe), update the parity:
    • P’ = d0’ + d1’ + d2’ + d3’;
    • Small writes(eg. write on d0), update the parity:
    • P = d0 + d1 + d2 + d3
    • P’ = d0’ + d1 + d2 + d3 = P + d0’ + d0;
    • However, writes are still very slow since the parity
    • disk is the bottleneck.
    block 0 block 4 block 8 block 12 block 1 block 5 block 9 block 13 block 2 block 6 block 10 block 14 block 3 block 7 block 11 block 15 P(0-3) P(4-7) P(8-11) P(12-15)
  • 45. Problems of Disk Arrays: Small Writes (read-modify-write procedure) D0 D1 D2 D3 P D0' + + D0' D1 D2 D3 P' new data old data old parity XOR XOR (1. Read) (2. Read) (3. Write) (4. Write) RAID-5: Small Write Algorithm 1 Logical Write = 2 Physical Reads + 2 Physical Writes
  • 46. Inspiration for RAID 5
    • RAID 4 works well for small reads
    • Small writes (write to one disk):
      • Option 1: read other data disks, create new sum and write to Parity Disk
      • Option 2: since P has old sum, compare old data to new data, add the difference to P
    • Small writes are limited by Parity Disk: Write to D0, D5 both also write to P disk. Parity disk must be updated for every write operation!
    D0 D1 D2 D3 P D4 D5 D6 P D7
  • 47. Redundant Arrays of Inexpensive Disks RAID 5: High I/O Rate Interleaved Parity Independent writes possible because of interleaved parity D0 D1 D2 D3 P D4 D5 D6 P D7 D8 D9 P D10 D11 D12 P D13 D14 D15 P D16 D17 D18 D19 D20 D21 D22 D23 P . . . . . . . . . . . . . . . Disk Columns Increasing Logical Disk Addresses Example: write to D0, D5 uses disks 0, 1, 3, 4
  • 48. Comparison of RAID Levels (N disks, each with capacity of C) Small write problem High performance, reliability (N-1)xC Interleave parity 5 Write-related bottleneck High redundancy and better performance (N-1)xC Block-level parity 4 Low performance Easy to implement, high error recoverability (N-1)xC Bit(byte)-level parity 3 cost High performance, fastest write (N/2)xC Mirroring 1 No redundancy Maximum data transfer rate and sizes NxC Striping 0 Disadvantage Advantage Capacity Technique RAID level
  • 49. RAID 6: Recovering from 2 failures
    • Why > 1 failure recovery?
      • operator accidentally replaces the wrong disk during a failure
      • since disk bandwidth is growing more slowly than disk capacity, the MTT Repair a disk in a RAID system is increasing  increases the chances of a 2nd failure during repair since takes longer
      • reading much more data during reconstruction meant increasing the chance of an uncorrectable media failure, which would result in data loss
  • 50. RAID 6: Recovering from 2 failures
    • Network Appliance’s row-diagonal parity or RAID-DP
    • Like the standard RAID schemes, it uses redundant space based on parity calculation per stripe
    • Since it is protecting against a double failure, it adds two check blocks per stripe of data.
      • If p+1 disks total, p-1 disks have data; assume p=5
    • Row parity disk is just like in RAID 4
      • Even parity across the other 4 data blocks in its stripe
    • Each block of the diagonal parity disk contains the even parity of the blocks in the same diagonal
  • 51. Example p = 5
    • Row diagonal parity starts by recovering one of the 4 blocks on the failed disk using diagonal parity
      • Since each diagonal misses one disk, and all diagonals miss a different disk, 2 diagonals are only missing 1 block
    • Once the data for those blocks is recovered, then the standard RAID recovery scheme can be used to recover two more blocks in the standard RAID 4 stripes
    • Process continues until two failed disks are restored
    0 4 3 2 1 0 4 3 2 1 0 4 3 2 1 0 4 3 2 1 0 4 3 2 1 0 4 3 2 1 0 4 3 2 1 0 Diagonal Parity Row Parity Data Disk 3 Data Disk 2 Data Disk 1 Data Disk 0
  • 52. Summary: RAID Techniques: Goal was performance, popularity due to reliability of storage • Disk Mirroring, Shadowing (RAID 1) Each disk is fully duplicated onto its "shadow" Logical write = two physical writes 100% capacity overhead • Parity Data Bandwidth Array (RAID 3) Parity computed horizontally Logically a single high data bw disk • High I/O Rate Parity Array (RAID 5) Interleaved parity blocks Independent reads and writes Logical write = 2 reads + 2 writes 1 0 0 1 0 0 1 1 1 1 0 0 1 1 0 1 1 0 0 1 0 0 1 1 0 0 1 1 0 0 1 0 1 0 0 1 0 0 1 1 1 0 0 1 0 0 1 1