Your SlideShare is downloading. ×
Download It
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Download It

734
views

Published on

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
734
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
50
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • CS258 S99
  • CS258 S99
  • CS258 S99
  • CS258 S99
  • CS258 S99
  • CS258 S99
  • CS258 S99
  • CS258 S99
  • CS258 S99
  • Transcript

    • 1. ECE7130: Advanced Computer Architecture: Storage Systems Dr. Xubin He http://iweb.tntech.edu/hexb Email: hexb@tntech.edu Tel: 931-3723462, Brown Hall 319
    • 2. Outline
      • Quick overview: storage systems
      • RAID
      • Advanced Dependability/Reliability/Availability
      • I/O Benchmarks, Performance and Dependability
    • 3. Storage Architectures 06/16/10
    • 4. Disk Figure of Merit: Areal Density
      • Bits recorded along a track
        • Metric is Bits Per Inch ( BPI )
      • Number of tracks per surface
        • Metric is Tracks Per Inch ( TPI )
      • Disk Designs Brag about bit density per unit area
        • Metric is Bits Per Square Inch : Areal Density = BPI x TPI
    • 5. Historical Perspective
      • 1956 IBM Ramac — early 1970s Winchester
        • Developed for mainframe computers, proprietary interfaces
        • Steady shrink in form factor: 27 in. to 14 in.
      • Form factor and capacity drives market more than performance
      • 1970s developments
        • 5.25 inch floppy disk formfactor (microcode into mainframe)
        • Emergence of industry standard disk interfaces
      • Early 1980s: PCs and first generation workstations
      • Mid 1980s: Client/server computing
        • Centralized storage on file server
          • accelerates disk downsizing: 8 inch to 5.25
        • Mass market disk drives become a reality
          • industry standards: SCSI, IPI, IDE
          • 5.25 inch to 3.5 inch drives for PCs, End of proprietary interfaces
      • 1900s: Laptops => 2.5 inch drives
      • 2000s: What new devices leading to new drives?
    • 6. Storage Media
      • Magnetic:
        • Hard drive
      • Optical:
        • CD
      • Solid State Semiconductor:
        • Flash SSD
        • RAM SSD
      • MEMS-based
        • Significantly cheaper than DRAM but much faster than traditional HDD, high density
      • Tape: sequential accessed
      06/16/10
    • 7. Driving forces Source: Anderson & Whittington, Seagate, FAST’07 06/16/10
    • 8. HDD Market Source: Anderson & Whittington, Seagate, FAST’07 06/16/10
    • 9. HDD Technology Trend Source: Anderson & Whittington, Seagate, FAST’07 06/16/10
    • 10. Errors
      • Errors happens:
        • Media cause: bit flip, noise…
      • Data detection and decoding logic, ECC
      06/16/10
    • 11. Right HDD for right application Source: Anderson & Whittington, Seagate, FAST’07 06/16/10
    • 12. Interfaces
      • Internal
        • SCSI: wide SCSI, Ultra wide SCSI…
        • IDE ATA: Parallel ATA (PATA): PATA 66/100/133
        • Serial ATA (SATA): SATA 150, SATA II 300
        • Serial Attached SCSI (SAS)
      • External
        • USB (1.0/2.0)
        • Firewire (400/800)
        • eSATA: external SATA
      06/16/10
    • 13. SATA/SAS compatibility Source: Anderson & Whittington, Seagate, FAST’07 06/16/10
    • 14. Top HDD manufactures
      • Western Digital
      • Seagate (also own Maxtor)
      • Fujitsu
      • Samsung
      • Hitachi
      • IBM used to make hard disk drives but sold that division to Hitachi.
      06/16/10
    • 15. Top companies to provide storage solutions
      • Adaptec
      • EMC
      • Qlogic
      • IBM
      • Hitachi
      • Brocade
      • Cisco
      • HP
      • Network appliance
      • Emulex
      06/16/10
    • 16. Fastest growing storage companies (07/2008) Source: storagesearch.com 06/16/10 Company Yearly growth Main product Coraid 370% NAS ExaGrid Systems 337% Disk to disk backup RELDATA 300% iSCSI Voltaire 194% InfiniBand Transcend Information 166% SSD OnStor 157% NAS Bluepoint Data Storage 140% Online Backup and Storage Compellent 124% NAS Alacritech 100% iSCSI Intransa 100% iSCSI
    • 17. Solid State Drive
      • A data storage device that uses solid-state memory to store persistent data.
      • A SSD either uses Flash non-volatile memory (Flash SSD) or DRAM volatile memory (RAM SSD)
      06/16/10
    • 18. Advantages of SSD
      • Faster startup: no spin-up
      • Fast random access for read
      • Extremely low read/write latency: much smaller seek time
      • Quiet: no moving
      • High mechanical reliability: endure shock, vibration
      • Balanced performance across entire storage device
      06/16/10
    • 19. Disadvantages of SSD
      • Price: unit price of SSD is 20x of HDD
      • Limited write cycles (Flash SSD): 30-50 millions
      • Slower write speed (Flash SSD): erase blocks
      • Lower storage density
      • Vulnerable to some effects: abrupt power loss (RAM SSD), magnetic fields and electric/static charges
      06/16/10
    • 20. Top SSD manufactures (as of 1Q of 2008, source: storagesearch.com) 06/16/10 Rank Manufacturer SSD Technology 1 BitMICRO Networks Flash SSD 2 STEC Flash SSD 3 Mtron Flash SSD 4 Memoright Flash SSD 5 SanDisk Flash SSD 6 Samsung Flash SSD 7 Adtron Flash SSD 8 Texas Memory Systems RAM/Flash SSD 9 Toshiba Flash SSD 10 Violin Memory RAM SSD
    • 21. Networked Storage
      • Network Attached Storage(NAS)
      • Storage accessed over TCP/IP, using industry standard file sharing protocols like NFS, HTTP, Windows Networking
        • Provide File System Functionality
        • Take LAN bandwidth of Servers
      • Storage Area Network(SAN)
      • Storage accessed over a Fibre Channel switching fabric, using encapsulated SCSI.
        • Block level storage system
        • Fibre-Channel SAN
        • IP SAN
          • Implementing SAN over well-known TCP/IP
          • iSCSI: Cost-effective, SCSI and TCP/IP
      06/16/10
    • 22. Advantages
      • Consolidation
      • Centralized Data Management
      • Scalability
      • Fault Resiliency
      06/16/10
    • 23. NAS and SAN shortcomings
      • SAN Shortcomings --Data to desktop --Sharing between NT and UNIX --Lack of standards for file access and locking
      • NAS Shortcomings --Shared tape resources --Number of drives --Distance to tapes/disks
          • NAS --Focuses on applications, users, and the files and data that they share
          • SAN --Focuses on disks, tapes, and a scalable, reliable infrastructure to connect them
          • NAS Plus SAN --The complete solution, from desktop to data center to storage device
      06/16/10
    • 24. Organizations
      • IEEE Computer Society Mass Storage Systems Technical Committee (MSSTC or TCMS): http://www.msstc.org
      • IEEE IETF: www.ietf.org
        • IPS, IMSS
      • International Committee for Information Technology Standards: www.incits.org
        • Storage: B11, T10, T11, T13
      • Storage mailing list: storage [email_address]
      • INSIC: Information Storage Industry Consortium: www.insic.org
      • SNIA: Storage Networking Industry Association: http://www.snia.org
        • Technical work groups
      06/16/10
    • 25. Conferences
      • FAST: Usenix: http://www.usenix.org/event/fast09/
      • SC: IEEE/ACM: http://sc08.supercomputing.org/
      • MSST: IEEE: http:// storageconference.org /
        • SNAPI, SISW, CMPD, DAPS
      • NAS: IEEE: http://www.eece.maine.edu/iwnas/
      • SNW: SNIA: Storage Networking World
      • SNIA Storage Developer Conference: http://www.snia.org/events/storage-developer2008/
      • Other conferences with storage components:
        • IPDPS, ICDCS, ICPP, AReS, PDCS, Mascots, HPDC, IPCCC, ccGRID…
      06/16/10
    • 26. Awards in Storage Systems
      • IEEE Reynold B. Johnson Information Storage Systems Award
        • Sponsored by: IBM Almaden Research Center
      • 2008 Recipient:
        • 2008 - ALAN J. SMITH, Professor University of California at Berkeley Berkeley, CA, USA 
      • 2007- Co-Recipients
        • DAVE HITZ Executive Vice President and Co-Founder Network Appliance Sunnyvale, CA
        • JAMES LAU Chief Strategy Officer, Executive Vice President and Co-Founder Network Appliance Sunnyvale, CA
      • 2006 Recipient
        • JAISHANKAR M. MENON Director of Storage Systems Architecture and Design IBM, San Jose, CA
        • "For pioneering work in the theory and application of RAID storage systems."
      • More: http://www.ieee.org/portal/pages/about/awards/sums/johnson.html
      06/16/10
    • 27. HPC storage challenge: SC
      • HPC systems are comprised of 3 major subsystems: processing, networking and storage. In different applications, any one of these subsystems can limit the overall system performance. The HPC Storage Challenge is a competition showcasing effective approaches using the storage subsystem, which is often the limiting system, with actual applications.
      • Participants must describe their implementations and present measurements of performance, scalability, and storage subsystem utilization. Judging will be based on these measurements as well as innovation and effectiveness; maximum size and peak performance are not the sole criteria.
      • Finalists will be chosen on the basis of submissions which are in the form of a proposal; submissions are encouraged to include reports of work in progress. Participants with access to either large or small HPC systems are encouraged to enter this challenge.
      06/16/10
    • 28. Research hotspots
      • Energy Efficient Storage: CMU, UIUC
      • Scalable Storages Meets Petaflops: IBM, UCSC
      • High availability and Reliable Storage Systems: UC Berkeley, CMU, TTU, UCSC
      • Object Storage (OSD): CMU, HUST, UCSC
      • Storage virtualization: CMU
      • Intelligent Storage Systems: UC Berkeley, CMU
      06/16/10
    • 29. Energy Efficient Storage
      • Energy aware:
        • disk level: SSD
        • I/O and file system level: efficient memory, cache, networked storage
        • Application level: data layout
      • Green storage initiative (GSI): SNIA
        • Energy efficient storage networking solutions
        • Storage administrator and infrastructure
      06/16/10
    • 30. Peta-scale scalable storage
      • Challenges to using storages that can meet the ever increasing speed, multi-core designs, and petaflop computing capabilities.
      • Latencies to disks are not keeping up with peta-scale computing.
      • Incentive approaches are needed.
      06/16/10
    • 31. My research in high performance and reliable storage systems
      • Active/Active Service for High Availability Computing
        • Active/Active Metadata Service
      • Networked Storage Systems
        • STICS
        • iRAID
      • A Unified Multiple-Level Cache for High Performance Storage Systems
        • iPVFS
        • iCache
      • Performance-Adaptive UDP for Bulk Data Transfer over Dedicated Network Links: PA-UDP
      06/16/10
    • 32. Improving disk performance.
      • Use large sectors to improve bandwidth
      • Use track caches and read ahead;
        • Read entire track into on-controller cache
        • Exploit locality (improves both latency and BW)
      • Design file systems to maximize locality
        • Allocate files sequentially on disks (exploit track cache)
        • Locate similar files in same cylinder (reduce seeks)
        • Locate simlar files in near-by cylinders (reduce seek distance)
      • Pack bits closer together to improve transfer rate and density.
      • Use a collection of small disks to form a large, high performance one--->disk array
      • Stripping data across multiple disks to allow parallel I/O, thus improving performance.
      06/16/10
    • 33. Use Arrays of Small Disks? 14” 10” 5.25” 3.5” 3.5” Disk Array: 1 disk design Conventional: 4 disk designs Low End High End
      • Katz and Patterson asked in 1987:
        • Can smaller disks be used to close gap in performance between disks and CPUs?
    • 34. Replace Small Number of Large Disks with Large Number of Small Disks! (1988 Disks) Capacity Volume Power Data Rate I/O Rate MTTF Cost IBM 3390K 20 GBytes 97 cu. ft. 3 KW 15 MB/s 600 I/Os/s 250 KHrs $250K IBM 3.5" 0061 320 MBytes 0.1 cu. ft. 11 W 1.5 MB/s 55 I/Os/s 50 KHrs $2K x70 23 GBytes 11 cu. ft. 1 KW 110 MB/s 3900 IOs/s ??? Hrs $150K Disk Arrays have potential for large data and I/O rates, high MB per cu. ft., high MB per KW, but what about reliability? 9X 3X 8X 6X
    • 35. Array Reliability
      • MTTF: Mean Time To Failure: average time that a non - repairable component will operate before experiencing failure.
      • Reliability of N disks = Reliability of 1 Disk ÷N
        • 50,000 Hours ÷ 70 disks = 700 hours
      • Disk system MTTF: Drops from 6 years to 1 month!
      • Arrays without redundancy too unreliable to be useful!
      • Solution: redundancy.
    • 36. Redundant Arrays of (Inexpensive) Disks
      • Replicate data over several disks so that no data will be lost if one disk fails.
      • Redundancy yields high data availability
        • Availability : service still provided to user, even if some components failed
      • Disks will still fail
      • Contents reconstructed from data redundantly stored in the array
        •  Capacity penalty to store redundant info
        •  Bandwidth penalty to update redundant info
    • 37.  
    • 38. Levels of RAID
      • Original RAID paper described five categories (RAID levels 1-5). (Patterson et al, “A case for redundant arrays of inexpensive disks (RAID)”, ACM SIGMOD, 1988)
      • Disk striping with no redundant now is called RAID0 or JBOD(Just a bunch of disks).
      • Other kinds have been proposed in literature,
      • Level 6 (P+Q Redundancy), Level 10, RAID53, etc.
      • Except RAID0, all the RAID levels trade disk capacity for reliability, and the extra reliability makes parallism a practical way to improve performance.
    • 39. RAID 0: Nonredundant (JBOD)
      • High I/O performance.
        • Data is not save redundantly.
        • Single copy of data is striped across multiple disks.
      • Low cost.
        • Lack of redundancy.
      • Least reliable: single disk failure leads to data loss.
      file data block 1 block 0 block 2 block 3 Disk 1 Disk 0 Disk 2 Disk 3
    • 40. Redundant Arrays of Inexpensive Disks RAID 1: Disk Mirroring/Shadowing
      • •  Each disk is fully duplicated onto its “ mirror ”
      • Very high availability can be achieved
      • • Bandwidth sacrifice on write:
      • Logical write = two physical writes
        • • Reads may be optimized, minimize the queue and disk search time
      • • Most expensive solution: 100% capacity overhead
      recovery group Targeted for high I/O rate , high availability environments
    • 41. RAID 2: Memory-Style ECC Data Disks Multiple ECC Disks and a Parity Disk
      • Multiple disks record the ECC information to determine which disk is in fault
      • A parity disk is then used to reconstruct corrupted or lost data
      • Needs log 2 (number of disks) redundancy disks
      f 0 (b) b 2 b 1 b 0 b 3 f 1 (b) P(b)
    • 42. RAID 3: Bit (Byte) Interleaved Parity
      • Only need one parity disk
      • Write/Read accesses all disks
      • Only one request can be serviced at a time
      • Easy to implement
      • Provides high bandwidth but not high I/O rates
      Targeted for high bandwidth applications: Multimedia, Image Processing 10010011 11001101 10010011 . . . Logical record 1 0 0 1 0 0 1 1 0 1 1 0 0 1 1 0 1 1 1 0 0 1 0 0 1 1 0 Striped physical records P Physical record
    • 43. RAID 3
      • Sum computed across recovery group to protect against hard disk failures, stored in P disk
      • Logically, a single high capacity, high transfer rate disk: good for large transfers
      • Wider arrays reduce capacity costs, but decreases availability
      • 12.5% capacity cost for parity in this configuration
      • RAID 3 relies on parity disk to discover errors on Read
      • But every sector has an error detection field
      • Rely on error detection field to catch errors on read, not on the parity disk
      • Allows independent reads to different disks simultaneously
      Inspiration for RAID 4
    • 44. RAID 4: Block Interleaved Parity
      • Blocks: striping units
      • Allow for parallel access by multiple I/O requests, high I/O rates
      • Doing multiple small reads is now faster than before. (allows small read requests to be restricted to a single disk).
      • Large writes(full stripe), update the parity:
      • P’ = d0’ + d1’ + d2’ + d3’;
      • Small writes(eg. write on d0), update the parity:
      • P = d0 + d1 + d2 + d3
      • P’ = d0’ + d1 + d2 + d3 = P + d0’ + d0;
      • However, writes are still very slow since the parity
      • disk is the bottleneck.
      block 0 block 4 block 8 block 12 block 1 block 5 block 9 block 13 block 2 block 6 block 10 block 14 block 3 block 7 block 11 block 15 P(0-3) P(4-7) P(8-11) P(12-15)
    • 45. Problems of Disk Arrays: Small Writes (read-modify-write procedure) D0 D1 D2 D3 P D0' + + D0' D1 D2 D3 P' new data old data old parity XOR XOR (1. Read) (2. Read) (3. Write) (4. Write) RAID-5: Small Write Algorithm 1 Logical Write = 2 Physical Reads + 2 Physical Writes
    • 46. Inspiration for RAID 5
      • RAID 4 works well for small reads
      • Small writes (write to one disk):
        • Option 1: read other data disks, create new sum and write to Parity Disk
        • Option 2: since P has old sum, compare old data to new data, add the difference to P
      • Small writes are limited by Parity Disk: Write to D0, D5 both also write to P disk. Parity disk must be updated for every write operation!
      D0 D1 D2 D3 P D4 D5 D6 P D7
    • 47. Redundant Arrays of Inexpensive Disks RAID 5: High I/O Rate Interleaved Parity Independent writes possible because of interleaved parity D0 D1 D2 D3 P D4 D5 D6 P D7 D8 D9 P D10 D11 D12 P D13 D14 D15 P D16 D17 D18 D19 D20 D21 D22 D23 P . . . . . . . . . . . . . . . Disk Columns Increasing Logical Disk Addresses Example: write to D0, D5 uses disks 0, 1, 3, 4
    • 48. Comparison of RAID Levels (N disks, each with capacity of C) Small write problem High performance, reliability (N-1)xC Interleave parity 5 Write-related bottleneck High redundancy and better performance (N-1)xC Block-level parity 4 Low performance Easy to implement, high error recoverability (N-1)xC Bit(byte)-level parity 3 cost High performance, fastest write (N/2)xC Mirroring 1 No redundancy Maximum data transfer rate and sizes NxC Striping 0 Disadvantage Advantage Capacity Technique RAID level
    • 49. RAID 6: Recovering from 2 failures
      • Why > 1 failure recovery?
        • operator accidentally replaces the wrong disk during a failure
        • since disk bandwidth is growing more slowly than disk capacity, the MTT Repair a disk in a RAID system is increasing  increases the chances of a 2nd failure during repair since takes longer
        • reading much more data during reconstruction meant increasing the chance of an uncorrectable media failure, which would result in data loss
    • 50. RAID 6: Recovering from 2 failures
      • Network Appliance’s row-diagonal parity or RAID-DP
      • Like the standard RAID schemes, it uses redundant space based on parity calculation per stripe
      • Since it is protecting against a double failure, it adds two check blocks per stripe of data.
        • If p+1 disks total, p-1 disks have data; assume p=5
      • Row parity disk is just like in RAID 4
        • Even parity across the other 4 data blocks in its stripe
      • Each block of the diagonal parity disk contains the even parity of the blocks in the same diagonal
    • 51. Example p = 5
      • Row diagonal parity starts by recovering one of the 4 blocks on the failed disk using diagonal parity
        • Since each diagonal misses one disk, and all diagonals miss a different disk, 2 diagonals are only missing 1 block
      • Once the data for those blocks is recovered, then the standard RAID recovery scheme can be used to recover two more blocks in the standard RAID 4 stripes
      • Process continues until two failed disks are restored
      0 4 3 2 1 0 4 3 2 1 0 4 3 2 1 0 4 3 2 1 0 4 3 2 1 0 4 3 2 1 0 4 3 2 1 0 Diagonal Parity Row Parity Data Disk 3 Data Disk 2 Data Disk 1 Data Disk 0
    • 52. Summary: RAID Techniques: Goal was performance, popularity due to reliability of storage • Disk Mirroring, Shadowing (RAID 1) Each disk is fully duplicated onto its "shadow" Logical write = two physical writes 100% capacity overhead • Parity Data Bandwidth Array (RAID 3) Parity computed horizontally Logically a single high data bw disk • High I/O Rate Parity Array (RAID 5) Interleaved parity blocks Independent reads and writes Logical write = 2 reads + 2 writes 1 0 0 1 0 0 1 1 1 1 0 0 1 1 0 1 1 0 0 1 0 0 1 1 0 0 1 1 0 0 1 0 1 0 0 1 0 0 1 1 1 0 0 1 0 0 1 1