SSD and its Application in Enterprise Storage Frank Zhao
What’s SSD ? SSD (S olid State Drive):  semiconductor-based  block storage device that appears to the host devices  like a disk drive   -IDC,2007 Other terms: EFD (Enterprise Flash Drive, EMC) SCM (Storage Class Mem, IBM) Types: DRAM + battery Flash based NOR flash NAND flash
Why SSD ? Disk drive becomes the performance bottleneck! Multiple core (Intel Nehalem) DDR3-1333: 10GB/s QFI: 25.6GB/s PCIe*16: 8GB/s FC disk: 150MB/s!
Flash instead of spindle/platter NAND flash: high density, long endurance  Behaves like disk: Sector/page-based and well-suited for sequential data (pictures, audio, and files) Bad block Unlike disk : Wear out the cell Two NAND types: SLC (single-level cell) 3 X fast ,  10 X long endurance  to MLC MLC (multi-level cell) 2~3 X more  capacity than SLC for enterprise for consumer
NAND Flash internal Data is grouped into block/page Page : 2K/4KB  (plus spare area: 64B/128B) Block : 64/128 pages, up to 512KB Special ops: Erase: set all bits within block to “1” The smallest erasable entity is a block Program: change bit from “1”to “0”
SSD product Interface: FC/SATA/SAS/ PCIe USB … Property Performance Endurance Capacity Cost
SSD - Performance 260X Cache rate 30X ~ 170X 180X 60X 16,000 46,000 115 220 0.02~0.12 512 73/128  = 57% 73 SLC STEC ZEUS IOPS Intel X25-E HDD Model 59.6/80  = 74.5% - User Space Rate 3,300  35,000 170 250 R 0.075, W 0.085 16 64 SLC 250 250 134 147 3.4 16 600 HDD15K rpm Media Type Random Write  (IOPS) Random Read  (IOPS) Sustained Write (MB/s) Sustained Read (MB/s) Access Time  (ms) Cache  (MB) Capacity (GB)
SSD: Endurance Data Retention:  10 years without power support MTBF 2 million hours (VS. HDD: 1.6million hours) Wear out Limit cell write-cycle SLC 100K~1M writes MLC 10K writes Available space is decreased: downgrade write performance when disk is nearly full Solutions : Additional reserved space  Wear leveling algorithm to spread write across the whole disk Bad Block Management map out the bad block
SSD:  Capacity   Amazing potential on capacity Could be huge and costly 2.5 Inch:  512 GB 3.5 Inch:  1 TB  (BitMacro) Could be extremely small and inexpensive 4GB SSD for E-PC. Can you image a 10-$ HDD?
SSD: Cost 8X 83X $ /GB $ /IOPS
SSD Cost: trend 1Q08 2Q08 3Q08 4Q08 1Q09 Cost/GB vs High Performance FC ≈ 40 x ≈ 8 x ≈ 22 x
A brief summary SLC NAND flash is dominant in enterprise environment Advantages : Excellent performance for  Read IO | Random IO | Small IO  (<32KB) workload Low power consumption Small size/weight Limitations : Cost Endurance Performance downgrade
Alternate routes SSD as cache in server Standalone appliance Tired Storage SSD in Enterprise Storage Tired Storage Cache Standalone SSD array
SSD Application in Enterprise BladeCenter HS21 2008 2009 2010 6 DS8000 3 9 6 3 9 12 ProLiant Blade, EVA Blade/Server Unified Storage 7000 ZFS upgrade Storage F5100 USP V/VM PAM(DRAM) RamSan on V3170 DMX-4 CX-4 2 nd  Gen SSD FAST X/Blade /Power SVC 12
EMC Strategic:  Tiered Storage System EFD, FC, SATA Products: V-Max/DMX-4: CX-4 Celerra SSD Tuned Arrays will Totally Change the Game!
EMC-FAST (Fully Automated Storage Tiering) FAST LUN level auto data movement, policy-based Management tools: Ionix (ECC), Navi, and RF Flash Fibre  Channel SATA Flash SATA Fibre Channel V-LUN V-LUN Flash Fibre  Channel SATA New Q4 2009 V-Max CX-4 Celerra
NetAPP PAM ( P erformance  A cceleration  M odule) PCIe card for read cache SW: ONTAP7G, FlexScale, and Predictive Cache Stat. Add-on SSD array from TMS( T exas  M emory  S ystems) Scale out the read performance for V-series/SAN SW: FlexCache TMS  RamSan  500: 2 TB, DRAM +flash array (RAID) PAM1: DRAM, up to 16GB/card * 10 PAM2: SSD, up to 512GB/card * 8
SUN/Solaris “ Hybrid Storage Pool” Write IO: write-optimized SSDs (ZFS Intent-log) Read IO: commodity SSDs Second-level flash cache (L2ARC) behind primary DRAM cache Smart replacement algorithm 18GB*8 100GB * 6
What’s Next in/after SSD? Views from array providers: EMC:  lower cost  IBM:  MLC Revolutionary RAM tech?( Racetrack Mem, Phase Change Mem) Micron, Samsung, Intel, Hynix Intel, JMicron, SandForce, STEC STEC, Intel, Samsung, Pliant, Segate, Fusion-IO SUN: MLC SW: to fault tolerant  Distinguish write with read IO Goal: Good Performance + Endurance + Capacity  with  affordable Cost Chip Controller Drive - 3 4 nm - 3 bit MLC - Other NVRAM - Write leveling algorithm - ECC/RAID/SMART/… - Bad blk map out - SLC+MLC+RAM
What’s Next in/after SSD? How to make best use   of SSD?

Ssd And Enteprise Storage

  • 1.
    SSD and itsApplication in Enterprise Storage Frank Zhao
  • 2.
    What’s SSD ?SSD (S olid State Drive): semiconductor-based block storage device that appears to the host devices like a disk drive -IDC,2007 Other terms: EFD (Enterprise Flash Drive, EMC) SCM (Storage Class Mem, IBM) Types: DRAM + battery Flash based NOR flash NAND flash
  • 3.
    Why SSD ?Disk drive becomes the performance bottleneck! Multiple core (Intel Nehalem) DDR3-1333: 10GB/s QFI: 25.6GB/s PCIe*16: 8GB/s FC disk: 150MB/s!
  • 4.
    Flash instead ofspindle/platter NAND flash: high density, long endurance Behaves like disk: Sector/page-based and well-suited for sequential data (pictures, audio, and files) Bad block Unlike disk : Wear out the cell Two NAND types: SLC (single-level cell) 3 X fast , 10 X long endurance to MLC MLC (multi-level cell) 2~3 X more capacity than SLC for enterprise for consumer
  • 5.
    NAND Flash internalData is grouped into block/page Page : 2K/4KB (plus spare area: 64B/128B) Block : 64/128 pages, up to 512KB Special ops: Erase: set all bits within block to “1” The smallest erasable entity is a block Program: change bit from “1”to “0”
  • 6.
    SSD product Interface:FC/SATA/SAS/ PCIe USB … Property Performance Endurance Capacity Cost
  • 7.
    SSD - Performance260X Cache rate 30X ~ 170X 180X 60X 16,000 46,000 115 220 0.02~0.12 512 73/128 = 57% 73 SLC STEC ZEUS IOPS Intel X25-E HDD Model 59.6/80 = 74.5% - User Space Rate 3,300 35,000 170 250 R 0.075, W 0.085 16 64 SLC 250 250 134 147 3.4 16 600 HDD15K rpm Media Type Random Write (IOPS) Random Read (IOPS) Sustained Write (MB/s) Sustained Read (MB/s) Access Time (ms) Cache (MB) Capacity (GB)
  • 8.
    SSD: Endurance DataRetention: 10 years without power support MTBF 2 million hours (VS. HDD: 1.6million hours) Wear out Limit cell write-cycle SLC 100K~1M writes MLC 10K writes Available space is decreased: downgrade write performance when disk is nearly full Solutions : Additional reserved space Wear leveling algorithm to spread write across the whole disk Bad Block Management map out the bad block
  • 9.
    SSD: Capacity Amazing potential on capacity Could be huge and costly 2.5 Inch: 512 GB 3.5 Inch: 1 TB (BitMacro) Could be extremely small and inexpensive 4GB SSD for E-PC. Can you image a 10-$ HDD?
  • 10.
    SSD: Cost 8X83X $ /GB $ /IOPS
  • 11.
    SSD Cost: trend1Q08 2Q08 3Q08 4Q08 1Q09 Cost/GB vs High Performance FC ≈ 40 x ≈ 8 x ≈ 22 x
  • 12.
    A brief summarySLC NAND flash is dominant in enterprise environment Advantages : Excellent performance for Read IO | Random IO | Small IO (<32KB) workload Low power consumption Small size/weight Limitations : Cost Endurance Performance downgrade
  • 13.
    Alternate routes SSDas cache in server Standalone appliance Tired Storage SSD in Enterprise Storage Tired Storage Cache Standalone SSD array
  • 14.
    SSD Application inEnterprise BladeCenter HS21 2008 2009 2010 6 DS8000 3 9 6 3 9 12 ProLiant Blade, EVA Blade/Server Unified Storage 7000 ZFS upgrade Storage F5100 USP V/VM PAM(DRAM) RamSan on V3170 DMX-4 CX-4 2 nd Gen SSD FAST X/Blade /Power SVC 12
  • 15.
    EMC Strategic: Tiered Storage System EFD, FC, SATA Products: V-Max/DMX-4: CX-4 Celerra SSD Tuned Arrays will Totally Change the Game!
  • 16.
    EMC-FAST (Fully AutomatedStorage Tiering) FAST LUN level auto data movement, policy-based Management tools: Ionix (ECC), Navi, and RF Flash Fibre Channel SATA Flash SATA Fibre Channel V-LUN V-LUN Flash Fibre Channel SATA New Q4 2009 V-Max CX-4 Celerra
  • 17.
    NetAPP PAM (P erformance A cceleration M odule) PCIe card for read cache SW: ONTAP7G, FlexScale, and Predictive Cache Stat. Add-on SSD array from TMS( T exas M emory S ystems) Scale out the read performance for V-series/SAN SW: FlexCache TMS RamSan 500: 2 TB, DRAM +flash array (RAID) PAM1: DRAM, up to 16GB/card * 10 PAM2: SSD, up to 512GB/card * 8
  • 18.
    SUN/Solaris “ HybridStorage Pool” Write IO: write-optimized SSDs (ZFS Intent-log) Read IO: commodity SSDs Second-level flash cache (L2ARC) behind primary DRAM cache Smart replacement algorithm 18GB*8 100GB * 6
  • 19.
    What’s Next in/afterSSD? Views from array providers: EMC: lower cost IBM: MLC Revolutionary RAM tech?( Racetrack Mem, Phase Change Mem) Micron, Samsung, Intel, Hynix Intel, JMicron, SandForce, STEC STEC, Intel, Samsung, Pliant, Segate, Fusion-IO SUN: MLC SW: to fault tolerant Distinguish write with read IO Goal: Good Performance + Endurance + Capacity with affordable Cost Chip Controller Drive - 3 4 nm - 3 bit MLC - Other NVRAM - Write leveling algorithm - ECC/RAID/SMART/… - Bad blk map out - SLC+MLC+RAM
  • 20.
    What’s Next in/afterSSD? How to make best use of SSD?

Editor's Notes

  • #5 So we need a better media: fast performance behaves like a disk for non volatile data store Large capacity and long Endurance
  • #6 Read:~25ms/page Write: Erase: 2ms/block Program: ~300ms/page. 12X slower than read NAND flash is data in a block can only be written sequentially. Number of Operations (NOPs) is the number of times the sectors can be programmed. So far this number for MLC flash is always one whereas for SLC flash it is four While NAND cannot inherently perform random access, it is possible at the system level through shadowing: More advanced semiconductor technology: 5X nm  4X nm  3X nm (x means an Arabic number) Larger page, block and spare area per 512 bytes: Page size * 2, Pages per Block * 2 spare area per 512 Bytes: 16 bytes  28 bytes Larger capacity: 2GB  4GB  8GB  16GB  … Faster interface: ONFI 1.0  ONFI 2.0  ONFI 2.1  …
  • #8 Seagate: http://www.seagate.com/ww/v/index.jsp?locale=en-US&amp;name=null&amp;vgnextoid=cb5dcfc7e21de110VgnVCM100000f5ee0a0aRCRD STEC 3nd: 5.4W for idle mode; and 8.4W for ops Seagate: 11.6W for idle, and 16.3W for work
  • #9 Factors for wear out: Wear-leveling efficiency Write amplification NAND cycles SSD densities
  • #10 Performance
  • #11 Title Month Year FC disk: 620$/600GB = 1.1$/GB. ST3600057FC Seagate http://www.provantage.com/seagate-st3600057fc~7SEGS20E.htm Intel X-25E: 550$ /63GB = 8.6$/GB Intel X-25E, IOPS Random 4KB Reads: &gt;35,000 IOPS Random 4KB Writes: &gt;3,300 IOPS Active: 2.4W Typical (server workload¹) Idle (DIPM): 0.06 W Typical Chip Price SLC: 6$/GB, ~4X than MLC MLC: 1.6$/GB. 20X than SATA disk
  • #12 Title Month Year
  • #14 SSD tuned array SSD found Work load identification Data Migration automatically Get right data to the right place at right time automatically
  • #15 IBM In July 2007, IBM has already supported SanDisk’s SSD on BladeCenter HS21, but IBM following action is quite slow or isolated, it seems that IBM was not satisfied with flash-based NVRAM Right now, IBM X/Blade/Power server support SSD from STEC (SAS or SATA): “SSD Data Balancer” tool is available for Power DS8000 and SVCV5 support SSD from STEC (FC or SAS): DFSMS (“Data Facility Storage Management Subsystem” for DS8000 SUN: “SSD is “ an even bigger impact on datacenter economics than virtualization ”” said by Jonathan
  • #16 #1 interest and winning tech in storage EMC takes ~60% of enterprise SSD market share in US
  • #18 PAM: http://www.netapp.com/us/communities/tech-ontap/pam.html http://media.netapp.com/documents/wp-7061.pdf Second-layer cache to hold blocks evicted from WAFL buffer cache. ONTAP maintains a set of cache tags in system mem so as to determine whether or not a block resides in PAM without accessing in the card. To reduce avg read latency (especially small random read IO) PCIe card, dual-channel DMA to 16GB/card, integrated a FPGA for onboard intelligence Default mode (both metada and data) Metadata mode Low-priority mode: meta/data + low-priority data FlexCache with TMS+V FlexCache arch: http://media.netapp.com/documents/tr-3669.pdf http://www.netapp.com/us/products/platform-os/flexcache/ NFS client only
  • #19 Hybrid Storage Pool: https://www.sun.com/offers/docs/820-5881.pdf
  • #20 SUN: as SUN observed, NAND vendors don’t address the needs for the enterprise market, instead, they pay more effort on commodity or consumer parts (MLC NAND), in result, enterprise SSD (SLC NAND) remains quite expensive, from this point, SUN is preparing to use MLC NAND to cache huge dataset and ride the commodity trends to deliver more cost effective system. For data reliability issue of MLC NAND, SUN persists errors should only affect performance not correctness, so software should take the responsibility to manage the data reliability. Story is just beginning!
  • #21 From Sorin Faibish’s Slides