Moving to PCI Express based SSD with NVM Express

13,472 views

Published on

Une très bonne présentation qui introduit la technologie NVM Express qui sera à coup sure l'interface du futur (proche) des "disques" SSD. Adieu SAS et SATA, bienvenu au PCI Express dans les serveurs (et postes clients)

Published in: Technology

Moving to PCI Express based SSD with NVM Express

  1. 1. Moving to PCI Express* Based Solid- State Drive with NVM Express Jack Zhang Sr. SSD Application Engineer, Intel Corporation SSDS002
  2. 2. 2 Agenda • Why PCI Express* (PCIe) for SSDs? – PCIe SSD in Client – PCIe SSD in Data Center • Why NVM Express (NVMe) for PCIe SSDs? – Overview NVMe – Driver ecosystem update – NVMe technology developments • Deploying PCIe SSD with NVMe
  3. 3. 3 Agenda • Why PCI Express* (PCIe) for SSDs? – PCIe SSD in Client – PCIe SSD in Data Center • Why NVM Express (NVMe) for PCIe SSDs? – Overview NVMe – Driver ecosystem update – NVMe technology developments • Deploying PCIe SSD with NVMe
  4. 4. 4 More than ten exabytes of NAND based compute SSDs shipped 2013 SSD Capacity Growth by Market Segment (PB/MGB) Solid-State Drive Market Growth - 10,000 20,000 30,000 40,000 50,000 60,000 70,000 2011 2012 2013 2014 2015 2016 2017 MGB Enterprise Client Source: Forward Insight Q4’13
  5. 5. 5 PCI Express* Bandwidth PCI Express* (PCIe) provides a scalable, high bandwidth interconnect, unleashing SSD performance possibilities Source: www.pcisig.com, www.sata-io.org www.usb.org
  6. 6. 6 PCI Express* Bandwidth PCI Express* (PCIe) provides a scalable, high bandwidth interconnect, unleashing SSD performance possibilities Source: www.pcisig.com, www.sata-io.org www.usb.org
  7. 7. 7 Motherboard PCIe SAS SATA Translation Queue NVMe File System Software SAS SATA PCI Express* (PCIe) removes controller latency NVM Express (NVMe) reduces software latency SSD Technology Evolution
  8. 8. 8 Source: Forward Insights* PCI Express* SSD starts ramping this year Enterprise SSD Interface Trends PCI Express* Interface SSD Grows Faster
  9. 9. 9 Why PCI Express* for SSDs? Added PCI Express* SSD Benefits • Even better performance • Increased Data Center CPU I/O: 40 PCI Express Lanes per CPU • Even lower latency • No external IOC means Lower power (~10W) & cost (~$15)
  10. 10. 10 Agenda • Why PCI Express* (PCIe) for SSDs? – PCIe SSD in Client – PCIe SSD in Data Center • Why NVM Express (NVMe) for PCIe SSDs? – Overview NVMe – Driver ecosystem update – NVMe technology developments • Deploying PCIe SSD with NVMe
  11. 11. 11 Client PCI Express* SSD Considerations • Form Factors? • Attach to CPU or PCH? • PCI Express* x2 or x4? • Path to NVM Express? • What about battery life? • Thermal concerns? Trending well, but hurdles remain
  12. 12. 12 Card-based PCI Express* SSD Options M.2 Socket 2 M.2 Socket 3 SATA Yes, Shared Yes, Shared PCIe x2 PCIe x4 No Yes Comms Support? Yes No Ref Clock Required Required Max “Up to” Performance 2 GB/s 4 GB/s Bottom Line Flexibility Performance Host Socket 2 Host Socket 3 Device w/ B&M Slots 22x80mm DS recommended for capacity 22x42mm SS recommended for size & weight M.2 defines: single or double sided SSDs in 5 lengths, and 2 SSD host sockets
  13. 13. 13 Card-based PCI Express* SSD Options M.2 Socket 2 M.2 Socket 3 SATA Yes, Shared Yes, Shared PCIe x2 PCIe x4 No Yes Comms Support? Yes No Ref Clock Required Required Max “Up to” Performance 2 GB/s 4 GB/s Bottom Line Flexibility Performance Host Socket 2 Host Socket 3 Device w/ B&M Slots 22x80mm DS recommended for capacity 22x42mm SS recommended for size & weight M.2 defines: single or double sided SSDs in 5 lengths, and 2 SSD host sockets Industry alignment for M.2 length will lower costs and accelerate transitions
  14. 14. 14 PCI Express* SSD Connector Options SATA Express* SFF-8639 SATA* Yes Yes PCIe x2 x2 or x4 Host Mux Yes No Ref Clock Optional Required EMI SRIS Shielding Height 7mm 15mm Max “Up to” Performance 2 GB/s 4 GB/s Bottom Line Flexibility & Cost Performance SATA Express*: flexibility for HDD Alignments on connectors for PCI Express* SSDs will lower costs and accelerate transitions Separate Refclk Independent SSC (SRIS) removes clocks from cables, reducing emissions & costs of shielding SFF-8639: Best performance
  15. 15. 15 PCI Express* SSD Connector Options SATA Express* SFF-8639 SATA* Yes Yes PCIe x2 x2 or x4 Host Mux Yes No Ref Clock Optional Required EMI SRIS Shielding Height 7mm 15mm Max “Up to” Performance 2 GB/s 4 GB/s Bottom Line Flexibility & Cost Performance SATA Express*: flexibility for HDD Alignments on connectors for PCI Express* SSDs will lower costs and accelerate transitions Separate Refclk Independent SSC (SRIS) removes clocks from cables, reducing emissions & costs of shielding SFF-8639: Best performance Use an M.2 interface without cables for x4 PCI Express* performance, and lower cost
  16. 16. 16 Many Options to Connect PCI Express* SSDs
  17. 17. 17 Many Options to Connect PCI Express* SSDs
  18. 18. 18 • SSD can attach to Processor (Gen 3.0) or Chipset (Gen 2.0 today, Gen 3.0 in future) • SSD uses PCIe x1, x2 or x4 • Driver interface can be AHCI or NVM Express Many Options to Connect PCI Express* SSDs
  19. 19. 19 • SSD can attach to Processor (Gen 3.0) or Chipset (Gen 2.0 today, Gen 3.0 in future) • SSD uses PCIe x1, x2 or x4 • Driver interface can be AHCI or NVM Express Many Options to Connect PCI Express* SSDs Chipset attached PCI Express* Gen 2.0 x2 SSDs provide ~2x SATA 6Gbps performance today
  20. 20. 20 PCI Express* Gen 3.0, x4 SSDs with NVM Express provide even better SSD performance tomorrow • SSD can attach to Processor (Gen 3.0) or Chipset (Gen 2.0 today, Gen 3.0 in future) • SSD uses PCIe x1, x2 or x4 • Driver interface can be AHCI or NVM Express Many Options to Connect PCI Express* SSDs
  21. 21. 21 Intel® Rapid Storage Technology 13.x Intel® RST driver support for PCI Express Storage coming in 2014 PCI Express* Storage + Intel® RST driver delivers power, performance and responsiveness across innovative form-factors in 2014 Platforms Detachables, Convertibles, All-in-Ones Mainstream & Performance Intel® Rapid Storage Technology (Intel® RST)
  22. 22. 22 Client SATA* vs. PCI Express* SSD Power Management Activity Device State SATA / AHCI State SATA I/O Ready Power Example PCIe Link State Time to Registe r Read PCIe I/O Ready Active D0/ D1/D2 Active NA ~500mW L0 NA ~ 60 µs Light Active Partial 10 µs ~450mW L1.2 < 150 µs ~ 5ms Idle Slumber 10 ms ~350mW Pervasive Idle / Lid down D3_hot DevSlp 50 - 200 ms ~15mW < 500 µs ~ 100ms D3_cold / RTD3 off < 1 s 0W L3 ~100ms ~300 ms Autonomous transition D3_cold/off, L1.2, autonomous transitions & two-step resume improves PCI Express* SSD battery life ~5mW
  23. 23. 23 Client PCI Express* (PCIe) SSD Peak Power Challenges • Max Power: 100% Sequential Writes • SATA*: ~3.5W @ ~400MB/s • x2 PCIe 2.0: up to 2x (7W) • x4 PCIe 3.0: up to ~15W2 0.00 1.00 2.00 3.00 4.00 5.00 1 2 3 4 5 Average Power(Watts) Drive SATA 128K Sequential Write Power Compressible Data, QD=321 Max 1. Data collected using Agilent* DC Power Analyzer N6705B. System configuration: Intel® Core™ i7-3960X (15MB L3 Cache, 3.3GHz) on Intel Desktop Board DX79SI, AMD* Radeon HD 6990 and driver 8.881.0.0, BIOS SIX791OJ.86A.0193.2011.0809.1137, Intel INF 9.1.2.1007, Memory 16GB (4X4GB) Triple-channel Samsung DDR3-1600, Microsoft* Windows* 7 MSAHCI storage driver, Microsoft Windows 7 Ultimate 64-bit Build 7600 with SP1, Various SSDs. Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. For more information go to http://www.intel.com/performance 2. M.2 Socket 3 has nine 3.3V supply pins, each capable of 0.5A for a total power capability of 14.85W Attention needed for power supply, thermals, and benchmarking Source: Intel Motherboard M.2 SSD Thermal Interface Material
  24. 24. 24 Client PCI Express* SSD Accelerators • The client ecosystem is ready: Implement PCI Express* SSDs now! • Use 42mm & 80mm length M.2 for client PCIe SSD • Implement L1.2 and extend RTD3 software support for optimal battery life • Use careful power supply & thermal design • High performance desktop and workstations can consider SFF-8639 data center SSDs for PCI Express* x4 performance today Drive PCI Express* client adoption with specification alignment and careful design
  25. 25. 25 Agenda • Why PCI Express* (PCIe) for SSDs? – PCIe SSD in Client – PCIe SSD in Data Center • Why NVM Express (NVMe) for PCIe SSDs? – Overview NVMe – Driver ecosystem update – NVMe technology developments • Deploying PCIe SSD with NVMe
  26. 26. 26 2.5” Enterprise SFF-8639 PCI Express* SSDs The path to mainstream: innovators begin shipping 2.5” enterprise PCI Express* SSDs! Image sources: Samsung*, Micron*, and Dell*
  27. 27. 27 Datacenter PCI Express* SSD Considerations • Form Factor? • Implementation options? • Hot plug or remove? • Traditional RAID? • Thermal/peak power? • Managements? Developments are on the way
  28. 28. 28 PCI Express* Enterprise SSD Form Factor • SFF-8639 supports 4 pluggable device types • Host slots can be designed to accept more than one type of device • Use PRSNT#, IfDet#, and DualPortEn# pins for device Presence Detect and device type decoding SFF-8639 enables multi-capable hosts
  29. 29. 29 SFF-8639 Connection Topologies • Interconnect standards currently in process • 2 & 3 connector designs • “beyond the scope of this specification” a common phrase for standards currently in development Source: “PCI Express SFF-8639 Module Specification”, Rev. 0.3 Meeting PCI Express 3.0* jitter budgets for 3 connector designs is non- trivial. Consider active signal conditioning to accelerate adoption.
  30. 30. 30 Solution Example – 5 Connectors PCI Express* (PCIe) signal retimers & switches are available from multiple sources Images: Dell* Poweredge* R720* PCIe drive interconnect. Contact PLX* or IDT* for more information on retimers or switches 4 5 3 Retimer or Switch Active signal conditioning enables SFF-8639 solutions with more connectors
  31. 31. 31 Hot-Plug Use Cases • Hot Add & Remove are software managed events • During boot, the system must prepare for hot-plug: – Configure PCI Express* Slot Capability registers – Enable and register for hot plug events to higher level storage software (e.g., RAID or tiering software) – Pre-allocate slot resources (Bus IDs, interrupts, memory regions) using ACPI* tables Existing BIOS and Windows*/Linux* OS are prepared to support PCI Express* Hot-Plug today
  32. 32. 32 Surprise Hot-Remove • Random device failure or operator error can result in surprise removal during I/O • Storage controller driver and the software stack are required to be robust for such cases • Storage controller driver must check for Master Abort – On all reads to the device, the driver checks register for FFFF_FFFFh – If data is FFFF_FFFFh, then driver reads another register expected to have a value that includes zeroes to verify device is still present • Time order of removal notification is unknown (e.g. Storage controller driver via Master Abort, or PCI Bus driver via Presence Change interrupt, or RAID software may signal removal first) Surprise Hot-Remove requires careful software design
  33. 33. 33 RAID for PCI Express* SSDs? • Software RAID is a hardware redundant solution to enable Highly Available (HA) systems today with PCI Express* (PCIe) SSDs • Multi copies of Application images (redundant resource) • Open cloud infrastructure that supports data redundancy with software implementations, such as Ceph* object storage Storage Pool Row B Row A Row B Hardware RAID for PCIe SSD is under-developments Data Striped Datareplicated
  34. 34. 34 Data Center PCI Express* (PCIe) SSD Peak Power Challenges • Max Power: 100% Sequential Writes • Larger capacities have high concurrency, consume most power (up to 25W!2) • Power varies >40% depending on capacity and workload • Consider UL touch safety standards when planning airflow designs or slot power limits3 1. Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. For more information go to http://www.intel.com/performance 2. PCI Express* “Enterprise SSD Form Factor” specification requires 2.5” SSD maximum continuous power of <25W 3. See PCI Express* Base Specification, Revision 3.0, Section 6.9 for more details on Slot Power Limit Control Attention needed for power supply, thermals, and SAFETY Source: Intel 0 5 10 15 20 25 30 Large Small Power,W 100% Seq Write 50/50 Seq Read/Write 70/30 Seq Read/Write 100% Seq Read Capacity Modeled PCI Express* SSD Power1
  35. 35. 35 PCI Express* SSDs Enclosure Management • SSD Form Factor Specification (www.ssdformfactor.org) defines hot plug indicator uses, Out-of- Band managements • PCI Express* Base Specification Rev. 3.0 defines enclosure indicators and registers intended for Hot-Plug management support (Registers: Device Capabilities, Slot Capabilities, Slot Control, Slot Status • SFF-8485 standard defines SGPIO enclosure management interface Standardize PCI Express* SSD enclosure management
  36. 36. 36 Data Center PCI Express*(PCIe) SSD Accelerators • The data center ecosystem is capable: Implement PCI Express* SSDs now! • Proved system implementations of design-in 2.5” PCIe SSDs • Understand Hot-Plug capabilities of your device, system and OS • Design thermal solutions with safety in mind • Collaborate on PCI Express SSD enclosure management standards Drive PCI Express* data center adoption through education, collaboration, and careful software design
  37. 37. 37 Agenda • Why PCI Express* (PCIe) for SSDs? – PCIe SSD in Client – PCIe SSD in Data Center • Why NVM Express (NVMe) for PCIe SSDs? – Overview NVMe – Driver ecosystem update – NVMe technology developments • Deploying PCIe SSD with NVMe
  38. 38. 38 PCI Express* for Data Center/Enterprise SSDs • PCI Express* (PCIe) is a great interface for SSDs – Stunning performance 1 GB/s per lane (PCIe Gen3 x1) – With PCIe scalability 8 GB/s per device (PCIe Gen3 x8) or more – Lower latency Platform+Adapter: 10 µsec down to 3 µsec – Lower power No external SAS IOC saves 7-10 W – Lower cost No external SAS IOC saves ~ $15 – PCIe lanes off the CPU 40 Gen3 (80 in dual socket) • HOWEVER, there is NO standard driver Fusion-io* Micron* LSI* Virident* Marvell* Intel OCZ* PCIe SSDs are emerging in Data Center/Enterprise, co-existing with SAS & SATA depending on application
  39. 39. 39 Next Generation NVM Technology Family Defining Switching Characteristics Phase Change Memory Energy (heat) converts material between crystalline (conductive) and amorphous (resistive) phases Magnetic Tunnel Junction (MTJ) Switching of magnetic resistive layer by spin-polarized electrons Electrochemical Cells (ECM) Formation / dissolution of “nano-bridge” by electrochemistry Binary Oxide Filament Cells Reversible filament formation by Oxidation-Reduction Interfacial Switching Oxygen vacancy drift diffusion induced barrier modulation Scalable Resistive Memory Element Resistive RAM NVM Options Cross Point Array in Backend Layers ~4l2 Cell Wordlines Memory Element Selector Device Many candidate next generation NVM technologies. Offer ~ 1000x speed-up over NAND.
  40. 40. 40 Fully Exploiting Next Generation NVM • With Next Generation NVM, the NVM is no longer the bottleneck – Need optimized platform storage interconnect – Need optimized software storage access methods * NVM Express is the interface architected for NAND today and next generation NVM
  41. 41. 41 Agenda • Why PCI Express* (PCIe) for SSDs? – PCIe SSD in Client – PCIe SSD in Data Center • Why NVM Express (NVMe) for PCIe SSDs? – Overview NVMe – Driver ecosystem update – NVMe technology developments • Deploying PCIe SSD with NVMe
  42. 42. 42 Technical Basics • All parameters for 4KB command in single 64B command • Supports deep queues (64K commands per queue, up to 64K queues) • Supports MSI-X and interrupt steering • Streamlined & simple command set (13 required commands) • Optional features to address target segment (Client, Enterprise, etc.) – Enterprise: End-to-end data protection, reservations, etc. – Client: Autonomous power state transitions, etc. • Designed to scale for next generation NVM, agnostic to NVM type used http://www.nvmexpress.org/
  43. 43. 43 Queuing Interface Command Submission & Processing Submission Queue Host Memory Completion Queue Host NVMe Controller Head Tail 1 Submission Queue Tail Doorbell Completion Queue Head Doorbell 2 3 4 Tail Head 5 6 7 8 Queue Command Ring Doorbell New Tail Fetch Command Process Command Queue Completion Generate Interrupt Process Completion Ring Doorbell New Head Command Submission 1. Host writes command to Submission Queue 2. Host writes updated Submission Queue tail pointer to doorbell Command Processing 3. Controller fetches command 4. Controller processes command *
  44. 44. 44 Queuing Interface Command Completion Submission Queue Host Memory Completion Queue Host NVMe Controller Head Tail 1 Submission Queue Tail Doorbell Completion Queue Head Doorbell 2 3 4 Tail Head 5 6 7 8 Queue Command Ring Doorbell New Tail Fetch Command Process Command Queue Completion Generate Interrupt Process Completion Ring Doorbell New Head Command Completion 5. Controller writes completion to Completion Queue 6. Controller generates MSI-X interrupt 7. Host processes completion 8. Host writes updated Completion Queue head pointer to doorbell *
  45. 45. 45 Simple Command Set – Optimized for NVM Admin Commands Create I/O Submission Queue Delete I/O Submission Queue Create I/O Completion Queue Delete I/O Completion Queue Get Log Page Identify Abort Set Features Get Features Asynchronous Event Request Firmware Activate (optional) Firmware Image Download (opt) Format NVM (optional) Security Send (optional) Security Receive (optional) NVM I/O Commands Read Write Flush Write Uncorrectable (optional) Compare (optional) Dataset Management (optional) Write Zeros (optional) Reservation Register (optional) Reservation Report (optional) Reservation Acquire (optional) Reservation Release (optional) Only 10 Admin and 3 I/O commands required
  46. 46. 46 Agenda • Why PCI Express* (PCIe) for SSDs? – PCIe SSD in Client – PCIe SSD in Data Center • Why NVM Express (NVMe) for PCIe SSDs? – Overview NVMe – Driver ecosystem update – NVMe technology developments • Deploying PCIe SSD with NVMe
  47. 47. 47 Driver Development on Major OSes • Windows* 8.1 and Windows* Server 2012 R2 include native driver • Open source driver in collaboration with OFA Windows* • Stable OS driver since Linux* kernel 3.10Linux* • FreeBSD driver upstreamUnix • Solaris driver will ship in S12Solaris* • vmklinux driver certified release in 1H, 2014VMware* • Open source driver available on SourceForgeUEFI Native OS drivers already available, with more coming!
  48. 48. 48 Windows* Open Source Driver Update • 64-bit support on Windows* 7 and Windows Server 2008 R2 • Mandatory features Release 1 Q2 2012 • Added 64-bit support Windows 8 • Public IOCTLs and Windows 8 Storport updates Release 1.1 Q4 2012 • Added 64-bit support on Windows Server 2012 • Signed executable drivers Release 1.2 Aug 2013 • Hibernation on boot drive • NUMA group support in core enumeration Release 1.3 March 2014 • WHQL certification • Drive Trace feature, WVI command processing • Migrate to VS2013, WDK8.1 Release 1.4 Oct, 2014 Four major open source releases since 2012. Contributors include Huawei*, PMC-Sierra*, Intel, LSI* & SanDisk* https://www.openfabrics.org/resources/developer-tools/nvme-windows-development.html
  49. 49. 49 Linux* Driver Update Recent Features • Stabled Linux* 3.10, Latest driver in 3.14 • Surprise hot plug/remove • Dynamic partitioning • Deallocate (i.e., Trim support) • 4KB sector support (in addition to 512B) • MSI support (in addition to MSI-X) • Disk I/O statistics Linux OS distributors’ support • RHEL 6.5, Ubuntu 13.10 has native drivers • RHEL 7.0, Ubuntu 14.04LTS and SLES 12 will have latest native drivers • SuSE is testing external driver package for SLES11 SP3 Works in progress: power management, end-to-end data protection, sysfs manageability & NUMA /dev/nvme0n1
  50. 50. 50 FreeBSD Driver Update • NVM Express* (NVMe) support is upstream in the head and stable/9 branches • FreeBSD 9.2 released in September is the first official release with NVMe support nvme Core NVMe driver nvd NVMe/block layer shim nvmecontrol User space utility, including firmware update FreeBSDNVMeModules
  51. 51. 51 Solaris* Driver Update • Current Status from Oracle* team - Fully compliant with 1.0e spec - Direct block interfaces bypassing complex SCSI code path - NUMA optimized queue/interrupt allocation - Support x86 and SPARC platform - A command line tool to monitor and configure the controller - Delivered to S12 and S11 Update 2 • Future Development Plans - Boot & install on SPARC and X86 - Surprise removal support - Shared hosts and multi-pathing
  52. 52. 52 VMware Driver Update • Vmklinux based driver development is completed – First release in mid-Oct, 2013 – Public release will be 1H, 2014 • A native VMware* NVMe driver is available for end user evaluations • VMware’s I/O Vendor Partner Program (IOVP) offers members a comprehensive set of tools, resources and processes needed to develop, certify and release software modules, including device drivers and utility libraries for VMware ESXi
  53. 53. 53 UEFI Driver Update • The UEFI 2.4 specification available at www.UEFI.org contains updates for NVM Express* (NVMe) • An open source version of an NVMe driver for UEFI is available at nvmexpress.org/resources “AMI is working with vendors of NVMe devices and plans for full BIOS support of NVMe in 2014.” Sandip Datta Roy VP BIOS R&D, AMI NVMe boot support with UEFI will start percolating releases from Independent BIOS Vendors in 2014
  54. 54. 54 Agenda • Why PCI Express* (PCIe) for SSDs? – PCIe SSD in Client – PCIe SSD in Data Center • Why NVM Express (NVMe) for PCIe SSDs? – Overview NVMe – Driver ecosystem update – NVMe technology developments • Deploying PCIe SSD with NVMe
  55. 55. 55 NVMe Promoters “Board of Directors” Technical Workgroup Queueing Interface Admin Command Set NVMe I/O Command Set Driver Based Management Current spec version: NVMe 1.1 Management Interface Workgroup In-Band (PCIe) and Out-of-Band (SMBus) PCIe SSD Management First specification will be Q3, 2014 NVM Express Organization Architected for Performance
  56. 56. 56 NVM Express 1.1 Overview • The NVM Express 1.1 specification, published in October of 2012, adds additional optional client and Enterprise features
  57. 57. 57 NVM Express 1.1 Overview • The NVM Express 1.1 specification, published in October of 2012, adds additional optional client and Enterprise features Multi-path Support • Reservations • Unique Identifier per Namespace • Subsystem Reset
  58. 58. 58 NVM Express 1.1 Overview • The NVM Express 1.1 specification, published in October of 2012, adds additional optional client and Enterprise features Power Optimizations • Autonomous Power State Transitions Multi-path Support • Reservations • Unique Identifier per Namespace • Subsystem Reset
  59. 59. 59 NVM Express 1.1 Overview • The NVM Express 1.1 specification, published in October of 2012, adds additional optional client and Enterprise features Power Optimizations • Autonomous Power State Transitions Command Enhancements • Scatter Gather List support • Active Namespace Reporting • Persistent Features Across Power States • Write Zeros Command Multi-path Support • Reservations • Unique Identifier per Namespace • Subsystem Reset
  60. 60. 60 Multi-path Support • Multi-path includes the traditional dual port model • With PCI Express*, it extends further with switches
  61. 61. 61 Reservations • In some multi-host environments, like Windows* clusters, reservations may be used to coordinate host access • NVMe 1.1 includes a simplified reservations mechanism that is compatible with implementations that use SCSI reservations • What is a reservation? Enables two or more hosts to coordinate access to a shared namespace. – A reservation may allow Host A and Host B access, but disallow Host C Namespace NSID 1 NVM Express Controller 1 Host ID = A NSID 1 NVM Express Controller 2 Host ID = A NSID 1 NVM Express Controller 3 Host ID = B NSID 1 Host A Host B Host C NVM Subsystem NVM Express Controller 4 Host ID = C
  62. 62. 62 Power Optimizations • NVMe 1.1 added the Autonomous Power State Transition feature for client power focused implementations • Without software intervention, the NVMe controller transitions to a lower power state after a certain idle period – Idle period prior to transition programmed by software Power State Opera- tional? Max Power Entrance Latency Exit Latency 0 Yes 4 W 10 µs 10 µs 1 No 10 mW 10 ms 5 ms 2 No 1 mW 15 ms 30 ms Example Power States Power State 0 Power State 1 Power State 2 After 50 ms idle After 500 ms idle
  63. 63. 63 Continuing to Advance NVM Express • NVM Express continues to add features to meet the needs of client and Enterprise market segments as they evolve • The Workgroup is defining features for the next revision of the specification, expected ~ middle of 2014 Features for Next Revision Namespace Management Management Interface Live Firmware Update Power Optimizations Enhanced Status Reporting Events for Namespace Changes … Get involved – join the NVMe Workgroup nvmexpress.org
  64. 64. 64 Agenda • Why PCI Express* (PCIe) for SSDs? – PCIe SSD in Client – PCIe SSD in Data Center • Why NVM Express (NVMe) for PCIe SSDs? – Overview NVMe – Driver ecosystem update – NVMe technology developments • Deploying PCIe SSD with NVMe
  65. 65. 65 Considerations of PCI Express* SSD with NVM Express, NVMe SSD • NVMe driver assistant? • S.M.A.R.T/Management? • Performance scalability? • PCIe SSD vs SATA SSDs? • PCIe SSD grades? • Software optimizations? NVMe SSDs are on the way to Data Center
  66. 66. 66 PCI Express* SSD vs Multi SATA* SSDs SATA SSDs advantages • Matured hardware RAID/Adapter for management of SSDs • Matured technology/eco system for SSDs • Cost & performance balance Quick Performance Comparison • Random WRITE IOPS: 6 x S3700 = one PCIe SSD 1.6T (4 lanes, Gen3) • Random READ IOPS: ~8 x S3700 = 1 x PCIe SSD Mix-Use PCIe and SATA SSDs • hot-pluggable 2.5” PCIe SSD has same maintenance advantage as SATA SSD • TCO, balance on performance and cost Performance of 6~8 Intel S3700 SSDs is close to 1x PCIe SSD 4K random workloads (IOPS) Measurements made on Hanlan Creek (Intel S5520HC) system with two Intel Xeon X5560@ 2.93GHz and 12GB (per CPU) Mem running RHEL6.4 O/S, Intel S3700 SATA Gen3 SSDs are connected to LSI* HBA 9211, NVMe SSD is under development, data collected by FIO* tool 0 100000 200000 300000 400000 500000 600000 100% read 50% read 0% read 6x800GB Intel S3700 1x NVMe 1600GB IOPS
  67. 67. 67 Example, PCIe/SATA SSDs in one system 1U 4x 2.5” PCIe SSDs + 4xSATA SSDs
  68. 68. 68 Selections of PCI Express* SSD with NVM Express, NVMe SSD • High Endurance Technology (HET) PCIe SSD Applications with intensive random write workloads, typical are high percentage small block random writes, such as critical database, OLTs… • Middle Tier PCIe SSD Applications needs random write performance and endurance, but much lower than HET PCIe SSD, typical workloads is <70% random writes. • Low cost PCIe SSD Same read performance as above, however it has 1/10th of HET write performance and endurance, Applications with high intensive read workloads, such as search engine etc. Application determines cost and performance
  69. 69. 69 Optimizations of PCI Express* SSD with NVM Express, NVMe SSD
  70. 70. 70 Optimizations of PCI Express* SSD with NVM Express, NVMe SSD NVMe Administration  Controller capability/identify  NVMe features  Asynchronous Event  NVMe logs Optional IO Command  Data Set management (Trim)
  71. 71. 71 Optimizations of PCI Express* SSD with NVM Express, NVMe SSD NVMe Administration  Controller capability/identify  NVMe features  Asynchronous Event  NVMe logs Optional IO Command  Data Set management (Trim)
  72. 72. 72 Optimizations of PCI Express* SSD with NVM Express, NVMe SSD NVMe Administration  Controller capability/identify  NVMe features  Asynchronous Event  NVMe logs Optional IO Command  Data Set management (Trim)
  73. 73. 73 Optimizations of PCI Express* SSD with NVM Express, NVMe SSD NVMe Administration  Controller capability/identify  NVMe features  Asynchronous Event  NVMe logs Optional IO Command  Data Set management (Trim)
  74. 74. 74 Optimizations of PCI Express* SSD with NVM Express, NVMe SSD NVMe Administration  Controller capability/identify  NVMe features  Asynchronous Event  NVMe logs Optional IO Command  Data Set management (Trim) NVMe IO Threaded structure  Understand number of CPU logic cores in your system  Write multi-Thread application programs  No need for handling rq_affinity
  75. 75. 75 Optimizations of PCI Express* SSD with NVM Express, NVMe SSD NVMe Administration  Controller capability/identify  NVMe features  Asynchronous Event  NVMe logs Optional IO Command  Data Set management (Trim) NVMe IO Threaded structure  Understand number of CPU logic cores in your system  Write multi-Thread application programs  No need for handling rq_affinity Write NVMe friendly applications
  76. 76. 76 Optimizations of PCI Express* SSD with NVM Express (cont.) IOPS performance • Chose higher number of threads ( < min(number system CPU cores, SSD controller maximum allocated queues)) • Chose Low Queue depth for each thread (asynchronous IO) • Avoid to use single thread with much higher Queue Depth(QD), especially for small transfer blocks • Example: 4K random read on one drive in a system with 8 CPU cores, use 8 threads with Queue Depth(QD)=16 per thread instead of single thread with QD=128. Latency • Lower QD for better latency • For intensive random write, there is a sweet point of threads & QD for balancing performance and latency • Example: 4K random write in 8-core system, threads=8, sweet QD is 4 to 6. Sequential vs Random workload • Multi-threads sequential workloads may turn to be random workloads at SSD side Use Multi-Threads with Low Queue Depth
  77. 77. 77 NVM Express (NVMe) Driver beyond NVMe Specification NVMe Linux driver is open source LBA0……………………..LBA255 LBA256…………..…..LBA511 LBA512……………….LBA767 LBA768……………..LBA1023 LBA1024…………….………..etc. …etc… Core 0 Core 1 • Driver Assisted Striping – Dual core NVMe controller each core maintains separate NAND array and striped LBA ranges (like RAID 0) – Driver can enforce all commands fall within KB stripe, ensuring maximum performance • Contribute to NVMe driver
  78. 78. 78 S.M.A.R.T and Management  Use PCIe in-band commands to get SSD SMART log (NVMe log)  Statistical data, status, Warnings, Temperature, endurance indicator • Use Out-Of-Band SMBus to access VPD EEPROM, Vendor information • Use Out-of-Band SMBus temperature sensor for close loop thermal controls (Fan speed) NVMe Standardizes S.M.A.R.T. on PCIe SSD
  79. 79. 79 Scalability of Multi-PCI Express* SSDs with NVM Express Performance on 4 PCIe SSDs = Performance on 1 PCIe SSD X 4 Advantage of NVM Express threaded and MSI-X structure! 100% random read 0.00 2.00 4.00 6.00 8.00 10.00 12.00 4K 8K 16K 64k 1xNVMe 1600GB 2xNVMe 1600GB 4xNVMe 1600GB GB/s 0 0.5 1 1.5 2 2.5 3 3.5 4K 8K 16K 64k 1xNVMe 1600GB 2xNVMe 1600GB 4xNVMe 1600GB GB/s 100% random write Measurements made on Intel system with two Intel Xeon™ CPU E5-2680 v2@ 2.80GHz and 32GB Mem running RHEL6.5 O/S, NVMe SSD is under development, data collected by FIO* tool, numJob=30, queue depth (QD)=4 (read), QD=1 (write), libaio. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
  80. 80. 80 PCI Express* SSD with NVM Express (NVMe SSD) deployments Source: Geoffrey Moore, Crossing the Chasm SSDs are a disruptive technology, approaching “The Chasm” Adoption success relies on clear benefit, simplification, and ease of use
  81. 81. 81 Summary • PCI Express* SSD enables lower latency and further alleviates the IO bottleneck • NVM Express is the interface architected for PCI Express* SSD, NAND Flash of today and next generation NVM of tomorrow • Promoting and adopting PCIe SSD with NVMe as mainstream technology and get ready for next generation of NVM
  82. 82. 82 Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Intel, Xeon, Look Inside and the Intel logo are trademarks of Intel Corporation in the United States and other countries. *Other names and brands may be claimed as the property of others. Copyright ©2014 Intel Corporation.
  83. 83. 83 Risk Factors The above statements and any others in this document that refer to plans and expectations for the first quarter, the year and the future are forward-looking statements that involve a number of risks and uncertainties. Words such as “anticipates,” “expects,” “intends,” “plans,” “believes,” “seeks,” “estimates,” “may,” “will,” “should” and their variations identify forward-looking statements. Statements that refer to or are based on projections, uncertain events or assumptions also identify forward-looking statements. Many factors could affect Intel’s actual results, and variances from Intel’s current expectations regarding such factors could cause actual results to differ materially from those expressed in these forward-looking statements. Intel presently considers the following to be the important factors that could cause actual results to differ materially from the company’s expectations. Demand could be different from Intel's expectations due to factors including changes in business and economic conditions; customer acceptance of Intel’s and competitors’ products; supply constraints and other disruptions affecting customers; changes in customer order patterns including order cancellations; and changes in the level of inventory at customers. Uncertainty in global economic and financial conditions poses a risk that consumers and businesses may defer purchases in response to negative financial events, which could negatively affect product demand and other related matters. Intel operates in intensely competitive industries that are characterized by a high percentage of costs that are fixed or difficult to reduce in the short term and product demand that is highly variable and difficult to forecast. Revenue and the gross margin percentage are affected by the timing of Intel product introductions and the demand for and market acceptance of Intel's products; actions taken by Intel's competitors, including product offerings and introductions, marketing programs and pricing pressures and Intel’s response to such actions; and Intel’s ability to respond quickly to technological developments and to incorporate new features into its products. The gross margin percentage could vary significantly from expectations based on capacity utilization; variations in inventory valuation, including variations related to the timing of qualifying products for sale; changes in revenue levels; segment product mix; the timing and execution of the manufacturing ramp and associated costs; start-up costs; excess or obsolete inventory; changes in unit costs; defects or disruptions in the supply of materials or resources; product manufacturing quality/yields; and impairments of long-lived assets, including manufacturing, assembly/test and intangible assets. Intel's results could be affected by adverse economic, social, political and physical/infrastructure conditions in countries where Intel, its customers or its suppliers operate, including military conflict and other security risks, natural disasters, infrastructure disruptions, health concerns and fluctuations in currency exchange rates. Expenses, particularly certain marketing and compensation expenses, as well as restructuring and asset impairment charges, vary depending on the level of demand for Intel's products and the level of revenue and profits. Intel’s results could be affected by the timing of closing of acquisitions and divestitures. Intel's results could be affected by adverse effects associated with product defects and errata (deviations from published specifications), and by litigation or regulatory matters involving intellectual property, stockholder, consumer, antitrust, disclosure and other issues, such as the litigation and regulatory matters described in Intel's SEC reports. An unfavorable ruling could include monetary damages or an injunction prohibiting Intel from manufacturing or selling one or more products, precluding particular business practices, impacting Intel’s ability to design its products, or requiring other remedies such as compulsory licensing of intellectual property. A detailed discussion of these and other factors that could affect Intel’s results is included in Intel’s SEC filings, including the company’s most recent reports on Form 10-Q, Form 10-K and earnings release. Rev. 1/16/14

×