Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Consequences of Infinite Storage Bandwidth: Allen Samuels, SanDisk

2,083 views

Published on

Audience: Beginner to Intermediate

About: Overall increases in CPU and DRAM processing power are falling behind the massive acceleration in available storage and network bandwidth. Storage management services are emerging as a serious bottleneck. What does this imply for the datacenter of the future? How will it affect the physical network and storage topologies? And how will storage software need to change to meet these new realities?

Speaker Bio: Allen joined SanDisk in 2013 as an Engineering Fellow, he is responsible for directing software development for SanDisk’s system level products. He has previously served as Chief Architect at Weitek Corp. and Citrix, and founded several companies including AMKAR Consulting, Orbital Data Corporation, and Cirtas Systems. Allen has a Bachelor of Science in Electrical Engineering from Rice University.

http://www.australiaday.openstack.org.au

OpenStack Australia Day - Sydney 2016
http://australiaday.openstack.org.au/sydney-2016/

Published in: Technology
  • Be the first to comment

The Consequences of Infinite Storage Bandwidth: Allen Samuels, SanDisk

  1. 1. May 5, 2016 1 Allen Samuels The Consequences of Infinite Storage Bandwidth Engineering Fellow, Systems and Software Solutions May 5, 2016
  2. 2. May 5, 2016 2 Disclaimer During the presentation today, we may make forward-looking statements. Any statement that refers to expectations, projections, or other characterizations of future events or circumstances is a forward- looking statement, including those relating to industry predictions and trends, future products and their projected availability, and evolution of product capacities. Actual results may differ materially from those expressed in these forward-looking statements due to a number of risks and uncertainties, including among others: industry predictions may not occur as expected, products may not become available as expected, and products may not evolve as excepted; and the factors detailed under the caption “Risk Factors” and elsewhere in the documents we file from time to time with the SEC, including, but not limited to, our annual report on Form 10-K for the year ended January 3, 2016. This presentation contains information from third parties, which reflect their projections as of the date of issuance. We undertake no obligation to update these forward-looking statements, which speak only as of the date hereof or the date of issuance by a third party.
  3. 3. May 5, 2016 3 What do I Mean By Infinite Bandwidth ?
  4. 4. May 5, 2016 4 Log scale • Use DRAM Bandwidth as a proxy for CPU throughput • Reasonable approximation for DMA heavy, and/or poor cache hit performance workloads (e.g. Storage) Bigdifference inslope! Data is for informational purposes only and may contain errors Network, Storage and DRAM Trends
  5. 5. May 5, 2016 5 Linear scale InfiniteStorageBandwidth • Same data as last slide, but for the Log- impaired • Storage Bandwidth is not literally infinite • But the ratio of Network and Storage to CPU throughput is widening very quickly Data is for informational purposes only and may contain errors Network, Storage and DRAM Trends
  6. 6. May 5, 2016 6 0 50 100 150 200 250 1990 1995 2000 2005 2010 2015 2020 2025 Year SSDs / CPU Socket Data is for informational purposes only and may contain errors
  7. 7. May 5, 2016 7 0 5 10 15 20 25 30 35 40 45 50 1995 2000 2005 2010 2015 2020 2025 Year SSDs / CPU Socket @ 20% Max BW Data is for informational purposes only and may contain errors
  8. 8. May 5, 2016 8 What happens as we get closer to the limit?
  9. 9. May 5, 2016 9  New Denser Server Form Factors – Blades – Sleds  Good short term solutions Let’s Get Small!
  10. 10. May 5, 2016 10  Storage Cost = Media + Access + Management  Shared nothing architecture conflates access and management  Storage costs will become dominated by Management cost  Storage costs become CPU/DRAM costs Effects Of The CPU/DRAM Bottleneck
  11. 11. May 5, 2016 11  Move management to upper layers where CPU can be right-sized by client  What kind of media access do I want? – Simple enough functionality to be done directly in drive hardware – NO CPU – Allow direct access throughout the compute cluster over a network – Just enough machinery to enable coarse-grained sharing Embracing The CPU/DRAM Bottleneck  In short, you really want a SAN ! – Or more technically, Fabric Connected Storage
  12. 12. May 5, 2016 12 Not Your Father’s SAN  Three problems with current SAN – Fibre channel transport – SCSI access protocol – Drive oriented storage allocation  All of these want to be updated – Fibre channel is brittle and costly – SCSI initiators have long code paths catering to seldom used configurations – Robust sub-drive storage allocation
  13. 13. May 5, 2016 13 SAN 2.0  NVMe over Fabrics  1.0 Spec is out for review, hopefully done in May  Simple enough for direct hardware execution of data path ops  Minimal initiator code path lengths improve performance  Namespaces allow sub-drive allocations  Not mature enough for enterprise deployment – yet
  14. 14. May 5, 2016 14 SAN 2.0  What storage network? – Current candidates are FC, Infiniband and Ethernet  Ethernet has best economics – if you can make it work  RoCE is easy on the edge, but hard on the interior – Only controlled environments have shown multi-switch scalability – General scalability in a multi-vendor environment likely to be difficult – Wonderful for intra-rack storage networking  iWarp is hard on the edge, but easy on the interior – Scarcity of implementations inhibits deployment  Storage over IP will see limited cross rack deployment until this is resolved
  15. 15. May 5, 2016 15  Implementations using OTS stuff are in progress  Server side implementations look pretty conventional too  4-5 MIOPS have been shown  Seems like 10 MIOPS isn’t unreasonable to expect First Generation Of SAN 2.0 NIC CPU DRAM SSD PCIe
  16. 16. May 5, 2016 16  Soon, NICs will forward NVMe operations to local PCIe devices  CPU removed from the software part of the data path  CPU is still needed for the hardware part of the data path  IOPS improve, BW is unchanged  Significant CPU freed for application processing  Getting closer to the wall! Second Generation SAN 2.0
  17. 17. May 5, 2016 17  New generation of combined SSD controller and NIC – Rethink of interfaces eliminates DRAM buffering  Network goes right into the drive  No CPU to be found  Works well with rack scale architecture Third Generation SAN 2.0, Imagined
  18. 18. May 5, 2016 18  Disaggregated / Rack Scale Architecture – Fabric connected – Independently scale compute, networking and storage Let’s Get Really Small
  19. 19. May 5, 2016 19 Call To Action  Fabric-connected storage isn’t well managed by existing FOSS  Lots of upper layer management software is available – OpenStack, Ceph, Gluster, Cassandra, MongoDB, SheepDog, etc.  Lower layer cluster management still primitive
  20. 20. May 5, 2016 20 What’s It All Mean?  New form factors are in everybody's future  The coming avalanche of storage bandwidth wants to be free – Not imprisoned by a CPU  Rack Scale Architecture allows new Storage/Compute configs  Storage will be increasingly “Software Defined” as the HW evolves
  21. 21. May 5, 2016 21 Product Pitch!
  22. 22. May 5, 2016 22 Old Model  Monolithic, large upfront investments, and fork-lift upgrades  Proprietary storage OS  Costly: $$$$$ New SD-AFS Model  Disaggregate storage, compute, and software for better scaling and costs  Best-in-class solution components  Open source software - no vendor lock-in  Cost-efficient: $ Software-defined All-Flash Storage The disaggregated model for scale
  23. 23. May 5, 2016 23 Scalable Raw Performance 2M IOPS, Latency 1-3ms 12-15 GB/s Throughput 8TB Flash-Card Innovations • Enterprise Grade Power-Fail Safe • Alerts & monitoring • Latching integrated & monitored • Directly samples air temp • Form-factor enables lowest cost SSD InfiniFlash™ Storage Platform Capacity 512TB – raw all Flash! All Flash 3U JBOD of Flash (JBOF) Up to 64 x 8TB SAS Drive Cards 4TB cards also available soon Operational Efficiency & Resilient Hot Swappable Architecture, Easy FRU Low power – typical workload 400-500W 150W(idle) - 750W(max) MTBF 1.5+ million hours Hot Swappable ! Fans, SAS Expander Boards, Power Suppliers, Flash cards Host Connectivity Connect up to 8 servers through 8 SAS ports Multi-path enabled Flash Drive Card EMS Product Management SanDisk Confidential
  24. 24. May 5, 2016 24 InfiniFlash IF500 All-Flash Storage System Block and Object Storage Powered by Ceph  Ultra-dense High Capacity Flash storage – 512TB in 3U, Scale-out software for PB scale capacity  Highly scalable performance – Industry leading IOPS/TB  Cinder, Glance and Swift storage – Add/remove server & capacity on-demand  Enterprise-Class storage features – Automatic rebalancing – Hot Software upgrade – Snapshots, replication, thin provisioning – Fully hot swappable, redundant  Ceph Optimized for SanDisk flash – Tuned & Hardened for InfiniFlash
  25. 25. May 5, 2016 25 InfiniFlash SW + HW Advantage Software Storage System Software tuned for Hardware • Ceph modifications for Flash • Both Ceph, Host OS tuned for InfiniFlash • SW defects that impacts Flash identified & mitigated Hardware Configured for Software • Right balance of CPU, RAM, Storage • Rack level designs for optimal performance & cost Software designed for all systems does not work well with any system  Ceph has over 50 tuning parameters that results in 5x – 6x performance improvement  Fixed CPU, RAM hyperconverged nodes does not work well for all workloads
  26. 26. May 5, 2016 26 InfiniFlash for OpenStack with Dis-Aggregation  Compute & Storage Disaggregation enables Optimal Resource utilization  Allows for more CPU usage required for OSDs with small Block workloads  Allows for higher bandwidth provisioning as required for large Object workload  Independent Scaling of Compute and Storage  Higher Storage capacity needs doesn't’t force you to add more compute and vice-versa  Leads to optimal ROI for PB scale OpenStack deploymentsHSEB A HSEB B OSDs SAS …. HSEB A HSEB B HSEB A HSEB B …. ComputeFarm LUN LUN iSCSI Storage …Obj Obj Swift ObjectStore …LUN LUN Nova with Cinder & Glance … LibRBD QEMU/KVM RGW WebServer KRBD iSCSI Target OSDs OSDs OSDs OSDs OSDs StorageFarm Confidential – EMS Product Management
  27. 27. May 5, 2016 27 IF500 - Enhancing Ceph for Enterprise Consumption IF500 provides usability and performance utilities without sacrificing Open Source principles • SanDisk Ceph Distro ensures packaging with stable, production-ready code with consistent quality • All Ceph Performance improvements developed by SanDisk are contributed back to community 27 SanDisk Distribution or Community Distribution  Out-of-the Box configurations tuned for performance with Flash  Sizing & planning tool  InfiniFlash drive management integrated into Ceph management (Coming Soon)  Ceph installer that is specifically built for InfiniFlash  High performance iSCSI storage  Better diagnostics with log collection tool  Enterprise hardened SW + HW QA
  28. 28. May 5, 2016 28 InfiniFlash Performance Advantage 900K Random Read Performance with 384TB of storage Flash Performance unleashed • Out-of-the Box configurations tuned for performance with Flash • Read & Write data-path changes for Flash • x3-12 block performance improvement – depending on workload • Almost linear performance scale with addition of InfiniFlash nodes • Write performance WIP with NV-RAM Journals• Measured with 3 InfiniFlash nodes with 128TB each • Avg Latency with 4K Block is ~2ms, with 99.9 percentile latency is under 10ms • For Lower block size, performance is CPU bound at Storage Node. • Maximum Bandwidth of 12.2GB/s measured towards 64KB blocks S 28
  29. 29. May 5, 2016 29 InfiniFlash Ceph Performance Advantage  Single InfiniFlash unit Performance – 1 x 512TB InfiniFlash unit connected with 8 nodes – 4K RR IOPS: ~1 million IOPs - 85% of bare metal perf. • Corresponding Bare metal IF100 IOPS is 1.1 million – All 8 hosts CPU saturated for 4K Random read. • More performance potential with higher CPU cycles – With 64k IO size we are able to utilize full IF150 bandwidth of over 12GB/s. – Librbd and Krbd performance are comparable. – Write Performance is on 3x copy configuration. The more common 2x copy will result in 33% improvement. Random Write IO Profile LIBRBD IOPs 4k Random Write 54k 64k Random Write 34k 256k Random Write 11.3k 1,123,175 349,247 87,369 0 5 10 15 20 25 0 200,000 400,000 600,000 800,000 1,000,000 1,200,000 4k 64k 256k BW(GBps) IOPS Random Read Block Performance LIBRBD IOPs Bandwidth (GBps)
  30. 30. May 5, 2016 30 InfiniFlash Ceph Performance Advantage  Linear Scaling with 2 InfiniFlash units – 2 x 512TB InfiniFlash unit connected with 16 nodes – 1.8M 4K IOPS – 80% of the bare metal performance – Performance is Scaling almost linearly - Almost doubled the performance of single IF150 with ceph – Write perf is 2 X with 16 node cluster compared with 8 node cluster. Random Read Random Write IO Profile LIBRBD IOPs 4k RR 1800k 64k RR 225k 256k RR 53k IO Profile LIBRBD BW(MB/s) 4k RR 7194 64k RR 14412 256k RR 13366
  31. 31. May 5, 2016 31 InfiniFlash OS – Hardened Enterprise Class Ceph  Hardened and tested for Hyperscale deployments and workloads  Platform focused testing enables us to deliver a complete and hardened storage solution  Single Vendor support for both Hardware & Software Enterprise Level Hardening Testing at Scale Failure Testing  9,000 hours of cumulative IO tests  1,100+ unique test cases  1,000 hours of Cluster Rebalancing tests  1,000 hours of IO on iSCSI  Over 100 server node clusters  Over 4PB of Flash Storage  2,000 Cycle Node Reboot  1,000 times Node Abrupt Power Cycle  1,000 times Storage Failure  1,000 times Network Failure  IO for 250 hours at a stretch
  32. 32. May 5, 2016 32 IF500 Reference Configurations Model Entry Mid High InfiniFlash 128TB 256TB 512TB Servers1 2 x Dell R 630-2U 4 x Dell R 630-2U 4 x Dell R 630-2U2 Processor per server Dual socket Intel Xeon E5-2690 v3 Dual socket Intel Xeon E5-2690 v3 * Dual socket Intel Xeon E5-2690 v3 Memory per server 128GB RAM 128GB RAM 128GB RAM HBA per server (1) LSI 9300-8e PCIe 12Gbps (1) LSI 9300-8e PCIe 12Gbps (1) LSI 9300-8e PCIe 12Gbps Network per server (1) Mellanox ConnectX-3 dual ports 40GbE (1) Mellanox ConnectX-3 dual ports 40GbE (1) Mellanox ConnectX-3 dual ports 40GbE Boot Drive per server (2) SATA 120GB SSD (2) SATA 120GB SSD (2) SATA 120GB SSD 1 - For larger block workload or less CPU intensive workload, OSD node could use single socket server. Dell Servers can be substituted with other vendor servers that match the specs. 2 - For Small Block workloads, 8 servers are recommended
  33. 33. May 5, 2016 33 InfiniFlash TCO Advantage $- $10,000,000 $20,000,000 $30,000,000 $40,000,000 $50,000,000 $60,000,000 $70,000,000 $80,000,000 Tradtional ObjStore on HDD IF500 ObjStore w/ 3 Full Replicas on Flash IF500 w/ EC - All Flash IF500 - Flash Primary & HDD Copies 3 year TCO comparison * 3 year Opex TCA 0 20 40 60 80 100 Tradtional ObjStore on HDD IF500 ObjStore w/ 3 Full Replicas on Flash IF500 w/ EC - All Flash IF500 - Flash Primary & HDD Copies Total Rack  Reduce the replica count with higher reliability of flash - 2 copies on InfiniFlash vs. 3 copies on HDD  InfiniFlash disaggregated architecture reduces compute usage, thereby reducing HW & SW costs - Flash allows the use of erasure coded storage pool without performance limitations - Protection equivalent of 2x storage with only 1.2x storage  Power, real estate, maintenance cost savings over 5 year TCO * TCO analysis based on a US customer’s OPEX & Cost data for a 100PB deployment 33
  34. 34. May 5, 2016 34 ©2016 SanDisk Corporation. All rights reserved. SanDisk is a trademark of SanDisk Corporation, registered in the United States and other countries. Other brands mentioned herein are for identification purposes only and may be the trademarks of their holder(s).

×