Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Ceph Day Beijing - SPDK for Ceph

3,505 views

Published on

Ziye Yang, Senior Software Engineer, Intel

Published in: Technology
  • Be the first to comment

Ceph Day Beijing - SPDK for Ceph

  1. 1. Ziye Yang, Senior software Engineer
  2. 2. Notices and Disclaimers Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. No computer system can be absolutely secure. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance. Intel, the Intel logo, Xeon, and others are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. *Other names and brands may be claimed as the property of others. © 2017 Intel Corporation. 2
  3. 3. • SPDKintroductionandstatusupdate • CurrentSDPKsupportinBluestore • Casestudy:AccelerateiSCSIserviceexportedbyCeph • SPDKsupportforCephin2017 • Summary
  4. 4. The Problem: Software is becoming the bottleneck The Opportunity: Use Intel software ingredients to unlock the potential of new media HDD SATA NAND SSD NVMe* NAND SSD Intel® Optane™ SSD Latency I/O Performance <500 IO/s >25,000 IO/s >400,000 IO/s >2ms <100µs <100µs
  5. 5. Storage Performance Development Kit 6 Scalable and Efficient Software Ingredients • User space, lockless, polled-mode components • Up to millions of IOPS per core • Designed for Intel Optane™ technology latencies Intel® Platform Storage Reference Architecture • Optimized for Intel platform characteristics • Open source building blocks (BSD licensed) • Available via spdk.io
  6. 6. Architecture Drivers Storage Services Storage Protocols iSCSI Target NVMe-oF* Target SCSI vhost-scsi Target NVMe NVMe Devices Blobstore NVMe-oF* Initiator Intel® QuickData Technology Driver Block Device Abstraction (BDEV) Ceph RBD Linux Async IO Blob bdev 3rd Party NVMe NVMe* PCIe Driver Released Q2’17 Pathfinding vhost-blk Target Object BlobFS Integration RocksDB Ceph Core Application Framework
  7. 7. Benefits of using SPDK SPDK more performance from Intel CPUs, non- volatile media, and networking FASTER TTM/ LESS RESOURCES than developing components from scratch 10X MORE IOPS/coreUp to for NVMe-oF* vs. Linux kernel as NVM technologies increase in performanceFuture ProofingProvides for NVMe vs. Linux kernel8X MORE IOPS/coreUp to Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance 350%Up to for RocksDB workloadsBETTER Tail Latency
  8. 8. SPDK Updates: 17.03 Release (Mar 2017) Blobstore • Block allocator for applications • Variable granularity, defaults to 4KB BlobFS • Lightweight, non-POSIX filesystem • Page caching & prefetch • Initially limited to DB file semantic requirements (e.g. file name and size) RocksDB SPDK Environment • Implement RocksDB using BlobFS QEMU vhost-scsi Target • Simplified I/O path to local QEMU guest VMs with unmodified apps NVMe over Fabrics Improvements • Read latency improvement • NVMe-oF Host (Initiator) zero-copy • Discovery code simplification • Quality, performance & hardening fixes Newcomponents: broader set of use cases for SPDK libraries & ingredients Existingcomponents: feature and hardening improvements
  9. 9. Current status Fully realizing new media performance requires software optimizations SPDK positioned to enable developers to realize this performance SPDK available today via http://spdk.io Help us build SPDK as an open source community!
  10. 10. Current SPDK support in BlueStore New features  Support multiple threads for doing I/Os on NVMe SSDs via SPDK user space NVMe driver  Support running SPDK I/O threads on designated CPU cores in configuration file. Upgrade in Ceph (now is 17.03)  Upgraded SPDK to 16.11 in Dec, 2016  Upgraded SPDK to 17.03 in April, 2017 Stability  Fixed several compilation issues, running time bugs while using SPDK. Totally 16 SPDK related Patches are merged in Bluestore (mainly in NVMEDEVICE module)
  11. 11. (From iStaury’s talk in SPDK PRC meetup 2016)
  12. 12. Block service exported by Ceph via iSCSI protocol  Cloud service providers which provision VM service can use iSCSI.  If Ceph could export block service with good performance, it would be easy to glue those providers to Ceph cluster solution. APP Multipath iSCSI initiator dm-1 sdx sdy iSCSI target RBD iSCSI target RBD OSD OSD OSD OSD OSD OSD OSD OSD Client iSCSI gateway Ceph cluster
  13. 13. iSCSI + RBD Gateway Ceph server  CPU:Intel(R) Xeon(R) CPU E5-2660 v4 @2.00GHz  Four intel P3700 SSDs  One OSD on each SSD, total 4 osds  4 pools PG number 512, one 10G image in one pool iSCSI target server (librbd+SPDK / librbd+tgt)  CPU:Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz  Only one core enable iSCSI initiator  CPU:Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz iSCSI Initiator iSCSI Target Server iSCSI Target librbd Ceph Server OSD0 OSD1 OSD2 OSD3
  14. 14. iSCSI + RBD Gateway One CPU Core: FIO + img iSCSI type + op 1 FIO + 1 img (IOPS) 2 FIO + 2 img (IOPS) 3 FIO + 3 img (IOPS) SPDK iSCSI tgt/TGT ratio TGT + 4k_randread 10K 20K 20K 140% SPDK iSCSI tgt+ 4k_randread 20K 24K 28K TGT + 4k_randwrite 6.5K 9.5K 18K 133% SPDK iSCSI tgt + 4k_randwrite 14K 19K 24K
  15. 15. iSCSI + RBD Gateway Two CPU Cores: FIO + img iSCSI type + op 1 FIO + 1 img (IOPS) 2 FIO + 2 img (IOPS) 3 FIO + 3 img (IOPS) 4 FIO + 4 img (IOPS) SPDK iSCSI tgt/TGT ratio TGT + 4k_randread 12K 24K 26K 26K 181% SPDK iSCSI tgt + 4k_randread 37K 47K 47K 47K TGT + 4k_randwrite 9.5K 13.5K 19K 22K 123% SPDK iSCSI tgt + 4k_randwrite 16K 24K 25K 27K
  16. 16. Reading Comparison 10 20 12 37 20 24 24 47 20 28 26 47 0 5 10 15 20 25 30 35 40 45 50 One core:TGT One core:SPDK-iSCSI Two cores:TGT Two cores:SPDK-iSCSI 4K_randread(IOPS(K)) 1stream 2 streams 3streams
  17. 17. Writing Comparison 6.5 14 9.5 16 9.5 19 13.5 24 18 24 19 25 22 27 0 5 10 15 20 25 30 One core:TGT One core:SPDK-iSCSI Two cores:TGT Two cores:SPDK-iSCSI 4K_randwrite(IOPS(K)) 1stream 2 streams 3streams 4streams
  18. 18. SPDK support for Ceph in 2017 To make SPDK really useful in Ceph, we will still do the following works with partners:  Continue stability maintenance – Version upgrade, bug fixing in compilation/running time.  Performance enhancement – Continue optimizing NVMEDEVICE module according to customers or partners’ feedback.  New feature Development: – Occasionally pickup some common requirements/feedback in community and may upstream those features in NVMEDEVICE module
  19. 19. Proposals/opportunties for better leveraging SPDK Multiple OSD support on same NVMe Device by using SPDK.  Leverage SPDK’s multiple process features in user space NVMe driver.  Risks: Same with kernel, i.e., fail all OSDs on the device if it is fail. Enhance cache support in NVMEDEVICE via using SPDK  Need better cache/buffer strategy for Read/Write performance improvement. Optimize Rocksdb usage in Bluestore by SPDK’s blobfs/blobstore  Make Rocksdb use SPDK’s Blobfs/Blostore instead of kernel file system for metadata management.
  20. 20. Leverage SPDK to accelerate the block service exported by Ceph Optimization in front of Ceph  Use optimized Block service daemon, e.g., SPDK iSCSI target or NVMe-oF target  Introduce Cache policy in Block service daemon. Store Optimization inside Ceph  Use SPDK’s user space NVMe driver instead of Kernel NVMe driver (Already have)  May replace “BlueRocksEnv + Bluefs” with “BlobfsENV + Blobfs/Blobstore”.
  21. 21. Ceph RBD service SPDK optimized iSCSI target SPDK optimized NVMe-oF target SPDK Ceph RBD bdev module (Leverage librbd/librados) SPDK Cache module Existing SPDK app/module Existing Ceph Service/component FileStore Export Block Service KVStoreBluestore metadata RocksDB BlueRocksENV Bluefs Kernel/SPDK driver NVMe device metadata RocksDB SPDK BlobfsENV SPDK Blobfs/ Blobstore SPDK NVMe driver NVMe device Optimized module to be developed (TBD in SPDK roadmap) Accelerate block service exported by Ceph via SPDK Even replace RocksDB?
  22. 22. Summary SPDK proves to useful to explore the capability of fast storage devices (e.g., NVMe SSDs) But it still needs lots of development work to make SPDK useful for Bluestore in product quality level. Call for actions:  Call for code contribution in SPDK community  Call for leveraging SPDK for Ceph optimization, welcome to contact SPDK dev team for help and collaboration.
  23. 23. Summary SPDK proves to useful to explore the capability of fast storage devices (e.g., NVMe SSDs) But it still needs lots of development work to make SPDK useful for Bluestore in product quality level. Call for actions:  Call for code contribution in SPDK community  Call for leveraging SPDK for Ceph optimization, welcome to contact SPDK dev team for help and collaboration.
  24. 24. Vhost-scsi Performance SPDK provides 1 Million IOPS with 1 core and 8x VM performance vs. kernel! Features Realized Benefit High performance storage virtualization Increased VM density Reduced VM exit Reduced tail latencies 1 11 System Configuration: Target system: 2x Intel® Xeon® E5-2695v4 (HT off), Intel® Speed Step enabled, Intel® Turbo Boost Technology enabled, 8x 8GB DDR4 2133 MT/s, 1 DIMM per channel, 8x Intel® P3700 NVMe SSD (800GB), 4x per CPU socket, FW 8DV10102, Network: Mellanox* ConnectX-4 100Gb RDMA, direct connection between initiator and target; Initiator OS: CentOS* Linux* 7.2, Linux kernel 4.7.0-rc2, Target OS (SPDK): CentOS Linux 7.2, Linux kernel 3.10.0-327.el7.x86_64, Target OS (Linux kernel): CentOS Linux 7.2, Linux kernel 4.7.0-rc2 Performance as measured by: fio, 4KB Random Read I/O, 2 RDMA QP per remote SSD, Numjobs=4 per SSD, Queue Depth: 32/job 10 10 10 17 8 1 0 5 10 15 20 25 30 QEMU virtio-scsi kernel vhost-scsi SPDK vhost-scsi VM cores I/O processing cores 0 200000 400000 600000 800000 1000000 QEMU virtio-scsi kernel vhost-scsi SPDK vhost-scsi I/Os handled per I/O processing core
  25. 25. Alibaba* Cloud ECS Case Study: Write Performance Source: http://mt.sohu.com/20170228/n481925423.shtml * Other namesand brands may be claimedas the property of others Ali Cloud sees 300% improvement in IOPS and latency using SPDK 0 200 400 600 800 1000 1200 1400 1 2 4 8 16 32 Latency(usec) Queue Depth Random Write Latency (usec) General Virtualization Infrastructure Ali Cloud High-Performance Storage Infrastructure with SPDK 0 50000 100000 150000 200000 250000 300000 350000 400000 1 2 4 8 16 32 IOPS Queue Depth Random Write 4K IOPS General Virtualization Infrastructure Ali Cloud High-Performance Storage Infrastructure with SPDK
  26. 26. Alibaba* Cloud ECS Case Study: MySQL Sysbench Source:http://mt.sohu.com/20170228/n481925423.shtml * Other names and brands may be claimed as the propertyof others Sysbench Update sees 4.6X QPS at 10% of the latency! 0 2 4 6 8 10 12 14 16 18 Select Update Latency(ms) MySQL Sysbench - Latency General Virtualization Infrastructure High Performance Virtualization with SPDK 0 20000 40000 60000 80000 100000 120000 Select Update MySQL Sysbench - TPS/QPS General Virtualization Infrastructure High Performance Virtualization with SPDK
  27. 27. SPDK Blobstore Vs. Kernel: Key Tail Latency 0 20000 40000 60000 80000 100000 120000 140000 Readwrite LatencyuS db_bench 99.99th Percentile Latency Lower is Better Kernel (256KB sync) Blobstore (20GB Cache + Readahead) 372% SPDK Blobstore reduces tail latency by 3.7X Insert Randread Overwrite Readwrite Kernel (256KB Sync) 366 6444 1675 122500 SPDK Blobstore (20GB Cache + Readahead) 444 3607 1200 33052 0 1000 2000 3000 4000 5000 6000 7000 Insert Randread Overwrite LatencyuS db_bench 99.99th Percentile Latency Lower is Better Kernel (256KB sync) Blobstore (20GB Cache + Readahead) 21% 44% 28%
  28. 28. SPDK Blobstore Vs. Kernel: Key Transactions per sec 0 200000 400000 600000 800000 1000000 1200000 Insert Randread Overwrite Readwrite Keyspersecond db_bench Key Transactions Higher is Better 85% 8% 4% ~0% Insert Randread Overwrite Readwrite Kernel (256KB Sync) 547046 92582 51421 30273 SPDK Blobstore (20GB Cache + Readahead) 1011245 99918 53495 29804 SPDK Blobstore improves insert throughput by 85%

×