Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

CEPH DAY BERLIN - CEPH ON THE BRAIN!

305 views

Published on

How does Ceph perform when used in high-performance computing? This talk will cover a year of running Ceph on a (small) Cray supercomputer. I will describe how Ceph was configured to perform in an all-NVME configuration, and the process of analysis and optimisation of the configuration. I'll also give details on the efforts underway to adapt Ceph's messaging to run over high performance network fabrics, and how this work could become the next frontier in the battle against storage performance bottlenecks.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

CEPH DAY BERLIN - CEPH ON THE BRAIN!

  1. 1. StackHPC Ceph on the Brain! Stig Telfer, CTO, StackHPC Ltd Ceph Days Berlin 2018
  2. 2. StackHPC 1992-2018: The story so far... HPC Alpha Processor HPC Software-defined networking HPC OpenStack for HPC HPC on OpenStack
  3. 3. StackHPC Human Brain Project • The Human Brain Project is a flagship EU FET project • Significant effort into massively parallel applications in neuro- simulation and analysis techniques • Research and development of platforms to enable these applications • A 10-year project - 2013-2023
  4. 4. HBP Pre-Commercial Procurement • EU vehicle for funding R&D activities in public institutions • Jülich Supercomputer Center and HBP ran three phases of competition • Phase III winners were Cray + IBM & NVIDIA • Technical Objectives: • Dense memory integration • Interactive supercomputing
  5. 5. JULIA pilot system • Cray CS400 base system • 60 KNL nodes • 4 visualisation nodes • 4 data nodes • 2 x 1.6TB Fultondate SSDs • Intel Omnipath interconnect • Highly diverse software stack • Diverse memory / storage system
  6. 6. Why Ceph? • Primarily to study novel storage / hot-tier object store • However, also need POSIX-compliant production filesystem (CephFS) • Excellent support and engagement from diverse community • Interesting set of interactions with cloud software (OpenStack etc. ) • Is it performant?
  7. 7. Ceph’s Performance Record Source: Intel white paper “Using Intel® OptaneTM Technology with Ceph to Build High- Performance Cloud Storage Solutions on Intel® Purley Platform”
  8. 8. StackHPC JULIA Cluster Fabric data-0001 data-0002 data-0003 data-0004 viz-0001 viz-0002 viz-0003 viz-0004 OPA 100G login-1 login-2 prod-0001 prod-0002 prod-0003 prod-0004 prod-0005 prod-0006 prod-0007 prod-0008 prod-0009 prod-0010 prod-0011 prod-0012 prod-0013 prod-0014 prod-0015 prod-0016 prod-0017 prod-0018 prod-0019 prod-0020 prod-0021 prod-0022 prod-0023 prod-0024 prod-0025 prod-0026 prod-0027 prod-0028 prod-0029 prod-0030 prod-0031 prod-0032 prod-0033 prod-0034 prod-0035 prod-0036 prod-0037 prod-0038 prod-0039 prod-0040 prod-0041 prod-0042 prod-0043 prod-0044 prod-0045 prod-0046 prod-0047 prod-0048 prod-0049 prod-0050 prod-0051 prod-0052 prod-0053 prod-0054 prod-0055 prod-0056 prod-0057 prod-0058 prod-0059 prod-0060 OPA 100G OPA 100G
  9. 9. StackHPC JULIA Data Node Architecture CPU 1 - Broadwell E5-2680 14 15 16 17 18 19 20 21 22 23 24 25 26 27 CPU 0 - Broadwell E5-2680 0 1 2 3 4 5 6 7 8 9 10 11 12 13 64GB RAM 64GB RAM NVME0 P3600 “Fultondale” 1.6 TB NVME1 P3600 “Fultondale” 1.6 TB OPA 100G QPI
  10. 10. StackHPC JULIA Ceph Cluster Architecture • Monitors, MDSs, MGRs previously freestanding, now co-hosted • 4 OSD processes per NVMe device • 32 OSDs in total • Using OPA IPoIB interface for both front-side and replication networks data-0001 data-0002 data-0003 data-0004 OPA 100G
  11. 11. Source: Intel white paper “Using Intel® OptaneTM Technology with Ceph to Build High- Performance Cloud Storage Solutions on Intel® Purley Platform”
  12. 12. StackHPC Data Node - Raw Read • 64K reads using fio • 4 jobs per OSD partition 
 (32 total) • Aggregate performance across all partitions approx 5200 MB/s data-0001 500 MB/s 0 MB/s
  13. 13. StackHPC A Non-Uniform Network Fabric • Single TCP stream performance (using iperf3) • IP encapsulation on Omnipath • KNL appears to struggle with performance of sequential activity • High variability between other classes of node also viz knl data knl data viz 27.0 Gbits/sec - 49.6 Gbits/sec 5.95 Gbits/sec 6.91 Gbits/sec 8.43 Gbits/sec 6.67 Gbits/sec 8.16 Gbits/sec 35.4 Gbits/sec25.9 Gbits/sec 48.1 Gbits/sec 28.1 Gbits/sec - 51.7 Gbits/sec
  14. 14. StackHPC Network and I/O Compared 500MB/s 0MB/s 1000MB/s 1500MB/s 2000MB/s 2500MB/s 3000MB/s 3500MB/s 4000MB/s 4500MB/s 5000MB/s 5500MB/s 6000MB/s 6500MB/s TCP/iP Xeon - Best Worst TCP/IP KNL - Best Worst Data Node NVMe (read) variation
  15. 15. StackHPC First Ceph Upgrade • Jewel to Luminous release • Filestore to Bluestore backend • Use ceph-ansible playbooks (mostly) • Doesn’t support multiple OSDs per block device • Manual creation of OSDs in partitions using Ceph tools
  16. 16. StackHPC Jewel to Luminous - Filestore
  17. 17. StackHPC Luminous - Filestore to Bluestore IPoIB
  18. 18. StackHPC Ceph RADOS Raw devices Write Amplification - Filestore
  19. 19. StackHPC Ceph RADOS Raw devices Write Amplification - Bluestore
  20. 20. StackHPC Write Degradation Issue
  21. 21. Source: Intel presentation “Accelerate Ceph with Optane and 3D NAND”
  22. 22. StackHPC Luminous - Scaling Out
  23. 23. StackHPC Storage Nodes and Processor Sleep 76% read efficiency
  24. 24. StackHPC Second Ceph Upgrade • Luminous to Mimic release • Bluestore backend • Use ceph-ansible playbooks • Use ceph-volume to create LVM OSDs
  25. 25. Spectre/Meltdown Mitigations • Storage I/O performance benchmarked 15.5% slower:
 
 5200 MB/s 4400 MB/s • Network performance also adversely affected
  26. 26. Mimic - Scaling Out (Reads, LVM)
  27. 27. Mimic - Scaling Out (Reads, Partitions) 90% read efficiency
  28. 28. Ceph OSD under Load…
  29. 29. 53% CPU in Networking Stack!
  30. 30. StackHPC Third Ceph Upgrade • Mimic to Nautilus-development • RDMA testing
  31. 31. StackHPC A brief detour away from HBP…
  32. 32. StackHPC Cambridge Data Accelerator
  33. 33. StackHPC CPU 1 - Skylake Gold 6142 14 15 16 17 18 19 20 21 22 23 24 25 26 27 CPU 0 - Skylake Gold 6142 0 1 2 3 4 5 6 7 8 9 10 11 12 13 96GB RAM 96GB RAM NVME P4600 NVME P4600 OPA 100G QPI NVME P4600 NVME P4600 NVME P4600 NVME P4600 6x NVME P4600 NVME P4600 NVME P4600 NVME P4600 NVME P4600 6x NVME P4600 OPA 100G Accelerator Data Node
  34. 34. StackHPC Burst Buffer Workflows • Stage in / Stage out • Transparent Caching • Checkpoint / Restart • Background data movement • Journaling • Swap memory Storage volumes - namespaces - can persist longer than the jobs and shared with multiple users, or private and ephemeral. POSIX or Object (this can also be at a flash block load/store interface)
  35. 35. StackHPC Hero Performance Numbers 84% read efficiency96% write efficiency
  36. 36. StackHPC Ceph and RDMA • RoCE 25G - Yes! • iWARP - not tested • Omnipath 100G - Nearly… • Infiniband 100G - Looking good… • Integrated since Luminous Ceph RPMS • (Mellanox have a bugfix tree) /etc/ceph/ceph.conf: ms_type = async+rdma or ms_cluster_type = async+rdma ms_async_rdma_device_name = hfi1_0 or mlx5_0 ms_async_rdma_polling_us = 0 /etc/security/limits.conf: #<domain> <type> <item> <value> * - memlock unlimited /usr/lib/systemd/system/ceph-*@.service: [Service] LimitMEMLOCK=infinity PrivateDevices=no
  37. 37. StackHPC New Developments in Ceph-RDMA • Intel: RDMA Connection Manager and iWARP support • https://github.com/tanghaodong25/ceph/tree/rdma-cm - merged • Nothing iWARP-specific here, (almost) works on IB, OPA too • Mellanox: New RDMA Messenger based on UCX • https://github.com/Mellanox/ceph/tree/vasily-ucx - 2019? • How about libfabric?
  38. 38. StackHPC Closing Comments • Ceph is getting there, fast… • RDMA performance not currently low hanging fruit on most setups • Intel’s benchmarking claims TCP messaging consumes 25% of CPU in high-end configurations - I see over 50% on our setup! • New approaches to RDMA should help in key areas: • Performance, Portability, Flexibility, Stability
  39. 39. Thank You! stig@stackhpc.com https://www.stackhpc.com @oneswig Cray EMEA Research Lab: Adrian Tate, Nina Mujkanovic, Alessandro Rigazzi Forschungszentrum Jülich: Lena Oden, Bastian Tweddell, Dirk Pleiter 

×