Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

What's new in Luminous and Beyond

3,292 views

Published on

Ceph is an open source distributed storage system that provides scalable object, block, and file interfaces on a commodity hardware. Luminous, the latest stable release of Ceph, was just released in August. This talk will cover all that is new in Luminous (there is a lot!) and provide a sneak peak at the roadmap for Mimic, which is due out in the Spring.

Published in: Software
  • Be the first to comment

What's new in Luminous and Beyond

  1. 1. WHAT’S NEW IN LUMINOUS AND BEYOND SAGE WEIL – RED HAT 2017.11.06
  2. 2. 2 UPSTREAM RELEASES Jewel (LTS) Spring 2016 Kraken Fall 2016 Luminous Summer 2017 12.2.z 10.2.z 14.2.z WE ARE HERE Mimic Spring 2018 Nautilus Winter 2019 13.2.z ● New release cadence ● Named release every 9 months ● Backports for 2 releases ● Upgrade up to 2 releases at a time (e.g., Luminous → Nautilus)
  3. 3. 3 LUMINOUS
  4. 4. 4 BLUESTORE: STABLE AND DEFAULT ● New OSD backend – consumes raw block device(s) – no more XFS – embeds rocksdb for metadata ● Fast on both HDDs (~2x) and SSDs (~1.5x) – Similar to FileStore on NVMe, where the device is not the bottleneck ● Smaller journals – happily uses fast SSD partition(s) for internal metadata, or NVRAM for journal ● Full data checksums (crc32c, xxhash, etc.) ● Inline compression (zlib, snappy) – policy driven by global or per-pool config, and/or client hints ● Stable and default
  5. 5. 5 HDD: RANDOM WRITE 4 8 16 32 64 128 256 512 1024 2048 4096 0 50 100 150 200 250 300 350 400 450 Bluestore vs Filestore HDD Random Write Throughput Filestore Bluestore (wip-bitmap-alloc-perf) BS (Master-a07452d) BS (Master-d62a4948) Bluestore (wip-bluestore-dw) IO Size Throughput(MB/s) 4 8 16 32 64 128 256 512 1024 2048 4096 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Bluestore vs Filestore HDD Random Write IOPS Filestore Bluestore (wip-bitmap-alloc-perf) BS (Master-a07452d) BS (Master-d62a4948) Bluestore (wip-bluestore-dw) IO Size IOPS
  6. 6. 6 HDD: MIXED READ/WRITE 4 8 16 32 64 128 256 512 1024 2048 4096 0 50 100 150 200 250 300 350 Bluestore vs Filestore HDD Random RW Throughput Filestore Bluestore (wip-bitmap-alloc-perf) BS (Master-a07452d) BS (Master-d62a4948) Bluestore (wip-bluestore-dw) IO Size Throughput(MB/s) 4 8 16 32 64 128 256 512 1024 2048 4096 0 200 400 600 800 1000 1200 Bluestore vs Filestore HDD Random RW IOPS Filestore Bluestore (wip-bitmap-alloc-perf) BS (Master-a07452d) BS (Master-d62a4948) Bluestore (wip-bluestore-dw) IO Size IOPS
  7. 7. 7 RGW ON HDD+NVME, EC 4+2 1 Bucket 128 Buckets 4 Buckets 512 Buckets Rados 1 RGW Server 4 RGW Servers Bench 0 200 400 600 800 1000 1200 1400 1600 1800 4+2 Erasure Coding RadosGW Write Tests 32MB Objects, 24 HDD/NVMe OSDs on 4 Servers, 4 Clients Filestore 512KB Chunks Filestore 4MB Chunks Bluestore 512KB Chunks Bluestore 4MB Chunks Throughput(MB/s)
  8. 8. 8 RBD OVER ERASURE CODED POOLS ● aka erasure code overwrites ● requires BlueStore to perform reasonably ● significant improvement in efficiency over 3x replication – 2+2 → 2x 4+2 → 1.5x ● small writes slower than replication – early testing showed 4+2 is about half as fast as 3x replication ● large writes faster than replication – less IO to device ● implementation still does the “simple” thing – all writes update a full stripe
  9. 9. 9 CEPH-MGR ● ceph-mgr – new management daemon to supplement ceph-mon (monitor) – easier integration point for python management logic – integrated metrics ● ceph-mon scaling – offload pg stats from mon to mgr – validated 10K OSD deployment (“Big Bang III” @ CERN) ● restful: new REST API ● prometheus, influx, zabbix ● dashboard: built-in web dashboard – webby equivalent of 'ceph -s' M G ??? (time for new iconography)
  10. 10. 10
  11. 11. 11 CEPH-METRICS
  12. 12. 12 ASYNCMESSENGER ● new network Messenger implementation – event driven – fixed-size thread pool ● RDMA backend (ibverbs) – built by default – limited testing, but seems stable! ● DPDK backend – prototype!
  13. 13. 13 PERFECTLY BALANCED OSDS (FINALLY!) ● CRUSH choose_args – alternate weight sets for individual rules – fixes two problems ● imbalance – run numeric optimization to adjust weights to balance PG distribution for a pool (or cluster) ● multipick anomaly – adjust weights per position to correct for low- weighted devices (e.g., mostly empty rack) – backward compatible ‘compat weight-set’ for imbalance only ● pg up-map – explicitly map individual PGs to specific devices in OSDMap – requires luminous+ clients ● Automagic, courtesy of ceph-mgr ‘balancer’ module – ceph balancer mode crush-compat – ceph balancer on
  14. 14. 14 RADOS MISC ● CRUSH device classes – mark OSDs with class (hdd, ssd, etc) – out-of-box rules to map to specific class of devices within the same hierarchy ● streamlined disk replacement ● require_min_compat_client – simpler, safer configuration ● annotated/documented config options ● client backoff on stuck PGs or objects ● better EIO handling ● peering and recovery speedups ● fast OSD failure detection
  15. 15. 15 S3 Swift Erasure coding Multisite federation Multisite replication NFS Encryption Tiering Deduplication RADOSGW Compression
  16. 16. 16 ZONE CZONE B RGW METADATA SEARCH RADOSGW LIBRADOS M CLUSTER A MM M CLUSTER B MM RADOSGW RADOSGW RADOSGW LIBRADOS LIBRADOS LIBRADOS REST RADOSGW LIBRADOS M CLUSTER C MM REST ZONE A
  17. 17. 17 RGW MISC ● NFS gateway – NFSv4 and v3 – full object access (not general purpose!) ● dynamic bucket index sharding – automatic (finally!) ● inline compression ● encryption – follows S3 encryption APIs ● S3 and Swift API odds and ends RADOSGW LIBRADOS
  18. 18. 18 Erasure coding Multisite mirroring Persistent client cache Consistency groups Encryption iSCSI Trash RBD
  19. 19. 19 RBD ● RBD over erasure coded pool – rbd create --data-pool <ecpoolname> ... ● RBD mirroring improvements – cooperative HA daemons – improved Cinder integration ● iSCSI – LIO tcmu-runner, librbd (full feature set) ● Kernel RBD improvements – exclusive locking, object map
  20. 20. 20 CEPHFS ● multiple active MDS daemons (finally!) ● subtree pinning to specific daemon ● directory fragmentation on by default – (snapshots still off by default) ● so many tests ● so many bugs fixed ● kernel client improvements
  21. 21. 21 MIMIC
  22. 22. 22 OSD refactor Tiering Client-side caches Metrics Dedup QoS Self-management Multi-site federation
  23. 23. 23 CONTAINERS CONTAINERS ● ceph-container (https://github.com/ceph/ceph-container) – multi-purpose container image – plan to publish upstream releases as containers on download.ceph.com ● ceph-helm (https://github.com/ceph/ceph-helm) – Helm charts to deploy Ceph in kubernetes ● openshift-ansible – deployment of Ceph in OpenShift ● mgr-based dashboard/GUI – management and metrics (based on ceph-metrics, https://github.com/ceph/ceph-metrics) ● leverage kubernetes for daemon scheduling – MDS, RGW, mgr, NFS, iSCSI, CIFS, rbd-mirror, … ● streamlined provisioning and management – adding nodes, replacing failed disks, …
  24. 24. 24 RADOS ● The future is – NVMe – DPDK/SPDK – futures-based programming model for OSD – painful but necessary ● BlueStore and rocksdb optimization – rocksdb level0 compaction – alternative KV store? ● RedStore? – considering NVMe focused backend beyond BlueStore ● Erasure coding plugin API improvements – new codes with less IO for single-OSD failures DEVICE OSD OBJECTSTORE MESSENGER
  25. 25. 25 EASE OF USE ODDS AND ENDS ● centralized config management – all config on mons – no more ceph.conf – ceph config … ● PG autoscaling – less reliance on operator to pick right values ● progress bars – recovery, rebalancing, etc. ● async recovery – fewer high-latency outliers during recovery
  26. 26. 26 QUALITY OF SERVICE ● Ongoing background development – dmclock distributed QoS queuing – minimum reservations and priority weighting ● Range of policies – IO type (background, client) – pool-based – client-based ● Theory is complex ● Prototype is promising, despite simplicity ● Basics will land in Mimic
  27. 27. 27
  28. 28. 28 CEPH STORAGE CLUSTER TIERING ● new RADOS ‘redirect’ primitive – basically a symlink, transparent to librados – replace “sparse” cache tier with base pool “index” APPLICATION BASE POOL (REPLICATED, SSD) APPLICATION CACHE POOL (REPLICATED) BASE POOL (HDD AND/OR ERASURE) CEPH STORAGE CLUSTER SLOW 1 (EC) SLOW #1 (...)
  29. 29. 29 CEPH STORAGE CLUSTER DEDUPLICATION WIP ● Generalize redirect to a “manifest” – map of offsets to object “fragments” (vs a full object copy) ● Break objects into chunks – fixed size, or content fingerprint ● Store chunks in content-addressable pool – name object by sha256(content) – reference count chunks ● TBD – inline or post? – policy (when to promote inline, etc.) – agent home (embed in OSD, …) APPLICATION BASE POOL (REPLICATED, SSD) SLOW 1 (EC) CAS DEDUP POOL
  30. 30. 30 CephFS ● Integrated NFS gateway management – HA nfs-ganesha service – API configurable, integrated with Manila ● snapshots! ● quota for kernel client
  31. 31. 31 CLIENT CACHES! ● RGW – persistent read-only cache on NVMe – fully consistent (only caches immutable “tail” rados objects) – Mass Open Cloud ● RBD – persistent read-only cache of immutable clone parent images – writeback cache for improving write latency ● cluster image remains crash-consistent if client cache is lost ● CephFS – kernel client already uses kernel fscache facility
  32. 32. 32 GROWING DEVELOPER COMMUNITY
  33. 33. 33 ● Red Hat ● Mirantis ● SUSE ● ZTE ● China Mobile ● XSky ● Digiware ● Intel ● Kylin Cloud ● Easystack ● Istuary Innovation Group ● Quantum ● Mellanox ● H3C ● Quantum ● UnitedStack ● Deutsche Telekom ● Reliance Jio Infocomm ● OVH ● Alibaba ● DreamHost ● CERN GROWING DEVELOPER COMMUNITY
  34. 34. 34 ● Red Hat ● Mirantis ● SUSE ● ZTE ● China Mobile ● XSky ● Digiware ● Intel ● Kylin Cloud ● Easystack ● Istuary Innovation Group ● Quantum ● Mellanox ● H3C ● Quantum ● UnitedStack ● Deutsche Telekom ● Reliance Jio Infocomm ● OVH ● Alibaba ● DreamHost ● CERN OPENSTACK VENDORS
  35. 35. 35 ● Red Hat ● Mirantis ● SUSE ● ZTE ● China Mobile ● XSky ● Digiware ● Intel ● Kylin Cloud ● Easystack ● Istuary Innovation Group ● Quantum ● Mellanox ● H3C ● Quantum ● UnitedStack ● Deutsche Telekom ● Reliance Jio Infocomm ● OVH ● Alibaba ● DreamHost ● CERN CLOUD OPERATORS
  36. 36. 36 ● Red Hat ● Mirantis ● SUSE ● ZTE ● China Mobile ● XSky ● Digiware ● Intel ● Kylin Cloud ● Easystack ● Istuary Innovation Group ● Quantum ● Mellanox ● H3C ● Quantum ● UnitedStack ● Deutsche Telekom ● Reliance Jio Infocomm ● OVH ● Alibaba ● DreamHost ● CERN HARDWARE AND SOLUTION VENDORS
  37. 37. 37 ● Red Hat ● Mirantis ● SUSE ● ZTE ● China Mobile ● XSky ● Digiware ● Intel ● Kylin Cloud ● Easystack ● Istuary Innovation Group ● Quantum ● Mellanox ● H3C ● Quantum ● UnitedStack ● Deutsche Telekom ● Reliance Jio Infocomm ● OVH ● Alibaba ● DreamHost ● CERN ???
  38. 38. 38 ● Mailing list and IRC – http://ceph.com/IRC ● Ceph Developer Monthly – first Weds of every month – video conference (Bluejeans) – alternating APAC- and EMEA- friendly times ● Github – https://github.com/ceph/ ● Ceph Days – http://ceph.com/cephdays/ ● Meetups – http://ceph.com/meetups ● Ceph Tech Talks – http://ceph.com/ceph-tech-talks/ ● ‘Ceph’ Youtube channel – (google it) ● Twitter – @ceph GET INVOLVED
  39. 39. CEPHALOCON APAC 2018.03.22 and 23 BEIJING, CHINA
  40. 40. THANK YOU! Sage Weil CEPH ARCHITECT sage@redhat.com @liewegas

×