1
WHAT’S NEW IN CEPH
PACIFIC
2021.02.25
2
The buzzwords
● “Software defined storage”
● “Unified storage system”
● “Scalable distributed storage”
● “The future of storage”
● “The Linux of storage”
WHAT IS CEPH?
The substance
● Ceph is open source software
● Runs on commodity hardware
○ Commodity servers
○ IP networks
○ HDDs, SSDs, NVMe, NV-DIMMs, ...
● A single cluster can serve object,
block, and file workloads
3
● Freedom to use (free as in beer)
● Freedom to introspect, modify,
and share (free as in speech)
● Freedom from vendor lock-in
● Freedom to innovate
CEPH IS FREE AND OPEN SOURCE
4
● Reliable storage service out of unreliable components
○ No single point of failure
○ Data durability via replication or erasure coding
○ No interruption of service from rolling upgrades, online expansion, etc.
● Favor consistency and correctness over performance
CEPH IS RELIABLE
5
● Ceph is elastic storage infrastructure
○ Storage cluster may grow or shrink
○ Add or remove hardware while system is
online and under load
● Scale up with bigger, faster hardware
● Scale out within a single cluster for
capacity and performance
● Federate multiple clusters across
sites with asynchronous replication
and disaster recovery capabilities
CEPH IS SCALABLE
6
CEPH IS A UNIFIED STORAGE SYSTEM
RGW
S3 and Swift
object storage
LIBRADOS
Low-level storage API
RADOS
Reliable, elastic, distributed storage layer with
replication and erasure coding
RBD
Virtual block device
CEPHFS
Distributed network
file system
OBJECT BLOCK FILE
7
RELEASE SCHEDULE
Octopus
Mar 2020
14.2.z
Nautilus
Mar 2019
WE ARE
HERE
15.2.z
16.2.z
Pacific
Mar 2021
17.2.z
Quincy
Mar 2022
● Stable, named release every 12 months
● Backports for 2 releases
○ Nautilus reaches EOL shortly after Pacific is released
● Upgrade up to 2 releases at a time
○ Nautilus → Pacific, Octopus → Quincy
8
Usability
Performance
Ecosystem
Multi-site
Quality
FIVE THEMES
9
New Features
● Automated upgrade from Octopus
○ (for clusters deployed with cephadm)
● Automated log-in to private registries
● iSCSI and NFS are now stable
● Automated HA for RGW
○ haproxy and keepalived
● Host maintenance mode
● cephadm exporter/agent for increased
performance/scalability
Robustness
● Lots of small usability improvements
● Lots of bug fixes
○ Backported into Octopus already
● Ongoing cleanup of docs.ceph.com
○ Removed ceph-deploy
CEPHADM
10
● Robust and responsive management GUI for cluster operations
○ All core Ceph services (object, block, file) and extensions (iSCSI, NFS Ganesha)
○ Monitoring, metrics, management
● Full OSD management
○ Bulk creation with DriveGroups (filter by host, device properties: size/type/model)
○ Disk replacement and SMART diagnostics
● Multisite capabilities
○ RBD mirroring
○ RGW multisite sync monitoring
● Orchestrator/cephadm integration
● Official Management REST API for Ceph
○ Stable, versioned, and fully documented
● Production-ready security
○ RBAC, account policies (including account lock-out), secure cookies, sanitized logs, …
DASHBOARD
11
● Improved hands-off defaults
○ Upmap balancer on by default
○ PG autoscaler has improved out-of-the-box experience
● Automatically detect and report daemon version mismatches
○ Associated health alert
○ Can be muted during upgrades and on demand
● Ability to cancel ongoing scrubs
● ceph -s simplified
○ Recovery progress shown as one progress bar - use ‘ceph progress’ to see more
● Framework for distributed tracing in the OSD (work in progress)
○ Opentracing tracepoints in the OSD I/O path
○ Can be collected and viewed via Jaeger's web ui.
○ To help with end-to-end performance analysis
RADOS USABILITY
12
● MultiFS is marked stable!
○ Automated file system creation: use ‘ceph fs volume create NAME’
○ MDS automatically deployed with cephadm
● MDS autoscaler (start/stop MDS based on file system max_mds, standby count)
● cephfs-top (preview)
○ See client sessions and performance of the file system
● Continued improvements to cephfs-shell
● Scheduled snapshots via new snap_schedule mgr module.
● First class NFS gateway support
○ active/active configurations
○ automatically deployed via the Ceph orchestrator (Rook and cephadm)
● MDS-side encrypted file support (kernel-side development on-going)
CEPHFS USABILITY
13
RBD
● “Instant” clone/recover from external
(file/HTTP/S3) data source
● Built-in support for LUKS1/LUKS2
encryption
● Native Windows driver
○ Signed, prebuilt driver available soon
● Restartable rbd-nbd daemon support
OTHER FEATURES AND USABILITY
RGW
● S3Select MVP (CSV-only)
● Lua scripting, RGW request path
● D3N (*)
14
Usability
Performance
Ecosystem
Multi-site
Quality
FIVE THEMES
15
● Improved PG deletion performance
● More controlled osdmap trimming in the monitor
● Msgr2.1
○ New wire format for msgr2 (both crc and secure modes)
● More efficient manager modules
○ Ability to turn off progress module
○ Efficient use of large C++ structures in the codebase
● Monitor/display SSD wear levels
○ ‘ceph device ls’ output
RADOS ROBUSTNESS
16
● Feature bit support for turning on/off required file system features
○ Clients not supporting features will be rejected
● Multiple MDS FS scrub (online integrity check)
● Kernel client (and mount.ceph) support for msgr2[.1]
○ Kernel mount option -o ms_mode=crc|secure|prefer-crc|prefer-secure
● Support for recovering mounts from blocklisting
○ Kernel reconnects with -o recover_session=clean
○ ceph-fuse reconnects with --client_reconnect_stale=1; page cache should be disabled
● Improved test coverage (doubled test matrix, 2500 -> 5000)
CEPHFS ROBUSTNESS
17
TELEMETRY AND CRASH REPORTS
● Public dashboards!
○ https://telemetry-public.ceph.com/
○ Clusters, devices
● Opt-in
○ Will require re-opt-in if telemetry content
is expanded in the future
○ Explicitly acknowledge data sharing
license
● Telemetry channels
○ basic - cluster size, version, etc.
○ crash - anonymized crash metadata
○ device - device health (SMART) data
○ ident - contact info (off by default!)
● Initial focus on crash reports
○ Integration with bug tracker
○ Daily reports on top crashes in wild
○ Fancy (internal) dashboard
● Extensive device dashboard
○ See which HDD and SSD models ceph
users are deploying
18
Usability
Performance
Ecosystem
Multi-site
Quality
FIVE THEMES
19
RADOS: BLUESTORE
● RocksDB sharding
○ Reduced disk space requirements
● Hybrid allocator
○ Lower memory use and disk fragmentation
● Better space utilization for small objects
○ 4K min_alloc_size for SSDs and HDDs
● More efficient caching
○ Better use of available memory
● Finer-grained memory tracking
○ Improved accounting of current usage
20
● Phase 1: QoS between recovery and client I/O using mclock scheduler
○ Different profiles to prioritize client I/O, recovery and background tasks
■ Config sets to hide complexity of tuning dmclock and recovery parameters
○ Better default values for Ceph parameters to get improved performance out of the system
based on extensive testing on SSDs
○ Pacific!
● Phase 2: Quincy
○ Optimize performance for HDDs
○ Account for background activities like scrubbing, PG deletion etc.
○ Further testing across different types of workloads
● Phase 3: client vs client QoS
RADOS: QoS
21
● High-performance rewrite of the OSD
● Recovery/backfill implemented
● Scrub state machine added to lay ground for scrub implementation in the
crimson osd
● Initial prototype of SeaStore in place
○ Targets both ZNS (zone-based) and traditional SSDs
○ Onode Tree implementation
○ Omap
○ LBA mappings
● Ability to run simple RBD workloads today
● Compatibility layer to run legacy BlueStore code
CRIMSON PROJECT
22
● Ephemeral pinning (policy based subtree pinning)
○ Distributed pins automatically shard sub-directories (think: /home)
○ Random pins shard descendent directories probabilistically
● Improved capability/cache management by MDS for large clusters
○ Cap recall defaults improved based on larger production clusters (CERN)
○ Capability acquisition throttling for some client workloads
● Asynchronous unlink/create (partial MDS support since Octopus)
○ Miscellaneous fixes and added testing
○ Kernel v5.7 and downstream in RHEL 8 / CentOS 8 (Stream)
○ libcephfs/ceph-fuse support in-progress
CEPHFS PERFORMANCE
23
RGW
● Avoid omap where unnecessary
○ FIFO queues for garbage collection
○ FIFO queues for data sync log
● Negative caching for bucket metadata
○ significant reduction in request latency for
many workloads
● Sync process performance improvements
○ Better state tracking and performance
when bucket sharding is enabled
● nginx authenticated HTTP front-cache
○ dramatically accelerate read-mostly
workloads
MISC PERFORMANCE
RBD
● librbd migration to boost::asio reactor
○ Event driven; uses neorados
○ May eventually allow tighter integration
with SPDK
24
Usability
Performance
Ecosystem
Multi-site
Quality
FIVE THEMES
25
● Replication targets (remote clusters) configured on any directory
● New cephfs-mirror daemon to migrate data
○ Managed by Rook or cephadm
● Snapshot-based
○ When snapshot is created on source cluster, it is replicated to remote cluster
● Initial implementation in Pacific
○ Single daemon
○ Inefficient incremental updates
○ Improvements will be backported
CEPHFS: SNAPSHOT-BASED MIRRORING
26
● Current multi-site supports
○ Federate multiple sites
○ Global bucket/user namespace
○ Async data replication at site/zone granularity
○ Bucket granularity replication
● Pacific adds:
○ Testing and QA to move bucket granularity replication out of experimental status
○ Foundation to support bucket resharding in multi-site environment
● Large-scale refactoring
○ Extensive multisite roadmap… lots of goodness should land in Quincy
RGW: PER-BUCKET REPLICATION
27
Usability
Performance
Ecosystem
Multi-site
Quality
FIVE THEMES
28
ROOK
● Stretch clusters
○ Configure storage in two datacenters, with a mon in a third location with higher latency
○ 5 mons
○ Pools with replication 4 (2 replicas in each datacenter)
● CephFS mirroring
○ Manage CephFS mirroring via CRDs
○ New, simpler snapshot-based mirroring
29
CSI / OPENSTACK MANILA
● RWX/ROX -- CephFS via mgr/volumes
○ New PV snapshot stabilization
■ Limits in place for snapshots on subvolumes.
○ New authorization API support (Manila)
○ New ephemeral pinning for volumes
● RWO/RWX/ROX -- RBD
○ dm-crypt encryption with Vault key management
○ PV snapshots and clones
○ Topology-aware provisioning
○ Integration with snapshot-based mirroring in-progress
30
● Removing instances of racially charged terms
○ blacklist/whitelist
○ master/slave
○ Some librados APIs/CLI affected: deprecated old calls, will remove in future
● Ongoing documentation improvements
● https://ceph.io website redesign
○ Static site generator, built from git (no more wordpress)
○ Should launch this spring
MISC
31
● (Consistent) CI and release builds
○ Thank you to Ampere for donated build hardware
● Container builds
● Limited QA/regression testing coverage
ARM: AARCH64
32
NEXT UP IS
QUINCY
33
● Ceph Developer Summit - March or April
○ Quincy planning
○ Traditional format (scheduled topical sessions, video + chat, recorded)
● Ceph Month - April or May
○ Topic per week (Core, RGW, RBD, CephFS)
○ 2-3 scheduled talks spread over the week
■ Including developers presenting what is new, what’s coming
○ Each talk followed by semi-/un-structured discussion meetup
■ Etherpad agenda
■ Open discussion
■ Opportunity for new/existing users/operators to compare notes
○ Lightning talks
○ Fully virtual (video + chat, recorded)
VIRTUAL EVENTS THIS SPRING
34
● https://ceph.io/
● Twitter: @ceph
● Docs: http://docs.ceph.com/
● Mailing lists: http://lists.ceph.io/
○ ceph-announce@ceph.io → announcements
○ ceph-users@ceph.io → user discussion
○ dev@ceph.io → developer discussion
● IRC: irc.oftc.net
○ #ceph, #ceph-devel
● GitHub: https://github.com/ceph/
● YouTube ‘Ceph’ channel
FOR MORE INFORMATION
35
36
● RBD NVMeoF target support
● Cephadm resource-aware scheduling, CIFS gateway
●
QUINCY SNEAK PEEK
37
Usability
Performance
Ecosystem
Multi-site
Quality
FIVE THEMES
38
● cephadm improvements
○ CIFS/SMB support
○ Resource-aware service placement (memory, CPU)
○ Moving Services out of failed hosts
○ Improved scalability/responsiveness for large clusters
● Rook improvements
○ Better integration with orchestrator API
○ Parity with cephadm
ORCHESTRATION
39
RGW
● Deduplicated storage
CephFS
● ‘fs top’
● NFS and SMB support via orchestrator
MISC USABILITY AND FEATURES
RBD
● Expose snapshots via RGW (object)
● Expose RBD via NVMeOF target gateway
● Improved rbd-nbd support
○ Expose kernel block device with full librbd
feature set
○ Improved integration with ceph-csi for
Kubernetes environments
40
● Multi-site monitoring & management (RBD, RGW, CephFS) and multi-cluster
support (single dashboard managing multiple Ceph clusters).
● RGW advanced features management (bucket policies, lifecycle, encryption,
notifications…)
● High-level user workflows:
○ Cluster installation wizard
○ Cluster upgrades
● Improved observability
○ Log aggregation
DASHBOARD
41
Usability
Performance
Ecosystem
Multi-site
Quality
FIVE THEMES
42
RADOS
● Enable ‘upmap’ balancer by default
○ More precise than ‘crush-compat’ mode
○ Hands-off by default
○ Improve balancing of ‘primary’ role
● Dynamically adjust recovery priority
based on load
● Automatic periodic security key rotation
● Distributed tracing framework
○ For end-to-end performance analysis
STABILITY AND ROBUSTNESS
CephFS
● MultiMDS metadata balancing
improvements
● Minor version upgrade improvements
43
● Work continues on backend analysis of telemetry data
○ Tools for developers to use crash reports identify and prioritize bug fixes
● Adjustments in collected data
○ Adjust what data is collected for Pacific
○ Periodic backport to Octopus (we re-opt-in)
○ e.g., which orchestrator module is in use (if any)
● Drive failure prediction
○ Building improved models for predictive drive failures
○ Expanding data set via Ceph collector, standalone collector, and other data sources
TELEMETRY
44
Usability
Performance
Ecosystem
Multi-site
Quality
FIVE THEMES
45
CephFS
● Async metadata operations
○ Support in both libcephfs
○ Async rmdir/mkdir
● Ceph-fuse performance
○ Take advantage of recent libfuse changes
MISC PERFORMANCE
RGW
● Data sync optimizations, sync fairness
● Sync metadata improvements
○ omap -> cls_fifo
○ Bucket index, metadata+data logs
● Ongoing async refactoring of RGW
○ Based on boost::asio
46
● Sharded RocksDB
○ Improve compaction performance
○ Reduce disk space requirements
● In-memory cache improvements
● SMR
○ Support for host-managed SMR HDDs
○ Targeting cold-stored workloads (e.g., RGW) only
RADOS: BLUESTORE
47
PROJECT CRIMSON
What
● Rewrite IO path in using Seastar
○ Preallocate cores
○ One thread per core
○ Explicitly shard all data structures
and work over cores
○ No locks and no blocking
○ Message passing between cores
○ Polling for IO
● DPDK, SPDK
○ Kernel bypass for network and
storage IO
● Goal: Working prototype for Pacific
Why
● Not just about how many IOPS we do…
● More about IOPS per CPU core
● Current Ceph is based on traditional
multi-threaded programming model
● Context switching is too expensive when
storage is almost as fast as memory
● New hardware devices coming
○ DIMM form-factor persistent memory
○ ZNS - zone-based SSDs
48
Usability
Performance
Ecosystem
Multi-site
Quality
FIVE THEMES
49
CEPHFS MULTI-SITE REPLICATION
● Automate periodic snapshot + sync to remote cluster
○ Arbitrary source tree, destination in remote cluster
○ Sync snapshots via rsync
○ May support non-CephFS targets
● Discussing more sophisticated models
○ Bidirectional, loosely/eventually consistent sync
○ Simple conflict resolution behavior?
50
● Nodes scale up (faster, bigger)
● Clusters scale out
○ Bigger clusters within a site
● Organizations scale globally
○ Multiple sites, data centers
○ Multiple public and private clouds
○ Multiple units within an organization
MOTIVATION, OBJECT
● Universal, global connectivity
○ Access your data from anywhere
● API consistency
○ Write apps to a single object API (e.g., S3)
regardless of which site, cloud it is
deployed on
● Disaster recovery
○ Replicate object data across sites
○ Synchronously or asynchronously
○ Failover application and reattach
○ Active/passive and active/active
● Migration
○ Migrate data set between sites, tiers
○ While it is being used
● Edge scenarios (caching and buffering)
○ Cache remote bucket locally
○ Buffer new data locally
51
● Project Zipper
○ Internal abstractions to allow alternate
storage backends (e.g., storage data in
external object store)
○ Policy layer based on LUA
○ Initial targets: database and file-based
stores, tiering to cloud (e.g., S3)
● Dynamic reshard vs multisite support
● Sync from external sources
○ AWS
● Lifecycle transition to cloud
RGW MULTISITE FOR QUINCY
52
RBD
● Consistency group support
MISC MULTI-SITE
53
Usability
Performance
Ecosystem
Multi-site
Quality
FIVE THEMES
54
Windows
● Windows port for RBD is underway
● Lightweight kernel pass-through to librbd
● CephFS to follow (based on Dokan)
Performance testing hardware
● Intel test cluster: officianalis
● AMD / Samsung / Mellanox cluster
● High-end ARM-based system?
OTHER ECOSYSTEM EFFORTS
ARM (aarch64)
● Loads of new build and test hardware
arriving in the lab
● CI and release builds for aarch64
IBM Z
● Collaboration with IBM Z team
● Build and test

2021.02 new in Ceph Pacific Dashboard

  • 1.
    1 WHAT’S NEW INCEPH PACIFIC 2021.02.25
  • 2.
    2 The buzzwords ● “Softwaredefined storage” ● “Unified storage system” ● “Scalable distributed storage” ● “The future of storage” ● “The Linux of storage” WHAT IS CEPH? The substance ● Ceph is open source software ● Runs on commodity hardware ○ Commodity servers ○ IP networks ○ HDDs, SSDs, NVMe, NV-DIMMs, ... ● A single cluster can serve object, block, and file workloads
  • 3.
    3 ● Freedom touse (free as in beer) ● Freedom to introspect, modify, and share (free as in speech) ● Freedom from vendor lock-in ● Freedom to innovate CEPH IS FREE AND OPEN SOURCE
  • 4.
    4 ● Reliable storageservice out of unreliable components ○ No single point of failure ○ Data durability via replication or erasure coding ○ No interruption of service from rolling upgrades, online expansion, etc. ● Favor consistency and correctness over performance CEPH IS RELIABLE
  • 5.
    5 ● Ceph iselastic storage infrastructure ○ Storage cluster may grow or shrink ○ Add or remove hardware while system is online and under load ● Scale up with bigger, faster hardware ● Scale out within a single cluster for capacity and performance ● Federate multiple clusters across sites with asynchronous replication and disaster recovery capabilities CEPH IS SCALABLE
  • 6.
    6 CEPH IS AUNIFIED STORAGE SYSTEM RGW S3 and Swift object storage LIBRADOS Low-level storage API RADOS Reliable, elastic, distributed storage layer with replication and erasure coding RBD Virtual block device CEPHFS Distributed network file system OBJECT BLOCK FILE
  • 7.
    7 RELEASE SCHEDULE Octopus Mar 2020 14.2.z Nautilus Mar2019 WE ARE HERE 15.2.z 16.2.z Pacific Mar 2021 17.2.z Quincy Mar 2022 ● Stable, named release every 12 months ● Backports for 2 releases ○ Nautilus reaches EOL shortly after Pacific is released ● Upgrade up to 2 releases at a time ○ Nautilus → Pacific, Octopus → Quincy
  • 8.
  • 9.
    9 New Features ● Automatedupgrade from Octopus ○ (for clusters deployed with cephadm) ● Automated log-in to private registries ● iSCSI and NFS are now stable ● Automated HA for RGW ○ haproxy and keepalived ● Host maintenance mode ● cephadm exporter/agent for increased performance/scalability Robustness ● Lots of small usability improvements ● Lots of bug fixes ○ Backported into Octopus already ● Ongoing cleanup of docs.ceph.com ○ Removed ceph-deploy CEPHADM
  • 10.
    10 ● Robust andresponsive management GUI for cluster operations ○ All core Ceph services (object, block, file) and extensions (iSCSI, NFS Ganesha) ○ Monitoring, metrics, management ● Full OSD management ○ Bulk creation with DriveGroups (filter by host, device properties: size/type/model) ○ Disk replacement and SMART diagnostics ● Multisite capabilities ○ RBD mirroring ○ RGW multisite sync monitoring ● Orchestrator/cephadm integration ● Official Management REST API for Ceph ○ Stable, versioned, and fully documented ● Production-ready security ○ RBAC, account policies (including account lock-out), secure cookies, sanitized logs, … DASHBOARD
  • 11.
    11 ● Improved hands-offdefaults ○ Upmap balancer on by default ○ PG autoscaler has improved out-of-the-box experience ● Automatically detect and report daemon version mismatches ○ Associated health alert ○ Can be muted during upgrades and on demand ● Ability to cancel ongoing scrubs ● ceph -s simplified ○ Recovery progress shown as one progress bar - use ‘ceph progress’ to see more ● Framework for distributed tracing in the OSD (work in progress) ○ Opentracing tracepoints in the OSD I/O path ○ Can be collected and viewed via Jaeger's web ui. ○ To help with end-to-end performance analysis RADOS USABILITY
  • 12.
    12 ● MultiFS ismarked stable! ○ Automated file system creation: use ‘ceph fs volume create NAME’ ○ MDS automatically deployed with cephadm ● MDS autoscaler (start/stop MDS based on file system max_mds, standby count) ● cephfs-top (preview) ○ See client sessions and performance of the file system ● Continued improvements to cephfs-shell ● Scheduled snapshots via new snap_schedule mgr module. ● First class NFS gateway support ○ active/active configurations ○ automatically deployed via the Ceph orchestrator (Rook and cephadm) ● MDS-side encrypted file support (kernel-side development on-going) CEPHFS USABILITY
  • 13.
    13 RBD ● “Instant” clone/recoverfrom external (file/HTTP/S3) data source ● Built-in support for LUKS1/LUKS2 encryption ● Native Windows driver ○ Signed, prebuilt driver available soon ● Restartable rbd-nbd daemon support OTHER FEATURES AND USABILITY RGW ● S3Select MVP (CSV-only) ● Lua scripting, RGW request path ● D3N (*)
  • 14.
  • 15.
    15 ● Improved PGdeletion performance ● More controlled osdmap trimming in the monitor ● Msgr2.1 ○ New wire format for msgr2 (both crc and secure modes) ● More efficient manager modules ○ Ability to turn off progress module ○ Efficient use of large C++ structures in the codebase ● Monitor/display SSD wear levels ○ ‘ceph device ls’ output RADOS ROBUSTNESS
  • 16.
    16 ● Feature bitsupport for turning on/off required file system features ○ Clients not supporting features will be rejected ● Multiple MDS FS scrub (online integrity check) ● Kernel client (and mount.ceph) support for msgr2[.1] ○ Kernel mount option -o ms_mode=crc|secure|prefer-crc|prefer-secure ● Support for recovering mounts from blocklisting ○ Kernel reconnects with -o recover_session=clean ○ ceph-fuse reconnects with --client_reconnect_stale=1; page cache should be disabled ● Improved test coverage (doubled test matrix, 2500 -> 5000) CEPHFS ROBUSTNESS
  • 17.
    17 TELEMETRY AND CRASHREPORTS ● Public dashboards! ○ https://telemetry-public.ceph.com/ ○ Clusters, devices ● Opt-in ○ Will require re-opt-in if telemetry content is expanded in the future ○ Explicitly acknowledge data sharing license ● Telemetry channels ○ basic - cluster size, version, etc. ○ crash - anonymized crash metadata ○ device - device health (SMART) data ○ ident - contact info (off by default!) ● Initial focus on crash reports ○ Integration with bug tracker ○ Daily reports on top crashes in wild ○ Fancy (internal) dashboard ● Extensive device dashboard ○ See which HDD and SSD models ceph users are deploying
  • 18.
  • 19.
    19 RADOS: BLUESTORE ● RocksDBsharding ○ Reduced disk space requirements ● Hybrid allocator ○ Lower memory use and disk fragmentation ● Better space utilization for small objects ○ 4K min_alloc_size for SSDs and HDDs ● More efficient caching ○ Better use of available memory ● Finer-grained memory tracking ○ Improved accounting of current usage
  • 20.
    20 ● Phase 1:QoS between recovery and client I/O using mclock scheduler ○ Different profiles to prioritize client I/O, recovery and background tasks ■ Config sets to hide complexity of tuning dmclock and recovery parameters ○ Better default values for Ceph parameters to get improved performance out of the system based on extensive testing on SSDs ○ Pacific! ● Phase 2: Quincy ○ Optimize performance for HDDs ○ Account for background activities like scrubbing, PG deletion etc. ○ Further testing across different types of workloads ● Phase 3: client vs client QoS RADOS: QoS
  • 21.
    21 ● High-performance rewriteof the OSD ● Recovery/backfill implemented ● Scrub state machine added to lay ground for scrub implementation in the crimson osd ● Initial prototype of SeaStore in place ○ Targets both ZNS (zone-based) and traditional SSDs ○ Onode Tree implementation ○ Omap ○ LBA mappings ● Ability to run simple RBD workloads today ● Compatibility layer to run legacy BlueStore code CRIMSON PROJECT
  • 22.
    22 ● Ephemeral pinning(policy based subtree pinning) ○ Distributed pins automatically shard sub-directories (think: /home) ○ Random pins shard descendent directories probabilistically ● Improved capability/cache management by MDS for large clusters ○ Cap recall defaults improved based on larger production clusters (CERN) ○ Capability acquisition throttling for some client workloads ● Asynchronous unlink/create (partial MDS support since Octopus) ○ Miscellaneous fixes and added testing ○ Kernel v5.7 and downstream in RHEL 8 / CentOS 8 (Stream) ○ libcephfs/ceph-fuse support in-progress CEPHFS PERFORMANCE
  • 23.
    23 RGW ● Avoid omapwhere unnecessary ○ FIFO queues for garbage collection ○ FIFO queues for data sync log ● Negative caching for bucket metadata ○ significant reduction in request latency for many workloads ● Sync process performance improvements ○ Better state tracking and performance when bucket sharding is enabled ● nginx authenticated HTTP front-cache ○ dramatically accelerate read-mostly workloads MISC PERFORMANCE RBD ● librbd migration to boost::asio reactor ○ Event driven; uses neorados ○ May eventually allow tighter integration with SPDK
  • 24.
  • 25.
    25 ● Replication targets(remote clusters) configured on any directory ● New cephfs-mirror daemon to migrate data ○ Managed by Rook or cephadm ● Snapshot-based ○ When snapshot is created on source cluster, it is replicated to remote cluster ● Initial implementation in Pacific ○ Single daemon ○ Inefficient incremental updates ○ Improvements will be backported CEPHFS: SNAPSHOT-BASED MIRRORING
  • 26.
    26 ● Current multi-sitesupports ○ Federate multiple sites ○ Global bucket/user namespace ○ Async data replication at site/zone granularity ○ Bucket granularity replication ● Pacific adds: ○ Testing and QA to move bucket granularity replication out of experimental status ○ Foundation to support bucket resharding in multi-site environment ● Large-scale refactoring ○ Extensive multisite roadmap… lots of goodness should land in Quincy RGW: PER-BUCKET REPLICATION
  • 27.
  • 28.
    28 ROOK ● Stretch clusters ○Configure storage in two datacenters, with a mon in a third location with higher latency ○ 5 mons ○ Pools with replication 4 (2 replicas in each datacenter) ● CephFS mirroring ○ Manage CephFS mirroring via CRDs ○ New, simpler snapshot-based mirroring
  • 29.
    29 CSI / OPENSTACKMANILA ● RWX/ROX -- CephFS via mgr/volumes ○ New PV snapshot stabilization ■ Limits in place for snapshots on subvolumes. ○ New authorization API support (Manila) ○ New ephemeral pinning for volumes ● RWO/RWX/ROX -- RBD ○ dm-crypt encryption with Vault key management ○ PV snapshots and clones ○ Topology-aware provisioning ○ Integration with snapshot-based mirroring in-progress
  • 30.
    30 ● Removing instancesof racially charged terms ○ blacklist/whitelist ○ master/slave ○ Some librados APIs/CLI affected: deprecated old calls, will remove in future ● Ongoing documentation improvements ● https://ceph.io website redesign ○ Static site generator, built from git (no more wordpress) ○ Should launch this spring MISC
  • 31.
    31 ● (Consistent) CIand release builds ○ Thank you to Ampere for donated build hardware ● Container builds ● Limited QA/regression testing coverage ARM: AARCH64
  • 32.
  • 33.
    33 ● Ceph DeveloperSummit - March or April ○ Quincy planning ○ Traditional format (scheduled topical sessions, video + chat, recorded) ● Ceph Month - April or May ○ Topic per week (Core, RGW, RBD, CephFS) ○ 2-3 scheduled talks spread over the week ■ Including developers presenting what is new, what’s coming ○ Each talk followed by semi-/un-structured discussion meetup ■ Etherpad agenda ■ Open discussion ■ Opportunity for new/existing users/operators to compare notes ○ Lightning talks ○ Fully virtual (video + chat, recorded) VIRTUAL EVENTS THIS SPRING
  • 34.
    34 ● https://ceph.io/ ● Twitter:@ceph ● Docs: http://docs.ceph.com/ ● Mailing lists: http://lists.ceph.io/ ○ ceph-announce@ceph.io → announcements ○ ceph-users@ceph.io → user discussion ○ dev@ceph.io → developer discussion ● IRC: irc.oftc.net ○ #ceph, #ceph-devel ● GitHub: https://github.com/ceph/ ● YouTube ‘Ceph’ channel FOR MORE INFORMATION
  • 35.
  • 36.
    36 ● RBD NVMeoFtarget support ● Cephadm resource-aware scheduling, CIFS gateway ● QUINCY SNEAK PEEK
  • 37.
  • 38.
    38 ● cephadm improvements ○CIFS/SMB support ○ Resource-aware service placement (memory, CPU) ○ Moving Services out of failed hosts ○ Improved scalability/responsiveness for large clusters ● Rook improvements ○ Better integration with orchestrator API ○ Parity with cephadm ORCHESTRATION
  • 39.
    39 RGW ● Deduplicated storage CephFS ●‘fs top’ ● NFS and SMB support via orchestrator MISC USABILITY AND FEATURES RBD ● Expose snapshots via RGW (object) ● Expose RBD via NVMeOF target gateway ● Improved rbd-nbd support ○ Expose kernel block device with full librbd feature set ○ Improved integration with ceph-csi for Kubernetes environments
  • 40.
    40 ● Multi-site monitoring& management (RBD, RGW, CephFS) and multi-cluster support (single dashboard managing multiple Ceph clusters). ● RGW advanced features management (bucket policies, lifecycle, encryption, notifications…) ● High-level user workflows: ○ Cluster installation wizard ○ Cluster upgrades ● Improved observability ○ Log aggregation DASHBOARD
  • 41.
  • 42.
    42 RADOS ● Enable ‘upmap’balancer by default ○ More precise than ‘crush-compat’ mode ○ Hands-off by default ○ Improve balancing of ‘primary’ role ● Dynamically adjust recovery priority based on load ● Automatic periodic security key rotation ● Distributed tracing framework ○ For end-to-end performance analysis STABILITY AND ROBUSTNESS CephFS ● MultiMDS metadata balancing improvements ● Minor version upgrade improvements
  • 43.
    43 ● Work continueson backend analysis of telemetry data ○ Tools for developers to use crash reports identify and prioritize bug fixes ● Adjustments in collected data ○ Adjust what data is collected for Pacific ○ Periodic backport to Octopus (we re-opt-in) ○ e.g., which orchestrator module is in use (if any) ● Drive failure prediction ○ Building improved models for predictive drive failures ○ Expanding data set via Ceph collector, standalone collector, and other data sources TELEMETRY
  • 44.
  • 45.
    45 CephFS ● Async metadataoperations ○ Support in both libcephfs ○ Async rmdir/mkdir ● Ceph-fuse performance ○ Take advantage of recent libfuse changes MISC PERFORMANCE RGW ● Data sync optimizations, sync fairness ● Sync metadata improvements ○ omap -> cls_fifo ○ Bucket index, metadata+data logs ● Ongoing async refactoring of RGW ○ Based on boost::asio
  • 46.
    46 ● Sharded RocksDB ○Improve compaction performance ○ Reduce disk space requirements ● In-memory cache improvements ● SMR ○ Support for host-managed SMR HDDs ○ Targeting cold-stored workloads (e.g., RGW) only RADOS: BLUESTORE
  • 47.
    47 PROJECT CRIMSON What ● RewriteIO path in using Seastar ○ Preallocate cores ○ One thread per core ○ Explicitly shard all data structures and work over cores ○ No locks and no blocking ○ Message passing between cores ○ Polling for IO ● DPDK, SPDK ○ Kernel bypass for network and storage IO ● Goal: Working prototype for Pacific Why ● Not just about how many IOPS we do… ● More about IOPS per CPU core ● Current Ceph is based on traditional multi-threaded programming model ● Context switching is too expensive when storage is almost as fast as memory ● New hardware devices coming ○ DIMM form-factor persistent memory ○ ZNS - zone-based SSDs
  • 48.
  • 49.
    49 CEPHFS MULTI-SITE REPLICATION ●Automate periodic snapshot + sync to remote cluster ○ Arbitrary source tree, destination in remote cluster ○ Sync snapshots via rsync ○ May support non-CephFS targets ● Discussing more sophisticated models ○ Bidirectional, loosely/eventually consistent sync ○ Simple conflict resolution behavior?
  • 50.
    50 ● Nodes scaleup (faster, bigger) ● Clusters scale out ○ Bigger clusters within a site ● Organizations scale globally ○ Multiple sites, data centers ○ Multiple public and private clouds ○ Multiple units within an organization MOTIVATION, OBJECT ● Universal, global connectivity ○ Access your data from anywhere ● API consistency ○ Write apps to a single object API (e.g., S3) regardless of which site, cloud it is deployed on ● Disaster recovery ○ Replicate object data across sites ○ Synchronously or asynchronously ○ Failover application and reattach ○ Active/passive and active/active ● Migration ○ Migrate data set between sites, tiers ○ While it is being used ● Edge scenarios (caching and buffering) ○ Cache remote bucket locally ○ Buffer new data locally
  • 51.
    51 ● Project Zipper ○Internal abstractions to allow alternate storage backends (e.g., storage data in external object store) ○ Policy layer based on LUA ○ Initial targets: database and file-based stores, tiering to cloud (e.g., S3) ● Dynamic reshard vs multisite support ● Sync from external sources ○ AWS ● Lifecycle transition to cloud RGW MULTISITE FOR QUINCY
  • 52.
    52 RBD ● Consistency groupsupport MISC MULTI-SITE
  • 53.
  • 54.
    54 Windows ● Windows portfor RBD is underway ● Lightweight kernel pass-through to librbd ● CephFS to follow (based on Dokan) Performance testing hardware ● Intel test cluster: officianalis ● AMD / Samsung / Mellanox cluster ● High-end ARM-based system? OTHER ECOSYSTEM EFFORTS ARM (aarch64) ● Loads of new build and test hardware arriving in the lab ● CI and release builds for aarch64 IBM Z ● Collaboration with IBM Z team ● Build and test