What's New with Ceph - Ceph Day Silicon Valley

WHAT’S NEW IN CEPH
Ceph Day Silicon Valley
at
The University of California Santa Cruz Silicon Valley
Campus
Neha Ojha, Patrick Donnelly

PRIORITIES
● Community
● Management and usability
● Performance
● Core Ceph
○ RADOS
○ RBD
○ RGW
○ CephFS
● Container platforms
● Dashboard

CEPHALOCON
● Cephalocon APAC
○ Mar 2018
○ Community organized (not us!)
○ ~1000 attendees
○ Overwhelming amount of technical content
○ Highlighted huge opportunity to build developer community in APAC
● Next Cephalocon
○ Europe in the spring!
○ 2 days, 3-4 tracks
○ Finalizing plans (venue and timing…)

CEPH DAYS
● One day, regional events
○ https://ceph.com/cephday
● Upcoming
○ Ceph Day Berlin - Nov 12 (day before OpenStack Summit)

AUTOMATION AND MANAGEMENT
● Focus on “hands off” operation
● Hidden/automated pg_num selection
○ Enable pg_num decreases as well as increases
○ Automated, hands-off management of pool pg_num based on utilization, workload, etc.
● Automated tuning
○ Manage cache sizes, configurables, etc based on user-provided memory envelope
○ Conditional defaults of performance-related functions based on device types
● Additional guard rails
○ ‘ceph osd safe-to-destroy’, ‘ok-to-stop’ checks
○ Include safe-to-destroy check in ‘ceph osd destroy/purge’

TELEMETRY AND INSIGHTS
● Phone-home via upstream telemetry or downstream Insights
● Centralized collection of crash reports
○ Alerting for transient failures (daemon crash + restart)
○ Phoned home to track failures and bugs in the wild, prioritize bugs, etc.
● Enablement for proactive/preemptive support
● Disk failure prediction
○ Preemptive evacuation of failing devices
○ Self-contained prediction or higher quality prediction via SaaS GRPC - thanks to Rick Chen
@ProphetStor

NEW IN MIMIC
● Centralized Config
○ Stored on monitors - one place to update, validate, see history
○ Manageable by the Dashboard in future
● Ceph-volume
○ Replacement for ceph-disk - no udev, predictable, no longer race condition prone
● Asynchronous Recovery
○ No blocking I/O for recovery
○ Better client I/O performance during recovery

HARDWARE IS CHANGING
● The world is moving to flash and NVMe
○ Substantially lower latencies and higher throughputs
○ Capacities rivaling HDDs
○ Massively lower $/IOPS
○ $/bit is still a few years away
● HDD-based storage is becoming niche space
○ WD just shut down a factory this week
● Storage software must adapt to survive
○ Not only top-line performance (max IOPS); also IOPS per CPU core

PROJECT CRIMSON
● Reimplementing Ceph OSD data path
○ Kefu, Casey (Red Hat)
○ Chunmei, Lisa, Yingxin (Intel)
● Seastar (from Scylla community)
○ run to completion model
○ explicit sharding of data and processing across CPU cores
● DPDK, SPDK bring network and storage drivers into userspace
● Current status
○ Working messenger, various infrastructure pieces (e.g., config mgmt, auth)
○ Shared caches and simplified data path coming next
○ Initial prototypes will be against MemStore (non-blocking)
● Kefu is presenting progress at Scylla Summit 2018 (Nov 6-7 in SF)

OTHER PERFORMANCE
● New logging infrastructure (lttng-based)
● Sample-based tracing
○ OpenTracing, Jaeger
● OpTracker improvements
● Mutex -> std::mutex
○ Compile out lockdep for production builds
● Auth signature check optimizations
● BlueStore allocator improvements

MSGR2
● New on-wire protocol
○ Improved protocol feature negotiation
○ Multiplexing (maybe, eventually)
○ IANA-assigned port number for mons (ce4h = 3300)
● Encryption over the wire
● Probably no signature support?
○ fast mode (with no cryptographic integrity checks) or
○ secure mode (full encryption)
● Dual-stack support (IPv4 + IPv6)
● Kerberos authentication
○ Use kerberos credentials to map to Ceph roles, issue Ceph CLI commands

MISC
● OSD memory stability
○ PG log length limited
○ OSD internally adjusts caches to stay within bounds
● QoS
○ min, max, and priority-based limits and reservations
● Improved introspection, utilization metrics (Nautilus)
● Future: ‘ceph top’
○ Sampling-based real-time view of client workload

BLOCK
● Improved orchestration/management of async mirroring
○ Point-in-time consistent DR with failover/failback, mirrored snapshots, etc.
● Security namespaces for RBD
○ Simple RBAC-style CephX cap profiles
○ Basic provisioned space quotas
● Client-side caching - PR in progress
● Transparent live image migration
● Simplified pool- and image-level configuration overrides
● Future: ‘rbd top’ for real-time per-image workload

RGW / OBJECT
● Ongoing performance and scalability improvements
● Security (Nautilus)
○ Cloud-based (STS, keystone) and enterprise (OPA, kerberos/AD)
● pub/sub API (Nautilus)
○ Recently prototyped for OpenWhisk project; targets Nautilus
● Sync from public cloud (Nautilus)
○ RGW sync to S3 added in Mimic; this adds other direction
● Tiering
○ Push individual objects to external cloud (Nautilus)
○ Push entire buckets to external cloud

MULTI- AND HYBRID CLOUD
● Next generation of applications will primarily consume object storage
○ Block is great for backing VMs and containers,
○ and file will serve legacy workloads and even scale to huge data sets,
○ but most cat pictures/videos/etc will land in objects
● Modern IT infrastructure spans multiple data centers, public/private clouds
● In the public cloud, it will be hard to beat native storage pricing (e.g., S3)
● RGW should expand to encompass “data services”
○ Data portability (especially paired with application portability)
○ Data placement (vs capacity, bandwidth, compliance/regulatory regimes)
○ Lifecycle management
○ Introspection (what am I storing and where?)
○ Policy and automation for all of the above

MULTI-MDS ROBUSTNESS
● Multi-MDS stable and supported since Luminous
○ Mimic makes adding/removing MDS easier.
● Snapshots stable and supported since Mimic.
○ Requires v4.17+ kernel for kclient.
● Lots of work remaining to make dynamic load balancer more robust
○ Generate realistic workloads at scale (many MDS daemons, lots of RAM for each)
○ Balancer tuning

24
● Directory (subtree) snapshots anywhere in the
file system hierarchy.
● Note: for the kernel client, use the latest
kernel.
● Mimic will change the default allow_new_snaps
to true. Existing file systems must turn the flag
on after upgrade.
Credit to Zheng Yan (Red Hat).
$ cat foo/file1
Hello
$ mkdir /cephfs/foo/.snap/2017-10-06
$ echo “world!” >> foo/file1
$ cat foo/file1
Hello
World!
$ cat foo/.snap/2017-10-06/file1
Hello
bar
file2
/
foo
file1
SNAPSHOTS STABLE

25
● Byte limit:
$ setfattr -n ceph.quota.max_bytes -v 100000000 /some/dir
● File limit:
$ setfattr -n ceph.quota.max_files -v 10000 /some/dir
● Caveat: limits are enforced eventually.
● Kernel changes not yet merged upstream
Cooperatively fixed by Luis Henriques (SUSE) and Zheng Yan
(Red Hat).
dir2
file2
quota config:
max_files=20,max_bytes=1048576
/
dir1
file1
quota config:
max_files=200,max_bytes=10485760
(KERNEL) QUOTA SUPPORT

● Bring volume/subvolume management into ceph-mgr
○ Change ceph-volume-client.py to simply wrap new functions
○ “Formalize” volume/subvolume concepts
○ Modify Rook, Kubernetes provisioners, Manila to all consume/share same interface
● Scale-out NFS
○ Cluster-managed ganesha gateways with active/active
○ Robust cluster-coherent NFSv4 recovery
○ Protect Ceph from untrusted clients
VOLUME MANAGEMENT & NFS GATEWAYS

cephfs-shell
● New alternative client that doesn’t
requiring mounting CephFS
● Outreachy project Summer 2018 by
Pavani Rajula.
$ cephfs-shell
CephFS:~/>>> mkdir foo
CephFS:~/>>> cd foo
CephFS:~/foo>>> put /etc/hosts hosts
CephFS:~/foo>>> cat hosts
127.0.0.1 localhost.localdomain
localhost
...
...

PERFORMANCE
● Buffered create/unlink by clients
○ Use write/buffer capability on directory
○ Pass out allocated inodes to each session for creating files
○ Asynchronously flush create/unlink to MDS
● Sharding metadata within an MDS (“sub-ranks”) to scale across cores
○ Mostly lock-less is a design goal
○ Fast zero-copy metadata/cap export between sub-ranks

ROOK
● Native, robust operator for kubernetes and openshift
● Intelligent deployment of Ceph daemons
○ e.g., add/remove/move mon daemons while maintaining quorum
○ e.g., intelligently schedule RGW/iSCSI/NFS gateways across nodes
● Integration with SDN functions
○ e.g., schedule Ganesha NFS gateways and attach them to tenant Neutron networks via Kuryr
● Upgrade orchestration
○ Update Rook operator pod (triggered via CLI or dashboard?)
○ Rook updates Ceph daemons in prescribed order, gated with health, safety, availability checks
○ Rook manages any release-specific steps (like forcing scrubs, declaring upgrade “done”, etc.)
● Existing and happy user community, CNCF member project

K8S STRATEGY
● Align with Ceph with kubernetes community interest/adoption (Rook, CNCF)
● Enable early kubernetes/openshift adopters (IT users, not IT)
○ Provision Rook/Ceph clusters layered over existing infrastructure
● Displace of legacy storage on prem for kubernetes
○ Allow multiple kube clusters, Rook instances to share an external Ceph cluster
● Enable kubernetes as a service (e.g., provided by IT)
○ Enable multi-tenant workflows for Rook storage classes (e.g., Pools)
● Maximize consistency of experience on public cloud
● Expose underlying Ceph federation capabilities (especially object)
● Also: use kubernetes “under the hood” for standalone Ceph

DASHBOARD
● Converged community investment on built-in web dashboard
○ Hybrid of openATTIC and John’s original dashboard proof of concept
○ Self-hosted by ceph-mgr with easy, tight integration with other cluster management and
automation functions
● Currently mostly “logical” cluster functions
○ Management of Ceph services (pools, RBD images, file systems, configuration, etc.)
○ Subsuming ceph-metrics and openATTIC grafana metrics
● Orchestrator abstraction
○ Allow ceph-mgr and dashboard (or CLI, APIs) to drive ansible, Rook, DeepSea, etc
○ Provision or deprovision Ceph daemons, add/remove nodes, replace disks, etc.
○ Abstracts/hides choice of orchestration layer, enabling generalized automation, GUI, docs, UX
● Indirectly laying foundation for stable and versioned management API

Questions?
Neha Ojha - nojha@redhat.com
Patrick Donnelly - pdonnell@redhat.com

CEPH-CSI
● Replace upstream Kubernetes and Rook flexvol with ceph-csi
● Development driven by CERN, Cisco, with help from Huamin (Red Hat)
● Stretch goal is to replace Rook’s flexvol with ceph-csi in next 0.9 release

MULTI-CLUSTER CEPHFS
● Geo-replication
○ Loosely-consistent and point-in-time consistent DR replication (Nautilus)
○ Active/active async replication, with associated consistency caveats?
● Sync and share (NextCloud) integration
○ Concurrent access via usual POSIX (kcephfs, NFS, etc) mounts and NextCloud to same files,
with revisions

What's New with Ceph - Ceph Day Silicon Valley

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to What's New with Ceph - Ceph Day Silicon Valley

Similar to What's New with Ceph - Ceph Day Silicon Valley (20)

Recently uploaded

Recently uploaded (20)

What's New with Ceph - Ceph Day Silicon Valley