Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

What's New with Ceph - Ceph Day Silicon Valley


Published on

Presented by Neha Ojha and Patrick Donnelly

Published in: Software
  • Be the first to comment

  • Be the first to like this

What's New with Ceph - Ceph Day Silicon Valley

  1. 1. WHAT’S NEW IN CEPH Ceph Day Silicon Valley at The University of California Santa Cruz Silicon Valley Campus Neha Ojha, Patrick Donnelly
  2. 2. PRIORITIES ● Community ● Management and usability ● Performance ● Core Ceph ○ RADOS ○ RBD ○ RGW ○ CephFS ● Container platforms ● Dashboard
  4. 4. CEPHALOCON ● Cephalocon APAC ○ Mar 2018 ○ Community organized (not us!) ○ ~1000 attendees ○ Overwhelming amount of technical content ○ Highlighted huge opportunity to build developer community in APAC ● Next Cephalocon ○ Europe in the spring! ○ 2 days, 3-4 tracks ○ Finalizing plans (venue and timing…)
  5. 5. CEPH DAYS ● One day, regional events ○ ● Upcoming ○ Ceph Day Berlin - Nov 12 (day before OpenStack Summit)
  7. 7. AUTOMATION AND MANAGEMENT ● Focus on “hands off” operation ● Hidden/automated pg_num selection ○ Enable pg_num decreases as well as increases ○ Automated, hands-off management of pool pg_num based on utilization, workload, etc. ● Automated tuning ○ Manage cache sizes, configurables, etc based on user-provided memory envelope ○ Conditional defaults of performance-related functions based on device types ● Additional guard rails ○ ‘ceph osd safe-to-destroy’, ‘ok-to-stop’ checks ○ Include safe-to-destroy check in ‘ceph osd destroy/purge’
  8. 8. TELEMETRY AND INSIGHTS ● Phone-home via upstream telemetry or downstream Insights ● Centralized collection of crash reports ○ Alerting for transient failures (daemon crash + restart) ○ Phoned home to track failures and bugs in the wild, prioritize bugs, etc. ● Enablement for proactive/preemptive support ● Disk failure prediction ○ Preemptive evacuation of failing devices ○ Self-contained prediction or higher quality prediction via SaaS GRPC - thanks to Rick Chen @ProphetStor
  9. 9. RADOS
  10. 10. NEW IN MIMIC ● Centralized Config ○ Stored on monitors - one place to update, validate, see history ○ Manageable by the Dashboard in future ● Ceph-volume ○ Replacement for ceph-disk - no udev, predictable, no longer race condition prone ● Asynchronous Recovery ○ No blocking I/O for recovery ○ Better client I/O performance during recovery
  12. 12. HARDWARE IS CHANGING ● The world is moving to flash and NVMe ○ Substantially lower latencies and higher throughputs ○ Capacities rivaling HDDs ○ Massively lower $/IOPS ○ $/bit is still a few years away ● HDD-based storage is becoming niche space ○ WD just shut down a factory this week ● Storage software must adapt to survive ○ Not only top-line performance (max IOPS); also IOPS per CPU core
  13. 13. PROJECT CRIMSON ● Reimplementing Ceph OSD data path ○ Kefu, Casey (Red Hat) ○ Chunmei, Lisa, Yingxin (Intel) ● Seastar (from Scylla community) ○ run to completion model ○ explicit sharding of data and processing across CPU cores ● DPDK, SPDK bring network and storage drivers into userspace ● Current status ○ Working messenger, various infrastructure pieces (e.g., config mgmt, auth) ○ Shared caches and simplified data path coming next ○ Initial prototypes will be against MemStore (non-blocking) ● Kefu is presenting progress at Scylla Summit 2018 (Nov 6-7 in SF)
  14. 14. OTHER PERFORMANCE ● New logging infrastructure (lttng-based) ● Sample-based tracing ○ OpenTracing, Jaeger ● OpTracker improvements ● Mutex -> std::mutex ○ Compile out lockdep for production builds ● Auth signature check optimizations ● BlueStore allocator improvements
  15. 15. MSGR2 ● New on-wire protocol ○ Improved protocol feature negotiation ○ Multiplexing (maybe, eventually) ○ IANA-assigned port number for mons (ce4h = 3300) ● Encryption over the wire ● Probably no signature support? ○ fast mode (with no cryptographic integrity checks) or ○ secure mode (full encryption) ● Dual-stack support (IPv4 + IPv6) ● Kerberos authentication ○ Use kerberos credentials to map to Ceph roles, issue Ceph CLI commands
  16. 16. MISC ● OSD memory stability ○ PG log length limited ○ OSD internally adjusts caches to stay within bounds ● QoS ○ min, max, and priority-based limits and reservations ● Improved introspection, utilization metrics (Nautilus) ● Future: ‘ceph top’ ○ Sampling-based real-time view of client workload
  17. 17. RBD
  18. 18. BLOCK ● Improved orchestration/management of async mirroring ○ Point-in-time consistent DR with failover/failback, mirrored snapshots, etc. ● Security namespaces for RBD ○ Simple RBAC-style CephX cap profiles ○ Basic provisioned space quotas ● Client-side caching - PR in progress ● Transparent live image migration ● Simplified pool- and image-level configuration overrides ● Future: ‘rbd top’ for real-time per-image workload
  19. 19. RGW
  20. 20. RGW / OBJECT ● Ongoing performance and scalability improvements ● Security (Nautilus) ○ Cloud-based (STS, keystone) and enterprise (OPA, kerberos/AD) ● pub/sub API (Nautilus) ○ Recently prototyped for OpenWhisk project; targets Nautilus ● Sync from public cloud (Nautilus) ○ RGW sync to S3 added in Mimic; this adds other direction ● Tiering ○ Push individual objects to external cloud (Nautilus) ○ Push entire buckets to external cloud
  21. 21. MULTI- AND HYBRID CLOUD ● Next generation of applications will primarily consume object storage ○ Block is great for backing VMs and containers, ○ and file will serve legacy workloads and even scale to huge data sets, ○ but most cat pictures/videos/etc will land in objects ● Modern IT infrastructure spans multiple data centers, public/private clouds ● In the public cloud, it will be hard to beat native storage pricing (e.g., S3) ● RGW should expand to encompass “data services” ○ Data portability (especially paired with application portability) ○ Data placement (vs capacity, bandwidth, compliance/regulatory regimes) ○ Lifecycle management ○ Introspection (what am I storing and where?) ○ Policy and automation for all of the above
  22. 22. CEPHFS
  23. 23. MULTI-MDS ROBUSTNESS ● Multi-MDS stable and supported since Luminous ○ Mimic makes adding/removing MDS easier. ● Snapshots stable and supported since Mimic. ○ Requires v4.17+ kernel for kclient. ● Lots of work remaining to make dynamic load balancer more robust ○ Generate realistic workloads at scale (many MDS daemons, lots of RAM for each) ○ Balancer tuning
  24. 24. 24 ● Directory (subtree) snapshots anywhere in the file system hierarchy. ● Note: for the kernel client, use the latest kernel. ● Mimic will change the default allow_new_snaps to true. Existing file systems must turn the flag on after upgrade. Credit to Zheng Yan (Red Hat). $ cat foo/file1 Hello $ mkdir /cephfs/foo/.snap/2017-10-06 $ echo “world!” >> foo/file1 $ cat foo/file1 Hello World! $ cat foo/.snap/2017-10-06/file1 Hello bar file2 / foo file1 SNAPSHOTS STABLE
  25. 25. 25 ● Byte limit: $ setfattr -n ceph.quota.max_bytes -v 100000000 /some/dir ● File limit: $ setfattr -n ceph.quota.max_files -v 10000 /some/dir ● Caveat: limits are enforced eventually. ● Kernel changes not yet merged upstream Cooperatively fixed by Luis Henriques (SUSE) and Zheng Yan (Red Hat). dir2 file2 quota config: max_files=20,max_bytes=1048576 / dir1 file1 quota config: max_files=200,max_bytes=10485760 (KERNEL) QUOTA SUPPORT
  26. 26. ● Bring volume/subvolume management into ceph-mgr ○ Change to simply wrap new functions ○ “Formalize” volume/subvolume concepts ○ Modify Rook, Kubernetes provisioners, Manila to all consume/share same interface ● Scale-out NFS ○ Cluster-managed ganesha gateways with active/active ○ Robust cluster-coherent NFSv4 recovery ○ Protect Ceph from untrusted clients VOLUME MANAGEMENT & NFS GATEWAYS
  27. 27. cephfs-shell ● New alternative client that doesn’t requiring mounting CephFS ● Outreachy project Summer 2018 by Pavani Rajula. $ cephfs-shell CephFS:~/>>> mkdir foo CephFS:~/>>> cd foo CephFS:~/foo>>> put /etc/hosts hosts CephFS:~/foo>>> cat hosts localhost.localdomain localhost ... ...
  28. 28. PERFORMANCE ● Buffered create/unlink by clients ○ Use write/buffer capability on directory ○ Pass out allocated inodes to each session for creating files ○ Asynchronously flush create/unlink to MDS ● Sharding metadata within an MDS (“sub-ranks”) to scale across cores ○ Mostly lock-less is a design goal ○ Fast zero-copy metadata/cap export between sub-ranks
  30. 30. ROOK ● Native, robust operator for kubernetes and openshift ● Intelligent deployment of Ceph daemons ○ e.g., add/remove/move mon daemons while maintaining quorum ○ e.g., intelligently schedule RGW/iSCSI/NFS gateways across nodes ● Integration with SDN functions ○ e.g., schedule Ganesha NFS gateways and attach them to tenant Neutron networks via Kuryr ● Upgrade orchestration ○ Update Rook operator pod (triggered via CLI or dashboard?) ○ Rook updates Ceph daemons in prescribed order, gated with health, safety, availability checks ○ Rook manages any release-specific steps (like forcing scrubs, declaring upgrade “done”, etc.) ● Existing and happy user community, CNCF member project
  31. 31. K8S STRATEGY ● Align with Ceph with kubernetes community interest/adoption (Rook, CNCF) ● Enable early kubernetes/openshift adopters (IT users, not IT) ○ Provision Rook/Ceph clusters layered over existing infrastructure ● Displace of legacy storage on prem for kubernetes ○ Allow multiple kube clusters, Rook instances to share an external Ceph cluster ● Enable kubernetes as a service (e.g., provided by IT) ○ Enable multi-tenant workflows for Rook storage classes (e.g., Pools) ● Maximize consistency of experience on public cloud ● Expose underlying Ceph federation capabilities (especially object) ● Also: use kubernetes “under the hood” for standalone Ceph
  33. 33. DASHBOARD ● Converged community investment on built-in web dashboard ○ Hybrid of openATTIC and John’s original dashboard proof of concept ○ Self-hosted by ceph-mgr with easy, tight integration with other cluster management and automation functions ● Currently mostly “logical” cluster functions ○ Management of Ceph services (pools, RBD images, file systems, configuration, etc.) ○ Subsuming ceph-metrics and openATTIC grafana metrics ● Orchestrator abstraction ○ Allow ceph-mgr and dashboard (or CLI, APIs) to drive ansible, Rook, DeepSea, etc ○ Provision or deprovision Ceph daemons, add/remove nodes, replace disks, etc. ○ Abstracts/hides choice of orchestration layer, enabling generalized automation, GUI, docs, UX ● Indirectly laying foundation for stable and versioned management API
  34. 34. Questions? Neha Ojha - Patrick Donnelly -
  35. 35. Appendix
  36. 36. CEPH-CSI ● Replace upstream Kubernetes and Rook flexvol with ceph-csi ● Development driven by CERN, Cisco, with help from Huamin (Red Hat) ● Stretch goal is to replace Rook’s flexvol with ceph-csi in next 0.9 release
  37. 37. MULTI-CLUSTER CEPHFS ● Geo-replication ○ Loosely-consistent and point-in-time consistent DR replication (Nautilus) ○ Active/active async replication, with associated consistency caveats? ● Sync and share (NextCloud) integration ○ Concurrent access via usual POSIX (kcephfs, NFS, etc) mounts and NextCloud to same files, with revisions