Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

What's new in Jewel and Beyond

2,011 views

Published on

Ceph project status update presented at Ceph Day Suisse at CERN.

Published in: Software

What's new in Jewel and Beyond

  1. 1. WHAT'S NEW IN JEWEL AND BEYOND SAGE WEIL CEPH DAY CERN – 2016.06.14
  2. 2. 2 AGENDA ● New in Jewel ● BlueStore ● Kraken and Luminous
  3. 3. 3 RELEASES ● Hammer v0.94.x (LTS) – March '15 ● Infernalis v9.2.x – November '15 ● Jewel v10.2.x (LTS) – April '16 ● Kraken v11.2.x – November '16 ● Luminous v12.2.x (LTS) – April '17
  4. 4. 4 JEWEL v10.2.x – APRIL 2016
  5. 5. RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP RBD A reliable and fully- distributed block device, with a Linux kernel client and a QEMU/KVM driver RBD A reliable and fully- distributed block device, with a Linux kernel client and a QEMU/KVM driver RADOSGW A bucket-based REST gateway, compatible with S3 and Swift RADOSGW A bucket-based REST gateway, compatible with S3 and Swift APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE NEARLY AWESOME AWESOMEAWESOME AWESOME AWESOME
  6. 6. 6 2016 = RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors RBD A reliable, fully-distributed block device with cloud platform integration CEPHFS A distributed file system with POSIX semantics and scale-out metadata management OBJECT BLOCK FILE FULLY AWESOME
  7. 7. CEPHFS
  8. 8. 8 CEPHFS: STABLE AT LAST ● Jewel recommendations – single active MDS (+ many standbys) – snapshots disabled ● Repair and disaster recovery tools ● CephFSVolumeManager and Manila driver ● Authorization improvements (confine client to a directory)
  9. 9. 9 LINUX HOST M M M RADOS CLUSTER KERNEL MODULE datametadata 01 10 or ceph-fuse, Samba, Ganesha
  10. 10. 10 SCALING FILE PEFORMANCE ● Data path is direct to RADOS – scale IO path by adding OSDs – or use SSDs, etc. ● No restrictions on file count or file system size – MDS cache performance related to size of active set, not total file count ● Metadata performance – provide lots of RAM for MDS daemons (no local on-disk state needed) – use SSDs for RADOS metadata pool ● Metadata path is scaled independently – up to 128 active metadata servers tested; 256 possible – in Jewel, only 1 is recommended – stable multi-active MDS coming in Kraken or Luminous
  11. 11. 11 DYNAMIC SUBTREE PARTITIONING
  12. 12. 12 DYNAMIC SUBTREE PARTITIONING
  13. 13. 13 DYNAMIC SUBTREE PARTITIONING
  14. 14. 14 DYNAMIC SUBTREE PARTITIONING
  15. 15. 15 DYNAMIC SUBTREE PARTITIONING
  16. 16. 16 POSIX AND CONSISTENCY ● CephFS has “consistent caching” – clients can cache data – caches are coherent – MDS invalidates data that is changed – complex locking/leasing protocol ● This means clients never see stale data of any kind – consistency is much stronger than, say, NFS ● file locks are fully supported – flock and fcntl locks
  17. 17. 17 RSTATS # ext4 reports dirs as 4K ls -lhd /ext4/data drwxrwxr-x. 2 john john 4.0K Jun 25 14:58 /home/john/data # cephfs reports dir size from contents $ ls -lhd /cephfs/mydata drwxrwxr-x. 1 john john 16M Jun 25 14:57 ./mydata
  18. 18. 18 OTHER GOOD STUFF ● Directory fragmentation – shard directories for scaling, performance – disabled by default in Jewel; on by default in Kraken ● Snapshots – create snapshot on any directory mkdir some/random/dir/.snap/mysnapshot ls some/random/dir/.snap/mysnapshot – disabled by default in Jewel; hopefully on by default in Luminous ● Security authorization model – confine a client mount to a directory and to a rados pool namespace
  19. 19. 19 FSCK AND RECOVERY ● metadata scrubbing – online operation – manually triggered in Jewel – automatic background scrubbing coming in Kraken, Luminous ● disaster recovery tools – rebuild file system namespace from scratch if RADOS loses it or something corrupts it
  20. 20. 20 OPENSTACK MANILA FSaaS ● CephFS native – Jewel and Mitaka – CephFSVolumeManager to orchestrate shares ● CephFS directories ● with quota ● backed by a RADOS pool + namespace ● and clients locked into the directory – VM mounts CephFS directory (ceph-fuse, kernel client, …) – tenant VM talks directory to Ceph cluster; deploy with caution
  21. 21. OTHER JEWEL STUFF
  22. 22. 22 GENERAL ● daemons run as ceph user – except upgraded clusters that don't want to chown -R ● selinux support ● all systemd ● ceph-ansible deployment ● ceph CLI bash completion ● “calamari on mons”
  23. 23. 23 BUILDS ● aarch64 builds – centos7, ubuntu xenial ● armv7l builds – debian jessie – http://ceph.com/community/500-osd-ceph-cluster/
  24. 24. RBD
  25. 25. 25 RBD IMAGE MIRRORING ● image mirroring – asynchronous replication to another cluster – replica(s) crash consistent – replication is per-image – each image has a data journal – rbd-mirror daemon does the work QEMU librbd Image Journal QEMU librbd Image rbd-mirror A B
  26. 26. 26 OTHER RBD STUFF ● fast dif – use object-map to do O(1) time dif ● deep flatten – separate clone from parent while retaining snapshot history ● dynamic features – turn on/of: exclusive-lock, object-map, fast-dif, journaling – turn of: deep-flatten – useful for compatibility with kernel client, which lacks some new features ● new default features – layering, exclusive-lock, object-map, fast-dif, deep-flatten ● snapshot rename ● rbd du ● improved/rewritten CLI (with dynamic usage/help)
  27. 27. RGW
  28. 28. 28 NEW IN RGW ● Rewritten multi-site capability – N zones, N-way sync – fail-over and fail-back – simpler configuration ● NFS interface – export a bucket over NFSv4 – designed for import/export of data – not general a purpose file system! – based on nfs-ganesha ● Indexless buckets – bypass RGW index for certain buckets (that don't need enumeration, quota, ...)
  29. 29. 29 ● S3 – AWS4 authentication support – LDAP and AD/LDAP support – static website – RGW STS (coming shortly) ● Kerberos, AD integration ● Swift – Keystone V3 – multi-tenancy – object expiration – Static Large Object (SLO) – bulk delete – object versioning – refcore compliance RGW API UPDATES
  30. 30. RADOS
  31. 31. 31 RADOS ● queuing improvements – new IO scheduler “wpq” (weighted priority queue) stabilizing – (more) unified queue (client io, scrub, snaptrim, most of recovery) – somewhat better client vs recovery/rebalance isolation ● mon scalability and performance improvements – thanks to testing here @ CERN ● optimizations, performance improvements (faster on SSDs) ● AsyncMessenger – new implementation of networking layer – fewer threads, friendlier to allocator (especially tcmalloc)
  32. 32. 32 MORE RADOS ● no more ext4 ● cache tiering improvements – proxy write support – promotion throttling – better, still not good enough for RBD and EC base ● SHEC erasure code (thanks to Fujitsu) – trade some extra storage for recovery performance ● [test-]reweight-by-utilization improvements – more better data distribution optimization ● BlueStore – new experimental backend
  33. 33. BLUESTORE
  34. 34. 34 BLUESTORE: NEW BACKEND GOALS ● no more double writes to a full data journal ● efficient object enumeration ● efficient clone operation ● efficient splice (“move these bytes from object X to object Y”) – (will enable EC overwrites for RBD over EC) ● efficient IO pattern for HDDs, SSDs, NVMe ● minimal locking, maximum parallelism (between PGs) ● full data and metadata checksums ● inline compression (zlib, snappy, etc.)
  35. 35. 35 ● BlueStore = Block + NewStore – consume raw block device(s) – key/value database (RocksDB) for metadata – data written directly to block device – pluggable block Allocator ● We must share the block device with RocksDB – implement our own rocksdb::Env – implement tiny “file system” BlueFS – make BlueStore and BlueFS share BLUESTORE BlueStore BlueFS RocksDB BlockDeviceBlockDeviceBlockDevice BlueRocksEnv data metadata Allocator ObjectStore
  36. 36. 36 Full data checksums ● We scrub... periodically ● We want to validate checksum on every read Compression ● 3x replication is expensive ● Any scale-out cluster is expensive WE WANT FANCY STUFF
  37. 37. 37 Full data checksums ● We scrub... periodically ● We want to validate checksum on every read ● More metadata in the blobs – 4KB of 32-bit csum metadata for 4MB object and 4KB blocks – larger csum blocks? – smaller csums (8 or 16 bits)? Compression ● 3x replication is expensive ● Any scale-out cluster is expensive ● Need largish extents to get compression benefit (64 KB, 128 KB) – overwrites occlude/obscure compressed blobs – compacted (rewritten) when >N layers deep WE WANT FANCY STUFF
  38. 38. 38 0 50 100 150 200 250 300 350 400 450 500 Ceph 10.1.0 Bluestore vs Filestore Sequential Writes FS HDD BS HDD IO Size Throughput(MB/s) HDD: SEQUENTIAL WRITE
  39. 39. 39 HDD: RANDOM WRITE 0 100 200 300 400 500 Ceph 10.1.0 Bluestore vs Filestore Random Writes FS HDD BS HDD IO Size Throughput(MB/s) 0 200 400 600 800 1000 1200 1400 1600 Ceph 10.1.0 Bluestore vs Filestore Random Writes FS HDD BS HDD IO Size IOPS
  40. 40. 40 HDD: SEQUENTIAL READ 0 200 400 600 800 1000 1200 Ceph 10.1.0 Bluestore vs Filestore Sequential Reads FS HDD BS HDD IO Size Throughput(MB/s)
  41. 41. KRAKEN AND LUMINOUS
  42. 42. 42 RADOS ● BlueStore! ● erasure code overwrites (RBD + EC) ● ceph-mgr – new mon-like daemon – management API endpoint (Calamari) – metrics ● config management in mons ● on-the-wire encryption ● OSD IO path optimization ● faster peering ● QoS ● ceph-disk support for dm-cache/bcache/FlashCache/...
  43. 43. 43 RGW ● AWS STS (kerberos support) ● pluggable full-zone syncing – tiering to tape – tiering to cloud – metadata indexing (elasticsearch?) ● S3 encryption API ● compression ● performance
  44. 44. 44 RBD ● RBD mirroring improvements – cooperative daemons – Cinder integration ● RBD client-side persistent cache – write-through and write-back cache – ordered writeback → crash consistent on loss of cache ● RBD consistency groups ● client-side encryption ● Kernel RBD improvements
  45. 45. 45 CEPHFS ● multi-active MDS and/or ● snapshots ● Manila hypervisor-mediated FsaaS – NFS over VSOCK → libvirt-managed Ganesha server → libcephfs FSAL → CephFS cluster – new Manila driver – new Nova API to attach shares to Vms ● Samba and Ganesha integration improvements ● richacl (ACL coherency between NFS and CIFS)
  46. 46. 46 OTHER COOL STUFF ● librados backend for RocksDB – and rocksdb is not a backend for MySQL... ● PMStore – Intel OSD backend for 3D-Xpoint ● multi-hosting on IPv4 and IPv6 ● ceph-ansible ● ceph-docker
  47. 47. 47 ● SUSE ● Mirantis ● XSKY ● Intel ● Fujitsu ● DreamHost ● ZTE ● SanDisk ● Gentoo ● Samsung ● LETV ● Igalia ● Deutsche Telekom ● Kylin Cloud ● Endurance International Group ● H3C ● Johannes Gutenberg-Universität Mainz ● Reliance Jio Infocomm ● Ebay ● Tencent GROWING DEVELOPMENT COMMUNITY
  48. 48. THANK YOU! Sage Weil CEPH PRINCIPAL ARCHITECT sage@redhat.com @liewegas

×