Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Ceph, Now and Later: Our Plan for Open Unified Cloud Storage

3,280 views

Published on

Ceph is a highly scalable open source distributed storage system that provides object, block, and file interfaces on a single platform. Although Ceph RBD block storage has dominated OpenStack deployments for several years, maturing object (S3, Swift, and librados) interfaces and stable CephFS (file) interfaces now make Ceph the only fully open source unified storage platform.

This talk will cover Ceph's architectural vision and project mission and how our approach differs from alternative approaches to storage in the OpenStack ecosystem. In particular, we will look at how our open development model dovetails well with OpenStack, how major contributors are advancing Ceph capabilities and performance at a rapid pace to adapt to new hardware types and deployment models, and what major features we are priotizing for the next few years to meet the needs of expanding cloud workloads.

Published in: Software

Ceph, Now and Later: Our Plan for Open Unified Cloud Storage

  1. 1. OUR VISION FOR OPEN UNIFIED CLOUD STORAGE SAGE WEIL – RED HAT OPENSTACK SUMMIT BARCELONA – 2016.10.27
  2. 2. 2 ● Now – why we develop ceph – why it's open – why it's unified – hardware – openstack – why not ceph ● Later – roadmap highlights – the road ahead – community OUTLINE
  3. 3. 3 JEWEL
  4. 4. 4 RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP RBD A reliable and fully- distributed block device, with a Linux kernel client and a QEMU/KVM driver RBD A reliable and fully- distributed block device, with a Linux kernel client and a QEMU/KVM driver RADOSGW A bucket-based REST gateway, compatible with S3 and Swift RADOSGW A bucket-based REST gateway, compatible with S3 and Swift APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE NEARLY AWESOME AWESOMEAWESOME AWESOME AWESOME
  5. 5. 5 2016 = RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library allowing apps to direct access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomic, distributed object store comprised of self-healing, self-managing, intelligent storage nodes (OSDs) and lightweight monitors (Mons) RBD A virtual block device with snapshots, copy-on-write clones, and multi-site replication CEPHFS A distributed POSIX file system with coherent caches and snapshots on any directory OBJECT BLOCK FILE FULLY AWESOME
  6. 6. 6 WHAT IS CEPH ALL ABOUT ● Distributed storage ● All components scale horizontally ● No single point of failure ● Software ● Hardware agnostic, commodity hardware ● Object, block, and file in a single cluster ● Self-manage whenever possible ● Open source (LGPL)
  7. 7. 7 WHY OPEN SOURCE ● Avoid software vendor lock-in ● Avoid hardware lock-in – and enable hybrid deployments ● Lower total cost of ownership ● Transparency ● Self-support ● Add, improve, or extend
  8. 8. 8 WHY UNIFIED STORAGE ● Simplicity of deployment – single cluster serving all APIs ● Efficient utilization of storage ● Simplicity of management – single set of management skills, staff, tools ...is that really true?
  9. 9. 9 WHY UNIFIED STORAGE ● Simplicity of deployment – single cluster serving all APIs ● Efficient utilization of space ● Only really true for small clusters – Large deployments optimize for workload ● Simplicity of management – single set of skills, staff, tools ● Functionality in RADOS exploited by all – Erasure coding, tiering, performance, monitoring
  10. 10. 10 SPECIALIZED CLUSTERS SSD M M M RADOS CLUSTER NVMeSSDHDD – Multiple Ceph clusters per-use case RADOS CLUSTERRADOS CLUSTER MM M M MM
  11. 11. 11 HYBRID CLUSTER SSD M M M RADOS CLUSTER NVMeSSDHDD – Single Ceph cluster – Separate RADOS pools for different storage hardware types
  12. 12. 12 CEPH ON FLASH ● Samsung – up to 153 TB in 2u – 700K IOPS, 30 GB/s ● SanDisk – up to 512 TB in 3u – 780K IOPS, 7 GB/s
  13. 13. 13 CEPH ON SSD+HDD ● SuperMicro ● Dell ● HP ● Quanta / QCT ● Fujitsu (Eternus) ● Penguin OCP
  14. 14. 14 ● Microserver on 8TB He WD HDD ● 1 GB RAM ● 2 core 1.3 GHz Cortex A9 ● 2x 1GbE SGMII instead of SATA ● 504 OSD 4 PB cluster – SuperMicro 1048-RT chassis ● Next gen will be aarch64 CEPH ON HDD (LITERALLY)
  15. 15. 15 FOSS → FAST AND OPEN INNOVATION ● Free and open source software (FOSS) – enables innovation in software – ideal testbed for new hardware architectures ● Proprietary software platforms are full of friction – harder to modify, experiment with closed-source code – require business partnerships and NDAs to be negotiated – dual investment (of time and/or $$$) from both parties
  16. 16. 16 PERSISTENT MEMORY ● Persistent (non-volatile) RAM is coming – e.g., 3D X-Point from Intel/Micron any year now – (and NV-DIMMs are a bit $$$ but already here) ● These are going to be – fast (~1000x), dense (~10x), high-endurance (~1000x) – expensive (~1/2 the cost of DRAM) – disruptive ● Intel – PMStore – OSD prototype backend targetting 3D-Xpoint – NVM-L (pmem.io/nvml)
  17. 17. 17 CREATE THE ECOSYSTEM TO BECOME THE LINUX OF DISTRIBUTED STORAGE OPEN COLLABORATIVE GENERAL PURPOSE
  18. 18. 18 KEYSTONE SWIF T CINDER NOVAGLANCE LIBRBD OPENSTACK CEPH FS M M M RADOS CLUSTER KVM CEPH + OPENSTACK MANILA RADOSG W
  19. 19. 19 CINDER DRIVER USAGE
  20. 20. 20 WHY NOT CEPH? ● inertia ● performance ● functionality ● stability ● ease of use
  21. 21. 21 WHAT ARE WE DOING ABOUT IT?
  22. 22. 22 Hammer (LTS) Spring 2015 Jewel (LTS) Spring 2016 Infernalis Fall 2015 Kraken Fall 2016 Luminous (LTS) Spring 2017 WE ARE HERE RELEASE CADENCERELEASE CADENCE 10.2.z 12.2.z 0.94.z
  23. 23. 23 KRAKEN
  24. 24. 24 ● BlueStore = Block + NewStore – key/value database (RocksDB) for metadata – all data written directly to raw block device(s) – can combine HDD, SSD, NVMe, NVRAM ● Full data checksums (crc32c, xxhash) ● Inline compression (zlib, snappy, zstd) ● ~2x faster than FileStore – better parallelism, efficiency on fast devices – no double writes for data – performs well with very small SSD journals BLUESTORE BlueStore BlueFS RocksDB BlockDeviceBlockDeviceBlockDevice BlueRocksEnv data metadata Allocator ObjectStore Compressor OSD
  25. 25. 25 ● Current master – nearly finalized disk format, good performance ● Kraken – stable code, stable disk format – probably still flagged 'experimental' ● Luminous – fully stable and ready for broad usage – maybe the default... we will see how it goes BLUESTORE – WHEN CAN HAZ? ● Migrating from FileStore – evacuate and fail old OSDs – reprovision with BlueStore – let Ceph recovery normally
  26. 26. 26 ● New implementation of network layer – replaces aging SimpleMessenger – fixed size thread pool (vs 2 threads per socket) – scales better to larger clusters – more healthy relationship with tcmalloc – now the default! ● Pluggable backends – PosixStack – Linux sockets, TCP (default, supported) – Two experimental backends! ASYNCMESSENGER
  27. 27. 27 TCP OVERHEAD IS SIGNIFICANT writes reads
  28. 28. 28 DPDK STACK
  29. 29. 29 RDMA STACK ● RDMA for data path (ibverbs) ● Keep TCP for control path ● Working prototype in ~1 month
  30. 30. 30 LUMINOUS
  31. 31. 31 MULTI-MDS CEPHFS
  32. 32. 32 ERASURE CODE OVERWRITES ● Current EC RADOS pools are append only – simple, stable suitable for RGW, or behind a cache tier ● EC overwrites will allow RBD and CephFS to consume EC pools directly ● It's hard – implementation requires two-phase commit – complex to avoid full-stripe update for small writes – relies on efficient implementation of “move ranges” operation by BlueStore ● It will be huge – tremendous impact on TCO – make RBD great again
  33. 33. 33 CEPH-MGR ● ceph-mon monitor daemons currently do a lot – more than they need to (PG stats to support things like 'df') – this limits cluster scalability ● ceph-mgr moves non-critical metrics into a separate daemon – that is more efficient – that can stream to graphite, influxdb – that can efficiently integrate with external modules (even Python!) ● Good host for – integrations, like Calamari REST API endpoint – coming features like 'ceph top' or 'rbd top' – high-level management functions and policy M M ??? (time for new iconography)
  34. 34. 34 QUALITY OF SERVICE ● Set policy for both – reserved/minimum IOPS – proportional sharing of excess capacity ● by – type of IO (client, scrub, recovery) – pool – client (e.g., VM) ● Based on mClock paper from OSDI'10 – IO scheduler – distributed enforcement with cooperating clients
  35. 35. 35 A B C D E F A' B' G H VM SHARED STORAGE → LATENCY ETHERNET
  36. 36. 36 VM LOCAL SSD → LOW LATENCY A B C D E F A' B' G HSSD SATA ETHERNET
  37. 37. 37 VM LOCAL SSD → FAILURE → SADNESS A B C D E F A' B' G HSSD SATA ETHERNET
  38. 38. 38 VM WRITEBACK IS UNORDERED A B C D E F A' B' G HSSD SATA ETHERNET A' B' C D
  39. 39. 39 VM WRITEBACK IS UNORDERED A B C D E F A' B' G HSSD SATA ETHERNET A' B' C D A' B' C D → never happened
  40. 40. 40 VM RBD ORDERED WRITEBACK CACHE A B C D E F A' B' G HSSD SATA ETHERNET A B C D → CRASH CONSISTENT! A B C D cleverness
  41. 41. 41 ● Low latency writes to local SSD ● Persistent cache (across reboots etc) ● Fast reads from cache ● Ordered writeback to the cluster, with batching, etc. ● RBD image is always point-in-time consistent RBD ORDERED WRITEBACK CACHE Low Latency SPoF No data Local SSD Ceph RBD Higher Latency No SPoF Perfect durability Low Latency SPoF on recent writes Stale but crash consistent
  42. 42. 42 IN THE FUTURE MOST BYTES WILL BE STORED IN OBJECTS
  43. 43. 43 S3 Swift Erasure coding Multisite federation Multisite replication WORM Encryption Tiering Deduplication RADOSGW Compression
  44. 44. 44 ZONE CZONE B RADOSGW INDEXING RADOSG WLIBRADOS M CLUSTER A MM M CLUSTER B MM RADOSG W RADOSG W RADOSG WLIBRADOS LIBRADOS LIBRADOS REST RADOSG WLIBRADOS M CLUSTER C MM REST ZONE A
  45. 45. 45 GROWING DEVELOPMENT COMMUNITY
  46. 46. 46 ● Red Hat ● Mirantis ● SUSE ● SanDisk ● XSKY ● ZTE ● LETV ● Quantum ● EasyStack ● H3C ● UnitedStack ● Digiware ● Mellanox ● Intel ● Walmart Labs ● DreamHost ● Tencent ● Deutsche Telekom ● Igalia ● Fujitsu ● DigitalOcean ● University of Toronto GROWING DEVELOPMENT COMMUNITY
  47. 47. 47 ● Red Hat ● Mirantis ● SUSE ● SanDisk ● XSKY ● ZTE ● LETV ● Quantum ● EasyStack ● H3C ● UnitedStack ● Digiware ● Mellanox ● Intel ● Walmart Labs ● DreamHost ● Tencent ● Deutsche Telekom ● Igalia ● Fujitsu ● DigitalOcean ● University of Toronto GROWING DEVELOPMENT COMMUNITY
  48. 48. 48 ● Red Hat ● Mirantis ● SUSE ● SanDisk ● XSKY ● ZTE ● LETV ● Quantum ● EasyStack ● H3C ● UnitedStack ● Digiware ● Mellanox ● Intel ● Walmart Labs ● DreamHost ● Tencent ● Deutsche Telekom ● Igalia ● Fujitsu ● DigitalOcean ● University of Toronto GROWING DEVELOPMENT COMMUNITY
  49. 49. 49 ● Red Hat ● Mirantis ● SUSE ● SanDisk ● XSKY ● ZTE ● LETV ● Quantum ● EasyStack ● H3C ● UnitedStack ● Digiware ● Mellanox ● Intel ● Walmart Labs ● DreamHost ● Tencent ● Deutsche Telekom ● Igalia ● Fujitsu ● DigitalOcean ● University of Toronto GROWING DEVELOPMENT COMMUNITY
  50. 50. 50 ● Red Hat ● Mirantis ● SUSE ● SanDisk ● XSKY ● ZTE ● LETV ● Quantum ● EasyStack ● H3C ● UnitedStack ● Digiware ● Mellanox ● Intel ● Walmart Labs ● DreamHost ● Tencent ● Deutsche Telekom ● Igalia ● Fujitsu ● DigitalOcean ● University of Toronto GROWING DEVELOPMENT COMMUNITY
  51. 51. 51 ● Red Hat ● Mirantis ● SUSE ● SanDisk ● XSKY ● ZTE ● LETV ● Quantum ● EasyStack ● H3C ● UnitedStack ● Digiware ● Mellanox ● Intel ● Walmart Labs ● DreamHost ● Tencent ● Deutsche Telekom ● Igalia ● Fujitsu ● DigitalOcean ● University of Toronto GROWING DEVELOPMENT COMMUNITY
  52. 52. 52 ● Nobody should have to use proprietary storage systems out of necessity ● The best storage solution should be an open source solution ● Ceph is growing up, but we're not done yet – performance, scalability, features, ease of use ● We are highly motivated TAKEAWAYS
  53. 53. 53 Operator ● File bugs http://tracker.ceph.com/ ● Document http://github.com/ceph/ceph ● Blog ● Build a relationship with the core team Developer ● Fix bugs ● Help design missing functionality ● Implement missing functionality ● Integrate ● Help make it easier to use ● Participate in monthly Ceph Developer Meetings – video chat, EMEA- and APAC- friendly times HOW TO HELP
  54. 54. THANK YOU! Sage Weil CEPH PRINCIPAL ARCHITECT sage@redhat.com @liewegas

×