Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
COMMUNITY UPDATE
SAGE WEIL – RED HAT
2017.05.11
2
UPSTREAM RELEASES
Jewel (LTS)
Spring 2016
Kraken
Fall 2016
Luminous (LTS)
Spring 2017
12.2.z
10.2.z
14.2.z
WE ARE HERE
M...
3 LUMINOUS
4
BLUESTORE: STABLE AND DEFAULT
● New OSD backend
– consumes raw block device(s) – no more XFS
– embeds rocksdb for metada...
5
HDD: RANDOM WRITE
4 8 16 32 64
128
256
512
1024
2048
4096
0
100
200
300
400
500
Bluestore vs Filestore HDD Random Write ...
6
HDD: MIXED READ/WRITE
4 8 16 32 64
128
256
512
1024
2048
4096
0
50
100
150
200
250
300
350
Bluestore vs Filestore HDD Ra...
7
RGW ON HDD+NVME, EC 4+2
1 Bucket 128 Buckets 4 Buckets 512 Buckets Rados
1 RGW Server 4 RGW Servers Bench
0
200
400
600
...
8
RBD OVER ERASURE CODED POOLS
● aka erasure code overwrites
● requires BlueStore to perform reasonably
● significant impr...
9
CEPH-MGR
● ceph-mgr
– new management daemon to supplement ceph-mon (monitor)
– easier integration point for python manag...
10
11
ASYNCMESSENGER
●
new network Messenger implementation
– event driven
– fixed-size thread pool
●
RDMA backend (ibverbs)
...
12
PERFECTLY BALANCED OSDS (FINALLY)
● CRUSH choose_args
– alternate weight sets for individual rules
– complete flexibili...
13
RADOS MISC
●
CRUSH device classes
– mark OSDs with class (hdd, ssd, etc)
– out-of-box rules to map to specific class of...
14
S3
Swift
Erasure coding
Multisite federation
Multisite replication
NFS
Encryption
Tiering
Deduplication
RADOSGW
Compres...
15
ZONE CZONE B
RGW METADATA SEARCH
RADOSGW
LIBRADOS
M
CLUSTER A
MM M
CLUSTER B
MM
RADOSGW RADOSGW RADOSGW
LIBRADOS LIBRAD...
16
RGW MISC
●
NFS gateway
– NFSv4 and v3
– full object access (not general purpose!)
●
dynamic bucket index sharding
– aut...
17
Erasure coding
Multisite mirroring
Persistent client cache
Consistency groups
Encryption
iSCSI
Trash
RBD
18
RBD
●
RBD over erasure coded pool
– rbd create --data-pool <ecpoolname> ...
●
RBD mirroring improvements
– cooperative ...
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes...
20
2017 =
RGW
A web services gateway
for object storage,
compatible with S3 and
Swift
LIBRADOS
A library allowing apps to ...
21
CEPHFS
● multiple active MDS daemons (finally!)
● subtree pinning to specific daemon
● directory fragmentation on by de...
22
OSD refactor
Tiering
Client-side caches
Metrics
Dedup
QoS
Self-management
AFTER LUMINOUS?
Multi-site federation
23 MIMIC
24
RADOS
● Peering optimizations
● IO path refactor and optimization
– async, state-driven, futures
– painful but necessar...
25
QUALITY OF SERVICE
● Ongoing background development
– dmclock distributed QoS queuing
– minimum reservations and priori...
26
27
CEPH STORAGE CLUSTER
TIERING
● new RADOS ‘redirect’ primitive
– basically a symlink, transparent to librados
– replace ...
28
CEPH STORAGE CLUSTER
DEDUPLICATION WIP
● Generalize redirect to a “manifest”
– map of offsets to object “fragments” (vs...
29
CEPH-MGR: METRICS, MGMT
● Metrics aggregation
– short-term time series out of the box
– no persistent state
– streaming...
30
ARM
●
aarch64 builds
– centos7, ubuntu xenial
– have some, but awaiting more build hardware in community lab
●
thank yo...
31
Gen 2 Gen3
ARM 32bit ARM 64 bit
uBoot UEFI
1GB DDR Up to 4GB DDR
2x 1GbE SGMII 2x 2.5GbE SGMII
Basic Security Enhanced ...
32
CLIENT CACHES!
● RGW
– persistent read-only cache on NVMe
– fully consistent (only caches immutable “tail” rados object...
33
GROWING DEVELOPER COMMUNITY
34
● Red Hat
● Mirantis
● SUSE
● ZTE
● China Mobile
● XSky
● Digiware
● Intel
● Kylin Cloud
● Easystack
● Istuary Innovati...
35
● Red Hat
● Mirantis
● SUSE
● ZTE
● China Mobile
● XSky
● Digiware
● Intel
● Kylin Cloud
● Easystack
● Istuary Innovati...
36
● Red Hat
● Mirantis
● SUSE
● ZTE
● China Mobile
● XSky
● Digiware
● Intel
● Kylin Cloud
● Easystack
● Istuary Innovati...
37
● Red Hat
● Mirantis
● SUSE
● ZTE
● China Mobile
● XSky
● Digiware
● Intel
● Kylin Cloud
● Easystack
● Istuary Innovati...
38
● Red Hat
● Mirantis
● SUSE
● ZTE
● China Mobile
● XSky
● Digiware
● Intel
● Kylin Cloud
● Easystack
● Istuary Innovati...
39
● Mailing list and IRC
– http://ceph.com/IRC
● Ceph Developer Monthly
– first Weds of every month
– video conference (B...
THANK YOU!
Sage Weil
CEPH PRINCIPAL ARCHITECT
sage@redhat.com
@liewegas
Upcoming SlideShare
Loading in …5
×

Community Update at OpenStack Summit Boston

2,144 views

Published on

Community update for Luminous and beyond.

Published in: Software

Community Update at OpenStack Summit Boston

  1. 1. COMMUNITY UPDATE SAGE WEIL – RED HAT 2017.05.11
  2. 2. 2 UPSTREAM RELEASES Jewel (LTS) Spring 2016 Kraken Fall 2016 Luminous (LTS) Spring 2017 12.2.z 10.2.z 14.2.z WE ARE HERE Mimic Fall 2017? Nautilus? (LTS) Spring 2018?
  3. 3. 3 LUMINOUS
  4. 4. 4 BLUESTORE: STABLE AND DEFAULT ● New OSD backend – consumes raw block device(s) – no more XFS – embeds rocksdb for metadata ● Fast on both HDDs (~2x) and SSDs (~1.5x) – Similar to FileStore on NVMe, where the device is not the bottleneck ● Smaller journals – happily uses fast SSD partition(s) for internal metadata, or NVRAM for journal ● Full data checksums (crc32c, xxhash, etc.) ● Inline compression (zlib, snappy) – policy driven by global or per-pool config, and/or client hints ● Stable and default
  5. 5. 5 HDD: RANDOM WRITE 4 8 16 32 64 128 256 512 1024 2048 4096 0 100 200 300 400 500 Bluestore vs Filestore HDD Random Write Throughput Filestore Bluestore (wip-bitmap-alloc-perf) BS (Master-a07452d) BS (Master-d62a4948) Bluestore (wip-bluestore-dw) IO Size Throughput(MB/s) 4 8 16 32 64 128 256 512 1024 2048 4096 0 500 1000 1500 2000 Bluestore vs Filestore HDD Random Write IOPS Filestore Bluestore (wip-bitmap-alloc-perf) BS (Master-a07452d) BS (Master-d62a4948) Bluestore (wip-bluestore-dw) IO Size IOPS
  6. 6. 6 HDD: MIXED READ/WRITE 4 8 16 32 64 128 256 512 1024 2048 4096 0 50 100 150 200 250 300 350 Bluestore vs Filestore HDD Random RWThroughput Filestore Bluestore (wip-bitmap-alloc-perf) BS (Master-a07452d) BS (Master-d62a4948) Bluestore (wip-bluestore-dw) IO Size Throughput(MB/s) 4 8 16 32 64 128 256 512 1024 2048 4096 0 200 400 600 800 1000 1200 Bluestore vs Filestore HDD Random RW IOPS Filestore Bluestore (wip-bitmap-alloc-perf) BS (Master-a07452d) BS (Master-d62a4948) Bluestore (wip-bluestore-dw) IO Size IOPS
  7. 7. 7 RGW ON HDD+NVME, EC 4+2 1 Bucket 128 Buckets 4 Buckets 512 Buckets Rados 1 RGW Server 4 RGW Servers Bench 0 200 400 600 800 1000 1200 1400 1600 1800 4+2 Erasure Coding RadosGW Write Tests 32MB Objects, 24 HDD/NVMe OSDs on 4 Servers, 4 Clients Filestore 512KB Chunks Filestore 4MB Chunks Bluestore 512KB Chunks Bluestore 4MB Chunks Throughput(MB/s)
  8. 8. 8 RBD OVER ERASURE CODED POOLS ● aka erasure code overwrites ● requires BlueStore to perform reasonably ● significant improvement in efficiency over 3x replication – 2+2 → 2x 4+2 → 1.5x ● small writes slower than replication – early testing showed 4+2 is about half as fast as 3x replication ● large writes faster than replication – less IO to device ● implementation still does the “simple” thing – all writes update a full stripe
  9. 9. 9 CEPH-MGR ● ceph-mgr – new management daemon to supplement ceph-mon (monitor) – easier integration point for python management logic – integrated metrics ● make ceph-mon scalable again – offload pg stats from mon to mgr – push to 10K OSDs (planned “big bang 3” @ CERN) ● new REST API – pecan – based on previous Calamari API ● built-in web dashboard – webby equivalent of 'ceph -s' M G ??? (time for new iconography)
  10. 10. 10
  11. 11. 11 ASYNCMESSENGER ● new network Messenger implementation – event driven – fixed-size thread pool ● RDMA backend (ibverbs) – built by default – limited testing, but seems stable! ● DPDK backend – prototype!
  12. 12. 12 PERFECTLY BALANCED OSDS (FINALLY) ● CRUSH choose_args – alternate weight sets for individual rules – complete flexibility to optimize weights etc – fixes two problems ● imbalance – run numeric optimization to adjust weights to balance PG distribution for a pool (or cluster) ● multipick anomaly – adjust weights per position to correct for low- weighted devices (e.g., mostly empty rack) – backward compatible with pre-luminous clients for imbalance case ● pg upmap – explicitly map individual PGs to specific devices in OSDMap – simple offline optimizer balance PGs – by pool or by cluster – requires luminous+ clients
  13. 13. 13 RADOS MISC ● CRUSH device classes – mark OSDs with class (hdd, ssd, etc) – out-of-box rules to map to specific class of devices within the same hierarchy ● streamlined disk replacement ● require_min_compat_client – simpler, safer configuration ● annotated/documented config options ● client backoff on stuck PGs or objects ● better EIO handling ● peering and recovery speedups ● fast OSD failure detection
  14. 14. 14 S3 Swift Erasure coding Multisite federation Multisite replication NFS Encryption Tiering Deduplication RADOSGW Compression
  15. 15. 15 ZONE CZONE B RGW METADATA SEARCH RADOSGW LIBRADOS M CLUSTER A MM M CLUSTER B MM RADOSGW RADOSGW RADOSGW LIBRADOS LIBRADOS LIBRADOS REST RADOSGW LIBRADOS M CLUSTER C MM REST ZONE A
  16. 16. 16 RGW MISC ● NFS gateway – NFSv4 and v3 – full object access (not general purpose!) ● dynamic bucket index sharding – automatic (finally!) ● inline compression ● encryption – follows S3 encryption APIs ● S3 and Swift API odds and ends RADOSGW LIBRADOS
  17. 17. 17 Erasure coding Multisite mirroring Persistent client cache Consistency groups Encryption iSCSI Trash RBD
  18. 18. 18 RBD ● RBD over erasure coded pool – rbd create --data-pool <ecpoolname> ... ● RBD mirroring improvements – cooperative HA daemons – improved Cinder integration ● iSCSI – LIO tcmu-runner, librbd (full feature set) ● Kernel RBD improvements – exclusive locking, object map
  19. 19. RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP RBD A reliable and fully- distributed block device, with a Linux kernel client and a QEMU/KVM driver RBD A reliable and fully- distributed block device, with a Linux kernel client and a QEMU/KVM driver RADOSGW A bucket-based REST gateway, compatible with S3 and Swift RADOSGW A bucket-based REST gateway, compatible with S3 and Swift APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE NEARLY AWESOME AWESOMEAWESOME AWESOME AWESOME
  20. 20. 20 2017 = RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors RBD A reliable, fully-distributed block device with cloud platform integration CEPHFS A distributed file system with POSIX semantics and scale-out metadata management OBJECT BLOCK FILE FULLY AWESOME
  21. 21. 21 CEPHFS ● multiple active MDS daemons (finally!) ● subtree pinning to specific daemon ● directory fragmentation on by default – (snapshots still off by default) ● so many tests ● so many bugs fixed ● kernel client improvements
  22. 22. 22 OSD refactor Tiering Client-side caches Metrics Dedup QoS Self-management AFTER LUMINOUS? Multi-site federation
  23. 23. 23 MIMIC
  24. 24. 24 RADOS ● Peering optimizations ● IO path refactor and optimization – async, state-driven, futures – painful but necessary ● BlueStore and rocksdb optimization – rocksdb level0 compaction – alternative KV store? ● Erasure coding plugin API improvements – new codes with less IO for single-OSD failures DEVICE OSD OBJECTSTORE MESSENGER
  25. 25. 25 QUALITY OF SERVICE ● Ongoing background development – dmclock distributed QoS queuing – minimum reservations and priority weighting ● Range of policies – IO type (background, client) – pool-based – client-based ● Theory is complex ● Prototype is promising, despite simplicity ● Missing management framework
  26. 26. 26
  27. 27. 27 CEPH STORAGE CLUSTER TIERING ● new RADOS ‘redirect’ primitive – basically a symlink, transparent to librados – replace “sparse” cache tier with base pool “index” APPLICATION BASE POOL (REPLICATED, SSD) APPLICATION CACHE POOL (REPLICATED) BASE POOL (HDD AND/OR ERASURE) CEPH STORAGE CLUSTER SLOW 1 (EC) SLOW #1 (...)
  28. 28. 28 CEPH STORAGE CLUSTER DEDUPLICATION WIP ● Generalize redirect to a “manifest” – map of offsets to object “fragments” (vs a full object copy) ● Break objects into chunks – fixed size, or content fingerprint ● Store chunks in content-addressable pool – name object by sha256(content) – reference count chunks ● TBD – inline or post? – policy (when to promote inline, etc.) – agent home (embed in OSD, …) APPLICATION BASE POOL (REPLICATED, SSD) SLOW 1 (EC) CAS DEDUP POOL
  29. 29. 29 CEPH-MGR: METRICS, MGMT ● Metrics aggregation – short-term time series out of the box – no persistent state – streaming to external platform (Prometheus, …) ● Host self-management functions – automatic CRUSH optimization – identification of slow devices – steer IO away from busy devices – device failure prediction M G consistency metrics summary
  30. 30. 30 ARM ● aarch64 builds – centos7, ubuntu xenial – have some, but awaiting more build hardware in community lab ● thank you to partners! ● ppc64 ● armv7l – http://ceph.com/community/500-osd-ceph-cluster/
  31. 31. 31 Gen 2 Gen3 ARM 32bit ARM 64 bit uBoot UEFI 1GB DDR Up to 4GB DDR 2x 1GbE SGMII 2x 2.5GbE SGMII Basic Security Enhanced Security No Flash option Flash option I2C Management Integrated Management Planning Gen3 availability to PoC partners Contact james.wilshire@wdc.com WDLABS MICROSERVER UPDATE
  32. 32. 32 CLIENT CACHES! ● RGW – persistent read-only cache on NVMe – fully consistent (only caches immutable “tail” rados objects) – Mass Open Cloud ● RBD – persistent read-only cache of immutable clone parent images – writeback cache for improving write latency ● cluster image remains crash-consistent if client cache is lost ● CephFS – kernel client already uses kernel fscache facility
  33. 33. 33 GROWING DEVELOPER COMMUNITY
  34. 34. 34 ● Red Hat ● Mirantis ● SUSE ● ZTE ● China Mobile ● XSky ● Digiware ● Intel ● Kylin Cloud ● Easystack ● Istuary Innovation Group ● Quantum ● Mellanox ● H3C ● Quantum ● UnitedStack ● Deutsche Telekom ● Reliance Jio Infocomm ● OVH ● Alibaba ● DreamHost ● CERN GROWING DEVELOPER COMMUNITY
  35. 35. 35 ● Red Hat ● Mirantis ● SUSE ● ZTE ● China Mobile ● XSky ● Digiware ● Intel ● Kylin Cloud ● Easystack ● Istuary Innovation Group ● Quantum ● Mellanox ● H3C ● Quantum ● UnitedStack ● Deutsche Telekom ● Reliance Jio Infocomm ● OVH ● Alibaba ● DreamHost ● CERN OPENSTACK VENDORS
  36. 36. 36 ● Red Hat ● Mirantis ● SUSE ● ZTE ● China Mobile ● XSky ● Digiware ● Intel ● Kylin Cloud ● Easystack ● Istuary Innovation Group ● Quantum ● Mellanox ● H3C ● Quantum ● UnitedStack ● Deutsche Telekom ● Reliance Jio Infocomm ● OVH ● Alibaba ● DreamHost ● CERN CLOUD OPERATORS
  37. 37. 37 ● Red Hat ● Mirantis ● SUSE ● ZTE ● China Mobile ● XSky ● Digiware ● Intel ● Kylin Cloud ● Easystack ● Istuary Innovation Group ● Quantum ● Mellanox ● H3C ● Quantum ● UnitedStack ● Deutsche Telekom ● Reliance Jio Infocomm ● OVH ● Alibaba ● DreamHost ● CERN HARDWARE AND SOLUTION VENDORS
  38. 38. 38 ● Red Hat ● Mirantis ● SUSE ● ZTE ● China Mobile ● XSky ● Digiware ● Intel ● Kylin Cloud ● Easystack ● Istuary Innovation Group ● Quantum ● Mellanox ● H3C ● Quantum ● UnitedStack ● Deutsche Telekom ● Reliance Jio Infocomm ● OVH ● Alibaba ● DreamHost ● CERN ???
  39. 39. 39 ● Mailing list and IRC – http://ceph.com/IRC ● Ceph Developer Monthly – first Weds of every month – video conference (Bluejeans) – alternating APAC- and EMEA- friendly times ● Github – https://github.com/ceph/ ● Ceph Days – http://ceph.com/cephdays/ ● Meetups – http://ceph.com/meetups ● Ceph Tech Talks – http://ceph.com/ceph-tech-talks / ● ‘Ceph’ Youtube channel – (google it) ● Twitter – @ceph GET INVOLVED
  40. 40. THANK YOU! Sage Weil CEPH PRINCIPAL ARCHITECT sage@redhat.com @liewegas

×