PLNOG16: Ceph distributed storage in OVH, Paweł Sadowski
1. Ceph Architecture Pools Client usage CaaS ?
Ceph distributed storage in OVH
Paweł Sadowski
March 1st, 2016
Paweł Sadowski Ceph in OVH March 1st, 2016 1 / 23
2. Ceph Architecture Pools Client usage CaaS ?
1 Ceph distributed storage
2 Architecture
3 Pool types
4 Client usage
5 Ceph in OVH
6 Questions
Paweł Sadowski Ceph in OVH March 1st, 2016 2 / 23
3. Ceph Architecture Pools Client usage CaaS ?
Main features
distributed, fault tolerant storage
self healing/rebalancing after failures/changes
online adding/removing of nodes
data can be replicated or erasure coded
data balanced according to hardware (disk size)
periodic data consistency verification (deep scrub)
online (rolling) upgrades
Paweł Sadowski Ceph in OVH March 1st, 2016 3 / 23
4. Ceph Architecture Pools Client usage CaaS ?
Ceph stack
Paweł Sadowski Ceph in OVH March 1st, 2016 4 / 23
5. Ceph Architecture Pools Client usage CaaS ?
Ceph stack
Paweł Sadowski Ceph in OVH March 1st, 2016 5 / 23
6. Ceph Architecture Pools Client usage CaaS ?
Ceph stack
Reliable Autonomic Distributed Object Store
Paweł Sadowski Ceph in OVH March 1st, 2016 5 / 23
7. Ceph Architecture Pools Client usage CaaS ?
Ceph stack
Reliable Autonomic Distributed Object Store
librados – library for direct RADOS access
Paweł Sadowski Ceph in OVH March 1st, 2016 5 / 23
8. Ceph Architecture Pools Client usage CaaS ?
Ceph stack
Reliable Autonomic Distributed Object Store
librados – library for direct RADOS access
RBD – block device (via librbd or krbd)
Paweł Sadowski Ceph in OVH March 1st, 2016 5 / 23
9. Ceph Architecture Pools Client usage CaaS ?
Ceph stack
Reliable Autonomic Distributed Object Store
librados – library for direct RADOS access
RBD – block device (via librbd or krbd)
radosGW – S3/Swift compatible REST interface for object storage
Paweł Sadowski Ceph in OVH March 1st, 2016 5 / 23
10. Ceph Architecture Pools Client usage CaaS ?
Ceph stack
Reliable Autonomic Distributed Object Store
librados – library for direct RADOS access
RBD – block device (via librbd or krbd)
radosGW – S3/Swift compatible REST interface for object storage
CephFS – POSIX compliant filesystem
Paweł Sadowski Ceph in OVH March 1st, 2016 5 / 23
11. Ceph Architecture Pools Client usage CaaS ?
Single–node cluster
mon
osd
Paweł Sadowski Ceph in OVH March 1st, 2016 6 / 23
24. Ceph Architecture Pools Client usage CaaS ?
Data placement – CRUSH
Paweł Sadowski Ceph in OVH March 1st, 2016 8 / 23
25. Ceph Architecture Pools Client usage CaaS ?
Data placement – CRUSH
data are stored in objects
Paweł Sadowski Ceph in OVH March 1st, 2016 8 / 23
26. Ceph Architecture Pools Client usage CaaS ?
Data placement – CRUSH
data are stored in objects
objects are stored in pools
Paweł Sadowski Ceph in OVH March 1st, 2016 8 / 23
27. Ceph Architecture Pools Client usage CaaS ?
Data placement – CRUSH
data are stored in objects
objects are stored in pools
pools contains placement groups
Paweł Sadowski Ceph in OVH March 1st, 2016 8 / 23
28. Ceph Architecture Pools Client usage CaaS ?
Data placement – CRUSH
data are stored in objects
objects are stored in pools
pools contains placement groups
object name is hashed and assigned to one placement group
Paweł Sadowski Ceph in OVH March 1st, 2016 8 / 23
29. Ceph Architecture Pools Client usage CaaS ?
Data placement – CRUSH
data are stored in objects
objects are stored in pools
pools contains placement groups
object name is hashed and assigned to one placement group
placement groups are assigned to OSDs using CRUSH
Paweł Sadowski Ceph in OVH March 1st, 2016 8 / 23
30. Ceph Architecture Pools Client usage CaaS ?
Data placement – CRUSH
data are stored in objects
objects are stored in pools
pools contains placement groups
object name is hashed and assigned to one placement group
placement groups are assigned to OSDs using CRUSH
crushmap define way to balance data between nodes
Paweł Sadowski Ceph in OVH March 1st, 2016 8 / 23
31. Ceph Architecture Pools Client usage CaaS ?
Data placement – CRUSH
data are stored in objects
objects are stored in pools
pools contains placement groups
object name is hashed and assigned to one placement group
placement groups are assigned to OSDs using CRUSH
crushmap define way to balance data between nodes
client calculates target OSD based on crushmap
Paweł Sadowski Ceph in OVH March 1st, 2016 8 / 23
35. Ceph Architecture Pools Client usage CaaS ?
crushmap in action
mon
osd 1
osd 2
osd 3
client 1
Paweł Sadowski Ceph in OVH March 1st, 2016 12 / 23
36. Ceph Architecture Pools Client usage CaaS ?
crushmap in action
mon
osd 1
osd 2
osd 3
client 1
fetch crushmap
Paweł Sadowski Ceph in OVH March 1st, 2016 12 / 23
37. Ceph Architecture Pools Client usage CaaS ?
crushmap in action
mon
osd 1
osd 2
osd 3
client 1
fetch crushmap
calculate #1 PG and OSD
Paweł Sadowski Ceph in OVH March 1st, 2016 12 / 23
38. Ceph Architecture Pools Client usage CaaS ?
crushmap in action
mon
osd 1
osd 2
osd 3
client 1
fetch crushmap
calculate #1 PG and OSD
push/fetch
data
#1
Paweł Sadowski Ceph in OVH March 1st, 2016 12 / 23
39. Ceph Architecture Pools Client usage CaaS ?
crushmap in action
mon
osd 1
osd 2
osd 3
client 1
fetch crushmap
calculate #1 PG and OSD
push/fetch
data
#1
calculate #2 PG and OSD
Paweł Sadowski Ceph in OVH March 1st, 2016 12 / 23
40. Ceph Architecture Pools Client usage CaaS ?
crushmap in action
mon
osd 1
osd 2
osd 3
client 1
fetch crushmap
calculate #1 PG and OSD
push/fetch
data
#1
calculate #2 PG and OSD
push/fetch data #2
Paweł Sadowski Ceph in OVH March 1st, 2016 12 / 23
41. Ceph Architecture Pools Client usage CaaS ?
ceph osd tree example
mon-01:~ # ceph osd tree
# id weight type name up/down reweight
-1 455.3 root default
-8 151.8 rack ABC20
-2 16.26 host cephhost-123456
0 5.42 osd.0 up 1
5 5.42 osd.5 up 1
6 5.42 osd.6 up 1
-3 65.04 host cephhost-123457
10 5.42 osd.10 up 1
13 5.42 osd.13 up 1
15 5.42 osd.15 up 1
19 5.42 osd.19 up 1
20 5.42 osd.20 up 1
24 5.42 osd.24 up 1
27 5.42 osd.27 up 1
30 5.42 osd.30 up 1
...
Paweł Sadowski Ceph in OVH March 1st, 2016 13 / 23
42. Ceph Architecture Pools Client usage CaaS ?
Replicated pool
data replicated n-times – pool size
operation on object allowed only when pool min_size replicas exists
can use it’s own CRUSH rule for data placement
deep scrub can detect object inconsistency
reads are served from primary acting OSD
writes are propagated by primary acting OSD
Paweł Sadowski Ceph in OVH March 1st, 2016 14 / 23
43. Ceph Architecture Pools Client usage CaaS ?
Erasure coded pool
k + m (i.e. 10 + 4 => 40% overhead, 4 can be lost)
space efficient (similar to RAID5/RAID6)
higher CPU requirement
check object consistency on read1
object can’t be modified, must be written at once
multiple plugins (i.e. from Intel with XEON optimization, from Fujitsu – speed up
recovery)
1
http://tracker.ceph.com/issues/12000
Paweł Sadowski Ceph in OVH March 1st, 2016 15 / 23
44. Ceph Architecture Pools Client usage CaaS ?
Cache tiering
pool can have another layer (another pool) on top of it
usually top, hot, pool will be stored on fast disks (SSD, NVMe), bottom, cold, pool
will be stored on slow disks (HDD, SMR)
rbd can used with cache tiering when hot pool is replicated – cold pool can be EC
evicting object from hot pool is a costly process so make sure your hot data set fits
in hot pool
possible to configure number of hits before promoting object
Paweł Sadowski Ceph in OVH March 1st, 2016 16 / 23
45. Ceph Architecture Pools Client usage CaaS ?
Rados Block Device
provides access to raw block device
rbd images are groups of object with fixed size (specified at the time of creation)
snapshots, clones, copy-on-write
full/differential import/export
librbd (the fastest one), krbd, fuse
QEMU uses librbd to access Ceph
OpenStack supports Ceph for VM disk, snapshots and volumes
Paweł Sadowski Ceph in OVH March 1st, 2016 17 / 23
46. Ceph Architecture Pools Client usage CaaS ?
Object Storage – radosGW
S3/Swift compatible API (via radosGW)
store only full objects
can use EC pools
multi region asynchronous replication
Paweł Sadowski Ceph in OVH March 1st, 2016 18 / 23
47. Ceph Architecture Pools Client usage CaaS ?
CephFS
starting point but not yet production ready
like NFS but with distributed server part
POSIX compatible filesystem
since RedHat acquired InkTank they are more focused on RBD performance
Paweł Sadowski Ceph in OVH March 1st, 2016 19 / 23
48. Ceph Architecture Pools Client usage CaaS ?
Problems
quite difficult to setup at the beginning
hard to predict performance level without experience
performance degradation during recovery/rebalance or deep scrub
high CPU usage when doing small IOs – CPU bound on Flash storage – big
improvement in Hammer and then Infernalis
corner case bugs on new features
Paweł Sadowski Ceph in OVH March 1st, 2016 20 / 23
49. Ceph Architecture Pools Client usage CaaS ?
Why and how we use Ceph?
HA backend for OVH Public Cloud
we are planning to use it for backups (with EC pools)
15 PB of raw storage available in running clusters (and growing)
mixed and Flash–only based clusters
we are using multiple Ceph clusters ...
Paweł Sadowski Ceph in OVH March 1st, 2016 21 / 23
50. Ceph Architecture Pools Client usage CaaS ?
Ceph-as-a-Service
create cluster in selected DC with one API request
configure users, pools and network access via API
multiple versions available: Firefly, Hammer, Infernalis
RBD full support, radosGW work in progress
public beta available (ask on ceph@ml.ovh.net)
Paweł Sadowski Ceph in OVH March 1st, 2016 22 / 23
51. Ceph Architecture Pools Client usage CaaS ?
Questions?
ceph@ml.ovh.net
https://www.ovh.pl/careers/
Paweł Sadowski Ceph in OVH March 1st, 2016 23 / 23