Datacenter Storage with Ceph
John Spray
john.spray@redhat.com
jcsp on #ceph-devel
Datacenter Storage with Ceph
OSDC.de 2015
Agenda
● What is Ceph?
● How does Ceph store your data?
● Interfaces to Ceph: RBD, RGW, CephFS
● Latest development updates
Datacenter Storage with Ceph
OSDC.de 2015
What is Ceph?
● Highly available resilient data store
● Free Software (LGPL)
● 10 years since inception
● Flexible object, block and filesystem interfaces
● Especially popular in private clouds as VM image
service, and S3-compatible object storage service.
Datacenter Storage with Ceph
OSDC.de 2015
A general purpose storage system
● You feed it commodity disks and ethernet
● In return, it gives your apps a storage service
● It doesn't lose your data
● It doesn't need babysitting
● It's portable
Datacenter Storage with Ceph
OSDC.de 2015
Interfaces to storage
FILE
SYSTEM
CephFS
BLOCK
STORAGE
RBD
OBJECT
STORAGE
RGW
Keystone
Geo-Replication
Native API
Multi-tenant
S3 & Swift
OpenStack
Linux Kernel
iSCSI
Clones
Snapshots
CIFS/NFS
HDFS
Distributed Metadata
Linux Kernel
POSIX
Datacenter Storage with Ceph
OSDC.de 2015
Ceph Architecture
(how your data is stored)
Components
RGW
A web services
gateway for object
storage, compatible
with S3 and Swift
LIBRADOS
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS
A software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors
RBD
A reliable, fully-
distributed block
device with cloud
platform integration
CEPHFS
A distributed file
system with POSIX
semantics and scale-
out metadata
management
APP HOST/VM CLIENT
Components
RGW
A web services
gateway for object
storage, compatible
with S3 and Swift
LIBRADOS
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS
A software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors
RBD
A reliable, fully-
distributed block
device with cloud
platform integration
CEPHFS
A distributed file
system with POSIX
semantics and scale-
out metadata
management
APP HOST/VM CLIENT
RADOS
Reliable
Autonomous
Distributed
Object Store
RADOS Components
OSDs:
10s to 10000s in a cluster
One per disk (or one per SSD, RAID group…)
Serve stored objects to clients
Intelligently peer for replication & recovery
Monitors:
Maintain cluster membership and state
Provide consensus for distributed decision-
making
Small, odd number
These do not serve stored objects to clients
M
Object Storage Daemons
FS
DISK
OSD
DISK
OSD
FS
DISK
OSD
FS
DISK
OSD
FS
xfs
ext4
btrfs
M
M
M
Rados Cluster
APPLICATION
M M
M M
M
RADOS CLUSTER
Where do objects live?
??
APPLICATION
M
M
M
OBJECT
A Metadata Server?
1
APPLICATION
M
M
M
2
Calculated placement
FAPPLICATION
M
M
M
A-G
H-N
O-T
U-Z
CRUSH: Dynamic data placement
Pseudo-random placement algorithm
● Fast calculation, no lookup
● Repeatable, deterministic
● Statistically uniform distribution
● Stable mapping
● Limited data migration on change
● Rule-based configuration
● Infrastructure topology aware
● Adjustable replication
● Weighting
CRUSH: Replication
RADOS CLUSTER
10
01
01
10
10
01
11
01
10
01
01
10
10
01 11
01
1001
0110 10 01
11
01
DATA
CRUSH: Topology-aware placement
RADOS CLUSTER
10 01 01 11
01
1010
10 10 01
01
01 11
10 0101
RACK A RACK B
CRUSH is a quick calculation
RADOS CLUSTER
DATA
10
01
01
10
10
01 11
01
1001
0110 10 01
11
01
CRUSH rules
● Simple language
● Specify which copies go where (across racks,
servers, datacenters, disk types)
rule <rule name> {
  ruleset <ruleset name>
  type <replicated | erasure>
  step take <bucket­type>
  step [choose|chooseleaf] [firstn|indep] <N> <type>
  step emit
}
Pools and Placement Groups
● Trick: apply CRUSH placement to fixed
number of placement groups instead of N
objects.
● Manage recovery/backfill at PG granularity:
less per-object metadata.
● Typically a few 100 PGs per OSD
● Pool is logical collection of PGs, using a
particular CRUSH rule
Recovering from failures
● OSDs notice when their peers stop responding, report
this to monitors
● Monitors make decision that an OSD is now “down”
● Peers continue to serve data, but it's in a degraded
state
● After some time, monitors mark the OSD “out”
● New peers selected by CRUSH, data is re-replicated
across whole cluster
● Faster than RAID rebuild because we share the load
● Does not require administrator intervention
RADOS advanced features
● Not just puts and gets!
● More feature rich than typical object stores
● Partial object updates & appends
● Key-value stores (OMAPs) within objects
● Copy-on-write snapshots
● Watch/notify to for pushing events
● Extensible with Object Classes: perform arbitrary
transactions on OSDs.
Choosing hardware
● Cheap hardware mitigates cost of replication
● OSD data journalling: a separate SSD is useful.
Approx 1 SSD for every 4 OSD disks.
● OSDs are more CPU/RAM intensive than
legacy storage: approx 8 or so per host.
● Many cheaper servers better than few
expensive: distribute the load of rebalancing.
● Consider your bandwidth/capacity ratio and
your read/write ratio
Datacenter Storage with Ceph
OSDC.de 2015
Interfaces to applications: RGW,
RBD, and CephFS
Components
RGW
A web services
gateway for object
storage, compatible
with S3 and Swift
LIBRADOS
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS
A software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors
RBD
A reliable, fully-
distributed block
device with cloud
platform integration
CEPHFS
A distributed file
system with POSIX
semantics and scale-
out metadata
management
APP HOST/VM CLIENT
RBD: Virtual disks in Ceph
RADOS Block Device:
Storage of disk images in RADOS
Decouples VMs from host
Images are striped across the cluster
(pool)
Snapshots
Copy-on-write clones
Support in:
Mainline Linux Kernel (2.6.39+)
Qemu/KVM
OpenStack, CloudStack, OpenNebula,
Proxmox
Storing virtual disks
M M
RADOS CLUSTER
HYPERVISOR
LIBRBD
VM
28
Architectural Components
RGW
A web services
gateway for object
storage, compatible
with S3 and Swift
LIBRADOS
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS
A software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors
RBD
A reliable, fully-
distributed block
device with cloud
platform integration
CEPHFS
A distributed file
system with POSIX
semantics and scale-
out metadata
management
APP HOST/VM CLIENT
RGW: HTTP object store
RADOSGW:
● REST-based object storage proxy
● Compatible with S3 and Swift applications
● Uses RADOS to store objects
● API supports buckets, accounts
● Usage accounting for billing
RADOS Gateway
M M
M
RADOS CLUSTER
RADOSGW
LIBRADOS
socket
RADOSGW
LIBRADOS
APPLICATION APPLICATION
REST
APPLICATION
Architectural Components
RGW
A web services
gateway for object
storage, compatible
with S3 and Swift
LIBRADOS
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS
A software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors
RBD
A reliable, fully-
distributed block
device with cloud
platform integration
CEPHFS
A distributed file
system with POSIX
semantics and scale-
out metadata
management
APP HOST/VM CLIENT
Ceph Filesystem (CephFS)
POSIX-compliant shared filesystem
Client:
● Userspace (FUSE) or Kernel
● Looks like like a local filesystem
● Sends data directly to RADOS
Metadata server:
● Filesystem metadata:
● Directory hierarchy
● Inode metadata (owner, timestamps, mode)
● Stores metadata in RADOS
● Does not serve file data to clients
Storing Data and Metadata
LINUX HOST
M M
M
RADOS CLUSTER
KERNEL MODULE
datametadata 01
10
CephFS
● Advanced features:
– Subdirectory snapshots
– Recursive statistics
– Multiple metadata servers
● Coming soon:
– Online consistency checking
– Scalable repair/recovery tools
Datacenter Storage with Ceph
OSDC.de 2015
CephFS in practice
ceph-deploy mds create myserver
ceph osd pool create fs_data
ceph osd pool create fs_metadata
ceph fs new myfs fs_metadata
fs_data
mount -t cephfs x.x.x.x:6789
/mnt/ceph
Datacenter Storage with Ceph
OSDC.de 2015
Beyond Replication: Erasure Coding
and Cache Tiering
● Use one Ceph pool as a cache to another:
– e.g. Flash cache to a spinning disk pool
● Configurable policies for eviction based
on capacity, object count, lifetime.
● Configurable mode:
– writeback: all client I/O to the cache
– readonly: client writes to backing pool, reads
from the cache
Cache Tiering
● Split objects into M data chunks and K
parity chunks, with configurable M and K
● An alternative to replication, providing a
different set of tradeoffs:
– Consume less storage capacity
– Consume less write bandwidth
– Reads scattered across OSDs
– Modifications are expensive
● Plugin interface for encoding schemes
Erasure Coding
Cache Tiering + EC
replica 1
replica 2
M1 M2 M3 K1 K2
replica 3
Pool ‘cache’
Pool ‘cold’
Client I/O
Writeback cache policy
200% overhead
66% overhead
Example: Cache Tiering & EC
# ceph osd erasure-code-profile set ecdemo k=3 m=2
# ceph osd pool create cold 384 384 erasure ecdemo
# ceph osd pool create cache 384 384
# ceph osd tier add cold cache
# ceph osd tier cache-mode cache writeback
# ceph osd tier set-overlay cold cache
Datacenter Storage with Ceph
OSDC.de 2015
What's new?
Datacenter Storage with Ceph
OSDC.de 2015
Ceph 0.94
Emperor
Firefly
Giant
Hammer
Infernalis
Jewel
Datacenter Storage with Ceph
OSDC.de 2015
RADOS
● Performance:
– more IOPs
– exploit flash backends
– exploit many-cored machines
● CRUSH straw2 algorithm:
– reduced data migration on changes
● Cache tiering:
– read performance, reduce unnecessary promotions
Datacenter Storage with Ceph
OSDC.de 2015
RBD
● Object maps:
– per-image metadata, identifies which extents are
allocated.
– optimisation for clone/export/delete.
● Mandatory locking:
– prevent multiple clients writing to same image
● Copy-on-read:
– improve performance for some workloads
Datacenter Storage with Ceph
OSDC.de 2015
RGW
● S3 object versioning API
– when enabled, all objects maintain history
– GET ID+version to see an old version
● Bucket sharding
– spread bucket index across multiple RADOS objects
– avoid oversized OMAPs
– avoid hotspots
CephFS
● Diagnostics & health checks
● Journal recovery tools
● Initial online metadata scrub
● Refined ENOSPC handling
● Soft client quotas
● General hardening and resilience
Datacenter Storage with Ceph
OSDC.de 2015
Finally...
Datacenter Storage with Ceph
OSDC.de 2015
Get involved
Evaluate the latest releases:
http://ceph.com/resources/downloads/
Mailing list, IRC:
http://ceph.com/resources/mailing-list-irc/
Bugs:
http://tracker.ceph.com/projects/ceph/issues
Online developer summits:
https://wiki.ceph.com/Planning/CDS
Ceph Days
Ceph Day Berlin is next Tuesday!
http://ceph.com/cephdays/ceph-day-berlin/
Axica Convention Center
April 28 2015
Datacenter Storage with Ceph
OSDC.de 2015

OSDC 2015: John Spray | The Ceph Storage System

  • 1.
    Datacenter Storage withCeph John Spray john.spray@redhat.com jcsp on #ceph-devel
  • 2.
    Datacenter Storage withCeph OSDC.de 2015 Agenda ● What is Ceph? ● How does Ceph store your data? ● Interfaces to Ceph: RBD, RGW, CephFS ● Latest development updates
  • 3.
    Datacenter Storage withCeph OSDC.de 2015 What is Ceph? ● Highly available resilient data store ● Free Software (LGPL) ● 10 years since inception ● Flexible object, block and filesystem interfaces ● Especially popular in private clouds as VM image service, and S3-compatible object storage service.
  • 4.
    Datacenter Storage withCeph OSDC.de 2015 A general purpose storage system ● You feed it commodity disks and ethernet ● In return, it gives your apps a storage service ● It doesn't lose your data ● It doesn't need babysitting ● It's portable
  • 5.
    Datacenter Storage withCeph OSDC.de 2015 Interfaces to storage FILE SYSTEM CephFS BLOCK STORAGE RBD OBJECT STORAGE RGW Keystone Geo-Replication Native API Multi-tenant S3 & Swift OpenStack Linux Kernel iSCSI Clones Snapshots CIFS/NFS HDFS Distributed Metadata Linux Kernel POSIX
  • 6.
    Datacenter Storage withCeph OSDC.de 2015 Ceph Architecture (how your data is stored)
  • 7.
    Components RGW A web services gatewayfor object storage, compatible with S3 and Swift LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors RBD A reliable, fully- distributed block device with cloud platform integration CEPHFS A distributed file system with POSIX semantics and scale- out metadata management APP HOST/VM CLIENT
  • 8.
    Components RGW A web services gatewayfor object storage, compatible with S3 and Swift LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors RBD A reliable, fully- distributed block device with cloud platform integration CEPHFS A distributed file system with POSIX semantics and scale- out metadata management APP HOST/VM CLIENT
  • 9.
  • 10.
    RADOS Components OSDs: 10s to10000s in a cluster One per disk (or one per SSD, RAID group…) Serve stored objects to clients Intelligently peer for replication & recovery Monitors: Maintain cluster membership and state Provide consensus for distributed decision- making Small, odd number These do not serve stored objects to clients M
  • 11.
  • 12.
  • 13.
    Where do objectslive? ?? APPLICATION M M M OBJECT
  • 14.
  • 15.
  • 16.
    CRUSH: Dynamic dataplacement Pseudo-random placement algorithm ● Fast calculation, no lookup ● Repeatable, deterministic ● Statistically uniform distribution ● Stable mapping ● Limited data migration on change ● Rule-based configuration ● Infrastructure topology aware ● Adjustable replication ● Weighting
  • 17.
  • 18.
    CRUSH: Topology-aware placement RADOSCLUSTER 10 01 01 11 01 1010 10 10 01 01 01 11 10 0101 RACK A RACK B
  • 19.
    CRUSH is aquick calculation RADOS CLUSTER DATA 10 01 01 10 10 01 11 01 1001 0110 10 01 11 01
  • 20.
    CRUSH rules ● Simplelanguage ● Specify which copies go where (across racks, servers, datacenters, disk types) rule <rule name> {   ruleset <ruleset name>   type <replicated | erasure>   step take <bucket­type>   step [choose|chooseleaf] [firstn|indep] <N> <type>   step emit }
  • 21.
    Pools and PlacementGroups ● Trick: apply CRUSH placement to fixed number of placement groups instead of N objects. ● Manage recovery/backfill at PG granularity: less per-object metadata. ● Typically a few 100 PGs per OSD ● Pool is logical collection of PGs, using a particular CRUSH rule
  • 22.
    Recovering from failures ●OSDs notice when their peers stop responding, report this to monitors ● Monitors make decision that an OSD is now “down” ● Peers continue to serve data, but it's in a degraded state ● After some time, monitors mark the OSD “out” ● New peers selected by CRUSH, data is re-replicated across whole cluster ● Faster than RAID rebuild because we share the load ● Does not require administrator intervention
  • 23.
    RADOS advanced features ●Not just puts and gets! ● More feature rich than typical object stores ● Partial object updates & appends ● Key-value stores (OMAPs) within objects ● Copy-on-write snapshots ● Watch/notify to for pushing events ● Extensible with Object Classes: perform arbitrary transactions on OSDs.
  • 24.
    Choosing hardware ● Cheaphardware mitigates cost of replication ● OSD data journalling: a separate SSD is useful. Approx 1 SSD for every 4 OSD disks. ● OSDs are more CPU/RAM intensive than legacy storage: approx 8 or so per host. ● Many cheaper servers better than few expensive: distribute the load of rebalancing. ● Consider your bandwidth/capacity ratio and your read/write ratio
  • 25.
    Datacenter Storage withCeph OSDC.de 2015 Interfaces to applications: RGW, RBD, and CephFS
  • 26.
    Components RGW A web services gatewayfor object storage, compatible with S3 and Swift LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors RBD A reliable, fully- distributed block device with cloud platform integration CEPHFS A distributed file system with POSIX semantics and scale- out metadata management APP HOST/VM CLIENT
  • 27.
    RBD: Virtual disksin Ceph RADOS Block Device: Storage of disk images in RADOS Decouples VMs from host Images are striped across the cluster (pool) Snapshots Copy-on-write clones Support in: Mainline Linux Kernel (2.6.39+) Qemu/KVM OpenStack, CloudStack, OpenNebula, Proxmox
  • 28.
    Storing virtual disks MM RADOS CLUSTER HYPERVISOR LIBRBD VM 28
  • 29.
    Architectural Components RGW A webservices gateway for object storage, compatible with S3 and Swift LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors RBD A reliable, fully- distributed block device with cloud platform integration CEPHFS A distributed file system with POSIX semantics and scale- out metadata management APP HOST/VM CLIENT
  • 30.
    RGW: HTTP objectstore RADOSGW: ● REST-based object storage proxy ● Compatible with S3 and Swift applications ● Uses RADOS to store objects ● API supports buckets, accounts ● Usage accounting for billing
  • 31.
    RADOS Gateway M M M RADOSCLUSTER RADOSGW LIBRADOS socket RADOSGW LIBRADOS APPLICATION APPLICATION REST APPLICATION
  • 32.
    Architectural Components RGW A webservices gateway for object storage, compatible with S3 and Swift LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors RBD A reliable, fully- distributed block device with cloud platform integration CEPHFS A distributed file system with POSIX semantics and scale- out metadata management APP HOST/VM CLIENT
  • 33.
    Ceph Filesystem (CephFS) POSIX-compliantshared filesystem Client: ● Userspace (FUSE) or Kernel ● Looks like like a local filesystem ● Sends data directly to RADOS Metadata server: ● Filesystem metadata: ● Directory hierarchy ● Inode metadata (owner, timestamps, mode) ● Stores metadata in RADOS ● Does not serve file data to clients
  • 34.
    Storing Data andMetadata LINUX HOST M M M RADOS CLUSTER KERNEL MODULE datametadata 01 10
  • 35.
    CephFS ● Advanced features: –Subdirectory snapshots – Recursive statistics – Multiple metadata servers ● Coming soon: – Online consistency checking – Scalable repair/recovery tools
  • 36.
    Datacenter Storage withCeph OSDC.de 2015 CephFS in practice ceph-deploy mds create myserver ceph osd pool create fs_data ceph osd pool create fs_metadata ceph fs new myfs fs_metadata fs_data mount -t cephfs x.x.x.x:6789 /mnt/ceph
  • 37.
    Datacenter Storage withCeph OSDC.de 2015 Beyond Replication: Erasure Coding and Cache Tiering
  • 38.
    ● Use oneCeph pool as a cache to another: – e.g. Flash cache to a spinning disk pool ● Configurable policies for eviction based on capacity, object count, lifetime. ● Configurable mode: – writeback: all client I/O to the cache – readonly: client writes to backing pool, reads from the cache Cache Tiering
  • 39.
    ● Split objectsinto M data chunks and K parity chunks, with configurable M and K ● An alternative to replication, providing a different set of tradeoffs: – Consume less storage capacity – Consume less write bandwidth – Reads scattered across OSDs – Modifications are expensive ● Plugin interface for encoding schemes Erasure Coding
  • 40.
    Cache Tiering +EC replica 1 replica 2 M1 M2 M3 K1 K2 replica 3 Pool ‘cache’ Pool ‘cold’ Client I/O Writeback cache policy 200% overhead 66% overhead
  • 41.
    Example: Cache Tiering& EC # ceph osd erasure-code-profile set ecdemo k=3 m=2 # ceph osd pool create cold 384 384 erasure ecdemo # ceph osd pool create cache 384 384 # ceph osd tier add cold cache # ceph osd tier cache-mode cache writeback # ceph osd tier set-overlay cold cache
  • 42.
    Datacenter Storage withCeph OSDC.de 2015 What's new?
  • 43.
    Datacenter Storage withCeph OSDC.de 2015 Ceph 0.94 Emperor Firefly Giant Hammer Infernalis Jewel
  • 44.
    Datacenter Storage withCeph OSDC.de 2015 RADOS ● Performance: – more IOPs – exploit flash backends – exploit many-cored machines ● CRUSH straw2 algorithm: – reduced data migration on changes ● Cache tiering: – read performance, reduce unnecessary promotions
  • 45.
    Datacenter Storage withCeph OSDC.de 2015 RBD ● Object maps: – per-image metadata, identifies which extents are allocated. – optimisation for clone/export/delete. ● Mandatory locking: – prevent multiple clients writing to same image ● Copy-on-read: – improve performance for some workloads
  • 46.
    Datacenter Storage withCeph OSDC.de 2015 RGW ● S3 object versioning API – when enabled, all objects maintain history – GET ID+version to see an old version ● Bucket sharding – spread bucket index across multiple RADOS objects – avoid oversized OMAPs – avoid hotspots
  • 47.
    CephFS ● Diagnostics &health checks ● Journal recovery tools ● Initial online metadata scrub ● Refined ENOSPC handling ● Soft client quotas ● General hardening and resilience
  • 48.
    Datacenter Storage withCeph OSDC.de 2015 Finally...
  • 49.
    Datacenter Storage withCeph OSDC.de 2015 Get involved Evaluate the latest releases: http://ceph.com/resources/downloads/ Mailing list, IRC: http://ceph.com/resources/mailing-list-irc/ Bugs: http://tracker.ceph.com/projects/ceph/issues Online developer summits: https://wiki.ceph.com/Planning/CDS
  • 50.
    Ceph Days Ceph DayBerlin is next Tuesday! http://ceph.com/cephdays/ceph-day-berlin/ Axica Convention Center April 28 2015
  • 51.
    Datacenter Storage withCeph OSDC.de 2015