Ceph Fundamentals
Ross Turk
VP Community, Inktank
ME ME ME ME ME ME.
2
Ross Turk
VP Community, Inktank
ross@inktank.com
@rossturk
inktank.com | ceph.com
Ceph Architectural Overview
Ah! Finally, 32 slides in and he gets to the nerdy stuff.
3
4
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-
managing, intelligent storage nodes
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
RBD
A reliable and fully-
distributed block
device, with a Linux
kernel client and a
QEMU/KVM driver
CEPH FS
A POSIX-compliant
distributed file
system, with a Linux
kernel client and
support for FUSE
RADOSGW
A bucket-based REST
gateway, compatible
with S3 and Swift
APP APP HOST/VM CLIENT
5
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-
managing, intelligent storage nodes
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
RBD
A reliable and fully-
distributed block
device, with a Linux
kernel client and a
QEMU/KVM driver
CEPH FS
A POSIX-compliant
distributed file
system, with a Linux
kernel client and
support for FUSE
RADOSGW
A bucket-based REST
gateway, compatible
with S3 and Swift
APP APP HOST/VM CLIENT
6
DISK
FS
DISK DISK
OSD
DISK DISK
OSD OSD OSD OSD
FS FS FSFS
btrfs
xfs
ext4
MMM
7
M
M
M
HUMAN
8
Monitors:
• Maintain cluster membership
and state
• Provide consensus for
distributed decision-making
• Small, odd number
• These do not serve stored
objects to clients
M
OSDs:
• 10s to 10000s in a cluster
• One per disk
• (or one per SSD, RAID group…)
• Serve stored objects to
clients
• Intelligently peer to perform
replication and recovery tasks
9
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
RBD
A reliable and fully-
distributed block
device, with a Linux
kernel client and a
QEMU/KVM driver
CEPH FS
A POSIX-compliant
distributed file
system, with a Linux
kernel client and
support for FUSE
RADOSGW
A bucket-based REST
gateway, compatible
with S3 and Swift
APP APP HOST/VM CLIENT
LIBRADOS
M
M
M
10
APP
socket
L
LIBRADOS
• Provides direct access to
RADOS for applications
• C, C++, Python, PHP, Java, Erl
ang
• Direct access to storage nodes
• No HTTP overhead
12
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-
managing, intelligent storage nodes
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
RBD
A reliable and fully-
distributed block
device, with a Linux
kernel client and a
QEMU/KVM driver
CEPH FS
A POSIX-compliant
distributed file
system, with a Linux
kernel client and
support for FUSE
RADOSGW
A bucket-based REST
gateway, compatible
with S3 and Swift
APP APP HOST/VM CLIENT
13
M
M
M
LIBRADOS
RADOSGW
APP
socket
REST
14
RADOS Gateway:
• REST-based object storage
proxy
• Uses RADOS to store objects
• API supports
buckets, accounts
• Usage accounting for billing
• Compatible with S3 and
Swift applications
15
M
x12
x12
x12
x12
x12
LOAD
BALANCER
M
MLOAD
BALANCER
16
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
CEPH FS
A POSIX-compliant
distributed file
system, with a Linux
kernel client and
support for FUSE
RADOSGW
A bucket-based REST
gateway, compatible
with S3 and Swift
APP APP HOST/VM CLIENT
RBD
A reliable and fully-
distributed block
device, with a Linux
kernel client and a
QEMU/KVM driver
17
M
M
M
VM
LIBRADOS
LIBRBD
HYPERVISOR
LIBRADOS
18
M
M
M
LIBRBD
HYPERVISOR
LIBRADOS
LIBRBD
HYPERVISORVM
LIBRADOS
19
M
M
M
KRBD (KERNEL MODULE)
HOST
20
RADOS Block Device:
• Storage of disk images in
RADOS
• Decouples VMs from host
• Images are striped across the
cluster (pool)
• Snapshots
• Copy-on-write clones
• Support in:
• Mainline Linux Kernel (2.6.39+)
• Qemu/KVM, native Xen coming
soon
• OpenStack, CloudStack, Nebula,
Proxmox
21
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
RBD
A reliable and fully-
distributed block
device, with a Linux
kernel client and a
QEMU/KVM driver
CEPH FS
A POSIX-compliant
distributed file
system, with a Linux
kernel client and
support for FUSE
RADOSGW
A bucket-based REST
gateway, compatible
with S3 and Swift
APP APP HOST/VM CLIENT
22
M
M
M
CLIENT
01
10
data
metadata
23
Metadata Server
• Manages metadata for a
POSIX-compliant shared
filesystem
• Directory hierarchy
• File metadata
(owner, timestamps, mode, et
c.)
• Stores metadata in RADOS
• Does not serve file data to
clients
• Only required for shared
filesystem
What Makes Ceph Unique?
Part one: it never, ever remembers where it puts stuff.
24
25
APP
??
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
How Long Did It Take You To Find Your Keys This Morning?
azmeen, Flickr / CC BY 2.0 26
27
APP
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
Dear Diary: Today I Put My Keys on the Kitchen Counter
Barnaby, Flickr / CC BY 2.0 28
29
APP
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
A-G
H-N
O-T
U-Z
F*
I Always Put My Keys on the Hook By the Door
vitamindave, Flickr / CC BY 2.0 30
HOW DO YOU
FIND YOUR KEYS
WHEN YOUR HOUSE
IS
INFINITELY BIG
AND
ALWAYS CHANGING?
31
The Answer: CRUSH!!!!!
pasukaru76, Flickr / CC SA 2.0 32
33
OBJECT
10 10 01 01 10 10 01 11 01 10
hash(object name) % num pg
CRUSH(pg, cluster state, rule set)
34
OBJECT
10 10 01 01 10 10 01 11 01 10
35
CRUSH
• Pseudo-random placement
algorithm
• Fast calculation, no lookup
• Repeatable, deterministic
• Statistically uniform distribution
• Stable mapping
• Limited data migration on change
• Rule-based configuration
• Infrastructure topology aware
• Adjustable replication
• Weighting
36
CLIENT
??
37
NAME:"foo"
POOL:"bar"
0101 1111
1001 0011
1010 1101
0011 1011 "bar" = 3
hash("foo") % 256 = 0x23
OBJECT PLACEMENT GROUP
24
3
12
CRUSH TARGET OSDsPLACEMENT GROUP
3.23
3.23
38
39
40
CLIENT
??
What Makes Ceph Unique
Part two: it has smart block devices for all those impatient, selfish VMs.
41
LIBRADOS
42
M
M
M
VM
LIBRBD
HYPERVISOR
HOW DO YOU
SPIN UP
THOUSANDS OF VMs
INSTANTLY
AND
EFFICIENTLY?
43
144
44
0 0 0 0
instant copy
= 144
4144
45
CLIENT
write
write
write
= 148
write
4144
46
CLIENT
read
read
read
= 148
What Makes Ceph Unique?
Part three: it has powerful friends, ones you probably already know.
47
48
M
M
M
APACHE CLOUDSTACK
HYPER-
VISOR
PRIMARY STORAGE
POOL
SECONDARY
STORAGE POOL
snapshots
templates
images
49
M
M
M
OPENSTACK
KEYSTONE API SWIFT API CINDER API
GLANCE
API
NOVA
API
HYPER-
VISOR
RADOSGW
What Makes Ceph Unique?
Part three: clustered metadata
50
51
M
M
M
CLIENT
01
10
52
M
M
M
53
one tree
three metadata servers
??
54
55
56
57
58
DYNAMIC SUBTREE PARTITIONING
Now What?
59
60
8 years & 26,000 commits
Getting Started With Ceph
Read about the latest version of Ceph.
• The latest stuff is always at http://ceph.com/get
Deploy a test cluster using ceph-deploy.
• Read the quick-start guide at http://ceph.com/qsg
Deploy a test cluster on the AWS free-tier using Juju.
• Read the guide at http://ceph.com/juju
Read the rest of the docs!
• Find docs for the latest release at http://ceph.com/docs
61
Have a working cluster up quickly.
Getting Involved With Ceph
Most project discussion happens on the mailing list.
• Join or view archives at http://ceph.com/list
IRC is a great place to get help (or help others!)
• Find details and historical logs at http://ceph.com/irc
The tracker manages our bugs and feature requests.
• Register and start looking around at http://ceph.com/tracker
Doc updates and suggestions are always welcome.
• Learn how to contribute docs at http://ceph.com/docwriting
62
Help build the best storage system around!
Questions?
63
Ross Turk
VP Community, Inktank
ross@inktank.com
@rossturk
inktank.com | ceph.com

Ceph Day NYC: Ceph Fundamentals

Editor's Notes

  • #5 RADOS is a distributed object store, and it’s the foundation for Ceph. On top of RADOS, the Ceph team has built three applications that allow you to store data and do fantastic things. But before we get into all of that, let’s start at the beginning of the story.
  • #6 But that’s a lot to digest all at once. Let’s start with RADOS.
  • #52 MDSs store all of their data within RADOS itself, but there’s still a problem…
  • #53 There are multiple MDSs!
  • #54 So how do you have one tree and multiple servers?
  • #55 If there’s just one MDS (which is a terrible idea), it manages metadata for the entire tree.
  • #56 When the second one comes along, it will intelligently partition the work by taking a subtree.
  • #57 When the third MDS arrives, it will attempt to split the tree again.
  • #58 Same with the fourth.
  • #59 A MDS can actually even just take a single directory or file, if the load is high enough. This all happens dynamically based on load and the structure of the data, and it’s called “dynamic subtree partitioning”.