What you need to
know about Ceph
Gluster Community Day, 20 May 2014
Haruka Iwao
Index
What is Ceph?
Ceph architecture
Ceph and OpenStack
Wrap-up
What is Ceph?
Ceph
The name "Ceph" is a
common nickname given to
pet octopuses, short for
cephalopod.
Cephalopod?
Ceph is...
{ }
object storage and file system
Open-source
Massively scalable
Software-defined
History of Ceph
2003 Project born at UCSC
2006 Open sourced
Papers published
2012 Inktank founded
“Argonaut” released
In April 2014
Yesterday
Red Hat acquires me
I joined Red Hat as an
architect of storage systems
This is just a coincidence.
Red Hat acquires me
Ceph releases
Major release every 3 months
Argonaut
Bobtail
Cuttlefish
Dumpling
Emperor
Firefly
Giant (coming in July)
Ceph architecture
Ceph at a glance
Layers in Ceph
RADOS = /dev/sda
Ceph FS = ext4
/dev/sda
ext4
RADOS
Reliable
Replicated to avoid data loss
Autonomic
Communicate each other to
detect failures
Replication done transparently
Distributed
Object Store
RADOS (2)
Fundamentals of Ceph
Everything is stored in
RADOS
Including Ceph FS metadata
Two components: mon, osd
CRUSH algorithm
OSD
Object storage daemon
One OSD per disk
Uses xfs/btrfs as backend
Btrfs is experimental!
Write-ahead journal for
integrity and performance
3 to 10000s OSDs in a cluster
OSD (2)
DISK
FS
DISK DISK
OSD
DISK DISK
OSD OSD OSD OSD
FS FS FSFS
btrfs
xfs
ext4
MON
Monitoring daemon
Maintain cluster map and
state
Small, odd number
Locating objects
RADOS uses an algorithm
“CRUSH” to locate objects
Location is decided through
pure “calculation”
No central “metadata” server
No SPoF
Massive scalability
CRUSH
1. Assign a placement group
pg = Hash(object name) % num pg
2. CRUSH(pg, cluster map, rule)
1
2
Cluster map
Hierarchical OSD map
Replicating across failure domains
Avoiding network congestion
Object locations computed
0100111010100111011 Name: abc, Pool: test
Hash(“abc”) % 256 = 0x23
“test” = 3
Placement Group: 3.23
PG to OSD
Placement Group: 3.23
CRUSH(PG 3.23, Cluster Map, Rule)
→ osd.1, osd.5, osd.9
1
5
9
Synchronous Replication
Replication is synchronous
to maintain strong consistency
When OSD fails
OSD marked “down”
5mins later, marked “out”
Cluster map updated
CRUSH(PG 3.23, Cluster Map #1, Rule)
→ osd.1, osd.5, osd.9
CRUSH(PG 3.23, Cluster Map #2, Rule)
→ osd.1, osd.3, osd.9
Wrap-up: CRUSH
Object name + cluster map
→ object locations
Deterministic
No metadata at all
Calculation done on clients
Cluster map reflects network
hierarchy
RADOSGW
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
RBD
A reliable and fully-
distributed block
device, with a Linux
kernel client and a
QEMU/KVM driver
CEPH FS
A POSIX-compliant
distributed file
system, with a Linux
kernel client and
support for FUSE
RADOSGW
A bucket-based REST
gateway, compatible
with S3 and Swift
RADOSGW
S3 / Swift compatible
gateway to RADOS
RBD
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
RBD
A reliable and fully-
distributed block
device, with a Linux
kernel client and a
QEMU/KVM driver
CEPH FS
A POSIX-compliant
distributed file
system, with a Linux
kernel client and
support for FUSE
RADOSGW
A bucket-based REST
gateway, compatible
with S3 and Swift
RBD
RADOS Block Devices
RBD
Directly mountable
rbd map foo --pool rbd
mkfs -t ext4 /dev/rbd/rbd/foo
OpenStack integration
Cinder & Glance
Will explain later
Ceph FS
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
RBD
A reliable and fully-
distributed block
device, with a Linux
kernel client and a
QEMU/KVM driver
CEPH FS
A POSIX-compliant
distributed file
system, with a Linux
kernel client and
support for FUSE
RADOSGW
A bucket-based REST
gateway, compatible
with S3 and Swift
Ceph FS
POSIX compliant file system
build on top of RADOS
Can mount with Linux native
kernel driver (cephfs) or FUSE
Metadata servers (mds)
manages metadata of the
file system tree
Ceph FS is reliable
MDS writes journal to RADOS
so that metadata doesn’t
lose by MDS failures
Multiple MDS can run for HA
and load balancing
Ceph FS and OSD
MDS
OSDOSDOSD
POSIX Metadata
(directory, time, owner, etc)
MDS
Write metadata journal
Data I/O
Metadata held in-memory
DYNAMIC SUBTREE PARTITIONING
Ceph FS is experimental
Other features
Rolling upgrades
Erasure Coding
Cache tiering
Key-value OSD backend
Separate backend network
Rolling upgrades
No interruption to the
service when upgrading
Stop/Start daemons one by
one
mon → osd → mds →
radowgw
Erasure coding
Use erasure coding instead
of parity for data durability
Suitable for rarely modified
or accessed objects
Erasure Coding Replication
Space overhead
(survive 2 fails)
Approx 40% 200%
CPU High Low
Latency High Low
Cache tiering
Cache tier
ex. SSD
Base tier
ex. HDD,
erasure coded
librados
transparent to clients
read/write
read when miss
fetch when miss
flush to base tier
Key-value OSD backend
Use LevelDB for OSD
backend (instead of xfs)
Better performance esp for
small objects
Plans to support RocksDB,
NVMKV, etc
Separate backend network
Backend network
for replication
Clients
Frontend network
for service
OSDs
1. Write
2. Replicate
OpenStack Integration
OpenStack with Ceph
RADOSGW and Keystone
Keystone Server
RADOSGW
RESTful Object Store
Query token
Access with token
Grant/revoke
Glance Integration
RB
D
Glance Server
/etc/glance/glance-api.conf
default_store=rbd
rbd_store_user=glance
rbd_store_pool=images
Store, Download
Need just 3 lines!
Image
Cinder/Nova Integration
RB
D
Cinder Server
qemu
VM
librbd
nova-compute
Boot from volume
Management
Volume Image
Copy-on-write clone
Benefits of using with
Unified storage for both
images and volumes
Copy-on-write cloning and
snapshot support
Native qemu / KVM support
for better performance
Wrap-up
Ceph is
Massively scalable storage
Unified architecture for
object / block / POSIX FS
OpenStack integration is
ready to use & awesome
Ceph and GlusterFS
Ceph GlusterFS
Distribution Object based File based
File location Deterministic
algorithm
(CRUSH)
Distributed
hash table,
stored in xattr
Replication Server side Client side
Primary usage Object / block
storage
POSIX-like file
system
Challenge POSIX file
system needs
improvement
Object / block
storage needs
improvement
Further readings
Ceph Documents
https://ceph.com/docs/master/
Well documented.
Sébastien Han
http://www.sebastien-han.fr/blog/
An awesome blog.
CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data
http://ceph.com/papers/weil-crush-sc06.pdf
CRUSH algorithm paper
Ceph: A Scalable, High-Performance Distributed File System
http://www.ssrc.ucsc.edu/Papers/weil-osdi06.pdf
Ceph paper
Ceph の覚え書きのインデックス
http://www.nminoru.jp/~nminoru/unix/ceph/
Well written introduction in Japanese
One more thing
Calamari will be open sourced
“Calamari, the monitoring
and diagnostics tool that
Inktank has developed as
part of the Inktank Ceph
Enterprise product, will soon
be open sourced.”
http://ceph.com/community/red-hat-to-acquire-inktank/#sthash.1rB0kfRS.dpuf
Calamari screens
Thank you!

What you need to know about ceph