What you need to know about ceph

What you need to
know about Ceph
Gluster Community Day, 20 May 2014
Haruka Iwao

Index
What is Ceph?
Ceph architecture
Ceph and OpenStack
Wrap-up

Ceph
The name "Ceph" is a
common nickname given to
pet octopuses, short for
cephalopod.

Ceph is...
{ }
object storage and file system
Open-source
Massively scalable
Software-defined

History of Ceph
2003 Project born at UCSC
2006 Open sourced
Papers published
2012 Inktank founded
“Argonaut” released

Yesterday
Red Hat acquires me
I joined Red Hat as an
architect of storage systems
This is just a coincidence.
Red Hat acquires me

Ceph releases
Major release every 3 months
Argonaut
Bobtail
Cuttlefish
Dumpling
Emperor
Firefly
Giant (coming in July)

Layers in Ceph
RADOS = /dev/sda
Ceph FS = ext4
/dev/sda
ext4

RADOS
Reliable
Replicated to avoid data loss
Autonomic
Communicate each other to
detect failures
Replication done transparently
Distributed
Object Store

RADOS (2)
Fundamentals of Ceph
Everything is stored in
RADOS
Including Ceph FS metadata
Two components: mon, osd
CRUSH algorithm

OSD
Object storage daemon
One OSD per disk
Uses xfs/btrfs as backend
Btrfs is experimental!
Write-ahead journal for
integrity and performance
3 to 10000s OSDs in a cluster

OSD (2)
DISK
FS
DISK DISK
OSD
DISK DISK
OSD OSD OSD OSD
FS FS FSFS
btrfs
xfs
ext4

MON
Monitoring daemon
Maintain cluster map and
state
Small, odd number

Locating objects
RADOS uses an algorithm
“CRUSH” to locate objects
Location is decided through
pure “calculation”
No central “metadata” server
No SPoF
Massive scalability

CRUSH
1. Assign a placement group
pg = Hash(object name) % num pg
2. CRUSH(pg, cluster map, rule)
1
2

Cluster map
Hierarchical OSD map
Replicating across failure domains
Avoiding network congestion

Object locations computed
0100111010100111011 Name: abc, Pool: test
Hash(“abc”) % 256 = 0x23
“test” = 3
Placement Group: 3.23

PG to OSD
Placement Group: 3.23
CRUSH(PG 3.23, Cluster Map, Rule)
→ osd.1, osd.5, osd.9
1
5
9

Synchronous Replication
Replication is synchronous
to maintain strong consistency

When OSD fails
OSD marked “down”
5mins later, marked “out”
Cluster map updated
CRUSH(PG 3.23, Cluster Map #1, Rule)
CRUSH(PG 3.23, Cluster Map #2, Rule)

Wrap-up: CRUSH
Object name + cluster map
→ object locations
Deterministic
No metadata at all
Calculation done on clients
Cluster map reflects network
hierarchy

RADOSGW
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
RBD
A reliable and fully-
distributed block
device, with a Linux
kernel client and a
QEMU/KVM driver
CEPH FS
A POSIX-compliant
distributed file
system, with a Linux
kernel client and
support for FUSE
RADOSGW
A bucket-based REST
gateway, compatible
with S3 and Swift

RADOSGW
S3 / Swift compatible
gateway to RADOS

RBD
RADOS
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
RBD
distributed block
kernel client and a
QEMU/KVM driver
CEPH FS
A POSIX-compliant
distributed file
kernel client and
support for FUSE
RADOSGW
A bucket-based REST
gateway, compatible
with S3 and Swift

RBD
Directly mountable
rbd map foo --pool rbd
mkfs -t ext4 /dev/rbd/rbd/foo
OpenStack integration
Cinder & Glance
Will explain later

Ceph FS
RADOS
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
RBD
distributed block
kernel client and a
QEMU/KVM driver
CEPH FS
A POSIX-compliant
distributed file
kernel client and
support for FUSE
RADOSGW
A bucket-based REST
gateway, compatible
with S3 and Swift

Ceph FS
POSIX compliant file system
build on top of RADOS
Can mount with Linux native
kernel driver (cephfs) or FUSE
Metadata servers (mds)
manages metadata of the
file system tree

Ceph FS is reliable
MDS writes journal to RADOS
so that metadata doesn’t
lose by MDS failures
Multiple MDS can run for HA
and load balancing

Ceph FS and OSD
MDS
OSDOSDOSD
POSIX Metadata
(directory, time, owner, etc)
MDS
Write metadata journal
Data I/O
Metadata held in-memory

Other features
Rolling upgrades
Erasure Coding
Cache tiering
Key-value OSD backend
Separate backend network

Rolling upgrades
No interruption to the
service when upgrading
Stop/Start daemons one by
one
mon → osd → mds →
radowgw

Erasure coding
Use erasure coding instead
of parity for data durability
Suitable for rarely modified
or accessed objects
Erasure Coding Replication
Space overhead
(survive 2 fails)
Approx 40% 200%
CPU High Low
Latency High Low

Cache tiering
Cache tier
ex. SSD
Base tier
ex. HDD,
erasure coded
librados
transparent to clients
read/write
read when miss
fetch when miss
flush to base tier

Key-value OSD backend
Use LevelDB for OSD
backend (instead of xfs)
Better performance esp for
small objects
Plans to support RocksDB,
NVMKV, etc

Separate backend network
Backend network
for replication
Clients
Frontend network
for service
OSDs
1. Write
2. Replicate

RADOSGW and Keystone
Keystone Server
RADOSGW
RESTful Object Store
Query token
Access with token
Grant/revoke

Glance Integration
RB
D
Glance Server
/etc/glance/glance-api.conf
default_store=rbd
rbd_store_user=glance
rbd_store_pool=images
Store, Download
Need just 3 lines!
Image

Cinder/Nova Integration
RB
D
Cinder Server
qemu
VM
librbd
nova-compute
Boot from volume
Management
Volume Image
Copy-on-write clone

Benefits of using with
Unified storage for both
images and volumes
Copy-on-write cloning and
snapshot support
Native qemu / KVM support
for better performance

Ceph is
Massively scalable storage
Unified architecture for
object / block / POSIX FS
OpenStack integration is
ready to use & awesome

Ceph and GlusterFS
Ceph GlusterFS
Distribution Object based File based
File location Deterministic
algorithm
(CRUSH)
Distributed
hash table,
stored in xattr
Replication Server side Client side
Primary usage Object / block
storage
POSIX-like file
system
Challenge POSIX file
system needs
improvement
Object / block
storage needs
improvement

Further readings
Ceph Documents
https://ceph.com/docs/master/
Well documented.
Sébastien Han
http://www.sebastien-han.fr/blog/
An awesome blog.
CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data
http://ceph.com/papers/weil-crush-sc06.pdf
CRUSH algorithm paper
Ceph: A Scalable, High-Performance Distributed File System
http://www.ssrc.ucsc.edu/Papers/weil-osdi06.pdf
Ceph paper
Ceph の覚え書きのインデックス
http://www.nminoru.jp/~nminoru/unix/ceph/
Well written introduction in Japanese

Calamari will be open sourced
“Calamari, the monitoring
and diagnostics tool that
Inktank has developed as
part of the Inktank Ceph
Enterprise product, will soon
be open sourced.”
http://ceph.com/community/red-hat-to-acquire-inktank/#sthash.1rB0kfRS.dpuf

What you need to know about ceph

More Related Content

What's hot

Similar to What you need to know about ceph

More from Emma Haruka Iwao

Recently uploaded

What you need to know about ceph