DEVIEW 2013

Ceph:
Massively Scalable Distributed Storage
Patrick McGarry
Director Community, Inktank

Who am I?

●

Patrick McGarry

●

Dir Community, Inktank

●

/. → ALU → P4 → Inktank

●

@scuttlemonkey

●

patrick@inktank.com

Outline
●

What is Ceph?

●

Ceph History

●

How does Ceph work?

●

CRUSH Fundamentals

●

Applications

●

Testing

●

Performance

●

Moving forward

●

Questions

Ceph at a glance

So ftw a re

A ll- in - 1

CR U SH

S c a le

O n c o m m o d ity h a rd w a re

O b je c t , b lo c k , a n d file

A w esom esauce

H uge and b eyond

C e p h ca n ru n o n a n y
in f r a s t r u c t u r e , m e t a l
o r v ir t u a liz e d t o
p r o v id e a c h e a p a n d
p o w e rfu l sto ra g e
c lu s t e r .

L o w o ve rh e a d
d o e sn ’t m e a n ju st
h a r d w a r e , it m e a n s
p e o p le t o o !

In f r a s t r u c t u r e - a w a r e
p la c e m e n t a lg o r it h m
a llo w s y o u t o d o
r e a lly c o o l s t u f f .

D e s ig n e d f o r
e x a b yte , cu rre n t
im p le m e n t a t io n s in
t h e m u lt i- p e t a b y t e .
H P C , B ig D a t a ,
C lo u d , r a w s t o r a g e .

Unified storage system
●

objects
●

●

●

native
RESTful

block
●

●

thin provisioning, snapshots, cloning

file
●

strong consistency, snapshots

Distributed storage system
●

data center scale
●

●

●

10s to 10,000s of machines
terabytes to exabytes

fault tolerant
●

●

●

no single point of failure
commodity hardware

self-managing, self-healing

UCSC research grant
●

“Petascale object storage”
●

DOE: LANL, LLNL, Sandia

●

Scalability

●

Reliability

●

Performance
●

●

Raw IO bandwidth, metadata ops/sec

HPC file system workloads
●

Thousands of clients writing to same file, directory

Distributed metadata management
●

Innovative design
●

Subtree-based partitioning for locality, efficiency

●

Dynamically adapt to current workload

●

Embedded inodes

●

Prototype simulator in Java (2004)

●

First line of Ceph code
●

Summer internship at LLNL

●

High security national lab environment

●

Could write anything, as long as it was OSS

The rest of Ceph
●

RADOS – distributed object storage cluster (2005)

●

EBOFS – local object storage (2004/2006)

●

CRUSH – hashing for the real world (2005)

●

Paxos monitors – cluster consensus (2006)
→ emphasis on consistent, reliable storage
→ scale by pushing intelligence to the edges
→ a different but compelling architecture

Industry black hole
●

Many large storage vendors
●

●

Proprietary solutions that don't scale well

Few open source alternatives (2006)
●

●

Limited community and architecture (Lustre)

●

●

Very limited scale, or
No enterprise feature sets (snapshots, quotas)

PhD grads all built interesting systems...
●

●

...and then went to work for Netapp, DDN, EMC, Veritas.

They want you, not your project

A different path
●

Change the world with open source
●

●

●

Do what Linux did to Solaris, Irix, Ultrix, etc.
What could go wrong?

License
●

●

●

GPL, BSD...
LGPL: share changes, okay to link to proprietary code

Avoid unsavory practices
●

Dual licensing

●

Copyright assignment

APP
APP

LIBRADOS
LIBRADOS

APP
APP

RADOSGW
RADOSGW

AA bucket-based
bucket-based
AA library allowing REST gateway,
library allowing
REST gateway,
apps to directly
compatible with S3
apps to directly
compatible with S3
access RADOS,
and Swift
access RADOS,
and Swift
with support for
with support for
C, C++, Java,
C, C++, Java,
Python, Ruby,
Python, Ruby,
and PHP
and PHP

HOST/VM
HOST/VM

CLIENT
CLIENT

RBD
RBD

CEPH FS
CEPH FS

AA reliable and fullyreliable and fullydistributed block
distributed block
device, with a a Linux
device, with Linux
kernel client and a a
kernel client and
QEMU/KVM driver
QEMU/KVM driver

AA POSIX-compliant
POSIX-compliant
distributed file system,
distributed file system,
with a a Linux kernel
with Linux kernel
client and support for
client and support for
FUSE
FUSE

RADOS
RADOS
AA reliable, autonomous, distributed object store comprised of self-healing, self-managing,
reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
intelligent storage nodes

Why start with objects?
●

more useful than (disk) blocks
●

●

variable size

●

●

names in a single flat namespace
simple API with rich semantics

more scalable than files
●

no hard-to-distribute hierarchy

●

update semantics do not span objects

●

workload is trivially parallel

Ceph object model
●

pools
●

●

independent namespaces or object collections

●

●

1s to 100s
replication level, placement policy

objects
●

bazillions

●

blob of data (bytes to gigabytes)

●

attributes (e.g., “version=12”; bytes to kilobytes)

●

key/value bundle (bytes to gigabytes)

OSD

OSD

OSD

OSD

OSD

FS

FS

FS

FS

FS

DISK

DISK

DISK

DISK

DISK

M

M

M

btrfs
xfs
ext4

Object Storage Daemons (OSDs)
●

10s to 10000s in a cluster

●

one per disk, SSD, or RAID group, or ...
●

●

●

hardware agnostic

serve stored objects to clients
intelligently peer to perform replication and
recovery tasks

Monitors

M

●

●

maintain cluster membership and state
provide consensus for distributed decisionmaking

●

small, odd number

●

these do not served stored objects to clients

Data distribution
●

●

●

all objects are replicated N times
objects are automatically placed, balanced, migrated in a
dynamic cluster
must consider physical infrastructure
●

●

ceph-osds on hosts in racks in rows in data centers

three approaches
●

pick a spot; remember where you put it

●

pick a spot; write down where you put it

●

calculate where to put it, where to find it

CRUSH
●

pseudo-random placement algorithm
●

fast calculation, no lookup

●

repeatable, deterministic

●

statistically uniform distribution

●

stable mapping
●

●

limited data migration on change

rule-based configuration
●

infrastructure topology aware

●

adjustable replication

●

allows weighting

[Mapping]

[Buckets]

[Rules]

[Map]

10 10 01 01 10 10 01 11 01 10
10 10 01 01 10 10 01 11 01 10

hash(object name) % num pg
10
10

10
10

01
01

01
01

10
10

10
10

01
01

1111

01
01

10
10

CRUSH(pg, cluster state, policy)

10 10 01 01 10 10 01 11 01 10
10 10 01 01 10 10 01 11 01 10

10
10

10
10

01
01

01
01

10
10

10
10

01
01

1111

01
01

10
10

RADOS Gateway
●

REST-based object storage proxy

●

uses RADOS to store objects

●

API supports buckets, accounting

●

usage accounting for billing purposes

●

compatible with S3, Swift APIs

RADOS Block Device
●

storage of disk images in RADOS

●

decouple VM from host

●

images striped across entire cluster (pool)

●

snapshots

●

copy-on-write clones

●

support in
●

Qemu/KVM, Xen

●

mainline Linux kernel (2.6.34+)

●

OpenStack, CloudStack

●

Ganeti, OpenNebula

Metadata Server (MDS)
●

manages metadata for POSIX shared file system
●

directory hierarchy

●

file metadata (size, owner, timestamps)

●

stores metadata in RADOS

●

does not serve file data to clients

●

only required for the shared file system

Multiple protocols, implementations
●

Linux kernel client
●

mount -t ceph 1.2.3.4:/ /mnt

●

export (NFS), Samba (CIFS)

●

ceph-fuse

●

libcephfs.so
●

your app

●

Samba (CIFS)

●

Ganesha (NFS)

●

SMB/CIFS

NFS

Ganesha
libcephfs

Samba
libcephfs

Hadoop
libcephfs

your app
libcephfs

Hadoop (map/reduce)
ceph

ceph-fuse
fuse
kernel

Package Building
●

GitBuilder
●

●

Very basic tool, works well though

●

Push to get → .deb && .rpm for all platforms

●

●

Build packages / tarballs

autosigned

Pbuilder
●

Used for major releases

●

Clean room type of build

●

Signed using different key

●

As many distros as possible (All Debian, Ubuntu, Fedora 17&18,
CentOS, SLES, OpenSUSE, tarball)

Teuthology
●

Our own custom test framework

●

Allocates machines you define to cluster you define

●

Runs tasks against that environment
●

Set up ceph

●

Run workloads

●

Pull from SAMBA gitbuilder → mount Ceph → run NFS client → mount
it → run workload

●

Makes it easy to stack things up, map them to hosts, run

●

Suite of tests on our github

●

Active community development underway!

4K Relative Performance

All Workloads: QEMU/KVM, BTRFS, or XFS
Read Heavy Only Workloads: Kernel RBD, EXT4, or XFS

128K Relative Performance

All Workloads: QEMU/KVM, or XFS
Read Heavy Only Workloads: Kernel RBD, BTRFS, or EXT4

4M Relative Performance

All Workloads: QEMU/KVM, or XFS
Read Heavy Only Workloads: Kernel RBD, BTRFS, or EXT4

What does this mean?
●

Anecdotal results are great!

●

Concrete evidence coming soon

●

Performance is complicated
●

Workload

●

Hardware (especially network!)

●

Lots and lots of tuneables

●

Mark Nelson (nhm on #ceph) is the performance geek

●

Lots of reading on ceph.com blog

●

We welcome testing and comparison help
●

Anyone want to help with an EMC / NetApp showdown?

Future plans

G e o - R e p lic a t io n

E r a s u r e C o d in g

T ie r in g

G o v e rn a n c e

A n o n g o in g p r o c e s s

R e c e p t io n e f f ic ie n c y

H e a d e d t o d y n a m ic

M a k in g it o p e n - e r

W h ile t h e f ir s t p a s s
f o r d is a s t e r r e c o v e r y
is d o n e , w e w a n t t o
g e t t o b u ilt - in ,
w o r ld - w id e
r e p lic a t io n .

C u r r e n t ly u n d e r w a y
in t h e c o m m u n it y !

C a n a lr e a d y d o t h is
in a s t a t ic p o o lb a s e d s e t u p . L o o k in g
to g e t to a u se -b a se d
m ig r a t io n .

B e e n t a lk in g a b o u t it
f o r e v e r. T h e t i m e i s
c o m in g !

Get Involved!

CDS

Q u a r t e r ly O n li n e S u m m i t

Ceph Day

IR C

N o t ju st fo r N Y C

G e e k -o n -d u ty

E m a i l m a k e s t h e w o r ld g o

O n lin e s u m m it p u t s
th e c o re d e vs
t o g e t h e r w it h t h e
C e p h c o m m u n it y.

M o r e p la n n e d ,
in c lu d in g S a n t a C la r a
and London. Keep an
eye out:

D u r in g t h e w e e k
t h e r e a r e t im e s
w h e n C e p h e x p e rts
a r e a v a ila b le t o
h e lp . S t o p b y
o ftc.n e t/ ce p h

O u r m a ilin g lis t s a r e
v e r y a c t iv e , c h e c k
o u t ce p h .co m fo r
d e t a ils o n h o w t o
j o in in !

h ttp : / / in k ta n k . c o m / c e p h d a y
s/

L is t s

E - M A IL
p a tr ic k @ in k ta n k . c o m

W E B S IT E
C ep h .co m

S O C IA L
@ s c u t t le m o n k e y
@ ceph
F a ce b o o k .co m / ce p h sto ra g e

Questions?

DEVIEW 2013

More Related Content

What's hot

Similar to DEVIEW 2013

More from Patrick McGarry

Recently uploaded

DEVIEW 2013