CEPH: A MASSIVELY SCALABLE
DISTRIBUTED STORAGE SYSTEM
Ken Dreyer
Software Engineer
Apr 23 2015
Hitler finds out about software-defined storage
If you are a proprietary storage vendor...
THE FUTURE OF STORAGE
Traditional Storage
Complex proprietary silos
Open Software Defined Storage
Standardized, unified, open platforms
USER
ADMIN
USER
ADMIN
Custom GUI
Proprietary Software
Custom GUI
Proprietary Software
Proprietary
Hardware
Proprietary
Hardware
Standard
Computers
and Disks
Commodity
Hardware
OpenSource
Software
Ceph
Control Plane (API, GUI)
ADMIN USER
THE JOURNEY
Open Software-Defined Storage is a fundamental
reimagining of how storage infrastructure
works.
It provides substantial economic and
operational advantages, and it has quickly
become ideally suited for a growing number of
use cases.
TODAY EMERGING FUTURE
Cloud
Infrastructure
Cloud
Native Apps
Analytics
Hyper-
Convergence
Containers
???
???
HISTORICAL TIMELINE
RHEL-OSP
Certification
FEB 2014
MAY 2012
Launch of
Inktank
OpenStack
Integration
2011
2010
Mainline
Linux
Kernel
Open
Source
2006
2004
Project
Starts at
UCSC
Production
Ready Ceph
SEPT 2012
2012
CloudStack
Integration
OCT 2013
Inktank Ceph
Enterprise
Launch
Xen
Integration
2013
APR 2014
Inktank
Acquired by
Red Hat
10 years in the making
5
ARCHITECTURE
ARCHITECTURAL COMPONENTS
7
RGW
A web services
gateway for object
storage, compatible
with S3 and Swift
LIBRADOS
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS
A software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors
RBD
A reliable, fully-
distributed block
device with cloud
platform integration
CEPHFS
A distributed file
system with POSIX
semantics and scale-
out metadata
management
APP HOST/VM CLIENT
OBJECT STORAGE DAEMONS
8
FS
DISK
OSD
DISK
OSD
FS
DISK
OSD
FS
DISK
OSD
FS
btrfs
xfs
ext4
zfs?
M
M
M
RADOS CLUSTER
9
APPLICATION
M M
M M
M
RADOS CLUSTER
RADOS COMPONENTS
10
OSDs:
 10s to 10000s in a cluster
 One per disk (or one per SSD, RAID group…)
 Serve stored objects to clients
 Intelligently peer for replication & recovery
Monitors:
 Maintain cluster membership and state
 Provide consensus for distributed decision-
making
 Small, odd number
 These do not serve stored objects to clients
M
WHERE DO OBJECTS LIVE?
11
??
APPLICATION
M
M
M
OBJECT
A METADATA SERVER?
12
1
APPLICATION
M
M
M
2
CALCULATED PLACEMENT
13
FAPPLICATION
M
M
M
A-G
H-N
O-T
U-Z
EVEN BETTER: CRUSH!
14
CLUSTER
OBJECTS
10
01
01
10
10
01
11
01
10
01
01
10
10
01 11
01
1001
0110 10 01
11
01
PLACEMENT GROUPS
(PGs)
CRUSH IS A QUICK CALCULATION
15
RADOS CLUSTER
OBJECT
10
01
01
10
10
01 11
01
1001
0110 10 01
11
01
CRUSH: DYNAMIC DATA
PLACEMENT
16
CRUSH:
 Pseudo-random placement algorithm
 Fast calculation, no lookup
 Repeatable, deterministic
 Statistically uniform distribution
 Stable mapping
 Limited data migration on change
 Rule-based configuration
 Infrastructure topology aware
 Adjustable replication
 Weighting
CRUSH
OBJECT
10 10 01 01 10 10 01 11 01 10
hash(object name) % num pg
CRUSH(pg, cluster state, rule set)
18
OBJECT
10 10 01 01 10 10 01 11 01 10
19
CLIENT
??
20
21
22
CLIENT
??
23
24
25
ARCHITECTURAL COMPONENTS
26
RGW
A web services
gateway for object
storage, compatible
with S3 and Swift
LIBRADOS
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS
A software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors
RBD
A reliable, fully-
distributed block
device with cloud
platform integration
CEPHFS
A distributed file
system with POSIX
semantics and scale-
out metadata
management
APP HOST/VM CLIENT
ACCESSING A RADOS CLUSTER
27
APPLICATION
M M
M
RADOS CLUSTER
LIBRADOS
OBJECT
socket
L
LIBRADOS: RADOS ACCESS FOR
APPS
28
LIBRADOS:
 Direct access to RADOS for applications
 C, C++, Python, PHP, Java, Erlang
 Direct access to storage nodes
 No HTTP overhead
ARCHITECTURAL COMPONENTS
29
RGW
A web services
gateway for object
storage, compatible
with S3 and Swift
LIBRADOS
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS
A software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors
RBD
A reliable, fully-
distributed block
device with cloud
platform integration
CEPHFS
A distributed file
system with POSIX
semantics and scale-
out metadata
management
APP HOST/VM CLIENT
THE RADOS GATEWAY
30
M M
M
RADOS CLUSTER
RADOSGW
LIBRADOS
socket
RADOSGW
LIBRADOS
APPLICATION APPLICATION
REST
RADOSGW MAKES RADOS WEBBY
31
RADOSGW:
 REST-based object storage proxy
 Uses RADOS to store objects
 API supports buckets, accounts
 Usage accounting for billing
 Compatible with S3 and Swift applications
ARCHITECTURAL COMPONENTS
32
RGW
A web services
gateway for object
storage, compatible
with S3 and Swift
LIBRADOS
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS
A software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors
RBD
A reliable, fully-
distributed block
device with cloud
platform integration
CEPHFS
A distributed file
system with POSIX
semantics and scale-
out metadata
management
APP HOST/VM CLIENT
STORING VIRTUAL DISKS
33
M M
RADOS CLUSTER
HYPERVISOR
LIBRBD
VM
SEPARATE COMPUTE FROM
STORAGE
34
M M
RADOS CLUSTER
HYPERVISOR
LIBRBD
VM
HYPERVISOR
LIBRBD
KRBD - KERNEL MODULE
M M
RADOS CLUSTER
LINUX HOST
KRBD
RBD STORES VIRTUAL DISKS
RADOS BLOCK DEVICE:
 Storage of disk images in RADOS
 Decouples VMs from host
 Images are striped across the cluster (pool)
 Snapshots
 Copy-on-write clones
 Support in:
 Mainline Linux Kernel (2.6.39+) and RHEL 7
 Qemu/KVM, native Xen coming soon
 OpenStack, CloudStack, Nebula, Proxmox
Export snapshots to geographically dispersed data centers
▪ Institute disaster recovery
Export incremental snapshots
▪ Minimize network bandwidth by only sending changes
RBD SNAPSHOTS
ARCHITECTURAL COMPONENTS
RGW
A web services
gateway for object
storage, compatible
with S3 and Swift
LIBRADOS
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS
A software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors
RBD
A reliable, fully-
distributed block
device with cloud
platform integration
CEPHFS
A distributed file
system with POSIX
semantics and scale-
out metadata
management
APP HOST/VM CLIENT
SEPARATE METADATA SERVER
LINUX HOST
M M
M
RADOS CLUSTER
KERNEL
MODULE
datametadata 01
10
SCALABLE METADATA SERVERS
METADATA SERVER
 Manages metadata for a POSIX-compliant
shared filesystem
 Directory hierarchy
 File metadata (owner, timestamps, mode,
etc.)
 Stores metadata in RADOS
 Does not serve file data to clients
 Only required for shared filesystem
CALAMARI
41
CALAMARI ARCHITECTURE
CEPH STORAGE CLUSTER
MASTER
CALAMARI
ADMIN NODE
MINION MINION
M
MINION MINION
M
MINIONMINION
M
USE CASES
WEB APPLICATION STORAGE
WEB APPLICATION
APP SERVER APP SERVER APP SERVER
CEPH STORAGE CLUSTER
(RADOS)
CEPH OBJECT
GATEWAY
(RGW)
CEPH OBJECT
GATEWAY
(RGW)
APP SERVER
S3/Swift S3/Swift S3/Swift S3/Swift
MULTI-SITE OBJECT STORAGE
WEB
APPLICATION
APP SERVER
CEPH OBJECT
GATEWAY
(RGW)
CEPH STORAGE
CLUSTER
(US-EAST)
WEB
APPLICATION
APP SERVER
CEPH OBJECT
GATEWAY
(RGW)
CEPH STORAGE
CLUSTER
(EU-WEST)
ARCHIVE / COLD STORAGE
APPLICATION
CACHE POOL (REPLICATED)
BACKING POOL (ERASURE CODED)
CEPH STORAGE CLUSTER
ERASURE CODING
47
OBJECT
REPLICATED POOL
CEPH STORAGE CLUSTER
ERASURE CODED POOL
CEPH STORAGE CLUSTER
COPY COPY
OBJECT
31 2 X Y
COPY
4
Full copies of stored objects
 Very high durability
 Quicker recovery
One copy plus parity
 Cost-effective durability
 Expensive recovery
ERASURE CODING: HOW DOES IT
WORK?
48
CEPH STORAGE CLUSTER
OBJECT
Y
OSD
3
OSD
2
OSD
1
OSD
4
OSD
X
OSD
ERASURE CODED POOL
CACHE TIERING
49
CEPH CLIENT
CACHE: WRITEBACK MODE
BACKING POOL (REPLICATED)
CEPH STORAGE CLUSTER
Read/Write Read/Write
WEBSCALE APPLICATIONS
50
WEB APPLICATION
APP SERVER APP SERVER APP SERVER
CEPH STORAGE CLUSTER
(RADOS)
APP SERVER
Native
Protocol
Native
Protocol
Native
Protocol
Native
Protocol
ARCHIVE / COLD STORAGE
51
APPLICATION
CACHE POOL (REPLICATED)
BACKING POOL (ERASURE CODED)
CEPH STORAGE CLUSTER
CEPH BLOCK DEVICE (RBD)
DATABASES
52
MYSQL / MARIADB
LINUX KERNEL
CEPH STORAGE CLUSTER
(RADOS)
Native
Protocol
Native
Protocol
Native
Protocol
Native
Protocol
Future Ceph Roadmap
CEPH ROADMAP
57
Hammer
(current release)
Infernalis J-Release
NewStore
Object Expiration
Performance
Improvements
Stable CephFS?Object Versioning
Alternative Web
Server for RGW
Performance
Improvements
???
Performance
Improvements
NEXT STEPS
NEXT STEPS
WHAT NOW?
• Read about the latest version of
Ceph: http://ceph.com/docs
• Deploy a test cluster using
ceph-deploy: http://ceph.com/
qsg
Getting Started with Ceph
 Most discussion happens on the mailing
lists ceph-devel and ceph-users. Join or
view archives at http://ceph.com/list
 IRC is a great place to get help (or help
others!) #ceph and #ceph-devel. Details
and logs at http://ceph.com/irc
Getting Involved with
Ceph
59
• Deploy a test cluster on the AWS free-
tier using Juju: http://ceph.com/juju
• Ansible playbooks for Ceph:
https://www.github.com
/alfredodeza/ceph-ansible
 Download the code: http:
//www.github.com/ceph
 The tracker manages bugs and feature
requests. Register and start looking
around at http://tracker.ceph.com
 Doc updates and suggestions are
always welcome. Learn how to
contribute docs at
http://ceph.com/docwriting
Thank You
extras
● metrics.ceph.com
● http://yahooeng.tumblr.com/post/116391291701
/yahoo-cloud-object-store-object-storage-at

Ceph Overview for Distributed Computing Denver Meetup

  • 1.
    CEPH: A MASSIVELYSCALABLE DISTRIBUTED STORAGE SYSTEM Ken Dreyer Software Engineer Apr 23 2015
  • 2.
    Hitler finds outabout software-defined storage If you are a proprietary storage vendor...
  • 3.
    THE FUTURE OFSTORAGE Traditional Storage Complex proprietary silos Open Software Defined Storage Standardized, unified, open platforms USER ADMIN USER ADMIN Custom GUI Proprietary Software Custom GUI Proprietary Software Proprietary Hardware Proprietary Hardware Standard Computers and Disks Commodity Hardware OpenSource Software Ceph Control Plane (API, GUI) ADMIN USER
  • 4.
    THE JOURNEY Open Software-DefinedStorage is a fundamental reimagining of how storage infrastructure works. It provides substantial economic and operational advantages, and it has quickly become ideally suited for a growing number of use cases. TODAY EMERGING FUTURE Cloud Infrastructure Cloud Native Apps Analytics Hyper- Convergence Containers ??? ???
  • 5.
    HISTORICAL TIMELINE RHEL-OSP Certification FEB 2014 MAY2012 Launch of Inktank OpenStack Integration 2011 2010 Mainline Linux Kernel Open Source 2006 2004 Project Starts at UCSC Production Ready Ceph SEPT 2012 2012 CloudStack Integration OCT 2013 Inktank Ceph Enterprise Launch Xen Integration 2013 APR 2014 Inktank Acquired by Red Hat 10 years in the making 5
  • 6.
  • 7.
    ARCHITECTURAL COMPONENTS 7 RGW A webservices gateway for object storage, compatible with S3 and Swift LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors RBD A reliable, fully- distributed block device with cloud platform integration CEPHFS A distributed file system with POSIX semantics and scale- out metadata management APP HOST/VM CLIENT
  • 8.
  • 9.
  • 10.
    RADOS COMPONENTS 10 OSDs:  10sto 10000s in a cluster  One per disk (or one per SSD, RAID group…)  Serve stored objects to clients  Intelligently peer for replication & recovery Monitors:  Maintain cluster membership and state  Provide consensus for distributed decision- making  Small, odd number  These do not serve stored objects to clients M
  • 11.
    WHERE DO OBJECTSLIVE? 11 ?? APPLICATION M M M OBJECT
  • 12.
  • 13.
  • 14.
  • 15.
    CRUSH IS AQUICK CALCULATION 15 RADOS CLUSTER OBJECT 10 01 01 10 10 01 11 01 1001 0110 10 01 11 01
  • 16.
    CRUSH: DYNAMIC DATA PLACEMENT 16 CRUSH: Pseudo-random placement algorithm  Fast calculation, no lookup  Repeatable, deterministic  Statistically uniform distribution  Stable mapping  Limited data migration on change  Rule-based configuration  Infrastructure topology aware  Adjustable replication  Weighting
  • 17.
    CRUSH OBJECT 10 10 0101 10 10 01 11 01 10 hash(object name) % num pg CRUSH(pg, cluster state, rule set)
  • 18.
    18 OBJECT 10 10 0101 10 10 01 11 01 10
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
    ARCHITECTURAL COMPONENTS 26 RGW A webservices gateway for object storage, compatible with S3 and Swift LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors RBD A reliable, fully- distributed block device with cloud platform integration CEPHFS A distributed file system with POSIX semantics and scale- out metadata management APP HOST/VM CLIENT
  • 27.
    ACCESSING A RADOSCLUSTER 27 APPLICATION M M M RADOS CLUSTER LIBRADOS OBJECT socket
  • 28.
    L LIBRADOS: RADOS ACCESSFOR APPS 28 LIBRADOS:  Direct access to RADOS for applications  C, C++, Python, PHP, Java, Erlang  Direct access to storage nodes  No HTTP overhead
  • 29.
    ARCHITECTURAL COMPONENTS 29 RGW A webservices gateway for object storage, compatible with S3 and Swift LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors RBD A reliable, fully- distributed block device with cloud platform integration CEPHFS A distributed file system with POSIX semantics and scale- out metadata management APP HOST/VM CLIENT
  • 30.
    THE RADOS GATEWAY 30 MM M RADOS CLUSTER RADOSGW LIBRADOS socket RADOSGW LIBRADOS APPLICATION APPLICATION REST
  • 31.
    RADOSGW MAKES RADOSWEBBY 31 RADOSGW:  REST-based object storage proxy  Uses RADOS to store objects  API supports buckets, accounts  Usage accounting for billing  Compatible with S3 and Swift applications
  • 32.
    ARCHITECTURAL COMPONENTS 32 RGW A webservices gateway for object storage, compatible with S3 and Swift LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors RBD A reliable, fully- distributed block device with cloud platform integration CEPHFS A distributed file system with POSIX semantics and scale- out metadata management APP HOST/VM CLIENT
  • 33.
    STORING VIRTUAL DISKS 33 MM RADOS CLUSTER HYPERVISOR LIBRBD VM
  • 34.
    SEPARATE COMPUTE FROM STORAGE 34 MM RADOS CLUSTER HYPERVISOR LIBRBD VM HYPERVISOR LIBRBD
  • 35.
    KRBD - KERNELMODULE M M RADOS CLUSTER LINUX HOST KRBD
  • 36.
    RBD STORES VIRTUALDISKS RADOS BLOCK DEVICE:  Storage of disk images in RADOS  Decouples VMs from host  Images are striped across the cluster (pool)  Snapshots  Copy-on-write clones  Support in:  Mainline Linux Kernel (2.6.39+) and RHEL 7  Qemu/KVM, native Xen coming soon  OpenStack, CloudStack, Nebula, Proxmox
  • 37.
    Export snapshots togeographically dispersed data centers ▪ Institute disaster recovery Export incremental snapshots ▪ Minimize network bandwidth by only sending changes RBD SNAPSHOTS
  • 38.
    ARCHITECTURAL COMPONENTS RGW A webservices gateway for object storage, compatible with S3 and Swift LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors RBD A reliable, fully- distributed block device with cloud platform integration CEPHFS A distributed file system with POSIX semantics and scale- out metadata management APP HOST/VM CLIENT
  • 39.
    SEPARATE METADATA SERVER LINUXHOST M M M RADOS CLUSTER KERNEL MODULE datametadata 01 10
  • 40.
    SCALABLE METADATA SERVERS METADATASERVER  Manages metadata for a POSIX-compliant shared filesystem  Directory hierarchy  File metadata (owner, timestamps, mode, etc.)  Stores metadata in RADOS  Does not serve file data to clients  Only required for shared filesystem
  • 41.
  • 42.
    CALAMARI ARCHITECTURE CEPH STORAGECLUSTER MASTER CALAMARI ADMIN NODE MINION MINION M MINION MINION M MINIONMINION M
  • 43.
  • 44.
    WEB APPLICATION STORAGE WEBAPPLICATION APP SERVER APP SERVER APP SERVER CEPH STORAGE CLUSTER (RADOS) CEPH OBJECT GATEWAY (RGW) CEPH OBJECT GATEWAY (RGW) APP SERVER S3/Swift S3/Swift S3/Swift S3/Swift
  • 45.
    MULTI-SITE OBJECT STORAGE WEB APPLICATION APPSERVER CEPH OBJECT GATEWAY (RGW) CEPH STORAGE CLUSTER (US-EAST) WEB APPLICATION APP SERVER CEPH OBJECT GATEWAY (RGW) CEPH STORAGE CLUSTER (EU-WEST)
  • 46.
    ARCHIVE / COLDSTORAGE APPLICATION CACHE POOL (REPLICATED) BACKING POOL (ERASURE CODED) CEPH STORAGE CLUSTER
  • 47.
    ERASURE CODING 47 OBJECT REPLICATED POOL CEPHSTORAGE CLUSTER ERASURE CODED POOL CEPH STORAGE CLUSTER COPY COPY OBJECT 31 2 X Y COPY 4 Full copies of stored objects  Very high durability  Quicker recovery One copy plus parity  Cost-effective durability  Expensive recovery
  • 48.
    ERASURE CODING: HOWDOES IT WORK? 48 CEPH STORAGE CLUSTER OBJECT Y OSD 3 OSD 2 OSD 1 OSD 4 OSD X OSD ERASURE CODED POOL
  • 49.
    CACHE TIERING 49 CEPH CLIENT CACHE:WRITEBACK MODE BACKING POOL (REPLICATED) CEPH STORAGE CLUSTER Read/Write Read/Write
  • 50.
    WEBSCALE APPLICATIONS 50 WEB APPLICATION APPSERVER APP SERVER APP SERVER CEPH STORAGE CLUSTER (RADOS) APP SERVER Native Protocol Native Protocol Native Protocol Native Protocol
  • 51.
    ARCHIVE / COLDSTORAGE 51 APPLICATION CACHE POOL (REPLICATED) BACKING POOL (ERASURE CODED) CEPH STORAGE CLUSTER
  • 52.
    CEPH BLOCK DEVICE(RBD) DATABASES 52 MYSQL / MARIADB LINUX KERNEL CEPH STORAGE CLUSTER (RADOS) Native Protocol Native Protocol Native Protocol Native Protocol
  • 53.
  • 54.
    CEPH ROADMAP 57 Hammer (current release) InfernalisJ-Release NewStore Object Expiration Performance Improvements Stable CephFS?Object Versioning Alternative Web Server for RGW Performance Improvements ??? Performance Improvements
  • 55.
  • 56.
    NEXT STEPS WHAT NOW? •Read about the latest version of Ceph: http://ceph.com/docs • Deploy a test cluster using ceph-deploy: http://ceph.com/ qsg Getting Started with Ceph  Most discussion happens on the mailing lists ceph-devel and ceph-users. Join or view archives at http://ceph.com/list  IRC is a great place to get help (or help others!) #ceph and #ceph-devel. Details and logs at http://ceph.com/irc Getting Involved with Ceph 59 • Deploy a test cluster on the AWS free- tier using Juju: http://ceph.com/juju • Ansible playbooks for Ceph: https://www.github.com /alfredodeza/ceph-ansible  Download the code: http: //www.github.com/ceph  The tracker manages bugs and feature requests. Register and start looking around at http://tracker.ceph.com  Doc updates and suggestions are always welcome. Learn how to contribute docs at http://ceph.com/docwriting
  • 57.
  • 58.