SlideShare a Scribd company logo
Ceph Fundamentals
Ross Turk
VP Community, Inktank
ME ME ME ME ME ME.
2
Ross Turk
VP Community, Inktank
ross@inktank.com
@rossturk
inktank.com | ceph.com
Ceph Architectural Overview
Ah! Finally, 32 slides in and he gets to the nerdy stuff.
3
4
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
RBD
A reliable and fully-
distributed block
device, with a Linux
kernel client and a
QEMU/KVM driver
CEPH FS
A POSIX-compliant
distributed file
system, with a Linux
kernel client and
support for FUSE
RADOSGW
A bucket-based REST
gateway, compatible
with S3 and Swift
APP APP HOST/VM CLIENT
5
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
RBD
A reliable and fully-
distributed block
device, with a Linux
kernel client and a
QEMU/KVM driver
CEPH FS
A POSIX-compliant
distributed file
system, with a Linux
kernel client and
support for FUSE
RADOSGW
A bucket-based REST
gateway, compatible
with S3 and Swift
APP APP HOST/VM CLIENT
6
DISK
FS
DISK DISK
OSD
DISK DISK
OSD OSD OSD OSD
FS FS FSFS
btrfs
xfs
ext4
MMM
7
M
M
M
HUMAN
8
Monitors:
• Maintain cluster membership
and state
• Provide consensus for
distributed decision-making
• Small, odd number
• These do not serve stored
objects to clients
M
OSDs:
• 10s to 10000s in a cluster
• One per disk
• (or one per SSD, RAID group…)
• Serve stored objects to
clients
• Intelligently peer to perform
replication and recovery tasks
9
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
RBD
A reliable and fully-
distributed block
device, with a Linux
kernel client and a
QEMU/KVM driver
CEPH FS
A POSIX-compliant
distributed file
system, with a Linux
kernel client and
support for FUSE
RADOSGW
A bucket-based REST
gateway, compatible
with S3 and Swift
APP APP HOST/VM CLIENT
LIBRADOS
M
M
M
10
APP
socket
L
LIBRADOS
• Provides direct access to
RADOS for applications
• C, C++, Python, PHP, Java,
Erlang
• Direct access to storage nodes
• No HTTP overhead
12
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
RBD
A reliable and fully-
distributed block
device, with a Linux
kernel client and a
QEMU/KVM driver
CEPH FS
A POSIX-compliant
distributed file
system, with a Linux
kernel client and
support for FUSE
RADOSGW
A bucket-based REST
gateway, compatible
with S3 and Swift
APP APP HOST/VM CLIENT
13
M
M
M
LIBRADOS
RADOSGW
APP
socket
REST
14
RADOS Gateway:
• REST-based object storage
proxy
• Uses RADOS to store objects
• API supports buckets,
accounts
• Usage accounting for billing
• Compatible with S3 and
Swift applications
15
M
x12
x12
x12
x12
x12
LOAD
BALANCER
M
MLOAD
BALANCER
16
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
CEPH FS
A POSIX-compliant
distributed file
system, with a Linux
kernel client and
support for FUSE
RADOSGW
A bucket-based REST
gateway, compatible
with S3 and Swift
APP APP HOST/VM CLIENT
RBD
A reliable and fully-
distributed block
device, with a Linux
kernel client and a
QEMU/KVM driver
17
M
M
M
VM
LIBRADOS
LIBRBD
HYPERVISOR
LIBRADOS
18
M
M
M
LIBRBD
HYPERVISOR
LIBRADOS
LIBRBD
HYPERVISORVM
LIBRADOS
19
M
M
M
KRBD (KERNEL MODULE)
HOST
20
RADOS Block Device:
• Storage of disk images in
RADOS
• Decouples VMs from host
• Images are striped across the
cluster (pool)
• Snapshots
• Copy-on-write clones
• Support in:
• Mainline Linux Kernel (2.6.39+)
• Qemu/KVM, native Xen coming
soon
• OpenStack, CloudStack, Nebula,
Proxmox
21
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
RBD
A reliable and fully-
distributed block
device, with a Linux
kernel client and a
QEMU/KVM driver
CEPH FS
A POSIX-compliant
distributed file
system, with a Linux
kernel client and
support for FUSE
RADOSGW
A bucket-based REST
gateway, compatible
with S3 and Swift
APP APP HOST/VM CLIENT
22
M
M
M
CLIENT
01
10
data
metadata
23
Metadata Server
• Manages metadata for a
POSIX-compliant shared
filesystem
• Directory hierarchy
• File metadata (owner,
timestamps, mode, etc.)
• Stores metadata in RADOS
• Does not serve file data to
clients
• Only required for shared
filesystem
What Makes Ceph Unique?
Part one: it never, ever remembers where it puts stuff.
24
25
APP
??
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
How Long Did It Take You To Find Your Keys This Morning?
azmeen, Flickr / CC BY 2.0 26
27
APP
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
Dear Diary: Today I Put My Keys on the Kitchen Counter
Barnaby, Flickr / CC BY 2.0 28
29
APP
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
A-G
H-N
O-T
U-Z
F*
I Always Put My Keys on the Hook By the Door
vitamindave, Flickr / CC BY 2.0 30
HOW DO YOU
FIND YOUR KEYS
WHEN YOUR HOUSE
IS
INFINITELY BIG
AND
ALWAYS CHANGING?
31
The Answer: CRUSH!!!!!
pasukaru76, Flickr / CC SA 2.0 32
33
OBJECT
10 10 01 01 10 10 01 11 01 10
hash(object name) % num pg
CRUSH(pg, cluster state, rule set)
34
OBJECT
10 10 01 01 10 10 01 11 01 10
35
CRUSH
• Pseudo-random placement
algorithm
• Fast calculation, no lookup
• Repeatable, deterministic
• Statistically uniform distribution
• Stable mapping
• Limited data migration on change
• Rule-based configuration
• Infrastructure topology aware
• Adjustable replication
• Weighting
36
CLIENT
??
37
NAME:"foo"
POOL:"bar"
0101 1111
1001 0011
1010 1101
0011 1011 "bar" = 3
hash("foo") % 256 = 0x23
OBJECT PLACEMENT GROUP
24
3
12
CRUSH TARGET OSDsPLACEMENT GROUP
3.23
3.23
38
39
40
CLIENT
??
What Makes Ceph Unique
Part two: it has smart block devices for all those impatient, selfish VMs.
41
LIBRADOS
42
M
M
M
VM
LIBRBD
HYPERVISOR
HOW DO YOU
SPIN UP
THOUSANDS OF VMs
INSTANTLY
AND
EFFICIENTLY?
43
144
44
0 0 0 0
instant copy
= 144
4144
45
CLIENT
write
write
write
= 148
write
4144
46
CLIENT
read
read
read
= 148
What Makes Ceph Unique?
Part three: it has powerful friends, ones you probably already know.
47
48
M
M
M
APACHE CLOUDSTACK
HYPER-
VISOR
PRIMARY STORAGE
POOL
SECONDARY
STORAGE POOL
snapshots
templates
images
49
M
M
M
OPENSTACK
KEYSTONE API SWIFT API CINDER API
GLANCE
API
NOVA
API
HYPER-
VISOR
RADOSGW
What Makes Ceph Unique?
Part three: clustered metadata
50
51
M
M
M
CLIENT
01
10
52
M
M
M
53
one tree
three metadata servers
??
54
55
56
57
58
DYNAMIC SUBTREE PARTITIONING
Questions?
59
Ross Turk
VP Community, Inktank
ross@inktank.com
@rossturk
inktank.com | ceph.com

More Related Content

Similar to Ceph Day Santa Clara: Ceph Fundamentals

Ceph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross TurkCeph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross Turk
buildacloud
 
Openstack with ceph
Openstack with cephOpenstack with ceph
Openstack with ceph
Ian Colle
 
New Features for Ceph with Cinder and Beyond
New Features for Ceph with Cinder and BeyondNew Features for Ceph with Cinder and Beyond
New Features for Ceph with Cinder and Beyond
OpenStack Foundation
 

Similar to Ceph Day Santa Clara: Ceph Fundamentals (20)

Ceph Day NYC: Ceph Fundamentals
Ceph Day NYC: Ceph FundamentalsCeph Day NYC: Ceph Fundamentals
Ceph Day NYC: Ceph Fundamentals
 
Ceph Introduction 2017
Ceph Introduction 2017  Ceph Introduction 2017
Ceph Introduction 2017
 
Ceph Day London 2014 - Ceph Ecosystem Overview
Ceph Day London 2014 - Ceph Ecosystem Overview Ceph Day London 2014 - Ceph Ecosystem Overview
Ceph Day London 2014 - Ceph Ecosystem Overview
 
Ceph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross TurkCeph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross Turk
 
Openstack with ceph
Openstack with cephOpenstack with ceph
Openstack with ceph
 
Ceph Day NYC: The Future of CephFS
Ceph Day NYC: The Future of CephFSCeph Day NYC: The Future of CephFS
Ceph Day NYC: The Future of CephFS
 
Ceph - Desmistificando Software-Define Storage
Ceph - Desmistificando Software-Define StorageCeph - Desmistificando Software-Define Storage
Ceph - Desmistificando Software-Define Storage
 
Ceph Overview for Distributed Computing Denver Meetup
Ceph Overview for Distributed Computing Denver MeetupCeph Overview for Distributed Computing Denver Meetup
Ceph Overview for Distributed Computing Denver Meetup
 
Ceph Day LA - RBD: A deep dive
Ceph Day LA - RBD: A deep dive Ceph Day LA - RBD: A deep dive
Ceph Day LA - RBD: A deep dive
 
The Future of Cloud Software Defined Storage with Ceph: Andrew Hatfield, Red Hat
The Future of Cloud Software Defined Storage with Ceph: Andrew Hatfield, Red HatThe Future of Cloud Software Defined Storage with Ceph: Andrew Hatfield, Red Hat
The Future of Cloud Software Defined Storage with Ceph: Andrew Hatfield, Red Hat
 
New features for Ceph with Cinder and Beyond
New features for Ceph with Cinder and BeyondNew features for Ceph with Cinder and Beyond
New features for Ceph with Cinder and Beyond
 
New Features for Ceph with Cinder and Beyond
New Features for Ceph with Cinder and BeyondNew Features for Ceph with Cinder and Beyond
New Features for Ceph with Cinder and Beyond
 
London Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFSLondon Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFS
 
Which Hypervisor is Best?
Which Hypervisor is Best?Which Hypervisor is Best?
Which Hypervisor is Best?
 
Which Hypervisor Is Best? My SQL on Ceph
Which Hypervisor Is Best? My SQL on CephWhich Hypervisor Is Best? My SQL on Ceph
Which Hypervisor Is Best? My SQL on Ceph
 
Ceph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Day Santa Clara: The Future of CephFS + Developing with LibradosCeph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Day Santa Clara: The Future of CephFS + Developing with Librados
 
Open Source Storage Summit Tokyo 2016 - Storage for Containers
Open Source Storage Summit Tokyo 2016 - Storage for ContainersOpen Source Storage Summit Tokyo 2016 - Storage for Containers
Open Source Storage Summit Tokyo 2016 - Storage for Containers
 
Docker-Intro
Docker-IntroDocker-Intro
Docker-Intro
 
Red Hat Storage Day New York - What's New in Red Hat Ceph Storage
Red Hat Storage Day New York - What's New in Red Hat Ceph StorageRed Hat Storage Day New York - What's New in Red Hat Ceph Storage
Red Hat Storage Day New York - What's New in Red Hat Ceph Storage
 
The challenge of application distribution - Introduction to Docker (2014 dec ...
The challenge of application distribution - Introduction to Docker (2014 dec ...The challenge of application distribution - Introduction to Docker (2014 dec ...
The challenge of application distribution - Introduction to Docker (2014 dec ...
 

Recently uploaded

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Peter Udo Diehl
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 

Recently uploaded (20)

When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 

Ceph Day Santa Clara: Ceph Fundamentals

Editor's Notes

  1. RADOS is a distributed object store, and it’s the foundation for Ceph. On top of RADOS, the Ceph team has built three applications that allow you to store data and do fantastic things. But before we get into all of that, let’s start at the beginning of the story.
  2. But that’s a lot to digest all at once. Let’s start with RADOS.
  3. MDSs store all of their data within RADOS itself, but there’s still a problem…
  4. There are multiple MDSs!
  5. So how do you have one tree and multiple servers?
  6. If there’s just one MDS (which is a terrible idea), it manages metadata for the entire tree.
  7. When the second one comes along, it will intelligently partition the work by taking a subtree.
  8. When the third MDS arrives, it will attempt to split the tree again.
  9. Same with the fourth.
  10. A MDS can actually even just take a single directory or file, if the load is high enough. This all happens dynamically based on load and the structure of the data, and it’s called “dynamic subtree partitioning”.