SlideShare a Scribd company logo
Ceph Block Devices:
A Deep Dive
Josh Durgin
RBD Lead
June 24, 2015
Ceph Motivating Principles
● All components must scale horizontally
● There can be no single point of failure
● The solution must be hardware agnostic
● Should use commodity hardware
● Self-manage wherever possible
● Open Source (LGPL)
● Move beyond legacy approaches
– client/cluster instead of client/server
– Ad hoc HA
Ceph Components
RGW
A web services gateway
for object storage,
compatible with S3 and
Swift
LIBRADOS
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS
A software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors
RBD
A reliable, fully-
distributed block device
with cloud platform
integration
CEPHFS
A distributed file system
with POSIX semantics
and scale-out metadata
management
APP HOST/VM CLIENT
Ceph Components
RGW
A web services gateway
for object storage,
compatible with S3 and
Swift
LIBRADOS
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS
A software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors
RBD
A reliable, fully-
distributed block device
with cloud platform
integration
CEPHFS
A distributed file system
with POSIX semantics
and scale-out metadata
management
APP HOST/VM CLIENT
Storing Virtual Disks
M M
RADOS CLUSTER
HYPERVISOR
LIBRBD
VM
Kernel Module
M M
RADOS CLUSTER
LINUX HOST
KRBD
RBD
● Stripe images across entire cluster (pool)
● Read-only snapshots
● Copy-on-write clones
● Broad integration
– QEMU, libvirt
– Linux kernel
– iSCSI (STGT, LIO)
– OpenStack, CloudStack, OpenNebula, Ganeti, Proxmox, oVirt
● Incremental backup (relative to snapshots)
Ceph Components
RGW
A web services gateway
for object storage,
compatible with S3 and
Swift
LIBRADOS
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS
A software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors
RBD
A reliable, fully-
distributed block device
with cloud platform
integration
CEPHFS
A distributed file system
with POSIX semantics
and scale-out metadata
management
APP HOST/VM CLIENT
RADOS
● Flat namespace within a pool
● Rich object API
– Bytes, attributes, key/value data
– Partial overwrite of existing data
– Single-object compound atomic operations
– RADOS classes (stored procedures)
● Strong consistency (CP system)
● Infrastructure aware, dynamic topology
● Hash-based placement (CRUSH)
● Direct client to server data path
RADOS Components
OSDs:
10s to 1000s in a cluster
 One per disk (or one per SSD, RAID group…)
 Serve stored objects to clients
 Intelligently peer for replication & recovery
Monitors:
 Maintain cluster membership and state
 Provide consensus for distributed decision-making
 Small, odd number (e.g., 5)
 Not part of data path
M
Ceph Components
RGW
A web services gateway
for object storage,
compatible with S3 and
Swift
LIBRADOS
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS
A software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors
RBD
A reliable, fully-
distributed block device
with cloud platform
integration
CEPHFS
A distributed file system
with POSIX semantics
and scale-out metadata
management
APP HOST/VM CLIENT
Metadata
● rbd_directory
– Maps image name to id, and vice versa
● rbd_children
– Lists clones in a pool, indexed by parent
● rbd_id.$image_name
– The internal id, locatable using only the user-specified image name
● rbd_header.$image_id
– Per-image metadata
Data
● rbd_data.* objects
– Named based on offset in image
– Non-existent to start with
– Plain data in each object
– Snapshots handled by rados
– Often sparse
Striping
● Objects are uniformly sized
– Default is simple 4MB divisions of device
● Randomly distributed among OSDs by CRUSH
● Parallel work is spread across many spindles
● No single set of servers responsible for image
● Small objects lower OSD usage variance
I/O
M M
RADOS CLUSTER
HYPERVISOR
LIBRBD
CACHE
VM
Snapshots
● Object granularity
● Snapshot context [list of snap ids, latest snap id]
– Stored in rbd image header (self-managed)
– Sent with every write
● Snapshot ids managed by monitors
● Deleted asynchronously
● RADOS keeps per-object overwrite stats, so diffs are easy
LIBRBD
Snap context: ([], 7)
write
Snapshots
LIBRBD
Create snap 8
Snapshots
LIBRBD
Set snap context to ([8], 8)
Snapshots
LIBRBD
Write with ([8], 8)
Snapshots
LIBRBD
Write with ([8], 8)
Snap 8
Snapshots
Watch/Notify
● establish stateful 'watch' on an object
– client interest persistently registered with object
– client keeps connection to OSD open
● send 'notify' messages to all watchers
– notify message (and payload) sent to all watchers
– notification (and reply payloads) on completion
● strictly time-bounded liveness check on watch
– no notifier falsely believes we got a message
Watch/Notify
HYPERVISOR
LIBRBD
/usr/bin/rbd
LIBRBD
watch
HYPERVISOR
LIBRBD
/usr/bin/rbd
LIBRBD
watch
Watch/Notify
HYPERVISOR
LIBRBD
/usr/bin/rbd
LIBRBD
HYPERVISOR
LIBRBD
/usr/bin/rbd
LIBRBD
Add snapshot
Watch/Notify
HYPERVISOR
LIBRBD
/usr/bin/rbd
LIBRBD
notify
HYPERVISOR
LIBRBD
/usr/bin/rbd
LIBRBD
notify
Watch/Notify
HYPERVISOR
LIBRBD
/usr/bin/rbd
LIBRBD
notify ack
HYPERVISOR
LIBRBD
/usr/bin/rbd
LIBRBD
notify complete
Watch/Notify
HYPERVISOR
LIBRBD
read metadata
HYPERVISOR
LIBRBD
Clones
● Copy-on-write (and optionally read, in Hammer)
● Object granularity
● Independent settings
– striping, feature bits, object size can differ
– can be in different pools
● Clones are based on protected snapshots
– 'protected' means they can't be deleted
● Can be flattened
– Copy all data from parent
– Remove parent relationship
Clones - read
LIBRBD
clone parent
Clones - write
LIBRBD
clone parent
LIBRBD
clone parent
read
Clones - write
LIBRBD
clone parent
copy up,
write
Clones - write
Ceph and OpenStack
Virtualized Setup
● Secret key stored by libvirt
● XML defining VM fed in, includes
– Monitor addresses
– Client name
– QEMU block device cache setting
● Writeback recommended
– Bus on which to attach block device
● virtio-blk/virtio-scsi recommended
● Ide ok for legacy systems
– Discard options (ide/scsi only), I/O throttling if desired
libvirtvm.xml
QEMU
LIBRBD
CACHE
qemu ­enable­kvm ...
VM
Virtualized Setup
Kernel RBD
● rbd map sets everything up
● /etc/ceph/rbdmap is like /etc/fstab
● udev adds handy symlinks:
– /dev/rbd/$pool/$image[@snap]
● striping v2 and later feature bits not supported yet
● Can be used to back LIO, NFS, SMB, etc.
● No specialized cache, page cache used by filesystem on top
What's new in Hammer
● Copy-on-read
● Librbd cache enabled in safe mode by default
● Readahead during boot
● Lttng tracepoints
● Allocation hints
● Cache hints
● Exclusive locking (off by default for now)
● Object map (off by default for now)
Infernalis
● Easier space tracking (rbd du)
● Faster differential backups
● Per-image metadata
– Can persist rbd options
● RBD journaling
● Enabling new features on the fly
Future Work
● Kernel client catch up
● RBD mirroring
● Consistency groups
● QoS for rados (policy at rbd level)
● Active/Active iSCSI
● Performance improvements
– Newstore osd backend
– Improved cache tiering
– Finer-grained caching
– Many more
Questions?
Josh Durgin
jdurgin@redhat.com

More Related Content

What's hot

CEPH DAY BERLIN - UNLIMITED FILESERVER WITH SAMBA CTDB AND CEPHFS
CEPH DAY BERLIN - UNLIMITED FILESERVER WITH SAMBA CTDB AND CEPHFSCEPH DAY BERLIN - UNLIMITED FILESERVER WITH SAMBA CTDB AND CEPHFS
CEPH DAY BERLIN - UNLIMITED FILESERVER WITH SAMBA CTDB AND CEPHFS
Ceph Community
 
CephFS Update
CephFS UpdateCephFS Update
CephFS Update
Ceph Community
 
Boosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uringBoosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uring
ShapeBlue
 
OpenNebula Conf 2014 | Lightning talk: OpenNebula at Etnetera by Jan Horacek
OpenNebula Conf 2014 | Lightning talk: OpenNebula at Etnetera by Jan HoracekOpenNebula Conf 2014 | Lightning talk: OpenNebula at Etnetera by Jan Horacek
OpenNebula Conf 2014 | Lightning talk: OpenNebula at Etnetera by Jan Horacek
NETWAYS
 
OpenNebula Conf 2014 | ONE BIT to rule them all - Stefan Kooman
OpenNebula Conf 2014 | ONE BIT to rule them all - Stefan KoomanOpenNebula Conf 2014 | ONE BIT to rule them all - Stefan Kooman
OpenNebula Conf 2014 | ONE BIT to rule them all - Stefan Kooman
NETWAYS
 
CEPH DAY BERLIN - WELCOME
CEPH DAY BERLIN - WELCOME CEPH DAY BERLIN - WELCOME
CEPH DAY BERLIN - WELCOME
Ceph Community
 
State of the_gluster_-_lceu
State of the_gluster_-_lceuState of the_gluster_-_lceu
State of the_gluster_-_lceu
Gluster.org
 
Sdc 2012-challenges
Sdc 2012-challengesSdc 2012-challenges
Sdc 2012-challenges
Gluster.org
 
Storage best practices
Storage best practicesStorage best practices
Storage best practices
Maor Lipchuk
 
Running OpenStack in Production - Barcamp Saigon 2016
Running OpenStack in Production - Barcamp Saigon 2016Running OpenStack in Production - Barcamp Saigon 2016
Running OpenStack in Production - Barcamp Saigon 2016
Thang Man
 
Cache Tiering and Erasure Coding
Cache Tiering and Erasure CodingCache Tiering and Erasure Coding
Cache Tiering and Erasure Coding
Shinobu Kinjo
 
OVN DBs HA with scale test
OVN DBs HA with scale testOVN DBs HA with scale test
OVN DBs HA with scale test
Aliasgar Ginwala
 
Red Hat Gluster Storage, Container Storage and CephFS Plans
Red Hat Gluster Storage, Container Storage and CephFS PlansRed Hat Gluster Storage, Container Storage and CephFS Plans
Red Hat Gluster Storage, Container Storage and CephFS Plans
Red_Hat_Storage
 
OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clo...
OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clo...OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clo...
OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clo...
NETWAYS
 
Accessing gluster ufo_-_eco_willson
Accessing gluster ufo_-_eco_willsonAccessing gluster ufo_-_eco_willson
Accessing gluster ufo_-_eco_willson
Gluster.org
 
Integrating gluster fs,_qemu_and_ovirt-vijay_bellur-linuxcon_eu_2013
Integrating gluster fs,_qemu_and_ovirt-vijay_bellur-linuxcon_eu_2013Integrating gluster fs,_qemu_and_ovirt-vijay_bellur-linuxcon_eu_2013
Integrating gluster fs,_qemu_and_ovirt-vijay_bellur-linuxcon_eu_2013
Gluster.org
 
Using Ceph in OStack.de - Ceph Day Frankfurt
Using Ceph in OStack.de - Ceph Day Frankfurt Using Ceph in OStack.de - Ceph Day Frankfurt
Using Ceph in OStack.de - Ceph Day Frankfurt
Ceph Community
 
Ceph Day Santa Clara Welcome
Ceph Day Santa Clara WelcomeCeph Day Santa Clara Welcome
Ceph Day Santa Clara Welcome
Ceph Community
 
Disaster Recovery in oVirt
Disaster Recovery in oVirtDisaster Recovery in oVirt
Disaster Recovery in oVirt
Maor Lipchuk
 
OpenNebulaConf 2016 - OpenNebula 5.0 Highlights and Beyond by Ruben S. Monter...
OpenNebulaConf 2016 - OpenNebula 5.0 Highlights and Beyond by Ruben S. Monter...OpenNebulaConf 2016 - OpenNebula 5.0 Highlights and Beyond by Ruben S. Monter...
OpenNebulaConf 2016 - OpenNebula 5.0 Highlights and Beyond by Ruben S. Monter...
OpenNebula Project
 

What's hot (20)

CEPH DAY BERLIN - UNLIMITED FILESERVER WITH SAMBA CTDB AND CEPHFS
CEPH DAY BERLIN - UNLIMITED FILESERVER WITH SAMBA CTDB AND CEPHFSCEPH DAY BERLIN - UNLIMITED FILESERVER WITH SAMBA CTDB AND CEPHFS
CEPH DAY BERLIN - UNLIMITED FILESERVER WITH SAMBA CTDB AND CEPHFS
 
CephFS Update
CephFS UpdateCephFS Update
CephFS Update
 
Boosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uringBoosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uring
 
OpenNebula Conf 2014 | Lightning talk: OpenNebula at Etnetera by Jan Horacek
OpenNebula Conf 2014 | Lightning talk: OpenNebula at Etnetera by Jan HoracekOpenNebula Conf 2014 | Lightning talk: OpenNebula at Etnetera by Jan Horacek
OpenNebula Conf 2014 | Lightning talk: OpenNebula at Etnetera by Jan Horacek
 
OpenNebula Conf 2014 | ONE BIT to rule them all - Stefan Kooman
OpenNebula Conf 2014 | ONE BIT to rule them all - Stefan KoomanOpenNebula Conf 2014 | ONE BIT to rule them all - Stefan Kooman
OpenNebula Conf 2014 | ONE BIT to rule them all - Stefan Kooman
 
CEPH DAY BERLIN - WELCOME
CEPH DAY BERLIN - WELCOME CEPH DAY BERLIN - WELCOME
CEPH DAY BERLIN - WELCOME
 
State of the_gluster_-_lceu
State of the_gluster_-_lceuState of the_gluster_-_lceu
State of the_gluster_-_lceu
 
Sdc 2012-challenges
Sdc 2012-challengesSdc 2012-challenges
Sdc 2012-challenges
 
Storage best practices
Storage best practicesStorage best practices
Storage best practices
 
Running OpenStack in Production - Barcamp Saigon 2016
Running OpenStack in Production - Barcamp Saigon 2016Running OpenStack in Production - Barcamp Saigon 2016
Running OpenStack in Production - Barcamp Saigon 2016
 
Cache Tiering and Erasure Coding
Cache Tiering and Erasure CodingCache Tiering and Erasure Coding
Cache Tiering and Erasure Coding
 
OVN DBs HA with scale test
OVN DBs HA with scale testOVN DBs HA with scale test
OVN DBs HA with scale test
 
Red Hat Gluster Storage, Container Storage and CephFS Plans
Red Hat Gluster Storage, Container Storage and CephFS PlansRed Hat Gluster Storage, Container Storage and CephFS Plans
Red Hat Gluster Storage, Container Storage and CephFS Plans
 
OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clo...
OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clo...OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clo...
OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clo...
 
Accessing gluster ufo_-_eco_willson
Accessing gluster ufo_-_eco_willsonAccessing gluster ufo_-_eco_willson
Accessing gluster ufo_-_eco_willson
 
Integrating gluster fs,_qemu_and_ovirt-vijay_bellur-linuxcon_eu_2013
Integrating gluster fs,_qemu_and_ovirt-vijay_bellur-linuxcon_eu_2013Integrating gluster fs,_qemu_and_ovirt-vijay_bellur-linuxcon_eu_2013
Integrating gluster fs,_qemu_and_ovirt-vijay_bellur-linuxcon_eu_2013
 
Using Ceph in OStack.de - Ceph Day Frankfurt
Using Ceph in OStack.de - Ceph Day Frankfurt Using Ceph in OStack.de - Ceph Day Frankfurt
Using Ceph in OStack.de - Ceph Day Frankfurt
 
Ceph Day Santa Clara Welcome
Ceph Day Santa Clara WelcomeCeph Day Santa Clara Welcome
Ceph Day Santa Clara Welcome
 
Disaster Recovery in oVirt
Disaster Recovery in oVirtDisaster Recovery in oVirt
Disaster Recovery in oVirt
 
OpenNebulaConf 2016 - OpenNebula 5.0 Highlights and Beyond by Ruben S. Monter...
OpenNebulaConf 2016 - OpenNebula 5.0 Highlights and Beyond by Ruben S. Monter...OpenNebulaConf 2016 - OpenNebula 5.0 Highlights and Beyond by Ruben S. Monter...
OpenNebulaConf 2016 - OpenNebula 5.0 Highlights and Beyond by Ruben S. Monter...
 

Similar to Ceph Block Devices: A Deep Dive

XenSummit - 08/28/2012
XenSummit - 08/28/2012XenSummit - 08/28/2012
XenSummit - 08/28/2012
Ceph Community
 
OSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage SystemOSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage System
NETWAYS
 
Block Storage For VMs With Ceph
Block Storage For VMs With CephBlock Storage For VMs With Ceph
Block Storage For VMs With Ceph
The Linux Foundation
 
Ceph Day LA - RBD: A deep dive
Ceph Day LA - RBD: A deep dive Ceph Day LA - RBD: A deep dive
Ceph Day LA - RBD: A deep dive
Ceph Community
 
Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)
Sage Weil
 
Distributed Storage and Compute With Ceph's librados (Vault 2015)
Distributed Storage and Compute With Ceph's librados (Vault 2015)Distributed Storage and Compute With Ceph's librados (Vault 2015)
Distributed Storage and Compute With Ceph's librados (Vault 2015)
Sage Weil
 
Open Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNETOpen Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNET
Nikos Kormpakis
 
Community Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonCommunity Update at OpenStack Summit Boston
Community Update at OpenStack Summit Boston
Sage Weil
 
Cache Tiering and Erasure Coding
Cache Tiering and Erasure CodingCache Tiering and Erasure Coding
Cache Tiering and Erasure Coding
Shinobu KINJO
 
Ceph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Day Santa Clara: The Future of CephFS + Developing with LibradosCeph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Community
 
Ceph Day NYC: The Future of CephFS
Ceph Day NYC: The Future of CephFSCeph Day NYC: The Future of CephFS
Ceph Day NYC: The Future of CephFS
Ceph Community
 
Scale 10x 01:22:12
Scale 10x 01:22:12Scale 10x 01:22:12
Scale 10x 01:22:12
Ceph Community
 
Ceph as storage for CloudStack
Ceph as storage for CloudStack Ceph as storage for CloudStack
Ceph as storage for CloudStack
Ceph Community
 
RBD: What will the future bring? - Jason Dillaman
RBD: What will the future bring? - Jason DillamanRBD: What will the future bring? - Jason Dillaman
RBD: What will the future bring? - Jason Dillaman
Ceph Community
 
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage ArhcitectureINFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
Thomas Uhl
 
London Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFSLondon Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFS
Ceph Community
 
Ceph Day New York 2014: Future of CephFS
Ceph Day New York 2014:  Future of CephFS Ceph Day New York 2014:  Future of CephFS
Ceph Day New York 2014: Future of CephFS
Ceph Community
 
What's new in Jewel and Beyond
What's new in Jewel and BeyondWhat's new in Jewel and Beyond
What's new in Jewel and Beyond
Sage Weil
 
Quick-and-Easy Deployment of a Ceph Storage Cluster
Quick-and-Easy Deployment of a Ceph Storage ClusterQuick-and-Easy Deployment of a Ceph Storage Cluster
Quick-and-Easy Deployment of a Ceph Storage Cluster
Patrick Quairoli
 
Red Hat Storage 2014 - Product(s) Overview
Red Hat Storage 2014 - Product(s) OverviewRed Hat Storage 2014 - Product(s) Overview
Red Hat Storage 2014 - Product(s) Overview
Marcel Hergaarden
 

Similar to Ceph Block Devices: A Deep Dive (20)

XenSummit - 08/28/2012
XenSummit - 08/28/2012XenSummit - 08/28/2012
XenSummit - 08/28/2012
 
OSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage SystemOSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage System
 
Block Storage For VMs With Ceph
Block Storage For VMs With CephBlock Storage For VMs With Ceph
Block Storage For VMs With Ceph
 
Ceph Day LA - RBD: A deep dive
Ceph Day LA - RBD: A deep dive Ceph Day LA - RBD: A deep dive
Ceph Day LA - RBD: A deep dive
 
Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)
 
Distributed Storage and Compute With Ceph's librados (Vault 2015)
Distributed Storage and Compute With Ceph's librados (Vault 2015)Distributed Storage and Compute With Ceph's librados (Vault 2015)
Distributed Storage and Compute With Ceph's librados (Vault 2015)
 
Open Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNETOpen Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNET
 
Community Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonCommunity Update at OpenStack Summit Boston
Community Update at OpenStack Summit Boston
 
Cache Tiering and Erasure Coding
Cache Tiering and Erasure CodingCache Tiering and Erasure Coding
Cache Tiering and Erasure Coding
 
Ceph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Day Santa Clara: The Future of CephFS + Developing with LibradosCeph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Day Santa Clara: The Future of CephFS + Developing with Librados
 
Ceph Day NYC: The Future of CephFS
Ceph Day NYC: The Future of CephFSCeph Day NYC: The Future of CephFS
Ceph Day NYC: The Future of CephFS
 
Scale 10x 01:22:12
Scale 10x 01:22:12Scale 10x 01:22:12
Scale 10x 01:22:12
 
Ceph as storage for CloudStack
Ceph as storage for CloudStack Ceph as storage for CloudStack
Ceph as storage for CloudStack
 
RBD: What will the future bring? - Jason Dillaman
RBD: What will the future bring? - Jason DillamanRBD: What will the future bring? - Jason Dillaman
RBD: What will the future bring? - Jason Dillaman
 
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage ArhcitectureINFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
 
London Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFSLondon Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFS
 
Ceph Day New York 2014: Future of CephFS
Ceph Day New York 2014:  Future of CephFS Ceph Day New York 2014:  Future of CephFS
Ceph Day New York 2014: Future of CephFS
 
What's new in Jewel and Beyond
What's new in Jewel and BeyondWhat's new in Jewel and Beyond
What's new in Jewel and Beyond
 
Quick-and-Easy Deployment of a Ceph Storage Cluster
Quick-and-Easy Deployment of a Ceph Storage ClusterQuick-and-Easy Deployment of a Ceph Storage Cluster
Quick-and-Easy Deployment of a Ceph Storage Cluster
 
Red Hat Storage 2014 - Product(s) Overview
Red Hat Storage 2014 - Product(s) OverviewRed Hat Storage 2014 - Product(s) Overview
Red Hat Storage 2014 - Product(s) Overview
 

Recently uploaded

WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 

Recently uploaded (20)

WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 

Ceph Block Devices: A Deep Dive

  • 1. Ceph Block Devices: A Deep Dive Josh Durgin RBD Lead June 24, 2015
  • 2. Ceph Motivating Principles ● All components must scale horizontally ● There can be no single point of failure ● The solution must be hardware agnostic ● Should use commodity hardware ● Self-manage wherever possible ● Open Source (LGPL) ● Move beyond legacy approaches – client/cluster instead of client/server – Ad hoc HA
  • 3. Ceph Components RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors RBD A reliable, fully- distributed block device with cloud platform integration CEPHFS A distributed file system with POSIX semantics and scale-out metadata management APP HOST/VM CLIENT
  • 4. Ceph Components RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors RBD A reliable, fully- distributed block device with cloud platform integration CEPHFS A distributed file system with POSIX semantics and scale-out metadata management APP HOST/VM CLIENT
  • 5. Storing Virtual Disks M M RADOS CLUSTER HYPERVISOR LIBRBD VM
  • 6. Kernel Module M M RADOS CLUSTER LINUX HOST KRBD
  • 7. RBD ● Stripe images across entire cluster (pool) ● Read-only snapshots ● Copy-on-write clones ● Broad integration – QEMU, libvirt – Linux kernel – iSCSI (STGT, LIO) – OpenStack, CloudStack, OpenNebula, Ganeti, Proxmox, oVirt ● Incremental backup (relative to snapshots)
  • 8.
  • 9. Ceph Components RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors RBD A reliable, fully- distributed block device with cloud platform integration CEPHFS A distributed file system with POSIX semantics and scale-out metadata management APP HOST/VM CLIENT
  • 10. RADOS ● Flat namespace within a pool ● Rich object API – Bytes, attributes, key/value data – Partial overwrite of existing data – Single-object compound atomic operations – RADOS classes (stored procedures) ● Strong consistency (CP system) ● Infrastructure aware, dynamic topology ● Hash-based placement (CRUSH) ● Direct client to server data path
  • 11. RADOS Components OSDs: 10s to 1000s in a cluster  One per disk (or one per SSD, RAID group…)  Serve stored objects to clients  Intelligently peer for replication & recovery Monitors:  Maintain cluster membership and state  Provide consensus for distributed decision-making  Small, odd number (e.g., 5)  Not part of data path M
  • 12. Ceph Components RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors RBD A reliable, fully- distributed block device with cloud platform integration CEPHFS A distributed file system with POSIX semantics and scale-out metadata management APP HOST/VM CLIENT
  • 13. Metadata ● rbd_directory – Maps image name to id, and vice versa ● rbd_children – Lists clones in a pool, indexed by parent ● rbd_id.$image_name – The internal id, locatable using only the user-specified image name ● rbd_header.$image_id – Per-image metadata
  • 14. Data ● rbd_data.* objects – Named based on offset in image – Non-existent to start with – Plain data in each object – Snapshots handled by rados – Often sparse
  • 15. Striping ● Objects are uniformly sized – Default is simple 4MB divisions of device ● Randomly distributed among OSDs by CRUSH ● Parallel work is spread across many spindles ● No single set of servers responsible for image ● Small objects lower OSD usage variance
  • 17. Snapshots ● Object granularity ● Snapshot context [list of snap ids, latest snap id] – Stored in rbd image header (self-managed) – Sent with every write ● Snapshot ids managed by monitors ● Deleted asynchronously ● RADOS keeps per-object overwrite stats, so diffs are easy
  • 18. LIBRBD Snap context: ([], 7) write Snapshots
  • 20. LIBRBD Set snap context to ([8], 8) Snapshots
  • 21. LIBRBD Write with ([8], 8) Snapshots
  • 22. LIBRBD Write with ([8], 8) Snap 8 Snapshots
  • 23. Watch/Notify ● establish stateful 'watch' on an object – client interest persistently registered with object – client keeps connection to OSD open ● send 'notify' messages to all watchers – notify message (and payload) sent to all watchers – notification (and reply payloads) on completion ● strictly time-bounded liveness check on watch – no notifier falsely believes we got a message
  • 29. Clones ● Copy-on-write (and optionally read, in Hammer) ● Object granularity ● Independent settings – striping, feature bits, object size can differ – can be in different pools ● Clones are based on protected snapshots – 'protected' means they can't be deleted ● Can be flattened – Copy all data from parent – Remove parent relationship
  • 35. Virtualized Setup ● Secret key stored by libvirt ● XML defining VM fed in, includes – Monitor addresses – Client name – QEMU block device cache setting ● Writeback recommended – Bus on which to attach block device ● virtio-blk/virtio-scsi recommended ● Ide ok for legacy systems – Discard options (ide/scsi only), I/O throttling if desired
  • 37. Kernel RBD ● rbd map sets everything up ● /etc/ceph/rbdmap is like /etc/fstab ● udev adds handy symlinks: – /dev/rbd/$pool/$image[@snap] ● striping v2 and later feature bits not supported yet ● Can be used to back LIO, NFS, SMB, etc. ● No specialized cache, page cache used by filesystem on top
  • 38.
  • 39. What's new in Hammer ● Copy-on-read ● Librbd cache enabled in safe mode by default ● Readahead during boot ● Lttng tracepoints ● Allocation hints ● Cache hints ● Exclusive locking (off by default for now) ● Object map (off by default for now)
  • 40. Infernalis ● Easier space tracking (rbd du) ● Faster differential backups ● Per-image metadata – Can persist rbd options ● RBD journaling ● Enabling new features on the fly
  • 41. Future Work ● Kernel client catch up ● RBD mirroring ● Consistency groups ● QoS for rados (policy at rbd level) ● Active/Active iSCSI ● Performance improvements – Newstore osd backend – Improved cache tiering – Finer-grained caching – Many more