OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

Using Ceph with
OpenNebula
John Spray
john.spray@redhat.com

Agenda
● What is it?
● Architecture
● Integration with OpenNebula
● What's new?
2 OpenNebulaConf 2014 Berlin

What is Ceph?

What is Ceph?
● Highly available resilient data store
● Free Software (LGPL)
● 10 years since inception
● Flexible object, block and filesystem interfaces
● Especially popular in private clouds as VM image
service, and S3-compatible object storage service.

Interfaces to storage
S3 & Swift
Multi-tenant
Snapshots
Clones
FILE
SYSTEM
CephFS
BLOCK
STORAGE
RBD
OBJECT
STORAGE
RGW
Keystone
Geo-Replication
Native API
OpenStack
Linux Kernel
iSCSI
POSIX
Linux Kernel
CIFS/NFS
HDFS
Distributed Metadata

Ceph Architecture

Architectural Components
APP HOST/VM CLIENT
RGW
A web services
gateway for object
storage, compatible
with S3 and Swift
RBD
A reliable, fully-distributed
block
device with cloud
platform integration
LIBRADOS
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS
A software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors
CEPHFS
A distributed file
system with POSIX
semantics and scale-out
metadata
management

Object Storage Daemons
OSD
FS
DISK
OSD
FS
DISK
OSD
FS
DISK
OSD
FS
DISK
btrfs
xfs
ext4
M
M
M

RADOS Components
OSDs:
 10s to 10000s in a cluster
 One per disk (or one per SSD, RAID group…)
 Serve stored objects to clients
 Intelligently peer for replication & recovery
Monitors:
 Maintain cluster membership and state
 Provide consensus for distributed decision-making
 Small, odd number
 These do not serve stored objects to clients
M

Rados Cluster
APPLICATION
M M
M M
M
RADOS CLUSTER

Where do objects live?
??
APPLICATION
M
M
M
OBJECT

A Metadata Server?
1
APPLICATION
M
M
M
2

Calculated placement
APPLICATION F
M
M
M
A-G
H-N
O-T
U-Z

Even better: CRUSH
01 11
11
01
RADOS CLUSTER
OBJECT
10
01
01
10
10
01
11
01
10
01
01
10
10
01
01 10
10 10 01 01

CRUSH is a quick calculation
01 11
11
01
RADOS CLUSTER
OBJECT
10
01
01
10
10
01
01 10
10 10 01 01

CRUSH: Dynamic data placement
CRUSH:
 Pseudo-random placement algorithm
 Fast calculation, no lookup
 Repeatable, deterministic
 Statistically uniform distribution
 Stable mapping
 Limited data migration on change
 Rule-based configuration
 Infrastructure topology aware
 Adjustable replication
 Weighting

Architectural Components
APP HOST/VM CLIENT
RGW
A web services
gateway for object
storage, compatible
with S3 and Swift
RBD
A reliable, fully-distributed
block
device with cloud
platform integration
LIBRADOS
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS
A software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors
CEPHFS
A distributed file
system with POSIX
semantics and scale-out
metadata
management

RBD: Virtual disks in Ceph
18
RADOS BLOCK DEVICE:
 Storage of disk images in RADOS
 Decouples VMs from host
 Images are striped across the cluster (pool)
 Snapshots
 Copy-on-write clones
 Support in:
 Mainline Linux Kernel (2.6.39+)
 Qemu/KVM
 OpenStack, CloudStack, OpenNebula,
Proxmox

Storing virtual disks
VM
HYPERVISOR
LIBRBD
M M
RADOS CLUSTER
19

Using Ceph with OpenNebula

Storage in OpenNebula deployments
OpenNebula Cloud Architecture Survey 2014 (http://c12g.com/resources/survey/)

RBD and libvirt/qemu
● librbd (user space) client integration with libvirt/qemu
● Support for live migration, thin clones
● Get recent versions!
● Directly supported in OpenNebula since 4.0 with the
Ceph Datastore (wraps `rbd` CLI)
More info online:
http://ceph.com/docs/master/rbd/libvirt/
http://docs.opennebula.org/4.10/administration/storage/ceph_ds.html

Other hypervisors
● OpenNebula is flexible, so can we also use Ceph with
non-libvirt/qemu hypervisors?
● Kernel RBD: can present RBD images in /dev/ on
hypervisor host for software unaware of librbd
● Docker: can exploit RBD volumes with a local
filesystem for use as data volumes – maybe CephFS
in future...?
● For unsupported hypervisors, can adapt to Ceph using
e.g. iSCSI for RBD, or NFS for CephFS (but test re-exports
carefully!)

Choosing hardware
Testing/benchmarking/expert advice is needed, but there
are general guidelines:
● Prefer many cheap nodes to few expensive nodes (10
is better than 3)
● Include small but fast SSDs for OSD journals
● Don't simply buy biggest drives: consider
IOPs/capacity ratio
● Provision network and IO capacity sufficient for your
workload plus recovery bandwidth from node failure.

What's new?

Ceph releases
● Ceph 0.80 firefly (May 2014)
– Cache tiering & erasure coding
– Key/val OSD backends
– OSD primary affinity
● Ceph 0.87 giant (October 2014)
– RBD cache enabled by default
– Performance improvements
– Locally recoverable erasure codes
● Ceph x.xx hammer (2015)

Additional components
● Ceph FS – scale-out POSIX filesystem service,
currently being stabilized
● Calamari – monitoring dashboard for Ceph
● ceph-deploy – easy SSH-based deployment tool
● Puppet, Chef modules

Get involved
Evaluate the latest releases:
http://ceph.com/resources/downloads/
Mailing list, IRC:
http://ceph.com/resources/mailing-list-irc/
Bugs:
http://tracker.ceph.com/projects/ceph/issues
Online developer summits:
https://wiki.ceph.com/Planning/CDS

Questions?

Spare slides

Ceph FS

CephFS architecture
● Dynamically balanced scale-out metadata
● Inherit flexibility/scalability of RADOS for data
● POSIX compatibility
● Beyond POSIX: Subtree snapshots, recursive statistics
Weil, Sage A., et al. "Ceph: A scalable, high-performance distributed file
system." Proceedings of the 7th symposium on Operating systems
design and implementation. USENIX Association, 2006.
http://ceph.com/papers/weil-ceph-osdi06.pdf

Components
● Client: kernel, fuse, libcephfs
● Server: MDS daemon
● Storage: RADOS cluster (mons & OSDs)

Components
Linux host
ceph.ko
metadata 01 data
10
M M
M
Ceph server daemons

From application to disk
Application
ceph-fuse libcephfs Kernel client
ceph-mds
Client network protocol
RADOS
Disk

Scaling out FS metadata
● Options for distributing metadata?
– by static subvolume
– by path hash
– by dynamic subtree
● Consider performance, ease of implementation

Dynamic subtree placement

Dynamic subtree placement
● Locality: get the dentries in a dir from one MDS
● Support read heavy workloads by replicating non-authoritative
copies (cached with capabilities just like
clients do)
● In practice work at directory fragment level in order to
handle large dirs

Data placement
● Stripe file contents across RADOS objects
● get full rados cluster bw from clients
● fairly tolerant of object losses: reads return zero
● Control striping with layout vxattrs
● layouts also select between multiple data pools
● Deletion is a special case: client deletions mark files
'stray', RADOS delete ops sent by MDS

Clients
● Two implementations:
● ceph-fuse/libcephfs
● kclient
● Interplay with VFS page cache, efficiency harder with
fuse (extraneous stats etc)
● Client perf. matters, for single-client workloads
● Slow client can hold up others if it's hogging metadata
locks: include clients in troubleshooting

Journaling and caching in MDS
● Metadata ops initially journaled to striped journal "file"
in the metadata pool.
– I/O latency on metadata ops is sum of network
latency and journal commit latency.
– Metadata remains pinned in in-memory cache
until expired from journal.

Journaling and caching in MDS
● In some workloads we expect almost all metadata
always in cache, in others its more of a stream.
● Control cache size with mds_cache_size
● Cache eviction relies on client cooperation
● MDS journal replay not only recovers data but also
warms up cache. Use standby replay to keep that
cache warm.

Lookup by inode
● Sometimes we need inode → path mapping:
● Hard links
● NFS handles
● Costly to store this: mitigate by piggybacking paths
(backtraces) onto data objects
● Con: storing metadata to data pool
● Con: extra IOs to set backtraces
● Pro: disaster recovery from data pool
● Future: improve backtrace writing latency

CephFS in practice
ceph-deploy mds create myserver
ceph osd pool create fs_data
ceph osd pool create fs_metadata
ceph fs new myfs fs_metadata fs_data
mount -t cephfs x.x.x.x:6789 /mnt/ceph

Managing CephFS clients
● New in giant: see hostnames of connected clients
● Client eviction is sometimes important:
● Skip the wait during reconnect phase on MDS restart
● Allow others to access files locked by crashed client
● Use OpTracker to inspect ongoing operations

CephFS tips
● Choose MDS servers with lots of RAM
● Investigate clients when diagnosing stuck/slow access
● Use recent Ceph and recent kernel
● Use a conservative configuration:
● Single active MDS, plus one standby
● Dedicated MDS server
● Kernel client
● No snapshots, no inline data

Towards a production-ready CephFS
● Focus on resilience:
1. Don't corrupt things
2. Stay up
3. Handle the corner cases
4. When something is wrong, tell me
5. Provide the tools to diagnose and fix problems
● Achieve this first within a conservative single-MDS
configuration

Giant->Hammer timeframe
● Initial online fsck (a.k.a. forward scrub)
● Online diagnostics (`session ls`, MDS health alerts)
● Journal resilience & tools (cephfs-journal-tool)
● flock in the FUSE client
● Initial soft quota support
● General resilience: full OSDs, full metadata cache

FSCK and repair
● Recover from damage:
● Loss of data objects (which files are damaged?)
● Loss of metadata objects (what subtree is damaged?)
● Continuous verification:
● Are recursive stats consistent?
● Does metadata on disk match cache?
● Does file size metadata match data on disk?
● Repair:
● Automatic where possible
● Manual tools to enable support

Client management
● Current eviction is not 100% safe against rogue clients
● Update to client protocol to wait for OSD blacklist
● Client metadata
● Initially domain name, mount point
● Extension to other identifiers?

Online diagnostics
● Bugs exposed relate to failures of one client to release
resources for another client: “my filesystem is frozen”.
Introduce new health messages:
● “client xyz is failing to respond to cache pressure”
● “client xyz is ignoring capability release messages”
● Add client metadata to allow us to give domain names
instead of IP addrs in messages.
● Opaque behavior in the face of dead clients. Introduce
`session ls`
● Which clients does MDS think are stale?
● Identify clients to evict with `session evict`

Journal resilience
● Bad journal prevents MDS recovery: “my MDS crashes
on startup”:
● Data loss
● Software bugs
● Updated on-disk format to make recovery from
damage easier
● New tool: cephfs-journal-tool
● Inspect the journal, search/filter
● Chop out unwanted entries/regions

Handling resource limits
● Write a test, see what breaks!
● Full MDS cache:
● Require some free memory to make progress
● Require client cooperation to unpin cache objects
● Anticipate tuning required for cache behaviour: what
should we evict?
● Full OSD cluster
● Require explicit handling to abort with -ENOSPC
● MDS → RADOS flow control:
● Contention between I/O to flush cache and I/O to journal

Test, QA, bug fixes
● The answer to “Is CephFS production ready?”
● teuthology test framework:
● Long running/thrashing test
● Third party FS correctness tests
● Python functional tests
● We dogfood CephFS internally
● Various kclient fixes discovered
● Motivation for new health monitoring metrics
● Third party testing is extremely valuable

What's next?
● You tell us!
● Recent survey highlighted:
● FSCK hardening
● Multi-MDS hardening
● Quota support
● Which use cases will community test with?
● General purpose
● Backup
● Hadoop

Reporting bugs
● Does the most recent development release or kernel
fix your issue?
● What is your configuration? MDS config, Ceph
version, client version, kclient or fuse
● What is your workload?
● Can you reproduce with debug logging enabled?
http://ceph.com/resources/mailing-list-irc/
http://tracker.ceph.com/projects/ceph/issues
http://ceph.com/docs/master/rados/troubleshooting/log-and-debug/

Future
● Ceph Developer Summit:
● When: 8 October
● Where: online
● Post-Hammer work:
● Recent survey highlighted multi-MDS, quota support
● Testing with clustered Samba/NFS?

OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

Similar to OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray (20)

More from OpenNebula Project

More from OpenNebula Project (20)

Recently uploaded

Recently uploaded (20)

OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray