4. What is Ceph?
● Highly available resilient data store
● Free Software (LGPL)
● 10 years since inception
● Flexible object, block and filesystem interfaces
● Especially popular in private clouds as VM image
service, and S3-compatible object storage service.
4 OpenNebulaConf 2014 Berlin
5. Interfaces to storage
S3 & Swift
Multi-tenant
Snapshots
Clones
5 OpenNebulaConf 2014 Berlin
FILE
SYSTEM
CephFS
BLOCK
STORAGE
RBD
OBJECT
STORAGE
RGW
Keystone
Geo-Replication
Native API
OpenStack
Linux Kernel
iSCSI
POSIX
Linux Kernel
CIFS/NFS
HDFS
Distributed Metadata
7. Architectural Components
APP HOST/VM CLIENT
RGW
A web services
gateway for object
storage, compatible
with S3 and Swift
RBD
A reliable, fully-distributed
block
device with cloud
platform integration
LIBRADOS
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS
A software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors
7 OpenNebulaConf 2014 Berlin
CEPHFS
A distributed file
system with POSIX
semantics and scale-out
metadata
management
8. Object Storage Daemons
OSD
FS
DISK
OSD
FS
DISK
OSD
FS
DISK
OSD
FS
DISK
btrfs
xfs
ext4
8 OpenNebulaConf 2014 Berlin
M
M
M
9. RADOS Components
OSDs:
10s to 10000s in a cluster
One per disk (or one per SSD, RAID group…)
Serve stored objects to clients
Intelligently peer for replication & recovery
Monitors:
Maintain cluster membership and state
Provide consensus for distributed decision-making
Small, odd number
These do not serve stored objects to clients
M
9 OpenNebulaConf 2014 Berlin
15. CRUSH is a quick calculation
15 OpenNebulaConf 2014 Berlin
01 11
11
01
RADOS CLUSTER
OBJECT
10
01
01
10
10
01
01 10
10 10 01 01
16. CRUSH: Dynamic data placement
CRUSH:
Pseudo-random placement algorithm
Fast calculation, no lookup
Repeatable, deterministic
Statistically uniform distribution
Stable mapping
Limited data migration on change
Rule-based configuration
Infrastructure topology aware
Adjustable replication
Weighting
16 OpenNebulaConf 2014 Berlin
17. Architectural Components
APP HOST/VM CLIENT
RGW
A web services
gateway for object
storage, compatible
with S3 and Swift
RBD
A reliable, fully-distributed
block
device with cloud
platform integration
LIBRADOS
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS
A software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors
17 OpenNebulaConf 2014 Berlin
CEPHFS
A distributed file
system with POSIX
semantics and scale-out
metadata
management
18. RBD: Virtual disks in Ceph
18 OpenNebulaConf 2014 Berlin
18
RADOS BLOCK DEVICE:
Storage of disk images in RADOS
Decouples VMs from host
Images are striped across the cluster (pool)
Snapshots
Copy-on-write clones
Support in:
Mainline Linux Kernel (2.6.39+)
Qemu/KVM
OpenStack, CloudStack, OpenNebula,
Proxmox
19. Storing virtual disks
VM
HYPERVISOR
LIBRBD
M M
RADOS CLUSTER
19
19 OpenNebulaConf 2014 Berlin
21. Storage in OpenNebula deployments
OpenNebula Cloud Architecture Survey 2014 (http://c12g.com/resources/survey/)
21 OpenNebulaConf 2014 Berlin
22. RBD and libvirt/qemu
● librbd (user space) client integration with libvirt/qemu
● Support for live migration, thin clones
● Get recent versions!
● Directly supported in OpenNebula since 4.0 with the
Ceph Datastore (wraps `rbd` CLI)
More info online:
http://ceph.com/docs/master/rbd/libvirt/
http://docs.opennebula.org/4.10/administration/storage/ceph_ds.html
22 OpenNebulaConf 2014 Berlin
23. Other hypervisors
● OpenNebula is flexible, so can we also use Ceph with
non-libvirt/qemu hypervisors?
● Kernel RBD: can present RBD images in /dev/ on
hypervisor host for software unaware of librbd
● Docker: can exploit RBD volumes with a local
filesystem for use as data volumes – maybe CephFS
in future...?
● For unsupported hypervisors, can adapt to Ceph using
e.g. iSCSI for RBD, or NFS for CephFS (but test re-exports
carefully!)
23 OpenNebulaConf 2014 Berlin
24. Choosing hardware
Testing/benchmarking/expert advice is needed, but there
are general guidelines:
● Prefer many cheap nodes to few expensive nodes (10
is better than 3)
● Include small but fast SSDs for OSD journals
● Don't simply buy biggest drives: consider
IOPs/capacity ratio
● Provision network and IO capacity sufficient for your
workload plus recovery bandwidth from node failure.
24 OpenNebulaConf 2014 Berlin
34. CephFS architecture
● Dynamically balanced scale-out metadata
● Inherit flexibility/scalability of RADOS for data
● POSIX compatibility
● Beyond POSIX: Subtree snapshots, recursive statistics
Weil, Sage A., et al. "Ceph: A scalable, high-performance distributed file
system." Proceedings of the 7th symposium on Operating systems
design and implementation. USENIX Association, 2006.
http://ceph.com/papers/weil-ceph-osdi06.pdf
34 OpenNebulaConf 2014 Berlin
36. Components
Linux host
ceph.ko
metadata 01 data
10
M M
M
Ceph server daemons
36 OpenNebulaConf 2014 Berlin
37. From application to disk
Application
ceph-fuse libcephfs Kernel client
ceph-mds
Client network protocol
RADOS
Disk
37 OpenNebulaConf 2014 Berlin
38. Scaling out FS metadata
● Options for distributing metadata?
– by static subvolume
– by path hash
– by dynamic subtree
● Consider performance, ease of implementation
38 OpenNebulaConf 2014 Berlin
40. Dynamic subtree placement
● Locality: get the dentries in a dir from one MDS
● Support read heavy workloads by replicating non-authoritative
copies (cached with capabilities just like
clients do)
● In practice work at directory fragment level in order to
handle large dirs
40 OpenNebulaConf 2014 Berlin
41. Data placement
● Stripe file contents across RADOS objects
● get full rados cluster bw from clients
● fairly tolerant of object losses: reads return zero
● Control striping with layout vxattrs
● layouts also select between multiple data pools
● Deletion is a special case: client deletions mark files
'stray', RADOS delete ops sent by MDS
41 OpenNebulaConf 2014 Berlin
42. Clients
● Two implementations:
● ceph-fuse/libcephfs
● kclient
● Interplay with VFS page cache, efficiency harder with
fuse (extraneous stats etc)
● Client perf. matters, for single-client workloads
● Slow client can hold up others if it's hogging metadata
locks: include clients in troubleshooting
42 OpenNebulaConf 2014 Berlin
43. Journaling and caching in MDS
● Metadata ops initially journaled to striped journal "file"
in the metadata pool.
– I/O latency on metadata ops is sum of network
latency and journal commit latency.
– Metadata remains pinned in in-memory cache
until expired from journal.
43 OpenNebulaConf 2014 Berlin
44. Journaling and caching in MDS
● In some workloads we expect almost all metadata
always in cache, in others its more of a stream.
● Control cache size with mds_cache_size
● Cache eviction relies on client cooperation
● MDS journal replay not only recovers data but also
warms up cache. Use standby replay to keep that
cache warm.
44 OpenNebulaConf 2014 Berlin
45. Lookup by inode
● Sometimes we need inode → path mapping:
● Hard links
● NFS handles
● Costly to store this: mitigate by piggybacking paths
(backtraces) onto data objects
● Con: storing metadata to data pool
● Con: extra IOs to set backtraces
● Pro: disaster recovery from data pool
● Future: improve backtrace writing latency
45 OpenNebulaConf 2014 Berlin
46. CephFS in practice
ceph-deploy mds create myserver
ceph osd pool create fs_data
ceph osd pool create fs_metadata
ceph fs new myfs fs_metadata fs_data
mount -t cephfs x.x.x.x:6789 /mnt/ceph
46 OpenNebulaConf 2014 Berlin
47. Managing CephFS clients
● New in giant: see hostnames of connected clients
● Client eviction is sometimes important:
● Skip the wait during reconnect phase on MDS restart
● Allow others to access files locked by crashed client
● Use OpTracker to inspect ongoing operations
47 OpenNebulaConf 2014 Berlin
48. CephFS tips
● Choose MDS servers with lots of RAM
● Investigate clients when diagnosing stuck/slow access
● Use recent Ceph and recent kernel
● Use a conservative configuration:
● Single active MDS, plus one standby
● Dedicated MDS server
● Kernel client
● No snapshots, no inline data
48 OpenNebulaConf 2014 Berlin
49. Towards a production-ready CephFS
● Focus on resilience:
1. Don't corrupt things
2. Stay up
3. Handle the corner cases
4. When something is wrong, tell me
5. Provide the tools to diagnose and fix problems
● Achieve this first within a conservative single-MDS
configuration
49 OpenNebulaConf 2014 Berlin
50. Giant->Hammer timeframe
● Initial online fsck (a.k.a. forward scrub)
● Online diagnostics (`session ls`, MDS health alerts)
● Journal resilience & tools (cephfs-journal-tool)
● flock in the FUSE client
● Initial soft quota support
● General resilience: full OSDs, full metadata cache
50 OpenNebulaConf 2014 Berlin
51. FSCK and repair
● Recover from damage:
● Loss of data objects (which files are damaged?)
● Loss of metadata objects (what subtree is damaged?)
● Continuous verification:
● Are recursive stats consistent?
● Does metadata on disk match cache?
● Does file size metadata match data on disk?
● Repair:
● Automatic where possible
● Manual tools to enable support
51 OpenNebulaConf 2014 Berlin
52. Client management
● Current eviction is not 100% safe against rogue clients
● Update to client protocol to wait for OSD blacklist
● Client metadata
● Initially domain name, mount point
● Extension to other identifiers?
52 OpenNebulaConf 2014 Berlin
53. Online diagnostics
● Bugs exposed relate to failures of one client to release
resources for another client: “my filesystem is frozen”.
Introduce new health messages:
● “client xyz is failing to respond to cache pressure”
● “client xyz is ignoring capability release messages”
● Add client metadata to allow us to give domain names
instead of IP addrs in messages.
● Opaque behavior in the face of dead clients. Introduce
`session ls`
● Which clients does MDS think are stale?
● Identify clients to evict with `session evict`
53 OpenNebulaConf 2014 Berlin
54. Journal resilience
● Bad journal prevents MDS recovery: “my MDS crashes
on startup”:
● Data loss
● Software bugs
● Updated on-disk format to make recovery from
damage easier
● New tool: cephfs-journal-tool
● Inspect the journal, search/filter
● Chop out unwanted entries/regions
54 OpenNebulaConf 2014 Berlin
55. Handling resource limits
● Write a test, see what breaks!
● Full MDS cache:
● Require some free memory to make progress
● Require client cooperation to unpin cache objects
● Anticipate tuning required for cache behaviour: what
should we evict?
● Full OSD cluster
● Require explicit handling to abort with -ENOSPC
● MDS → RADOS flow control:
● Contention between I/O to flush cache and I/O to journal
55 OpenNebulaConf 2014 Berlin
56. Test, QA, bug fixes
● The answer to “Is CephFS production ready?”
● teuthology test framework:
● Long running/thrashing test
● Third party FS correctness tests
● Python functional tests
● We dogfood CephFS internally
● Various kclient fixes discovered
● Motivation for new health monitoring metrics
● Third party testing is extremely valuable
56 OpenNebulaConf 2014 Berlin
57. What's next?
● You tell us!
● Recent survey highlighted:
● FSCK hardening
● Multi-MDS hardening
● Quota support
● Which use cases will community test with?
● General purpose
● Backup
● Hadoop
57 OpenNebulaConf 2014 Berlin
58. Reporting bugs
● Does the most recent development release or kernel
fix your issue?
● What is your configuration? MDS config, Ceph
version, client version, kclient or fuse
● What is your workload?
● Can you reproduce with debug logging enabled?
http://ceph.com/resources/mailing-list-irc/
http://tracker.ceph.com/projects/ceph/issues
http://ceph.com/docs/master/rados/troubleshooting/log-and-debug/
58 OpenNebulaConf 2014 Berlin
59. Future
● Ceph Developer Summit:
● When: 8 October
● Where: online
● Post-Hammer work:
● Recent survey highlighted multi-MDS, quota support
● Testing with clustered Samba/NFS?
59 OpenNebulaConf 2014 Berlin