SlideShare a Scribd company logo
1 of 86
Download to read offline
CephFS Update 
John Spray 
john.spray@redhat.com 
Ceph Day London
Agenda 
● Introduction to distributed filesystems 
● Architectural overview 
● Recent development 
● Test & QA 
2 Ceph Day London - CephFS Update
Distributed filesystems 
...and why they are hard. 
3 Ceph Day London – CephFS Update
Interfaces to storage 
● Object 
● Ceph RGW, S3, Swift 
● Block (aka SAN) 
● Ceph RBD, iSCSI, FC, SAS 
● File (aka scale-out NAS) 
● Ceph, GlusterFS, Lustre, proprietary filers 
4 Ceph Day London - CephFS Update
Interfaces to storage 
S3 & Swift 
Multi-tenant 
Snapshots 
Clones 
5 Ceph Day London - CephFS Update 
FILE 
SYSTEM 
CephFS 
BLOCK 
STORAGE 
RBD 
OBJECT 
STORAGE 
RGW 
Keystone 
Geo-Replication 
Native API 
OpenStack 
Linux Kernel 
iSCSI 
POSIX 
Linux Kernel 
CIFS/NFS 
HDFS 
Distributed Metadata
Object stores scale out well 
● Last writer wins consistency 
● Consistency rules only apply to one object at a time 
● Clients are stateless (unless explicitly doing lock ops) 
● No relationships exist between objects 
● Objects have exactly one name 
● Scale-out accomplished by mapping objects to nodes 
● Single objects may be lost without affecting others 
6 Ceph Day London - CephFS Update
POSIX filesystems are hard to scale out 
● Extents written from multiple clients must win or lose 
on all-or-nothing basis → locking 
● Inodes depend on one another (directory hierarchy) 
● Clients are stateful: holding files open 
● Users have local-filesystem latency expectations: 
applications assume FS client will do lots of metadata 
caching for them. 
● Scale-out requires spanning inode/dentry relationships 
across servers 
● Loss of data can damage whole subtrees 
7 Ceph Day London - CephFS Update
Failure cases increase complexity further 
● What should we do when... ? 
● Filesystem is full 
● Client goes dark 
● An MDS goes dark 
● Memory is running low 
● Clients are competing for the same files 
● Clients misbehave 
● Hard problems in distributed systems generally, 
especially hard when we have to uphold POSIX 
semantics designed for local systems. 
8 Ceph Day London - CephFS Update
Terminology 
● inode: a file. Has unique ID, may be referenced by 
one or more dentries. 
● dentry: a link between an inode and a directory 
● directory: special type of inode that has 0 or more child 
dentries 
● hard link: many dentries referring to the same inode 
● Terms originate form original (local disk) filesystems, 
where these were how a filesystem was represented 
on disk. 
9 Ceph Day London - CephFS Update
Architectural overview 
10 Ceph Day London – CephFS Update
CephFS architecture 
● Dynamically balanced scale-out metadata 
● Inherit flexibility/scalability of RADOS for data 
● POSIX compatibility 
● Beyond POSIX: Subtree snapshots, recursive statistics 
Weil, Sage A., et al. "Ceph: A scalable, high-performance distributed file 
system." Proceedings of the 7th symposium on Operating systems 
design and implementation. USENIX Association, 2006. 
http://ceph.com/papers/weil-ceph-osdi06.pdf 
11 Ceph Day London - CephFS Update
Components 
● Client: kernel, fuse, libcephfs 
● Server: MDS daemon 
● Storage: RADOS cluster (mons & OSDs) 
12 Ceph Day London – CephFS Update
Components 
Linux host 
ceph.ko 
metadata 01 data 
10 
M M 
M 
Ceph server daemons 
13 Ceph Day London – CephFS Update
From application to disk 
Application 
ceph-fuse libcephfs Kernel client 
ceph-mds 
Client network protocol 
RADOS 
Disk 
14 Ceph Day London - CephFS Update
Scaling out FS metadata 
● Options for distributing metadata? 
– by static subvolume 
– by path hash 
– by dynamic subtree 
● Consider performance, ease of implementation 
15 Ceph Day London – CephFS Update
DYNAMIC SUBTREE PARTITIONING 
16 Ceph Day London – CephFS Update
Dynamic subtree placement 
● Locality: get the dentries in a dir from one MDS 
● Support read heavy workloads by replicating non-authoritative 
copies (cached with capabilities just like 
clients do) 
● In practice work at directory fragment level in order to 
handle large dirs 
17 Ceph Day London - CephFS Update
Data placement 
● Stripe file contents across RADOS objects 
● get full rados cluster bandwidth from clients 
● delegate all placement/balancing to RADOS 
● Control striping with layout vxattrs 
● layouts also select between multiple data pools 
● Deletion is a special case: client deletions mark files 
'stray', RADOS delete ops sent by MDS 
18 Ceph Day London - CephFS Update
Clients 
● Two implementations: 
● ceph-fuse/libcephfs 
● kclient 
● Interplay with VFS page cache, efficiency harder with 
fuse (extraneous stats etc) 
● Client perf. matters, for single-client workloads 
● Slow client can hold up others if it's hogging metadata 
locks: include clients in troubleshooting 
19 Ceph Day London - CephFS Update
Journaling and caching in MDS 
● Metadata ops initially journaled to striped journal "file" 
in the metadata pool. 
● I/O latency on metadata ops is sum of network latency 
and journal commit latency. 
● Metadata remains pinned in in-memory cache until 
expired from journal. 
20 Ceph Day London - CephFS Update
Journaling and caching in MDS 
● In some workloads we expect almost all metadata 
always in cache, in others its more of a stream. 
● Control cache size with mds_cache_size 
● Cache eviction relies on client cooperation 
● MDS journal replay not only recovers data but also 
warms up cache. Use standby replay to keep that 
cache warm. 
21 Ceph Day London - CephFS Update
Lookup by inode 
● Sometimes we need inode → path mapping: 
● Hard links 
● NFS handles 
● Costly to store this: mitigate by piggybacking paths 
(backtraces) onto data objects 
● Con: storing metadata to data pool 
● Con: extra IOs to set backtraces 
● Pro: disaster recovery from data pool 
● Future: improve backtrace writing latency? 
22 Ceph Day London - CephFS Update
Extra features 
● Snapshots: 
● Exploit RADOS snapshotting for file data 
● … plus some clever code in the MDS 
● Fast petabyte snapshots 
● Recursive statistics 
● Lazily updated 
● Access via vxattr 
● Avoid spurious client I/O for df 
23 Ceph Day London - CephFS Update
Extra features 
● Snapshots: 
● Exploit RADOS snapshotting for file data 
● … plus some clever code in the MDS 
● Fast petabyte snapshots 
● Recursive statistics 
● Lazily updated 
● Access via vxattr 
● Avoid spurious client I/O for df 
24 Ceph Day London - CephFS Update
CephFS in practice 
ceph-deploy mds create myserver 
ceph osd pool create fs_data 
ceph osd pool create fs_metadata 
ceph fs new myfs fs_metadata fs_data 
mount -t cephfs x.x.x.x:6789 /mnt/ceph 
25 Ceph Day London - CephFS Update
Managing CephFS clients 
● New in giant: see hostnames of connected clients 
● Client eviction is sometimes important: 
● Skip the wait during reconnect phase on MDS restart 
● Allow others to access files locked by crashed client 
● Use OpTracker to inspect ongoing operations 
26 Ceph Day London - CephFS Update
CephFS tips 
● Choose MDS servers with lots of RAM 
● Investigate clients when diagnosing stuck/slow access 
● Use recent Ceph and recent kernel 
● Use a conservative configuration: 
● Single active MDS, plus one standby 
● Dedicated MDS server 
● Kernel client 
● No snapshots, no inline data 
27 Ceph Day London - CephFS Update
Development update 
28 Ceph Day London – CephFS Update
AAPPPP AAPPPP HHOOSSTT/V/VMM CCLLIEIENNTT 
LIBRADOS 
A library allowing 
apps to directly 
access RADOS, 
with support for 
C, C++, Java, 
Python, Ruby, 
and PHP 
LIBRADOS 
A library allowing 
apps to directly 
access RADOS, 
with support for 
C, C++, Java, 
Python, Ruby, 
and PHP 
RADOS 
RADOS 
RBD 
RBD 
A reliable and fully-distributed 
A reliable and fully-distributed 
block 
block 
device, with a Linux 
kernel client and a 
QEMU/KVM driver 
device, with a Linux 
kernel client and a 
QEMU/KVM driver 
RADOSGW 
RADOSGW 
A bucket-based 
REST gateway, 
compatible with S3 
and Swift 
A bucket-based 
REST gateway, 
compatible with S3 
and Swift 
CEPH FS 
A POSIX-compliant 
distributed file system, 
with a Linux kernel 
client and support for 
FUSE 
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, 
intelligent storage nodes 
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, 
intelligent storage nodes 
29 Ceph Day London – CephFS Update 
CEPH FS 
A POSIX-compliant 
distributed file system, 
with a Linux kernel 
client and support for 
FUSE 
NEARLY 
AWESOME 
AWESOME AWESOME 
AWESOME 
AWESOME
Towards a production-ready CephFS 
● Focus on resilience: 
1. Don't corrupt things 
2. Stay up 
3. Handle the corner cases 
4. When something is wrong, tell me 
5. Provide the tools to diagnose and fix problems 
● Achieve this first within a conservative single-MDS 
configuration 
30 Ceph Day London - CephFS Update
Giant → Hammer timeframe 
● Initial online fsck (a.k.a. forward scrub) 
● Online diagnostics (`session ls`, MDS health alerts) 
● Journal resilience & tools (cephfs-journal-tool) 
● flock in the FUSE client 
● Initial soft quota support 
● General resilience: full OSDs, full metadata cache 
31 Ceph Day London - CephFS Update
FSCK and repair 
● Recover from damage: 
● Loss of data objects (which files are damaged?) 
● Loss of metadata objects (what subtree is damaged?) 
● Continuous verification: 
● Are recursive stats consistent? 
● Does metadata on disk match cache? 
● Does file size metadata match data on disk? 
● Repair: 
● Automatic where possible 
● Manual tools to enable support 
32 Ceph Day London - CephFS Update
Client management 
● Current eviction is not 100% safe against rogue clients 
● Update to client protocol to wait for OSD blacklist 
● Client metadata 
● Initially domain name, mount point 
● Extension to other identifiers? 
33 Ceph Day London - CephFS Update
Online diagnostics 
● Bugs exposed relate to failures of one client to release 
resources for another client: “my filesystem is frozen”. 
Introduce new health messages: 
● “client xyz is failing to respond to cache pressure” 
● “client xyz is ignoring capability release messages” 
● Add client metadata to allow us to give domain names 
instead of IP addrs in messages. 
● Opaque behavior in the face of dead clients. Introduce 
`session ls` 
● Which clients does MDS think are stale? 
● Identify clients to evict with `session evict` 
34 Ceph Day London - CephFS Update
Journal resilience 
● Bad journal prevents MDS recovery: “my MDS crashes 
on startup”: 
● Data loss 
● Software bugs 
● Updated on-disk format to make recovery from 
damage easier 
● New tool: cephfs-journal-tool 
● Inspect the journal, search/filter 
● Chop out unwanted entries/regions 
35 Ceph Day London - CephFS Update
Handling resource limits 
● Write a test, see what breaks! 
● Full MDS cache: 
● Require some free memory to make progress 
● Require client cooperation to unpin cache objects 
● Anticipate tuning required for cache behaviour: what 
should we evict? 
● Full OSD cluster 
● Require explicit handling to abort with -ENOSPC 
● MDS → RADOS flow control: 
● Contention between I/O to flush cache and I/O to journal 
36 Ceph Day London - CephFS Update
Test, QA, bug fixes 
● The answer to “Is CephFS production ready?” 
● teuthology test framework: 
● Long running/thrashing test 
● Third party FS correctness tests 
● Python functional tests 
● We dogfood CephFS internally 
● Various kclient fixes discovered 
● Motivation for new health monitoring metrics 
● Third party testing is extremely valuable 
37 Ceph Day London - CephFS Update
What's next? 
● You tell us! 
● Recent survey highlighted: 
● FSCK hardening 
● Multi-MDS hardening 
● Quota support 
● Which use cases will matter to community? 
● Backup 
● Hadoop 
● NFS/Samba gateway 
● Other? 
38 Ceph Day London - CephFS Update
Reporting bugs 
● Does the most recent development release or kernel 
fix your issue? 
● What is your configuration? MDS config, Ceph 
version, client version, kclient or fuse 
● What is your workload? 
● Can you reproduce with debug logging enabled? 
http://ceph.com/resources/mailing-list-irc/ 
http://tracker.ceph.com/projects/ceph/issues 
http://ceph.com/docs/master/rados/troubleshooting/log-and-debug/ 
39 Ceph Day London - CephFS Update
Future 
● Ceph Developer Summit: 
● When: 8 October 
● Where: online 
● Post-Hammer work: 
● Recent survey highlighted multi-MDS, quota support 
● Testing with clustered Samba/NFS? 
40 Ceph Day London - CephFS Update
Questions? 
41 Ceph Day London – CephFS Update
42 Ceph Day London – CephFS Update
Body slide design guidelines 
● > 15 words per bullet 
● If your slide is text-only, reserve at least 1/3 of the slide 
for white space. 
● If you use a graphic, make sure text is readable. 
43 Ceph Day London - CephFS Update
Body slide design guidelines 
● > 15 words per bullet 
● If your slide is text-only, reserve at least 1/3 of the slide 
for white space. 
● If you use a graphic, make sure text is readable. 
44 Ceph Day London - CephFS Update
Introduce Red Hat 
● Create an agenda slide for every presentation. 
● Outline what you’re going to tell the audience. 
● Prepare them for a call to action after the presentation. 
● If this is a confidential presentation, use the 
confidential presentation template located on the Corporate > 
Templates > Presentation templates page of the PNT Portal. 
45 Ceph Day London - CephFS Update
Introduce Red Hat solutions and services 
● Provide product details that specifically solve the 
customer pain point you’re addressing. 
● These slides explain how Red Hat solutions work, what 
makes them unique and valuable. 
46 Ceph Day London - CephFS Update
Learn more 
● End with a call to action. 
● Let the audience know what can be done next, how 
you or Red Hat can help them. 
47 Ceph Day London - CephFS Update
Divider slide 
48 Ceph Day London – CephFS Update
Divider slide 
49 Ceph Day London – CephFS Update
Divider slide 
50 Ceph Day London – CephFS Update
Divider slide 
51 Ceph Day London – CephFS Update
Divider slide 
52 Ceph Day London – CephFS Update
Divider slide 
53 Ceph Day London – CephFS Update
Divider slide 
54 Ceph Day London – CephFS Update
Divider slide 
55 Ceph Day London – CephFS Update
DDiivviiddeerr Sslliiddee 
56 Ceph Day London – CephFS Update
A STORAGE REVOLUTION 
SUPPORT & 
MAINTENANCE 
PROPRIETARY 
SOFTWARE 
PROPRIETARY 
HARDWARE 
COMPUTER DISK 
COMPUTER DISK 
COMPUTER DISK 
ENTERPRISE 
PRODUCTS & 
SERVICES 
OPEN SOURCE 
SOFTWARE 
STANDARD 
HARDWARE 
COMPUTER DISK 
COMPUTER DISK 
COMPUTER DISK
ARCHITECTURAL COMPONENTS 
APP HOST/VM CLIENT 
RBD 
A reliable, fully-distributed 
block 
device with cloud 
platform integration 
A distributed file 
system with POSIX 
semantics and scale-out 
Copyright © 2014 by Inktank | Private and Confidential 
58 
RGW 
A web services 
gateway for object 
storage, compatible 
with S3 and Swift 
LIBRADOS 
CEPHFS 
metadata 
management 
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) 
RADOS 
A software-based, reliable, autonomous, distributed object store comprised of 
self-healing, self-managing, intelligent storage nodes and lightweight monitors
ARCHITECTURAL COMPONENTS 
APP HOST/VM CLIENT 
RBD 
A reliable, fully-distributed 
block 
device with cloud 
platform integration 
A distributed file 
system with POSIX 
semantics and scale-out 
Copyright © 2014 by Inktank | Private and Confidential 
59 
RGW 
A web services 
gateway for object 
storage, compatible 
with S3 and Swift 
LIBRADOS 
CEPHFS 
metadata 
management 
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) 
RADOS 
A software-based, reliable, autonomous, distributed object store comprised of 
self-healing, self-managing, intelligent storage nodes and lightweight monitors
OBJECT STORAGE DAEMONS 
60 
OSD 
FS 
DISK 
OSD 
FS 
DISK 
OSD 
FS 
DISK 
OSD 
FS 
DISK 
btrfs 
xfs 
ext4 
M 
M 
M
RADOS CLUSTER 
61 
APPLICATION 
M M 
M M 
M 
RADOS CLUSTER
RADOS COMPONENTS 
62 
OSDs: 
 10s to 10000s in a cluster 
 One per disk (or one per SSD, RAID group…) 
 Serve stored objects to clients 
 Intelligently peer for replication & recovery 
Monitors: 
 Maintain cluster membership and state 
 Provide consensus for distributed decision-making 
 Small, odd number 
 These do not serve stored objects to clients 
M
WHERE DO OBJECTS LIVE? 
63 
?? 
APPLICATION 
M 
M 
M 
OBJECT
A METADATA SERVER? 
64 
1 
APPLICATION 
M 
M 
M 
2
CALCULATED PLACEMENT 
65 
APPLICATION F 
M 
M 
M 
A-G 
H-N 
O-T 
U-Z
EVEN BETTER: CRUSH! 
66 
01 11 
11 
01 
RADOS CLUSTER 
OBJECT 
10 
01 
01 
10 
10 
01 
11 
01 
10 
01 
01 
10 
10 
01 
01 10 
10 10 01 01
CRUSH IS A QUICK CALCULATION 
67 
01 11 
11 
01 
RADOS CLUSTER 
OBJECT 
10 
01 
01 
10 
10 
01 
01 10 
10 10 01 01
CRUSH: DYNAMIC DATA 
PLACEMENT 
68 
CRUSH: 
 Pseudo-random placement algorithm 
 Fast calculation, no lookup 
 Repeatable, deterministic 
 Statistically uniform distribution 
 Stable mapping 
 Limited data migration on change 
 Rule-based configuration 
 Infrastructure topology aware 
 Adjustable replication 
 Weighting
ARCHITECTURAL COMPONENTS 
APP HOST/VM CLIENT 
RBD 
A reliable, fully-distributed 
block 
device with cloud 
platform integration 
A distributed file 
system with POSIX 
semantics and scale-out 
Copyright © 2014 by Inktank | Private and Confidential 
69 
RGW 
A web services 
gateway for object 
storage, compatible 
with S3 and Swift 
LIBRADOS 
CEPHFS 
metadata 
management 
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) 
RADOS 
A software-based, reliable, autonomous, distributed object store comprised of 
self-healing, self-managing, intelligent storage nodes and lightweight monitors
ACCESSING A RADOS CLUSTER 
70 
APPLICATION 
LIBRADOS 
OBJECT 
socket 
M M 
M 
RADOS CLUSTER
LIBRADOS: RADOS ACCESS FOR 
APPS 
L 
71 
LIBRADOS: 
 Direct access to RADOS for applications 
 C, C++, Python, PHP, Java, Erlang 
 Direct access to storage nodes 
 No HTTP overhead
ARCHITECTURAL COMPONENTS 
APP HOST/VM CLIENT 
RBD 
A reliable, fully-distributed 
block 
device with cloud 
platform integration 
A distributed file 
system with POSIX 
semantics and scale-out 
Copyright © 2014 by Inktank | Private and Confidential 
72 
RGW 
A web services 
gateway for object 
storage, compatible 
with S3 and Swift 
LIBRADOS 
CEPHFS 
metadata 
management 
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) 
RADOS 
A software-based, reliable, autonomous, distributed object store comprised of 
self-healing, self-managing, intelligent storage nodes and lightweight monitors
THE RADOS GATEWAY 
73 
APPLICATION APPLICATION 
REST 
RADOSGW 
LIBRADOS 
M M 
M 
RADOS CLUSTER 
RADOSGW 
LIBRADOS 
socket
RADOSGW MAKES RADOS WEBBY 
74 
RADOSGW: 
 REST-based object storage proxy 
 Uses RADOS to store objects 
 API supports buckets, accounts 
 Usage accounting for billing 
 Compatible with S3 and Swift applications
ARCHITECTURAL COMPONENTS 
APP HOST/VM CLIENT 
RBD 
A reliable, fully-distributed 
block 
device with cloud 
platform integration 
A distributed file 
system with POSIX 
semantics and scale-out 
Copyright © 2014 by Inktank | Private and Confidential 
75 
RGW 
A web services 
gateway for object 
storage, compatible 
with S3 and Swift 
LIBRADOS 
CEPHFS 
metadata 
management 
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) 
RADOS 
A software-based, reliable, autonomous, distributed object store comprised of 
self-healing, self-managing, intelligent storage nodes and lightweight monitors
STORING VIRTUAL DISKS 
76 
VM 
HYPERVISOR 
LIBRBD 
M M 
RADOS CLUSTER
SEPARATE COMPUTE FROM 
STORAGE 
77 
M M 
RADOS CLUSTER 
HYPERVISOR 
LIBRBD 
VM HYPERVISOR 
LIBRBD
KERNEL MODULE FOR MAX 
FLEXIBLE! 
78 
LINUX HOST 
KRBD 
M M 
RADOS CLUSTER
RBD STORES VIRTUAL DISKS 
79 
RADOS BLOCK DEVICE: 
 Storage of disk images in RADOS 
 Decouples VMs from host 
 Images are striped across the cluster (pool) 
 Snapshots 
 Copy-on-write clones 
 Support in: 
 Mainline Linux Kernel (2.6.39+) 
 Qemu/KVM, native Xen coming soon 
 OpenStack, CloudStack, Nebula, Proxmox
ARCHITECTURAL COMPONENTS 
APP HOST/VM CLIENT 
RBD 
A reliable, fully-distributed 
block 
device with cloud 
platform integration 
A distributed file 
system with POSIX 
semantics and scale-out 
Copyright © 2014 by Inktank | Private and Confidential 
80 
RGW 
A web services 
gateway for object 
storage, compatible 
with S3 and Swift 
LIBRADOS 
CEPHFS 
metadata 
management 
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) 
RADOS 
A software-based, reliable, autonomous, distributed object store comprised of 
self-healing, self-managing, intelligent storage nodes and lightweight monitors
SEPARATE METADATA SERVER 
81 
LINUX HOST 
KERNEL 
MODULE 
metadata 01 data 
10 
M M 
M 
RADOS CLUSTER
SCALABLE METADATA SERVERS 
82 
METADATA SERVER 
 Manages metadata for a POSIX-compliant 
shared filesystem 
 Directory hierarchy 
 File metadata (owner, timestamps, mode, 
etc.) 
 Stores metadata in RADOS 
 Does not serve file data to clients 
 Only required for shared filesystem
CEPH AND OPENSTACK 
83 
KEYSTONE CINDER GLANC 
SWIFT E NOVA 
RADOSGW 
LIBRADOS 
OPENSTACK 
LIBRB 
D 
M M 
RADOS CLUSTER 
LIBRB 
D 
HYPER-VISOR 
LIBRBD
GETTING STARTED WITH CEPH 
 Read about the latest version of Ceph. 
 The latest stuff is always at http://ceph.com/get 
 Deploy a test cluster using ceph-deploy. 
 Read the quick-start guide at http://ceph.com/qsg 
 Read the rest of the docs! 
 Find docs for the latest release at http://ceph.com/docs 
 Ask for help when you get stuck! 
 Community volunteers are waiting for you at 
http://ceph.com/help 
Copyright © 2014 by Inktank | Private and Confidential 
84
85 Ceph Day London – CephFS Update
86 Ceph Day London – CephFS Update

More Related Content

What's hot

What's hot (20)

Block Storage For VMs With Ceph
Block Storage For VMs With CephBlock Storage For VMs With Ceph
Block Storage For VMs With Ceph
 
Red Hat Gluster Storage : GlusterFS
Red Hat Gluster Storage : GlusterFSRed Hat Gluster Storage : GlusterFS
Red Hat Gluster Storage : GlusterFS
 
Ceph Performance: Projects Leading up to Jewel
Ceph Performance: Projects Leading up to JewelCeph Performance: Projects Leading up to Jewel
Ceph Performance: Projects Leading up to Jewel
 
What's new in Jewel and Beyond
What's new in Jewel and BeyondWhat's new in Jewel and Beyond
What's new in Jewel and Beyond
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDB
 
HKG15-401: Ceph and Software Defined Storage on ARM servers
HKG15-401: Ceph and Software Defined Storage on ARM serversHKG15-401: Ceph and Software Defined Storage on ARM servers
HKG15-401: Ceph and Software Defined Storage on ARM servers
 
Community Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonCommunity Update at OpenStack Summit Boston
Community Update at OpenStack Summit Boston
 
Keeping OpenStack storage trendy with Ceph and containers
Keeping OpenStack storage trendy with Ceph and containersKeeping OpenStack storage trendy with Ceph and containers
Keeping OpenStack storage trendy with Ceph and containers
 
The Future of GlusterFS and Gluster.org
The Future of GlusterFS and Gluster.orgThe Future of GlusterFS and Gluster.org
The Future of GlusterFS and Gluster.org
 
Gluster technical overview
Gluster technical overviewGluster technical overview
Gluster technical overview
 
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong ZhuBuild a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
 
Experiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsExperiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah Watkins
 
Bluestore
BluestoreBluestore
Bluestore
 
Ceph - A distributed storage system
Ceph - A distributed storage systemCeph - A distributed storage system
Ceph - A distributed storage system
 
Email storage with Ceph - Danny Al-Gaaf
Email storage with Ceph -  Danny Al-GaafEmail storage with Ceph -  Danny Al-Gaaf
Email storage with Ceph - Danny Al-Gaaf
 
SF Ceph Users Jan. 2014
SF Ceph Users Jan. 2014SF Ceph Users Jan. 2014
SF Ceph Users Jan. 2014
 
What's new in Luminous and Beyond
What's new in Luminous and BeyondWhat's new in Luminous and Beyond
What's new in Luminous and Beyond
 
New Ceph capabilities and Reference Architectures
New Ceph capabilities and Reference ArchitecturesNew Ceph capabilities and Reference Architectures
New Ceph capabilities and Reference Architectures
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 

Viewers also liked

Sistema genital feminino e masc.2
Sistema genital feminino e masc.2Sistema genital feminino e masc.2
Sistema genital feminino e masc.2
Regina Valentim
 
Apresentação
ApresentaçãoApresentação
Apresentação
Elida1234
 
A Arte do Ballet
A Arte do BalletA Arte do Ballet
A Arte do Ballet
BiaEsteves
 
Pode, ou não pode
Pode, ou não podePode, ou não pode
Pode, ou não pode
carlasolano1
 
Pintura de Mãos
Pintura de Mãos Pintura de Mãos
Pintura de Mãos
BiaEsteves
 
Legislação comercial jorge
Legislação comercial   jorgeLegislação comercial   jorge
Legislação comercial jorge
Ines Soares
 

Viewers also liked (20)

Final de ano
Final de anoFinal de ano
Final de ano
 
Carnaval
CarnavalCarnaval
Carnaval
 
FORÇA DOS ELEMENTOS
FORÇA DOS ELEMENTOSFORÇA DOS ELEMENTOS
FORÇA DOS ELEMENTOS
 
Sistema genital feminino e masc.2
Sistema genital feminino e masc.2Sistema genital feminino e masc.2
Sistema genital feminino e masc.2
 
Pura Arte
Pura Arte Pura Arte
Pura Arte
 
Apresentação
ApresentaçãoApresentação
Apresentação
 
Septiembre eso bach_2013
Septiembre eso bach_2013Septiembre eso bach_2013
Septiembre eso bach_2013
 
A Arte do Ballet
A Arte do BalletA Arte do Ballet
A Arte do Ballet
 
Ceph Day London 2014 - Ceph Over High-Performance Networks
Ceph Day London 2014 - Ceph Over High-Performance Networks Ceph Day London 2014 - Ceph Over High-Performance Networks
Ceph Day London 2014 - Ceph Over High-Performance Networks
 
Borrador de acta
Borrador de actaBorrador de acta
Borrador de acta
 
Bolile Infectioase Leporide Diagnostic Diferential Pasteureloza-Mixomatoza
Bolile Infectioase Leporide Diagnostic Diferential Pasteureloza-Mixomatoza Bolile Infectioase Leporide Diagnostic Diferential Pasteureloza-Mixomatoza
Bolile Infectioase Leporide Diagnostic Diferential Pasteureloza-Mixomatoza
 
Pode, ou não pode
Pode, ou não podePode, ou não pode
Pode, ou não pode
 
Belles Photos
Belles PhotosBelles Photos
Belles Photos
 
Pintura de Mãos
Pintura de Mãos Pintura de Mãos
Pintura de Mãos
 
Acordoortogrfico 090415083253-phpapp01
Acordoortogrfico 090415083253-phpapp01Acordoortogrfico 090415083253-phpapp01
Acordoortogrfico 090415083253-phpapp01
 
Solicitud libros
Solicitud librosSolicitud libros
Solicitud libros
 
Legislação comercial jorge
Legislação comercial   jorgeLegislação comercial   jorge
Legislação comercial jorge
 
21
2121
21
 
The professional use of linked in scion dtu open inspiration 29.10.2014 sli...
The professional use of linked in   scion dtu open inspiration 29.10.2014 sli...The professional use of linked in   scion dtu open inspiration 29.10.2014 sli...
The professional use of linked in scion dtu open inspiration 29.10.2014 sli...
 
Spot. donacion
Spot. donacionSpot. donacion
Spot. donacion
 

Similar to Ceph Day London 2014 - The current state of CephFS development

OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula -...
OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula -...OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula -...
OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula -...
OpenNebula Project
 
Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Community
 
Ceph storage for ocp deploying and managing ceph on top of open shift conta...
Ceph storage for ocp   deploying and managing ceph on top of open shift conta...Ceph storage for ocp   deploying and managing ceph on top of open shift conta...
Ceph storage for ocp deploying and managing ceph on top of open shift conta...
OrFriedmann
 

Similar to Ceph Day London 2014 - The current state of CephFS development (20)

CephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at LastCephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at Last
 
OSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage SystemOSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage System
 
Ceph as software define storage
Ceph as software define storageCeph as software define storage
Ceph as software define storage
 
OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula ...
OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula ...OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula ...
OpenNebula Conf 2014 | Using Ceph to provide scalable storage for OpenNebula ...
 
OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula -...
OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula -...OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula -...
OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula -...
 
MayaData Datastax webinar - Operating Cassandra on Kubernetes with the help ...
MayaData  Datastax webinar - Operating Cassandra on Kubernetes with the help ...MayaData  Datastax webinar - Operating Cassandra on Kubernetes with the help ...
MayaData Datastax webinar - Operating Cassandra on Kubernetes with the help ...
 
Red Hat Storage 2014 - Product(s) Overview
Red Hat Storage 2014 - Product(s) OverviewRed Hat Storage 2014 - Product(s) Overview
Red Hat Storage 2014 - Product(s) Overview
 
Architecture of a Next-Generation Parallel File System
Architecture of a Next-Generation Parallel File System	Architecture of a Next-Generation Parallel File System
Architecture of a Next-Generation Parallel File System
 
Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store
 
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
 
Kfs presentation
Kfs presentationKfs presentation
Kfs presentation
 
Quick-and-Easy Deployment of a Ceph Storage Cluster
Quick-and-Easy Deployment of a Ceph Storage ClusterQuick-and-Easy Deployment of a Ceph Storage Cluster
Quick-and-Easy Deployment of a Ceph Storage Cluster
 
Ceph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Day Santa Clara: The Future of CephFS + Developing with LibradosCeph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Day Santa Clara: The Future of CephFS + Developing with Librados
 
Ceph storage for ocp deploying and managing ceph on top of open shift conta...
Ceph storage for ocp   deploying and managing ceph on top of open shift conta...Ceph storage for ocp   deploying and managing ceph on top of open shift conta...
Ceph storage for ocp deploying and managing ceph on top of open shift conta...
 
London Ceph Day Keynote: Building Tomorrow's Ceph
London Ceph Day Keynote: Building Tomorrow's Ceph London Ceph Day Keynote: Building Tomorrow's Ceph
London Ceph Day Keynote: Building Tomorrow's Ceph
 
XenSummit - 08/28/2012
XenSummit - 08/28/2012XenSummit - 08/28/2012
XenSummit - 08/28/2012
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4
 
London Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFSLondon Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFS
 
2015 open storage workshop ceph software defined storage
2015 open storage workshop   ceph software defined storage2015 open storage workshop   ceph software defined storage
2015 open storage workshop ceph software defined storage
 
Performance characterization in large distributed file system with gluster fs
Performance characterization in large distributed file system with gluster fsPerformance characterization in large distributed file system with gluster fs
Performance characterization in large distributed file system with gluster fs
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 

Ceph Day London 2014 - The current state of CephFS development

  • 1. CephFS Update John Spray john.spray@redhat.com Ceph Day London
  • 2. Agenda ● Introduction to distributed filesystems ● Architectural overview ● Recent development ● Test & QA 2 Ceph Day London - CephFS Update
  • 3. Distributed filesystems ...and why they are hard. 3 Ceph Day London – CephFS Update
  • 4. Interfaces to storage ● Object ● Ceph RGW, S3, Swift ● Block (aka SAN) ● Ceph RBD, iSCSI, FC, SAS ● File (aka scale-out NAS) ● Ceph, GlusterFS, Lustre, proprietary filers 4 Ceph Day London - CephFS Update
  • 5. Interfaces to storage S3 & Swift Multi-tenant Snapshots Clones 5 Ceph Day London - CephFS Update FILE SYSTEM CephFS BLOCK STORAGE RBD OBJECT STORAGE RGW Keystone Geo-Replication Native API OpenStack Linux Kernel iSCSI POSIX Linux Kernel CIFS/NFS HDFS Distributed Metadata
  • 6. Object stores scale out well ● Last writer wins consistency ● Consistency rules only apply to one object at a time ● Clients are stateless (unless explicitly doing lock ops) ● No relationships exist between objects ● Objects have exactly one name ● Scale-out accomplished by mapping objects to nodes ● Single objects may be lost without affecting others 6 Ceph Day London - CephFS Update
  • 7. POSIX filesystems are hard to scale out ● Extents written from multiple clients must win or lose on all-or-nothing basis → locking ● Inodes depend on one another (directory hierarchy) ● Clients are stateful: holding files open ● Users have local-filesystem latency expectations: applications assume FS client will do lots of metadata caching for them. ● Scale-out requires spanning inode/dentry relationships across servers ● Loss of data can damage whole subtrees 7 Ceph Day London - CephFS Update
  • 8. Failure cases increase complexity further ● What should we do when... ? ● Filesystem is full ● Client goes dark ● An MDS goes dark ● Memory is running low ● Clients are competing for the same files ● Clients misbehave ● Hard problems in distributed systems generally, especially hard when we have to uphold POSIX semantics designed for local systems. 8 Ceph Day London - CephFS Update
  • 9. Terminology ● inode: a file. Has unique ID, may be referenced by one or more dentries. ● dentry: a link between an inode and a directory ● directory: special type of inode that has 0 or more child dentries ● hard link: many dentries referring to the same inode ● Terms originate form original (local disk) filesystems, where these were how a filesystem was represented on disk. 9 Ceph Day London - CephFS Update
  • 10. Architectural overview 10 Ceph Day London – CephFS Update
  • 11. CephFS architecture ● Dynamically balanced scale-out metadata ● Inherit flexibility/scalability of RADOS for data ● POSIX compatibility ● Beyond POSIX: Subtree snapshots, recursive statistics Weil, Sage A., et al. "Ceph: A scalable, high-performance distributed file system." Proceedings of the 7th symposium on Operating systems design and implementation. USENIX Association, 2006. http://ceph.com/papers/weil-ceph-osdi06.pdf 11 Ceph Day London - CephFS Update
  • 12. Components ● Client: kernel, fuse, libcephfs ● Server: MDS daemon ● Storage: RADOS cluster (mons & OSDs) 12 Ceph Day London – CephFS Update
  • 13. Components Linux host ceph.ko metadata 01 data 10 M M M Ceph server daemons 13 Ceph Day London – CephFS Update
  • 14. From application to disk Application ceph-fuse libcephfs Kernel client ceph-mds Client network protocol RADOS Disk 14 Ceph Day London - CephFS Update
  • 15. Scaling out FS metadata ● Options for distributing metadata? – by static subvolume – by path hash – by dynamic subtree ● Consider performance, ease of implementation 15 Ceph Day London – CephFS Update
  • 16. DYNAMIC SUBTREE PARTITIONING 16 Ceph Day London – CephFS Update
  • 17. Dynamic subtree placement ● Locality: get the dentries in a dir from one MDS ● Support read heavy workloads by replicating non-authoritative copies (cached with capabilities just like clients do) ● In practice work at directory fragment level in order to handle large dirs 17 Ceph Day London - CephFS Update
  • 18. Data placement ● Stripe file contents across RADOS objects ● get full rados cluster bandwidth from clients ● delegate all placement/balancing to RADOS ● Control striping with layout vxattrs ● layouts also select between multiple data pools ● Deletion is a special case: client deletions mark files 'stray', RADOS delete ops sent by MDS 18 Ceph Day London - CephFS Update
  • 19. Clients ● Two implementations: ● ceph-fuse/libcephfs ● kclient ● Interplay with VFS page cache, efficiency harder with fuse (extraneous stats etc) ● Client perf. matters, for single-client workloads ● Slow client can hold up others if it's hogging metadata locks: include clients in troubleshooting 19 Ceph Day London - CephFS Update
  • 20. Journaling and caching in MDS ● Metadata ops initially journaled to striped journal "file" in the metadata pool. ● I/O latency on metadata ops is sum of network latency and journal commit latency. ● Metadata remains pinned in in-memory cache until expired from journal. 20 Ceph Day London - CephFS Update
  • 21. Journaling and caching in MDS ● In some workloads we expect almost all metadata always in cache, in others its more of a stream. ● Control cache size with mds_cache_size ● Cache eviction relies on client cooperation ● MDS journal replay not only recovers data but also warms up cache. Use standby replay to keep that cache warm. 21 Ceph Day London - CephFS Update
  • 22. Lookup by inode ● Sometimes we need inode → path mapping: ● Hard links ● NFS handles ● Costly to store this: mitigate by piggybacking paths (backtraces) onto data objects ● Con: storing metadata to data pool ● Con: extra IOs to set backtraces ● Pro: disaster recovery from data pool ● Future: improve backtrace writing latency? 22 Ceph Day London - CephFS Update
  • 23. Extra features ● Snapshots: ● Exploit RADOS snapshotting for file data ● … plus some clever code in the MDS ● Fast petabyte snapshots ● Recursive statistics ● Lazily updated ● Access via vxattr ● Avoid spurious client I/O for df 23 Ceph Day London - CephFS Update
  • 24. Extra features ● Snapshots: ● Exploit RADOS snapshotting for file data ● … plus some clever code in the MDS ● Fast petabyte snapshots ● Recursive statistics ● Lazily updated ● Access via vxattr ● Avoid spurious client I/O for df 24 Ceph Day London - CephFS Update
  • 25. CephFS in practice ceph-deploy mds create myserver ceph osd pool create fs_data ceph osd pool create fs_metadata ceph fs new myfs fs_metadata fs_data mount -t cephfs x.x.x.x:6789 /mnt/ceph 25 Ceph Day London - CephFS Update
  • 26. Managing CephFS clients ● New in giant: see hostnames of connected clients ● Client eviction is sometimes important: ● Skip the wait during reconnect phase on MDS restart ● Allow others to access files locked by crashed client ● Use OpTracker to inspect ongoing operations 26 Ceph Day London - CephFS Update
  • 27. CephFS tips ● Choose MDS servers with lots of RAM ● Investigate clients when diagnosing stuck/slow access ● Use recent Ceph and recent kernel ● Use a conservative configuration: ● Single active MDS, plus one standby ● Dedicated MDS server ● Kernel client ● No snapshots, no inline data 27 Ceph Day London - CephFS Update
  • 28. Development update 28 Ceph Day London – CephFS Update
  • 29. AAPPPP AAPPPP HHOOSSTT/V/VMM CCLLIEIENNTT LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP RADOS RADOS RBD RBD A reliable and fully-distributed A reliable and fully-distributed block block device, with a Linux kernel client and a QEMU/KVM driver device, with a Linux kernel client and a QEMU/KVM driver RADOSGW RADOSGW A bucket-based REST gateway, compatible with S3 and Swift A bucket-based REST gateway, compatible with S3 and Swift CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes 29 Ceph Day London – CephFS Update CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE NEARLY AWESOME AWESOME AWESOME AWESOME AWESOME
  • 30. Towards a production-ready CephFS ● Focus on resilience: 1. Don't corrupt things 2. Stay up 3. Handle the corner cases 4. When something is wrong, tell me 5. Provide the tools to diagnose and fix problems ● Achieve this first within a conservative single-MDS configuration 30 Ceph Day London - CephFS Update
  • 31. Giant → Hammer timeframe ● Initial online fsck (a.k.a. forward scrub) ● Online diagnostics (`session ls`, MDS health alerts) ● Journal resilience & tools (cephfs-journal-tool) ● flock in the FUSE client ● Initial soft quota support ● General resilience: full OSDs, full metadata cache 31 Ceph Day London - CephFS Update
  • 32. FSCK and repair ● Recover from damage: ● Loss of data objects (which files are damaged?) ● Loss of metadata objects (what subtree is damaged?) ● Continuous verification: ● Are recursive stats consistent? ● Does metadata on disk match cache? ● Does file size metadata match data on disk? ● Repair: ● Automatic where possible ● Manual tools to enable support 32 Ceph Day London - CephFS Update
  • 33. Client management ● Current eviction is not 100% safe against rogue clients ● Update to client protocol to wait for OSD blacklist ● Client metadata ● Initially domain name, mount point ● Extension to other identifiers? 33 Ceph Day London - CephFS Update
  • 34. Online diagnostics ● Bugs exposed relate to failures of one client to release resources for another client: “my filesystem is frozen”. Introduce new health messages: ● “client xyz is failing to respond to cache pressure” ● “client xyz is ignoring capability release messages” ● Add client metadata to allow us to give domain names instead of IP addrs in messages. ● Opaque behavior in the face of dead clients. Introduce `session ls` ● Which clients does MDS think are stale? ● Identify clients to evict with `session evict` 34 Ceph Day London - CephFS Update
  • 35. Journal resilience ● Bad journal prevents MDS recovery: “my MDS crashes on startup”: ● Data loss ● Software bugs ● Updated on-disk format to make recovery from damage easier ● New tool: cephfs-journal-tool ● Inspect the journal, search/filter ● Chop out unwanted entries/regions 35 Ceph Day London - CephFS Update
  • 36. Handling resource limits ● Write a test, see what breaks! ● Full MDS cache: ● Require some free memory to make progress ● Require client cooperation to unpin cache objects ● Anticipate tuning required for cache behaviour: what should we evict? ● Full OSD cluster ● Require explicit handling to abort with -ENOSPC ● MDS → RADOS flow control: ● Contention between I/O to flush cache and I/O to journal 36 Ceph Day London - CephFS Update
  • 37. Test, QA, bug fixes ● The answer to “Is CephFS production ready?” ● teuthology test framework: ● Long running/thrashing test ● Third party FS correctness tests ● Python functional tests ● We dogfood CephFS internally ● Various kclient fixes discovered ● Motivation for new health monitoring metrics ● Third party testing is extremely valuable 37 Ceph Day London - CephFS Update
  • 38. What's next? ● You tell us! ● Recent survey highlighted: ● FSCK hardening ● Multi-MDS hardening ● Quota support ● Which use cases will matter to community? ● Backup ● Hadoop ● NFS/Samba gateway ● Other? 38 Ceph Day London - CephFS Update
  • 39. Reporting bugs ● Does the most recent development release or kernel fix your issue? ● What is your configuration? MDS config, Ceph version, client version, kclient or fuse ● What is your workload? ● Can you reproduce with debug logging enabled? http://ceph.com/resources/mailing-list-irc/ http://tracker.ceph.com/projects/ceph/issues http://ceph.com/docs/master/rados/troubleshooting/log-and-debug/ 39 Ceph Day London - CephFS Update
  • 40. Future ● Ceph Developer Summit: ● When: 8 October ● Where: online ● Post-Hammer work: ● Recent survey highlighted multi-MDS, quota support ● Testing with clustered Samba/NFS? 40 Ceph Day London - CephFS Update
  • 41. Questions? 41 Ceph Day London – CephFS Update
  • 42. 42 Ceph Day London – CephFS Update
  • 43. Body slide design guidelines ● > 15 words per bullet ● If your slide is text-only, reserve at least 1/3 of the slide for white space. ● If you use a graphic, make sure text is readable. 43 Ceph Day London - CephFS Update
  • 44. Body slide design guidelines ● > 15 words per bullet ● If your slide is text-only, reserve at least 1/3 of the slide for white space. ● If you use a graphic, make sure text is readable. 44 Ceph Day London - CephFS Update
  • 45. Introduce Red Hat ● Create an agenda slide for every presentation. ● Outline what you’re going to tell the audience. ● Prepare them for a call to action after the presentation. ● If this is a confidential presentation, use the confidential presentation template located on the Corporate > Templates > Presentation templates page of the PNT Portal. 45 Ceph Day London - CephFS Update
  • 46. Introduce Red Hat solutions and services ● Provide product details that specifically solve the customer pain point you’re addressing. ● These slides explain how Red Hat solutions work, what makes them unique and valuable. 46 Ceph Day London - CephFS Update
  • 47. Learn more ● End with a call to action. ● Let the audience know what can be done next, how you or Red Hat can help them. 47 Ceph Day London - CephFS Update
  • 48. Divider slide 48 Ceph Day London – CephFS Update
  • 49. Divider slide 49 Ceph Day London – CephFS Update
  • 50. Divider slide 50 Ceph Day London – CephFS Update
  • 51. Divider slide 51 Ceph Day London – CephFS Update
  • 52. Divider slide 52 Ceph Day London – CephFS Update
  • 53. Divider slide 53 Ceph Day London – CephFS Update
  • 54. Divider slide 54 Ceph Day London – CephFS Update
  • 55. Divider slide 55 Ceph Day London – CephFS Update
  • 56. DDiivviiddeerr Sslliiddee 56 Ceph Day London – CephFS Update
  • 57. A STORAGE REVOLUTION SUPPORT & MAINTENANCE PROPRIETARY SOFTWARE PROPRIETARY HARDWARE COMPUTER DISK COMPUTER DISK COMPUTER DISK ENTERPRISE PRODUCTS & SERVICES OPEN SOURCE SOFTWARE STANDARD HARDWARE COMPUTER DISK COMPUTER DISK COMPUTER DISK
  • 58. ARCHITECTURAL COMPONENTS APP HOST/VM CLIENT RBD A reliable, fully-distributed block device with cloud platform integration A distributed file system with POSIX semantics and scale-out Copyright © 2014 by Inktank | Private and Confidential 58 RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS CEPHFS metadata management A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors
  • 59. ARCHITECTURAL COMPONENTS APP HOST/VM CLIENT RBD A reliable, fully-distributed block device with cloud platform integration A distributed file system with POSIX semantics and scale-out Copyright © 2014 by Inktank | Private and Confidential 59 RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS CEPHFS metadata management A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors
  • 60. OBJECT STORAGE DAEMONS 60 OSD FS DISK OSD FS DISK OSD FS DISK OSD FS DISK btrfs xfs ext4 M M M
  • 61. RADOS CLUSTER 61 APPLICATION M M M M M RADOS CLUSTER
  • 62. RADOS COMPONENTS 62 OSDs:  10s to 10000s in a cluster  One per disk (or one per SSD, RAID group…)  Serve stored objects to clients  Intelligently peer for replication & recovery Monitors:  Maintain cluster membership and state  Provide consensus for distributed decision-making  Small, odd number  These do not serve stored objects to clients M
  • 63. WHERE DO OBJECTS LIVE? 63 ?? APPLICATION M M M OBJECT
  • 64. A METADATA SERVER? 64 1 APPLICATION M M M 2
  • 65. CALCULATED PLACEMENT 65 APPLICATION F M M M A-G H-N O-T U-Z
  • 66. EVEN BETTER: CRUSH! 66 01 11 11 01 RADOS CLUSTER OBJECT 10 01 01 10 10 01 11 01 10 01 01 10 10 01 01 10 10 10 01 01
  • 67. CRUSH IS A QUICK CALCULATION 67 01 11 11 01 RADOS CLUSTER OBJECT 10 01 01 10 10 01 01 10 10 10 01 01
  • 68. CRUSH: DYNAMIC DATA PLACEMENT 68 CRUSH:  Pseudo-random placement algorithm  Fast calculation, no lookup  Repeatable, deterministic  Statistically uniform distribution  Stable mapping  Limited data migration on change  Rule-based configuration  Infrastructure topology aware  Adjustable replication  Weighting
  • 69. ARCHITECTURAL COMPONENTS APP HOST/VM CLIENT RBD A reliable, fully-distributed block device with cloud platform integration A distributed file system with POSIX semantics and scale-out Copyright © 2014 by Inktank | Private and Confidential 69 RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS CEPHFS metadata management A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors
  • 70. ACCESSING A RADOS CLUSTER 70 APPLICATION LIBRADOS OBJECT socket M M M RADOS CLUSTER
  • 71. LIBRADOS: RADOS ACCESS FOR APPS L 71 LIBRADOS:  Direct access to RADOS for applications  C, C++, Python, PHP, Java, Erlang  Direct access to storage nodes  No HTTP overhead
  • 72. ARCHITECTURAL COMPONENTS APP HOST/VM CLIENT RBD A reliable, fully-distributed block device with cloud platform integration A distributed file system with POSIX semantics and scale-out Copyright © 2014 by Inktank | Private and Confidential 72 RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS CEPHFS metadata management A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors
  • 73. THE RADOS GATEWAY 73 APPLICATION APPLICATION REST RADOSGW LIBRADOS M M M RADOS CLUSTER RADOSGW LIBRADOS socket
  • 74. RADOSGW MAKES RADOS WEBBY 74 RADOSGW:  REST-based object storage proxy  Uses RADOS to store objects  API supports buckets, accounts  Usage accounting for billing  Compatible with S3 and Swift applications
  • 75. ARCHITECTURAL COMPONENTS APP HOST/VM CLIENT RBD A reliable, fully-distributed block device with cloud platform integration A distributed file system with POSIX semantics and scale-out Copyright © 2014 by Inktank | Private and Confidential 75 RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS CEPHFS metadata management A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors
  • 76. STORING VIRTUAL DISKS 76 VM HYPERVISOR LIBRBD M M RADOS CLUSTER
  • 77. SEPARATE COMPUTE FROM STORAGE 77 M M RADOS CLUSTER HYPERVISOR LIBRBD VM HYPERVISOR LIBRBD
  • 78. KERNEL MODULE FOR MAX FLEXIBLE! 78 LINUX HOST KRBD M M RADOS CLUSTER
  • 79. RBD STORES VIRTUAL DISKS 79 RADOS BLOCK DEVICE:  Storage of disk images in RADOS  Decouples VMs from host  Images are striped across the cluster (pool)  Snapshots  Copy-on-write clones  Support in:  Mainline Linux Kernel (2.6.39+)  Qemu/KVM, native Xen coming soon  OpenStack, CloudStack, Nebula, Proxmox
  • 80. ARCHITECTURAL COMPONENTS APP HOST/VM CLIENT RBD A reliable, fully-distributed block device with cloud platform integration A distributed file system with POSIX semantics and scale-out Copyright © 2014 by Inktank | Private and Confidential 80 RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS CEPHFS metadata management A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors
  • 81. SEPARATE METADATA SERVER 81 LINUX HOST KERNEL MODULE metadata 01 data 10 M M M RADOS CLUSTER
  • 82. SCALABLE METADATA SERVERS 82 METADATA SERVER  Manages metadata for a POSIX-compliant shared filesystem  Directory hierarchy  File metadata (owner, timestamps, mode, etc.)  Stores metadata in RADOS  Does not serve file data to clients  Only required for shared filesystem
  • 83. CEPH AND OPENSTACK 83 KEYSTONE CINDER GLANC SWIFT E NOVA RADOSGW LIBRADOS OPENSTACK LIBRB D M M RADOS CLUSTER LIBRB D HYPER-VISOR LIBRBD
  • 84. GETTING STARTED WITH CEPH  Read about the latest version of Ceph.  The latest stuff is always at http://ceph.com/get  Deploy a test cluster using ceph-deploy.  Read the quick-start guide at http://ceph.com/qsg  Read the rest of the docs!  Find docs for the latest release at http://ceph.com/docs  Ask for help when you get stuck!  Community volunteers are waiting for you at http://ceph.com/help Copyright © 2014 by Inktank | Private and Confidential 84
  • 85. 85 Ceph Day London – CephFS Update
  • 86. 86 Ceph Day London – CephFS Update