Future of CephFS
Sage Weil
APP
APP

LIBRADOS
LIBRADOS

APP
APP

RADOSGW
RADOSGW

AA bucket-based
bucket-based
AA library allowing REST gateway,
libra...
CLIENT
CLIENT

metadata

01
01
10
10

data

M
M
M
M

M
M
M
M
M
M

M
M
Metadata Server
• Manages metadata for a
POSIX-compliant shared
filesystem
• Directory hierarchy
• File metadata (owner,
t...
legacy metadata storage
●

a scaling disaster
●

name → inode → block list →
data

●

no inode table locality

●

fragment...
ceph fs metadata storage
●

●

100
1
etc
home
usr
var
vmlinuz
…

hosts
mtab
passwd
…

block lists unnecessary
inode table ...
controlling metadata io
●

view ceph-mds as cache
●

reduce reads
–

●

reduce writes
–

●

dir+inode prefetching
journal
...
one tree

three metadata servers

??
load distribution
●

coarse (static subtree)
●

●

high management overhead

fine (hash)
●

always balanced

●

less vulne...
DYNAMIC SUBTREE PARTITIONING
dynamic subtree partitioning
●

scalable
●

●

arbitrarily partition
metadata

adaptive
●

●

●

move work from busy to
id...
Dynamic partitioning
many directories

same directory
Failure recovery
Metadata replication and availability
Metadata cluster scaling
client protocol
●

highly stateful
●

●

consistent, fine-grained caching

seamless hand-off between ceph-mds daemons
●

●...
an example
●

mount -t ceph 1.2.3.4:/ /mnt
●

●

●

3 ceph-mon RT
2 ceph-mds RT (1 ceph-mds to -osd RT)

2 ceph-mds RT (2 ...
recursive accounting
●

ceph-mds tracks recursive directory stats
●

file sizes

●

file and directory counts

●

modifica...
snapshots
●

volume or subvolume snapshots unusable at petabyte
scale
●

●

snapshot arbitrary subdirectories

simple inte...
multiple client implementations
●

Linux kernel client
●

●

mount -t ceph 1.2.3.4:/
/mnt
export (NFS), Samba (CIFS)

●

c...
APP
APP

LIBRADOS
LIBRADOS

APP
APP

HOST/VM
HOST/VM

RADOSGW
RADOSGW

RBD
RBD

AA bucket-based
bucket-based
AA library al...
Path forward
●

Testing
●

●

●

Various workloads
Multiple active MDSs

Test automation
●

●

●

Simple workload generato...
hard links?
●

rare

●

useful locality properties
●

●

●

intra-directory
parallel inter-directory

on miss, file object...
what is journaled
●

lots of state
●

●

●

journaling is expensive up-front, cheap to recover
non-journaled state is chea...
London Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFS
Upcoming SlideShare
Loading in …5
×

London Ceph Day: The Future of CephFS

1,867 views

Published on

Sage Weil, Creator of Ceph, Founder & CTO, Inktank

Published in: Technology
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,867
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
84
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide
  • {"5":"If you aren’t running Ceph FS, you don’t need to deploy metadata servers.\n","11":"If there’s just one MDS (which is a terrible idea), it manages metadata for the entire tree.\n","12":"When the second one comes along, it will intelligently partition the work by taking a subtree.\n","1":"<number>\n","13":"When the third MDS arrives, it will attempt to split the tree again.\n","2":"Finally, let’s talk about Ceph FS. Ceph FS is a parallel filesystem that provides a massively scalable, single-hierarchy, shared disk. If you use a shared drive at work, this is the same thing except that the same drive could be shared by everyone you’ve ever met (and everyone they’ve ever met).\n","14":"Same with the fourth.\n","3":"Remember all that meta-data we talked about in the beginning? Feels so long ago. It has to be stored somewhere! Something has to keep track of who created files, when they were created, and who has the right to access them. And something has to remember where they live within a tree. Enter MDS, the Ceph Metadata Server. Clients accessing Ceph FS data first make a request to an MDS, which provides what they need to get files from the right OSDs.\n","9":"So how do you have one tree and multiple servers?\n","26":"Ceph FS is feature-complete but still lacks the testing, quality assurance, and benchmarking work we feel it needs to recommend it for production use.\n","15":"A MDS can actually even just take a single directory or file, if the load is high enough. This all happens dynamically based on load and the structure of the data, and it’s called “dynamic subtree partitioning”.\n","4":"There are multiple MDSs!\n"}
  • London Ceph Day: The Future of CephFS

    1. 1. Future of CephFS Sage Weil
    2. 2. APP APP LIBRADOS LIBRADOS APP APP RADOSGW RADOSGW AA bucket-based bucket-based AA library allowing REST gateway, library allowing REST gateway, apps to directly compatible with S3 apps to directly compatible with S3 access RADOS, access RADOS, and Swift and Swift with support for with support for C, C++, Java, C, C++, Java, Python, Ruby, Python, Ruby, and PHP and PHP HOST/VM HOST/VM CLIENT CLIENT RBD RBD CEPH FS AA reliable and fullyreliable and fullydistributed block distributed block device, with a a Linux device, with Linux kernel client and a a kernel client and QEMU/KVM driver QEMU/KVM driver A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE RADOS RADOS AA reliable, autonomous, distributed object store comprised of self-healing, self-managing, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes intelligent storage nodes
    3. 3. CLIENT CLIENT metadata 01 01 10 10 data M M M M M M
    4. 4. M M M M M M
    5. 5. Metadata Server • Manages metadata for a POSIX-compliant shared filesystem • Directory hierarchy • File metadata (owner, timestamps, mode, etc.) • Stores metadata in RADOS • Does not serve file data to clients • Only required for shared filesystem
    6. 6. legacy metadata storage ● a scaling disaster ● name → inode → block list → data ● no inode table locality ● fragmentation – inode table – directory ● many seeks ● difficult to partition etc home usr var vmlinuz … hosts mtab passwd … bin include lib …
    7. 7. ceph fs metadata storage ● ● 100 1 etc home usr var vmlinuz … hosts mtab passwd … block lists unnecessary inode table mostly useless ● 102 bin include lib … ● ● APIs are path-based, not inode-based no random table access, sloppy caching embed inodes inside directories ● good locality, prefetching ● leverage key/value object
    8. 8. controlling metadata io ● view ceph-mds as cache ● reduce reads – ● reduce writes – ● dir+inode prefetching journal consolidate multiple writes large journal or log ● stripe over objects ● two tiers – – ● journal for short term per-directory for long term fast failure recovery directories
    9. 9. one tree three metadata servers ??
    10. 10. load distribution ● coarse (static subtree) ● ● high management overhead fine (hash) ● always balanced ● less vulnerable to hot spots ● ● static subtree preserve locality ● good locality destroy hierarchy, locality can a dynamic approach capture benefits of both extremes? hash directories good balance hash files
    11. 11. DYNAMIC SUBTREE PARTITIONING
    12. 12. dynamic subtree partitioning ● scalable ● ● arbitrarily partition metadata adaptive ● ● ● move work from busy to idle servers replicate hot metadata efficient ● ● hierarchical partition preserve locality dynamic ● daemons can join/leave ● take over for failed nodes
    13. 13. Dynamic partitioning many directories same directory
    14. 14. Failure recovery
    15. 15. Metadata replication and availability
    16. 16. Metadata cluster scaling
    17. 17. client protocol ● highly stateful ● ● consistent, fine-grained caching seamless hand-off between ceph-mds daemons ● ● ● when client traverses hierarchy when metadata is migrated between servers direct access to OSDs for file I/O
    18. 18. an example ● mount -t ceph 1.2.3.4:/ /mnt ● ● ● 3 ceph-mon RT 2 ceph-mds RT (1 ceph-mds to -osd RT) 2 ceph-mds RT (2 ceph-mds to -osd RT) ls -al ● open ● readdir – 1 ceph-mds RT (1 ceph-mds to -osd RT) ● stat each file ● ● ceph-osd cd /mnt/foo/bar ● ● ceph-mon close cp * /tmp ● N ceph-osd RT ceph-mds
    19. 19. recursive accounting ● ceph-mds tracks recursive directory stats ● file sizes ● file and directory counts ● modification time ● virtual xattrs present full stats ● efficient $ ls ­alSh | head total 0 drwxr­xr­x 1 root            root      9.7T 2011­02­04 15:51 . drwxr­xr­x 1 root            root      9.7T 2010­12­16 15:06 .. drwxr­xr­x 1 pomceph         pg4194980 9.6T 2011­02­24 08:25 pomceph drwxr­xr­x 1 mcg_test1       pg2419992  23G 2011­02­02 08:57 mcg_test1 drwx­­x­­­ 1 luko            adm        19G 2011­01­21 12:17 luko drwx­­x­­­ 1 eest            adm        14G 2011­02­04 16:29 eest drwxr­xr­x 1 mcg_test2       pg2419992 3.0G 2011­02­02 09:34 mcg_test2 drwx­­x­­­ 1 fuzyceph        adm       1.5G 2011­01­18 10:46 fuzyceph drwxr­xr­x 1 dallasceph      pg275     596M 2011­01­14 10:06 dallasceph
    20. 20. snapshots ● volume or subvolume snapshots unusable at petabyte scale ● ● snapshot arbitrary subdirectories simple interface ● hidden '.snap' directory ● no special tools $ mkdir foo/.snap/one $ ls foo/.snap one $ ls foo/bar/.snap _one_1099511627776 $ rm foo/myfile $ ls -F foo bar/ $ ls -F foo/.snap/one myfile bar/ $ rmdir foo/.snap/one # create snapshot # parent's snap name is mangled # remove snapshot
    21. 21. multiple client implementations ● Linux kernel client ● ● mount -t ceph 1.2.3.4:/ /mnt export (NFS), Samba (CIFS) ● ceph-fuse ● libcephfs.so ● your app ● Ganesha (NFS) ● Hadoop (map/reduce) Ganesha libcephfs Samba libcephfs Hadoop libcephfs your app libcephfs Samba (CIFS) ● SMB/CIFS NFS ceph ceph-fuse fuse kernel
    22. 22. APP APP LIBRADOS LIBRADOS APP APP HOST/VM HOST/VM RADOSGW RADOSGW RBD RBD AA bucket-based bucket-based AA library allowing REST gateway, library allowing REST gateway, apps to directly compatible with S3 apps to directly compatible with S3 access RADOS, access RADOS, and Swift and Swift with support for with support for C, C++, Java, C, C++, Java, Python, Ruby, Python, Ruby, AWESOME and PHP and PHP CEPH FS CEPH FS AA reliable and fullyreliable and fullydistributed block distributed block device, with a a Linux device, with Linux kernel client and a a kernel client and QEMU/KVM driver QEMU/KVM driver AA POSIX-compliant POSIX-compliant distributed file system, distributed file system, with a a Linux kernel with Linux kernel client and support for client and support for FUSE FUSE AWESOME AWESOME RADOS RADOS CLIENT CLIENT NEARLY AWESOME AWESOME AA reliable, autonomous, distributed object store comprised of self-healing, self-managing, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes intelligent storage nodes
    23. 23. Path forward ● Testing ● ● ● Various workloads Multiple active MDSs Test automation ● ● ● Simple workload generator scripts Bug reproducers Hacking ● ● ● Bug squashing Long-tail features Integrations ● Ganesha, Samba, *stacks
    24. 24. hard links? ● rare ● useful locality properties ● ● ● intra-directory parallel inter-directory on miss, file objects provide per-file backpointers ● degenerates to log(n) lookups ● optimistic read complexity
    25. 25. what is journaled ● lots of state ● ● ● journaling is expensive up-front, cheap to recover non-journaled state is cheap, but complex (and somewhat expensive) to recover yes ● ● ● client sessions actual fs metadata modifications no ● ● ● cache provenance open files lazy flush ● client modifications may not be durable until fsync() or visible by another client

    ×