Future of CephFS
Sage Weil
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes...
MM
MM
MM
CLIENTCLIENT
01
10
01
10
data
metadata
MM
MM
MM
Metadata Server
• Manages metadata for a
POSIX-compliant shared
filesystem
• Directory hierarchy
• File metadata (owner,
t...
legacy metadata storage
●
a scaling disaster
●
name inode block list→ → →
data
●
no inode table locality
●
fragmentation
–...
ceph fs metadata storage
●
block lists unnecessary
● inode table mostly useless
●
APIs are path-based, not
inode-based
●
n...
controlling metadata io
● view ceph-mds as cache
●
reduce reads
– dir+inode prefetching
●
reduce writes
– consolidate mult...
one tree
three metadata servers
??
load distribution
●
coarse (static subtree)
●
preserve locality
●
high management overhead
●
fine (hash)
●
always balanced...
DYNAMIC SUBTREE PARTITIONING
●
scalable
●
arbitrarily partition
metadata
● adaptive
●
move work from busy to
idle servers
●
replicate hot metadata
●
ef...
Dynamic partitioning
many directories same directory
Failure recovery
Metadata replication and availability
Metadata cluster scaling
client protocol
●
highly stateful
●
consistent, fine-grained caching
● seamless hand-off between ceph-mds daemons
●
when c...
an example
● mount -t ceph 1.2.3.4:/ /mnt
●
3 ceph-mon RT
●
2 ceph-mds RT (1 ceph-mds to -osd RT)
● cd /mnt/foo/bar
●
2 ce...
recursive accounting
●
ceph-mds tracks recursive directory stats
●
file sizes
●
file and directory counts
●
modification t...
snapshots
●
volume or subvolume snapshots unusable at petabyte
scale
●
snapshot arbitrary subdirectories
●
simple interfac...
multiple client implementations
●
Linux kernel client
●
mount -t ceph 1.2.3.4:/
/mnt
●
export (NFS), Samba (CIFS)
● ceph-f...
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes...
Path forward
●
Testing
●
Various workloads
●
Multiple active MDSs
●
Test automation
●
Simple workload generator scripts
●
...
librados
object model
●
pools
●
1s to 100s
●
independent namespaces or object collections
●
replication level, placement policy
●
o...
atomic transactions
●
client operations send to the OSD cluster
●
operate on a single object
●
can contain a sequence of o...
key/value storage
●
store key/value pairs in an object
●
independent from object attrs or byte data payload
● based on goo...
watch/notify
●
establish stateful 'watch' on an object
●
client interest persistently registered with object
●
client keep...
CLIENT
#1
CLIENT
#2
CLIENT
#3
OSD
watch
ack/commit
ack/commit
watch
ack/commit
watch
notify
notify
notify
notify
ack
ack
a...
watch/notify example
●
radosgw cache consistency
●
radosgw instances watch a single object (.rgw/notify)
●
locally cache b...
rados classes
●
dynamically loaded .so
●
/var/lib/rados-classes/*
●
implement new object “methods” using existing methods
...
class examples
●
grep
●
read an object, filter out individual records, and return those
● sha1
●
read object, generate fin...
ideas
●
distributed key/value table
●
aggregate many k/v objects into one big 'table'
●
working prototype exists (thanks, ...
ideas
●
lua rados class
●
embed lua interpreter in a rados class
●
ship semi-arbitrary code for operations
●
json class
●
...
ideas
●
rados mailbox (RMB?)
●
plug librados backend into dovecot, postfix, etc.
●
key/value object for each mailbox
– key...
hard links?
● rare
● useful locality properties
●
intra-directory
●
parallel inter-directory
● on miss, file objects provi...
Ceph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Day Santa Clara: The Future of CephFS + Developing with Librados
Upcoming SlideShare
Loading in …5
×

Ceph Day Santa Clara: The Future of CephFS + Developing with Librados

3,327 views

Published on

Sage Weil, Creator of Ceph, Founder & CTO, Inktank

CephFS is a distributed filesystem built on RADOS, offering POSIX-semantics and a true scale-out architecture. While production deployments of CephFS do exist, it still needs lots of testing and hardening before it can be used in the most challenging (and interesting) scenarios. In this session, Sage will discuss the future of CephFS, includ- ing the areas where it still needs work and ways the community can help.

RADOS is a surprisingly flexible object store. To take advantage of its rich feature set, developers can build with its programmable library, librados. Librados is avail- able in many languages, and offers access to key/value stores, object classes, cluster health and status, and other useful RADOS internals. This session will cover how to use librados, discuss situations where librados is the right choice, and share a list of lesser-known RADOS features that developers can tap into.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
3,327
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
38
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Finally, let’s talk about Ceph FS. Ceph FS is a parallel filesystem that provides a massively scalable, single-hierarchy, shared disk. If you use a shared drive at work, this is the same thing except that the same drive could be shared by everyone you’ve ever met (and everyone they’ve ever met).
  • Remember all that meta-data we talked about in the beginning? Feels so long ago. It has to be stored somewhere! Something has to keep track of who created files, when they were created, and who has the right to access them. And something has to remember where they live within a tree. Enter MDS, the Ceph Metadata Server. Clients accessing Ceph FS data first make a request to an MDS, which provides what they need to get files from the right OSDs.
  • There are multiple MDSs!
  • If you aren’t running Ceph FS, you don’t need to deploy metadata servers.
  • So how do you have one tree and multiple servers?
  • If there’s just one MDS (which is a terrible idea), it manages metadata for the entire tree.
  • When the second one comes along, it will intelligently partition the work by taking a subtree.
  • When the third MDS arrives, it will attempt to split the tree again.
  • Same with the fourth.
  • A MDS can actually even just take a single directory or file, if the load is high enough. This all happens dynamically based on load and the structure of the data, and it’s called “dynamic subtree partitioning”.
  • Ceph FS is feature-complete but still lacks the testing, quality assurance, and benchmarking work we feel it needs to recommend it for production use.
  • Ceph Day Santa Clara: The Future of CephFS + Developing with Librados

    1. 1. Future of CephFS Sage Weil
    2. 2. RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP RBD A reliable and fully- distributed block device, with a Linux kernel client and a QEMU/KVM driver RBD A reliable and fully- distributed block device, with a Linux kernel client and a QEMU/KVM driver CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE RADOSGW A bucket-based REST gateway, compatible with S3 and Swift RADOSGW A bucket-based REST gateway, compatible with S3 and Swift APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
    3. 3. MM MM MM CLIENTCLIENT 01 10 01 10 data metadata
    4. 4. MM MM MM
    5. 5. Metadata Server • Manages metadata for a POSIX-compliant shared filesystem • Directory hierarchy • File metadata (owner, timestamps, mode, etc.) • Stores metadata in RADOS • Does not serve file data to clients • Only required for shared filesystem
    6. 6. legacy metadata storage ● a scaling disaster ● name inode block list→ → → data ● no inode table locality ● fragmentation – inode table – directory ● many seeks ● difficult to partition usr etc var home vmlinuz passwd mtab hosts lib … … … include bin
    7. 7. ceph fs metadata storage ● block lists unnecessary ● inode table mostly useless ● APIs are path-based, not inode-based ● no random table access, sloppy caching ● embed inodes inside directories ● good locality, prefetching ● leverage key/value object 102 100 1 usr etc var home vmlinuz passwd mtab hosts lib include bin … … …
    8. 8. controlling metadata io ● view ceph-mds as cache ● reduce reads – dir+inode prefetching ● reduce writes – consolidate multiple writes ● large journal or log ● stripe over objects ● two tiers – journal for short term – per-directory for long term ● fast failure recovery journal directories
    9. 9. one tree three metadata servers ??
    10. 10. load distribution ● coarse (static subtree) ● preserve locality ● high management overhead ● fine (hash) ● always balanced ● less vulnerable to hot spots ● destroy hierarchy, locality ● can a dynamic approach capture benefits of both extremes? static subtree hash directories hash files good locality good balance
    11. 11. DYNAMIC SUBTREE PARTITIONING
    12. 12. ● scalable ● arbitrarily partition metadata ● adaptive ● move work from busy to idle servers ● replicate hot metadata ● efficient ● hierarchical partition preserve locality ● dynamic ● daemons can join/leave ● take over for failed nodes dynamic subtree partitioning
    13. 13. Dynamic partitioning many directories same directory
    14. 14. Failure recovery
    15. 15. Metadata replication and availability
    16. 16. Metadata cluster scaling
    17. 17. client protocol ● highly stateful ● consistent, fine-grained caching ● seamless hand-off between ceph-mds daemons ● when client traverses hierarchy ● when metadata is migrated between servers ● direct access to OSDs for file I/O
    18. 18. an example ● mount -t ceph 1.2.3.4:/ /mnt ● 3 ceph-mon RT ● 2 ceph-mds RT (1 ceph-mds to -osd RT) ● cd /mnt/foo/bar ● 2 ceph-mds RT (2 ceph-mds to -osd RT) ● ls -al ● open ● readdir – 1 ceph-mds RT (1 ceph-mds to -osd RT) ● stat each file ● close ● cp * /tmp ● N ceph-osd RT ceph-mon ceph-mds ceph-osd
    19. 19. recursive accounting ● ceph-mds tracks recursive directory stats ● file sizes ● file and directory counts ● modification time ● virtual xattrs present full stats ● efficient $ ls ­alSh | head total 0 drwxr­xr­x 1 root            root      9.7T 2011­02­04 15:51 . drwxr­xr­x 1 root            root      9.7T 2010­12­16 15:06 .. drwxr­xr­x 1 pomceph         pg4194980 9.6T 2011­02­24 08:25 pomceph drwxr­xr­x 1 mcg_test1       pg2419992  23G 2011­02­02 08:57 mcg_test1 drwx­­x­­­ 1 luko            adm        19G 2011­01­21 12:17 luko drwx­­x­­­ 1 eest            adm        14G 2011­02­04 16:29 eest drwxr­xr­x 1 mcg_test2       pg2419992 3.0G 2011­02­02 09:34 mcg_test2 drwx­­x­­­ 1 fuzyceph        adm       1.5G 2011­01­18 10:46 fuzyceph drwxr­xr­x 1 dallasceph      pg275     596M 2011­01­14 10:06 dallasceph
    20. 20. snapshots ● volume or subvolume snapshots unusable at petabyte scale ● snapshot arbitrary subdirectories ● simple interface ● hidden '.snap' directory ● no special tools $ mkdir foo/.snap/one # create snapshot $ ls foo/.snap one $ ls foo/bar/.snap _one_1099511627776 # parent's snap name is mangled $ rm foo/myfile $ ls -F foo bar/ $ ls -F foo/.snap/one myfile bar/ $ rmdir foo/.snap/one # remove snapshot
    21. 21. multiple client implementations ● Linux kernel client ● mount -t ceph 1.2.3.4:/ /mnt ● export (NFS), Samba (CIFS) ● ceph-fuse ● libcephfs.so ● your app ● Samba (CIFS) ● Ganesha (NFS) ● Hadoop (map/reduce) kernel libcephfs ceph fuse ceph-fuse your app libcephfs Samba libcephfs Ganesha NFS SMB/CIFS libcephfs Hadoop
    22. 22. RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP RBD A reliable and fully- distributed block device, with a Linux kernel client and a QEMU/KVM driver RBD A reliable and fully- distributed block device, with a Linux kernel client and a QEMU/KVM driver RADOSGW A bucket-based REST gateway, compatible with S3 and Swift RADOSGW A bucket-based REST gateway, compatible with S3 and Swift APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE NEARLY AWESOME AWESOMEAWESOME AWESOME AWESOME
    23. 23. Path forward ● Testing ● Various workloads ● Multiple active MDSs ● Test automation ● Simple workload generator scripts ● Bug reproducers ● Hacking ● Bug squashing ● Long-tail features ● Integrations ● Ganesha, Samba, *stacks
    24. 24. librados
    25. 25. object model ● pools ● 1s to 100s ● independent namespaces or object collections ● replication level, placement policy ● objects ● bazillions ● blob of data (bytes to gigabytes) ● attributes (e.g., “version=12”; bytes to kilobytes) ● key/value bundle (bytes to gigabytes)
    26. 26. atomic transactions ● client operations send to the OSD cluster ● operate on a single object ● can contain a sequence of operations, e.g. – truncate object – write new object data – set attribute ● atomicity ● all operations commit or do not commit atomically ● conditional ● 'guard' operations can control whether operation is performed – verify xattr has specific value – assert object is a specific version ● allows atomic compare-and-swap etc.
    27. 27. key/value storage ● store key/value pairs in an object ● independent from object attrs or byte data payload ● based on google's leveldb ● efficient random and range insert/query/removal ● based on BigTable SSTable design ● exposed via key/value API ● insert, update, remove ● individual keys or ranges of keys ● avoid read/modify/write cycle for updating complex objects ● e.g., file system directory objects
    28. 28. watch/notify ● establish stateful 'watch' on an object ● client interest persistently registered with object ● client keeps session to OSD open ● send 'notify' messages to all watchers ● notify message (and payload) is distributed to all watchers ● variable timeout ● notification on completion – all watchers got and acknowledged the notify ● use any object as a communication/synchronization channel ● locking, distributed coordination (ala ZooKeeper), etc.
    29. 29. CLIENT #1 CLIENT #2 CLIENT #3 OSD watch ack/commit ack/commit watch ack/commit watch notify notify notify notify ack ack ack complete
    30. 30. watch/notify example ● radosgw cache consistency ● radosgw instances watch a single object (.rgw/notify) ● locally cache bucket metadata ● on bucket metadata changes (removal, ACL changes) – write change to relevant bucket object – send notify with bucket name to other radosgw instances ● on receipt of notify – invalidate relevant portion of cache
    31. 31. rados classes ● dynamically loaded .so ● /var/lib/rados-classes/* ● implement new object “methods” using existing methods ● part of I/O pipeline ● simple internal API ● reads ● can call existing native or class methods ● do whatever processing is appropriate ● return data ● writes ● can call existing native or class methods ● do whatever processing is appropriate ● generates a resulting transaction to be applied atomically
    32. 32. class examples ● grep ● read an object, filter out individual records, and return those ● sha1 ● read object, generate fingerprint, return that ● images ● rotate, resize, crop image stored in object ● remove red-eye ● crypto ● encrypt/decrypt object data with provided key
    33. 33. ideas ● distributed key/value table ● aggregate many k/v objects into one big 'table' ● working prototype exists (thanks, Eleanor!)
    34. 34. ideas ● lua rados class ● embed lua interpreter in a rados class ● ship semi-arbitrary code for operations ● json class ● parse, manipulate json structures
    35. 35. ideas ● rados mailbox (RMB?) ● plug librados backend into dovecot, postfix, etc. ● key/value object for each mailbox – key = message id – value = headers ● object for each message or attachment ● watch/notify to delivery notification
    36. 36. hard links? ● rare ● useful locality properties ● intra-directory ● parallel inter-directory ● on miss, file objects provide per-file backpointers ● degenerates to log(n) lookups ● optimistic read complexity

    ×