An intro to Ceph and big data - CERN Big Data Workshop


Published on

Presentation materials for the CERN Big Data Workshop on 27JUN2013.

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • RADOS is a distributed object store, and it’s the foundation for Ceph. On top of RADOS, the Ceph team has built three applications that allow you to store data and do fantastic things. But before we get into all of that, let’s start at the beginning of the story.
  • Let’s start with RADOS, Reliable Autonomic Distributed Object Storage. In this example, you’ve got five disks in a computer. You have initialized each disk with a filesystem (btrfs is the right filesystem to use someday, but until it’s stable we recommend XFS). On each filesystem, you deploy a Ceph OSD (Object Storage Daemon). That computer, with its five disks and five object storage daemons, becomes a single node in a RADOS cluster. Alongside these nodes are monitor nodes, which keep track of the current state of the cluster and provide users with an entry point into the cluster (although they do not serve any data themselves).
  • With CRUSH, the data is first split into a certain number of sections. These are called “placement groups”. The number of placement groups is configurable. Then, the CRUSH algorithm runs, having received the latest cluster map and a set of placement rules, and it determines where the placement group belongs in the cluster. This is a pseudo-random calculation, but it’s also repeatable; given the same cluster state and rule set, it will always return the same results.
  • Each placement group is run through CRUSH and stored in the cluster. Notice how no node has received more than one copy of a placement group, and no two nodes contain the same information? That’s important.
  • When it comes time to store an object in the cluster (or retrieve one), the client calculates where it belongs.
  • What happens, though, when a node goes down? The OSDs are always talking to each other (and the monitors), and they know when something is amiss. The third and fifth node on the top row have noticed that the second node on the bottom row is gone, and they are also aware that they have replicas of the missing data.
  • The OSDs collectively use the CRUSH algorithm to determine how the cluster should look based on its new state, and move the data to where clients running CRUSH expect it to be.
  • Because of the way placement is calculated instead of centrally controlled, node failures are transparent to clients.
  • Most people will default to discussions about CephFS when confronted with either Big Data or HPC applications. This can mean using CephFS by itself, or perhaps as a drop-in replacement for HDFS. [NOT READY ARGUMENT] There are a couple of other options, however. You can use librados to talk directly to the object store. One user I know actually plugged Hadoop in at this level, instead of using CephFS. Ceph also has a pretty decent key-value store proof-of-concept done by an intern last year. It's based on a b-tree structure but uses a fixed height of two levels instead of a true tree structure. This draws from both a normal B-Tree and Google BigTable. Would love to see someone do more with it.
  • I mentioned librados, this is the low-level library that allows you to directly access a RADOS cluster from your application. This has native language bindings for C, C++, Python, etc. This is obviously the fastest way to get at your data and comes with no inherent overheard or translation layer.
  • For most object systems an object is just a bunch of bytes, maybe some extended attributes. Ceph you can store a lot more than that. You can store key/value pairs inside an object, think berkelyDB or sql where each object is a logical container. It supports atomic transaction so you can do things like atomic compare-and-swap. Update the bytes and the keys/values in an atomic fashion and it will be consistently distributed and replicated across a cluster in a safe way. There is snapshotting that will give you per-directory snapshots, and inter-client communication for locking and whatnot. The really exciting part about this is the ability to implement your own functionality on the OSD.....
  • These embedded object classes allow you to send an object method call to the cluster and it will actually perform that computation without having to pull the data over the network. The downside to using these object classes is the injection of new functionality into the system. A compiled C++ plugin has to be delivered and dynamically loaded into each OSD process. This becomes more complicated if a cluster is composed of multiple architecture targets, and makes it difficult to update functionality on the fly. One approach to addressing these problems is to embed a language run-time within the OSD. Noah Watkins, one of our engineers tackled this with some Lua bindings which are available.
  • One of the more contentious assertions that Sage likes to make is that as we move towards exascale computing and beyond we'll need to transcend or replace POSIX. The heirarchical model just doesn't scale well beyond a certain level. Future models are going to have to start blurring the line between compute and storage and recognizing when data is local to perform operations vs when you need to gather data from multiple sources and gather data for an operation. And finally fault tolerance needs to become a first-class property of these architectures. As we push the scale of our existing architectures, building things like burst buffers to deal with huge checkpoints across millions of cores it just doesn't make a whole lot of sense.
  • Having said all that, there are too many things (both people and code) built using POSIX mentality to ditch it any time soon. CephFS is designed to provide that POSIX layer on top of RADOS. [read slide] Now, as we've said there is certainly some work to be done on CephFS, but I want to share a bit about how it works since it (and similar thinking) will play a big part of Ceph's HPC and Big Data applications going forward.
  • CephFS adds a metadata server (or MDS) to the list of node types in your Ceph cluster. Something has to keep track of who created files, when they were created, and who has the right to access them. And something has to remember where they live within a tree. Clients accessing Ceph FS data first make a request to an MDS, which provides what they need to get files from the right OSDs.
  • There are multiple MDSs!
  • So how do you have one tree and multiple servers?
  • If there’s just one MDS (which is a terrible idea), it manages metadata for the entire tree.
  • When the second one comes along, it will intelligently partition the work by taking a subtree.
  • When the third MDS arrives, it will attempt to split the tree again.
  • Same with the fourth.
  • A MDS can actually even just take a single directory or file, if the load is high enough. This all happens dynamically based on load and the structure of the data, and it’s called “dynamic subtree partitioning”. This is done as a periodic load balance exchange. The transfer just ships the cache contents between MDS and lets the clients continue transparently.
  • CephFS has some neat features that you don't find in most file systems. Because they built the filesystem namespace from the ground up they were able to build these features into the infrastructure. One of these features is recursive accounting. The MDSs keep track of directory stats for every dir in the file system. For instance, when you do an 'ls -al' the file size is actually the total number of bytes stored in that directory recursively in the system. The same thing you can get from a 'du' but in realtime.
  • [Provides snapshots] The motivation here is once you start talking about petabytes and exabytes it doesn't make much sense to try to snapshot the entire tree. You need to be able to snapshot different directories and different data sets. You can add and remove snapshots for any directory with standard bash-type commands.
  • Also, next Ceph developer summit coming soon to plan for the Emperor release. Would love to see some blueprints submitted for CephFS work.
  • An intro to Ceph and big data - CERN Big Data Workshop

    1. 1. an intro to ceph and big data patrick mcgarry – inktank Big Data Workshop – 27 JUN 2013
    2. 2. what is ceph?  distributed storage system − reliable system built with unreliable components − fault tolerant, no SPoF  commodity hardware − expensive arrays, controllers, specialized networks not required  large scale (10s to 10,000s of nodes) − heterogenous hardware (no fork-lift upgrades) − incremental expansion (or contraction)  dynamic cluster
    3. 3. what is ceph?  unified storage platform − scalable object + compute storage platform − RESTful object storage (e.g., S3, Swift) − block storage − distributed file system  open source − LGPL server-side − client support in mainline Linux kernel
    4. 4. RADOS – the Ceph object store A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes RADOS – the Ceph object store A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP RBD A reliable and fully- distributed block device, with a Linux kernel client and a QEMU/KVM driver RBD A reliable and fully- distributed block device, with a Linux kernel client and a QEMU/KVM driver CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE RADOSGW A bucket-based REST gateway, compatible with S3 and Swift RADOSGW A bucket-based REST gateway, compatible with S3 and Swift APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
    6. 6. 1010 1010 0101 0101 1010 1010 0101 1111 0101 1010 hash(object name) % num pg CRUSH(pg, cluster state, policy)
    7. 7. 1010 1010 0101 0101 1010 1010 0101 1111 0101 1010
    8. 8. CLIENTCLIENT ??
    9. 9. CLIENT ??
    10. 10. So what about big data?  CephFS  s/HDFS/CephFS/g  Object Storage  Key-value store
    11. 11. LL librados  direct access to RADOS from applications  C, C++, Python, PHP, Java, Erlang  direct access to storage nodes  no HTTP overhead
    12. 12.  efficient key/value storage inside an object  atomic single-object transactions − update data, attr, keys together − atomic compare-and-swap  object-granularity snapshot infrastructure  inter-client communication via object  embed code in ceph-osd daemon via plugin API − arbitrary atomic object mutations, processing rich librados API
    13. 13. Data and compute  RADOS Embedded Object Classes  Moves compute directly adjacent to data  C++ by default  Lua bindings available
    14. 14. die, POSIX, die  successful exascale architectures will replace or transcend POSIX − hierarchical model does not distribute  line between compute and storage will blur − some processes is data-local, some is not  fault tolerance will be first-class property of architecture − for both computation and storage
    15. 15. POSIX – I'm not dead yet!  CephFS builds POSIX namespace on top of RADOS − metadata managed by ceph-mds daemons − stored in objects  strong consistency, stateful client protocol − heavy prefetching, embedded inodes  architected for HPC workloads − distribute namespace across cluster of MDSs − mitigate bursty workloads − adapt distribution as workloads shift over time
    16. 16. MM MM MM CLIENTCLIENT 01 10 01 10 data metadata
    17. 17. MM MM MM
    18. 18. one tree three metadata servers ??
    20. 20. recursive accounting  ceph-mds tracks recursive directory stats − file sizes − file and directory counts − modification time  efficient$ ls -alSh | head total 0 drwxr-xr-x 1 root root 9.7T 2011-02-04 15:51 . drwxr-xr-x 1 root root 9.7T 2010-12-16 15:06 .. drwxr-xr-x 1 pomceph pg4194980 9.6T 2011-02-24 08:25 pomceph drwxr-xr-x 1 mcg_test1 pg2419992 23G 2011-02-02 08:57 mcg_test1 drwx--x--- 1 luko adm 19G 2011-01-21 12:17 luko drwx--x--- 1 eest adm 14G 2011-02-04 16:29 eest drwxr-xr-x 1 mcg_test2 pg2419992 3.0G 2011-02-02 09:34 mcg_test2 drwx--x--- 1 fuzyceph adm 1.5G 2011-01-18 10:46 fuzyceph drwxr-xr-x 1 dallasceph pg275 596M 2011-01-14 10:06 dallasceph
    21. 21. snapshots  snapshot arbitrary subdirectories  simple interface − hidden '.snap' directory − no special tools $ mkdir foo/.snap/one # create snapshot $ ls foo/.snap one $ ls foo/bar/.snap _one_1099511627776 # parent's snap name is mangled $ rm foo/myfile $ ls -F foo bar/ $ ls -F foo/.snap/one myfile bar/ $ rmdir foo/.snap/one # remove snapshot
    22. 22. how can you help?  try ceph and tell us what you think −  − ask if you need help  ask your organization to start dedicating resources to the project  find a bug ( and fix it  participate in our ceph developer summit −
    23. 23. questions?
    24. 24. thanks patrick mcgarry @scuttlemonkey