InktankDelivering the Future of StorageAdvanced Features of the Ceph Distributed Storage SystemFebruary 12, 2013
outline    why you should care    what is it, what its for    how it works            architecture    how you can use...
•   company that provides       •   distributed unified object,    professional services and       block and file storage ...
why should you care about another storagesystem?                                            4
requirementsdiverse storage needs         object storage         block devices (for VMs) with snapshots, cloning       ...
timeease of administrationno manual data migration, load balancingpainless scaling         expansion and contraction    ...
costlinear function of size or performanceincremental expansion          no fork-lift upgradesno vendor lock-in         ...
what is ceph?                8
unified storage systemobjects         native         RESTfulblock                  thin provisioning, snapshots, clonin...
distributed storage systemdata center scale         10s to 10,000s of machines         terabytes to exabytesfault tolera...
how do you design a storage system that(really) scales?                                          11
DISK                   DISK                   DISK                   DISK                   DISK                   DISKHUM...
DISK                   DISK                   DISK                   DISKHUMANHUMAN                   DISK                ...
HUMAN   HUMAN      HUMAN               HUMAN                       HUMAN                        HUMAN HUMAN  HUMAN        ...
COMPUTER         COMPUTER   DISK                     DISK        COMPUTER         COMPUTER   DISK                     DISK...
APP         APP                    APP                                 APP                   HOST/VM                      ...
why start with objects?more useful than (disk) blocks        names in a single flat namespace        variable size     ...
ceph object model pools              1s to 100s              independent namespaces or object collections              ...
OSD    OSD    OSD    OSD    OSD                                   btrfs FS     FS    FS     FS     FS     xfs             ...
Object Storage Daemons (OSDs)          10s to 10000s in a cluster          one per disk, SSD, or RAID group, or ...     ...
HUMAN        MM           M
data distributionall objects are replicated N timesobjects are automatically placed, balanced, migrated in adynamic cluste...
CRUSH  pseudo-random placement algorithm         fast calculation, no lookup         repeatable, deterministic  statis...
10 10 01 01 10 10 01 11 01 10               10 10 01 01 10 10 01 11 01 10                                           hash(o...
10 10 01 01 10 10 01 11 01 10               10 10 01 01 10 10 01 11 01 101010   1010    0101   0101   1010   1010   0101  ...
CLIENTCLIENT         ??
CLIENT         ??
APP         APP                    APP                                 APP                   HOST/VM                      ...
APP       APP     LIBRADOS      LIBRADOS                 native     MMMM                MM
librados            direct access to RADOS fromL        applications            C, C++, Python, PHP, Java, Erlang      ...
rich librados API    atomic single-object transactions            update data, attr together            compare-and-swa...
APP         APP                    APP                                 APP                   HOST/VM                      ...
APP  APP                    APP                          APP                                      RESTRADOSGWRADOSGW      ...
RADOS Gateway  REST-based object storage proxy  uses RADOS to store objects  API supports buckets, accounting  usage a...
APP         APP                    APP                                 APP                   HOST/VM                      ...
COMPUTER                    COMPUTER   DISK                                DISK                   COMPUTER                ...
COMPUTER       COMPUTER   DISK                   DISK      COMPUTER       COMPUTER   DISK                   DISK      COMP...
VM               VMVIRTUALIZATION CONTAINER VIRTUALIZATION CONTAINER              LIBRBD                LIBRBD          LI...
CONTAINERCONTAINER            VM                      VM        CONTAINER                                CONTAINER   LIBRB...
HOST         HOST     KRBD (KERNEL MODULE)      KRBD (KERNEL MODULE)          LIBRADOS            LIBRADOS        MMMM    ...
RADOS Block Device  storage of disk images in RADOS  decouple VM from host  images striped across entire cluster  (pool...
instant copy144      0         0        0   0
read              read                  read144   4   = 148
APP         APP                    APP                                 APP                   HOST/VM                      ...
CLIENT                CLIENTmetadata                     01                      01   data                     10         ...
MMMM        MM
Metadata Server (MDS)  manages metadata for POSIX shared file  system         directory hierarchy         file metadata...
one treethree metadata servers                                    ??                                         51
52
53
54
55
DYNAMIC SUBTREE PARTITIONING                               56
recursive accounting    ceph-mds tracks recursive directory stats              file sizes              file and directo...
snapshots      snapshot arbitrary subdirectories      simple interface              hidden .snap directory            ...
multiple protocols, implementations    Linux kernel client               mount -t ceph 1.2.3.4:/ /mnt               exp...
Current Status                 60
current statusargonaut stable release v0.48.x         rados, RBD, radosgwbobtail stable release v0.56.x          RBD clo...
cuttlefish roadmap    RBD               Xen integration, iSCSI    radosgw               multi-side federation, disaste...
Contact SageEmail: sage@inktank.comFollow Sage on Twitter:@Liewegas
Voting Questions                   64
Inktank’s Professional ServicesConsulting Services: •   Technical Overview •   Infrastructure Assessment •   Proof of Conc...
Check out our on-demand webinars•   Getting Started with Ceph•   Introduction to Ceph with OpenStack•   DreamHost Case Stu...
Contact UsInfo@inktank.com1-855-INKTANKDon’t forget to follow us on:Twitter: https://twitter.com/inktankFacebook: http://w...
Thank you!             68
Webinar - Advance Ceph Features
Webinar - Advance Ceph Features
Webinar - Advance Ceph Features
Upcoming SlideShare
Loading in …5
×

Webinar - Advance Ceph Features

1,686 views

Published on

Slides from our advanced Ceph topics webinar hosted by Ceph creator Sage Weil. This webinar was aired on Feb 12 2013.

Published in: Technology
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,686
On SlideShare
0
From Embeds
0
Number of Embeds
344
Actions
Shares
0
Downloads
0
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide
  • 20121129
  • It was at this point that computers had to learn how to consolidate the storage available in multiple hard drives and present it in a unified way to their human patrons.
  • This causes a whole new set of problems. How do you prevent two people from writing the same file at the same time? How do you make sure that one person can’t see data intended for someone else?
  • And, as you might imagine, these solutions led to other problems. Notice a bottleneck in this picture?
  • Blam!! The invention of distributed storage was another very important point in the history of information storage. For the first time, every part of the system could be scaled.
  • RADOS is a distributed object store, and it’s the foundation for Ceph. On top of RADOS, the Ceph team has built three applications that allow you to store data and do fantastic things. But before we get into all of that, let’s start at the beginning of the story.
  • Let’s start with RADOS, Reliable Autonomic Distributed Object Storage. In this example, you’ve got five disks in a computer. You have initialized each disk with a filesystem (btrfs is the right filesystem to use someday, but until it’s stable we recommend XFS). On each filesystem, you deploy a Ceph OSD (Object Storage Daemon). That computer, with its five disks and five object storage daemons, becomes a single node in a RADOS cluster. Alongside these nodes are monitor nodes, which keep track of the current state of the cluster and provide users with an entry point into the cluster (although they do not serve any data themselves).
  • Applications wanting to store objects into RADOS interact with the cluster as a single entity.
  • The way CRUSH is configured is somewhat unique. Instead of defining pools for different data types, workgroups, subnets, or applications, CRUSH is configured with the physical topology of your storage network. You tell it how many buildings, rooms, shelves, racks, and nodes you have, and you tell it how you want data placed. For example, you could tell CRUSH that it’s okay to have two replicas in the same building, but not on the same power circuit. You also tell it how many copies to keep.
  • With CRUSH, the data is first split into a certain number of sections. These are called “placement groups”. The number of placement groups is configurable. Then, the CRUSH algorithm runs, having received the latest cluster map and a set of placement rules, and it determines where the placement group belongs in the cluster. This is a pseudo-random calculation, but it’s also repeatable; given the same cluster state and rule set, it will always return the same results.
  • Each placement group is run through CRUSH and stored in the cluster. Notice how no node has received more than one copy of a placement group, and no two nodes contain the same information? That’s important.
  • When it comes time to store an object in the cluster (or retrieve one), the client calculates where it belongs.
  • What happens, though, when a node goes down? The OSDs are always talking to each other (and the monitors), and they know when something is amiss. The third and fifth node on the top row have noticed that the second node on the bottom row is gone, and they are also aware that they have replicas of the missing data.
  • The OSDs collectively use the CRUSH algorithm to determine how the cluster should look based on its new state, and move the data to where clients running CRUSH expect it to be.
  • Because of the way placement is calculated instead of centrally controlled, node failures are transparent to clients.
  • Next, let’s talk about librados. Librados is a native C library that allows applications to work with RADOS. There are similar libraries available for C++, Java, Python, Ruby, and PHP.
  • So applications link with librados, allowing them to interact with RADOS through a native protocol.
  • The radosgw component is a REST-based interface to RADOS. It allows developers to build applications that work with Ceph through standard web services.
  • So, for example, an application can use a REST-based API to work with radosgw, and radosgw talks to RADOS using a native protocol. You can deploy as many gateways as you need, and you can use standard HTTP load balancers. User authentication and S3-style buckets are also supported, and applications written to work with Amazon S3 or OpenStack Swift will automatically work with radosgw by just changing their endpoint.
  • The RADOS Block Device (RBD) allows users to store virtual disks inside RADOS.
  • OK. This gets a little bit abstract now. Sometimes, people don’t want to store files or objects, they want to store entire disks. Virtual disks. Collections of data that can be assembled together and presented to a computer, which would see it as a real hard drive with platters and sectors.
  • … or the more modern way, which is to boot multiple virtual machines using disks from a storage cluster.
  • For example, you can use a virtualization container like KVM or QEMU to boot virtual machines from images that have been stored in RADOS. Images are striped across the entire cluster, which allows for simultaneous read access from different cluster nodes.
  • Separating a virtual computer from its storage also lets you do really neat things, like migrate a virtual machine from one server to another without rebooting it.
  • As an alternative, machines (even those running on bare metal) can mount an RBD image using native Linux kernel drivers.
  • With Ceph, copying an RBD image four times gives you five total copies…but only takes the space of one. It also happens instantly.
  • When they read, though, they read through to the original copy if there’s no newer data.
  • Finally, let’s talk about Ceph FS. Ceph FS is a parallel filesystem that provides a massively scalable, single-hierarchy, shared disk. If you use a shared drive at work, this is the same thing except that the same drive could be shared by everyone you’ve ever met (and everyone they’ve ever met).
  • Remember all that meta-data we talked about in the beginning? Feels so long ago. It has to be stored somewhere! Something has to keep track of who created files, when they were created, and who has the right to access them. And something has to remember where they live within a tree. Enter MDS, the Ceph Metadata Server. Clients accessing Ceph FS data first make a request to an MDS, which provides what they need to get files from the right OSDs.
  • There are multiple MDSs!
  • If you aren’t running Ceph FS, you don’t need to deploy metadata servers.
  • So how do you have one tree and multiple servers?
  • If there’s just one MDS (which is a terrible idea), it manages metadata for the entire tree.
  • When the second one comes along, it will intelligently partition the work by taking a subtree.
  • When the third MDS arrives, it will attempt to split the tree again.
  • Same with the fourth.
  • A MDS can actually even just take a single directory or file, if the load is high enough. This all happens dynamically based on load and the structure of the data, and it’s called “dynamic subtree partitioning”.
  • Which components of Ceph are you using today? Which components of Ceph are you interested in using in the future? What additional functionality would you like to see in Ceph? Free form
  • Technical Overview We will explain the architecture, functionality, best practices, and typical use cases. We’ll take you through code review, the technology road map and review your business goals. Infrastructure Assessment Our team will conduct an in-depth, on-site assessment of your current storage environment to fully understand your architecture, application, and workload. Inktank engineers will customize a solution for your business needs. Proof of Concept The best way to determine whether Ceph is the right solution for you is to set up your own test cluster. With our configuration, benchmarking tools and services, we can guide you to the right solution for your particular application. Implementation Support Reduce project risk by adding Ceph experts to your team. We will work with you to build the most reliable, secure, and robust storage system for your application. We can help you adapt your applications to fully leverage Ceph or extend the Ceph core to provide the features you need. Performance Tuning Through detailed measurements, we can tune the components in a production Ceph cluster to achieve the optimum configuration for your actual workload. Inktank's Pre-Production Subscriptio n is ideal for architects and administrators setting up a Ceph environment and needing support during the installation and configuration process.  Features Email support channel One named company contact SLA-backed response Access to support ticketing system Unlimited support requests Inktank Production Support Silver & Gold Subscriptions
  • Webinar - Advance Ceph Features

    1. 1. InktankDelivering the Future of StorageAdvanced Features of the Ceph Distributed Storage SystemFebruary 12, 2013
    2. 2. outline why you should care what is it, what its for how it works  architecture how you can use it  librados  radosgw  RBD, the ceph block device  distributed file system roadmap why we do this, who we are
    3. 3. • company that provides • distributed unified object, professional services and block and file storage support for Ceph platform• founded in 2011 • created by storage experts• funded by DreamHost • open source• Mark Shuttleworth invested $1M • in the Linux Kernel• Sage Weil, CTO and • integrated into Cloud creator of Ceph Platforms
    4. 4. why should you care about another storagesystem? 4
    5. 5. requirementsdiverse storage needs  object storage  block devices (for VMs) with snapshots, cloning  shared file system with POSIX, coherent caches  structured data... files, block devices, or objects?scale  terabytes, petabytes, exabytes  heterogeneous hardware  reliability and fault tolerance
    6. 6. timeease of administrationno manual data migration, load balancingpainless scaling  expansion and contraction  seamless migration
    7. 7. costlinear function of size or performanceincremental expansion  no fork-lift upgradesno vendor lock-in  choice of hardware  choice of softwareopen
    8. 8. what is ceph? 8
    9. 9. unified storage systemobjects  native  RESTfulblock  thin provisioning, snapshots, cloningfile  strong consistency, snapshots
    10. 10. distributed storage systemdata center scale  10s to 10,000s of machines  terabytes to exabytesfault tolerant  no single point of failure  commodity hardwareself-managing, self-healing
    11. 11. how do you design a storage system that(really) scales? 11
    12. 12. DISK DISK DISK DISK DISK DISKHUMANHUMAN COMPUTER COMPUTER DISK DISK DISK DISK DISK DISK DISK DISK
    13. 13. DISK DISK DISK DISKHUMANHUMAN DISK DISKHUMANHUMAN COMPUTER COMPUTER DISK DISK DISK DISKHUMANHUMAN DISK DISK DISK DISK
    14. 14. HUMAN HUMAN HUMAN HUMAN HUMAN HUMAN HUMAN HUMAN DISK DISK HUMAN HUMANHUMAN HUMAN DISK DISK HUMAN HUMAN DISK DISK HUMAN HUMAN DISK DISK HUMAN HUMAN DISK DISK HUMAN HUMANHUMAN DISK DISK HUMAN (COMPUTER) ) (COMPUTER HUMAN HUMAN DISK DISK HUMAN HUMAN HUMAN HUMAN DISK DISK HUMAN HUMAN HUMAN HUMAN DISK DISK HUMAN HUMAN DISK DISK HUMAN HUMAN DISK DISK HUMAN HUMAN HUMAN HUMAN DISK DISK HUMAN HUMAN HUMAN HUMAN (actually more like this…)
    15. 15. COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISKHUMANHUMAN COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISKHUMANHUMAN COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISKHUMANHUMAN COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK
    16. 16. APP APP APP APP HOST/VM HOST/VM CLIENT CLIENT RADOSGW RADOSGW RBD RBD CEPH FS CEPH FS LIBRADOS LIBRADOS AA bucket-based REST AA reliable and fully- bucket-based REST reliable and fully- AA POSIX-compliant POSIX-compliant AA library allowing library allowing gateway, compatible gateway, compatible distributed block distributed block distributed file system, distributed file system, apps to directly apps to directly with S3 and Swift with S3 and Swift device, with a a Linux device, with Linux with a a Linux kernel with Linux kernel access RADOS, access RADOS, kernel client and a a kernel client and client and support for client and support for with support for with support for QEMU/KVM driver QEMU/KVM driver FUSE FUSE C, C++, Java, C, C++, Java, Python, Ruby, Python, Ruby, and PHP and PHPRADOS RADOSAA reliable, autonomous, distributed object store comprised of self-healing, self-managing, reliable, autonomous, distributed object store comprised of self-healing, self-managing,intelligent storage nodes intelligent storage nodes
    17. 17. why start with objects?more useful than (disk) blocks  names in a single flat namespace  variable size  simple API with rich semanticsmore scalable than files  no hard-to-distribute hierarchy  update semantics do not span objects  workload is trivially parallel
    18. 18. ceph object model pools  1s to 100s  independent namespaces or object collections  replication level, placement policy objects  bazillions  blob of data (bytes to gigabytes)  attributes (e.g., “version=12”; bytes to kilobytes)  key/value bundle (bytes to gigabytes)
    19. 19. OSD OSD OSD OSD OSD btrfs FS FS FS FS FS xfs ext4DISK DISK DISK DISK DISK M M M
    20. 20. Object Storage Daemons (OSDs)  10s to 10000s in a cluster  one per disk, SSD, or RAID group, or ...  hardware agnostic  serve stored objects to clients  intelligently peer to perform replication and recovery tasks Monitors  maintain cluster membership and stateM  provide consensus for distributed decision-making  small, odd number  these do not served stored objects to clients
    21. 21. HUMAN MM M
    22. 22. data distributionall objects are replicated N timesobjects are automatically placed, balanced, migrated in adynamic clustermust consider physical infrastructure  ceph-osds on hosts in racks in rows in data centersthree approaches  pick a spot; remember where you put it  pick a spot; write down where you put it  calculate where to put it, where to find it
    23. 23. CRUSH pseudo-random placement algorithm  fast calculation, no lookup  repeatable, deterministic statistically uniform distribution stable mapping  limited data migration on change rule-based configuration  infrastructure topology aware  adjustable replication  allows weighting
    24. 24. 10 10 01 01 10 10 01 11 01 10 10 10 01 01 10 10 01 11 01 10 hash(object name) % num pg1010 1010 0101 0101 1010 1010 0101 1111 0101 1010 CRUSH(pg, cluster state, policy)
    25. 25. 10 10 01 01 10 10 01 11 01 10 10 10 01 01 10 10 01 11 01 101010 1010 0101 0101 1010 1010 0101 1111 0101 1010
    26. 26. CLIENTCLIENT ??
    27. 27. CLIENT ??
    28. 28. APP APP APP APP HOST/VM HOST/VM CLIENT CLIENT RADOSGW RADOSGW RBD RBD CEPH FS CEPH FS LIBRADOS AA bucket-based REST AA reliable and fully- bucket-based REST reliable and fully- AA POSIX-compliant POSIX-compliant A library allowing gateway, compatible gateway, compatible distributed block distributed block distributed file system, distributed file system, apps to directly with S3 and Swift with S3 and Swift device, with a a Linux device, with Linux with a a Linux kernel with Linux kernel access RADOS, kernel client and a a kernel client and client and support for client and support for with support for QEMU/KVM driver QEMU/KVM driver FUSE FUSE C, C++, Java, Python, Ruby, and PHPRADOS RADOSAA reliable, autonomous, distributed object store comprised of self-healing, self-managing, reliable, autonomous, distributed object store comprised of self-healing, self-managing,intelligent storage nodes intelligent storage nodes
    29. 29. APP APP LIBRADOS LIBRADOS native MMMM MM
    30. 30. librados  direct access to RADOS fromL applications  C, C++, Python, PHP, Java, Erlang  direct access to storage nodes  no HTTP overhead
    31. 31. rich librados API atomic single-object transactions  update data, attr together  compare-and-swap efficient key/value storage inside an object object-granularity snapshot primitives embed code in ceph-osd daemon via plugin API inter-client communication via object
    32. 32. APP APP APP APP HOST/VM HOST/VM CLIENT CLIENT RADOSGW RBD RBD CEPH FS CEPH FS LIBRADOS LIBRADOS A bucket-based REST AA reliable and fully- reliable and fully- AA POSIX-compliant POSIX-compliant AA library allowing library allowing gateway, compatible distributed block distributed block distributed file system, distributed file system, apps to directly apps to directly with S3 and Swift device, with a a Linux device, with Linux with a a Linux kernel with Linux kernel access RADOS, access RADOS, kernel client and a a kernel client and client and support for client and support for with support for with support for QEMU/KVM driver QEMU/KVM driver FUSE FUSE C, C++, Java, C, C++, Java, Python, Ruby, Python, Ruby, and PHP and PHPRADOS RADOSAA reliable, autonomous, distributed object store comprised of self-healing, self-managing, reliable, autonomous, distributed object store comprised of self-healing, self-managing,intelligent storage nodes intelligent storage nodes
    33. 33. APP APP APP APP RESTRADOSGWRADOSGW RADOSGW RADOSGW LIBRADOS LIBRADOS LIBRADOS LIBRADOS native MM MM MM
    34. 34. RADOS Gateway REST-based object storage proxy uses RADOS to store objects API supports buckets, accounting usage accounting for billing purposes compatible with S3, Swift APIs
    35. 35. APP APP APP APP HOST/VM HOST/VM CLIENT CLIENT RADOSGW RADOSGW RBD CEPH FS CEPH FS LIBRADOS LIBRADOS AA bucket-based REST A reliable and fully- bucket-based REST AA POSIX-compliant POSIX-compliant AA library allowing library allowing gateway, compatible gateway, compatible distributed block distributed file system, distributed file system, apps to directly apps to directly with S3 and Swift with S3 and Swift device, with a Linux with a a Linux kernel with Linux kernel access RADOS, access RADOS, kernel client and a client and support for client and support for with support for with support for QEMU/KVM driver FUSE FUSE C, C++, Java, C, C++, Java, Python, Ruby, Python, Ruby, and PHP and PHPRADOS RADOSAA reliable, autonomous, distributed object store comprised of self-healing, self-managing, reliable, autonomous, distributed object store comprised of self-healing, self-managing,intelligent storage nodes intelligent storage nodes
    36. 36. COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISKCOMPUTERCOMPUTER COMPUTER COMPUTER DISK DISK DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK
    37. 37. COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISKVM VM COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISKVM VM COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISKVM VM COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK
    38. 38. VM VMVIRTUALIZATION CONTAINER VIRTUALIZATION CONTAINER LIBRBD LIBRBD LIBRADOS LIBRADOS MM MM MM
    39. 39. CONTAINERCONTAINER VM VM CONTAINER CONTAINER LIBRBD LIBRBD LIBRBD LIBRBD LIBRADOS LIBRADOS LIBRADOS LIBRADOS MM MM MM
    40. 40. HOST HOST KRBD (KERNEL MODULE) KRBD (KERNEL MODULE) LIBRADOS LIBRADOS MMMM MM
    41. 41. RADOS Block Device storage of disk images in RADOS decouple VM from host images striped across entire cluster (pool) snapshots copy-on-write clones support in  Qemu/KVM  mainline Linux kernel (2.6.39+)  OpenStack, CloudStack
    42. 42. instant copy144 0 0 0 0
    43. 43. read read read144 4 = 148
    44. 44. APP APP APP APP HOST/VM HOST/VM CLIENT CLIENT RADOSGW RADOSGW RBD RBD CEPH FS LIBRADOS LIBRADOS AA bucket-based REST AA reliable and fully- bucket-based REST reliable and fully- A POSIX-compliant AA library allowing library allowing gateway, compatible gateway, compatible distributed block distributed block distributed file system, apps to directly apps to directly with S3 and Swift with S3 and Swift device, with a a Linux device, with Linux with a Linux kernel access RADOS, access RADOS, kernel client and a a kernel client and client and support for with support for with support for QEMU/KVM driver QEMU/KVM driver FUSE C, C++, Java, C, C++, Java, Python, Ruby, Python, Ruby, and PHP and PHPRADOS RADOSAA reliable, autonomous, distributed object store comprised of self-healing, self-managing, reliable, autonomous, distributed object store comprised of self-healing, self-managing,intelligent storage nodes intelligent storage nodes
    45. 45. CLIENT CLIENTmetadata 01 01 data 10 10 MM MM MM
    46. 46. MMMM MM
    47. 47. Metadata Server (MDS) manages metadata for POSIX shared file system  directory hierarchy  file metadata (size, owner, timestamps) stores metadata in RADOS does not serve file data to clients only required for the shared file system
    48. 48. one treethree metadata servers ?? 51
    49. 49. 52
    50. 50. 53
    51. 51. 54
    52. 52. 55
    53. 53. DYNAMIC SUBTREE PARTITIONING 56
    54. 54. recursive accounting ceph-mds tracks recursive directory stats  file sizes  file and directory counts  modification time efficient $ ls -alSh | head total 0 drwxr-xr-x 1 root root 9.7T 2011-02-04 15:51 . drwxr-xr-x 1 root root 9.7T 2010-12-16 15:06 .. drwxr-xr-x 1 pomceph pg4194980 9.6T 2011-02-24 08:25 pomceph drwxr-xr-x 1 mcg_test1 pg2419992 23G 2011-02-02 08:57 mcg_test1 drwx--x--- 1 luko adm 19G 2011-01-21 12:17 luko drwx--x--- 1 eest adm 14G 2011-02-04 16:29 eest drwxr-xr-x 1 mcg_test2 pg2419992 3.0G 2011-02-02 09:34 mcg_test2 drwx--x--- 1 fuzyceph adm 1.5G 2011-01-18 10:46 fuzyceph drwxr-xr-x 1 dallasceph pg275 596M 2011-01-14 10:06 dallasceph
    55. 55. snapshots  snapshot arbitrary subdirectories  simple interface  hidden .snap directory  no special tools$ mkdir foo/.snap/one # create snapshot$ ls foo/.snapone$ ls foo/bar/.snap_one_1099511627776 # parents snap name is mangled$ rm foo/myfile$ ls -F foobar/$ ls -F foo/.snap/onemyfile bar/$ rmdir foo/.snap/one # remove snapshot
    56. 56. multiple protocols, implementations Linux kernel client  mount -t ceph 1.2.3.4:/ /mnt  export (NFS), Samba (CIFS) ceph-fuse NFS SMB/CIFS libcephfs.so  your app  Samba (CIFS) Ganesha Samba  Ganesha (NFS) libcephfs libcephfs  Hadoop (map/reduce) Hadoop your app libcephfs libcephfs ceph-fuse ceph fuse kernel
    57. 57. Current Status 60
    58. 58. current statusargonaut stable release v0.48.x  rados, RBD, radosgwbobtail stable release v0.56.x  RBD cloning  improved performance, scaling, failure behavior  radosgw API and performance improvements  just released
    59. 59. cuttlefish roadmap RBD  Xen integration, iSCSI radosgw  multi-side federation, disaster recovery RADOS  improved data integrity checking, disaster recovery  ongoing performance improvements file system  guiding ongoing community development  robust failure recovery, stability, fsck
    60. 60. Contact SageEmail: sage@inktank.comFollow Sage on Twitter:@Liewegas
    61. 61. Voting Questions 64
    62. 62. Inktank’s Professional ServicesConsulting Services: • Technical Overview • Infrastructure Assessment • Proof of Concept • Implementation Support • Performance TuningSupport Subscriptions: • Pre-Production Support • Production SupportA full description of our services can be found at the following:Consulting Services: http://www.inktank.com/consulting-services/Support Subscriptions: http://www.inktank.com/support-services/
    63. 63. Check out our on-demand webinars• Getting Started with Ceph• Introduction to Ceph with OpenStack• DreamHost Case Study: DreamObjects with Ceph• Advanced Features of Ceph Distributed Storage (delivered by Sage Weil, creator of Ceph)All can be watched at:http://www.inktank.com/news-events/webinars/
    64. 64. Contact UsInfo@inktank.com1-855-INKTANKDon’t forget to follow us on:Twitter: https://twitter.com/inktankFacebook: http://www.facebook.com/inktankYouTube: http://www.youtube.com/inktankstorage
    65. 65. Thank you! 68

    ×