SlideShare a Scribd company logo
Inktank
Delivering the Future of Storage


Advanced Features of the Ceph Distributed Storage System
February 12, 2013
outline


    why you should care

    what is it, what it's for

    how it works
     
       architecture

    how you can use it
     
       librados
     
       radosgw
     
       RBD, the ceph block device
     
       distributed file system

    roadmap

    why we do this, who we are
•   company that provides       •   distributed unified object,
    professional services and       block and file storage
    support for Ceph
                                    platform
•   founded in 2011
                                •   created by storage experts
•   funded by DreamHost
                                •   open source
•   Mark Shuttleworth
    invested $1M                •   in the Linux Kernel

•   Sage Weil, CTO and          •   integrated into Cloud
    creator of Ceph                 Platforms
why should you care about another storage
system?




                                            4
requirements

diverse storage needs
   
      object storage
   
      block devices (for VMs) with snapshots, cloning
   
      shared file system with POSIX, coherent caches
   
      structured data... files, block devices, or objects?

scale
    
        terabytes, petabytes, exabytes
    
        heterogeneous hardware
    
        reliability and fault tolerance
time

ease of administration

no manual data migration, load balancing

painless scaling
   
      expansion and contraction
   
      seamless migration
cost

linear function of size or performance

incremental expansion
    
      no fork-lift upgrades

no vendor lock-in
   
      choice of hardware
   
      choice of software

open
what is ceph?




                8
unified storage system

objects
   
      native
   
      RESTful

block
       
           thin provisioning, snapshots, cloning

file
       
           strong consistency, snapshots
distributed storage system

data center scale
   
      10s to 10,000s of machines
   
      terabytes to exabytes

fault tolerant
    
       no single point of failure
    
       commodity hardware

self-managing, self-healing
how do you design a storage system that
(really) scales?




                                          11
DISK
                   DISK

                   DISK
                   DISK

                   DISK
                   DISK

HUMAN
HUMAN   COMPUTER
        COMPUTER   DISK
                   DISK

                   DISK
                   DISK

                   DISK
                   DISK

                   DISK
                   DISK
DISK
                   DISK

                   DISK
                   DISK
HUMAN
HUMAN
                   DISK
                   DISK

HUMAN
HUMAN   COMPUTER
        COMPUTER   DISK
                   DISK

                   DISK
                   DISK
HUMAN
HUMAN
                   DISK
                   DISK

                   DISK
                   DISK
HUMAN
   HUMAN      HUMAN
               HUMAN

                       HUMAN
                        HUMAN
 HUMAN
  HUMAN                                                      DISK
                                                              DISK
              HUMAN
               HUMAN
HUMAN
 HUMAN                                                       DISK
                                                              DISK

 HUMAN
  HUMAN                                                      DISK
                                                              DISK
               HUMAN
                HUMAN

                                                             DISK
                                                              DISK
     HUMAN
      HUMAN
                                                             DISK
                                                              DISK
           HUMAN
            HUMAN
HUMAN                                                        DISK
                                                              DISK
 HUMAN                           (COMPUTER) )
                                   (COMPUTER
            HUMAN
             HUMAN
                                                             DISK
                                                              DISK
  HUMAN
   HUMAN           HUMAN
                    HUMAN                                    DISK
                                                              DISK
             HUMAN
              HUMAN
 HUMAN
  HUMAN                                                      DISK
                                                              DISK

              HUMAN
               HUMAN                                         DISK
                                                              DISK

  HUMAN
   HUMAN                                                     DISK
                                                              DISK
                    HUMAN
                     HUMAN
     HUMAN
      HUMAN                                                  DISK
                                                              DISK
                    HUMAN
                     HUMAN

          HUMAN
           HUMAN                (actually more like this…)
COMPUTER
         COMPUTER   DISK
                     DISK

        COMPUTER
         COMPUTER   DISK
                     DISK

        COMPUTER
         COMPUTER   DISK
                     DISK
HUMAN
HUMAN
        COMPUTER
         COMPUTER   DISK
                     DISK

        COMPUTER
         COMPUTER   DISK
                     DISK

        COMPUTER
         COMPUTER   DISK
                     DISK
HUMAN
HUMAN
        COMPUTER
         COMPUTER   DISK
                     DISK

        COMPUTER
         COMPUTER   DISK
                     DISK

        COMPUTER
         COMPUTER   DISK
                     DISK
HUMAN
HUMAN
        COMPUTER
         COMPUTER   DISK
                     DISK

        COMPUTER
         COMPUTER   DISK
                     DISK

        COMPUTER
         COMPUTER   DISK
                     DISK
APP
         APP                    APP
                                 APP                   HOST/VM
                                                        HOST/VM                  CLIENT
                                                                                  CLIENT




                        RADOSGW
                         RADOSGW                RBD
                                                 RBD                      CEPH FS
                                                                           CEPH FS
  LIBRADOS
   LIBRADOS
                        AA bucket-based REST AA reliable and fully-
                          bucket-based REST    reliable and fully-        AA POSIX-compliant
                                                                             POSIX-compliant
  AA library allowing
    library allowing    gateway, compatible
                         gateway, compatible distributed block
                                              distributed block           distributed file system,
                                                                           distributed file system,
  apps to directly
   apps to directly     with S3 and Swift
                         with S3 and Swift   device, with a a Linux
                                              device, with Linux          with a a Linux kernel
                                                                           with Linux kernel
  access RADOS,
   access RADOS,                             kernel client and a a
                                              kernel client and           client and support for
                                                                           client and support for
  with support for
   with support for                          QEMU/KVM driver
                                              QEMU/KVM driver             FUSE
                                                                           FUSE
  C, C++, Java,
   C, C++, Java,
  Python, Ruby,
   Python, Ruby,
  and PHP
   and PHP




RADOS
 RADOS

AA reliable, autonomous, distributed object store comprised of self-healing, self-managing,
   reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
  intelligent storage nodes
why start with objects?

more useful than (disk) blocks
   
     names in a single flat namespace
   
     variable size
   
     simple API with rich semantics

more scalable than files
   
     no hard-to-distribute hierarchy
   
     update semantics do not span objects
   
     workload is trivially parallel
ceph object model

 pools
     
         1s to 100s
     
         independent namespaces or object collections
     
         replication level, placement policy

 objects
    
       bazillions
    
       blob of data (bytes to gigabytes)
    
       attributes (e.g., “version=12”; bytes to kilobytes)
    
       key/value bundle (bytes to gigabytes)
OSD    OSD    OSD    OSD    OSD




                                   btrfs
 FS     FS    FS     FS     FS     xfs
                                   ext4

DISK   DISK   DISK   DISK   DISK




  M            M            M
Object Storage Daemons (OSDs)
    
      10s to 10000s in a cluster
    
      one per disk, SSD, or RAID group, or ...
       
           hardware agnostic
    
      serve stored objects to clients
    
      intelligently peer to perform replication
      and recovery tasks




    Monitors
    
      maintain cluster membership and state



M
    
      provide consensus for distributed
      decision-making
    
      small, odd number
    
      these do not served stored objects to
      clients
HUMAN




        M




M           M
data distribution
all objects are replicated N times

objects are automatically placed, balanced, migrated in a
dynamic cluster

must consider physical infrastructure
  
      ceph-osds on hosts in racks in rows in data centers

three approaches
    
      pick a spot; remember where you put it
    
      pick a spot; write down where you put it
    
      calculate where to put it, where to find it
CRUSH

  pseudo-random placement algorithm
   
      fast calculation, no lookup
   
      repeatable, deterministic

  statistically uniform distribution

  stable mapping
   
      limited data migration on change

  rule-based configuration
   
      infrastructure topology aware
   
      adjustable replication
   
      allows weighting
10 10 01 01 10 10 01 11 01 10
               10 10 01 01 10 10 01 11 01 10

                                           hash(object name) % num pg



1010   1010    0101   0101   1010   1010   0101     1111   0101   1010




                                                  CRUSH(pg, cluster state, policy)
10 10 01 01 10 10 01 11 01 10
               10 10 01 01 10 10 01 11 01 10




1010   1010    0101   0101   1010   1010   0101   1111   0101   1010
CLIENT
CLIENT

         ??
CLIENT

         ??
APP
         APP                    APP
                                 APP                   HOST/VM
                                                        HOST/VM                  CLIENT
                                                                                  CLIENT




                       RADOSGW
                        RADOSGW                 RBD
                                                 RBD                      CEPH FS
                                                                           CEPH FS
  LIBRADOS
                       AA bucket-based REST AA reliable and fully-
                         bucket-based REST    reliable and fully-         AA POSIX-compliant
                                                                             POSIX-compliant
  A library allowing   gateway, compatible
                        gateway, compatible distributed block
                                             distributed block            distributed file system,
                                                                           distributed file system,
  apps to directly     with S3 and Swift
                        with S3 and Swift   device, with a a Linux
                                             device, with Linux           with a a Linux kernel
                                                                           with Linux kernel
  access RADOS,                             kernel client and a a
                                             kernel client and            client and support for
                                                                           client and support for
  with support for                          QEMU/KVM driver
                                             QEMU/KVM driver              FUSE
                                                                           FUSE
  C, C++, Java,
  Python, Ruby,
  and PHP




RADOS
 RADOS

AA reliable, autonomous, distributed object store comprised of self-healing, self-managing,
   reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
  intelligent storage nodes
APP
       APP
     LIBRADOS
      LIBRADOS
                 native




     MM

MM                MM
librados
    
        direct access to RADOS from



L
        applications
    
        C, C++, Python, PHP, Java, Erlang
    
        direct access to storage nodes
    
        no HTTP overhead
rich librados API


    atomic single-object transactions
     
       update data, attr together
     
       compare-and-swap


    efficient key/value storage inside an object


    object-granularity snapshot primitives


    embed code in ceph-osd daemon via plugin API


    inter-client communication via object
APP
         APP                    APP
                                 APP                   HOST/VM
                                                        HOST/VM                  CLIENT
                                                                                  CLIENT




                        RADOSGW                 RBD
                                                 RBD                      CEPH FS
                                                                           CEPH FS
  LIBRADOS
   LIBRADOS
                        A bucket-based REST     AA reliable and fully-
                                                  reliable and fully-     AA POSIX-compliant
                                                                             POSIX-compliant
  AA library allowing
    library allowing    gateway, compatible     distributed block
                                                 distributed block        distributed file system,
                                                                           distributed file system,
  apps to directly
   apps to directly     with S3 and Swift       device, with a a Linux
                                                 device, with Linux       with a a Linux kernel
                                                                           with Linux kernel
  access RADOS,
   access RADOS,                                kernel client and a a
                                                 kernel client and        client and support for
                                                                           client and support for
  with support for
   with support for                             QEMU/KVM driver
                                                 QEMU/KVM driver          FUSE
                                                                           FUSE
  C, C++, Java,
   C, C++, Java,
  Python, Ruby,
   Python, Ruby,
  and PHP
   and PHP




RADOS
 RADOS

AA reliable, autonomous, distributed object store comprised of self-healing, self-managing,
   reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
  intelligent storage nodes
APP
  APP                    APP
                          APP
                                      REST




RADOSGW
RADOSGW             RADOSGW
                    RADOSGW
  LIBRADOS
    LIBRADOS             LIBRADOS
                           LIBRADOS


                                             native




               MM

        MM          MM
RADOS Gateway

  REST-based object storage proxy

  uses RADOS to store objects

  API supports buckets, accounting

  usage accounting for billing purposes

  compatible with S3, Swift APIs
APP
         APP                    APP
                                 APP                   HOST/VM
                                                        HOST/VM                  CLIENT
                                                                                  CLIENT




                        RADOSGW
                         RADOSGW                RBD                       CEPH FS
                                                                           CEPH FS
  LIBRADOS
   LIBRADOS
                        AA bucket-based REST A reliable and fully-
                          bucket-based REST                               AA POSIX-compliant
                                                                             POSIX-compliant
  AA library allowing
    library allowing    gateway, compatible
                         gateway, compatible distributed block            distributed file system,
                                                                           distributed file system,
  apps to directly
   apps to directly     with S3 and Swift
                         with S3 and Swift   device, with a Linux         with a a Linux kernel
                                                                           with Linux kernel
  access RADOS,
   access RADOS,                             kernel client and a          client and support for
                                                                           client and support for
  with support for
   with support for                          QEMU/KVM driver              FUSE
                                                                           FUSE
  C, C++, Java,
   C, C++, Java,
  Python, Ruby,
   Python, Ruby,
  and PHP
   and PHP




RADOS
 RADOS

AA reliable, autonomous, distributed object store comprised of self-healing, self-managing,
   reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
  intelligent storage nodes
COMPUTER
                    COMPUTER   DISK
                                DISK

                   COMPUTER
                    COMPUTER   DISK
                                DISK

                   COMPUTER
                    COMPUTER   DISK
                                DISK

                   COMPUTER
                    COMPUTER   DISK
                                DISK

                   COMPUTER
                    COMPUTER   DISK
                                DISK

                   COMPUTER
                    COMPUTER   DISK
                                DISK
COMPUTER
COMPUTER           COMPUTER
                    COMPUTER   DISK
                                DISK
           DISK
            DISK
                   COMPUTER
                    COMPUTER   DISK
                                DISK

                   COMPUTER
                    COMPUTER   DISK
                                DISK

                   COMPUTER
                    COMPUTER   DISK
                                DISK

                   COMPUTER
                    COMPUTER   DISK
                                DISK

                   COMPUTER
                    COMPUTER   DISK
                                DISK
COMPUTER
       COMPUTER   DISK
                   DISK

      COMPUTER
       COMPUTER   DISK
                   DISK

      COMPUTER
       COMPUTER   DISK
                   DISK

      COMPUTER
       COMPUTER   DISK
                   DISK

VM
 VM   COMPUTER
       COMPUTER   DISK
                   DISK

      COMPUTER
       COMPUTER   DISK
                   DISK
VM
 VM   COMPUTER
       COMPUTER   DISK
                   DISK

      COMPUTER
       COMPUTER   DISK
                   DISK
VM
 VM
      COMPUTER
       COMPUTER   DISK
                   DISK

      COMPUTER
       COMPUTER   DISK
                   DISK

      COMPUTER
       COMPUTER   DISK
                   DISK

      COMPUTER
       COMPUTER   DISK
                   DISK
VM
               VM




VIRTUALIZATION CONTAINER
 VIRTUALIZATION CONTAINER
              LIBRBD
                LIBRBD
          LIBRADOS
            LIBRADOS




         MM

    MM                   MM
CONTAINER
CONTAINER            VM
                      VM        CONTAINER
                                CONTAINER
   LIBRBD
     LIBRBD                        LIBRBD
                                     LIBRBD
  LIBRADOS
    LIBRADOS                      LIBRADOS
                                    LIBRADOS




                    MM

               MM          MM
HOST
         HOST
     KRBD (KERNEL MODULE)
      KRBD (KERNEL MODULE)
          LIBRADOS
            LIBRADOS




        MM

MM                       MM
RADOS Block Device

  storage of disk images in RADOS

  decouple VM from host

  images striped across entire cluster
  (pool)

  snapshots

  copy-on-write clones

  support in
   
      Qemu/KVM
   
      mainline Linux kernel (2.6.39+)
   
      OpenStack, CloudStack
instant copy




144
      0         0        0   0
read



              read




                  read




144   4   = 148
APP
         APP                    APP
                                 APP                   HOST/VM
                                                        HOST/VM                  CLIENT
                                                                                  CLIENT




                        RADOSGW
                         RADOSGW                RBD
                                                 RBD                      CEPH FS
  LIBRADOS
   LIBRADOS
                        AA bucket-based REST AA reliable and fully-
                          bucket-based REST    reliable and fully-        A POSIX-compliant
  AA library allowing
    library allowing    gateway, compatible
                         gateway, compatible distributed block
                                              distributed block           distributed file system,
  apps to directly
   apps to directly     with S3 and Swift
                         with S3 and Swift   device, with a a Linux
                                              device, with Linux          with a Linux kernel
  access RADOS,
   access RADOS,                             kernel client and a a
                                              kernel client and           client and support for
  with support for
   with support for                          QEMU/KVM driver
                                              QEMU/KVM driver             FUSE
  C, C++, Java,
   C, C++, Java,
  Python, Ruby,
   Python, Ruby,
  and PHP
   and PHP




RADOS
 RADOS

AA reliable, autonomous, distributed object store comprised of self-healing, self-managing,
   reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
  intelligent storage nodes
CLIENT
                CLIENT


metadata
                     01
                      01   data
                     10
                      10




                MM

           MM              MM
MM

MM        MM
Metadata Server (MDS)

  manages metadata for POSIX shared file
  system
   
      directory hierarchy
   
      file metadata (size, owner,
      timestamps)

  stores metadata in RADOS

  does not serve file data to clients

  only required for the shared file system
one tree




three metadata servers




                                    ??
                                         51
52
53
54
55
DYNAMIC SUBTREE PARTITIONING

                               56
recursive accounting


    ceph-mds tracks recursive directory stats
     
         file sizes
     
         file and directory counts
     
         modification time

    efficient


 $ ls -alSh   | head
 total 0
 drwxr-xr-x   1   root         root        9.7T   2011-02-04   15:51   .
 drwxr-xr-x   1   root         root        9.7T   2010-12-16   15:06   ..
 drwxr-xr-x   1   pomceph      pg4194980   9.6T   2011-02-24   08:25   pomceph
 drwxr-xr-x   1   mcg_test1    pg2419992    23G   2011-02-02   08:57   mcg_test1
 drwx--x---   1   luko         adm          19G   2011-01-21   12:17   luko
 drwx--x---   1   eest         adm          14G   2011-02-04   16:29   eest
 drwxr-xr-x   1   mcg_test2    pg2419992   3.0G   2011-02-02   09:34   mcg_test2
 drwx--x---   1   fuzyceph     adm         1.5G   2011-01-18   10:46   fuzyceph
 drwxr-xr-x   1   dallasceph   pg275       596M   2011-01-14   10:06   dallasceph
snapshots

 
     snapshot arbitrary subdirectories
 
     simple interface
      
        hidden '.snap' directory
      
        no special tools



$ mkdir foo/.snap/one   # create snapshot
$ ls foo/.snap
one
$ ls foo/bar/.snap
_one_1099511627776      # parent's snap name is mangled
$ rm foo/myfile
$ ls -F foo
bar/
$ ls -F foo/.snap/one
myfile bar/
$ rmdir foo/.snap/one   # remove snapshot
multiple protocols, implementations

    Linux kernel client
      
         mount -t ceph 1.2.3.4:/ /mnt
      
         export (NFS), Samba (CIFS)

    ceph-fuse
                                         NFS                  SMB/CIFS

    libcephfs.so
      
         your app
      
         Samba (CIFS)                    Ganesha                   Samba
      
         Ganesha (NFS)                     libcephfs                libcephfs
      
         Hadoop (map/reduce)

                                          Hadoop                  your app
                                           libcephfs                libcephfs


                                                  ceph-fuse
                                        ceph           fuse
                                                       kernel
Current Status




                 60
current status

argonaut stable release v0.48.x
   
      rados, RBD, radosgw

bobtail stable release v0.56.x
   
       RBD cloning
   
       improved performance, scaling, failure behavior
   
       radosgw API and performance improvements
   
       just released
cuttlefish roadmap


    RBD
      
         Xen integration, iSCSI

    radosgw
      
         multi-side federation, disaster recovery

    RADOS
      
         improved data integrity checking, disaster recovery
      
         ongoing performance improvements

    file system
      
         guiding ongoing community development
      
         robust failure recovery, stability, fsck
Contact Sage
Email: sage@inktank.com

Follow Sage on Twitter:
@Liewegas
Voting Questions




                   64
Inktank’s Professional Services
Consulting Services:
 •   Technical Overview
 •   Infrastructure Assessment
 •   Proof of Concept
 •   Implementation Support
 •   Performance Tuning

Support Subscriptions:
 •   Pre-Production Support
 •   Production Support

A full description of our services can be found at the following:
Consulting Services: http://www.inktank.com/consulting-services/
Support Subscriptions: http://www.inktank.com/support-services/
Check out our on-demand webinars
•   Getting Started with Ceph

•   Introduction to Ceph with OpenStack

•   DreamHost Case Study: DreamObjects with Ceph

•   Advanced Features of Ceph Distributed Storage
    (delivered by Sage Weil, creator of Ceph)

All can be watched at:
http://www.inktank.com/news-events/webinars/
Contact Us
Info@inktank.com
1-855-INKTANK

Don’t forget to follow us on:

Twitter: https://twitter.com/inktank

Facebook: http://www.facebook.com/inktank

YouTube: http://www.youtube.com/inktankstorage
Thank you!




             68

More Related Content

Recently uploaded

JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 

Recently uploaded (20)

JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 

Featured

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
Marius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
Expeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
Pixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
marketingartwork
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
Skeleton Technologies
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
SpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
Rajiv Jayarajah, MAppComm, ACC
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Christy Abraham Joy
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
Vit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
MindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
RachelPearson36
 

Featured (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Webinar - Advance Ceph Features

  • 1. Inktank Delivering the Future of Storage Advanced Features of the Ceph Distributed Storage System February 12, 2013
  • 2. outline  why you should care  what is it, what it's for  how it works  architecture  how you can use it  librados  radosgw  RBD, the ceph block device  distributed file system  roadmap  why we do this, who we are
  • 3. company that provides • distributed unified object, professional services and block and file storage support for Ceph platform • founded in 2011 • created by storage experts • funded by DreamHost • open source • Mark Shuttleworth invested $1M • in the Linux Kernel • Sage Weil, CTO and • integrated into Cloud creator of Ceph Platforms
  • 4. why should you care about another storage system? 4
  • 5. requirements diverse storage needs  object storage  block devices (for VMs) with snapshots, cloning  shared file system with POSIX, coherent caches  structured data... files, block devices, or objects? scale  terabytes, petabytes, exabytes  heterogeneous hardware  reliability and fault tolerance
  • 6. time ease of administration no manual data migration, load balancing painless scaling  expansion and contraction  seamless migration
  • 7. cost linear function of size or performance incremental expansion  no fork-lift upgrades no vendor lock-in  choice of hardware  choice of software open
  • 9. unified storage system objects  native  RESTful block  thin provisioning, snapshots, cloning file  strong consistency, snapshots
  • 10. distributed storage system data center scale  10s to 10,000s of machines  terabytes to exabytes fault tolerant  no single point of failure  commodity hardware self-managing, self-healing
  • 11. how do you design a storage system that (really) scales? 11
  • 12. DISK DISK DISK DISK DISK DISK HUMAN HUMAN COMPUTER COMPUTER DISK DISK DISK DISK DISK DISK DISK DISK
  • 13. DISK DISK DISK DISK HUMAN HUMAN DISK DISK HUMAN HUMAN COMPUTER COMPUTER DISK DISK DISK DISK HUMAN HUMAN DISK DISK DISK DISK
  • 14. HUMAN HUMAN HUMAN HUMAN HUMAN HUMAN HUMAN HUMAN DISK DISK HUMAN HUMAN HUMAN HUMAN DISK DISK HUMAN HUMAN DISK DISK HUMAN HUMAN DISK DISK HUMAN HUMAN DISK DISK HUMAN HUMAN HUMAN DISK DISK HUMAN (COMPUTER) ) (COMPUTER HUMAN HUMAN DISK DISK HUMAN HUMAN HUMAN HUMAN DISK DISK HUMAN HUMAN HUMAN HUMAN DISK DISK HUMAN HUMAN DISK DISK HUMAN HUMAN DISK DISK HUMAN HUMAN HUMAN HUMAN DISK DISK HUMAN HUMAN HUMAN HUMAN (actually more like this…)
  • 15. COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK HUMAN HUMAN COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK HUMAN HUMAN COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK HUMAN HUMAN COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK
  • 16. APP APP APP APP HOST/VM HOST/VM CLIENT CLIENT RADOSGW RADOSGW RBD RBD CEPH FS CEPH FS LIBRADOS LIBRADOS AA bucket-based REST AA reliable and fully- bucket-based REST reliable and fully- AA POSIX-compliant POSIX-compliant AA library allowing library allowing gateway, compatible gateway, compatible distributed block distributed block distributed file system, distributed file system, apps to directly apps to directly with S3 and Swift with S3 and Swift device, with a a Linux device, with Linux with a a Linux kernel with Linux kernel access RADOS, access RADOS, kernel client and a a kernel client and client and support for client and support for with support for with support for QEMU/KVM driver QEMU/KVM driver FUSE FUSE C, C++, Java, C, C++, Java, Python, Ruby, Python, Ruby, and PHP and PHP RADOS RADOS AA reliable, autonomous, distributed object store comprised of self-healing, self-managing, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes intelligent storage nodes
  • 17. why start with objects? more useful than (disk) blocks  names in a single flat namespace  variable size  simple API with rich semantics more scalable than files  no hard-to-distribute hierarchy  update semantics do not span objects  workload is trivially parallel
  • 18. ceph object model pools  1s to 100s  independent namespaces or object collections  replication level, placement policy objects  bazillions  blob of data (bytes to gigabytes)  attributes (e.g., “version=12”; bytes to kilobytes)  key/value bundle (bytes to gigabytes)
  • 19. OSD OSD OSD OSD OSD btrfs FS FS FS FS FS xfs ext4 DISK DISK DISK DISK DISK M M M
  • 20. Object Storage Daemons (OSDs)  10s to 10000s in a cluster  one per disk, SSD, or RAID group, or ...  hardware agnostic  serve stored objects to clients  intelligently peer to perform replication and recovery tasks Monitors  maintain cluster membership and state M  provide consensus for distributed decision-making  small, odd number  these do not served stored objects to clients
  • 21. HUMAN M M M
  • 22. data distribution all objects are replicated N times objects are automatically placed, balanced, migrated in a dynamic cluster must consider physical infrastructure  ceph-osds on hosts in racks in rows in data centers three approaches  pick a spot; remember where you put it  pick a spot; write down where you put it  calculate where to put it, where to find it
  • 23. CRUSH  pseudo-random placement algorithm  fast calculation, no lookup  repeatable, deterministic  statistically uniform distribution  stable mapping  limited data migration on change  rule-based configuration  infrastructure topology aware  adjustable replication  allows weighting
  • 24. 10 10 01 01 10 10 01 11 01 10 10 10 01 01 10 10 01 11 01 10 hash(object name) % num pg 1010 1010 0101 0101 1010 1010 0101 1111 0101 1010 CRUSH(pg, cluster state, policy)
  • 25. 10 10 01 01 10 10 01 11 01 10 10 10 01 01 10 10 01 11 01 10 1010 1010 0101 0101 1010 1010 0101 1111 0101 1010
  • 27.
  • 28.
  • 29. CLIENT ??
  • 30. APP APP APP APP HOST/VM HOST/VM CLIENT CLIENT RADOSGW RADOSGW RBD RBD CEPH FS CEPH FS LIBRADOS AA bucket-based REST AA reliable and fully- bucket-based REST reliable and fully- AA POSIX-compliant POSIX-compliant A library allowing gateway, compatible gateway, compatible distributed block distributed block distributed file system, distributed file system, apps to directly with S3 and Swift with S3 and Swift device, with a a Linux device, with Linux with a a Linux kernel with Linux kernel access RADOS, kernel client and a a kernel client and client and support for client and support for with support for QEMU/KVM driver QEMU/KVM driver FUSE FUSE C, C++, Java, Python, Ruby, and PHP RADOS RADOS AA reliable, autonomous, distributed object store comprised of self-healing, self-managing, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes intelligent storage nodes
  • 31. APP APP LIBRADOS LIBRADOS native MM MM MM
  • 32. librados  direct access to RADOS from L applications  C, C++, Python, PHP, Java, Erlang  direct access to storage nodes  no HTTP overhead
  • 33. rich librados API  atomic single-object transactions  update data, attr together  compare-and-swap  efficient key/value storage inside an object  object-granularity snapshot primitives  embed code in ceph-osd daemon via plugin API  inter-client communication via object
  • 34. APP APP APP APP HOST/VM HOST/VM CLIENT CLIENT RADOSGW RBD RBD CEPH FS CEPH FS LIBRADOS LIBRADOS A bucket-based REST AA reliable and fully- reliable and fully- AA POSIX-compliant POSIX-compliant AA library allowing library allowing gateway, compatible distributed block distributed block distributed file system, distributed file system, apps to directly apps to directly with S3 and Swift device, with a a Linux device, with Linux with a a Linux kernel with Linux kernel access RADOS, access RADOS, kernel client and a a kernel client and client and support for client and support for with support for with support for QEMU/KVM driver QEMU/KVM driver FUSE FUSE C, C++, Java, C, C++, Java, Python, Ruby, Python, Ruby, and PHP and PHP RADOS RADOS AA reliable, autonomous, distributed object store comprised of self-healing, self-managing, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes intelligent storage nodes
  • 35. APP APP APP APP REST RADOSGW RADOSGW RADOSGW RADOSGW LIBRADOS LIBRADOS LIBRADOS LIBRADOS native MM MM MM
  • 36. RADOS Gateway  REST-based object storage proxy  uses RADOS to store objects  API supports buckets, accounting  usage accounting for billing purposes  compatible with S3, Swift APIs
  • 37. APP APP APP APP HOST/VM HOST/VM CLIENT CLIENT RADOSGW RADOSGW RBD CEPH FS CEPH FS LIBRADOS LIBRADOS AA bucket-based REST A reliable and fully- bucket-based REST AA POSIX-compliant POSIX-compliant AA library allowing library allowing gateway, compatible gateway, compatible distributed block distributed file system, distributed file system, apps to directly apps to directly with S3 and Swift with S3 and Swift device, with a Linux with a a Linux kernel with Linux kernel access RADOS, access RADOS, kernel client and a client and support for client and support for with support for with support for QEMU/KVM driver FUSE FUSE C, C++, Java, C, C++, Java, Python, Ruby, Python, Ruby, and PHP and PHP RADOS RADOS AA reliable, autonomous, distributed object store comprised of self-healing, self-managing, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes intelligent storage nodes
  • 38. COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER COMPUTER COMPUTER DISK DISK DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK
  • 39. COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK VM VM COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK VM VM COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK VM VM COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK
  • 40. VM VM VIRTUALIZATION CONTAINER VIRTUALIZATION CONTAINER LIBRBD LIBRBD LIBRADOS LIBRADOS MM MM MM
  • 41. CONTAINER CONTAINER VM VM CONTAINER CONTAINER LIBRBD LIBRBD LIBRBD LIBRBD LIBRADOS LIBRADOS LIBRADOS LIBRADOS MM MM MM
  • 42. HOST HOST KRBD (KERNEL MODULE) KRBD (KERNEL MODULE) LIBRADOS LIBRADOS MM MM MM
  • 43. RADOS Block Device  storage of disk images in RADOS  decouple VM from host  images striped across entire cluster (pool)  snapshots  copy-on-write clones  support in  Qemu/KVM  mainline Linux kernel (2.6.39+)  OpenStack, CloudStack
  • 44. instant copy 144 0 0 0 0
  • 45.
  • 46. read read read 144 4 = 148
  • 47. APP APP APP APP HOST/VM HOST/VM CLIENT CLIENT RADOSGW RADOSGW RBD RBD CEPH FS LIBRADOS LIBRADOS AA bucket-based REST AA reliable and fully- bucket-based REST reliable and fully- A POSIX-compliant AA library allowing library allowing gateway, compatible gateway, compatible distributed block distributed block distributed file system, apps to directly apps to directly with S3 and Swift with S3 and Swift device, with a a Linux device, with Linux with a Linux kernel access RADOS, access RADOS, kernel client and a a kernel client and client and support for with support for with support for QEMU/KVM driver QEMU/KVM driver FUSE C, C++, Java, C, C++, Java, Python, Ruby, Python, Ruby, and PHP and PHP RADOS RADOS AA reliable, autonomous, distributed object store comprised of self-healing, self-managing, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes intelligent storage nodes
  • 48. CLIENT CLIENT metadata 01 01 data 10 10 MM MM MM
  • 49. MM MM MM
  • 50. Metadata Server (MDS)  manages metadata for POSIX shared file system  directory hierarchy  file metadata (size, owner, timestamps)  stores metadata in RADOS  does not serve file data to clients  only required for the shared file system
  • 51. one tree three metadata servers ?? 51
  • 52. 52
  • 53. 53
  • 54. 54
  • 55. 55
  • 57. recursive accounting  ceph-mds tracks recursive directory stats  file sizes  file and directory counts  modification time  efficient $ ls -alSh | head total 0 drwxr-xr-x 1 root root 9.7T 2011-02-04 15:51 . drwxr-xr-x 1 root root 9.7T 2010-12-16 15:06 .. drwxr-xr-x 1 pomceph pg4194980 9.6T 2011-02-24 08:25 pomceph drwxr-xr-x 1 mcg_test1 pg2419992 23G 2011-02-02 08:57 mcg_test1 drwx--x--- 1 luko adm 19G 2011-01-21 12:17 luko drwx--x--- 1 eest adm 14G 2011-02-04 16:29 eest drwxr-xr-x 1 mcg_test2 pg2419992 3.0G 2011-02-02 09:34 mcg_test2 drwx--x--- 1 fuzyceph adm 1.5G 2011-01-18 10:46 fuzyceph drwxr-xr-x 1 dallasceph pg275 596M 2011-01-14 10:06 dallasceph
  • 58. snapshots  snapshot arbitrary subdirectories  simple interface  hidden '.snap' directory  no special tools $ mkdir foo/.snap/one # create snapshot $ ls foo/.snap one $ ls foo/bar/.snap _one_1099511627776 # parent's snap name is mangled $ rm foo/myfile $ ls -F foo bar/ $ ls -F foo/.snap/one myfile bar/ $ rmdir foo/.snap/one # remove snapshot
  • 59. multiple protocols, implementations  Linux kernel client  mount -t ceph 1.2.3.4:/ /mnt  export (NFS), Samba (CIFS)  ceph-fuse NFS SMB/CIFS  libcephfs.so  your app  Samba (CIFS) Ganesha Samba  Ganesha (NFS) libcephfs libcephfs  Hadoop (map/reduce) Hadoop your app libcephfs libcephfs ceph-fuse ceph fuse kernel
  • 61. current status argonaut stable release v0.48.x  rados, RBD, radosgw bobtail stable release v0.56.x  RBD cloning  improved performance, scaling, failure behavior  radosgw API and performance improvements  just released
  • 62. cuttlefish roadmap  RBD  Xen integration, iSCSI  radosgw  multi-side federation, disaster recovery  RADOS  improved data integrity checking, disaster recovery  ongoing performance improvements  file system  guiding ongoing community development  robust failure recovery, stability, fsck
  • 63. Contact Sage Email: sage@inktank.com Follow Sage on Twitter: @Liewegas
  • 65. Inktank’s Professional Services Consulting Services: • Technical Overview • Infrastructure Assessment • Proof of Concept • Implementation Support • Performance Tuning Support Subscriptions: • Pre-Production Support • Production Support A full description of our services can be found at the following: Consulting Services: http://www.inktank.com/consulting-services/ Support Subscriptions: http://www.inktank.com/support-services/
  • 66. Check out our on-demand webinars • Getting Started with Ceph • Introduction to Ceph with OpenStack • DreamHost Case Study: DreamObjects with Ceph • Advanced Features of Ceph Distributed Storage (delivered by Sage Weil, creator of Ceph) All can be watched at: http://www.inktank.com/news-events/webinars/
  • 67. Contact Us Info@inktank.com 1-855-INKTANK Don’t forget to follow us on: Twitter: https://twitter.com/inktank Facebook: http://www.facebook.com/inktank YouTube: http://www.youtube.com/inktankstorage

Editor's Notes

  1. 20121129
  2. It was at this point that computers had to learn how to consolidate the storage available in multiple hard drives and present it in a unified way to their human patrons.
  3. This causes a whole new set of problems. How do you prevent two people from writing the same file at the same time? How do you make sure that one person can’t see data intended for someone else?
  4. And, as you might imagine, these solutions led to other problems. Notice a bottleneck in this picture?
  5. Blam!! The invention of distributed storage was another very important point in the history of information storage. For the first time, every part of the system could be scaled.
  6. RADOS is a distributed object store, and it’s the foundation for Ceph. On top of RADOS, the Ceph team has built three applications that allow you to store data and do fantastic things. But before we get into all of that, let’s start at the beginning of the story.
  7. Let’s start with RADOS, Reliable Autonomic Distributed Object Storage. In this example, you’ve got five disks in a computer. You have initialized each disk with a filesystem (btrfs is the right filesystem to use someday, but until it’s stable we recommend XFS). On each filesystem, you deploy a Ceph OSD (Object Storage Daemon). That computer, with its five disks and five object storage daemons, becomes a single node in a RADOS cluster. Alongside these nodes are monitor nodes, which keep track of the current state of the cluster and provide users with an entry point into the cluster (although they do not serve any data themselves).
  8. Applications wanting to store objects into RADOS interact with the cluster as a single entity.
  9. The way CRUSH is configured is somewhat unique. Instead of defining pools for different data types, workgroups, subnets, or applications, CRUSH is configured with the physical topology of your storage network. You tell it how many buildings, rooms, shelves, racks, and nodes you have, and you tell it how you want data placed. For example, you could tell CRUSH that it’s okay to have two replicas in the same building, but not on the same power circuit. You also tell it how many copies to keep.
  10. With CRUSH, the data is first split into a certain number of sections. These are called “placement groups”. The number of placement groups is configurable. Then, the CRUSH algorithm runs, having received the latest cluster map and a set of placement rules, and it determines where the placement group belongs in the cluster. This is a pseudo-random calculation, but it’s also repeatable; given the same cluster state and rule set, it will always return the same results.
  11. Each placement group is run through CRUSH and stored in the cluster. Notice how no node has received more than one copy of a placement group, and no two nodes contain the same information? That’s important.
  12. When it comes time to store an object in the cluster (or retrieve one), the client calculates where it belongs.
  13. What happens, though, when a node goes down? The OSDs are always talking to each other (and the monitors), and they know when something is amiss. The third and fifth node on the top row have noticed that the second node on the bottom row is gone, and they are also aware that they have replicas of the missing data.
  14. The OSDs collectively use the CRUSH algorithm to determine how the cluster should look based on its new state, and move the data to where clients running CRUSH expect it to be.
  15. Because of the way placement is calculated instead of centrally controlled, node failures are transparent to clients.
  16. Next, let’s talk about librados. Librados is a native C library that allows applications to work with RADOS. There are similar libraries available for C++, Java, Python, Ruby, and PHP.
  17. So applications link with librados, allowing them to interact with RADOS through a native protocol.
  18. The radosgw component is a REST-based interface to RADOS. It allows developers to build applications that work with Ceph through standard web services.
  19. So, for example, an application can use a REST-based API to work with radosgw, and radosgw talks to RADOS using a native protocol. You can deploy as many gateways as you need, and you can use standard HTTP load balancers. User authentication and S3-style buckets are also supported, and applications written to work with Amazon S3 or OpenStack Swift will automatically work with radosgw by just changing their endpoint.
  20. The RADOS Block Device (RBD) allows users to store virtual disks inside RADOS.
  21. OK. This gets a little bit abstract now. Sometimes, people don’t want to store files or objects, they want to store entire disks. Virtual disks. Collections of data that can be assembled together and presented to a computer, which would see it as a real hard drive with platters and sectors.
  22. … or the more modern way, which is to boot multiple virtual machines using disks from a storage cluster.
  23. For example, you can use a virtualization container like KVM or QEMU to boot virtual machines from images that have been stored in RADOS. Images are striped across the entire cluster, which allows for simultaneous read access from different cluster nodes.
  24. Separating a virtual computer from its storage also lets you do really neat things, like migrate a virtual machine from one server to another without rebooting it.
  25. As an alternative, machines (even those running on bare metal) can mount an RBD image using native Linux kernel drivers.
  26. With Ceph, copying an RBD image four times gives you five total copies…but only takes the space of one. It also happens instantly.
  27. When they read, though, they read through to the original copy if there’s no newer data.
  28. Finally, let’s talk about Ceph FS. Ceph FS is a parallel filesystem that provides a massively scalable, single-hierarchy, shared disk. If you use a shared drive at work, this is the same thing except that the same drive could be shared by everyone you’ve ever met (and everyone they’ve ever met).
  29. Remember all that meta-data we talked about in the beginning? Feels so long ago. It has to be stored somewhere! Something has to keep track of who created files, when they were created, and who has the right to access them. And something has to remember where they live within a tree. Enter MDS, the Ceph Metadata Server. Clients accessing Ceph FS data first make a request to an MDS, which provides what they need to get files from the right OSDs.
  30. There are multiple MDSs!
  31. If you aren’t running Ceph FS, you don’t need to deploy metadata servers.
  32. So how do you have one tree and multiple servers?
  33. If there’s just one MDS (which is a terrible idea), it manages metadata for the entire tree.
  34. When the second one comes along, it will intelligently partition the work by taking a subtree.
  35. When the third MDS arrives, it will attempt to split the tree again.
  36. Same with the fourth.
  37. A MDS can actually even just take a single directory or file, if the load is high enough. This all happens dynamically based on load and the structure of the data, and it’s called “dynamic subtree partitioning”.
  38. Which components of Ceph are you using today? Which components of Ceph are you interested in using in the future? What additional functionality would you like to see in Ceph? Free form
  39. Technical Overview We will explain the architecture, functionality, best practices, and typical use cases. We’ll take you through code review, the technology road map and review your business goals. Infrastructure Assessment Our team will conduct an in-depth, on-site assessment of your current storage environment to fully understand your architecture, application, and workload. Inktank engineers will customize a solution for your business needs. Proof of Concept The best way to determine whether Ceph is the right solution for you is to set up your own test cluster. With our configuration, benchmarking tools and services, we can guide you to the right solution for your particular application. Implementation Support Reduce project risk by adding Ceph experts to your team. We will work with you to build the most reliable, secure, and robust storage system for your application. We can help you adapt your applications to fully leverage Ceph or extend the Ceph core to provide the features you need. Performance Tuning Through detailed measurements, we can tune the components in a production Ceph cluster to achieve the optimum configuration for your actual workload. Inktank's Pre-Production Subscriptio n is ideal for architects and administrators setting up a Ceph environment and needing support during the installation and configuration process.  Features Email support channel One named company contact SLA-backed response Access to support ticketing system Unlimited support requests Inktank Production Support Silver & Gold Subscriptions