Inktank
Delivering the Future of Storage


The End of RAID as you know it with Ceph Replication
March 28, 2013
Agenda
l    Intank and Ceph Introduction

l    Ceph Technology

l    Challenges of Raid

l    Ceph Advantages

l    Q&A

l    Resources and Moving Forward
•    Company that provides       •  Distributed unified object,
     professional services and      block and file storage
     support for Ceph
                                    platform
•    Founded in 2011
                                 •    Created by storage
•    Funded by DreamHost              experts

•    Mark Shuttleworth           •    Open source
     invested $1M
                                 •    In the Linux Kernel
•    Sage Weil, CTO and
     creator of Ceph             •    Integrated into Cloud
                                      Platforms
Ceph Technology Overview
Ceph Technological Foundations

Ceph was built with the following goals:

•  Every component must scale

•  There can be no single points of failure

•  The solution must be software-based, not an appliance

•  Must be open source

•  Should run on readily-available, commodity hardware

•  Everything must self-manage wherever possible



                                                           5
Ceph Innovations
CRUSH data placement algorithm
Algorithm is infrastructure aware and quickly adjusts to failures
Data location is computed rather than locked up
Enables clients to directly directly communicate with servers that store their data
Enables clients to perform parallel I/O for greatly enhanced throughput


Reliable Autonomic Distributed Object Store
Storage devices assume complete responsibility for data integrity
They operate independently, in parallel, without central choreography
Very efficient. Very fast. Very scalable.


CephFS Distributed Metadata Server
Highly scalable to large numbers of active/active metadata servers and high throughput
Highly reliable and available, with full Posix semantics and consistency guarantees
Has both a FUSE client and a client fully integrated into the Linux kernel


Advanced Virtual Block Device
Enterprise storage capabilities from utility server hardware
Thin Provisioned, Allocate-on-Write Snapshots, LUN cloning
In the Linux kernel and integrated with OpenStack components
Unified Storage Platform
Object
   •     Archival and backup storage
   •     Primary data storage
   •     S3-like storage
   •     Web services and platforms
   •     Application development
Block
   •  SAN replacement
   •  Virtual block device, VM images
File
    •  HPC
    •  Posix-compatible applications
Ceph Unified Storage Platform

        OBJECTS                      VIRTUAL DISKS             FILES & DIRECTORIES




     CEPH                          CEPH                           CEPH
    GATEWAY                     BLOCK DEVICE                  FILE SYSTEM
A powerful S3- and Swift-      A distributed virtual block   A distributed, scale-out
compatible gateway that        device that delivers high-     filesystem with POSIX
 brings the power of the      performance, cost-effective    semantics that provides
  Ceph Object Store to        storage for virtual machines   storage for a legacy and
   modern applications          and legacy applications        modern applications




                            CEPH OBJECT STORE
 A reliable, easy to manage, next-generation distributed object
store that provides storage of unstructured data for applications
RADOS Cluster Makeup


          OSD    OSD    OSD    OSD      OSD
RADOS
 Node
                                                   btrfs
          FS     FS     FS         FS   FS         xfs
                                                   ext4

          DISK   DISK   DISK   DISK     DISK




RADOS            M             M               M
Cluster



                                                           9
RADOS Object Storage Daemons
                     Intelligent Storage Servers

                •    Serve stored objects to clients

                •    OSD is primary for some objects
                      •  Responsible for replication
                      •  Responsible for coherency
                      •  Responsible for re-balancing
                      •  Responsible for recovery

                •    OSD is secondary for some objects
                      •  Under control of primary
                      •  Capable of becoming primary

                •    Supports extended object classes
                      •  Atomic transactions
                      •  Synchronization and notifications
                      •  Send computation to the data

10
CRUSH
 Pseudo-random placement
 algorithm
     •  Deterministic function of
        inputs
     •  Clients can compute data
        location

 Rule-based configuration
     •  Desired/required replica
        count
     •  Affinity/distribution rules
     •  Infrastructure topology
     •  Weighting

 Excellent data distribution
     •  Declustered placement
     •  Excellent data re-distribution
     •  Migration proportional to
        change
                                         11
CLIENT

         ??




              12
RADOS Monitors
                Stewards of the Cluster



     M
            •  Distributed consensus (Paxos)

            •  Odd number required (quorum)

            •  Maintain/distribute cluster map

            •  Authentication/key servers

            •  Monitors are not in the data path

            •  Clients talk directly to OSDs




13
RAID and its Challenges




                          14
Redundant Array of Inexpensive Disks
              Enhanced Reliability
                 •  RAID-1 mirroring
                 •  RAID-5/6 parity (reduced overhead)
                 •  Automated recovery

              Enhanced Performance
                 •    RAID-0 striping
                 •    SAN interconnects
                 •    Enterprise SAS drives
                 •    Proprietary H/W RAID controllers

              Economical Storage Solutions
                 •  Software RAID implementations
                 •  iSCSI and JBODs

              Enhanced Capacity
                 •  Logical volume concatenation

15
RAID Challenges: Capacity/Speed
•  Storage economies in disks come from more GB per spindle

•  NRE rates are flat (typically estimated at 10E-15/bit)
       •  4% chance of NRE while recovering a 4+1 RAID-5 set
          and it goes up with the number of volumes in the set
          many RAID controllers fail the recovery after an NRE

•  Access speed has not kept up with density increases
       •  27 hours to rebuild a 4+1 RAID-5 set at 20MB/s
           during which time a second drive can fail

•  Managing the risk of second failures requires hot-spares
       •  Defeating some of the savings from parity
          redundancy
RAID Challenges: Expansion

•  The next generation of disks will be larger and cost less per
   GB. We would like to use these as we expand

•  Most RAID replication schemes require identical disk meaning
   new disks cannot be added to an old set meaning failed disks
   must be replaced with identical units

•  Proprietary appliances may require replacements from
   manufacturer (at much higher than commodity prices)

•  Many storage systems reach a limit beyond which they
   cannot be further expanded (e.g. fork-lift upgrade)

•  Re-balancing existing data over new volumes is non-trivial
RAID Challenges: Reliability/Availability

•  RAID-5 can only survive a single disk failure
   •    The odds of an NRE during recovery are significant
   •    Odds of a second failure during recovery are non-negligible
   •    Annual peta-byte durability for RAID-5 is only 3 nines

•  RAID-6 redundancy protects against two disk failures
   •    Odds of an NRE during recovery are still significant
   •    Client data access will be starved out during recovery
   •    Throttling recovery increases the risk of data loss

•  Even RAID-6 can't protect against:
   •    Server failures
   •    NIC failures
   •    Switch failures
   •    OS crashes
   •    Facility or regional disasters
RAID Challenges: Expense

Capital Expenses … good RAID costs
   •    Significant mark-up for enterprise hardware
   •    High performance RAID controllers can add $50-100/disk
   •    SANs further increase
   •    Expensive equipment, much of which is often poorly used
   •    Software RAID is much less expensive, and much slower

Operating Expenses … RAID doesn't manage itself
   •    RAID group, LUN and pool management
   •    Lots of application-specific tunable parameters
   •    Difficult expansion and migration
   •    When a recovery goes bad, it goes very bad
   •    Don't even think about putting off replacing a failed drive
Ceph Advantages
Ceph VALUE PROPOSITION

                        •  Open source
                        •  Runs on commodity hardware
    SAVES MONEY         •  Runs in heterogeneous
                           environments


                        •  Self-managing
     SAVES TIME         •  OK to batch drive replacements
                        •  Emerging platform integration


                        •  Object, block, & filesystem storage
INCREASES FLEXIBILITY   •  Highly adaptable software solution
                        •  Easier deployment of new services


                        •  No vendor lock in
    LOWERS RISK         •  Rule configurable failure-zones
                        •  Improved reliability and availability
Ceph Advantage: Declustered Placement
•    Consider a failed 2TB RAID mirror
      •  We must copy 2TB from the survivor to the successor
      •  Survivor and successor are likely in same failure zone

•    Consider two RADOS objects clustered on the same primary
      •  Surviving copies are declustered (on different secondaries)
      •  New copies will be declustered (on different successors)
      •  Copy 10GB from each of 200 survivors to 200 successors
      •  Survivors and successors are in different failure zones

•    Benefits
      •  Recovery is parallel and 200x faster
      •  Service can continue during the recovery process
      •  Exposure to 2nd failures is reduced by 200x
      •  Zone aware placement protects against higher level failures
      •  Recovery is automatic and does not await new drives
      •  No idle hot-spares are required
CLIENT

         ??




              23
Ceph Advantage: Object Granularity
•    Consider a failed 2TB RAID mirror
      •  To recover it we must read and write (at least) 2TB
      •  Successor must be same size as failed volume
      •  An error in recovery will probably lose the file system

•    Consider a failed RADOS OSD
      •  To recover it we must read and write thousands of objects
      •  Successor OSDs must have each have some free space
      •  An error in recovery will probably lose one object

•    Benefits
      •  Heterogeneous commodity disks are easily supported
      •  Better and more uniform space utilization
      •  Per-object updates always preserve causality ordering
      •  Object updates are more easily replicated over WAN links
      •  Greatly reduced data loss if errors do occur
Ceph Advantage: Intelligent Storage

•  Intelligent OSDs automatically rebalance data
     •  When new nodes are added
     •  When old nodes fail or are decommissioned
     •  When placement policies are changed

•  The resulting rebalancing is very good:
    •  Even distribution of data across all OSDs
    •  Uniform mix of old and new data across all OSDs
    •  Moves only as much data as required

•  Intelligent OSDs continuously scrub their objects
     •  To detect and correct silent write errors before another failure

•  This architecture scales from petabytes to exabytes
    •  A single pool of thin provisioned, self-managing storage
    •  Serving a wide range of block, object, and file clients
Ceph Advantage: Price

•  Can leverage commodity hardware for lowest costs
•  Not locked in to single vendor; get best deal over time
•  RAID not required, leading to lower component costs



                          Enterprise RAID       Ceph Replication


      Raw $/GB            $3                    $0.50

      Protected $/GB      $4 (RAID6 6+2)        $1.50 (3 copies)

      Usable (90%)        $4.44                 $1.67

      Replicated          $8.88 (Main + Bkup)   $1.67 (3 copies)

      Relative Expense    533% storage cost     Baseline (100%)
Q&A
Leverage great online resources

Documentation on the Ceph web site:
   •  http://ceph.com/docs/master/
Blogs from Inktank and the Ceph community:
   •  http://www.inktank.com/news-events/blog/
   •  http://ceph.com/community/blog/
Developer resources:
   •  http://ceph.com/resources/development/
   •  http://ceph.com/resources/mailing-list-irc/
   •  http://dir.gmane.org/gmane.comp.file-
      systems.ceph.devel
29




     Leverage Ceph Expert Support
     Inktank will partner with you for complex deployments
         •  Solution design and Proof-of-Concept
         •  Solution customization
         •  Capacity planning
         •  Performance optimization

     Having access to expert support is a production best practice
        •  Troubleshooting
        •  Debugging

     A full description of our services can be found at the following:

     Consulting Services: http://www.inktank.com/consulting-services/

     Support Subscriptions: http://www.inktank.com/support-services/
Check out our upcoming webinars
Ceph Unified Storage for OpenStack
 •  April 4, 2013
 •  10:00AM PT, 12:00PM CT, 1:00PM ET

Technical Deep Dive Into Ceph Object Storage
 •  April 10, 2013
 •  10:00AM PT, 12:00PM CT, 1:00PM ET

Register today at:
http://www.inktank.com/news-events/webinars/
Contact Us
Info@inktank.com
1-855-INKTANK

Don’t forget to follow us on:

   Twitter: https://twitter.com/inktank

   Facebook: http://www.facebook.com/inktank

   YouTube: http://www.youtube.com/inktankstorage
Thank you for joining!

End of RAID as we know it with Ceph Replication

  • 1.
    Inktank Delivering the Futureof Storage The End of RAID as you know it with Ceph Replication March 28, 2013
  • 2.
    Agenda l  Intank and Ceph Introduction l  Ceph Technology l  Challenges of Raid l  Ceph Advantages l  Q&A l  Resources and Moving Forward
  • 3.
    •  Company that provides •  Distributed unified object, professional services and block and file storage support for Ceph platform •  Founded in 2011 •  Created by storage •  Funded by DreamHost experts •  Mark Shuttleworth •  Open source invested $1M •  In the Linux Kernel •  Sage Weil, CTO and creator of Ceph •  Integrated into Cloud Platforms
  • 4.
  • 5.
    Ceph Technological Foundations Cephwas built with the following goals: •  Every component must scale •  There can be no single points of failure •  The solution must be software-based, not an appliance •  Must be open source •  Should run on readily-available, commodity hardware •  Everything must self-manage wherever possible 5
  • 6.
    Ceph Innovations CRUSH dataplacement algorithm Algorithm is infrastructure aware and quickly adjusts to failures Data location is computed rather than locked up Enables clients to directly directly communicate with servers that store their data Enables clients to perform parallel I/O for greatly enhanced throughput Reliable Autonomic Distributed Object Store Storage devices assume complete responsibility for data integrity They operate independently, in parallel, without central choreography Very efficient. Very fast. Very scalable. CephFS Distributed Metadata Server Highly scalable to large numbers of active/active metadata servers and high throughput Highly reliable and available, with full Posix semantics and consistency guarantees Has both a FUSE client and a client fully integrated into the Linux kernel Advanced Virtual Block Device Enterprise storage capabilities from utility server hardware Thin Provisioned, Allocate-on-Write Snapshots, LUN cloning In the Linux kernel and integrated with OpenStack components
  • 7.
    Unified Storage Platform Object •  Archival and backup storage •  Primary data storage •  S3-like storage •  Web services and platforms •  Application development Block •  SAN replacement •  Virtual block device, VM images File •  HPC •  Posix-compatible applications
  • 8.
    Ceph Unified StoragePlatform OBJECTS VIRTUAL DISKS FILES & DIRECTORIES CEPH CEPH CEPH GATEWAY BLOCK DEVICE FILE SYSTEM A powerful S3- and Swift- A distributed virtual block A distributed, scale-out compatible gateway that device that delivers high- filesystem with POSIX brings the power of the performance, cost-effective semantics that provides Ceph Object Store to storage for virtual machines storage for a legacy and modern applications and legacy applications modern applications CEPH OBJECT STORE A reliable, easy to manage, next-generation distributed object store that provides storage of unstructured data for applications
  • 9.
    RADOS Cluster Makeup OSD OSD OSD OSD OSD RADOS Node btrfs FS FS FS FS FS xfs ext4 DISK DISK DISK DISK DISK RADOS M M M Cluster 9
  • 10.
    RADOS Object StorageDaemons Intelligent Storage Servers •  Serve stored objects to clients •  OSD is primary for some objects •  Responsible for replication •  Responsible for coherency •  Responsible for re-balancing •  Responsible for recovery •  OSD is secondary for some objects •  Under control of primary •  Capable of becoming primary •  Supports extended object classes •  Atomic transactions •  Synchronization and notifications •  Send computation to the data 10
  • 11.
    CRUSH Pseudo-random placement algorithm •  Deterministic function of inputs •  Clients can compute data location Rule-based configuration •  Desired/required replica count •  Affinity/distribution rules •  Infrastructure topology •  Weighting Excellent data distribution •  Declustered placement •  Excellent data re-distribution •  Migration proportional to change 11
  • 12.
    CLIENT ?? 12
  • 13.
    RADOS Monitors Stewards of the Cluster M •  Distributed consensus (Paxos) •  Odd number required (quorum) •  Maintain/distribute cluster map •  Authentication/key servers •  Monitors are not in the data path •  Clients talk directly to OSDs 13
  • 14.
    RAID and itsChallenges 14
  • 15.
    Redundant Array ofInexpensive Disks Enhanced Reliability •  RAID-1 mirroring •  RAID-5/6 parity (reduced overhead) •  Automated recovery Enhanced Performance •  RAID-0 striping •  SAN interconnects •  Enterprise SAS drives •  Proprietary H/W RAID controllers Economical Storage Solutions •  Software RAID implementations •  iSCSI and JBODs Enhanced Capacity •  Logical volume concatenation 15
  • 16.
    RAID Challenges: Capacity/Speed • Storage economies in disks come from more GB per spindle •  NRE rates are flat (typically estimated at 10E-15/bit) •  4% chance of NRE while recovering a 4+1 RAID-5 set and it goes up with the number of volumes in the set many RAID controllers fail the recovery after an NRE •  Access speed has not kept up with density increases •  27 hours to rebuild a 4+1 RAID-5 set at 20MB/s during which time a second drive can fail •  Managing the risk of second failures requires hot-spares •  Defeating some of the savings from parity redundancy
  • 17.
    RAID Challenges: Expansion • The next generation of disks will be larger and cost less per GB. We would like to use these as we expand •  Most RAID replication schemes require identical disk meaning new disks cannot be added to an old set meaning failed disks must be replaced with identical units •  Proprietary appliances may require replacements from manufacturer (at much higher than commodity prices) •  Many storage systems reach a limit beyond which they cannot be further expanded (e.g. fork-lift upgrade) •  Re-balancing existing data over new volumes is non-trivial
  • 18.
    RAID Challenges: Reliability/Availability • RAID-5 can only survive a single disk failure •  The odds of an NRE during recovery are significant •  Odds of a second failure during recovery are non-negligible •  Annual peta-byte durability for RAID-5 is only 3 nines •  RAID-6 redundancy protects against two disk failures •  Odds of an NRE during recovery are still significant •  Client data access will be starved out during recovery •  Throttling recovery increases the risk of data loss •  Even RAID-6 can't protect against: •  Server failures •  NIC failures •  Switch failures •  OS crashes •  Facility or regional disasters
  • 19.
    RAID Challenges: Expense CapitalExpenses … good RAID costs •  Significant mark-up for enterprise hardware •  High performance RAID controllers can add $50-100/disk •  SANs further increase •  Expensive equipment, much of which is often poorly used •  Software RAID is much less expensive, and much slower Operating Expenses … RAID doesn't manage itself •  RAID group, LUN and pool management •  Lots of application-specific tunable parameters •  Difficult expansion and migration •  When a recovery goes bad, it goes very bad •  Don't even think about putting off replacing a failed drive
  • 20.
  • 21.
    Ceph VALUE PROPOSITION •  Open source •  Runs on commodity hardware SAVES MONEY •  Runs in heterogeneous environments •  Self-managing SAVES TIME •  OK to batch drive replacements •  Emerging platform integration •  Object, block, & filesystem storage INCREASES FLEXIBILITY •  Highly adaptable software solution •  Easier deployment of new services •  No vendor lock in LOWERS RISK •  Rule configurable failure-zones •  Improved reliability and availability
  • 22.
    Ceph Advantage: DeclusteredPlacement •  Consider a failed 2TB RAID mirror •  We must copy 2TB from the survivor to the successor •  Survivor and successor are likely in same failure zone •  Consider two RADOS objects clustered on the same primary •  Surviving copies are declustered (on different secondaries) •  New copies will be declustered (on different successors) •  Copy 10GB from each of 200 survivors to 200 successors •  Survivors and successors are in different failure zones •  Benefits •  Recovery is parallel and 200x faster •  Service can continue during the recovery process •  Exposure to 2nd failures is reduced by 200x •  Zone aware placement protects against higher level failures •  Recovery is automatic and does not await new drives •  No idle hot-spares are required
  • 23.
    CLIENT ?? 23
  • 24.
    Ceph Advantage: ObjectGranularity •  Consider a failed 2TB RAID mirror •  To recover it we must read and write (at least) 2TB •  Successor must be same size as failed volume •  An error in recovery will probably lose the file system •  Consider a failed RADOS OSD •  To recover it we must read and write thousands of objects •  Successor OSDs must have each have some free space •  An error in recovery will probably lose one object •  Benefits •  Heterogeneous commodity disks are easily supported •  Better and more uniform space utilization •  Per-object updates always preserve causality ordering •  Object updates are more easily replicated over WAN links •  Greatly reduced data loss if errors do occur
  • 25.
    Ceph Advantage: IntelligentStorage •  Intelligent OSDs automatically rebalance data •  When new nodes are added •  When old nodes fail or are decommissioned •  When placement policies are changed •  The resulting rebalancing is very good: •  Even distribution of data across all OSDs •  Uniform mix of old and new data across all OSDs •  Moves only as much data as required •  Intelligent OSDs continuously scrub their objects •  To detect and correct silent write errors before another failure •  This architecture scales from petabytes to exabytes •  A single pool of thin provisioned, self-managing storage •  Serving a wide range of block, object, and file clients
  • 26.
    Ceph Advantage: Price • Can leverage commodity hardware for lowest costs •  Not locked in to single vendor; get best deal over time •  RAID not required, leading to lower component costs Enterprise RAID Ceph Replication Raw $/GB $3 $0.50 Protected $/GB $4 (RAID6 6+2) $1.50 (3 copies) Usable (90%) $4.44 $1.67 Replicated $8.88 (Main + Bkup) $1.67 (3 copies) Relative Expense 533% storage cost Baseline (100%)
  • 27.
  • 28.
    Leverage great onlineresources Documentation on the Ceph web site: •  http://ceph.com/docs/master/ Blogs from Inktank and the Ceph community: •  http://www.inktank.com/news-events/blog/ •  http://ceph.com/community/blog/ Developer resources: •  http://ceph.com/resources/development/ •  http://ceph.com/resources/mailing-list-irc/ •  http://dir.gmane.org/gmane.comp.file- systems.ceph.devel
  • 29.
    29 Leverage Ceph Expert Support Inktank will partner with you for complex deployments •  Solution design and Proof-of-Concept •  Solution customization •  Capacity planning •  Performance optimization Having access to expert support is a production best practice •  Troubleshooting •  Debugging A full description of our services can be found at the following: Consulting Services: http://www.inktank.com/consulting-services/ Support Subscriptions: http://www.inktank.com/support-services/
  • 30.
    Check out ourupcoming webinars Ceph Unified Storage for OpenStack •  April 4, 2013 •  10:00AM PT, 12:00PM CT, 1:00PM ET Technical Deep Dive Into Ceph Object Storage •  April 10, 2013 •  10:00AM PT, 12:00PM CT, 1:00PM ET Register today at: http://www.inktank.com/news-events/webinars/
  • 31.
    Contact Us Info@inktank.com 1-855-INKTANK Don’t forgetto follow us on: Twitter: https://twitter.com/inktank Facebook: http://www.facebook.com/inktank YouTube: http://www.youtube.com/inktankstorage
  • 32.
    Thank you forjoining!