SlideShare a Scribd company logo
1 of 39
Download to read offline
virtual machine block storage with
the ceph distributed storage system


               sage weil
       xensummit – august 28, 2012
outline
●   why you should care
●   what is it, what it does
●   how it works, how you can use it
    ●   architecture
    ●   objects, recovery
●   rados block device
    ●   integration
    ●   path forward
●   who we are, why we do this
why should you care about another
        storage system?

      requirements, time, cost
requirements
●   diverse storage needs
    ●   object storage
    ●   block devices (for VMs) with snapshots, cloning
    ●   shared file system with POSIX, coherent caches
    ●   structured data... files, block devices, or objects?
●   scale
    ●   terabytes, petabytes, exabytes
    ●   heterogeneous hardware
    ●   reliability and fault tolerance
time
●   ease of administration
●   no manual data migration, load balancing
●   painless scaling
    ●   expansion and contraction
    ●   seamless migration
cost
●   low cost per gigabyte
●   no vendor lock-in

●   software solution
    ●   run on commodity hardware
●   open source
what is ceph?
APP                   APP                HOST/VM                 CLIENT



                     RADOSGW                RBD                    CEPH FS
  LIBRADOS
                     A bucket-based         A reliable and         A POSIX-
  A library allowing REST gateway,          fully-distributed      compliant
  apps to directly compatible with          block device, with     distributed file
  access RADOS,      S3 and Swift           a Linux kernel         system, with a
  with support for                          client and a           Linux kernel
  C, C++, Java,                             QEMU/KVM driver        client and support
  Python, Ruby,                                                    for FUSE
  and PHP




RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-
managing, intelligent storage nodes



                                                                                        8
open source
●   LGPLv2
    ●   copyleft
    ●   free to link to proprietary code
●   no copyright assignment
    ●   no dual licensing
    ●   no “enterprise-only” feature set
●   active community
●   commercial support available
distributed storage system
●   data center (not geo) scale
    ●   10s to 10,000s of machines
    ●   terabytes to exabytes
●   fault tolerant
    ●   no SPoF
    ●   commodity hardware
        –   ethernet, SATA/SAS, HDD/SSD
        –   RAID, SAN probably a waste of time, power, and money
object storage model
●   pools
    ●   1s to 100s
    ●   independent namespaces or object collections
    ●   replication level, placement policy
●   objects
    ●   trillions
    ●   blob of data (bytes to gigabytes)
    ●   attributes (e.g., “version=12”; bytes to kilobytes)
    ●   key/value bundle (bytes to gigabytes)
object storage cluster
●   conventional client/server model doesn't scale
    ●   server(s) become bottlenecks; proxies are inefficient
    ●   if storage devices don't coordinate, clients must
●   ceph-osds are intelligent storage daemons
    ●   coordinate with peers
    ●   sensible, cluster-aware protocols
    ●   sit on local file system
         –   btrfs, xfs, ext4, etc.
         –   leveldb
OSD    OSD    OSD    OSD    OSD




 FS     FS     FS    FS     FS     btrfs
                                   xfs
                                   ext4
DISK   DISK   DISK   DISK   DISK




  M            M             M



                                           13
Monitors:
    •
        Maintain cluster state



M
    •
        Provide consensus for
        distributed decision-making
    •
        Small, odd number
    •
        These do not serve stored
        objects to clients
    •




    OSDs:
    •
        One per disk or RAID group
    •
        At least three in a cluster
    •
        Serve stored objects to
        clients
    •
        Intelligently peer to perform
        replication tasks
HUMAN




        M




M           M
data distribution
●   all objects are replicated N times
●   objects are automatically placed, balanced, migrated
    in a dynamic cluster
●   must consider physical infrastructure
    ●   ceph-osds on hosts in racks in rows in data centers

●   three approaches
    ●   pick a spot; remember where you put it
    ●   pick a spot; write down where you put it
    ●   calculate where to put it, where to find it
CRUSH
•   Pseudo-random placement
    algorithm
•   Fast calculation, no lookup
•   Ensures even distribution
•   Repeatable, deterministic
•   Rule-based configuration
    •   specifiable replication
    •   infrastructure topology aware
    •   allows weighting
•   Stable mapping
    •   Limited data migration
distributed object storage
●   CRUSH tells us where data should go
    ●   small “osd map” records cluster state at point in time
    ●   ceph-osd node status (up/down, weight, IP)
    ●   CRUSH function specifying desired data distribution
●
    object storage daemons (RADOS)
    ●   store it there
    ●
        migrate it as the cluster changes
●   decentralized, distributed approach allows
    ●
        massive scales (10,000s of servers or more)
    ●   efficient data access
    ●
        the illusion of a single copy with consistent behavior
large clusters aren't static
●   dynamic cluster
    ●   nodes are added, removed; nodes reboot, fail, recover
    ●   recovery is the norm
●   osd maps are versioned
    ●   shared via gossip
●   any map update potentially triggers data migration
    ●   ceph-osds monitor peers for failure
    ●   new nodes register with monitor
    ●   administrator adjusts weights, mark out old hardware, etc.
CLIENT

         ??
what does this mean for my cloud?
●   virtual disks
    ●   reliable
    ●   accessible from many hosts
●   appliances
    ●   great for small clouds
    ●   not viable for public or (large) private clouds
●   avoid single server bottlenecks
●   efficient management
VM




VIRTUALIZATION CONTAINER
             LIBRBD
            LIBRADOS




        M

   M                   M
CONTAINER            VM       CONTAINER
   LIBRBD                        LIBRBD
  LIBRADOS                      LIBRADOS




                 M

             M            M
HOST
    KRBD (KERNEL MODULE)
           LIBRADOS




       M

M                          M
RBD: RADOS Block Device
•
    Replicated, reliable, high-
    performance virtual disk
•
    Allows decoupling of VMs
    and containers
    •   Live migration!
•
    Images are striped across
    the cluster
•
    Snapshots!
•
    Native support in the Linux
    kernel
    •   /dev/rbd1
•
    librbd allows easy
    integration
HOW DO YOU
    SPIN UP
THOUSANDS OF VMs
   INSTANTLY
      AND
  EFFICIENTLY?
instant copy




144
      0        0         0   0   = 144
write
                          CLIENT
                  write


                  write


                  write




144   4   = 148
read


             read
                    CLIENT

             read




144   4   = 148
current RBD integration
●   native Linux kernel support
    ●   /dev/rbd0, /dev/rbd/<poolname>/<imagename>
●   librbd
    ●   user-level library
●   Qemu/KVM
    ●   links to librbd user-level library
●   libvirt
    ●   librbd-based storage pool
    ●   understands RBD images
    ●   can only start KVM VMs... :-(
●   CloudStack, OpenStack
what about Xen?
●   Linux kernel driver (i.e. /dev/rbd0)
    ●   easy fit into existing stacks
    ●   works today
    ●   need recent Linux kernel for dom0
●
    blktap
    ●   generic kernel driver, userland process
    ●
        easy integration with librbd
    ●   more featureful (cloning, caching), maybe faster
    ●
        doesn't exist yet!
●   rbd-fuse
    ●
        coming soon!
libvirt
●   CloudStack, OpenStack
●   libvirt understands rbd images, storage pools
    ●   xml specifies cluster, pool, image name, auth
●   currently only usable with KVM
●   could configure /dev/rbd devices for VMs
librbd
●   management
    ●   create, destroy, list, describe images
    ●   resize, snapshot, clone
●   I/O
    ●   open, read, write, discard, close
●   C, C++, Python bindings
RBD roadmap
●   locking
    ●   fence failed VM hosts
●   clone performance
●   KSM (kernel same-page merging) hints
●   caching
    ●   improved librbd caching
    ●   kernel RBD + bcache to local SSD/disk
why
●   limited options for scalable open source storage
●   proprietary solutions
    ●   marry hardware and software
    ●   expensive
    ●   don't scale (out)
●   industry needs to change
who we are
●   Ceph created at UC Santa Cruz (2007)
●   supported by DreamHost (2008-2011)
●   Inktank (2012)
●   growing user and developer community
●   we are hiring
    ●   C/C++/Python developers
    ●   sysadmins, testing engineers
    ●   Los Angeles, San Francisco, Sunnyvale, remote


                            http://ceph.com/
APP                   APP                HOST/VM                 CLIENT



                     RADOSGW                RBD                    CEPH FS
  LIBRADOS
                     A bucket-based         A reliable and         A POSIX-
  A library allowing REST gateway,          fully-distributed      compliant
  apps to directly compatible with          block device, with     distributed file
  access RADOS,      S3 and Swift           a Linux kernel         system, with a
  with support for                          client and a           Linux kernel
  C, C++, Java,                             QEMU/KVM driver        client and support
  Python, Ruby,                                                    for FUSE
  and PHP




RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-
managing, intelligent storage nodes



                                                                                        39

More Related Content

What's hot

Dude where's my volume, open stack summit vancouver 2015
Dude where's my volume, open stack summit vancouver 2015Dude where's my volume, open stack summit vancouver 2015
Dude where's my volume, open stack summit vancouver 2015Sean Cohen
 
Reliable Storage for High Availability, Disaster Recovery, Clouds and Contain...
Reliable Storage for High Availability, Disaster Recovery, Clouds and Contain...Reliable Storage for High Availability, Disaster Recovery, Clouds and Contain...
Reliable Storage for High Availability, Disaster Recovery, Clouds and Contain...Celia Chase
 
Red Hat Gluster Storage - Direction, Roadmap and Use-Cases
Red Hat Gluster Storage - Direction, Roadmap and Use-CasesRed Hat Gluster Storage - Direction, Roadmap and Use-Cases
Red Hat Gluster Storage - Direction, Roadmap and Use-CasesRed_Hat_Storage
 
New features for Ceph with Cinder and Beyond
New features for Ceph with Cinder and BeyondNew features for Ceph with Cinder and Beyond
New features for Ceph with Cinder and BeyondCeph Community
 
LINSTOR - Resilient OSS Storage for OpenNebula - September 2018
LINSTOR - Resilient OSS Storage for OpenNebula - September 2018LINSTOR - Resilient OSS Storage for OpenNebula - September 2018
LINSTOR - Resilient OSS Storage for OpenNebula - September 2018OpenNebula Project
 
Introduction to GlusterFS Webinar - September 2011
Introduction to GlusterFS Webinar - September 2011Introduction to GlusterFS Webinar - September 2011
Introduction to GlusterFS Webinar - September 2011GlusterFS
 
Red Hat Storage Day Boston - OpenStack + Ceph Storage
Red Hat Storage Day Boston - OpenStack + Ceph StorageRed Hat Storage Day Boston - OpenStack + Ceph Storage
Red Hat Storage Day Boston - OpenStack + Ceph StorageRed_Hat_Storage
 
Ceph LISA'12 Presentation
Ceph LISA'12 PresentationCeph LISA'12 Presentation
Ceph LISA'12 PresentationCeph Community
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific DashboardCeph Community
 
Glusterfs for sysadmins-justin_clift
Glusterfs for sysadmins-justin_cliftGlusterfs for sysadmins-justin_clift
Glusterfs for sysadmins-justin_cliftGluster.org
 
Iocg Whats New In V Sphere
Iocg Whats New In V SphereIocg Whats New In V Sphere
Iocg Whats New In V SphereAnne Achleman
 
Scalable POSIX File Systems in the Cloud
Scalable POSIX File Systems in the CloudScalable POSIX File Systems in the Cloud
Scalable POSIX File Systems in the CloudRed_Hat_Storage
 
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS StorageWebinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS StorageGlusterFS
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageMayaData Inc
 
Introduction to highly_availablenfs_server_on_scale-out_storage_systems_based...
Introduction to highly_availablenfs_server_on_scale-out_storage_systems_based...Introduction to highly_availablenfs_server_on_scale-out_storage_systems_based...
Introduction to highly_availablenfs_server_on_scale-out_storage_systems_based...Gluster.org
 
Ceph storage for ocp deploying and managing ceph on top of open shift conta...
Ceph storage for ocp   deploying and managing ceph on top of open shift conta...Ceph storage for ocp   deploying and managing ceph on top of open shift conta...
Ceph storage for ocp deploying and managing ceph on top of open shift conta...OrFriedmann
 
ISC Cloud 2013 - Cloud Architectures for HPC – Industry Case Studies
 ISC Cloud 2013 - Cloud Architectures for HPC – Industry Case Studies ISC Cloud 2013 - Cloud Architectures for HPC – Industry Case Studies
ISC Cloud 2013 - Cloud Architectures for HPC – Industry Case StudiesOpenNebula Project
 
Red Hat Global File System (GFS)
Red Hat Global File System (GFS)Red Hat Global File System (GFS)
Red Hat Global File System (GFS)Schubert Zhang
 
Smb gluster devmar2013
Smb gluster devmar2013Smb gluster devmar2013
Smb gluster devmar2013Gluster.org
 

What's hot (20)

Dude where's my volume, open stack summit vancouver 2015
Dude where's my volume, open stack summit vancouver 2015Dude where's my volume, open stack summit vancouver 2015
Dude where's my volume, open stack summit vancouver 2015
 
Reliable Storage for High Availability, Disaster Recovery, Clouds and Contain...
Reliable Storage for High Availability, Disaster Recovery, Clouds and Contain...Reliable Storage for High Availability, Disaster Recovery, Clouds and Contain...
Reliable Storage for High Availability, Disaster Recovery, Clouds and Contain...
 
Red Hat Gluster Storage - Direction, Roadmap and Use-Cases
Red Hat Gluster Storage - Direction, Roadmap and Use-CasesRed Hat Gluster Storage - Direction, Roadmap and Use-Cases
Red Hat Gluster Storage - Direction, Roadmap and Use-Cases
 
New features for Ceph with Cinder and Beyond
New features for Ceph with Cinder and BeyondNew features for Ceph with Cinder and Beyond
New features for Ceph with Cinder and Beyond
 
LINSTOR - Resilient OSS Storage for OpenNebula - September 2018
LINSTOR - Resilient OSS Storage for OpenNebula - September 2018LINSTOR - Resilient OSS Storage for OpenNebula - September 2018
LINSTOR - Resilient OSS Storage for OpenNebula - September 2018
 
Introduction to GlusterFS Webinar - September 2011
Introduction to GlusterFS Webinar - September 2011Introduction to GlusterFS Webinar - September 2011
Introduction to GlusterFS Webinar - September 2011
 
Red Hat Storage Day Boston - OpenStack + Ceph Storage
Red Hat Storage Day Boston - OpenStack + Ceph StorageRed Hat Storage Day Boston - OpenStack + Ceph Storage
Red Hat Storage Day Boston - OpenStack + Ceph Storage
 
Ceph LISA'12 Presentation
Ceph LISA'12 PresentationCeph LISA'12 Presentation
Ceph LISA'12 Presentation
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard
 
Glusterfs for sysadmins-justin_clift
Glusterfs for sysadmins-justin_cliftGlusterfs for sysadmins-justin_clift
Glusterfs for sysadmins-justin_clift
 
Understanding LXC & Docker
Understanding LXC & DockerUnderstanding LXC & Docker
Understanding LXC & Docker
 
Iocg Whats New In V Sphere
Iocg Whats New In V SphereIocg Whats New In V Sphere
Iocg Whats New In V Sphere
 
Scalable POSIX File Systems in the Cloud
Scalable POSIX File Systems in the CloudScalable POSIX File Systems in the Cloud
Scalable POSIX File Systems in the Cloud
 
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS StorageWebinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
 
Introduction to highly_availablenfs_server_on_scale-out_storage_systems_based...
Introduction to highly_availablenfs_server_on_scale-out_storage_systems_based...Introduction to highly_availablenfs_server_on_scale-out_storage_systems_based...
Introduction to highly_availablenfs_server_on_scale-out_storage_systems_based...
 
Ceph storage for ocp deploying and managing ceph on top of open shift conta...
Ceph storage for ocp   deploying and managing ceph on top of open shift conta...Ceph storage for ocp   deploying and managing ceph on top of open shift conta...
Ceph storage for ocp deploying and managing ceph on top of open shift conta...
 
ISC Cloud 2013 - Cloud Architectures for HPC – Industry Case Studies
 ISC Cloud 2013 - Cloud Architectures for HPC – Industry Case Studies ISC Cloud 2013 - Cloud Architectures for HPC – Industry Case Studies
ISC Cloud 2013 - Cloud Architectures for HPC – Industry Case Studies
 
Red Hat Global File System (GFS)
Red Hat Global File System (GFS)Red Hat Global File System (GFS)
Red Hat Global File System (GFS)
 
Smb gluster devmar2013
Smb gluster devmar2013Smb gluster devmar2013
Smb gluster devmar2013
 

Viewers also liked

Ximena quiñonez herrs power
Ximena quiñonez herrs powerXimena quiñonez herrs power
Ximena quiñonez herrs powerXimena Qb
 
XML without Tears (J Gollner at Intelligent Content 2012)
XML without Tears (J Gollner at Intelligent Content 2012)XML without Tears (J Gollner at Intelligent Content 2012)
XML without Tears (J Gollner at Intelligent Content 2012)Joe Gollner
 
Xlabs - Social Proof for startups
Xlabs - Social Proof for startupsXlabs - Social Proof for startups
Xlabs - Social Proof for startupscubeittoday
 
Éxito en las pequeñas y medias empresas
Éxito en las pequeñas y medias empresasÉxito en las pequeñas y medias empresas
Éxito en las pequeñas y medias empresasAbdi Juarez Hernandez
 
Xiwer0572
Xiwer0572Xiwer0572
Xiwer0572GWROY
 
искусство Xix века
искусство Xix векаискусство Xix века
искусство Xix векаVictoria2013
 
Xenon Lights Manufacturer in India
Xenon Lights Manufacturer in IndiaXenon Lights Manufacturer in India
Xenon Lights Manufacturer in IndiaWiptech Peripherals
 
XE NÂNG ĐIỆN ĐẨY TAY BÁN TỰ ĐỘNG
XE NÂNG ĐIỆN ĐẨY TAY BÁN TỰ ĐỘNG XE NÂNG ĐIỆN ĐẨY TAY BÁN TỰ ĐỘNG
XE NÂNG ĐIỆN ĐẨY TAY BÁN TỰ ĐỘNG Thiên Trang
 
παγκοσμια ημερα κατα καπνισματος xm
παγκοσμια ημερα κατα καπνισματος xmπαγκοσμια ημερα κατα καπνισματος xm
παγκοσμια ημερα κατα καπνισματος xmdakekavalas
 
Xico,o campeão-da-reciclagem%20 pp[1]
Xico,o campeão-da-reciclagem%20 pp[1]Xico,o campeão-da-reciclagem%20 pp[1]
Xico,o campeão-da-reciclagem%20 pp[1]Teresa Ramos
 

Viewers also liked (15)

Ximena quiñonez herrs power
Ximena quiñonez herrs powerXimena quiñonez herrs power
Ximena quiñonez herrs power
 
XML without Tears (J Gollner at Intelligent Content 2012)
XML without Tears (J Gollner at Intelligent Content 2012)XML without Tears (J Gollner at Intelligent Content 2012)
XML without Tears (J Gollner at Intelligent Content 2012)
 
Xiaobei li cv
Xiaobei li  cvXiaobei li  cv
Xiaobei li cv
 
xmnpdgpx09
xmnpdgpx09xmnpdgpx09
xmnpdgpx09
 
Xlabs - Social Proof for startups
Xlabs - Social Proof for startupsXlabs - Social Proof for startups
Xlabs - Social Proof for startups
 
Éxito en las pequeñas y medias empresas
Éxito en las pequeñas y medias empresasÉxito en las pequeñas y medias empresas
Éxito en las pequeñas y medias empresas
 
Xd1
Xd1Xd1
Xd1
 
Xiwer0572
Xiwer0572Xiwer0572
Xiwer0572
 
Xerrada
XerradaXerrada
Xerrada
 
искусство Xix века
искусство Xix векаискусство Xix века
искусство Xix века
 
Xenon Lights Manufacturer in India
Xenon Lights Manufacturer in IndiaXenon Lights Manufacturer in India
Xenon Lights Manufacturer in India
 
XE NÂNG ĐIỆN ĐẨY TAY BÁN TỰ ĐỘNG
XE NÂNG ĐIỆN ĐẨY TAY BÁN TỰ ĐỘNG XE NÂNG ĐIỆN ĐẨY TAY BÁN TỰ ĐỘNG
XE NÂNG ĐIỆN ĐẨY TAY BÁN TỰ ĐỘNG
 
Xlw (g6) auto tensile tester
Xlw (g6) auto tensile testerXlw (g6) auto tensile tester
Xlw (g6) auto tensile tester
 
παγκοσμια ημερα κατα καπνισματος xm
παγκοσμια ημερα κατα καπνισματος xmπαγκοσμια ημερα κατα καπνισματος xm
παγκοσμια ημερα κατα καπνισματος xm
 
Xico,o campeão-da-reciclagem%20 pp[1]
Xico,o campeão-da-reciclagem%20 pp[1]Xico,o campeão-da-reciclagem%20 pp[1]
Xico,o campeão-da-reciclagem%20 pp[1]
 

Similar to XenSummit - 08/28/2012

Ceph Block Devices: A Deep Dive
Ceph Block Devices: A Deep DiveCeph Block Devices: A Deep Dive
Ceph Block Devices: A Deep Divejoshdurgin
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep DiveRed_Hat_Storage
 
Ceph Day NYC: The Future of CephFS
Ceph Day NYC: The Future of CephFSCeph Day NYC: The Future of CephFS
Ceph Day NYC: The Future of CephFSCeph Community
 
London Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFSLondon Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFSCeph Community
 
OSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage SystemOSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage SystemNETWAYS
 
Cache Tiering and Erasure Coding
Cache Tiering and Erasure CodingCache Tiering and Erasure Coding
Cache Tiering and Erasure CodingShinobu KINJO
 
Open Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNETOpen Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNETNikos Kormpakis
 
End of RAID as we know it with Ceph Replication
End of RAID as we know it with Ceph ReplicationEnd of RAID as we know it with Ceph Replication
End of RAID as we know it with Ceph ReplicationCeph Community
 
Ceph Day New York 2014: Future of CephFS
Ceph Day New York 2014:  Future of CephFS Ceph Day New York 2014:  Future of CephFS
Ceph Day New York 2014: Future of CephFS Ceph Community
 
Ceph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Day Santa Clara: The Future of CephFS + Developing with LibradosCeph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Day Santa Clara: The Future of CephFS + Developing with LibradosCeph Community
 
Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Sage Weil
 
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage ArhcitectureINFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage ArhcitectureThomas Uhl
 
New Features for Ceph with Cinder and Beyond
New Features for Ceph with Cinder and BeyondNew Features for Ceph with Cinder and Beyond
New Features for Ceph with Cinder and BeyondOpenStack Foundation
 
What CloudStackers Need To Know About LINSTOR/DRBD
What CloudStackers Need To Know About LINSTOR/DRBDWhat CloudStackers Need To Know About LINSTOR/DRBD
What CloudStackers Need To Know About LINSTOR/DRBDShapeBlue
 
Ceph - A distributed storage system
Ceph - A distributed storage systemCeph - A distributed storage system
Ceph - A distributed storage systemItalo Santos
 
What's new in Jewel and Beyond
What's new in Jewel and BeyondWhat's new in Jewel and Beyond
What's new in Jewel and BeyondSage Weil
 

Similar to XenSummit - 08/28/2012 (20)

Scale 10x 01:22:12
Scale 10x 01:22:12Scale 10x 01:22:12
Scale 10x 01:22:12
 
Strata - 03/31/2012
Strata - 03/31/2012Strata - 03/31/2012
Strata - 03/31/2012
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices: A Deep DiveCeph Block Devices: A Deep Dive
Ceph Block Devices: A Deep Dive
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep Dive
 
Ceph Day NYC: The Future of CephFS
Ceph Day NYC: The Future of CephFSCeph Day NYC: The Future of CephFS
Ceph Day NYC: The Future of CephFS
 
London Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFSLondon Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFS
 
OSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage SystemOSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage System
 
Cache Tiering and Erasure Coding
Cache Tiering and Erasure CodingCache Tiering and Erasure Coding
Cache Tiering and Erasure Coding
 
Open Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNETOpen Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNET
 
End of RAID as we know it with Ceph Replication
End of RAID as we know it with Ceph ReplicationEnd of RAID as we know it with Ceph Replication
End of RAID as we know it with Ceph Replication
 
Ceph Day New York 2014: Future of CephFS
Ceph Day New York 2014:  Future of CephFS Ceph Day New York 2014:  Future of CephFS
Ceph Day New York 2014: Future of CephFS
 
Ceph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Day Santa Clara: The Future of CephFS + Developing with LibradosCeph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Day Santa Clara: The Future of CephFS + Developing with Librados
 
Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)
 
OpenVZ Linux Containers
OpenVZ Linux ContainersOpenVZ Linux Containers
OpenVZ Linux Containers
 
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage ArhcitectureINFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
 
DEVIEW 2013
DEVIEW 2013DEVIEW 2013
DEVIEW 2013
 
New Features for Ceph with Cinder and Beyond
New Features for Ceph with Cinder and BeyondNew Features for Ceph with Cinder and Beyond
New Features for Ceph with Cinder and Beyond
 
What CloudStackers Need To Know About LINSTOR/DRBD
What CloudStackers Need To Know About LINSTOR/DRBDWhat CloudStackers Need To Know About LINSTOR/DRBD
What CloudStackers Need To Know About LINSTOR/DRBD
 
Ceph - A distributed storage system
Ceph - A distributed storage systemCeph - A distributed storage system
Ceph - A distributed storage system
 
What's new in Jewel and Beyond
What's new in Jewel and BeyondWhat's new in Jewel and Beyond
What's new in Jewel and Beyond
 

Recently uploaded

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 

Recently uploaded (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 

XenSummit - 08/28/2012

  • 1. virtual machine block storage with the ceph distributed storage system sage weil xensummit – august 28, 2012
  • 2. outline ● why you should care ● what is it, what it does ● how it works, how you can use it ● architecture ● objects, recovery ● rados block device ● integration ● path forward ● who we are, why we do this
  • 3. why should you care about another storage system? requirements, time, cost
  • 4. requirements ● diverse storage needs ● object storage ● block devices (for VMs) with snapshots, cloning ● shared file system with POSIX, coherent caches ● structured data... files, block devices, or objects? ● scale ● terabytes, petabytes, exabytes ● heterogeneous hardware ● reliability and fault tolerance
  • 5. time ● ease of administration ● no manual data migration, load balancing ● painless scaling ● expansion and contraction ● seamless migration
  • 6. cost ● low cost per gigabyte ● no vendor lock-in ● software solution ● run on commodity hardware ● open source
  • 8. APP APP HOST/VM CLIENT RADOSGW RBD CEPH FS LIBRADOS A bucket-based A reliable and A POSIX- A library allowing REST gateway, fully-distributed compliant apps to directly compatible with block device, with distributed file access RADOS, S3 and Swift a Linux kernel system, with a with support for client and a Linux kernel C, C++, Java, QEMU/KVM driver client and support Python, Ruby, for FUSE and PHP RADOS A reliable, autonomous, distributed object store comprised of self-healing, self- managing, intelligent storage nodes 8
  • 9. open source ● LGPLv2 ● copyleft ● free to link to proprietary code ● no copyright assignment ● no dual licensing ● no “enterprise-only” feature set ● active community ● commercial support available
  • 10. distributed storage system ● data center (not geo) scale ● 10s to 10,000s of machines ● terabytes to exabytes ● fault tolerant ● no SPoF ● commodity hardware – ethernet, SATA/SAS, HDD/SSD – RAID, SAN probably a waste of time, power, and money
  • 11. object storage model ● pools ● 1s to 100s ● independent namespaces or object collections ● replication level, placement policy ● objects ● trillions ● blob of data (bytes to gigabytes) ● attributes (e.g., “version=12”; bytes to kilobytes) ● key/value bundle (bytes to gigabytes)
  • 12. object storage cluster ● conventional client/server model doesn't scale ● server(s) become bottlenecks; proxies are inefficient ● if storage devices don't coordinate, clients must ● ceph-osds are intelligent storage daemons ● coordinate with peers ● sensible, cluster-aware protocols ● sit on local file system – btrfs, xfs, ext4, etc. – leveldb
  • 13. OSD OSD OSD OSD OSD FS FS FS FS FS btrfs xfs ext4 DISK DISK DISK DISK DISK M M M 13
  • 14. Monitors: • Maintain cluster state M • Provide consensus for distributed decision-making • Small, odd number • These do not serve stored objects to clients • OSDs: • One per disk or RAID group • At least three in a cluster • Serve stored objects to clients • Intelligently peer to perform replication tasks
  • 15. HUMAN M M M
  • 16. data distribution ● all objects are replicated N times ● objects are automatically placed, balanced, migrated in a dynamic cluster ● must consider physical infrastructure ● ceph-osds on hosts in racks in rows in data centers ● three approaches ● pick a spot; remember where you put it ● pick a spot; write down where you put it ● calculate where to put it, where to find it
  • 17. CRUSH • Pseudo-random placement algorithm • Fast calculation, no lookup • Ensures even distribution • Repeatable, deterministic • Rule-based configuration • specifiable replication • infrastructure topology aware • allows weighting • Stable mapping • Limited data migration
  • 18. distributed object storage ● CRUSH tells us where data should go ● small “osd map” records cluster state at point in time ● ceph-osd node status (up/down, weight, IP) ● CRUSH function specifying desired data distribution ● object storage daemons (RADOS) ● store it there ● migrate it as the cluster changes ● decentralized, distributed approach allows ● massive scales (10,000s of servers or more) ● efficient data access ● the illusion of a single copy with consistent behavior
  • 19. large clusters aren't static ● dynamic cluster ● nodes are added, removed; nodes reboot, fail, recover ● recovery is the norm ● osd maps are versioned ● shared via gossip ● any map update potentially triggers data migration ● ceph-osds monitor peers for failure ● new nodes register with monitor ● administrator adjusts weights, mark out old hardware, etc.
  • 20.
  • 21.
  • 22. CLIENT ??
  • 23. what does this mean for my cloud? ● virtual disks ● reliable ● accessible from many hosts ● appliances ● great for small clouds ● not viable for public or (large) private clouds ● avoid single server bottlenecks ● efficient management
  • 24. VM VIRTUALIZATION CONTAINER LIBRBD LIBRADOS M M M
  • 25. CONTAINER VM CONTAINER LIBRBD LIBRBD LIBRADOS LIBRADOS M M M
  • 26. HOST KRBD (KERNEL MODULE) LIBRADOS M M M
  • 27. RBD: RADOS Block Device • Replicated, reliable, high- performance virtual disk • Allows decoupling of VMs and containers • Live migration! • Images are striped across the cluster • Snapshots! • Native support in the Linux kernel • /dev/rbd1 • librbd allows easy integration
  • 28. HOW DO YOU SPIN UP THOUSANDS OF VMs INSTANTLY AND EFFICIENTLY?
  • 29. instant copy 144 0 0 0 0 = 144
  • 30. write CLIENT write write write 144 4 = 148
  • 31. read read CLIENT read 144 4 = 148
  • 32. current RBD integration ● native Linux kernel support ● /dev/rbd0, /dev/rbd/<poolname>/<imagename> ● librbd ● user-level library ● Qemu/KVM ● links to librbd user-level library ● libvirt ● librbd-based storage pool ● understands RBD images ● can only start KVM VMs... :-( ● CloudStack, OpenStack
  • 33. what about Xen? ● Linux kernel driver (i.e. /dev/rbd0) ● easy fit into existing stacks ● works today ● need recent Linux kernel for dom0 ● blktap ● generic kernel driver, userland process ● easy integration with librbd ● more featureful (cloning, caching), maybe faster ● doesn't exist yet! ● rbd-fuse ● coming soon!
  • 34. libvirt ● CloudStack, OpenStack ● libvirt understands rbd images, storage pools ● xml specifies cluster, pool, image name, auth ● currently only usable with KVM ● could configure /dev/rbd devices for VMs
  • 35. librbd ● management ● create, destroy, list, describe images ● resize, snapshot, clone ● I/O ● open, read, write, discard, close ● C, C++, Python bindings
  • 36. RBD roadmap ● locking ● fence failed VM hosts ● clone performance ● KSM (kernel same-page merging) hints ● caching ● improved librbd caching ● kernel RBD + bcache to local SSD/disk
  • 37. why ● limited options for scalable open source storage ● proprietary solutions ● marry hardware and software ● expensive ● don't scale (out) ● industry needs to change
  • 38. who we are ● Ceph created at UC Santa Cruz (2007) ● supported by DreamHost (2008-2011) ● Inktank (2012) ● growing user and developer community ● we are hiring ● C/C++/Python developers ● sysadmins, testing engineers ● Los Angeles, San Francisco, Sunnyvale, remote http://ceph.com/
  • 39. APP APP HOST/VM CLIENT RADOSGW RBD CEPH FS LIBRADOS A bucket-based A reliable and A POSIX- A library allowing REST gateway, fully-distributed compliant apps to directly compatible with block device, with distributed file access RADOS, S3 and Swift a Linux kernel system, with a with support for client and a Linux kernel C, C++, Java, QEMU/KVM driver client and support Python, Ruby, for FUSE and PHP RADOS A reliable, autonomous, distributed object store comprised of self-healing, self- managing, intelligent storage nodes 39