SlideShare a Scribd company logo
1 of 30
Download to read offline
the ceph distributed storage system


                sage weil
       scale 10x – january 22, 2012
hello
●   why you should care
●   what is it, what it does
●   how it works, how you can use it
    ●   architecture
    ●   objects
    ●   recovery
    ●   block devices
    ●   file system
●   who we are, why we do this
why should you care about another
        storage system?

     requirements, time, money
requirements
●   diverse storage needs
    ●   object storage (RESTful or low-level)
    ●   block devices (for VMs) with snapshots, cloning
    ●   shared file system with POSIX, coherent caches
●   scale
    ●   terabytes, petabytes. exabytes?
    ●   heterogeneous hardware
    ●   reliability and fault tolerance
time
●   ease of administration
●   no manual data migration, load balancing
●   painless scaling
    ●   expansion and contraction
    ●   seamless migration
money
●   low cost per gigabyte
●   no vendor lock-in

●   software solution
    ●   run on commodity hardware
●   open source
what is ceph?
unified storage system
●   objects
    ●   small or large
    ●   multi-protocol
●   block devices
                             radosgw   RBD     Ceph DFS
    ●   snapshots, cloning             RADOS
●   files
    ●   cache coherent
    ●   snapshots
    ●   usage accounting
open source
●   LGPLv2
    ●   copyleft
    ●   free to link to proprietary code
●   no copyright assignment
    ●   no dual licensing
    ●   no “enterprise-only” feature set
●   active community
●   commercial support
architecture
●   monitors (ceph-mon)
    ●   1s-10s, paxos
    ●   lightweight process
    ●   authentication, cluster membership,
        critical cluster state
●   object storage daemons (ceph-osd)
    ●   1s-10,000s
    ●   smart, coordinate with peers
●   clients (librados, librbd)
    ●   zillions
    ●   authenticate with monitors, talk directly
        to ceph-osds
●   metadata servers (ceph-mds)
    ●   1s-10s
    ●   build POSIX file system on top of objects
rados object storage model
●   pools
    ●   1s to 100s
    ●   independent namespaces or object collections
    ●   replication level, placement policy
●   objects
    ●   trillions
    ●   blob of data (bytes to gigabytes)
    ●   attributes (e.g., “version=12”; bytes to kilobytes)
    ●   key/value bundle (bytes to gigabytes)
rados object API
●
    librados.so
    ●   C, C++, Python, Java. shell.
●   read/write (extent), truncate, remove; get/set/remove xattr or key
    ●   like a file or .db file
●   efficient copy-on-write clone
●   atomic compound operations/transactions
    ●   read + getxattr, write + setxattr
    ●   compare xattr value, if match write + setxattr
●   classes
    ●   load new code into cluster to implement new methods
    ●   calc sha1, grep/filter, generate thumbnail
    ●   encrypt, increment, rotate image
●
    watch/notify
    ●   use object as communication channel between clients (locking primitive)
object storage
●   client/server, host/device paradigm doesn't scale
    ●   idle servers are wasted servers
    ●   if servers don't coordinate, clients must
●   ceph-osds are intelligent storage daemons
    ●   coordinate with peers over TCP/IP
    ●   intelligent protocols; no 'IP fail over' or similar hacks
●   flexible deployment
    ●   one per disk
    ●   one per host
    ●   one per RAID volume
●   sit on local file system
    ●   btrfs, xfs, ext4, etc.
why we like btrfs
●   pervasive checksumming
●   snapshots, copy-on-write
●   efficient metadata (xattrs)
●   inline data for small files
●   transparent compression
●   integrated volume management
    ●   software RAID, mirroring, error recovery
    ●   SSD-aware
●   online fsck
●   active development community
data distribution
●   all objects are replicated N times
●   objects are automatically placed, balanced, migrated
    in a dynamic cluster
●   must consider physical infrastructure
    ●   ceph-osds on hosts in racks in rows in data centers

●   three approaches
    ●   pick a spot; remember where you put it
    ●   pick a spot; write down where you put it
    ●   calculate where to put it, where to find it
CRUSH
●   pseudo-random placement algorithm
    ●   uniform, weighted distribution
    ●   fast calculation, no lookup
●   placement rules
    ●   in terms of physical infrastructure
        –   “3 replicas, same row, different racks”
●   predictable, bounded migration on changes
    ●   N → N + 1 ceph-osds means a bit over 1/Nth of
        data moves
object placement
        pool

                                     hash(object name) % num_pg = pg

placement group (PG)

                                  CRUSH(pg, cluster state, rule) = [A, B]
replication
●   all data replicated N times
●   ceph-osd cluster handles replication
    ●   client writes to first replica




    ●   reduce client bandwidth
    ●   “only once” semantics
    ●   dual in-memory vs on-disk acks on request
recovery
●   dynamic cluster
    ●   nodes are added, removed
    ●   nodes reboot, fail, recover
●   “recovery” is the norm
    ●   “map” records cluster state at point in time
        –   ceph-osd node status (up/down, weight, IP)
        –   CRUSH function specifying desired data distribution
    ●   ceph-osds cooperatively migrate data to achieve that
●   any map update potentially triggers data migration
    ●   ceph-osds monitor peers for failure
    ●   new nodes register with monitor
    ●   administrator adjusts weights, mark out old hardware, etc.
librados, radosgw
●   librados
                                                               HTTP
    ●   direct parallel access to
        cluster                                     haproxy
    ●   rich API                                               HTTP
●   radosgw                         your app   radosgw    radosgw
                                    librados   librados   librados
    ●   RESTful object storage
        –   S3, Swift APIs
    ●   proxy HTTP to rados
    ●   ACL-based security for
        the big bad internet
rbd – rados block device
●   replicated, reliable, high-performance virtual disk
    ●   striped over objects across entire cluster
    ●   thinly provisioned, snapshots
    ●   image cloning (real soon now)
                                                                 KVM
●   well integrated                                               librbd
                                                                librados
    ●   Linux kernel driver (/dev/rbd0)              KVM/Xen
    ●   qemu/KVM + librbd                   ext4         rbd
                                            rbd        kernel
    ●   libvirt, OpenStack
●   sever link between virtual machine and host
    ●   fail-over, live migration
ceph distributed file system
●   shared cluster-coherent file system
●   separate metadata and data paths
    ●   avoid “server” bottleneck inherent in NFS etc
●   ceph-mds cluster
    ●   manages file system hierarchy
    ●   redistributes load based on workload
    ●   ultimately stores everything in objects
●   highly stateful client sessions
    ●   lots of caching, prefetching, locks and leases
an example
●   mount -t ceph 1.2.3.4:/ /mnt
    ●   3 ceph-mon RT
    ●   2 ceph-mds RT (1 ceph-mds to -osd RT)
                                                     ceph-mon   ceph-osd
●   cd /mnt/foo/bar
    ●   2 ceph-mds RT (2 ceph-mds to -osd RT)
●   ls -al
    ●   open
    ●   readdir
         –   1 ceph-mds RT (1 ceph-mds to -osd RT)
    ●   stat each file
    ●   close                                                   ceph-mds
●   cp * /tmp
    ●   N ceph-osd RT
dynamic subtree partitioning
                                          Root




                                                                      ceph-mds


●   efficient                                    ●   scalable
    ●   hierarchical partition preserve              ●   arbitrarily partition metadata
        locality                                 ●   adaptive
●   dynamic                                          ●   move work from busy to idle
    ●   daemons can join/leave                           servers
    ●   take over for failed nodes                   ●   replicate hot metadata
recursive accounting
●   ceph-mds tracks recursive directory stats
    ●   file sizes
    ●   file and directory counts
    ●   modification time
●   virtual xattrs present full stats
●   efficient

$ ls ­alSh | head
total 0
drwxr­xr­x 1 root            root      9.7T 2011­02­04 15:51 .
drwxr­xr­x 1 root            root      9.7T 2010­12­16 15:06 ..
drwxr­xr­x 1 pomceph         pg4194980 9.6T 2011­02­24 08:25 pomceph
drwxr­xr­x 1 mcg_test1       pg2419992  23G 2011­02­02 08:57 mcg_test1
drwx­­x­­­ 1 luko            adm        19G 2011­01­21 12:17 luko
drwx­­x­­­ 1 eest            adm        14G 2011­02­04 16:29 eest
drwxr­xr­x 1 mcg_test2       pg2419992 3.0G 2011­02­02 09:34 mcg_test2
drwx­­x­­­ 1 fuzyceph        adm       1.5G 2011­01­18 10:46 fuzyceph
drwxr­xr­x 1 dallasceph      pg275     596M 2011­01­14 10:06 dallasceph
snapshots
●   volume or subvolume unusable at petabyte scale
    ●   snapshot arbitrary subdirectories
●   simple interface
    ●   hidden '.snap' directory
    ●   no special tools

$ mkdir foo/.snap/one      # create snapshot
$ ls foo/.snap
one
$ ls foo/bar/.snap
_one_1099511627776         # parent's snap name is mangled
$ rm foo/myfile
$ ls -F foo
bar/
$ ls -F foo/.snap/one
myfile bar/
$ rmdir foo/.snap/one      # remove snapshot
multiple protocols, implementations
●   Linux kernel client
    ●   mount -t ceph 1.2.3.4:/ /mnt
                                        NFS            SMB/CIFS
    ●   export (NFS), Samba (CIFS)
●   ceph-fuse                           Ganesha            Samba
                                         libcephfs          libcephfs
●   libcephfs.so
    ●   your app                                           your app
    ●   Samba (CIFS)                                        libcephfs

    ●   Ganesha (NFS)                          ceph-fuse
                                       ceph        fuse
    ●   Hadoop (map/reduce)                       kernel
can I deploy it already?
●   rados object store is stable
    ●   librados
    ●   radosgw (RESTful APIs)
    ●   rbd rados block device
    ●   commercial support in 1-3 months
●   file system is not ready
    ●   feature complete
    ●   suitable for testing, PoC, benchmarking
    ●   needs testing, deliberate qa effort for production
why we do this
●   limited options for scalable open source storage
    ●   nexenta
    ●   orangefs, lustre
    ●   glusterfs
●   proprietary solutions
    ●   marry hardware and software
    ●   expensive
    ●   don't scale (well or out)
●   we can change the industry
who we are
●   created at UC Santa Cruz (2007)
●   supported by DreamHost (2008-2011)
●   spun off as new company (2012)
    ●   downtown Los Angeles, downtown San Francisco
●   growing user and developer community
    ●   Silicon Valley, Asia, Europe
    ●   Debian, SuSE, Canonical, RedHat
    ●   cloud computing stacks
●   we are hiring
    ●   C/C++/Python developers
    ●   sysadmins, testing engineers


                                 http://ceph.newdream.net/

More Related Content

What's hot

BlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InBlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InSage Weil
 
Ceph - A distributed storage system
Ceph - A distributed storage systemCeph - A distributed storage system
Ceph - A distributed storage systemItalo Santos
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDBSage Weil
 
What's new in Luminous and Beyond
What's new in Luminous and BeyondWhat's new in Luminous and Beyond
What's new in Luminous and BeyondSage Weil
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephSage Weil
 
Keeping OpenStack storage trendy with Ceph and containers
Keeping OpenStack storage trendy with Ceph and containersKeeping OpenStack storage trendy with Ceph and containers
Keeping OpenStack storage trendy with Ceph and containersSage Weil
 
The State of Ceph, Manila, and Containers in OpenStack
The State of Ceph, Manila, and Containers in OpenStackThe State of Ceph, Manila, and Containers in OpenStack
The State of Ceph, Manila, and Containers in OpenStackSage Weil
 
What's new in Jewel and Beyond
What's new in Jewel and BeyondWhat's new in Jewel and Beyond
What's new in Jewel and BeyondSage Weil
 
CephFS update February 2016
CephFS update February 2016CephFS update February 2016
CephFS update February 2016John Spray
 
Ceph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud worldCeph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud worldSage Weil
 
Ceph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross TurkCeph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross Turkbuildacloud
 
2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph2019.06.27 Intro to Ceph
2019.06.27 Intro to CephCeph Community
 
XSKY - ceph luminous update
XSKY - ceph luminous updateXSKY - ceph luminous update
XSKY - ceph luminous updateinwin stack
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephSage Weil
 
Storing VMs with Cinder and Ceph RBD.pdf
Storing VMs with Cinder and Ceph RBD.pdfStoring VMs with Cinder and Ceph RBD.pdf
Storing VMs with Cinder and Ceph RBD.pdfOpenStack Foundation
 
Making distributed storage easy: usability in Ceph Luminous and beyond
Making distributed storage easy: usability in Ceph Luminous and beyondMaking distributed storage easy: usability in Ceph Luminous and beyond
Making distributed storage easy: usability in Ceph Luminous and beyondSage Weil
 
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSE
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSEUnderstanding blue store, Ceph's new storage backend - Tim Serong, SUSE
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSEOpenStack
 

What's hot (20)

BlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InBlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year In
 
XenSummit - 08/28/2012
XenSummit - 08/28/2012XenSummit - 08/28/2012
XenSummit - 08/28/2012
 
Ceph - A distributed storage system
Ceph - A distributed storage systemCeph - A distributed storage system
Ceph - A distributed storage system
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDB
 
What's new in Luminous and Beyond
What's new in Luminous and BeyondWhat's new in Luminous and Beyond
What's new in Luminous and Beyond
 
Bluestore
BluestoreBluestore
Bluestore
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
librados
libradoslibrados
librados
 
Keeping OpenStack storage trendy with Ceph and containers
Keeping OpenStack storage trendy with Ceph and containersKeeping OpenStack storage trendy with Ceph and containers
Keeping OpenStack storage trendy with Ceph and containers
 
The State of Ceph, Manila, and Containers in OpenStack
The State of Ceph, Manila, and Containers in OpenStackThe State of Ceph, Manila, and Containers in OpenStack
The State of Ceph, Manila, and Containers in OpenStack
 
What's new in Jewel and Beyond
What's new in Jewel and BeyondWhat's new in Jewel and Beyond
What's new in Jewel and Beyond
 
CephFS update February 2016
CephFS update February 2016CephFS update February 2016
CephFS update February 2016
 
Ceph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud worldCeph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud world
 
Ceph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross TurkCeph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross Turk
 
2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph
 
XSKY - ceph luminous update
XSKY - ceph luminous updateXSKY - ceph luminous update
XSKY - ceph luminous update
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
Storing VMs with Cinder and Ceph RBD.pdf
Storing VMs with Cinder and Ceph RBD.pdfStoring VMs with Cinder and Ceph RBD.pdf
Storing VMs with Cinder and Ceph RBD.pdf
 
Making distributed storage easy: usability in Ceph Luminous and beyond
Making distributed storage easy: usability in Ceph Luminous and beyondMaking distributed storage easy: usability in Ceph Luminous and beyond
Making distributed storage easy: usability in Ceph Luminous and beyond
 
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSE
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSEUnderstanding blue store, Ceph's new storage backend - Tim Serong, SUSE
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSE
 

Similar to Scale 10x 01:22:12

INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage ArhcitectureINFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage ArhcitectureThomas Uhl
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices: A Deep DiveCeph Block Devices: A Deep Dive
Ceph Block Devices: A Deep Divejoshdurgin
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep DiveRed_Hat_Storage
 
7. Cloud Native Computing - Kubernetes - Bratislava - Rook.io
7. Cloud Native Computing - Kubernetes - Bratislava - Rook.io7. Cloud Native Computing - Kubernetes - Bratislava - Rook.io
7. Cloud Native Computing - Kubernetes - Bratislava - Rook.ioDávid Kőszeghy
 
Open Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNETOpen Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNETNikos Kormpakis
 
Ceph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Day Santa Clara: The Future of CephFS + Developing with LibradosCeph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Day Santa Clara: The Future of CephFS + Developing with LibradosCeph Community
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data Omid Vahdaty
 
Linuxtag.ceph.talk
Linuxtag.ceph.talkLinuxtag.ceph.talk
Linuxtag.ceph.talkUdo Seidel
 
20171101 taco scargo luminous is out, what's in it for you
20171101 taco scargo   luminous is out, what's in it for you20171101 taco scargo   luminous is out, what's in it for you
20171101 taco scargo luminous is out, what's in it for youTaco Scargo
 
Running OpenStack in Production - Barcamp Saigon 2016
Running OpenStack in Production - Barcamp Saigon 2016Running OpenStack in Production - Barcamp Saigon 2016
Running OpenStack in Production - Barcamp Saigon 2016Thang Man
 
Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Sage Weil
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific DashboardCeph Community
 
CephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at LastCephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at LastCeph Community
 
CEPH DAY BERLIN - WHAT'S NEW IN CEPH
CEPH DAY BERLIN - WHAT'S NEW IN CEPH CEPH DAY BERLIN - WHAT'S NEW IN CEPH
CEPH DAY BERLIN - WHAT'S NEW IN CEPH Ceph Community
 
Linux-HA with Pacemaker
Linux-HA with PacemakerLinux-HA with Pacemaker
Linux-HA with PacemakerKris Buytaert
 
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.The Hive
 

Similar to Scale 10x 01:22:12 (20)

INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage ArhcitectureINFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices: A Deep DiveCeph Block Devices: A Deep Dive
Ceph Block Devices: A Deep Dive
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep Dive
 
Cncf meetup-rook
Cncf meetup-rookCncf meetup-rook
Cncf meetup-rook
 
Cncf meetup-rook
Cncf meetup-rookCncf meetup-rook
Cncf meetup-rook
 
7. Cloud Native Computing - Kubernetes - Bratislava - Rook.io
7. Cloud Native Computing - Kubernetes - Bratislava - Rook.io7. Cloud Native Computing - Kubernetes - Bratislava - Rook.io
7. Cloud Native Computing - Kubernetes - Bratislava - Rook.io
 
Open Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNETOpen Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNET
 
Ceph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Day Santa Clara: The Future of CephFS + Developing with LibradosCeph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Day Santa Clara: The Future of CephFS + Developing with Librados
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data
 
Linuxtag.ceph.talk
Linuxtag.ceph.talkLinuxtag.ceph.talk
Linuxtag.ceph.talk
 
20171101 taco scargo luminous is out, what's in it for you
20171101 taco scargo   luminous is out, what's in it for you20171101 taco scargo   luminous is out, what's in it for you
20171101 taco scargo luminous is out, what's in it for you
 
Running OpenStack in Production - Barcamp Saigon 2016
Running OpenStack in Production - Barcamp Saigon 2016Running OpenStack in Production - Barcamp Saigon 2016
Running OpenStack in Production - Barcamp Saigon 2016
 
Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard
 
Hbase: an introduction
Hbase: an introductionHbase: an introduction
Hbase: an introduction
 
CephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at LastCephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at Last
 
OpenVZ Linux Containers
OpenVZ Linux ContainersOpenVZ Linux Containers
OpenVZ Linux Containers
 
CEPH DAY BERLIN - WHAT'S NEW IN CEPH
CEPH DAY BERLIN - WHAT'S NEW IN CEPH CEPH DAY BERLIN - WHAT'S NEW IN CEPH
CEPH DAY BERLIN - WHAT'S NEW IN CEPH
 
Linux-HA with Pacemaker
Linux-HA with PacemakerLinux-HA with Pacemaker
Linux-HA with Pacemaker
 
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
 

Recently uploaded

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

Scale 10x 01:22:12

  • 1. the ceph distributed storage system sage weil scale 10x – january 22, 2012
  • 2. hello ● why you should care ● what is it, what it does ● how it works, how you can use it ● architecture ● objects ● recovery ● block devices ● file system ● who we are, why we do this
  • 3. why should you care about another storage system? requirements, time, money
  • 4. requirements ● diverse storage needs ● object storage (RESTful or low-level) ● block devices (for VMs) with snapshots, cloning ● shared file system with POSIX, coherent caches ● scale ● terabytes, petabytes. exabytes? ● heterogeneous hardware ● reliability and fault tolerance
  • 5. time ● ease of administration ● no manual data migration, load balancing ● painless scaling ● expansion and contraction ● seamless migration
  • 6. money ● low cost per gigabyte ● no vendor lock-in ● software solution ● run on commodity hardware ● open source
  • 8. unified storage system ● objects ● small or large ● multi-protocol ● block devices radosgw RBD Ceph DFS ● snapshots, cloning RADOS ● files ● cache coherent ● snapshots ● usage accounting
  • 9. open source ● LGPLv2 ● copyleft ● free to link to proprietary code ● no copyright assignment ● no dual licensing ● no “enterprise-only” feature set ● active community ● commercial support
  • 10. architecture ● monitors (ceph-mon) ● 1s-10s, paxos ● lightweight process ● authentication, cluster membership, critical cluster state ● object storage daemons (ceph-osd) ● 1s-10,000s ● smart, coordinate with peers ● clients (librados, librbd) ● zillions ● authenticate with monitors, talk directly to ceph-osds ● metadata servers (ceph-mds) ● 1s-10s ● build POSIX file system on top of objects
  • 11. rados object storage model ● pools ● 1s to 100s ● independent namespaces or object collections ● replication level, placement policy ● objects ● trillions ● blob of data (bytes to gigabytes) ● attributes (e.g., “version=12”; bytes to kilobytes) ● key/value bundle (bytes to gigabytes)
  • 12. rados object API ● librados.so ● C, C++, Python, Java. shell. ● read/write (extent), truncate, remove; get/set/remove xattr or key ● like a file or .db file ● efficient copy-on-write clone ● atomic compound operations/transactions ● read + getxattr, write + setxattr ● compare xattr value, if match write + setxattr ● classes ● load new code into cluster to implement new methods ● calc sha1, grep/filter, generate thumbnail ● encrypt, increment, rotate image ● watch/notify ● use object as communication channel between clients (locking primitive)
  • 13. object storage ● client/server, host/device paradigm doesn't scale ● idle servers are wasted servers ● if servers don't coordinate, clients must ● ceph-osds are intelligent storage daemons ● coordinate with peers over TCP/IP ● intelligent protocols; no 'IP fail over' or similar hacks ● flexible deployment ● one per disk ● one per host ● one per RAID volume ● sit on local file system ● btrfs, xfs, ext4, etc.
  • 14. why we like btrfs ● pervasive checksumming ● snapshots, copy-on-write ● efficient metadata (xattrs) ● inline data for small files ● transparent compression ● integrated volume management ● software RAID, mirroring, error recovery ● SSD-aware ● online fsck ● active development community
  • 15. data distribution ● all objects are replicated N times ● objects are automatically placed, balanced, migrated in a dynamic cluster ● must consider physical infrastructure ● ceph-osds on hosts in racks in rows in data centers ● three approaches ● pick a spot; remember where you put it ● pick a spot; write down where you put it ● calculate where to put it, where to find it
  • 16. CRUSH ● pseudo-random placement algorithm ● uniform, weighted distribution ● fast calculation, no lookup ● placement rules ● in terms of physical infrastructure – “3 replicas, same row, different racks” ● predictable, bounded migration on changes ● N → N + 1 ceph-osds means a bit over 1/Nth of data moves
  • 17. object placement pool hash(object name) % num_pg = pg placement group (PG) CRUSH(pg, cluster state, rule) = [A, B]
  • 18. replication ● all data replicated N times ● ceph-osd cluster handles replication ● client writes to first replica ● reduce client bandwidth ● “only once” semantics ● dual in-memory vs on-disk acks on request
  • 19. recovery ● dynamic cluster ● nodes are added, removed ● nodes reboot, fail, recover ● “recovery” is the norm ● “map” records cluster state at point in time – ceph-osd node status (up/down, weight, IP) – CRUSH function specifying desired data distribution ● ceph-osds cooperatively migrate data to achieve that ● any map update potentially triggers data migration ● ceph-osds monitor peers for failure ● new nodes register with monitor ● administrator adjusts weights, mark out old hardware, etc.
  • 20. librados, radosgw ● librados HTTP ● direct parallel access to cluster haproxy ● rich API HTTP ● radosgw your app radosgw radosgw librados librados librados ● RESTful object storage – S3, Swift APIs ● proxy HTTP to rados ● ACL-based security for the big bad internet
  • 21. rbd – rados block device ● replicated, reliable, high-performance virtual disk ● striped over objects across entire cluster ● thinly provisioned, snapshots ● image cloning (real soon now) KVM ● well integrated librbd librados ● Linux kernel driver (/dev/rbd0) KVM/Xen ● qemu/KVM + librbd ext4 rbd rbd kernel ● libvirt, OpenStack ● sever link between virtual machine and host ● fail-over, live migration
  • 22. ceph distributed file system ● shared cluster-coherent file system ● separate metadata and data paths ● avoid “server” bottleneck inherent in NFS etc ● ceph-mds cluster ● manages file system hierarchy ● redistributes load based on workload ● ultimately stores everything in objects ● highly stateful client sessions ● lots of caching, prefetching, locks and leases
  • 23. an example ● mount -t ceph 1.2.3.4:/ /mnt ● 3 ceph-mon RT ● 2 ceph-mds RT (1 ceph-mds to -osd RT) ceph-mon ceph-osd ● cd /mnt/foo/bar ● 2 ceph-mds RT (2 ceph-mds to -osd RT) ● ls -al ● open ● readdir – 1 ceph-mds RT (1 ceph-mds to -osd RT) ● stat each file ● close ceph-mds ● cp * /tmp ● N ceph-osd RT
  • 24. dynamic subtree partitioning Root ceph-mds ● efficient ● scalable ● hierarchical partition preserve ● arbitrarily partition metadata locality ● adaptive ● dynamic ● move work from busy to idle ● daemons can join/leave servers ● take over for failed nodes ● replicate hot metadata
  • 25. recursive accounting ● ceph-mds tracks recursive directory stats ● file sizes ● file and directory counts ● modification time ● virtual xattrs present full stats ● efficient $ ls ­alSh | head total 0 drwxr­xr­x 1 root            root      9.7T 2011­02­04 15:51 . drwxr­xr­x 1 root            root      9.7T 2010­12­16 15:06 .. drwxr­xr­x 1 pomceph         pg4194980 9.6T 2011­02­24 08:25 pomceph drwxr­xr­x 1 mcg_test1       pg2419992  23G 2011­02­02 08:57 mcg_test1 drwx­­x­­­ 1 luko            adm        19G 2011­01­21 12:17 luko drwx­­x­­­ 1 eest            adm        14G 2011­02­04 16:29 eest drwxr­xr­x 1 mcg_test2       pg2419992 3.0G 2011­02­02 09:34 mcg_test2 drwx­­x­­­ 1 fuzyceph        adm       1.5G 2011­01­18 10:46 fuzyceph drwxr­xr­x 1 dallasceph      pg275     596M 2011­01­14 10:06 dallasceph
  • 26. snapshots ● volume or subvolume unusable at petabyte scale ● snapshot arbitrary subdirectories ● simple interface ● hidden '.snap' directory ● no special tools $ mkdir foo/.snap/one # create snapshot $ ls foo/.snap one $ ls foo/bar/.snap _one_1099511627776 # parent's snap name is mangled $ rm foo/myfile $ ls -F foo bar/ $ ls -F foo/.snap/one myfile bar/ $ rmdir foo/.snap/one # remove snapshot
  • 27. multiple protocols, implementations ● Linux kernel client ● mount -t ceph 1.2.3.4:/ /mnt NFS SMB/CIFS ● export (NFS), Samba (CIFS) ● ceph-fuse Ganesha Samba libcephfs libcephfs ● libcephfs.so ● your app your app ● Samba (CIFS) libcephfs ● Ganesha (NFS) ceph-fuse ceph fuse ● Hadoop (map/reduce) kernel
  • 28. can I deploy it already? ● rados object store is stable ● librados ● radosgw (RESTful APIs) ● rbd rados block device ● commercial support in 1-3 months ● file system is not ready ● feature complete ● suitable for testing, PoC, benchmarking ● needs testing, deliberate qa effort for production
  • 29. why we do this ● limited options for scalable open source storage ● nexenta ● orangefs, lustre ● glusterfs ● proprietary solutions ● marry hardware and software ● expensive ● don't scale (well or out) ● we can change the industry
  • 30. who we are ● created at UC Santa Cruz (2007) ● supported by DreamHost (2008-2011) ● spun off as new company (2012) ● downtown Los Angeles, downtown San Francisco ● growing user and developer community ● Silicon Valley, Asia, Europe ● Debian, SuSE, Canonical, RedHat ● cloud computing stacks ● we are hiring ● C/C++/Python developers ● sysadmins, testing engineers http://ceph.newdream.net/