SlideShare a Scribd company logo
the ceph distributed storage system


                sage weil
         strata – march 1, 2012
hello
●   why you should care
●   what is it, what it does
●   how it works, how you can use it
    ●   architecture
    ●   objects
    ●   recovery
    ●   file system
    ●   hadoop integration
●   distributed computation
    ●   object classes
●   who we are, why we do this
why should you care about another
        storage system?

     requirements, time, money
requirements
●   diverse storage needs
    ●   object storage
    ●   block devices (for VMs) with snapshots, cloning
    ●   shared file system with POSIX, coherent caches
    ●   structured data... files, block devices, or objects?
●   scale
    ●   terabytes, petabytes, exabytes
    ●   heterogeneous hardware
    ●   reliability and fault tolerance
time
●   ease of administration
●   no manual data migration, load balancing
●   painless scaling
    ●   expansion and contraction
    ●   seamless migration
money
●   low cost per gigabyte
●   no vendor lock-in

●   software solution
    ●   run on commodity hardware
●   open source
what is ceph?
unified storage system
●   objects
    ●   small or large
    ●   multi-protocol       Netflix   VM       Hadoop
●   block devices
                             radosgw   RBD     Ceph DFS
    ●   snapshots, cloning             RADOS
●   files
    ●   cache coherent
    ●   snapshots
    ●   usage accounting
open source
●   LGPLv2
    ●   copyleft
    ●   free to link to proprietary code
●   no copyright assignment
    ●   no dual licensing
    ●   no “enterprise-only” feature set
●   active community
●   commercial support
distributed storage system
●   data center (not geo) scale
    ●   10s to 10,000s of machines
    ●   terabytes to exabytes
●   fault tolerant
    ●   no SPoF
    ●   commodity hardware
        –   ethernet, SATA/SAS, HDD/SSD
        –   RAID, SAN probably a waste of time, power, and money
architecture
●   monitors (ceph-mon)
    ●   1s-10s, paxos
    ●   lightweight process
    ●   authentication, cluster membership,
        critical cluster state
●   object storage daemons (ceph-osd)
    ●   1s-10,000s
    ●   smart, coordinate with peers
●   clients (librados, librbd)
    ●   zillions
    ●   authenticate with monitors, talk directly
        to ceph-osds
●   metadata servers (ceph-mds)
    ●   1s-10s
    ●   build POSIX file system on top of objects
rados object storage model
●   pools
    ●   1s to 100s
    ●   independent namespaces or object collections
    ●   replication level, placement policy
●   objects
    ●   trillions
    ●   blob of data (bytes to gigabytes)
    ●   attributes (e.g., “version=12”; bytes to kilobytes)
    ●   key/value bundle (bytes to gigabytes)
rados object API
●   librados.so
    ●   C, C++, Python, Java. shell.
●   read/write (extent), truncate, remove; get/set/remove xattr or key
    ●   like a file or .db file
●
    efficient copy-on-write clone
●
    atomic compound operations/transactions
    ●
        read + getxattr, write + setxattr
    ●   compare xattr value, if match write + setxattr
●
    classes
    ●   load new code into cluster to implement new methods
    ●
        calc sha1, grep/filter, generate thumbnail
    ●   encrypt, increment, rotate image
object storage
●   client/server, host/device paradigm doesn't scale
    ●   idle servers are wasted servers
    ●   if storage devices don't coordinate, clients must
●   ceph-osds are intelligent storage daemons
    ●   coordinate with peers
    ●   sensible, cluster-aware protocols
●   flexible deployment
    ●   one per disk
    ●   one per host
    ●   one per RAID volume
●   sit on local file system
    ●   btrfs, xfs, ext4, etc.
data distribution
●   all objects are replicated N times
●   objects are automatically placed, balanced, migrated
    in a dynamic cluster
●   must consider physical infrastructure
    ●   ceph-osds on hosts in racks in rows in data centers

●   three approaches
    ●   pick a spot; remember where you put it
    ●   pick a spot; write down where you put it
    ●   calculate where to put it, where to find it
CRUSH
●   pseudo-random placement algorithm
    ●   uniform, weighted distribution
    ●   fast calculation, no lookup
●   placement rules
    ●   in terms of physical infrastructure
        –   “3 replicas, same row, different racks”
●   predictable, bounded migration on changes
    ●   N → N + 1 ceph-osds means a bit over 1/Nth of
        data moves
object placement
        pool

                                     hash(object name) % num_pg = pg

placement group (PG)

                                  CRUSH(pg, cluster state, rule) = [A, B]




                       X
replication
●   all data replicated N times
●   ceph-osd cluster handles replication
    ●   client writes to first replica




    ●   reduce client bandwidth
    ●   “only once” semantics
    ●   cluster maintains strict consistently
recovery
●   dynamic cluster
    ●   nodes are added, removed
    ●   nodes reboot, fail, recover
●   “recovery” is the norm
    ●   “map” records cluster state at point in time
        –   ceph-osd node status (up/down, weight, IP)
        –   CRUSH function specifying desired data distribution
    ●   ceph-osds cooperatively migrate data to achieve that
●   any map update potentially triggers data migration
    ●   ceph-osds monitor peers for failure
    ●   new nodes register with monitor
    ●   administrator adjusts weights, mark out old hardware, etc.
rbd – rados block device
●   replicated, reliable, high-performance virtual disk
    ●   striped over objects across entire cluster
    ●   thinly provisioned, snapshots
    ●   image cloning (real soon now)
                                                                 KVM
●   well integrated                                               librbd
                                                                librados
    ●   Linux kernel driver (/dev/rbd0)              KVM/Xen
    ●   qemu/KVM + librbd                   ext4         rbd
                                            rbd        kernel
    ●   libvirt, OpenStack
●   sever link between virtual machine and host
    ●   fail-over, live migration
ceph distributed file system
●   shared cluster-coherent file system
●   separate metadata and data paths
    ●   avoid “server” bottleneck inherent in NFS etc
●   ceph-mds cluster
    ●   manages file system hierarchy
    ●   redistributes load based on workload
    ●   ultimately stores everything in objects
●   highly stateful client sessions
    ●   lots of caching, prefetching, locks and leases
an example
●   mount -t ceph 1.2.3.4:/ /mnt
    ●   3 ceph-mon RT
    ●   2 ceph-mds RT (1 ceph-mds to -osd RT)
                                                     ceph-mon   ceph-osd
●   cd /mnt/foo/bar
    ●   2 ceph-mds RT (2 ceph-mds to -osd RT)
●   ls -al
    ●   open
    ●   readdir
         –   1 ceph-mds RT (1 ceph-mds to -osd RT)
    ●   stat each file
    ●   close                                                   ceph-mds
●   cp * /tmp
    ●   N ceph-osd RT
dynamic subtree partitioning
                                          Root




                                                                      ceph-mds


●   efficient                                    ●   scalable
    ●   hierarchical partition preserve              ●   arbitrarily partition metadata
        locality                                 ●   adaptive
●   dynamic                                          ●   move work from busy to idle
    ●   daemons can join/leave                           servers
    ●   take over for failed nodes                   ●   replicate hot metadata
recursive accounting
●   ceph-mds tracks recursive directory stats
    ●   file sizes
    ●   file and directory counts
    ●   modification time
●
    virtual xattrs present full stats
●
    efficient
        $ ls ­alSh | head
        total 0
        drwxr­xr­x 1 root            root      9.7T 2011­02­04 15:51 .
        drwxr­xr­x 1 root            root      9.7T 2010­12­16 15:06 ..
        drwxr­xr­x 1 pomceph         pg4194980 9.6T 2011­02­24 08:25 pomceph
        drwxr­xr­x 1 mcg_test1       pg2419992  23G 2011­02­02 08:57 mcg_test1
        drwx­­x­­­ 1 luko            adm        19G 2011­01­21 12:17 luko
        drwx­­x­­­ 1 eest            adm        14G 2011­02­04 16:29 eest
        drwxr­xr­x 1 mcg_test2       pg2419992 3.0G 2011­02­02 09:34 mcg_test2
        drwx­­x­­­ 1 fuzyceph        adm       1.5G 2011­01­18 10:46 fuzyceph
        drwxr­xr­x 1 dallasceph      pg275     596M 2011­01­14 10:06 dallasceph
snapshots
●   volume or subvolume snapshots unusable at petabyte scale
    ●   snapshot arbitrary subdirectories
●   simple interface
    ●   hidden '.snap' directory
    ●
        no special tools


        $ mkdir foo/.snap/one      # create snapshot
        $ ls foo/.snap
        one
        $ ls foo/bar/.snap
        _one_1099511627776         # parent's snap name is mangled
        $ rm foo/myfile
        $ ls -F foo
        bar/
        $ ls -F foo/.snap/one
        myfile bar/
        $ rmdir foo/.snap/one      # remove snapshot
multiple protocols, implementations
●   Linux kernel client
    ●   mount -t ceph 1.2.3.4:/ /mnt
                                        NFS            SMB/CIFS
    ●   export (NFS), Samba (CIFS)
●   ceph-fuse                           Ganesha            Samba
                                         libcephfs          libcephfs
●   libcephfs.so
    ●   your app                        Hadoop             your app
                                         libcephfs          libcephfs
    ●   Samba (CIFS)
    ●   Ganesha (NFS)                          ceph-fuse
                                       ceph        fuse
    ●   Hadoop (map/reduce)                       kernel
hadoop
●   seamless integration                   ●   can interact “normally” with
    ●   Java libcephfs wrapper                 Hadoop data
    ●   Hadoop CephFileSystem                  ●   kernel mount
    ●   drop-in replacement for HDFS           ●   ceph-fuse
●   locality                                   ●   NFS/CIFS
    ●   exposes data layout                ●   can colocate Hadoop with
    ●   reads from local replica               “normal” storage
    ●   first write does not go to local       ●   avoid staging/destaging
        node
distributed computation models
●   object classes                  ●   map/reduce
    ●   tightly couple                  ●   colocation of
        computation with data               computation and data is
                                            optimization only
    ●   carefully sandboxed
                                        ●   more loosely sandboxed
    ●   part of I/O pipeline
                                        ●   orchestrated data flow
    ●   atomic transactions                 between files, nodes
    ●   rich data abstraction           ●   job scheduling
         –   blob of bytes (file)       ●   limited storage
         –   xattrs                         abstraction
         –   key/value bundle
structured data
●   data types             ●   operations
    ●   record streams         ●   filter
    ●   key/value maps         ●   fingerprint
    ●   queues             ●   mutations
    ●   images                 ●   rotate/resize
    ●   matrices               ●   substitute
                               ●   translate
                           ●   avoid read/modify/write
size vs (intra-object) smarts

               S3   HDFS
                                                      RADOS
object size




                           RADOS object       hbase          redis
                                     cassandra        riak

                              object smarts
best tool for the job
●   key/value stores             ●   hbase/bigtable
    ●   cassandra, riak, redis       ●   tablets, logs
●   object store                 ●   map/reduce
    ●   RADOS (ceph)                 ●   data flows
●   map/reduce                   ●   percolator
    “filesystems”                    ●   triggers, transactions
    ●   GFS, HDFS
●   POSIX filesystems
    ●   ceph, lustre, gluster
can I deploy it already?
●   rados object store is stable
    ●   librados
    ●   radosgw (RESTful APIs)
    ●   rbd rados block device
    ●   commercial support
●   file system is almost ready
    ●   feature complete
    ●   suitable for testing, PoC, benchmarking
    ●   needs testing, deliberate qa effort for production
why we do this
●   limited options for scalable open source storage
    ●   orangefs, lustre
    ●   glusterfs
    ●   HDFS
●   proprietary solutions
    ●   marry hardware and software
    ●   expensive
    ●   don't scale (well or out)
●   industry needs to change
who we are
●   created at UC Santa Cruz (2007)
●   supported by DreamHost (2008-2011)
●   spun off as new company (2012)
    ●   downtown Los Angeles, downtown San Francisco
●   growing user and developer community
    ●   Silicon Valley, Asia, Europe
    ●   Debian, SuSE, Canonical, RedHat
    ●   cloud computing stacks
●   we are hiring
    ●   C/C++/Python developers
    ●   sysadmins, testing engineers


                                       http://ceph.com/
librados, radosgw
●   librados
                                                               HTTP
    ●   direct parallel access to
        cluster                                     haproxy
    ●   rich API                                               HTTP
●   radosgw                         your app   radosgw    radosgw
                                    librados   librados   librados
    ●   RESTful object storage
        –   S3, Swift APIs
    ●   proxy HTTP to rados
    ●   ACL-based security for
        the big bad internet
why we like btrfs
●   pervasive checksumming
●   snapshots, copy-on-write
●   efficient metadata (xattrs)
●   inline data for small files
●   transparent compression
●   integrated volume management
    ●   software RAID, mirroring, error recovery
    ●   SSD-aware
●   online fsck
●   active development community

More Related Content

What's hot

What's new in Luminous and Beyond
What's new in Luminous and BeyondWhat's new in Luminous and Beyond
What's new in Luminous and Beyond
Sage Weil
 
Keeping OpenStack storage trendy with Ceph and containers
Keeping OpenStack storage trendy with Ceph and containersKeeping OpenStack storage trendy with Ceph and containers
Keeping OpenStack storage trendy with Ceph and containers
Sage Weil
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDB
Sage Weil
 
The State of Ceph, Manila, and Containers in OpenStack
The State of Ceph, Manila, and Containers in OpenStackThe State of Ceph, Manila, and Containers in OpenStack
The State of Ceph, Manila, and Containers in OpenStack
Sage Weil
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
Sage Weil
 
What's new in Jewel and Beyond
What's new in Jewel and BeyondWhat's new in Jewel and Beyond
What's new in Jewel and Beyond
Sage Weil
 
DEVIEW 2013
DEVIEW 2013DEVIEW 2013
DEVIEW 2013
Patrick McGarry
 
Making distributed storage easy: usability in Ceph Luminous and beyond
Making distributed storage easy: usability in Ceph Luminous and beyondMaking distributed storage easy: usability in Ceph Luminous and beyond
Making distributed storage easy: usability in Ceph Luminous and beyond
Sage Weil
 
GlusterFS CTDB Integration
GlusterFS CTDB IntegrationGlusterFS CTDB Integration
GlusterFS CTDB IntegrationEtsuji Nakai
 
XSKY - ceph luminous update
XSKY - ceph luminous updateXSKY - ceph luminous update
XSKY - ceph luminous update
inwin stack
 
librados
libradoslibrados
librados
Patrick McGarry
 
2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph
Ceph Community
 
20171101 taco scargo luminous is out, what's in it for you
20171101 taco scargo   luminous is out, what's in it for you20171101 taco scargo   luminous is out, what's in it for you
20171101 taco scargo luminous is out, what's in it for you
Taco Scargo
 
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSE
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSEUnderstanding blue store, Ceph's new storage backend - Tim Serong, SUSE
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSE
OpenStack
 
HIGH AVAILABLE CLUSTER IN WEB SERVER WITH HEARTBEAT + DRBD + OCFS2
HIGH AVAILABLE CLUSTER IN WEB SERVER WITH  HEARTBEAT + DRBD + OCFS2HIGH AVAILABLE CLUSTER IN WEB SERVER WITH  HEARTBEAT + DRBD + OCFS2
HIGH AVAILABLE CLUSTER IN WEB SERVER WITH HEARTBEAT + DRBD + OCFS2
Utah Networxs Consultoria e Treinamento
 
MySQL with DRBD/Pacemaker/Corosync on Linux
 MySQL with DRBD/Pacemaker/Corosync on Linux MySQL with DRBD/Pacemaker/Corosync on Linux
MySQL with DRBD/Pacemaker/Corosync on Linux
Pawan Kumar
 
Gluster fs tutorial part 2 gluster and big data- gluster for devs and sys ...
Gluster fs tutorial   part 2  gluster and big data- gluster for devs and sys ...Gluster fs tutorial   part 2  gluster and big data- gluster for devs and sys ...
Gluster fs tutorial part 2 gluster and big data- gluster for devs and sys ...Tommy Lee
 
Ceph Performance: Projects Leading up to Jewel
Ceph Performance: Projects Leading up to JewelCeph Performance: Projects Leading up to Jewel
Ceph Performance: Projects Leading up to Jewel
Colleen Corrice
 
Intorduce to Ceph
Intorduce to CephIntorduce to Ceph
Intorduce to Ceph
kao kuo-tung
 
Java in containers
Java in containersJava in containers
Java in containers
Martin Baez
 

What's hot (20)

What's new in Luminous and Beyond
What's new in Luminous and BeyondWhat's new in Luminous and Beyond
What's new in Luminous and Beyond
 
Keeping OpenStack storage trendy with Ceph and containers
Keeping OpenStack storage trendy with Ceph and containersKeeping OpenStack storage trendy with Ceph and containers
Keeping OpenStack storage trendy with Ceph and containers
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDB
 
The State of Ceph, Manila, and Containers in OpenStack
The State of Ceph, Manila, and Containers in OpenStackThe State of Ceph, Manila, and Containers in OpenStack
The State of Ceph, Manila, and Containers in OpenStack
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
What's new in Jewel and Beyond
What's new in Jewel and BeyondWhat's new in Jewel and Beyond
What's new in Jewel and Beyond
 
DEVIEW 2013
DEVIEW 2013DEVIEW 2013
DEVIEW 2013
 
Making distributed storage easy: usability in Ceph Luminous and beyond
Making distributed storage easy: usability in Ceph Luminous and beyondMaking distributed storage easy: usability in Ceph Luminous and beyond
Making distributed storage easy: usability in Ceph Luminous and beyond
 
GlusterFS CTDB Integration
GlusterFS CTDB IntegrationGlusterFS CTDB Integration
GlusterFS CTDB Integration
 
XSKY - ceph luminous update
XSKY - ceph luminous updateXSKY - ceph luminous update
XSKY - ceph luminous update
 
librados
libradoslibrados
librados
 
2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph
 
20171101 taco scargo luminous is out, what's in it for you
20171101 taco scargo   luminous is out, what's in it for you20171101 taco scargo   luminous is out, what's in it for you
20171101 taco scargo luminous is out, what's in it for you
 
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSE
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSEUnderstanding blue store, Ceph's new storage backend - Tim Serong, SUSE
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSE
 
HIGH AVAILABLE CLUSTER IN WEB SERVER WITH HEARTBEAT + DRBD + OCFS2
HIGH AVAILABLE CLUSTER IN WEB SERVER WITH  HEARTBEAT + DRBD + OCFS2HIGH AVAILABLE CLUSTER IN WEB SERVER WITH  HEARTBEAT + DRBD + OCFS2
HIGH AVAILABLE CLUSTER IN WEB SERVER WITH HEARTBEAT + DRBD + OCFS2
 
MySQL with DRBD/Pacemaker/Corosync on Linux
 MySQL with DRBD/Pacemaker/Corosync on Linux MySQL with DRBD/Pacemaker/Corosync on Linux
MySQL with DRBD/Pacemaker/Corosync on Linux
 
Gluster fs tutorial part 2 gluster and big data- gluster for devs and sys ...
Gluster fs tutorial   part 2  gluster and big data- gluster for devs and sys ...Gluster fs tutorial   part 2  gluster and big data- gluster for devs and sys ...
Gluster fs tutorial part 2 gluster and big data- gluster for devs and sys ...
 
Ceph Performance: Projects Leading up to Jewel
Ceph Performance: Projects Leading up to JewelCeph Performance: Projects Leading up to Jewel
Ceph Performance: Projects Leading up to Jewel
 
Intorduce to Ceph
Intorduce to CephIntorduce to Ceph
Intorduce to Ceph
 
Java in containers
Java in containersJava in containers
Java in containers
 

Viewers also liked

Ceph Day Nov 2012 - Sage Weil
Ceph Day Nov 2012 - Sage WeilCeph Day Nov 2012 - Sage Weil
Ceph Day Nov 2012 - Sage Weil
Ceph Community
 
Ceph Day London 2014 - Deploying ceph in the wild
Ceph Day London 2014 - Deploying ceph in the wildCeph Day London 2014 - Deploying ceph in the wild
Ceph Day London 2014 - Deploying ceph in the wild
Ceph Community
 
Ceph Day London 2014 - Ceph Over High-Performance Networks
Ceph Day London 2014 - Ceph Over High-Performance Networks Ceph Day London 2014 - Ceph Over High-Performance Networks
Ceph Day London 2014 - Ceph Over High-Performance Networks
Ceph Community
 
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Community
 
Ceph Day New York 2014: Ceph over High Performance Networks
Ceph Day New York 2014: Ceph over High Performance NetworksCeph Day New York 2014: Ceph over High Performance Networks
Ceph Day New York 2014: Ceph over High Performance Networks
Ceph Community
 
General english project
General english projectGeneral english project
General english projectHency Soni
 
Propaganda De Produtos Sujeitos à VigilâNcia SanitáRia
Propaganda De Produtos Sujeitos à VigilâNcia SanitáRiaPropaganda De Produtos Sujeitos à VigilâNcia SanitáRia
Propaganda De Produtos Sujeitos à VigilâNcia SanitáRia
DESENVOLVA CONSULTORIA
 
Bonsais Agosto2008
Bonsais Agosto2008Bonsais Agosto2008
Bonsais Agosto2008angelnuro
 
Teste de dna
Teste de dnaTeste de dna
Teste de dna
RonCruz
 
Presentación para seminario
Presentación para seminarioPresentación para seminario
Presentación para seminarioAlexilla
 
Portfolio digital fabio2
Portfolio digital fabio2Portfolio digital fabio2
Portfolio digital fabio2Dulce Gomes
 
A ESB Abraça a Ciência
A ESB Abraça a CiênciaA ESB Abraça a Ciência
A ESB Abraça a CiênciaDulce Gomes
 
A missão evangelizadora
A missão evangelizadoraA missão evangelizadora
A missão evangelizadoraDulce Gomes
 

Viewers also liked (20)

Ceph Day Nov 2012 - Sage Weil
Ceph Day Nov 2012 - Sage WeilCeph Day Nov 2012 - Sage Weil
Ceph Day Nov 2012 - Sage Weil
 
Ceph Day London 2014 - Deploying ceph in the wild
Ceph Day London 2014 - Deploying ceph in the wildCeph Day London 2014 - Deploying ceph in the wild
Ceph Day London 2014 - Deploying ceph in the wild
 
Ceph Day London 2014 - Ceph Over High-Performance Networks
Ceph Day London 2014 - Ceph Over High-Performance Networks Ceph Day London 2014 - Ceph Over High-Performance Networks
Ceph Day London 2014 - Ceph Over High-Performance Networks
 
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
 
Ceph Day New York 2014: Ceph over High Performance Networks
Ceph Day New York 2014: Ceph over High Performance NetworksCeph Day New York 2014: Ceph over High Performance Networks
Ceph Day New York 2014: Ceph over High Performance Networks
 
General english project
General english projectGeneral english project
General english project
 
Propaganda De Produtos Sujeitos à VigilâNcia SanitáRia
Propaganda De Produtos Sujeitos à VigilâNcia SanitáRiaPropaganda De Produtos Sujeitos à VigilâNcia SanitáRia
Propaganda De Produtos Sujeitos à VigilâNcia SanitáRia
 
Boletín RadioAMLO no. 7
Boletín RadioAMLO no. 7Boletín RadioAMLO no. 7
Boletín RadioAMLO no. 7
 
Pensamiento critico
Pensamiento criticoPensamiento critico
Pensamiento critico
 
Bonsais Agosto2008
Bonsais Agosto2008Bonsais Agosto2008
Bonsais Agosto2008
 
Teste de dna
Teste de dnaTeste de dna
Teste de dna
 
B&d antrax 2013
B&d   antrax 2013B&d   antrax 2013
B&d antrax 2013
 
El Santo
El SantoEl Santo
El Santo
 
Producto final
Producto finalProducto final
Producto final
 
Presentación para seminario
Presentación para seminarioPresentación para seminario
Presentación para seminario
 
Portfolio digital fabio2
Portfolio digital fabio2Portfolio digital fabio2
Portfolio digital fabio2
 
A ESB Abraça a Ciência
A ESB Abraça a CiênciaA ESB Abraça a Ciência
A ESB Abraça a Ciência
 
A missão evangelizadora
A missão evangelizadoraA missão evangelizadora
A missão evangelizadora
 
Inmunizaciones1
Inmunizaciones1Inmunizaciones1
Inmunizaciones1
 
Tics foligna korol[1]
Tics foligna korol[1]Tics foligna korol[1]
Tics foligna korol[1]
 

Similar to Strata - 03/31/2012

XenSummit - 08/28/2012
XenSummit - 08/28/2012XenSummit - 08/28/2012
XenSummit - 08/28/2012
Ceph Community
 
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage ArhcitectureINFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage ArhcitectureThomas Uhl
 
Cncf meetup-rook
Cncf meetup-rookCncf meetup-rook
Cncf meetup-rook
Juraj Hantak
 
Cncf meetup-rook
Cncf meetup-rookCncf meetup-rook
Cncf meetup-rook
Juraj Hantak
 
7. Cloud Native Computing - Kubernetes - Bratislava - Rook.io
7. Cloud Native Computing - Kubernetes - Bratislava - Rook.io7. Cloud Native Computing - Kubernetes - Bratislava - Rook.io
7. Cloud Native Computing - Kubernetes - Bratislava - Rook.io
Dávid Kőszeghy
 
Linuxtag.ceph.talk
Linuxtag.ceph.talkLinuxtag.ceph.talk
Linuxtag.ceph.talk
Udo Seidel
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data
Omid Vahdaty
 
Linux-HA with Pacemaker
Linux-HA with PacemakerLinux-HA with Pacemaker
Linux-HA with Pacemaker
Kris Buytaert
 
CEPH DAY BERLIN - WHAT'S NEW IN CEPH
CEPH DAY BERLIN - WHAT'S NEW IN CEPH CEPH DAY BERLIN - WHAT'S NEW IN CEPH
CEPH DAY BERLIN - WHAT'S NEW IN CEPH
Ceph Community
 
Running OpenStack in Production - Barcamp Saigon 2016
Running OpenStack in Production - Barcamp Saigon 2016Running OpenStack in Production - Barcamp Saigon 2016
Running OpenStack in Production - Barcamp Saigon 2016
Thang Man
 
Open Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNETOpen Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNET
Nikos Kormpakis
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep Dive
Red_Hat_Storage
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices: A Deep DiveCeph Block Devices: A Deep Dive
Ceph Block Devices: A Deep Dive
joshdurgin
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard
Ceph Community
 
MySQL HA with Pacemaker
MySQL HA with  PacemakerMySQL HA with  Pacemaker
MySQL HA with Pacemaker
Kris Buytaert
 
Ceph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Day Santa Clara: The Future of CephFS + Developing with LibradosCeph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Community
 
CephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at LastCephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at Last
Ceph Community
 
Open stack HA - Theory to Reality
Open stack HA -  Theory to RealityOpen stack HA -  Theory to Reality
Open stack HA - Theory to Reality
Sriram Subramanian
 
Hbase: an introduction
Hbase: an introductionHbase: an introduction
Hbase: an introduction
Jean-Baptiste Poullet
 

Similar to Strata - 03/31/2012 (20)

XenSummit - 08/28/2012
XenSummit - 08/28/2012XenSummit - 08/28/2012
XenSummit - 08/28/2012
 
Block Storage For VMs With Ceph
Block Storage For VMs With CephBlock Storage For VMs With Ceph
Block Storage For VMs With Ceph
 
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage ArhcitectureINFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
 
Cncf meetup-rook
Cncf meetup-rookCncf meetup-rook
Cncf meetup-rook
 
Cncf meetup-rook
Cncf meetup-rookCncf meetup-rook
Cncf meetup-rook
 
7. Cloud Native Computing - Kubernetes - Bratislava - Rook.io
7. Cloud Native Computing - Kubernetes - Bratislava - Rook.io7. Cloud Native Computing - Kubernetes - Bratislava - Rook.io
7. Cloud Native Computing - Kubernetes - Bratislava - Rook.io
 
Linuxtag.ceph.talk
Linuxtag.ceph.talkLinuxtag.ceph.talk
Linuxtag.ceph.talk
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data
 
Linux-HA with Pacemaker
Linux-HA with PacemakerLinux-HA with Pacemaker
Linux-HA with Pacemaker
 
CEPH DAY BERLIN - WHAT'S NEW IN CEPH
CEPH DAY BERLIN - WHAT'S NEW IN CEPH CEPH DAY BERLIN - WHAT'S NEW IN CEPH
CEPH DAY BERLIN - WHAT'S NEW IN CEPH
 
Running OpenStack in Production - Barcamp Saigon 2016
Running OpenStack in Production - Barcamp Saigon 2016Running OpenStack in Production - Barcamp Saigon 2016
Running OpenStack in Production - Barcamp Saigon 2016
 
Open Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNETOpen Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNET
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep Dive
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices: A Deep DiveCeph Block Devices: A Deep Dive
Ceph Block Devices: A Deep Dive
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard
 
MySQL HA with Pacemaker
MySQL HA with  PacemakerMySQL HA with  Pacemaker
MySQL HA with Pacemaker
 
Ceph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Day Santa Clara: The Future of CephFS + Developing with LibradosCeph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Day Santa Clara: The Future of CephFS + Developing with Librados
 
CephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at LastCephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at Last
 
Open stack HA - Theory to Reality
Open stack HA -  Theory to RealityOpen stack HA -  Theory to Reality
Open stack HA - Theory to Reality
 
Hbase: an introduction
Hbase: an introductionHbase: an introduction
Hbase: an introduction
 

Recently uploaded

Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 

Recently uploaded (20)

Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 

Strata - 03/31/2012

  • 1. the ceph distributed storage system sage weil strata – march 1, 2012
  • 2. hello ● why you should care ● what is it, what it does ● how it works, how you can use it ● architecture ● objects ● recovery ● file system ● hadoop integration ● distributed computation ● object classes ● who we are, why we do this
  • 3. why should you care about another storage system? requirements, time, money
  • 4. requirements ● diverse storage needs ● object storage ● block devices (for VMs) with snapshots, cloning ● shared file system with POSIX, coherent caches ● structured data... files, block devices, or objects? ● scale ● terabytes, petabytes, exabytes ● heterogeneous hardware ● reliability and fault tolerance
  • 5. time ● ease of administration ● no manual data migration, load balancing ● painless scaling ● expansion and contraction ● seamless migration
  • 6. money ● low cost per gigabyte ● no vendor lock-in ● software solution ● run on commodity hardware ● open source
  • 8. unified storage system ● objects ● small or large ● multi-protocol Netflix VM Hadoop ● block devices radosgw RBD Ceph DFS ● snapshots, cloning RADOS ● files ● cache coherent ● snapshots ● usage accounting
  • 9. open source ● LGPLv2 ● copyleft ● free to link to proprietary code ● no copyright assignment ● no dual licensing ● no “enterprise-only” feature set ● active community ● commercial support
  • 10. distributed storage system ● data center (not geo) scale ● 10s to 10,000s of machines ● terabytes to exabytes ● fault tolerant ● no SPoF ● commodity hardware – ethernet, SATA/SAS, HDD/SSD – RAID, SAN probably a waste of time, power, and money
  • 11. architecture ● monitors (ceph-mon) ● 1s-10s, paxos ● lightweight process ● authentication, cluster membership, critical cluster state ● object storage daemons (ceph-osd) ● 1s-10,000s ● smart, coordinate with peers ● clients (librados, librbd) ● zillions ● authenticate with monitors, talk directly to ceph-osds ● metadata servers (ceph-mds) ● 1s-10s ● build POSIX file system on top of objects
  • 12. rados object storage model ● pools ● 1s to 100s ● independent namespaces or object collections ● replication level, placement policy ● objects ● trillions ● blob of data (bytes to gigabytes) ● attributes (e.g., “version=12”; bytes to kilobytes) ● key/value bundle (bytes to gigabytes)
  • 13. rados object API ● librados.so ● C, C++, Python, Java. shell. ● read/write (extent), truncate, remove; get/set/remove xattr or key ● like a file or .db file ● efficient copy-on-write clone ● atomic compound operations/transactions ● read + getxattr, write + setxattr ● compare xattr value, if match write + setxattr ● classes ● load new code into cluster to implement new methods ● calc sha1, grep/filter, generate thumbnail ● encrypt, increment, rotate image
  • 14. object storage ● client/server, host/device paradigm doesn't scale ● idle servers are wasted servers ● if storage devices don't coordinate, clients must ● ceph-osds are intelligent storage daemons ● coordinate with peers ● sensible, cluster-aware protocols ● flexible deployment ● one per disk ● one per host ● one per RAID volume ● sit on local file system ● btrfs, xfs, ext4, etc.
  • 15. data distribution ● all objects are replicated N times ● objects are automatically placed, balanced, migrated in a dynamic cluster ● must consider physical infrastructure ● ceph-osds on hosts in racks in rows in data centers ● three approaches ● pick a spot; remember where you put it ● pick a spot; write down where you put it ● calculate where to put it, where to find it
  • 16. CRUSH ● pseudo-random placement algorithm ● uniform, weighted distribution ● fast calculation, no lookup ● placement rules ● in terms of physical infrastructure – “3 replicas, same row, different racks” ● predictable, bounded migration on changes ● N → N + 1 ceph-osds means a bit over 1/Nth of data moves
  • 17. object placement pool hash(object name) % num_pg = pg placement group (PG) CRUSH(pg, cluster state, rule) = [A, B] X
  • 18. replication ● all data replicated N times ● ceph-osd cluster handles replication ● client writes to first replica ● reduce client bandwidth ● “only once” semantics ● cluster maintains strict consistently
  • 19. recovery ● dynamic cluster ● nodes are added, removed ● nodes reboot, fail, recover ● “recovery” is the norm ● “map” records cluster state at point in time – ceph-osd node status (up/down, weight, IP) – CRUSH function specifying desired data distribution ● ceph-osds cooperatively migrate data to achieve that ● any map update potentially triggers data migration ● ceph-osds monitor peers for failure ● new nodes register with monitor ● administrator adjusts weights, mark out old hardware, etc.
  • 20. rbd – rados block device ● replicated, reliable, high-performance virtual disk ● striped over objects across entire cluster ● thinly provisioned, snapshots ● image cloning (real soon now) KVM ● well integrated librbd librados ● Linux kernel driver (/dev/rbd0) KVM/Xen ● qemu/KVM + librbd ext4 rbd rbd kernel ● libvirt, OpenStack ● sever link between virtual machine and host ● fail-over, live migration
  • 21. ceph distributed file system ● shared cluster-coherent file system ● separate metadata and data paths ● avoid “server” bottleneck inherent in NFS etc ● ceph-mds cluster ● manages file system hierarchy ● redistributes load based on workload ● ultimately stores everything in objects ● highly stateful client sessions ● lots of caching, prefetching, locks and leases
  • 22. an example ● mount -t ceph 1.2.3.4:/ /mnt ● 3 ceph-mon RT ● 2 ceph-mds RT (1 ceph-mds to -osd RT) ceph-mon ceph-osd ● cd /mnt/foo/bar ● 2 ceph-mds RT (2 ceph-mds to -osd RT) ● ls -al ● open ● readdir – 1 ceph-mds RT (1 ceph-mds to -osd RT) ● stat each file ● close ceph-mds ● cp * /tmp ● N ceph-osd RT
  • 23. dynamic subtree partitioning Root ceph-mds ● efficient ● scalable ● hierarchical partition preserve ● arbitrarily partition metadata locality ● adaptive ● dynamic ● move work from busy to idle ● daemons can join/leave servers ● take over for failed nodes ● replicate hot metadata
  • 24. recursive accounting ● ceph-mds tracks recursive directory stats ● file sizes ● file and directory counts ● modification time ● virtual xattrs present full stats ● efficient $ ls ­alSh | head total 0 drwxr­xr­x 1 root            root      9.7T 2011­02­04 15:51 . drwxr­xr­x 1 root            root      9.7T 2010­12­16 15:06 .. drwxr­xr­x 1 pomceph         pg4194980 9.6T 2011­02­24 08:25 pomceph drwxr­xr­x 1 mcg_test1       pg2419992  23G 2011­02­02 08:57 mcg_test1 drwx­­x­­­ 1 luko            adm        19G 2011­01­21 12:17 luko drwx­­x­­­ 1 eest            adm        14G 2011­02­04 16:29 eest drwxr­xr­x 1 mcg_test2       pg2419992 3.0G 2011­02­02 09:34 mcg_test2 drwx­­x­­­ 1 fuzyceph        adm       1.5G 2011­01­18 10:46 fuzyceph drwxr­xr­x 1 dallasceph      pg275     596M 2011­01­14 10:06 dallasceph
  • 25. snapshots ● volume or subvolume snapshots unusable at petabyte scale ● snapshot arbitrary subdirectories ● simple interface ● hidden '.snap' directory ● no special tools $ mkdir foo/.snap/one # create snapshot $ ls foo/.snap one $ ls foo/bar/.snap _one_1099511627776 # parent's snap name is mangled $ rm foo/myfile $ ls -F foo bar/ $ ls -F foo/.snap/one myfile bar/ $ rmdir foo/.snap/one # remove snapshot
  • 26. multiple protocols, implementations ● Linux kernel client ● mount -t ceph 1.2.3.4:/ /mnt NFS SMB/CIFS ● export (NFS), Samba (CIFS) ● ceph-fuse Ganesha Samba libcephfs libcephfs ● libcephfs.so ● your app Hadoop your app libcephfs libcephfs ● Samba (CIFS) ● Ganesha (NFS) ceph-fuse ceph fuse ● Hadoop (map/reduce) kernel
  • 27. hadoop ● seamless integration ● can interact “normally” with ● Java libcephfs wrapper Hadoop data ● Hadoop CephFileSystem ● kernel mount ● drop-in replacement for HDFS ● ceph-fuse ● locality ● NFS/CIFS ● exposes data layout ● can colocate Hadoop with ● reads from local replica “normal” storage ● first write does not go to local ● avoid staging/destaging node
  • 28. distributed computation models ● object classes ● map/reduce ● tightly couple ● colocation of computation with data computation and data is optimization only ● carefully sandboxed ● more loosely sandboxed ● part of I/O pipeline ● orchestrated data flow ● atomic transactions between files, nodes ● rich data abstraction ● job scheduling – blob of bytes (file) ● limited storage – xattrs abstraction – key/value bundle
  • 29. structured data ● data types ● operations ● record streams ● filter ● key/value maps ● fingerprint ● queues ● mutations ● images ● rotate/resize ● matrices ● substitute ● translate ● avoid read/modify/write
  • 30. size vs (intra-object) smarts S3 HDFS RADOS object size RADOS object hbase redis cassandra riak object smarts
  • 31. best tool for the job ● key/value stores ● hbase/bigtable ● cassandra, riak, redis ● tablets, logs ● object store ● map/reduce ● RADOS (ceph) ● data flows ● map/reduce ● percolator “filesystems” ● triggers, transactions ● GFS, HDFS ● POSIX filesystems ● ceph, lustre, gluster
  • 32. can I deploy it already? ● rados object store is stable ● librados ● radosgw (RESTful APIs) ● rbd rados block device ● commercial support ● file system is almost ready ● feature complete ● suitable for testing, PoC, benchmarking ● needs testing, deliberate qa effort for production
  • 33. why we do this ● limited options for scalable open source storage ● orangefs, lustre ● glusterfs ● HDFS ● proprietary solutions ● marry hardware and software ● expensive ● don't scale (well or out) ● industry needs to change
  • 34. who we are ● created at UC Santa Cruz (2007) ● supported by DreamHost (2008-2011) ● spun off as new company (2012) ● downtown Los Angeles, downtown San Francisco ● growing user and developer community ● Silicon Valley, Asia, Europe ● Debian, SuSE, Canonical, RedHat ● cloud computing stacks ● we are hiring ● C/C++/Python developers ● sysadmins, testing engineers http://ceph.com/
  • 35.
  • 36. librados, radosgw ● librados HTTP ● direct parallel access to cluster haproxy ● rich API HTTP ● radosgw your app radosgw radosgw librados librados librados ● RESTful object storage – S3, Swift APIs ● proxy HTTP to rados ● ACL-based security for the big bad internet
  • 37. why we like btrfs ● pervasive checksumming ● snapshots, copy-on-write ● efficient metadata (xattrs) ● inline data for small files ● transparent compression ● integrated volume management ● software RAID, mirroring, error recovery ● SSD-aware ● online fsck ● active development community