Ceph, Xen, and CloudStack: Semper Melior


Published on

Slides from my presentation at the Xen User Summit 2013 in New Orleans.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • The way CRUSH is configured is somewhat unique. Instead of defining pools for different data types, workgroups, subnets, or applications, CRUSH is configured with the physical topology of your storage network. You tell it how many buildings, rooms, shelves, racks, and nodes you have, and you tell it how you want data placed. For example, you could tell CRUSH that it’s okay to have two replicas in the same building, but not on the same power circuit. You also tell it how many copies to keep.
  • With CRUSH, the first thing that happens is the data gets split into a certain number of sections. These are called “placement groups”. The number of placement groups is configurable. Then, the CRUSH algorithm is invoked, passing along the latest cluster map and a set of placement rules, and it determines where the placement group belongs in the cluster. This is a pseudo-random calculation, but it’s also repeatable; given the same cluster state and rule set, it will always return the same results.
  • Each placement group is run through CRUSH and stored in the cluster. Notice how no node has received more than one copy of a placement group, and no two nodes contain the same information? That’s important.
  • When it comes time to store an object in the cluster (or retrieve one), the client calculates where it belongs.
  • What happens, though, when a node goes down? The OSDs are always talking to each other (and the monitors), and they know when something is amiss. The third and fifth node on the top row have noticed that the second node on the bottom row is gone, and they are also aware that they have replicas of the missing data.
  • What happens, though, when a node goes down? The OSDs are always talking to each other (and the monitors), and they know when something is amiss. The third and fifth node on the top row have noticed that the second node on the bottom row is gone, and they are also aware that they have replicas of the missing data.
  • The OSDs collectively use the CRUSH algorithm to determine how the cluster should look based on its new state, and move the data to where clients running CRUSH expect it to be.
  • Because of the way placement is calculated instead of centrally controlled, node failures are transparent to clients.
  • 4.2 ready (working on RBD java bindings)QEMU and libvirt are creating images in format 1, hacky stuff to make format 2RBD for Primary and RGW S3 for Secondary (templates, backups, isos)
  • You can have a management server which is communicating to all of your agents (hypervisors)Management servers can by clustered for HA/failover or performance
  • Client -> XenAPI ->Domain manager -> xen control library -> standard xen libraries && “upstream” qemuStorage plugins -> libvirt support (experimental build) -> ceph && ocfs2
  • Ceph, Xen, and CloudStack: Semper Melior

    1. 1. Ceph, Xen, and CloudStack: Semper Melior Xen User Summit| New Orleans, LA | 18 SEP 2013
    2. 2. 2 •Patrick McGarry •Community monkey •Inktank / Ceph •/. > ALU > P4 •@scuttlemonkey •patrick@inktankcom Accept no substitutes C’est Moi
    3. 3. 3 •Ceph in <30s •Ceph, a little bit more •Ceph in the wild •Orchestration •Community status •What’s Next? •Questions The plan, Stan Welcome!
    4. 4. 4 On commodity hardware Ceph can run on any infrastructure, metal or virtualized to provide a cheap and powerful storage cluster. Object, block, and file Low overhead doesn’t mean just hardware, it means people too! Awesomesauce Infrastructure-aware placement algorithm allows you to do really cool stuff. Huge and beyond Designed for exabyte, current implementations in the multi-petabyte. HPC, Big Data, Cloud, raw storage. …besides wicked-awesome? What is Ceph? Software All-in-1 CRUSH Scale
    5. 5. 5 Find out more! Ceph.com …but you can find out more Use it today Dreamhost.com/cloud/DreamObjects Get Support Inktank.com That WAS fast
    6. 6. 6 OBJECTS VIRTUAL DISKS FILES & DIRECTORIES CEPH FILE SYSTEM A distributed, scale-out filesystem with POSIX semantics that provides storage for a legacy and modern applications CEPH GATEWAY A powerful S3- and Swift- compatible gateway that brings the power of the Ceph Object Store to modern applications CEPH BLOCK DEVICE A distributed virtual block device that delivers high- performance, cost-effective storage for virtual machines and legacy applications CEPH OBJECT STORE A reliable, easy to manage, next-generation distributed object store that provides storage of unstructured data for applications
    7. 7. 7
    8. 8. 8
    9. 9. 9 • CRUSH – Pseudo-random placement algorithm – Ensures even distribution – Repeatable, deterministic – Rule-based configuration • Replica count • Infrastructure topology • Weighting
    10. 10. 1 0 10 10 01 01 10 10 01 11 01 10 10 10 01 01 10 10 01 11 01 10 hash(object name) % num pg CRUSH(pg, cluster state, rule set)
    11. 11. 1 1 10 10 01 01 10 10 01 11 01 10 10 10 01 01 10 10 01 11 01 10
    12. 12. 1 2 CLIENT
    13. 13. 1 3
    14. 14. 1 4
    15. 15. 1 5
    16. 16. 1 6 CLIENT ??
    17. 17. 1 7 …with Marty Stouffer Ceph in the Wild
    18. 18. 1 8 No incendiary devices please… Linux Distros
    19. 19. 1 9 Object && Block Via RBD and RGW (Swift API) Our BFF Identity Via Keystone More coming! Work continues with updates in Havana and Icehouse. OpenStack
    20. 20. 2 0 Block Alternate primary, and secondary Community maintained Community Wido from 42on.com More coming in 4.2! Snapshot & backup support Cloning (layering) support No NFS for system VMs Secondary/Backup storage (s3) CloudStack
    21. 21. 2 1 A blatent ripoff! Primary Storage Flow •The mgmt server never talks to the Ceph cluster •One mgmt server can manage 1000s of hypervisors •Mgmt server can be clustered •Multiple Ceph clusters/pools can be added to CloudStack cluster
    22. 22. 2 2 A pretty package A commercially packaged OpenStack solution back by Ceph. RADOS for Archipelago Virtual server management software tool on top of Xen or KVM. RBD backed Complete virtualization management with KVM and containers. BBC territory Talk next week in Berlin So many delicious flavors Other Cloud SUSE Cloud Ganeti Proxmox OpenNebula
    23. 23. 2 3 Since 2.6.35 Kernel clients for RBD and CephFS. Active development as a Linux file system. iSCSI ahoy! One of the Linux iSCSI target frameworks. Emulates: SBC (disk), SMC (jukebox), MMC (CD/DVD), SSC (tape), OSD. Getting creative Creative community member used Ceph to back their VMWare infrastructure via fibre channel. You can always use more friends Project Intersection Kernel STGT VMWare Love me! Slightly out-of-date. Some work has been done, but could use some love. Wireshark
    24. 24. 2 4 CephFS CephFS can serve as a drop-in replacement for HDFS. Upstream Ceph vfs module upstream samba. CephFS or RBD Reexporting CephFS or RBD for NFS/CIFS. MOAR projects Project Intersection Hadoop Samba Ganesha Recently Open Source Commercially supported product from Citrix. Recently Open Sourced. Still a bit of a tech preview. XenServer
    25. 25. 2 5 Support for libvirt XenServer can manipulate Ceph! Don’t let the naming fool you, it’s easy Blktap{2,3,asplode} Qemu; new boss, same as the old boss (but not really) What’s in a name? Ceph :: XenServer :: Libvirt Block device :: VDI :: storage vol Pool :: Storage Repo :: storage pool Doing it with Xen*
    26. 26. 2 6 Thanks David Scott! XenServer host arch Xapi, XenAPI xenopsd S M adapters libvirt libxl ceph ocfs2 libxenguest libxc qemu xen Client (CloudStack, OpenStack, XenDesktop)
    27. 27. 2 7 Come for the block Stay for the object and file No matter what you use! Reduced Overhead Easier to manage one cluster “Other Stuff” CephFS prototypes fast development profile ceph-devel lots of partner action Gateway Drug
    28. 28. 2 8 Squash Hotspots Multiple hosts = parallel workload But what does that mean? Instant Clones No time to boot for many images Live migration Shared storage allows you to move instances between compute nodes transparently. Blocks are delicious
    29. 29. 2 9 Flexible APIs Native support for swift and s3 And less filling! Secondary Storage Coming with 4.2 Horizontal Scaling Easy with HAProxy or others Objects can juggle
    30. 30. 3 0 Neat prototypes Image distribution to hypervisors You can dress them up, but you can’t take them anywhere Still early You can fix that! Outside uses Great way to combine resources. Files are tricksy
    31. 31. 3 1 Where the metal meets the…software Deploying this stuff
    32. 32. 3 2 Procedural, Ruby Written in Ruby, this is more of the dev- side of DevOps. Once you get past the learning curve it’s powerful though. Model-driven Aimed more at the sysadmin, this procedural tool has a very wide penetration (even on Windows!). Agentless, whole stack Using the built-in OpenSSH in your OS, this super easy tool goes further up the stack than most. Fast, 0MQ Using ZeroMQ this tool is designed for massive scale and fast, fast, fast. Unfortunately 0MQ has no built in encryption. The new hotness Orchestration Chef Puppet Ansible Salt
    33. 33. 3 3 Canonical Unleashed Being language agnostic, this tool can completely encapsulate a service. Can also handle provisioning all the way down to hardware. Dell has skin in the game Complete operations platform that can dive all the way down to BIOS/RAID level. Others are joining in Custom provisioning and orchestration, just one example of how busy this corner of the market is. Doing it w/o a tool If you prefer not to use a tool, Ceph gives you an easy way to deploy your cluster by hand. MOAR HOTNESS Orchestration Cont’d Juju Crowbar ComodIT Ceph-deploy
    34. 34. 3 4 All your space are belong to us Ceph Community
    35. 35. 3 5
    36. 36. 3 6 Up and to the right! Code Contributions
    37. 37. 3 7 Up and to the right! Commits
    38. 38. 3 8 Up and to the right! List Participation
    39. 39. 3 9 This Ceph thing sounds hot. What’s Next?
    40. 40. 4 0 An ongoing process While the first pass for disaster recovery is done, we want to get to built-in, world- wide replication. Reception efficiency Currently underway in the community! Headed to dynamic Can already do this in a static pool-based setup. Looking to get to a use-based migration. Making it open-er Been talking about it forever. The time is coming! Hop on board! The Ceph Train Geo-Replication Erasure Coding Tiering Governance
    41. 41. 4 1 Quarterly Online Summit Online summit puts the core devs together with the Ceph community. Not just for NYC More planned, including Santa Clara and London. Keep an eye out: http://inktank.com/cephdays/ Geek-on-duty During the week there are times when Ceph experts are available to help. Stop by oftc.net/ceph Email makes the world go Our mailing lists are very active, check out ceph.com for details on how to join in! Open Source is Open! Get Involved! CDS Ceph Day IRC Lists
    42. 42. 4 2 http://wiki.ceph.com/04De velopment/Project_Ideas Lists, blueprints, sideboard, paper cuts, etc. http://tracker.ceph.com/ All the things! New #ceph-devel Splitting off developer chatter to make it easier to filter discussions. http://ceph.com/resources /mailing-list-irc/ Our mailing lists are very active, check out ceph.com for details on how to join in! Patches welcome Projects Wiki Redmine IRC Lists
    43. 43. 4 3 Comments? Anything for the good of the cause? Questions? E-MAIL patrick@inktank.com WEBSITE Ceph.com SOCIAL @scuttlemonkey @ceph Facebook.com/cephstorage