Your SlideShare is downloading. ×
XenSummit - 08/28/2012
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

XenSummit - 08/28/2012

725
views

Published on

Sage Weil's slides from the XenSummit in Aug.

Sage Weil's slides from the XenSummit in Aug.

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
725
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
26
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. virtual machine block storage withthe ceph distributed storage system sage weil xensummit – august 28, 2012
  • 2. outline● why you should care● what is it, what it does● how it works, how you can use it ● architecture ● objects, recovery● rados block device ● integration ● path forward● who we are, why we do this
  • 3. why should you care about another storage system? requirements, time, cost
  • 4. requirements● diverse storage needs ● object storage ● block devices (for VMs) with snapshots, cloning ● shared file system with POSIX, coherent caches ● structured data... files, block devices, or objects?● scale ● terabytes, petabytes, exabytes ● heterogeneous hardware ● reliability and fault tolerance
  • 5. time● ease of administration● no manual data migration, load balancing● painless scaling ● expansion and contraction ● seamless migration
  • 6. cost● low cost per gigabyte● no vendor lock-in● software solution ● run on commodity hardware● open source
  • 7. what is ceph?
  • 8. APP APP HOST/VM CLIENT RADOSGW RBD CEPH FS LIBRADOS A bucket-based A reliable and A POSIX- A library allowing REST gateway, fully-distributed compliant apps to directly compatible with block device, with distributed file access RADOS, S3 and Swift a Linux kernel system, with a with support for client and a Linux kernel C, C++, Java, QEMU/KVM driver client and support Python, Ruby, for FUSE and PHPRADOSA reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes 8
  • 9. open source● LGPLv2 ● copyleft ● free to link to proprietary code● no copyright assignment ● no dual licensing ● no “enterprise-only” feature set● active community● commercial support available
  • 10. distributed storage system● data center (not geo) scale ● 10s to 10,000s of machines ● terabytes to exabytes● fault tolerant ● no SPoF ● commodity hardware – ethernet, SATA/SAS, HDD/SSD – RAID, SAN probably a waste of time, power, and money
  • 11. object storage model● pools ● 1s to 100s ● independent namespaces or object collections ● replication level, placement policy● objects ● trillions ● blob of data (bytes to gigabytes) ● attributes (e.g., “version=12”; bytes to kilobytes) ● key/value bundle (bytes to gigabytes)
  • 12. object storage cluster● conventional client/server model doesnt scale ● server(s) become bottlenecks; proxies are inefficient ● if storage devices dont coordinate, clients must● ceph-osds are intelligent storage daemons ● coordinate with peers ● sensible, cluster-aware protocols ● sit on local file system – btrfs, xfs, ext4, etc. – leveldb
  • 13. OSD OSD OSD OSD OSD FS FS FS FS FS btrfs xfs ext4DISK DISK DISK DISK DISK M M M 13
  • 14. Monitors: • Maintain cluster stateM • Provide consensus for distributed decision-making • Small, odd number • These do not serve stored objects to clients • OSDs: • One per disk or RAID group • At least three in a cluster • Serve stored objects to clients • Intelligently peer to perform replication tasks
  • 15. HUMAN MM M
  • 16. data distribution● all objects are replicated N times● objects are automatically placed, balanced, migrated in a dynamic cluster● must consider physical infrastructure ● ceph-osds on hosts in racks in rows in data centers● three approaches ● pick a spot; remember where you put it ● pick a spot; write down where you put it ● calculate where to put it, where to find it
  • 17. CRUSH• Pseudo-random placement algorithm• Fast calculation, no lookup• Ensures even distribution• Repeatable, deterministic• Rule-based configuration • specifiable replication • infrastructure topology aware • allows weighting• Stable mapping • Limited data migration
  • 18. distributed object storage● CRUSH tells us where data should go ● small “osd map” records cluster state at point in time ● ceph-osd node status (up/down, weight, IP) ● CRUSH function specifying desired data distribution● object storage daemons (RADOS) ● store it there ● migrate it as the cluster changes● decentralized, distributed approach allows ● massive scales (10,000s of servers or more) ● efficient data access ● the illusion of a single copy with consistent behavior
  • 19. large clusters arent static● dynamic cluster ● nodes are added, removed; nodes reboot, fail, recover ● recovery is the norm● osd maps are versioned ● shared via gossip● any map update potentially triggers data migration ● ceph-osds monitor peers for failure ● new nodes register with monitor ● administrator adjusts weights, mark out old hardware, etc.
  • 20. CLIENT ??
  • 21. what does this mean for my cloud?● virtual disks ● reliable ● accessible from many hosts● appliances ● great for small clouds ● not viable for public or (large) private clouds● avoid single server bottlenecks● efficient management
  • 22. VMVIRTUALIZATION CONTAINER LIBRBD LIBRADOS M M M
  • 23. CONTAINER VM CONTAINER LIBRBD LIBRBD LIBRADOS LIBRADOS M M M
  • 24. HOST KRBD (KERNEL MODULE) LIBRADOS MM M
  • 25. RBD: RADOS Block Device• Replicated, reliable, high- performance virtual disk• Allows decoupling of VMs and containers • Live migration!• Images are striped across the cluster• Snapshots!• Native support in the Linux kernel • /dev/rbd1• librbd allows easy integration
  • 26. HOW DO YOU SPIN UPTHOUSANDS OF VMs INSTANTLY AND EFFICIENTLY?
  • 27. instant copy144 0 0 0 0 = 144
  • 28. write CLIENT write write write144 4 = 148
  • 29. read read CLIENT read144 4 = 148
  • 30. current RBD integration● native Linux kernel support ● /dev/rbd0, /dev/rbd/<poolname>/<imagename>● librbd ● user-level library● Qemu/KVM ● links to librbd user-level library● libvirt ● librbd-based storage pool ● understands RBD images ● can only start KVM VMs... :-(● CloudStack, OpenStack
  • 31. what about Xen?● Linux kernel driver (i.e. /dev/rbd0) ● easy fit into existing stacks ● works today ● need recent Linux kernel for dom0● blktap ● generic kernel driver, userland process ● easy integration with librbd ● more featureful (cloning, caching), maybe faster ● doesnt exist yet!● rbd-fuse ● coming soon!
  • 32. libvirt● CloudStack, OpenStack● libvirt understands rbd images, storage pools ● xml specifies cluster, pool, image name, auth● currently only usable with KVM● could configure /dev/rbd devices for VMs
  • 33. librbd● management ● create, destroy, list, describe images ● resize, snapshot, clone● I/O ● open, read, write, discard, close● C, C++, Python bindings
  • 34. RBD roadmap● locking ● fence failed VM hosts● clone performance● KSM (kernel same-page merging) hints● caching ● improved librbd caching ● kernel RBD + bcache to local SSD/disk
  • 35. why● limited options for scalable open source storage● proprietary solutions ● marry hardware and software ● expensive ● dont scale (out)● industry needs to change
  • 36. who we are● Ceph created at UC Santa Cruz (2007)● supported by DreamHost (2008-2011)● Inktank (2012)● growing user and developer community● we are hiring ● C/C++/Python developers ● sysadmins, testing engineers ● Los Angeles, San Francisco, Sunnyvale, remote http://ceph.com/
  • 37. APP APP HOST/VM CLIENT RADOSGW RBD CEPH FS LIBRADOS A bucket-based A reliable and A POSIX- A library allowing REST gateway, fully-distributed compliant apps to directly compatible with block device, with distributed file access RADOS, S3 and Swift a Linux kernel system, with a with support for client and a Linux kernel C, C++, Java, QEMU/KVM driver client and support Python, Ruby, for FUSE and PHPRADOSA reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes 39