• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Block Storage For VMs With Ceph
 

Block Storage For VMs With Ceph

on

  • 8,835 views

 

Statistics

Views

Total Views
8,835
Views on SlideShare
7,950
Embed Views
885

Actions

Likes
5
Downloads
165
Comments
0

7 Embeds 885

http://www.xen.org 479
http://www.scoop.it 205
http://xen.org 115
http://dev.gluesys.com 68
http://www-archive.xenproject.org 11
https://twitter.com 5
http://xen.xensource.com 2
More...

Accessibility

Categories

Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • So what *is* Ceph? Ceph is a massively scalable and flexible object store with tightly-integrated applications that provide REST access to objects, a distributed virtual block device, and a parallel filesystem.
  • Let’s start with RADOS, Reliable Autonomic Distributed Object Storage. In this example, you’ve got five disks in a computer. You have initialized each disk with a filesystem (btrfs is the right filesystem to use someday, but until it’s stable we recommend XFS). On each filesystem, you deploy a Ceph OSD (Object Storage Daemon). That computer, with its five disks and five object storage daemons, becomes a single node in a RADOS cluster. Alongside these nodes are monitor nodes, which keep track of the current state of the cluster and provide users with an entry point into the cluster (although they do not serve any data themselves).
  • Applications wanting to store objects into RADOS interact with the cluster as a single entity.
  • The way CRUSH is configured is somewhat unique. Instead of defining pools for different data types, workgroups, subnets, or applications, CRUSH is configured with the physical topology of your storage network. You tell it how many buildings, rooms, shelves, racks, and nodes you have, and you tell it how you want data placed. For example, you could tell CRUSH that it’s okay to have two replicas in the same building, but not on the same power circuit. You also tell it how many copies to keep.
  • What happens, though, when a node goes down? The OSDs are always talking to each other (and the monitors), and they know when something is amiss. The third and fifth node on the top row have noticed that the second node on the bottom row is gone, and they are also aware that they have replicas of the missing data.
  • The OSDs collectively use the CRUSH algorithm to determine how the cluster should look based on its new state, and move the data to where clients running CRUSH expect it to be.
  • Because of the way placement is calculated instead of centrally controlled, node failures are transparent to clients.
  • The RADOS Block Device (RBD) allows users to store virtual disks inside RADOS. For example, you can use a virtualization container like KVM or QEMU to boot virtual machines from images that have been stored in RADOS. Images are striped across the entire cluster, which allows for simultaneous read access from different cluster nodes.
  • Separating a virtual computer from its storage also lets you do really neat things, like migrate a virtual machine from one server to another without rebooting it.
  • As an alternative, machines (even those running on bare metal) can mount an RBD image using native Linux kernel drivers.
  • With Ceph, copying an RBD image four times gives you five total copies…but only takes the space of one. It also happens instantly.
  • When clients mount one of the copied images and begin writing, they write to their copy.
  • When they read, though, they read through to the original copy if there’s no newer data.
  • So what *is* Ceph? Ceph is a massively scalable and flexible object store with tightly-integrated applications that provide REST access to objects, a distributed virtual block device, and a parallel filesystem.

Block Storage For VMs With Ceph Block Storage For VMs With Ceph Presentation Transcript

  • virtual machine block storage withthe ceph distributed storage system sage weil xensummit – august 28, 2012
  • outline● why you should care● what is it, what it does● how it works, how you can use it ● architecture ● objects, recovery● rados block device ● integration ● path forward● who we are, why we do this
  • why should you care about another storage system? requirements, time, cost
  • requirements● diverse storage needs ● object storage ● block devices (for VMs) with snapshots, cloning ● shared file system with POSIX, coherent caches ● structured data... files, block devices, or objects?● scale ● terabytes, petabytes, exabytes ● heterogeneous hardware ● reliability and fault tolerance
  • time● ease of administration● no manual data migration, load balancing● painless scaling ● expansion and contraction ● seamless migration
  • cost● low cost per gigabyte● no vendor lock-in● software solution ● run on commodity hardware● open source
  • what is ceph?
  • APP APP HOST/VM CLIENT RADOSGW RBD CEPH FS LIBRADOS A bucket-based A reliable and fully- A POSIX-compliant A library allowing REST gateway, distributed block distributed file apps to directly compatible with S3 device, with a Linux system, with a access RADOS, and Swift kernel client and a Linux kernel client with support for QEMU/KVM driver and support for C, C++, Java, FUSE Python, Ruby, and PHPRADOSA reliable, autonomous, distributed object store comprised of self-healing, self-managing,intelligent storage nodes 8
  • open source● LGPLv2 ● copyleft ● free to link to proprietary code● no copyright assignment ● no dual licensing ● no “enterprise-only” feature set● active community● commercial support available
  • distributed storage system● data center (not geo) scale ● 10s to 10,000s of machines ● terabytes to exabytes● fault tolerant ● no SPoF ● commodity hardware – ethernet, SATA/SAS, HDD/SSD – RAID, SAN probably a waste of time, power, and money
  • object storage model● pools ● 1s to 100s ● independent namespaces or object collections ● replication level, placement policy● objects ● trillions ● blob of data (bytes to gigabytes) ● attributes (e.g., “version=12”; bytes to kilobytes) ● key/value bundle (bytes to gigabytes)
  • object storage cluster● conventional client/server model doesnt scale ● server(s) become bottlenecks; proxies are inefficient ● if storage devices dont coordinate, clients must● ceph-osds are intelligent storage daemons ● coordinate with peers ● sensible, cluster-aware protocols ● sit on local file system – btrfs, xfs, ext4, etc. – leveldb
  • OSD OSD OSD OSD OSDFS FS FS FS FS btrfs xfs ext4DISK DISK DISK DISK DISK M M M 13
  • Monitors: • Maintain cluster stateM • Provide consensus for distributed decision-making • Small, odd number • These do not serve stored objects to clients • OSDs: • One per disk or RAID group • At least three in a cluster • Serve stored objects to clients • Intelligently peer to perform replication tasks
  • HUMAN MM M
  • data distribution● all objects are replicated N times● objects are automatically placed, balanced, migrated in a dynamic cluster● must consider physical infrastructure ● ceph-osds on hosts in racks in rows in data centers● three approaches ● pick a spot; remember where you put it ● pick a spot; write down where you put it ● calculate where to put it, where to find it
  • CRUSH• Pseudo-random placement algorithm• Fast calculation, no lookup• Ensures even distribution• Repeatable, deterministic• Rule-based configuration • specifiable replication • infrastructure topology aware • allows weighting• Stable mapping • Limited data migration
  • distributed object storage● CRUSH tells us where data should go ● small “osd map” records cluster state at point in time ● ceph-osd node status (up/down, weight, IP) ● CRUSH function specifying desired data distribution● object storage daemons (RADOS) ● store it there ● migrate it as the cluster changes● decentralized, distributed approach allows ● massive scales (10,000s of servers or more) ● efficient data access ● the illusion of a single copy with consistent behavior
  • large clusters arent static● dynamic cluster ● nodes are added, removed; nodes reboot, fail, recover ● recovery is the norm● osd maps are versioned ● shared via gossip● any map update potentially triggers data migration ● ceph-osds monitor peers for failure ● new nodes register with monitor ● administrator adjusts weights, mark out old hardware, etc.
  • CLIENT ??
  • what does this mean for my cloud?● virtual disks ● reliable ● accessible from many hosts● appliances ● great for small clouds ● not viable for public or (large) private clouds● avoid single server bottlenecks● efficient management
  • VMVIRTUALIZATION CONTAINER LIBRBD LIBRADOS M M M
  • CONTAINER VM CONTAINER LIBRBD LIBRBD LIBRADOS LIBRADOS M M M
  • HOST KRBD (KERNEL MODULE) LIBRADOS MM M
  • RBD: RADOS Block Device• Replicated, reliable, high- performance virtual disk• Allows decoupling of VMs and containers • Live migration!• Images are striped across the cluster• Snapshots!• Native support in the Linux kernel • /dev/rbd1• librbd allows easy integration
  • HOW DO YOU SPIN UPTHOUSANDS OF VMs INSTANTLY AND EFFICIENTLY?
  • instant copy144 0 0 0 0 = 144
  • write CLIENT write write write144 4 = 148
  • read read CLIENT read144 4 = 148
  • current RBD integration● native Linux kernel support ● /dev/rbd0, /dev/rbd/<poolname>/<imagename>● librbd ● user-level library● Qemu/KVM ● links to librbd user-level library● libvirt ● librbd-based storage pool ● understands RBD images ● can only start KVM VMs... :-(● CloudStack, OpenStack
  • what about Xen?● Linux kernel driver (i.e. /dev/rbd0) ● easy fit into existing stacks ● works today ● need recent Linux kernel for dom0● blktap ● generic kernel driver, userland process ● easy integration with librbd ● more featureful (cloning, caching), maybe faster ● doesnt exist yet!● rbd-fuse ● coming soon!
  • libvirt● CloudStack, OpenStack● libvirt understands rbd images, storage pools ● xml specifies cluster, pool, image name, auth● currently only usable with KVM● could configure /dev/rbd devices for VMs
  • librbd● management ● create, destroy, list, describe images ● resize, snapshot, clone● I/O ● open, read, write, discard, close● C, C++, Python bindings
  • RBD roadmap● locking ● fence failed VM hosts● clone performance● KSM (kernel same-page merging) hints● caching ● improved librbd caching ● kernel RBD + bcache to local SSD/disk
  • why● limited options for scalable open source storage● proprietary solutions ● marry hardware and software ● expensive ● dont scale (out)● industry needs to change
  • who we are● Ceph created at UC Santa Cruz (2007)● supported by DreamHost (2008-2011)● Inktank (2012)● growing user and developer community● we are hiring ● C/C++/Python developers ● sysadmins, testing engineers ● Los Angeles, San Francisco, Sunnyvale, remote http://ceph.com/
  • APP APP HOST/VM CLIENT RADOSGW RBD CEPH FS LIBRADOS A bucket-based A reliable and fully- A POSIX-compliant A library allowing REST gateway, distributed block distributed file apps to directly compatible with S3 device, with a Linux system, with a access RADOS, and Swift kernel client and a Linux kernel client with support for QEMU/KVM driver and support for C, C++, Java, FUSE Python, Ruby, and PHPRADOSA reliable, autonomous, distributed object store comprised of self-healing, self-managing,intelligent storage nodes 39