Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Overview of sheepdog


Published on

Published in: Technology

Overview of sheepdog

  1. 1. Sheepdog OverviewLiu Yuan2013.4.27
  2. 2. Sheepdog – Distributed Object Storage● Replicated shared storage for VM● Most intelligent storage in OSS– Self-healing– Self-managing– No configuration file– One-liner setup● Scale-out (more than 1000+ nodes)● Integrate well in QEMU/Libvirt/Openstack
  3. 3. Agenda● Background Knowledge● Node management● Data management● Thin-provisioning● Sheepfs● Features from the future
  4. 4. Background Knowledge● VM Storage stack● QEMU/KVM stack● Virtual Disk● IO Requests Type● Write Cache● QEMU Snapshot
  5. 5. VM Storage StackGuset File SystemGuset Block DriverQEMU Image FormatQEMU Disk EmulationQEMU Format ProtocolPOSIX file, Raw device, Sheepdog, CephSheepdog block driver inQEMU is implemented atprotocol layer● Support all the formats ofQEMU● Raw format as default● Best performance● Snapshot is supported bythe Sheepdog protocol
  6. 6. QEMU/KVM StackVCPU VCPUKernelVMPCPU PCPUVM_ENTRYIO RequestsKVMeventfdVirtualDiskVM_EXITSheepdogQEMUNetwork
  7. 7. Virtual Disk● Transports– ATA, SCSI, Virtio– Virtio – Designed for VM● Simpler interface, better performance● Virtio-scsi– Enhancement of virtio-blk– Advanced DISCARD operation supports● Write-cache– Essential for distributed backend storage to boostperformance
  8. 8. IO Requests Type of VD● Read/Write● Discard– VMs FS (EXT4, XFS) transparently informunderlying storage backend to release blocks● FLUSH– Assure dirty bits reach the underlying backend storage● Write Cache Enable (WCE)– VM uses it to change the VD cache mode on the fly
  9. 9. Write Cache● Not a memory cache like page cache– DirectIO(O_DIRECT) bypass page cache but notbypass write cache– O_SYNC or fsync(2) flush write cache● All modern disks have it and have well-supportfrom OS● Most virtual devices emulate write cache– As safe as well-behaved hard-disk cache
  10. 10. QEMU Snapshot● Two type of states– Memory state (VM state) and disk state● Users can optional save– VM state only– VM state + disk state– Disk state only● Internal snapshot & external snapshot– Sheepdog choose external snapshot
  11. 11. Node management● Node Add/Delete● Dual NIC
  12. 12. Node Add/Delete● One-liner to add or delete node– Add node● $ sheep /store # use corosync or● $ sheep /store -c zookeeper:IP– Delete node● $ kill sheep– Support group add/kill● Rely on Corosync or Zookeeper– Membership change events– Cluster-wide ordered message
  13. 13. Pic. from
  14. 14. Dual NIC● One for control messages(heart-beat), theother for data transfer– If data NIC is down, data transfer will fallback oncontrol NIC– But if control NIC is down, the node is considered asdead● Single NIC– Control and data will share it
  15. 15. Data Management● Object Management● VM Requst Management● Auto-weighting● Multi-disk● Object Cache● Journaling
  16. 16. Object Management● Data are stored as replicated objects– Object is plain fix-sized POSIX file● objects are auto-rebalanced at nodeadd/delete/crash events● Replica are auto-recovered● Different copy number for each VDI● Support SAN-like or SAN-less or even mixedarchiteture
  17. 17. Pic. from
  18. 18. VM Requst Management● Parallel requests handling– Every node can handle the requests concurrently● Serve the requests even in the node changeevents– VM requests are prioritized againt replica recoveryrequests– VM requests will retry until it succeeds at nodechange events
  19. 19. Auto-weighting● Node storage is auto-weighted– Different sized nodes will only store its proportionalshare● Use consistent hashing + virtual node● Users can specify exported space– Use all the free space as default
  20. 20. Multi-disk● Single deamon manage multi-disks– $ sheep /disk1,/disk2{,disk3...}– Auto-weighting– Auto-rebalance– Recover objects from other Sheep● Simply put, MD = raid0 + auto-recovery● Eliminate need of hardware RAID– Support hot-plug/unplug
  21. 21. Object cache● Sheepdogs write cache of Virtual Disk– $ sheep -w size=100G /store● $ qemu -drive cache={writeback|writethrough|off}– Support writeback, writethrough, directio– LRU algorithm for reclaiming– Share objects between the VM from the same base● Use SSD for object cache to get a boost
  22. 22. Object cacheVirtual DiskObject CacheR&WFLUSHVMPUSH & PULLSheepdog Cluster
  23. 23. Journaling● $ sheep -j dir=/path/to/journal /store● Sheepdog use O_SYNC write as default● Object writes are fairly random● Log all the write opertions as append write onthe rotated log file– Transform random write into sequential write– Objects write can then drop O_SYNC● Boost performance + avoid partial write
  24. 24. Thin-provisioning● Sparse Volume● Discard Operation● COW Snapshot
  25. 25. Sparse Volume● Only allocate one inode object for new VDI asdefault– Instant creation of new VDI● Create data objects on demand● Users can preallocate data objects– Not recommended, performance gain is verylimited
  26. 26. Discard operation● Release objects when users delete files insideVM● Only support IDE and virtio-scsi device– CentOS 6.3+– OS running vanilla kernel 3.4+– We need QEMU 1.5+
  27. 27. Snapshot● Live snapshot (VM state + vdisk)– Save the snapshot in the sheepdog● QEMU monitor > savevm tag– Restore the snapshot on the fly● QEMU monitor > loadvm tag– Restore the snapshot at boot● $ qemu -hda sheepdog -loadvm tag● Live or off-line snapshot (vdisk only)– $ qemu-img snapshot sheepdog:disk
  28. 28. Snapshot cont.● Tree structure snapshotsbase● Rollback to any snapshot and make your branch
  29. 29. Snapshot cont.● All snapshots are COW based– Only create inode object for the snapshot– Instantly taken● Support incremental snapshot backup● Read the snapshot out of cluster– $ collie vdi read -s tag disk● Snapshots are stored in the Sheepdog storageso shared by all the nodes
  30. 30. Sheepfs● FUSE-based pseudo file system to exportSheepdogs virtual disks– $ sheepfs /mountpoint● Mount vdisk into local file system hierarchy asa block file– $ echo vdisk > /mountpoint/vdi/mount– Then /mountpoint/volume/vdisk will show up
  31. 31. Features from the future● Cluster-wide snapshot– Useful for backup and inter-cluster VDI-migration/sharing– Dedup, compression, incremental snapshot● QEMU-SD connection auto-restart– Useful for upgrading sheep without stopping the VM● QEMU-SD multi-connection– Higher Availibility VM
  32. 32. Thank You