Overview of sheepdog

8,258 views

Published on

Published in: Technology

Overview of sheepdog

  1. 1. Sheepdog OverviewLiu Yuan2013.4.27
  2. 2. Sheepdog – Distributed Object Storage● Replicated shared storage for VM● Most intelligent storage in OSS– Self-healing– Self-managing– No configuration file– One-liner setup● Scale-out (more than 1000+ nodes)● Integrate well in QEMU/Libvirt/Openstack
  3. 3. Agenda● Background Knowledge● Node management● Data management● Thin-provisioning● Sheepfs● Features from the future
  4. 4. Background Knowledge● VM Storage stack● QEMU/KVM stack● Virtual Disk● IO Requests Type● Write Cache● QEMU Snapshot
  5. 5. VM Storage StackGuset File SystemGuset Block DriverQEMU Image FormatQEMU Disk EmulationQEMU Format ProtocolPOSIX file, Raw device, Sheepdog, CephSheepdog block driver inQEMU is implemented atprotocol layer● Support all the formats ofQEMU● Raw format as default● Best performance● Snapshot is supported bythe Sheepdog protocol
  6. 6. QEMU/KVM StackVCPU VCPUKernelVMPCPU PCPUVM_ENTRYIO RequestsKVMeventfdVirtualDiskVM_EXITSheepdogQEMUNetwork
  7. 7. Virtual Disk● Transports– ATA, SCSI, Virtio– Virtio – Designed for VM● Simpler interface, better performance● Virtio-scsi– Enhancement of virtio-blk– Advanced DISCARD operation supports● Write-cache– Essential for distributed backend storage to boostperformance
  8. 8. IO Requests Type of VD● Read/Write● Discard– VMs FS (EXT4, XFS) transparently informunderlying storage backend to release blocks● FLUSH– Assure dirty bits reach the underlying backend storage● Write Cache Enable (WCE)– VM uses it to change the VD cache mode on the fly
  9. 9. Write Cache● Not a memory cache like page cache– DirectIO(O_DIRECT) bypass page cache but notbypass write cache– O_SYNC or fsync(2) flush write cache● All modern disks have it and have well-supportfrom OS● Most virtual devices emulate write cache– As safe as well-behaved hard-disk cache
  10. 10. QEMU Snapshot● Two type of states– Memory state (VM state) and disk state● Users can optional save– VM state only– VM state + disk state– Disk state only● Internal snapshot & external snapshot– Sheepdog choose external snapshot
  11. 11. Node management● Node Add/Delete● Dual NIC
  12. 12. Node Add/Delete● One-liner to add or delete node– Add node● $ sheep /store # use corosync or● $ sheep /store -c zookeeper:IP– Delete node● $ kill sheep– Support group add/kill● Rely on Corosync or Zookeeper– Membership change events– Cluster-wide ordered message
  13. 13. Pic. from http://www.osrg.net/sheepdog/
  14. 14. Dual NIC● One for control messages(heart-beat), theother for data transfer– If data NIC is down, data transfer will fallback oncontrol NIC– But if control NIC is down, the node is considered asdead● Single NIC– Control and data will share it
  15. 15. Data Management● Object Management● VM Requst Management● Auto-weighting● Multi-disk● Object Cache● Journaling
  16. 16. Object Management● Data are stored as replicated objects– Object is plain fix-sized POSIX file● objects are auto-rebalanced at nodeadd/delete/crash events● Replica are auto-recovered● Different copy number for each VDI● Support SAN-like or SAN-less or even mixedarchiteture
  17. 17. Pic. from http://www.osrg.net/sheepdog/
  18. 18. VM Requst Management● Parallel requests handling– Every node can handle the requests concurrently● Serve the requests even in the node changeevents– VM requests are prioritized againt replica recoveryrequests– VM requests will retry until it succeeds at nodechange events
  19. 19. Auto-weighting● Node storage is auto-weighted– Different sized nodes will only store its proportionalshare● Use consistent hashing + virtual node● Users can specify exported space– Use all the free space as default
  20. 20. Multi-disk● Single deamon manage multi-disks– $ sheep /disk1,/disk2{,disk3...}– Auto-weighting– Auto-rebalance– Recover objects from other Sheep● Simply put, MD = raid0 + auto-recovery● Eliminate need of hardware RAID– Support hot-plug/unplug
  21. 21. Object cache● Sheepdogs write cache of Virtual Disk– $ sheep -w size=100G /store● $ qemu -drive cache={writeback|writethrough|off}– Support writeback, writethrough, directio– LRU algorithm for reclaiming– Share objects between the VM from the same base● Use SSD for object cache to get a boost
  22. 22. Object cacheVirtual DiskObject CacheR&WFLUSHVMPUSH & PULLSheepdog Cluster
  23. 23. Journaling● $ sheep -j dir=/path/to/journal /store● Sheepdog use O_SYNC write as default● Object writes are fairly random● Log all the write opertions as append write onthe rotated log file– Transform random write into sequential write– Objects write can then drop O_SYNC● Boost performance + avoid partial write
  24. 24. Thin-provisioning● Sparse Volume● Discard Operation● COW Snapshot
  25. 25. Sparse Volume● Only allocate one inode object for new VDI asdefault– Instant creation of new VDI● Create data objects on demand● Users can preallocate data objects– Not recommended, performance gain is verylimited
  26. 26. Discard operation● Release objects when users delete files insideVM● Only support IDE and virtio-scsi device– CentOS 6.3+– OS running vanilla kernel 3.4+– We need QEMU 1.5+
  27. 27. Snapshot● Live snapshot (VM state + vdisk)– Save the snapshot in the sheepdog● QEMU monitor > savevm tag– Restore the snapshot on the fly● QEMU monitor > loadvm tag– Restore the snapshot at boot● $ qemu -hda sheepdog -loadvm tag● Live or off-line snapshot (vdisk only)– $ qemu-img snapshot sheepdog:disk
  28. 28. Snapshot cont.● Tree structure snapshotsbase● Rollback to any snapshot and make your branch
  29. 29. Snapshot cont.● All snapshots are COW based– Only create inode object for the snapshot– Instantly taken● Support incremental snapshot backup● Read the snapshot out of cluster– $ collie vdi read -s tag disk● Snapshots are stored in the Sheepdog storageso shared by all the nodes
  30. 30. Sheepfs● FUSE-based pseudo file system to exportSheepdogs virtual disks– $ sheepfs /mountpoint● Mount vdisk into local file system hierarchy asa block file– $ echo vdisk > /mountpoint/vdi/mount– Then /mountpoint/volume/vdisk will show up
  31. 31. Features from the future● Cluster-wide snapshot– Useful for backup and inter-cluster VDI-migration/sharing– Dedup, compression, incremental snapshot● QEMU-SD connection auto-restart– Useful for upgrading sheep without stopping the VM● QEMU-SD multi-connection– Higher Availibility VM
  32. 32. Thank You

×