vmfs intro


Published on

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

vmfs intro

  1. 1. VMFS IntroductionBergwolf@linuxfb.org
  2. 2. AgendaESX IntroductionVMFS Design GoalsVMFS ArchitectureSAN ImpactConclusion
  3. 3. ESX System Setup
  4. 4. Guest Memory Layers Shadow page tables (VA- MA). Page sharing (BA-MA).
  5. 5. ESX IO Stack Average IO requests just involves offset remapping.
  6. 6. AgendaESX IntroductionVMFS Design GoalsVMFS ArchitectureSAN Influence and ImpactConclusion
  7. 7. Use CaseSmall number of files (30~100 per VM)Files either very small (~a few KBs), or verylarge (many GBs)SAN storage is the underlying substrate.All storage exported by these storage systemsis shared among all ESX servers
  8. 8. Design GoalsMetadata overhead should be very lowVM IO throughput and latency should be asgood as directly attached raw deviceA clustered lock manager for moderatingaccess to files among ESX serversHelp VM deterministically react to transientand non-transient SAN events and errorconditions.
  9. 9. AgendaESX IntroductionVMFS Design GoalsVMFS ArchitectureSAN Influence and ImpactConclusion
  10. 10. VMFS ArchitectureA volume is an aggregation of resources and on-disklocks.A resource is either an inode, a file block, a sub-block or an indirect block.Each lock moderates access to a subset of resources.Hosts negotiate access to resource by acquiringrelevant locks.VMFS = a clustered lock manager + a resourcemanager + a journaling module + a data mover + aVM IO manager + POSIX system call frantend
  11. 11. VMKernel Logical VolumeVMFS are by default created inside VMKernel logical volumes. VMKernel logical volumes can be spanned across multiple devices.
  12. 12. VMFS on disk Layout
  13. 13. Four Resources file blocks sub-blocks pointer blocks file descriptorsResources are grouped together into collections called CLUSTERs and clusters are further grouped together into CLUSTER GROUPS.
  14. 14. Block Mapping Packed inside inode Sub block addressing File block addressing Pointer block addressingCan upgrade automatically.
  15. 15. System FilesSystem files are created at file system format time, and each manages one type of resources.
  16. 16. System FilesUse file blocks.Same read/write method as regular files.Checking file data consistency essentiallyprovides metadata consistency.
  17. 17. Cluster GroupsCluster groups are repeated to create a file system.An existing VMFS volume grows over unused spaceon the disk or spans new disks by laying out newcluster groups that refer to the newly added space.VMFS resource manager makes hosts operate ondifferent and distant cluster groups within a systemfile. This reduces the possibility of mutiple hostscontending on the same lock(s) and increases theefficiency of the clustered lock manager.
  18. 18. On-disk LockA single sector datastructure.Locking is based on lease.Atomic disk operations (SCSIreserve-read-modify-write-SCSI release)
  19. 19. On-disk Lock Data StructureHostID: This is a 128-bit unique identifier that identifies the ESX host thatowns the lock at a given point in time. All zeros means no owner.Mode: A set of non-zero values to indicate whether a lock is free, heldexclusively, held by multiple hosts for shared read access, or held bymultiple hosts for shared read and write access.Generation: A monotonically increasing counter, updates every time a lockis acquired, released or broken. While the hostID field sufficientlydisambiguates operations on a lock from different hosts, this fielddisambiguates multiple operations on a lock by the same host.HBregion: For each valid hostID (if any) currently using the lock, a pointerto the on disk heartbeat region of the host.HBgen: A generation number to validate the HBregion reference as beingcurrent or stale. It disambiguates locks held by a given host before andafter a host crash and before and after a storage outage.
  20. 20. On-disk HeartbeatA single sector data structureEvery host accessing a VMSF volume acquiresa heartbeat on disk to declare liveness toother hosts.Allocated from a 1MB reserved region of thevolume. 2048 concurrent hosts access.
  21. 21. HB Failure HandlingHosts are free to break locks if heartbeat’stimestamp does not change for 20 second. Shouldreplay journal when taking stale lock.If failing to update heartbeat timestamp in five HBperiod (about 15 sec and 40 HB IO tries), host willfence itself and abort all inflight IOs.Lock manager tries to rejoin the cluster if IO error isnot permanent, and reclaims HB slot.
  22. 22. On-disk Lock & HBEach host can join a cluster by acquiring a on-disk HB.It can also hold thousands of on-disk locks
  23. 23. JournalingEach host maintains its own journal on thevolume.HB region on disk stores journal location.
  24. 24. Transaction State Machine
  25. 25. Optimistic LockingAll hosts in a VMFS cluster generally operate onmutually exclusive subsets of locks on the volume.A host that is interested in acquiring a given lock willtypically find it to be free on disk.In stead of acquiring all locks, host first reads alllocks, if they are free, modify in memory metadataand then upgrade locks and commit.
  26. 26. Transaction State Machine w/ op lock
  27. 27. Transaction State Machine w/ op lock Upgrade Lock1: reserve disk;2: issue asynchronous (async) reads of allrequired locks;3: if any lock is acquired by remote host,abort and fall back to normal TSM;4: issue async writes of all required locks;5: wait for all async writes to complete;6: release disk;
  28. 28. AgendaESX IntroductionVMFS Design GoalsVMFS ArchitectureSAN Influence and ImpactConclusion
  29. 29. Adaptive SAN-aware retriesFor some SAN errors, instead of letting guestOS retry IO, VMkernel retries the IO after anoptimal time.
  30. 30. Adaptive SAN-aware retries
  31. 31. Data Moverclone(srcFileHandle, srcFileOffset,dstFileHandle, dstFileOffset, length, policies)
  32. 32. Data Mover
  33. 33. Directive SCSI CMDoperator(VMID, source_blocklist,destination_blocklist)Zero, clone, delete
  34. 34. Directive SCSI CMDatomic_test_and_set(block_number, old_image,new_image)For VMFS lock manager, new lock algorithm: reads alock image from disk, and if the lock is free, issuesan atomic_test_and_set with a new_imagecontaining host specific hostID, generation andheartbeat information.4 IOs -> 2 IOs
  35. 35. AgendaESX IntroductionVMFS Design GoalsVMFS ArchitectureSAN Influence and ImpactConclusion
  36. 36. Performance