SlideShare a Scribd company logo
VMFS Introduction


Bergwolf@linuxfb.org
Agenda

ESX Introduction
VMFS Design Goals
VMFS Architecture
SAN Impact
Conclusion
ESX System Setup
Guest Memory Layers


               Shadow page tables (VA-
               MA).

               Page sharing (BA-MA).
ESX IO Stack

       Average IO requests just
          involves offset remapping.
Agenda

ESX Introduction
VMFS Design Goals
VMFS Architecture
SAN Influence and Impact
Conclusion
Use Case

Small number of files (30~100 per VM)
Files either very small (~a few KBs), or very
large (many GBs)
SAN storage is the underlying substrate.
All storage exported by these storage systems
is shared among all ESX servers
Design Goals

Metadata overhead should be very low
VM IO throughput and latency should be as
good as directly attached raw device
A clustered lock manager for moderating
access to files among ESX servers
Help VM deterministically react to transient
and non-transient SAN events and error
conditions.
Agenda

ESX Introduction
VMFS Design Goals
VMFS Architecture
SAN Influence and Impact
Conclusion
VMFS Architecture
A volume is an aggregation of resources and on-disk
locks.
A resource is either an inode, a file block, a sub-
block or an indirect block.
Each lock moderates access to a subset of resources.
Hosts negotiate access to resource by acquiring
relevant locks.
VMFS = a clustered lock manager + a resource
manager + a journaling module + a data mover + a
VM IO manager + POSIX system call frantend
VMKernel Logical Volume

VMFS are by default created inside VMKernel
 logical volumes. VMKernel logical volumes can
 be spanned across multiple devices.
VMFS on disk Layout
Four Resources

  file blocks
  sub-blocks
  pointer blocks
  file descriptors

Resources are grouped together into collections called
  CLUSTERs and clusters are further grouped together
  into CLUSTER GROUPS.
Block Mapping

 Packed inside inode
 Sub block addressing
 File block addressing
 Pointer block addressing

Can upgrade automatically.
System Files

System files are created at file system format
  time, and each manages one type of
  resources.
System Files

Use file blocks.
Same read/write method as regular files.
Checking file data consistency essentially
provides metadata consistency.
Cluster Groups
Cluster groups are repeated to create a file system.
An existing VMFS volume grows over unused space
on the disk or spans new disks by laying out new
cluster groups that refer to the newly added space.
VMFS resource manager makes hosts operate on
different and distant cluster groups within a system
file. This reduces the possibility of mutiple hosts
contending on the same lock(s) and increases the
efficiency of the clustered lock manager.
On-disk Lock

A single sector data
structure.
Locking is based on lease.
Atomic disk operations (SCSI
reserve-read-modify-write-
SCSI release)
On-disk Lock Data Structure
HostID: This is a 128-bit unique identifier that identifies the ESX host that
owns the lock at a given point in time. All zeros means no owner.
Mode: A set of non-zero values to indicate whether a lock is free, held
exclusively, held by multiple hosts for shared read access, or held by
multiple hosts for shared read and write access.
Generation: A monotonically increasing counter, updates every time a lock
is acquired, released or broken. While the hostID field sufficiently
disambiguates operations on a lock from different hosts, this field
disambiguates multiple operations on a lock by the same host.
HBregion: For each valid hostID (if any) currently using the lock, a pointer
to the on disk heartbeat region of the host.
HBgen: A generation number to validate the HBregion reference as being
current or stale. It disambiguates locks held by a given host before and
after a host crash and before and after a storage outage.
On-disk Heartbeat

A single sector data structure
Every host accessing a VMSF volume acquires
a heartbeat on disk to declare liveness to
other hosts.
Allocated from a 1MB reserved region of the
volume. 2048 concurrent hosts access.
HB Failure Handling

Hosts are free to break locks if heartbeat’s
timestamp does not change for 20 second. Should
replay journal when taking stale lock.
If failing to update heartbeat timestamp in five HB
period (about 15 sec and 40 HB IO tries), host will
fence itself and abort all inflight IOs.
Lock manager tries to rejoin the cluster if IO error is
not permanent, and reclaims HB slot.
On-disk Lock & HB

Each host can join a cluster by acquiring a on-
disk HB.
It can also hold thousands of on-disk locks
Journaling

Each host maintains its own journal on the
volume.
HB region on disk stores journal location.
Transaction State Machine
Optimistic Locking

All hosts in a VMFS cluster generally operate on
mutually exclusive subsets of locks on the volume.
A host that is interested in acquiring a given lock will
typically find it to be free on disk.
In stead of acquiring all locks, host first reads all
locks, if they are free, modify in memory metadata
and then upgrade locks and commit.
Transaction State Machine w/ op lock
Transaction State Machine w/ op lock
            Upgrade Lock
1: reserve disk;
2: issue asynchronous (async) reads of all
required locks;
3: if any lock is acquired by remote host,
abort and fall back to normal TSM;
4: issue async writes of all required locks;
5: wait for all async writes to complete;
6: release disk;
Agenda

ESX Introduction
VMFS Design Goals
VMFS Architecture
SAN Influence and Impact
Conclusion
Adaptive SAN-aware retries

For some SAN errors, instead of letting guest
OS retry IO, VMkernel retries the IO after an
optimal time.
Adaptive SAN-aware retries
Data Mover

clone(srcFileHandle, srcFileOffset,
dstFileHandle, dstFileOffset, length, policies)
Data Mover
Directive SCSI CMD

operator(VMID, source_blocklist,
destination_blocklist)
Zero, clone, delete
Directive SCSI CMD

atomic_test_and_set(block_number, old_image,
new_image)
For VMFS lock manager, new lock algorithm: reads a
lock image from disk, and if the lock is free, issues
an atomic_test_and_set with a new_image
containing host specific hostID, generation and
heartbeat information.
4 IOs -> 2 IOs
Agenda

ESX Introduction
VMFS Design Goals
VMFS Architecture
SAN Influence and Impact
Conclusion
Performance

More Related Content

What's hot

Grasp patterns and its types
Grasp patterns and its typesGrasp patterns and its types
Grasp patterns and its types
Syed Hassan Ali
 
Web engineering - MVC
Web engineering - MVCWeb engineering - MVC
Web engineering - MVC
Nosheen Qamar
 
Unix and shell programming | Unix File System | Unix File Permission | Blocks
Unix and shell programming | Unix File System | Unix File Permission | BlocksUnix and shell programming | Unix File System | Unix File Permission | Blocks
Unix and shell programming | Unix File System | Unix File Permission | Blocks
LOKESH KUMAR
 
Chord Algorithm
Chord AlgorithmChord Algorithm
Chord Algorithm
Sijia Lyu
 
7 mobile app usability testing best practices by UserTesting
7 mobile app usability testing best practices by UserTesting7 mobile app usability testing best practices by UserTesting
7 mobile app usability testing best practices by UserTesting
UserTesting
 
White box & Black box testing
White box & Black box testingWhite box & Black box testing
White box & Black box testing
NitishMhaske1
 
formal verification
formal verificationformal verification
formal verification
Toseef Aslam
 
201 core java interview questions oo ps interview questions - javatpoint
201 core java interview questions   oo ps interview questions - javatpoint201 core java interview questions   oo ps interview questions - javatpoint
201 core java interview questions oo ps interview questions - javatpoint
ravi tyagi
 
Distributed Systems Architecture in Software Engineering SE11
Distributed Systems Architecture in Software Engineering SE11Distributed Systems Architecture in Software Engineering SE11
Distributed Systems Architecture in Software Engineering SE11koolkampus
 
Parallel processing
Parallel processingParallel processing
Parallel processing
rajshreemuthiah
 
3. CPU virtualization and scheduling
3. CPU virtualization and scheduling3. CPU virtualization and scheduling
3. CPU virtualization and scheduling
Hwanju Kim
 
Coding and testing in Software Engineering
Coding and testing in Software EngineeringCoding and testing in Software Engineering
Coding and testing in Software Engineering
Abhay Vijay
 
Virtual box
Virtual boxVirtual box
Virtual box
Mohd Tousif
 
Distribution File System DFS Technologies
Distribution File System DFS TechnologiesDistribution File System DFS Technologies
Distribution File System DFS Technologies
Raphael Ejike
 
Distributed Database Management System
Distributed Database Management SystemDistributed Database Management System
Distributed Database Management System
AAKANKSHA JAIN
 
NTFS.ppt
NTFS.pptNTFS.ppt
NTFS.ppt
jlmansilla
 
Black & White Box testing
Black & White Box testingBlack & White Box testing
Object oriented software engineering concepts
Object oriented software engineering conceptsObject oriented software engineering concepts
Object oriented software engineering conceptsKomal Singh
 
Chapter 2 Introduction to Unix Concepts
Chapter 2 Introduction to Unix ConceptsChapter 2 Introduction to Unix Concepts
Chapter 2 Introduction to Unix Concepts
MeenalJabde
 

What's hot (20)

Grasp patterns and its types
Grasp patterns and its typesGrasp patterns and its types
Grasp patterns and its types
 
Web engineering - MVC
Web engineering - MVCWeb engineering - MVC
Web engineering - MVC
 
Unix and shell programming | Unix File System | Unix File Permission | Blocks
Unix and shell programming | Unix File System | Unix File Permission | BlocksUnix and shell programming | Unix File System | Unix File Permission | Blocks
Unix and shell programming | Unix File System | Unix File Permission | Blocks
 
Chord Algorithm
Chord AlgorithmChord Algorithm
Chord Algorithm
 
7 mobile app usability testing best practices by UserTesting
7 mobile app usability testing best practices by UserTesting7 mobile app usability testing best practices by UserTesting
7 mobile app usability testing best practices by UserTesting
 
White box & Black box testing
White box & Black box testingWhite box & Black box testing
White box & Black box testing
 
formal verification
formal verificationformal verification
formal verification
 
201 core java interview questions oo ps interview questions - javatpoint
201 core java interview questions   oo ps interview questions - javatpoint201 core java interview questions   oo ps interview questions - javatpoint
201 core java interview questions oo ps interview questions - javatpoint
 
Distributed Systems Architecture in Software Engineering SE11
Distributed Systems Architecture in Software Engineering SE11Distributed Systems Architecture in Software Engineering SE11
Distributed Systems Architecture in Software Engineering SE11
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 
3. CPU virtualization and scheduling
3. CPU virtualization and scheduling3. CPU virtualization and scheduling
3. CPU virtualization and scheduling
 
Coding and testing in Software Engineering
Coding and testing in Software EngineeringCoding and testing in Software Engineering
Coding and testing in Software Engineering
 
Virtual box
Virtual boxVirtual box
Virtual box
 
Distribution File System DFS Technologies
Distribution File System DFS TechnologiesDistribution File System DFS Technologies
Distribution File System DFS Technologies
 
Distributed Database Management System
Distributed Database Management SystemDistributed Database Management System
Distributed Database Management System
 
NTFS.ppt
NTFS.pptNTFS.ppt
NTFS.ppt
 
Chapter 15
Chapter 15Chapter 15
Chapter 15
 
Black & White Box testing
Black & White Box testingBlack & White Box testing
Black & White Box testing
 
Object oriented software engineering concepts
Object oriented software engineering conceptsObject oriented software engineering concepts
Object oriented software engineering concepts
 
Chapter 2 Introduction to Unix Concepts
Chapter 2 Introduction to Unix ConceptsChapter 2 Introduction to Unix Concepts
Chapter 2 Introduction to Unix Concepts
 

Viewers also liked

Google Megastore
Google MegastoreGoogle Megastore
Google Megastorebergwolf
 
Markdown Slides [EN]
Markdown Slides [EN]Markdown Slides [EN]
Markdown Slides [EN]
Adolfo Sanz De Diego
 
How to use any static site generator with GitLab Pages.
How to use any static site generator with GitLab Pages. How to use any static site generator with GitLab Pages.
How to use any static site generator with GitLab Pages.
Ivan Nemytchenko
 
Megastore by Google
Megastore by GoogleMegastore by Google
Megastore by Google
Ankita Kapratwar
 

Viewers also liked (6)

RCU
RCURCU
RCU
 
Google Megastore
Google MegastoreGoogle Megastore
Google Megastore
 
CLFS 2010
CLFS 2010CLFS 2010
CLFS 2010
 
Markdown Slides [EN]
Markdown Slides [EN]Markdown Slides [EN]
Markdown Slides [EN]
 
How to use any static site generator with GitLab Pages.
How to use any static site generator with GitLab Pages. How to use any static site generator with GitLab Pages.
How to use any static site generator with GitLab Pages.
 
Megastore by Google
Megastore by GoogleMegastore by Google
Megastore by Google
 

Similar to vmfs intro

Learning from ZFS to Scale Storage on and under Containers
Learning from ZFS to Scale Storage on and under ContainersLearning from ZFS to Scale Storage on and under Containers
Learning from ZFS to Scale Storage on and under Containers
inside-BigData.com
 
Esxi troubleshooting
Esxi troubleshootingEsxi troubleshooting
Esxi troubleshooting
Ovi Chis
 
Posscon2013
Posscon2013Posscon2013
Posscon2013
Dru Lavigne
 
VMworld Europe 2014: Virtual SAN Architecture Deep Dive
VMworld Europe 2014: Virtual SAN Architecture Deep DiveVMworld Europe 2014: Virtual SAN Architecture Deep Dive
VMworld Europe 2014: Virtual SAN Architecture Deep Dive
VMworld
 
Virtualization
VirtualizationVirtualization
Virtualization
YaqutAlsaad
 
Network Storage dan Filesystem.pdf
Network Storage dan Filesystem.pdfNetwork Storage dan Filesystem.pdf
Network Storage dan Filesystem.pdf
TaseigerKu
 
Distributed file systems (from Google)
Distributed file systems (from Google)Distributed file systems (from Google)
Distributed file systems (from Google)Sri Prasanna
 
Iocg Whats New In V Sphere
Iocg Whats New In V SphereIocg Whats New In V Sphere
Iocg Whats New In V Sphere
Anne Achleman
 
VMware vSphere Storage Enhancements
VMware vSphere Storage EnhancementsVMware vSphere Storage Enhancements
VMware vSphere Storage Enhancements
Anne Achleman
 
3487570
34875703487570
VMware Performance Troubleshooting
VMware Performance TroubleshootingVMware Performance Troubleshooting
VMware Performance Troubleshootingglbsolutions
 
Virtualization.ppt
Virtualization.pptVirtualization.ppt
Virtualization.ppt
9260SahilPatil
 
Virtualization.ppt
Virtualization.pptVirtualization.ppt
Virtualization.ppt
9260SahilPatil
 
Virtualization.ppt
Virtualization.pptVirtualization.ppt
Virtualization.ppt
4173CarreonIraMaeL
 
Virtualization.ppt
Virtualization.pptVirtualization.ppt
Virtualization.ppt
XdemonGraphicz
 
Virtualization.ppt
Virtualization.pptVirtualization.ppt
Virtualization.ppt
UbaidURRahman78
 
Xen server storage Overview
Xen server storage OverviewXen server storage Overview
Xen server storage Overview
Nuno Alves
 
Tlf2014
Tlf2014Tlf2014
Tlf2014
Dru Lavigne
 

Similar to vmfs intro (20)

Learning from ZFS to Scale Storage on and under Containers
Learning from ZFS to Scale Storage on and under ContainersLearning from ZFS to Scale Storage on and under Containers
Learning from ZFS to Scale Storage on and under Containers
 
Esxi troubleshooting
Esxi troubleshootingEsxi troubleshooting
Esxi troubleshooting
 
Posscon2013
Posscon2013Posscon2013
Posscon2013
 
VMworld Europe 2014: Virtual SAN Architecture Deep Dive
VMworld Europe 2014: Virtual SAN Architecture Deep DiveVMworld Europe 2014: Virtual SAN Architecture Deep Dive
VMworld Europe 2014: Virtual SAN Architecture Deep Dive
 
Storage
StorageStorage
Storage
 
Virtualization
VirtualizationVirtualization
Virtualization
 
Network Storage dan Filesystem.pdf
Network Storage dan Filesystem.pdfNetwork Storage dan Filesystem.pdf
Network Storage dan Filesystem.pdf
 
Distributed file systems (from Google)
Distributed file systems (from Google)Distributed file systems (from Google)
Distributed file systems (from Google)
 
Iocg Whats New In V Sphere
Iocg Whats New In V SphereIocg Whats New In V Sphere
Iocg Whats New In V Sphere
 
VMware vSphere Storage Enhancements
VMware vSphere Storage EnhancementsVMware vSphere Storage Enhancements
VMware vSphere Storage Enhancements
 
Installation Guide
Installation GuideInstallation Guide
Installation Guide
 
3487570
34875703487570
3487570
 
VMware Performance Troubleshooting
VMware Performance TroubleshootingVMware Performance Troubleshooting
VMware Performance Troubleshooting
 
Virtualization.ppt
Virtualization.pptVirtualization.ppt
Virtualization.ppt
 
Virtualization.ppt
Virtualization.pptVirtualization.ppt
Virtualization.ppt
 
Virtualization.ppt
Virtualization.pptVirtualization.ppt
Virtualization.ppt
 
Virtualization.ppt
Virtualization.pptVirtualization.ppt
Virtualization.ppt
 
Virtualization.ppt
Virtualization.pptVirtualization.ppt
Virtualization.ppt
 
Xen server storage Overview
Xen server storage OverviewXen server storage Overview
Xen server storage Overview
 
Tlf2014
Tlf2014Tlf2014
Tlf2014
 

More from bergwolf

NFS updates for CLSF
NFS updates for CLSFNFS updates for CLSF
NFS updates for CLSF
bergwolf
 
pnfs status
pnfs statuspnfs status
pnfs statusbergwolf
 
linux trim
linux trimlinux trim
linux trimbergwolf
 
network filesystem briefs
network filesystem briefsnetwork filesystem briefs
network filesystem briefsbergwolf
 
gsoc and grub4ext4
gsoc and grub4ext4gsoc and grub4ext4
gsoc and grub4ext4bergwolf
 
grub4ext4 status-plans
grub4ext4 status-plansgrub4ext4 status-plans
grub4ext4 status-plansbergwolf
 

More from bergwolf (8)

NFS updates for CLSF
NFS updates for CLSFNFS updates for CLSF
NFS updates for CLSF
 
Linux aio
Linux aioLinux aio
Linux aio
 
pnfs status
pnfs statuspnfs status
pnfs status
 
linux trim
linux trimlinux trim
linux trim
 
network filesystem briefs
network filesystem briefsnetwork filesystem briefs
network filesystem briefs
 
logfs
logfslogfs
logfs
 
gsoc and grub4ext4
gsoc and grub4ext4gsoc and grub4ext4
gsoc and grub4ext4
 
grub4ext4 status-plans
grub4ext4 status-plansgrub4ext4 status-plans
grub4ext4 status-plans
 

Recently uploaded

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 

Recently uploaded (20)

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 

vmfs intro

  • 2. Agenda ESX Introduction VMFS Design Goals VMFS Architecture SAN Impact Conclusion
  • 4. Guest Memory Layers Shadow page tables (VA- MA). Page sharing (BA-MA).
  • 5. ESX IO Stack Average IO requests just involves offset remapping.
  • 6. Agenda ESX Introduction VMFS Design Goals VMFS Architecture SAN Influence and Impact Conclusion
  • 7. Use Case Small number of files (30~100 per VM) Files either very small (~a few KBs), or very large (many GBs) SAN storage is the underlying substrate. All storage exported by these storage systems is shared among all ESX servers
  • 8. Design Goals Metadata overhead should be very low VM IO throughput and latency should be as good as directly attached raw device A clustered lock manager for moderating access to files among ESX servers Help VM deterministically react to transient and non-transient SAN events and error conditions.
  • 9. Agenda ESX Introduction VMFS Design Goals VMFS Architecture SAN Influence and Impact Conclusion
  • 10. VMFS Architecture A volume is an aggregation of resources and on-disk locks. A resource is either an inode, a file block, a sub- block or an indirect block. Each lock moderates access to a subset of resources. Hosts negotiate access to resource by acquiring relevant locks. VMFS = a clustered lock manager + a resource manager + a journaling module + a data mover + a VM IO manager + POSIX system call frantend
  • 11. VMKernel Logical Volume VMFS are by default created inside VMKernel logical volumes. VMKernel logical volumes can be spanned across multiple devices.
  • 12. VMFS on disk Layout
  • 13. Four Resources file blocks sub-blocks pointer blocks file descriptors Resources are grouped together into collections called CLUSTERs and clusters are further grouped together into CLUSTER GROUPS.
  • 14. Block Mapping Packed inside inode Sub block addressing File block addressing Pointer block addressing Can upgrade automatically.
  • 15. System Files System files are created at file system format time, and each manages one type of resources.
  • 16. System Files Use file blocks. Same read/write method as regular files. Checking file data consistency essentially provides metadata consistency.
  • 17. Cluster Groups Cluster groups are repeated to create a file system. An existing VMFS volume grows over unused space on the disk or spans new disks by laying out new cluster groups that refer to the newly added space. VMFS resource manager makes hosts operate on different and distant cluster groups within a system file. This reduces the possibility of mutiple hosts contending on the same lock(s) and increases the efficiency of the clustered lock manager.
  • 18. On-disk Lock A single sector data structure. Locking is based on lease. Atomic disk operations (SCSI reserve-read-modify-write- SCSI release)
  • 19. On-disk Lock Data Structure HostID: This is a 128-bit unique identifier that identifies the ESX host that owns the lock at a given point in time. All zeros means no owner. Mode: A set of non-zero values to indicate whether a lock is free, held exclusively, held by multiple hosts for shared read access, or held by multiple hosts for shared read and write access. Generation: A monotonically increasing counter, updates every time a lock is acquired, released or broken. While the hostID field sufficiently disambiguates operations on a lock from different hosts, this field disambiguates multiple operations on a lock by the same host. HBregion: For each valid hostID (if any) currently using the lock, a pointer to the on disk heartbeat region of the host. HBgen: A generation number to validate the HBregion reference as being current or stale. It disambiguates locks held by a given host before and after a host crash and before and after a storage outage.
  • 20. On-disk Heartbeat A single sector data structure Every host accessing a VMSF volume acquires a heartbeat on disk to declare liveness to other hosts. Allocated from a 1MB reserved region of the volume. 2048 concurrent hosts access.
  • 21. HB Failure Handling Hosts are free to break locks if heartbeat’s timestamp does not change for 20 second. Should replay journal when taking stale lock. If failing to update heartbeat timestamp in five HB period (about 15 sec and 40 HB IO tries), host will fence itself and abort all inflight IOs. Lock manager tries to rejoin the cluster if IO error is not permanent, and reclaims HB slot.
  • 22. On-disk Lock & HB Each host can join a cluster by acquiring a on- disk HB. It can also hold thousands of on-disk locks
  • 23. Journaling Each host maintains its own journal on the volume. HB region on disk stores journal location.
  • 25. Optimistic Locking All hosts in a VMFS cluster generally operate on mutually exclusive subsets of locks on the volume. A host that is interested in acquiring a given lock will typically find it to be free on disk. In stead of acquiring all locks, host first reads all locks, if they are free, modify in memory metadata and then upgrade locks and commit.
  • 27. Transaction State Machine w/ op lock Upgrade Lock 1: reserve disk; 2: issue asynchronous (async) reads of all required locks; 3: if any lock is acquired by remote host, abort and fall back to normal TSM; 4: issue async writes of all required locks; 5: wait for all async writes to complete; 6: release disk;
  • 28. Agenda ESX Introduction VMFS Design Goals VMFS Architecture SAN Influence and Impact Conclusion
  • 29. Adaptive SAN-aware retries For some SAN errors, instead of letting guest OS retry IO, VMkernel retries the IO after an optimal time.
  • 33. Directive SCSI CMD operator(VMID, source_blocklist, destination_blocklist) Zero, clone, delete
  • 34. Directive SCSI CMD atomic_test_and_set(block_number, old_image, new_image) For VMFS lock manager, new lock algorithm: reads a lock image from disk, and if the lock is free, issues an atomic_test_and_set with a new_image containing host specific hostID, generation and heartbeat information. 4 IOs -> 2 IOs
  • 35. Agenda ESX Introduction VMFS Design Goals VMFS Architecture SAN Influence and Impact Conclusion