1. Sumit Lahiri – Product Line Manager
STO1479BU
STO1479BU
vSAN Beyond the Basics
Eric Knauft – Staff Engineer
2. • This presentation may contain product features that are currently under development.
• This overview of new technology represents no commitment from VMware to deliver these
features in any generally available product.
• Features are subject to change, and must not be included in contracts, purchase orders, or
sales agreements of any kind.
• Technical feasibility and market demand will affect final delivery.
• Pricing and packaging for any new technologies or features discussed or presented have not
been determined.
Disclaimer
2
3. Agenda
1 The world of Objects
2 Life of vSAN Component
3 The 4 Rs of vSAN
4 Multi-Level Fault Domains
5 All Flash I/O Flow
CONFIDENTIAL
3
5. Disk layout in host
disk groupdisk group disk group disk group disk group
Disk groups contribute to single vSAN datastore in vSphere cluster
Cache
Capacity
vSAN Datastore
§ Max 64 nodes
§ Min 2 nodes (ROBO)
§ Max 5 Disk Groups per
host
§ 2 – Tiers per Disk
Group
6. Creating vm, creates several objects in the background
6
(VMDK)
Virtual Disk
VM home namespace: VMX, log files
Virtual memory swap objects
7. From VM to components
7
Component
Component
Component
Component
(Object) (components) (blocks)
(Max Size: 255 GB)
(in low MBs)
9. CONFIDENTIAL
9
Failures to Tolerate (FTT)
vSphere vSAN
Host Racks Sites
Always in context to fault domains
Failures to Tolerate Failures to Tolerate Failures to Tolerate
10. CONFIDENTIAL
10
Failures to Tolerate (FTT)
vSphere vSAN
FTT implies host failures to tolerate if fault domain is not mentioned
vSphere vSAN vSphere vSAN
FTT=1 FTT=2 FTT=3
11. CONFIDENTIAL
11
Failures to Tolerate (FTT) can be Nested
vSphere vSAN
Host Racks Sites
Survive one site failure and one host failure on the other site
25. Quorum: In the event of cluster partition, which partition shall
proceed?
25
…........ …........
partition-01 partition-02
M hostsN hosts
26. Quorum: The partition with the higher Votes proceed
26
…........ …........
partition-01 partition-02
M hostsN hosts
N votes M votes
Cluster members participate in voting
27. If M > N, Partition-2 proceeds
27
…........ …........
partition-01 partition-02
M hostsN hosts
N votes M votes
partition-02 proceeds
Cluster members participate in voting
29. Quorum is calculated on a per object basis
29
(VMDK)
R1
C C
(RAID-1)
(component) (component)
No witness
1 1
• Each component participates in voting
• With two components, this sums to even
number of votes
30. Add witness for Tier breaker vote
30
(VMDK)
R1
C C
(RAID-1)
(component) (component)
W
(witness)
1
11
(votes)
(votes)
(votes)
• Witness is added as Tier breaker vote
• Acts as an observer which component has latest
data
31. For VMDK-A , partition-2 has higher votes
31
…........ …........
partition-01 partition-02 proceeds
M hostsN hosts
(VMDK-A)
R1
C C
(RAID-1)
(component) (component)
W
(witness)
1
11
(votes)
C C W
1 1
1
(votes)
(votes)
(votes)
(votes)
32. General Case: Different objects proceed on different partition
32
…........ …........
partition-01 proceeds for VMDK-B partition-02 proceeds for VMDK-A
M hostsN hosts
C C W
1 1
1
(VMDK-A)
R1
C C
(RAID-1)
(component) (component)
W
(witness)
1
11
(votes)
(votes)
(votes)
(votes)
(votes)
(VMDK-B)
R1
C C
(RAID-1)
(component) (component)
W
(witness)
1
11
(votes)
(votes)
C W C
1 1 1
33. Components can be classified as data component and witness
component
(VMDK)
R1
D D
(RAID-1)
(no striping) (no striping)
(data component) (data component)
W (witness component)
1 1
1
(1 vote)
(1 vote)(1 vote)
34. Min count of hosts required for survive
N host failures?
35. Minimum 2N+1 hosts required to survive N host failures
35
…........ …........
partition-01 partition-02 is winning partition
(N +1) hosts = (N+1) shares of vote
• If each host represents same share of vote
• Wining partition would require a minimum of N+1 hosts
• Minimum size of cluster = 2N+1 hosts to survive N host failures
N hosts = N shares of votes
1 1 1 1 1
36. CONFIDENTIAL
36
Min cluster size is determined by meeting Liveness requirement
• Liveness = (Quorum) && (Availability)
• Min of hosts in cluster = Max (Min hosts for Quorum,
Min hosts for Availability)
37. CONFIDENTIAL
37
Examples
• FTT =1 , FTM = RAID-1
• Min host for availability = 2
• Min host of Quorum = 2N+1 = 3
• Min cluster size =3
• FTT=2, FTM = RAID-1
• Min host for availability = 3
• Min host for Quorum = 2N+1 =5
• Min cluster size =5
43. It is possible to have Quorum but no Availability
43
1 1
1 3
1 1 1
2
(votes)
C1 C1 C1
C2 C2 C2
(VMDK)
W
W
R1
R0 R0 R0
Partition - 1
Partition - 2
Quorum
ü Quorum
48. Object States: can be “not compliant” but accessible
48
esxi-01 esxi-02
esxi-03
C1 C2 C1 C2
R1
R0
R0
W
(VMDK)
• Compliance status: Are all replicas good?
• Operational status: Is Accessible? 3
22
(votes)
(votes)(votes)
49. Object States: can be “not compliant” but accessible
49
esxi-01 esxi-02
esxi-03
C1 C2 C1 C2
R1
R0
R0
W
(VMDK)
• Active = known good
• Degraded = known bad, rebuild now
• Absent = known bad, cause not known,
repair after 60 mins
• Stale = Active however needs update
• Compliance status: Are all replicas good?
• Operational status: Is Accessible? 3
22
(votes)
(votes)(votes)
• Accessible implies Liveness
50. 4 Rs – Resync , Rebuild, Repair and Reconfiguration
50
C1 ….. C4
R1
(components)
(blocks)
(VMDK)
• VMDK is divided into components
• Components comprise of data blocks
• Each component on different host
• Each data block of fixed size
C1 ….. C4
R1
(resync blocks)
(VMDK)
C1 ….. C4
R1
(VMDK)
Partial Resync
• Copy data to stale components
• When a component comes
back from being absent
Repair / Reconfigure
• Build fresh component
• Full Resync
(build out the component)
(Host-4)(Host-1) (state: degraded)(state: active-stale)
51. CONFIDENTIAL
51
Resync / Reconfiguration Triggers
disk group disk group
Cache
Capacity
§ Components in Active-Stale
§ Some blocks are resynced / rebuild
C1 ….. C4
(resync blocks)
(state: active-stale)
(Partition resolves)
(Change storage policies)
Components are rebuilt
C1
…..
C4
(build out the component)
(state: degraded)
53. W
Begin: All components / elements are in active state
53
2 3 2
(2 votes)
Tolerate 1 host failure with RAID-1
(Active)
(2 votes) (2 votes)
C1 C1
A A
C2C2
A AA
(Active) (Active)(Active) (Active)
54. W C1 C2C1 C2
Cluster partitions with unknown cause, components go ”Absent”
54
A
B
2
A
B
3 2
(2 votes)
Cluster partition, cause unknown, do not repair immediately
Partition - 1
A
A A
Partition – 2
(Absent)
(2 votes) (2 votes)
(Active)
(Active)
Object is not compliant but accessible
Absent: Known bad,
but cause not known
55. C1 C2C1 C2
Partition with both Availability and Quorum proceeds
55
A
B
2
A
B
3 2
(2 votes)
vm HA to partition -2 , partition-2 has both quorum and availability
Partition - 1
A A
Partition – 2 - proceeds
(2 votes) (2 votes)
(Absent)
Quorum && AvailabilityAvailability no Quorum
W
A
56. C1 C2C1 C2
Partition is resolved, component is Resynced
56
AS
2
AS
3 2
(2 votes)
Active-Stale Component is Resynced
A A
(Active-Stale)
(2 votes) (2 votes)
Resync
Component marked as Active Stale, Object is not compliant
W
A
57. W
All components / elements are in active state
57
2 3 2
(2 votes)
All components are Active
(Active)
(2 votes) (2 votes)
C1 C1
A A
C2C2
A AA
(Active) (Active)(Active) (Active)
Object is compliant and accessible
59. WC1 C2 C1 C2
Absent Components Repair After 60 Min
59
A
2
A
3 2
(2 votes)
Partition - 1
A
A A
Partition – 2 : most recent data
(Absent)
(2 votes) (2 votes)
Resync after 60 min
60. WC1 C2 C1 C2
Degraded Components Repair Immediately
60
D
2
D
3 2
(2 votes)
Hardware Failure Causes Degraded
A
A A
(2 votes) (2 votes)
Known bad,
Resync Now
(Degraded)
61. W C1 C2C1 C2
Fresh components Resynced From Existing Components
61
D
2
D
3 2
(2 votes)
A
A A
(Degraded) (Reconfiguring)
2
Find another host to resync, Resync begins
C1 C2
R R
Resync
Object state is not-compliant but accessible
(Another Host)
62. W
C1 C2
Object is Compliant Again
62
D
2
D
(2 votes)
(Degraded)
2 3 2
(1 vote)
(Active)
(1 vote) (1 vote)
C1 C1
A A
C2C2
A AA
(Active) (Active)(Active) (Active)
Degraded component is marked for deletion
(remove)
69. CONFIDENTIAL
69
Failures to Tolerate (FTT) can be Nested
vSphere vSAN
Host Racks Sites
Survive one site failure and one host failure on the other site
70. Stretched Cluster deployment with local fault protection
70
• Prior examples, host is the fault domain
• 2 Levels of fault domain
– Site and host
• Failures to tolerate at each level
vSphere vSAN
ClusterCluster
5ms RTT, 10GbE
RAID-5
3rd
site for
witness
RAID-5
RAID-1
71. RAID tree for stretched cluster with local fault protection
71
(Site -1) (Site -2)
D2
D1
D3
P1
R5 R5
R1
D2
D1
D3
P1
74. Anatomy of write: from site - 1 to site - 2
74
R1
R5
R5
1 Issue write
(Site -1)
D2
D1
D3
P1
(Site - 2)
D2
D1
D3
P1
Remote Helper Raid Tree
(proxy owner)
R5
Dn
Send only data across sites
2b
2a
Update Local Data
and Parity
3 Remote side calculates
parity.
76. W
R5
5 Votes per site
76
3 voting entities for first level
4 components for second level
(Site -1)
D2
D1 D3
P1
(Site -2)
D2
D1 D3
P1
Total of 5 votes (odd number of votes)
Witness has equal share of votes as
the other 2 entities (e.g. sites)
R1
Site-1, Site-2 and the witness
R5
5 5
1
1 2
1
77. W
R5
Witness is assigned same voting rights as the sites
77
5
3 voting entities for first level
5
4 components for second level
(Site -1)
D2
D1 D3
P1
(Site -2)
D2
D1 D3
P1
Total of 5 votes (odd number of votes)
Witness has equal share of votes as
the other 2 entities (e.g. sites)
R1
Site-1, Site-2 and the witness
R5
5
5 5
79. Anatomy of a All Flash Write
Pretty much same as hybrid:
§ VM running on host H1
§ H1 is owner of virtual disk object Number
Of Failures To Tolerate = 1
§ Object has 2 replicas on H1 and H2
1. Guest OS issues write op to virtual disk
2. Owner clones write op
3. In parallel: sends “prepare” op to H1 (locally)
and H2
4. H1, H2 persist op to Flash (log)
5. H1, H2 ACK prepare op to owner
6. Owner waits for ACK from both ‘prepares’ and
completes I/O
7. Later, owner commits batch of writes
vSphere
Virtual SAN
H3H2H1
6
5
5
2
virtual disk
3
1
4 4
77
80. vSphere
Virtual SAN
H3H2H1
virtual disk
hot
cold
All-flash: Destaging Cache to Capacity
§ Data from committed writes
accumulate on Flash Cache (Write
Buffer)
• From different VMs / virtual disks
§ In all-flash, blocks that are written most
often (hot) stay in write cache.
§ In all-flash, blocks that are infrequently
accessed (cold) are destaged to flash
capacity layer.
81. Nerd Out With These Key vSAN Activities at VMworld
#HitRefresh on your current data center and discover the possibilities!
Earn VMware digital badges to
showcase your skills
• New 2017 vSAN Specialist
Badge
• Education & Certification Lounge:
VM Village
• Certification Exam Center:
Jasmine EFG, Level 3
Become a
vSAN Specialist
Learnfrom self-pacedand expert
led hands on labs
• vSAN Getting Started Workshop
(Expertled)
• VxRail Getting Started (Self
paced)
• Self-Paced lab available online
24x7
Practice with
Hands-on-Labs
Discover how to assess if your IT
is a good fit for HCI
• Four Seasons Willow Room/2nd
floor
• Open from 11am – 5pm Sun,
Mon, and Tue
• Learn more atAssessing &
Sizing in STO1500BU
Visit SDDC
Assessment Lounge
82. 3 Easy Ways to Learn More about vSAN
82
• Live at VMworld
• Practical learning of
vSAN, VxRail and more
• 24x7 availability online
– for free!
vSAN Sizer
vSAN Assessment
New vSAN Tools
• StorageHub.vmware.com
• Reference architectures,
off-line demos and more
• Easy search function
• And More!
Storage Hub Technical Library Hands-On Lab
Test drive vSAN
for free today!