vSAN Beyond The Basics

Sumit Lahiri – Product Line Manager
STO1479BU
STO1479BU
vSAN Beyond the Basics
Eric Knauft – Staff Engineer

• This presentation may contain product features that are currently under development.
• This overview of new technology represents no commitment from VMware to deliver these
features in any generally available product.
• Features are subject to change, and must not be included in contracts, purchase orders, or
sales agreements of any kind.
• Technical feasibility and market demand will affect final delivery.
• Pricing and packaging for any new technologies or features discussed or presented have not
been determined.
Disclaimer
2

Agenda
1 The world of Objects
2 Life of vSAN Component
3 The 4 Rs of vSAN
4 Multi-Level Fault Domains
5 All Flash I/O Flow
CONFIDENTIAL
3

Disk layout in host
disk groupdisk group disk group disk group disk group
Disk groups contribute to single vSAN datastore in vSphere cluster
Cache
Capacity
vSAN Datastore
§ Max 64 nodes
§ Min 2 nodes (ROBO)
§ Max 5 Disk Groups per
host
§ 2 – Tiers per Disk
Group

Creating vm, creates several objects in the background
6
(VMDK)
Virtual Disk
VM home namespace: VMX, log files
Virtual memory swap objects

From VM to components
7
Component
Component
Component
Component
(Object) (components) (blocks)
(Max Size: 255 GB)
(in low MBs)

CONFIDENTIAL
8
Fault Domains
vSphere vSAN
Host Racks Sites

CONFIDENTIAL
9
Failures to Tolerate (FTT)
vSphere vSAN
Host Racks Sites
Always in context to fault domains
Failures to Tolerate Failures to Tolerate Failures to Tolerate

CONFIDENTIAL
10
Failures to Tolerate (FTT)
vSphere vSAN
FTT implies host failures to tolerate if fault domain is not mentioned
vSphere vSAN vSphere vSAN
FTT=1 FTT=2 FTT=3

CONFIDENTIAL
11
Failures to Tolerate (FTT) can be Nested
vSphere vSAN
Host Racks Sites
Survive one site failure and one host failure on the other site

CONFIDENTIAL
13
Failures Tolerate Method (FTM)
vSphere vSAN vSphere vSAN vSphere vSAN
FTT=1 FTT=2 FTT=3
RAID-1 ü. ü. ü.
RAID-5 ü.
RAID-6 ü.
2bytes/byte
1.3 bytes/byte
1.5 bytes/byte
X
X X
X
3bytes/byte 4bytes/byte

FTT = Failures to Tolerate
FTM = Fault Tolerance Method

Object is associated with underlying policy
16
1. Failures to Tolerate
2. Fault Tolerance Method
(VMDK)
Policy:

Policy dictates how objects are managed
17
1. Failures to Tolerate (FTT)
2. Fault Tolerance Method
(FTM)
(VMDK)
Policy:
Replica Replica
(VMDK)
C1 C2 ….
(components)
(stripes)
C1 C2 ….
(components)
(stripes)
FTT =1, FTM = RAID-1, Stripe Width >2

RAID Abstraction Model
18
Replica Replica
(VMDK)
C1 C2 ….
(components)
(stripes)
C1 C2 ….
(components)
(stripes)
(VMDK)
R1
R0 R0
C1 C2 ….
(components)
C1 C2 ….
(components)
(RAID-1)
(RAID-0) (RAID-0)
FTT =1, FTM = RAID-1 , Stripe Width >2
No witness

FTT=1,FTM=RAID-1, comparison with stripe and without stripes
19
(VMDK)
R1
R0 R0
C1 C2 ….
(components)
C1 C2 ….
(components)
(RAID-1)
(RAID-0) (RAID-0)
(VMDK)
R1
C C
(RAID-1)
(no striping) (no striping)
(component) (component)
No witness
250GB
250 GB
No witness
250GB 250GB
1TB 1 TB

vSAN managed as bunch of components
vSAN Datastore
components
C C CCCC

Each replica on different Fault Domain (e.g. host)
21
(VMDK)
R1
R0 R0
C1 C2
(components)
(RAID-1)
(RAID-0) (RAID-0)
C1 C2
(components)
R0
(RAID-0)
C1 C2
(components)
FTT =2, FTM = RAID-1 , Stripe Width = 2

Each component is commonly placed on a different host
22
(VMDK)
R1
R0 R0
C1 C2
(components)
(RAID-1)
(RAID-0) (RAID-0)
C1 C2
(components)
R0
(RAID-0)
C1 C2
(components)

Can we survive 2 host failures with 3 hosts?
23
(VMDK)
R1
R0 R0
C1 C2
(components)
(RAID-1)
(RAID-0) (RAID-0)
C1 C2
(components)
R0
(RAID-0)
C1 C2
(components)

Liveness = Availability && Quorum

Quorum: In the event of cluster partition, which partition shall
proceed?
25
…........ …........
partition-01 partition-02
M hostsN hosts

Quorum: The partition with the higher Votes proceed
26
…........ …........
M hostsN hosts
N votes M votes
Cluster members participate in voting

If M > N, Partition-2 proceeds
27
…........ …........
M hostsN hosts
N votes M votes
partition-02 proceeds
Cluster members participate in voting

Quorum is calculated on a per object basis
29
(VMDK)
R1
C C
(RAID-1)
No witness
1 1
• Each component participates in voting
• With two components, this sums to even
number of votes

Add witness for Tier breaker vote
30
(VMDK)
R1
C C
(RAID-1)
W
(witness)
1
11
(votes)
(votes)
(votes)
• Witness is added as Tier breaker vote
• Acts as an observer which component has latest
data

For VMDK-A , partition-2 has higher votes
31
…........ …........
partition-01 partition-02 proceeds
M hostsN hosts
(VMDK-A)
R1
C C
(RAID-1)
W
(witness)
1
11
(votes)
C C W
1 1
1
(votes)
(votes)
(votes)
(votes)

General Case: Different objects proceed on different partition
32
…........ …........
partition-01 proceeds for VMDK-B partition-02 proceeds for VMDK-A
M hostsN hosts
C C W
1 1
1
(VMDK-A)
R1
C C
(RAID-1)
W
(witness)
1
11
(votes)
(votes)
(votes)
(votes)
(votes)
(VMDK-B)
R1
C C
(RAID-1)
W
(witness)
1
11
(votes)
(votes)
C W C
1 1 1

Components can be classified as data component and witness
component
(VMDK)
R1
D D
(RAID-1)
(no striping) (no striping)
(data component) (data component)
W (witness component)
1 1
1
(1 vote)
(1 vote)(1 vote)

Min count of hosts required for survive
N host failures?

Minimum 2N+1 hosts required to survive N host failures
35
…........ …........
partition-01 partition-02 is winning partition
(N +1) hosts = (N+1) shares of vote
• If each host represents same share of vote
• Wining partition would require a minimum of N+1 hosts
• Minimum size of cluster = 2N+1 hosts to survive N host failures
N hosts = N shares of votes
1 1 1 1 1

CONFIDENTIAL
36
Min cluster size is determined by meeting Liveness requirement
• Liveness = (Quorum) && (Availability)
• Min of hosts in cluster = Max (Min hosts for Quorum,
Min hosts for Availability)

CONFIDENTIAL
37
Examples
• FTT =1 , FTM = RAID-1
• Min host for availability = 2
• Min host of Quorum = 2N+1 = 3
• Min cluster size =3
• FTT=2, FTM = RAID-1
• Min host for availability = 3
• Min host for Quorum = 2N+1 =5
• Min cluster size =5

Examples of Liveness (Quorum + Availability)

Quorum (FTT:2, FTM: RAID-1 ) = 5 Hosts, no stripe
39
(VMDK)
R1
D D
(RAID-1)
D
(data component) (data component) (data component)
W
W
(witness component)
1 1 1
1
1
3 data components = 3 votes
2 witness components = 2 votes

Votes Re-assigned / Re-balanced as stripe width is changed
40
(VMDK)
R1
R0 R0
C1 C2
(components)
(RAID-1)
(RAID-0) (RAID-0)
C1 C2
(components)
R0
(RAID-0)
C1 C2
(components)
W W
11
2 2 2
2 3
1 1 1 1
Assign higher votes
to break tie

C2 C1
Quorum with stripe width =2
41
2 2 2 3 2
(2 votes)
Availability but no Quorum (Availability) && (Quorum)
Partition - 1
Partition – 2 proceeds
(2 votes) (2 votes) (1 vote) (1 vote)
C2 C1 C2 C1W W
(VMDK)

Quorum = True
Availability = False

It is possible to have Quorum but no Availability
43
1 1
1 3
1 1 1
2
(votes)
C1 C1 C1
C2 C2 C2
(VMDK)
W
W
R1
R0 R0 R0
Partition - 1
Partition - 2
Quorum
ü Quorum

C0 C1
RAID – 5 protection against 1 host failure
45
esxi-01 esxi-02 esxi-03 esxi-04
1 1
…...... …...... …......
Each component on a separate host
(VMDK)
R5
C2 C3
12
Assigned higher vote to break tie

C0 C1
RAID – 5 protection against 1 host failure
46
esxi-01 esxi-02 esxi-03 esxi-04
1 1
…...... …...... …......
(VMDK)
R5
C2 C3
12
D1 D2 D3P1
Each component is divided into data and parity blocks

Object States: can be “not compliant” but accessible
48
esxi-01 esxi-02
esxi-03
C1 C2 C1 C2
R1
R0
R0
W
(VMDK)
• Compliance status: Are all replicas good?
• Operational status: Is Accessible? 3
22
(votes)
(votes)(votes)

Object States: can be “not compliant” but accessible
49
esxi-01 esxi-02
esxi-03
C1 C2 C1 C2
R1
R0
R0
W
(VMDK)
• Active = known good
• Degraded = known bad, rebuild now
• Absent = known bad, cause not known,
repair after 60 mins
• Stale = Active however needs update
• Compliance status: Are all replicas good?
• Operational status: Is Accessible? 3
22
(votes)
(votes)(votes)
• Accessible implies Liveness

4 Rs – Resync , Rebuild, Repair and Reconfiguration
50
C1 ….. C4
R1
(components)
(blocks)
(VMDK)
• VMDK is divided into components
• Components comprise of data blocks
• Each component on different host
• Each data block of fixed size
C1 ….. C4
R1
(resync blocks)
(VMDK)
C1 ….. C4
R1
(VMDK)
Partial Resync
• Copy data to stale components
• When a component comes
back from being absent
Repair / Reconfigure
• Build fresh component
• Full Resync
(build out the component)
(Host-4)(Host-1) (state: degraded)(state: active-stale)

CONFIDENTIAL
51
Resync / Reconfiguration Triggers
disk group disk group
Cache
Capacity
§ Components in Active-Stale
§ Some blocks are resynced / rebuild
C1 ….. C4
(resync blocks)
(state: active-stale)
(Partition resolves)
(Change storage policies)
Components are rebuilt
C1
…..
C4
(build out the component)
(state: degraded)

W
Begin: All components / elements are in active state
53
2 3 2
(2 votes)
Tolerate 1 host failure with RAID-1
(Active)
(2 votes) (2 votes)
C1 C1
A A
C2C2
A AA
(Active) (Active)(Active) (Active)

W C1 C2C1 C2
Cluster partitions with unknown cause, components go ”Absent”
54
A
B
2
A
B
3 2
(2 votes)
Cluster partition, cause unknown, do not repair immediately
Partition - 1
A
A A
Partition – 2
(Absent)
(2 votes) (2 votes)
(Active)
(Active)
Object is not compliant but accessible
Absent: Known bad,
but cause not known

C1 C2C1 C2
Partition with both Availability and Quorum proceeds
55
A
B
2
A
B
3 2
(2 votes)
vm HA to partition -2 , partition-2 has both quorum and availability
Partition - 1
A A
Partition – 2 - proceeds
(2 votes) (2 votes)
(Absent)
Quorum && AvailabilityAvailability no Quorum
W
A

C1 C2C1 C2
Partition is resolved, component is Resynced
56
AS
2
AS
3 2
(2 votes)
Active-Stale Component is Resynced
A A
(Active-Stale)
(2 votes) (2 votes)
Resync
Component marked as Active Stale, Object is not compliant
W
A

W
All components / elements are in active state
57
2 3 2
(2 votes)
All components are Active
(Active)
(2 votes) (2 votes)
C1 C1
A A
C2C2
A AA
Object is compliant and accessible

WC1 C2 C1 C2
Absent Components Repair After 60 Min
59
A
2
A
3 2
(2 votes)
Partition - 1
A
A A
Partition – 2 : most recent data
(Absent)
(2 votes) (2 votes)
Resync after 60 min

WC1 C2 C1 C2
Degraded Components Repair Immediately
60
D
2
D
3 2
(2 votes)
Hardware Failure Causes Degraded
A
A A
(2 votes) (2 votes)
Known bad,
Resync Now
(Degraded)

W C1 C2C1 C2
Fresh components Resynced From Existing Components
61
D
2
D
3 2
(2 votes)
A
A A
(Degraded) (Reconfiguring)
2
Find another host to resync, Resync begins
C1 C2
R R
Resync
Object state is not-compliant but accessible
(Another Host)

W
C1 C2
Object is Compliant Again
62
D
2
D
(2 votes)
(Degraded)
2 3 2
(1 vote)
(Active)
(1 vote) (1 vote)
C1 C1
A A
C2C2
A AA
Degraded component is marked for deletion
(remove)

Rebuild RAID schematics – Resync begins
63
(Degraded)
Resync begins
C1 C2
C1
C2
C2C1
W
R1
R0
R0 R0
(VMDK)

Rebuild RAID schematics – Resync ends
64
Resync Ends
C2
C2C1
W
R1
R0
R0 R0
(VMDK)
(mark for removal)
C1
C2
C1

Reconfiguration
Changing Storage Policies

Reconfiguration – Increase FTT =2 to FTT =3
R1
R0 R0 R0
R1
R0 R0 R0 R0

Reconfiguration – Increase Sripe Width
R1
R0 R0 R0
R1
R0 R0 R0
R0 R0 R0

CONFIDENTIAL
69
Failures to Tolerate (FTT) can be Nested
vSphere vSAN
Host Racks Sites
Survive one site failure and one host failure on the other site

Stretched Cluster deployment with local fault protection
70
• Prior examples, host is the fault domain
• 2 Levels of fault domain
– Site and host
• Failures to tolerate at each level
vSphere vSAN
ClusterCluster
5ms RTT, 10GbE
RAID-5
3rd
site for
witness
RAID-5
RAID-1

RAID tree for stretched cluster with local fault protection
71
(Site -1) (Site -2)
D2
D1
D3
P1
R5 R5
R1
D2
D1
D3
P1

Survive 1 site failure
72
(Site -1) (Site -2)
D2
D1
D3
P1
R5 R5
R1
D2
D1
D3
P1

Survive 1 site failure and 1 host failure
73
(Site -1) (Site -2)
D2
D1
D3
P1
R5 R5
R1
D2
D1
D3
P1

Anatomy of write: from site - 1 to site - 2
74
R1
R5
R5
1 Issue write
(Site -1)
D2
D1
D3
P1
(Site - 2)
D2
D1
D3
P1
Remote Helper Raid Tree
(proxy owner)
R5
Dn
Send only data across sites
2b
2a
Update Local Data
and Parity
3 Remote side calculates
parity.

W
R5
5 Votes per site
76
3 voting entities for first level
4 components for second level
(Site -1)
D2
D1 D3
P1
(Site -2)
D2
D1 D3
P1
Total of 5 votes (odd number of votes)
Witness has equal share of votes as
the other 2 entities (e.g. sites)
R1
Site-1, Site-2 and the witness
R5
5 5
1
1 2
1

W
R5
Witness is assigned same voting rights as the sites
77
5
3 voting entities for first level
5
4 components for second level
(Site -1)
D2
D1 D3
P1
(Site -2)
D2
D1 D3
P1
Total of 5 votes (odd number of votes)
Witness has equal share of votes as
the other 2 entities (e.g. sites)
R1
Site-1, Site-2 and the witness
R5
5
5 5

Anatomy of a All Flash Write
Pretty much same as hybrid:
§ VM running on host H1
§ H1 is owner of virtual disk object Number
Of Failures To Tolerate = 1
§ Object has 2 replicas on H1 and H2
1. Guest OS issues write op to virtual disk
2. Owner clones write op
3. In parallel: sends “prepare” op to H1 (locally)
and H2
4. H1, H2 persist op to Flash (log)
5. H1, H2 ACK prepare op to owner
6. Owner waits for ACK from both ‘prepares’ and
completes I/O
7. Later, owner commits batch of writes
vSphere
Virtual SAN
H3H2H1
6
5
5
2
virtual disk
3
1
4 4
77

vSphere
Virtual SAN
H3H2H1
virtual disk
hot
cold
All-flash: Destaging Cache to Capacity
§ Data from committed writes
accumulate on Flash Cache (Write
Buffer)
• From different VMs / virtual disks
§ In all-flash, blocks that are written most
often (hot) stay in write cache.
§ In all-flash, blocks that are infrequently
accessed (cold) are destaged to flash
capacity layer.

Nerd Out With These Key vSAN Activities at VMworld
#HitRefresh on your current data center and discover the possibilities!
Earn VMware digital badges to
showcase your skills
• New 2017 vSAN Specialist
Badge
• Education & Certification Lounge:
VM Village
• Certification Exam Center:
Jasmine EFG, Level 3
Become a
vSAN Specialist
Learnfrom self-pacedand expert
led hands on labs
• vSAN Getting Started Workshop
(Expertled)
• VxRail Getting Started (Self
paced)
• Self-Paced lab available online
24x7
Practice with
Hands-on-Labs
Discover how to assess if your IT
is a good fit for HCI
• Four Seasons Willow Room/2nd
floor
• Open from 11am – 5pm Sun,
Mon, and Tue
• Learn more atAssessing &
Sizing in STO1500BU
Visit SDDC
Assessment Lounge

3 Easy Ways to Learn More about vSAN
82
• Live at VMworld
• Practical learning of
vSAN, VxRail and more
• 24x7 availability online
– for free!
vSAN Sizer
vSAN Assessment
New vSAN Tools
• StorageHub.vmware.com
• Reference architectures,
off-line demos and more
• Easy search function
• And More!
Storage Hub Technical Library Hands-On Lab
Test drive vSAN
for free today!

vSAN Beyond The Basics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to vSAN Beyond The Basics

Similar to vSAN Beyond The Basics (20)

Recently uploaded

Recently uploaded (20)

vSAN Beyond The Basics