Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A day in the life of a VSAN I/O - STO7875

11,472 views

Published on

Title: A day in the life of a VSAN I/O - STO7875
Presenters: John Nicholson, Duncan Epping

Published in: Technology
  • Just got my check for $500, Sometimes people don't believe me when I tell them about how much you can make taking paid surveys online... So I took a video of myself actually getting paid $500 for paid surveys to finally set the record straight. I'm not going to leave this video up for long, so check it out now before I take it down! ★★★ https://tinyurl.com/make2793amonth
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Like to know how to take easy surveys and get huge checks - then you need to visit us now! Having so many paid surveys available to you all the time let you live the kind of life you want. learn more...♥♥♥ https://tinyurl.com/make2793amonth
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

A day in the life of a VSAN I/O - STO7875

  1. 1. A day in the life of a VSAN I/O Duncan Epping (@DuncanYB) John Nicholson (@lost_signal) Diving in to the I/O flow of Virtual SAN VMworld session: #SDDC7875
  2. 2. Agenda 1 Introduction (Duncan) 2 Virtual SAN, what is it? (Duncan) 3 Virtual SAN, a bit of a deeper dive (Duncan) 4 What about failures? (John) 5 IO Deep Dive (John) 6 Wrapping up (John) 2
  3. 3. The Software Defined Data Center Compute Networking Storage Management • All infrastructure services virtualized: compute, networking, storage • Underlying hardware abstracted, resources are pooled • Control of data center automated by software (management, security) • Virtual Machines are first class citizens of the SDDC • Today’s session will focus on one aspect of the SDDC - storage 3
  4. 4. Hardware evolution started the infrastructure revolution
  5. 5. Hyper-Converged Infrastructure: new IT model “A lazy admin is the best admin” 5
  6. 6. Simplicity: Operational / Management 6
  7. 7. The Hypervisor is the Strategic High Ground SAN/NASx86 - HCI Object Storage VMware vSphere Cloud Storage 7
  8. 8. Storage Policy-Based Management – App centric automation Overview • Intelligent placement • Fine control of services at VM level • Automation at scale through policy • Need new services for VM? • Change current policy on-the-fly • Attach new policy on-the-fly Virtual Machine Storage policy Reserve Capacity 40GB Availability 2 Failures to tolerate Read Cache 50% Stripe Width 6 Storage Policy-Based Management vSphere Virtual SAN Virtual Volumes Virtual Datastore 8
  9. 9. Virtual SAN Primer So that we are all on the same page 9
  10. 10. Virtual SAN, what is it? Hyper-Converged Infrastructure Distributed, Scale-out Architecture Integrated with vSphere platform Ready for today’s vSphere use cases Software-Defined Storage vSphere & Virtual SAN 10
  11. 11. But what does that really mean? VSAN network Generic x86 hardware VMware vSphere & Virtual SAN Integrated with your Hypervisor Leveraging local storage resources Exposing a single shared datastore Virtual SAN 11
  12. 12. VSAN is the Most Widely Adopted HCI Product Simplicity is key, on an oil platform there are no virtualization, storage or network admins. The infrastructure is managed over a satellite link via a centralized vCenter Server. Reliability, availability and predictability is key. 12
  13. 13. Virtual SAN Use Cases VMware vSphere + Virtual SAN End User Computing Test/Dev ROBOStagingManagementDMZ Business Critical Apps DR / DA 13
  14. 14. Tiered Hybrid vs All-Flash 14 All-Flash 100K IOPS per Host + sub-millisecond latency Caching Writes cached first, Reads from capacity tier Capacity Tier Flash Devices Reads go directly to capacity tier SSD PCIe Data Persistence Hybrid 40K IOPS per Host Read and Write Cache Capacity Tier SAS / NL-SAS / SATA SSD PCIe NVMe Virtual SAN NVMe
  15. 15. Flash Devices All writes and the vast majority of reads are served by flash storage 1. Write-back Buffer (30%) (or 100% in all-flash) – Writes acknowledged as soon as they are persisted on flash (on all replicas) 2. Read Cache (70%) – Active data set always in flash, hot data replace cold data – Cache miss – read data from HDD and put in cache A performance tier tuned for virtualized workloads – High IOPS, low $/IOPS – Low, predictable latency 15
  16. 16. Virtual SAN, a bit of a deeper dive 16
  17. 17. Virtual Machine as a set of Objects on VSAN • VM Home Namespace • VM Swap Object • Virtual Disk (VMDK) Object • Snapshot (delta) Object • Snapshot (delta) Memory Object VM Home VM Swap VMDK Snap delta Snap memory Snapshot 17
  18. 18. Define a policy first… Virtual SAN currently surfaces multiple storage capabilities to vCenter Server Determines layout of components! 18
  19. 19. ESXi Host Virtual SAN Objects and Components VSAN is an object store! • Object Tree with Branches • Each Object has multiple Components – This allows you to meet availability and performance requirements • Here is one example of “Distributed RAID” using 2 techniques: – Striping (RAID-0) – Mirroring (RAID-1) • Data is distributed based on VM Storage Policy ESXi HostESXi Host Mirror Copy stripe-2b stripe-2a RAID-0 Mirror Copy stripe-1b stripe-1a RAID-0 witness VMDK Object RAID-1 19
  20. 20. Number of Failures to Tolerate (How many copies of your data?) • Defines the number of hosts, disk or network failures a storage object can tolerate. • RAID-1 Mirroring used when Failure Tolerance Method set to Performance (default). • For “n” failures tolerated, “n+1” copies of the object are created and “2n+1” host contributing storage are required! esxi-01 esxi-02 esxi-03 esxi-04 Virtual SAN Policy: “Number of failures to tolerate = 1” vmdk ~50% of I/O vmdk witness ~50% of I/O RAID-1 20
  21. 21. Number of Disk Stripes Per Object (on how many devices?) • Number of disk stripes per object – The number of HDDs across which each replica of a storage object is distributed. Higher values may result in better performance. 21 esxi-01 esxi-02 esxi-03 esxi-04 Virtual SAN Policy: “Number of failures to tolerate = 1” vmdk vmdk witness RAID-1 vmdk vmdk RAID-0RAID-0
  22. 22. Fault Domains, increasing availability through rack awareness • Create fault domains to increase availability • 8 node cluster with 4 defined fault domains (2 nodes in each) FD1 = esxi-01, esxi-02 FD3 = esxi-05, esxi-06 FD2 = esxi-03, esxi-04 FD4 = esxi-7, esxi-08 • To protect against one rack failure only 2 replicas are required and a witness across 3 failure domains! 22 FD2 FD3 FD4 esxi-01 esxi-02 esxi-03 esxi-04 esxi-05 esxi-06 esxi-07 esxi-08 FD1 vmdk vmdk witness RAID-1 22
  23. 23. 23 What about failures?
  24. 24. VSAN 1 host isolated – HA restart • HA detects an isolation – ESXi-01 cannot ping master – Master receives no pings – ESXi-01 cannot ping Gateway – Isolation declared! • HA kills VM on ESXi-01 – Note that the Isolation Response needs to be configured! – Shutdown / Power Off / Disabled • VM can now be restarted on any of the remaining hosts Isolated! esxi-01 esxi-03 esxi-05 esxi-07 vmdk vmdk witness RAID-1 HA restart
  25. 25. VSAN 2 hosts partitioned – HA restart • This is not an isolation, but rather a partition • ESXi-01 can ping ESXi-02 • ESXi-01 cannot ping the rest of the cluster • VSAN kills VM on ESXi-01 – It does this as as all components are inaccessible – AutoTerminateGhostVm • HA detects that VM is missing • HA sees no hosts is accessing components • HA restarts the VM! Partitioned esxi-01 esxi-03 esxi-05 esxi-07 vmdk vmdk witness RAID-1 HA restart esxi-02 esxi-04 esxi-06 esxi-08 FD2 FD3 FD4FD1
  26. 26. VSAN 4 hosts partitioned – HA restart • Double partition scenario! • Again, VSAN kills VM on ESXi-01 – AutoTerminateGhostVm • HA detects that VM is missing • HA sees no hosts is accessing components • HA restarts the VM in either FD2 or FD3! – They have majority Partitioned esxi-01 esxi-03 esxi-05 esxi-07 vmdk vmdk witness RAID-1 HA restart esxi-02 esxi-04 esxi-06 esxi-08 FD2 FD3 FD4FD1 Partitioned
  27. 27. VSAN 4 hosts partitioned – HA restart • Double partition scenario! • Note that VM remains running in FD1 • VM runs headless, cannot write to disk! • HA sees that access to storage is lost • HA restarts the VM in either FD3 or FD4! – They have majority • As soon as partition is lifted VM is killed in FD1 as it lost its lock! Partitioned esxi-01 esxi-03 esxi-05 esxi-07 vmdk vmdk witness RAID-1 HA restart esxi-02 esxi-04 esxi-06 esxi-08 FD2 FD3 FD4FD1 Partitioned
  28. 28. IO Deep Dive 28
  29. 29. VSAN IO flow – Write Acknowledgement • VSAN mirrors write IOs to all active mirrors • These are acknowledged when they hit the write buffer! • The write buffer is flash based, persistent to avoid data loss • Writes will be de-staged to the capacity tier – VSAN takes locality in to account when destaging for spindles – Optimizes IO pattern esxi-01 esxi-02 esxi-03 esxi-04 vmdk vmdk witness RAID-1
  30. 30. vSphere & Virtual SAN Anatomy of a Hybrid Read 1. Guest OS issues a read on virtual disk 2. Owner chooses replica to read from • Load balance across replicas • Not necessarily local replica (if one) • A block always reads from same replica 3. At chosen replica (esxi-03): read data from flash Read Cache or client cache, if present 4. Otherwise, read from HDD and place data in flash Read Cache • Replace ‘cold’ data 5. Return data to owner 6. Complete read and return data to VM vmdk vmdk 1 2 3 4 5 6 esxi-01 esxi-02 esxi-03
  31. 31. vSphere & Virtual SAN Anatomy of a All-Flash Read 1. Guest OS issues a read on virtual disk 2. Owner chooses replica to read from – Load balance across replicas – Not necessarily local replica (if one) – A block always read from same replica 3. At chosen replica (esxi-03): read data from (write) Flash Cache or client cache, if present 4. Otherwise, read from capacity flash device 5. Return data to owner 6. Complete read and return data to VMvmdk vmdk 1 2 3 4 5 6 esxi-01 esxi-02 esxi-03
  32. 32. 32 vmdk vmdk esxi-01 esxi-02 Witness Client Cache • Always Local • Up to 1GB of memory per Host • Memory Latency < Network Latency • Horizon 7 Testing - 75% fewer Read IOPS, 25% better latency. • Complements CBRC • Enabled by default in 6.2
  33. 33. vSphere & Virtual SAN Anatomy of Checksum 1. Guest OS issues a write on virtual disk 2. Host generates Checksum before it leaves host 3. Transferred over network 4. Checksum verified on host where it will write to disk. 5. ACK is returned to the virtual machine 6. On Read the checksum is verified by the host with the VM. If any component fails it is repaired from the other copy or parity. 7. Scrubs of cold data performedvmdk vmdk 1 2 3 4 5 6 esxi-01 esxi-02 esxi-03 7
  34. 34. Deduplication and Compression for Space Efficiency • deduplication and compression per disk group level. – Enabled on a cluster level – Fixed block length deduplication (4KB Blocks) • Compression after deduplication – LZ4 is used, low CPU! – Single feature, no schedules required! – File System stripes all IO across disk group Beta esxi-01 esxi-02 esxi-03 vmdk vmdk vSphere & Virtual SAN vmdk 34 All-Flash Only
  35. 35. Deduplication and Compression Disk Group Stripes • deduplication and compression per disk group level. – Data stripes across the disk group • Fault domain isolated to disk group – Fault of device leads to rebuild of disk group – Stripes reduce hotspots – Endurance/Throughput Impact Beta 35 vmdkvmdk
  36. 36. Costs of Deduplication (nothing is free) • CPU overhead • Metadata and Memory overhead – Overhead for Metadata? • IO Overhead (metadata lookup) • IO Overhead (Data movement from WB) • IO Overhead (Fragmenation) • Endurance Overhead 36 vmdk Deduplication vmdk Compression 1 2 3 4 5
  37. 37. Costs of Compression (nothing is free) • CPU overhead • Capacity overhead • Memory overhead • IO overhead 37 vmdk Deduplication vmdk Compression 1 2 3 4 5
  38. 38. Deduplication and Compression (I/O Path) • Avoids Inline or post process downsides • Performed at disk group level • 4KB fixed block • LZ4 compression after deduplication 38 All-Flash Only SSD SSD 1. VM issues write 2. Write acknowledged by cache 3. Cold data to memory 4. Deduplication 5. Compression 6. Data written to capacity
  39. 39. RAID 5/6 • All Flash enabled RAID 5 and RAID 6. • SPBM Policy – Set per Object 39 esxi-01 esxi-02 esxi-03 esxi-04 Virtual SAN Policy: “Number of failures to tolerate = 1” vmdk vmdk vmdk Raid-5 vmdk All-Flash Only
  40. 40. RAID-5 Inline Erasure Coding • When Number of Failures to Tolerate = 1 and Failure Tolerance Method = Capacity  RAID-5 – 3+1 (4 host minimum) – 1.33x instead of 2x overhead • 20GB disk consumes 40GB with RAID-1, now consumes ~27GB with RAID-5 40 RAID-5 ESXi Host parity data data data ESXi Host data parity data data ESXi Host data data parity data ESXi Host data data data parity
  41. 41. RAID-6 Inline Erasure Coding • When Number of Failures to Tolerate = 2 and Failure Tolerance Method = Capacity  RAID-6 – 4+2 (6 host minimum) – 1.5x instead of 3x overhead • 20GB disk consumes 60GB with RAID-1, now consumes ~30GB with RAID-6 41 All Flash Only ESXi Host parity data data RAID-6 ESXi Host parity data data ESXi Host data parity data ESXi Host data parity data ESXi Host data data parity ESXi Host data data parity
  42. 42. Swap Placement? 42 Sparse Swap • Reclaim Space used by memory swap • Host advanced option enables setting • How to set it? esxcfg-advcfg -g /VSAN/SwapThickProvisionDisabled https://github.com/jasemccarty/SparseSwap
  43. 43. Snapshots for VSAN 43 • Not using VMFS Redo Logs • Writes allocated into 4MB allocations • snapshot metadata cache (avoids read amplification) • Performs Pre-Fetch of metadata cache • Maximum 31
  44. 44. Wrapping up 44
  45. 45. Three Ways to Get Started with Virtual SAN Today VSAN Assessment32Download Evaluation Online Hands-on Lab1 • Test-drive Virtual SAN right from your browser—with an instant Hands-on Lab • Register and your free, self-paced lab is up and running in minutes • 60-day Free Virtual SAN Evaluation • VMUG members get a 6- month EVAL or 1-year EVALExperience for $200 • Reach out to your VMware Partner, SEs or Rep for a FREE VSAN Assessment • Results in just 1 week! • The VSAN Assessment tool collects and analyzes data from your vSphere storage environment and provides technical and business recommendations. Learn more… vmware.com/go/virtual-san • Virtual SAN Product Overview Video • Virtual SAN Datasheet • Virtual SAN Customer References • Virtual SAN Assessment • VMware Storage Blog • @vmwarevsan vmware.com/go/try-vsan-en vmware.com/go/try-vsan-en 45

×