Successfully reported this slideshow.
Your SlideShare is downloading. ×

A day in the life of a VSAN I/O - STO7875

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 46 Ad

More Related Content

Slideshows for you (20)

Viewers also liked (20)

Advertisement

Similar to A day in the life of a VSAN I/O - STO7875 (20)

Recently uploaded (20)

Advertisement

A day in the life of a VSAN I/O - STO7875

  1. 1. A day in the life of a VSAN I/O Duncan Epping (@DuncanYB) John Nicholson (@lost_signal) Diving in to the I/O flow of Virtual SAN VMworld session: #SDDC7875
  2. 2. Agenda 1 Introduction (Duncan) 2 Virtual SAN, what is it? (Duncan) 3 Virtual SAN, a bit of a deeper dive (Duncan) 4 What about failures? (John) 5 IO Deep Dive (John) 6 Wrapping up (John) 2
  3. 3. The Software Defined Data Center Compute Networking Storage Management • All infrastructure services virtualized: compute, networking, storage • Underlying hardware abstracted, resources are pooled • Control of data center automated by software (management, security) • Virtual Machines are first class citizens of the SDDC • Today’s session will focus on one aspect of the SDDC - storage 3
  4. 4. Hardware evolution started the infrastructure revolution
  5. 5. Hyper-Converged Infrastructure: new IT model “A lazy admin is the best admin” 5
  6. 6. Simplicity: Operational / Management 6
  7. 7. The Hypervisor is the Strategic High Ground SAN/NASx86 - HCI Object Storage VMware vSphere Cloud Storage 7
  8. 8. Storage Policy-Based Management – App centric automation Overview • Intelligent placement • Fine control of services at VM level • Automation at scale through policy • Need new services for VM? • Change current policy on-the-fly • Attach new policy on-the-fly Virtual Machine Storage policy Reserve Capacity 40GB Availability 2 Failures to tolerate Read Cache 50% Stripe Width 6 Storage Policy-Based Management vSphere Virtual SAN Virtual Volumes Virtual Datastore 8
  9. 9. Virtual SAN Primer So that we are all on the same page 9
  10. 10. Virtual SAN, what is it? Hyper-Converged Infrastructure Distributed, Scale-out Architecture Integrated with vSphere platform Ready for today’s vSphere use cases Software-Defined Storage vSphere & Virtual SAN 10
  11. 11. But what does that really mean? VSAN network Generic x86 hardware VMware vSphere & Virtual SAN Integrated with your Hypervisor Leveraging local storage resources Exposing a single shared datastore Virtual SAN 11
  12. 12. VSAN is the Most Widely Adopted HCI Product Simplicity is key, on an oil platform there are no virtualization, storage or network admins. The infrastructure is managed over a satellite link via a centralized vCenter Server. Reliability, availability and predictability is key. 12
  13. 13. Virtual SAN Use Cases VMware vSphere + Virtual SAN End User Computing Test/Dev ROBOStagingManagementDMZ Business Critical Apps DR / DA 13
  14. 14. Tiered Hybrid vs All-Flash 14 All-Flash 100K IOPS per Host + sub-millisecond latency Caching Writes cached first, Reads from capacity tier Capacity Tier Flash Devices Reads go directly to capacity tier SSD PCIe Data Persistence Hybrid 40K IOPS per Host Read and Write Cache Capacity Tier SAS / NL-SAS / SATA SSD PCIe NVMe Virtual SAN NVMe
  15. 15. Flash Devices All writes and the vast majority of reads are served by flash storage 1. Write-back Buffer (30%) (or 100% in all-flash) – Writes acknowledged as soon as they are persisted on flash (on all replicas) 2. Read Cache (70%) – Active data set always in flash, hot data replace cold data – Cache miss – read data from HDD and put in cache A performance tier tuned for virtualized workloads – High IOPS, low $/IOPS – Low, predictable latency 15
  16. 16. Virtual SAN, a bit of a deeper dive 16
  17. 17. Virtual Machine as a set of Objects on VSAN • VM Home Namespace • VM Swap Object • Virtual Disk (VMDK) Object • Snapshot (delta) Object • Snapshot (delta) Memory Object VM Home VM Swap VMDK Snap delta Snap memory Snapshot 17
  18. 18. Define a policy first… Virtual SAN currently surfaces multiple storage capabilities to vCenter Server Determines layout of components! 18
  19. 19. ESXi Host Virtual SAN Objects and Components VSAN is an object store! • Object Tree with Branches • Each Object has multiple Components – This allows you to meet availability and performance requirements • Here is one example of “Distributed RAID” using 2 techniques: – Striping (RAID-0) – Mirroring (RAID-1) • Data is distributed based on VM Storage Policy ESXi HostESXi Host Mirror Copy stripe-2b stripe-2a RAID-0 Mirror Copy stripe-1b stripe-1a RAID-0 witness VMDK Object RAID-1 19
  20. 20. Number of Failures to Tolerate (How many copies of your data?) • Defines the number of hosts, disk or network failures a storage object can tolerate. • RAID-1 Mirroring used when Failure Tolerance Method set to Performance (default). • For “n” failures tolerated, “n+1” copies of the object are created and “2n+1” host contributing storage are required! esxi-01 esxi-02 esxi-03 esxi-04 Virtual SAN Policy: “Number of failures to tolerate = 1” vmdk ~50% of I/O vmdk witness ~50% of I/O RAID-1 20
  21. 21. Number of Disk Stripes Per Object (on how many devices?) • Number of disk stripes per object – The number of HDDs across which each replica of a storage object is distributed. Higher values may result in better performance. 21 esxi-01 esxi-02 esxi-03 esxi-04 Virtual SAN Policy: “Number of failures to tolerate = 1” vmdk vmdk witness RAID-1 vmdk vmdk RAID-0RAID-0
  22. 22. Fault Domains, increasing availability through rack awareness • Create fault domains to increase availability • 8 node cluster with 4 defined fault domains (2 nodes in each) FD1 = esxi-01, esxi-02 FD3 = esxi-05, esxi-06 FD2 = esxi-03, esxi-04 FD4 = esxi-7, esxi-08 • To protect against one rack failure only 2 replicas are required and a witness across 3 failure domains! 22 FD2 FD3 FD4 esxi-01 esxi-02 esxi-03 esxi-04 esxi-05 esxi-06 esxi-07 esxi-08 FD1 vmdk vmdk witness RAID-1 22
  23. 23. 23 What about failures?
  24. 24. VSAN 1 host isolated – HA restart • HA detects an isolation – ESXi-01 cannot ping master – Master receives no pings – ESXi-01 cannot ping Gateway – Isolation declared! • HA kills VM on ESXi-01 – Note that the Isolation Response needs to be configured! – Shutdown / Power Off / Disabled • VM can now be restarted on any of the remaining hosts Isolated! esxi-01 esxi-03 esxi-05 esxi-07 vmdk vmdk witness RAID-1 HA restart
  25. 25. VSAN 2 hosts partitioned – HA restart • This is not an isolation, but rather a partition • ESXi-01 can ping ESXi-02 • ESXi-01 cannot ping the rest of the cluster • VSAN kills VM on ESXi-01 – It does this as as all components are inaccessible – AutoTerminateGhostVm • HA detects that VM is missing • HA sees no hosts is accessing components • HA restarts the VM! Partitioned esxi-01 esxi-03 esxi-05 esxi-07 vmdk vmdk witness RAID-1 HA restart esxi-02 esxi-04 esxi-06 esxi-08 FD2 FD3 FD4FD1
  26. 26. VSAN 4 hosts partitioned – HA restart • Double partition scenario! • Again, VSAN kills VM on ESXi-01 – AutoTerminateGhostVm • HA detects that VM is missing • HA sees no hosts is accessing components • HA restarts the VM in either FD2 or FD3! – They have majority Partitioned esxi-01 esxi-03 esxi-05 esxi-07 vmdk vmdk witness RAID-1 HA restart esxi-02 esxi-04 esxi-06 esxi-08 FD2 FD3 FD4FD1 Partitioned
  27. 27. VSAN 4 hosts partitioned – HA restart • Double partition scenario! • Note that VM remains running in FD1 • VM runs headless, cannot write to disk! • HA sees that access to storage is lost • HA restarts the VM in either FD3 or FD4! – They have majority • As soon as partition is lifted VM is killed in FD1 as it lost its lock! Partitioned esxi-01 esxi-03 esxi-05 esxi-07 vmdk vmdk witness RAID-1 HA restart esxi-02 esxi-04 esxi-06 esxi-08 FD2 FD3 FD4FD1 Partitioned
  28. 28. IO Deep Dive 28
  29. 29. VSAN IO flow – Write Acknowledgement • VSAN mirrors write IOs to all active mirrors • These are acknowledged when they hit the write buffer! • The write buffer is flash based, persistent to avoid data loss • Writes will be de-staged to the capacity tier – VSAN takes locality in to account when destaging for spindles – Optimizes IO pattern esxi-01 esxi-02 esxi-03 esxi-04 vmdk vmdk witness RAID-1
  30. 30. vSphere & Virtual SAN Anatomy of a Hybrid Read 1. Guest OS issues a read on virtual disk 2. Owner chooses replica to read from • Load balance across replicas • Not necessarily local replica (if one) • A block always reads from same replica 3. At chosen replica (esxi-03): read data from flash Read Cache or client cache, if present 4. Otherwise, read from HDD and place data in flash Read Cache • Replace ‘cold’ data 5. Return data to owner 6. Complete read and return data to VM vmdk vmdk 1 2 3 4 5 6 esxi-01 esxi-02 esxi-03
  31. 31. vSphere & Virtual SAN Anatomy of a All-Flash Read 1. Guest OS issues a read on virtual disk 2. Owner chooses replica to read from – Load balance across replicas – Not necessarily local replica (if one) – A block always read from same replica 3. At chosen replica (esxi-03): read data from (write) Flash Cache or client cache, if present 4. Otherwise, read from capacity flash device 5. Return data to owner 6. Complete read and return data to VMvmdk vmdk 1 2 3 4 5 6 esxi-01 esxi-02 esxi-03
  32. 32. 32 vmdk vmdk esxi-01 esxi-02 Witness Client Cache • Always Local • Up to 1GB of memory per Host • Memory Latency < Network Latency • Horizon 7 Testing - 75% fewer Read IOPS, 25% better latency. • Complements CBRC • Enabled by default in 6.2
  33. 33. vSphere & Virtual SAN Anatomy of Checksum 1. Guest OS issues a write on virtual disk 2. Host generates Checksum before it leaves host 3. Transferred over network 4. Checksum verified on host where it will write to disk. 5. ACK is returned to the virtual machine 6. On Read the checksum is verified by the host with the VM. If any component fails it is repaired from the other copy or parity. 7. Scrubs of cold data performedvmdk vmdk 1 2 3 4 5 6 esxi-01 esxi-02 esxi-03 7
  34. 34. Deduplication and Compression for Space Efficiency • deduplication and compression per disk group level. – Enabled on a cluster level – Fixed block length deduplication (4KB Blocks) • Compression after deduplication – LZ4 is used, low CPU! – Single feature, no schedules required! – File System stripes all IO across disk group Beta esxi-01 esxi-02 esxi-03 vmdk vmdk vSphere & Virtual SAN vmdk 34 All-Flash Only
  35. 35. Deduplication and Compression Disk Group Stripes • deduplication and compression per disk group level. – Data stripes across the disk group • Fault domain isolated to disk group – Fault of device leads to rebuild of disk group – Stripes reduce hotspots – Endurance/Throughput Impact Beta 35 vmdkvmdk
  36. 36. Costs of Deduplication (nothing is free) • CPU overhead • Metadata and Memory overhead – Overhead for Metadata? • IO Overhead (metadata lookup) • IO Overhead (Data movement from WB) • IO Overhead (Fragmenation) • Endurance Overhead 36 vmdk Deduplication vmdk Compression 1 2 3 4 5
  37. 37. Costs of Compression (nothing is free) • CPU overhead • Capacity overhead • Memory overhead • IO overhead 37 vmdk Deduplication vmdk Compression 1 2 3 4 5
  38. 38. Deduplication and Compression (I/O Path) • Avoids Inline or post process downsides • Performed at disk group level • 4KB fixed block • LZ4 compression after deduplication 38 All-Flash Only SSD SSD 1. VM issues write 2. Write acknowledged by cache 3. Cold data to memory 4. Deduplication 5. Compression 6. Data written to capacity
  39. 39. RAID 5/6 • All Flash enabled RAID 5 and RAID 6. • SPBM Policy – Set per Object 39 esxi-01 esxi-02 esxi-03 esxi-04 Virtual SAN Policy: “Number of failures to tolerate = 1” vmdk vmdk vmdk Raid-5 vmdk All-Flash Only
  40. 40. RAID-5 Inline Erasure Coding • When Number of Failures to Tolerate = 1 and Failure Tolerance Method = Capacity  RAID-5 – 3+1 (4 host minimum) – 1.33x instead of 2x overhead • 20GB disk consumes 40GB with RAID-1, now consumes ~27GB with RAID-5 40 RAID-5 ESXi Host parity data data data ESXi Host data parity data data ESXi Host data data parity data ESXi Host data data data parity
  41. 41. RAID-6 Inline Erasure Coding • When Number of Failures to Tolerate = 2 and Failure Tolerance Method = Capacity  RAID-6 – 4+2 (6 host minimum) – 1.5x instead of 3x overhead • 20GB disk consumes 60GB with RAID-1, now consumes ~30GB with RAID-6 41 All Flash Only ESXi Host parity data data RAID-6 ESXi Host parity data data ESXi Host data parity data ESXi Host data parity data ESXi Host data data parity ESXi Host data data parity
  42. 42. Swap Placement? 42 Sparse Swap • Reclaim Space used by memory swap • Host advanced option enables setting • How to set it? esxcfg-advcfg -g /VSAN/SwapThickProvisionDisabled https://github.com/jasemccarty/SparseSwap
  43. 43. Snapshots for VSAN 43 • Not using VMFS Redo Logs • Writes allocated into 4MB allocations • snapshot metadata cache (avoids read amplification) • Performs Pre-Fetch of metadata cache • Maximum 31
  44. 44. Wrapping up 44
  45. 45. Three Ways to Get Started with Virtual SAN Today VSAN Assessment32Download Evaluation Online Hands-on Lab1 • Test-drive Virtual SAN right from your browser—with an instant Hands-on Lab • Register and your free, self-paced lab is up and running in minutes • 60-day Free Virtual SAN Evaluation • VMUG members get a 6- month EVAL or 1-year EVALExperience for $200 • Reach out to your VMware Partner, SEs or Rep for a FREE VSAN Assessment • Results in just 1 week! • The VSAN Assessment tool collects and analyzes data from your vSphere storage environment and provides technical and business recommendations. Learn more… vmware.com/go/virtual-san • Virtual SAN Product Overview Video • Virtual SAN Datasheet • Virtual SAN Customer References • Virtual SAN Assessment • VMware Storage Blog • @vmwarevsan vmware.com/go/try-vsan-en vmware.com/go/try-vsan-en 45

Editor's Notes

  • The Software Defined Data Center

    In SDDC, all three core infrastructure components, compute, storage and networking are virtualized.
    Virtualization software abstracts underlying hardware, while pooling compute, network and storage resources to deliver better utilization, faster provisioning and simpler operations.
    The VM becomes the centerpiece of the operational model, providing automation and agility to repurpose infrastructure according to business needs.

    Today we will focus on Storage, which has been growing at an extremely rapid pace and is a fast changing aspect of the datacenter!
  • When it comes to “software defined storage” people typically talk about “software”. Hardware is an often overlooked fact, but I think it is fair to say the the hardware evolution started this new infrastructure revolution. The changes we are seeing today, with hyper-converged, new all-flash storage systems or even hybrid systems are mostly made possible by hardware evolving. Where in the past we needed dozens of spindles to get 1000s of IOPS we now only need 1 flash device to provide 10s if not 100s of thousands of IOPS, not just extremely high IOPS also delivered with a ultra low latency.

    SSDs, PCIe based flash, NVMe, NVDIMM deliver much more than just performance. They provided the storage and infrastructure vendors to revolutionize their platform by removing a lot of the complexity needed to provide the availability and performance needed by virtualized workloads. RAID-Groups, Disk Groups, Wide Striping… all of them no longer needed to provide your business critical applications what they need.
  • What we are trying to achieve is simplify datacenter operations, and our primary focus will be storage and availability. Storage is we all know traditionally has been a painpoint in many data centers, high cost and usually does not provide the performance and scalability one would want. By offering our customers choice we aim to change the world of IT, start a new revolution. But we cannot do this by ourselves, we need the help of you, the consultant / admin / architect.
  • vSphere is perfectly positioned for this as it abstracts physical resources and can provide them as a shared pooled construct to the administrator.

    Because it sits directly in the I/O path, the hypervisor (through the notion of policies associated with virtual machines) has the unique ability to make optimal decisions around matching the demands of virtualized applications with the supply of underlying physical infrastructure.

    On top of that the platform provides you the ability to assign service level agreements to workloads which will reduce the operational complexity and as such significantly reduces the chances of making mistakes.
  • This is where it all starts, without Storage Policy Based Management many of the products and features we are about to talk about would not be possible! If there is one thing you need to remember when you walk away today, then it is Storage Policy Based Management. it is the key enabled for Software Defined Storage and Availability!

    Storage Policy Based Management is composed of the following:
    Common Policy framework Across Virtual Volumes, Virtual SAN and VMFS-based Storage
    Common API Layer for Cloud Management Frameworks (vRealize Automation, OpenStack), Scripting users (PowerShell, JavaScript, Python, etc.) and Orchestration Platforms (vCO)
    Represents Application and VM Level Requirements
    Consumes Capabilities Published via VASA

    SPBM provides the following benefits for customers:
    Stable, Robust Automation Platform
    Intelligent placement and fine control of services at the VM level
    Shields Automation and Orchestration Platforms from infrastructure changes by abstracting the Underlying Storage Implementation
  • What is VSAN in a nutshell…

    So, it follows a hyper-converged architecture for easy, streamlined management and scaling of both compute and storage. Hyper-converged represents a system architecture – one where compute and persistence are co-located. This system architecture is enabled by software.

    It is a SDS product. A layer of software that runs on every ESXi host. It aggregates the local storage devices on ESX hosts (SSD and magnetic disks) and makes them look like a single pool of shared storage across all the hosts.

    VSAN has a distributed architecture with no single point of failure.

    VSAN goes a step further than other HCI products – VMware owns the most popular hypervisor in the industry. Strong integration of VSAN in the hypervisor means that we can optimize the data path and we ensure optimal resource scheduling (compute, network, storage) according to the needs of each application. At the end, better resource utilization means better consolidation ratios, more bang for your buck! Resource utilization is one part of the story. The other part is the Operational aspects of the product.

    VSAN has been designed as a storage product to be used primarily by vSphere admins. So, we put a lot of effort in packaging the product in a way that is ideal for today’s use cases of virtualized environments. Specifically, the VSAN configuration and management workflows have been designed as extensions of the existing host and cluster management features of vSphere. That means easy, intuitive operational experience for vSphere admins. It also means native integration with key vSphere features unlike any other storage product out there, HCI or not.
  • VSAN is widely adopted, over 3000 customers since launch with some very interesting use cases ranging from Oil Platforms to Trains and now being planned to be deployed on sub-marines and mobile deployment units out in the field.

    The Oil Platform scenario is a “robo” deployment managed through a central vCenter Server leveraging a satellite connection.

    The sub marines and mobile deployment unit story I can’t reveal who this is, but it is very real. Dual datacenter setups in a ship are not uncommon and Virtual SAN is a natural fit here.
  • We were very conservative when we initially launched VSAN – after all, this was customers data we were talking about.
    However, even though we were conservative, our customer were not.
    There are plenty of other use cases. The ones listed on the slide are the most commonly used. It is fair to say that Virtual SAN fits in most scenarios:
    Of course customers started with the test/dev workloads, just like they did when virtualization was first introduced
    Business Critical Apps – We have customers running Exchange / SQL / SAP and billing systems on Virtual San
    Virtual SAN is included in the Horizon Suite Advanced and Enterprise, so VDI/EUC is a natural fit.
    As a DR destination VSAN is also commonly used as you can scale out and the cost is relatively low compared to a traditional storage system
    Isolation workloads also something that VSAN is often used for, both DMZ and Management clusters fit this bill
    Of course there is also ROBO, VSAN can start small and grow when desired, both scale-out and scale-up, and with 6.1 we even made things better by introducing a 2 node, but we will get back to that!
  • Virtual SAN enables both hybrid and all-flash architectures.
    Irrespective of the architecture, there is a flash-based caching tier which can be configured out of flash devices like SSDs, PCIe cards, Ultra DIMMs etc. The flash caching tier acts as the read cache/write buffer that dramatically improves the performance of storage operations.

    In the hybrid architecture, server-attached magnetic disks are pooled to create a distributed shared datastore, that persists the data. In this type of architecture, you can get up to 40K IOPs per server host.

    In All-Flash architecture, the flash-based caching tier is intelligently used as a write-buffer only, while another set of SSDs forms the persistence tier to store data. Since this architecture utilizes only flash devices, it delivers extremely high IOPs of up to 90K per host, with predictable low latencies.
  • Each flash device is configured with two partitions: 30% used as write-back buffer and 70% as read cache. All writes go to the flash device and overtime will be destaged to disk. When it comes to writes then the active data set of the aggregate workload reaching the disk group lives in the cache. With a 1:10 ratio of flash to HDD, and for all realistic workloads, more than 90% of reads are served from the flash. With the majority of our customers this percentage is even higher, and 98 / 99% is not uncommon.

    As you realize virtually all IOs are served from the flash and that can be achieved with a modest flash capacity, because in most practical cases the active data set is a fraction of the total stored volume of data!
  • Objects are divided and distributed into components based on policies. Components and policies will be covered shortly. VMs are no longer based on a set of files, like we have on traditional storage.
  • First thing you do before you deploy a VM is define a policy. VSAN has what if APIs so it will show what the “result” would be of having such a policy applied to a VM of a certain size. Very useful as it gives you an idea of what the “cost” is of certain attributes

    Also note that a number of new capabilities were introduces in VSAN 6.2, and these will be discussed in more detail later on.
  • RAID-0 and RAID-1 were the only distributed RAID options up to and including version 6.1.
    New techniques introduced in VSAN 6.2 will be discussed shortly.
  • RAID-5/6 used when Fault Tolerance Method set to Capacity
  • Note that in order to protect against a rack failure the minimum required number of failure domains is 3, this is similar to protecting against a host failure using FTT=1 where the minimum number of hosts is 3.
  • It should be noted that in an All Flash enviroment the Cache Tier is 100% devoted to writes, in a hybrid it is only 30%.
  • Stress the point here that when data is read from the replica it is place in the flash cache. We also use “spatial data locality”, which means that we will place a full 1MB block in cache instead of just 4KB or 8KB which is requested, as it is very likely that blocks from the same region will be read. Also note that reads for a given block will always be served from the SAME host. This to ensure we only have 1 copy in cache and we always read from the same host for that block to optimize for performance and cost!

    To be clear, a 4KB read will come from 1 host, per 1MB block a single host will serve the read. If you read 32MB in total then this can come from different hosts, depending on who serves those specific blocks.

    More in-depth about spatial and temporal locality: https://www.vmware.com/files/pdf/products/vsan/VMware-Virtual-SAN-Data-Locality.pdf
  • Now that we saw how writes work, let’s take a look at reads in an all-flash.

    Same example here: a VM with a vmdk that has two replicas on H1 and H2.

    Stress: Major difference – read cache misses do not cause any serious performance degradation. Reads from flash capacity device should be almost as quick as reading from capacity. Another major difference is that there is no need to move the block from capacity layer to cache layer, which is what we do in hybrid configurations.
  • Client cache is a memory cache that uses .4% of host memory up to 1GB of RAM. Note this is a per Host not a per Virutal machine.
    In Horizion 7 testing it reduced 75% of read IOPS and increased latency by 25%.
    It will follow the virtual machine and stay local because the overhead to re-sync this is slow (its small) and the memory latency is lower than network latency (Unlike disk latency).

    Extends CBRC caching benefits to:
    Linked Clones
    App volumes
    Non-replica components

  • VSAN checksum’s are performed in such a way to identify and repair corruption of data at rest, as well as in flight. The checksum is generated as soon as possible, and verified as late as possible to remove any opportunities for corruption.
    A host side checksum cache assists with reducing IO amplification
    CRC32 is used as modern processors have offload functions to reduce CPU overhead.
    Scrubs of non-read data are performed yearly (But this is adjustable).
    Note this requires VSAN FSv3 or higher (Introduced in 6.2, and supports hybrid)
  • All Flash Only.

    “High level description”
    Dedupe and compression happens during destaging from the caching tier to the capacity tier. You on a cluster level and deduplication/compression happens on a per disk group basis. Bigger disk groups will result in a higher deduplication ratio. After the blocks are deduplicated they will be compressed. A significant saving already, combined with deduplication and the results achieved can be up to 7x space reduction, of course fully dependent on the workload and type of VMs.

    “Lower level description”
    Compression (LZ4) would be performed during destaging from the caching tier to the capacity tier. 4KB is the block size for deduplication. For each unique 4k block compression would be performed and if the output block size is less than or equal to 2KB, a compressed block would be saved in place of the 4K block. If the output block size is greater than 2KB, the block would be written uncompressed and tracked as such. The reason is to avoid block alignment issues, as well as reduce the CPU hit for decompressing the data which is greater than compression for data with low compression ratios. All of this data reduction is after the write acknowledgement.

    Deduplication domains are within each disk group. This avoids needing a global lookup table (significant resource overhead), and allows us to put those resources towards tracking a smaller and more meaningful block size. We purposefully avoid dedupe of “write hot data” In the cache, or decompressing uncompressible data significant CPU/memory resources can avoid being wasted.

    Note: Feature is supported with stretch clusters, ROBO edition
  • So historically in VSAN you’ve known exactly where data is

    In 6.2 with De-duplication and compression, you know which disk group, but you do loose some granularity as you can not control where a dedupilcation hash will “match”.
  • CPU overhead is low. SHA-1 is practically free as Intel has really fast offload.
    Memory overhead is a big deal with deduplication. Most vendors get stuck with a bad choice.
    Aim for memory efficiency and use a REALLY large block size (8KB, 16KB Fixed, or worse and switch to variable block which has all kinds of terrible secondary impacts)
    Downsides are obvious, you get awful Dedupe. It’s a feature checkbox, and when you read their guidence its weird unicorn cases they recommend it (Full Clone VDI, when the moon is red).
    Still burns CPU and Memory with little benefit

    Aim for capacity efficiency and use 4KB Fixed block and stick everything in memory. This has some nasty downsides and it hasn’t really been done before in HCI but we see these problems in scale out and flash storage arrays.
    You can’t do HCI. Seriously the one vendor who comes to mind who does 4KB Fixed block dedupe in scale out spends all their memory on lookup tables
    You get stuck with small SSD’s as memory MUST scale linearly. One AFA vendor even forced customers to evacuate and do a full data destructive swing migration to move to 8KB because they couldn’t scale Memory and CPU to handle it!


    Memory Overhead is low as this data already had to be in memory for the destage.
    Also we don’t run an in memory database for hash lookups (just a smart compressed cache)


    IO overheads
    Metadata is stored on disk, but we do deploy memory caches, and a DRAM cache (Client Cache) to keep the need for “double reads” to a minimum.
    This is not post process so we don’t have an overhead of having to read data or write data that was not allready “in flight” aging out of the write buffer.
    This is not a post process deduplication so we don’t have issues of “holes in the layout” and fragmentation as the dedupe engine crawls the on disk data and finds duplicates.

    We avoid a LOT of writes to the capacity tier, and this extends the lifespan of the capacity tier drives enabling us to use lower cost TLC drives.
  • CPU overhead is very low
    LZ4 is insanely fast and optimized for CPU utilization.
    We don’t compress duplicate data
    We don’t store compressed if it compresses poorly. This leads to not wasting CPU on re-hyrdration of blocks that only compressed went from 4K to 3K

    Capacity overhead is great as LZ4 is a verry efficent compression system.

    Memory overhead is low, as this is data that was already in memory (for de-stage and dedupe)

    IO overhead is low as we use a fixed block (no fragmentation, defrag process)
  • “Near-line”, not in-line - Avoids performance hit
    Performed at disk group level - Balances efficiency and utilization
    4KB fixed block - More granular than many competitors
    LZ4 compression after deduplication - Only if down to 2k or less
  • RAID 5/6 happens in the write buffer tier and migrates down to the capacity layer.
    So you will have a read of the copy of the data and the parity, a calculation made, and an update of both.
    This read then write adds marginal latency to the IO path and IOPS, but mordern 10Gbps networks with very low port to port latency as well as incredibly fast flash devices (and NVMe devices that can run parallel IO) make this trade off acceptable for many workloads.

    You can set this per object so a OLAP Database could have RAID 5 for the data volume, and raid 1 for the log that is write heavy.
  • Sometimes RAID 5 and RAID 6 over the network is also referred as erasure coding. This is done inline; there is no post-processing required.
    Since VMware has a design goal of not relying on data locality, this implementation of erasure coding does not bring any negative results by distributing the RAID-5/6 stripe across multiple hosts.

    In this case RAID-5 requires 4 hosts at a minimum as it uses a 3+1 logic. With 4 hosts 1 can fail without data loss. This results in a significant reduction of required disk capacity. Normally a 20GB disk would require 40GB of disk capacity, but in the case of RAID-5 over the network the requirement is only ~27GB. There is another option if higher availability is desired

    Use case Information:
    Erasure codes offer “guaranteed capacity reduction unlike deduplication and compression. For customers who have “no thin provisioning policies” have data that is already compressed and deduplicated or have encrypted data this offers “known/fixed” capacity gains.
    This can be applied on a granular basis (Per VMDK) using the Storage Policy Based Management system.


    30% Savings.
    Note: All Flash VSAN only.
    Note: Not supported with stretched clusters
    Note: this does not require the cluster size be a multiple of 4, just 4or more.
  • With RAID-6 two host failures can be tolerated, similar to FTT=2 using RAID-1.
    In the traditional scenario for a 20GB disk the required disk capacity would be 60GB, but with RAID-6 over the network this is just 30GB.
    Note that the parity is distributed across all hosts and there is no dedicated parity host or anything like that.
    Since VMware has a design goal of not relying on data locality, this implementation of erasure coding does not bring any negative results by distributing the RAID-5/6 stripe across multiple hosts.
    Again, this is sometimes by others referred to as erasure coding. In this case a 4+2 configuration is used, which means that 6 hosts is the minimum to be able to use this configuration.

    Use case Information:
    Erasure codes offer “guaranteed capacity reduction unlike deduplication and compression. For customers who have “no thin provisioning policies” have data that is already compressed and deduplicated or have encrypted data this offers “known/fixed” capacity gains.
    This can be applied on a granular basis (Per VMDK) using the Storage Policy Based Management system.


    50% savings
    Note: All Flash VSAN only
    Note: this does not require the cluster size be a multiple of six, just six or more.
    Not supported with stretched clusters
  • Swap by default has a policy to be thick and reserve space. In order to disable this, there is an advanced setting to set, and this will impact the next time a virtual machine is restarted.
    Helpful in memory dense environments without significant over commitment.
    Jase Mccarty’s github has a handy PowerCLI script for setting this cluster wide.
  • The VSAN snapshot system is a significant improvement on traditional redo logs. By leveraging the sparse file system, new snapshot objects are created for writes to redirect into. Writes are directed into 4MB allocations.
    Writes always go into newest snapshot object (no write amplification), reads must read through but a small in memory cache tracking the most current significantly reduces this.
    Real world workloads 5% overhead isn’t uncommon with several snapshots open. Traditional VMFS Redo logs have significantly higher overhead.
  • With that I would (click) like to thank you and open the floor for questions

×