Accurate network latency measurements between VMs in the cloud can paint a surprisingly detailed picture of the underlying infrastructure, in terms of both physical topology and network load state. Drawing on data from thousands of clusters in all major clouds, this session distills the most interesting findings and highlights their implications on cloud system performance.
In particular, VM colocation, where multiple virtual machines are hosted on the same physical machine, causes bandwidth and latency bottlenecks.
Boost Fertility New Invention Ups Success Rates.pdf
Cutting Through the Fog of Virtualization
1. Brought to you by
Cutting Through the Fog
of Virtualization
Bernd Bandemer
Head of Data Science at Clockwork.io
2. Bernd Bandemer
Head of Data Science at Clockwork.io
■ Accurate, scalable, and stable clock sync is
a game changer in distributed computing
■ My prior lives:
Aruba Networks, Stanford, originally from Germany
3. The Promise of Virtualization
■ Dynamic demand automatically drives scale-up / scale-down
■ Resource guarantees
● Every VM behaves the same, independent of location, date, time of day
● Every network link between VMs behaves the same
■ Resource isolation
● Your neighbors won't bother you
● Your own VMs won't affect each other
4. VM Colocation
Top of Rack Switch
Fabric switch Fabric switch
Fabric switch
Top of Rack Switch Top of Rack Switch
Physical machines
(hosts)
5. VM Colocation
Top of Rack Switch
Physical machines
(hosts)
Fabric switch Fabric switch
Fabric switch
Top of Rack Switch Top of Rack Switch
Virtual machines
6. VM Colocation
Top of Rack Switch
Fabric switch Fabric switch
Fabric switch
Top of Rack Switch Top of Rack Switch
Virtual machines
YOUR virtual machines
Physical machines
(hosts)
7. VM Colocation
Top of Rack Switch
Fabric switch Fabric switch
Fabric switch
Top of Rack Switch Top of Rack Switch
Virtual machines
YOUR virtual machines
Physical machines
(hosts)
8. VM Colocation
Top of Rack Switch
Fabric switch Fabric switch
Fabric switch
Top of Rack Switch Top of Rack Switch
Virtual machines
YOUR virtual machines
Colocated on same host
Physical machines
(hosts)
9. VM Colocation vs. Shared Tenancy
■ Shared Tenancy
● Load on VMs is independent of each other
● Load spikes average out across the VMs
VM Colocation is much worse than shared tenancy
■ VM Colocation
● VMs participate in the same workload at the same time
● Simultaneous load spikes, averaging does not help
10. Potential Effects of VM Colocation
■ CPU and memory are well isolated between VMs
■ Networking is not well isolated
● Bandwidth: colocated VMs share a physical network interface (NICs)
● Latency: packets between colocated VMs don't travel through the network
● Packet drops: packets between colocated VMs can't get dropped in the network
12. Real-world Data
■ ~3,000 test cluster instances
■ Amazon EKS, Google GKE, Azure AKS
■ Three Geo regions:
Eastern US (Virginia), Western Europe (London), Southeast Asia (Singapore)
■ Each cluster
● Bring up 50 virtual machine instances
● Instrumented with Clockwork's clock sync system
● Latency Sensei Audit, which includes several phases of network load
13. Determining VM Colocation
■ Clock sync system measures relative
clock offsets and clock drifts
● This is done by exchanging small UDP
packets in a probe mesh and measuring
their transit times
■ Colocated VMs share a physical
system clock
● This can be detected and used to
reverse-engineer the colocation
● Validated by many experiments, including
sole-tenant hosts
← Example colocation structure
21. Network Latency
■ We measure two-way delay between any two VMs with high accuracy
● Two-way delay is the sum of two one-way delays
● Exclude the effect of ACK turn-around time and sender/receiver stack delays
22. Network Latency on AWS
AWS virtual networking hides any potential latency benefit
■ No visible difference between
colocated and non-colocated
pairs of VM
23. Network Latency on AWS
AWS virtual networking hides any potential latency benefit
■ Two distinct modes, probably
explained by different generations
of networking implementation
24. Network Latency on Azure
In Azure, colocated VMs have higher latency
■ Azure's accelerated networking
optimizes the typical case
■ VMs on the same physical host raise an
exception that is handled in software
25. Network Latency on Google Cloud
In Google cloud, each region/instance type behaves differently
27. Network Packet Drops
■ In each measurement run, we send 10s of millions of probe packets
■ UDP packets may get lost on the way
Non-colocated VMs Colocated VMs
Azure 68 ppm 60 ppm
AWS 220 ppm 213 ppm
Google 60 ppm 62 ppm
Packet drops rates are independent of VM colocation
28. Conclusion
■ VM Colocation has performance impact and no upside
● Highly colocated VMs have lower network bandwidth
● Colocation has no latency or reliability benefit
■ For optimal cloud system performance, VM colocation should be avoided
■ Clockwork Latency Sensei provides visibility into your cloud system
● Accurate clock sync makes VM colocation visible
● Latency Sensei audit reports highlight the impact on YOUR cloud system
29. Brought to you by
Bernd Bandemer
bernd@clockwork.io
www.clockwork.io
Thank you!