Cutting Through the Fog of Virtualization

Brought to you by
Cutting Through the Fog
of Virtualization
Bernd Bandemer
Head of Data Science at Clockwork.io

Bernd Bandemer
Head of Data Science at Clockwork.io
■ Accurate, scalable, and stable clock sync is
a game changer in distributed computing
■ My prior lives:
Aruba Networks, Stanford, originally from Germany

The Promise of Virtualization
■ Dynamic demand automatically drives scale-up / scale-down
■ Resource guarantees
● Every VM behaves the same, independent of location, date, time of day
● Every network link between VMs behaves the same
■ Resource isolation
● Your neighbors won't bother you
● Your own VMs won't affect each other

VM Colocation
Top of Rack Switch
Fabric switch Fabric switch
Fabric switch
Top of Rack Switch Top of Rack Switch
Physical machines
(hosts)

VM Colocation
Top of Rack Switch
Physical machines
(hosts)
Fabric switch
Virtual machines

VM Colocation
Top of Rack Switch
Fabric switch
Virtual machines
YOUR virtual machines
Physical machines
(hosts)

VM Colocation
Top of Rack Switch
Fabric switch
Virtual machines
YOUR virtual machines
Colocated on same host
Physical machines
(hosts)

VM Colocation vs. Shared Tenancy
■ Shared Tenancy
● Load on VMs is independent of each other
● Load spikes average out across the VMs
VM Colocation is much worse than shared tenancy
■ VM Colocation
● VMs participate in the same workload at the same time
● Simultaneous load spikes, averaging does not help

Potential Effects of VM Colocation
■ CPU and memory are well isolated between VMs
■ Networking is not well isolated
● Bandwidth: colocated VMs share a physical network interface (NICs)
● Latency: packets between colocated VMs don't travel through the network
● Packet drops: packets between colocated VMs can't get dropped in the network

How do these effects play out
in practice?

Real-world Data
■ ~3,000 test cluster instances
■ Amazon EKS, Google GKE, Azure AKS
■ Three Geo regions:
Eastern US (Virginia), Western Europe (London), Southeast Asia (Singapore)
■ Each cluster
● Bring up 50 virtual machine instances
● Instrumented with Clockwork's clock sync system
● Latency Sensei Audit, which includes several phases of network load

Determining VM Colocation
■ Clock sync system measures relative
clock offsets and clock drifts
● This is done by exchanging small UDP
packets in a probe mesh and measuring
their transit times
■ Colocated VMs share a physical
system clock
● This can be detected and used to
reverse-engineer the colocation
● Validated by many experiments, including
sole-tenant hosts
← Example colocation structure

Network Bandwidth
■ We measure network bandwidth by sending long TCP ﬂows between the VMs
Colocation structure Egress bandwidth
10 Gbps →
5 Gbps →
0 Gbps →
Colocated VMs have severely lower network bandwidth

Network Bandwidth on Google Cloud
n1-standard-4 n2-standard-4
Bandwidth is impacted when 3 or more VMs are colocated

Network Bandwidth on AWS
m4.xlarge m5.xlarge
Colocation is purposely limited; no bandwidth impact

Network Bandwidth on Azure
Standard_D4s_v3 Standard_D4s_v4
Bandwidth impact appears when colocation > 4

Network Bandwidth on Azure
During low-load times, Azure lifts the speed limit
20 Gbps →
10 Gbps →
0 Gbps →

Network Latency
■ We measure two-way delay between any two VMs with high accuracy
● Two-way delay is the sum of two one-way delays
● Exclude the effect of ACK turn-around time and sender/receiver stack delays

Network Latency on AWS
AWS virtual networking hides any potential latency beneﬁt
■ No visible difference between
colocated and non-colocated
pairs of VM

Network Latency on AWS
AWS virtual networking hides any potential latency beneﬁt
■ Two distinct modes, probably
explained by different generations
of networking implementation

Network Latency on Azure
In Azure, colocated VMs have higher latency
■ Azure's accelerated networking
optimizes the typical case
■ VMs on the same physical host raise an
exception that is handled in software

Network Latency on Google Cloud
In Google cloud, each region/instance type behaves differently

Network Packet Drops
■ In each measurement run, we send 10s of millions of probe packets
■ UDP packets may get lost on the way
Non-colocated VMs Colocated VMs
Azure 68 ppm 60 ppm
AWS 220 ppm 213 ppm
Google 60 ppm 62 ppm
Packet drops rates are independent of VM colocation

Conclusion
■ VM Colocation has performance impact and no upside
● Highly colocated VMs have lower network bandwidth
● Colocation has no latency or reliability beneﬁt
■ For optimal cloud system performance, VM colocation should be avoided
■ Clockwork Latency Sensei provides visibility into your cloud system
● Accurate clock sync makes VM colocation visible
● Latency Sensei audit reports highlight the impact on YOUR cloud system

Brought to you by
Bernd Bandemer
bernd@clockwork.io
www.clockwork.io
Thank you!

Cutting Through the Fog of Virtualization

Recommended

Recommended

More Related Content

Similar to Cutting Through the Fog of Virtualization

Similar to Cutting Through the Fog of Virtualization (20)

More from ScyllaDB

More from ScyllaDB (20)

Recently uploaded

Recently uploaded (20)

Cutting Through the Fog of Virtualization