Networking Challenges
for the Next Decade
Amin Vahdat
On behalf of Google Technical Infrastructure and Google Cloud Platform
APRIL 4, 2017
Google Global Cache edge nodes
FASTER (US, JP, TW) 2016
Unity (US, JP) 2010
SJC (JP, HK, SG) 2013
Points of presence >100
Network fiber
Google Network
More than a collection of data centers
#
#
Future regions and number of zones
Current regions and number of zones
3
3
2
3
3 3
3
3
2
4
3
3
2
Frankfurt
Singapore
S Carolina
N Virginia
Belgium
London
Taiwan
Mumbai
Sydney
Oregon
Iowa
São Paulo
Finland
Tokyo
Montreal
California
Netherlands
3
3
33
Google Cloud Regions
Adding 11 new regions
Ubiquitous Cloud...10x Scaling
Datacenter
Next-gen disaggregation of
storage, memory and compute
Campus & Metro
Cloud regions and campus
expansion driving DC
interconnect
WAN
Cloud replication and
bandwidth intensive cloud
services (e.g., turnkey video,
IoT)
10x10x 10x
Step Function Disruptions: Bandwidth, Latency, Availability, Predictability
B4
WAN
Interconnect
Andromeda
NFV and network
virtualization
Jupiter
Datacenter
Networking
The Pillars of SDN @ Google
B4
WAN
Interconnect
Andromeda
NFV and network
virtualization
Jupiter
Datacenter
Networking
The Pillars of SDN @ Google
Espresso
SDN for public
Internet
B4: [Jain et al, SIGCOMM 13] BwE: [Jain et al, SIGCOMM 15]
B4: Google's Software Defined WAN
B4: [Jain et al, SIGCOMM 13] BwE: [Jain et al, SIGCOMM 15]
B4: From Copy Network to Business Critical
B4traffic
2012 — 2016
10.1.4/24
VNET: 5.4/16
VNET: 192.168.32/24
VNET: 10.1.1/24 Load Balancing
DoS
ACLs
VPN
NFV
Internal Network
Andromeda
ToR
Google Infrastructure Services
10.1.1/24
ToR
10.1.2/24
ToR
10.1.3/24
ToR
Watchtower
Saturn
Firehose
1.1
Google Datacenter Network Innovation
And hardware scale that we could not buy
10
Time
Capacity
Firehose
1.0
Jupiter
4 Post
1.3Pb/s clusters
in 2013
B4
WAN
Interconnect
Andromeda
NFV and network
virtualization
Jupiter
Datacenter
Networking
The Pillars of SDN @ Google
Public
Internet?
B4
WAN
Interconnect
Andromeda
NFV and network
virtualization
Jupiter
Datacenter
Networking
The Pillars of SDN @ Google
Espresso
SDN for public
Internet
Espresso in Context
B4
Jupiter Data Center
Google
Espresso in Context
B4
B2
Peering Metro
Jupiter Data Center
Google
Google
Espresso in Context
B4
Espresso
B2
Internet
Peering Metro
User
Jupiter Data Center
Google
Google
Cloud 1.0
Espresso
SDN
Peering
Router
Centric
Protocols
Espresso: Before and After
Local view
Connectivity first
Coarse fault recovery
Per-metro and global view
Application signals
Real-time optimization
Espresso Architecture Overview
Label-switched
Fabric
BGP
speaker
External Peer
Espresso
Metro
Peering Fabric
eBGP Peering
Espresso Architecture Overview
Label-switched
Fabric
Host
Host
Host
Host
Host
Host
Packet
Processor
BGP
speaker
External PeereBGP Peering
Espresso
Metro
Labeled packets
specify egress
Host
Host
Host
Host
Host
Peering Fabric
Espresso Architecture Overview
Label-switched
Fabric
Host
Host
Host
Host
Host
Host
Packet
Processor
Local
Control
Global Controller
BGP
speaker
External PeereBGP Peering
Espresso
Metro
Application Signals
Labeled packets
specify egress
Host
Host
Host
Host
Host
Peering Fabric
The next wave in computing
• Serverless compute in Cloud 3.0
• IoT
• Tightly coupled, general purpose
distributed computing
It’s time to put it all together
• Agile Scale
• Jitter
• Isolation
• Performance is great, but only
meaningful with availability,
manageability, and velocity
Next Decade Challenges
in Networking
Virtualization delivers capex savings to enterprise DCs
Cloud 1.0
Last Decade
Cloud 1.0
Public cloud frees enterprise from private HW infrastructure
Scheduling, load balancing primitives, “big data” query processing
Cloud 2.0Cloud 1.0
HW on
Demand
Now
Cloud 1.0 Cloud 2.0
Serverless compute, real-time intelligence, and machine learning
Not data placement, load balancing, OS configuration and patching
Cloud 3.0
Compute,
not servers
The Third Wave of Cloud Computing
Cloud 2.0
Networking should be aiming for Cloud 3.0
Cloud 3.0Cloud 1.0
The Third Wave of Cloud Computing
Storage disaggregation:
the datacenter is the storage appliance
Seamless telemetry
and scale up/down
Transparent live migration
Open Marketplace
of services, securely placed and accessed
Networking and
Cloud 3.0
Applications+Functions
not VMs
Policy
not middleboxes
Actionable Intelligence
not data processing
SLOs
not placement/load balancing/scheduling
Networking and
Cloud 3.0
The network will enable next-generation
compute infrastructure
The network can define next-generation
storage infrastructure
The right network infrastructure can deliver
fundamental new capability
Next Decade Challenges
in Networking
How we Prioritize
Infrastructure Work
Availability
Manageability
Velocity
Stranding
Performance
• First things first: an insecure infrastructure is an unavailable infrastructure
• Stability is more important than efficiency
• Network management is critical
• Configuration is hard
• Automation matters but can be counter to availability
“Evolve or Die: High-Availability Design Principles Drawn from Google’s Network Infrastructure.”
SIGCOMM 2016.
Availability is Paramount
• Velocity is the speed of iteration
• Retrospective on “Tussle in Cyberspace:
Defining Tomorrow’s Internet”
• Build for hitless upgrades and
self-validation
• Debugging and tracing matter
○ Without visibility, performance
does not matter
• Network fabrics built for expansion and
evolution
• Launch and Iterate
Build for Velocity
Isolation with reservations is easy but leads to huge resource stranding
● General-purpose, shared infrastructure to approximate custom-built and reserved
Isolation has many components
● Latency, bandwidth, but also the control plane
● Accounting and chargeback are big missing pieces
Congestion Control is still really hard
● Rationalizing multiple control loops, flow, endpoint, flow group, Traffic Engineering
Isolation is Critical; Stranding is Terrible
Amdahl’s law applies and so an incredible,
localized optimization that takes any effort
to adopt will be ignored
1. Scale
2. Jitter
3. Storage Disaggregation
Must optimize from the application all the
way to the end user
Performance only
Matters if End to End
How we Prioritize
Infrastructure Work
Availability
Manageability
Velocity
Stranding
Performance
The next wave of computing
• Serverless compute in Cloud 3.0
• IoT
• Tightly coupled, general purpose
distributed computing
It’s time to put it all together
• Agile Scale
• Jitter
• Isolation
• Performance is great, but only
meaningful with availability,
manageability, and velocity
Next Decade
Challenges in Networking
Thank You!Thank You!
Open Source
Google Cloud Platform 36
Google
MapReduce
Google
Bigtable
Google Borg Google Borg
Google
Dremel
Open Source
Google Cloud Platform 37
TCP
BBR
gRPC
Open
Config
QUIC ...

Networking Challenges for the Next Decade