SlideShare a Scribd company logo
Telemetry in Large Data Center
Networks
Sriram Natarajan
Google Networking
on behalf of Google Technical Infrastructure
❯ Rich and deep history of building world-class infrastructure
❯ Common infrastructure across Google apps and Google Cloud Platform
❯ This infrastructure is available to you via GCP
Google Technical Infrastructure
2
Over the past 15+ years Google has built one
of the fastest, most capable
network infrastructures on the planet.
2002 2004 2006 2008 2010 2012 2013 2014
Colossus
Spanner
Dremel
Dataflow
Kubernetes
2003 2005 2007 2009 2011
Efficient Server Power
GFS
MapReduce
On-Board UPS
Warehouse Scale
Datacenter
Carbon Neutral
Bigtable
Containerized Datacenter
First Public
PUE Reporting
Containers Software Defined
Networking
Flume
Megastore
B4
2006
2008
2010
2012
2014
Google
Global
Cache
BwE
Jupiter
gRPC
FreeDome
Watchtower
QUIC
Andromeda
Google Innovations in Networking
2016
Google Network Infrastructure
•GGC: edge presence
•B4: WAN interconnect
•Jupiter: building scale data center network
•Freedome: campus-level interconnect
•Andromeda: virtualize the physical network
❯ Tiny round trip times
❯ Massive multi-path
❯ More plentiful and more uniform bandwidth
❯ Little buffering
❯ Latency, tail latency as important as bandwidth
❯ Homogeneity, protocol modification much easier
❯ Aggregate bandwidth of entire Internet in one building
What’s different about Data Center Networks
Five Generations of Data Center Fabrics
Year
Datacenter
Generation
Fabric
Speed
Host
Speed
Aggr
BW
2004 Four-Post CRs 10G 1G 2T
2005 Firehose 1.0 10G 1G 10T
2006 Firehose 1.1 10G 1G 10T
2008 Watchtower 10G 1G 82T
2009 Saturn 10G 10G 207T
2012 Jupiter 10/40G 10G/40G 1.3P
Aggregatetraffic
Traffic generated by servers in our data centers
Jul
‘08
Jun
‘09
May
‘10
Apr
‘11
Mar
‘12
Feb
‘13
Nov
‘14
50x
1x
Dec
‘13
8
So, why do we need Telemetry & Analytics in DC Fabrics?
9
5+
10+
20+
# of fabric
architectures in
production
kinds of switches
as part of fabrics in
production
# consumers per
production fabric
}Need Sensors, Software and Systems that help
• perform network design and modeling
• perform topology, configuration and routing verification
• perform smart analytics for root cause isolation
Data Center Fabrics are Complex
10
n stage
host stacks:
virtualization
transport
application
switch stack
1
2
3
4
State is distributed across various
elements inside and out of the fabric
Complex interaction between the states
Multiple uncoordinated writers of
state under various control loops
Large: Impossible to humanly observe
state and react to ambiguity or faults
controller software
Challenges are similar for SDN-centric or traditional protocol-based networks
11
Life of a typical Data Center Fabric
OPERATE (Apps and SLAs)
• Define application SLAs
• Measure SLAs and traffic characteristics
• Feed stats to TE, enforcers, PCR schedulers
OPERATE (Apps & SLA)
CONNECT (Routing)
• Design connectivity policies
• Create the intended configuration
• Push configuration to devices
CONNECT (routing, reachability)
BUILD (Initial Topology)
• Design the network topology
• Model the intended topology
• Populate (deploy, wire-up) DC floor
BUILD (init topology)
12
Safety, Correctness and Visibility in a DC Fabric
• Define application SLAs
• Measure SLAs and traffic characteristics
• Feed stats to TE, enforcers, PCR schedulers
OPERATE (Apps & SLA)
• Create connectivity policies & config
• Push configuration to devices
• Verify routing consistency
CONNECT (routing, reachability)
BUILD (Initial Topology)
• Design & model the network topology
• Populate (deploy, wire-up) DC floor
• Verify deployed topology against intent
BUILD (init topology)
13
Compounded by Scale
10,000+
switches per fabric**
3.5 Billion
searches per day
30,000 burst
updates within 1 min**
300 hours
of video uploaded every minute
10 Million +
routing rules per fabric**
0.25Million+
links per fabric**
SLA
& app visibility
BUILD
& topology
CONNECTIVITY
& routing
**Typical numbers seen in large data center fabrics
14
Lots of sensors and data sources
fabric data plane application plane
TCP, UDP / IP stats
virtual switch state
RPC sensors
application sensors
host enforcers
stats | counters
queue depths
switch probes
host probes
routing state
topology models
configuration models
Challenge: To use these effectively to simplify operational complexities in Data Center Fabrics
control & management plane
forwarding state
Systems to Enable Safety, Correctness & Visibility
15
Topology Verification
To continually verify that what is
deployed is what was intended
01.
Route Consistency
To verify routing state
consistency between
controllers and data plane
02.
Traffic Characteristics
**
To verify host-granular
reachability and measure traffic
characteristics
03.
SLA
& app visibility
BUILD
& topology
CONNECTIVITY
& routing
**The analytics system described here focuses primarily on the host-level reachability
and packet-loss characterization of app2app communication
01. Topology Verification at Scale
16
How do we verify that what has been
deployed and wired up matches
intended topology
in a 10,000+ node / 250,000+ links fabric
n stage
17
1
2
3
4
Generate probe traffic from Hosts
This is not switched like production traffic
Don’t rely on just destination-based routing
Source Route: Ensures targeted full coverage
Analytics App: Takes all this data generated and
localize connectivity problems
Read in the intended model of the fabric
Generate a topology to verify against
01. Topology Verification at Scale
Simultaneous Detection of Topological faults within a minute of occurrence
18
How do we verify consistency between
configured policy, routing state and
forwarding state
in a fabric with 10M+ rules and detect & isolate
loops & black holes quickly
02. Routing Consistency Verification at Scale
19
1
2
3
4
Map: Create a network slice by destination subnet by
picking rules relevant to that subnet against a shard
of the full set
Reduce: Construct a directed forwarding graph and
verify properties such as loop freedom and reachability
At every subsequent routing update, only
analyze incremental updates
Generate snapshot of the routing state by
recording route change events
SDN Controller
Route Snapshot
}
Rule Shard 1 Rule Shard n…
M M M…
R R R
M
R …
Report Subnet 1 Report Subnet m…
02. Libra: Routing Consistency Verification at Scale
Detection of Loops & Holes within 1ms of occurrence in a 10K Node Network
20
How do we measure host-level
reachability and app-level traffic
characteristics comprehensively
across all host-pairs and all traffic classes with
varying SLA & TE needs
03. App-level SLA measurements at Scale
21
1
2
3
4
Exercise src2dest and dest2src paths through
entire host-fabric-host SW stack
Structured rotation of probes across all hosts &
traffic queues for full coverage
Correlate probe loss & latency in fwd & rev directions
across a number of probe sets to localize issues
Randomly pick a subset of hosts and generate
probes to and from the hosts.
03. App-level SLA measurements at Scale
22
That IS a lot of data...
Now that your network infrastructure is richly
instrumented...how do you extract this information?
We use a RPC framework optimized for encrypted,
streaming, and multiplexed connections.
...and so can you.
http://grpc.io
23
Network equipment
● We’ve discussed a lot of external software telemetry sources...what about data from
network devices themselves?
● Today, SNMP is often the de facto network telemetry protocol. Time to upgrade!
○ legacy implementations -- designed for limited processing and bandwidth
○ expensive discoverability -- re-walk MIBs to discover new elements
○ no capability advertisement -- test OIDs to determine support
○ rigid structure -- limited extensibility to add new data
○ proprietary data -- require vendor-specific mappings and multiple requests to
reassemble data
○ protocol stagnation -- no absorption of current data modeling and transmission
techniques
24
Next generation network telemetry:
● network elements stream data to collectors (push model)
● data populated based on vendor-neutral models whenever possible
● utilize a publish/subscribe API to select desired data
● scale for next 10 years of density growth with high data freshness
○ other protocols distribute load to hardware, so should telemetry
● utilize modern transport mechanisms with active development communities
○ e.g., gRPC (HTTP/2), Thrift, protobuf over UDP
www.openconfig.net
25
Building Safe, Correct and Transparent DC Fabrics
Detect host reachability issues and SLA violations
within minutes of these events
OPERATE (Apps & SLA)
Verify routing consistency and detect black
holes, loops within a minute of a routing or
configuration change
CONNECT (routing, reachability)
Explore Topology Design Space & Synthesize
Verify deployed topology against intent within
minutes of turn-up
BUILD (init topology)
❯ Think outside the BOX
❯ Sensors and DATA ANALYTICS are key for building data center networks
❯ HUMANS are (almost) useless
Key Take-Aways
26
Data Center Networking @ Google
• An SDN centric DevOps organization
• Responsible for the architecture, design, and build of Google’s data center networks
• Responsible for development of software & systems for modeling, automation, telemetry, analytics
Network Engineers Network Systems Engineers Technical Program ManagersSREArchitects
Join Data Center Networking @ Google
Sriram Natarajan
snr@google.com
References THANK YOU
1. “Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s
Datacenter Network,” SIGCOMM 2015.
2. “Network Detective: Finding network blackholes”, Israel Networking Day 2014.
3. “Libra: Divide and Conquer to Verify Forwarding Tables in Huge Networks”, NSDI 2014.
4. “B4: Experience With a Globally-Deployed Software Defined WAN,” SIGCOMM 2013.
5. “Bandwidth Enforcer: Flexible, Hierarchical Bandwidth Allocation for WAN Distributed
Computing,” SIGCOMM 2015.

More Related Content

What's hot

Software Define Networking (SDN)
Software Define Networking (SDN)Software Define Networking (SDN)
Software Define Networking (SDN)
Pradeep Kumar TS
 
Prototype Implementation of a Demand Driven Network Monitoring Architecture
Prototype Implementation of a Demand Driven Network Monitoring ArchitecturePrototype Implementation of a Demand Driven Network Monitoring Architecture
Prototype Implementation of a Demand Driven Network Monitoring ArchitectureAugusto Ciuffoletti
 
Distributed Clouds and Software Defined Networking
Distributed Clouds and Software Defined NetworkingDistributed Clouds and Software Defined Networking
Distributed Clouds and Software Defined Networking
US-Ignite
 
20180717 Introduction of Seamless BLE Connection Migration System (SeamBlue)
20180717 Introduction of Seamless BLE Connection Migration System (SeamBlue)20180717 Introduction of Seamless BLE Connection Migration System (SeamBlue)
20180717 Introduction of Seamless BLE Connection Migration System (SeamBlue)
Will Shen
 
SDN overview 2014
SDN overview 2014SDN overview 2014
SDN overview 2014
Dave Michels
 
Multipath Load Balancing for SDN Data Plane
Multipath Load Balancing for SDN Data Plane Multipath Load Balancing for SDN Data Plane
Multipath Load Balancing for SDN Data Plane
Sabelo Dlamini
 
NetBrain CE 5.0
NetBrain CE 5.0NetBrain CE 5.0
NetBrain CE 5.0
NetBrain Technologies
 
Networking Project(FINAL)
Networking Project(FINAL)Networking Project(FINAL)
Networking Project(FINAL)Priyojit Das
 
Enabling Active Flow Manipulation (AFM) in Silicon-based Network Forwarding E...
Enabling Active Flow Manipulation (AFM) in Silicon-based Network Forwarding E...Enabling Active Flow Manipulation (AFM) in Silicon-based Network Forwarding E...
Enabling Active Flow Manipulation (AFM) in Silicon-based Network Forwarding E...
Tal Lavian Ph.D.
 
Near rt ric tc
Near rt ric tcNear rt ric tc
Near rt ric tc
NickHuang49
 
Network entry success rate
Network entry success rateNetwork entry success rate
Network entry success rate
Shiraz316
 
OpenDayLight Load Balanced Switching
OpenDayLight Load Balanced SwitchingOpenDayLight Load Balanced Switching
OpenDayLight Load Balanced Switching
ManasaKulkarni3
 
AndreaPetrucci_ACAT_2007
AndreaPetrucci_ACAT_2007AndreaPetrucci_ACAT_2007
AndreaPetrucci_ACAT_2007Andrea PETRUCCI
 
Light Reading BTE_SDNtoolbox_June_2015
Light Reading BTE_SDNtoolbox_June_2015Light Reading BTE_SDNtoolbox_June_2015
Light Reading BTE_SDNtoolbox_June_2015
Deborah Porchivina
 
Implementation ans analysis_of_quic_for_mqtt
Implementation ans analysis_of_quic_for_mqttImplementation ans analysis_of_quic_for_mqtt
Implementation ans analysis_of_quic_for_mqtt
Puneet Kumar
 
German Sviridov - PhD defense
German Sviridov - PhD defense German Sviridov - PhD defense
German Sviridov - PhD defense
German Sviridov
 
[White paper] Maintain-Accurate-Network-Diagrams
[White paper] Maintain-Accurate-Network-Diagrams[White paper] Maintain-Accurate-Network-Diagrams
[White paper] Maintain-Accurate-Network-Diagrams
NetBrain Technologies
 
IEEE Connect 2020 Novel TLS signature extraction for Encrypted malware detection
IEEE Connect 2020 Novel TLS signature extraction for Encrypted malware detectionIEEE Connect 2020 Novel TLS signature extraction for Encrypted malware detection
IEEE Connect 2020 Novel TLS signature extraction for Encrypted malware detection
madhucharis
 
Dynamic Classification in a Silicon-Based Forwarding Engine
Dynamic Classification in a Silicon-Based Forwarding EngineDynamic Classification in a Silicon-Based Forwarding Engine
Dynamic Classification in a Silicon-Based Forwarding Engine
Tal Lavian Ph.D.
 

What's hot (20)

Software Define Networking (SDN)
Software Define Networking (SDN)Software Define Networking (SDN)
Software Define Networking (SDN)
 
Prototype Implementation of a Demand Driven Network Monitoring Architecture
Prototype Implementation of a Demand Driven Network Monitoring ArchitecturePrototype Implementation of a Demand Driven Network Monitoring Architecture
Prototype Implementation of a Demand Driven Network Monitoring Architecture
 
Distributed Clouds and Software Defined Networking
Distributed Clouds and Software Defined NetworkingDistributed Clouds and Software Defined Networking
Distributed Clouds and Software Defined Networking
 
20180717 Introduction of Seamless BLE Connection Migration System (SeamBlue)
20180717 Introduction of Seamless BLE Connection Migration System (SeamBlue)20180717 Introduction of Seamless BLE Connection Migration System (SeamBlue)
20180717 Introduction of Seamless BLE Connection Migration System (SeamBlue)
 
SDN overview 2014
SDN overview 2014SDN overview 2014
SDN overview 2014
 
Multipath Load Balancing for SDN Data Plane
Multipath Load Balancing for SDN Data Plane Multipath Load Balancing for SDN Data Plane
Multipath Load Balancing for SDN Data Plane
 
NetBrain CE 5.0
NetBrain CE 5.0NetBrain CE 5.0
NetBrain CE 5.0
 
Networking Project(FINAL)
Networking Project(FINAL)Networking Project(FINAL)
Networking Project(FINAL)
 
Enabling Active Flow Manipulation (AFM) in Silicon-based Network Forwarding E...
Enabling Active Flow Manipulation (AFM) in Silicon-based Network Forwarding E...Enabling Active Flow Manipulation (AFM) in Silicon-based Network Forwarding E...
Enabling Active Flow Manipulation (AFM) in Silicon-based Network Forwarding E...
 
Near rt ric tc
Near rt ric tcNear rt ric tc
Near rt ric tc
 
Network entry success rate
Network entry success rateNetwork entry success rate
Network entry success rate
 
OpenDayLight Load Balanced Switching
OpenDayLight Load Balanced SwitchingOpenDayLight Load Balanced Switching
OpenDayLight Load Balanced Switching
 
Ithings2012 20nov
Ithings2012 20novIthings2012 20nov
Ithings2012 20nov
 
AndreaPetrucci_ACAT_2007
AndreaPetrucci_ACAT_2007AndreaPetrucci_ACAT_2007
AndreaPetrucci_ACAT_2007
 
Light Reading BTE_SDNtoolbox_June_2015
Light Reading BTE_SDNtoolbox_June_2015Light Reading BTE_SDNtoolbox_June_2015
Light Reading BTE_SDNtoolbox_June_2015
 
Implementation ans analysis_of_quic_for_mqtt
Implementation ans analysis_of_quic_for_mqttImplementation ans analysis_of_quic_for_mqtt
Implementation ans analysis_of_quic_for_mqtt
 
German Sviridov - PhD defense
German Sviridov - PhD defense German Sviridov - PhD defense
German Sviridov - PhD defense
 
[White paper] Maintain-Accurate-Network-Diagrams
[White paper] Maintain-Accurate-Network-Diagrams[White paper] Maintain-Accurate-Network-Diagrams
[White paper] Maintain-Accurate-Network-Diagrams
 
IEEE Connect 2020 Novel TLS signature extraction for Encrypted malware detection
IEEE Connect 2020 Novel TLS signature extraction for Encrypted malware detectionIEEE Connect 2020 Novel TLS signature extraction for Encrypted malware detection
IEEE Connect 2020 Novel TLS signature extraction for Encrypted malware detection
 
Dynamic Classification in a Silicon-Based Forwarding Engine
Dynamic Classification in a Silicon-Based Forwarding EngineDynamic Classification in a Silicon-Based Forwarding Engine
Dynamic Classification in a Silicon-Based Forwarding Engine
 

Viewers also liked

SD - A peer to peer issue tracking system
SD - A peer to peer issue tracking systemSD - A peer to peer issue tracking system
SD - A peer to peer issue tracking systemJesse Vincent
 
The Carrier DevOps Trend (Presented to Okinawa Open Days Conference)
The Carrier DevOps Trend (Presented to Okinawa Open Days Conference)The Carrier DevOps Trend (Presented to Okinawa Open Days Conference)
The Carrier DevOps Trend (Presented to Okinawa Open Days Conference)
Alex Henthorn-Iwane
 
Devtest Orchestration for SDN & NFV
Devtest Orchestration for SDN & NFVDevtest Orchestration for SDN & NFV
Devtest Orchestration for SDN & NFV
Alex Henthorn-Iwane
 
Swarm - A Docker Clustering System
Swarm - A Docker Clustering SystemSwarm - A Docker Clustering System
Swarm - A Docker Clustering System
snrism
 
Next-Gen DDoS Detection
Next-Gen DDoS DetectionNext-Gen DDoS Detection
Next-Gen DDoS Detection
Alex Henthorn-Iwane
 
Swarm sec
Swarm secSwarm sec
Swarm sec
snrism
 
Cloud Aware Network Management
Cloud Aware Network ManagementCloud Aware Network Management
Cloud Aware Network Management
Alex Henthorn-Iwane
 
Cloud-Scale BGP and NetFlow Analysis
Cloud-Scale BGP and NetFlow AnalysisCloud-Scale BGP and NetFlow Analysis
Cloud-Scale BGP and NetFlow Analysis
Alex Henthorn-Iwane
 
Standard measurements
Standard measurementsStandard measurements
Standard measurements
netvis
 
SDN Controller - Programming Challenges
SDN Controller - Programming ChallengesSDN Controller - Programming Challenges
SDN Controller - Programming Challengessnrism
 
垂直互联网站点的技术改造
垂直互联网站点的技术改造垂直互联网站点的技术改造
垂直互联网站点的技术改造
Dahui Feng
 
Docker-OVS
Docker-OVSDocker-OVS
Docker-OVS
snrism
 
Docker networking Tutorial 101
Docker networking Tutorial 101Docker networking Tutorial 101
Docker networking Tutorial 101
LorisPack Project
 

Viewers also liked (14)

SD - A peer to peer issue tracking system
SD - A peer to peer issue tracking systemSD - A peer to peer issue tracking system
SD - A peer to peer issue tracking system
 
The Carrier DevOps Trend (Presented to Okinawa Open Days Conference)
The Carrier DevOps Trend (Presented to Okinawa Open Days Conference)The Carrier DevOps Trend (Presented to Okinawa Open Days Conference)
The Carrier DevOps Trend (Presented to Okinawa Open Days Conference)
 
Devtest Orchestration for SDN & NFV
Devtest Orchestration for SDN & NFVDevtest Orchestration for SDN & NFV
Devtest Orchestration for SDN & NFV
 
Swarm - A Docker Clustering System
Swarm - A Docker Clustering SystemSwarm - A Docker Clustering System
Swarm - A Docker Clustering System
 
Next-Gen DDoS Detection
Next-Gen DDoS DetectionNext-Gen DDoS Detection
Next-Gen DDoS Detection
 
Swarm sec
Swarm secSwarm sec
Swarm sec
 
Cloud Aware Network Management
Cloud Aware Network ManagementCloud Aware Network Management
Cloud Aware Network Management
 
Cloud-Scale BGP and NetFlow Analysis
Cloud-Scale BGP and NetFlow AnalysisCloud-Scale BGP and NetFlow Analysis
Cloud-Scale BGP and NetFlow Analysis
 
Standard measurements
Standard measurementsStandard measurements
Standard measurements
 
SDN Controller - Programming Challenges
SDN Controller - Programming ChallengesSDN Controller - Programming Challenges
SDN Controller - Programming Challenges
 
垂直互联网站点的技术改造
垂直互联网站点的技术改造垂直互联网站点的技术改造
垂直互联网站点的技术改造
 
Docker-OVS
Docker-OVSDocker-OVS
Docker-OVS
 
Docker networking Tutorial 101
Docker networking Tutorial 101Docker networking Tutorial 101
Docker networking Tutorial 101
 
Edge architecture ieee international conference on cloud engineering
Edge architecture   ieee international conference on cloud engineeringEdge architecture   ieee international conference on cloud engineering
Edge architecture ieee international conference on cloud engineering
 

Similar to 5G-USA-Telemetry

Introduction to Software Defined Networking (SDN) presentation by Warren Finc...
Introduction to Software Defined Networking (SDN) presentation by Warren Finc...Introduction to Software Defined Networking (SDN) presentation by Warren Finc...
Introduction to Software Defined Networking (SDN) presentation by Warren Finc...
APNIC
 
Introduction to Software Defined Networking (SDN)
Introduction to Software Defined Networking (SDN)Introduction to Software Defined Networking (SDN)
Introduction to Software Defined Networking (SDN)
Bangladesh Network Operators Group
 
Edge Computing Platforms and Protocols - Ph.D. thesis
Edge Computing Platforms and Protocols - Ph.D. thesisEdge Computing Platforms and Protocols - Ph.D. thesis
Edge Computing Platforms and Protocols - Ph.D. thesis
Nitinder Mohan
 
Software Defined Networks
Software Defined NetworksSoftware Defined Networks
Software Defined Networks
Shreeya Shah
 
Radisys/Wind River: The Telcom Cloud - Deployment Strategies: SDN/NFV and Vir...
Radisys/Wind River: The Telcom Cloud - Deployment Strategies: SDN/NFV and Vir...Radisys/Wind River: The Telcom Cloud - Deployment Strategies: SDN/NFV and Vir...
Radisys/Wind River: The Telcom Cloud - Deployment Strategies: SDN/NFV and Vir...
Radisys Corporation
 
Why sdn
Why sdnWhy sdn
Why sdn
lz1dsb
 
Foundation of Modern Network- william stalling
Foundation of Modern Network- william stallingFoundation of Modern Network- william stalling
Foundation of Modern Network- william stalling
JonathanWallace46
 
Mobile IoT Middleware Interoperability & QoS Analysis - Eclipse IoT Day Paris...
Mobile IoT Middleware Interoperability & QoS Analysis - Eclipse IoT Day Paris...Mobile IoT Middleware Interoperability & QoS Analysis - Eclipse IoT Day Paris...
Mobile IoT Middleware Interoperability & QoS Analysis - Eclipse IoT Day Paris...
Nikolaos Georgantas
 
Zou Layered VO PDCAT2008 V0.5 Concise
Zou Layered VO PDCAT2008 V0.5 ConciseZou Layered VO PDCAT2008 V0.5 Concise
Zou Layered VO PDCAT2008 V0.5 Concise
yongqiangzou
 
CloudComp 2015 - SDN-Cloud Testbed with Hyper-convergent SmartX Boxes
CloudComp 2015 - SDN-Cloud Testbed with Hyper-convergent SmartX BoxesCloudComp 2015 - SDN-Cloud Testbed with Hyper-convergent SmartX Boxes
CloudComp 2015 - SDN-Cloud Testbed with Hyper-convergent SmartX Boxes
GIST (Gwangju Institute of Science and Technology)
 
Cloud computing and Software defined networking
Cloud computing and Software defined networkingCloud computing and Software defined networking
Cloud computing and Software defined networking
saigandham1
 
ENHANCING AND MEASURING THE PERFORMANCE IN SOFTWARE DEFINED NETWORKING
ENHANCING AND MEASURING THE PERFORMANCE IN SOFTWARE DEFINED NETWORKINGENHANCING AND MEASURING THE PERFORMANCE IN SOFTWARE DEFINED NETWORKING
ENHANCING AND MEASURING THE PERFORMANCE IN SOFTWARE DEFINED NETWORKING
IJCNCJournal
 
netconf, restconf, grpc_basic
netconf, restconf, grpc_basicnetconf, restconf, grpc_basic
netconf, restconf, grpc_basic
Gyewan An
 
4th SDN Interest Group Seminar-Session 2-3(130313)
4th SDN Interest Group Seminar-Session 2-3(130313)4th SDN Interest Group Seminar-Session 2-3(130313)
4th SDN Interest Group Seminar-Session 2-3(130313)
NAIM Networks, Inc.
 
Software Defined Networking(SDN) and practical implementation_trupti
Software Defined Networking(SDN) and practical implementation_truptiSoftware Defined Networking(SDN) and practical implementation_trupti
Software Defined Networking(SDN) and practical implementation_trupti
trups7778
 
btNOG 9 presentation Introduction to Software Defined Networking
btNOG 9 presentation Introduction to Software Defined NetworkingbtNOG 9 presentation Introduction to Software Defined Networking
btNOG 9 presentation Introduction to Software Defined Networking
APNIC
 
Cloud data management
Cloud data managementCloud data management
Cloud data managementambitlick
 
Network Telemetry
Network TelemetryNetwork Telemetry
Network Telemetry
Aalok Shah
 

Similar to 5G-USA-Telemetry (20)

Introduction to Software Defined Networking (SDN) presentation by Warren Finc...
Introduction to Software Defined Networking (SDN) presentation by Warren Finc...Introduction to Software Defined Networking (SDN) presentation by Warren Finc...
Introduction to Software Defined Networking (SDN) presentation by Warren Finc...
 
Introduction to Software Defined Networking (SDN)
Introduction to Software Defined Networking (SDN)Introduction to Software Defined Networking (SDN)
Introduction to Software Defined Networking (SDN)
 
Final_Report
Final_ReportFinal_Report
Final_Report
 
Edge Computing Platforms and Protocols - Ph.D. thesis
Edge Computing Platforms and Protocols - Ph.D. thesisEdge Computing Platforms and Protocols - Ph.D. thesis
Edge Computing Platforms and Protocols - Ph.D. thesis
 
Software Defined Networks
Software Defined NetworksSoftware Defined Networks
Software Defined Networks
 
Cloud Migration
Cloud MigrationCloud Migration
Cloud Migration
 
Radisys/Wind River: The Telcom Cloud - Deployment Strategies: SDN/NFV and Vir...
Radisys/Wind River: The Telcom Cloud - Deployment Strategies: SDN/NFV and Vir...Radisys/Wind River: The Telcom Cloud - Deployment Strategies: SDN/NFV and Vir...
Radisys/Wind River: The Telcom Cloud - Deployment Strategies: SDN/NFV and Vir...
 
Why sdn
Why sdnWhy sdn
Why sdn
 
Foundation of Modern Network- william stalling
Foundation of Modern Network- william stallingFoundation of Modern Network- william stalling
Foundation of Modern Network- william stalling
 
Mobile IoT Middleware Interoperability & QoS Analysis - Eclipse IoT Day Paris...
Mobile IoT Middleware Interoperability & QoS Analysis - Eclipse IoT Day Paris...Mobile IoT Middleware Interoperability & QoS Analysis - Eclipse IoT Day Paris...
Mobile IoT Middleware Interoperability & QoS Analysis - Eclipse IoT Day Paris...
 
Zou Layered VO PDCAT2008 V0.5 Concise
Zou Layered VO PDCAT2008 V0.5 ConciseZou Layered VO PDCAT2008 V0.5 Concise
Zou Layered VO PDCAT2008 V0.5 Concise
 
CloudComp 2015 - SDN-Cloud Testbed with Hyper-convergent SmartX Boxes
CloudComp 2015 - SDN-Cloud Testbed with Hyper-convergent SmartX BoxesCloudComp 2015 - SDN-Cloud Testbed with Hyper-convergent SmartX Boxes
CloudComp 2015 - SDN-Cloud Testbed with Hyper-convergent SmartX Boxes
 
Cloud computing and Software defined networking
Cloud computing and Software defined networkingCloud computing and Software defined networking
Cloud computing and Software defined networking
 
ENHANCING AND MEASURING THE PERFORMANCE IN SOFTWARE DEFINED NETWORKING
ENHANCING AND MEASURING THE PERFORMANCE IN SOFTWARE DEFINED NETWORKINGENHANCING AND MEASURING THE PERFORMANCE IN SOFTWARE DEFINED NETWORKING
ENHANCING AND MEASURING THE PERFORMANCE IN SOFTWARE DEFINED NETWORKING
 
netconf, restconf, grpc_basic
netconf, restconf, grpc_basicnetconf, restconf, grpc_basic
netconf, restconf, grpc_basic
 
4th SDN Interest Group Seminar-Session 2-3(130313)
4th SDN Interest Group Seminar-Session 2-3(130313)4th SDN Interest Group Seminar-Session 2-3(130313)
4th SDN Interest Group Seminar-Session 2-3(130313)
 
Software Defined Networking(SDN) and practical implementation_trupti
Software Defined Networking(SDN) and practical implementation_truptiSoftware Defined Networking(SDN) and practical implementation_trupti
Software Defined Networking(SDN) and practical implementation_trupti
 
btNOG 9 presentation Introduction to Software Defined Networking
btNOG 9 presentation Introduction to Software Defined NetworkingbtNOG 9 presentation Introduction to Software Defined Networking
btNOG 9 presentation Introduction to Software Defined Networking
 
Cloud data management
Cloud data managementCloud data management
Cloud data management
 
Network Telemetry
Network TelemetryNetwork Telemetry
Network Telemetry
 

5G-USA-Telemetry

  • 1. Telemetry in Large Data Center Networks Sriram Natarajan Google Networking on behalf of Google Technical Infrastructure
  • 2. ❯ Rich and deep history of building world-class infrastructure ❯ Common infrastructure across Google apps and Google Cloud Platform ❯ This infrastructure is available to you via GCP Google Technical Infrastructure 2
  • 3. Over the past 15+ years Google has built one of the fastest, most capable network infrastructures on the planet.
  • 4. 2002 2004 2006 2008 2010 2012 2013 2014 Colossus Spanner Dremel Dataflow Kubernetes 2003 2005 2007 2009 2011 Efficient Server Power GFS MapReduce On-Board UPS Warehouse Scale Datacenter Carbon Neutral Bigtable Containerized Datacenter First Public PUE Reporting Containers Software Defined Networking Flume Megastore
  • 6. Google Network Infrastructure •GGC: edge presence •B4: WAN interconnect •Jupiter: building scale data center network •Freedome: campus-level interconnect •Andromeda: virtualize the physical network
  • 7. ❯ Tiny round trip times ❯ Massive multi-path ❯ More plentiful and more uniform bandwidth ❯ Little buffering ❯ Latency, tail latency as important as bandwidth ❯ Homogeneity, protocol modification much easier ❯ Aggregate bandwidth of entire Internet in one building What’s different about Data Center Networks
  • 8. Five Generations of Data Center Fabrics Year Datacenter Generation Fabric Speed Host Speed Aggr BW 2004 Four-Post CRs 10G 1G 2T 2005 Firehose 1.0 10G 1G 10T 2006 Firehose 1.1 10G 1G 10T 2008 Watchtower 10G 1G 82T 2009 Saturn 10G 10G 207T 2012 Jupiter 10/40G 10G/40G 1.3P Aggregatetraffic Traffic generated by servers in our data centers Jul ‘08 Jun ‘09 May ‘10 Apr ‘11 Mar ‘12 Feb ‘13 Nov ‘14 50x 1x Dec ‘13 8
  • 9. So, why do we need Telemetry & Analytics in DC Fabrics? 9 5+ 10+ 20+ # of fabric architectures in production kinds of switches as part of fabrics in production # consumers per production fabric }Need Sensors, Software and Systems that help • perform network design and modeling • perform topology, configuration and routing verification • perform smart analytics for root cause isolation
  • 10. Data Center Fabrics are Complex 10 n stage host stacks: virtualization transport application switch stack 1 2 3 4 State is distributed across various elements inside and out of the fabric Complex interaction between the states Multiple uncoordinated writers of state under various control loops Large: Impossible to humanly observe state and react to ambiguity or faults controller software Challenges are similar for SDN-centric or traditional protocol-based networks
  • 11. 11 Life of a typical Data Center Fabric OPERATE (Apps and SLAs) • Define application SLAs • Measure SLAs and traffic characteristics • Feed stats to TE, enforcers, PCR schedulers OPERATE (Apps & SLA) CONNECT (Routing) • Design connectivity policies • Create the intended configuration • Push configuration to devices CONNECT (routing, reachability) BUILD (Initial Topology) • Design the network topology • Model the intended topology • Populate (deploy, wire-up) DC floor BUILD (init topology)
  • 12. 12 Safety, Correctness and Visibility in a DC Fabric • Define application SLAs • Measure SLAs and traffic characteristics • Feed stats to TE, enforcers, PCR schedulers OPERATE (Apps & SLA) • Create connectivity policies & config • Push configuration to devices • Verify routing consistency CONNECT (routing, reachability) BUILD (Initial Topology) • Design & model the network topology • Populate (deploy, wire-up) DC floor • Verify deployed topology against intent BUILD (init topology)
  • 13. 13 Compounded by Scale 10,000+ switches per fabric** 3.5 Billion searches per day 30,000 burst updates within 1 min** 300 hours of video uploaded every minute 10 Million + routing rules per fabric** 0.25Million+ links per fabric** SLA & app visibility BUILD & topology CONNECTIVITY & routing **Typical numbers seen in large data center fabrics
  • 14. 14 Lots of sensors and data sources fabric data plane application plane TCP, UDP / IP stats virtual switch state RPC sensors application sensors host enforcers stats | counters queue depths switch probes host probes routing state topology models configuration models Challenge: To use these effectively to simplify operational complexities in Data Center Fabrics control & management plane forwarding state
  • 15. Systems to Enable Safety, Correctness & Visibility 15 Topology Verification To continually verify that what is deployed is what was intended 01. Route Consistency To verify routing state consistency between controllers and data plane 02. Traffic Characteristics ** To verify host-granular reachability and measure traffic characteristics 03. SLA & app visibility BUILD & topology CONNECTIVITY & routing **The analytics system described here focuses primarily on the host-level reachability and packet-loss characterization of app2app communication
  • 16. 01. Topology Verification at Scale 16 How do we verify that what has been deployed and wired up matches intended topology in a 10,000+ node / 250,000+ links fabric
  • 17. n stage 17 1 2 3 4 Generate probe traffic from Hosts This is not switched like production traffic Don’t rely on just destination-based routing Source Route: Ensures targeted full coverage Analytics App: Takes all this data generated and localize connectivity problems Read in the intended model of the fabric Generate a topology to verify against 01. Topology Verification at Scale Simultaneous Detection of Topological faults within a minute of occurrence
  • 18. 18 How do we verify consistency between configured policy, routing state and forwarding state in a fabric with 10M+ rules and detect & isolate loops & black holes quickly 02. Routing Consistency Verification at Scale
  • 19. 19 1 2 3 4 Map: Create a network slice by destination subnet by picking rules relevant to that subnet against a shard of the full set Reduce: Construct a directed forwarding graph and verify properties such as loop freedom and reachability At every subsequent routing update, only analyze incremental updates Generate snapshot of the routing state by recording route change events SDN Controller Route Snapshot } Rule Shard 1 Rule Shard n… M M M… R R R M R … Report Subnet 1 Report Subnet m… 02. Libra: Routing Consistency Verification at Scale Detection of Loops & Holes within 1ms of occurrence in a 10K Node Network
  • 20. 20 How do we measure host-level reachability and app-level traffic characteristics comprehensively across all host-pairs and all traffic classes with varying SLA & TE needs 03. App-level SLA measurements at Scale
  • 21. 21 1 2 3 4 Exercise src2dest and dest2src paths through entire host-fabric-host SW stack Structured rotation of probes across all hosts & traffic queues for full coverage Correlate probe loss & latency in fwd & rev directions across a number of probe sets to localize issues Randomly pick a subset of hosts and generate probes to and from the hosts. 03. App-level SLA measurements at Scale
  • 22. 22 That IS a lot of data... Now that your network infrastructure is richly instrumented...how do you extract this information? We use a RPC framework optimized for encrypted, streaming, and multiplexed connections. ...and so can you. http://grpc.io
  • 23. 23 Network equipment ● We’ve discussed a lot of external software telemetry sources...what about data from network devices themselves? ● Today, SNMP is often the de facto network telemetry protocol. Time to upgrade! ○ legacy implementations -- designed for limited processing and bandwidth ○ expensive discoverability -- re-walk MIBs to discover new elements ○ no capability advertisement -- test OIDs to determine support ○ rigid structure -- limited extensibility to add new data ○ proprietary data -- require vendor-specific mappings and multiple requests to reassemble data ○ protocol stagnation -- no absorption of current data modeling and transmission techniques
  • 24. 24 Next generation network telemetry: ● network elements stream data to collectors (push model) ● data populated based on vendor-neutral models whenever possible ● utilize a publish/subscribe API to select desired data ● scale for next 10 years of density growth with high data freshness ○ other protocols distribute load to hardware, so should telemetry ● utilize modern transport mechanisms with active development communities ○ e.g., gRPC (HTTP/2), Thrift, protobuf over UDP www.openconfig.net
  • 25. 25 Building Safe, Correct and Transparent DC Fabrics Detect host reachability issues and SLA violations within minutes of these events OPERATE (Apps & SLA) Verify routing consistency and detect black holes, loops within a minute of a routing or configuration change CONNECT (routing, reachability) Explore Topology Design Space & Synthesize Verify deployed topology against intent within minutes of turn-up BUILD (init topology)
  • 26. ❯ Think outside the BOX ❯ Sensors and DATA ANALYTICS are key for building data center networks ❯ HUMANS are (almost) useless Key Take-Aways 26
  • 27. Data Center Networking @ Google • An SDN centric DevOps organization • Responsible for the architecture, design, and build of Google’s data center networks • Responsible for development of software & systems for modeling, automation, telemetry, analytics Network Engineers Network Systems Engineers Technical Program ManagersSREArchitects Join Data Center Networking @ Google
  • 28. Sriram Natarajan snr@google.com References THANK YOU 1. “Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network,” SIGCOMM 2015. 2. “Network Detective: Finding network blackholes”, Israel Networking Day 2014. 3. “Libra: Divide and Conquer to Verify Forwarding Tables in Huge Networks”, NSDI 2014. 4. “B4: Experience With a Globally-Deployed Software Defined WAN,” SIGCOMM 2013. 5. “Bandwidth Enforcer: Flexible, Hierarchical Bandwidth Allocation for WAN Distributed Computing,” SIGCOMM 2015.