The Age of Data-Driven Network Operations

Data-Driven Network Operations
APRICOT 2017
Avi Freedman
avi at kentik.com

Summary
It’s hard to run infrastructure!
… but there’s hope
What’s needed: Data-Driven Network Operations
How-to: Get the data (what data)?
How-to: Fuse and store the data
Use cases: Network Nerds: Planning, peering, DDoS, perfomance
Use cases: Business nerds: Cost, revenue, security posture
Take-aways: Next steps
2

But all the views are disparate!
How many operational tools+scripts do you run?
Active Testing (ping/traceroute) Flow Tools
APM Logging
BI Metric (App/SNMP)
BGP Hijack detection NPM
Config Management Policy Analysis
Event Correlation Routing Analytics
Forensics Traffic Engineering
Threat Intelligence
And how many instances of each?
5

With all those tools, can you:
See if it’s the network, or the application, and where?
See the whole network – customer, peering, WAN, LAN, DC?
Answer ops questions around planning, peering, security, and performance?
Let other tech groups understand the network’s impact (or not)?
Give biz folks the answers to business questions around revenue, cost, and risk?
Automatically detect traffic anomalies, attacks, and shifts?
6

The network is key to delivering revenue
Applications generate traffic…
But networks deliver it!
So why is the network view left out of cross-enterprise visibility stacks?
1) A bit of a chicken-and-the-egg problem.
2) It’s hard to get some network concepts without being hands-on.
3) General lack of vendor innovation from 2003-2013.
4) Immaturity of backend and distributed systems methods.
8

Why the limitations like…
• Not scaling to handle large amounts of data (space/IO/CPU limited).
• Not understanding network concepts (classic BI tools).
• Limited scope (data source, functionality, target users).
• Typically storing only aggregates or pre-filtered data.
• Very limited fusing, only 1 or 2 data types per tool.
• Limited dimensions and filtering depth, often slow.
• Needing lots of tuning and configuration.
9

Most tools are pre-big data architectures!
But with a more modern approach, it’s possible to get:
• Large scale, billions of records per hour, no aggregation, complex fusing.
• Distributed micro-service architecture. Can scale very “wide”. ++Hardware.
• Speed with granularity. Real-time ingest and < 5s queries.
• Big challenge is adding fusion, network savvy-ness and speed of back-ends.
• Modern data bus (Kafka) and streaming analytics options have gone from 0
to great in the last 5 years.
• Whether for DIY, or from a new wave of vendor options.
10

And people are getting the network’s important!
In net-centric enterprises like service providers and web companies,
there is an understanding that the network is not just plumbing, but
can actually see the business.
And with API partners, customers, and cloud back-ends often on remote
networks, understanding the network is increasingly important.
Tools aren’t there yet but we’re seeing support for efforts to innovate
internally and by supporting a new wave of vendors!
11

And yes - DIY is hard…
12
Required areas of expertise
(because every presentation needs a Vin diagram)
Distributed
systems
engineersNetwork
Engineers
SREs
Low-level
Network
developers
Resilience / Reliability
Geo-distributed ingest
Flow friendly data-store
BGP Daemon
Flow inspection & conversion
Network protocols hacking
Make all of the above
work reliably
Train all the other teams
on the involved network
protocols and their usage
Unicorn

But systems like this are no longer (as) exotic.
Ingest &
Fusion layer
Storage Layer Query
Layer
Each layer has separate and different scaling characteristics
Query engine
and UI
Query
interfaces
SQL
WWW
REST
data
sources
clients
SELECT flow
FROM router
WHERE …
>_

And the OSS and SaaS options are growing!
So yes, it’s hard to find instantly qualified tooling folks, but…
Distributed system, devops, and big data systems and skills are
becoming more wide-spread – and at a faster rate than every before.
And recent vendors built from scratch multi-tenant open SaaS and
on-prem big data options.
14

What’s Needed:
Data Driven
Network Operations

What is Data-Driven Network Operations?
Getting network traffic intelligence (netops + network-savvy BI) by…
Using data to drive your technical and business operations!
Most content companies and enterprises are data and analytics driven.
Devops is as well (logs, APM, metrics-at-scale).
But the network world has some catchup to do.
We can have nice things too!
And share with our tech peers (systems/apps) and the business side?
16

Data-driven operations + business use cases
• Network Planning
• Peering Analytics and Abuse
• Congestion detection
• Is it the network?
• Where on the network?
• Proactive alerting
• Distributed DDoS Detection
• What Changed Post Deploy?
• Security and Breach Detection
• Cost Analytics
• Revenue Identification
(New + @ Risk)
• Enabling Internal Groups
17

Key network operator requirements
18
Key requirements for modern Data-Driven Network Operations:
• No data aggregation or pre-filtering.
• Correlation (fusing) between data types.
• Full resolution searchable and stored for months.
• FAST. Less than 10s for results. Cannot wait minutes while spelunking.
• Network-savvy UIs and APIs (understanding routing and prefixes).
• Detect anomalies. Should not have to watch graphs manually.
• Data and alerts available across the company.
• “0” to usable in minutes to weeks, not months to years.

How To:
Get the data.
Fuse the data.
Store the data.
Use the data.
Share the data.

How To:
Get the Data
(What Data?)

TCP stats data / app specific data
Where to find this data ?
Flow data
NetFlow, SFlow, IPFIX
SNMP, Streaming telemetry
Sys/Event logs
TACACS
&
Syslog
App
Server,
Logs,
Metrics
BGP, IGP Path info
NETWORK
+
+
+
=
Combinatorially
Useful!
+
Router
Router
PCAP
agent
+User tags, Threat Intel,
SDN Control, DNS, ping/trace

A broader view of “NetFlow”
You can ALSO get performance data from the infrastructure:
• Queue Depth
• Retransmits per flow
• TCP latency
• Application Latency
From:
• Host software (nProbe)
• Sensors / Taps
• Webserver logs (Nginx)
• Cisco AVC supported routers
22

Fusing data for richer traffic analytics
Flow or BGP or SNMP or DNS or logs alone are not enough.
This becomes much richer when combined with:
• Performance and layer 7 information
• BGP attributes
• Geography
• Tags (rack, department, customer…)
• Config changes and software versions
• Threat intelligence and known-bad IPs
• Fusing should be near real-time, performed at ingest and data specific
24

DATA FUSION
Decoder
Modules
Mem
Table
sNetFlow v5
NetFlow v9
IPFIX
BGP RIB
Custom
Tags
SNMP
Poller
BGP
Daemons
Enrichment
DB
DATA FUSION
Geo ←→ IP
ASN ←→ IP
SFlow
ROUTER
TRAFFIC-SAVVY DATASTORE
Single flow
fused row
sent to storage
PCAP
PCAP
agent
proxy

Store the data: Yes, back-ends are still tricky.
You can’t keep enough granularity on a VM- or relational-based backend.
FOSS big data has limited network-savviness and no support for query rate-
limiting, which is key to multi-tenancy, and query fragment caching, which is
key for efficiency.
You can do a lot with ELK, but it’s not very efficient or super network-savvy –
and has basically no multi-tenanted security.
Column store systems are still rapidly developing.
26

Use Cases:
For Network Nerds…

Use case: Traffic debugging and inspection
Why did the interface just double its traffic, now saturated?
Is it an attack? No, it’s a mis-config! No, it’s an attack…
Where is the traffic leaving my network?
Is a peer sending me traffic they shouldn’t be? Are my peers balanced with
me?
Did a content provider shift their traffic path to me?
Are other networks seeing what I’m seeing? 28

Fusing data for richer traffic analytics
Data in a “lake” is not useful!
A modern data-driven network operations system should have:
• A flexible and spelunk-able UI
• Proactive alerting with links to detailed history and trends
• Dashboards that instantly link to detailed history and exploration
• Complete API availability for bi-directional integration
29

Use Case: Traffic debugging and inspection
30

Traffic annotated with multiple events
31

Anomaly detection: DDoS detection and characteristics
32

Use case: Anomaly detection for peering
Traffic from individual top-20 ASN over transit unusually high. Operator
notified at red line.
33

Traffic, anomaly detection and annotation
34

Use case: Network planning
Flow-based traffic + BGP can be used to help show:
• Path, neighbor, transit, origin, and country of traffic.
• Strategic peering and transit changes that can improve perf and costs.
• Potential new peers and locations to peer.
• Evaluate the potential of new peering exchanges or facilities.
• Transit relationships that are of high or little value.
• Understand ROI before extending backbone links or capacity.
35

Use Case: Network planning, traffic by BGP HOP
36

Use case: Network security analytics
Flow-based traffic +/- threat intelligence can show:
• Compromised servers, desktops, and IoT devices.
• From threat intel and anomaly detection.
• To help your own, or downstream/internal customers.
• And feed DDoS response if there are local sources/sinks.
• And find BCP38 violations on-net or on peers, even with
simple “how many source /8s” heuristics.
37

Use case: Network performance analytics
Flow-based traffic + BGP + network performance data can show:
• Whether issues are in the application or network layer (+ where)
• And where?
• And in a way expose-able to internal dev + app operations
• And to pinpoint performance issues by peer or remote AS path
• Or prefix
• Or data center
• Or provably not in the network J
38

Perf-enhanced flow: TCP latency / ASN

Perf-enhanced flow: TCP latency / Prefix

What For?
For Business Nerds…

Use case: Customer cost analysis
Whether an enterprise or SP…
Customers (external or internal) cost money.
Most critical for SPs (customer packet-mile cost)…
But also important for many enterprise:
How much does this service use of our international backbone)?
42

Use case: Security posture and risk
The same analytics that can power operational knowledge and fixes can inform
the business about the risk and support security groups as they assess and fix
not just the production but also the corporate networks.
43

Use case: Revenue enhancement and retention
Sometimes not discussed in polite company, but great traffic-based analytics
can help with the top line as well as margin:
• Offering high-margin customers lower rates to attract more traffic.
• Identifying large 2nd and 3rd hop AS sinks or sources behind peers,
to convert to customers.
• Find large customers that look small (on-boarding/testing) or large
customers starting to migrate to competitors.
• Create service revenue around security, DDoS, and performance.
44

Summary
46
• It’s not just “flow tools” any more (or just BGP or just SNMP or …).
• Networks can produce a lot of (very diverse) data.
• You can capture and use it with modern streaming, fusing, and storage
components.
• Enterprises are looking for and funding turnkey but open tools that
integrate and provide cross-group access.
• Vendors are starting to innovate.
• And DIY options are getting better (but still require a lot of training).

Take-Aways
48
• You deserve, and can have, nice things!
• And so can your tech and business peers.
• And answer business as well as tech ops questions.
• It’s possible (with some work) to see not just traffic flow but performance.
• There are a new wave of SaaS and big data-based vendors that integrate.
• And if you intend to DIY, start cross-training and hiring now.

Questions?
(Happy to answer offline as well...)
Avi Freedman
email: avi at kentik.com
49

The Age of Data-Driven Network Operations

More Related Content

What's hot

Viewers also liked

Similar to The Age of Data-Driven Network Operations

More from APNIC

Recently uploaded

The Age of Data-Driven Network Operations