Data-Driven	Network	Operations
APRICOT	2017
Avi	Freedman
avi at	kentik.com
Summary
It’s	hard	to	run	infrastructure!
… but	there’s	hope
What’s	needed:	Data-Driven	Network	Operations
How-to:	Get	the	data	(what	data)?
How-to:	Fuse	and	store	the	data
Use	cases:	Network	Nerds:	Planning,	peering,	DDoS,	perfomance
Use	cases:	Business	nerds:	Cost,	revenue,	security	posture
Take-aways:	Next	steps
2
SNAFU:
Life’s	hard
4
Tools, tools, everywhere…
But	all	the	views	are	disparate!
How	many	operational	tools+scripts do	you	run?
Active	Testing	(ping/traceroute) Flow	Tools
APM Logging
BI Metric	(App/SNMP)
BGP	Hijack	detection NPM
Config Management Policy	Analysis
Event	Correlation Routing	Analytics
Forensics Traffic	Engineering
Threat	Intelligence
And	how	many	instances	of	each?
5
With	all	those	tools,	can	you:
See	if	it’s	the	network,	or	the	application,	and	where?
See	the	whole	network	– customer,	peering,	WAN,	LAN,	DC?
Answer	ops	questions	around	planning,	peering,	security,	and	performance?
Let	other	tech	groups	understand	the	network’s	impact	(or	not)?
Give	biz	folks	the	answers	to	business	questions	around	revenue,	cost,	and	risk?
Automatically	detect	traffic	anomalies,	attacks,	and	shifts?
6
But	There’s	Hope…
The	network	is	key	to	delivering	revenue
Applications	generate	traffic…
But	networks	deliver	it!
So	why	is	the	network	view	left	out	of	cross-enterprise	visibility	stacks?
1) A	bit	of	a	chicken-and-the-egg	problem.
2) It’s	hard	to	get	some	network	concepts	without	being	hands-on.
3) General	lack	of	vendor	innovation	from	2003-2013.
4) Immaturity	of	backend	and	distributed	systems	methods.
8
Why	the	limitations	like…
• Not	scaling	to	handle	large	amounts	of	data	(space/IO/CPU	limited).
• Not	understanding	network	concepts	(classic	BI	tools).
• Limited	scope	(data	source,	functionality,	target	users).
• Typically	storing	only	aggregates	or	pre-filtered	data.
• Very	limited	fusing,	only	1	or	2	data	types	per	tool.
• Limited	dimensions	and	filtering	depth,	often	slow.
• Needing	lots	of	tuning	and	configuration.
9
Most	tools	are	pre-big	data	architectures!
But	with	a	more	modern	approach,	it’s	possible	to	get:
• Large	scale,	billions	of	records	per	hour,	no	aggregation,	complex	fusing.
• Distributed	micro-service	architecture.	Can	scale	very	“wide”.	++Hardware.
• Speed	with granularity.	Real-time	ingest	and	<	5s	queries.
• Big	challenge	is	adding	fusion,	network	savvy-ness	and	speed	of	back-ends.
• Modern	data	bus	(Kafka)	and	streaming	analytics	options	have	gone	from	0	
to	great	in	the	last	5	years.
• Whether	for	DIY,	or	from	a	new	wave	of	vendor	options.
10
And	people	are	getting	the	network’s	important!
In	net-centric	enterprises	like	service	providers	and	web	companies,
there	is	an	understanding	that	the	network	is	not	just	plumbing,	but
can	actually	see the	business.
And	with	API	partners,	customers,	and	cloud	back-ends	often	on	remote	
networks,	understanding	the	network	is	increasingly	important.
Tools	aren’t	there	yet	but	we’re	seeing	support	for	efforts	to	innovate	
internally	and	by	supporting	a	new	wave	of	vendors!
11
And	yes	- DIY	is	hard…
12
Required areas of expertise
(because every presentation needs a Vin diagram)
Distributed
systems
engineersNetwork
Engineers
SREs
Low-level
Network
developers
Resilience / Reliability
Geo-distributed ingest
Flow friendly data-store
BGP Daemon
Flow inspection & conversion
Network protocols hacking
Make all of the above
work reliably
Train all the other teams
on the involved network
protocols and their usage
Unicorn
But systems like this are no longer (as) exotic.
Ingest &
Fusion layer
Storage Layer Query
Layer
Each layer has separate and different scaling characteristics
Query engine
and UI
Query
interfaces
SQL
WWW
REST
data
sources
clients
SELECT flow
FROM router
WHERE …
>_
And	the	OSS	and	SaaS options	are	growing!
So	yes,	it’s	hard	to	find	instantly	qualified	tooling	folks,	but…
Distributed	system,	devops,	and	big	data	systems	and	skills	are	
becoming	more	wide-spread	– and	at	a	faster	rate	than	every	before.
And	recent	vendors	built	from	scratch	multi-tenant	open	SaaS and	
on-prem big	data	options.
14
What’s	Needed:
Data	Driven	
Network	Operations
What	is	Data-Driven	Network	Operations?
Getting	network	traffic	intelligence	(netops +	network-savvy	BI)	by…
Using	data	to	drive	your	technical	and	business	operations!
Most	content	companies	and	enterprises	are	data	and	analytics	driven.
Devops is	as	well	(logs,	APM,	metrics-at-scale).
But	the	network	world	has	some	catchup	to	do.
We	can	have	nice	things	too!
And	share	with	our	tech	peers	(systems/apps)	and	the	business	side?
16
Data-driven	operations	+	business	use	cases
• Network	Planning
• Peering	Analytics	and	Abuse
• Congestion	detection
• Is	it	the	network?
• Where	on	the	network?
• Proactive	alerting
• Distributed	DDoS	Detection
• What	Changed	Post	Deploy?
• Security	and	Breach	Detection
• Cost	Analytics
• Revenue	Identification	
(New	+	@	Risk)
• Enabling	Internal	Groups
17
Key	network	operator	requirements
18
Key	requirements	for	modern	Data-Driven	Network	Operations:
• No	data	aggregation	or	pre-filtering.
• Correlation	(fusing)	between	data	types.
• Full	resolution	searchable	and	stored	for	months.
• FAST.	Less	than	10s	for	results.		Cannot	wait	minutes	while	spelunking.
• Network-savvy	UIs	and	APIs	(understanding	routing	and	prefixes).
• Detect	anomalies.	Should	not	have	to	watch	graphs	manually.
• Data	and	alerts	available	across	the	company.
• “0”	to	usable	in	minutes	to	weeks,	not	months	to	years.
How	To:
Get	the	data.
Fuse	the	data.
Store	the	data.
Use	the	data.
Share	the	data.
How	To:
Get	the	Data
(What	Data?)
TCP stats data / app specific data
Where to find this data ?
Flow data
NetFlow, SFlow, IPFIX
SNMP, Streaming telemetry
Sys/Event logs
TACACS
&
Syslog
App
Server,
Logs,
Metrics
BGP, IGP Path info
NETWORK
+
+
+
=
Combinatorially
Useful!
+
Router
Router
PCAP
agent
+User tags, Threat Intel,
SDN Control, DNS, ping/trace
A	broader	view	of	“NetFlow”
You	can	ALSO	get	performance	data	from	the	infrastructure:
• Queue	Depth
• Retransmits	per	flow
• TCP	latency
• Application	Latency
From:
• Host	software	(nProbe)
• Sensors	/	Taps
• Webserver	logs	(Nginx)
• Cisco	AVC	supported	routers
22
How	To:
Fuse	the	Data
Fusing	data	for	richer	traffic	analytics
Flow	or	BGP	or	SNMP	or	DNS	or	logs	alone	are	not	enough.
This	becomes	much	richer	when	combined	with:
• Performance	and	layer	7	information
• BGP	attributes
• Geography
• Tags	(rack,	department,	customer…)
• Config changes	and	software	versions
• Threat	intelligence	and	known-bad	IPs
• Fusing	should	be	near	real-time,	performed	at	ingest	and	data	specific
24
DATA FUSION
Decoder
Modules
Mem
Table
sNetFlow v5
NetFlow v9
IPFIX
BGP RIB
Custom
Tags
SNMP
Poller
BGP
Daemons
Enrichment
DB
DATA FUSION
Geo ←→ IP
ASN ←→ IP
SFlow
ROUTER
TRAFFIC-SAVVY DATASTORE
Single flow
fused row
sent to storage
PCAP
PCAP
agent
proxy
Store	the	data:	Yes,	back-ends	are	still	tricky.
You	can’t	keep	enough	granularity	on	a	VM- or	relational-based	backend.
FOSS	big	data	has	limited	network-savviness and	no	support	for	query	rate-
limiting,	which	is	key	to	multi-tenancy,	and	query	fragment	caching,	which	is	
key	for	efficiency.
You	can	do	a	lot	with	ELK,	but	it’s	not	very	efficient	or	super	network-savvy	–
and	has	basically	no	multi-tenanted	security.
Column	store	systems	are	still	rapidly	developing.
26
Use	Cases:
For	Network	Nerds…
Use	case:	Traffic	debugging	and	inspection
Why	did	the	interface	just	double	its	traffic,	now	saturated?
Is	it	an	attack?		No,	it’s	a	mis-config!		No,	it’s	an	attack…
Where	is	the	traffic	leaving	my	network?		
Is	a	peer	sending	me	traffic	they	shouldn’t	be?	Are	my	peers	balanced	with	
me?
Did	a	content	provider	shift	their	traffic	path	to	me?
Are	other	networks	seeing	what	I’m	seeing? 28
Fusing	data	for	richer	traffic	analytics
Data	in	a	“lake”	is	not	useful!
A	modern	data-driven	network	operations	system	should	have:
• A	flexible	and	spelunk-able	UI
• Proactive	alerting	with	links	to	detailed	history	and	trends
• Dashboards	that	instantly	link	to	detailed	history	and	exploration
• Complete	API	availability	for	bi-directional	integration
29
Use	Case:	Traffic	debugging	and	inspection
30
Traffic	annotated	with	multiple	events
31
Anomaly	detection:	DDoS	detection	and	characteristics
32
Use	case:	Anomaly	detection	for	peering
Traffic	from	individual	top-20	ASN	over	transit	unusually	high.	Operator	
notified	at	red	line.
33
Traffic,	anomaly	detection	and	annotation
34
Use	case:	Network	planning
Flow-based	traffic	+	BGP	can	be	used	to	help	show:
• Path,	neighbor,	transit,	origin,	and	country	of	traffic.
• Strategic	peering	and	transit	changes	that	can	improve	perf and	costs.
• Potential	new	peers	and	locations	to	peer.
• Evaluate	the	potential	of	new	peering	exchanges	or	facilities.
• Transit	relationships	that	are	of	high	or	little	value.
• Understand	ROI	before	extending	backbone	links	or	capacity.
35
Use	Case:	Network	planning,	traffic	by	BGP	HOP
36
Use	case:	Network	security	analytics
Flow-based	traffic	+/- threat	intelligence	can	show:
• Compromised	servers,	desktops,	and	IoT devices.
• From	threat	intel and	anomaly	detection.
• To	help	your	own,	or	downstream/internal	customers.
• And	feed	DDoS response	if	there	are	local	sources/sinks.
• And	find	BCP38	violations	on-net	or	on	peers,	even	with
simple	“how	many	source	/8s”	heuristics.
37
Use	case:	Network	performance	analytics
Flow-based	traffic	+	BGP	+	network	performance	data	can	show:
• Whether	issues	are	in	the	application	or	network	layer	(+	where)
• And	where?
• And	in	a	way	expose-able	to	internal	dev +	app	operations
• And	to	pinpoint	performance	issues	by	peer	or	remote	AS	path
• Or	prefix
• Or	data	center
• Or	provably	not	in	the	network	J
38
Perf-enhanced flow: TCP latency / ASN
Perf-enhanced flow: TCP latency / Prefix
What	For?
For	Business	Nerds…
Use	case:	Customer	cost	analysis
Whether	an	enterprise	or	SP…
Customers	(external	or	internal)	cost	money.
Most	critical	for	SPs	(customer	packet-mile	cost)…
But	also	important	for	many	enterprise:
How	much	does	this	service	use	of	our	international	backbone)?
42
Use	case:	Security	posture	and	risk
The	same	analytics	that	can	power	operational	knowledge	and	fixes	can	inform	
the	business	about	the	risk	and	support	security	groups	as	they	assess	and	fix	
not	just	the	production but	also	the	corporate networks.
43
Use	case:	Revenue	enhancement	and	retention
Sometimes	not	discussed	in	polite	company,	but	great	traffic-based	analytics	
can	help	with	the	top	line	as	well	as	margin:
• Offering	high-margin	customers	lower	rates	to	attract	more	traffic.
• Identifying	large	2nd and	3rd hop	AS	sinks	or	sources	behind	peers,
to	convert	to	customers.
• Find	large	customers	that	look	small	(on-boarding/testing)	or	large
customers	starting	to	migrate	to	competitors.
• Create	service	revenue	around	security,	DDoS,	and	performance.
44
Summary
Summary
46
• It’s	not	just	“flow	tools”	any	more	(or	just	BGP	or	just	SNMP	or	…).
• Networks	can	produce	a	lot	of	(very	diverse)	data.
• You	can	capture	and	use	it	with	modern	streaming,	fusing,	and	storage	
components.
• Enterprises	are	looking	for	and	funding	turnkey	but	open	tools	that	
integrate	and	provide	cross-group	access.
• Vendors	are	starting	to	innovate.
• And	DIY	options	are	getting	better	(but	still	require	a	lot	of	training).
Take-Aways
Take-Aways
48
• You	deserve,	and	can	have,	nice	things!
• And	so	can	your	tech	and	business	peers.
• And	answer	business	as	well	as	tech	ops	questions.
• It’s	possible	(with	some	work)	to	see	not	just	traffic	flow	but	performance.
• There	are	a	new	wave	of	SaaS and	big	data-based	vendors	that	integrate.
• And	if	you	intend	to	DIY,	start	cross-training	and	hiring	now.
Questions?
(Happy	to	answer	offline	as	well...)
Avi	Freedman
email:	avi at	kentik.com
49

The Age of Data-Driven Network Operations