Kentik Detect Engine - Network Field Day 2017

Kentik Data Engine
Dan Ellis
CTO

KDE Quick Stats
(kentik detect engine)
NetFlow in the Cloud
• 125+ Billion Flows/Day stored
• 1,000,000+ FPS
• 50 “Large” Queries/s, thousands of sub-qps
• 75+ TB flow data stored/day
(25+ compressed)
SNMP, BGP, network performance too!

KDE High-Level
• KDE is a hybrid system:
○ Fusing / Ingest Layer
○ Distributed column store db / query engine
○ Realtime stream processing for anomaly detection
• We evaluated various existing engines: ES, Hadoop,
Cassandra, Storm, Spark, SILK, Druid, Kafka....
• Couldn’t find performance, multi-tenancy, and network
savvy
so we wrote our own...

Ingest &
Fusion
layer
Storage layer
(flow specific)
Query
layer
Each layer has separate and different scaling characteristics
Query engine
and UI
Query
interfaces
SQL
WWW
REST
Data
sources Clients
SELECT flow
FROM router
WHERE …
>_
KDE architecture

KDE Architecture
BGP VIP
KDE ingest layer
enKryptor
Storage layer
Streaming layer
kFlow
(HTTPS)
NetFlow
(UDP)
NetFlow
(UDP)
kFlow
(HTTPS)
kFlow
(HTTP)
kFlow
(HTTP)
relay
relay
proxy
proxy
proxy
client
C
client
C
client
C

KDE ingest layer
enKryptor
Storage layer
Streaming layer
kFlow
(HTTPS)
NetFlow
(UDP)
kFlow
(HTTPS)
kFlow
(HTTPS)
kFlow
(HTTPS)
proxy
proxy
proxy
client
C
client
C
client
C
BGP VIP
NetFlow
(UDP) relay
VIP + Relay
• One IP bound to multiple
servers
• Sharded by Source-IP
• Validate Sender as Kentik
Customer
• Pass flow on (raw UDP
socket) to correct proxy
• Relay handles load balancing
(Kentik specific, UDP+TCP)
relay

Proxy
BGP VIP
KDE ingest layer
enKryptor
Storage layer
Streaming layer
kFlow
(HTTPS)
NetFlow
(UDP)
NetFlow
(UDP)
kFlow
(HTTPS)
relay
relay
kFlow
(HTTP)
client
C
client
C
client
C
kFlow
(HTTP)
• Inspect flow & determine type:
V5, V9, IPFIX, SFlow, KFlow
• Need to resample?
• Configured Sample Rate
• Launch Client Process for each
device
• Poll for device changes
• Monitor health
• Relaunch of client crash
proxy
proxy
proxy

BGP VIP
KDE ingest layer
enKryptor
Storage layer
Streaming layer
kFlow
(HTTPS)
NetFlow
(UDP)
NetFlow
(UDP)
kFlow
(HTTPS)
relay
relay
proxy
proxy
proxy
kFlow
(HTTP)
kFlow
(HTTP)
client
C
client
C
client
C
Client
(where the magic happens)
• One per device
configured to send flow
• * goes in, KFlow comes
out
client
C
NetFlow
SFlow
IPFix
kFlow

Client Processing
is a key enabler to useful data

Step 1: Normalization
• Separate code paths for each type expected
• CGO callouts

Step 2: Enrichment
• BGP - Route data for xxx
• GeoIP - Where does my traffic start and end
• SNMP - Interface names and descriptions
• Tagging - business classification: cost-centers,
user-info, peering info
• App Specific Data - URL/DNS requests, MYSQL
query
• Performance data (NPM) - Retransmits, network latency,
appl latency
• coming soon:
• Timestamped event Data (syslog)
• Threat feeds

DATA FUSION in
CLIENT
Decoder
Modules
Mem
Tables
NetFlow v5
NetFlow v9
IPFIX
BGP RIB
Custom Tags
SNMP Poller
BGP
Daemon
Enrichment
DB
DATA
FUSION
Geo ←→ IP
ASN ←→ IP
SFlow
ROUTER
FLOW FRIENDLY DATASTORE
Single flow
fused row
sent to storage
PCAP
PCAP
agent
proxy

Step 3: Resampling & Unification
• Long term (>1 Month)
• What a process (device) said over an hour
• Two tricks:
• Flow Unification
• Resampling

Query+Storage layers
achieving ‘ā la carte’
data consumption

Storage Layer
• Fused KFlow as input...Cap'n Proto (like
protobuffers)
• Shard data into small chunks
• HTTP to N distributed storage nodes
• Metadata supervisor DB handles shard locations
• Row Oriented to Column Oriented
• Compressed using ZFS
DISK

Multi-Tenancy DB
Needed Multitenancy for a large-scale SaaS product
Could not find other DB’s @scale with it
We succeeded by building in:
● Fairness
queries are chopped into small chunks, users are rate limited and
prioritized
● Security
data is isolated between “users” down to the thread level
● Multiuser caching with fairness
Built a cache that cannot be monopolized by any 1 user

Ingest &
Fusion
layer
Storage layer
(flow specific)
Query
layer
Query engine
and UI
Query
interfaces
SQL
WWW
REST
Data
sources Clients
SELECT flow
FROM router
WHERE …
>_
● SQL interface
PSQL FDW
● UI/UX
feat. advanced
data-viz
● REST API based
interface
build your own

SELECT flow
FROM router
WHERE …
SQL

Anomaly Detection and
Streaming Databases

Anomaly Detection
● Network + NPM specific
● Policy based, customizable
● Granular itemization and metrics
○ look at top-100 Country, IP, Port, ASN, site, path,...
○ Unique senders, bps, pps, rxmits, latency
● Over/under static thresholds
● Over/under what’s “normal” (baselining)
● Perform actions
○ E-mail, Slack, JSON, Pagerduty
○ Mitigation (A10, Radware, BGP)

• DDoS is a simple use case of anomaly detection
• V1 anomaly detection relied on KDE queries. Abusive
• V2 needed stream processing and in-ram baseline
storage
• Typically avoided streaming db’s due to aggregation
• Streaming db’s for anomaly detection+our long term
flow storage is a powerful combination
• Evaluated Spark, Storm, Samza, PipelineDB. Fail
Detecting Anomalies

BGP VIP
KDE ingest layer
enKryptor
Storage layer
kFlow
(HTTPS)
NetFlow
(UDP)
NetFlow
(UDP)
kFlow
(HTTPS)
kFlow
(HTTP)
kFlow
(HTTP)
relay
relay
proxy
proxy
proxy
client
C
client
C
client
C
Streaming layer

Aggregation
Layer #2
POLICIES
kFlow
Multiple kFPS
Policy
#1
Policy
#2
1s 1s 1s 1s 1s 1s
Aggregation
Layer #1
1min
Σ
Σ Σ
Aggregation
Layer #3
Policy
#1
Policy
Aggregation
Filter
Policy
Thresholds
& Actions
1hour
Σ
Threshold
Comparator
Action
Triggers

Kentik Detect Engine - Network Field Day 2017

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Kentik Detect Engine - Network Field Day 2017

Similar to Kentik Detect Engine - Network Field Day 2017 (20)

Recently uploaded

Recently uploaded (20)

Kentik Detect Engine - Network Field Day 2017