Monitoring OpenConnect CDN
Sergey Fedorov, Netflix
Monitorama 2015
Sergey Fedorov, Netflix, Monitorama 2015
What is OpenConnect
36.5%
US downstream traffic *
* 2015 Sandvine reportSergey Fedorov, Netflix, Monitorama 2015
OpenConnect Cache Appliance
Space/Power optimized
10/40Gbs network interface
FreeBSD OS
NGinx server
Bird routing proxy
Gizmodo, “This box can hold an entire Netflix” http://gizmodo.com/this-box-can-hold-an-entire-netflix-1592590450
Sergey Fedorov, Netflix, Monitorama 2015
Network
Transit
Internet Exchange
ISP embedded
Sergey Fedorov, Netflix, Monitorama 2015
Sergey Fedorov, Netflix, Monitorama 2015
Intelligent clients
Control Plane
end-user content request router
client location
network conditions
server utilization
content distribution
Sergey Fedorov, Netflix, Monitorama 2015
Who we are
Sergey Fedorov Stefan Praszalowicz
Sergey Fedorov, Netflix, Monitorama 2015
Monitoring challenge
Testing in prod*
Network changes
Firmware deployments
App pushes
Updating content
...
Sergey Fedorov, Netflix, Monitorama 2015
Sergey Fedorov, Netflix, Monitorama 2015
CachesClients
Control
Plane
Micro
services
Network
Capacity
Config
Content
Telemetry (Atlas)
Logs (ElasticSearch)
Data sources
METRICS
Something breaks all the time
Big problems start small
Context matters
Sergey Fedorov, Netflix, Monitorama 2015
Sergey Fedorov, Netflix, Monitorama 2015
Small SRE team
Elastic
How we do it
Netflix
Clients
Caches Network ConfigData sources ...... ...
Sergey Fedorov, Netflix, Monitorama 2015
Netflix
Clients
Caches Network ConfigData sources ...... ...
Orchestration
Data processing
stream processorspollers
Sergey Fedorov, Netflix, Monitorama 2015
FSMState processing
Netflix
Clients
Caches Network ConfigData sources ...... ...
Orchestration
Data processing
stream processorspollers
Sergey Fedorov, Netflix, Monitorama 2015
MAINTENANCE
start fixing end fixing
threshold=75%
Sergey Fedorov, Netflix, Monitorama 2015
start fixing end fixing
action: ok
from: cpu
threshold=75%
MAINTENANCE
Sergey Fedorov, Netflix, Monitorama 2015
start fixing end fixing
action: ok
from: cpu
threshold=75%
MAINTENANCE
Sergey Fedorov, Netflix, Monitorama 2015
start fixing end fixing
action: ok
from: cpu
threshold=75%
MAINTENANCE
Sergey Fedorov, Netflix, Monitorama 2015
start fixing end fixing
action: ok
from: cpu
threshold=75%
MAINTENANCE
Sergey Fedorov, Netflix, Monitorama 2015
MAINTENANCE
start fixing end fixing
action: silence
from: config
threshold=75%
Sergey Fedorov, Netflix, Monitorama 2015
MAINTENANCE
start fixing end fixing
action: ok
from: cpu
threshold=75%
Sergey Fedorov, Netflix, Monitorama 2015
MAINTENANCE
start fixing end fixing
action: silence
from: config
threshold=75%
Sergey Fedorov, Netflix, Monitorama 2015
MAINTENANCE
start fixing end fixing
action: break
from: cpu
threshold=75%
Sergey Fedorov, Netflix, Monitorama 2015
MAINTENANCE
start fixing end fixing
action: break
from: cpu
threshold=75%
Sergey Fedorov, Netflix, Monitorama 2015
MAINTENANCE
start fixing end fixing
action: break
from: cpu
threshold=75%
Sergey Fedorov, Netflix, Monitorama 2015
MAINTENANCE
start fixing end fixing
action: break
from: cpu
threshold=75%
Sergey Fedorov, Netflix, Monitorama 2015
MAINTENANCE
start fixing end fixing
action: break
from: cpu
threshold=75%
Sergey Fedorov, Netflix, Monitorama 2015
MAINTENANCE
start fixing end fixing
action: break
from: cpu
threshold=75%
Sergey Fedorov, Netflix, Monitorama 2015
MAINTENANCE
start fixing end fixing
action: ok
from: cpu
threshold=75%
Sergey Fedorov, Netflix, Monitorama 2015
MAINTENANCE
start fixing end fixing
action: unsilence
from: config
threshold=75%
Sergey Fedorov, Netflix, Monitorama 2015
MAINTENANCE
start fixing end fixing
action: ok
from: cpu
threshold=75%
Sergey Fedorov, Netflix, Monitorama 2015
MAINTENANCE
start fixing end fixing
action: ok
from: cpu
threshold=75%
Sergey Fedorov, Netflix, Monitorama 2015
MAINTENANCE
start fixing end fixing
action: ok
from: cpu
threshold=75%
Sergey Fedorov, Netflix, Monitorama 2015
MAINTENANCE
start fixing end fixing
action: ok
from: cpu
threshold=75%
Sergey Fedorov, Netflix, Monitorama 2015
MAINTENANCE
start fixing end fixing
action: break
from: cpu
threshold=75%
Sergey Fedorov, Netflix, Monitorama 2015
MAINTENANCE
start fixing end fixing
action: break
from: cpu
threshold=75%
Sergey Fedorov, Netflix, Monitorama 2015
MAINTENANCE
start fixing end fixing
action: start_fix
from: user
threshold=75%
Sergey Fedorov, Netflix, Monitorama 2015
MAINTENANCE
start fixing end fixing
action: break
from: cpu
threshold=75%
Sergey Fedorov, Netflix, Monitorama 2015
MAINTENANCE
start fixing end fixing
action: break
from: cpu
threshold=75%
Sergey Fedorov, Netflix, Monitorama 2015
MAINTENANCE
start fixing end fixing
action: break
from: cpu
threshold=75%
Sergey Fedorov, Netflix, Monitorama 2015
MAINTENANCE
start fixing end fixing
action: break
from: cpu
threshold=75%
Sergey Fedorov, Netflix, Monitorama 2015
MAINTENANCE
start fixing end fixing
action: break
from: cpu
threshold=75%
Sergey Fedorov, Netflix, Monitorama 2015
MAINTENANCE
start fixing end fixing
action: break
from: cpu
threshold=75%
Sergey Fedorov, Netflix, Monitorama 2015
MAINTENANCE
start fixing end fixing
action: break
from: cpu
threshold=75%
Sergey Fedorov, Netflix, Monitorama 2015
MAINTENANCE
start fixing end fixing
action: ok
from: cpu
threshold=75%
Sergey Fedorov, Netflix, Monitorama 2015
MAINTENANCE
start fixing end fixing
action: ok
from: cpu
threshold=75%
Sergey Fedorov, Netflix, Monitorama 2015
MAINTENANCE
start fixing end fixing
action: end_fix
from: user
threshold=75%
Sergey Fedorov, Netflix, Monitorama 2015
MAINTENANCE
start fixing end fixing
threshold=75%
Sergey Fedorov, Netflix, Monitorama 2015
FSMState processing
Netflix
Clients
Caches Network ConfigData sources ...... ...
Orchestration
Data processing
stream processorspollers
Sergey Fedorov, Netflix, Monitorama 2015
FSMState processing
Netflix
Clients
Caches Network ConfigData sources ...... ...
Orchestration
Data processing
stream processorspollers
Events processing
Event handlers
STATE TRANSITION
EVENT
● OLD STATE
● NEW STATE
● Input action
● Metric name
● Action metadata
○ metric value
○ comments
○ tags
○ timestamp
○ ...
Event handlers
Triggers an event
Event handlers
RULES
Sergey Fedorov, Netflix, Monitorama 2015
Sergey Fedorov, Netflix, Monitorama 2015
Events priority
Escalation
Do Never
Notice
Warning
Critical
Severity
Info
Do Next
Do Last
Do Now
0 1 2 3
Notice
Warning
Critical
Severity
Info
0 1 2 3Escalation
Notice
Warning
Critical
Severity
Info
0 1 2 3
Notifications
Sergey Fedorov, Netflix, Monitorama 2015
FSMState processing
Netflix
Clients
Caches Network ConfigData sources ...... ...
Orchestration
Data processing
stream processorspollers
Events processing
Event handlers
Aggregation
C
Cluster
Cache state = aggregation of states of its metrics
Cluster state = aggregation of states of its caches
OK all OK
DEGRADED some BROKEN or DEGRADED
BROKEN most BROKEN
All caches are OK → cluster state is OK
Sergey Fedorov, Netflix, Monitorama 2015
Aggregation
C
Cluster OK all OK
DEGRADED some BROKEN or DEGRADED
BROKEN most BROKEN
2/12 caches are BROKEN → cluster state is DEGRADED
Sergey Fedorov, Netflix, Monitorama 2015
Aggregation
C
Cluster OK all OK
DEGRADED some BROKEN or DEGRADED
BROKEN most BROKEN
7/12 caches are BROKEN → cluster state is BROKEN
Sergey Fedorov, Netflix, Monitorama 2015
FSMState processing
Netflix
Clients
Caches Network ConfigData sources ...... ...
Orchestration
Data processing
stream processorspollers
Events processing
Event handlers
Challenges
Setup
Sergey Fedorov, Netflix, Monitorama 2015
Challenges
Setup
Predefined groupings
Sergey Fedorov, Netflix, Monitorama 2015
Challenges
Setup
Predefined groupings
UI
Sergey Fedorov, Netflix, Monitorama 2015
Challenges
Setup
Predefined groupings
UI
Issues correlation
Sergey Fedorov, Netflix, Monitorama 2015
Challenges
Setup
Predefined groupings
UI
Issues correlation
Failure forecasting
Sergey Fedorov, Netflix, Monitorama 2015
Challenges
Setup
Predefined groupings
UI
Issues correlation
Failure forecasting
OSS
Sergey Fedorov, Netflix, Monitorama 2015
Feedback
jobs.netflix.com/jobs/1693/
jobs.netflix.com/jobs/2240/
Sergey Fedorov
OpenConnect, Netflix
sfedorov@netflix.com

Monitorama 2015 Monitoring OpenConnect CDN