Netflix strives to provide an amazing experience to each member. To accomplish this, Netflix needs to maintain very high availability across our systems. However, at a certain scale, humans can no longer scale their ability to monitor the status of all systems, making it critical for Netflix to build tools and platforms that can automatically monitor their production environments and make intelligent real-time operational decisions to remedy the problems they identify. In this session, we discuss how Netflix uses data mining and machine learning techniques to automate decisions in real-time with the goal of supporting operational availability, reliability, and consistency. We review how we got to the current states, the lessons we learned, and the future of real-time analytics at Netflix. While Netflix's scale is larger than most other companies, we believe the approaches and technologies we discuss are highly relevant to other production environments, and audience members should come away with actionable ideas that are implementable in, and benefit, most other environments.
18. MMO: Most Memorable Outage
• One device (out of ~103)
• One test cell (out of ~101)
• One test (out of ~104)
• Couldn’t view House of Cards S3E1
• For a week
19. Scale At Scale
We have weird, device-specific problems all
the time, and interactions with A/B tests only
make them more complicated, so I'm not
sure we have a pat moral of the story except
that we really like alerting and fast
responses.
- Matt McCarthy
20. Bad News About the Cloud
• Infrastructure no longer the bottleneck
• Before: Weeks to change infrastructure
• After: API call
• TTD expectations vastly higher
• AWS makes us the lameness bottleneck
21. Good News About the Cloud
• Infrastructure no longer the bottleneck
• Before: Weeks to change infrastructure
• After: API call
• Rapid recovery, automated response
possible
• AWS: Enabling productive laziness for 9
years and counting
22. Don’t Forget to Bring a Towel!
Monitoring Capabilities You’ll Find Useful
• Time series
• Event Streaming
• Dependency Discovery and Inspection
26. Predictive Scaling
Auto Scaling is reactive.
• SCALE UP by 10%
• WHEN Requests Per Second > 120
• FOR 10 consecutive minutes
• FOLLOWED-BY a cool-down of 15 minutes
35. Predictive Scaling
Predictive-reactive Auto Scaling
• A hybrid approach
• Predict the workload of a cluster in advance and proactively scale.
• Use auto scaling to handle unexpected surges in workload.
55. Server Outlier Detection
Netflix runs on thousands of servers
• A small percentage of servers become unhealthy.
• Customer experience may be degraded.
• Time wasted looking for evidence.
58. Server Outlier Detection
Cluster Analysis
• Unsupervised machine learning.
• If a server belongs to a group it should be near lots of other points as
measured by some distance function.
Assumption
• Servers running the same hardware and software should behave similar.
61. Server Outlier Detection
Actions / Remediation
• Send e-mail
• Page service owner
• Terminate instance
• Remove from service
• Detach from a load balancer
63. Automated Canary Analysis
Canary Release Process
• A change is gradually rolled out to production.
• Checkpoints are performed along the way.
• A decision is made at each checkpoint.
64. Automated Canary Analysis
Advantages
• Better degree of trust and safety in deployments.
• Faster deployment cadence.
• Lower investment in simulation engineering.
65. Automated Canary Analysis
Canary Process
Current Version
(v1.0)
New Version
(v1.1)
Load
Balancer
Traffic
100 Servers
5 Servers
95%
5%
Metrics
66. Automated Canary Analysis
Canary Process
Current Version
(v1.0)
New Version
(v1.1)
Load
Balancer
Traffic
0 Servers
100 Servers
100%
Metrics
68. Automated Canary Analysis
Automated Analysis
• Identify a set of metrics to compare.
• Use a statistical test to identify the difference between v1.0 and v1.1
• Mann–Whitney
• Kolmogorov-Smirnov
• Generate a score that indicates overall similarity.
• Percentage of metrics that match in performance.