(BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Chris Sanden, Netflix
Roy Rapoport, Netflix
October 2015
BDT207
Real-Time Analytics In Service of
Self-Healing Ecosystems

@chris_sanden
Chris & Roy
@royrapoport

Expectations
(Reasonable)
Telemetry
System
Real-Time
Analytics
System(s)
Data
Orchestration
Systems
Decision
Observation

Not Bad
• Absolutely necessary
• Pretty useful
• Insufficient

Scale At Scale
We’ve Got
1,982,562,395
Problems
And Boredom Ain’t One

Scale At Scale
Complexity in a
Few* Dimensions
* For sufficiently large values of “few”

Scale At Scale
Telemetry Volume
is silly

Scale At Scale
2,000,000,000
is silly

MMO: Most Memorable Outage
• One device (out of ~103)
• One test cell (out of ~101)
• One test (out of ~104)
• Couldn’t view House of Cards S3E1
• For a week

Scale At Scale
We have weird, device-specific problems all
the time, and interactions with A/B tests only
make them more complicated, so I'm not
sure we have a pat moral of the story except
that we really like alerting and fast
responses.
- Matt McCarthy

Bad News About the Cloud
• Infrastructure no longer the bottleneck
• Before: Weeks to change infrastructure
• After: API call
• TTD expectations vastly higher
• AWS makes us the lameness bottleneck

Good News About the Cloud
• Infrastructure no longer the bottleneck
• Before: Weeks to change infrastructure
• After: API call
• Rapid recovery, automated response
possible
• AWS: Enabling productive laziness for 9
years and counting

Don’t Forget to Bring a Towel!
Monitoring Capabilities You’ll Find Useful
• Time series
• Event Streaming
• Dependency Discovery and Inspection

Real-Time Analytics
1. Prediction
2. Detection
3. Correlation

Predictive Scaling
Auto Scaling is reactive.
• SCALE UP by 10%
• WHEN Requests Per Second > 120
• FOR 10 consecutive minutes
• FOLLOWED-BY a cool-down of 15 minutes

Predictive Scaling
Advanced Use Cases
• Rapid, reoccurring, spike in demand
• Variable traffic patterns
• Outages

Predictive Scaling
Concept
• Anticipate change in traffic and workload.
• Predict the resources needed a head of time.
• Proactively scale up or down.

Predictive Scaling
Metric Selection
• Clear, relatively stable, and recurring pattern.
• Independent of cluster performance.

Predictive Scaling
Requests Per Second (RPS)

Predictive Scaling
Fast Fourier Transformation (FFT)

Predictive Scaling
FFT-based Prediction

Prediction
Predictive Scaling
Action Plan

Predictive Scaling
Metric
FFT
Prediction
Action
Plan
Scale
Prediction Workflow

Predictive Scaling
Predictive-reactive Auto Scaling
• A hybrid approach
• Predict the workload of a cluster in advance and proactively scale.
• Use auto scaling to handle unexpected surges in workload.

a·nom·a·ly de·tec·tion
[uh-nom-uh-lee] [dih-tek-shuh n]
1. identification of observations which do not conform to an expected pattern.
2. a task that keeps data scientists up at night.

“Blips”
Anomaly Detection
Anomaly Types
“Bloops”

Anomaly Detection
Static Threshold

Anomaly Detection
Prediction Algorithms
• FFT-based Prediction
• Double Exponential Smoothing (DES)
• Holt-Winters
• ARIMA
• Etc.

Anomaly Detection
Metric Prediction Residual Threshold
Detection Workflow

Anomaly Detection
Double Exponential Smoothing

Anomaly Detection
Statistical Techniques
• Three-sigma (3-sigma)
• Kolmogorov-Smirnov (KS)
• Interquartile Range (IQR)
• Grubbs Test
• Least Squares
• Etc.

Anomaly Detection
Metric Prediction Residual 3-sigma
Detection Workflow

Anomaly Detection
Metric Prediction Residual IQR
Combine
Votes
3-sigma
KS
Detection Workflow - Ensemble Approach

Anomaly Detection
Advanced Detection Techniques
• Robust Anomaly Detection (RAD) - Netflix
• Seasonal Hybrid ESD - Twitter
• Extendible Generic Anomaly Detection System (EGADS) - Yahoo
• Kale - Etsy

out·li·er de·tec·tion
[out-lahy-er] [dih-tek-shuh n]
1. identification of unusual members from a set of generating mechanisms.
2. not be confused with anomaly detection.

Time
Population Outlier Detection
Anomaly
Detection

Server Outlier Detection
Netflix runs on thousands of servers
• A small percentage of servers become unhealthy.
• Customer experience may be degraded.
• Time wasted looking for evidence.

Cluster Analysis
• Unsupervised machine learning.
• If a server belongs to a group it should be near lots of other points as
measured by some distance function.
Assumption
• Servers running the same hardware and software should behave similar.

DBSCAN - Density-Based Spatial Clustering of Applications with Noise

Metric DBSCAN Filter Action
Detection Workflow

Actions / Remediation
• Send e-mail
• Page service owner
• Terminate instance
• Remove from service
• Detach from a load balancer

Automated Canary Analysis
Canary Release Process
• A change is gradually rolled out to production.
• Checkpoints are performed along the way.
• A decision is made at each checkpoint.

Advantages
• Better degree of trust and safety in deployments.
• Faster deployment cadence.
• Lower investment in simulation engineering.

Canary Process
Current Version
(v1.0)
New Version
(v1.1)
Load
Balancer
Traffic
100 Servers
5 Servers
95%
5%
Metrics

Canary Process
Current Version
(v1.0)
New Version
(v1.1)
Load
Balancer
Traffic
0 Servers
100 Servers
100%
Metrics

Automated Analysis
• Identify a set of metrics to compare.
• Use a statistical test to identify the difference between v1.0 and v1.1
• Mann–Whitney
• Kolmogorov-Smirnov
• Generate a score that indicates overall similarity.
• Percentage of metrics that match in performance.

Metrics
Statistical
Test
Calculate
Score
Decision
Analysis Workflow

Augmented Canary Process
Previous Version
(v1.0)
New Version
(Canary - v1.1)
Load
Balancer
Traffic
88 Servers
6 Servers
Previous Version
(Control - v1.0)
6 Servers
AnalysisMetrics

Correlation Analysis
Automated Finger-Pointing for Fun and Profit
You Want Service-Oriented Architecture?
We’ve got Service-Oriented Architecture

Automated Finger-Pointing for Fun and Profit
A
B C
D
E
F
G
H
I
J

Something Else Is Also Weird!
CPU up
Alert triggered
HTTP 400
Correlated spike
HTTP requests
Correlated drop

If you care about this, you don’t care about that …
I Care About
This Metric!
I Also Care About
This Metric!
Maybe not?

Magic!
In Conclusion
Not Magic!
DES
IQR
FFT
DBSCAN
RAD/RPCA

In Conclusion
(Reasonable)
Telemetry
System
Real-Time
Analytics
System(s)
Data
Orchestration
Systems
Decision
Observation

Useful Links
• Prediction
• Predictive Auto Scaling: http://techblog.netflix.com/2013/12/scryer-netflixs-predictive-auto-scaling.htm
• FFT: https://en.wikipedia.org/wiki/Fast_Fourier_transform
• Detection
• Double Exponential Smoothing: https://en.wikipedia.org/wiki/Exponential_smoothing
• Interquartile Range (IQR): https://en.wikipedia.org/wiki/Interquartile_range
• Ensemble Learning: http://www.scholarpedia.org/article/Ensemble_learning
• Robust Anomaly Detection (RAD): http://techblog.netflix.com/2015/02/rad-outlier-detection-on-big-data.html
• DBSCAN: https://en.wikipedia.org/wiki/DBSCAN
• Server Outlier Detection: http://techblog.netflix.com/2015/07/tracking-down-villains-outlier.html
• Canary Release Process: http://martinfowler.com/bliki/CanaryRelease.html
• Automated Canary Analysis: http://www.infoq.com/presentations/canary-analysis-deployment-pattern
• Nonparametric tests: https://en.wikipedia.org/wiki/Nonparametric_statistics
• Correlation
• Pearson Correlation: https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient

Attributions
• http://aggronaut.com
• http://designsold.com/pictures-of-kittens/
• http://slate.com, Illustration by Phil Plait
• http://www-rohan.sdsu.edu/
• http://scikit-learn.org/stable/documentation.html

Remember to complete
your evaluations!

Thank you!
@chris_sanden
@royrapoport

(BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Similar to (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

(BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems