Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Reducing MTTR and False
Escalations:
Event Correlation at LinkedIn
Jeff Weiner
Chief Executive Officer
Michael Kehoe
Staff...
False Escalations
• Have you ever?
• Been woken because your service is
unhealthy because of a dependency
• Been woken bec...
Today’s
agenda
1 Introductions
2 The Problem Statement
3 Architecture Considerations
4 Platform Overview
5 Ecosystem Integ...
Introduction
Who are we?
PRODUCTION-SRE TEAM AT LINKEDIN
• Assist in restoring stability to services
during site-critical issues
• Deve...
Problem Statement
Problem Statement
Learning Curve MTTRReliability
Service Complexity
Problem Statement
Understanding
services is harder
Learning
Curve
Complexity delays
identification of cause
High MTTR
Lack...
Project Goals
Project Goals
Internal application
shows high latency/
errors
Unified API
External monitoring
show high latency/
errors
We...
Project Goals
Reduce impact on
members
Reduce MTTR
Less disruptions to
oncall SRE’s
Reduce False
Escalations
Project Goals
Internal application
shows high latency/
errors
Applicable Use-
cases
External monitoring
show high latency/...
Architecture Considerations
Architecture Considerations
Running metric
correlation via
stream-processing
Real-Time Metrics Analysis
Metric correlation...
Architecture Considerations
REAL-TIME METRIC ANALYTICS
• Pros
• Fast response time
• Ability to do advanced analytics in
r...
Architecture Considerations
AD-HOC METRIC ANALYTICS
• Pros
• Smaller resource footprint
• Cons
• Analysis time is slow
Architecture Considerations
ALERT CORRELATION
• Pros
• Leverage already existing alerts
• Strong signal-to-noise ratio
• C...
Architecture Considerations
EVALUATION
• Alert Correlation gives us strong signal
• Real-time analytics is expensive, but
...
Platform Overview
Platform Overview
Understanding how
services depend on
each other
Call Graph
K-Means analysis
Ad-Hoc Metric
Correlation
Us...
Correlation Engine Overview
Architecture
21
Callgraph-api
Callgraph-be
correlate-fe
drilldown invisualize
site-stabilizer
Problem Statement
Learning Curve MTTRReliability
Service Complexity
Learning Curve
Scattered
Knowledge
Outdated
Documentatio
n
Poor
Dependency
Understanding
s
Callgraph
Stores
Callcount, latency and error rates
Created
Programmatically
Interface
API and a User Interface
Lookup
Ser...
Services, APIs,
Protocols
Service
Discovery
Destination service,
Endpoint, Protocol
Metrics
How do we map service
Site Stabilizer | Real Time
and Ad-Hoc Metrics Analysis
Challenge: Not all metrics
had thresholds
Threshold
Challenge: expensive, real
time processing, tuning
based on the indivi...
Clustering Algorithm
K-Means
Partitions
n observations to k clusters
Store
Can be trained and saved
cluster center
cluster 3cluster 3
cluster 2
cluster 1cluster 1
K-Means : How it works
cluster 1
cluster 2
cluster 3
cluster 3
cluster center
Predict score
Ranking
Using K-Means
Predict score
Based on the trend of
the time series
Trend score
Leverage week on
week data
WoW
Typical Workflow
Identify Drilldown
Identify the critical
metrics using the k-
means method
Drilldown to the
corresponding...
inVisualize | Alert
Correlation and
Visualization
Polls the monitoring system
continuously for alerts
Ingests and represents
callcount, average latency, error
rate from cal...
Higher the alerts for a service,
more likely it’s affected or
broken
Higher the change in
latency/error to a downstream,
m...
inVisualize
Alert Correlation and
Visualization
inVisualize
Save the states continuously for
replay
Rank the services based on a score...
Recommendation Engine
Recommendation Engine
Service, colo,
duration
Input
Collates the outputs
from Site stabilizer
and inVisualize
Collate
Resp...
Ecosystem Integration
Ecosystem Integration
Escalate
to correct
SRE
Nurse Plan arguments
• service-name: my-frontend
• req_confidence = 85
• esc...
Key Takeaways
Key Takeaways
Understand what
correlation
infrastructure makes
sense
Approach
Understand
dependencies
Dependencies
Key Takeaways
Feedback Loops
• Important to get some feedback on
accuracy
• Provides a means to do reporting:
• System eff...
Team
Michael Kehoe Rusty Wickell Reynold PJ Govindaluri
Kishore
Renjith Rejan
Q&A
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn
Upcoming SlideShare
Loading in …5
×

of

SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 1 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 2 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 3 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 4 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 5 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 6 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 7 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 8 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 9 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 10 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 11 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 12 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 13 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 14 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 15 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 16 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 17 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 18 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 19 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 20 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 21 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 22 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 23 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 24 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 25 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 26 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 27 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 28 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 29 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 30 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 31 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 32 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 33 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 34 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 35 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 36 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 37 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 38 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 39 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 40 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 41 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 42 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 43 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 44 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 45 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 46 SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn Slide 47
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

0 Likes

Share

Download to read offline

SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn

Download to read offline

LinkedIn’s production stack is made up of over 900 applications, 2200 internal API’s and hundreds of databases. With any given application having many interconnected pieces, it is difficult to escalate to the right person in a timely manner.

In order to combat this, LinkedIn built an Event Correlation Engine that monitors service health and maps dependencies between services to correctly escalate to the SRE’s who own the unhealthy service.

We’ll discuss the approach we used in building a correlation engine and how it has been used at LinkedIn to reduce incident impact and provide better quality of life to LinkedIn’s oncall engineers.

Related Books

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn

  1. 1. Reducing MTTR and False Escalations: Event Correlation at LinkedIn Jeff Weiner Chief Executive Officer Michael Kehoe Staff SRE Rusty Wickell Sr Operations Engineer
  2. 2. False Escalations • Have you ever? • Been woken because your service is unhealthy because of a dependency • Been woken because someone believes your service is responsible • Spent hours trying to work out why your service is broken
  3. 3. Today’s agenda 1 Introductions 2 The Problem Statement 3 Architecture Considerations 4 Platform Overview 5 Ecosystem Integration 6 Key Takeaways 7 Q&A
  4. 4. Introduction
  5. 5. Who are we? PRODUCTION-SRE TEAM AT LINKEDIN • Assist in restoring stability to services during site-critical issues • Develop applications to improve MTTD and MTTR • Provide direction and guidelines for site monitoring • Build tools for efficient site-issue detection, correlation & troubleshooting,
  6. 6. Problem Statement
  7. 7. Problem Statement Learning Curve MTTRReliability Service Complexity
  8. 8. Problem Statement Understanding services is harder Learning Curve Complexity delays identification of cause High MTTR Lack of understanding results in false escalations False Escalations
  9. 9. Project Goals
  10. 10. Project Goals Internal application shows high latency/ errors Unified API External monitoring show high latency/ errors Web Frontend
  11. 11. Project Goals Reduce impact on members Reduce MTTR Less disruptions to oncall SRE’s Reduce False Escalations
  12. 12. Project Goals Internal application shows high latency/ errors Applicable Use- cases External monitoring show high latency/ errors Non-Applicable Use- cases
  13. 13. Architecture Considerations
  14. 14. Architecture Considerations Running metric correlation via stream-processing Real-Time Metrics Analysis Metric correlation on demand Ad-Hoc metric analytics Processing alerts and performing Alert Correlation
  15. 15. Architecture Considerations REAL-TIME METRIC ANALYTICS • Pros • Fast response time • Ability to do advanced analytics in real-time • Cons • Resource intensive = Expensive
  16. 16. Architecture Considerations AD-HOC METRIC ANALYTICS • Pros • Smaller resource footprint • Cons • Analysis time is slow
  17. 17. Architecture Considerations ALERT CORRELATION • Pros • Leverage already existing alerts • Strong signal-to-noise ratio • Cons • Analysis constrained to alerts only (boolean state)
  18. 18. Architecture Considerations EVALUATION • Alert Correlation gives us strong signal • Real-time analytics is expensive, but useful • Ad-Hoc metric analytics is slower, but cheaper
  19. 19. Platform Overview
  20. 20. Platform Overview Understanding how services depend on each other Call Graph K-Means analysis Ad-Hoc Metric Correlation Using alerts to confirm performance Alert Correlation Collating and decorating data Recommendations Engine
  21. 21. Correlation Engine Overview Architecture 21 Callgraph-api Callgraph-be correlate-fe drilldown invisualize site-stabilizer
  22. 22. Problem Statement Learning Curve MTTRReliability Service Complexity
  23. 23. Learning Curve Scattered Knowledge Outdated Documentatio n Poor Dependency Understanding s
  24. 24. Callgraph Stores Callcount, latency and error rates Created Programmatically Interface API and a User Interface Lookup Service/ API
  25. 25. Services, APIs, Protocols Service Discovery Destination service, Endpoint, Protocol Metrics How do we map service
  26. 26. Site Stabilizer | Real Time and Ad-Hoc Metrics Analysis
  27. 27. Challenge: Not all metrics had thresholds Threshold Challenge: expensive, real time processing, tuning based on the individual metrics behaviour Statistical Approaches that we tried Challenge: expensive, real time processing, tuning based on the individual metrics behaviour Machine Learning
  28. 28. Clustering Algorithm K-Means Partitions n observations to k clusters Store Can be trained and saved
  29. 29. cluster center cluster 3cluster 3 cluster 2 cluster 1cluster 1 K-Means : How it works
  30. 30. cluster 1 cluster 2 cluster 3 cluster 3 cluster center Predict score
  31. 31. Ranking Using K-Means Predict score Based on the trend of the time series Trend score Leverage week on week data WoW
  32. 32. Typical Workflow Identify Drilldown Identify the critical metrics using the k- means method Drilldown to the corresponding critical services
  33. 33. inVisualize | Alert Correlation and Visualization
  34. 34. Polls the monitoring system continuously for alerts Ingests and represents callcount, average latency, error rate from callgraph Correlates downstream alerts using Callgraph inVisualize Assumptions Alert Correlation and Visualization inVisualize
  35. 35. Higher the alerts for a service, more likely it’s affected or broken Higher the change in latency/error to a downstream, more likely it’s broken Higher the callcount to a downstream, more valuable it is inVisualize Assumptions Alert Correlation and Visualization inVisualize
  36. 36. inVisualize
  37. 37. Alert Correlation and Visualization inVisualize Save the states continuously for replay Rank the services based on a score and accessible via api Score is normalized between 0-100
  38. 38. Recommendation Engine
  39. 39. Recommendation Engine Service, colo, duration Input Collates the outputs from Site stabilizer and inVisualize Collate Responsible service, SRE team, correlation confidence score User Interface With information such as scheduled changes, deployments and A/B experiments Decorate
  40. 40. Ecosystem Integration
  41. 41. Ecosystem Integration Escalate to correct SRE Nurse Plan arguments • service-name: my-frontend • req_confidence = 85 • escalate=true Find what’s wrong with ‘my-frontend’ in DatacenterB Service: Service-C Confidence: 91% Reason: ‘Service-C’ has high latency after a deploy Service Owner: SRE
  42. 42. Key Takeaways
  43. 43. Key Takeaways Understand what correlation infrastructure makes sense Approach Understand dependencies Dependencies
  44. 44. Key Takeaways Feedback Loops • Important to get some feedback on accuracy • Provides a means to do reporting: • System effectiveness • Engineers saved from escalations • Use feedback data to train system = Improve Results
  45. 45. Team Michael Kehoe Rusty Wickell Reynold PJ Govindaluri Kishore Renjith Rejan
  46. 46. Q&A

LinkedIn’s production stack is made up of over 900 applications, 2200 internal API’s and hundreds of databases. With any given application having many interconnected pieces, it is difficult to escalate to the right person in a timely manner. In order to combat this, LinkedIn built an Event Correlation Engine that monitors service health and maps dependencies between services to correctly escalate to the SRE’s who own the unhealthy service. We’ll discuss the approach we used in building a correlation engine and how it has been used at LinkedIn to reduce incident impact and provide better quality of life to LinkedIn’s oncall engineers.

Views

Total views

236

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

7

Shares

0

Comments

0

Likes

0

×