Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Reducing MTTR and False Escalations: Event Correlation at LinkedIn

516 views

Published on

LinkedIn’s production stack is made up of over 900 applications and over 2200 internal API’s. With any given application having many interconnected pieces, it is difficult to escalate to the right person in a timely manner.

In order to combat this, LinkedIn built an Event Correlation Engine that monitors service health and maps dependencies between services to correctly escalate to the SRE’s who own the unhealthy service.

We’ll discuss the approach we used in building a correlation engine and how it has been used at LinkedIn to reduce incident impact and provide better quality of life to LinkedIn’s oncall engineers.

Published in: Internet
  • Be the first to comment

  • Be the first to like this

Reducing MTTR and False Escalations: Event Correlation at LinkedIn

  1. 1. Reducing MTTR and False Escalations: Event Correlation at LinkedIn Michael Kehoe Staff Site Reliability Engineer LinkedIn
  2. 2. Have you ever? 2 False Escalations • Been woken because your service is unhealthy because of a dependency? • Been woken because someone believes your service is responsible? • Spent hours trying to work out why your service is broken?
  3. 3. 3 Agenda • Project Problem Statement • Project Goals • Architecture Considerations • Correlation Engine Overview • Results & Takeaways • Questions
  4. 4. $ whoami 4 Michael Kehoe • Staff Site Reliability Engineer (SRE) @ LinkedIn • Production-SRE team • Funny accent = Australian + 3 years American
  5. 5. $ whatis PROD-SRE 5 Michael Kehoe • Production-SRE • Develop applications to improve MTTD and MTTR • Build tools for efficient site issue troubleshooting, issue detection & correlation • Provide direction on site monitoring • Assist in restoring stability to services during site critical issues
  6. 6. 6 Problem Statement Service Complexity Learning Curve MTTR Reliability
  7. 7. Project Technical Goal 7 Problem Statement Find problem with a service between a given time period (or ongoing) using: Unified API Web Frontend
  8. 8. Project Success Criteria 8 Problem Statement • Reduce MTTR on incidents • Reduce false/ needless escalations
  9. 9. Expected Use-Cases 9 Problem Statement Applicable use-cases: • A service has high latency or error rates • Find the problematic service(s) Non-applicable use-cases: • External monitoring services show slow page-load times
  10. 10. 10 Architecture Considerations Real-Time metrics analytics (stream processing) Ad-Hoc metrics Analytics Alert Correlation
  11. 11. Evaluation 11 Architecture Considerations • Real-Time metrics analytics (stream processing) • Pros • Fast response time • Ability to do advanced analytics in real-time • Cons • Resource intensive (especially at LinkedIn scale)
  12. 12. Evaluation 12 Architecture Considerations • Ad-Hoc metric analytics • Pros • Smaller resource footprint • Cons • Analysis time is slow
  13. 13. Evaluation 13 Architecture Considerations • Alert Correlation • Pros • Leverage already existing alerts • Strong signal-to-noise ratio • Cons • Analysis constrained to alerts only (boolean state)
  14. 14. Evaluation 14 Architecture Considerations • Real-time analytics is expensive, but useful • Ad-Hoc metric analytics is slower, but cheaper • Alert Correlation gives us strong signal
  15. 15. 15 Correlation Engine Overview At LinkedIn, we had two smaller projects that we could leverage Drilldown + Site-Stabilizer Near-Time metric analytics & event correlation Invisualize Alert Correlation Existing knowledge available
  16. 16. Where to get started 16 Correlation Engine Overview The ability to correlate is great! But you need to understand dependencies Build a callgraph!
  17. 17. Callgraph 17 Correlation Engine Overview LinkedIn applications emit metrics on a per-API and per-dependency basis Map metrics to understand dependencies Simple to build callgraph platform!
  18. 18. Callgraph 18 Correlation Engine Overview Callgraph-be Voldemort (RO Datastore) Espresso (RW Datastore) Collect: ● Call count ● Latency
  19. 19. drilldown (Near-Time analytics) 19 Correlation Engine Overview Using callgraph, identifies high-value dependencies (and the associated metrics) In 5min chunks, analyses high-value metrics Using a k-means unsupervised algorithm, find similar trends between service metrics Queryable API Outputs correlation confidence scores Normalised between 0-100
  20. 20. inVisualize (Alert Correlation) 20 Correlation Engine Overview inVisualize analyses alerts (in realtime) from each service Use callgraph to calculate the unhealthy service and affected services Queryable API Results normalised between 0-100 Visualizes impact
  21. 21. inVisualize 21 Correlation Engine Overview
  22. 22. Site-Stabilizer 22 Correlation Engine Overview Backend service Collates recommendations from Drilldown & inVisualize Decorates recommendations with: Scheduled changes Deployment events A/B experiment changes
  23. 23. Architecture 23 Correlation Engine Overview Callgraph-api Callgraph-be drilldown invisualize site-stabilizer
  24. 24. Correlate-fe 24 Correlation Engine Overview API for automation Auto-remediation Alert suppressing UI for manual introspection
  25. 25. Correlate-fe 25 Correlation Engine Overview User Interfaces gives Responsible service Correlation Confidence Root cause SRE team Analysis
  26. 26. Architecture 26 Correlation Engine Overview Callgraph-api Callgraph-be correlate-fe drilldown invisualize site-stabilizer
  27. 27. Latency Alert NURSE Nurse Plan arguments • service-name: my-frontend • req_confidence = 85 • escalate = True Escalate to correct SRE Find what’s wrong with ‘my-frontend’ in DatacenterB IrisAlert Correlation API Service: Service-C Confidence: 91% Reason: ‘Service-C’ has high latency after a deploy Service Owner: SRE
  28. 28. 28 Early Results Siteops (NOC) has greater visibility on the site Reducing MTTR Reducing false escalations
  29. 29. 29 Conclusion Understand what correlation approach makes sense for you Understand your dependencies Build, Integrate and benefit!
  30. 30. 30 Team Govindaluri Kishore Renjith Rajan Reynold Perumpilly Rusty Wickell Michael Kehoe
  31. 31. 31 Questions?
  32. 32. ©2014 LinkedIn Corporation. All Rights Reserved.©2014 LinkedIn Corporation. All Rights Reserved.
  33. 33. Callgraph 33 Correlation Engine Overview Callgraph-be RestLi (Internal API’s) Voldemort (RO Datastore) Espresso (RW Datastore) Call count Latency
  34. 34. Architecture 34 Correlation Engine Overview Callgraph-api Callgraph-be correlate-fe drilldown invisualize site-stabilizer

×