Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Making Observability Actionable At Scale - DBS DevConnect 2019

241 views

Published on

Many organisations already possess a vast amount of existing data about production systems. As customer expectations evolve, organisations are often challenged to find more proactive ways of dealing with traditionally reactive incident response activity. In this talk, we discuss approaches to unlock value from this data by making it truly actionable. Understanding production failure modes better, enriching technical and business context effectively, decomposing response activity into shared primitives, actions and workflows, and overall, sharing and augmenting this active knowledge repository on a continuous basis are key takeaways. Through case studies, we'll discuss how we can accomplish this by engineering your observability processes and tooling to work for human-in-the-loop interpretation and response rather than a purely human-reliant strategy.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Making Observability Actionable At Scale - DBS DevConnect 2019

  1. 1. Making Observability Actionable At Scale Sisir Koppaka CTO Squadcast 1 DevConnect Conference 2019 DBS Asia Hub, Academy Singapore
  2. 2. Hi there ! Squadcast - Building a simple and free incident response tool to help increase adoption of Site Reliability Engineering (SRE) Built real-time data science pipelines at two different startups in NYC and BLR Can disease diagnosis and tracking be automated with ultrasound? - Research at MIT Studied Reliability & Production Engineering at IIT Kharagpur What I’m definitely NOT! An expert in Banking 2 Building reliable software at scale is really hard. School of Hard Knocks
  3. 3. Squadcast System of engagement for managing reliability end-to-end combining human + machine data 3 Democratize SRE! Service Definitions Service Level Objectives (SLOs) Service Level Indicators (SLIs) Error Budgets And ACTIONS! - what we’ll focus on for this talk
  4. 4. 4 What Service Level Objectives (SLOs) Look Like SLOs, SLIs, Error Budgets and SRE best practices like generic mitigation help limit toil and help you turn the vicious cycle into a virtuous cycle
  5. 5. 5
  6. 6. 6 Excellent What does Observability really mean? How well are you able to infer a system’s internal state given it’s output ? System Input Output Not-so-good Proactive Customer Success High TBF Low TTA, TTR Transparency Predictable Change Velocity Low Toil Sticky Customers Reactive Customer Success Low TBF High TTA, TTR Lack of Transparency Unpredictable Change Velocity High Toil Meh Customers Observability ...
  7. 7. 7 Pillars of Observability* Logs Metrics Traces *an apolitical rendition What does Observability really mean? How well are you able to infer a system’s internal state given it’s output ? System Input Output
  8. 8. 8 What the data tells us at Squadcast! Time-to-act (TTA) and Time-to-resolve (TTR) are on average larger and more variable outside the main working shift Incident Response globally could be more consistent, transferable, and scaleable within organizations. Response patterns cannot be versioned or programmed against. Similar to CI/CD circa 2005. Are we at peak observability as a community? No. If we can’t act effectively, we cannot claim peak observability. *Normalized across three 8-hour shifts across the world. Data is not representative of any individual customer.
  9. 9. 9 A Deeper Look (SRE teams at 72 companies) A majority of respondents considered themselves SREs (at well-known companies). 56% were managing between 50-500 services, and 32% were managing 10-50 services.
  10. 10. 10
  11. 11. We may need a fourth pillar to optimize for peak observability by building an active knowledge repository of Actions. 11 Pillars of Observability* Logs Metrics Traces Actions Data Impact
  12. 12. What are Squadcast Actions? Quick Demo 12 Actions - Primitives squadctl circleci:rebuild platform-js/master/latest squadctl namespace:action :repo/branch/tag - Runbooks for the long tail of response activity Markdown-supported active runbooks in a language of your choice
  13. 13. Building Actions A few things we learnt along the way 13 Don’t Repeat Yourself (DRY) Audit Trails with immutable log Continuous Security Composing Action Primitives into Workflows Continuous Feedback in the SDLC Heterogeneous Workloads become easier to support Hybrid Cloud And many more...
  14. 14. 14 Let’s look at a real example A Fortune 100 Enterprise has over 100 TB of release artifacts, growing at double-digit % every year. They have different Engineering teams for each product line, have a NOC that routes production incidents to the appropriate team, have a SOC….. Can we unlock additional value by taking more actions during incident response that improves observability, and thereby, the change velocity? Use Cases ➔ Automatically flagging build artifacts for telemetry spikes, and rolling back ➔ Flagging build artifacts for new vulnerabilities and automated rollbacks ➔ Scaling production environment based on external events such as traffic spikes ➔ And many more
  15. 15. 15 Release Promotion and the SRE Loop For a simple workload 1 2 3 V C S Dev Artifacts Quality Gate Staging Artifacts Quality Gate GA Artifacts Quality Gate Production Artifacts Quality Gate Triage Generic Remediation SLO Breach Incident Routing Root Cause Analysis
  16. 16. 16 Release Promotion and the SRE Loop 1 2 3 V C S Dev Artifacts Quality Gate Staging Artifacts Quality Gate GA Artifacts Quality Gate Production Artifacts Quality Gate Triage Generic Remediation SLO Breach Incident Routing Root Cause Analysis Motivation Improving Observability can reduce the drag force on change velocity
  17. 17. 17 Drag Force Reduction At Scale With Superior Traceability - Backpropagate accurate and real-time metadata associated with releases to JFrog Artifactory (example used hee) or Sonatype Nexus - Use metadata to programmatically drive incident response using Artifactory Query Language in Squadcast Runbooks Quick Demo
  18. 18. 18 How Squadcast Works - Squadcast Actions and Runbooks which trigger programmatic response during incident response - Human-in-the-loop, machine-assisted - Primitives can be composed - primitives to snippets to more complicated workflows - Functional from all interfaces including mobile, ‘coz incidents happen anytime, anywhere.
  19. 19. 19 Known Known Ex - that telemetry spike Automate Known Unknowns Ex - External Traffic Spikes Prepare, then human-in-loop Unknown Knowns Ex - Vulnerabilities Prepare, then human-in-loop Unknown Unknowns Convert to others Let’s start the clock! Understanding Failure Modes
  20. 20. 20 Known Known Ex - telemetry spike Automate Known Unknowns Ex - External Traffic Spikes Prepare, then human-in-loop Unknown Knowns Ex - Vulnerabilities Prepare, then human-in-loop Unknown Unknowns Convert to other 3 types Let’s start the clock! What we’ll take a look at in the Demo Responding to Failure Modes
  21. 21. 21 DEMO 1. Improving traceability by building a loop between release metadata / change requests and incident response 2. Enrich production context by annotating Actions more comprehensively in your visualization tool like Grafana 3. Try at home - Improve and automate response to vulnerabilities on a real-time basis (you can start with automating response to vulnerabilities from Snyk) Known Known Ex - telemetry spike Automate Unknown Knowns Ex - Vulnerabilities Prepare, then human-in-loop Known Unknowns Ex - External Traffic Spikes Prepare, then human-in-loop
  22. 22. 22 Known Known Ex - telemetry spike Automate
  23. 23. 23 Known Known Ex - telemetry spike Automate
  24. 24. 24 Here’s one more idea... Actions help make your system more Observable.
  25. 25. What does the modern enterprise gain from the fourth pillar of Observability? 25 Top 3 Priorities of the Modern Enterprise* 88% Revenue Acceleration 71% Improved Agility and faster Time to Market 47% Cost Reduction 29% Better Management of Regulatory and Compliance Risks 29% Increased CSAT 41% Other (Brand, Strategic, Financial) *McKinsey Digital Survey of CIOs/CTOs at 52 enterprises. 78 percent work at orgs with 5,000+ employees, and 44 percent work at companies with annual revenues of $10 billion+
  26. 26. 26 Thank you! t: @sisirkoppaka / @squadcastHQ e: sisir@squadcast.com

×