SlideShare a Scribd company logo
1 of 40
Download to read offline
Three Pillars with Zero Answers
A New Observability Scorecard
December 6, 2018
First, a Critique
Observing microservices is hard
Google and Facebook solved this (right???)
They used Metrics, Logging, and Distributed Tracing…
So we should, too.
The Conventional Wisdom
The Three Pillars of Observability
- Metrics
- Logging
- Distributed Tracing
Metrics!
Logging!
Tracing!
Fatal Flaws
A word nobody knew in 2015…
Dimensions (aka “tags”) can explain variance
in timeseries data (aka “metrics”) …
… but cardinality
Logging Data Volume: a reality check
transaction rate
x all microservices
x cost of net+storage
x weeks of retention
-----------------------
way too much $$$$
The Life of Transaction Data: Dapper
Stage Overhead affects… Retained
Instrumentation Executed App 100.00%
Buffered within app process App 000.10%
Flushed out of process App 000.10%
Centralized regionally Regional network + storage 000.10%
Centralized globally WAN + storage 000.01%
Logs Metrics Dist. Traces
TCO scales gracefully
– ✓ ✓
Accounts for all data
(i.e., unsampled) ✓ ✓ –
Immune to cardinality
✓ – ✓
Fatal Flaws: A Review
Data vs UI
Data vs UI
Data vs UI
Metrics
Logs
Traces
Metrics, Logs, and Traces are
Just Data,
… not a feature or use case.
A New Scorecard
for Observability
“SLI” = “Service Level Indicator”
TL;DR: An SLI is an indicator of health that
a service’s consumers would care about.
… not an indicator of its inner workings
Observability: Quick Vocab Refresher
Observability: Two Fundamental Goals
- Gradually improving an SLI
- Rapidly restoring an SLI
Reminder: “SLI” = “Service Level Indicator”
NOW!!!!
days, weeks, months…
1. Detection: perfect SLI capture
2. Refinement: reduce the search space
Observability: Two Fundamental Activities
An interlude about stats frequency
Specificity:
- Arbitrary dimensionality and cardinality
- Any layer of the stack, including mobile+web!
Fidelity:
- Correct stats!!!
- High stats frequency (i.e., “beware smoothing”!)
Freshness: ≤ 5 second lag
Scorecard >> Detection
# of things your users
actually care about
# of microservices
# of failure modes
Must reduce
the search space!
Scorecard >> Refinement
Scorecard >> Refinement
Identify Variance
Explain Variance
Deploy
Fix
An interlude about variance and “p99”
Scorecard >> Refinement
Identifying Variance:
- Cardinality: understand which tag changed
- Robust stats: histograms (see prev slide)
- Data retention: always “Know What’s Normal”
Explaining variance:
- Correct stats!!!
- “Suppress the messengers” of microservice failures
Wrapping up…
(first, a hint at my perspective)
The Life of Transaction Data: Dapper
Stage Overhead affects… Retained
Instrumentation Executed App 100.00%
Buffered within app process App 000.10%
Flushed out of process App 000.10%
Centralized regionally Regional network + storage 000.10%
Centralized globally WAN + storage 000.01%
(Review)
The Life of Transaction Data: Dapper LightStep
Stage Overhead affects… Retained
Instrumentation Executed App 100.00%
Buffered within app process App 100.00%
Flushed out of process App 100.00%
Centralized regionally Regional network + storage 100.00%
Centralized globally WAN + storage on-demand
Refinement
- Identifying variance:
unlimited cardinality, hi-fi
histograms, data retention
- “Suppress the messengers”
Detection
- Specificity: unlimited
cardinality, across the
entire stack
- Fidelity: correct stats,
high stats frequency
- Freshness: ≤ 5 seconds
An Observability Scorecard
Thank you!
Ben Sigelman, Co-founder and CEO
twitter: @el_bhs
email: bhs@lightstep.com
Extra slides
Ideal Measurement: Robust
Ideal Measurement: High-Dimensional
Ideal Refinement: Real-time
Must be able to test and eliminate hypotheses quickly
- Actual data must be ≤10s fresh
- UI / API latency must be very low
Ideal Refinement: Global
Ideal Refinement: Context-Rich
We can’t expect humans to know what’s normal
Thank you / Q&A

More Related Content

What's hot

Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)
Krishnaram Kenthapadi
 
Orchestration & provisioning
Orchestration & provisioningOrchestration & provisioning
Orchestration & provisioning
buildacloud
 

What's hot (20)

Hands-on Introduction to Machine Learning
Hands-on Introduction to Machine LearningHands-on Introduction to Machine Learning
Hands-on Introduction to Machine Learning
 
Intepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural NetworksIntepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural Networks
 
Tutorial on Deep Learning
Tutorial on Deep LearningTutorial on Deep Learning
Tutorial on Deep Learning
 
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
 
Remote-first Team Interactions for Business and Technology Teams @ Berlin CTO...
Remote-first Team Interactions for Business and Technology Teams @ Berlin CTO...Remote-first Team Interactions for Business and Technology Teams @ Berlin CTO...
Remote-first Team Interactions for Business and Technology Teams @ Berlin CTO...
 
Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)
 
Modern Convolutional Neural Network techniques for image segmentation
Modern Convolutional Neural Network techniques for image segmentationModern Convolutional Neural Network techniques for image segmentation
Modern Convolutional Neural Network techniques for image segmentation
 
Generative AI - The New Reality: How Key Players Are Progressing
Generative AI - The New Reality: How Key Players Are Progressing Generative AI - The New Reality: How Key Players Are Progressing
Generative AI - The New Reality: How Key Players Are Progressing
 
Marketing Analytics using R/Python
Marketing Analytics using R/PythonMarketing Analytics using R/Python
Marketing Analytics using R/Python
 
Gan seminar
Gan seminarGan seminar
Gan seminar
 
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
 
Big data landscape version 2.0
Big data landscape version 2.0Big data landscape version 2.0
Big data landscape version 2.0
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
 
Algorithmic Bias : What is it? Why should we care? What can we do about it?
Algorithmic Bias : What is it? Why should we care? What can we do about it?Algorithmic Bias : What is it? Why should we care? What can we do about it?
Algorithmic Bias : What is it? Why should we care? What can we do about it?
 
Network embedding
Network embeddingNetwork embedding
Network embedding
 
FRACTAL GRIDDING: A Visual Technique for Managing Personal, Enterprise, and N...
FRACTAL GRIDDING: A Visual Technique for Managing Personal, Enterprise, and N...FRACTAL GRIDDING: A Visual Technique for Managing Personal, Enterprise, and N...
FRACTAL GRIDDING: A Visual Technique for Managing Personal, Enterprise, and N...
 
Orchestration & provisioning
Orchestration & provisioningOrchestration & provisioning
Orchestration & provisioning
 
PPT4: Frameworks & Libraries of Machine Learning & Deep Learning
PPT4: Frameworks & Libraries of Machine Learning & Deep Learning PPT4: Frameworks & Libraries of Machine Learning & Deep Learning
PPT4: Frameworks & Libraries of Machine Learning & Deep Learning
 
Large Language Models Bootcamp
Large Language Models BootcampLarge Language Models Bootcamp
Large Language Models Bootcamp
 
Communication with Artificial intelligence
Communication with Artificial intelligenceCommunication with Artificial intelligence
Communication with Artificial intelligence
 

Similar to Three Pillars with Zero Answers: A New Observability Scorecard

The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
confluent
 
Kakfa summit london 2019 - the art of the event-streaming app
Kakfa summit london 2019 - the art of the event-streaming appKakfa summit london 2019 - the art of the event-streaming app
Kakfa summit london 2019 - the art of the event-streaming app
Neil Avery
 
The State of Stream Processing
The State of Stream ProcessingThe State of Stream Processing
The State of Stream Processing
confluent
 
Making Runtime Data Useful for Incident Diagnosis: An Experience Report
Making Runtime Data Useful for Incident Diagnosis: An Experience ReportMaking Runtime Data Useful for Incident Diagnosis: An Experience Report
Making Runtime Data Useful for Incident Diagnosis: An Experience Report
QAware GmbH
 
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Flink Forward
 
How to Quantify the Value of Kafka in Your Organization
How to Quantify the Value of Kafka in Your Organization How to Quantify the Value of Kafka in Your Organization
How to Quantify the Value of Kafka in Your Organization
confluent
 
Connecting Kafka to Cash (CKC) (Lyndon Hedderly, Confluent) Kafka Summit Lond...
Connecting Kafka to Cash (CKC) (Lyndon Hedderly, Confluent) Kafka Summit Lond...Connecting Kafka to Cash (CKC) (Lyndon Hedderly, Confluent) Kafka Summit Lond...
Connecting Kafka to Cash (CKC) (Lyndon Hedderly, Confluent) Kafka Summit Lond...
confluent
 

Similar to Three Pillars with Zero Answers: A New Observability Scorecard (20)

Three Pillars, Zero Answers: Rethinking Observability
Three Pillars, Zero Answers: Rethinking ObservabilityThree Pillars, Zero Answers: Rethinking Observability
Three Pillars, Zero Answers: Rethinking Observability
 
Three Pillars, No Answers: Helping Platform Teams Solve Real Observability Pr...
Three Pillars, No Answers: Helping Platform Teams Solve Real Observability Pr...Three Pillars, No Answers: Helping Platform Teams Solve Real Observability Pr...
Three Pillars, No Answers: Helping Platform Teams Solve Real Observability Pr...
 
High Availability HPC ~ Microservice Architectures for Supercomputing
High Availability HPC ~ Microservice Architectures for SupercomputingHigh Availability HPC ~ Microservice Architectures for Supercomputing
High Availability HPC ~ Microservice Architectures for Supercomputing
 
Machine Learning Impact on IoT - Part 2
Machine Learning Impact on IoT - Part 2Machine Learning Impact on IoT - Part 2
Machine Learning Impact on IoT - Part 2
 
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
 
Kakfa summit london 2019 - the art of the event-streaming app
Kakfa summit london 2019 - the art of the event-streaming appKakfa summit london 2019 - the art of the event-streaming app
Kakfa summit london 2019 - the art of the event-streaming app
 
Intelligent Monitoring
Intelligent MonitoringIntelligent Monitoring
Intelligent Monitoring
 
CQRS and Event Sourcing: A DevOps perspective
CQRS and Event Sourcing: A DevOps perspectiveCQRS and Event Sourcing: A DevOps perspective
CQRS and Event Sourcing: A DevOps perspective
 
The State of Stream Processing
The State of Stream ProcessingThe State of Stream Processing
The State of Stream Processing
 
Modern Operations at Scale within Viasat – How to Structure Teams and Build A...
Modern Operations at Scale within Viasat – How to Structure Teams and Build A...Modern Operations at Scale within Viasat – How to Structure Teams and Build A...
Modern Operations at Scale within Viasat – How to Structure Teams and Build A...
 
Making Runtime Data Useful for Incident Diagnosis: An Experience Report
Making Runtime Data Useful for Incident Diagnosis: An Experience ReportMaking Runtime Data Useful for Incident Diagnosis: An Experience Report
Making Runtime Data Useful for Incident Diagnosis: An Experience Report
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
 
Observability at Spotify
Observability at SpotifyObservability at Spotify
Observability at Spotify
 
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
 
How to not fail at security data analytics (by CxOSidekick)
How to not fail at security data analytics (by CxOSidekick)How to not fail at security data analytics (by CxOSidekick)
How to not fail at security data analytics (by CxOSidekick)
 
Artificial intelligence - A Teaser to the Topic.
Artificial intelligence - A Teaser to the Topic.Artificial intelligence - A Teaser to the Topic.
Artificial intelligence - A Teaser to the Topic.
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)
 
How to Quantify the Value of Kafka in Your Organization
How to Quantify the Value of Kafka in Your Organization How to Quantify the Value of Kafka in Your Organization
How to Quantify the Value of Kafka in Your Organization
 
Connecting Kafka to Cash (CKC) (Lyndon Hedderly, Confluent) Kafka Summit Lond...
Connecting Kafka to Cash (CKC) (Lyndon Hedderly, Confluent) Kafka Summit Lond...Connecting Kafka to Cash (CKC) (Lyndon Hedderly, Confluent) Kafka Summit Lond...
Connecting Kafka to Cash (CKC) (Lyndon Hedderly, Confluent) Kafka Summit Lond...
 
Streaming real time data with Vibe Data Stream
Streaming real time data with Vibe Data StreamStreaming real time data with Vibe Data Stream
Streaming real time data with Vibe Data Stream
 

More from DevOps.com

Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
DevOps.com
 
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
DevOps.com
 

More from DevOps.com (20)

Modernizing on IBM Z Made Easier With Open Source Software
Modernizing on IBM Z Made Easier With Open Source SoftwareModernizing on IBM Z Made Easier With Open Source Software
Modernizing on IBM Z Made Easier With Open Source Software
 
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
 
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
 
Next Generation Vulnerability Assessment Using Datadog and Snyk
Next Generation Vulnerability Assessment Using Datadog and SnykNext Generation Vulnerability Assessment Using Datadog and Snyk
Next Generation Vulnerability Assessment Using Datadog and Snyk
 
Vulnerability Discovery in the Cloud
Vulnerability Discovery in the CloudVulnerability Discovery in the Cloud
Vulnerability Discovery in the Cloud
 
2021 Open Source Governance: Top Ten Trends and Predictions
2021 Open Source Governance: Top Ten Trends and Predictions2021 Open Source Governance: Top Ten Trends and Predictions
2021 Open Source Governance: Top Ten Trends and Predictions
 
A New Year’s Ransomware Resolution
A New Year’s Ransomware ResolutionA New Year’s Ransomware Resolution
A New Year’s Ransomware Resolution
 
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
 
Don't Panic! Effective Incident Response
Don't Panic! Effective Incident ResponseDon't Panic! Effective Incident Response
Don't Panic! Effective Incident Response
 
Creating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's Culture
Creating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's CultureCreating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's Culture
Creating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's Culture
 
Role Based Access Controls (RBAC) for SSH and Kubernetes Access with Teleport
Role Based Access Controls (RBAC) for SSH and Kubernetes Access with TeleportRole Based Access Controls (RBAC) for SSH and Kubernetes Access with Teleport
Role Based Access Controls (RBAC) for SSH and Kubernetes Access with Teleport
 
Monitoring Serverless Applications with Datadog
Monitoring Serverless Applications with DatadogMonitoring Serverless Applications with Datadog
Monitoring Serverless Applications with Datadog
 
Deliver your App Anywhere … Publicly or Privately
Deliver your App Anywhere … Publicly or PrivatelyDeliver your App Anywhere … Publicly or Privately
Deliver your App Anywhere … Publicly or Privately
 
Securing medical apps in the age of covid final
Securing medical apps in the age of covid finalSecuring medical apps in the age of covid final
Securing medical apps in the age of covid final
 
How to Build a Healthy On-Call Culture
How to Build a Healthy On-Call CultureHow to Build a Healthy On-Call Culture
How to Build a Healthy On-Call Culture
 
The Evolving Role of the Developer in 2021
The Evolving Role of the Developer in 2021The Evolving Role of the Developer in 2021
The Evolving Role of the Developer in 2021
 
Service Mesh: Two Big Words But Do You Need It?
Service Mesh: Two Big Words But Do You Need It?Service Mesh: Two Big Words But Do You Need It?
Service Mesh: Two Big Words But Do You Need It?
 
Secure Data Sharing in OpenShift Environments
Secure Data Sharing in OpenShift EnvironmentsSecure Data Sharing in OpenShift Environments
Secure Data Sharing in OpenShift Environments
 
How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...
How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...
How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...
 
Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...
Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...
Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Recently uploaded (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 

Three Pillars with Zero Answers: A New Observability Scorecard

  • 1. Three Pillars with Zero Answers A New Observability Scorecard December 6, 2018
  • 3. Observing microservices is hard Google and Facebook solved this (right???) They used Metrics, Logging, and Distributed Tracing… So we should, too. The Conventional Wisdom
  • 4. The Three Pillars of Observability - Metrics - Logging - Distributed Tracing
  • 8.
  • 10. A word nobody knew in 2015… Dimensions (aka “tags”) can explain variance in timeseries data (aka “metrics”) … … but cardinality
  • 11. Logging Data Volume: a reality check transaction rate x all microservices x cost of net+storage x weeks of retention ----------------------- way too much $$$$
  • 12. The Life of Transaction Data: Dapper Stage Overhead affects… Retained Instrumentation Executed App 100.00% Buffered within app process App 000.10% Flushed out of process App 000.10% Centralized regionally Regional network + storage 000.10% Centralized globally WAN + storage 000.01%
  • 13. Logs Metrics Dist. Traces TCO scales gracefully – ✓ ✓ Accounts for all data (i.e., unsampled) ✓ ✓ – Immune to cardinality ✓ – ✓ Fatal Flaws: A Review
  • 17. Metrics, Logs, and Traces are Just Data, … not a feature or use case.
  • 18. A New Scorecard for Observability
  • 19. “SLI” = “Service Level Indicator” TL;DR: An SLI is an indicator of health that a service’s consumers would care about. … not an indicator of its inner workings Observability: Quick Vocab Refresher
  • 20. Observability: Two Fundamental Goals - Gradually improving an SLI - Rapidly restoring an SLI Reminder: “SLI” = “Service Level Indicator” NOW!!!! days, weeks, months…
  • 21. 1. Detection: perfect SLI capture 2. Refinement: reduce the search space Observability: Two Fundamental Activities
  • 22. An interlude about stats frequency
  • 23. Specificity: - Arbitrary dimensionality and cardinality - Any layer of the stack, including mobile+web! Fidelity: - Correct stats!!! - High stats frequency (i.e., “beware smoothing”!) Freshness: ≤ 5 second lag Scorecard >> Detection
  • 24. # of things your users actually care about # of microservices # of failure modes Must reduce the search space! Scorecard >> Refinement
  • 25. Scorecard >> Refinement Identify Variance Explain Variance Deploy Fix
  • 26. An interlude about variance and “p99”
  • 27. Scorecard >> Refinement Identifying Variance: - Cardinality: understand which tag changed - Robust stats: histograms (see prev slide) - Data retention: always “Know What’s Normal” Explaining variance: - Correct stats!!! - “Suppress the messengers” of microservice failures
  • 29. (first, a hint at my perspective)
  • 30. The Life of Transaction Data: Dapper Stage Overhead affects… Retained Instrumentation Executed App 100.00% Buffered within app process App 000.10% Flushed out of process App 000.10% Centralized regionally Regional network + storage 000.10% Centralized globally WAN + storage 000.01% (Review)
  • 31. The Life of Transaction Data: Dapper LightStep Stage Overhead affects… Retained Instrumentation Executed App 100.00% Buffered within app process App 100.00% Flushed out of process App 100.00% Centralized regionally Regional network + storage 100.00% Centralized globally WAN + storage on-demand
  • 32. Refinement - Identifying variance: unlimited cardinality, hi-fi histograms, data retention - “Suppress the messengers” Detection - Specificity: unlimited cardinality, across the entire stack - Fidelity: correct stats, high stats frequency - Freshness: ≤ 5 seconds An Observability Scorecard
  • 33. Thank you! Ben Sigelman, Co-founder and CEO twitter: @el_bhs email: bhs@lightstep.com
  • 37. Ideal Refinement: Real-time Must be able to test and eliminate hypotheses quickly - Actual data must be ≤10s fresh - UI / API latency must be very low
  • 39. Ideal Refinement: Context-Rich We can’t expect humans to know what’s normal
  • 40. Thank you / Q&A