SlideShare a Scribd company logo
1 of 44
Download to read offline
Three Pillars, No Answers: Helping Platform
Teams Solve Real Observability Problems
Austin Parker, Principal Developer Advocate at Lightstep
Who Am I?
Austin Parker
Principal Developer Advocate
@austinlparker
austin@lightstep.com✉
Part 1: A
Critique
The Conventional Wisdom
● Observing microservices is hard
● Google and Facebook solved this (right???)
● They used Metrics, Logging, and Distributed Tracing…
● So we should, too.
The Three Pillars of Observability
- Metrics
- Logging
- Distributed Tracing
Metrics
Logging
Tracing
Fatal Flaws
A word nobody knew in 2015…
Dimensions (aka “tags”) can explain
variance in timeseries data (aka “metrics”)
…… but cardinality
Logging Data Volume: a reality check
transaction rate
x all microservices
x cost of net+storage
x weeks of retention
-----------------------
way too much $$$$
The Life of Transaction Data: Dapper
Stage Overhead affects… Retained
Instrumentation Executed App 100.00%
Buffered within app process App 000.10%
Flushed out of process App 000.10%
Centralized regionally Regional network + storage 000.10%
Centralized globally WAN + storage 000.01%
Fatal Flaws: A Review
Logs Metrics Dist. Traces
TCO scales gracefully
– ✓ ✓
Accounts for all data
(i.e., unsampled) ✓ ✓ –
Immune to cardinality
✓ – ✓
Data vs UI
Data vs UI
Data vs UI
Metrics
Logs
Traces
Metrics, Logs, and Traces are
Just Data,
… not a feature or use case.
Part 2: A New
Scorecard for
Observability
Mental Model: Goals and Activities
● Goals: how our services perform in the eyes of
their consumers
● Activities: what we (as operators) actually do
to further our goals
Quick Vocab Refresher: SLIs
“SLI” = “Service Level Indicator”
TL;DR: An SLI is an indicator of health that a
service’s consumers would care about.
… not an indicator of its inner workings
Observability: 2 Fundamental Goals
Gradually improving an SLI
Rapidly restoring an SLI
Reminder: “SLI” = “Service Level Indicator”
NOW!!!!
days, weeks, months…
Observability: 2 Fundamental Activities
1. Detection: measuring SLIs precisely
2. Refinement: reducing the search
space for plausible explanations
An interlude about stats frequency
Scorecard: Detection
1. Specificity:
- Cost of cardinality ($ per tag value)
- Stack support (mobile/web platforms, managed services, “black-box
OSS infra” like Kafka/Cassandra)
2. Fidelity:
- Correct stats!!! (global p95, p99)
- High stats frequency (stats sampling frequency, in seconds)
3. Freshness (lag from real-time, in seconds)
Why “Refinement”?
# of things your users
actually care about
# of microservices
# of failure modes
Must reduce
the search space!
The Refinement Process
Discover Variance
Explain Variance
Deploy
Fix
Histograms vs “p99”
Scorecard: Refinement
Identifying Variance:
- Cardinality ($ per tag value)
- Robust stats (histograms (see prev slide))
- Retention horizons for plausible queries (time duration)
Explaining variance:
- Correct stats!!! (global p95, p99)
- “Suppress the messengers” of microservice failures
Wrapping Up...
(first, a hint at my
perspective)
A fun game! (“Observability Whack-a-Mole”)
Design your own observability system:
❏ High-throughput
❏ High-cardinality
❏ Lengthy retention window
❏ Unsampled
Choose three
The Life of Trace Data:
Dapper
Stage Overhead affects… Retained
Instrumentation Executed App 100.00%
Buffered within app process App 000.10%
Flushed out of process App 000.10%
Centralized regionally Regional network + storage 000.10%
Centralized globally WAN + storage 000.01%
The Life of Trace Data:
Dapper Other Approaches
Stage Overhead affects… Retained
Instrumentation Executed App 100.00%
Buffered within app process App 100.00%
Flushed out of process App 100.00%
Centralized regionally Regional network + storage 100.00%
Centralized globally WAN + storage “fancy”
An Observability Scorecard
Detection
- Specificity: cardinality cost,
stack coverage
- Fidelity: correct stats, high stats
frequency
- Freshness: ≤ 5 seconds
Refinement
- Identifying variance: cardinality
cost, correct stats, hi-fi
histograms, retention horizons
- “Suppress the messengers”
LightStep: Observability with context
Automatic deployment and regression detection
System and service diagrams
Real-time and historical root cause analysis
Correlations
Custom alerting
Easy Setup with no vendor lock-in
No cardinality limitations, really
Q&A
Get Started Today
go.lightstep.com/trial
Extra Slides
Ideal Measurement: Robust
Ideal Measurement: High-Dimensional
Ideal Refinement: Real-time
Must be able to test and eliminate hypotheses
quickly
Actual data must be ≤10s fresh
UI / API latency must be very low
Ideal Refinement: Global
Ideal Refinement: Context-Rich
We can’t expect humans to know what’s normal

More Related Content

What's hot

Wie erkenne ich die Auswirkungen von IT Ausfallen auf meine Produktion?
Wie erkenne ich die Auswirkungen von IT Ausfallen auf meine Produktion?Wie erkenne ich die Auswirkungen von IT Ausfallen auf meine Produktion?
Wie erkenne ich die Auswirkungen von IT Ausfallen auf meine Produktion?Splunk
 
Splunk Discovery Köln - 17-01-2020 - Splunk for ITOps
Splunk Discovery Köln - 17-01-2020 - Splunk for ITOpsSplunk Discovery Köln - 17-01-2020 - Splunk for ITOps
Splunk Discovery Köln - 17-01-2020 - Splunk for ITOpsSplunk
 
Splunk Overview
Splunk OverviewSplunk Overview
Splunk OverviewSplunk
 
.conf21 - The Best of
.conf21 - The Best of.conf21 - The Best of
.conf21 - The Best ofSplunk
 
Security Automation & Orchestration
Security Automation & OrchestrationSecurity Automation & Orchestration
Security Automation & OrchestrationSplunk
 
Better Threat Analytics: From Getting Started to Cloud Security Analytics and...
Better Threat Analytics: From Getting Started to Cloud Security Analytics and...Better Threat Analytics: From Getting Started to Cloud Security Analytics and...
Better Threat Analytics: From Getting Started to Cloud Security Analytics and...Splunk
 
Manufacturing Webinar AMS
Manufacturing Webinar AMSManufacturing Webinar AMS
Manufacturing Webinar AMSSplunk
 
SplunkLive! Paris 2018: Splunk Overview
SplunkLive! Paris 2018: Splunk OverviewSplunkLive! Paris 2018: Splunk Overview
SplunkLive! Paris 2018: Splunk OverviewSplunk
 
The Risks and Rewards of AI
The Risks and  Rewards of AIThe Risks and  Rewards of AI
The Risks and Rewards of AISplunk
 
IoT Analytics @ splunk
IoT Analytics @ splunkIoT Analytics @ splunk
IoT Analytics @ splunkSplunk
 
Catch these Sessions on-demand at .conf Online
Catch these Sessions on-demand at .conf OnlineCatch these Sessions on-demand at .conf Online
Catch these Sessions on-demand at .conf OnlineSplunk
 
Introduction into Security Analytics Methods
Introduction into Security Analytics Methods Introduction into Security Analytics Methods
Introduction into Security Analytics Methods Splunk
 
Splunk Artificial Intelligence & Machine Learning Webinar
Splunk Artificial Intelligence & Machine Learning WebinarSplunk Artificial Intelligence & Machine Learning Webinar
Splunk Artificial Intelligence & Machine Learning WebinarSplunk
 
Splunk Discovery Köln - 17-01-2020 - Turning Data Into Business Outcomes
Splunk Discovery Köln - 17-01-2020 - Turning Data Into Business OutcomesSplunk Discovery Köln - 17-01-2020 - Turning Data Into Business Outcomes
Splunk Discovery Köln - 17-01-2020 - Turning Data Into Business OutcomesSplunk
 
Best Practices for Forwarder Hierarchies
Best Practices for Forwarder HierarchiesBest Practices for Forwarder Hierarchies
Best Practices for Forwarder HierarchiesSplunk
 
SplunkLive! Stockholm 2019 - Customer presentation: ISS
SplunkLive! Stockholm 2019 - Customer presentation: ISS SplunkLive! Stockholm 2019 - Customer presentation: ISS
SplunkLive! Stockholm 2019 - Customer presentation: ISS Splunk
 
Monitoring Modern Architectures with Data Science
Monitoring Modern Architectures with Data ScienceMonitoring Modern Architectures with Data Science
Monitoring Modern Architectures with Data ScienceC4Media
 
Splunk for Monitoring and Diagnostics Breakout Session
Splunk for Monitoring and Diagnostics Breakout SessionSplunk for Monitoring and Diagnostics Breakout Session
Splunk for Monitoring and Diagnostics Breakout SessionSplunk
 
SplunkLive! Munich 2018: Getting Started with Splunk Enterprise
SplunkLive! Munich 2018: Getting Started with Splunk EnterpriseSplunkLive! Munich 2018: Getting Started with Splunk Enterprise
SplunkLive! Munich 2018: Getting Started with Splunk EnterpriseSplunk
 

What's hot (20)

Wie erkenne ich die Auswirkungen von IT Ausfallen auf meine Produktion?
Wie erkenne ich die Auswirkungen von IT Ausfallen auf meine Produktion?Wie erkenne ich die Auswirkungen von IT Ausfallen auf meine Produktion?
Wie erkenne ich die Auswirkungen von IT Ausfallen auf meine Produktion?
 
Splunk Discovery Köln - 17-01-2020 - Splunk for ITOps
Splunk Discovery Köln - 17-01-2020 - Splunk for ITOpsSplunk Discovery Köln - 17-01-2020 - Splunk for ITOps
Splunk Discovery Köln - 17-01-2020 - Splunk for ITOps
 
Splunk Overview
Splunk OverviewSplunk Overview
Splunk Overview
 
.conf21 - The Best of
.conf21 - The Best of.conf21 - The Best of
.conf21 - The Best of
 
Security Automation & Orchestration
Security Automation & OrchestrationSecurity Automation & Orchestration
Security Automation & Orchestration
 
Better Threat Analytics: From Getting Started to Cloud Security Analytics and...
Better Threat Analytics: From Getting Started to Cloud Security Analytics and...Better Threat Analytics: From Getting Started to Cloud Security Analytics and...
Better Threat Analytics: From Getting Started to Cloud Security Analytics and...
 
Manufacturing Webinar AMS
Manufacturing Webinar AMSManufacturing Webinar AMS
Manufacturing Webinar AMS
 
SplunkLive! Paris 2018: Splunk Overview
SplunkLive! Paris 2018: Splunk OverviewSplunkLive! Paris 2018: Splunk Overview
SplunkLive! Paris 2018: Splunk Overview
 
The Risks and Rewards of AI
The Risks and  Rewards of AIThe Risks and  Rewards of AI
The Risks and Rewards of AI
 
IoT Analytics @ splunk
IoT Analytics @ splunkIoT Analytics @ splunk
IoT Analytics @ splunk
 
Catch these Sessions on-demand at .conf Online
Catch these Sessions on-demand at .conf OnlineCatch these Sessions on-demand at .conf Online
Catch these Sessions on-demand at .conf Online
 
Introduction into Security Analytics Methods
Introduction into Security Analytics Methods Introduction into Security Analytics Methods
Introduction into Security Analytics Methods
 
Splunk Artificial Intelligence & Machine Learning Webinar
Splunk Artificial Intelligence & Machine Learning WebinarSplunk Artificial Intelligence & Machine Learning Webinar
Splunk Artificial Intelligence & Machine Learning Webinar
 
Splunk Discovery Köln - 17-01-2020 - Turning Data Into Business Outcomes
Splunk Discovery Köln - 17-01-2020 - Turning Data Into Business OutcomesSplunk Discovery Köln - 17-01-2020 - Turning Data Into Business Outcomes
Splunk Discovery Köln - 17-01-2020 - Turning Data Into Business Outcomes
 
Best Practices for Forwarder Hierarchies
Best Practices for Forwarder HierarchiesBest Practices for Forwarder Hierarchies
Best Practices for Forwarder Hierarchies
 
Observe 2020-d mc
Observe 2020-d mcObserve 2020-d mc
Observe 2020-d mc
 
SplunkLive! Stockholm 2019 - Customer presentation: ISS
SplunkLive! Stockholm 2019 - Customer presentation: ISS SplunkLive! Stockholm 2019 - Customer presentation: ISS
SplunkLive! Stockholm 2019 - Customer presentation: ISS
 
Monitoring Modern Architectures with Data Science
Monitoring Modern Architectures with Data ScienceMonitoring Modern Architectures with Data Science
Monitoring Modern Architectures with Data Science
 
Splunk for Monitoring and Diagnostics Breakout Session
Splunk for Monitoring and Diagnostics Breakout SessionSplunk for Monitoring and Diagnostics Breakout Session
Splunk for Monitoring and Diagnostics Breakout Session
 
SplunkLive! Munich 2018: Getting Started with Splunk Enterprise
SplunkLive! Munich 2018: Getting Started with Splunk EnterpriseSplunkLive! Munich 2018: Getting Started with Splunk Enterprise
SplunkLive! Munich 2018: Getting Started with Splunk Enterprise
 

Similar to Three Pillars, No Answers: Helping Platform Teams Solve Real Observability Problems

Three Pillars, Zero Answers: Rethinking Observability
Three Pillars, Zero Answers: Rethinking ObservabilityThree Pillars, Zero Answers: Rethinking Observability
Three Pillars, Zero Answers: Rethinking ObservabilityDevOps.com
 
Three Pillars with Zero Answers: A New Observability Scorecard
Three Pillars with Zero Answers: A New Observability ScorecardThree Pillars with Zero Answers: A New Observability Scorecard
Three Pillars with Zero Answers: A New Observability ScorecardDevOps.com
 
Mantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing SystemMantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing SystemC4Media
 
High Availability HPC ~ Microservice Architectures for Supercomputing
High Availability HPC ~ Microservice Architectures for SupercomputingHigh Availability HPC ~ Microservice Architectures for Supercomputing
High Availability HPC ~ Microservice Architectures for Supercomputinginside-BigData.com
 
Big Data : Bits of History, Words of Advice
Big Data : Bits of History, Words of AdviceBig Data : Bits of History, Words of Advice
Big Data : Bits of History, Words of AdviceVenu Vasudevan
 
Real-Time Analytics With StarRocks (DWH+DL).pdf
Real-Time Analytics With StarRocks (DWH+DL).pdfReal-Time Analytics With StarRocks (DWH+DL).pdf
Real-Time Analytics With StarRocks (DWH+DL).pdfAlbert Wong
 
Intelligent Monitoring
Intelligent MonitoringIntelligent Monitoring
Intelligent MonitoringIntelie
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleSriram Krishnan
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science ChallengeMark Nichols, P.E.
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupDoug Needham
 
How to not fail at security data analytics (by CxOSidekick)
How to not fail at security data analytics (by CxOSidekick)How to not fail at security data analytics (by CxOSidekick)
How to not fail at security data analytics (by CxOSidekick)Dinis Cruz
 
QConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemQConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemDanny Yuan
 
Tooling Up for Efficiency: DIY Solutions @ Netflix - ABD319 - re:Invent 2017
Tooling Up for Efficiency: DIY Solutions @ Netflix - ABD319 - re:Invent 2017Tooling Up for Efficiency: DIY Solutions @ Netflix - ABD319 - re:Invent 2017
Tooling Up for Efficiency: DIY Solutions @ Netflix - ABD319 - re:Invent 2017Amazon Web Services
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataAbhishek M Shivalingaiah
 
No specimen (software) left behind
No specimen (software) left behindNo specimen (software) left behind
No specimen (software) left behindVince Smith
 
Build User-Facing Analytics Application That Scales Using StarRocks (DLH).pdf
Build User-Facing Analytics Application That Scales Using StarRocks (DLH).pdfBuild User-Facing Analytics Application That Scales Using StarRocks (DLH).pdf
Build User-Facing Analytics Application That Scales Using StarRocks (DLH).pdfAlbert Wong
 
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure Execution
Spark Summit EU 2016: The Next AMPLab:  Real-time Intelligent Secure ExecutionSpark Summit EU 2016: The Next AMPLab:  Real-time Intelligent Secure Execution
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure ExecutionDatabricks
 
"Traffic Speed Control System in the Cloud using Machine Learning" by Albert ...
"Traffic Speed Control System in the Cloud using Machine Learning" by Albert ..."Traffic Speed Control System in the Cloud using Machine Learning" by Albert ...
"Traffic Speed Control System in the Cloud using Machine Learning" by Albert ...DevClub_lv
 

Similar to Three Pillars, No Answers: Helping Platform Teams Solve Real Observability Problems (20)

Three Pillars, Zero Answers: Rethinking Observability
Three Pillars, Zero Answers: Rethinking ObservabilityThree Pillars, Zero Answers: Rethinking Observability
Three Pillars, Zero Answers: Rethinking Observability
 
Three Pillars with Zero Answers: A New Observability Scorecard
Three Pillars with Zero Answers: A New Observability ScorecardThree Pillars with Zero Answers: A New Observability Scorecard
Three Pillars with Zero Answers: A New Observability Scorecard
 
Mantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing SystemMantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing System
 
High Availability HPC ~ Microservice Architectures for Supercomputing
High Availability HPC ~ Microservice Architectures for SupercomputingHigh Availability HPC ~ Microservice Architectures for Supercomputing
High Availability HPC ~ Microservice Architectures for Supercomputing
 
Big Data : Bits of History, Words of Advice
Big Data : Bits of History, Words of AdviceBig Data : Bits of History, Words of Advice
Big Data : Bits of History, Words of Advice
 
Is this normal?
Is this normal?Is this normal?
Is this normal?
 
Real-Time Analytics With StarRocks (DWH+DL).pdf
Real-Time Analytics With StarRocks (DWH+DL).pdfReal-Time Analytics With StarRocks (DWH+DL).pdf
Real-Time Analytics With StarRocks (DWH+DL).pdf
 
Intelligent Monitoring
Intelligent MonitoringIntelligent Monitoring
Intelligent Monitoring
 
Machine Learning Impact on IoT - Part 2
Machine Learning Impact on IoT - Part 2Machine Learning Impact on IoT - Part 2
Machine Learning Impact on IoT - Part 2
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science Challenge
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup Group
 
How to not fail at security data analytics (by CxOSidekick)
How to not fail at security data analytics (by CxOSidekick)How to not fail at security data analytics (by CxOSidekick)
How to not fail at security data analytics (by CxOSidekick)
 
QConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemQConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing system
 
Tooling Up for Efficiency: DIY Solutions @ Netflix - ABD319 - re:Invent 2017
Tooling Up for Efficiency: DIY Solutions @ Netflix - ABD319 - re:Invent 2017Tooling Up for Efficiency: DIY Solutions @ Netflix - ABD319 - re:Invent 2017
Tooling Up for Efficiency: DIY Solutions @ Netflix - ABD319 - re:Invent 2017
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
 
No specimen (software) left behind
No specimen (software) left behindNo specimen (software) left behind
No specimen (software) left behind
 
Build User-Facing Analytics Application That Scales Using StarRocks (DLH).pdf
Build User-Facing Analytics Application That Scales Using StarRocks (DLH).pdfBuild User-Facing Analytics Application That Scales Using StarRocks (DLH).pdf
Build User-Facing Analytics Application That Scales Using StarRocks (DLH).pdf
 
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure Execution
Spark Summit EU 2016: The Next AMPLab:  Real-time Intelligent Secure ExecutionSpark Summit EU 2016: The Next AMPLab:  Real-time Intelligent Secure Execution
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure Execution
 
"Traffic Speed Control System in the Cloud using Machine Learning" by Albert ...
"Traffic Speed Control System in the Cloud using Machine Learning" by Albert ..."Traffic Speed Control System in the Cloud using Machine Learning" by Albert ...
"Traffic Speed Control System in the Cloud using Machine Learning" by Albert ...
 

More from DevOps.com

Modernizing on IBM Z Made Easier With Open Source Software
Modernizing on IBM Z Made Easier With Open Source SoftwareModernizing on IBM Z Made Easier With Open Source Software
Modernizing on IBM Z Made Easier With Open Source SoftwareDevOps.com
 
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...DevOps.com
 
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...DevOps.com
 
Next Generation Vulnerability Assessment Using Datadog and Snyk
Next Generation Vulnerability Assessment Using Datadog and SnykNext Generation Vulnerability Assessment Using Datadog and Snyk
Next Generation Vulnerability Assessment Using Datadog and SnykDevOps.com
 
Vulnerability Discovery in the Cloud
Vulnerability Discovery in the CloudVulnerability Discovery in the Cloud
Vulnerability Discovery in the CloudDevOps.com
 
2021 Open Source Governance: Top Ten Trends and Predictions
2021 Open Source Governance: Top Ten Trends and Predictions2021 Open Source Governance: Top Ten Trends and Predictions
2021 Open Source Governance: Top Ten Trends and PredictionsDevOps.com
 
A New Year’s Ransomware Resolution
A New Year’s Ransomware ResolutionA New Year’s Ransomware Resolution
A New Year’s Ransomware ResolutionDevOps.com
 
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)DevOps.com
 
Don't Panic! Effective Incident Response
Don't Panic! Effective Incident ResponseDon't Panic! Effective Incident Response
Don't Panic! Effective Incident ResponseDevOps.com
 
Creating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's Culture
Creating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's CultureCreating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's Culture
Creating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's CultureDevOps.com
 
Role Based Access Controls (RBAC) for SSH and Kubernetes Access with Teleport
Role Based Access Controls (RBAC) for SSH and Kubernetes Access with TeleportRole Based Access Controls (RBAC) for SSH and Kubernetes Access with Teleport
Role Based Access Controls (RBAC) for SSH and Kubernetes Access with TeleportDevOps.com
 
Monitoring Serverless Applications with Datadog
Monitoring Serverless Applications with DatadogMonitoring Serverless Applications with Datadog
Monitoring Serverless Applications with DatadogDevOps.com
 
Deliver your App Anywhere … Publicly or Privately
Deliver your App Anywhere … Publicly or PrivatelyDeliver your App Anywhere … Publicly or Privately
Deliver your App Anywhere … Publicly or PrivatelyDevOps.com
 
Securing medical apps in the age of covid final
Securing medical apps in the age of covid finalSecuring medical apps in the age of covid final
Securing medical apps in the age of covid finalDevOps.com
 
How to Build a Healthy On-Call Culture
How to Build a Healthy On-Call CultureHow to Build a Healthy On-Call Culture
How to Build a Healthy On-Call CultureDevOps.com
 
The Evolving Role of the Developer in 2021
The Evolving Role of the Developer in 2021The Evolving Role of the Developer in 2021
The Evolving Role of the Developer in 2021DevOps.com
 
Service Mesh: Two Big Words But Do You Need It?
Service Mesh: Two Big Words But Do You Need It?Service Mesh: Two Big Words But Do You Need It?
Service Mesh: Two Big Words But Do You Need It?DevOps.com
 
Secure Data Sharing in OpenShift Environments
Secure Data Sharing in OpenShift EnvironmentsSecure Data Sharing in OpenShift Environments
Secure Data Sharing in OpenShift EnvironmentsDevOps.com
 
How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...
How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...
How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...DevOps.com
 
Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...
Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...
Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...DevOps.com
 

More from DevOps.com (20)

Modernizing on IBM Z Made Easier With Open Source Software
Modernizing on IBM Z Made Easier With Open Source SoftwareModernizing on IBM Z Made Easier With Open Source Software
Modernizing on IBM Z Made Easier With Open Source Software
 
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
 
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
 
Next Generation Vulnerability Assessment Using Datadog and Snyk
Next Generation Vulnerability Assessment Using Datadog and SnykNext Generation Vulnerability Assessment Using Datadog and Snyk
Next Generation Vulnerability Assessment Using Datadog and Snyk
 
Vulnerability Discovery in the Cloud
Vulnerability Discovery in the CloudVulnerability Discovery in the Cloud
Vulnerability Discovery in the Cloud
 
2021 Open Source Governance: Top Ten Trends and Predictions
2021 Open Source Governance: Top Ten Trends and Predictions2021 Open Source Governance: Top Ten Trends and Predictions
2021 Open Source Governance: Top Ten Trends and Predictions
 
A New Year’s Ransomware Resolution
A New Year’s Ransomware ResolutionA New Year’s Ransomware Resolution
A New Year’s Ransomware Resolution
 
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
 
Don't Panic! Effective Incident Response
Don't Panic! Effective Incident ResponseDon't Panic! Effective Incident Response
Don't Panic! Effective Incident Response
 
Creating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's Culture
Creating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's CultureCreating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's Culture
Creating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's Culture
 
Role Based Access Controls (RBAC) for SSH and Kubernetes Access with Teleport
Role Based Access Controls (RBAC) for SSH and Kubernetes Access with TeleportRole Based Access Controls (RBAC) for SSH and Kubernetes Access with Teleport
Role Based Access Controls (RBAC) for SSH and Kubernetes Access with Teleport
 
Monitoring Serverless Applications with Datadog
Monitoring Serverless Applications with DatadogMonitoring Serverless Applications with Datadog
Monitoring Serverless Applications with Datadog
 
Deliver your App Anywhere … Publicly or Privately
Deliver your App Anywhere … Publicly or PrivatelyDeliver your App Anywhere … Publicly or Privately
Deliver your App Anywhere … Publicly or Privately
 
Securing medical apps in the age of covid final
Securing medical apps in the age of covid finalSecuring medical apps in the age of covid final
Securing medical apps in the age of covid final
 
How to Build a Healthy On-Call Culture
How to Build a Healthy On-Call CultureHow to Build a Healthy On-Call Culture
How to Build a Healthy On-Call Culture
 
The Evolving Role of the Developer in 2021
The Evolving Role of the Developer in 2021The Evolving Role of the Developer in 2021
The Evolving Role of the Developer in 2021
 
Service Mesh: Two Big Words But Do You Need It?
Service Mesh: Two Big Words But Do You Need It?Service Mesh: Two Big Words But Do You Need It?
Service Mesh: Two Big Words But Do You Need It?
 
Secure Data Sharing in OpenShift Environments
Secure Data Sharing in OpenShift EnvironmentsSecure Data Sharing in OpenShift Environments
Secure Data Sharing in OpenShift Environments
 
How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...
How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...
How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...
 
Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...
Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...
Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...
 

Recently uploaded

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 

Recently uploaded (20)

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 

Three Pillars, No Answers: Helping Platform Teams Solve Real Observability Problems

  • 1. Three Pillars, No Answers: Helping Platform Teams Solve Real Observability Problems Austin Parker, Principal Developer Advocate at Lightstep
  • 2. Who Am I? Austin Parker Principal Developer Advocate @austinlparker austin@lightstep.com✉
  • 4. The Conventional Wisdom ● Observing microservices is hard ● Google and Facebook solved this (right???) ● They used Metrics, Logging, and Distributed Tracing… ● So we should, too.
  • 5. The Three Pillars of Observability - Metrics - Logging - Distributed Tracing
  • 9.
  • 11. A word nobody knew in 2015… Dimensions (aka “tags”) can explain variance in timeseries data (aka “metrics”) …… but cardinality
  • 12. Logging Data Volume: a reality check transaction rate x all microservices x cost of net+storage x weeks of retention ----------------------- way too much $$$$
  • 13. The Life of Transaction Data: Dapper Stage Overhead affects… Retained Instrumentation Executed App 100.00% Buffered within app process App 000.10% Flushed out of process App 000.10% Centralized regionally Regional network + storage 000.10% Centralized globally WAN + storage 000.01%
  • 14. Fatal Flaws: A Review Logs Metrics Dist. Traces TCO scales gracefully – ✓ ✓ Accounts for all data (i.e., unsampled) ✓ ✓ – Immune to cardinality ✓ – ✓
  • 15.
  • 19. Metrics, Logs, and Traces are Just Data, … not a feature or use case.
  • 20. Part 2: A New Scorecard for Observability
  • 21. Mental Model: Goals and Activities ● Goals: how our services perform in the eyes of their consumers ● Activities: what we (as operators) actually do to further our goals
  • 22. Quick Vocab Refresher: SLIs “SLI” = “Service Level Indicator” TL;DR: An SLI is an indicator of health that a service’s consumers would care about. … not an indicator of its inner workings
  • 23. Observability: 2 Fundamental Goals Gradually improving an SLI Rapidly restoring an SLI Reminder: “SLI” = “Service Level Indicator” NOW!!!! days, weeks, months…
  • 24. Observability: 2 Fundamental Activities 1. Detection: measuring SLIs precisely 2. Refinement: reducing the search space for plausible explanations
  • 25. An interlude about stats frequency
  • 26. Scorecard: Detection 1. Specificity: - Cost of cardinality ($ per tag value) - Stack support (mobile/web platforms, managed services, “black-box OSS infra” like Kafka/Cassandra) 2. Fidelity: - Correct stats!!! (global p95, p99) - High stats frequency (stats sampling frequency, in seconds) 3. Freshness (lag from real-time, in seconds)
  • 27. Why “Refinement”? # of things your users actually care about # of microservices # of failure modes Must reduce the search space!
  • 28. The Refinement Process Discover Variance Explain Variance Deploy Fix
  • 30. Scorecard: Refinement Identifying Variance: - Cardinality ($ per tag value) - Robust stats (histograms (see prev slide)) - Retention horizons for plausible queries (time duration) Explaining variance: - Correct stats!!! (global p95, p99) - “Suppress the messengers” of microservice failures
  • 32. (first, a hint at my perspective)
  • 33. A fun game! (“Observability Whack-a-Mole”) Design your own observability system: ❏ High-throughput ❏ High-cardinality ❏ Lengthy retention window ❏ Unsampled Choose three
  • 34. The Life of Trace Data: Dapper Stage Overhead affects… Retained Instrumentation Executed App 100.00% Buffered within app process App 000.10% Flushed out of process App 000.10% Centralized regionally Regional network + storage 000.10% Centralized globally WAN + storage 000.01%
  • 35. The Life of Trace Data: Dapper Other Approaches Stage Overhead affects… Retained Instrumentation Executed App 100.00% Buffered within app process App 100.00% Flushed out of process App 100.00% Centralized regionally Regional network + storage 100.00% Centralized globally WAN + storage “fancy”
  • 36. An Observability Scorecard Detection - Specificity: cardinality cost, stack coverage - Fidelity: correct stats, high stats frequency - Freshness: ≤ 5 seconds Refinement - Identifying variance: cardinality cost, correct stats, hi-fi histograms, retention horizons - “Suppress the messengers”
  • 37. LightStep: Observability with context Automatic deployment and regression detection System and service diagrams Real-time and historical root cause analysis Correlations Custom alerting Easy Setup with no vendor lock-in No cardinality limitations, really
  • 42. Ideal Refinement: Real-time Must be able to test and eliminate hypotheses quickly Actual data must be ≤10s fresh UI / API latency must be very low
  • 44. Ideal Refinement: Context-Rich We can’t expect humans to know what’s normal