SlideShare a Scribd company logo
Three Pillars, No Answers: Helping Platform
Teams Solve Real Observability Problems
Austin Parker, Principal Developer Advocate at Lightstep
Who Am I?
Austin Parker
Principal Developer Advocate
@austinlparker
austin@lightstep.com✉
Part 1: A
Critique
The Conventional Wisdom
● Observing microservices is hard
● Google and Facebook solved this (right???)
● They used Metrics, Logging, and Distributed Tracing…
● So we should, too.
The Three Pillars of Observability
- Metrics
- Logging
- Distributed Tracing
Metrics
Logging
Tracing
Fatal Flaws
A word nobody knew in 2015…
Dimensions (aka “tags”) can explain
variance in timeseries data (aka “metrics”)
…… but cardinality
Logging Data Volume: a reality check
transaction rate
x all microservices
x cost of net+storage
x weeks of retention
-----------------------
way too much $$$$
The Life of Transaction Data: Dapper
Stage Overhead affects… Retained
Instrumentation Executed App 100.00%
Buffered within app process App 000.10%
Flushed out of process App 000.10%
Centralized regionally Regional network + storage 000.10%
Centralized globally WAN + storage 000.01%
Fatal Flaws: A Review
Logs Metrics Dist. Traces
TCO scales gracefully
– ✓ ✓
Accounts for all data
(i.e., unsampled) ✓ ✓ –
Immune to cardinality
✓ – ✓
Data vs UI
Data vs UI
Data vs UI
Metrics
Logs
Traces
Metrics, Logs, and Traces are
Just Data,
… not a feature or use case.
Part 2: A New
Scorecard for
Observability
Mental Model: Goals and Activities
● Goals: how our services perform in the eyes of
their consumers
● Activities: what we (as operators) actually do
to further our goals
Quick Vocab Refresher: SLIs
“SLI” = “Service Level Indicator”
TL;DR: An SLI is an indicator of health that a
service’s consumers would care about.
… not an indicator of its inner workings
Observability: 2 Fundamental Goals
Gradually improving an SLI
Rapidly restoring an SLI
Reminder: “SLI” = “Service Level Indicator”
NOW!!!!
days, weeks, months…
Observability: 2 Fundamental Activities
1. Detection: measuring SLIs precisely
2. Refinement: reducing the search
space for plausible explanations
An interlude about stats frequency
Scorecard: Detection
1. Specificity:
- Cost of cardinality ($ per tag value)
- Stack support (mobile/web platforms, managed services, “black-box
OSS infra” like Kafka/Cassandra)
2. Fidelity:
- Correct stats!!! (global p95, p99)
- High stats frequency (stats sampling frequency, in seconds)
3. Freshness (lag from real-time, in seconds)
Why “Refinement”?
# of things your users
actually care about
# of microservices
# of failure modes
Must reduce
the search space!
The Refinement Process
Discover Variance
Explain Variance
Deploy
Fix
Histograms vs “p99”
Scorecard: Refinement
Identifying Variance:
- Cardinality ($ per tag value)
- Robust stats (histograms (see prev slide))
- Retention horizons for plausible queries (time duration)
Explaining variance:
- Correct stats!!! (global p95, p99)
- “Suppress the messengers” of microservice failures
Wrapping Up...
(first, a hint at my
perspective)
A fun game! (“Observability Whack-a-Mole”)
Design your own observability system:
❏ High-throughput
❏ High-cardinality
❏ Lengthy retention window
❏ Unsampled
Choose three
The Life of Trace Data:
Dapper
Stage Overhead affects… Retained
Instrumentation Executed App 100.00%
Buffered within app process App 000.10%
Flushed out of process App 000.10%
Centralized regionally Regional network + storage 000.10%
Centralized globally WAN + storage 000.01%
The Life of Trace Data:
Dapper Other Approaches
Stage Overhead affects… Retained
Instrumentation Executed App 100.00%
Buffered within app process App 100.00%
Flushed out of process App 100.00%
Centralized regionally Regional network + storage 100.00%
Centralized globally WAN + storage “fancy”
An Observability Scorecard
Detection
- Specificity: cardinality cost,
stack coverage
- Fidelity: correct stats, high stats
frequency
- Freshness: ≤ 5 seconds
Refinement
- Identifying variance: cardinality
cost, correct stats, hi-fi
histograms, retention horizons
- “Suppress the messengers”
LightStep: Observability with context
Automatic deployment and regression detection
System and service diagrams
Real-time and historical root cause analysis
Correlations
Custom alerting
Easy Setup with no vendor lock-in
No cardinality limitations, really
Q&A
Get Started Today
go.lightstep.com/trial
Extra Slides
Ideal Measurement: Robust
Ideal Measurement: High-Dimensional
Ideal Refinement: Real-time
Must be able to test and eliminate hypotheses
quickly
Actual data must be ≤10s fresh
UI / API latency must be very low
Ideal Refinement: Global
Ideal Refinement: Context-Rich
We can’t expect humans to know what’s normal

More Related Content

What's hot

Wie erkenne ich die Auswirkungen von IT Ausfallen auf meine Produktion?
Wie erkenne ich die Auswirkungen von IT Ausfallen auf meine Produktion?Wie erkenne ich die Auswirkungen von IT Ausfallen auf meine Produktion?
Wie erkenne ich die Auswirkungen von IT Ausfallen auf meine Produktion?
Splunk
 
Splunk Discovery Köln - 17-01-2020 - Splunk for ITOps
Splunk Discovery Köln - 17-01-2020 - Splunk for ITOpsSplunk Discovery Köln - 17-01-2020 - Splunk for ITOps
Splunk Discovery Köln - 17-01-2020 - Splunk for ITOps
Splunk
 
Splunk Overview
Splunk OverviewSplunk Overview
Splunk Overview
Splunk
 
.conf21 - The Best of
.conf21 - The Best of.conf21 - The Best of
.conf21 - The Best of
Splunk
 
Security Automation & Orchestration
Security Automation & OrchestrationSecurity Automation & Orchestration
Security Automation & Orchestration
Splunk
 
Better Threat Analytics: From Getting Started to Cloud Security Analytics and...
Better Threat Analytics: From Getting Started to Cloud Security Analytics and...Better Threat Analytics: From Getting Started to Cloud Security Analytics and...
Better Threat Analytics: From Getting Started to Cloud Security Analytics and...
Splunk
 
Manufacturing Webinar AMS
Manufacturing Webinar AMSManufacturing Webinar AMS
Manufacturing Webinar AMS
Splunk
 
SplunkLive! Paris 2018: Splunk Overview
SplunkLive! Paris 2018: Splunk OverviewSplunkLive! Paris 2018: Splunk Overview
SplunkLive! Paris 2018: Splunk Overview
Splunk
 
The Risks and Rewards of AI
The Risks and  Rewards of AIThe Risks and  Rewards of AI
The Risks and Rewards of AI
Splunk
 
IoT Analytics @ splunk
IoT Analytics @ splunkIoT Analytics @ splunk
IoT Analytics @ splunk
Splunk
 
Catch these Sessions on-demand at .conf Online
Catch these Sessions on-demand at .conf OnlineCatch these Sessions on-demand at .conf Online
Catch these Sessions on-demand at .conf Online
Splunk
 
Introduction into Security Analytics Methods
Introduction into Security Analytics Methods Introduction into Security Analytics Methods
Introduction into Security Analytics Methods
Splunk
 
Splunk Artificial Intelligence & Machine Learning Webinar
Splunk Artificial Intelligence & Machine Learning WebinarSplunk Artificial Intelligence & Machine Learning Webinar
Splunk Artificial Intelligence & Machine Learning Webinar
Splunk
 
Splunk Discovery Köln - 17-01-2020 - Turning Data Into Business Outcomes
Splunk Discovery Köln - 17-01-2020 - Turning Data Into Business OutcomesSplunk Discovery Köln - 17-01-2020 - Turning Data Into Business Outcomes
Splunk Discovery Köln - 17-01-2020 - Turning Data Into Business Outcomes
Splunk
 
Best Practices for Forwarder Hierarchies
Best Practices for Forwarder HierarchiesBest Practices for Forwarder Hierarchies
Best Practices for Forwarder Hierarchies
Splunk
 
Observe 2020-d mc
Observe 2020-d mcObserve 2020-d mc
Observe 2020-d mc
Dave McAllister
 
SplunkLive! Stockholm 2019 - Customer presentation: ISS
SplunkLive! Stockholm 2019 - Customer presentation: ISS SplunkLive! Stockholm 2019 - Customer presentation: ISS
SplunkLive! Stockholm 2019 - Customer presentation: ISS
Splunk
 
Monitoring Modern Architectures with Data Science
Monitoring Modern Architectures with Data ScienceMonitoring Modern Architectures with Data Science
Monitoring Modern Architectures with Data Science
C4Media
 
Splunk for Monitoring and Diagnostics Breakout Session
Splunk for Monitoring and Diagnostics Breakout SessionSplunk for Monitoring and Diagnostics Breakout Session
Splunk for Monitoring and Diagnostics Breakout Session
Splunk
 
SplunkLive! Munich 2018: Getting Started with Splunk Enterprise
SplunkLive! Munich 2018: Getting Started with Splunk EnterpriseSplunkLive! Munich 2018: Getting Started with Splunk Enterprise
SplunkLive! Munich 2018: Getting Started with Splunk Enterprise
Splunk
 

What's hot (20)

Wie erkenne ich die Auswirkungen von IT Ausfallen auf meine Produktion?
Wie erkenne ich die Auswirkungen von IT Ausfallen auf meine Produktion?Wie erkenne ich die Auswirkungen von IT Ausfallen auf meine Produktion?
Wie erkenne ich die Auswirkungen von IT Ausfallen auf meine Produktion?
 
Splunk Discovery Köln - 17-01-2020 - Splunk for ITOps
Splunk Discovery Köln - 17-01-2020 - Splunk for ITOpsSplunk Discovery Köln - 17-01-2020 - Splunk for ITOps
Splunk Discovery Köln - 17-01-2020 - Splunk for ITOps
 
Splunk Overview
Splunk OverviewSplunk Overview
Splunk Overview
 
.conf21 - The Best of
.conf21 - The Best of.conf21 - The Best of
.conf21 - The Best of
 
Security Automation & Orchestration
Security Automation & OrchestrationSecurity Automation & Orchestration
Security Automation & Orchestration
 
Better Threat Analytics: From Getting Started to Cloud Security Analytics and...
Better Threat Analytics: From Getting Started to Cloud Security Analytics and...Better Threat Analytics: From Getting Started to Cloud Security Analytics and...
Better Threat Analytics: From Getting Started to Cloud Security Analytics and...
 
Manufacturing Webinar AMS
Manufacturing Webinar AMSManufacturing Webinar AMS
Manufacturing Webinar AMS
 
SplunkLive! Paris 2018: Splunk Overview
SplunkLive! Paris 2018: Splunk OverviewSplunkLive! Paris 2018: Splunk Overview
SplunkLive! Paris 2018: Splunk Overview
 
The Risks and Rewards of AI
The Risks and  Rewards of AIThe Risks and  Rewards of AI
The Risks and Rewards of AI
 
IoT Analytics @ splunk
IoT Analytics @ splunkIoT Analytics @ splunk
IoT Analytics @ splunk
 
Catch these Sessions on-demand at .conf Online
Catch these Sessions on-demand at .conf OnlineCatch these Sessions on-demand at .conf Online
Catch these Sessions on-demand at .conf Online
 
Introduction into Security Analytics Methods
Introduction into Security Analytics Methods Introduction into Security Analytics Methods
Introduction into Security Analytics Methods
 
Splunk Artificial Intelligence & Machine Learning Webinar
Splunk Artificial Intelligence & Machine Learning WebinarSplunk Artificial Intelligence & Machine Learning Webinar
Splunk Artificial Intelligence & Machine Learning Webinar
 
Splunk Discovery Köln - 17-01-2020 - Turning Data Into Business Outcomes
Splunk Discovery Köln - 17-01-2020 - Turning Data Into Business OutcomesSplunk Discovery Köln - 17-01-2020 - Turning Data Into Business Outcomes
Splunk Discovery Köln - 17-01-2020 - Turning Data Into Business Outcomes
 
Best Practices for Forwarder Hierarchies
Best Practices for Forwarder HierarchiesBest Practices for Forwarder Hierarchies
Best Practices for Forwarder Hierarchies
 
Observe 2020-d mc
Observe 2020-d mcObserve 2020-d mc
Observe 2020-d mc
 
SplunkLive! Stockholm 2019 - Customer presentation: ISS
SplunkLive! Stockholm 2019 - Customer presentation: ISS SplunkLive! Stockholm 2019 - Customer presentation: ISS
SplunkLive! Stockholm 2019 - Customer presentation: ISS
 
Monitoring Modern Architectures with Data Science
Monitoring Modern Architectures with Data ScienceMonitoring Modern Architectures with Data Science
Monitoring Modern Architectures with Data Science
 
Splunk for Monitoring and Diagnostics Breakout Session
Splunk for Monitoring and Diagnostics Breakout SessionSplunk for Monitoring and Diagnostics Breakout Session
Splunk for Monitoring and Diagnostics Breakout Session
 
SplunkLive! Munich 2018: Getting Started with Splunk Enterprise
SplunkLive! Munich 2018: Getting Started with Splunk EnterpriseSplunkLive! Munich 2018: Getting Started with Splunk Enterprise
SplunkLive! Munich 2018: Getting Started with Splunk Enterprise
 

Similar to Three Pillars, No Answers: Helping Platform Teams Solve Real Observability Problems

Three Pillars, Zero Answers: Rethinking Observability
Three Pillars, Zero Answers: Rethinking ObservabilityThree Pillars, Zero Answers: Rethinking Observability
Three Pillars, Zero Answers: Rethinking Observability
DevOps.com
 
Three Pillars with Zero Answers: A New Observability Scorecard
Three Pillars with Zero Answers: A New Observability ScorecardThree Pillars with Zero Answers: A New Observability Scorecard
Three Pillars with Zero Answers: A New Observability Scorecard
DevOps.com
 
Mantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing SystemMantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing System
C4Media
 
High Availability HPC ~ Microservice Architectures for Supercomputing
High Availability HPC ~ Microservice Architectures for SupercomputingHigh Availability HPC ~ Microservice Architectures for Supercomputing
High Availability HPC ~ Microservice Architectures for Supercomputing
inside-BigData.com
 
Big Data : Bits of History, Words of Advice
Big Data : Bits of History, Words of AdviceBig Data : Bits of History, Words of Advice
Big Data : Bits of History, Words of Advice
Venu Vasudevan
 
Is this normal?
Is this normal?Is this normal?
Is this normal?
Theo Schlossnagle
 
Real-Time Analytics With StarRocks (DWH+DL).pdf
Real-Time Analytics With StarRocks (DWH+DL).pdfReal-Time Analytics With StarRocks (DWH+DL).pdf
Real-Time Analytics With StarRocks (DWH+DL).pdf
Albert Wong
 
Intelligent Monitoring
Intelligent MonitoringIntelligent Monitoring
Intelligent Monitoring
Intelie
 
Machine Learning Impact on IoT - Part 2
Machine Learning Impact on IoT - Part 2Machine Learning Impact on IoT - Part 2
Machine Learning Impact on IoT - Part 2
Value Amplify Consulting
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Sriram Krishnan
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup Group
Doug Needham
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science Challenge
Mark Nichols, P.E.
 
How to not fail at security data analytics (by CxOSidekick)
How to not fail at security data analytics (by CxOSidekick)How to not fail at security data analytics (by CxOSidekick)
How to not fail at security data analytics (by CxOSidekick)
Dinis Cruz
 
QConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemQConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing system
Danny Yuan
 
Tooling Up for Efficiency: DIY Solutions @ Netflix - ABD319 - re:Invent 2017
Tooling Up for Efficiency: DIY Solutions @ Netflix - ABD319 - re:Invent 2017Tooling Up for Efficiency: DIY Solutions @ Netflix - ABD319 - re:Invent 2017
Tooling Up for Efficiency: DIY Solutions @ Netflix - ABD319 - re:Invent 2017
Amazon Web Services
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
Abhishek M Shivalingaiah
 
No specimen (software) left behind
No specimen (software) left behindNo specimen (software) left behind
No specimen (software) left behind
Vince Smith
 
Build User-Facing Analytics Application That Scales Using StarRocks (DLH).pdf
Build User-Facing Analytics Application That Scales Using StarRocks (DLH).pdfBuild User-Facing Analytics Application That Scales Using StarRocks (DLH).pdf
Build User-Facing Analytics Application That Scales Using StarRocks (DLH).pdf
Albert Wong
 
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure Execution
Spark Summit EU 2016: The Next AMPLab:  Real-time Intelligent Secure ExecutionSpark Summit EU 2016: The Next AMPLab:  Real-time Intelligent Secure Execution
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure Execution
Databricks
 
"Traffic Speed Control System in the Cloud using Machine Learning" by Albert ...
"Traffic Speed Control System in the Cloud using Machine Learning" by Albert ..."Traffic Speed Control System in the Cloud using Machine Learning" by Albert ...
"Traffic Speed Control System in the Cloud using Machine Learning" by Albert ...
DevClub_lv
 

Similar to Three Pillars, No Answers: Helping Platform Teams Solve Real Observability Problems (20)

Three Pillars, Zero Answers: Rethinking Observability
Three Pillars, Zero Answers: Rethinking ObservabilityThree Pillars, Zero Answers: Rethinking Observability
Three Pillars, Zero Answers: Rethinking Observability
 
Three Pillars with Zero Answers: A New Observability Scorecard
Three Pillars with Zero Answers: A New Observability ScorecardThree Pillars with Zero Answers: A New Observability Scorecard
Three Pillars with Zero Answers: A New Observability Scorecard
 
Mantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing SystemMantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing System
 
High Availability HPC ~ Microservice Architectures for Supercomputing
High Availability HPC ~ Microservice Architectures for SupercomputingHigh Availability HPC ~ Microservice Architectures for Supercomputing
High Availability HPC ~ Microservice Architectures for Supercomputing
 
Big Data : Bits of History, Words of Advice
Big Data : Bits of History, Words of AdviceBig Data : Bits of History, Words of Advice
Big Data : Bits of History, Words of Advice
 
Is this normal?
Is this normal?Is this normal?
Is this normal?
 
Real-Time Analytics With StarRocks (DWH+DL).pdf
Real-Time Analytics With StarRocks (DWH+DL).pdfReal-Time Analytics With StarRocks (DWH+DL).pdf
Real-Time Analytics With StarRocks (DWH+DL).pdf
 
Intelligent Monitoring
Intelligent MonitoringIntelligent Monitoring
Intelligent Monitoring
 
Machine Learning Impact on IoT - Part 2
Machine Learning Impact on IoT - Part 2Machine Learning Impact on IoT - Part 2
Machine Learning Impact on IoT - Part 2
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup Group
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science Challenge
 
How to not fail at security data analytics (by CxOSidekick)
How to not fail at security data analytics (by CxOSidekick)How to not fail at security data analytics (by CxOSidekick)
How to not fail at security data analytics (by CxOSidekick)
 
QConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemQConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing system
 
Tooling Up for Efficiency: DIY Solutions @ Netflix - ABD319 - re:Invent 2017
Tooling Up for Efficiency: DIY Solutions @ Netflix - ABD319 - re:Invent 2017Tooling Up for Efficiency: DIY Solutions @ Netflix - ABD319 - re:Invent 2017
Tooling Up for Efficiency: DIY Solutions @ Netflix - ABD319 - re:Invent 2017
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
 
No specimen (software) left behind
No specimen (software) left behindNo specimen (software) left behind
No specimen (software) left behind
 
Build User-Facing Analytics Application That Scales Using StarRocks (DLH).pdf
Build User-Facing Analytics Application That Scales Using StarRocks (DLH).pdfBuild User-Facing Analytics Application That Scales Using StarRocks (DLH).pdf
Build User-Facing Analytics Application That Scales Using StarRocks (DLH).pdf
 
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure Execution
Spark Summit EU 2016: The Next AMPLab:  Real-time Intelligent Secure ExecutionSpark Summit EU 2016: The Next AMPLab:  Real-time Intelligent Secure Execution
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure Execution
 
"Traffic Speed Control System in the Cloud using Machine Learning" by Albert ...
"Traffic Speed Control System in the Cloud using Machine Learning" by Albert ..."Traffic Speed Control System in the Cloud using Machine Learning" by Albert ...
"Traffic Speed Control System in the Cloud using Machine Learning" by Albert ...
 

More from DevOps.com

Modernizing on IBM Z Made Easier With Open Source Software
Modernizing on IBM Z Made Easier With Open Source SoftwareModernizing on IBM Z Made Easier With Open Source Software
Modernizing on IBM Z Made Easier With Open Source Software
DevOps.com
 
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
DevOps.com
 
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
DevOps.com
 
Next Generation Vulnerability Assessment Using Datadog and Snyk
Next Generation Vulnerability Assessment Using Datadog and SnykNext Generation Vulnerability Assessment Using Datadog and Snyk
Next Generation Vulnerability Assessment Using Datadog and Snyk
DevOps.com
 
Vulnerability Discovery in the Cloud
Vulnerability Discovery in the CloudVulnerability Discovery in the Cloud
Vulnerability Discovery in the Cloud
DevOps.com
 
2021 Open Source Governance: Top Ten Trends and Predictions
2021 Open Source Governance: Top Ten Trends and Predictions2021 Open Source Governance: Top Ten Trends and Predictions
2021 Open Source Governance: Top Ten Trends and Predictions
DevOps.com
 
A New Year’s Ransomware Resolution
A New Year’s Ransomware ResolutionA New Year’s Ransomware Resolution
A New Year’s Ransomware Resolution
DevOps.com
 
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
DevOps.com
 
Don't Panic! Effective Incident Response
Don't Panic! Effective Incident ResponseDon't Panic! Effective Incident Response
Don't Panic! Effective Incident Response
DevOps.com
 
Creating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's Culture
Creating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's CultureCreating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's Culture
Creating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's Culture
DevOps.com
 
Role Based Access Controls (RBAC) for SSH and Kubernetes Access with Teleport
Role Based Access Controls (RBAC) for SSH and Kubernetes Access with TeleportRole Based Access Controls (RBAC) for SSH and Kubernetes Access with Teleport
Role Based Access Controls (RBAC) for SSH and Kubernetes Access with Teleport
DevOps.com
 
Monitoring Serverless Applications with Datadog
Monitoring Serverless Applications with DatadogMonitoring Serverless Applications with Datadog
Monitoring Serverless Applications with Datadog
DevOps.com
 
Deliver your App Anywhere … Publicly or Privately
Deliver your App Anywhere … Publicly or PrivatelyDeliver your App Anywhere … Publicly or Privately
Deliver your App Anywhere … Publicly or Privately
DevOps.com
 
Securing medical apps in the age of covid final
Securing medical apps in the age of covid finalSecuring medical apps in the age of covid final
Securing medical apps in the age of covid final
DevOps.com
 
How to Build a Healthy On-Call Culture
How to Build a Healthy On-Call CultureHow to Build a Healthy On-Call Culture
How to Build a Healthy On-Call Culture
DevOps.com
 
The Evolving Role of the Developer in 2021
The Evolving Role of the Developer in 2021The Evolving Role of the Developer in 2021
The Evolving Role of the Developer in 2021
DevOps.com
 
Service Mesh: Two Big Words But Do You Need It?
Service Mesh: Two Big Words But Do You Need It?Service Mesh: Two Big Words But Do You Need It?
Service Mesh: Two Big Words But Do You Need It?
DevOps.com
 
Secure Data Sharing in OpenShift Environments
Secure Data Sharing in OpenShift EnvironmentsSecure Data Sharing in OpenShift Environments
Secure Data Sharing in OpenShift Environments
DevOps.com
 
How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...
How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...
How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...
DevOps.com
 
Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...
Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...
Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...
DevOps.com
 

More from DevOps.com (20)

Modernizing on IBM Z Made Easier With Open Source Software
Modernizing on IBM Z Made Easier With Open Source SoftwareModernizing on IBM Z Made Easier With Open Source Software
Modernizing on IBM Z Made Easier With Open Source Software
 
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
 
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
 
Next Generation Vulnerability Assessment Using Datadog and Snyk
Next Generation Vulnerability Assessment Using Datadog and SnykNext Generation Vulnerability Assessment Using Datadog and Snyk
Next Generation Vulnerability Assessment Using Datadog and Snyk
 
Vulnerability Discovery in the Cloud
Vulnerability Discovery in the CloudVulnerability Discovery in the Cloud
Vulnerability Discovery in the Cloud
 
2021 Open Source Governance: Top Ten Trends and Predictions
2021 Open Source Governance: Top Ten Trends and Predictions2021 Open Source Governance: Top Ten Trends and Predictions
2021 Open Source Governance: Top Ten Trends and Predictions
 
A New Year’s Ransomware Resolution
A New Year’s Ransomware ResolutionA New Year’s Ransomware Resolution
A New Year’s Ransomware Resolution
 
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
 
Don't Panic! Effective Incident Response
Don't Panic! Effective Incident ResponseDon't Panic! Effective Incident Response
Don't Panic! Effective Incident Response
 
Creating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's Culture
Creating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's CultureCreating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's Culture
Creating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's Culture
 
Role Based Access Controls (RBAC) for SSH and Kubernetes Access with Teleport
Role Based Access Controls (RBAC) for SSH and Kubernetes Access with TeleportRole Based Access Controls (RBAC) for SSH and Kubernetes Access with Teleport
Role Based Access Controls (RBAC) for SSH and Kubernetes Access with Teleport
 
Monitoring Serverless Applications with Datadog
Monitoring Serverless Applications with DatadogMonitoring Serverless Applications with Datadog
Monitoring Serverless Applications with Datadog
 
Deliver your App Anywhere … Publicly or Privately
Deliver your App Anywhere … Publicly or PrivatelyDeliver your App Anywhere … Publicly or Privately
Deliver your App Anywhere … Publicly or Privately
 
Securing medical apps in the age of covid final
Securing medical apps in the age of covid finalSecuring medical apps in the age of covid final
Securing medical apps in the age of covid final
 
How to Build a Healthy On-Call Culture
How to Build a Healthy On-Call CultureHow to Build a Healthy On-Call Culture
How to Build a Healthy On-Call Culture
 
The Evolving Role of the Developer in 2021
The Evolving Role of the Developer in 2021The Evolving Role of the Developer in 2021
The Evolving Role of the Developer in 2021
 
Service Mesh: Two Big Words But Do You Need It?
Service Mesh: Two Big Words But Do You Need It?Service Mesh: Two Big Words But Do You Need It?
Service Mesh: Two Big Words But Do You Need It?
 
Secure Data Sharing in OpenShift Environments
Secure Data Sharing in OpenShift EnvironmentsSecure Data Sharing in OpenShift Environments
Secure Data Sharing in OpenShift Environments
 
How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...
How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...
How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...
 
Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...
Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...
Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...
 

Recently uploaded

Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Precisely
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
saastr
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
DianaGray10
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
Ajin Abraham
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
Fwdays
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
c5vrf27qcz
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 

Recently uploaded (20)

Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 

Three Pillars, No Answers: Helping Platform Teams Solve Real Observability Problems

  • 1. Three Pillars, No Answers: Helping Platform Teams Solve Real Observability Problems Austin Parker, Principal Developer Advocate at Lightstep
  • 2. Who Am I? Austin Parker Principal Developer Advocate @austinlparker austin@lightstep.com✉
  • 4. The Conventional Wisdom ● Observing microservices is hard ● Google and Facebook solved this (right???) ● They used Metrics, Logging, and Distributed Tracing… ● So we should, too.
  • 5. The Three Pillars of Observability - Metrics - Logging - Distributed Tracing
  • 9.
  • 11. A word nobody knew in 2015… Dimensions (aka “tags”) can explain variance in timeseries data (aka “metrics”) …… but cardinality
  • 12. Logging Data Volume: a reality check transaction rate x all microservices x cost of net+storage x weeks of retention ----------------------- way too much $$$$
  • 13. The Life of Transaction Data: Dapper Stage Overhead affects… Retained Instrumentation Executed App 100.00% Buffered within app process App 000.10% Flushed out of process App 000.10% Centralized regionally Regional network + storage 000.10% Centralized globally WAN + storage 000.01%
  • 14. Fatal Flaws: A Review Logs Metrics Dist. Traces TCO scales gracefully – ✓ ✓ Accounts for all data (i.e., unsampled) ✓ ✓ – Immune to cardinality ✓ – ✓
  • 15.
  • 19. Metrics, Logs, and Traces are Just Data, … not a feature or use case.
  • 20. Part 2: A New Scorecard for Observability
  • 21. Mental Model: Goals and Activities ● Goals: how our services perform in the eyes of their consumers ● Activities: what we (as operators) actually do to further our goals
  • 22. Quick Vocab Refresher: SLIs “SLI” = “Service Level Indicator” TL;DR: An SLI is an indicator of health that a service’s consumers would care about. … not an indicator of its inner workings
  • 23. Observability: 2 Fundamental Goals Gradually improving an SLI Rapidly restoring an SLI Reminder: “SLI” = “Service Level Indicator” NOW!!!! days, weeks, months…
  • 24. Observability: 2 Fundamental Activities 1. Detection: measuring SLIs precisely 2. Refinement: reducing the search space for plausible explanations
  • 25. An interlude about stats frequency
  • 26. Scorecard: Detection 1. Specificity: - Cost of cardinality ($ per tag value) - Stack support (mobile/web platforms, managed services, “black-box OSS infra” like Kafka/Cassandra) 2. Fidelity: - Correct stats!!! (global p95, p99) - High stats frequency (stats sampling frequency, in seconds) 3. Freshness (lag from real-time, in seconds)
  • 27. Why “Refinement”? # of things your users actually care about # of microservices # of failure modes Must reduce the search space!
  • 28. The Refinement Process Discover Variance Explain Variance Deploy Fix
  • 30. Scorecard: Refinement Identifying Variance: - Cardinality ($ per tag value) - Robust stats (histograms (see prev slide)) - Retention horizons for plausible queries (time duration) Explaining variance: - Correct stats!!! (global p95, p99) - “Suppress the messengers” of microservice failures
  • 32. (first, a hint at my perspective)
  • 33. A fun game! (“Observability Whack-a-Mole”) Design your own observability system: ❏ High-throughput ❏ High-cardinality ❏ Lengthy retention window ❏ Unsampled Choose three
  • 34. The Life of Trace Data: Dapper Stage Overhead affects… Retained Instrumentation Executed App 100.00% Buffered within app process App 000.10% Flushed out of process App 000.10% Centralized regionally Regional network + storage 000.10% Centralized globally WAN + storage 000.01%
  • 35. The Life of Trace Data: Dapper Other Approaches Stage Overhead affects… Retained Instrumentation Executed App 100.00% Buffered within app process App 100.00% Flushed out of process App 100.00% Centralized regionally Regional network + storage 100.00% Centralized globally WAN + storage “fancy”
  • 36. An Observability Scorecard Detection - Specificity: cardinality cost, stack coverage - Fidelity: correct stats, high stats frequency - Freshness: ≤ 5 seconds Refinement - Identifying variance: cardinality cost, correct stats, hi-fi histograms, retention horizons - “Suppress the messengers”
  • 37. LightStep: Observability with context Automatic deployment and regression detection System and service diagrams Real-time and historical root cause analysis Correlations Custom alerting Easy Setup with no vendor lock-in No cardinality limitations, really
  • 42. Ideal Refinement: Real-time Must be able to test and eliminate hypotheses quickly Actual data must be ≤10s fresh UI / API latency must be very low
  • 44. Ideal Refinement: Context-Rich We can’t expect humans to know what’s normal