SlideShare a Scribd company logo
From Grief to Growth: The 7 Stages of
Observability
Alex Gervais
ConFoo Montreal
Feb. 26, 2020
Bonjour-Hi!
Outdoorsy, data-driven, eternal student, not so geeky creative
mind and traveler. Alex is a curious, introverted and humble
character. Working by day as a Senior Software Developer at
Datawire he has many years of savoir-faire building full-stack
systems from cloud infrastructures to backend services and
DevOps tools. Alex is a Kubernetes early adopter who thrives on
collaboration and contributor to many Cloud-Native projects.
@alex_gervais on twitter
alexgervais on github
Case Study
Grief
Confusion
Envy
Excitement
Satisfaction
Fear
Growth
Lessons learned
Table of Contents
Prediction time...
Serverless will take over the world.
The story
only the names have been changed to protect the innocent
Monoliths and Distributed Applications
my-app
web-frontend
the-relational-db
this-other-service
oh-that-third-party
big-data-warehouse
infosec-robot
some-cache
this-other-service
-v2
email-relay
In the Monolith era, we were equipped with
● Log files and tail, grep, and more
● curl, tcpdump, dig, and more
● IDE debuggers
● Some form of APM
● Classic monitoring such as Nagios, Pingdom, and more
Business-critical operations
Who’s monitoring the user experience and satisfaction?
Are we equipped to face traffic bursts?
How is our release process and QA culture impacting release
frequency AND quality?
Who owns each project and component?
Grief
deep sorrow, especially that caused by someone's death
This is soooo complicated...
● We have downtimes, failures, unstable environments and the
infrastructure footprint is increasing
● Everything comes up as a SURPRISE!
● Failures are everywhere: in infrastructure, in applications, in
configuration
● Failures are visible: in degraded performance, in bugs, in
downtime
● Fatigue
To delight our customers with our SaaS offering,
our problem was…
How could we reach operational excellence and
share ownership effectively?
Objectives
● Increase our global comprehension of the system
● Improve the platform stability
● Find owners for each component and involve them in
operational work
● State of the art observability
Confusion
lack of understanding; uncertainty
Blind operations
● We have all these metrics and data sources, but we are still
operating in the dark
● Too many tools, no trustable source of truth
● No runbooks
● Clients report errors, slowness and downtimes before we
catch them… That’s not good!
● An easy prey for vendors
Concepts
Service-Level Indicator - SLI
Service-Level Objective - SLO
Service-Level Agreement - SLA
Error budgets (P99)
Distinguish the types of error:
bugs vs performance; app vs infra vs config
Envy
a feeling of discontented or resentful longing aroused by someone else's possessions,
qualities, or luck
Keep on dreaming
● Case studies, blogs, and vendor pitches
● Conferences and learning from mature organizations
● Is this a miracle?
● Chaos engineering
● No free cycles to improve
Toil
Manual, repetitive, automatable, interrupt-driven and reactive
work with no enduring value.
Unplanned Work
Work In Progress
Adopting a vision
● Business sponsors
● Invest in our platform
● Identified key enablers to reduce toil
Excitement
a feeling of great enthusiasm and eagerness
We set out to improve our tooling
● Introduce new bleeding-edge technologies
● Custom tools
● Many integrations and trials
Satisfaction
fulfillment of one's wishes, expectations, or needs, or the pleasure derived from this
Good is the enemy of great
● We did a lot of custom code and integrations
● We need to define internal standards
● We tried pretty much everything, all tools and solutions had
their chance
● Definition of Done?
Increase in the number of tools...
● Too many tools and solutions
● We are not reducing complexity
● We now need operational knowledge of more tools!
Friction
● (Intrusive) code instrumentation multiple stacks
● Bumping heads with competing standards
● Change of mentality
● Change of habits
Fear
an unpleasant emotion caused by the belief that someone or something is dangerous,
likely to cause pain, or a threat
Living in fear
● No adoption
● Are we covered?
● How’s ownership going?
Unknown unknowns
● Practice… practice… incident management & chaos
engineering
● Distributed tracing data had a lot of potential, but we failed at
extracting any value out of the data we collected
Grief Iterate
Double down on distributed tracing
● Vendor partnership
● Instrument our code
● Attach business metrics and logs to spans
● Collect the data
● Infer metrics from the collected data
● Render in a comprehensive single pane of glass
● No data retention; No sampling
Live service graph
Single trace view
Drill down analytics
Distributed traces
Helped define our SLIs
Visibility on our SLOs
Identify root-causes
Understand failure modes
Go-to dashboard
Trace Context
Growth
1. the process of developing or maturing physically, mentally, or spiritually
2. the process of increasing in amount, value, or importance
Addressing the problem
● Find, assign and route alert to owners
● Reduce the number of moving parts (environments)
● Start at the edge
● “3 pillars”
● Aligned with business objectives
● The power of habits
Thanks!
Credits
[1] The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business Win,
Gene Kim, Kevin Behr, George Spafford, 2013
[2] https://medium.com/@copyconstruct/monitoring-and-observability-8417d1952e1c
[3] https://www.datadoghq.com/blog/monitoring-101-alerting/
[4] https://xkcd.com/2259/
[5] https://landing.google.com/sre/sre-book/chapters/eliminating-toil/
[6]
https://imgix.datadoghq.com/img/blog/net-monitoring-apm/dotnet-monitoring-service-map
v3.png?fit=max
[7] https://vital-leaf.cloudvent.net/paths/gs-resolve-path/step-three
[8] https://github.com/honeycombio/examples/tree/master/kubernetes-envoy-tracing

More Related Content

Similar to [Confoo Montreal 2020] From Grief to Growth: The 7 Stages of Observability - Alex Gervais

DevOpsRoadTrip San Francisco Final Speaking Deck
DevOpsRoadTrip San Francisco Final Speaking Deck DevOpsRoadTrip San Francisco Final Speaking Deck
DevOpsRoadTrip San Francisco Final Speaking Deck
VictorOps
 
HIS 2017 Paul Sherwood- towards trustable software
HIS 2017 Paul Sherwood- towards trustable software HIS 2017 Paul Sherwood- towards trustable software
HIS 2017 Paul Sherwood- towards trustable software
jamieayre
 
Security for Humans
Security for HumansSecurity for Humans
Security for Humans
Dustin Collins
 
(SEC402) Enterprise Cloud Security via DevSecOps 2.0
(SEC402) Enterprise Cloud Security via DevSecOps 2.0(SEC402) Enterprise Cloud Security via DevSecOps 2.0
(SEC402) Enterprise Cloud Security via DevSecOps 2.0
Amazon Web Services
 
Rsqrd AI: From R&D to ROI of AI
Rsqrd AI: From R&D to ROI of AIRsqrd AI: From R&D to ROI of AI
Rsqrd AI: From R&D to ROI of AI
Sanjana Chowdhury
 
DevSecOps What Why and How
DevSecOps What Why and HowDevSecOps What Why and How
DevSecOps What Why and How
NotSoSecure Global Services
 
Nadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
Nadine Schöne, Dataiku. The Complete Data Value Chain in a NutshellNadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
Nadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
IT Arena
 
TechRadarCon 2022 | Have you built your platform yet ?
TechRadarCon 2022 | Have you built your platform yet ?TechRadarCon 2022 | Have you built your platform yet ?
TechRadarCon 2022 | Have you built your platform yet ?
Haggai Philip Zagury
 
Challenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in ProductionChallenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in Production
iguazio
 
Security For Humans
Security For HumansSecurity For Humans
Security For Humans
conjur_inc
 
SplunkLive! Frankfurt 2018 - Legacy SIEM to Splunk, How to Conquer Migration ...
SplunkLive! Frankfurt 2018 - Legacy SIEM to Splunk, How to Conquer Migration ...SplunkLive! Frankfurt 2018 - Legacy SIEM to Splunk, How to Conquer Migration ...
SplunkLive! Frankfurt 2018 - Legacy SIEM to Splunk, How to Conquer Migration ...
Splunk
 
SplunkLive! Paris 2018: Legacy SIEM to Splunk
SplunkLive! Paris 2018: Legacy SIEM to SplunkSplunkLive! Paris 2018: Legacy SIEM to Splunk
SplunkLive! Paris 2018: Legacy SIEM to Splunk
Splunk
 
Not my problem - Delegating responsibility to infrastructure
Not my problem - Delegating responsibility to infrastructureNot my problem - Delegating responsibility to infrastructure
Not my problem - Delegating responsibility to infrastructure
Yshay Yaacobi
 
Real world dev ops
Real world dev opsReal world dev ops
Real world dev ops
Mohan Krishnan
 
Making Onboarding Suck Less
Making Onboarding Suck LessMaking Onboarding Suck Less
Making Onboarding Suck Less
VMware Tanzu
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil Games
Rob Winters
 
Agile Software and DevOps Essentials
Agile Software and DevOps EssentialsAgile Software and DevOps Essentials
Agile Software and DevOps Essentials
Narayanan Subramaniam
 
Splunk Discovery: Warsaw 2018 - Legacy SIEM to Splunk, How to Conquer Migrati...
Splunk Discovery: Warsaw 2018 - Legacy SIEM to Splunk, How to Conquer Migrati...Splunk Discovery: Warsaw 2018 - Legacy SIEM to Splunk, How to Conquer Migrati...
Splunk Discovery: Warsaw 2018 - Legacy SIEM to Splunk, How to Conquer Migrati...
Splunk
 
SplunkLive! Zurich 2018: Legacy SIEM to Splunk, How to Conquer Migration and ...
SplunkLive! Zurich 2018: Legacy SIEM to Splunk, How to Conquer Migration and ...SplunkLive! Zurich 2018: Legacy SIEM to Splunk, How to Conquer Migration and ...
SplunkLive! Zurich 2018: Legacy SIEM to Splunk, How to Conquer Migration and ...
Splunk
 
Troubleshooting: A High-Value Asset For The Service-Provider Discipline
Troubleshooting: A High-Value Asset For The Service-Provider DisciplineTroubleshooting: A High-Value Asset For The Service-Provider Discipline
Troubleshooting: A High-Value Asset For The Service-Provider Discipline
Sagi Brody
 

Similar to [Confoo Montreal 2020] From Grief to Growth: The 7 Stages of Observability - Alex Gervais (20)

DevOpsRoadTrip San Francisco Final Speaking Deck
DevOpsRoadTrip San Francisco Final Speaking Deck DevOpsRoadTrip San Francisco Final Speaking Deck
DevOpsRoadTrip San Francisco Final Speaking Deck
 
HIS 2017 Paul Sherwood- towards trustable software
HIS 2017 Paul Sherwood- towards trustable software HIS 2017 Paul Sherwood- towards trustable software
HIS 2017 Paul Sherwood- towards trustable software
 
Security for Humans
Security for HumansSecurity for Humans
Security for Humans
 
(SEC402) Enterprise Cloud Security via DevSecOps 2.0
(SEC402) Enterprise Cloud Security via DevSecOps 2.0(SEC402) Enterprise Cloud Security via DevSecOps 2.0
(SEC402) Enterprise Cloud Security via DevSecOps 2.0
 
Rsqrd AI: From R&D to ROI of AI
Rsqrd AI: From R&D to ROI of AIRsqrd AI: From R&D to ROI of AI
Rsqrd AI: From R&D to ROI of AI
 
DevSecOps What Why and How
DevSecOps What Why and HowDevSecOps What Why and How
DevSecOps What Why and How
 
Nadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
Nadine Schöne, Dataiku. The Complete Data Value Chain in a NutshellNadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
Nadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
 
TechRadarCon 2022 | Have you built your platform yet ?
TechRadarCon 2022 | Have you built your platform yet ?TechRadarCon 2022 | Have you built your platform yet ?
TechRadarCon 2022 | Have you built your platform yet ?
 
Challenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in ProductionChallenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in Production
 
Security For Humans
Security For HumansSecurity For Humans
Security For Humans
 
SplunkLive! Frankfurt 2018 - Legacy SIEM to Splunk, How to Conquer Migration ...
SplunkLive! Frankfurt 2018 - Legacy SIEM to Splunk, How to Conquer Migration ...SplunkLive! Frankfurt 2018 - Legacy SIEM to Splunk, How to Conquer Migration ...
SplunkLive! Frankfurt 2018 - Legacy SIEM to Splunk, How to Conquer Migration ...
 
SplunkLive! Paris 2018: Legacy SIEM to Splunk
SplunkLive! Paris 2018: Legacy SIEM to SplunkSplunkLive! Paris 2018: Legacy SIEM to Splunk
SplunkLive! Paris 2018: Legacy SIEM to Splunk
 
Not my problem - Delegating responsibility to infrastructure
Not my problem - Delegating responsibility to infrastructureNot my problem - Delegating responsibility to infrastructure
Not my problem - Delegating responsibility to infrastructure
 
Real world dev ops
Real world dev opsReal world dev ops
Real world dev ops
 
Making Onboarding Suck Less
Making Onboarding Suck LessMaking Onboarding Suck Less
Making Onboarding Suck Less
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil Games
 
Agile Software and DevOps Essentials
Agile Software and DevOps EssentialsAgile Software and DevOps Essentials
Agile Software and DevOps Essentials
 
Splunk Discovery: Warsaw 2018 - Legacy SIEM to Splunk, How to Conquer Migrati...
Splunk Discovery: Warsaw 2018 - Legacy SIEM to Splunk, How to Conquer Migrati...Splunk Discovery: Warsaw 2018 - Legacy SIEM to Splunk, How to Conquer Migrati...
Splunk Discovery: Warsaw 2018 - Legacy SIEM to Splunk, How to Conquer Migrati...
 
SplunkLive! Zurich 2018: Legacy SIEM to Splunk, How to Conquer Migration and ...
SplunkLive! Zurich 2018: Legacy SIEM to Splunk, How to Conquer Migration and ...SplunkLive! Zurich 2018: Legacy SIEM to Splunk, How to Conquer Migration and ...
SplunkLive! Zurich 2018: Legacy SIEM to Splunk, How to Conquer Migration and ...
 
Troubleshooting: A High-Value Asset For The Service-Provider Discipline
Troubleshooting: A High-Value Asset For The Service-Provider DisciplineTroubleshooting: A High-Value Asset For The Service-Provider Discipline
Troubleshooting: A High-Value Asset For The Service-Provider Discipline
 

More from Ambassador Labs

Building Microservice Systems Without Cooking Your Laptop: Going “Remocal” wi...
Building Microservice Systems Without Cooking Your Laptop: Going “Remocal” wi...Building Microservice Systems Without Cooking Your Laptop: Going “Remocal” wi...
Building Microservice Systems Without Cooking Your Laptop: Going “Remocal” wi...
Ambassador Labs
 
Ambassador Developer Office Hours: Summer of Kubernetes Ship Week 1: Intro to...
Ambassador Developer Office Hours: Summer of Kubernetes Ship Week 1: Intro to...Ambassador Developer Office Hours: Summer of Kubernetes Ship Week 1: Intro to...
Ambassador Developer Office Hours: Summer of Kubernetes Ship Week 1: Intro to...
Ambassador Labs
 
Cloud native development without the toil
Cloud native development without the toilCloud native development without the toil
Cloud native development without the toil
Ambassador Labs
 
Webinar: Accelerate Your Inner Dev Loop for Kubernetes Services
Webinar: Accelerate Your Inner Dev Loop for Kubernetes Services Webinar: Accelerate Your Inner Dev Loop for Kubernetes Services
Webinar: Accelerate Your Inner Dev Loop for Kubernetes Services
Ambassador Labs
 
[Confoo Montreal 2020] Build Your Own Serverless with Knative - Alex Gervais
[Confoo Montreal 2020] Build Your Own Serverless with Knative - Alex Gervais[Confoo Montreal 2020] Build Your Own Serverless with Knative - Alex Gervais
[Confoo Montreal 2020] Build Your Own Serverless with Knative - Alex Gervais
Ambassador Labs
 
[QCon London 2020] The Future of Cloud Native API Gateways - Richard Li
[QCon London 2020] The Future of Cloud Native API Gateways - Richard Li[QCon London 2020] The Future of Cloud Native API Gateways - Richard Li
[QCon London 2020] The Future of Cloud Native API Gateways - Richard Li
Ambassador Labs
 
What's New in the Ambassador Edge Stack 1.0?
What's New in the Ambassador Edge Stack 1.0? What's New in the Ambassador Edge Stack 1.0?
What's New in the Ambassador Edge Stack 1.0?
Ambassador Labs
 
Webinar: Effective Management of APIs and the Edge when Adopting Kubernetes
Webinar: Effective Management of APIs and the Edge when Adopting Kubernetes Webinar: Effective Management of APIs and the Edge when Adopting Kubernetes
Webinar: Effective Management of APIs and the Edge when Adopting Kubernetes
Ambassador Labs
 
Ambassador: Building a Control Plane for Envoy
Ambassador: Building a Control Plane for Envoy Ambassador: Building a Control Plane for Envoy
Ambassador: Building a Control Plane for Envoy
Ambassador Labs
 
Telepresence - Fast Development Workflows for Kubernetes
Telepresence - Fast Development Workflows for KubernetesTelepresence - Fast Development Workflows for Kubernetes
Telepresence - Fast Development Workflows for Kubernetes
Ambassador Labs
 
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...
Ambassador Labs
 
[KubeCon NA 2018] Effective Kubernetes Develop: Turbocharge Your Dev Loop - P...
[KubeCon NA 2018] Effective Kubernetes Develop: Turbocharge Your Dev Loop - P...[KubeCon NA 2018] Effective Kubernetes Develop: Turbocharge Your Dev Loop - P...
[KubeCon NA 2018] Effective Kubernetes Develop: Turbocharge Your Dev Loop - P...
Ambassador Labs
 
The rise of Layer 7, microservices, and the proxy war with Envoy, NGINX, and ...
The rise of Layer 7, microservices, and the proxy war with Envoy, NGINX, and ...The rise of Layer 7, microservices, and the proxy war with Envoy, NGINX, and ...
The rise of Layer 7, microservices, and the proxy war with Envoy, NGINX, and ...
Ambassador Labs
 
The Simply Complex Task of Implementing Kubernetes Ingress - Velocity NYC
The Simply Complex Task of Implementing Kubernetes Ingress - Velocity NYCThe Simply Complex Task of Implementing Kubernetes Ingress - Velocity NYC
The Simply Complex Task of Implementing Kubernetes Ingress - Velocity NYC
Ambassador Labs
 
Ambassador Kubernetes-Native API Gateway
Ambassador Kubernetes-Native API GatewayAmbassador Kubernetes-Native API Gateway
Ambassador Kubernetes-Native API Gateway
Ambassador Labs
 
Micro xchg 2018 - What is a Service Mesh?
Micro xchg 2018 - What is a Service Mesh? Micro xchg 2018 - What is a Service Mesh?
Micro xchg 2018 - What is a Service Mesh?
Ambassador Labs
 
KubeCon NA 2017: Ambassador and Envoy (Envoy Salon)
KubeCon NA 2017: Ambassador and Envoy (Envoy Salon)KubeCon NA 2017: Ambassador and Envoy (Envoy Salon)
KubeCon NA 2017: Ambassador and Envoy (Envoy Salon)
Ambassador Labs
 
Webinar: Code Faster on Kubernetes
Webinar: Code Faster on KubernetesWebinar: Code Faster on Kubernetes
Webinar: Code Faster on Kubernetes
Ambassador Labs
 
QCon SF 2017 - Microservices: Service-Oriented Development
QCon SF 2017 - Microservices: Service-Oriented DevelopmentQCon SF 2017 - Microservices: Service-Oriented Development
QCon SF 2017 - Microservices: Service-Oriented Development
Ambassador Labs
 
Montreal Kubernetes Meetup: Developer-first workflows (for microservices) on ...
Montreal Kubernetes Meetup: Developer-first workflows (for microservices) on ...Montreal Kubernetes Meetup: Developer-first workflows (for microservices) on ...
Montreal Kubernetes Meetup: Developer-first workflows (for microservices) on ...
Ambassador Labs
 

More from Ambassador Labs (20)

Building Microservice Systems Without Cooking Your Laptop: Going “Remocal” wi...
Building Microservice Systems Without Cooking Your Laptop: Going “Remocal” wi...Building Microservice Systems Without Cooking Your Laptop: Going “Remocal” wi...
Building Microservice Systems Without Cooking Your Laptop: Going “Remocal” wi...
 
Ambassador Developer Office Hours: Summer of Kubernetes Ship Week 1: Intro to...
Ambassador Developer Office Hours: Summer of Kubernetes Ship Week 1: Intro to...Ambassador Developer Office Hours: Summer of Kubernetes Ship Week 1: Intro to...
Ambassador Developer Office Hours: Summer of Kubernetes Ship Week 1: Intro to...
 
Cloud native development without the toil
Cloud native development without the toilCloud native development without the toil
Cloud native development without the toil
 
Webinar: Accelerate Your Inner Dev Loop for Kubernetes Services
Webinar: Accelerate Your Inner Dev Loop for Kubernetes Services Webinar: Accelerate Your Inner Dev Loop for Kubernetes Services
Webinar: Accelerate Your Inner Dev Loop for Kubernetes Services
 
[Confoo Montreal 2020] Build Your Own Serverless with Knative - Alex Gervais
[Confoo Montreal 2020] Build Your Own Serverless with Knative - Alex Gervais[Confoo Montreal 2020] Build Your Own Serverless with Knative - Alex Gervais
[Confoo Montreal 2020] Build Your Own Serverless with Knative - Alex Gervais
 
[QCon London 2020] The Future of Cloud Native API Gateways - Richard Li
[QCon London 2020] The Future of Cloud Native API Gateways - Richard Li[QCon London 2020] The Future of Cloud Native API Gateways - Richard Li
[QCon London 2020] The Future of Cloud Native API Gateways - Richard Li
 
What's New in the Ambassador Edge Stack 1.0?
What's New in the Ambassador Edge Stack 1.0? What's New in the Ambassador Edge Stack 1.0?
What's New in the Ambassador Edge Stack 1.0?
 
Webinar: Effective Management of APIs and the Edge when Adopting Kubernetes
Webinar: Effective Management of APIs and the Edge when Adopting Kubernetes Webinar: Effective Management of APIs and the Edge when Adopting Kubernetes
Webinar: Effective Management of APIs and the Edge when Adopting Kubernetes
 
Ambassador: Building a Control Plane for Envoy
Ambassador: Building a Control Plane for Envoy Ambassador: Building a Control Plane for Envoy
Ambassador: Building a Control Plane for Envoy
 
Telepresence - Fast Development Workflows for Kubernetes
Telepresence - Fast Development Workflows for KubernetesTelepresence - Fast Development Workflows for Kubernetes
Telepresence - Fast Development Workflows for Kubernetes
 
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...
 
[KubeCon NA 2018] Effective Kubernetes Develop: Turbocharge Your Dev Loop - P...
[KubeCon NA 2018] Effective Kubernetes Develop: Turbocharge Your Dev Loop - P...[KubeCon NA 2018] Effective Kubernetes Develop: Turbocharge Your Dev Loop - P...
[KubeCon NA 2018] Effective Kubernetes Develop: Turbocharge Your Dev Loop - P...
 
The rise of Layer 7, microservices, and the proxy war with Envoy, NGINX, and ...
The rise of Layer 7, microservices, and the proxy war with Envoy, NGINX, and ...The rise of Layer 7, microservices, and the proxy war with Envoy, NGINX, and ...
The rise of Layer 7, microservices, and the proxy war with Envoy, NGINX, and ...
 
The Simply Complex Task of Implementing Kubernetes Ingress - Velocity NYC
The Simply Complex Task of Implementing Kubernetes Ingress - Velocity NYCThe Simply Complex Task of Implementing Kubernetes Ingress - Velocity NYC
The Simply Complex Task of Implementing Kubernetes Ingress - Velocity NYC
 
Ambassador Kubernetes-Native API Gateway
Ambassador Kubernetes-Native API GatewayAmbassador Kubernetes-Native API Gateway
Ambassador Kubernetes-Native API Gateway
 
Micro xchg 2018 - What is a Service Mesh?
Micro xchg 2018 - What is a Service Mesh? Micro xchg 2018 - What is a Service Mesh?
Micro xchg 2018 - What is a Service Mesh?
 
KubeCon NA 2017: Ambassador and Envoy (Envoy Salon)
KubeCon NA 2017: Ambassador and Envoy (Envoy Salon)KubeCon NA 2017: Ambassador and Envoy (Envoy Salon)
KubeCon NA 2017: Ambassador and Envoy (Envoy Salon)
 
Webinar: Code Faster on Kubernetes
Webinar: Code Faster on KubernetesWebinar: Code Faster on Kubernetes
Webinar: Code Faster on Kubernetes
 
QCon SF 2017 - Microservices: Service-Oriented Development
QCon SF 2017 - Microservices: Service-Oriented DevelopmentQCon SF 2017 - Microservices: Service-Oriented Development
QCon SF 2017 - Microservices: Service-Oriented Development
 
Montreal Kubernetes Meetup: Developer-first workflows (for microservices) on ...
Montreal Kubernetes Meetup: Developer-first workflows (for microservices) on ...Montreal Kubernetes Meetup: Developer-first workflows (for microservices) on ...
Montreal Kubernetes Meetup: Developer-first workflows (for microservices) on ...
 

Recently uploaded

openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
Shane Coughlan
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
Alina Yurenko
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
Google
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
Aftab Hussain
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
Pro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp BookPro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp Book
abdulrafaychaudhry
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
Aftab Hussain
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata
 

Recently uploaded (20)

openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
Pro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp BookPro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp Book
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
 

[Confoo Montreal 2020] From Grief to Growth: The 7 Stages of Observability - Alex Gervais

  • 1. From Grief to Growth: The 7 Stages of Observability Alex Gervais ConFoo Montreal Feb. 26, 2020
  • 2. Bonjour-Hi! Outdoorsy, data-driven, eternal student, not so geeky creative mind and traveler. Alex is a curious, introverted and humble character. Working by day as a Senior Software Developer at Datawire he has many years of savoir-faire building full-stack systems from cloud infrastructures to backend services and DevOps tools. Alex is a Kubernetes early adopter who thrives on collaboration and contributor to many Cloud-Native projects. @alex_gervais on twitter alexgervais on github
  • 4. Prediction time... Serverless will take over the world.
  • 5. The story only the names have been changed to protect the innocent
  • 6. Monoliths and Distributed Applications my-app web-frontend the-relational-db this-other-service oh-that-third-party big-data-warehouse infosec-robot some-cache this-other-service -v2 email-relay
  • 7.
  • 8. In the Monolith era, we were equipped with ● Log files and tail, grep, and more ● curl, tcpdump, dig, and more ● IDE debuggers ● Some form of APM ● Classic monitoring such as Nagios, Pingdom, and more
  • 9. Business-critical operations Who’s monitoring the user experience and satisfaction? Are we equipped to face traffic bursts? How is our release process and QA culture impacting release frequency AND quality? Who owns each project and component?
  • 10. Grief deep sorrow, especially that caused by someone's death
  • 11. This is soooo complicated... ● We have downtimes, failures, unstable environments and the infrastructure footprint is increasing ● Everything comes up as a SURPRISE! ● Failures are everywhere: in infrastructure, in applications, in configuration ● Failures are visible: in degraded performance, in bugs, in downtime ● Fatigue
  • 12. To delight our customers with our SaaS offering, our problem was… How could we reach operational excellence and share ownership effectively?
  • 13. Objectives ● Increase our global comprehension of the system ● Improve the platform stability ● Find owners for each component and involve them in operational work ● State of the art observability
  • 15. Blind operations ● We have all these metrics and data sources, but we are still operating in the dark ● Too many tools, no trustable source of truth ● No runbooks ● Clients report errors, slowness and downtimes before we catch them… That’s not good! ● An easy prey for vendors
  • 16. Concepts Service-Level Indicator - SLI Service-Level Objective - SLO Service-Level Agreement - SLA Error budgets (P99) Distinguish the types of error: bugs vs performance; app vs infra vs config
  • 17.
  • 18. Envy a feeling of discontented or resentful longing aroused by someone else's possessions, qualities, or luck
  • 19. Keep on dreaming ● Case studies, blogs, and vendor pitches ● Conferences and learning from mature organizations ● Is this a miracle? ● Chaos engineering ● No free cycles to improve
  • 20. Toil Manual, repetitive, automatable, interrupt-driven and reactive work with no enduring value. Unplanned Work Work In Progress
  • 21. Adopting a vision ● Business sponsors ● Invest in our platform ● Identified key enablers to reduce toil
  • 22. Excitement a feeling of great enthusiasm and eagerness
  • 23. We set out to improve our tooling ● Introduce new bleeding-edge technologies ● Custom tools ● Many integrations and trials
  • 24. Satisfaction fulfillment of one's wishes, expectations, or needs, or the pleasure derived from this
  • 25. Good is the enemy of great ● We did a lot of custom code and integrations ● We need to define internal standards ● We tried pretty much everything, all tools and solutions had their chance ● Definition of Done?
  • 26. Increase in the number of tools... ● Too many tools and solutions ● We are not reducing complexity ● We now need operational knowledge of more tools!
  • 27. Friction ● (Intrusive) code instrumentation multiple stacks ● Bumping heads with competing standards ● Change of mentality ● Change of habits
  • 28. Fear an unpleasant emotion caused by the belief that someone or something is dangerous, likely to cause pain, or a threat
  • 29. Living in fear ● No adoption ● Are we covered? ● How’s ownership going?
  • 30. Unknown unknowns ● Practice… practice… incident management & chaos engineering ● Distributed tracing data had a lot of potential, but we failed at extracting any value out of the data we collected
  • 32. Double down on distributed tracing ● Vendor partnership ● Instrument our code ● Attach business metrics and logs to spans ● Collect the data ● Infer metrics from the collected data ● Render in a comprehensive single pane of glass ● No data retention; No sampling
  • 36. Distributed traces Helped define our SLIs Visibility on our SLOs Identify root-causes Understand failure modes Go-to dashboard
  • 38. Growth 1. the process of developing or maturing physically, mentally, or spiritually 2. the process of increasing in amount, value, or importance
  • 39. Addressing the problem ● Find, assign and route alert to owners ● Reduce the number of moving parts (environments) ● Start at the edge ● “3 pillars” ● Aligned with business objectives ● The power of habits
  • 41. Credits [1] The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business Win, Gene Kim, Kevin Behr, George Spafford, 2013 [2] https://medium.com/@copyconstruct/monitoring-and-observability-8417d1952e1c [3] https://www.datadoghq.com/blog/monitoring-101-alerting/ [4] https://xkcd.com/2259/ [5] https://landing.google.com/sre/sre-book/chapters/eliminating-toil/ [6] https://imgix.datadoghq.com/img/blog/net-monitoring-apm/dotnet-monitoring-service-map v3.png?fit=max [7] https://vital-leaf.cloudvent.net/paths/gs-resolve-path/step-three [8] https://github.com/honeycombio/examples/tree/master/kubernetes-envoy-tracing