SlideShare a Scribd company logo
1 of 57
Download to read offline
Observability for Emerging Infra
The Future of Observability in Complex Systems
@mipsytipsy
Operations Engineer
“the only good diff is a red diff”
@grepory, Monitorama 2016
“Monitoring is dead.”
“Monitoring systems have not changed significantly in 20 years and has fallen behind the way we
build software. Our software is now large distributed systems made up of many non-uniform
interacting components while the core functionality of monitoring systems has stagnated.”
@mipsytipsy
hates monitoring
… not a monitoring company
We don’t *know* what the questions are, all
we have are unreliable symptoms or reports.
Complexity is exploding everywhere,
but our tools are designed for
a predictable world.
As soon as we know the question, we usually
know the answer too.
@grepory, 2016
This is a outdated model for complex systems.
Many catastrophic states exist at any given time.
Your system is never entirely ‘up’
Observability
Observability is about the unknown-unknowns. A system is
observable when you can ask any arbitrary question about it and dive
deep, explore, follow bread crumbs. Observability is about being
able to trace the inner workings, just by observing the outside.
Monitoring
Observability
Observability is a superset of …
How you operate a service.
Actionable checks, outputs
within parameters, alerts, dashboards.
Monitoring
How you develop a service
to be monitorable & observable.
Timers, gauges, counters, events etc.
Instrumentation
Observability is about making systems …
How you track down
operational failures and
programmatic bugs
Debuggable
How you answer questions
about your systems and
understand trends … unrelated to
outages or operational issues.
Understandable
Complexity:
Can be inherent or incidental.
Arises out of (software * architecture * human process)
Annoyingly subjective
Complexity is increasing
# of contributors, LoC, rate of change
Halstead volume. cyclomatic complexity.
size, distribution, users. # of types of components.
storage formats. reliability requirements. features.
algorithmic complexity. novelty vs maturity of components.
Architectural complexity
Parse, 2015LAMP stack, 2005
monitoring => observability
known unknowns => unknown unknowns
The app tier capacity is exceeded. Maybe we
rolled out a build with a perf regression, or
maybe some app instances are down.
DB queries are slower than normal. Maybe
we deployed a bad new query, or there is lock
contention.
Errors or latency are high. We will look at
several dashboards that reflect common root
causes, and one of them will show us why.
“Photos are loading slowly for some people. Why?”
Monitoring
(LAMP stack)
monitor these things
Characteristics
Monitoring
• Known-unknowns predominate
• Intuition-friendly
• Dashboards are valuable.
• Monolithic app, single data source.
• The health of the system more or less accurately
represents the experience of the individual users.
(LAMP stack)
Best Practices
Monitoring
• Lots of actionable active checks and alerts
• Proactively notify engineers of failures and warnings
• Maintain a runbook for stable production systems
“Photos are loading slowly for some people. Why?”
(microservices)
Any microservices running on c2.4xlarge
instances and PIOPS storage in us-east-1b has a
1/20 chance of running on degraded hardware,
and will take 20x longer to complete for requests
that hit the disk with a blocking call. This
disproportionately impacts people looking at
older archives due to our fanout model.
Canadian users who are using the French
language pack on the iPad running iOS 9, are
hitting a firmware condition which makes it fail
saving to local cache … which is why it FEELS
like photos are loading slowly
Our newest SDK makes db queries
sequentially if the developer has enabled an
optional feature flag. Working as intended;
the reporters all had debug mode enabled.
But flag should be renamed for clarity sake.
wtf do i ‘monitor’ for?!
Monitoring?!?
Problems Symptoms
"I have twenty microservices and a sharded
db and three other data stores across three
regions, and everything seems to be getting a
little bit slower over the past two weeks but
nothing has changed that we know of, and
oddly, latency is usually back to the historical
norm on Tuesdays.
“All twenty app micro services have 10% of
available nodes enter a simultaneous crash
loop cycle, about five times a day, at
unpredictable intervals. They have nothing in
common afaik and it doesn’t seem to impact
the stateful services. It clears up before we
can debug it, every time.”
“Our users can compose their own queries that
we execute server-side, and we don’t surface it
to them when they are accidentally doing full
table scans or even multiple full table scans, so
they blame us.”
Observability
(microservices)
Still More Symptoms
“Several users in Romania and Eastern
Europe are complaining that all push
notifications have been down for them … for
days.”
“Disney is complaining that once in a while,
but not always, they don’t see the photo they
expected to see — they see someone else’s
photo! When they refresh, it’s fixed. Actually,
we’ve had a few other people report this too,
we just didn’t believe them.”
“Sometimes a bot takes off, or an app is
featured on the iTunes store, and it takes us a
long long time to track down which app or user
is generating disproportionate pressure on
shared components of our system (esp
databases). It’s different every time.”
Observability
“We run a platform, and it’s hard to
programmatically distinguish between
problems that users are inflicting themselves
and problems in our own code, since they all
manifest as the same errors or timeouts."
(microservices)
These are all unknown-unknowns
that may have never happened before, or ever happen again
They are also the overwhelming majority of what you have to
care about for the rest of your life.
Characteristics
• Unknown-unknowns are most of the problems
• “Many” components and storage systems
• You cannot model the entire system in your head.
Dashboards may be actively misleading.
• The hardest problem is often identifying which
component(s) to debug or trace.
• The health of the system is irrelevant. The health of
each individual request is of supreme consequence.
(microservices/complex systems)
Observability
Best Practices
• Rich instrumentation.
• Events, not metrics.
• Sampling, not write-time aggregation.
• Few (if any) dashboards.
• Test in production.. a lot.
• Very few paging alerts.
Observability
(microservices/complex systems)
Why:
Instrumentation?
Events, not metrics?
No dashboards?
Sampling, not time series aggregation?
Test in production?
Fewer alerts?
Instrumentation?
Start at the edge and work down
Internal state from software you didn’t write, too
Wrap every network call, every data call
Structured data only
`gem install` magic will only get you so far
Events, not metrics?
(trick question.. you’ll need both
but you’ll rely on events more and more)
Cardinality
Context
Structured data
Metrics are cheap, but terribly limited in context or cardinality.
Metrics
(+tags)
Metrics are cheap, but terribly limited in context or cardinality.
Metrics
(+tags)
UUIDs
db raw queries
normalized queries
comments
firstname, lastname
PID/PPID
app ID
device ID
HTTP header type
build ID
IP:port
shopping cart ID
userid
... etc
Some of these …
might be …
useful …
YA THINK??!
High cardinality will save your ass.
Metrics
(cardinality)
You must be able to break down by 1/millions and
THEN by anything/everything else
High cardinality is not a nice-to-have
‘Platform problems’ are now everybody’s problems
Looking for a needle in
your haystack? Be
descriptive, add unique
identifiers.
Structured Data
Read-time aggregation
lets you compute
percentiles,
derived columns.
Events tell stories.
Arbitrarily wide events mean you can amass more and more context
over time. Use sampling to control costs and bandwidth.
Structure your data at the source to reap
massive efficiencies over strings.
Events
(“Logs” are just a transport mechanism for events)
Dashboards??
Dashboards
Jumps to an answer, instead of starting with a question
You don’t know what you don’t know.
Dashboards
Artifacts of past failures.
Raw
Fast
Iterative
Interactive
Exploratory
Dashboards
must die
Unknown-unknowns
demand explorability
and an open mind.
sampling, not aggregation
Raw requests:
Aggregation is a one-way trip
Destroying raw events eliminates your ability to ask new questions.
Forever.
Aggregates are the devil
Aggregates destroy your precious details.
You need MORE detail and MORE context.
Aggregates
Raw Requests
You can’t hunt needles if your tools don’t handle extreme outliers, aggregation
by arbitrary values in a high-cardinality dimension, super-wide rich context…
Black swans are the norm
you must care about max/min, 99%, 99.9th, 99.99th, 99.999th …
Raw data examples
“Sum up all the time spent holding the user.*
table lock by INSERT queries, broken down by
user id and the size of the object written, and
show me any users using more than 30% of
the overall row lock.”
“Latency seems elevated for HTTP requests.
Requests can loop recursively back into the
API multiple times; are requests getting
progressively slower as the iteration stack
gets deeper? What is the MAX recursive call
depth, and max latency over the past day? Is
it still growing? What do the 100 slowest have
in common?”
“Show me all the 50x errors broken down by
user id or app id. Show me all the abandoned
carts with the most items in them. Show me
the users rate limited in the past hour, broken
down by browser type or mobile device type
and release version string.”
Zero users care what the “system” health is
All users care about THEIR experience.
Nines don’t matter if users aren’t happy.
Nines don’t matter if users aren’t happy.
Nines don’t matter if users aren’t happy.
Nines don’t matter if users aren’t happy.
Nines don’t matter if users aren’t happy.Raw Requests
Test in production
SWEs own their own services
Services need owners, not operators.
Engineers
must be designed for generalist SWEs.
Observability:
SaaS, APIs, SDKs.
Ops lives on the other side of an API
Operations skills are not optional for software engineers
in 2016. They are not “nice-to-have”,
they are table stakes.
Engineers
Monitoring Instrumentation is part of building software
Engineers
Watch it run in production.
Accept no substitute.
Get used to observing your systems when they AREN’T on fire.
Engineers
Engineers
You win …
Drastically fewer paging alerts!
Monitoring
Dashboards
Metrics
Aggregates
Operators
• Write checks, define thresholds
• Craft dashboard views
• Tag metrics
• Tend to health of the system
• Quickly identify which component is
the problem
• Unknown-unknowns are rare, known-
unknowns the norm
• Failures well understood
Observability
Exploratory
Structured Events
Raw Requests
Engineers
• A vanishingly long tail of black swan
or nearly-impossible events
• Platforms; user-defined server-side
functions or queries; flexibility, outliers
• System may be perfectly functional
yet useless for many users
• Requests may loop back in multiple
times.
• Microservices, polyglot persistence,
IoT, mobile, high variance experience
Old New
Monitoring Observability
Dashboards
Metrics
Exploratory
Structured Events
Aggregates Raw Requests
EngineersOperators
Thank you!
Observability for Emerging Infra (what got you here won't get you there)

More Related Content

Similar to Observability for Emerging Infra (what got you here won't get you there)

The math behind big systems analysis.
The math behind big systems analysis.The math behind big systems analysis.
The math behind big systems analysis.Theo Schlossnagle
 
Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud
Skynet project: Monitor, analyze, scale, and maintain a system in the CloudSkynet project: Monitor, analyze, scale, and maintain a system in the Cloud
Skynet project: Monitor, analyze, scale, and maintain a system in the CloudSylvain Kalache
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)Brian Brazil
 
What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)Brian Brazil
 
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...Puppet
 
SRE Topics with Charity Majors and Liz Fong-Jones of Honeycomb
SRE Topics with Charity Majors and Liz Fong-Jones of HoneycombSRE Topics with Charity Majors and Liz Fong-Jones of Honeycomb
SRE Topics with Charity Majors and Liz Fong-Jones of HoneycombDaniel Zivkovic
 
LOPSA East 2013 - Building a More Effective Monitoring Environment
LOPSA East 2013 - Building a More Effective Monitoring EnvironmentLOPSA East 2013 - Building a More Effective Monitoring Environment
LOPSA East 2013 - Building a More Effective Monitoring EnvironmentMike Julian
 
Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Brian Brazil
 
A People's History of Microservices
A People's History of MicroservicesA People's History of Microservices
A People's History of MicroservicesCamille Fournier
 
Distributesd Tracing in Serverless Systems - Shannon Hogue, Epsagon - Cloud N...
Distributesd Tracing in Serverless Systems - Shannon Hogue, Epsagon - Cloud N...Distributesd Tracing in Serverless Systems - Shannon Hogue, Epsagon - Cloud N...
Distributesd Tracing in Serverless Systems - Shannon Hogue, Epsagon - Cloud N...Cloud Native Day Tel Aviv
 
A Digital Conversation: The Next Web
A Digital Conversation: The Next Web A Digital Conversation: The Next Web
A Digital Conversation: The Next Web Reading Room
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Brian Brazil
 
Eric Proegler Early Performance Testing from CAST2014
Eric Proegler Early Performance Testing from CAST2014Eric Proegler Early Performance Testing from CAST2014
Eric Proegler Early Performance Testing from CAST2014Eric Proegler
 
Continuous Automated Testing - Cast conference workshop august 2014
Continuous Automated Testing - Cast conference workshop august 2014Continuous Automated Testing - Cast conference workshop august 2014
Continuous Automated Testing - Cast conference workshop august 2014Noah Sussman
 
Monitoring Complex Systems - Chicago Erlang, 2014
Monitoring Complex Systems - Chicago Erlang, 2014Monitoring Complex Systems - Chicago Erlang, 2014
Monitoring Complex Systems - Chicago Erlang, 2014Brian Troutwine
 
Our Data Ourselves, Pydata 2015
Our Data Ourselves, Pydata 2015Our Data Ourselves, Pydata 2015
Our Data Ourselves, Pydata 2015kingsBSD
 
The present and future of serverless observability
The present and future of serverless observabilityThe present and future of serverless observability
The present and future of serverless observabilityYan Cui
 
How AI Helps Observe Decentralised Systems
How AI Helps Observe Decentralised SystemsHow AI Helps Observe Decentralised Systems
How AI Helps Observe Decentralised SystemsDominic Wellington
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum JapanBrian Brazil
 
From 🤦 to 🐿️
From 🤦 to 🐿️From 🤦 to 🐿️
From 🤦 to 🐿️Ori Pekelman
 

Similar to Observability for Emerging Infra (what got you here won't get you there) (20)

The math behind big systems analysis.
The math behind big systems analysis.The math behind big systems analysis.
The math behind big systems analysis.
 
Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud
Skynet project: Monitor, analyze, scale, and maintain a system in the CloudSkynet project: Monitor, analyze, scale, and maintain a system in the Cloud
Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)
 
What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)
 
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
 
SRE Topics with Charity Majors and Liz Fong-Jones of Honeycomb
SRE Topics with Charity Majors and Liz Fong-Jones of HoneycombSRE Topics with Charity Majors and Liz Fong-Jones of Honeycomb
SRE Topics with Charity Majors and Liz Fong-Jones of Honeycomb
 
LOPSA East 2013 - Building a More Effective Monitoring Environment
LOPSA East 2013 - Building a More Effective Monitoring EnvironmentLOPSA East 2013 - Building a More Effective Monitoring Environment
LOPSA East 2013 - Building a More Effective Monitoring Environment
 
Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)
 
A People's History of Microservices
A People's History of MicroservicesA People's History of Microservices
A People's History of Microservices
 
Distributesd Tracing in Serverless Systems - Shannon Hogue, Epsagon - Cloud N...
Distributesd Tracing in Serverless Systems - Shannon Hogue, Epsagon - Cloud N...Distributesd Tracing in Serverless Systems - Shannon Hogue, Epsagon - Cloud N...
Distributesd Tracing in Serverless Systems - Shannon Hogue, Epsagon - Cloud N...
 
A Digital Conversation: The Next Web
A Digital Conversation: The Next Web A Digital Conversation: The Next Web
A Digital Conversation: The Next Web
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)
 
Eric Proegler Early Performance Testing from CAST2014
Eric Proegler Early Performance Testing from CAST2014Eric Proegler Early Performance Testing from CAST2014
Eric Proegler Early Performance Testing from CAST2014
 
Continuous Automated Testing - Cast conference workshop august 2014
Continuous Automated Testing - Cast conference workshop august 2014Continuous Automated Testing - Cast conference workshop august 2014
Continuous Automated Testing - Cast conference workshop august 2014
 
Monitoring Complex Systems - Chicago Erlang, 2014
Monitoring Complex Systems - Chicago Erlang, 2014Monitoring Complex Systems - Chicago Erlang, 2014
Monitoring Complex Systems - Chicago Erlang, 2014
 
Our Data Ourselves, Pydata 2015
Our Data Ourselves, Pydata 2015Our Data Ourselves, Pydata 2015
Our Data Ourselves, Pydata 2015
 
The present and future of serverless observability
The present and future of serverless observabilityThe present and future of serverless observability
The present and future of serverless observability
 
How AI Helps Observe Decentralised Systems
How AI Helps Observe Decentralised SystemsHow AI Helps Observe Decentralised Systems
How AI Helps Observe Decentralised Systems
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum Japan
 
From 🤦 to 🐿️
From 🤦 to 🐿️From 🤦 to 🐿️
From 🤦 to 🐿️
 

Recently uploaded

Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesVictoriaMetrics
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingShane Coughlan
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesKrzysztofKkol1
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdfAndrey Devyatkin
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfmaor17
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxRTS corp
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slidesvaideheekore1
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 

Recently uploaded (20)

Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 Updates
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdf
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slides
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 

Observability for Emerging Infra (what got you here won't get you there)

  • 1. Observability for Emerging Infra The Future of Observability in Complex Systems
  • 2. @mipsytipsy Operations Engineer “the only good diff is a red diff”
  • 3. @grepory, Monitorama 2016 “Monitoring is dead.” “Monitoring systems have not changed significantly in 20 years and has fallen behind the way we build software. Our software is now large distributed systems made up of many non-uniform interacting components while the core functionality of monitoring systems has stagnated.”
  • 5. We don’t *know* what the questions are, all we have are unreliable symptoms or reports. Complexity is exploding everywhere, but our tools are designed for a predictable world. As soon as we know the question, we usually know the answer too.
  • 6. @grepory, 2016 This is a outdated model for complex systems.
  • 7. Many catastrophic states exist at any given time. Your system is never entirely ‘up’
  • 8. Observability Observability is about the unknown-unknowns. A system is observable when you can ask any arbitrary question about it and dive deep, explore, follow bread crumbs. Observability is about being able to trace the inner workings, just by observing the outside.
  • 10. Observability is a superset of … How you operate a service. Actionable checks, outputs within parameters, alerts, dashboards. Monitoring How you develop a service to be monitorable & observable. Timers, gauges, counters, events etc. Instrumentation
  • 11. Observability is about making systems … How you track down operational failures and programmatic bugs Debuggable How you answer questions about your systems and understand trends … unrelated to outages or operational issues. Understandable
  • 12. Complexity: Can be inherent or incidental. Arises out of (software * architecture * human process) Annoyingly subjective
  • 14. # of contributors, LoC, rate of change Halstead volume. cyclomatic complexity. size, distribution, users. # of types of components. storage formats. reliability requirements. features. algorithmic complexity. novelty vs maturity of components.
  • 16. monitoring => observability known unknowns => unknown unknowns
  • 17. The app tier capacity is exceeded. Maybe we rolled out a build with a perf regression, or maybe some app instances are down. DB queries are slower than normal. Maybe we deployed a bad new query, or there is lock contention. Errors or latency are high. We will look at several dashboards that reflect common root causes, and one of them will show us why. “Photos are loading slowly for some people. Why?” Monitoring (LAMP stack) monitor these things
  • 18. Characteristics Monitoring • Known-unknowns predominate • Intuition-friendly • Dashboards are valuable. • Monolithic app, single data source. • The health of the system more or less accurately represents the experience of the individual users. (LAMP stack)
  • 19. Best Practices Monitoring • Lots of actionable active checks and alerts • Proactively notify engineers of failures and warnings • Maintain a runbook for stable production systems
  • 20. “Photos are loading slowly for some people. Why?” (microservices) Any microservices running on c2.4xlarge instances and PIOPS storage in us-east-1b has a 1/20 chance of running on degraded hardware, and will take 20x longer to complete for requests that hit the disk with a blocking call. This disproportionately impacts people looking at older archives due to our fanout model. Canadian users who are using the French language pack on the iPad running iOS 9, are hitting a firmware condition which makes it fail saving to local cache … which is why it FEELS like photos are loading slowly Our newest SDK makes db queries sequentially if the developer has enabled an optional feature flag. Working as intended; the reporters all had debug mode enabled. But flag should be renamed for clarity sake. wtf do i ‘monitor’ for?! Monitoring?!?
  • 21. Problems Symptoms "I have twenty microservices and a sharded db and three other data stores across three regions, and everything seems to be getting a little bit slower over the past two weeks but nothing has changed that we know of, and oddly, latency is usually back to the historical norm on Tuesdays. “All twenty app micro services have 10% of available nodes enter a simultaneous crash loop cycle, about five times a day, at unpredictable intervals. They have nothing in common afaik and it doesn’t seem to impact the stateful services. It clears up before we can debug it, every time.” “Our users can compose their own queries that we execute server-side, and we don’t surface it to them when they are accidentally doing full table scans or even multiple full table scans, so they blame us.” Observability (microservices)
  • 22. Still More Symptoms “Several users in Romania and Eastern Europe are complaining that all push notifications have been down for them … for days.” “Disney is complaining that once in a while, but not always, they don’t see the photo they expected to see — they see someone else’s photo! When they refresh, it’s fixed. Actually, we’ve had a few other people report this too, we just didn’t believe them.” “Sometimes a bot takes off, or an app is featured on the iTunes store, and it takes us a long long time to track down which app or user is generating disproportionate pressure on shared components of our system (esp databases). It’s different every time.” Observability “We run a platform, and it’s hard to programmatically distinguish between problems that users are inflicting themselves and problems in our own code, since they all manifest as the same errors or timeouts." (microservices)
  • 23. These are all unknown-unknowns that may have never happened before, or ever happen again They are also the overwhelming majority of what you have to care about for the rest of your life.
  • 24. Characteristics • Unknown-unknowns are most of the problems • “Many” components and storage systems • You cannot model the entire system in your head. Dashboards may be actively misleading. • The hardest problem is often identifying which component(s) to debug or trace. • The health of the system is irrelevant. The health of each individual request is of supreme consequence. (microservices/complex systems) Observability
  • 25. Best Practices • Rich instrumentation. • Events, not metrics. • Sampling, not write-time aggregation. • Few (if any) dashboards. • Test in production.. a lot. • Very few paging alerts. Observability (microservices/complex systems)
  • 26. Why: Instrumentation? Events, not metrics? No dashboards? Sampling, not time series aggregation? Test in production? Fewer alerts?
  • 27. Instrumentation? Start at the edge and work down Internal state from software you didn’t write, too Wrap every network call, every data call Structured data only `gem install` magic will only get you so far
  • 28. Events, not metrics? (trick question.. you’ll need both but you’ll rely on events more and more) Cardinality Context Structured data
  • 29. Metrics are cheap, but terribly limited in context or cardinality. Metrics (+tags)
  • 30. Metrics are cheap, but terribly limited in context or cardinality. Metrics (+tags)
  • 31. UUIDs db raw queries normalized queries comments firstname, lastname PID/PPID app ID device ID HTTP header type build ID IP:port shopping cart ID userid ... etc Some of these … might be … useful … YA THINK??! High cardinality will save your ass. Metrics (cardinality)
  • 32. You must be able to break down by 1/millions and THEN by anything/everything else High cardinality is not a nice-to-have ‘Platform problems’ are now everybody’s problems
  • 33. Looking for a needle in your haystack? Be descriptive, add unique identifiers. Structured Data Read-time aggregation lets you compute percentiles, derived columns.
  • 34. Events tell stories. Arbitrarily wide events mean you can amass more and more context over time. Use sampling to control costs and bandwidth. Structure your data at the source to reap massive efficiencies over strings. Events (“Logs” are just a transport mechanism for events)
  • 37. Jumps to an answer, instead of starting with a question You don’t know what you don’t know. Dashboards Artifacts of past failures.
  • 41. Aggregation is a one-way trip Destroying raw events eliminates your ability to ask new questions. Forever. Aggregates are the devil
  • 42. Aggregates destroy your precious details. You need MORE detail and MORE context. Aggregates
  • 43. Raw Requests You can’t hunt needles if your tools don’t handle extreme outliers, aggregation by arbitrary values in a high-cardinality dimension, super-wide rich context… Black swans are the norm you must care about max/min, 99%, 99.9th, 99.99th, 99.999th …
  • 44. Raw data examples “Sum up all the time spent holding the user.* table lock by INSERT queries, broken down by user id and the size of the object written, and show me any users using more than 30% of the overall row lock.” “Latency seems elevated for HTTP requests. Requests can loop recursively back into the API multiple times; are requests getting progressively slower as the iteration stack gets deeper? What is the MAX recursive call depth, and max latency over the past day? Is it still growing? What do the 100 slowest have in common?” “Show me all the 50x errors broken down by user id or app id. Show me all the abandoned carts with the most items in them. Show me the users rate limited in the past hour, broken down by browser type or mobile device type and release version string.”
  • 45. Zero users care what the “system” health is All users care about THEIR experience. Nines don’t matter if users aren’t happy. Nines don’t matter if users aren’t happy. Nines don’t matter if users aren’t happy. Nines don’t matter if users aren’t happy. Nines don’t matter if users aren’t happy.Raw Requests
  • 46. Test in production SWEs own their own services
  • 47. Services need owners, not operators. Engineers must be designed for generalist SWEs. Observability: SaaS, APIs, SDKs. Ops lives on the other side of an API
  • 48. Operations skills are not optional for software engineers in 2016. They are not “nice-to-have”, they are table stakes. Engineers
  • 49. Monitoring Instrumentation is part of building software Engineers
  • 50. Watch it run in production. Accept no substitute. Get used to observing your systems when they AREN’T on fire. Engineers
  • 52. You win … Drastically fewer paging alerts!
  • 53. Monitoring Dashboards Metrics Aggregates Operators • Write checks, define thresholds • Craft dashboard views • Tag metrics • Tend to health of the system • Quickly identify which component is the problem • Unknown-unknowns are rare, known- unknowns the norm • Failures well understood
  • 54. Observability Exploratory Structured Events Raw Requests Engineers • A vanishingly long tail of black swan or nearly-impossible events • Platforms; user-defined server-side functions or queries; flexibility, outliers • System may be perfectly functional yet useless for many users • Requests may loop back in multiple times. • Microservices, polyglot persistence, IoT, mobile, high variance experience
  • 55. Old New Monitoring Observability Dashboards Metrics Exploratory Structured Events Aggregates Raw Requests EngineersOperators