SlideShare a Scribd company logo
1 of 52
Download to read offline
Observability and the Glorious Future
The Future of Observability in Complex Systems **
** Otherwise Known As Your Systems
Observability and the Glorious Future
The Future of Observability in Complex Systems **
** Otherwise Known As Your Systems
@mipsytipsy
engineer, cofounder, CEO
@mipsytipsy
Hates monitoring
Not a monitoring company
refactor slides
Monitoring
Observability
What’s changed?
Complexity.
We don’t *know* what the questions are, all
we have are unreliable symptoms or reports.
Complexity is exploding everywhere,
but our tools are designed for
a predictable world.
As soon as we know the question, we usually
know the answer too.
DBAs and Ops
Full stack instrumentation
you need strace for systems
The app tier capacity is exceeded. There was
a big traffic spike, or maybe we rolled out a
performance degradation, or maybe some
app instances are down.
Connections to the database are slower than
normal, causing connections to timeout and
latency to rise. Maybe we deployed a bad
query, or the RAID array is degraded, or there
is lock contention on a critical row.
Errors or latency are high. We will run through
many dashboards built to surface a large
number of possible causes that we have
predicted.
“Photos are loading slowly for some people. Why?”
(LAMP stack edition)
“Photos are loading slowly for some people. Why?”
(microservices edition)
On one of our 50 microservices, one node is
running on degraded hardware, causing every
request to take 50 seconds to complete but
without generating a timeout error. This is just
1 of 10k nodes, but disproportionately
impacts people looking at older archives.
They aren’t. But Canadian users running a
French language pack on a particular version
of iPhone hardware are hitting a firmware
condition which makes them unable to save
local cache, which is why it FEELS like photos
are loading slowly
Our newest SDK makes additional sequential
db queries if the developer has enabled an
optional feature. Working as intended, but
sucks; needs refactoring.
wtf do i ‘monitor’ for?
Problems Symptoms
"I have twenty microservices and a sharded
db and three other data stores in three
regions, and everything seems to be getting a
little bit slower but nothing changed that we
know of, and latency is usually fine on
Tuesdays.
“All twenty app micro services have 10% of
available nodes enter a simultaneous crash
loop cycle, about five times a day, at
unpredictable intervals. They have nothing in
common afaik and it doesn’t seem to impact
the stateful services. It clears up before we
can debug it, every time.”
“Our users can compose their own queries that
we execute server-side, and we don’t surface it
to them when they are accidentally doing full
table scans or even multiple full table scans, so
they blame us.”
Your system is never entirely ‘up’
Many catastrophic states exist at any given time.
there are no more easy problems in the future,
there are only hard problems.
(Duh … you fixed the easy ones. :) )
Monitoring
Observability
must be exploratory and open-ended.
Observability:
not dashboard-centric or prescriptive.
you don’t know what you don’t know.
If there’s a schema or an index involved, it’s not futureproof.
Gather everything.
exploratory, interrogatory, questioning:
Debug by asking questions, not by muscle memory
Can you ask arbitrary open-ended questions
and play with them?
Context is *everything*, preserve it.
Quit debugging with your eyeballs,
start debugging with data
It will make you a better engineer!
Replaceable!!
must be people-first and consumer-quality
Observability:
tools must draw on your intuition and habits
rich history, sharing, social features
Don’t make everyone be an expert.
(they won’t be, and that’s ok)
Debugging is a social act.
solving new problems is cognitively expensive. sharing is not.
Our tools must tap into our sense of joy, play,
performance, community, solidarity.
Bring everyone up to the level of the best debuggers.
must be event-driven, not pre-aggregated.
Observability:
High cardinality is a must.
Structured data is absolutely assumed.
Get used to sampling at scale.
Events tell stories.
Arbitrarily wide events mean you can amass more and more context
over time. Use sampling to control costs and bandwidth.
“Logs” are just a transport mechanism for events!
Aggregates destroy your precious details.
You need MORE detail and MORE context.
Tags: not good enough
(Yes, you can have aggregates for percentiles; you just
have to do read-time aggregation.)
You can’t hunt needles if your tools don’t handle extreme outliers, aggregation
by arbitrary values in a high-cardinality dimension, super-wide rich context…
(they don’t)
You must be able to break down by 1/millions and
THEN by anything/everything else
High cardinality is not a nice-to-have
‘Platform problems’ are now everybody’s problems
Black swans are the norm
you must care about max/min, 99%, 99.9th, 99.99th, 99.999th …
Structure your god damn events like it’s 2017
Structure them at the *source*
must be a lingua franca, spanning teams
Observability:
no boundaries between vendor software and your code
don’t create yet another silo
share tools across the stack.
can your android devs debug Redis and vice versa?
If your tools don’t give you the ability to correlate across
disparate systems, vendor and application data alike, whether
you have control over the underlying software or not …
they’re broken
must be designed for generalist SWEs.
Observability:
SaaS, APIs, SDKs.
not designed for ops.
Ops lives on the other side of an API
Operations skills are not optional for software engineers
in 2016. They are not “nice-to-have”,
they are table stakes.
Cultivate a team of software engineers who
value operational excellence.
Watch it run in production.
Accept no substitute.
Get used to observing your systems when they AREN’T on fire.
Your reward:
Drastically fewer paging alerts
Do you really need more than end to end
checks of your SLAs? Really?
Wake up a human only when customers
are impacted
there are no more easy problems in the future,
there are only hard problems.
(Duh … you fixed the easy ones. :) )
~@grepory, Monitorama 2016, paraphrased
“Just get used to thinking about your
system like it’s a distributed system,
and you’ll mostly be okay.”
high cardinality
high dimensionality
event-driven
structured
ad hoc
social
fun.
Glorious Future™
“Monitoring” is dead and good riddance
“Observability” is TDD for production
Don’t ship without it.
i sketched up a @honeycombio redis
connector in a few lines of shell
OLD: static dashboards & monitoring
NEW: exploratory debugging & observability
Charity Majors
@mipsytipsy
RedisConf17 - Observability and the Glorious Future
RedisConf17 - Observability and the Glorious Future
RedisConf17 - Observability and the Glorious Future
RedisConf17 - Observability and the Glorious Future
RedisConf17 - Observability and the Glorious Future
RedisConf17 - Observability and the Glorious Future

More Related Content

What's hot

Sensu at brightpearl
Sensu at brightpearlSensu at brightpearl
Sensu at brightpearl
David Tibbs
 

What's hot (20)

Scalabe MySQL Infrastructure
Scalabe MySQL InfrastructureScalabe MySQL Infrastructure
Scalabe MySQL Infrastructure
 
Give A Great Tech Talk 2013
Give A Great Tech Talk 2013Give A Great Tech Talk 2013
Give A Great Tech Talk 2013
 
Dr. Elephant – Achieving Quicker, Easier, and Cost-Effective Big Data Analyti...
Dr. Elephant – Achieving Quicker, Easier, and Cost-Effective Big Data Analyti...Dr. Elephant – Achieving Quicker, Easier, and Cost-Effective Big Data Analyti...
Dr. Elephant – Achieving Quicker, Easier, and Cost-Effective Big Data Analyti...
 
Happy Browser, Happy User! WordSesh 2019
Happy Browser, Happy User! WordSesh 2019Happy Browser, Happy User! WordSesh 2019
Happy Browser, Happy User! WordSesh 2019
 
NoSQL isn't NO SQL
NoSQL isn't NO SQLNoSQL isn't NO SQL
NoSQL isn't NO SQL
 
Scylla’s Journey Towards Being an Elastic Cloud Native Database
Scylla’s Journey Towards Being an Elastic Cloud Native DatabaseScylla’s Journey Towards Being an Elastic Cloud Native Database
Scylla’s Journey Towards Being an Elastic Cloud Native Database
 
How We Made Scylla Maintenance Easier, Safer and Faster
How We Made Scylla Maintenance Easier, Safer and FasterHow We Made Scylla Maintenance Easier, Safer and Faster
How We Made Scylla Maintenance Easier, Safer and Faster
 
Sensu at brightpearl
Sensu at brightpearlSensu at brightpearl
Sensu at brightpearl
 
Apcera Case Study: The selection of the Go language
Apcera Case Study: The selection of the Go languageApcera Case Study: The selection of the Go language
Apcera Case Study: The selection of the Go language
 
Scylla Cloud on Display: Functionality, Performance and Demos
Scylla Cloud on Display: Functionality, Performance and DemosScylla Cloud on Display: Functionality, Performance and Demos
Scylla Cloud on Display: Functionality, Performance and Demos
 
Mobile 3: Launch Like a Boss!
Mobile 3: Launch Like a Boss!Mobile 3: Launch Like a Boss!
Mobile 3: Launch Like a Boss!
 
Data on its way to history, interrupted by analytics and silicon (@pavlobaron)
Data on its way to history, interrupted by analytics and silicon (@pavlobaron)Data on its way to history, interrupted by analytics and silicon (@pavlobaron)
Data on its way to history, interrupted by analytics and silicon (@pavlobaron)
 
Elassandra
ElassandraElassandra
Elassandra
 
Care and feeding notes
Care and feeding notesCare and feeding notes
Care and feeding notes
 
Open Source Monitoring Tools Shootout
Open Source Monitoring Tools ShootoutOpen Source Monitoring Tools Shootout
Open Source Monitoring Tools Shootout
 
Stop using Nagios (so it can die peacefully)
Stop using Nagios (so it can die peacefully)Stop using Nagios (so it can die peacefully)
Stop using Nagios (so it can die peacefully)
 
The Open-Source Monitoring Landscape
The Open-Source Monitoring LandscapeThe Open-Source Monitoring Landscape
The Open-Source Monitoring Landscape
 
Ensuring Consistency in a Replicated World
Ensuring Consistency in a Replicated WorldEnsuring Consistency in a Replicated World
Ensuring Consistency in a Replicated World
 
Scylla Summit 2018: OLAP or OLTP? Why Not Both?
Scylla Summit 2018: OLAP or OLTP? Why Not Both?Scylla Summit 2018: OLAP or OLTP? Why Not Both?
Scylla Summit 2018: OLAP or OLTP? Why Not Both?
 
Migrating big data
Migrating big dataMigrating big data
Migrating big data
 

Similar to RedisConf17 - Observability and the Glorious Future

Continuous Delivery
Continuous DeliveryContinuous Delivery
Continuous Delivery
Stein Inge Morisbak
 
Software Carpentry and the Hydrological Sciences @ AGU 2013
Software Carpentry and the Hydrological Sciences @ AGU 2013Software Carpentry and the Hydrological Sciences @ AGU 2013
Software Carpentry and the Hydrological Sciences @ AGU 2013
Aron Ahmadia
 
Reactive Microservice Architecture with Groovy and Grails
Reactive Microservice Architecture with Groovy and GrailsReactive Microservice Architecture with Groovy and Grails
Reactive Microservice Architecture with Groovy and Grails
Steve Pember
 

Similar to RedisConf17 - Observability and the Glorious Future (20)

Chaos Engineering Without Observability ... Is Just Chaos
Chaos Engineering Without Observability ... Is Just ChaosChaos Engineering Without Observability ... Is Just Chaos
Chaos Engineering Without Observability ... Is Just Chaos
 
Observability for Emerging Infra (what got you here won't get you there)
Observability for Emerging Infra (what got you here won't get you there)Observability for Emerging Infra (what got you here won't get you there)
Observability for Emerging Infra (what got you here won't get you there)
 
Moving to Microservices with the Help of Distributed Traces
Moving to Microservices with the Help of Distributed TracesMoving to Microservices with the Help of Distributed Traces
Moving to Microservices with the Help of Distributed Traces
 
From 🤦 to 🐿️
From 🤦 to 🐿️From 🤦 to 🐿️
From 🤦 to 🐿️
 
SRE Topics with Charity Majors and Liz Fong-Jones of Honeycomb
SRE Topics with Charity Majors and Liz Fong-Jones of HoneycombSRE Topics with Charity Majors and Liz Fong-Jones of Honeycomb
SRE Topics with Charity Majors and Liz Fong-Jones of Honeycomb
 
From DevOps to NoOps how not to get Equifaxed Apidays
From DevOps to NoOps how not to get Equifaxed ApidaysFrom DevOps to NoOps how not to get Equifaxed Apidays
From DevOps to NoOps how not to get Equifaxed Apidays
 
How to be an inefficient developer? - Petar Ducic (Infobip)
How to be an inefficient developer? - Petar Ducic (Infobip)How to be an inefficient developer? - Petar Ducic (Infobip)
How to be an inefficient developer? - Petar Ducic (Infobip)
 
SAD15 - Maintenance
SAD15 - MaintenanceSAD15 - Maintenance
SAD15 - Maintenance
 
SAD01 - An Introduction to Systems Analysis and Design
SAD01 - An Introduction to Systems Analysis and DesignSAD01 - An Introduction to Systems Analysis and Design
SAD01 - An Introduction to Systems Analysis and Design
 
How to fix bug or defects in software
How to fix bug or defects in software How to fix bug or defects in software
How to fix bug or defects in software
 
From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018
 
When Things Go Bump in the Night
When Things Go Bump in the NightWhen Things Go Bump in the Night
When Things Go Bump in the Night
 
From SLO to GOTY
From SLO to GOTYFrom SLO to GOTY
From SLO to GOTY
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)
 
Continuous Delivery
Continuous DeliveryContinuous Delivery
Continuous Delivery
 
ROOTS2011 Continuous Delivery
ROOTS2011 Continuous DeliveryROOTS2011 Continuous Delivery
ROOTS2011 Continuous Delivery
 
Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud
Skynet project: Monitor, analyze, scale, and maintain a system in the CloudSkynet project: Monitor, analyze, scale, and maintain a system in the Cloud
Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud
 
Humane assessment on cards
Humane assessment on cardsHumane assessment on cards
Humane assessment on cards
 
Software Carpentry and the Hydrological Sciences @ AGU 2013
Software Carpentry and the Hydrological Sciences @ AGU 2013Software Carpentry and the Hydrological Sciences @ AGU 2013
Software Carpentry and the Hydrological Sciences @ AGU 2013
 
Reactive Microservice Architecture with Groovy and Grails
Reactive Microservice Architecture with Groovy and GrailsReactive Microservice Architecture with Groovy and Grails
Reactive Microservice Architecture with Groovy and Grails
 

More from Redis Labs

SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020
SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020
SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020
Redis Labs
 
Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...
Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...
Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...
Redis Labs
 
RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020
RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020
RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020
Redis Labs
 
RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020
RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020
RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020
Redis Labs
 

More from Redis Labs (20)

Redis Day Bangalore 2020 - Session state caching with redis
Redis Day Bangalore 2020 - Session state caching with redisRedis Day Bangalore 2020 - Session state caching with redis
Redis Day Bangalore 2020 - Session state caching with redis
 
Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020
Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020
Protecting Your API with Redis by Jane Paek - Redis Day Seattle 2020
 
The Happy Marriage of Redis and Protobuf by Scott Haines of Twilio - Redis Da...
The Happy Marriage of Redis and Protobuf by Scott Haines of Twilio - Redis Da...The Happy Marriage of Redis and Protobuf by Scott Haines of Twilio - Redis Da...
The Happy Marriage of Redis and Protobuf by Scott Haines of Twilio - Redis Da...
 
SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020
SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020
SQL, Redis and Kubernetes by Paul Stanton of Windocks - Redis Day Seattle 2020
 
Rust and Redis - Solving Problems for Kubernetes by Ravi Jagannathan of VMwar...
Rust and Redis - Solving Problems for Kubernetes by Ravi Jagannathan of VMwar...Rust and Redis - Solving Problems for Kubernetes by Ravi Jagannathan of VMwar...
Rust and Redis - Solving Problems for Kubernetes by Ravi Jagannathan of VMwar...
 
Redis for Data Science and Engineering by Dmitry Polyakovsky of Oracle
Redis for Data Science and Engineering by Dmitry Polyakovsky of OracleRedis for Data Science and Engineering by Dmitry Polyakovsky of Oracle
Redis for Data Science and Engineering by Dmitry Polyakovsky of Oracle
 
Practical Use Cases for ACLs in Redis 6 by Jamie Scott - Redis Day Seattle 2020
Practical Use Cases for ACLs in Redis 6 by Jamie Scott - Redis Day Seattle 2020Practical Use Cases for ACLs in Redis 6 by Jamie Scott - Redis Day Seattle 2020
Practical Use Cases for ACLs in Redis 6 by Jamie Scott - Redis Day Seattle 2020
 
Moving Beyond Cache by Yiftach Shoolman Redis Labs - Redis Day Seattle 2020
Moving Beyond Cache by Yiftach Shoolman Redis Labs - Redis Day Seattle 2020Moving Beyond Cache by Yiftach Shoolman Redis Labs - Redis Day Seattle 2020
Moving Beyond Cache by Yiftach Shoolman Redis Labs - Redis Day Seattle 2020
 
Leveraging Redis for System Monitoring by Adam McCormick of SBG - Redis Day S...
Leveraging Redis for System Monitoring by Adam McCormick of SBG - Redis Day S...Leveraging Redis for System Monitoring by Adam McCormick of SBG - Redis Day S...
Leveraging Redis for System Monitoring by Adam McCormick of SBG - Redis Day S...
 
JSON in Redis - When to use RedisJSON by Jay Won of Coupang - Redis Day Seatt...
JSON in Redis - When to use RedisJSON by Jay Won of Coupang - Redis Day Seatt...JSON in Redis - When to use RedisJSON by Jay Won of Coupang - Redis Day Seatt...
JSON in Redis - When to use RedisJSON by Jay Won of Coupang - Redis Day Seatt...
 
Highly Available Persistent Session Management Service by Mohamed Elmergawi o...
Highly Available Persistent Session Management Service by Mohamed Elmergawi o...Highly Available Persistent Session Management Service by Mohamed Elmergawi o...
Highly Available Persistent Session Management Service by Mohamed Elmergawi o...
 
Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...
Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...
Anatomy of a Redis Command by Madelyn Olson of Amazon Web Services - Redis Da...
 
Building a Multi-dimensional Analytics Engine with RedisGraph by Matthew Goos...
Building a Multi-dimensional Analytics Engine with RedisGraph by Matthew Goos...Building a Multi-dimensional Analytics Engine with RedisGraph by Matthew Goos...
Building a Multi-dimensional Analytics Engine with RedisGraph by Matthew Goos...
 
RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020
RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020
RediSearch 1.6 by Pieter Cailliau - Redis Day Bangalore 2020
 
RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020
RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020
RedisGraph 2.0 by Pieter Cailliau - Redis Day Bangalore 2020
 
RedisTimeSeries 1.2 by Pieter Cailliau - Redis Day Bangalore 2020
RedisTimeSeries 1.2 by Pieter Cailliau - Redis Day Bangalore 2020RedisTimeSeries 1.2 by Pieter Cailliau - Redis Day Bangalore 2020
RedisTimeSeries 1.2 by Pieter Cailliau - Redis Day Bangalore 2020
 
RedisAI 0.9 by Sherin Thomas of Tensorwerk - Redis Day Bangalore 2020
RedisAI 0.9 by Sherin Thomas of Tensorwerk - Redis Day Bangalore 2020RedisAI 0.9 by Sherin Thomas of Tensorwerk - Redis Day Bangalore 2020
RedisAI 0.9 by Sherin Thomas of Tensorwerk - Redis Day Bangalore 2020
 
Rate-Limiting 30 Million requests by Vijay Lakshminarayanan and Girish Koundi...
Rate-Limiting 30 Million requests by Vijay Lakshminarayanan and Girish Koundi...Rate-Limiting 30 Million requests by Vijay Lakshminarayanan and Girish Koundi...
Rate-Limiting 30 Million requests by Vijay Lakshminarayanan and Girish Koundi...
 
Three Pillars of Observability by Rajalakshmi Raji Srinivasan of Site24x7 Zoh...
Three Pillars of Observability by Rajalakshmi Raji Srinivasan of Site24x7 Zoh...Three Pillars of Observability by Rajalakshmi Raji Srinivasan of Site24x7 Zoh...
Three Pillars of Observability by Rajalakshmi Raji Srinivasan of Site24x7 Zoh...
 
Solving Complex Scaling Problems by Prashant Kumar and Abhishek Jain of Myntr...
Solving Complex Scaling Problems by Prashant Kumar and Abhishek Jain of Myntr...Solving Complex Scaling Problems by Prashant Kumar and Abhishek Jain of Myntr...
Solving Complex Scaling Problems by Prashant Kumar and Abhishek Jain of Myntr...
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Recently uploaded (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

RedisConf17 - Observability and the Glorious Future

  • 1. Observability and the Glorious Future The Future of Observability in Complex Systems ** ** Otherwise Known As Your Systems
  • 2. Observability and the Glorious Future The Future of Observability in Complex Systems ** ** Otherwise Known As Your Systems
  • 4. @mipsytipsy Hates monitoring Not a monitoring company refactor slides
  • 7.
  • 8. We don’t *know* what the questions are, all we have are unreliable symptoms or reports. Complexity is exploding everywhere, but our tools are designed for a predictable world. As soon as we know the question, we usually know the answer too.
  • 9. DBAs and Ops Full stack instrumentation you need strace for systems
  • 10. The app tier capacity is exceeded. There was a big traffic spike, or maybe we rolled out a performance degradation, or maybe some app instances are down. Connections to the database are slower than normal, causing connections to timeout and latency to rise. Maybe we deployed a bad query, or the RAID array is degraded, or there is lock contention on a critical row. Errors or latency are high. We will run through many dashboards built to surface a large number of possible causes that we have predicted. “Photos are loading slowly for some people. Why?” (LAMP stack edition)
  • 11.
  • 12. “Photos are loading slowly for some people. Why?” (microservices edition) On one of our 50 microservices, one node is running on degraded hardware, causing every request to take 50 seconds to complete but without generating a timeout error. This is just 1 of 10k nodes, but disproportionately impacts people looking at older archives. They aren’t. But Canadian users running a French language pack on a particular version of iPhone hardware are hitting a firmware condition which makes them unable to save local cache, which is why it FEELS like photos are loading slowly Our newest SDK makes additional sequential db queries if the developer has enabled an optional feature. Working as intended, but sucks; needs refactoring. wtf do i ‘monitor’ for?
  • 13. Problems Symptoms "I have twenty microservices and a sharded db and three other data stores in three regions, and everything seems to be getting a little bit slower but nothing changed that we know of, and latency is usually fine on Tuesdays. “All twenty app micro services have 10% of available nodes enter a simultaneous crash loop cycle, about five times a day, at unpredictable intervals. They have nothing in common afaik and it doesn’t seem to impact the stateful services. It clears up before we can debug it, every time.” “Our users can compose their own queries that we execute server-side, and we don’t surface it to them when they are accidentally doing full table scans or even multiple full table scans, so they blame us.”
  • 14. Your system is never entirely ‘up’ Many catastrophic states exist at any given time.
  • 15. there are no more easy problems in the future, there are only hard problems. (Duh … you fixed the easy ones. :) )
  • 17. must be exploratory and open-ended. Observability: not dashboard-centric or prescriptive. you don’t know what you don’t know. If there’s a schema or an index involved, it’s not futureproof. Gather everything.
  • 18. exploratory, interrogatory, questioning: Debug by asking questions, not by muscle memory Can you ask arbitrary open-ended questions and play with them? Context is *everything*, preserve it.
  • 19. Quit debugging with your eyeballs, start debugging with data It will make you a better engineer! Replaceable!!
  • 20. must be people-first and consumer-quality Observability: tools must draw on your intuition and habits rich history, sharing, social features
  • 21. Don’t make everyone be an expert. (they won’t be, and that’s ok)
  • 22. Debugging is a social act. solving new problems is cognitively expensive. sharing is not. Our tools must tap into our sense of joy, play, performance, community, solidarity. Bring everyone up to the level of the best debuggers.
  • 23. must be event-driven, not pre-aggregated. Observability: High cardinality is a must. Structured data is absolutely assumed. Get used to sampling at scale.
  • 24. Events tell stories. Arbitrarily wide events mean you can amass more and more context over time. Use sampling to control costs and bandwidth. “Logs” are just a transport mechanism for events!
  • 25. Aggregates destroy your precious details. You need MORE detail and MORE context. Tags: not good enough (Yes, you can have aggregates for percentiles; you just have to do read-time aggregation.)
  • 26.
  • 27. You can’t hunt needles if your tools don’t handle extreme outliers, aggregation by arbitrary values in a high-cardinality dimension, super-wide rich context… (they don’t)
  • 28. You must be able to break down by 1/millions and THEN by anything/everything else High cardinality is not a nice-to-have ‘Platform problems’ are now everybody’s problems
  • 29. Black swans are the norm you must care about max/min, 99%, 99.9th, 99.99th, 99.999th …
  • 30. Structure your god damn events like it’s 2017 Structure them at the *source*
  • 31. must be a lingua franca, spanning teams Observability: no boundaries between vendor software and your code don’t create yet another silo share tools across the stack. can your android devs debug Redis and vice versa?
  • 32. If your tools don’t give you the ability to correlate across disparate systems, vendor and application data alike, whether you have control over the underlying software or not … they’re broken
  • 33. must be designed for generalist SWEs. Observability: SaaS, APIs, SDKs. not designed for ops. Ops lives on the other side of an API
  • 34. Operations skills are not optional for software engineers in 2016. They are not “nice-to-have”, they are table stakes.
  • 35. Cultivate a team of software engineers who value operational excellence.
  • 36. Watch it run in production. Accept no substitute. Get used to observing your systems when they AREN’T on fire.
  • 37. Your reward: Drastically fewer paging alerts Do you really need more than end to end checks of your SLAs? Really? Wake up a human only when customers are impacted
  • 38. there are no more easy problems in the future, there are only hard problems. (Duh … you fixed the easy ones. :) )
  • 39. ~@grepory, Monitorama 2016, paraphrased “Just get used to thinking about your system like it’s a distributed system, and you’ll mostly be okay.”
  • 41. “Monitoring” is dead and good riddance “Observability” is TDD for production Don’t ship without it.
  • 42. i sketched up a @honeycombio redis connector in a few lines of shell
  • 43.
  • 44. OLD: static dashboards & monitoring
  • 45. NEW: exploratory debugging & observability