Monitoring 101

•Download as PPTX, PDF•

2 likes•1,122 views

This presentation talks about three topics related to monitoring. The first is a brief history and future forecast of monitoring trends. The second is a second look at the inputs, outputs, and techniques for setting SLOs. The third sets some basic tenets one should always follow when monitoring systems.

Software

Monitoring 101
THE BASICS
http://l42.org/JQE

Theo Schlossnagle CEO @Circonus
Twitter: @postwait

Agenda
Navigation Skills
Tenets
Service
Level
Objective
Overview
History

“
”
Monitoring is the action of observing
and checking static and dynamic
properties of a system.
- HEINRICH HARTMANN (http://l42.org/GwE)

Your System
Is Larger Than Your “Systems”
Technology
Engineering
Operations
HR
Finance
Sales
Marketin

Evolution of Web Monitoring
1995:
Synthetic web page
loads every 15
minutes
2000:
Watching every web
request to understand
“real users”
2015:
Deep analysis on every
transaction to ensure no user left
behind

Evolution of Database Monitoring
1995:
Synthetic queries to
test performance
2005:
Watching every query
request to understand
“real performance”
2015:
Deep analysis of every transaction
to understand overall system
behavior

Evolution of Systems Monitoring
1995:
Looked at avwait
lat (disks)
2015:
Track the latency of every
disk I/O operation
performed

Monitoring Is Sophisticated
Increased
Telemetry
Volume
Advances in Time Series
Databases to store
trillions of samples in a
billion streams.
Advances in Stream
Analytics to handle
velocity at scale for real-
time analysis and
alerting.
More Valuable
Operational
Questions
Data Science is the
future.
Increased volume
mandates computer
assistance where “ops
dashboards” once
worked.
Most sophisticated
modeling: stats,
machine learning, AI,
etc.
Increased
Organizational
Velocity
Systems are decoupled,
distributed and
changing faster.
Understanding overall
systems behavior is like
looking at sand dunes.

Service
Level
Objectives
SLOS ARE WHAT DRIVES SRES

SLO: usually based on percentiles
 E.g. 95th percentile less than 10ms
 “simply” 95% of all samples should 10ms or less, 1% can be arbitrarily bad
 Not “simple”
 Calculated over what period of time (or worse, number of samples)?
 Why 95% and not 99% or 99.9% or 99.34860943%?
 Why 10ms?
 The tragedy of the not-a-histogram histogram:
 There are no right answers, and rarely good ones.

Summary Histogram
30days and 36mm samples

Time-series Histogram
30days and 36mm samples

Summary Histogram
2days and 1.6mm samples

p-1(10ms) Latency
Over 5m Stepping Window

p-1(50ms) Latency
Over 5m Stepping Window

Time Matters
The time quantum you use to assess
is your minimum window of failure.

Uncertainty Matters
You will certainly want to revise your goals,
likely in all parametric space.

Histograms Matter
You cannot manage percentile-based SLOs at scale without
histograms.

Do not measure rates.
You can derive the rate of change over time at query time.
#1

Monitor outside the tech stack.
Your tech stack would not exist without happy customers and a sales pipeline.
Monitor that which is important to the health of your organization.
#2

Do not silo data.
The behavior of the parts must be put in context.
Correlating disparate systems and even business outcomes is critical.
#3

Value observation of real work
over the measurement of synthesized work.
#4

Synthesize work to ensure function
for business critical, low-volume events.
#5

Percentiles are not histograms.
For robust SLO management
you need to store histograms for post-processing.
#6

History is critical;
not weeks or months, but years of detailed history.
Capacity planning, retrospectives, comparative analysis,
and modelling all rely on accurate, high-fidelity history.
#7

Alerts require documentation.
No ruleset should trigger an alert without:
human-readable explanation
business impact description
remediation procedure
escalation documentation
#8

Be outside the blast radius.
The purpose of monitoring is to detect changes in behavior and assist in
answering operational questions.
#9

Something is better than nothing.
Don’t let perfect be the enemy of good.
You have to start somewhere.
#10

What's hot

How to SRE when you have no SRESquadcast Inc

8 Blind Spots Often Overlooked When Testing on MobileNeotys

Translating Tester-Speak Into Plain English: Simple Explanations for 8 Testin...Neotys

Sigma Open Tech Week: Bitter Truth About Software SecurityVlad Styran

Avoid these 7 risk assessment and method statement mistakesHANDS HQ

Control Charts: Finding the Right Control ChartMatt Hansen

Brainstorming failureJeffery Smith

Is your Automation Infrastructure ‘Well Architected’?Adam Goucher

The Most Important Thing: How Mozilla Does Security and What You Can Stealmozilla.presentations

QMSS Root Cause Analysis - Sample SlidesBusiness Performance Improvement (BPI)

Last Conference 2016 - Rapid DeliveryLay Ming Clough

Chaos Engineering 101: A Field Guidematthewbrahms

The left is not wrong, just not right; It's time to shift right!Phillip Maddux

[CXL Live 16] Beyond Test-by-Test Results: CRO Metrics for Performance & Insi...CXL

Troubleshooting.pdfwww.thepetrosolutions.com

Infographic :: Critical Mistake Analysis by CognitiveArtsNIIT (USA), Inc.

Rewriting DevOpsMatthew Boeckman

The R.O.A.D to DevOpsSeniorStoryteller

Building a SIPOC with Matt Hansen at StatStuffMatt Hansen

4 pc repairRheigh Henley Calderon

What's hot (20)

How to SRE when you have no SRE

8 Blind Spots Often Overlooked When Testing on Mobile

Translating Tester-Speak Into Plain English: Simple Explanations for 8 Testin...

Sigma Open Tech Week: Bitter Truth About Software Security

Avoid these 7 risk assessment and method statement mistakes

Control Charts: Finding the Right Control Chart

Brainstorming failure

Is your Automation Infrastructure ‘Well Architected’?

The Most Important Thing: How Mozilla Does Security and What You Can Steal

QMSS Root Cause Analysis - Sample Slides

Last Conference 2016 - Rapid Delivery

Chaos Engineering 101: A Field Guide

The left is not wrong, just not right; It's time to shift right!

[CXL Live 16] Beyond Test-by-Test Results: CRO Metrics for Performance & Insi...

Troubleshooting.pdf

Infographic :: Critical Mistake Analysis by CognitiveArts

Rewriting DevOps

The R.O.A.D to DevOps

Building a SIPOC with Matt Hansen at StatStuff

4 pc repair

Similar to Monitoring 101

ThoughtWorks Continuous DeliveryKyle Hodgson

Dow Chemical presentation at the Chief Analytics Officer Forum East Coast USA...Chief Analytics Officer Forum

Monitoring Distributed SystemsAleksandr Tavgen

The State of Streaming Analytics: The Need for Speed and ScaleVoltDB

Reduce The Risk Critical To Protect Critical To Monitorjellobrand

Prometheus - Open Source Forum JapanBrian Brazil

SplunkLive! Splunk App for VMwareSplunk

MANT 265 S01.pptMohammedRizwanSharif1

Taking Splunk to the Next Level - Management Breakout SessionSplunk

How to Handle the Realities of DevOps Monitoring TodayDevOps.com

Joel Marusiak, Neovia Logistics presenatation at Spare Parts 2013Copperberg

TopConf : DevOps Monitoring: Feedback Loops in Enterprise EnvironmentsJonah Kowall

Observability - The good, the bad and the ugly Xp Days 2019 Kiev Ukraine Aleksandr Tavgen

The Anti-Transformation transformation @DevOps Summit AmsterdamMirco Hering

SplunkDeep Mehta

Operational AnalyticsEckerson Group

Gov Day Sacramento 2015 - Keynote/OverviewSplunk

Streaming Analytics - Comparison of Open Source Frameworks and ProductsKai Wähner

Unified Monitoring Webinar with Dustin WhittleAppDynamics

5 Reasons DevOps Toolchain Needs Time-Series Based MonitoringDevOps.com

Similar to Monitoring 101 (20)

ThoughtWorks Continuous Delivery

Dow Chemical presentation at the Chief Analytics Officer Forum East Coast USA...

Monitoring Distributed Systems

The State of Streaming Analytics: The Need for Speed and Scale

Reduce The Risk Critical To Protect Critical To Monitor

Prometheus - Open Source Forum Japan

SplunkLive! Splunk App for VMware

MANT 265 S01.ppt

Taking Splunk to the Next Level - Management Breakout Session

How to Handle the Realities of DevOps Monitoring Today

Joel Marusiak, Neovia Logistics presenatation at Spare Parts 2013

TopConf : DevOps Monitoring: Feedback Loops in Enterprise Environments

Observability - The good, the bad and the ugly Xp Days 2019 Kiev Ukraine

The Anti-Transformation transformation @DevOps Summit Amsterdam

Splunk

Operational Analytics

Gov Day Sacramento 2015 - Keynote/Overview

Streaming Analytics - Comparison of Open Source Frameworks and Products

Unified Monitoring Webinar with Dustin Whittle

5 Reasons DevOps Toolchain Needs Time-Series Based Monitoring

Recently uploaded

why an Opensea Clone Script might be your perfect match.pdfjoe51371421

Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝soniya singh

What is Fashion PLM and Why Do You Need ItWave PLM

Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app

Professional Resume Template for Software DevelopersVinodh Ram

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171

Optimizing AI for immediate response in Smart CCTVshikhaohhpro

Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin

The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.

What is Binary Language? Computer Number SystemsJheuzeDellosa

Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01

Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531

DNT_Corporate presentation know about usDynamic Netsoft

chapter--4-software-project-planning.pptkotipi9215

XpertSolvers: Your Partner in Building Innovative Software SolutionsMehedi Hasan Shohan

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS

Recently uploaded (20)

why an Opensea Clone Script might be your perfect match.pdf

Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝

What is Fashion PLM and Why Do You Need It

Engage Usergroup 2024 - The Good The Bad_The Ugly

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx

Professional Resume Template for Software Developers

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf

Optimizing AI for immediate response in Smart CCTV

Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide

The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...

What is Binary Language? Computer Number Systems

Der Spagat zwischen BIAS und FAIRNESS (2024)

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...

Hand gesture recognition PROJECT PPT.pptx

DNT_Corporate presentation know about us

chapter--4-software-project-planning.ppt

XpertSolvers: Your Partner in Building Innovative Software Solutions

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...

Monitoring 101

1. Monitoring 101 THE BASICS http://l42.org/JQE

2. Theo Schlossnagle CEO @Circonus Twitter: @postwait

3. Agenda Navigation Skills Tenets Service Level Objective Overview History

4. “ ” Monitoring is the action of observing and checking static and dynamic properties of a system. - HEINRICH HARTMANN (http://l42.org/GwE)

5. Your System Is Larger Than Your “Systems” Technology Engineering Operations HR Finance Sales Marketin

6. Evolution of Web Monitoring 1995: Synthetic web page loads every 15 minutes 2000: Watching every web request to understand “real users” 2015: Deep analysis on every transaction to ensure no user left behind

7. Evolution of Database Monitoring 1995: Synthetic queries to test performance 2005: Watching every query request to understand “real performance” 2015: Deep analysis of every transaction to understand overall system behavior

8. Evolution of Systems Monitoring 1995: Looked at avwait lat (disks) 2015: Track the latency of every disk I/O operation performed

9. Monitoring Is Sophisticated Increased Telemetry Volume Advances in Time Series Databases to store trillions of samples in a billion streams. Advances in Stream Analytics to handle velocity at scale for real- time analysis and alerting. More Valuable Operational Questions Data Science is the future. Increased volume mandates computer assistance where “ops dashboards” once worked. Most sophisticated modeling: stats, machine learning, AI, etc. Increased Organizational Velocity Systems are decoupled, distributed and changing faster. Understanding overall systems behavior is like looking at sand dunes.

10. Service Level Objectives SLOS ARE WHAT DRIVES SRES

11. SLO: usually based on percentiles  E.g. 95th percentile less than 10ms  “simply” 95% of all samples should 10ms or less, 1% can be arbitrarily bad  Not “simple”  Calculated over what period of time (or worse, number of samples)?  Why 95% and not 99% or 99.9% or 99.34860943%?  Why 10ms?  The tragedy of the not-a-histogram histogram:  There are no right answers, and rarely good ones.

12. Median Latency Over 5m Stepping Window

13. Summary Histogram 30days and 36mm samples

14. Time-series Histogram 30days and 36mm samples

15. Time-series Histogram 30days and 36mm samples

16. Time-series Histogram 30days and 36mm samples

17. Average Latency Over 5m Stepping Window

18. Summary Histogram 2days and 1.6mm samples

19. Latency Over 5m Stepping Window

20. Latency Over 5m Stepping Window

21. p(95) Latency Over 5m Stepping Window

22. p-1(10ms) Latency Over 5m Stepping Window

23. p-1(50ms) Latency Over 5m Stepping Window

24. Time Matters The time quantum you use to assess is your minimum window of failure.

25. Uncertainty Matters You will certainly want to revise your goals, likely in all parametric space.

26. Histograms Matter You cannot manage percentile-based SLOs at scale without histograms.

27. Do not measure rates. You can derive the rate of change over time at query time. #1

28. Monitor outside the tech stack. Your tech stack would not exist without happy customers and a sales pipeline. Monitor that which is important to the health of your organization. #2

29. Do not silo data. The behavior of the parts must be put in context. Correlating disparate systems and even business outcomes is critical. #3

30. Value observation of real work over the measurement of synthesized work. #4

31. Synthesize work to ensure function for business critical, low-volume events. #5

32. Percentiles are not histograms. For robust SLO management you need to store histograms for post-processing. #6

33. History is critical; not weeks or months, but years of detailed history. Capacity planning, retrospectives, comparative analysis, and modelling all rely on accurate, high-fidelity history. #7

34. Alerts require documentation. No ruleset should trigger an alert without: human-readable explanation business impact description remediation procedure escalation documentation #8

35. Be outside the blast radius. The purpose of monitoring is to detect changes in behavior and assist in answering operational questions. #9

36. Something is better than nothing. Don’t let perfect be the enemy of good. You have to start somewhere. #10

37. Thank You! http://l42.org/JQE

Monitoring 101

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Monitoring 101

Similar to Monitoring 101 (20)

More from Theo Schlossnagle

More from Theo Schlossnagle (20)

Recently uploaded

Recently uploaded (20)

Monitoring 101