SlideShare a Scribd company logo
Monitoring & Observability
Getting off the starting blocks.
Wednesday, June 19, 13
THE MANY FACES OF THEO
FUN WITH BEARDS AND HAIR
Wednesday, June 19, 13
THE MANY FACES OF THEO
FUN WITH BEARDS AND HAIR
FUCK IT ALL
VENDETTA SCARY
DETERMINED CAREFREE NO-FLY
ZONE
Wednesday, June 19, 13
Agenda
Define stuff.
Set some tenets.
Discuss and implement some tenets.
Answer a lot of questions.
Wednesday, June 19, 13
Monitoring... what it is.
We’ll get to that.
Wednesday, June 19, 13
Observability
Being able to measure “things” or
witness state changes.
Not useful if doing so alters behavior (significantly).
Wednesday, June 19, 13
Development & Production
For the rest of this talk...
There is only production.
Wednesday, June 19, 13
Data & Information Terms
Measurement: a single measurement of something
a value on which numerical operations make sense:
1, -110, 1.234123, 9.886-19, 0, null
“200”, “304”, “v1.234”, “happy”, null
Wednesday, June 19, 13
Data & Information Terms
Metric: something that you are measuring
The version of deployed code
Total cost on Amazon services
total bugs filed, bug backlog
Total queries executed
Wednesday, June 19, 13
Notice no rates
DO NOT STORE RATES.
Wednesday, June 19, 13
Measurement Velocity
The rate at which new measurements are taken.
Wednesday, June 19, 13
Perspective
Sometimes perspective matters
page load times, DNS queries,
consider RUM (real user monitoring)
Usually it does not
total requests made against a web server
Wednesday, June 19, 13
Visualization
The assimilation of
multiple measurements into
a visual representation.
Wednesday, June 19, 13
Trending
Understanding the
“direction” of series of measurements on a metric.
Here direction is loose and means “pattern within.”
Wednesday, June 19, 13
Alerting
To bring something to one’s attention.
Wednesday, June 19, 13
Anomaly Detection
The determination that a
specific measurement is
not within reason.
Wednesday, June 19, 13
Monitoring... what it is.
All of that.
Wednesday, June 19, 13
Review
Measurement
Measurement Velocity
Metric
Perspective
Visualization
Trending
Alerting
Anomaly Detection
Observability
Monitoring
Wednesday, June 19, 13
Some Tenets
Most people suck at monitoring.
They monitor all the wrong things (somewhat bad)
The don’t monitor the important things (awful)
Wednesday, June 19, 13
Do not collect rates of things
Rates are like trees making sounds falling in the forest.
Direct measurement of rates leads to data loss
and ultimately ignorance.
Wednesday, June 19, 13
Prefer high level telemetry
1. Business drivers via KPIs,
2. Team KPIs,
3. Staff KPIs,
4. ... then telemetry from everything else.
Wednesday, June 19, 13
Implementation
Herein it gets tricky.
Wednesday, June 19, 13
Only because of the tools.
I could show you how to use tool X, or Y or Z.
But I wrote Reconnoiter and founded Circonus
because X, Y and Z didn’t meet my needs.
Reconnoiter is open.
Circonus is a service.
Wednesday, June 19, 13
Methodology
I’m going to focus on methodology
that can be applied across whatever toolset you have.
Wednesday, June 19, 13
Pull vs. Push
Anyone who says one is better than the other is...
WRONG.
They both have their uses.
Wednesday, June 19, 13
Reasons for pull
1. Synthesized observation is desirable.
2. Observable activity is infrequent.
3. Alterations in observation frequency are useful.
Wednesday, June 19, 13
Reasons for push
Direct observation is desirable.
Discrete observed actions are useful.
Discrete observed actions are frequent.
Wednesday, June 19, 13
False reasons.
Polling doesn’t scale.
Wednesday, June 19, 13
Protocol Soup
The great thing about standards is...
there are so many to choose from.
Wednesday, June 19, 13
Protocol Soup
SNMP(v1,v2,v3) both push(trap) and pull(query)
collectd(v4,v5) push only
statsd push only
JMX, JDBC, ICMP, DHCP, NTP, SSH, TCP, UDP, barf.
Wednesday, June 19, 13
Color me RESTy
Use JSON.
HTTP(s) PUT/POST somewhere for push
HTTP(s) GET something for pull
Wednesday, June 19, 13
High-volume Data
Occasionally, data velocity is beyond what’s reasonable
for individual HTTP PUT/POST for each observation.
1. You can fall back to UDP (try statsd)
2. I prefer to batch them and continue to use REST
Wednesday, June 19, 13
nad
nad is great. use nad.
https://github.com/circonus-labs/nad
Think of it like an SNMP that’s
actually Simple
Monitoring not Management
and trivial extended to suit your needs
Wednesday, June 19, 13
nad online example
To the Internet ➥
Wednesday, June 19, 13
But wait...
nad isn’t methodology...
it’s technology.
Wednesday, June 19, 13
Correct...
Back to the topic.
I talked about nad briefly to provide a
super simple tool to erase the question:
“but how?”
Wednesday, June 19, 13
The real question is: “what?”
What should I be monitoring?
This is the best question you can ask yourself.
Before you start.
While you’re implementing.
After you’re done.
Wednesday, June 19, 13
The industry answer:
MONITOR ALL THE THINGS!
I’ll tell you this too, in fact.
But we have put the cart ahead of the horse.
Wednesday, June 19, 13
Question?
If I could monitor one thing, what would it be?
hint: CPU utilization on your web server ain’t it.
Wednesday, June 19, 13
Answer:
It depends on your business.
If you don’t know the answer to this,
I suggest you stop worrying about monitoring
and start worrying about WTF your company does.
Wednesday, June 19, 13
Here, we can’t continue.
Unless I make stuff up...
So, here I go makin’ stuff up.
Wednesday, June 19, 13
Let us assume
we run a web site where customers buy products
Wednesday, June 19, 13
Monitoring purchases.
So, we should monitor how many purchases were
made and ensure it is within acceptable levels.
Not so fast.
Wednesday, June 19, 13
Actually.
We want to make sure customers
can purchase from the site and
are purchasing from the site.
This semantic different is critically important.
And choosing which comes down to velocity.
Wednesday, June 19, 13
What is this velocity thing?
Displacement / time
(i.e. purchases/second or $/second)
BUT WAIT! You said:
“Do not collect rates of things.”
Correct...
collect the displacement,
visualize and alert on the rate.
Wednesday, June 19, 13
So which?
High velocity w/ predictably smooth trends:
velocity is more important
Low velocity or uneven arrival rates:
measuring capability is more important
Wednesday, June 19, 13
To rephrase
If you have sufficient real data,
observing that data works best;
otherwise, you must
synthesize data and monitor that.
Wednesday, June 19, 13
As a tenet.
Always synthesize.
additionally observe real data when possible
Wednesday, June 19, 13
More demonstrable
(in a short session)
I’ve got a web site that my customers need to visit.
The business understands that we need to serve
customers with at least a basic level of QoS:
no page loads over 4s
Wednesday, June 19, 13
Active checks.
Wednesday, June 19, 13
A first attempt
curl http://surge.omniti.com/
extract the HTTP response code
if 200, we’re super good!
Admittedly not so good.
Wednesday, June 19, 13
A wealth of data.
Synthesizing an HTTPS GET could provide:
SSL Subject, validity, expiration
HTTP code, Headers and Content
Timings on TCP connection, first byte, full payload
Wednesday, June 19, 13
Still, this is highly imperfect.
Don’t get me wrong, they are useful.
We use them all over the place... they are cheap.
But, ideally, you want to load the page closer to the
way a user does (all assets, javascript, etc.)
Enter phantomjs
Wednesday, June 19, 13
var page = require('webpage').create();
page.viewportSize = { width: 1024, height: 768 };
page.onError = function(err) { stats.errors++; };
page.onInitialized =
function() { start = new Date(); };
page.onLoadStarted =
function() { stats.load_started = new Date() - start; };
page.onLoadFinished =
function() { stats.load_finished = new Date() - start; };
page.onResourceRequested = function() { stats.res++; };
page.onResourceError = function(err) { stats.res_errors++; };
page.onUrlChanged = function() { stats.url_redirects++; };
page.open('http://surge.omniti.com/', function(status) {
stats.status = status;
stats.duration = new Date() - start;
console.log(JSON.stringify(stats));
phantom.exit();
});
Wednesday, June 19, 13
var start, stats = {
status: null
, errors: 0
, load_started: null
, load_finished: null
, resources: 0
, resource_errors: 0
, url_redirects: 0
};
Wednesday, June 19, 13
Passive checks.
Wednesday, June 19, 13
Now for the passive stuff
Some examples are Google Analytics, Omniture, etc.
Statsd (out-of-the-box) and Metrics
are mediocre approach.
If we have a lot of observable data N,
N̅ isn’t so useful,
𝜎, |N|, q(0.5), q(0.95), q(0.99), q(0), q(1), add a lot.
Wednesday, June 19, 13
Still... we can do better.
N̅, 𝜎, |N|, q(0,0.5,0.95,0.99,1) is 8 statistical aggregates
Let’s look at API latencies...
say we do 1000/s,
that’s 60k/minute.
Over a minute of time, 60k points to 8 represents...
a lot of information loss.
Wednesday, June 19, 13
First 60k/minute, how?
statsd
http puts
logs
etc.
Wednesday, June 19, 13
Histograms
Wednesday, June 19, 13
A line graph of data.
Wednesday, June 19, 13
A heatmap of data.
Wednesday, June 19, 13
Zoomed in on a heatmap.
Wednesday, June 19, 13
Unfolding to a histogram.
Wednesday, June 19, 13
Observability
I don’t want to launch into a tutorial on DTrace
despite the fact that you can simple spin up an
OmniOS AMI in Amazon and have it now.
Instead let’s talk about what shouldn’t happen.
Wednesday, June 19, 13
The production questions:
I wonder if that queue is backed up...
Performance like that should only happen if our binary
tree is badly imbalanced (replace with countless other
pathologically bad precipitates of failure); I wonder if it
is...
It’s almost like some requests are super slow; I wonder
if they are.
STOP WONDERING.
Wednesday, June 19, 13
Instrument your software
Instrument your software and systems
and stop the wonder
Do it for the kids
This is simple with DTrace & a bit more work otherwise
Avoiding work is not an excuse for ignorance
Wednesday, June 19, 13
A tour through our Sauna
We have this software that stores data...
happens to store all data visualized in Circonus.
We have to get data into the system.
We have to get data out of the system.
I don’t wonder... here’s why.
Wednesday, June 19, 13
A small background
with metrics intermingled.
To the Internet ➥
Wednesday, June 19, 13
Bad habits.
While monitoring all things is a good approach,
alerting on things that do not have specific remediation
requirements is horribly damaging.
Wednesday, June 19, 13
Data tenet.
Do not collect data twice.
That which you collect for visualization
should be the same data on which you alert.
Wednesday, June 19, 13
Alerting tenet.
A ruleset against metrics in the system should never
produce an alert without documetation:
the failure condition in plain English,
the business impact of the failure condition,
a concise and repeatable remediation procedure,
an escalation path up the chain.
Wednesday, June 19, 13
Alerting post mortems
Try this out:
for each alert, run a post mortem exercise
understand why it alerted, what was done to fix
rehash who the stakeholders are
have them in the meeting
have the stakeholder speak to the business impact
Wednesday, June 19, 13

More Related Content

What's hot

Observability in the world of microservices
Observability in the world of microservicesObservability in the world of microservices
Observability in the world of microservices
Chandresh Pancholi
 
Observability & Datadog
Observability & DatadogObservability & Datadog
Observability & Datadog
JamesAnderson599331
 
Monitoring and observability
Monitoring and observabilityMonitoring and observability
Monitoring and observability
Danylenko Max
 
Do You Really Need to Evolve From Monitoring to Observability?
Do You Really Need to Evolve From Monitoring to Observability?Do You Really Need to Evolve From Monitoring to Observability?
Do You Really Need to Evolve From Monitoring to Observability?
Splunk
 
Cloud-Native Observability
Cloud-Native ObservabilityCloud-Native Observability
Cloud-Native Observability
Tyler Treat
 
Observability – the good, the bad, and the ugly
Observability – the good, the bad, and the uglyObservability – the good, the bad, and the ugly
Observability – the good, the bad, and the ugly
Timetrix
 
Observability vs APM vs Monitoring Comparison
Observability vs APM vs  Monitoring ComparisonObservability vs APM vs  Monitoring Comparison
Observability vs APM vs Monitoring Comparison
jeetendra mandal
 
Observability for modern applications
Observability for modern applications  Observability for modern applications
Observability for modern applications
MoovingON
 
Observability, what, why and how
Observability, what, why and howObservability, what, why and how
Observability, what, why and how
Neeraj Bagga
 
Observability
ObservabilityObservability
Observability
Ebru Cucen Çüçen
 
.conf Go 2022 - Observability Session
.conf Go 2022 - Observability Session.conf Go 2022 - Observability Session
.conf Go 2022 - Observability Session
Splunk
 
Logging and observability
Logging and observabilityLogging and observability
Logging and observability
Anton Drukh
 
Observability
ObservabilityObservability
Observability
Martin Gross
 
Migrating Monitoring to Observability – How to Transform DevOps from being Re...
Migrating Monitoring to Observability – How to Transform DevOps from being Re...Migrating Monitoring to Observability – How to Transform DevOps from being Re...
Migrating Monitoring to Observability – How to Transform DevOps from being Re...
Liz Masters Lovelace
 
Shift left Observability
Shift left ObservabilityShift left Observability
Shift left Observability
Eric D. Schabell
 
Observability
ObservabilityObservability
Observability
Diego Pacheco
 
Combining Logs, Metrics, and Traces for Unified Observability
Combining Logs, Metrics, and Traces for Unified ObservabilityCombining Logs, Metrics, and Traces for Unified Observability
Combining Logs, Metrics, and Traces for Unified Observability
Elasticsearch
 
AIOps - The next 5 years
AIOps - The next 5 yearsAIOps - The next 5 years
AIOps - The next 5 years
Moogsoft
 
Demystifying observability
Demystifying observability Demystifying observability
Demystifying observability
Abigail Bangser
 
Observability; a gentle introduction
Observability; a gentle introductionObservability; a gentle introduction
Observability; a gentle introduction
Bram Vogelaar
 

What's hot (20)

Observability in the world of microservices
Observability in the world of microservicesObservability in the world of microservices
Observability in the world of microservices
 
Observability & Datadog
Observability & DatadogObservability & Datadog
Observability & Datadog
 
Monitoring and observability
Monitoring and observabilityMonitoring and observability
Monitoring and observability
 
Do You Really Need to Evolve From Monitoring to Observability?
Do You Really Need to Evolve From Monitoring to Observability?Do You Really Need to Evolve From Monitoring to Observability?
Do You Really Need to Evolve From Monitoring to Observability?
 
Cloud-Native Observability
Cloud-Native ObservabilityCloud-Native Observability
Cloud-Native Observability
 
Observability – the good, the bad, and the ugly
Observability – the good, the bad, and the uglyObservability – the good, the bad, and the ugly
Observability – the good, the bad, and the ugly
 
Observability vs APM vs Monitoring Comparison
Observability vs APM vs  Monitoring ComparisonObservability vs APM vs  Monitoring Comparison
Observability vs APM vs Monitoring Comparison
 
Observability for modern applications
Observability for modern applications  Observability for modern applications
Observability for modern applications
 
Observability, what, why and how
Observability, what, why and howObservability, what, why and how
Observability, what, why and how
 
Observability
ObservabilityObservability
Observability
 
.conf Go 2022 - Observability Session
.conf Go 2022 - Observability Session.conf Go 2022 - Observability Session
.conf Go 2022 - Observability Session
 
Logging and observability
Logging and observabilityLogging and observability
Logging and observability
 
Observability
ObservabilityObservability
Observability
 
Migrating Monitoring to Observability – How to Transform DevOps from being Re...
Migrating Monitoring to Observability – How to Transform DevOps from being Re...Migrating Monitoring to Observability – How to Transform DevOps from being Re...
Migrating Monitoring to Observability – How to Transform DevOps from being Re...
 
Shift left Observability
Shift left ObservabilityShift left Observability
Shift left Observability
 
Observability
ObservabilityObservability
Observability
 
Combining Logs, Metrics, and Traces for Unified Observability
Combining Logs, Metrics, and Traces for Unified ObservabilityCombining Logs, Metrics, and Traces for Unified Observability
Combining Logs, Metrics, and Traces for Unified Observability
 
AIOps - The next 5 years
AIOps - The next 5 yearsAIOps - The next 5 years
AIOps - The next 5 years
 
Demystifying observability
Demystifying observability Demystifying observability
Demystifying observability
 
Observability; a gentle introduction
Observability; a gentle introductionObservability; a gentle introduction
Observability; a gentle introduction
 

Viewers also liked

Craftsmanship
CraftsmanshipCraftsmanship
Craftsmanship
Theo Schlossnagle
 
Omnios and unix
Omnios and unixOmnios and unix
Omnios and unix
Theo Schlossnagle
 
It's all about telemetry
It's all about telemetryIt's all about telemetry
It's all about telemetry
Theo Schlossnagle
 
Monitoring is easy, why are we so bad at it presentation
Monitoring is easy, why are we so bad at it  presentationMonitoring is easy, why are we so bad at it  presentation
Monitoring is easy, why are we so bad at it presentation
Theo Schlossnagle
 
OmniOS Motivation and Design ~ LISA 2012
OmniOS Motivation and Design ~ LISA 2012OmniOS Motivation and Design ~ LISA 2012
OmniOS Motivation and Design ~ LISA 2012
Theo Schlossnagle
 
Project reality
Project realityProject reality
Project reality
Theo Schlossnagle
 
TELEMEDICINE AND HEALTH INFORMATION TECHNOLOGIES
TELEMEDICINE AND HEALTH INFORMATION TECHNOLOGIESTELEMEDICINE AND HEALTH INFORMATION TECHNOLOGIES
TELEMEDICINE AND HEALTH INFORMATION TECHNOLOGIES
Rubashkyn
 
Web Operations Career
Web Operations CareerWeb Operations Career
Web Operations Career
Theo Schlossnagle
 
PostgreSQL on Solaris
PostgreSQL on SolarisPostgreSQL on Solaris
PostgreSQL on Solaris
Theo Schlossnagle
 
Applying operations culture to everything
Applying operations culture to everythingApplying operations culture to everything
Applying operations culture to everything
Theo Schlossnagle
 
Velocity 2010: Scalable Internet Architectures
Velocity 2010: Scalable Internet ArchitecturesVelocity 2010: Scalable Internet Architectures
Velocity 2010: Scalable Internet Architectures
Theo Schlossnagle
 
Xtreme Deployment
Xtreme DeploymentXtreme Deployment
Xtreme Deployment
Theo Schlossnagle
 
Atldevops
AtldevopsAtldevops
The math behind big systems analysis.
The math behind big systems analysis.The math behind big systems analysis.
The math behind big systems analysis.
Theo Schlossnagle
 
What's in a number?
What's in a number?What's in a number?
What's in a number?
Theo Schlossnagle
 
Understanding Slowness
Understanding SlownessUnderstanding Slowness
Understanding Slowness
Theo Schlossnagle
 
Big Bad PostgreSQL @ Percona
Big Bad PostgreSQL @ PerconaBig Bad PostgreSQL @ Percona
Big Bad PostgreSQL @ Percona
Theo Schlossnagle
 
SRECon Coherent Performance
SRECon Coherent PerformanceSRECon Coherent Performance
SRECon Coherent Performance
Theo Schlossnagle
 
Adaptive availability
Adaptive availabilityAdaptive availability
Adaptive availability
Theo Schlossnagle
 
FreeSWITCH Monitoring
FreeSWITCH MonitoringFreeSWITCH Monitoring
FreeSWITCH Monitoring
Moises Silva
 

Viewers also liked (20)

Craftsmanship
CraftsmanshipCraftsmanship
Craftsmanship
 
Omnios and unix
Omnios and unixOmnios and unix
Omnios and unix
 
It's all about telemetry
It's all about telemetryIt's all about telemetry
It's all about telemetry
 
Monitoring is easy, why are we so bad at it presentation
Monitoring is easy, why are we so bad at it  presentationMonitoring is easy, why are we so bad at it  presentation
Monitoring is easy, why are we so bad at it presentation
 
OmniOS Motivation and Design ~ LISA 2012
OmniOS Motivation and Design ~ LISA 2012OmniOS Motivation and Design ~ LISA 2012
OmniOS Motivation and Design ~ LISA 2012
 
Project reality
Project realityProject reality
Project reality
 
TELEMEDICINE AND HEALTH INFORMATION TECHNOLOGIES
TELEMEDICINE AND HEALTH INFORMATION TECHNOLOGIESTELEMEDICINE AND HEALTH INFORMATION TECHNOLOGIES
TELEMEDICINE AND HEALTH INFORMATION TECHNOLOGIES
 
Web Operations Career
Web Operations CareerWeb Operations Career
Web Operations Career
 
PostgreSQL on Solaris
PostgreSQL on SolarisPostgreSQL on Solaris
PostgreSQL on Solaris
 
Applying operations culture to everything
Applying operations culture to everythingApplying operations culture to everything
Applying operations culture to everything
 
Velocity 2010: Scalable Internet Architectures
Velocity 2010: Scalable Internet ArchitecturesVelocity 2010: Scalable Internet Architectures
Velocity 2010: Scalable Internet Architectures
 
Xtreme Deployment
Xtreme DeploymentXtreme Deployment
Xtreme Deployment
 
Atldevops
AtldevopsAtldevops
Atldevops
 
The math behind big systems analysis.
The math behind big systems analysis.The math behind big systems analysis.
The math behind big systems analysis.
 
What's in a number?
What's in a number?What's in a number?
What's in a number?
 
Understanding Slowness
Understanding SlownessUnderstanding Slowness
Understanding Slowness
 
Big Bad PostgreSQL @ Percona
Big Bad PostgreSQL @ PerconaBig Bad PostgreSQL @ Percona
Big Bad PostgreSQL @ Percona
 
SRECon Coherent Performance
SRECon Coherent PerformanceSRECon Coherent Performance
SRECon Coherent Performance
 
Adaptive availability
Adaptive availabilityAdaptive availability
Adaptive availability
 
FreeSWITCH Monitoring
FreeSWITCH MonitoringFreeSWITCH Monitoring
FreeSWITCH Monitoring
 

Similar to Monitoring and observability

SMAC (Social, Mobile, Analytic and Cloud) Inflection: Technology's Long Promi...
SMAC (Social, Mobile, Analytic and Cloud) Inflection: Technology's Long Promi...SMAC (Social, Mobile, Analytic and Cloud) Inflection: Technology's Long Promi...
SMAC (Social, Mobile, Analytic and Cloud) Inflection: Technology's Long Promi...
Jay Jakosky
 
Making Sense of the Numbers (Lean Analytics)
Making Sense of the Numbers (Lean Analytics)Making Sense of the Numbers (Lean Analytics)
Making Sense of the Numbers (Lean Analytics)
Lean Analytics
 
REBEL practices to implement innovation initiatives #RebelJam15 #vanrompay...
REBEL practices to implement innovation initiatives #RebelJam15 #vanrompay...REBEL practices to implement innovation initiatives #RebelJam15 #vanrompay...
REBEL practices to implement innovation initiatives #RebelJam15 #vanrompay...
Erik Van Rompay
 
TDC 2012 - You, Me and Opendata
TDC 2012 - You, Me and Opendata TDC 2012 - You, Me and Opendata
TDC 2012 - You, Me and Opendata
Thiago Rondon
 
Android DevCon 2013 Usability
Android DevCon 2013 UsabilityAndroid DevCon 2013 Usability
Android DevCon 2013 Usability
gravityswitch
 
What is Stopping (eating) Mobile? - Agency perspective of what is stopping mo...
What is Stopping (eating) Mobile? - Agency perspective of what is stopping mo...What is Stopping (eating) Mobile? - Agency perspective of what is stopping mo...
What is Stopping (eating) Mobile? - Agency perspective of what is stopping mo...
Decibel Advertising
 
10 Step Guide to COPPA Compliance
10 Step Guide to COPPA Compliance10 Step Guide to COPPA Compliance
10 Step Guide to COPPA Compliance
Kegan Blumenthal
 
Why Strategic Decision Making Goes Wrong
Why Strategic Decision Making Goes WrongWhy Strategic Decision Making Goes Wrong
Why Strategic Decision Making Goes Wrong
University of Hertfordshire
 
Awright openanalytics-mapmeter
Awright openanalytics-mapmeter Awright openanalytics-mapmeter
Awright openanalytics-mapmeter
Open Analytics
 
Madison+ UX 2014: A/B Testing - The Good, The Bad, and The Ugly
Madison+ UX 2014: A/B Testing - The Good, The Bad, and The UglyMadison+ UX 2014: A/B Testing - The Good, The Bad, and The Ugly
Madison+ UX 2014: A/B Testing - The Good, The Bad, and The Ugly
coreyloose
 
Everything you always wanted to know about search in typo3
Everything you always wanted to know about search in typo3Everything you always wanted to know about search in typo3
Everything you always wanted to know about search in typo3
Olivier Dobberkau
 
Why UX Makes A Difference
Why UX Makes A DifferenceWhy UX Makes A Difference
Why UX Makes A Difference
Steve Zehngut
 

Similar to Monitoring and observability (12)

SMAC (Social, Mobile, Analytic and Cloud) Inflection: Technology's Long Promi...
SMAC (Social, Mobile, Analytic and Cloud) Inflection: Technology's Long Promi...SMAC (Social, Mobile, Analytic and Cloud) Inflection: Technology's Long Promi...
SMAC (Social, Mobile, Analytic and Cloud) Inflection: Technology's Long Promi...
 
Making Sense of the Numbers (Lean Analytics)
Making Sense of the Numbers (Lean Analytics)Making Sense of the Numbers (Lean Analytics)
Making Sense of the Numbers (Lean Analytics)
 
REBEL practices to implement innovation initiatives #RebelJam15 #vanrompay...
REBEL practices to implement innovation initiatives #RebelJam15 #vanrompay...REBEL practices to implement innovation initiatives #RebelJam15 #vanrompay...
REBEL practices to implement innovation initiatives #RebelJam15 #vanrompay...
 
TDC 2012 - You, Me and Opendata
TDC 2012 - You, Me and Opendata TDC 2012 - You, Me and Opendata
TDC 2012 - You, Me and Opendata
 
Android DevCon 2013 Usability
Android DevCon 2013 UsabilityAndroid DevCon 2013 Usability
Android DevCon 2013 Usability
 
What is Stopping (eating) Mobile? - Agency perspective of what is stopping mo...
What is Stopping (eating) Mobile? - Agency perspective of what is stopping mo...What is Stopping (eating) Mobile? - Agency perspective of what is stopping mo...
What is Stopping (eating) Mobile? - Agency perspective of what is stopping mo...
 
10 Step Guide to COPPA Compliance
10 Step Guide to COPPA Compliance10 Step Guide to COPPA Compliance
10 Step Guide to COPPA Compliance
 
Why Strategic Decision Making Goes Wrong
Why Strategic Decision Making Goes WrongWhy Strategic Decision Making Goes Wrong
Why Strategic Decision Making Goes Wrong
 
Awright openanalytics-mapmeter
Awright openanalytics-mapmeter Awright openanalytics-mapmeter
Awright openanalytics-mapmeter
 
Madison+ UX 2014: A/B Testing - The Good, The Bad, and The Ugly
Madison+ UX 2014: A/B Testing - The Good, The Bad, and The UglyMadison+ UX 2014: A/B Testing - The Good, The Bad, and The Ugly
Madison+ UX 2014: A/B Testing - The Good, The Bad, and The Ugly
 
Everything you always wanted to know about search in typo3
Everything you always wanted to know about search in typo3Everything you always wanted to know about search in typo3
Everything you always wanted to know about search in typo3
 
Why UX Makes A Difference
Why UX Makes A DifferenceWhy UX Makes A Difference
Why UX Makes A Difference
 

More from Theo Schlossnagle

Adding Simplicity to Complexity
Adding Simplicity to ComplexityAdding Simplicity to Complexity
Adding Simplicity to Complexity
Theo Schlossnagle
 
Put Some SRE in Your Shipped Software
Put Some SRE in Your Shipped SoftwarePut Some SRE in Your Shipped Software
Put Some SRE in Your Shipped Software
Theo Schlossnagle
 
Monitoring 101
Monitoring 101Monitoring 101
Monitoring 101
Theo Schlossnagle
 
Distributed Systems - Like It Or Not
Distributed Systems - Like It Or NotDistributed Systems - Like It Or Not
Distributed Systems - Like It Or Not
Theo Schlossnagle
 
Applying SRE techniques to micro service design
Applying SRE techniques to micro service designApplying SRE techniques to micro service design
Applying SRE techniques to micro service design
Theo Schlossnagle
 
Commandments of scale
Commandments of scaleCommandments of scale
Commandments of scale
Theo Schlossnagle
 
Monitoring the #DevOps way
Monitoring the #DevOps wayMonitoring the #DevOps way
Monitoring the #DevOps way
Theo Schlossnagle
 
Operational Software Design
Operational Software DesignOperational Software Design
Operational Software Design
Theo Schlossnagle
 
A Coherent Discussion About Performance
A Coherent Discussion About PerformanceA Coherent Discussion About Performance
A Coherent Discussion About Performance
Theo Schlossnagle
 
Is this normal?
Is this normal?Is this normal?
Is this normal?
Theo Schlossnagle
 
Social improvements in monitoring
Social improvements in monitoringSocial improvements in monitoring
Social improvements in monitoring
Theo Schlossnagle
 
Building Scalable Systems: an asynchronous approach
Building Scalable Systems: an asynchronous approachBuilding Scalable Systems: an asynchronous approach
Building Scalable Systems: an asynchronous approach
Theo Schlossnagle
 
Webops dashboards
Webops dashboardsWebops dashboards
Webops dashboards
Theo Schlossnagle
 
Http front-ends
Http front-endsHttp front-ends
Http front-ends
Theo Schlossnagle
 

More from Theo Schlossnagle (14)

Adding Simplicity to Complexity
Adding Simplicity to ComplexityAdding Simplicity to Complexity
Adding Simplicity to Complexity
 
Put Some SRE in Your Shipped Software
Put Some SRE in Your Shipped SoftwarePut Some SRE in Your Shipped Software
Put Some SRE in Your Shipped Software
 
Monitoring 101
Monitoring 101Monitoring 101
Monitoring 101
 
Distributed Systems - Like It Or Not
Distributed Systems - Like It Or NotDistributed Systems - Like It Or Not
Distributed Systems - Like It Or Not
 
Applying SRE techniques to micro service design
Applying SRE techniques to micro service designApplying SRE techniques to micro service design
Applying SRE techniques to micro service design
 
Commandments of scale
Commandments of scaleCommandments of scale
Commandments of scale
 
Monitoring the #DevOps way
Monitoring the #DevOps wayMonitoring the #DevOps way
Monitoring the #DevOps way
 
Operational Software Design
Operational Software DesignOperational Software Design
Operational Software Design
 
A Coherent Discussion About Performance
A Coherent Discussion About PerformanceA Coherent Discussion About Performance
A Coherent Discussion About Performance
 
Is this normal?
Is this normal?Is this normal?
Is this normal?
 
Social improvements in monitoring
Social improvements in monitoringSocial improvements in monitoring
Social improvements in monitoring
 
Building Scalable Systems: an asynchronous approach
Building Scalable Systems: an asynchronous approachBuilding Scalable Systems: an asynchronous approach
Building Scalable Systems: an asynchronous approach
 
Webops dashboards
Webops dashboardsWebops dashboards
Webops dashboards
 
Http front-ends
Http front-endsHttp front-ends
Http front-ends
 

Recently uploaded

Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
Public CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptxPublic CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptx
marufrahmanstratejm
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
Data Hops
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
LucaBarbaro3
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 

Recently uploaded (20)

Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
Public CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptxPublic CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptx
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 

Monitoring and observability

  • 1. Monitoring & Observability Getting off the starting blocks. Wednesday, June 19, 13
  • 2. THE MANY FACES OF THEO FUN WITH BEARDS AND HAIR Wednesday, June 19, 13
  • 3. THE MANY FACES OF THEO FUN WITH BEARDS AND HAIR FUCK IT ALL VENDETTA SCARY DETERMINED CAREFREE NO-FLY ZONE Wednesday, June 19, 13
  • 4. Agenda Define stuff. Set some tenets. Discuss and implement some tenets. Answer a lot of questions. Wednesday, June 19, 13
  • 5. Monitoring... what it is. We’ll get to that. Wednesday, June 19, 13
  • 6. Observability Being able to measure “things” or witness state changes. Not useful if doing so alters behavior (significantly). Wednesday, June 19, 13
  • 7. Development & Production For the rest of this talk... There is only production. Wednesday, June 19, 13
  • 8. Data & Information Terms Measurement: a single measurement of something a value on which numerical operations make sense: 1, -110, 1.234123, 9.886-19, 0, null “200”, “304”, “v1.234”, “happy”, null Wednesday, June 19, 13
  • 9. Data & Information Terms Metric: something that you are measuring The version of deployed code Total cost on Amazon services total bugs filed, bug backlog Total queries executed Wednesday, June 19, 13
  • 10. Notice no rates DO NOT STORE RATES. Wednesday, June 19, 13
  • 11. Measurement Velocity The rate at which new measurements are taken. Wednesday, June 19, 13
  • 12. Perspective Sometimes perspective matters page load times, DNS queries, consider RUM (real user monitoring) Usually it does not total requests made against a web server Wednesday, June 19, 13
  • 13. Visualization The assimilation of multiple measurements into a visual representation. Wednesday, June 19, 13
  • 14. Trending Understanding the “direction” of series of measurements on a metric. Here direction is loose and means “pattern within.” Wednesday, June 19, 13
  • 15. Alerting To bring something to one’s attention. Wednesday, June 19, 13
  • 16. Anomaly Detection The determination that a specific measurement is not within reason. Wednesday, June 19, 13
  • 17. Monitoring... what it is. All of that. Wednesday, June 19, 13
  • 19. Some Tenets Most people suck at monitoring. They monitor all the wrong things (somewhat bad) The don’t monitor the important things (awful) Wednesday, June 19, 13
  • 20. Do not collect rates of things Rates are like trees making sounds falling in the forest. Direct measurement of rates leads to data loss and ultimately ignorance. Wednesday, June 19, 13
  • 21. Prefer high level telemetry 1. Business drivers via KPIs, 2. Team KPIs, 3. Staff KPIs, 4. ... then telemetry from everything else. Wednesday, June 19, 13
  • 22. Implementation Herein it gets tricky. Wednesday, June 19, 13
  • 23. Only because of the tools. I could show you how to use tool X, or Y or Z. But I wrote Reconnoiter and founded Circonus because X, Y and Z didn’t meet my needs. Reconnoiter is open. Circonus is a service. Wednesday, June 19, 13
  • 24. Methodology I’m going to focus on methodology that can be applied across whatever toolset you have. Wednesday, June 19, 13
  • 25. Pull vs. Push Anyone who says one is better than the other is... WRONG. They both have their uses. Wednesday, June 19, 13
  • 26. Reasons for pull 1. Synthesized observation is desirable. 2. Observable activity is infrequent. 3. Alterations in observation frequency are useful. Wednesday, June 19, 13
  • 27. Reasons for push Direct observation is desirable. Discrete observed actions are useful. Discrete observed actions are frequent. Wednesday, June 19, 13
  • 28. False reasons. Polling doesn’t scale. Wednesday, June 19, 13
  • 29. Protocol Soup The great thing about standards is... there are so many to choose from. Wednesday, June 19, 13
  • 30. Protocol Soup SNMP(v1,v2,v3) both push(trap) and pull(query) collectd(v4,v5) push only statsd push only JMX, JDBC, ICMP, DHCP, NTP, SSH, TCP, UDP, barf. Wednesday, June 19, 13
  • 31. Color me RESTy Use JSON. HTTP(s) PUT/POST somewhere for push HTTP(s) GET something for pull Wednesday, June 19, 13
  • 32. High-volume Data Occasionally, data velocity is beyond what’s reasonable for individual HTTP PUT/POST for each observation. 1. You can fall back to UDP (try statsd) 2. I prefer to batch them and continue to use REST Wednesday, June 19, 13
  • 33. nad nad is great. use nad. https://github.com/circonus-labs/nad Think of it like an SNMP that’s actually Simple Monitoring not Management and trivial extended to suit your needs Wednesday, June 19, 13
  • 34. nad online example To the Internet ➥ Wednesday, June 19, 13
  • 35. But wait... nad isn’t methodology... it’s technology. Wednesday, June 19, 13
  • 36. Correct... Back to the topic. I talked about nad briefly to provide a super simple tool to erase the question: “but how?” Wednesday, June 19, 13
  • 37. The real question is: “what?” What should I be monitoring? This is the best question you can ask yourself. Before you start. While you’re implementing. After you’re done. Wednesday, June 19, 13
  • 38. The industry answer: MONITOR ALL THE THINGS! I’ll tell you this too, in fact. But we have put the cart ahead of the horse. Wednesday, June 19, 13
  • 39. Question? If I could monitor one thing, what would it be? hint: CPU utilization on your web server ain’t it. Wednesday, June 19, 13
  • 40. Answer: It depends on your business. If you don’t know the answer to this, I suggest you stop worrying about monitoring and start worrying about WTF your company does. Wednesday, June 19, 13
  • 41. Here, we can’t continue. Unless I make stuff up... So, here I go makin’ stuff up. Wednesday, June 19, 13
  • 42. Let us assume we run a web site where customers buy products Wednesday, June 19, 13
  • 43. Monitoring purchases. So, we should monitor how many purchases were made and ensure it is within acceptable levels. Not so fast. Wednesday, June 19, 13
  • 44. Actually. We want to make sure customers can purchase from the site and are purchasing from the site. This semantic different is critically important. And choosing which comes down to velocity. Wednesday, June 19, 13
  • 45. What is this velocity thing? Displacement / time (i.e. purchases/second or $/second) BUT WAIT! You said: “Do not collect rates of things.” Correct... collect the displacement, visualize and alert on the rate. Wednesday, June 19, 13
  • 46. So which? High velocity w/ predictably smooth trends: velocity is more important Low velocity or uneven arrival rates: measuring capability is more important Wednesday, June 19, 13
  • 47. To rephrase If you have sufficient real data, observing that data works best; otherwise, you must synthesize data and monitor that. Wednesday, June 19, 13
  • 48. As a tenet. Always synthesize. additionally observe real data when possible Wednesday, June 19, 13
  • 49. More demonstrable (in a short session) I’ve got a web site that my customers need to visit. The business understands that we need to serve customers with at least a basic level of QoS: no page loads over 4s Wednesday, June 19, 13
  • 51. A first attempt curl http://surge.omniti.com/ extract the HTTP response code if 200, we’re super good! Admittedly not so good. Wednesday, June 19, 13
  • 52. A wealth of data. Synthesizing an HTTPS GET could provide: SSL Subject, validity, expiration HTTP code, Headers and Content Timings on TCP connection, first byte, full payload Wednesday, June 19, 13
  • 53. Still, this is highly imperfect. Don’t get me wrong, they are useful. We use them all over the place... they are cheap. But, ideally, you want to load the page closer to the way a user does (all assets, javascript, etc.) Enter phantomjs Wednesday, June 19, 13
  • 54. var page = require('webpage').create(); page.viewportSize = { width: 1024, height: 768 }; page.onError = function(err) { stats.errors++; }; page.onInitialized = function() { start = new Date(); }; page.onLoadStarted = function() { stats.load_started = new Date() - start; }; page.onLoadFinished = function() { stats.load_finished = new Date() - start; }; page.onResourceRequested = function() { stats.res++; }; page.onResourceError = function(err) { stats.res_errors++; }; page.onUrlChanged = function() { stats.url_redirects++; }; page.open('http://surge.omniti.com/', function(status) { stats.status = status; stats.duration = new Date() - start; console.log(JSON.stringify(stats)); phantom.exit(); }); Wednesday, June 19, 13
  • 55. var start, stats = { status: null , errors: 0 , load_started: null , load_finished: null , resources: 0 , resource_errors: 0 , url_redirects: 0 }; Wednesday, June 19, 13
  • 57. Now for the passive stuff Some examples are Google Analytics, Omniture, etc. Statsd (out-of-the-box) and Metrics are mediocre approach. If we have a lot of observable data N, N̅ isn’t so useful, 𝜎, |N|, q(0.5), q(0.95), q(0.99), q(0), q(1), add a lot. Wednesday, June 19, 13
  • 58. Still... we can do better. N̅, 𝜎, |N|, q(0,0.5,0.95,0.99,1) is 8 statistical aggregates Let’s look at API latencies... say we do 1000/s, that’s 60k/minute. Over a minute of time, 60k points to 8 represents... a lot of information loss. Wednesday, June 19, 13
  • 59. First 60k/minute, how? statsd http puts logs etc. Wednesday, June 19, 13
  • 61. A line graph of data. Wednesday, June 19, 13
  • 62. A heatmap of data. Wednesday, June 19, 13
  • 63. Zoomed in on a heatmap. Wednesday, June 19, 13
  • 64. Unfolding to a histogram. Wednesday, June 19, 13
  • 65. Observability I don’t want to launch into a tutorial on DTrace despite the fact that you can simple spin up an OmniOS AMI in Amazon and have it now. Instead let’s talk about what shouldn’t happen. Wednesday, June 19, 13
  • 66. The production questions: I wonder if that queue is backed up... Performance like that should only happen if our binary tree is badly imbalanced (replace with countless other pathologically bad precipitates of failure); I wonder if it is... It’s almost like some requests are super slow; I wonder if they are. STOP WONDERING. Wednesday, June 19, 13
  • 67. Instrument your software Instrument your software and systems and stop the wonder Do it for the kids This is simple with DTrace & a bit more work otherwise Avoiding work is not an excuse for ignorance Wednesday, June 19, 13
  • 68. A tour through our Sauna We have this software that stores data... happens to store all data visualized in Circonus. We have to get data into the system. We have to get data out of the system. I don’t wonder... here’s why. Wednesday, June 19, 13
  • 69. A small background with metrics intermingled. To the Internet ➥ Wednesday, June 19, 13
  • 70. Bad habits. While monitoring all things is a good approach, alerting on things that do not have specific remediation requirements is horribly damaging. Wednesday, June 19, 13
  • 71. Data tenet. Do not collect data twice. That which you collect for visualization should be the same data on which you alert. Wednesday, June 19, 13
  • 72. Alerting tenet. A ruleset against metrics in the system should never produce an alert without documetation: the failure condition in plain English, the business impact of the failure condition, a concise and repeatable remediation procedure, an escalation path up the chain. Wednesday, June 19, 13
  • 73. Alerting post mortems Try this out: for each alert, run a post mortem exercise understand why it alerted, what was done to fix rehash who the stakeholders are have them in the meeting have the stakeholder speak to the business impact Wednesday, June 19, 13