Abe
Stanway
@jonlives
BRING THE NOISE!
MAKING SENSE OF
A HAILSTORM
OF METRICS
Jon Cowie
@abestanway
Tuesday, 9 July 13
Ninety minutes is a
long time.
- motivations
- skyline
- oculus
- demo!
- questions
This talk:
~10
~25
~30
~10
~15
Tuesday, 9 July 13
Ninety minutes is a
long time.
- motivations
- skyline
- oculus
- demo!
- questions
This talk:
~10
~25
~30
~10
~15
But we have some
sweet stuff to show
you.
Tuesday, 9 July 13
Background and Motivations
Tuesday, 9 July 13
Tuesday, 9 July 13
1.5 billion page views
$117 million of goods sold
950 thousand users
Tuesday, 9 July 13
1.5 billion page views
$117 million of goods sold
950 thousand users
(in december ‘12)
Tuesday, 9 July 13
We practice continuous
deployment.
Tuesday, 9 July 13
de•ploy /diˈploi/
Verb
To release your code for the
world to see, hopefully without
breaking the Internet
Tuesday, 9 July 13
Everyone deploys.
250+ committers.
Tuesday, 9 July 13
Day one:
DEPLOY
Tuesday, 9 July 13
Tuesday, 9 July 13
30+ DEPLOYS A DAY
(~8 commits per deploy!)
Tuesday, 9 July 13
“30 deploys a day? Is that safe?”
Tuesday, 9 July 13
We optimize for quick recovery
by anticipating problems...
Tuesday, 9 July 13
...instead of fearing human error.
Tuesday, 9 July 13
Can’t fix what you
don’t measure!
- W. Edwards Deming
Tuesday, 9 July 13
StatsD
graphite
Skyline
Oculus
Supergrep
homemade!not homemade
Nagios
Ganglia
Tuesday, 9 July 13
Text
Real time error logging
Tuesday, 9 July 13
“Not all things that
break throw errors.”
- Oscar Wilde
Tuesday, 9 July 13
StatsD
Tuesday, 9 July 13
StatsD::increment(“foo.bar”)
Tuesday, 9 July 13
If it moves,
graph it!
Tuesday, 9 July 13
If it moves,
graph it!
we would graph them ➞
Tuesday, 9 July 13
If it doesn’t move,
graph it anyway
(it might make a run for it)
Tuesday, 9 July 13
DASHBOARDS!
Tuesday, 9 July 13
[1358731200,20]
[1358731200,20]
[1358731200,20]
[1358731200,20]
[1358731200,20]
[1358731200,20]
[1358731200,20]
[1358731200,20]
[1358731200,60]
[1358731200,20]
[1358731200,20]
Tuesday, 9 July 13
DASHBOARDS! x 250,000
Tuesday, 9 July 13
Tuesday, 9 July 13
lol nagios
Tuesday, 9 July 13
“...but there are also
unknown unknowns -
there are things we do
not know we don’t
know.”
Tuesday, 9 July 13
Unknown
anomalies
Tuesday, 9 July 13
Unknown
correlations
Tuesday, 9 July 13
Kale.
Tuesday, 9 July 13
Kale:
- leaves
- green stuff
Tuesday, 9 July 13
Kale:
- leaves
- green stuff
OCULUS
SKYLINE
Tuesday, 9 July 13
Q). How do you analyze a
timeseries for anomalies
in real time?
Tuesday, 9 July 13
A). Lots of HTTP requests
to Graphite’s API!
Tuesday, 9 July 13
Q). How do you analyze a
quarter million timeseries
for anomalies in real time?
Tuesday, 9 July 13
SKYLINE
Tuesday, 9 July 13
SKYLINE
Tuesday, 9 July 13
A real time
anomaly detection
system
Tuesday, 9 July 13
Real time?
Tuesday, 9 July 13
Kinda.
Tuesday, 9 July 13
StatsD
Ten second resolution
Tuesday, 9 July 13
Ganglia
One minute resolution
Tuesday, 9 July 13
~ 10s
( ~ 1minBest case:
Tuesday, 9 July 13
(
Takes about 90 seconds
with our throughput.
Tuesday, 9 July 13
(
Still faster than you would
have discovered it otherwise.
Tuesday, 9 July 13
Memory > Disk
Tuesday, 9 July 13
Tuesday, 9 July 13
Q). How do you get a
quarter million timeseries
into Redis on time?
Tuesday, 9 July 13
STREAM IT!
Tuesday, 9 July 13
Graphite’s relay agent
original
graphite backup graphite
Tuesday, 9 July 13
Graphite’s relay agent
original
graphite backup graphite
[statsd.numStats, [1365603422, 82345]]
pickles
[statsd.numStats, [1365603432, 80611]]
[statsd.numStats, [1365603412, 73421]]
Tuesday, 9 July 13
Graphite’s relay agent
original
graphite skyline
[statsd.numStats, [1365603422, 82345]]
pickles
[statsd.numStats, [1365603432, 80611]]
[statsd.numStats, [1365603412, 73421]]
Tuesday, 9 July 13
We import from Ganglia too.
Tuesday, 9 July 13
Storing timeseries
Tuesday, 9 July 13
Minimize I/O
Minimize memory
Tuesday, 9 July 13
redis.append()
- Strings
- Constant time
- One operation per update
Tuesday, 9 July 13
JSON?
Tuesday, 9 July 13
“[1358711400, 51],”
=> get statsD.numStats
----------------------------
Tuesday, 9 July 13
“[1358711400, 51],
=> get statsD.numStats
----------------------------
[1358711410, 23],”
Tuesday, 9 July 13
“[1358711400, 51],
=> get statsD.numStats
----------------------------
[1358711410, 23],
[1358711420, 45],”
Tuesday, 9 July 13
OVER HALF
CPU time spent
decoding JSON
Tuesday, 9 July 13
[1,2]
Tuesday, 9 July 13
[ 1 , 2 ]
Stuff we care about
Extra junk
Tuesday, 9 July 13
MESSAGEPACK
Tuesday, 9 July 13
MESSAGEPACK
A binary-based
serialization protocol
Tuesday, 9 July 13
x93x01x02
Array size
(16 or 32 bit big
endian integer)
Things we care about
Tuesday, 9 July 13
x93x01x02
Array size
(16 or 32 bit big
endian integer)
Things we care about
x93x02x03
Tuesday, 9 July 13
CUT IN HALF
Run Time + Memory Used
Tuesday, 9 July 13
ROOMBA.PY
CLEANS THE DATA
Tuesday, 9 July 13
“Wait...you wrote this in Python?”
Tuesday, 9 July 13
Great statistics libraries
Not fun for parallelism
Tuesday, 9 July 13
Assign Redis keys to each process
Process decodes and analyzes
The Analyzer
Tuesday, 9 July 13
Anomalous metrics written as JSON
setInterval() retrieves from front end
The Analyzer
Tuesday, 9 July 13
Tuesday, 9 July 13
What does it mean
to be anomalous?
Tuesday, 9 July 13
Consensus model
Tuesday, 9 July 13
Implement everything you
can get your hands on
Tuesday, 9 July 13
Basic algorithm:
“A metric is anomalous if its
latest datapoint is over three
standard deviations above
its moving average.”
Tuesday, 9 July 13
Grubb’s test, ordinary least squares
Tuesday, 9 July 13
Histogram binning
Tuesday, 9 July 13
Four horsemen of the modelpocalypse
Tuesday, 9 July 13
1. Seasonality
2. Spike influence
3. Normality
4. Parameters
Tuesday, 9 July 13
Anomaly?
Tuesday, 9 July 13
Nope.
Tuesday, 9 July 13
Text
Spikes artificially raise the moving average
Anomaly
detected (yay!)
Anomaly missed :(
Bigger moving average
Tuesday, 9 July 13
Real world data doesn’t
necessarily follow a perfect
normal distribution.
Tuesday, 9 July 13
Too many metrics to fit
parameters for them all!
Tuesday, 9 July 13
A robust set of algorithms is the
current focus of this project.
Tuesday, 9 July 13
Q). How do you analyze a
quarter million timeseries
for correlations?
Tuesday, 9 July 13
OCULUS
Tuesday, 9 July 13
Image comparison is expensive and slow
Tuesday, 9 July 13
“[[975, 1365528530],
[643, 1365528540],
[750, 1365528550],
[992, 1365528560],
[580, 1365528570],
[586, 1365528580],
[649, 1365528590],
[548, 1365528600],
[901, 1365528610],
[633, 1365528620]]”
Use raw timeseries instead of raw graphs
Tuesday, 9 July 13
Naming Things
Cache Invalidation
Numerical Comparison?
HARD PROBLEMS
Tuesday, 9 July 13
Naming Things
Cache Invalidation
Numerical Comparison?
HARD PROBLEMS
Tuesday, 9 July 13
Euclidian Distance
Tuesday, 9 July 13
Dynamic Time Warping
(helps with phase shifts)
Tuesday, 9 July 13
We’ve solved it!
Tuesday, 9 July 13
O(N2)
Tuesday, 9 July 13
O(N2) x 250k
Tuesday, 9 July 13
Too slow!
Tuesday, 9 July 13
doesn’t
Tuesday, 9 July 13
No need to run DTW on all 250k.
Tuesday, 9 July 13
Discard obviously dissimilar metrics.
Tuesday, 9 July 13
“975 643 643 750 992 992 992 580”
“sharpdecrement flat increment
sharpincrement flat flat
shapdecrement”
Shape Description Alphabet
Tuesday, 9 July 13
“975 643 643 750 992 992 992 580”
“sharpdecrement flat increment
sharpincrement flat flat
shapdecrement”
Shape Description Alphabet
“24 4 4 11 25 25 25 0 1”
(normalization step)
Tuesday, 9 July 13
Tuesday, 9 July 13
Search for shape description
fingerprint in Elasticsearch
Tuesday, 9 July 13
Run DTW on results
as final polish
Tuesday, 9 July 13
O(N2) on ~10k metrics
Tuesday, 9 July 13
Still too slow.
Tuesday, 9 July 13
Fast DTW - O(N)
coarsen
project
refine
Tuesday, 9 July 13
Elasticsearch Details
Phrase search for first
pass scores across shape
description fingerprints
Tuesday, 9 July 13
Elasticsearch Details
Phrase search for first pass scores
across shape description fingerprints
Custom FastDTW and euclidian
distance plugins to score across the
remaining filtered timeseries
Tuesday, 9 July 13
Elasticsearch Structure
{
:id => “statsd.numStats”,
:fingerprint => “sdec inc sinc sdec”,
:values => "10 1 2 15 4"
}
Tuesday, 9 July 13
Mappings
Specify tokenizers
“Untouched” fields
Tuesday, 9 July 13
First pass query
:match => {
:fingerprint => {
:query => “sdec inc sinc sdec inc”,
:type => "phrase",
:slop => 20
}
}
shape description
fingerprint
Tuesday, 9 July 13
Refinement query{:custom_score	
  =>	
  {
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  :query	
  =>	
  <first_pass_query>,
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  :script	
  =>	
  "oculus_dtw",
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  :params	
  =>	
  {
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  :query_value	
  =>	
  “10	
  20	
  20	
  10	
  
30”,
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  :query_field	
  =>	
  
"values.untouched",
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  },
}
raw timeseries
Tuesday, 9 July 13
Skyline
Elasticsearch
Resque
Sinatra
Ganglia
Graphite
StatsD
KALE
Flask
Tuesday, 9 July 13
Populating Elasticsearch
Tuesday, 9 July 13
ES
Index
resque workers
Tuesday, 9 July 13
Too slow to
update and search
Tuesday, 9 July 13
New
Index
Last
Index
Webapp
Tuesday, 9 July 13
Sinatra frontend
Queries ES
Renders results
Tuesday, 9 July 13
Collections
Tuesday, 9 July 13
devops <3
Tuesday, 9 July 13
Tuesday, 9 July 13
Special thanks to:
Dr. Neil Gunther, PerfDynamics
Dr. Brian Whitman, Echonest
Burc Arpat, Facebook
Seth Walker, Etsy
Rafe Colburn, Etsy
Mike Rembetsy, Etsy
John Allspaw, Etsy
Tuesday, 9 July 13
@abestanway @jonlives
Thanks!
github.com/etsy/skyline
github.com/etsy/oculus
Tuesday, 9 July 13

Bring the Noise