Статистика на практике для
поиска аномалий в нагрузочном
тестировании и production
Антон Лебедевич
A Lot of Graphs
Contents
●

Real Data and Ideal Models

●

Load Testing (Tuning)

●

Production Monitoring

●

Correlation

●

Tools
Real Data vs. Ideal Models
●

noise (human actions)

●

outliers

●

missing data

●

different resolutions

●

counter update frequencies

●

quantization

●

not Gaussian and not random walk

●

what is normal for the system?
Outlier vs. Changepoint
Outlier vs. Changepoint
Outlier vs. Changepoint
Resolution
●

>=5min

●

1min

●

10s

●

<=1s
Load Testing (Tuning)
●

goal

●

beware of transient response

●

find failure

●

filter data

●

find bottleneck and fix

●

rinse and repeat
Transient Response
Failure on Target Metric
Failure on Target Metric
Failure on Target Metric
Filtration
●

constants

●

index of dispersion (sd/mean)

●

apply system knowledge
–

tasks migrated by scheduler

–

dependent (disk used/free)

–

interface traffic < 10 packets/s

–

load average < 0.5

–

…
Missing or Constant
Changed Mean
Nonlinear

ndiffs: diff until kpss says it's stationary
Production Classics
●

Control charts
–
–

●

fixed window moving average (MA)
exponentially weighted moving average (EWMA)

Holt-Winters
Test Subject
Test Subject
Test Subject
Moving Average
Exponentially-Weighted Moving Average
Control Charts
●

stationary

●

Gaussian/Poisson

●

outliers
Two Weeks
Holt-Winters
triple exponential smoothing
●

needs a lot of data

●

sensitive to outliers

●

can't handle 3 seasons + holidays

●

overfitting
Time Shifting
Production Experimental
●

autocorrelation

●

non-parametric 2 sample tests
Autocorrelation
Autocorrelation
Ljung-Box Test
●

non-stationary

●

mean shift

●

trends

●

seasonal

●

periodic (cron jobs, sampling)

●

aggregated (MA, EWMA)
Distribution Change
Distribution Change
Distribution Change
Distribution Change
2-Sample Tests: Good
Kolmogorov–Smirnov, Cramér–von Mises
●

good for request size and latency (unaggregated)

●

work on periodic data

●

outlier resistant

●

good for data exploration
2-Sample Tests: Bad
Kolmogorov–Smirnov, Cramér–von Mises
●

false positives on trends and seasonal changes

●

need many unique values

●

computational complexity

●

bad for alerting
Finding Similar Graphs
●

correlation (Pearson, Spearman)

●

Euclidean distance

●

dynamic time warping (DTW)

●

discrete Fourier transform (DFT)

●

discrete wavelet transform (DWT)
Cluster Centers
Cluster Members
Cluster Members
Clustering
●

non-euclidean (ultrametric) space

●

many small clusters

●

local clustering around events

●

false positives
–

cron jobs (log rotation)

–

human actions (restarts, reconfigurations)

–

cache expirations

–

…
Tools
●

collectd

●

statsd

●

graphite

●

whisper-fetch

●

R
R
add.smooth <- function(m) {
r <- nrow(m)
ms <- sapply(m, function(y) {
ave(coredata(y),
seq.int(r) %/% max(3, r %/% 150),
FUN=function(x) {mean(x, na.rm=T)})
})
df <- data.frame(index(m)[rep.int(1:r, ncol(m))],
factor(rep(1:ncol(m), each = r), levels = 1:ncol(m)),
as.vector(coredata(m)),
as.vector(coredata(ms)))
names(df) <- c("Index", "Series", "Value", "Smooth")
df
}
Kale Stack
●

github.com/etsy/skyline

●

github.com/etsy/oculus
Skyline
image from
github.com/etsy/skyline
Skyline Internals
●

Horizon agent

●

Redis

●

Analyzer agent

●

Flask (Python) Web App
Skyline Algorithms
●

median absolute deviation

●

mean subtraction cumulation

●

grubbs

●

least squares

●

first hour average

●

histogram bins

●

stddev from average

●

ks test

●

stddev from moving average

●

second order anomalies
Oculus
image from
github.com/etsy/oculus
Oculus Internals
●

Skyline Import Script and Cronjob

●

Resque workers

●

ElasticSearch

●

Sinatra (Ruby) Web App
Q&A
Anton Lebedevich
mabrek@gmail.com
twitter.com/widdoc
github.com/mabrek

Антон Лебедевич