Антон Лебедевич

Статистика на практике для
поиска аномалий в нагрузочном
тестировании и production
Антон Лебедевич

Contents
●

Real Data and Ideal Models

●

Load Testing (Tuning)

●

Production Monitoring

●

Correlation

●

Tools

Real Data vs. Ideal Models
●

noise (human actions)

●

outliers

●

missing data

●

different resolutions

●

counter update frequencies

●

quantization

●

not Gaussian and not random walk

●

what is normal for the system?

Resolution
●

>=5min

●

1min

●

10s

●

<=1s

Load Testing (Tuning)
●

goal

●

beware of transient response

●

find failure

●

filter data

●

find bottleneck and fix

●

rinse and repeat

Filtration
●

constants

●

index of dispersion (sd/mean)

●

apply system knowledge
–

tasks migrated by scheduler

–

dependent (disk used/free)

–

interface traffic < 10 packets/s

–

load average < 0.5

–

…

Nonlinear

ndiffs: diff until kpss says it's stationary

Production Classics
●

Control charts
–
–

●

fixed window moving average (MA)
exponentially weighted moving average (EWMA)

Holt-Winters

Exponentially-Weighted Moving Average

Control Charts
●

stationary

●

Gaussian/Poisson

●

outliers

Holt-Winters
triple exponential smoothing
●

needs a lot of data

●

sensitive to outliers

●

can't handle 3 seasons + holidays

●

overfitting

Production Experimental
●

autocorrelation

●

non-parametric 2 sample tests

Autocorrelation
Ljung-Box Test
●

non-stationary

●

mean shift

●

trends

●

seasonal

●

periodic (cron jobs, sampling)

●

aggregated (MA, EWMA)

2-Sample Tests: Good
Kolmogorov–Smirnov, Cramér–von Mises
●

good for request size and latency (unaggregated)

●

work on periodic data

●

outlier resistant

●

good for data exploration

2-Sample Tests: Bad
Kolmogorov–Smirnov, Cramér–von Mises
●

false positives on trends and seasonal changes

●

need many unique values

●

computational complexity

●

bad for alerting

Finding Similar Graphs
●

correlation (Pearson, Spearman)

●

Euclidean distance

●

dynamic time warping (DTW)

●

discrete Fourier transform (DFT)

●

discrete wavelet transform (DWT)

Clustering
●

non-euclidean (ultrametric) space

●

many small clusters

●

local clustering around events

●

false positives
–

cron jobs (log rotation)

–

human actions (restarts, reconfigurations)

–

cache expirations

–

…

Tools
●

collectd

●

statsd

●

graphite

●

whisper-fetch

●

R

R
add.smooth <- function(m) {
r <- nrow(m)
ms <- sapply(m, function(y) {
ave(coredata(y),
seq.int(r) %/% max(3, r %/% 150),
FUN=function(x) {mean(x, na.rm=T)})
})
df <- data.frame(index(m)[rep.int(1:r, ncol(m))],
factor(rep(1:ncol(m), each = r), levels = 1:ncol(m)),
as.vector(coredata(m)),
as.vector(coredata(ms)))
names(df) <- c("Index", "Series", "Value", "Smooth")
df
}

Kale Stack
●

github.com/etsy/skyline

●

github.com/etsy/oculus

Skyline
image from
github.com/etsy/skyline

Skyline Internals
●

Horizon agent

●

Redis

●

Analyzer agent

●

Flask (Python) Web App

Skyline Algorithms
●

median absolute deviation

●

mean subtraction cumulation

●

grubbs

●

least squares

●

first hour average

●

histogram bins

●

stddev from average

●

ks test

●

stddev from moving average

●

second order anomalies

Oculus
image from
github.com/etsy/oculus

Oculus Internals
●

Skyline Import Script and Cronjob

●

Resque workers

●

ElasticSearch

●

Sinatra (Ruby) Web App

Q&A
Anton Lebedevich
mabrek@gmail.com
twitter.com/widdoc
github.com/mabrek

Антон Лебедевич

More Related Content

What's hot

Similar to Антон Лебедевич

More from Ontico

Recently uploaded

Антон Лебедевич