Антон Лебедевич

2,328 views

Published on

HighLoad++ 2013

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,328
On SlideShare
0
From Embeds
0
Number of Embeds
1,082
Actions
Shares
0
Downloads
16
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Антон Лебедевич

  1. 1. Статистика на практике для поиска аномалий в нагрузочном тестировании и production Антон Лебедевич
  2. 2. A Lot of Graphs
  3. 3. Contents ● Real Data and Ideal Models ● Load Testing (Tuning) ● Production Monitoring ● Correlation ● Tools
  4. 4. Real Data vs. Ideal Models ● noise (human actions) ● outliers ● missing data ● different resolutions ● counter update frequencies ● quantization ● not Gaussian and not random walk ● what is normal for the system?
  5. 5. Outlier vs. Changepoint
  6. 6. Outlier vs. Changepoint
  7. 7. Outlier vs. Changepoint
  8. 8. Resolution ● >=5min ● 1min ● 10s ● <=1s
  9. 9. Load Testing (Tuning) ● goal ● beware of transient response ● find failure ● filter data ● find bottleneck and fix ● rinse and repeat
  10. 10. Transient Response
  11. 11. Failure on Target Metric
  12. 12. Failure on Target Metric
  13. 13. Failure on Target Metric
  14. 14. Filtration ● constants ● index of dispersion (sd/mean) ● apply system knowledge – tasks migrated by scheduler – dependent (disk used/free) – interface traffic < 10 packets/s – load average < 0.5 – …
  15. 15. Missing or Constant
  16. 16. Changed Mean
  17. 17. Nonlinear ndiffs: diff until kpss says it's stationary
  18. 18. Production Classics ● Control charts – – ● fixed window moving average (MA) exponentially weighted moving average (EWMA) Holt-Winters
  19. 19. Test Subject
  20. 20. Test Subject
  21. 21. Test Subject
  22. 22. Moving Average
  23. 23. Exponentially-Weighted Moving Average
  24. 24. Control Charts ● stationary ● Gaussian/Poisson ● outliers
  25. 25. Two Weeks
  26. 26. Holt-Winters triple exponential smoothing ● needs a lot of data ● sensitive to outliers ● can't handle 3 seasons + holidays ● overfitting
  27. 27. Time Shifting
  28. 28. Production Experimental ● autocorrelation ● non-parametric 2 sample tests
  29. 29. Autocorrelation
  30. 30. Autocorrelation Ljung-Box Test ● non-stationary ● mean shift ● trends ● seasonal ● periodic (cron jobs, sampling) ● aggregated (MA, EWMA)
  31. 31. Distribution Change
  32. 32. Distribution Change
  33. 33. Distribution Change
  34. 34. Distribution Change
  35. 35. 2-Sample Tests: Good Kolmogorov–Smirnov, Cramér–von Mises ● good for request size and latency (unaggregated) ● work on periodic data ● outlier resistant ● good for data exploration
  36. 36. 2-Sample Tests: Bad Kolmogorov–Smirnov, Cramér–von Mises ● false positives on trends and seasonal changes ● need many unique values ● computational complexity ● bad for alerting
  37. 37. Finding Similar Graphs ● correlation (Pearson, Spearman) ● Euclidean distance ● dynamic time warping (DTW) ● discrete Fourier transform (DFT) ● discrete wavelet transform (DWT)
  38. 38. Cluster Centers
  39. 39. Cluster Members
  40. 40. Cluster Members
  41. 41. Clustering ● non-euclidean (ultrametric) space ● many small clusters ● local clustering around events ● false positives – cron jobs (log rotation) – human actions (restarts, reconfigurations) – cache expirations – …
  42. 42. Tools ● collectd ● statsd ● graphite ● whisper-fetch ● R
  43. 43. R add.smooth <- function(m) { r <- nrow(m) ms <- sapply(m, function(y) { ave(coredata(y), seq.int(r) %/% max(3, r %/% 150), FUN=function(x) {mean(x, na.rm=T)}) }) df <- data.frame(index(m)[rep.int(1:r, ncol(m))], factor(rep(1:ncol(m), each = r), levels = 1:ncol(m)), as.vector(coredata(m)), as.vector(coredata(ms))) names(df) <- c("Index", "Series", "Value", "Smooth") df }
  44. 44. Kale Stack ● github.com/etsy/skyline ● github.com/etsy/oculus
  45. 45. Skyline image from github.com/etsy/skyline
  46. 46. Skyline Internals ● Horizon agent ● Redis ● Analyzer agent ● Flask (Python) Web App
  47. 47. Skyline Algorithms ● median absolute deviation ● mean subtraction cumulation ● grubbs ● least squares ● first hour average ● histogram bins ● stddev from average ● ks test ● stddev from moving average ● second order anomalies
  48. 48. Oculus image from github.com/etsy/oculus
  49. 49. Oculus Internals ● Skyline Import Script and Cronjob ● Resque workers ● ElasticSearch ● Sinatra (Ruby) Web App
  50. 50. Q&A Anton Lebedevich mabrek@gmail.com twitter.com/widdoc github.com/mabrek

×