Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Adaptive Fault Detection

2,609 views

Published on

Published in: Technology
  • Be the first to comment

Adaptive Fault Detection

  1. 1. Adaptive Fault DetectionBaron Schwartz • Percona Live NYC 2012Date
  2. 2. Me Optimization, Backups, 3r e rs V Replication, and more Co d e rs v Ed io Author of High Performance MySQL iti n 5.✤ on 5✤ Creator of some tools that you might use✤ I love hearing from people just like you: High @xaprb on Twitter Performance MySQL ✤ ✤ http://www.linkedin.com/in/xaprb Baron Schwartz, Peter Zaitsev & Vadim Tkachenko
  3. 3. Conventional Fault DetectionMetrics, Thresholds, and Actions
  4. 4. Nagios and ThresholdsIs there a right answer?
  5. 5. Motivations✤ Detect unknown failure mode✤ Capture diagnostic data automatically✤ Surface relevant information
  6. 6. Six Sigmas99.7% of measurements fall within ±3 sigmas of mean in a normal distribution
  7. 7. Abnormality DetectionStatistical process control, operations research, and intuition
  8. 8. Shewhart Control ChartsMetrics that fall ± too many standard deviations from the mean are out of bounds
  9. 9. Holt-Winters ForecastingPredict the future based on history, trend, and seasonality
  10. 10. Brownian MotionA random walk shouldn’t go the same way for long
  11. 11. Probability of Increase/Decrease Increase 49.88% Same 0.21% Decrease 49.91%Coin TossingQPS increases and decreases with ~equal probability
  12. 12. Length of QPS Runs 100000 75000 50000 25000 0 -9 -8 -7 -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 7 8 9I Feel Normal TodayLong runs of QPS increases/decreases behave like a coin toss
  13. 13. Run Improbability Length 1 50.10% 75.09% 2 75.10% 93.79% 3 87.57% 98.46% 4 93.80% 99.62% 5 96.91% 99.90% 6 98.46% 99.98% 7 99.23% 99.99% 8 99.62% 9 99.90% 10 99.95%Oddly EvenHow long is an unusually long random walk?
  14. 14. Run Improbability Length 1 50.10% 75.09% 2 75.10% 93.79% 3 87.57% 98.46% 4 93.80% 99.62% Two ! 5 96.91% 99.90% 6 98.46% 99.98% 7 99.23% 99.99% 8 99.62% 9 99.90% 10 99.95%Oddly EvenHow long is an unusually long random walk?
  15. 15. Run Improbability Length 1 50.10% 75.09% 2 75.10% 93.79% 3 87.57% 98.46% 4 93.80% 99.62% Two ! 5 96.91% 99.90% 6 98.46% 99.98% 7 99.23% 99.99% 8 99.62% Three ! 9 99.90% 10 99.95%Oddly EvenHow long is an unusually long random walk?
  16. 16. Run Improbability Length 1 50.10% 75.09% 2 75.10% 93.79% 3 87.57% 98.46% 4 93.80% 99.62% Two ! 5 96.91% 99.90% 6 98.46% 99.98% 7 99.23% 99.99% 8 99.62% Three ! 9 99.90% 10 99.95% Same thing, butOddly Even two variablesHow long is an unusually long random walk?
  17. 17. Houston, We Have an OpportunityThese techniques fall short of what’s needed
  18. 18. God hath chosen the foolish things of theworld to confound the wise [1 Cor 1:27]
  19. 19. Unexpected Things Happen... but who says abnormal is bad?
  20. 20. Bottleneck DetectionFind abnormalities, then determine whether they are system faults
  21. 21. Metrics That MatterThroughput, concurrency, and change—but not response time (why?)
  22. 22. AlgorithmsVarious combinations of severity, directionality, run length, duration, and more
  23. 23. Out of BoundsHow often each algorithm detected abnormalities
  24. 24. Drilling DownOne of the algorithms triggered at the gray line
  25. 25. Another ExampleThis one from a more selective algorithm
  26. 26. Brownian CommotionJust because it’s long and aimed the right way doesn’t mean it’s scary
  27. 27. QPS Cxn RunsSee Spots RunThere’s a clear run of decreasing QPS and increasing connections—but no stall/lockup
  28. 28. QPS Cxn RunsSee Spots RunThere’s a clear run of decreasing QPS and increasing connections—but no stall/lockup
  29. 29. A New ApproachCombinations of algorithms to avoid run-based false positives
  30. 30. I’d Like Fries With ThatAnd supersize my Threads_running, please
  31. 31. Why Not Response Time?We are legion and we want your carat patch
  32. 32. Mass ApplicationDoes it generalize to many different settings?
  33. 33. Dataset #2Looks reasonable on a different workload—so far
  34. 34. Stall Detection In ActionSorry, no witty comment here
  35. 35. Oh Crud.What does “bad” mean on a workload like this?
  36. 36. He taketh the wise in their own craftiness.[1 Cor 3:19]
  37. 37. Different Like Everybody ElseVariance-to-mean ratio / index of dispersion to the rescue?
  38. 38. Still Life With Purple LineIf you think this is art, I am happy to sell it to you
  39. 39. When In Doubt, Get EmpiricalMeasure 99.7th% V:M on a well-behaved dataset. Good enough for government work.
  40. 40. Workload #1 ReduxStill finds lots of the same “bad spots” with the V:M ratio filter
  41. 41. If you can read this, flip me over!Workload #2 With New Filter
  42. 42. Conclusions✤ Workloads differ. GIGO.✤ Unlikely events are not necessarily bad.✤ Naive techniques fail; more sophisticated methods are required.✤ Thresholds are too simplistic, but reappear.✤ Common sense beats elaborate math.✤ Automated problem detection is a good thing?
  43. 43. Jobs✤ Want to work and live in America’s #1 City?✤ I am hiring DBAs and developers. Talk to me.
  44. 44. Questions? @xaprb • http://www.linkedin.com/in/xaprb
  45. 45. Image Credits✤ http://www.flickr.com/photos/katej/853418592/✤ http://www.flickr.com/photos/domesticat/2963393184/✤ http://www.flickr.com/photos/markho/481969187/✤ http://www.flickr.com/photos/exquisitur/3502317741/✤ http://www.flickr.com/photos/calleephoto/4952091078/✤ http://www.flickr.com/photos/paperpariah/4150220583/✤ http://www.flickr.com/photos/stevewall/6057281066/✤ http://www.flickr.com/photos/mybloodyself/5879425774/✤ http://www.flickr.com/photos/josephrobertson/92849605/✤ http://en.wikipedia.org/

×