Adaptive Fault DetectionBaron Schwartz • Percona Live NYC 2012Date
Me                                                   Optimization, Backups,                                               ...
Conventional Fault DetectionMetrics, Thresholds, and Actions
Nagios and ThresholdsIs there a right answer?
Motivations✤   Detect unknown failure mode✤   Capture diagnostic data automatically✤   Surface relevant information
Six Sigmas99.7% of measurements fall within ±3 sigmas of mean in a normal distribution
Abnormality DetectionStatistical process control, operations research, and intuition
Shewhart Control ChartsMetrics that fall ± too many standard deviations from the mean are out of bounds
Holt-Winters ForecastingPredict the future based on history, trend, and seasonality
Brownian MotionA random walk shouldn’t go the same way for long
Probability of Increase/Decrease                                     Increase                                      49.88% ...
Length of QPS Runs                 100000                  75000                  50000                  25000            ...
Run             Improbability                     Length                        1            50.10%    75.09%             ...
Run             Improbability                     Length                        1            50.10%    75.09%             ...
Run             Improbability                     Length                        1            50.10%    75.09%             ...
Run             Improbability                     Length                        1            50.10%    75.09%             ...
Houston, We Have an OpportunityThese techniques fall short of what’s needed
God hath chosen the foolish things of theworld to confound the wise [1 Cor 1:27]
Unexpected Things Happen... but who says abnormal is bad?
Bottleneck DetectionFind abnormalities, then determine whether they are system faults
Metrics That MatterThroughput, concurrency, and change—but not response time (why?)
AlgorithmsVarious combinations of severity, directionality, run length, duration, and more
Out of BoundsHow often each algorithm detected abnormalities
Drilling DownOne of the algorithms triggered at the gray line
Another ExampleThis one from a more selective algorithm
Brownian CommotionJust because it’s long and aimed the right way doesn’t mean it’s scary
QPS Cxn                                                      RunsSee Spots RunThere’s a clear run of decreasing QPS and in...
QPS Cxn                                                      RunsSee Spots RunThere’s a clear run of decreasing QPS and in...
A New ApproachCombinations of algorithms to avoid run-based false positives
I’d Like Fries With ThatAnd supersize my Threads_running, please
Why Not Response Time?We are legion and we want your carat patch
Mass ApplicationDoes it generalize to many different settings?
Dataset #2Looks reasonable on a different workload—so far
Stall Detection In ActionSorry, no witty comment here
Oh Crud.What does “bad” mean on a workload like this?
He taketh the wise in their own craftiness.[1 Cor 3:19]
Different Like Everybody ElseVariance-to-mean ratio / index of dispersion to the rescue?
Still Life With Purple LineIf you think this is art, I am happy to sell it to you
When In Doubt, Get EmpiricalMeasure 99.7th% V:M on a well-behaved dataset. Good enough for government work.
Workload #1 ReduxStill finds lots of the same “bad spots” with the V:M ratio filter
If you can read this, flip me over!Workload #2 With New Filter
Conclusions✤   Workloads differ. GIGO.✤   Unlikely events are not necessarily bad.✤   Naive techniques fail; more sophisti...
Jobs✤   Want to work and live in America’s #1 City?✤   I am hiring DBAs and developers. Talk to me.
Questions?     @xaprb • http://www.linkedin.com/in/xaprb
Image Credits✤   http://www.flickr.com/photos/katej/853418592/✤   http://www.flickr.com/photos/domesticat/2963393184/✤   htt...
Adaptive Fault Detection
Upcoming SlideShare
Loading in...5
×

Adaptive Fault Detection

2,244

Published on

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,244
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
19
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • seq 1 15 | awk '{printf "%.2f%%\\n", 100-(.499**$1*100)}'\n
  • seq 1 15 | awk '{printf "%.2f%%\\n", 100-(.499**$1*100)}'\n
  • seq 1 15 | awk '{printf "%.2f%%\\n", 100-(.499**$1*100)}'\n
  • seq 1 15 | awk '{printf "%.2f%%\\n", 100-(.499**$1*100)}'\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Adaptive Fault Detection

    1. 1. Adaptive Fault DetectionBaron Schwartz • Percona Live NYC 2012Date
    2. 2. Me Optimization, Backups, 3r e rs V Replication, and more Co d e rs v Ed io Author of High Performance MySQL iti n 5.✤ on 5✤ Creator of some tools that you might use✤ I love hearing from people just like you: High @xaprb on Twitter Performance MySQL ✤ ✤ http://www.linkedin.com/in/xaprb Baron Schwartz, Peter Zaitsev & Vadim Tkachenko
    3. 3. Conventional Fault DetectionMetrics, Thresholds, and Actions
    4. 4. Nagios and ThresholdsIs there a right answer?
    5. 5. Motivations✤ Detect unknown failure mode✤ Capture diagnostic data automatically✤ Surface relevant information
    6. 6. Six Sigmas99.7% of measurements fall within ±3 sigmas of mean in a normal distribution
    7. 7. Abnormality DetectionStatistical process control, operations research, and intuition
    8. 8. Shewhart Control ChartsMetrics that fall ± too many standard deviations from the mean are out of bounds
    9. 9. Holt-Winters ForecastingPredict the future based on history, trend, and seasonality
    10. 10. Brownian MotionA random walk shouldn’t go the same way for long
    11. 11. Probability of Increase/Decrease Increase 49.88% Same 0.21% Decrease 49.91%Coin TossingQPS increases and decreases with ~equal probability
    12. 12. Length of QPS Runs 100000 75000 50000 25000 0 -9 -8 -7 -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 7 8 9I Feel Normal TodayLong runs of QPS increases/decreases behave like a coin toss
    13. 13. Run Improbability Length 1 50.10% 75.09% 2 75.10% 93.79% 3 87.57% 98.46% 4 93.80% 99.62% 5 96.91% 99.90% 6 98.46% 99.98% 7 99.23% 99.99% 8 99.62% 9 99.90% 10 99.95%Oddly EvenHow long is an unusually long random walk?
    14. 14. Run Improbability Length 1 50.10% 75.09% 2 75.10% 93.79% 3 87.57% 98.46% 4 93.80% 99.62% Two ! 5 96.91% 99.90% 6 98.46% 99.98% 7 99.23% 99.99% 8 99.62% 9 99.90% 10 99.95%Oddly EvenHow long is an unusually long random walk?
    15. 15. Run Improbability Length 1 50.10% 75.09% 2 75.10% 93.79% 3 87.57% 98.46% 4 93.80% 99.62% Two ! 5 96.91% 99.90% 6 98.46% 99.98% 7 99.23% 99.99% 8 99.62% Three ! 9 99.90% 10 99.95%Oddly EvenHow long is an unusually long random walk?
    16. 16. Run Improbability Length 1 50.10% 75.09% 2 75.10% 93.79% 3 87.57% 98.46% 4 93.80% 99.62% Two ! 5 96.91% 99.90% 6 98.46% 99.98% 7 99.23% 99.99% 8 99.62% Three ! 9 99.90% 10 99.95% Same thing, butOddly Even two variablesHow long is an unusually long random walk?
    17. 17. Houston, We Have an OpportunityThese techniques fall short of what’s needed
    18. 18. God hath chosen the foolish things of theworld to confound the wise [1 Cor 1:27]
    19. 19. Unexpected Things Happen... but who says abnormal is bad?
    20. 20. Bottleneck DetectionFind abnormalities, then determine whether they are system faults
    21. 21. Metrics That MatterThroughput, concurrency, and change—but not response time (why?)
    22. 22. AlgorithmsVarious combinations of severity, directionality, run length, duration, and more
    23. 23. Out of BoundsHow often each algorithm detected abnormalities
    24. 24. Drilling DownOne of the algorithms triggered at the gray line
    25. 25. Another ExampleThis one from a more selective algorithm
    26. 26. Brownian CommotionJust because it’s long and aimed the right way doesn’t mean it’s scary
    27. 27. QPS Cxn RunsSee Spots RunThere’s a clear run of decreasing QPS and increasing connections—but no stall/lockup
    28. 28. QPS Cxn RunsSee Spots RunThere’s a clear run of decreasing QPS and increasing connections—but no stall/lockup
    29. 29. A New ApproachCombinations of algorithms to avoid run-based false positives
    30. 30. I’d Like Fries With ThatAnd supersize my Threads_running, please
    31. 31. Why Not Response Time?We are legion and we want your carat patch
    32. 32. Mass ApplicationDoes it generalize to many different settings?
    33. 33. Dataset #2Looks reasonable on a different workload—so far
    34. 34. Stall Detection In ActionSorry, no witty comment here
    35. 35. Oh Crud.What does “bad” mean on a workload like this?
    36. 36. He taketh the wise in their own craftiness.[1 Cor 3:19]
    37. 37. Different Like Everybody ElseVariance-to-mean ratio / index of dispersion to the rescue?
    38. 38. Still Life With Purple LineIf you think this is art, I am happy to sell it to you
    39. 39. When In Doubt, Get EmpiricalMeasure 99.7th% V:M on a well-behaved dataset. Good enough for government work.
    40. 40. Workload #1 ReduxStill finds lots of the same “bad spots” with the V:M ratio filter
    41. 41. If you can read this, flip me over!Workload #2 With New Filter
    42. 42. Conclusions✤ Workloads differ. GIGO.✤ Unlikely events are not necessarily bad.✤ Naive techniques fail; more sophisticated methods are required.✤ Thresholds are too simplistic, but reappear.✤ Common sense beats elaborate math.✤ Automated problem detection is a good thing?
    43. 43. Jobs✤ Want to work and live in America’s #1 City?✤ I am hiring DBAs and developers. Talk to me.
    44. 44. Questions? @xaprb • http://www.linkedin.com/in/xaprb
    45. 45. Image Credits✤ http://www.flickr.com/photos/katej/853418592/✤ http://www.flickr.com/photos/domesticat/2963393184/✤ http://www.flickr.com/photos/markho/481969187/✤ http://www.flickr.com/photos/exquisitur/3502317741/✤ http://www.flickr.com/photos/calleephoto/4952091078/✤ http://www.flickr.com/photos/paperpariah/4150220583/✤ http://www.flickr.com/photos/stevewall/6057281066/✤ http://www.flickr.com/photos/mybloodyself/5879425774/✤ http://www.flickr.com/photos/josephrobertson/92849605/✤ http://en.wikipedia.org/
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×