Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Database Stalls, From the Ordinary to the Obscure

339 views

Published on

VividCortex monitors lots of production database servers, which means we get to see lots of different database problems. One specific type of problem that we like to focus on is database stalls. We define stalls as short periods of time, typically one second, when work isn’t getting done. It’s easy to see when a database isn’t performing its work as usual, but trying to find the cause is much more difficult. Preetam talked about what kinds of metrics and instrumentation he relies on to diagnose obscure stalls, and how to develop a “work-centric” monitoring process to solve problems. Preetam discussed the basics of back pressure, and how applications should properly react to stalls to avoid query stampedes and cascading failures.

Published in: Software
  • Be the first to comment

Database Stalls, From the Ordinary to the Obscure

  1. 1. Database Stalls, From the Ordinary to the Obscure Preetam Jinka (@PreetamJinka) Software Engineer Percona Live 2017
  2. 2. VividCortex’s database monitoring application is the best way to improve your database performance, efficiency, and uptime. Supporting MySQL, PostgreSQL, Redis, MongoDB, and Amazon Aurora, VividCortex uses patented algorithms to reveal key insights, helping users fix performance problems before they impact customers. Say hello and see a demo, Booth #205. We’re hiring!
  3. 3. 3 This talk isn’t about the math. Come to the O’Reilly booth after the talk to pick up a free copy of our book!
  4. 4. What is a stall? 4
  5. 5. 5 Stalls ● Short periods when work isn’t being done ● We’re detecting stalls as short as 1 second ● We do this with zero configuration and no fixed thresholds ○ The secret sauce: we have a model.
  6. 6. 6 We’re trying to catch small problems before they turn into bigger ones.
  7. 7. Little’s Law ● L = λ × W ● Concurrency = Throughput × Latency ● Little’s Law provides a model to relate throughput and concurrency In MySQL: ● Concurrency: threads_running ○ There’s one thread per query. ○ From SHOW STATUS ● Throughput: queries completed per second 7
  8. 8. MySQL Server Stall Example 8 More queries in progress Fewer being completed
  9. 9. MySQL Server Stall Example 9 All of the stalled queries are completing after the fault ends.
  10. 10. Where do stalls come from? 10 ● Running out of credits on EBS volumes ● MySQL query cache ● Lock contention ● A bad network cable! ● Transparent huge pages (THP) ○ “If a transparent huge page isn’t available, the application will stall to let memory compaction run to free a page.”
  11. 11. But we don’t really care about any of those things. We’re focused on the work your database is doing. 11
  12. 12. Work-centric monitoring 12
  13. 13. 13 Work-centric monitoring in one slide ● Focus on the work your systems are doing ● Find relationships between metrics (maybe using a model) ● Monitor what you want to optimize ● Focus on heavy hitters ● Automatically detect changes
  14. 14. How to respond to database stalls 14
  15. 15. 15 Slowness is about spending time on something. Things spend time doing work or waiting.
  16. 16. 16 Work ● CPU ● Disk I/O ● Various storage engine metrics ● Slow queries ○ Large scans Waiting ● Lock contention ● Disk I/O ● Memory compaction
  17. 17. Walkthrough 17
  18. 18. 18
  19. 19. 19
  20. 20. 20
  21. 21. 21
  22. 22. 22
  23. 23. 23 Be careful about causality.
  24. 24. Thread states 24
  25. 25. Back pressure 25
  26. 26. 26 Back pressure is about systems receiving more work than they can process.
  27. 27. 27
  28. 28. 28 It’s much better to handle back pressure higher up the stack.
  29. 29. Clients 29 APIs Database System
  30. 30. 30 Low-level back pressure can cause unfair slowdowns higher up the stack.* *Totally untested hypothesis. :)
  31. 31. 31
  32. 32. 32 50 ms shift
  33. 33. 33 50 ms shift ~1 sec queries stay ~1 sec queries (1x) ~1 ms queries become ~50 ms queries (50x)
  34. 34. ● Rate limiting / throttling ● Use a queue to contain requests at a higher level ● Somehow prioritize some requests over others 34 Ways to deal with back pressure
  35. 35. 35 Can you eliminate stalls? Probably not all. Most? Perhaps!
  36. 36. Come find me at the O’Reilly booth! 36 Questions? Twitter: @PreetamJinka Email: preetam@vividcortex.com

×