Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

When it all Goes Wrong |Nordic PGDay 2019 | Will Leinweber

12 views

Published on

You're woken up in the middle of the night to your phone. Your app is down and you're on call to fix it. Eventually you track it down to "something with the db," but what exactly is wrong? And of course, you're sure that nothing changed recently…

Knowing what to fix, and even where to start looking, is a skill that takes a long time to develop. Especially since Postgres normally works very well for months at a time, not letting you get practice!

In this talk, I'll share not only the more common failure cases and how to fix them, but also a general approach to efficiently figuring out what's wrong in the first place.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

When it all Goes Wrong |Nordic PGDay 2019 | Will Leinweber

  1. 1. When it all Goes Wrong Nordic PGDay 2019 — March 19 — Copenhagen
  2. 2. @leinweber Will Leinweber @leinweber Citus Data (Microsoft) bitfission.com
 (warning autoplays midi)
  3. 3. @leinweber coming from citus cloud heroku postgres
  4. 4. @leinweber special thanks citus cloud
 — dan farina (@danfarina) heroku postgres
 — maciek sakrejda (@uhoh_itsmaciek)
  5. 5. @leinweber same sorts of problems from pages & alerts from support tickets
  6. 6. @leinweber this talk more app dev who uses postgres
 rather than dba
  7. 7. @leinweber the problem with Postgres it’s pretty good you don’t get experience with how it breaks
  8. 8. @leinweber what to do for a problem
  9. 9. @leinweber what to do for a problem
  10. 10. @leinweber complicated system network hardware o/s postgres
  11. 11. @leinweber using the database (too much) 95% application 4% auto vacuum 1% everything else
  12. 12. @leinweber hard to convince all the graphs saying DB is slow and nothing has changed …must be the database!
  13. 13. @leinweber https://upload.wikimedia.org/wikipedia/commons/9/98/Survivorship-bias.png
  14. 14. @leinweber “but I didn’t change anything” no deploys! no database migrations! no scaling!
  15. 15. @leinweber “but I didn’t change anything” https://upload.wikimedia.org/wikipedia/commons/0/09/Redherring.gif
  16. 16. @leinweber “but I didn’t change anything” more traffic? change in access patterns? one big user logged in?
  17. 17. @leinweber run out of a resource
  18. 18. @leinweber snowball
  19. 19. @leinweber example manageable user 1s query => 2x expensive frequent, small queries 3ms => 12ms
  20. 20. @leinweber assumptions hardware maintenance app
  21. 21. @leinweber assumptions postgres should not crash …with overcommit off and no containers large extensions increase chance
  22. 22. @leinweber if not postgres, then what
  23. 23. @leinweber system resources cpu memory disk parallelism / backends locks
  24. 24. @leinweber cpu mem disk parallelismcpu mem disk parallelism
  25. 25. @leinweber cpu mem disk parallelism credentials wrong networking broken locking issue, check pg_locks idle in transaction
  26. 26. @leinweber cpu mem disk parallelism application submitting backlogged workload connection leak pool sizes set too large pg_lock issue + application backlog
  27. 27. @leinweber cpu mem disk parallelism workload skew causing thrashing unusual sequential scan workload failover or restart => no cache pg_prewarm
  28. 28. @leinweber cpu mem disk parallelism same as just disk, but also the application is piling on
  29. 29. @leinweber cpu mem disk parallelism large GROUP BYs high disk latency due to unusual page dispersion pattern in the workload
  30. 30. @leinweber cpu mem disk parallelism workload has high mem (GROUP BY)
 + app adding backlog lock contention slowing mem release
  31. 31. @leinweber cpu mem disk parallelism large GROUP BYs + paging in unusual data
  32. 32. @leinweber cpu mem disk parallelism Look for what is causing disk access
  33. 33. @leinweber cpu mem disk parallelism small, in-memory workload lots of seq scans on small table index scan w/ filter dropping lots
  34. 34. @leinweber cpu mem disk parallelism app backlog 
 + too much processing on small data simply a lot of work
  35. 35. @leinweber cpu mem disk parallelism large seq scans
  36. 36. @leinweber cpu mem disk parallelism loading cold data + application backlog
  37. 37. @leinweber cpu mem disk parallelism small # of backends doing a lot more work
  38. 38. @leinweber cpu mem disk parallelism entity, workload, entity*workload soft deletes and non-conditional indexes
  39. 39. @leinweber cpu mem disk parallelism reporting query
  40. 40. @leinweber cpu mem disk parallelism app backlog, but with CPU/mem problems
  41. 41. @leinweber tools of the trade
  42. 42. @leinweber tools of the trade C symbols
  43. 43. @leinweber tools of the trade: perf perf record -p <pid> && perf report
  44. 44. @leinweber tools of the trade: perf perf top
  45. 45. @leinweber tools of the trade: perf www.brendangregg.com/perf.html
  46. 46. @leinweber tools of the trade: gdb gdb -batch -ex 'bt' -p <pid>
  47. 47. @leinweber
  48. 48. @leinweber
  49. 49. @leinweber tools of the trade: iostat iostat -xm 10
  50. 50. @leinweber tools of the trade: iotop
  51. 51. @leinweber tools of the trade: htop
  52. 52. @leinweber Tools of the trade: bwm-ng
  53. 53. @leinweber tools of the trade: backends pgrep -lf postgres + grep + wc select * from pg_stat_activity
  54. 54. @leinweber tools of the trade: pg_s_s select * from pg_stat_statements
  55. 55. @leinweber tools of the trade: summary cpu mem disk parallelism network perf x gdb x iostat x iotop x htop x x bwm x pgrep x
  56. 56. @leinweber what to do
  57. 57. @leinweber what to do configuration change
  58. 58. @leinweber what to do db change
  59. 59. @leinweber what to do code change
  60. 60. @leinweber flirting with disaster Velocity NY 2013: Richard Cook
 "Resilience In Complex Adaptive Systems” Jens Rasmussen:
 Risk management in a dynamic society: a modeling problem
  61. 61. @leinweber flirting with disaster economic boundary
  62. 62. @leinweber flirting with disaster economic boundary workload boundary
  63. 63. @leinweber flirting with disaster economic boundary workload boundary performance boundary
  64. 64. @leinweber flirting with disaster economic boundary workload boundary performance boundary error margin
  65. 65. @leinweber flirting with disaster economic boundary workload boundary performance boundary
  66. 66. @leinweber flirting with disaster economic boundary workload boundary performance boundary error margin
  67. 67. @leinweber flirting with disaster economic boundary workload boundary performance boundary error margin
  68. 68. @leinweber flirting with disaster Velocity NY 2013: Richard Cook
 "Resilience In Complex Adaptive Systems” Jens Rasmussen: 
 Risk management in a dynamic society: a modeling problem
  69. 69. @leinweber thank youWill Leinweber @leinweber citusdata.com

×