Database Health Check

4,899 views

Published on

Published in: Technology

Database Health Check

  1. 1. Database Server Health Check Josh Berkus PostgreSQL Experts Inc. pgCon 2010
  2. 2. DATABASE SERVER HELP 5¢
  3. 3. Program of Treatment● What is a Healthy Database?● Know Your Application● Load Testing● Doing a database server checkup ● hardware ● OS & FS ● PostgreSQL ● application● Common Ailments of the Database Server
  4. 4. What is a Healthy Database Server?
  5. 5. What is a Healthy Database Server? ● Response Times
  6. 6. What is a Healthy Database Server? ● Response Times ● lower than required ● consistent & predicable ● Capacity for more ● CPU and I/O headroom ● low server load
  7. 7. 30 25Median Response Time 20 Expected Load 15 10 Max Response Time 5 0 25 50 75 100 125 150 175 200 225 250 Number of Clients
  8. 8. What is an Unhealthy Database Server?● Slow response times● Inconsistent response times● High server load● No capacity for growth
  9. 9. 30 25Median Response Time 20 Expected Load 15 10 Max Response Time 5 0 25 50 75 100 125 150 175 200 225 250 Number of Clients
  10. 10. A healthy database server is able to maintain consistentand acceptable response times under expected loads with margin for error.
  11. 11. 30 25Median Response Time 20 15 10 5 0 25 50 75 100 125 150 175 200 225 250 Number of Clients
  12. 12. Hitting The Wall
  13. 13. CPUs FlooredAverage: CPU %user %system %iowait %idleAverage:all 69.36 0.13 24.87 5.77 0 88.96 0.09 10.03 1.11 1 12.09 0.02 86.98 0.00 2 98.90 0.00 0.00 10.10 3 77.52 0.44 1.70 20.34 16:38:29 up 13 days, 22:10, 3 users, load average: 11.05, 9.08, 8.13
  14. 14. CPUs FlooredAverage: CPU %user %system %iowait %idleAverage:all 69.36 0.13 24.87 5.77 0 88.96 0.09 10.03 1.11 1 12.09 0.02 86.98 0.00 2 98.90 0.00 0.00 10.10 3 77.52 0.44 1.70 20.34 16:38:29 up 13 days, 22:10, 3 users, load average: 11.05, 9.08, 8.13
  15. 15. IO Saturated Device: tps MB_read/s MB_wrtn/s sde 414.33 0.40 38.15 sdf 1452.00 99.14 29.00Average: CPU %user %system %iowait %idleAverage:all 34.75 0.13 58.75 6.37 0 8.96 0.09 90.03 1.11 1 12.09 0.02 86.98 0.00 2 91.90 0.00 7.00 10.10 3 27.52 0.44 51.70 20.34
  16. 16. Out of ConnectionsFATAL:connection limitexceeded for non-superusers
  17. 17. How close are you to the wall?
  18. 18. The Checkup (full physical)1. Analyze application2. Analyze platform3. Correct anything obviously wrong4. Set up load test5. Monitor load test6. Analyze Results7. Correct issues
  19. 19. The Checkup (semi-annual)1. Check response times2. Check system load3. Check previous issues4. Check for Signs of Illness5. Fix new issues
  20. 20. Know yourapplication!
  21. 21. Application database usageWhich does your application do?✔ small reads✔ large sequential reads✔ small writes✔ large writes✔ long-running procedures/transactions✔ bulk loads and/or ETL
  22. 22. What Color Is My Application?W ● Web Application (Web)O ● Online Transaction Processing (OLTP)D ● Data Warehousing (DW)
  23. 23. What Color Is My Application?W ● Web Application (Web) ● DB much smaller than RAM ● 90% or more simple queriesO ● Online Transaction Processing (OLTP)D ● Data Warehousing (DW)
  24. 24. What Color Is My Application?W ● Web Application (Web) ● DB smaller than RAM ● 90% or more simple queriesO ● Online Transaction Processing (OLTP) ● DB slightly larger than RAM to 1TB ● 20-40% small data write queries ● Some long transactions and complex read queriesD ● Data Warehousing (DW)
  25. 25. What Color Is My Application?W ● Web Application (Web) ● DB smaller than RAM ● 90% or more simple queriesO ● Online Transaction Processing (OLTP) ● DB slightly larger than RAM to 1TB ● 20-40% small data write queries ● Some long transactions and complex read queriesD ● Data Warehousing (DW) ● Large to huge databases (100GB to 100TB) ● Large complex reporting queries ● Large bulk loads of data ● Also called "Decision Support" or "Business Intelligence"
  26. 26. What Color Is My Application?W ● Web Application (Web) ● CPU-bound ● Ailments: idle connections/transactions, too many queriesO ● Online Transaction Processing (OLTP) ● CPU or I/O bound ● Ailments: locks, database growth, idle transactions, database bloatD ● Data Warehousing (DW) ● I/O or RAM bound ● Resources: database growth, longer running queries, memory usage growth
  27. 27. Special features required?● GIS ● heavy cpu for GIS functions ● lots of RAM for GIS indexes● TSearch ● lots of RAM for indexes ● slow response time on writes● SSL ● response time lag on connections
  28. 28. LoadTesting
  29. 29. 80 70 60Requests Per Second 50 40 30 20 10 0 02:00:00 AM 06:00:00 AM 10:00:00 AM 02:00:00 PM 06:00:00 PM 10:00:00 PM 12:00:00 AM 04:00:00 AM 08:00:00 AM 12:00:00 PM 04:00:00 PM 08:00:00 PM Time
  30. 30. 80 70 DOWNTIME 60Requests Per Second 50 40 30 20 10 0 02:00:00 AM 06:00:00 AM 10:00:00 AM 02:00:00 PM 06:00:00 PM 10:00:00 PM 12:00:00 AM 04:00:00 AM 08:00:00 AM 12:00:00 PM 04:00:00 PM 08:00:00 PM Time
  31. 31. When preventing downtime,it is not average load which matters, it is peak load.
  32. 32. What to load test● Load should be as similar as possible to your production traffic● You should be able to create your target level of traffic ● better: incremental increases● Test the whole application as well ● the database server may not be your weak point
  33. 33. How to Load Test1. Set up a load testing tool youll need test servers for this*2. Turn on PostgreSQL, HW, application monitoring all monitoring should start at the same time3. Run the test for a defined time 1 hour is usually good4. Collect and analyze data5. Re-run at higher level of traffic
  34. 34. Test Servers● Must be as close as reasonable to production servers ● otherwise you dont know how production will be different ● there is no predictable multiplier● Double them up as your development/staging or failover servers● If your test server is much smaller, then you need to do a same-load comparison
  35. 35. Tools for Load Testing
  36. 36. Production Test1. Determine the peak load hour on the production servers2. Turn on lots of monitoring during that peak load hour3. Analyze resultsPretty much your only choice without a testserver.
  37. 37. Issues with Production Test● Not repeatable −load wont be exactly the same ever again● Cannot test target load −just whatever happens to occur during that hour −cant test incremental increases either● Monitoring may hurt production performance● Cannot test experimental changes
  38. 38. The Ad-Hoc Test● Get 10 to 50 coworkers to open several sessions each● Have them go crazy on using the application
  39. 39. Problems with Ad-Hoc Testing● Not repeatable ● minor changes in response times may be due to changes in worker activity● Labor intensive ● each test run shuts down the office● Cant reach target levels of load ● unless you have a lot of coworkers
  40. 40. Seige● HTTP traffic generator ● all test interfaces must be addressable as URLs ● useless for non-web applications● Simple to use ● create a simple load test in a few hours● Tests the whole web application ● cannot test database separately● http://www.joedog.org/index/siege-home
  41. 41. pgReplay● Replays your activity logs at variable speed ● get exactly the traffic you get in production● Good for testing just the database server● Can take time to set up ● need database snapshot, collect activity logs ● must already have production traffic ● http://pgreplay.projects.postgresql.org/
  42. 42. tsung● Generic load generator in erlang ● a load testing kit rather than a tool ● Generate a tsung file from your actvity logs using pgFouine and test the database ● Generate load for a web application using custom scripts● Can be time consuming to set up ● but highly configurable and advanced ● very scalable - cluster of load testing clients ● http://tsung.erlang-projects.org/
  43. 43. pgBench● Simple micro-benchmark ● not like any real application● Version 9.0 adds multi-threading, customization ● write custom pgBench scripts ● run against real database● Fairly ad-hoc compared to other tools ● but easy to set up ● ships with PostgreSQL
  44. 44. Benchmarks● Many “real” benchmarks available ● DBT2, EAstress, CrashMe, DBT5, DBMonster, etc.● Useful for testing your hardware ● not useful for testing your application● Often time-consuming and complex
  45. 45. Platform-specific● Web framework or platform tests ● Rails: ActionController::PerformanceTest ● J2EE: OpenDemand, Grinder, many more – JBoss, BEA have their own tools ● Zend Framework Performance Test● Useful for testing specific application performance ● such as performance of specific features, modules● Not all platforms have them
  46. 46. Flight-Check● Attend the tutorial tomorrow!
  47. 47. monitoring PostgreSQL during load test log_collector = on log_destination = csvlog log_filename = load_test_1_%h log_rotation_age = 60min log_rotation_size = 1GB log_min_duration_statement = 0 log_connections = on log_disconnections = on log_temp_files = 100kB log_lock_waits = on
  48. 48. monitoring hardware during load testsar -A -o load_test_1.sar 30 240iostator fsstator zfs iostat
  49. 49. monitoring application during load test● Collect response times ● with timestamp ● with activity● Monitor hardware and utilization ● activity ● memory & CPU usage● Record errors & timeouts
  50. 50. Checking Hardware
  51. 51. Checking Hardware ● CPUs and Cores ● RAM ● I/O & disk support ● Network
  52. 52. CPUs and Cores● Pretty simple: ● Rules of thumb ● number ● fewer faster CPUs is ● type usually better than more slower ones ● speed ● core != cpu ● L1/L2 cache ● thread != core ● virtual core != core
  53. 53. CPU calculations● ½ to 1 core for OS● ½ to 1 core for software raid or ZFS● 1 core for postmaster and bgwriter● 1 core per: ● DW: 1 to 3 concurrent users ● OLTP: 10 to 50 concurrent users ● Web: 100 to 1000 concurrent users
  54. 54. CPU tools● sar● mpstat● pgTop
  55. 55. in praise of sar● collects data about all aspects of HW usage● available on most OSes ● but output is slightly different● easiest tool for collecting basic information ● often enough for server-checking purposes● BUT: does not report all data on all platforms
  56. 56. sarCPUs: sar -P ALL and sar -uMemory: sar -r and sar -RI/O: sar -b and sar -dnetwork: sar -n
  57. 57. sar CPU output Linux06:05:01 AM CPU %user %nice %system %iowait %steal %idle06:15:01 AM all 14.26 0.09 6.01 1.32 0.00 78.3206:15:01 AM 0 14.26 0.09 6.01 1.32 0.00 78.32 Solaris 15:08:56 %usr %sys %wio %idle 15:09:26 10 5 0 85 15:09:56 9 7 0 84 15:10:26 15 6 0 80 15:10:56 14 7 0 79 15:11:26 15 5 0 80 15:11:56 14 5 0 81
  58. 58. Memory● Only one statistic: how much?● Not generally an issue on its own ● low memory can cause more I/O ● low memory can cause more CPU time
  59. 59. memory sizingShared Filesystem work_memBuffers Cache maint_memIn Buffer In Cache On Disk
  60. 60. Figure out Memory Sizing● What is the active portion of your database? ● i.e. gets queried frequently● How large is it?● Where does it fit into the size categories?● How large is the inactive portion of your database? ● how frequently does it get hit? (remember backups)
  61. 61. Memory Sizing● Other needs for RAM – work_mem: ● sorts and aggregates: do you do a lot of big ones? ● GIN/GiST indexes: these can be huge ● hashes: for joins and aggregates ● VACUUM
  62. 62. I/O Considerations● Throughput ● how fast can you get data off disk?● Latency ● how long does it take to respond to requests?● Seek Time ● how long does it take to find random disk pages?
  63. 63. I/O Considerations● Throughput ● important for large databases ● important for bulk loads● Latency ● huge effect on small writes & reads ● not so much on large scans● Seek Time ● important for small writes & reads ● very important for index lookups
  64. 64. I/O Considerations● Web ● concerned about read latency & seek time● OLTP ● concerned about write latency & seek time● DW/BI ● concerned about throughput & seek time
  65. 65. ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks- Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP32096M 79553 99 240548 45 50646 5 72471 94 185634 10 1140 1 ------Sequential Output------ --Sequential Input-- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP/sec %CP 24G 260044 33 62110 17 89914 15 1167 25 6549ms 4882ms 3395ms 107ms
  66. 66. Common I/O Types● Software RAID & ZFS● Hardware RAID Array● NAS/SAN● SSD
  67. 67. Hardware RAID Sanity Check● RAID 1 / 10, not 5● Battery-backed write cache? ● otherwise, turn write cache off● SATA < SCSI/SAS ● about ½ real throughput● Enough drives? ● 4-14 for OLTP application ● 8-48 for DW/BI
  68. 68. Sw RAID / ZFS Sanity Check● Enough CPUs? ● will need one for the RAID● Enough disks? ● same as hardware raid● Extra configuration? ● caching ● block size
  69. 69. NAS/SAN Sanity Check● Check latency!● Check real throughput ● drivers often a problem● Enough network bandwidth? ● multipath or fiber required to get HW RAID performance
  70. 70. SSD Sanity Check● 1 SSD = 4 Drives ● relative performance● Check write cache configuration ● make sure data is safe● Test real throughput, seek times ● drivers often a problem● Research durability stats
  71. 71. IO Tools● I/O Tests ● Monitoring Tools ● dd test ● sar ● Bonnie++ ● mpstat iowait ● IOZone ● iostat ● filebench ● on zfs: fsstat, zfs -iostat ● EXPLAIN ANALYZE
  72. 72. Network● Throughput ● not usually an issue, except: – iSCSI / NAS / SAN – ELT & Bulk Load Processes ● remember that gigabit is only 100MB/s!● Latency ● real issue for Web / OLTP ● consider putting app ↔ database on private network
  73. 73. Checkups for the Cloud
  74. 74. Just like real HW, except ...● Low ceiling on #cpus, RAM● Virtual Core < Real Core ● “CPU Stealing” ● last-generation hardware ● calculate 50% more cores
  75. 75. Cloud I/O Hell● I/O tends to be very slow, erratic ● comparable to a USB thumb drive ● horrible latency, up to ½ second ● erratic, speeds go up and down ● RAID together several volumes on EBS ● use asynchronous commit – or at least commit_siblings
  76. 76. #1 Cloud Rule If your databasedoesnt fit in RAM, dont host iton a public cloud
  77. 77. Checking Operating System and Filesystem
  78. 78. OS Basics● Use recent versions ● large performance, scaling improvements in Linux & Solaris in last 2 years● Check OS tuning advice for databases ● advice for Oracle is usually good for PostgreSQL● Keep up with information about issues & patches ● frequently specific releases have major issues ● especially check HW drivers
  79. 79. OS Basics● Use Linux, BSD or Solaris! ● Windows has poor performance and weak diagnostic tools ● OSX is optimized for desktop and has poor hardware support ● AIX and HPUX require expertise just to install, and lack tools
  80. 80. Filesystem Layout● One array / one big pool● Two arrays / partitions ● OS and transaction log ● Database● Three arrays ● OS & stats file ● Transaction log ● Database
  81. 81. Linux Tuning● XFS > Ext3 (but not that much) ● Ext3 Tuning: data=writeback,noatime,nodiratime ● XFS Tuning: noatime,nodiratime – for transaction log: nobarrier● “deadline” I/O scheduler● Increase SHMMAX and SHMALL ● to ½ of RAM● Cluster filesystems also a possibility ● OCFS, RHCFS
  82. 82. Solaris Tuning● Use ZFS ● no advantage to UFS anymore ● mixed filesystems causes caching issues ● set recordsize – 8K small databases – 128K large databases – check for throughput/latency issues
  83. 83. Solaris Tuning● Set OS parameters via “projects”● For all databases: ● project.max-shm-memory=(priv,12GB,deny)● For high-connection databases: ● use libumem ● project.max-shm-ids=(priv,32768,deny) ● project.max-sem-ids=(priv,4096,deny) ● project.max-msg-ids=(priv,4096,deny)
  84. 84. FreeBSD Tuning● ZFS: same as Solaris ● definite win for very large databases ● not so much for small databases● Other tuning per docs
  85. 85. PostgreSQL Checkup
  86. 86. postgresql.conf: formulae shared_buffers =  available RAM / 4
  87. 87. postgresql.conf: formulae max_connections = web: 100 to 200 OLTP: 50 to 100 DW/BI: 5 to 20 if you need more, use pooling! 
  88. 88. postgresql.conf: formulaeWeb/OLTP:work_mem = Av.RAM * 2 / max_connections DW/BI:work_mem AvRAM / max_connections
  89. 89. postgresql.conf: formulaeWeb/OLTP:maintenance_work_mem = Av.RAM * 16DW/BI:maintenance_work_mem = AvRAM / 8
  90. 90. postgresql.conf: formulaeautovacuum = onDW/BI & bulk loads:autovacuum = offautovacuum_max_workers =1/2
  91. 91. postgresql.conf: formulaecheckpoint_segments =web: 8 to 16oltp: 32 to 64BI/DW: 128 to 256
  92. 92. postgresql.conf: formulaewal_buffers = 8MBeffective_cache_size =AvRAM * 0.75
  93. 93. How much recoverability do you need?● None: ● fsync=off ● full_page_writes=off ● consider using ramdrive● Some Loss OK ● synchronous_commit = off ● wal_buffers = 16MB to 32MB● Data integrity critical ● keep everything on
  94. 94. File Locations● Database● Transaction Log● Activity Log● Stats File● Tablespaces?
  95. 95. Database Checks: Indexes select relname, seq_scan, seq_tup_read,pg_size_pretty(pg_relation_size(relid)) as size, coalesce(n_tup_ins,0)+ coalesce(n_tup_upd,0) + coalesce(n_tup_del,0) as update_activityfrom pg_stat_user_tables where seq_scan > 1000 andpg_relation_size(relid) > 1000000 order by seq_scan desc limit 10; relname | seq_scan | seq_tup_read | size | update_activity----------------+----------+--------------+---------+----------------- permissions | 12264 | 53703 | 2696 kB | 365 users | 11697 | 351635 | 17 MB | 741 test_set | 9150 | 18492353300 | 275 MB | 27643 test_pool | 5143 | 3141630847 | 212 MB | 77755
  96. 96. Database Checks: IndexesSELECT indexrelid::regclass as index , relid::regclass astable FROM pg_stat_user_indexes JOIN pg_index USING(indexrelid) WHERE idx_scan < 100 AND indisunique IS FALSE; index | tableacct_acctdom_idx | accountshitlist_acct_idx | hitlisthitlist_number_idx | hitlistcustom_field_acct_idx | custom_fielduser_log_accstrt_idx | user_loguser_log_idn_idx | user_loguser_log_feed_idx | user_loguser_log_inbdstart_idx | user_loguser_log_lead_idx | user_log
  97. 97. Database Checks: Large Tables relname | total_size | table_size-------------------------+------------+------------ operations_2008 | 9776 MB | 3396 MB operations_2009 | 9399 MB | 3855 MB request_by_second | 7387 MB | 5254 MB request_archive | 6975 MB | 3349 MB events | 92 MB | 66 MB event_edits | 82 MB | 68 MB 2009_ops_eoy | 33 MB | 19 MB
  98. 98. Database Checks: Heavily-Used Tablesselect relname, pg_size_pretty(pg_relation_size(relid)) as size,coalesce(n_tup_ins,0) + coalesce(n_tup_upd,0) +coalesce(n_tup_del,0) as update_activity from pg_stat_user_tablesorder by update_activity desc limit 10; relname | size | update_activity------------------------+---------+----------------- session_log | 344 GB | 4811814 feature | 279 MB | 1012565 daily_feature | 28 GB | 984406 cache_queue_2010_05 | 2578 MB | 981812 user_log | 30 GB | 796043 vendor_feed | 29 GB | 479392 vendor_info | 23 GB | 348355 error_log | 239 MB | 214376 test_log | 945 MB | 185785 settings | 215 MB | 117480
  99. 99. Database Unit Tests● You need them! ● you will be changing database objects and rewriting queries ● find bugs in testing or in testing … or in production● Various tools ● pgTAP ● Framework-level tests – Rails, Django, Catalyst, JBoss, etc.
  100. 100. Application Stack Checkup
  101. 101. The Layer Cake Queries Transactions Application Drivers Connections Caching Middleware Schema Config PostgreSQLFilesystem Kernel Operating System Storage RAM/CPU Network Hardware
  102. 102. The Layer Cake Queries Transactions Application Drivers Connections Caching Middleware Schema Config PostgreSQLFilesystem Kernel Operating System Storage RAM/CPU Network Hardware
  103. 103. The Funnel Application Middleware PostgreSQL OS HW
  104. 104. Check PostgreSQL Drivers● Does the driver version match the PostgreSQL version?● Have you applied all updates?● Are you using the best driver? ● There are several Python, C++ drivers ● Dont use ODBC if you can avoid it.● Does the driver support cached plans & binary data? ● If so, are they being used?
  105. 105. Check Caching
  106. 106. Check Caching● Does the application use data caching? ● what kind? ● could it be used more? ● what is the cache invalidation strategy? ● is there protection from “cache refresh storms”?● Does the application use HTTP caching? ● could they be using it more?
  107. 107. Check Connection Pooling● Is the application using connection pooling? ● all web applications should, and most OLTP ● external or built into the application server?● Is it configured correctly? ● max. efficiency: transaction / statement mode ● make sure timeouts match
  108. 108. Check Query Design● PostgreSQL does better with fewer, bigger statements● Check for common query mistakes ● joins in the application layer ● pulling too much data and discarding it ● huge OFFSETs ● unanchored text searches
  109. 109. Check Transaction Management● Are transactions being used for loops? ● batches of inserts or updates can be 75% faster if wrapped in a transaction● Are transactions aborted properly? ● on error ● on timeout ● transactions being held open while non-database activity runs
  110. 110. Common Ailments of the Database Server
  111. 111. Check for them, monitor for them● ailments could throw off your response time targets ● database could even “hit the wall”● check for them during health check ● and during each checkup● add daily/continuous monitors for them ● Nagios check_postgres.pl has checks for many of these things
  112. 112. Database Growth● Checkup: ● check both total database size and largest table(s) size daily or weekly● Symptoms: ● database grows faster than expected ● some tables grow continuously and rapidly
  113. 113. Database Growth● Caused By: ● faster than expected increase in usage ● “append forever” tables ● Database Bloat● Leads to: ● slower seq scans and index scans ● swapping & temp files ● slower backups
  114. 114. Database Growth● Treatment: ● check for Bloat ● find largest tables and make them smaller – expire data – partitioning ● horizontal scaling (if possible) ● get better storage & more RAM, sooner
  115. 115. Database Bloat-[ RECORD 1 ]+-----schemaname | publictablename | user_logtbloat | 3.4wastedpages | 2356903wastedbytes | 19307749376wastedsize | 18 GBiname | user_log_accttime_idxituples | 941451584ipages | 9743581iotta | 40130146ibloat | 0.2wastedipages | 0wastedibytes | 0wastedisize | 0 bytes
  116. 116. Database Bloat● Caused by: ● Autovacuum not keeping up – or not enough manual vacuum – often on specific tables only ● FSM set wrong (before 8.4) ● Idle In Transaction● Leads To: ● slow response times ● unpredictable response times ● heavy I/O
  117. 117. Database Bloat● Treatment: ● make autovacuum more aggressive – on specific tables with bloat ● fix FSM_relations/FSM_pages ● check when tables are getting vacuumed ● check for Idle In Transaction
  118. 118. Memory Usage Growth00:00:01 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s01:00:00 0 0 100 0 0 100 0 002:00:00 0 0 100 0 0 100 0 003:00:00 0 0 100 0 0 100 0 004:00:00 0 0 100 0 0 100 0 000:00:01 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s01:00:00 3788 115 98 0 0 100 0 002:00:00 21566 420 78 0 0 100 0 003:00:00 455721 1791 59 0 0 100 0 004:00:00 908 6 96 0 0 100 0 0
  119. 119. Memory Usage Growth● Caused by: ● Database Growth or Bloat ● work_mem limit too high ● bad queries● Leads To: ● database out of cache – slow response times ● OOM Errors (OOM Killer)
  120. 120. Memory Usage Growth● Treatment ● Look at ways to shrink queries, DB – partitioning – data expiration ● lower work_mem limit ● refactor bad queries ● Or just buy more RAM
  121. 121. Idle Connectionsselect datname, usename, count(*) frompg_stat_activity where current_query =<IDLE> group by datname, usename; datname | usename | count---------+---------+------- track | www | 318
  122. 122. Idle Connections● Caused by: ● poor session management in application ● wrong connection pool settings● Leads to: ● memory usage for connections ● slower response times ● out-of-connections at peak load
  123. 123. Idle Connections● Treatment: ● refactor application ● reconfigure connection pool – or add one
  124. 124. Idle In Transactionselect datname, usename, max(now() - xact_start) asmax_time, count(*) from pg_stat_activity wherecurrent_query ~* <IDLE> in transaction group bydatname, usename; datname | usename | max_time | count---------+----------+---------------+------- track | admin | 00:00:00.0217 | 1 track | www | 01:03:06.0709 | 7
  125. 125. Idle In Transaction● Caused by: ● poor transaction control by application ● abandoned sessions not being terminated fast enough● Leads To: ● locking problems ● database bloat ● out of connections
  126. 126. Idle In Transaction● Treatment ● refactor application ● change driver/ORM settings for transactions ● change session timeouts & keepalives on pool, driver, database
  127. 127. Longer Running Queries● Detection: ● log slow queries to PostgreSQL log ● do daily or weekly report (pgfouine)● Symptoms: ● number of long-running queries in log increasing ● slowest queries getting slower
  128. 128. Longer Running Queries● Caused by: ● database growth ● poorly-written queries ● wrong indexes ● out-of-date stats● Leads to: ● out-of-CPU ● out-of-connections
  129. 129. Longer Running Queries● Treatments: ● refactor queries ● update indexes ● make Autoanalyze more aggressive ● control database growth
  130. 130. Too Many Queries
  131. 131. Too Many Queries● Caused By: ● joins in middleware ● not caching ● poll cycles without delays ● other application code issues● Leads To: ● out-of-CPU ● out-of-connections
  132. 132. Too Many Queries● Treatment: ● characterize queries using logging ● refactor application
  133. 133. Locking● Detection: ● log_lock_waits ● scan activity log for deadlock warnings ● query pg_stat_activity and pg_locks● Symptoms: ● deadlock error messages ● number and time of lock_waits getting larger
  134. 134. Locking● Caused by: ● long-running operations with exclusive locks ● inconsistent foreign key updates ● poorly planned runtime DDL● Leads to: ● poor response times ● timeouts ● deadlock errors
  135. 135. Locking● Treatment ● analyze locks ● refactor operations taking locks – establish a canonical order of updates for long transactions – use pessimistic locks with NOWAIT ● rely on cascade for FK updates – not on middleware code
  136. 136. Temp File Usage● Detection: ● log_temp_files = 100kB ● scan logs for temp files weekly or daily● Symptoms: ● temp file usage getting more frequent ● queries using temp files getting longer
  137. 137. Temp File Usage● Caused by: ● Sorts, hashes & aggregates too big for work_mem● Leads to: ● slow response times ● timeouts
  138. 138. Temp File Usage● Treatment ● find swapping queries via logs ● set work_mem higher for that ROLE, or ● refactor them to need less memory, or ● buy more RAM
  139. 139. All healthy now?See you in six months!
  140. 140. Q&A● Josh Berkus ● Also see: ● josh@pgexperts.com ● Load Testing ● it.toolbox.com/blogs/ (tommorrow) database-soup ● Testing BOF (Friday)● PostgreSQL Experts ● www.pgexperts.com ● pgCon SponsorCopyright 2010 Josh Berkus & PostgreSQL Experts Inc. Distributable under the creativecommons attribution license,except for 3rd-party images which are property of their respective owners.

×