Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,778
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
64
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Database Server Health Check Josh Berkus PostgreSQL Experts Inc. pgCon 2010
  • 2. DATABASE SERVER HELP 5¢
  • 3. Program of Treatment● What is a Healthy Database?● Know Your Application● Load Testing● Doing a database server checkup ● hardware ● OS & FS ● PostgreSQL ● application● Common Ailments of the Database Server
  • 4. What is a Healthy Database Server?
  • 5. What is a Healthy Database Server? ● Response Times
  • 6. What is a Healthy Database Server? ● Response Times ● lower than required ● consistent & predicable ● Capacity for more ● CPU and I/O headroom ● low server load
  • 7. 30 25Median Response Time 20 Expected Load 15 10 Max Response Time 5 0 25 50 75 100 125 150 175 200 225 250 Number of Clients
  • 8. What is an Unhealthy Database Server?● Slow response times● Inconsistent response times● High server load● No capacity for growth
  • 9. 30 25Median Response Time 20 Expected Load 15 10 Max Response Time 5 0 25 50 75 100 125 150 175 200 225 250 Number of Clients
  • 10. A healthy database server is able to maintain consistentand acceptable response times under expected loads with margin for error.
  • 11. 30 25Median Response Time 20 15 10 5 0 25 50 75 100 125 150 175 200 225 250 Number of Clients
  • 12. Hitting The Wall
  • 13. CPUs FlooredAverage: CPU %user %system %iowait %idleAverage:all 69.36 0.13 24.87 5.77 0 88.96 0.09 10.03 1.11 1 12.09 0.02 86.98 0.00 2 98.90 0.00 0.00 10.10 3 77.52 0.44 1.70 20.34 16:38:29 up 13 days, 22:10, 3 users, load average: 11.05, 9.08, 8.13
  • 14. CPUs FlooredAverage: CPU %user %system %iowait %idleAverage:all 69.36 0.13 24.87 5.77 0 88.96 0.09 10.03 1.11 1 12.09 0.02 86.98 0.00 2 98.90 0.00 0.00 10.10 3 77.52 0.44 1.70 20.34 16:38:29 up 13 days, 22:10, 3 users, load average: 11.05, 9.08, 8.13
  • 15. IO Saturated Device: tps MB_read/s MB_wrtn/s sde 414.33 0.40 38.15 sdf 1452.00 99.14 29.00Average: CPU %user %system %iowait %idleAverage:all 34.75 0.13 58.75 6.37 0 8.96 0.09 90.03 1.11 1 12.09 0.02 86.98 0.00 2 91.90 0.00 7.00 10.10 3 27.52 0.44 51.70 20.34
  • 16. Out of ConnectionsFATAL:connection limitexceeded for non-superusers
  • 17. How close are you to the wall?
  • 18. The Checkup (full physical)1. Analyze application2. Analyze platform3. Correct anything obviously wrong4. Set up load test5. Monitor load test6. Analyze Results7. Correct issues
  • 19. The Checkup (semi-annual)1. Check response times2. Check system load3. Check previous issues4. Check for Signs of Illness5. Fix new issues
  • 20. Know yourapplication!
  • 21. Application database usageWhich does your application do?✔ small reads✔ large sequential reads✔ small writes✔ large writes✔ long-running procedures/transactions✔ bulk loads and/or ETL
  • 22. What Color Is My Application?W ● Web Application (Web)O ● Online Transaction Processing (OLTP)D ● Data Warehousing (DW)
  • 23. What Color Is My Application?W ● Web Application (Web) ● DB much smaller than RAM ● 90% or more simple queriesO ● Online Transaction Processing (OLTP)D ● Data Warehousing (DW)
  • 24. What Color Is My Application?W ● Web Application (Web) ● DB smaller than RAM ● 90% or more simple queriesO ● Online Transaction Processing (OLTP) ● DB slightly larger than RAM to 1TB ● 20-40% small data write queries ● Some long transactions and complex read queriesD ● Data Warehousing (DW)
  • 25. What Color Is My Application?W ● Web Application (Web) ● DB smaller than RAM ● 90% or more simple queriesO ● Online Transaction Processing (OLTP) ● DB slightly larger than RAM to 1TB ● 20-40% small data write queries ● Some long transactions and complex read queriesD ● Data Warehousing (DW) ● Large to huge databases (100GB to 100TB) ● Large complex reporting queries ● Large bulk loads of data ● Also called "Decision Support" or "Business Intelligence"
  • 26. What Color Is My Application?W ● Web Application (Web) ● CPU-bound ● Ailments: idle connections/transactions, too many queriesO ● Online Transaction Processing (OLTP) ● CPU or I/O bound ● Ailments: locks, database growth, idle transactions, database bloatD ● Data Warehousing (DW) ● I/O or RAM bound ● Resources: database growth, longer running queries, memory usage growth
  • 27. Special features required?● GIS ● heavy cpu for GIS functions ● lots of RAM for GIS indexes● TSearch ● lots of RAM for indexes ● slow response time on writes● SSL ● response time lag on connections
  • 28. LoadTesting
  • 29. 80 70 60Requests Per Second 50 40 30 20 10 0 02:00:00 AM 06:00:00 AM 10:00:00 AM 02:00:00 PM 06:00:00 PM 10:00:00 PM 12:00:00 AM 04:00:00 AM 08:00:00 AM 12:00:00 PM 04:00:00 PM 08:00:00 PM Time
  • 30. 80 70 DOWNTIME 60Requests Per Second 50 40 30 20 10 0 02:00:00 AM 06:00:00 AM 10:00:00 AM 02:00:00 PM 06:00:00 PM 10:00:00 PM 12:00:00 AM 04:00:00 AM 08:00:00 AM 12:00:00 PM 04:00:00 PM 08:00:00 PM Time
  • 31. When preventing downtime,it is not average load which matters, it is peak load.
  • 32. What to load test● Load should be as similar as possible to your production traffic● You should be able to create your target level of traffic ● better: incremental increases● Test the whole application as well ● the database server may not be your weak point
  • 33. How to Load Test1. Set up a load testing tool youll need test servers for this*2. Turn on PostgreSQL, HW, application monitoring all monitoring should start at the same time3. Run the test for a defined time 1 hour is usually good4. Collect and analyze data5. Re-run at higher level of traffic
  • 34. Test Servers● Must be as close as reasonable to production servers ● otherwise you dont know how production will be different ● there is no predictable multiplier● Double them up as your development/staging or failover servers● If your test server is much smaller, then you need to do a same-load comparison
  • 35. Tools for Load Testing
  • 36. Production Test1. Determine the peak load hour on the production servers2. Turn on lots of monitoring during that peak load hour3. Analyze resultsPretty much your only choice without a testserver.
  • 37. Issues with Production Test● Not repeatable −load wont be exactly the same ever again● Cannot test target load −just whatever happens to occur during that hour −cant test incremental increases either● Monitoring may hurt production performance● Cannot test experimental changes
  • 38. The Ad-Hoc Test● Get 10 to 50 coworkers to open several sessions each● Have them go crazy on using the application
  • 39. Problems with Ad-Hoc Testing● Not repeatable ● minor changes in response times may be due to changes in worker activity● Labor intensive ● each test run shuts down the office● Cant reach target levels of load ● unless you have a lot of coworkers
  • 40. Seige● HTTP traffic generator ● all test interfaces must be addressable as URLs ● useless for non-web applications● Simple to use ● create a simple load test in a few hours● Tests the whole web application ● cannot test database separately● http://www.joedog.org/index/siege-home
  • 41. pgReplay● Replays your activity logs at variable speed ● get exactly the traffic you get in production● Good for testing just the database server● Can take time to set up ● need database snapshot, collect activity logs ● must already have production traffic ● http://pgreplay.projects.postgresql.org/
  • 42. tsung● Generic load generator in erlang ● a load testing kit rather than a tool ● Generate a tsung file from your actvity logs using pgFouine and test the database ● Generate load for a web application using custom scripts● Can be time consuming to set up ● but highly configurable and advanced ● very scalable - cluster of load testing clients ● http://tsung.erlang-projects.org/
  • 43. pgBench● Simple micro-benchmark ● not like any real application● Version 9.0 adds multi-threading, customization ● write custom pgBench scripts ● run against real database● Fairly ad-hoc compared to other tools ● but easy to set up ● ships with PostgreSQL
  • 44. Benchmarks● Many “real” benchmarks available ● DBT2, EAstress, CrashMe, DBT5, DBMonster, etc.● Useful for testing your hardware ● not useful for testing your application● Often time-consuming and complex
  • 45. Platform-specific● Web framework or platform tests ● Rails: ActionController::PerformanceTest ● J2EE: OpenDemand, Grinder, many more – JBoss, BEA have their own tools ● Zend Framework Performance Test● Useful for testing specific application performance ● such as performance of specific features, modules● Not all platforms have them
  • 46. Flight-Check● Attend the tutorial tomorrow!
  • 47. monitoring PostgreSQL during load test log_collector = on log_destination = csvlog log_filename = load_test_1_%h log_rotation_age = 60min log_rotation_size = 1GB log_min_duration_statement = 0 log_connections = on log_disconnections = on log_temp_files = 100kB log_lock_waits = on
  • 48. monitoring hardware during load testsar -A -o load_test_1.sar 30 240iostator fsstator zfs iostat
  • 49. monitoring application during load test● Collect response times ● with timestamp ● with activity● Monitor hardware and utilization ● activity ● memory & CPU usage● Record errors & timeouts
  • 50. Checking Hardware
  • 51. Checking Hardware ● CPUs and Cores ● RAM ● I/O & disk support ● Network
  • 52. CPUs and Cores● Pretty simple: ● Rules of thumb ● number ● fewer faster CPUs is ● type usually better than more slower ones ● speed ● core != cpu ● L1/L2 cache ● thread != core ● virtual core != core
  • 53. CPU calculations● ½ to 1 core for OS● ½ to 1 core for software raid or ZFS● 1 core for postmaster and bgwriter● 1 core per: ● DW: 1 to 3 concurrent users ● OLTP: 10 to 50 concurrent users ● Web: 100 to 1000 concurrent users
  • 54. CPU tools● sar● mpstat● pgTop
  • 55. in praise of sar● collects data about all aspects of HW usage● available on most OSes ● but output is slightly different● easiest tool for collecting basic information ● often enough for server-checking purposes● BUT: does not report all data on all platforms
  • 56. sarCPUs: sar -P ALL and sar -uMemory: sar -r and sar -RI/O: sar -b and sar -dnetwork: sar -n
  • 57. sar CPU output Linux06:05:01 AM CPU %user %nice %system %iowait %steal %idle06:15:01 AM all 14.26 0.09 6.01 1.32 0.00 78.3206:15:01 AM 0 14.26 0.09 6.01 1.32 0.00 78.32 Solaris 15:08:56 %usr %sys %wio %idle 15:09:26 10 5 0 85 15:09:56 9 7 0 84 15:10:26 15 6 0 80 15:10:56 14 7 0 79 15:11:26 15 5 0 80 15:11:56 14 5 0 81
  • 58. Memory● Only one statistic: how much?● Not generally an issue on its own ● low memory can cause more I/O ● low memory can cause more CPU time
  • 59. memory sizingShared Filesystem work_memBuffers Cache maint_memIn Buffer In Cache On Disk
  • 60. Figure out Memory Sizing● What is the active portion of your database? ● i.e. gets queried frequently● How large is it?● Where does it fit into the size categories?● How large is the inactive portion of your database? ● how frequently does it get hit? (remember backups)
  • 61. Memory Sizing● Other needs for RAM – work_mem: ● sorts and aggregates: do you do a lot of big ones? ● GIN/GiST indexes: these can be huge ● hashes: for joins and aggregates ● VACUUM
  • 62. I/O Considerations● Throughput ● how fast can you get data off disk?● Latency ● how long does it take to respond to requests?● Seek Time ● how long does it take to find random disk pages?
  • 63. I/O Considerations● Throughput ● important for large databases ● important for bulk loads● Latency ● huge effect on small writes & reads ● not so much on large scans● Seek Time ● important for small writes & reads ● very important for index lookups
  • 64. I/O Considerations● Web ● concerned about read latency & seek time● OLTP ● concerned about write latency & seek time● DW/BI ● concerned about throughput & seek time
  • 65. ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks- Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP32096M 79553 99 240548 45 50646 5 72471 94 185634 10 1140 1 ------Sequential Output------ --Sequential Input-- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP/sec %CP 24G 260044 33 62110 17 89914 15 1167 25 6549ms 4882ms 3395ms 107ms
  • 66. Common I/O Types● Software RAID & ZFS● Hardware RAID Array● NAS/SAN● SSD
  • 67. Hardware RAID Sanity Check● RAID 1 / 10, not 5● Battery-backed write cache? ● otherwise, turn write cache off● SATA < SCSI/SAS ● about ½ real throughput● Enough drives? ● 4-14 for OLTP application ● 8-48 for DW/BI
  • 68. Sw RAID / ZFS Sanity Check● Enough CPUs? ● will need one for the RAID● Enough disks? ● same as hardware raid● Extra configuration? ● caching ● block size
  • 69. NAS/SAN Sanity Check● Check latency!● Check real throughput ● drivers often a problem● Enough network bandwidth? ● multipath or fiber required to get HW RAID performance
  • 70. SSD Sanity Check● 1 SSD = 4 Drives ● relative performance● Check write cache configuration ● make sure data is safe● Test real throughput, seek times ● drivers often a problem● Research durability stats
  • 71. IO Tools● I/O Tests ● Monitoring Tools ● dd test ● sar ● Bonnie++ ● mpstat iowait ● IOZone ● iostat ● filebench ● on zfs: fsstat, zfs -iostat ● EXPLAIN ANALYZE
  • 72. Network● Throughput ● not usually an issue, except: – iSCSI / NAS / SAN – ELT & Bulk Load Processes ● remember that gigabit is only 100MB/s!● Latency ● real issue for Web / OLTP ● consider putting app ↔ database on private network
  • 73. Checkups for the Cloud
  • 74. Just like real HW, except ...● Low ceiling on #cpus, RAM● Virtual Core < Real Core ● “CPU Stealing” ● last-generation hardware ● calculate 50% more cores
  • 75. Cloud I/O Hell● I/O tends to be very slow, erratic ● comparable to a USB thumb drive ● horrible latency, up to ½ second ● erratic, speeds go up and down ● RAID together several volumes on EBS ● use asynchronous commit – or at least commit_siblings
  • 76. #1 Cloud Rule If your databasedoesnt fit in RAM, dont host iton a public cloud
  • 77. Checking Operating System and Filesystem
  • 78. OS Basics● Use recent versions ● large performance, scaling improvements in Linux & Solaris in last 2 years● Check OS tuning advice for databases ● advice for Oracle is usually good for PostgreSQL● Keep up with information about issues & patches ● frequently specific releases have major issues ● especially check HW drivers
  • 79. OS Basics● Use Linux, BSD or Solaris! ● Windows has poor performance and weak diagnostic tools ● OSX is optimized for desktop and has poor hardware support ● AIX and HPUX require expertise just to install, and lack tools
  • 80. Filesystem Layout● One array / one big pool● Two arrays / partitions ● OS and transaction log ● Database● Three arrays ● OS & stats file ● Transaction log ● Database
  • 81. Linux Tuning● XFS > Ext3 (but not that much) ● Ext3 Tuning: data=writeback,noatime,nodiratime ● XFS Tuning: noatime,nodiratime – for transaction log: nobarrier● “deadline” I/O scheduler● Increase SHMMAX and SHMALL ● to ½ of RAM● Cluster filesystems also a possibility ● OCFS, RHCFS
  • 82. Solaris Tuning● Use ZFS ● no advantage to UFS anymore ● mixed filesystems causes caching issues ● set recordsize – 8K small databases – 128K large databases – check for throughput/latency issues
  • 83. Solaris Tuning● Set OS parameters via “projects”● For all databases: ● project.max-shm-memory=(priv,12GB,deny)● For high-connection databases: ● use libumem ● project.max-shm-ids=(priv,32768,deny) ● project.max-sem-ids=(priv,4096,deny) ● project.max-msg-ids=(priv,4096,deny)
  • 84. FreeBSD Tuning● ZFS: same as Solaris ● definite win for very large databases ● not so much for small databases● Other tuning per docs
  • 85. PostgreSQL Checkup
  • 86. postgresql.conf: formulae shared_buffers =  available RAM / 4
  • 87. postgresql.conf: formulae max_connections = web: 100 to 200 OLTP: 50 to 100 DW/BI: 5 to 20 if you need more, use pooling! 
  • 88. postgresql.conf: formulaeWeb/OLTP:work_mem = Av.RAM * 2 / max_connections DW/BI:work_mem AvRAM / max_connections
  • 89. postgresql.conf: formulaeWeb/OLTP:maintenance_work_mem = Av.RAM * 16DW/BI:maintenance_work_mem = AvRAM / 8
  • 90. postgresql.conf: formulaeautovacuum = onDW/BI & bulk loads:autovacuum = offautovacuum_max_workers =1/2
  • 91. postgresql.conf: formulaecheckpoint_segments =web: 8 to 16oltp: 32 to 64BI/DW: 128 to 256
  • 92. postgresql.conf: formulaewal_buffers = 8MBeffective_cache_size =AvRAM * 0.75
  • 93. How much recoverability do you need?● None: ● fsync=off ● full_page_writes=off ● consider using ramdrive● Some Loss OK ● synchronous_commit = off ● wal_buffers = 16MB to 32MB● Data integrity critical ● keep everything on
  • 94. File Locations● Database● Transaction Log● Activity Log● Stats File● Tablespaces?
  • 95. Database Checks: Indexes select relname, seq_scan, seq_tup_read,pg_size_pretty(pg_relation_size(relid)) as size, coalesce(n_tup_ins,0)+ coalesce(n_tup_upd,0) + coalesce(n_tup_del,0) as update_activityfrom pg_stat_user_tables where seq_scan > 1000 andpg_relation_size(relid) > 1000000 order by seq_scan desc limit 10; relname | seq_scan | seq_tup_read | size | update_activity----------------+----------+--------------+---------+----------------- permissions | 12264 | 53703 | 2696 kB | 365 users | 11697 | 351635 | 17 MB | 741 test_set | 9150 | 18492353300 | 275 MB | 27643 test_pool | 5143 | 3141630847 | 212 MB | 77755
  • 96. Database Checks: IndexesSELECT indexrelid::regclass as index , relid::regclass astable FROM pg_stat_user_indexes JOIN pg_index USING(indexrelid) WHERE idx_scan < 100 AND indisunique IS FALSE; index | tableacct_acctdom_idx | accountshitlist_acct_idx | hitlisthitlist_number_idx | hitlistcustom_field_acct_idx | custom_fielduser_log_accstrt_idx | user_loguser_log_idn_idx | user_loguser_log_feed_idx | user_loguser_log_inbdstart_idx | user_loguser_log_lead_idx | user_log
  • 97. Database Checks: Large Tables relname | total_size | table_size-------------------------+------------+------------ operations_2008 | 9776 MB | 3396 MB operations_2009 | 9399 MB | 3855 MB request_by_second | 7387 MB | 5254 MB request_archive | 6975 MB | 3349 MB events | 92 MB | 66 MB event_edits | 82 MB | 68 MB 2009_ops_eoy | 33 MB | 19 MB
  • 98. Database Checks: Heavily-Used Tablesselect relname, pg_size_pretty(pg_relation_size(relid)) as size,coalesce(n_tup_ins,0) + coalesce(n_tup_upd,0) +coalesce(n_tup_del,0) as update_activity from pg_stat_user_tablesorder by update_activity desc limit 10; relname | size | update_activity------------------------+---------+----------------- session_log | 344 GB | 4811814 feature | 279 MB | 1012565 daily_feature | 28 GB | 984406 cache_queue_2010_05 | 2578 MB | 981812 user_log | 30 GB | 796043 vendor_feed | 29 GB | 479392 vendor_info | 23 GB | 348355 error_log | 239 MB | 214376 test_log | 945 MB | 185785 settings | 215 MB | 117480
  • 99. Database Unit Tests● You need them! ● you will be changing database objects and rewriting queries ● find bugs in testing or in testing … or in production● Various tools ● pgTAP ● Framework-level tests – Rails, Django, Catalyst, JBoss, etc.
  • 100. Application Stack Checkup
  • 101. The Layer Cake Queries Transactions Application Drivers Connections Caching Middleware Schema Config PostgreSQLFilesystem Kernel Operating System Storage RAM/CPU Network Hardware
  • 102. The Layer Cake Queries Transactions Application Drivers Connections Caching Middleware Schema Config PostgreSQLFilesystem Kernel Operating System Storage RAM/CPU Network Hardware
  • 103. The Funnel Application Middleware PostgreSQL OS HW
  • 104. Check PostgreSQL Drivers● Does the driver version match the PostgreSQL version?● Have you applied all updates?● Are you using the best driver? ● There are several Python, C++ drivers ● Dont use ODBC if you can avoid it.● Does the driver support cached plans & binary data? ● If so, are they being used?
  • 105. Check Caching
  • 106. Check Caching● Does the application use data caching? ● what kind? ● could it be used more? ● what is the cache invalidation strategy? ● is there protection from “cache refresh storms”?● Does the application use HTTP caching? ● could they be using it more?
  • 107. Check Connection Pooling● Is the application using connection pooling? ● all web applications should, and most OLTP ● external or built into the application server?● Is it configured correctly? ● max. efficiency: transaction / statement mode ● make sure timeouts match
  • 108. Check Query Design● PostgreSQL does better with fewer, bigger statements● Check for common query mistakes ● joins in the application layer ● pulling too much data and discarding it ● huge OFFSETs ● unanchored text searches
  • 109. Check Transaction Management● Are transactions being used for loops? ● batches of inserts or updates can be 75% faster if wrapped in a transaction● Are transactions aborted properly? ● on error ● on timeout ● transactions being held open while non-database activity runs
  • 110. Common Ailments of the Database Server
  • 111. Check for them, monitor for them● ailments could throw off your response time targets ● database could even “hit the wall”● check for them during health check ● and during each checkup● add daily/continuous monitors for them ● Nagios check_postgres.pl has checks for many of these things
  • 112. Database Growth● Checkup: ● check both total database size and largest table(s) size daily or weekly● Symptoms: ● database grows faster than expected ● some tables grow continuously and rapidly
  • 113. Database Growth● Caused By: ● faster than expected increase in usage ● “append forever” tables ● Database Bloat● Leads to: ● slower seq scans and index scans ● swapping & temp files ● slower backups
  • 114. Database Growth● Treatment: ● check for Bloat ● find largest tables and make them smaller – expire data – partitioning ● horizontal scaling (if possible) ● get better storage & more RAM, sooner
  • 115. Database Bloat-[ RECORD 1 ]+-----schemaname | publictablename | user_logtbloat | 3.4wastedpages | 2356903wastedbytes | 19307749376wastedsize | 18 GBiname | user_log_accttime_idxituples | 941451584ipages | 9743581iotta | 40130146ibloat | 0.2wastedipages | 0wastedibytes | 0wastedisize | 0 bytes
  • 116. Database Bloat● Caused by: ● Autovacuum not keeping up – or not enough manual vacuum – often on specific tables only ● FSM set wrong (before 8.4) ● Idle In Transaction● Leads To: ● slow response times ● unpredictable response times ● heavy I/O
  • 117. Database Bloat● Treatment: ● make autovacuum more aggressive – on specific tables with bloat ● fix FSM_relations/FSM_pages ● check when tables are getting vacuumed ● check for Idle In Transaction
  • 118. Memory Usage Growth00:00:01 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s01:00:00 0 0 100 0 0 100 0 002:00:00 0 0 100 0 0 100 0 003:00:00 0 0 100 0 0 100 0 004:00:00 0 0 100 0 0 100 0 000:00:01 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s01:00:00 3788 115 98 0 0 100 0 002:00:00 21566 420 78 0 0 100 0 003:00:00 455721 1791 59 0 0 100 0 004:00:00 908 6 96 0 0 100 0 0
  • 119. Memory Usage Growth● Caused by: ● Database Growth or Bloat ● work_mem limit too high ● bad queries● Leads To: ● database out of cache – slow response times ● OOM Errors (OOM Killer)
  • 120. Memory Usage Growth● Treatment ● Look at ways to shrink queries, DB – partitioning – data expiration ● lower work_mem limit ● refactor bad queries ● Or just buy more RAM
  • 121. Idle Connectionsselect datname, usename, count(*) frompg_stat_activity where current_query =<IDLE> group by datname, usename; datname | usename | count---------+---------+------- track | www | 318
  • 122. Idle Connections● Caused by: ● poor session management in application ● wrong connection pool settings● Leads to: ● memory usage for connections ● slower response times ● out-of-connections at peak load
  • 123. Idle Connections● Treatment: ● refactor application ● reconfigure connection pool – or add one
  • 124. Idle In Transactionselect datname, usename, max(now() - xact_start) asmax_time, count(*) from pg_stat_activity wherecurrent_query ~* <IDLE> in transaction group bydatname, usename; datname | usename | max_time | count---------+----------+---------------+------- track | admin | 00:00:00.0217 | 1 track | www | 01:03:06.0709 | 7
  • 125. Idle In Transaction● Caused by: ● poor transaction control by application ● abandoned sessions not being terminated fast enough● Leads To: ● locking problems ● database bloat ● out of connections
  • 126. Idle In Transaction● Treatment ● refactor application ● change driver/ORM settings for transactions ● change session timeouts & keepalives on pool, driver, database
  • 127. Longer Running Queries● Detection: ● log slow queries to PostgreSQL log ● do daily or weekly report (pgfouine)● Symptoms: ● number of long-running queries in log increasing ● slowest queries getting slower
  • 128. Longer Running Queries● Caused by: ● database growth ● poorly-written queries ● wrong indexes ● out-of-date stats● Leads to: ● out-of-CPU ● out-of-connections
  • 129. Longer Running Queries● Treatments: ● refactor queries ● update indexes ● make Autoanalyze more aggressive ● control database growth
  • 130. Too Many Queries
  • 131. Too Many Queries● Caused By: ● joins in middleware ● not caching ● poll cycles without delays ● other application code issues● Leads To: ● out-of-CPU ● out-of-connections
  • 132. Too Many Queries● Treatment: ● characterize queries using logging ● refactor application
  • 133. Locking● Detection: ● log_lock_waits ● scan activity log for deadlock warnings ● query pg_stat_activity and pg_locks● Symptoms: ● deadlock error messages ● number and time of lock_waits getting larger
  • 134. Locking● Caused by: ● long-running operations with exclusive locks ● inconsistent foreign key updates ● poorly planned runtime DDL● Leads to: ● poor response times ● timeouts ● deadlock errors
  • 135. Locking● Treatment ● analyze locks ● refactor operations taking locks – establish a canonical order of updates for long transactions – use pessimistic locks with NOWAIT ● rely on cascade for FK updates – not on middleware code
  • 136. Temp File Usage● Detection: ● log_temp_files = 100kB ● scan logs for temp files weekly or daily● Symptoms: ● temp file usage getting more frequent ● queries using temp files getting longer
  • 137. Temp File Usage● Caused by: ● Sorts, hashes & aggregates too big for work_mem● Leads to: ● slow response times ● timeouts
  • 138. Temp File Usage● Treatment ● find swapping queries via logs ● set work_mem higher for that ROLE, or ● refactor them to need less memory, or ● buy more RAM
  • 139. All healthy now?See you in six months!
  • 140. Q&A● Josh Berkus ● Also see: ● josh@pgexperts.com ● Load Testing ● it.toolbox.com/blogs/ (tommorrow) database-soup ● Testing BOF (Friday)● PostgreSQL Experts ● www.pgexperts.com ● pgCon SponsorCopyright 2010 Josh Berkus & PostgreSQL Experts Inc. Distributable under the creativecommons attribution license,except for 3rd-party images which are property of their respective owners.