Database Health Check
Upcoming SlideShare
Loading in...5
×
 

Database Health Check

on

  • 1,711 views

 

Statistics

Views

Total Views
1,711
Views on SlideShare
1,706
Embed Views
5

Actions

Likes
2
Downloads
47
Comments
0

1 Embed 5

https://twitter.com 5

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Database Health Check Database Health Check Presentation Transcript

    • Database Server Health Check Josh Berkus PostgreSQL Experts Inc. pgCon 2010
    • DATABASE SERVER HELP 5¢
    • Program of Treatment● What is a Healthy Database?● Know Your Application● Load Testing● Doing a database server checkup ● hardware ● OS & FS ● PostgreSQL ● application● Common Ailments of the Database Server
    • What is a Healthy Database Server?
    • What is a Healthy Database Server? ● Response Times
    • What is a Healthy Database Server? ● Response Times ● lower than required ● consistent & predicable ● Capacity for more ● CPU and I/O headroom ● low server load
    • 30 25Median Response Time 20 Expected Load 15 10 Max Response Time 5 0 25 50 75 100 125 150 175 200 225 250 Number of Clients
    • What is an Unhealthy Database Server?● Slow response times● Inconsistent response times● High server load● No capacity for growth
    • 30 25Median Response Time 20 Expected Load 15 10 Max Response Time 5 0 25 50 75 100 125 150 175 200 225 250 Number of Clients
    • A healthy database server is able to maintain consistentand acceptable response times under expected loads with margin for error.
    • 30 25Median Response Time 20 15 10 5 0 25 50 75 100 125 150 175 200 225 250 Number of Clients
    • Hitting The Wall
    • CPUs FlooredAverage: CPU %user %system %iowait %idleAverage:all 69.36 0.13 24.87 5.77 0 88.96 0.09 10.03 1.11 1 12.09 0.02 86.98 0.00 2 98.90 0.00 0.00 10.10 3 77.52 0.44 1.70 20.34 16:38:29 up 13 days, 22:10, 3 users, load average: 11.05, 9.08, 8.13
    • CPUs FlooredAverage: CPU %user %system %iowait %idleAverage:all 69.36 0.13 24.87 5.77 0 88.96 0.09 10.03 1.11 1 12.09 0.02 86.98 0.00 2 98.90 0.00 0.00 10.10 3 77.52 0.44 1.70 20.34 16:38:29 up 13 days, 22:10, 3 users, load average: 11.05, 9.08, 8.13
    • IO Saturated Device: tps MB_read/s MB_wrtn/s sde 414.33 0.40 38.15 sdf 1452.00 99.14 29.00Average: CPU %user %system %iowait %idleAverage:all 34.75 0.13 58.75 6.37 0 8.96 0.09 90.03 1.11 1 12.09 0.02 86.98 0.00 2 91.90 0.00 7.00 10.10 3 27.52 0.44 51.70 20.34
    • Out of ConnectionsFATAL:connection limitexceeded for non-superusers
    • How close are you to the wall?
    • The Checkup (full physical)1. Analyze application2. Analyze platform3. Correct anything obviously wrong4. Set up load test5. Monitor load test6. Analyze Results7. Correct issues
    • The Checkup (semi-annual)1. Check response times2. Check system load3. Check previous issues4. Check for Signs of Illness5. Fix new issues
    • Know yourapplication!
    • Application database usageWhich does your application do?✔ small reads✔ large sequential reads✔ small writes✔ large writes✔ long-running procedures/transactions✔ bulk loads and/or ETL
    • What Color Is My Application?W ● Web Application (Web)O ● Online Transaction Processing (OLTP)D ● Data Warehousing (DW)
    • What Color Is My Application?W ● Web Application (Web) ● DB much smaller than RAM ● 90% or more simple queriesO ● Online Transaction Processing (OLTP)D ● Data Warehousing (DW)
    • What Color Is My Application?W ● Web Application (Web) ● DB smaller than RAM ● 90% or more simple queriesO ● Online Transaction Processing (OLTP) ● DB slightly larger than RAM to 1TB ● 20-40% small data write queries ● Some long transactions and complex read queriesD ● Data Warehousing (DW)
    • What Color Is My Application?W ● Web Application (Web) ● DB smaller than RAM ● 90% or more simple queriesO ● Online Transaction Processing (OLTP) ● DB slightly larger than RAM to 1TB ● 20-40% small data write queries ● Some long transactions and complex read queriesD ● Data Warehousing (DW) ● Large to huge databases (100GB to 100TB) ● Large complex reporting queries ● Large bulk loads of data ● Also called "Decision Support" or "Business Intelligence"
    • What Color Is My Application?W ● Web Application (Web) ● CPU-bound ● Ailments: idle connections/transactions, too many queriesO ● Online Transaction Processing (OLTP) ● CPU or I/O bound ● Ailments: locks, database growth, idle transactions, database bloatD ● Data Warehousing (DW) ● I/O or RAM bound ● Resources: database growth, longer running queries, memory usage growth
    • Special features required?● GIS ● heavy cpu for GIS functions ● lots of RAM for GIS indexes● TSearch ● lots of RAM for indexes ● slow response time on writes● SSL ● response time lag on connections
    • LoadTesting
    • 80 70 60Requests Per Second 50 40 30 20 10 0 02:00:00 AM 06:00:00 AM 10:00:00 AM 02:00:00 PM 06:00:00 PM 10:00:00 PM 12:00:00 AM 04:00:00 AM 08:00:00 AM 12:00:00 PM 04:00:00 PM 08:00:00 PM Time
    • 80 70 DOWNTIME 60Requests Per Second 50 40 30 20 10 0 02:00:00 AM 06:00:00 AM 10:00:00 AM 02:00:00 PM 06:00:00 PM 10:00:00 PM 12:00:00 AM 04:00:00 AM 08:00:00 AM 12:00:00 PM 04:00:00 PM 08:00:00 PM Time
    • When preventing downtime,it is not average load which matters, it is peak load.
    • What to load test● Load should be as similar as possible to your production traffic● You should be able to create your target level of traffic ● better: incremental increases● Test the whole application as well ● the database server may not be your weak point
    • How to Load Test1. Set up a load testing tool youll need test servers for this*2. Turn on PostgreSQL, HW, application monitoring all monitoring should start at the same time3. Run the test for a defined time 1 hour is usually good4. Collect and analyze data5. Re-run at higher level of traffic
    • Test Servers● Must be as close as reasonable to production servers ● otherwise you dont know how production will be different ● there is no predictable multiplier● Double them up as your development/staging or failover servers● If your test server is much smaller, then you need to do a same-load comparison
    • Tools for Load Testing
    • Production Test1. Determine the peak load hour on the production servers2. Turn on lots of monitoring during that peak load hour3. Analyze resultsPretty much your only choice without a testserver.
    • Issues with Production Test● Not repeatable −load wont be exactly the same ever again● Cannot test target load −just whatever happens to occur during that hour −cant test incremental increases either● Monitoring may hurt production performance● Cannot test experimental changes
    • The Ad-Hoc Test● Get 10 to 50 coworkers to open several sessions each● Have them go crazy on using the application
    • Problems with Ad-Hoc Testing● Not repeatable ● minor changes in response times may be due to changes in worker activity● Labor intensive ● each test run shuts down the office● Cant reach target levels of load ● unless you have a lot of coworkers
    • Seige● HTTP traffic generator ● all test interfaces must be addressable as URLs ● useless for non-web applications● Simple to use ● create a simple load test in a few hours● Tests the whole web application ● cannot test database separately● http://www.joedog.org/index/siege-home
    • pgReplay● Replays your activity logs at variable speed ● get exactly the traffic you get in production● Good for testing just the database server● Can take time to set up ● need database snapshot, collect activity logs ● must already have production traffic ● http://pgreplay.projects.postgresql.org/
    • tsung● Generic load generator in erlang ● a load testing kit rather than a tool ● Generate a tsung file from your actvity logs using pgFouine and test the database ● Generate load for a web application using custom scripts● Can be time consuming to set up ● but highly configurable and advanced ● very scalable - cluster of load testing clients ● http://tsung.erlang-projects.org/
    • pgBench● Simple micro-benchmark ● not like any real application● Version 9.0 adds multi-threading, customization ● write custom pgBench scripts ● run against real database● Fairly ad-hoc compared to other tools ● but easy to set up ● ships with PostgreSQL
    • Benchmarks● Many “real” benchmarks available ● DBT2, EAstress, CrashMe, DBT5, DBMonster, etc.● Useful for testing your hardware ● not useful for testing your application● Often time-consuming and complex
    • Platform-specific● Web framework or platform tests ● Rails: ActionController::PerformanceTest ● J2EE: OpenDemand, Grinder, many more – JBoss, BEA have their own tools ● Zend Framework Performance Test● Useful for testing specific application performance ● such as performance of specific features, modules● Not all platforms have them
    • Flight-Check● Attend the tutorial tomorrow!
    • monitoring PostgreSQL during load test log_collector = on log_destination = csvlog log_filename = load_test_1_%h log_rotation_age = 60min log_rotation_size = 1GB log_min_duration_statement = 0 log_connections = on log_disconnections = on log_temp_files = 100kB log_lock_waits = on
    • monitoring hardware during load testsar -A -o load_test_1.sar 30 240iostator fsstator zfs iostat
    • monitoring application during load test● Collect response times ● with timestamp ● with activity● Monitor hardware and utilization ● activity ● memory & CPU usage● Record errors & timeouts
    • Checking Hardware
    • Checking Hardware ● CPUs and Cores ● RAM ● I/O & disk support ● Network
    • CPUs and Cores● Pretty simple: ● Rules of thumb ● number ● fewer faster CPUs is ● type usually better than more slower ones ● speed ● core != cpu ● L1/L2 cache ● thread != core ● virtual core != core
    • CPU calculations● ½ to 1 core for OS● ½ to 1 core for software raid or ZFS● 1 core for postmaster and bgwriter● 1 core per: ● DW: 1 to 3 concurrent users ● OLTP: 10 to 50 concurrent users ● Web: 100 to 1000 concurrent users
    • CPU tools● sar● mpstat● pgTop
    • in praise of sar● collects data about all aspects of HW usage● available on most OSes ● but output is slightly different● easiest tool for collecting basic information ● often enough for server-checking purposes● BUT: does not report all data on all platforms
    • sarCPUs: sar -P ALL and sar -uMemory: sar -r and sar -RI/O: sar -b and sar -dnetwork: sar -n
    • sar CPU output Linux06:05:01 AM CPU %user %nice %system %iowait %steal %idle06:15:01 AM all 14.26 0.09 6.01 1.32 0.00 78.3206:15:01 AM 0 14.26 0.09 6.01 1.32 0.00 78.32 Solaris 15:08:56 %usr %sys %wio %idle 15:09:26 10 5 0 85 15:09:56 9 7 0 84 15:10:26 15 6 0 80 15:10:56 14 7 0 79 15:11:26 15 5 0 80 15:11:56 14 5 0 81
    • Memory● Only one statistic: how much?● Not generally an issue on its own ● low memory can cause more I/O ● low memory can cause more CPU time
    • memory sizingShared Filesystem work_memBuffers Cache maint_memIn Buffer In Cache On Disk
    • Figure out Memory Sizing● What is the active portion of your database? ● i.e. gets queried frequently● How large is it?● Where does it fit into the size categories?● How large is the inactive portion of your database? ● how frequently does it get hit? (remember backups)
    • Memory Sizing● Other needs for RAM – work_mem: ● sorts and aggregates: do you do a lot of big ones? ● GIN/GiST indexes: these can be huge ● hashes: for joins and aggregates ● VACUUM
    • I/O Considerations● Throughput ● how fast can you get data off disk?● Latency ● how long does it take to respond to requests?● Seek Time ● how long does it take to find random disk pages?
    • I/O Considerations● Throughput ● important for large databases ● important for bulk loads● Latency ● huge effect on small writes & reads ● not so much on large scans● Seek Time ● important for small writes & reads ● very important for index lookups
    • I/O Considerations● Web ● concerned about read latency & seek time● OLTP ● concerned about write latency & seek time● DW/BI ● concerned about throughput & seek time
    • ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks- Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP32096M 79553 99 240548 45 50646 5 72471 94 185634 10 1140 1 ------Sequential Output------ --Sequential Input-- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP/sec %CP 24G 260044 33 62110 17 89914 15 1167 25 6549ms 4882ms 3395ms 107ms
    • Common I/O Types● Software RAID & ZFS● Hardware RAID Array● NAS/SAN● SSD
    • Hardware RAID Sanity Check● RAID 1 / 10, not 5● Battery-backed write cache? ● otherwise, turn write cache off● SATA < SCSI/SAS ● about ½ real throughput● Enough drives? ● 4-14 for OLTP application ● 8-48 for DW/BI
    • Sw RAID / ZFS Sanity Check● Enough CPUs? ● will need one for the RAID● Enough disks? ● same as hardware raid● Extra configuration? ● caching ● block size
    • NAS/SAN Sanity Check● Check latency!● Check real throughput ● drivers often a problem● Enough network bandwidth? ● multipath or fiber required to get HW RAID performance
    • SSD Sanity Check● 1 SSD = 4 Drives ● relative performance● Check write cache configuration ● make sure data is safe● Test real throughput, seek times ● drivers often a problem● Research durability stats
    • IO Tools● I/O Tests ● Monitoring Tools ● dd test ● sar ● Bonnie++ ● mpstat iowait ● IOZone ● iostat ● filebench ● on zfs: fsstat, zfs -iostat ● EXPLAIN ANALYZE
    • Network● Throughput ● not usually an issue, except: – iSCSI / NAS / SAN – ELT & Bulk Load Processes ● remember that gigabit is only 100MB/s!● Latency ● real issue for Web / OLTP ● consider putting app ↔ database on private network
    • Checkups for the Cloud
    • Just like real HW, except ...● Low ceiling on #cpus, RAM● Virtual Core < Real Core ● “CPU Stealing” ● last-generation hardware ● calculate 50% more cores
    • Cloud I/O Hell● I/O tends to be very slow, erratic ● comparable to a USB thumb drive ● horrible latency, up to ½ second ● erratic, speeds go up and down ● RAID together several volumes on EBS ● use asynchronous commit – or at least commit_siblings
    • #1 Cloud Rule If your databasedoesnt fit in RAM, dont host iton a public cloud
    • Checking Operating System and Filesystem
    • OS Basics● Use recent versions ● large performance, scaling improvements in Linux & Solaris in last 2 years● Check OS tuning advice for databases ● advice for Oracle is usually good for PostgreSQL● Keep up with information about issues & patches ● frequently specific releases have major issues ● especially check HW drivers
    • OS Basics● Use Linux, BSD or Solaris! ● Windows has poor performance and weak diagnostic tools ● OSX is optimized for desktop and has poor hardware support ● AIX and HPUX require expertise just to install, and lack tools
    • Filesystem Layout● One array / one big pool● Two arrays / partitions ● OS and transaction log ● Database● Three arrays ● OS & stats file ● Transaction log ● Database
    • Linux Tuning● XFS > Ext3 (but not that much) ● Ext3 Tuning: data=writeback,noatime,nodiratime ● XFS Tuning: noatime,nodiratime – for transaction log: nobarrier● “deadline” I/O scheduler● Increase SHMMAX and SHMALL ● to ½ of RAM● Cluster filesystems also a possibility ● OCFS, RHCFS
    • Solaris Tuning● Use ZFS ● no advantage to UFS anymore ● mixed filesystems causes caching issues ● set recordsize – 8K small databases – 128K large databases – check for throughput/latency issues
    • Solaris Tuning● Set OS parameters via “projects”● For all databases: ● project.max-shm-memory=(priv,12GB,deny)● For high-connection databases: ● use libumem ● project.max-shm-ids=(priv,32768,deny) ● project.max-sem-ids=(priv,4096,deny) ● project.max-msg-ids=(priv,4096,deny)
    • FreeBSD Tuning● ZFS: same as Solaris ● definite win for very large databases ● not so much for small databases● Other tuning per docs
    • PostgreSQL Checkup
    • postgresql.conf: formulae shared_buffers =  available RAM / 4
    • postgresql.conf: formulae max_connections = web: 100 to 200 OLTP: 50 to 100 DW/BI: 5 to 20 if you need more, use pooling! 
    • postgresql.conf: formulaeWeb/OLTP:work_mem = Av.RAM * 2 / max_connections DW/BI:work_mem AvRAM / max_connections
    • postgresql.conf: formulaeWeb/OLTP:maintenance_work_mem = Av.RAM * 16DW/BI:maintenance_work_mem = AvRAM / 8
    • postgresql.conf: formulaeautovacuum = onDW/BI & bulk loads:autovacuum = offautovacuum_max_workers =1/2
    • postgresql.conf: formulaecheckpoint_segments =web: 8 to 16oltp: 32 to 64BI/DW: 128 to 256
    • postgresql.conf: formulaewal_buffers = 8MBeffective_cache_size =AvRAM * 0.75
    • How much recoverability do you need?● None: ● fsync=off ● full_page_writes=off ● consider using ramdrive● Some Loss OK ● synchronous_commit = off ● wal_buffers = 16MB to 32MB● Data integrity critical ● keep everything on
    • File Locations● Database● Transaction Log● Activity Log● Stats File● Tablespaces?
    • Database Checks: Indexes select relname, seq_scan, seq_tup_read,pg_size_pretty(pg_relation_size(relid)) as size, coalesce(n_tup_ins,0)+ coalesce(n_tup_upd,0) + coalesce(n_tup_del,0) as update_activityfrom pg_stat_user_tables where seq_scan > 1000 andpg_relation_size(relid) > 1000000 order by seq_scan desc limit 10; relname | seq_scan | seq_tup_read | size | update_activity----------------+----------+--------------+---------+----------------- permissions | 12264 | 53703 | 2696 kB | 365 users | 11697 | 351635 | 17 MB | 741 test_set | 9150 | 18492353300 | 275 MB | 27643 test_pool | 5143 | 3141630847 | 212 MB | 77755
    • Database Checks: IndexesSELECT indexrelid::regclass as index , relid::regclass astable FROM pg_stat_user_indexes JOIN pg_index USING(indexrelid) WHERE idx_scan < 100 AND indisunique IS FALSE; index | tableacct_acctdom_idx | accountshitlist_acct_idx | hitlisthitlist_number_idx | hitlistcustom_field_acct_idx | custom_fielduser_log_accstrt_idx | user_loguser_log_idn_idx | user_loguser_log_feed_idx | user_loguser_log_inbdstart_idx | user_loguser_log_lead_idx | user_log
    • Database Checks: Large Tables relname | total_size | table_size-------------------------+------------+------------ operations_2008 | 9776 MB | 3396 MB operations_2009 | 9399 MB | 3855 MB request_by_second | 7387 MB | 5254 MB request_archive | 6975 MB | 3349 MB events | 92 MB | 66 MB event_edits | 82 MB | 68 MB 2009_ops_eoy | 33 MB | 19 MB
    • Database Checks: Heavily-Used Tablesselect relname, pg_size_pretty(pg_relation_size(relid)) as size,coalesce(n_tup_ins,0) + coalesce(n_tup_upd,0) +coalesce(n_tup_del,0) as update_activity from pg_stat_user_tablesorder by update_activity desc limit 10; relname | size | update_activity------------------------+---------+----------------- session_log | 344 GB | 4811814 feature | 279 MB | 1012565 daily_feature | 28 GB | 984406 cache_queue_2010_05 | 2578 MB | 981812 user_log | 30 GB | 796043 vendor_feed | 29 GB | 479392 vendor_info | 23 GB | 348355 error_log | 239 MB | 214376 test_log | 945 MB | 185785 settings | 215 MB | 117480
    • Database Unit Tests● You need them! ● you will be changing database objects and rewriting queries ● find bugs in testing or in testing … or in production● Various tools ● pgTAP ● Framework-level tests – Rails, Django, Catalyst, JBoss, etc.
    • Application Stack Checkup
    • The Layer Cake Queries Transactions Application Drivers Connections Caching Middleware Schema Config PostgreSQLFilesystem Kernel Operating System Storage RAM/CPU Network Hardware
    • The Layer Cake Queries Transactions Application Drivers Connections Caching Middleware Schema Config PostgreSQLFilesystem Kernel Operating System Storage RAM/CPU Network Hardware
    • The Funnel Application Middleware PostgreSQL OS HW
    • Check PostgreSQL Drivers● Does the driver version match the PostgreSQL version?● Have you applied all updates?● Are you using the best driver? ● There are several Python, C++ drivers ● Dont use ODBC if you can avoid it.● Does the driver support cached plans & binary data? ● If so, are they being used?
    • Check Caching
    • Check Caching● Does the application use data caching? ● what kind? ● could it be used more? ● what is the cache invalidation strategy? ● is there protection from “cache refresh storms”?● Does the application use HTTP caching? ● could they be using it more?
    • Check Connection Pooling● Is the application using connection pooling? ● all web applications should, and most OLTP ● external or built into the application server?● Is it configured correctly? ● max. efficiency: transaction / statement mode ● make sure timeouts match
    • Check Query Design● PostgreSQL does better with fewer, bigger statements● Check for common query mistakes ● joins in the application layer ● pulling too much data and discarding it ● huge OFFSETs ● unanchored text searches
    • Check Transaction Management● Are transactions being used for loops? ● batches of inserts or updates can be 75% faster if wrapped in a transaction● Are transactions aborted properly? ● on error ● on timeout ● transactions being held open while non-database activity runs
    • Common Ailments of the Database Server
    • Check for them, monitor for them● ailments could throw off your response time targets ● database could even “hit the wall”● check for them during health check ● and during each checkup● add daily/continuous monitors for them ● Nagios check_postgres.pl has checks for many of these things
    • Database Growth● Checkup: ● check both total database size and largest table(s) size daily or weekly● Symptoms: ● database grows faster than expected ● some tables grow continuously and rapidly
    • Database Growth● Caused By: ● faster than expected increase in usage ● “append forever” tables ● Database Bloat● Leads to: ● slower seq scans and index scans ● swapping & temp files ● slower backups
    • Database Growth● Treatment: ● check for Bloat ● find largest tables and make them smaller – expire data – partitioning ● horizontal scaling (if possible) ● get better storage & more RAM, sooner
    • Database Bloat-[ RECORD 1 ]+-----schemaname | publictablename | user_logtbloat | 3.4wastedpages | 2356903wastedbytes | 19307749376wastedsize | 18 GBiname | user_log_accttime_idxituples | 941451584ipages | 9743581iotta | 40130146ibloat | 0.2wastedipages | 0wastedibytes | 0wastedisize | 0 bytes
    • Database Bloat● Caused by: ● Autovacuum not keeping up – or not enough manual vacuum – often on specific tables only ● FSM set wrong (before 8.4) ● Idle In Transaction● Leads To: ● slow response times ● unpredictable response times ● heavy I/O
    • Database Bloat● Treatment: ● make autovacuum more aggressive – on specific tables with bloat ● fix FSM_relations/FSM_pages ● check when tables are getting vacuumed ● check for Idle In Transaction
    • Memory Usage Growth00:00:01 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s01:00:00 0 0 100 0 0 100 0 002:00:00 0 0 100 0 0 100 0 003:00:00 0 0 100 0 0 100 0 004:00:00 0 0 100 0 0 100 0 000:00:01 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s01:00:00 3788 115 98 0 0 100 0 002:00:00 21566 420 78 0 0 100 0 003:00:00 455721 1791 59 0 0 100 0 004:00:00 908 6 96 0 0 100 0 0
    • Memory Usage Growth● Caused by: ● Database Growth or Bloat ● work_mem limit too high ● bad queries● Leads To: ● database out of cache – slow response times ● OOM Errors (OOM Killer)
    • Memory Usage Growth● Treatment ● Look at ways to shrink queries, DB – partitioning – data expiration ● lower work_mem limit ● refactor bad queries ● Or just buy more RAM
    • Idle Connectionsselect datname, usename, count(*) frompg_stat_activity where current_query =<IDLE> group by datname, usename; datname | usename | count---------+---------+------- track | www | 318
    • Idle Connections● Caused by: ● poor session management in application ● wrong connection pool settings● Leads to: ● memory usage for connections ● slower response times ● out-of-connections at peak load
    • Idle Connections● Treatment: ● refactor application ● reconfigure connection pool – or add one
    • Idle In Transactionselect datname, usename, max(now() - xact_start) asmax_time, count(*) from pg_stat_activity wherecurrent_query ~* <IDLE> in transaction group bydatname, usename; datname | usename | max_time | count---------+----------+---------------+------- track | admin | 00:00:00.0217 | 1 track | www | 01:03:06.0709 | 7
    • Idle In Transaction● Caused by: ● poor transaction control by application ● abandoned sessions not being terminated fast enough● Leads To: ● locking problems ● database bloat ● out of connections
    • Idle In Transaction● Treatment ● refactor application ● change driver/ORM settings for transactions ● change session timeouts & keepalives on pool, driver, database
    • Longer Running Queries● Detection: ● log slow queries to PostgreSQL log ● do daily or weekly report (pgfouine)● Symptoms: ● number of long-running queries in log increasing ● slowest queries getting slower
    • Longer Running Queries● Caused by: ● database growth ● poorly-written queries ● wrong indexes ● out-of-date stats● Leads to: ● out-of-CPU ● out-of-connections
    • Longer Running Queries● Treatments: ● refactor queries ● update indexes ● make Autoanalyze more aggressive ● control database growth
    • Too Many Queries
    • Too Many Queries● Caused By: ● joins in middleware ● not caching ● poll cycles without delays ● other application code issues● Leads To: ● out-of-CPU ● out-of-connections
    • Too Many Queries● Treatment: ● characterize queries using logging ● refactor application
    • Locking● Detection: ● log_lock_waits ● scan activity log for deadlock warnings ● query pg_stat_activity and pg_locks● Symptoms: ● deadlock error messages ● number and time of lock_waits getting larger
    • Locking● Caused by: ● long-running operations with exclusive locks ● inconsistent foreign key updates ● poorly planned runtime DDL● Leads to: ● poor response times ● timeouts ● deadlock errors
    • Locking● Treatment ● analyze locks ● refactor operations taking locks – establish a canonical order of updates for long transactions – use pessimistic locks with NOWAIT ● rely on cascade for FK updates – not on middleware code
    • Temp File Usage● Detection: ● log_temp_files = 100kB ● scan logs for temp files weekly or daily● Symptoms: ● temp file usage getting more frequent ● queries using temp files getting longer
    • Temp File Usage● Caused by: ● Sorts, hashes & aggregates too big for work_mem● Leads to: ● slow response times ● timeouts
    • Temp File Usage● Treatment ● find swapping queries via logs ● set work_mem higher for that ROLE, or ● refactor them to need less memory, or ● buy more RAM
    • All healthy now?See you in six months!
    • Q&A● Josh Berkus ● Also see: ● josh@pgexperts.com ● Load Testing ● it.toolbox.com/blogs/ (tommorrow) database-soup ● Testing BOF (Friday)● PostgreSQL Experts ● www.pgexperts.com ● pgCon SponsorCopyright 2010 Josh Berkus & PostgreSQL Experts Inc. Distributable under the creativecommons attribution license,except for 3rd-party images which are property of their respective owners.