08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
PGConf.ASIA 2019 Bali - Modern PostgreSQL Monitoring & Diagnostics - Mahadevan Ramachandran
1.
2. Modern PostgreSQL Monitoring & Diagnostics
Mahadevan Ramachandran
Co-founder & CEO, RapidLoop
September 10, 2019
Mahadevan Ramachandran PostgreSQL Monitoring September 10, 2019 2 / 55
3. Hello!
Mahadevan Ramachandran
Co-founder & CEO, RapidLoop
We build monitoring products:
OpsDash – server & service monitoring
pgDash – dedicated PostgreSQL monitoring
Mahadevan Ramachandran PostgreSQL Monitoring September 10, 2019 3 / 55
4. In The Beginning...
OpsDash was our first product. It does server and service monitoring. It
also does basic PostgreSQL monitoring. OpsDash itself uses Postgres
internally.
As we grew, we realized that we needed more, in-depth monitoring for
Postgres.
But:
What should we collect? There was no standard list of metrics
information to collect.
How should we collect? There was no standard way to collect, that
worked across Postgres versions, distros etc.
Why should we collect what we were supposed to collect? How were
we supposed to interpret it?
Mahadevan Ramachandran PostgreSQL Monitoring September 10, 2019 4 / 55
5. What we needed
A modern system by which we could collect all relevant info about a
Postgres server into one place
Extract & store relevant metrics into timeseries db
Display collected information in a rich set of dashboards – not just as
a bunch of timeseries graphs
More importantly: let algorithms have a look at all this information
and perform diagnostics – for example:
WAL file count increased steadily over the last 24 hours
Look at transactions running for more than 24 hours – none
Look at wal archiving info – not failing – what then?
Look at replication slots – one of them has become inactive – aha!
Mahadevan Ramachandran PostgreSQL Monitoring September 10, 2019 5 / 55
6. So we built pgmetrics
We did some research:
Started with our requirements
Actually read the Postgres documentation :-)
Various metrics collection agents/plugins, check nagios.pl
Read a few Postgres administration books
and then built an open source tool called pgmetrics
Added more features and tweaks based on user’s inputs
Now queries and collects over 350 metrics..
..using over 50 different queries
Mahadevan Ramachandran PostgreSQL Monitoring September 10, 2019 6 / 55
7. pgmetrics
Open source CLI tool - also usable as a library
Run anywhere - Go, single binary, statically linked, no dependencies -
Windows, Linux, FreeBSD, MacOS
No installation or Postgres extensions required - so works with AWS
RDS, Heroku too
Easy to use - exact same command-line arguments and env. vars as
psql
Usable for scripting and automation - JSON, CSV outputs
Accommodates Postgres version differences - works with v9.3+
Mahadevan Ramachandran PostgreSQL Monitoring September 10, 2019 7 / 55
8. How It Works
pgmetrics collects information from:
Statistics collector views – pg stat archiver, pg stat bgwriter,
pg stat replication, pg stat wal receiver, pg stat activity,
pg stat subscription, ...
Functions – pg is in recovery(), pg last wal receive lsn(),
pg last wal replay lsn(), pg tablespace size(), ...
System catalog views – pg database, pg tablespace, pg class, ...
PSS Extension – pg stat statements
Configuration settings – pg settings
Bloat – using the query from check nagios.pl
Filesystem – pg ls dir(’pg wal’)
System metrics – /proc filesystem
Mahadevan Ramachandran PostgreSQL Monitoring September 10, 2019 8 / 55
9. Demo Time
Let’s have a look!
Mahadevan Ramachandran PostgreSQL Monitoring September 10, 2019 9 / 55
11. Cluster Overview
Name: 10/main
Server Version: 10.10 (Ubuntu 10.10-1.pgdg18.04+1)
Server Started: 23 Aug 2019 3:47:32 AM (24 minutes ago)
System Identifier: 6685915216424112509
Timeline: 1
Last Checkpoint: 23 Aug 2019 4:07:36 AM (4 minutes ago)
Prior LSN: 13E/4100F370
REDO LSN: 13E/4103E9F8 (190 KiB since Prior)
Checkpoint LSN: 13E/410411D8 (10 KiB since REDO)
Transaction IDs: 1499764934 to 1640241289 (diff = 140476355)
Notification Queue: 0.0% used
Active Backends: 3 (max 100)
Recovery Mode? no
Mahadevan Ramachandran PostgreSQL Monitoring September 10, 2019 11 / 55
12. Cluster Overview
What to monitor:
Transaction ID range
Time since last checkpoint
Number of (client) backends
Notification queue usage
Time since server start
Mahadevan Ramachandran PostgreSQL Monitoring September 10, 2019 12 / 55
13. Backends
Backends:
Total Backends: 17 (17.0% of max 100)
Problematic: 5 waiting on locks, 6 waiting on other, 1 xact too long, 3 idle in xact
Waiting for Locks:
+-------+--------+---------+-------------+----------+----------------------+------------------------+
| PID | User | App | Client Addr | Database | Wait | Query Start |
+-------+--------+---------+-------------+----------+----------------------+------------------------+
| 28525 | mdevan | psql | | bench | Lock / relation | 28 Aug 2019 6:42:13 AM |
| 28539 | mdevan | pgbench | | bench | Lock / transactionid | 28 Aug 2019 6:42:56 AM |
| 28541 | mdevan | pgbench | | bench | Lock / transactionid | 28 Aug 2019 6:42:56 AM |
| 28565 | mdevan | psql | | bench | Lock / relation | 28 Aug 2019 6:42:26 AM |
| 28588 | mdevan | psql | | bench | Lock / relation | 28 Aug 2019 6:42:45 AM |
+-------+--------+---------+-------------+----------+----------------------+------------------------+
Other Waiting Backends:
+-------+--------+---------+-------------+----------+-----------------------+------------------------+
| PID | User | App | Client Addr | Database | Wait | Query Start |
+-------+--------+---------+-------------+----------+-----------------------+------------------------+
| 22066 | pgdash | | | pgdash | Client / ClientRead | 28 Aug 2019 6:40:06 AM |
| 27976 | pgdash | | | pgdash | Client / ClientRead | 28 Aug 2019 6:40:06 AM |
| 28174 | mdevan | psql | | bench | Client / ClientRead | 28 Aug 2019 6:41:47 AM |
| 28534 | mdevan | pgbench | | bench | LWLock / WALWriteLock | 28 Aug 2019 6:42:56 AM |
| 28536 | mdevan | pgbench | | bench | LWLock / WALWriteLock | 28 Aug 2019 6:42:56 AM |
| 28542 | mdevan | pgbench | | bench | LWLock / WALWriteLock | 28 Aug 2019 6:42:56 AM |
+-------+--------+---------+-------------+----------+-----------------------+------------------------+
(continued..)
Mahadevan Ramachandran PostgreSQL Monitoring September 10, 2019 13 / 55
14. Backends
Long Running (>60 sec) Transactions:
+-------+--------+------+-------------+----------+---------------------------------------+
| PID | User | App | Client Addr | Database | Transaction Start |
+-------+--------+------+-------------+----------+---------------------------------------+
| 28174 | mdevan | psql | | bench | 28 Aug 2019 6:41:40 AM (1 minute ago) |
+-------+--------+------+-------------+----------+---------------------------------------+
Idling in Transaction:
+-------+--------+---------+-------------+----------+----------+------------------------+
| PID | User | App | Client Addr | Database | Aborted? | State Change |
+-------+--------+---------+-------------+----------+----------+------------------------+
| 28174 | mdevan | psql | | bench | no | 28 Aug 2019 6:41:47 AM |
| 28535 | mdevan | pgbench | | bench | no | 28 Aug 2019 6:42:56 AM |
| 28540 | mdevan | pgbench | | bench | no | 28 Aug 2019 6:42:56 AM |
+-------+--------+---------+-------------+----------+----------+------------------------+
Mahadevan Ramachandran PostgreSQL Monitoring September 10, 2019 14 / 55
15. Backends
What to monitor:
Backends waiting for locks
Backends that have been running for too long
Backends that are idling in transaction
Mahadevan Ramachandran PostgreSQL Monitoring September 10, 2019 15 / 55
16. WAL Files
WAL Files:
WAL Archiving? yes
WAL Files: 19
Ready Files: 0
Archive Rate: 1.20 per min
Last Archived: 27 Aug 2019 8:46:52 AM (13 seconds ago)
Last Failure:
Totals: 7265 succeeded, 0 failed
Totals Since: 23 Aug 2019 3:47:36 AM (4 days ago)
Mahadevan Ramachandran PostgreSQL Monitoring September 10, 2019 16 / 55
17. WAL Files
What to monitor:
Number of WAL files
Number of WAL files ready for archiving
Archiving failures
Rate of archiving
Mahadevan Ramachandran PostgreSQL Monitoring September 10, 2019 17 / 55
18. BG Writer
Checkpoint Rate: 0.20 per min
Average Write: 6.0 MiB per checkpoint
Total Checkpoints: 1214 sched (100.0%) + 0 req (0.0%) = 1214
Total Write: 1.3 TiB, @ 3.7 MiB per sec
Buffers Allocated: 257126091 (1.9 TiB)
Buffers Written: 938583 chkpt (0.5%) + 84471609 bgw (48.4%) +
89281998 be (51.1%)
Clean Scan Stops: 769729
BE fsyncs: 0
Counts Since: 23 Aug 2019 3:47:36 AM (4 days ago)
Mahadevan Ramachandran PostgreSQL Monitoring September 10, 2019 18 / 55
19. BG Writer
What to monitor:
Checkpoint rate (compare with checkpoint timeout)
Ratio of scheduled to requested checkpoints
Percentage of buffers written by the bgwriter
Stops of the bgwriter clean scan run
Mahadevan Ramachandran PostgreSQL Monitoring September 10, 2019 19 / 55
21. Tablespaces
What to monitor:
Disk usage per tablespace
Inode usage (typically not an issue)
Mahadevan Ramachandran PostgreSQL Monitoring September 10, 2019 21 / 55
23. Locks
What to monitor:
Overall number of locks
‘relation’ locks (see also Blocked Queries at database level)
‘advisory’ locks
Mahadevan Ramachandran PostgreSQL Monitoring September 10, 2019 23 / 55
25. Database Overview
Database #4:
Name: bench
Owner: mdevan
Tablespace: pg_default
Connections: 1 (no max limit)
Frozen Xid Age: 149467276
Transactions: 159267223 (57.6%) commits,
117062056 (42.4%) rollbacks
Cache Hits: 98.1%
Rows Changed: ins 25.0%, upd 75.0%, del 0.0%
Total Temp: 0 B in 0 files
Problems: 0 deadlocks, 0 conflicts
Totals Since: 23 Aug 2019 3:47:39 AM (4 days ago)
Size: 1.4 GiB
Mahadevan Ramachandran PostgreSQL Monitoring September 10, 2019 25 / 55
26. Database Overview
What to monitor:
Number of connections
Transaction ID range
Cache efficiency
Commit ratio
Deadlock Count
Query Conflict Count (on standbys)
Size of database on-disk
Temporary files
Total data written into temporary files
Mahadevan Ramachandran PostgreSQL Monitoring September 10, 2019 26 / 55
28. Slow Queries
What to monitor:
Average time taken and rows/call for specific queries
Mahadevan Ramachandran PostgreSQL Monitoring September 10, 2019 28 / 55
29. Blocked Queries
Blocked Query #1:
Query: TRUNCATE TABLE locktest;
Started By: psql mdevan/bench (PID 28565)
Waiting Since: 28 Aug 2019 6:42:26 AM (30 seconds ago)
Waiting For:
Query: LOCK TABLE locktest IN ACCESS EXCLUSIVE MODE;
Lock: relation, AccessExclusiveLock, table public.locktest
Started By: psql mdevan/bench (PID 28174)
Waiting For:
Query: SELECT * FROM locktest FOR UPDATE;
Lock: relation, AccessExclusiveLock, table public.locktest
Started By: psql mdevan/bench (PID 28525)
Blocked Query #2:
Query: CLUSTER VERBOSE locktest;
Started By: psql mdevan/bench (PID 28588)
Waiting Since: 28 Aug 2019 6:42:45 AM (11 seconds ago)
Waiting For:
Query: LOCK TABLE locktest IN ACCESS EXCLUSIVE MODE;
Lock: relation, AccessExclusiveLock, table public.locktest
Started By: psql mdevan/bench (PID 28174)
Waiting For:
Query: SELECT * FROM locktest FOR UPDATE;
Lock: relation, AccessExclusiveLock, table public.locktest
Started By: psql mdevan/bench (PID 28525)
Waiting For:
Query: TRUNCATE TABLE locktest;
Lock: relation, AccessExclusiveLock, table public.locktest
Started By: psql mdevan/bench (PID 28565)
Mahadevan Ramachandran PostgreSQL Monitoring September 10, 2019 29 / 55
30. Blocked Queries
What to monitor:
Number of blocked queries
Record query information for later analysis
Mahadevan Ramachandran PostgreSQL Monitoring September 10, 2019 30 / 55
31. Tables
Table #2 in "bench":
Name: bench.public.pgbench_tellers
Columns: 4
Manual Vacuums: never
Manual Analyze: never
Auto Vacuums: 7237, last 30 seconds ago
Auto Analyze: 6910, last 30 seconds ago
Post-Analyze: 0.0% est. rows modified
Row Estimate: 100.0% live of total 100
Rows Changed: ins 0.0%, upd 99.4%, del 0.0%
HOT Updates: 99.4% of all updates
Seq Scans: 93933255, 100.0 rows/scan
Idx Scans: 95608483, 1.0 rows/scan
Cache Hits: 100.0% (idx=94.5%)
Size: 744 KiB
Bloat: 704 KiB (94.6%)
Mahadevan Ramachandran PostgreSQL Monitoring September 10, 2019 31 / 55
32. Tables
What to monitor:
Time since last vacuum
Time since last analyze
HOT updates
Cache efficiency
Number of sequential scans
Size of the table on-disk
Bloat, both in bytes as well as a % of table size
Mahadevan Ramachandran PostgreSQL Monitoring September 10, 2019 32 / 55
34. Indexes
What to monitor:
Unused indexes (same scan count over n days)
Cache efficiency
Size of the index on-disk
Bloat, both in bytes as well as a % of index size
Mahadevan Ramachandran PostgreSQL Monitoring September 10, 2019 34 / 55
36. Outgoing Replication, Physical
Outgoing physical replication connection to a standby:
Destination #2:
User: repluser
Application: walreceiver
Client Address: 127.0.0.1/32
State: streaming
Started At: 23 Aug 2019 6:00:15 AM (10 minutes ago)
Sent LSN: 13E/DD556190
Written Until: 13E/DD556190 (no write lag)
Flushed Until: 13E/DD555D30 (flush lag = 1.1 KiB)
Replayed Until: 13E/DD555D30 (no replay lag)
Sync Priority: 0
Sync State: async
Mahadevan Ramachandran PostgreSQL Monitoring September 10, 2019 36 / 55
37. Outgoing Replication, Physical
Generic outgoing physical replicaton connection:
Destination #3:
User: mdevan
Application: pg_receivewal
Client Address:
State: streaming
Started At: 23 Aug 2019 4:32:35 AM (49 seconds ago)
Sent LSN: 13E/63643DC8
Written Until: 13E/636439B0 (write lag = 1.0 KiB)
Flushed Until: 13E/63000000 (flush lag = 6.3 MiB)
Replayed Until:
Sync Priority: 0
Sync State: async
Mahadevan Ramachandran PostgreSQL Monitoring September 10, 2019 37 / 55
38. Outgoing Replication, Physical
What to monitor:
Write lag
Flush lag
Replay lag
WAL sender state – ”streaming”, ”catchup”
Mahadevan Ramachandran PostgreSQL Monitoring September 10, 2019 38 / 55
39. Physical Replication Slots
Physical Replication Slots:
+-------------+--------+---------------+--------------+-
| Name | Active | Oldest Txn ID | Restart LSN |
+-------------+--------+---------------+--------------+-
| repl_test_1 | yes | | 13E/63000000 |
| standby1 | yes | | 13E/63643DC8 |
+-------------+--------+---------------+--------------+-
-+-----------+
| Temporary |
-+-----------+
| no |
| no |
-+-----------+
Mahadevan Ramachandran PostgreSQL Monitoring September 10, 2019 39 / 55
40. Physical Replication Slots
What to monitor:
Is the slot active or not?
Mahadevan Ramachandran PostgreSQL Monitoring September 10, 2019 40 / 55
41. Incoming Replication
Recovery Status:
Replay paused: no
Received LSN: 160/2D09CE0
Replayed LSN: 160/2D09CE0 (no lag)
Last Replayed Txn: 28 Aug 2019 4:52:39 AM (now)
Incoming Replication Stats:
Status: streaming
Received LSN: 160/2D09CE0 (started at 153/48000000, 51 GiB)
Timeline: 1 (was 1 at start)
Latency: 58us
Replication Slot: standby1
Mahadevan Ramachandran PostgreSQL Monitoring September 10, 2019 41 / 55
42. Incoming Replication
What to monitor:
Replay lag
Latency
Mahadevan Ramachandran PostgreSQL Monitoring September 10, 2019 42 / 55
43. Outgoing Replication, Logical
Outgoing logical replicaton connection:
Destination #1:
User: mdevan
Application: pg_recvlogical
Client Address:
State: streaming
Started At: 23 Aug 2019 3:47:39 AM (45 minutes ago)
Sent LSN: 13E/63643DC8
Written Until: 13E/636439B0 (write lag = 1.0 KiB)
Flushed Until: 13E/636439B0 (no flush lag)
Replayed Until:
Sync Priority: 0
Sync State: async
Mahadevan Ramachandran PostgreSQL Monitoring September 10, 2019 43 / 55
44. Outgoing Replication, Logical
What to monitor:
Write lag
Flush lag
Replay lag
Mahadevan Ramachandran PostgreSQL Monitoring September 10, 2019 44 / 55
51. Other
What to monitor:
vacuum jobs
disabled triggers
changes to configuration settings
changes to the list of roles & membership
system-level
maximum load average in the last 24 hours
maximum memory usage in the last 24 hours
disk IOPS and bandwidth (MB/s)
CPU usage (esp. user, iowait)
Mahadevan Ramachandran PostgreSQL Monitoring September 10, 2019 51 / 55
52. pgDash
We spun off our Postgres monitoring into it’s own product.
pgDash is what you’d do with the output of pgmetrics – dashboards,
diagnostics, query performance, alerting and more.
Mahadevan Ramachandran PostgreSQL Monitoring September 10, 2019 52 / 55
53. pgDash Features - https://pgdash.io
Diagnostics - Automatically analyze and call out situations like high
bloat, tables not vacuumed in a week, inactive replication slots
Query Performance - Queries executed during a time range
Blocked Queries - Historical information about locks and blocked
queries, including the SQL itself
Index Management - Looking for indexes not used in the last 30
days?
Alerts - Meaningful alerts: ”bloat > 10% for tables of size 100 MiB
or more”
And More - Replication, In-Depth metrics about Tables and Indexes,
Tablespaces, Backends, ...
Mahadevan Ramachandran PostgreSQL Monitoring September 10, 2019 53 / 55
54. pgmetrics Roadmap
Features in the pipeline:
Collect data from log files
Deadlock details
auto explain output to get query plans
vacuum job information
Collect/monitor data from more sources
AWS CloudWatch & AWS Enhanced Monitoring for RDS
PgBouncer (already available)
PgPool
Collect query plans directly..
from Postgres, maybe via an enhanced pg stat statements?
Mahadevan Ramachandran PostgreSQL Monitoring September 10, 2019 54 / 55
55. Thank you!
We’d love your hear your thoughts about pgmetrics!
pgmetrics Home https://pgmetrics.io
GitHub https://github.com/rapidloop/pgmetrics
pgDash https://pgdash.io
RapidLoop https://www.rapidloop.com
Me! mahadevan@rapidloop.com
Thanks for your time and have a great evening!
Mahadevan Ramachandran PostgreSQL Monitoring September 10, 2019 55 / 55