Slides from my talk at Citus Con on Optimizing Autovacuum: PostgreSQL's vacuum cleaner.
Talk Abstract below:
If you have run PostgreSQL for any serious OLTP workload, you have heard of autovacuum. Autovacuum is PostgreSQL’s way of running vacuum regularly to clear bloat from your tables and indexes. However, in spite of having autovacuum on, a large number of PostgreSQL users still see their database bloat increasing. What’s going on?
In the last decade, I have personally worked with 50+ Postgres customers who have struggled to figure out why autovacuum isn’t working how they expect. In this talk, we will walk through what I’ve learned from analyzing and improving these production Postgres databases. In this talk you will learn how autovacuum works, how to figure out why it is not working as you expect, and what you can do to fix it.
2. About Me
Engineering Manager in the
Open Source PG team at
Microsoft
Working with PostgreSQL for >
10 years.
Previously solutions / customer
engineer at Microsoft, Citus
Learning how to be a parent to
my 10 month old daughter
3. Intro to VACUUM
In Postgres, updates and deletes create
new row versions in Postgres for
concurrency control.
Once all transactions to which a row
version is visible complete, the row
version can be removed.
This is done by VACUUM
Header Data4
Header Data3 (Updated to Data4)
Header Data2
Header Data1 (Deleted)
Header Data4
Header Data2
4. How to invoke VACUUM in PostgreSQL
VACUUM [ ( option [, ...] ) ] [ table_and_columns [, ...] ]
VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ table_and_columns [, ...] ]
where option can be one of:
FULL [ boolean ]
FREEZE [ boolean ]
VERBOSE [ boolean ]
ANALYZE [ boolean ]
DISABLE_PAGE_SKIPPING [ boolean ]
SKIP_LOCKED [ boolean ]
INDEX_CLEANUP [ boolean ]
TRUNCATE [ boolean ]
and table_and_columns is:
table_name [ ( column_name [, ...] ) ]
5. VACUUMing
manually
requires you to
think about
Manually scheduling these jobs
When to VACUUM?
Which tables to VACUUM? ... different
tables change at different rates
How many to VACUUM in parallel? How to
throttle?
And others…
6. Enter Autovacuum
ž Wake up every autovacuum_naptime
seconds
ž Check for tables modified significantly
ž Start more workers to VACUUM and
ANALYZE jobs in parallel
ž Repeat
ž On by default. DO NOT turn off
8. Problems with un-tuned autovacuums
Vacuum too aggressively on a small workload / machine: -> Vacuum
consumes a lot of resources -> Slowness
Example 1: 10GB Database with 10 TPS + 2 cores
Vacuum too little on a large workload / machine -> Excessive bloat -
> Too much storage usage -> Slowness
Example 2: 1TB database with 10,000 TPS + 64 cores
9. 4 common autovacuum problems
VACUUM can’t clean dead rows
VACUUM isn’t triggered often enough
VACUUM running too slow
Transaction ID wraparound
10. 4 common autovacuum problems
VACUUM can’t clean dead rows
VACUUM isn’t triggered often enough
VACUUM running too slow
Transaction ID wraparound
11. When a VACUUM is triggered
Obsoleted tuples > autovacuum_vacuum_threshold +
autovacuum_vacuum_scale_factor * number of tuples
OR
Inserted tuples > autovacuum_vacuum_insert_threshold +
autovavuum_vacuum_insert_scale_factor * number of tuples
OR
Relfrozenxid > autovacuum_freeze_max_age transactions old
12. Signs VACUUM needs to be triggered more
1. Bloat or dead tuples are growing more than expectation
2. You have to manually vacuum tables to clear up bloat
3. Last autovacuum for a fast-growing table is too far in past
SELECT last_autovacuum from pg_stat_user_tables
4. Autovacuum count for a fast-growing table is low
SELECT autovacuum_count, vacuum_count from pg_stat_user_tables
13. Solving
“VACUUM isn’t
triggered often
enough”
Set autovacuum_vacuum_scale_factor and
autovacuum_vacuum_insert_scale_factor
based on the size and growth rate of the tables
Example: For a table which has 1B rows, that’s
200M rows changed before VACUUM gets
triggered with default parameters (0.2). Scale
factor of 0.02 or even 0.002 might be better.
Adjust scale factors
according to workload
14. 4 common autovacuum problems
VACUUM can’t clean dead rows
VACUUM isn’t triggered often enough
VACUUM running too slow
Transaction ID wraparound
15. Signs VACUUM is
running too slow
• VACUUM is always running on
your database J
• Your rate of cleaning up bloat
< rate at which you are
updating / inserting rows.
16. Solving
“VACUUM
running too
slow” - Basic
Disable Cost limiting
VACUUM might be sleeping occasionally to reduce I/O
impact. See here.
Set autovacuum_vacuum_cost_delay to 0, OR
Change autovacuum_vacuum_cost_limit to a high
value (10000)
Increase autovacuum_max_workers
If you have many large tables and cores
17. Looking deeper - pg_stat_progress_vacuum
Scanning heap slowly
• Compare heap_blks_scanned vs
heap_blks_total to see progress
Vaccuming indexes slowly
• Index_vacuum_count is high
• Too many indexes to be vacuumed.
18. Solving
“VACUUM
running too
slow” further
Scanning Heap Slowly
ž Prefetch relations in memory.
ž Larger shared_buffers for better caching.
Vaccuming Indexes Slowly
ž Increase max_parallel_maintenance_workers to
vacuum indexes in parallel.
ž Increase maintenance_work_mem /
autovacuum_work_mem to store more dead tuples and
reduce number of index vacuuming cycles.
19. All done, Well…
Autovacuum is now triggering
VACUUM, and VACUUM has
resources
But what if it dead tuples still don’t
come down?
20. 4 common autovacuum problems
VACUUM can’t clean dead rows
VACUUM isn’t triggered often enough
VACUUM running too slow
Transaction ID wraparound
22. Who needs them?
Long running backends
Long-running queries on standby
Replication slots not in use
Uncommitted prepare transactions
23. Long running backends
Find long running backends
SELECT pid, datname, usename,
state, backend_xmin
FROM pg_stat_activity
WHERE backend_xmin IS NOT NULL
ORDER BY age(backend_xmin) DESC;
Solutions
ž Use pg_terminate_backend() to
terminate long running backends.
ž Consider setting up a large
statement_timeout or at least
log_min_duration_statement to log
long running queries.
24. Long running queries on standby
Setting hot_standby_feedback = on reduces replication conflict, but it can
cause queries on secondary to “hold” rows on primary
Solutions
ž Limit long running queries on secondary
ž Consider dealing with replication conflicts
and set hot_standby_feedback = off
ž Use vacuum_defer_cleanup_age to keep
rows for longer and reduce conflicts.
Find xmin of replication connections
SELECT pid, client_hostname,
state, backend_xmin
FROM pg_stat_replication
WHERE backend_xmin IS NOT NULL
ORDER BY age(backend_xmin) DESC;
25. Unused replication slots
If standby is delayed or down, then VACUUMing can be delayed.
• Physical replication with hot_standby_feedback = on
• System catalog bloat with logical replication
Solutions
ž Use SELECT pg_drop_replication_slot()
to drop unneeded replication slots.
ž Consider setting hot_standby_feedback
to off.
Find replication slots with old
Txns to retain
SELECT slot_name, slot_type,
database, xmin, catalog_xmin
FROM pg_replication_slots
ORDER BY age(xmin),
age(catalog_xmin) DESC;
26. Uncommitted “PREPARED” transactions
If 2PC transactions are prepared but not committed or rolled back, they are kept.
Solutions
ž Use ROLLBACK PREPARED … to
remove old and unneeded prepared
transactions.
Find old prepared transactions
SELECT gid, prepared, owner,
database, transaction AS xmin
FROM pg_prepared_xacts
ORDER BY age(transaction) DESC;
27. DDL operations
causing
VACUUM to
terminate
• VACUUM terminates itself if it can’t acquire
required locks.
• If there’s DDL activity all the time on tables,
that might not allow VACUUM to run and
hence leave a lot of dead tuples.
Another possible reason
28. 4 common autovacuum problems
VACUUM can’t clean dead rows
VACUUM isn’t triggered often enough
VACUUM running too slow
Transaction ID wraparound
29. What do you know
about transaction
id wrapround
VACUUMS?
30. Transaction ID Wraparound VACUUMs - Quick Intro
ž Postgres Transaction IDs are limited (32 bits).
ž To allow > 2**32 transactions, VACUUM freezes rows visible to all
future transactions.
ž It not autovacuumed, wraparound VACUUM invoked on a table once
every autovacuum_freeze_max_age minus
vacuum_freeze_min_age transactions.
32. Managing Transaction ID Wraparound VACUUMs
ž If you see a lot of autovacuum (to prevent wraparound), might make
sense to increase autovacuum_freeze_max_age to > 1B.
ž Monitor progress towards transaction ID wraparound and set alerts
ž If alerts are triggered, manually VACUUM (and not VACUUM freeze) or unblock VACUUM’s
progress.
ž If you see WARNING: database "mydb" must be vacuumed within X transactions,
execute a DB wide VACUUM immediately.
ž If these WARNINGs are ignored, DB will shutdown and not start until it is VACUUMed.
ž Understand how they are different (previous slide) and manage your
application accordingly.
33. 5 things to remember about Autovacuum & Postgres
Autovacuum can solve
most of your bloat
problems.
Configure it to make
sure it’s running often
and fast enough.
Make sure you don’t
have things blocking
cleaning of dead tuples.
Manage transactionid
wraparound VACUUMs
so it doesn’t impact
workload but runs at
right time
Ensure you VACUUM
before system has to
shut down due to
running out of txn ids.