In the “Sharing is caring” spirit, we came up with a series of internal talks called, By Showmaxers, for Showmaxers, and we recently started making them public. There are already talks about Networks, and Android app building, available.
Our latest talk focuses on PostgreSQL Terminology, and is led by Angus Dippenaar. He worked on Showmax projects from South Africa, and moved to work with us in Prague, Czech Republic.
The talk was meant to fill some holes in our knowledge of PostgreSQL. So, it guides you through the basic PostgreSQL terminology you need to understand when reading the official documentation and blogs.
You may learn what all these POstgreSQL terms mean:
Command, query, local or global object, non-schema local objects, relation, tablespace, database, database cluster, instance and its processes like postmaster or backend; session, connection, heap, file segment, table, TOAST, tuple, view, materialized (view), transaction, commit, rollback, index, write-ahead log, WAL record, WAL file, checkpoint, Multi-version concurrency control (MVCC), dead tuples (dead rows), or transaction exhaustion.
The terminology is followed by a demonstration of transaction exhaustion.
Get the complete explanation and see the demonstration of the transaction exhaustion and of tuple freezing in the talk on YouTube: https://youtu.be/E-RkI3Ws7gM.
2. Part 1:
Terminology
Some of what I have here is copy-pasta from https://www.postgresql.org/docs/current/glossary.html
with some extra information added from their respective pages as well as some of my own knowledge
and research.
Glossary
3. Command
You will see this used all over the documentation, but it's never explained.
A command is a string that is sent to the server in order for it to do something for you. In PSQL, are
separated by semicolons.
A command generally is used to:
● fetch data
● modify data
● administer the PostgreSQL instance.
SELECT * FROM table
CREATE EXTENSION pg_stat_statements
BEGIN; DELETE FROM TABLE
4. Object
Any object that can be created with a CREATE command.
Most objects are specific to one database, and commonly known as local SQL objects.
5. Local Object
Schema Local Objects: Name and type are unique within each schema
● Relations
● Routines
● Data types
CREATE TABLE; CREATE VIEW; CREATE INDEX
CREATE FUNCTION
CREATE TYPE
6. Non-schema Local Objects
Local Objects: Name and type are unique within each database
● Extensions
● Data type casts
● Foreign data wrappers
CREATE EXTENSION
CREATE CAST
CREATE FOREIGN DATA WRAPPER
7. Global Objects
Exist entirely outside of any specific database. Names are unique within the database cluster.
● Roles
● Tablespaces
● Replication origins
● Subscriptions for logical replication
● Databases
CREATE ROLE
CREATE TABLESPACE
CALL pg_replication_origin_create()
CREATE SUBSCRIPTION
CREATE DATABASE
8. Tablespace
A named location on the server file system.
Allows database admins to define locations of the filesystem where the files representing the database
objects can be stored.
This is very useful if you have databases of varying sizes or for optimizing performance. You can put a
bigger, or less needed database on a slower disk, and a very active database on a faster disk.
Initially, a database cluster contains a single usable tablespace which is used as the default for all SQL
objects, called pg_default .
9. Tablespace
Some examples:
CREATE TABLESPACE tablespace_name LOCATION 'directory'; .
Then you can create an SQL Object:
CREATE DATABASE name TABLESPACE tablespace_name; .
CREATE TABLE name TABLESPACE tablespace_name; .
CREATE INDEX name ON table_name TABLESPACE tablespace_name; .
10. Database
A named collection of local SQL objects.
You need to connect to a database when connecting to a cluster.
The SQL standard calls databases “catalogs”, but there is no difference in practice.
There’s 2 ways to create a database:
1. CREATE DATABASE dbname [OWNER relename] from an SQL environment
2. createdb [-O rolename] dbname from the shell
There’s 2 ways to destroy a database:
1. DROP DATABASE dbname from an SQL environment
2. dropdb dbname from the shell
11. Database Cluster
A collection of databases and global SQL objects, and their common static and dynamic metadata.
In PostgreSQL, the term cluster is also sometimes used to refer to an instance.
12. Instance
A group of backend and auxiliary processes that communicate using a common shared memory area.
One postmaster process manages the instance.
One instance manages exactly one database cluster with all its databases.
Many instances can run on the same server as long as their TCP ports do not conflict.
13. Postmaster
The very first process of an instance.
It manages the other processes and creates backend processes on demand.
14. Backend
Process of an instance which acts on behalf of a client session and handles its requests.
One backed process will be forked for each client session.
15. Session
A state that allows a client and backend to interact, communicating over a connection.
16. Connection
An established line of communication between a client process and a backend process, supporting a
session.
Usually over a network, but also can work over a socket.
17. Query
A type of command sent by a client to a backend.
Most of the time, a query will be retrieving data or modifying the database.
18. Relation
The generic term for all objects in a database that have
1. A name
2. A list of attributes defined in a specific order
Includes:
● Tables
● Sequences
● Views
● Foreign Tables
● Materialized views
● Composite types
● Indexes
19. Heap
This is not the memory heap of the application.
It is the data for a relation.
The heap is stored in one or more file segments.
20. File Segment
A physical file which stores data for a given relation.
File size is limited with --with-segsize during compilation, default is 1 GB.
If a relation exceeds the size limit, it is split into multiple segments.
To know more than you ever needed to, see: https://www.postgresql.org/docs/current/storage-file-
layout.html
Storage File Layout
22. TOAST
Stands for: The Oversized-Attribute Storage Technique
A mechanism by which large attributes of table rows are split and stored in a secondary table, called the
TOAST table.
Each relation with large attributes has its own TOAST table.
Long string storage is generally where you will find TOAST being used.
24. Tuple
A collection of attributes in a fixed order.
That order may be defined by the relation where the tuple is contained.
When talking about a table, a tuple is generally referred to as a row.
25. View
A relation that is defined by a SELECT statement, but has no storage of its own.
Any time a query references a view, the definition of the view is substituted into the query.
This substitution happens before the query planner or optimizer.
26. Materialized
The property that some information has been pre-computed and stored for later use, rather than
computing it on-the-fly.
27. Materialized View
Like an immutable table.
Update the results with REFRESH MATERIALIZED VIEW .
You can CREATE INDEX .
You can also ALTER|DROP MATERIALIZED VIEW like you can with a table.
28. Transaction
A combination of commands that must act as a single atomic command.
They all succeed or all fail as a single unit.
Their effects are not visible to other sessions until the transaction is complete.
Each transaction has a Transaction ID, or XID . A session is assigned a Transaction ID when it first
causes a database modification.
Manually started with BEGIN , and ends with ROLLBACK or COMMIT .
31. Index
A relation that contains data derived from a table or materialized view.
Its internal structure supports fast retrieval of and access to the original data.
32. Write-Ahead Log (WAL)
The journal that keeps track of the changes in the database cluster.
Consists of multiple WAL records, written sequentially to WAL files.
There is only 1 WAL per cluster.
33. WAL Record
A low-level, binary description of an individual change.
Replayed in the event of a database failure.
It's more efficient having this write-only log instead of modifying the page files directly.
Also a method of Postgres replication. Records are streamed to the replicas and replayed.
A change to the cluster is considered persistent when it’s WAL record is written to disk.
34. WAL File
A.K.A. WAL segment. A.K.A. WAL segment file.
If the system crashes, the files are read in order, eventually restoring the last state of the database.
Barman ships these WAL files to allow you to restore your database to any point in time by replaying
the WAL records until the requested time has been reached.
Each WAL file can be released after a checkpoint writes all the changes to the corresponding data files.
Releasing the file can be done by either:
● Deleting it
● Changing its name so that it will be used in the future. A.K.A. recycling.
35. Checkpoint
A point in the WAL sequence at which it is guaranteed that the heap and index data files have been
updated with all information from shared memory modified before that checkpoint.
A checkpoint record is written and flushed to WAL to mark that point.
A checkpoint is started:
● Every checkpoint_timeout seconds
● If max_wal_size is about to be exceeded
● When calling CHECKPOINT
Whichever comes first.
36. Multi-version concurrency control (MVCC)
A mechanism designed to allow several transactions to be reading and writing the same rows without
one process causing other processes to stall.
A read will not block a write and a write will not block a read.
37. How MVCC works
Postgres stores transaction information with each row: xmin and xmax .
These are used to determine if a row is visible to a transaction or not.
A row is visible to a transaction if xmin < XID < xmax .
This depends on the isolation level.
By default, as soon as a transaction is committed, the new visibility is applied to all transactions.
The SERIALIZABLE isolation level works as described before.
38. What actually happens
A row is given:
● xmin when it is INSERTed
● xmax when it is marked as DELETEd.
Updating a row is like inserting a new row and deleting the old one.
You can query xmin and xmax from any row.
SELECT xmin, xmax FROM table;
39. Dead tuples (Dead rows)
When a tuple (row) is no longer visible to any transaction, it is considered dead.
40. Time for a little quest
Transaction Exhaustion (Wraparound)
41. Transaction Exhaustion
Transaction IDs are 32-bits , so you can have a total of 232 transactions.
The XID s are split into 2 parts:
● XID s in the past
● XID s in the future
You can get your current XID with:
SELECT txid_current() .
42. Transaction Exhaustion
When txid_current reaches 232 , the next transaction will wrap around back to 0.
All of a sudden, all rows appear to be in the future.
All deleted rows are not deleted anymore and all created rows are not created.
43. Transaction Exhaustion
When txid_current reaches 232 , the next transaction will wrap around back to 0.
All of a sudden, all rows appear to be in the future.
All deleted rows are not deleted anymore and all created rows are not created.
This is referred to as Not a good time .
44. Transaction Exhaustion
When txid_current reaches 232 , the next transaction will wrap around back to 0.
All of a sudden, all rows appear to be in the future.
All deleted rows are not deleted anymore and all created rows are not created.
This is referred to as Not a good time .
But it actually works a bit differently...
45. Transaction Exhaustion
Basically:
● Past XID s are txid_current - 231 to txid_current - 1 .
● Future XID s are txid_current + 1 to txid_current - 231 - 1 .
So we have ~2 billion transactions.
About 1 million transactions before the Not a good time , Postgres will not allow any new
transactions, and will start a VACUUM , even if autovacuum is not enabled.
46. Transaction Exhaustion
What Not a good time looks like in the logs:
WARNING: database "mydb" must be vacuumed within x transactions
HINT: To avoid database shutdown, execute a database-wide VACUUM in "mydb"
47. Vacuum
It has 2 jobs:
1. Remove dead tuples from tables or materialized views.
2. Freeze tuples.
VACUUM steps: SELECT datname, phase FROM pg_stat_progress_vacuum .
1. initializing
2. scanning heap
3. vacuuming indexes
4. vacuuming heap
5. cleaning up indexes
6. truncating heap
7. performing final cleanup
See: https://www.postgresql.org/docs/current/progress-reporting.html
Postgres Reporting
48. Tuple Freezing
Each row has a frozen bit which, if set, means that no matter what the xmin and xmax is set
to, this row is always in the past.
Tuple freezing is the process of setting this bit on all tuples that are in the past of all the current
transactions.
Each table has a relfrozenxid value that is the xmin of the oldest row that is not frozen.
datfrozenxid is the oldest relfrozenxid for the database.
So datfrozenxid + 231 - 1 million is actually Not a good time .
51. Transaction Exhaustion and Tuple Freezing
Visualized
datfrozenxid = 0 .
txid_current = ~900M .
Everything is still seems normal
datfrozenxid
datfrozenxid+231
txid_current
txid_current+231
52. Transaction Exhaustion and Tuple Freezing
Visualized
datfrozenxid = 0 .
txid_current = ~1 400M .
We’re starting to get close to the database
shutting down
datfrozenxid
datfrozenxid+231
txid_current
txid_current+231
53. Transaction Exhaustion and Tuple Freezing
Visualized
datfrozenxid = 0 .
txid_current = ~2 000M .
Not a good time happened.
At around 2 trillion transactions, the database
stops accepting connections and starts
VACUUM .
datfrozenxid
datfrozenxid+231 txid_current
txid_current+231
54. Transaction Exhaustion and Tuple Freezing
Visualized
datfrozenxid = ~350M .
txid_current = ~2 000M .
VACUUM starts freezing tuples.
We can only connect again when it’s done.
datfrozenxid
datfrozenxid+231
txid_current
txid_current+231
55. Transaction Exhaustion and Tuple Freezing
Visualized
datfrozenxid = ~700M .
txid_current = ~2 000M .
VACUUM is still freezing tuples.
Still no new transactions allowed.
datfrozenxid
datfrozenxid+231
txid_current
txid_current+231
56. Transaction Exhaustion and Tuple Freezing
Visualized
datfrozenxid = ~1 000M .
txid_current = ~2 000M .
VACUUM is still freezing tuples.
Still no new transactions allowed. datfrozenxid
datfrozenxid+231
txid_current
txid_current+231
57. Transaction Exhaustion and Tuple Freezing
Visualized
datfrozenxid = ~1 400M .
txid_current = ~2 000M .
VACUUM is still freezing tuples.
Still no new transactions allowed.
datfrozenxid
datfrozenxid+231
txid_current
txid_current+231
58. Transaction Exhaustion and Tuple Freezing
Visualized
datfrozenxid = ~1 800M .
txid_current = ~2 000M .
VACUUM is done.
New transactions are allowed to start again.
datfrozenxid
datfrozenxid+231
txid_current
txid_current+231
59. Transaction Exhaustion and Tuple Freezing
Visualized
datfrozenxid = ~1 800M .
txid_current = ~2 300M .
Database is working again.
datfrozenxid
datfrozenxid+231
txid_current
txid_current+231
60. Transaction Exhaustion and Tuple Freezing
Visualized
I made a little animation to help understand:
https://tuple-freezing-demo.angusd.com
Tuple Freezing Demo