The future is CSN
Alexander Korotkov
Postgres Professional
Alexander Korotkov The future is CSN 1 / 31
Russian developers of PostgreSQL:
Alexander Korotkov, Teodor Sigaev, Oleg Bartunov
▶ Speakers at PGCon, PGConf: 20+ talks
▶ GSoC mentors
▶ PostgreSQL commi ers (1+1 in progress)
▶ Conference organizers
▶ 50+ years of PostgreSQL expertship:
development, audit, consul ng
▶ Postgres Professional co-founders
PostgreSQL CORE
▶ Locale support
▶ PostgreSQL extendability:
GiST(KNN), GIN, SP-GiST
▶ Full Text Search (FTS)
▶ NoSQL (hstore, jsonb)
▶ Indexed regexp search
▶ Create AM & Generic WAL
▶ Table engines (WIP)
Extensions
▶ intarray
▶ pg_trgm
▶ ltree
▶ hstore
▶ plantuner
▶ jsquery
▶ RUM
Alexander Korotkov The future is CSN 2 / 31
Why do we need snapshots? (1/5)
Alexander Korotkov The future is CSN 3 / 31
Why do we need snapshots? (2/5)
Alexander Korotkov The future is CSN 4 / 31
Why do we need snapshots? (3/5)
Alexander Korotkov The future is CSN 5 / 31
Why do we need snapshots? (4/5)
Alexander Korotkov The future is CSN 6 / 31
Why do we need snapshots? (5/5)
Alexander Korotkov The future is CSN 7 / 31
Exis ng MVCC snapshots
Alexander Korotkov The future is CSN 8 / 31
How do snapshots work?
▶ Array of ac ve transac on ids is stored in shared
memory.
▶ GetSnapshotData() scans all the ac ve xids while
holding shared ProcArrayLock.
▶ Assigning of new xid doesn’t require ProcArrayLock.
▶ Clearing ac ve xid requires exclusive ProcArrayLock.
▶ 9.6 comes with “group clear xid” op miza on.
Mul ple xids of finished transac ons could be
cleared using single exclusive ProcArrayLock.
Alexander Korotkov The future is CSN 9 / 31
Problem with of snapshots
▶ Nowadays mul -core systems running can run
thousands of backends simultaneously. For short
queries GetSnapshotData() becomes just CPU
expensive.
▶ LWLock subsystem is just not designed for high
concurrency. In par cular, exclusive lock waits could
have infinite starva on. Therefore, it’s impossible to
connect while there is high flow of short readonly
queries.
▶ In the mixed read-write workload, ProcArrayLock
could become a bo leneck.
Alexander Korotkov The future is CSN 10 / 31
Commit sequence number (CSN) snapshots
Alexander Korotkov The future is CSN 11 / 31
CSN development history
▶ Jun 7, 2013 – proposal by Ants Aasma
▶ May 30, 2014 – first path by Heikki Linnakangas
▶ PGCon 2015 – talk by Dilip Kumar (no patch
published)
▶ Aug, 2016 – Heikki returned to this work
Alexander Korotkov The future is CSN 12 / 31
CSN snapshots proper es
Pro:
▶ Taking snapshots is cheaper. It’s even possible to
make it lockless.
▶ CSN snapshots are more friendly to distributed
systems. Distributed visibility techniques like
incremental snapshots or Clock-SI assumes that
snapshot is represented by single number.
Cons:
▶ Have to map XID ⇒ CSN while visibility check.
Alexander Korotkov The future is CSN 13 / 31
XID ⇒ CSN map worst case
1 M rows table, xmin is random in 10 M transac ons
version first scan, ms next scans, ms
master 2500 50
csn 4900 4900
Without CSN we have to lookup CLOG only during first
scan of the table. During first scan hint bits are set.
Second and subsequent scans use hint bits and don’t
lookup CLOG.
Alexander Korotkov The future is CSN 14 / 31
Could we hint XID ⇒ CSN map as well?
▶ In general, it’s possible. We could rewrite XID of
commi ed transac on into its CSN.
▶ Xmin and xmax are 32-bit. Usage of 32-bit CSN is
undesirable. We already have xid and mul xact
wraparounds. Yet another CSN wraparound would
be discouraging.
▶ Se ng hint bits is not WAL-logged. We need to
preserve this property.
Alexander Korotkov The future is CSN 15 / 31
Make both XID and CSN 64-bit
▶ Add 64-bit xid_epoch, mul xact_epoch and
csn_epoch to page header.
▶ Allocate high bit of xmin and xmax for CSN flag.
▶ Actual xid or csn stored in xmin or xmax should be
found as corresponding epoch plus xmin or xmax.
▶ We s ll can address 231
xids from xmin and xmax as
we did before.
▶ Wraparound is possible only inside single page. And
it could be resolved by single page freeze.
Alexander Korotkov The future is CSN 16 / 31
CSN-rewrite patch
▶ Use 64-bit XID and CSN as described before.
▶ Rewrite XID to CSN instead of se ng “commi ed”
hint bit.
▶ Lockless snapshot taking.
▶ WIP, not published yet.
Alexander Korotkov The future is CSN 17 / 31
XID ⇒ CSN map worst case
1 M rows table, xmin is random in 10 M transac ons
version first scan, ms next scans, ms
master 2500 50
csn 4900 4900
csn-rewrite 4900 50
Subsequent scans of table is as cheap as it was before.
First scan s ll have a room for op miza on.
Alexander Korotkov The future is CSN 18 / 31
Benchmarks
Alexander Korotkov The future is CSN 19 / 31
Taking snapshots (SELECT 1)
0 50 100 150 200 250
# Clients
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
TPS
pgbench -s 1000 -j $n -c $n -M prepared -f sel1.sql on 4 x 18 cores Intel Xeon E7-8890 processors
median of 3 2-minute runs with shared_buffers = 32GB, max_connections = 300
master
xact-align
csn-rewrite
Alexander Korotkov The future is CSN 20 / 31
Read-only benchmark
0 50 100 150 200 250
# Clients
0
500000
1000000
1500000
2000000
TPS
pgbench -s 1000 -j $n -c $n -M prepared -S on 4 x 18 cores Intel Xeon E7-8890 processors
median of 3 2-minute runs with shared_buffers = 32GB, max_connections = 300
master
xact-align
csn-rewrite
Alexander Korotkov The future is CSN 21 / 31
Read-write benchmark
0 50 100 150 200 250
# Clients
0
20000
40000
60000
80000
100000
120000
140000
160000
TPS
pgbench -s 1000 -j $n -c $n -M prepared on 4 x 18 cores Intel Xeon E7-8890 processors
median of 3 2-minute runs with shared_buffers = 32GB, max_connections = 300
master
xact-align
csn-rewrite
Alexander Korotkov The future is CSN 22 / 31
Random: 78% read queries, 22% write
queries
0 50 100 150 200 250
# Clients
0
200000
400000
600000
800000
1000000
TPS
pgbench -s 1000 -j $n -c $n -M prepared -b select-only@9 -b tpcb-like@1
on 4 x 18 cores Intel Xeon E7-8890 processors
median of 3 2-minute runs with shared_buffers = 32GB, max_connections = 300
master
xact-align
csn-rewrite
Alexander Korotkov The future is CSN 23 / 31
Custom script with extra 20 read queries
set naccounts 100000 * :scale
set aid1 random(1, :naccounts)
...........................................................
set aid20 random(1, :naccounts)
set aid random(1, 100000 * :scale)
set bid random(1, 1 * :scale)
set tid random(1, 10 * :scale)
set delta random(-5000, 5000)
SELECT abalance FROM pgbench_accounts WHERE aid IN (:aid1);
...........................................................
SELECT abalance FROM pgbench_accounts WHERE aid IN (:aid20);
BEGIN;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :d
END;
Alexander Korotkov The future is CSN 24 / 31
Custom script with extra 20 read queries
0 50 100 150 200 250
# Clients
0
10000
20000
30000
40000
50000
60000
70000
TPS
pgbench -s 1000 -j $n -c $n -M prepared -f rrw.sql on 4 x 18 cores Intel Xeon E7-8890 processors
median of 3 2-minute runs with shared_buffers = 32GB, max_connections = 300
master
xact-align
csn-rewrite
Alexander Korotkov The future is CSN 25 / 31
Further PostgreSQL OLTP bo lenecks
Alexander Korotkov The future is CSN 26 / 31
Further PostgreSQL OLTP bo lenecks
▶ Buffer manager – slow hash-table, pin, locks etc.
▶ Synchronous protocol.
▶ Executor.
▶ Slow xid alloca on – a lot of locks.
Alexander Korotkov The future is CSN 27 / 31
Further PostgreSQL OLTP bo lenecks in
numbers
▶ SELECT val FROM t WHERE id IN (:id1, ... :id10) –
150K per second = 1.5M key-value pairs per second, no gain.
Bo leneck in buffer manager.
▶ SELECT 1 with CSN-rewrite patch – 3.9M queries per second.
Protocol and executor are bo lenecks.
▶ SELECT txid_current() – 390K per second. Bo leneck in
locks.
Alexander Korotkov The future is CSN 28 / 31
How can we improve PostgreSQL OLTP?
▶ True in-memory engine without buffer manager.
▶ Asynchronous binary protocol for processing more
short queries.
▶ Executor improvements including JIT-compila on.
▶ Lockless xid alloca on.
Alexander Korotkov The future is CSN 29 / 31
Conclusion
▶ Despite all the micro-op miza ons made, our
snapshot model could be a bo leneck on modern
mul core systems. And it would be even worse
bo leneck on future systems.
▶ CSN is the way to remove this bo leneck. It also
more friendly to distributed systems.
▶ It’s possible to minimize XID ⇒ CSN map in the
same way we minimize CLOG accesses.
Alexander Korotkov The future is CSN 30 / 31
Thank you for a en on!
Alexander Korotkov The future is CSN 31 / 31

The future is CSN

  • 1.
    The future isCSN Alexander Korotkov Postgres Professional Alexander Korotkov The future is CSN 1 / 31
  • 2.
    Russian developers ofPostgreSQL: Alexander Korotkov, Teodor Sigaev, Oleg Bartunov ▶ Speakers at PGCon, PGConf: 20+ talks ▶ GSoC mentors ▶ PostgreSQL commi ers (1+1 in progress) ▶ Conference organizers ▶ 50+ years of PostgreSQL expertship: development, audit, consul ng ▶ Postgres Professional co-founders PostgreSQL CORE ▶ Locale support ▶ PostgreSQL extendability: GiST(KNN), GIN, SP-GiST ▶ Full Text Search (FTS) ▶ NoSQL (hstore, jsonb) ▶ Indexed regexp search ▶ Create AM & Generic WAL ▶ Table engines (WIP) Extensions ▶ intarray ▶ pg_trgm ▶ ltree ▶ hstore ▶ plantuner ▶ jsquery ▶ RUM Alexander Korotkov The future is CSN 2 / 31
  • 3.
    Why do weneed snapshots? (1/5) Alexander Korotkov The future is CSN 3 / 31
  • 4.
    Why do weneed snapshots? (2/5) Alexander Korotkov The future is CSN 4 / 31
  • 5.
    Why do weneed snapshots? (3/5) Alexander Korotkov The future is CSN 5 / 31
  • 6.
    Why do weneed snapshots? (4/5) Alexander Korotkov The future is CSN 6 / 31
  • 7.
    Why do weneed snapshots? (5/5) Alexander Korotkov The future is CSN 7 / 31
  • 8.
    Exis ng MVCCsnapshots Alexander Korotkov The future is CSN 8 / 31
  • 9.
    How do snapshotswork? ▶ Array of ac ve transac on ids is stored in shared memory. ▶ GetSnapshotData() scans all the ac ve xids while holding shared ProcArrayLock. ▶ Assigning of new xid doesn’t require ProcArrayLock. ▶ Clearing ac ve xid requires exclusive ProcArrayLock. ▶ 9.6 comes with “group clear xid” op miza on. Mul ple xids of finished transac ons could be cleared using single exclusive ProcArrayLock. Alexander Korotkov The future is CSN 9 / 31
  • 10.
    Problem with ofsnapshots ▶ Nowadays mul -core systems running can run thousands of backends simultaneously. For short queries GetSnapshotData() becomes just CPU expensive. ▶ LWLock subsystem is just not designed for high concurrency. In par cular, exclusive lock waits could have infinite starva on. Therefore, it’s impossible to connect while there is high flow of short readonly queries. ▶ In the mixed read-write workload, ProcArrayLock could become a bo leneck. Alexander Korotkov The future is CSN 10 / 31
  • 11.
    Commit sequence number(CSN) snapshots Alexander Korotkov The future is CSN 11 / 31
  • 12.
    CSN development history ▶Jun 7, 2013 – proposal by Ants Aasma ▶ May 30, 2014 – first path by Heikki Linnakangas ▶ PGCon 2015 – talk by Dilip Kumar (no patch published) ▶ Aug, 2016 – Heikki returned to this work Alexander Korotkov The future is CSN 12 / 31
  • 13.
    CSN snapshots properes Pro: ▶ Taking snapshots is cheaper. It’s even possible to make it lockless. ▶ CSN snapshots are more friendly to distributed systems. Distributed visibility techniques like incremental snapshots or Clock-SI assumes that snapshot is represented by single number. Cons: ▶ Have to map XID ⇒ CSN while visibility check. Alexander Korotkov The future is CSN 13 / 31
  • 14.
    XID ⇒ CSNmap worst case 1 M rows table, xmin is random in 10 M transac ons version first scan, ms next scans, ms master 2500 50 csn 4900 4900 Without CSN we have to lookup CLOG only during first scan of the table. During first scan hint bits are set. Second and subsequent scans use hint bits and don’t lookup CLOG. Alexander Korotkov The future is CSN 14 / 31
  • 15.
    Could we hintXID ⇒ CSN map as well? ▶ In general, it’s possible. We could rewrite XID of commi ed transac on into its CSN. ▶ Xmin and xmax are 32-bit. Usage of 32-bit CSN is undesirable. We already have xid and mul xact wraparounds. Yet another CSN wraparound would be discouraging. ▶ Se ng hint bits is not WAL-logged. We need to preserve this property. Alexander Korotkov The future is CSN 15 / 31
  • 16.
    Make both XIDand CSN 64-bit ▶ Add 64-bit xid_epoch, mul xact_epoch and csn_epoch to page header. ▶ Allocate high bit of xmin and xmax for CSN flag. ▶ Actual xid or csn stored in xmin or xmax should be found as corresponding epoch plus xmin or xmax. ▶ We s ll can address 231 xids from xmin and xmax as we did before. ▶ Wraparound is possible only inside single page. And it could be resolved by single page freeze. Alexander Korotkov The future is CSN 16 / 31
  • 17.
    CSN-rewrite patch ▶ Use64-bit XID and CSN as described before. ▶ Rewrite XID to CSN instead of se ng “commi ed” hint bit. ▶ Lockless snapshot taking. ▶ WIP, not published yet. Alexander Korotkov The future is CSN 17 / 31
  • 18.
    XID ⇒ CSNmap worst case 1 M rows table, xmin is random in 10 M transac ons version first scan, ms next scans, ms master 2500 50 csn 4900 4900 csn-rewrite 4900 50 Subsequent scans of table is as cheap as it was before. First scan s ll have a room for op miza on. Alexander Korotkov The future is CSN 18 / 31
  • 19.
    Benchmarks Alexander Korotkov Thefuture is CSN 19 / 31
  • 20.
    Taking snapshots (SELECT1) 0 50 100 150 200 250 # Clients 0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 TPS pgbench -s 1000 -j $n -c $n -M prepared -f sel1.sql on 4 x 18 cores Intel Xeon E7-8890 processors median of 3 2-minute runs with shared_buffers = 32GB, max_connections = 300 master xact-align csn-rewrite Alexander Korotkov The future is CSN 20 / 31
  • 21.
    Read-only benchmark 0 50100 150 200 250 # Clients 0 500000 1000000 1500000 2000000 TPS pgbench -s 1000 -j $n -c $n -M prepared -S on 4 x 18 cores Intel Xeon E7-8890 processors median of 3 2-minute runs with shared_buffers = 32GB, max_connections = 300 master xact-align csn-rewrite Alexander Korotkov The future is CSN 21 / 31
  • 22.
    Read-write benchmark 0 50100 150 200 250 # Clients 0 20000 40000 60000 80000 100000 120000 140000 160000 TPS pgbench -s 1000 -j $n -c $n -M prepared on 4 x 18 cores Intel Xeon E7-8890 processors median of 3 2-minute runs with shared_buffers = 32GB, max_connections = 300 master xact-align csn-rewrite Alexander Korotkov The future is CSN 22 / 31
  • 23.
    Random: 78% readqueries, 22% write queries 0 50 100 150 200 250 # Clients 0 200000 400000 600000 800000 1000000 TPS pgbench -s 1000 -j $n -c $n -M prepared -b select-only@9 -b tpcb-like@1 on 4 x 18 cores Intel Xeon E7-8890 processors median of 3 2-minute runs with shared_buffers = 32GB, max_connections = 300 master xact-align csn-rewrite Alexander Korotkov The future is CSN 23 / 31
  • 24.
    Custom script withextra 20 read queries set naccounts 100000 * :scale set aid1 random(1, :naccounts) ........................................................... set aid20 random(1, :naccounts) set aid random(1, 100000 * :scale) set bid random(1, 1 * :scale) set tid random(1, 10 * :scale) set delta random(-5000, 5000) SELECT abalance FROM pgbench_accounts WHERE aid IN (:aid1); ........................................................... SELECT abalance FROM pgbench_accounts WHERE aid IN (:aid20); BEGIN; UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid; SELECT abalance FROM pgbench_accounts WHERE aid = :aid; UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid; UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid; INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :d END; Alexander Korotkov The future is CSN 24 / 31
  • 25.
    Custom script withextra 20 read queries 0 50 100 150 200 250 # Clients 0 10000 20000 30000 40000 50000 60000 70000 TPS pgbench -s 1000 -j $n -c $n -M prepared -f rrw.sql on 4 x 18 cores Intel Xeon E7-8890 processors median of 3 2-minute runs with shared_buffers = 32GB, max_connections = 300 master xact-align csn-rewrite Alexander Korotkov The future is CSN 25 / 31
  • 26.
    Further PostgreSQL OLTPbo lenecks Alexander Korotkov The future is CSN 26 / 31
  • 27.
    Further PostgreSQL OLTPbo lenecks ▶ Buffer manager – slow hash-table, pin, locks etc. ▶ Synchronous protocol. ▶ Executor. ▶ Slow xid alloca on – a lot of locks. Alexander Korotkov The future is CSN 27 / 31
  • 28.
    Further PostgreSQL OLTPbo lenecks in numbers ▶ SELECT val FROM t WHERE id IN (:id1, ... :id10) – 150K per second = 1.5M key-value pairs per second, no gain. Bo leneck in buffer manager. ▶ SELECT 1 with CSN-rewrite patch – 3.9M queries per second. Protocol and executor are bo lenecks. ▶ SELECT txid_current() – 390K per second. Bo leneck in locks. Alexander Korotkov The future is CSN 28 / 31
  • 29.
    How can weimprove PostgreSQL OLTP? ▶ True in-memory engine without buffer manager. ▶ Asynchronous binary protocol for processing more short queries. ▶ Executor improvements including JIT-compila on. ▶ Lockless xid alloca on. Alexander Korotkov The future is CSN 29 / 31
  • 30.
    Conclusion ▶ Despite allthe micro-op miza ons made, our snapshot model could be a bo leneck on modern mul core systems. And it would be even worse bo leneck on future systems. ▶ CSN is the way to remove this bo leneck. It also more friendly to distributed systems. ▶ It’s possible to minimize XID ⇒ CSN map in the same way we minimize CLOG accesses. Alexander Korotkov The future is CSN 30 / 31
  • 31.
    Thank you fora en on! Alexander Korotkov The future is CSN 31 / 31