In-memory OLTP storage with persistence and transaction support

In-memory OLTP storage with
persistence and transac on support
Alexander Korotkov
Postgres Professional
October 25, 2017
Alexander Korotkov In-memory OLTP storage with persistence and transac on support 1 / 34

Disclaimer
▶ This talk is not about something produc on ready. Don’t hold
your breath while wai ng for use some of considered
func onality in produc on. When this func onality will be
available and produc on-ready, it might become something
drama cally diﬀerent.
▶ This talk is about some intermediate results achieved during
development. These results are presented for discussion and
brainstorming in order to make further development be er.

What this talk is about?
▶ We (Postgres Pro) have a prototype of in-memory OLTP storage
implemented using FDW.
▶ It’s proof of concept of opportuni es for in-memory OLTP in
PostgreSQL (it was debatable that there are any).
▶ It’s yet another example of alterna ve storage implemented
using FDW interface before we’ve na ve pluggable storages.
So, it’s waypoint to verify where we are on pluggable storages.

Why pluggable storages?
▶ Lack of pluggable storages support is understood as
limita on.
PostgreSQL was always posi oned as highly extendable DBMS while
lack of pluggable storages support is large gap in this area.
▶ Rising interest in PostgreSQL from enterprises.
Postgres-centric companies have enough of resources to support
mul ple storage engines. Enterprises are also interested in using
PostgreSQL for non-OLTP tasks. Alterna ve storages might improve
OLTP too (UNDO log for be er update performance).

Use cases for pluggable storages
▶ Diﬀerent MVCC implementa on: mostly varia ons of UNDO
log, but not only.
▶ Data compression: either row-level, page-level etc...
▶ Non disk-oriented storage: in-memory, SSD-op mized,
NVRAM-op mized.
▶ Non heap-like rows layout: index-organized table (IOT)
including LSM.
▶ Non row-oriented data layout: either column or parquet
layouts.

Current state of pluggable storage
https://www.postgresql.org/message-id/flat/
20160812231527.GA690404%40alvherre.pgsql
▶ Started as quite mechanical separa on of heap_* methods into
storage AM interface.
▶ Boundary of storage layer was signiﬁcantly shi ed during
discussions.
▶ S ll a lot of work to do.

Current view on pluggable storage
proper es
▶ All the storages should share same transac on model (or have
no transac ons at all).
▶ All the storages should write same WAL stream.
▶ Tuples have to be iden ﬁed by TIDs (further improvement is
possible).
▶ Storages should share some of index access methods.
▶ Index access method interface should be expanded with new
func ons (at least retail tuple delete).
▶ Storages may have completely diﬀerent MVCC implementa on.

Why FDW for prototyping
Pro:
▶ FDW is completely free in the way it scans and modiﬁes the
foreign table.
▶ This approach is already used in cstore_fdw1
, vops2
.
Cons:
▶ Lack of control on associated resources,
▶ Lack of DDL support.
1
https://github.com/citusdata/cstore_fdw
2
https://github.com/postgrespro/vops

Why in-memory?
▶ No extra mapping layer (buﬀer manager) to traverse from one page
to another.
▶ Row-level WAL takes less space (no page-level informa on, no
explicit index logging), but slower to apply.
▶ Be er IO u liza on (write both snapshots and WAL are wri en
sequen ally).

What this par cular in-memory engine is?
▶ Index organized table where index is in-memory B-tree.
▶ This B-tree supports transac ons and MVCC using UNDO log
which is circular buﬀer in memory containing both row-level
and page-level records.
▶ It writes full data snapshots on checkpoints and row-level WAL.

Why our in-memory engine is a good
example of pluggable storage
Because it does the things in a quite different way.
▶ It stores data in main memory with quite different
model of persistence: full data snapshots plus
row-level WAL.
▶ It doesn’t have heap-like layout.
▶ It uses very different MVCC implementa on:
combina on of row-level and page-level undo logs.

Why our in-memory engine is a bad
example of pluggable storage
▶ It uses CSN snapshot model which is far from ge ng
commi ed.
▶ Tuples aren’t iden ﬁed by TIDs.
▶ Persistence is implemented using set of hacks.

Conﬁgura on parameters
▶ in_memory_engine.shared_pool_size – size of
separate pool of 1k pages for in-memory tables.
▶ in_memory_engine.undo_size – size of circular
buﬀer for undo records to support transac ons and
MVCC.

Usage: deﬁning in-memory table and
inser ng data
CREATE EXTENSION in_memory;
CREATE FOREIGN TABLE im_test
(
id int8 NOT NULL,
val text NOT NULL
) SERVER in_memory OPTIONS (indices ’unique (id)’,
persistent ’true’);
INSERT INTO im_test
(SELECT id, ’val’ || id FROM generate_series(1, 1000000) id);

Usage: querying a single key
# EXPLAIN ANALYZE SELECT * FROM im_test WHERE id = 50000;
QUERY PLAN
----------------------------------------------------------------
Foreign Scan on im_test (cost=0.06..4.52 rows=357 width=40)
(actual time=0.190..0.191 rows=1 loops=1)
Pk conds: (id = 50000)
Planning time: 0.635 ms
Execution time: 0.260 ms
(4 rows)

Usage: querying a key range
# EXPLAIN ANALYZE SELECT * FROM im_test
WHERE id >= 10000 AND id <= 20000;
QUERY PLAN
----------------------------------------------------------------
Pk conds: (id >= 10000 AND id <= 20000)
(4 rows)

Usage: querying for non-key condi on
# EXPLAIN ANALYZE SELECT * FROM im_test WHERE val LIKE ’%1111%’;
QUERY PLAN
----------------------------------------------------------------
Filter: (val ~~ ’%1111%’::text)
Rows Removed by Filter: 999720
(5 rows)

Usage: non-persistent tables are writable on
standby
*** Master ***
# CREATE FOREIGN TABLE im_test (id int8 NOT NULL, val text NOT NULL)
SERVER in_memory OPTIONS (indices ’unique (id)’,
persistent ’false’);
# INSERT INTO im_test
(SELECT id, ’val’ || id FROM generate_series(1, 1000000) id);
INSERT 0 1000000
*** Standby ***
# SELECT * FROM im_test;
id | val
----+-----
(0 rows)
# INSERT INTO im_test VALUES (1, ’foo’), (2, ’bar’);
INSERT 0 2
# SELECT * FROM im_test;
id | val
----+-----
1 | foo
2 | bar
(2 rows)

Limita ons
▶ Only B-tree with limited func onality is supported.
▶ No secondary indexes are supported yet.
▶ No out-of-line storage are supported for tuples yet.
▶ Undo log shouldn’t wraparound during single transac on (that
transac on is automa cally aborted).
▶ If required undo record is already overﬂowed then “snapshot’s
too old” error is emi ed.
▶ Serializable isola on level isn’t supported.
▶ Replica on isn’t supported yet.

Read-only benchmark
0 50 100 150 200 250
# Clients
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
QPS
pgbench -s 1000 -j $n -c $n -M prepared -S on 4 x 18 cores Intel Xeon E7-8890 processors
mean of 3 3-minute runs with shared_buffers = 32GB, max_connections = 300
builtin
in-memory

Why there is no win?
Storage is only one layer par cipa ng in query
execu on. There are also:
▶ Network layer,
▶ Executor,
▶ Parser (analyze & rewrite if not prepared),
▶ Transac on management (including snapshot
acquirement),
▶ ...

Measuring overheads
0 50 100 150 200 250
# Clients
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
QPS
read-only
SELECT 1;
;

Read-only benchmark:
fetch 9 values per single query
set aid1 random(1, 100000 * :scale)
SELECT abalance FROM pgbench_accounts WHERE
aid IN (:aid1, :aid2, :aid3, :aid4, :aid5, :aid6, :aid7, :aid8, :aid9);

fetch 9 values per single query
0 50 100 150 200 250
# Clients
0
200000
400000
600000
800000
1000000
1200000
QPS
pgbench -s 1000 -j $n -c $n -M prepared -f ro9.sql on 4 x 18 cores Intel Xeon E7-8890 processors
builtin
in-memory

compare values-per-second
0 50 100 150 200 250
# Clients
0
2000000
4000000
6000000
8000000
10000000
VPS
builtin-1
builtin-9
in_memory-1
in_memory-9

Read-write benchmark without persistence
(async commit)
0 50 100 150 200 250
# Clients
0
50000
100000
150000
200000
250000
TPS
pgbench -s 1000 -j $n -c $n -M prepared on 4 x 18 cores Intel Xeon E7-8890 processors
unlogged table
in_memory

Read-write benchmark with persistence
(async commit)
0 50 100 150 200 250
# Clients
0
50000
100000
150000
200000
TPS
builtin
in-memory

Read-write benchmark:
do transac on in a single statement
CREATE OR REPLACE FUNCTION tcpb_trx(_aid int, _bid int, _tid int, _delta int)
RETURNS void AS $$
BEGIN
UPDATE pgbench_accounts SET abalance = abalance + _delta WHERE aid = _aid;
PERFORM abalance FROM pgbench_accounts WHERE aid = _aid;
UPDATE pgbench_tellers SET tbalance = tbalance + _delta WHERE tid = _tid;
UPDATE pgbench_branches SET bbalance = bbalance + _delta WHERE bid = _bid;
INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES
(_tid, _bid, _aid, _delta, CURRENT_TIMESTAMP);
END;
$$ LANGUAGE plpgsql;
set aid random(1, 100000 * :scale)
set bid random(1, 1 * :scale)
set tid random(1, 10 * :scale)
set delta random(-5000, 5000)
SELECT tcpb_trx(:aid, :bid, :tid, :delta);

Read-write benchmark with persistence
(async commit, func on vs. interac ve)
0 50 100 150 200 250
# Clients
0
100000
200000
300000
400000
500000
600000
TPS
builtin
builtin-func
in_memory
in_memory-func

Hacks used in implementa on
▶ Minimalis c CSN implementa on
CSNs are assigned but neither used, neither wri en to SLRUs.
in-memory engine doesn’t need SLRU.
▶ Checkpoint hook
in-memory engine writes full data snapshot on checkpoint.
▶ Generic logical message hook
Used to implement custom recovery/replica on. This is an awful
hack.
▶ TRUNCATE using u lity command hook
TRUNCATE isn’t supported by FDW directly.
▶ DROP support using event trigger
Used to free the resources occupied by in-memory table.

Recovery problem
▶ Row-level WAL is compact, but it requires meta-informa on to
apply. That is we need to be able to read system catalog while
applying WAL including recovery.
▶ We can’t access system catalog during recovery, because the
whole database isn’t accessible since it’s not recovered yet.

Recovery problem solu on:
2-phase recovery
At the second phase we have consistent system catalog.

Future roadmap
Integrate in-memory as pluggable storage:
▶ In-memory B-tree as index access method.
▶ Implement storage for in-memory tables using one of following
ways:
▶ Write some kind of «in-memory heap» OR/AND
▶ Write a storage wrapper implemen ng index-organized
table.

Thank you for a en on!

In-memory OLTP storage with persistence and transaction support

More Related Content

What's hot

Similar to In-memory OLTP storage with persistence and transaction support

More from Alexander Korotkov

Recently uploaded

In-memory OLTP storage with persistence and transaction support