HOW ZHEAP WORKS
REINVENTING POSTGRESQL STORAGE
BY HANS-JÜRGEN SCHÖNIG
ABOUT
ME AND MY
COMPANY
■ Who is the guy?
■ Who is CYBERTEC?
HANS-JÜRGEN
SCHÖNIG
CEO & SENIOR DATABASE CONSULTANT
■ PostgreSQL since 1999
■ author of various database books
M A I L hs@cybertec.at
P H O N E +43 2622 930 22-2
W E B www.cybertec-postgresql.com
DATABASE SERVICES
DATA Science
▪ Artificial Intelligence
▪ Machine Learning
▪ Big Data
▪ Business Intelligence
▪ Data Mining
▪ etc.
POSTGRESQL Services
▪ 24/7 Support
▪ Training
▪ Consulting
▪ Performance Tuning
▪ Clustering
▪ etc.
▪ ICT
▪ University
▪ Government
▪ Automotive
▪ Industry
▪ Trade
▪ Finance
▪ etc.
CLIENT
SECTORS
AGENDA
■ traditional tables
■ table bloat and VACUUM
■ Why a new storage system?
■ zheap: the goal
■ zheap: basic architecture
■ zheap: transaction slots, etc.
■ performance impacts
■ roadmap
TRADITIONAL TABLES
HEAP: STANDARD TABLES
■ Data structure looks as follows:
■ Data structure looks as follows:
HEAP: STANDARD TABLES
HEAP AND TRANSACTIONS
UPDATES AND VISIBILITY
PROBLEMS WITH HEAP
MAIN ISSUE: TABLE BLOAT
test=# CREATE TABLE a (aid int) WITH (autovacuum_enabled = off);
CREATE TABLE
test=# INSERT INTO a SELECT * FROM generate_series(1, 1000000);
INSERT 0 1000000
test=# SELECT pg_size_pretty(pg_relation_size('a'));
pg_size_pretty
----------------
35 MB
(1 row)
MAIN ISSUE: TABLE BLOAT
test=# UPDATE a SET aid = aid + 1;
UPDATE 1000000
test=# SELECT pg_size_pretty(pg_relation_size('a'));
pg_size_pretty
----------------
69 MB
(1 row)
MAIN ISSUE: TABLE BLOAT
test=# VACUUM VERBOSE a;
INFO: vacuuming "public.a"
INFO: "a": removed 1000000 row versions in 4425 pages
INFO: "a": found 1000000 removable, 1000000 nonremovable row versions in 8850
out of 8850 pages
DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 539
...
VACUUM
test=# SELECT pg_size_pretty(pg_relation_size('a'));
pg_size_pretty
----------------
69 MB
(1 row)
ONE WORD ABOUT VACUUM
■ VACUUM is not always allowed to
reallocate dead rows
■ A row must be REALLY dead for VACUUM
to do its job
■ Long transactions can be an enemy
→ Once you are in pain it tends not to go away
WAYS OUT
■ VACUUM FULL: Needs a table lock
■ pg_squeeze:
■ Shrinking tables with less locking
■ Move between tablespaces
■ Index organize tables
HINT: Try to avoid bloat in the first place!
ZHEAP
COMING TO THE RESCUE
ZHEAP: DESIGN GOALS
■ Perform UPDATE in place
■ Have smaller tables
■ smaller tuple headers
■ improved alignment
■ Reduce writes as much as possible
■ avoid dirtying pages unless data is modified
■ normal heaps dirty pages in some cases during reads
■ Reuse space more quickly
■ Get rid of VACUUM
ZHEAP: TUPLE HEADERS
ZHEAP: TUPLE HEADERS
■ Heap: 20+ bytes per row
■ Zheap: 5 bytes per row
How can this be achieved?
■ The tuple header controls “visibility”
■ “Normalize tuple header”
■ Move visibility info to the page level
ZHEAP: TRANSACTION SLOTS
Transaction slots hold transactional visibility
ZHEAP: TRANSACTION SLOTS
Transaction slots:
■ 16 bytes of storage
■ contains the following information
■ transaction id
■ epoch
■ latest undo record pointer of that transaction
What if we need more slots?
ZHEAP: TPD PAGES
■ TPD: Store additional transaction slots if “4” is not enough
■ TPD pages are interleaved with normal pages
■
UNDO: HANDLING
STALE DATA
OPERATION: INSERT
■ Allocate a transaction slot
■ Emit an undo entry to fix things on error
■ Space can be reclaimed instantly after a ROLLBACK
→ Most simplistic operation
OPERATION: UPDATE
■ More complicated:
■ The new row fits into the old space
■ The new row does not fit into the old space
OPERATION: UPDATE FITS
■ If the row is shorter:
■ We can overwrite it
■ Emit undo record
In short: We hold the new row in zheap and a copy of the old row in undo so
that we can copy it back to the old structure in case it is needed.
OPERATION: UPDATE DOESN’T FIT
■ Will be worse
■ DELETE old row
■ INSERT new row in a different place
■ Less efficient
Space can instantly be reclaimed in the following cases:
■ When updating a row to a shorter version
■ When non-inplace UPDATEs are performed
OPERATION: DELETE
■ How it works
■ Emit undo record
■ DELETE row from zheap
Old row can be moved back into zheap during ROLLBACK.
UNDO PAGE FORMAT
ROLLBACK
ROLLBACK
■ In case a ROLLBACK happens:
■ undo has to make sure that the old state of the table is restored.
■ Old rows have to be copied back
■ ROLLBACK takes longer !
Undo itself can be removed in three cases:
■ as soon as there are no transactions anymore that can see the data.
■ as soon as all undo action has been completed
■ For committed transactions till the time they are all-visible
UNDO WORKERS
■ Discarding the undo logs is performed by discard worker
■ Undo launcher checks the rollback_hash_table periodically
■ Spawn new undo workers to perform the rollback
■ Each spawned undo worker processes the rollback requests for a
particular database.
UNDO LOG PROCESSING
OBSERVATIONS
PREPARING DATA
■ Creating some random data
test=# SET temp_buffers TO '1 GB';
SET
test=# CREATE TEMP TABLE raw AS
SELECT id,
hashtext(id::text) as name,
random() * 10000 AS n, true AS b
FROM generate_series(1, 10000000) AS id;
SELECT 10000000
LOADING A HEAP
■ Populating a normal table
test=# timing
Timing is on.
test=# CREATE TABLE h1 (LIKE raw) USING heap;
CREATE TABLE
Time: 7.836 ms
test=# INSERT INTO h1 SELECT * FROM raw;
INSERT 0 10000000
Time: 7495.798 ms (00:07.496)
LOADING A ZHEAP
■ Mind the runtime
test=# CREATE TABLE z1 (LIKE raw) USING zheap;
CREATE TABLE
Time: 8.045 ms
test=# INSERT INTO z1 SELECT * FROM raw;
INSERT 0 10000000
Time: 27947.516 ms (00:27.948)
ZHEAP IS SMALLER
■ Smaller tuple headers make a difference
test=# d+
List of relations
Schema | Name | Type | Owner | Persistence | Size | ...
-----------+------+-------+-------+-------------+--------+----
pg_temp_5 | raw | table | hs | temporary | 498 MB |
public | h1 | table | hs | permanent | 498 MB |
public | z1 | table | hs | permanent | 251 MB |
ZHEAP IN ACTION
test=# BEGIN;
BEGIN
test=*# SELECT pg_size_pretty(pg_relation_size('z1'));
pg_size_pretty
----------------
251 MB
(1 row)
test=*# UPDATE z1 SET id = id + 1;
UPDATE 10000000
test=*# SELECT pg_size_pretty(pg_relation_size('z1'));
pg_size_pretty
----------------
251 MB
(1 row)
UNDO IN ACTION
[hs@hs-MS-7817 undo]$ pwd
/home/hs/db13/base/undo
[hs@hs-MS-7817 undo]$ ls -l | tail -n 10
-rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003EC00000
-rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003ED00000
-rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003EE00000
-rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003EF00000
-rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003F000000
-rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003F100000
-rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003F200000
-rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003F300000
-rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003F400000
-rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003F500000
ROADMAP
WHAT WE ARE WORKING ON
■ agree on final design issues
■ fix bugs in current code
■ large code base
■ not easy to handle
■ preparing a patch to move “undo” to core
■ “undo” is core infrastructure
We hope to bring this into core some day.
QUESTIONS?
Feel free to contact me!
M A I L hs@cybertec.at
P H O N E +43 2622 930 22-2
T W I T T E R @postgresql_007

Learn how zheap works

  • 1.
    HOW ZHEAP WORKS REINVENTINGPOSTGRESQL STORAGE BY HANS-JÜRGEN SCHÖNIG
  • 2.
    ABOUT ME AND MY COMPANY ■Who is the guy? ■ Who is CYBERTEC?
  • 3.
    HANS-JÜRGEN SCHÖNIG CEO & SENIORDATABASE CONSULTANT ■ PostgreSQL since 1999 ■ author of various database books M A I L hs@cybertec.at P H O N E +43 2622 930 22-2 W E B www.cybertec-postgresql.com
  • 4.
    DATABASE SERVICES DATA Science ▪Artificial Intelligence ▪ Machine Learning ▪ Big Data ▪ Business Intelligence ▪ Data Mining ▪ etc. POSTGRESQL Services ▪ 24/7 Support ▪ Training ▪ Consulting ▪ Performance Tuning ▪ Clustering ▪ etc.
  • 6.
    ▪ ICT ▪ University ▪Government ▪ Automotive ▪ Industry ▪ Trade ▪ Finance ▪ etc. CLIENT SECTORS
  • 7.
    AGENDA ■ traditional tables ■table bloat and VACUUM ■ Why a new storage system? ■ zheap: the goal ■ zheap: basic architecture ■ zheap: transaction slots, etc. ■ performance impacts ■ roadmap
  • 8.
  • 9.
    HEAP: STANDARD TABLES ■Data structure looks as follows:
  • 10.
    ■ Data structurelooks as follows: HEAP: STANDARD TABLES
  • 11.
  • 12.
  • 13.
    MAIN ISSUE: TABLEBLOAT test=# CREATE TABLE a (aid int) WITH (autovacuum_enabled = off); CREATE TABLE test=# INSERT INTO a SELECT * FROM generate_series(1, 1000000); INSERT 0 1000000 test=# SELECT pg_size_pretty(pg_relation_size('a')); pg_size_pretty ---------------- 35 MB (1 row)
  • 14.
    MAIN ISSUE: TABLEBLOAT test=# UPDATE a SET aid = aid + 1; UPDATE 1000000 test=# SELECT pg_size_pretty(pg_relation_size('a')); pg_size_pretty ---------------- 69 MB (1 row)
  • 15.
    MAIN ISSUE: TABLEBLOAT test=# VACUUM VERBOSE a; INFO: vacuuming "public.a" INFO: "a": removed 1000000 row versions in 4425 pages INFO: "a": found 1000000 removable, 1000000 nonremovable row versions in 8850 out of 8850 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 539 ... VACUUM test=# SELECT pg_size_pretty(pg_relation_size('a')); pg_size_pretty ---------------- 69 MB (1 row)
  • 16.
    ONE WORD ABOUTVACUUM ■ VACUUM is not always allowed to reallocate dead rows ■ A row must be REALLY dead for VACUUM to do its job ■ Long transactions can be an enemy → Once you are in pain it tends not to go away
  • 17.
    WAYS OUT ■ VACUUMFULL: Needs a table lock ■ pg_squeeze: ■ Shrinking tables with less locking ■ Move between tablespaces ■ Index organize tables HINT: Try to avoid bloat in the first place!
  • 18.
  • 19.
    ZHEAP: DESIGN GOALS ■Perform UPDATE in place ■ Have smaller tables ■ smaller tuple headers ■ improved alignment ■ Reduce writes as much as possible ■ avoid dirtying pages unless data is modified ■ normal heaps dirty pages in some cases during reads ■ Reuse space more quickly ■ Get rid of VACUUM
  • 20.
  • 21.
    ZHEAP: TUPLE HEADERS ■Heap: 20+ bytes per row ■ Zheap: 5 bytes per row How can this be achieved? ■ The tuple header controls “visibility” ■ “Normalize tuple header” ■ Move visibility info to the page level
  • 22.
    ZHEAP: TRANSACTION SLOTS Transactionslots hold transactional visibility
  • 23.
    ZHEAP: TRANSACTION SLOTS Transactionslots: ■ 16 bytes of storage ■ contains the following information ■ transaction id ■ epoch ■ latest undo record pointer of that transaction What if we need more slots?
  • 24.
    ZHEAP: TPD PAGES ■TPD: Store additional transaction slots if “4” is not enough ■ TPD pages are interleaved with normal pages ■
  • 25.
  • 26.
    OPERATION: INSERT ■ Allocatea transaction slot ■ Emit an undo entry to fix things on error ■ Space can be reclaimed instantly after a ROLLBACK → Most simplistic operation
  • 27.
    OPERATION: UPDATE ■ Morecomplicated: ■ The new row fits into the old space ■ The new row does not fit into the old space
  • 28.
    OPERATION: UPDATE FITS ■If the row is shorter: ■ We can overwrite it ■ Emit undo record In short: We hold the new row in zheap and a copy of the old row in undo so that we can copy it back to the old structure in case it is needed.
  • 29.
    OPERATION: UPDATE DOESN’TFIT ■ Will be worse ■ DELETE old row ■ INSERT new row in a different place ■ Less efficient Space can instantly be reclaimed in the following cases: ■ When updating a row to a shorter version ■ When non-inplace UPDATEs are performed
  • 30.
    OPERATION: DELETE ■ Howit works ■ Emit undo record ■ DELETE row from zheap Old row can be moved back into zheap during ROLLBACK.
  • 31.
  • 32.
  • 33.
    ROLLBACK ■ In casea ROLLBACK happens: ■ undo has to make sure that the old state of the table is restored. ■ Old rows have to be copied back ■ ROLLBACK takes longer ! Undo itself can be removed in three cases: ■ as soon as there are no transactions anymore that can see the data. ■ as soon as all undo action has been completed ■ For committed transactions till the time they are all-visible
  • 34.
    UNDO WORKERS ■ Discardingthe undo logs is performed by discard worker ■ Undo launcher checks the rollback_hash_table periodically ■ Spawn new undo workers to perform the rollback ■ Each spawned undo worker processes the rollback requests for a particular database.
  • 35.
  • 36.
  • 37.
    PREPARING DATA ■ Creatingsome random data test=# SET temp_buffers TO '1 GB'; SET test=# CREATE TEMP TABLE raw AS SELECT id, hashtext(id::text) as name, random() * 10000 AS n, true AS b FROM generate_series(1, 10000000) AS id; SELECT 10000000
  • 38.
    LOADING A HEAP ■Populating a normal table test=# timing Timing is on. test=# CREATE TABLE h1 (LIKE raw) USING heap; CREATE TABLE Time: 7.836 ms test=# INSERT INTO h1 SELECT * FROM raw; INSERT 0 10000000 Time: 7495.798 ms (00:07.496)
  • 39.
    LOADING A ZHEAP ■Mind the runtime test=# CREATE TABLE z1 (LIKE raw) USING zheap; CREATE TABLE Time: 8.045 ms test=# INSERT INTO z1 SELECT * FROM raw; INSERT 0 10000000 Time: 27947.516 ms (00:27.948)
  • 40.
    ZHEAP IS SMALLER ■Smaller tuple headers make a difference test=# d+ List of relations Schema | Name | Type | Owner | Persistence | Size | ... -----------+------+-------+-------+-------------+--------+---- pg_temp_5 | raw | table | hs | temporary | 498 MB | public | h1 | table | hs | permanent | 498 MB | public | z1 | table | hs | permanent | 251 MB |
  • 41.
    ZHEAP IN ACTION test=#BEGIN; BEGIN test=*# SELECT pg_size_pretty(pg_relation_size('z1')); pg_size_pretty ---------------- 251 MB (1 row) test=*# UPDATE z1 SET id = id + 1; UPDATE 10000000 test=*# SELECT pg_size_pretty(pg_relation_size('z1')); pg_size_pretty ---------------- 251 MB (1 row)
  • 42.
    UNDO IN ACTION [hs@hs-MS-7817undo]$ pwd /home/hs/db13/base/undo [hs@hs-MS-7817 undo]$ ls -l | tail -n 10 -rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003EC00000 -rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003ED00000 -rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003EE00000 -rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003EF00000 -rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003F000000 -rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003F100000 -rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003F200000 -rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003F300000 -rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003F400000 -rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003F500000
  • 43.
  • 44.
    WHAT WE AREWORKING ON ■ agree on final design issues ■ fix bugs in current code ■ large code base ■ not easy to handle ■ preparing a patch to move “undo” to core ■ “undo” is core infrastructure We hope to bring this into core some day.
  • 45.
    QUESTIONS? Feel free tocontact me! M A I L hs@cybertec.at P H O N E +43 2622 930 22-2 T W I T T E R @postgresql_007