SlideShare a Scribd company logo
1 of 35
Download to read offline
Parallel Recovery in
PostgreSQL
Koichi Suzuki
May 31st, 2023
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
2
Agenda
• Recovery Overview
• Possibility of recovery parallelism
• Additional rules and synchronization
• Implementation architecture
• Current achievement
• Remaining works
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
3
Recovery Overview
• WAL (Write-Ahead-Log) takes essential role in backup and recovery.
• This is also used in log-shipping replication.
• Recovery is done by dedicated startup process.
Database Data
WAL
Checkpoint
Buffer flush
Data update
Serialized
Data
WAL
Copy
Archive
Log shipping
Database in
recovery
mode
Read each WAL
record in series
Replay each WAL
record in series
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
4
Single replay thread, why?
● Single thread replay guarantees consistent replay
○ This is the order of write to the database
○ Replaying in the write order is simple and the safest.
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
5
Current WAL replay architecture
WAL record type:
● Categorized by “resource manager”
○ For types of data object
■ Heap, transaction, Btree, GIN, ….
○ Each resource manager provides replay function for each record.
● Replay (“redo” in the source) reads the resource manager ID (type RmgrId) of
each WAL record and calls replay function for these resources.
○ StartupXLOG() in xlog.c
● Replay is done by dedicated startup process.
● Many more to handle in StartupXLOG()
○ Determine if postmaster can begin to accept connection,
○ Feedback commit/abort back to the primary,
○ Recovery target handling,
….(many more)...
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
6
Current list of RmgrId (as of V14/15)
XLOG
Transaction
Storage
CLOG
Database
Tablespace
MultiXact
RelMap
Standby
Heap2
Heap
Btree
Hash
Gin
Gist
Sequence
SPGist
BRIN
CommitTs
ReplicationOrigin
Generic
LogicalMessage
When new resource manager (for example, access method and index)
are added, new RmgrID and its replay function are added. No changes
are needed in current redo framework.
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
7
Structure of WAL (simplified)
・・・
initdb
WAL segment,
fixed size, 16 MB
by default
WAL
record
LSN is the position of the WAL record from the very beginning.
64bit
● Each WAL segment has its own header,
● WAL record may span across multiple WAL segment files.
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
8
Motivation of parallelism in the recovery?
Motivation
• Save time of restoration from the backup.
• Improve replication lag in log-shipping replication.
• At least we can replay WAL record for different database page in parallel.
Challenge (for example)
• There are many constraints which we must follow during WAL replay
○ When replaying COMMIT/ABORT, all the WAL records for the
transaction must have been replayed.
○ Some WAL records has replay information for multiple pages.
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
9
Basic idea of constraints in parallel replay
1. For each database page, replay must be done in LSN order.
2. Replaying WAL for different pages can be done in different worker process in
parallel.
3. If WAL is not associated with database page, they are replayed in LSN order by
dedicated thread (transaction worker).
4. For some WAL record, such as timeline change, all the preceding WAL records
must have been replayed before replaying such WAL record.
5. If a WAL record is associated with multiple pages, they will be replayed by one of
the workers.
○ To maintain the first constraint, we need some synchronization (mentioned
later)
6. Before replaying commit/abort/prepare, it must be confirmed that all the WAL
record for the transaction must have been replayed. (mentioned later).
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
10
Worker processes for parallel recovery
All the worker processes are startup process or its child process.
● Reader worker (dedicated)
○ This is startup process itself. Read each WAL record and ques to the
distributor worker
● Distributor worker (dedicated)
○ Parses each WAL record and determine which worker process to handle.
○ If WAL record is associated with block data, then it is assigned to one of
the block worker based on the hash value of the block,
○ Add each WAL record to the queue of each worker process in LSN order.
○ Can be merged into reader worker.
● Transaction worker (dedicated)
○ Replays all the WAL records without block information.
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
11
Worker processes for parallel recovery (cont.)
● Block worker (multiple, configurable)
○ Replays assigned WAL record.
○ Handling of multi-page WAL will be mentioned later.
● Invalid page worker (dedicated)
○ Target database page may be missing or inconsistent because table
might have been dropped by subsequent WALs.
○ Current recovery registers this as invalid page and make correction when
replaying corresponding DELETE/DROP WAL.
○ Now dedicated process to reuse existing code using hash functions
based on the memory context belonging to a process,
○ May receive queue from txn and block workers.
○ Can be modified to use shared memory. In this case, this worker is not
needed.
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
12
Worker processes
Configured with three block workers.
Child of startup process.
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
13
Shared data among workers
● WAL record and its parsed data is shared among workers through dynamic
shared memory (dsm).
○ Allocated by reader worker (startup) before other workers are folked.
○ Handling of multi-page WAL will be mentioned later.
● Other shared data structures are accommodated in the same dsm.
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
14
Shared data among workers (cont.)
● Buffer for WAL and its parsed result,
● Status of outstanding transaction,
○ Latest LSN assigned to each block worker,
● Status of each worker,
● Queue to assign WAL to workers,
● History of outstanding WAL waiting for replay,
○ Used to advance replayed LSN.
● Data protection using spinlock.
○ No I/O or wait event should be associated while a spinlock is acquired.
○ Only one spinlock should be acquired by a worker.
■ May have to allow exception though…
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
15
WAL buffer is allocated in cyclic manner
● WAL buffer is allocated in cyclic manner to avoid fragmentation.
● WAL is eventually replayed and its data is freed.
● If we allocate WAL buffer in cyclic manner, we don’t have to worry about memory
fragmentation.
● Buffer is allocated at the start of main free area.
● When a buffer is freed, it is concatenated with surrounding free area and if
necessary, main free area info is updated.
● If free area is among allocated buffers, we don’t take care of it until it is
concatenated to main free area.
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
16
WAL buffer is allocated in cyclic manner (cont.)
Main free
area
Main free
area
Allocated
Main free
area
Main free
area
Allocated
Main free
area
Allocated
Allocated
Main free
area
Allocated
● If sufficient main free area is not available, wait until other workers replay WALs and free buffers.
● If all workers is waiting for the buffer, reply aborts and suggests to enlarge shared buffer (configurable through GUC).
● With three block workers and ten queues for each worker, 4MB of buffer looks sufficient. We may need more with more
block workers and more queue depth.
(head)
(tail)
Alloc start Alloc start Alloc start
Allocated area grows.
Allocated area grows.
Allocated area grows.
Allocated area grows.
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
17
Transaction status
● Composed of set of latest assigned WAL LSNs to each block worker for each
transaction.
● When the worker replays this LSN, it is cleaned.
● When transaction worker replays commit/abort/prepare, it checks all the WALs for
this transaction has been replayed.
● If not, waits until such workers replay all the assigned WALs (mentioned later).
LSN for
block
worker 1
LSN for
block
worker 2
LSN for
block
worker n
・・・
LSN for
block
worker 1
LSN for
block
worker 2
LSN for
block
worker n
・・・
LSN for
block
worker 1
LSN for
block
worker 2
LSN for
block
worker n
・・・
LSN for
block
worker 1
LSN for
block
worker 2
LSN for
block
worker n
・・・
For each outstanding
transaction
Waits until all the workers replay corresponding WAL
before commit/abort/prepare
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
18
Worker status
● Latest assigned WAL LSN,
● Latest replayed WAL LSN,
● Flags to indicate
○ the worker is waiting for sync from other workers,
○ If the worker failed,
○ If the worker terminated.
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
19
WAL history
● Recovered point can be advanced only all the WALs are replayed.
● If any WAL has not been replayed, we need to wait to advance recovered point
until such WAL is replayed.
Wal history tracks this status.
Replayed Not replayed yet
Recovered point
Recovered point advances if all the previous WALs are replayed
Not yet Replayed
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
20
Synchronization among workers
• Dedicated data is queued asking for sync message.
• Sync message is sent via socket (unix domain udp). Each worker has its own
socket to receive synch.
• For multi-page WAL, another mechanism is implemented for performance.
Wal history tracks this status.
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
21
Waiting for all the preceding WAL replay
● Distributor worker sends queue asking for synchronization.
● Distributor reads data from the target worker from the socket.
● When the target worker receives synchronization request, writes synchronization
data to the socket.
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
22
Synchronization for multi-page WAL
● Multi-page WAL is assigned to all the block workers based on the hash value of the
page.
● WAL is associated with array of assigned workers and number of remaining
workers (initial value is number of assigned block worker).
● When a block worker receives WAL, it checks if remaining worker is one (only
myself).
○ If so, then the worker replays the WAL and then write synch to all the other
workers.
○ If not, then the worker waits sync from the above worker.
● This guarantees that WAL replay to each page is in the order of WAL LSN.
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
23
Synchronization for multi-page WAL (cont.)
Wait some other block
worker replays
Last block worker to
pick up WAL. Replay
and sync others.
Multi-page WAL is assigned to all the block workers for each page (by
dispatcher worker).
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
24
Current status of the work and
achievement
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
25
Current Status
● Code is almost done.
● Still with many debug codes.
● Runs under gdb control.
● Parallelism looks working without issue.
● Existing redo functions are running without modification.
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
26
Affected major source files
Modified/Added Source Outline
A src/backend/access/transam/parallel_replay.c
src/include/access/parallel_replay.h
Main parallel recovery logic.
Each worker loop.
M src/backend/access/transam/xlog.c
src/backend/access/transam/xlogreader.c
src/backend/access/transam/xlogutils.c
src/include/access/xlog.h
src/include/access/xlogreader.h
src/include/access/xlogutils.h
Read and analyze WAL into shared buffer
Pass WAL to distributor worker.
M src/backend/postmaster/postmaster.c
src/backend/postmaster/startup.c
src/backend/storage/lmgr/proc.c
src/backend/utils/activity/backend_status.c
src/backend/utils/adt/pgstatfuncs.c
src/include/miscadmin.h
src/include/postmaster/postmaster.h
src/include/storage/proc.h
src/include/utils/backend_status.h
Fork worker processes.
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
27
Affected major source files (cont.)
Modified/Added Source Outline
M src/backend/utils/misc/guc.c Additional GUC parameters
M src/backend/utils/error/elog.c
src/backend/utils/activity/backend_status.c
src/backend/utils/misc/ps_status.c
Add workers to log and status.
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
28
Additional major GUC parameters
Name Type Default Description
parallel_replay bool off Enable parallel recovery,
num_preplay_workers int 5 Number of parallel replay worker, including
reader/distributor/txn and invalid pag.
num_preplay_queues int 128 Number of queue element used to pass WAL to
various workers.
num_preplay_max_txn int max_connec
tions
Number of transactions running in parallel in
the recovery.
preplay_buffers int (calculated) Shared buffer size, allocated using dynamic
shared memory.
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
29
Remaining work
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
30
Remaining work
● Cleasnup test code.
● Cleanup structure, which still have some unused data from past
try-and-error attempt.
● Validation of consistent recovery point determination.
● Determination of the end of WAL read.
● Port to version 16 or later.
● Run without gdb and measure the performance gain.
○ Archive recovery,
○ Log-shipping replications.
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
31
Issues for further
discussion/development
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
32
Issues for discussion
● Synchronization:
○ Now using unix domain socket directly. May need some wrapper for portability.
● Queueing
○ Now using POSIX mqueue directly. It is very fast and simple and is supporting
multi to one queueing.
○ Is it portable to other environment such as Windows?
● Bringing to PG core
○ Large amount of code: around 7k lines with debug code.
○ Maybe around 4k lines without debug code.
○ Need expert’s help to divide this into manageable smaller pieces for commit
fest.
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
33
Resources
Source repo
https://github.com/koichi-szk/postgres.git
Branch: parallel_replay_14_6
Test tools/environment
https://github.com/koichi-szk/parallel_recovery_test.git
Branch: main
PostgreSQL Wiki:
https://wiki.postgresql.org/wiki/Parallel_Recovery
They are now all public. Please let me know if you are interested in.
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
34
Thank you very much
Koichi Suzuki
koichi.suzuki@enterprisedb.com
koichi.dbms@gmail.com
© Copyright EnterpriseDB Corporation, 2023. All rights reserved.
35
Challenge in the test
● Startup process starts before postmaster begins to accept connection.
● No direct means to pause and notice its PID to attach to gdb.
● Writes these information to a debug file, wait until it is attached to gdb and signal file is touched.
Paste to
terminal
Each terminal
runs gdb for
each worker

More Related Content

Similar to Using Parallel Recovery in PostgreSQL 17

Similar to Using Parallel Recovery in PostgreSQL 17 (20)

War Stories: DIY Kafka
War Stories: DIY KafkaWar Stories: DIY Kafka
War Stories: DIY Kafka
 
Faster, better, stronger: The new InnoDB
Faster, better, stronger: The new InnoDBFaster, better, stronger: The new InnoDB
Faster, better, stronger: The new InnoDB
 
Oracle RAC Internals - The Cache Fusion Edition
Oracle RAC Internals - The Cache Fusion EditionOracle RAC Internals - The Cache Fusion Edition
Oracle RAC Internals - The Cache Fusion Edition
 
OpenShift.io on Gluster
OpenShift.io on GlusterOpenShift.io on Gluster
OpenShift.io on Gluster
 
Configuring Aerospike - Part 2
Configuring Aerospike - Part 2 Configuring Aerospike - Part 2
Configuring Aerospike - Part 2
 
MySQL Replication Performance Tuning for Fun and Profit!
MySQL Replication Performance Tuning for Fun and Profit!MySQL Replication Performance Tuning for Fun and Profit!
MySQL Replication Performance Tuning for Fun and Profit!
 
Mutexes 2
Mutexes 2Mutexes 2
Mutexes 2
 
The Future of zHeap
The Future of zHeapThe Future of zHeap
The Future of zHeap
 
MongoDb scalability and high availability with Replica-Set
MongoDb scalability and high availability with Replica-SetMongoDb scalability and high availability with Replica-Set
MongoDb scalability and high availability with Replica-Set
 
Megastore by Google
Megastore by GoogleMegastore by Google
Megastore by Google
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
A presentaion on Panasas HPC NAS
A presentaion on Panasas HPC NASA presentaion on Panasas HPC NAS
A presentaion on Panasas HPC NAS
 
Oracle Database 12c : Multitenant
Oracle Database 12c : MultitenantOracle Database 12c : Multitenant
Oracle Database 12c : Multitenant
 
IO Schedulers (Elevater) concept and its affection on database performance
IO Schedulers (Elevater) concept and its affection on database performanceIO Schedulers (Elevater) concept and its affection on database performance
IO Schedulers (Elevater) concept and its affection on database performance
 
Memory Bandwidth QoS
Memory Bandwidth QoSMemory Bandwidth QoS
Memory Bandwidth QoS
 
Unifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
Unifying Messaging, Queueing & Light Weight Compute Using Apache PulsarUnifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
Unifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
 
Testing Persistent Storage Performance in Kubernetes with Sherlock
Testing Persistent Storage Performance in Kubernetes with SherlockTesting Persistent Storage Performance in Kubernetes with Sherlock
Testing Persistent Storage Performance in Kubernetes with Sherlock
 
Mongo DB
Mongo DBMongo DB
Mongo DB
 

Recently uploaded

The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 

Recently uploaded (20)

Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 

Using Parallel Recovery in PostgreSQL 17

  • 2. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 2 Agenda • Recovery Overview • Possibility of recovery parallelism • Additional rules and synchronization • Implementation architecture • Current achievement • Remaining works
  • 3. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 3 Recovery Overview • WAL (Write-Ahead-Log) takes essential role in backup and recovery. • This is also used in log-shipping replication. • Recovery is done by dedicated startup process. Database Data WAL Checkpoint Buffer flush Data update Serialized Data WAL Copy Archive Log shipping Database in recovery mode Read each WAL record in series Replay each WAL record in series
  • 4. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 4 Single replay thread, why? ● Single thread replay guarantees consistent replay ○ This is the order of write to the database ○ Replaying in the write order is simple and the safest.
  • 5. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 5 Current WAL replay architecture WAL record type: ● Categorized by “resource manager” ○ For types of data object ■ Heap, transaction, Btree, GIN, …. ○ Each resource manager provides replay function for each record. ● Replay (“redo” in the source) reads the resource manager ID (type RmgrId) of each WAL record and calls replay function for these resources. ○ StartupXLOG() in xlog.c ● Replay is done by dedicated startup process. ● Many more to handle in StartupXLOG() ○ Determine if postmaster can begin to accept connection, ○ Feedback commit/abort back to the primary, ○ Recovery target handling, ….(many more)...
  • 6. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 6 Current list of RmgrId (as of V14/15) XLOG Transaction Storage CLOG Database Tablespace MultiXact RelMap Standby Heap2 Heap Btree Hash Gin Gist Sequence SPGist BRIN CommitTs ReplicationOrigin Generic LogicalMessage When new resource manager (for example, access method and index) are added, new RmgrID and its replay function are added. No changes are needed in current redo framework.
  • 7. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 7 Structure of WAL (simplified) ・・・ initdb WAL segment, fixed size, 16 MB by default WAL record LSN is the position of the WAL record from the very beginning. 64bit ● Each WAL segment has its own header, ● WAL record may span across multiple WAL segment files.
  • 8. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 8 Motivation of parallelism in the recovery? Motivation • Save time of restoration from the backup. • Improve replication lag in log-shipping replication. • At least we can replay WAL record for different database page in parallel. Challenge (for example) • There are many constraints which we must follow during WAL replay ○ When replaying COMMIT/ABORT, all the WAL records for the transaction must have been replayed. ○ Some WAL records has replay information for multiple pages.
  • 9. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 9 Basic idea of constraints in parallel replay 1. For each database page, replay must be done in LSN order. 2. Replaying WAL for different pages can be done in different worker process in parallel. 3. If WAL is not associated with database page, they are replayed in LSN order by dedicated thread (transaction worker). 4. For some WAL record, such as timeline change, all the preceding WAL records must have been replayed before replaying such WAL record. 5. If a WAL record is associated with multiple pages, they will be replayed by one of the workers. ○ To maintain the first constraint, we need some synchronization (mentioned later) 6. Before replaying commit/abort/prepare, it must be confirmed that all the WAL record for the transaction must have been replayed. (mentioned later).
  • 10. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 10 Worker processes for parallel recovery All the worker processes are startup process or its child process. ● Reader worker (dedicated) ○ This is startup process itself. Read each WAL record and ques to the distributor worker ● Distributor worker (dedicated) ○ Parses each WAL record and determine which worker process to handle. ○ If WAL record is associated with block data, then it is assigned to one of the block worker based on the hash value of the block, ○ Add each WAL record to the queue of each worker process in LSN order. ○ Can be merged into reader worker. ● Transaction worker (dedicated) ○ Replays all the WAL records without block information.
  • 11. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 11 Worker processes for parallel recovery (cont.) ● Block worker (multiple, configurable) ○ Replays assigned WAL record. ○ Handling of multi-page WAL will be mentioned later. ● Invalid page worker (dedicated) ○ Target database page may be missing or inconsistent because table might have been dropped by subsequent WALs. ○ Current recovery registers this as invalid page and make correction when replaying corresponding DELETE/DROP WAL. ○ Now dedicated process to reuse existing code using hash functions based on the memory context belonging to a process, ○ May receive queue from txn and block workers. ○ Can be modified to use shared memory. In this case, this worker is not needed.
  • 12. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 12 Worker processes Configured with three block workers. Child of startup process.
  • 13. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 13 Shared data among workers ● WAL record and its parsed data is shared among workers through dynamic shared memory (dsm). ○ Allocated by reader worker (startup) before other workers are folked. ○ Handling of multi-page WAL will be mentioned later. ● Other shared data structures are accommodated in the same dsm.
  • 14. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 14 Shared data among workers (cont.) ● Buffer for WAL and its parsed result, ● Status of outstanding transaction, ○ Latest LSN assigned to each block worker, ● Status of each worker, ● Queue to assign WAL to workers, ● History of outstanding WAL waiting for replay, ○ Used to advance replayed LSN. ● Data protection using spinlock. ○ No I/O or wait event should be associated while a spinlock is acquired. ○ Only one spinlock should be acquired by a worker. ■ May have to allow exception though…
  • 15. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 15 WAL buffer is allocated in cyclic manner ● WAL buffer is allocated in cyclic manner to avoid fragmentation. ● WAL is eventually replayed and its data is freed. ● If we allocate WAL buffer in cyclic manner, we don’t have to worry about memory fragmentation. ● Buffer is allocated at the start of main free area. ● When a buffer is freed, it is concatenated with surrounding free area and if necessary, main free area info is updated. ● If free area is among allocated buffers, we don’t take care of it until it is concatenated to main free area.
  • 16. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 16 WAL buffer is allocated in cyclic manner (cont.) Main free area Main free area Allocated Main free area Main free area Allocated Main free area Allocated Allocated Main free area Allocated ● If sufficient main free area is not available, wait until other workers replay WALs and free buffers. ● If all workers is waiting for the buffer, reply aborts and suggests to enlarge shared buffer (configurable through GUC). ● With three block workers and ten queues for each worker, 4MB of buffer looks sufficient. We may need more with more block workers and more queue depth. (head) (tail) Alloc start Alloc start Alloc start Allocated area grows. Allocated area grows. Allocated area grows. Allocated area grows.
  • 17. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 17 Transaction status ● Composed of set of latest assigned WAL LSNs to each block worker for each transaction. ● When the worker replays this LSN, it is cleaned. ● When transaction worker replays commit/abort/prepare, it checks all the WALs for this transaction has been replayed. ● If not, waits until such workers replay all the assigned WALs (mentioned later). LSN for block worker 1 LSN for block worker 2 LSN for block worker n ・・・ LSN for block worker 1 LSN for block worker 2 LSN for block worker n ・・・ LSN for block worker 1 LSN for block worker 2 LSN for block worker n ・・・ LSN for block worker 1 LSN for block worker 2 LSN for block worker n ・・・ For each outstanding transaction Waits until all the workers replay corresponding WAL before commit/abort/prepare
  • 18. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 18 Worker status ● Latest assigned WAL LSN, ● Latest replayed WAL LSN, ● Flags to indicate ○ the worker is waiting for sync from other workers, ○ If the worker failed, ○ If the worker terminated.
  • 19. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 19 WAL history ● Recovered point can be advanced only all the WALs are replayed. ● If any WAL has not been replayed, we need to wait to advance recovered point until such WAL is replayed. Wal history tracks this status. Replayed Not replayed yet Recovered point Recovered point advances if all the previous WALs are replayed Not yet Replayed
  • 20. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 20 Synchronization among workers • Dedicated data is queued asking for sync message. • Sync message is sent via socket (unix domain udp). Each worker has its own socket to receive synch. • For multi-page WAL, another mechanism is implemented for performance. Wal history tracks this status.
  • 21. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 21 Waiting for all the preceding WAL replay ● Distributor worker sends queue asking for synchronization. ● Distributor reads data from the target worker from the socket. ● When the target worker receives synchronization request, writes synchronization data to the socket.
  • 22. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 22 Synchronization for multi-page WAL ● Multi-page WAL is assigned to all the block workers based on the hash value of the page. ● WAL is associated with array of assigned workers and number of remaining workers (initial value is number of assigned block worker). ● When a block worker receives WAL, it checks if remaining worker is one (only myself). ○ If so, then the worker replays the WAL and then write synch to all the other workers. ○ If not, then the worker waits sync from the above worker. ● This guarantees that WAL replay to each page is in the order of WAL LSN.
  • 23. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 23 Synchronization for multi-page WAL (cont.) Wait some other block worker replays Last block worker to pick up WAL. Replay and sync others. Multi-page WAL is assigned to all the block workers for each page (by dispatcher worker).
  • 24. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 24 Current status of the work and achievement
  • 25. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 25 Current Status ● Code is almost done. ● Still with many debug codes. ● Runs under gdb control. ● Parallelism looks working without issue. ● Existing redo functions are running without modification.
  • 26. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 26 Affected major source files Modified/Added Source Outline A src/backend/access/transam/parallel_replay.c src/include/access/parallel_replay.h Main parallel recovery logic. Each worker loop. M src/backend/access/transam/xlog.c src/backend/access/transam/xlogreader.c src/backend/access/transam/xlogutils.c src/include/access/xlog.h src/include/access/xlogreader.h src/include/access/xlogutils.h Read and analyze WAL into shared buffer Pass WAL to distributor worker. M src/backend/postmaster/postmaster.c src/backend/postmaster/startup.c src/backend/storage/lmgr/proc.c src/backend/utils/activity/backend_status.c src/backend/utils/adt/pgstatfuncs.c src/include/miscadmin.h src/include/postmaster/postmaster.h src/include/storage/proc.h src/include/utils/backend_status.h Fork worker processes.
  • 27. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 27 Affected major source files (cont.) Modified/Added Source Outline M src/backend/utils/misc/guc.c Additional GUC parameters M src/backend/utils/error/elog.c src/backend/utils/activity/backend_status.c src/backend/utils/misc/ps_status.c Add workers to log and status.
  • 28. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 28 Additional major GUC parameters Name Type Default Description parallel_replay bool off Enable parallel recovery, num_preplay_workers int 5 Number of parallel replay worker, including reader/distributor/txn and invalid pag. num_preplay_queues int 128 Number of queue element used to pass WAL to various workers. num_preplay_max_txn int max_connec tions Number of transactions running in parallel in the recovery. preplay_buffers int (calculated) Shared buffer size, allocated using dynamic shared memory.
  • 29. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 29 Remaining work
  • 30. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 30 Remaining work ● Cleasnup test code. ● Cleanup structure, which still have some unused data from past try-and-error attempt. ● Validation of consistent recovery point determination. ● Determination of the end of WAL read. ● Port to version 16 or later. ● Run without gdb and measure the performance gain. ○ Archive recovery, ○ Log-shipping replications.
  • 31. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 31 Issues for further discussion/development
  • 32. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 32 Issues for discussion ● Synchronization: ○ Now using unix domain socket directly. May need some wrapper for portability. ● Queueing ○ Now using POSIX mqueue directly. It is very fast and simple and is supporting multi to one queueing. ○ Is it portable to other environment such as Windows? ● Bringing to PG core ○ Large amount of code: around 7k lines with debug code. ○ Maybe around 4k lines without debug code. ○ Need expert’s help to divide this into manageable smaller pieces for commit fest.
  • 33. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 33 Resources Source repo https://github.com/koichi-szk/postgres.git Branch: parallel_replay_14_6 Test tools/environment https://github.com/koichi-szk/parallel_recovery_test.git Branch: main PostgreSQL Wiki: https://wiki.postgresql.org/wiki/Parallel_Recovery They are now all public. Please let me know if you are interested in.
  • 34. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 34 Thank you very much Koichi Suzuki koichi.suzuki@enterprisedb.com koichi.dbms@gmail.com
  • 35. © Copyright EnterpriseDB Corporation, 2023. All rights reserved. 35 Challenge in the test ● Startup process starts before postmaster begins to accept connection. ● No direct means to pause and notice its PID to attach to gdb. ● Writes these information to a debug file, wait until it is attached to gdb and signal file is touched. Paste to terminal Each terminal runs gdb for each worker