Streaming huge databases using logical decoding

Streaming Huge Databases
using Logical Decoding
Adventures of a naive programmer
Oleksandr “Alex” Shulgin, Zalando SE
Overview
● Introduction
● Problem Statement
● Approach
● Problems
● Some Numbers
● Conclusion
Introduction
What is Logical Decoding all about?
● A new feature of PostgreSQL since version 9.4.
● Allows streaming database changes in a custom format.
● Requires an Output Plugin to be written (yes, in C).
● Consistent snapshot before all the changes?
Logical Decoding
CREATE_REPLICATION_SLOT "slot1" LOGICAL <plugin_name>;
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
SET TRANSACTION SNAPSHOT 'XXXXXXXX-N';
SELECT pg_export_snapshot();
Logical Decoding Output
{
"action": "I", /* INSERT */
"relation": [
"myschema", /* INTO myschema.mytable(id) */
"mytable"
],
"newtuple": {
"id": 1 /* VALUES(1) */
}
}
Problem Statement
?
Approach
?
W
W
https://github.com/zalando/saiki-tawon
Command Line and Dashboard
Long Live the Snapshot
{
"w-8ee20b3": {
"snapshot_id": "51E426C2-1",
"ts_start": "2016-01-14 11:02:40 UTC",
"heartbeat": "2016-01-15 07:10:27 UTC",
"pid": 1,
"backend_pid": 58793
},
"w-fbfb655": {
"snapshot_id": "51E426C4-1",
"ts_start": "2016-01-14 11:02:41 UTC",
"heartbeat": "2016-01-15 07:10:28 UTC",
"pid": 1,
"backend_pid": 58794
},
...
The Source System
● Ubuntu precise (12.04) 3.2.0-xx-generic
● CPU: @2.50GHz Xeon with 24 cores
● RAM: 125.88 GB
● 6x HDDs (spinning drives) in a RAID 1+0, 5 TB total capacity
● Data size: 3.0 TB / 17 B rows (+ 1.8 TB indexes)
● PostgreSQL 9.6devel
The Target System
“Things are working amazingly fast  
when you write to /dev/null.”
– proverbial wisdom
Problems?
Problems!
● OpenVPN quickly becomes the bottleneck on the laptop
Problems
● OpenVPN quickly becomes the bottleneck on the laptop
Obvious solution: deploy workers closer to the database.
Docker + Mesosphere DCOS
https://zalando-techmonkeys.github.io/
Problems
● Workers quickly run out of memory
The (problematic) code:
cursor.execute("SELECT * FROM mytable")
Problems
● Workers quickly run out of memory
The (problematic) code:
cursor.execute("SELECT * FROM mytable")
● Invokes PQexec().
● Async. connection doesn’t help.
● psycopg2 is not designed to stream results.
Problems
● Invoke COPY protocol!
Corrected code:
cursor.copy_expert(
"COPY (SELECT * FROM mytable) TO STDOUT",
...)
Problems
● Invoke COPY protocol!
Corrected code:
cursor.copy_expert(
"COPY (SELECT * FROM mytable) TO STDOUT",
...)
● Tried 32 MB, then 64 MB per worker: it was not enough...
● One of the values was around 80 MB(!). Not much we can do.
More Problems?
● More problems with this code
The correct(?) code:
cursor.copy_expert(
"COPY (SELECT * FROM mytable) TO STDOUT",
...)
More Problems?
● More problems with this code
The correct(?) code:
cursor.copy_expert(
"COPY (SELECT * FROM mytable) TO STDOUT",
...)
● SELECT * FROM [ONLY] myschema.mytable
NoSQL?
● How about some JSON for comparison?
SELECT row_to_json(x.*) FROM mytable AS x
● Slows down the export 2-3 times.
● Not 100% equivalent to what output plugin emits.
● Have to write a C function for every plugin.
What if we would write a generic function...
CREATE FUNCTION pg_logical_slot_stream_relation(
IN slot_name NAME,
IN relnamespace NAME,
IN relname NAME,
IN nochildren BOOL DEFAULT FALSE,
VARIADIC options TEXT[] DEFAULT '{}',
OUT data TEXT
)
RETURNS SETOF TEXT ...
The Final Code
cursor.copy_expert(
"COPY (SELECT pg_logical_slot_stream_relation(...)) TO STDOUT",
...)
● Do not use SELECT … FROM pg_logical_slot_… – it caches result in the backend.
● Requires changes to core PostgreSQL.
● Ideally should not require a slot, only a snapshot.
● Slots cannot be used concurrently (yet).
At Last: Some Numbers
6 client processes, SSL (no compression), 1Gbit/s network interface
Query Run Time Volume Notes
SELECT * FROM … 7.5 h 2.7 TB 105 MB/s
pglogical / JSON 17.5 h 6.7 TB 112 MB/s
pglogical / native 30+ h (incomplete) 11+ TB 106 MB/s
Bottled Water / Avro 13.5 h 5.0 TB 108 MB/s
Space for Improvement
In native protocol format pglogical_output emits metadata per each tuple.
● Metadata overhead: 5.0 TB (167%)
○ nspname + relname + attnames
● Protocol overhead: 1.5 TB (50%)
○ message type + lengths
Set plugin option relmeta_cache_size to -1.
● Network is apparently the bottleneck.
● What if we enable SSL compression?..
A Common Number: ~110 MB/s
sslcompression=1?
● Nowadays seems to be really hard to do, re: CRIME vulnerability.
● Older distro versions: set env. OPENSSL_DEFAULT_ZLIB
● Newer distro versions: OpenSSL is compiled without zlib. Good luck!
● TLSv1.3 will remove support for compression.
● HINT: your streaming replication is likely to be running uncompressed.
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2012-4929
A Much Better Picture
Compression FTW!
24 client processes, SSL (with compression)
Query Run Time Volume Notes
SELECT * FROM … 3h (vs. 7.5 h) 2.7 TB
pglogical / JSON 7-8* h (vs. 17.5 h) 6.7 TB *ordering
pglogical / native 8-9* h (vs. 30+ h) 7.2 TB (vs. 11+ TB)
Bottled Water / Avro 10 h (vs. 13.5 h) 5.0 TB
In Conclusion
● Set relmeta_cache_size with pglogical_output native.
● Run a benchmark to see if you need compression.
● Order tables from largest to smallest.
● Do listen on the replication slot once the export is finished.
● Help needed: review the proposed streaming interface!
References
● https://github.com/zalando/saiki-tawon
● https://github.com/2ndQuadrant/postgres/tree/dev/pglogical-output
● https://github.com/confluentinc/bottledwater-pg/
● https://zalando-techmonkeys.github.io
● https://github.com/zalando/pg_view
● http://www.slideshare.net/AlexanderShulgin3/adding-replication-
protocol-support-for-psycopg2
● http://www.postgresql.org/message-id/flat/CACACo5RNZ0OB8K...
Thank you!
Questions?
1 of 32

More Related Content

What's hot(20)

On The Building Of A PostgreSQL ClusterOn The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL Cluster
Srihari Sriraman1.1K views
PostgreSQL TerminologyPostgreSQL Terminology
PostgreSQL Terminology
Showmax Engineering671 views
Case Studies on PostgreSQLCase Studies on PostgreSQL
Case Studies on PostgreSQL
InMobi Technology1.7K views
PostgreSQL   HAPostgreSQL   HA
PostgreSQL HA
haroonm1.3K views
Tuning Linux for MongoDBTuning Linux for MongoDB
Tuning Linux for MongoDB
Tim Vaillancourt2.2K views
Building Spark as Service in CloudBuilding Spark as Service in Cloud
Building Spark as Service in Cloud
InMobi Technology939 views
Postgresql database administration volume 1Postgresql database administration volume 1
Postgresql database administration volume 1
Federico Campoli33.3K views
Elephant Roads: a tour of Postgres forksElephant Roads: a tour of Postgres forks
Elephant Roads: a tour of Postgres forks
Command Prompt., Inc5.9K views
Managing PostgreSQL with PgCenterManaging PostgreSQL with PgCenter
Managing PostgreSQL with PgCenter
Alexey Lesovsky1.4K views

Viewers also liked(20)

MultimasterMultimaster
Multimaster
Stas Kelvich758 views
PostgreSQL on EXT4, XFS, BTRFS and ZFSPostgreSQL on EXT4, XFS, BTRFS and ZFS
PostgreSQL on EXT4, XFS, BTRFS and ZFS
Tomas Vondra47.9K views
Streaming replication in practiceStreaming replication in practice
Streaming replication in practice
Alexey Lesovsky8.7K views
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performance
PostgreSQL-Consulting35.9K views
Postgresql on NFS - J.Battiato, pgday2016Postgresql on NFS - J.Battiato, pgday2016
Postgresql on NFS - J.Battiato, pgday2016
Jonathan Battiato5.8K views
Deep dive into PostgreSQL statistics.Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.
Alexey Lesovsky22.9K views

Similar to Streaming huge databases using logical decoding(20)

Streaming huge databases using logical decoding

  • 1. Streaming Huge Databases using Logical Decoding Adventures of a naive programmer Oleksandr “Alex” Shulgin, Zalando SE
  • 2. Overview ● Introduction ● Problem Statement ● Approach ● Problems ● Some Numbers ● Conclusion
  • 3. Introduction What is Logical Decoding all about? ● A new feature of PostgreSQL since version 9.4. ● Allows streaming database changes in a custom format. ● Requires an Output Plugin to be written (yes, in C). ● Consistent snapshot before all the changes?
  • 4. Logical Decoding CREATE_REPLICATION_SLOT "slot1" LOGICAL <plugin_name>; BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ; SET TRANSACTION SNAPSHOT 'XXXXXXXX-N'; SELECT pg_export_snapshot();
  • 5. Logical Decoding Output { "action": "I", /* INSERT */ "relation": [ "myschema", /* INTO myschema.mytable(id) */ "mytable" ], "newtuple": { "id": 1 /* VALUES(1) */ } }
  • 6. Problem Statement ?
  • 7. Approach ? W W https://github.com/zalando/saiki-tawon
  • 8. Command Line and Dashboard
  • 9. Long Live the Snapshot { "w-8ee20b3": { "snapshot_id": "51E426C2-1", "ts_start": "2016-01-14 11:02:40 UTC", "heartbeat": "2016-01-15 07:10:27 UTC", "pid": 1, "backend_pid": 58793 }, "w-fbfb655": { "snapshot_id": "51E426C4-1", "ts_start": "2016-01-14 11:02:41 UTC", "heartbeat": "2016-01-15 07:10:28 UTC", "pid": 1, "backend_pid": 58794 }, ...
  • 10. The Source System ● Ubuntu precise (12.04) 3.2.0-xx-generic ● CPU: @2.50GHz Xeon with 24 cores ● RAM: 125.88 GB ● 6x HDDs (spinning drives) in a RAID 1+0, 5 TB total capacity ● Data size: 3.0 TB / 17 B rows (+ 1.8 TB indexes) ● PostgreSQL 9.6devel
  • 11. The Target System “Things are working amazingly fast   when you write to /dev/null.” – proverbial wisdom
  • 12. Problems?
  • 13. Problems! ● OpenVPN quickly becomes the bottleneck on the laptop
  • 14. Problems ● OpenVPN quickly becomes the bottleneck on the laptop Obvious solution: deploy workers closer to the database. Docker + Mesosphere DCOS https://zalando-techmonkeys.github.io/
  • 15. Problems ● Workers quickly run out of memory The (problematic) code: cursor.execute("SELECT * FROM mytable")
  • 16. Problems ● Workers quickly run out of memory The (problematic) code: cursor.execute("SELECT * FROM mytable") ● Invokes PQexec(). ● Async. connection doesn’t help. ● psycopg2 is not designed to stream results.
  • 17. Problems ● Invoke COPY protocol! Corrected code: cursor.copy_expert( "COPY (SELECT * FROM mytable) TO STDOUT", ...)
  • 18. Problems ● Invoke COPY protocol! Corrected code: cursor.copy_expert( "COPY (SELECT * FROM mytable) TO STDOUT", ...) ● Tried 32 MB, then 64 MB per worker: it was not enough... ● One of the values was around 80 MB(!). Not much we can do.
  • 19. More Problems? ● More problems with this code The correct(?) code: cursor.copy_expert( "COPY (SELECT * FROM mytable) TO STDOUT", ...)
  • 20. More Problems? ● More problems with this code The correct(?) code: cursor.copy_expert( "COPY (SELECT * FROM mytable) TO STDOUT", ...) ● SELECT * FROM [ONLY] myschema.mytable
  • 21. NoSQL? ● How about some JSON for comparison? SELECT row_to_json(x.*) FROM mytable AS x ● Slows down the export 2-3 times. ● Not 100% equivalent to what output plugin emits. ● Have to write a C function for every plugin.
  • 22. What if we would write a generic function... CREATE FUNCTION pg_logical_slot_stream_relation( IN slot_name NAME, IN relnamespace NAME, IN relname NAME, IN nochildren BOOL DEFAULT FALSE, VARIADIC options TEXT[] DEFAULT '{}', OUT data TEXT ) RETURNS SETOF TEXT ...
  • 23. The Final Code cursor.copy_expert( "COPY (SELECT pg_logical_slot_stream_relation(...)) TO STDOUT", ...) ● Do not use SELECT … FROM pg_logical_slot_… – it caches result in the backend. ● Requires changes to core PostgreSQL. ● Ideally should not require a slot, only a snapshot. ● Slots cannot be used concurrently (yet).
  • 24. At Last: Some Numbers 6 client processes, SSL (no compression), 1Gbit/s network interface Query Run Time Volume Notes SELECT * FROM … 7.5 h 2.7 TB 105 MB/s pglogical / JSON 17.5 h 6.7 TB 112 MB/s pglogical / native 30+ h (incomplete) 11+ TB 106 MB/s Bottled Water / Avro 13.5 h 5.0 TB 108 MB/s
  • 25. Space for Improvement In native protocol format pglogical_output emits metadata per each tuple. ● Metadata overhead: 5.0 TB (167%) ○ nspname + relname + attnames ● Protocol overhead: 1.5 TB (50%) ○ message type + lengths Set plugin option relmeta_cache_size to -1.
  • 26. ● Network is apparently the bottleneck. ● What if we enable SSL compression?.. A Common Number: ~110 MB/s
  • 27. sslcompression=1? ● Nowadays seems to be really hard to do, re: CRIME vulnerability. ● Older distro versions: set env. OPENSSL_DEFAULT_ZLIB ● Newer distro versions: OpenSSL is compiled without zlib. Good luck! ● TLSv1.3 will remove support for compression. ● HINT: your streaming replication is likely to be running uncompressed. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2012-4929
  • 28. A Much Better Picture
  • 29. Compression FTW! 24 client processes, SSL (with compression) Query Run Time Volume Notes SELECT * FROM … 3h (vs. 7.5 h) 2.7 TB pglogical / JSON 7-8* h (vs. 17.5 h) 6.7 TB *ordering pglogical / native 8-9* h (vs. 30+ h) 7.2 TB (vs. 11+ TB) Bottled Water / Avro 10 h (vs. 13.5 h) 5.0 TB
  • 30. In Conclusion ● Set relmeta_cache_size with pglogical_output native. ● Run a benchmark to see if you need compression. ● Order tables from largest to smallest. ● Do listen on the replication slot once the export is finished. ● Help needed: review the proposed streaming interface!
  • 31. References ● https://github.com/zalando/saiki-tawon ● https://github.com/2ndQuadrant/postgres/tree/dev/pglogical-output ● https://github.com/confluentinc/bottledwater-pg/ ● https://zalando-techmonkeys.github.io ● https://github.com/zalando/pg_view ● http://www.slideshare.net/AlexanderShulgin3/adding-replication- protocol-support-for-psycopg2 ● http://www.postgresql.org/message-id/flat/CACACo5RNZ0OB8K...
  • 32. Thank you! Questions?