Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
PostgreSQL + Kafka
The Delight of Change Data Capture
Jeff Klukas - Data Engineer at Simple
1
2
Overview
Commit logs: what are they?
Write-ahead logging (WAL)
Commit logs as a data store
Demo: change data capture
Use...
3
https://www.confluent.io/blog/hands-free-kafka-replication-a-lesson-in-operational-simplicity/
Commit Logs
4
Ordered Immutable Durable
Commit Logs
5
Commit Logs
Ordered Immutable Durable
In practice, old logs can be deleted or archived
6
Write-Ahead Logging (WAL)
7
– https://www.postgresql.org/docs/current/static/wal-intro.html
“WAL's central concept is that changes to
data files (wh...
8
– https://www.postgresql.org/docs/9.4/static/logicaldecoding-explanation.html
“Logical decoding is the process of
extrac...
9
10
Topic Partitions
11
Topics
12
Compacted Topics
13
https://www.confluent.io/blog/bottled-water-real-time-integration-of-postgresql-and-kafka/
14
INSERT INTO transactions
VALUES (56789, 20.00);
{
"transaction_id": {"int": 56789},
"amount": {"double": 20.00}
}
Bottl...
15
UPDATE transactions
SET amount = 25.00
WHERE transaction_id = 56789;
{
"transaction_id": {"int": 56789},
"amount": {"do...
16
DELETE FROM transactions
WHERE transaction_id = 56789;
null
Bottled Water - Message Key
{ "transaction_id": { "int": 56...
17
tx-service
tx-postgres
Use Cases
18
tx-service
tx-postgres
tx-pgkafka
Kafka topic: tx-pgkafka
19
tx-service
tx-postgres
tx-pgkafka
demux-service
Kafka topic: tx-pgkafka
20
tx-service
tx-postgres
tx-pgkafka
demux-service
Kafka topic: tx-pgkafka
Kafka topic: customers-table
Kafka topic: trans...
21
tx-service
tx-postgres
tx-pgkafka
demux-service
activity-service
activity-postgres
activity-pgkafka
Kafka topic: tx-pgk...
22
tx-service
tx-postgres
tx-pgkafka
demux-service
activity-service
activity-postgres
activity-pgkafka
Amazon Redshift
(Da...
23
tx-service
tx-postgres
tx-pgkafka
demux-service
activity-service
activity-postgres
activity-pgkafka
Amazon Redshift
(Da...
24
tx-service
tx-postgres
tx-pgkafka
demux-service
activity-service
activity-postgres
activity-pgkafka
Amazon Redshift
(Da...
25
tx-service
tx-postgres
tx-pgkafka
demux-service
activity-service
activity-postgres
activity-pgkafka
Amazon Redshift
(Da...
26
Recap
Commit logs: what are they?
Write-ahead logging (WAL)
Commit logs as a data store
Demo: change data capture
Use c...
27
• Blog post on Simple’s CDC pipeline
• https://www.simple.com/engineering
• Bottled Water: https://github.com/confluent...
Thank You
28
Extras
29
30
The Dual Write Problem
https://www.confluent.io/blog/bottled-water-real-time-integration-of-postgresql-and-kafka/
31
Redshift Architecture
Amazon
Redshift
Replicating to Redshift
32
33
Table Schema
CREATE TABLE pgkafka_txservice_transactions (
pg_lsn NUMERIC(20,0) ENCODE raw,
pg_txn_id BIGINT ENCODE lzo...
34
Deduplication
CREATE TABLE deduped LIKE pgkafka_txservice_transactions;
INSERT INTO deduped SELECT * FROM (
SELECT *, R...
35
View of Current State
CREATE VIEW current_txservice_transactions AS
SELECT transaction_id, amount,
FROM (
SELECT *, ROW...
Upcoming SlideShare
Loading in …5
×

PostgreSQL + Kafka: The Delight of Change Data Capture

PostgreSQL is an open source relational database. Kafka is an open source log-based messaging system. Because both systems are powerful and flexible, they’re devouring whole categories of infrastructure. And they’re even better together.
In this talk, you’ll learn about commit logs and how that fundamental data structure underlies both PostgreSQL and Kafka. We’ll use that basis to understand what Kafka is, what advantages it has over traditional messaging systems, and why it’s perfect for modeling database tables as streams. From there, we’ll introduce the concept of change data capture (CDC) and run a live demo of Bottled Water, an open source CDC pipeline, watching INSERT, UPDATE, and DELETE operations in PostgreSQL stream into Kafka. We’ll wrap up with a discussion of use cases for this pipeline: messaging between systems with transactional guarantees, transmitting database changes to a data warehouse, and stream processing.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to comment

PostgreSQL + Kafka: The Delight of Change Data Capture

  1. 1. PostgreSQL + Kafka The Delight of Change Data Capture Jeff Klukas - Data Engineer at Simple 1
  2. 2. 2 Overview Commit logs: what are they? Write-ahead logging (WAL) Commit logs as a data store Demo: change data capture Use cases
  3. 3. 3 https://www.confluent.io/blog/hands-free-kafka-replication-a-lesson-in-operational-simplicity/ Commit Logs
  4. 4. 4 Ordered Immutable Durable Commit Logs
  5. 5. 5 Commit Logs Ordered Immutable Durable In practice, old logs can be deleted or archived
  6. 6. 6 Write-Ahead Logging (WAL)
  7. 7. 7 – https://www.postgresql.org/docs/current/static/wal-intro.html “WAL's central concept is that changes to data files (where tables and indexes reside) must be written only after those changes have been logged, that is, after log records describing the changes have been flushed to permanent storage”
  8. 8. 8 – https://www.postgresql.org/docs/9.4/static/logicaldecoding-explanation.html “Logical decoding is the process of extracting all persistent changes to a database's tables into a coherent, easy to understand format which can be interpreted without detailed knowledge of the database's internal state.”
  9. 9. 9
  10. 10. 10 Topic Partitions
  11. 11. 11 Topics
  12. 12. 12 Compacted Topics
  13. 13. 13 https://www.confluent.io/blog/bottled-water-real-time-integration-of-postgresql-and-kafka/
  14. 14. 14 INSERT INTO transactions VALUES (56789, 20.00); { "transaction_id": {"int": 56789}, "amount": {"double": 20.00} } Bottled Water - Message Key { "transaction_id": { "int": 56789 } } Bottled Water - Message Value
  15. 15. 15 UPDATE transactions SET amount = 25.00 WHERE transaction_id = 56789; { "transaction_id": {"int": 56789}, "amount": {"double": 25.00} } Bottled Water - Message Key { "transaction_id": { "int": 56789 } } Bottled Water - Message Value
  16. 16. 16 DELETE FROM transactions WHERE transaction_id = 56789; null Bottled Water - Message Key { "transaction_id": { "int": 56789 } } Bottled Water - Message Value
  17. 17. 17 tx-service tx-postgres Use Cases
  18. 18. 18 tx-service tx-postgres tx-pgkafka Kafka topic: tx-pgkafka
  19. 19. 19 tx-service tx-postgres tx-pgkafka demux-service Kafka topic: tx-pgkafka
  20. 20. 20 tx-service tx-postgres tx-pgkafka demux-service Kafka topic: tx-pgkafka Kafka topic: customers-table Kafka topic: transactions-table
  21. 21. 21 tx-service tx-postgres tx-pgkafka demux-service activity-service activity-postgres activity-pgkafka Kafka topic: tx-pgkafka Kafka topic: customers-table Kafka topic: transactions-table Kafka topic: activity-pgkafka
  22. 22. 22 tx-service tx-postgres tx-pgkafka demux-service activity-service activity-postgres activity-pgkafka Amazon Redshift (Data Warehouse) Amazon S3 (Data Lake) analytics-service Kafka topic: tx-pgkafka Kafka topic: customers-table Kafka topic: transactions-table Kafka topic: activity-pgkafka
  23. 23. 23 tx-service tx-postgres tx-pgkafka demux-service activity-service activity-postgres activity-pgkafka Amazon Redshift (Data Warehouse) Amazon S3 (Data Lake) analytics-service Kafka topic: tx-pgkafka Kafka topic: customers-table Kafka topic: transactions-table Kafka topic: activity-pgkafka Change Data Capture
  24. 24. 24 tx-service tx-postgres tx-pgkafka demux-service activity-service activity-postgres activity-pgkafka Amazon Redshift (Data Warehouse) Amazon S3 (Data Lake) analytics-service Kafka topic: tx-pgkafka Kafka topic: customers-table Kafka topic: transactions-table Kafka topic: activity-pgkafka Messaging
  25. 25. 25 tx-service tx-postgres tx-pgkafka demux-service activity-service activity-postgres activity-pgkafka Amazon Redshift (Data Warehouse) Amazon S3 (Data Lake) analytics-service Kafka topic: tx-pgkafka Kafka topic: customers-table Kafka topic: transactions-table Kafka topic: activity-pgkafka Analytics
  26. 26. 26 Recap Commit logs: what are they? Write-ahead logging (WAL) Commit logs as a data store Demo: change data capture Use cases
  27. 27. 27 • Blog post on Simple’s CDC pipeline • https://www.simple.com/engineering • Bottled Water: https://github.com/confluentinc/bottledwater-pg • Debezium (CDC to Kafka from Postgres, MySQL, or MongoDB) • http://debezium.io/ • https://wecode.wepay.com/posts/streaming-databases-in- realtime-with-mysql-debezium-kafka • https://www.confluent.io/kafka-summit-sf17/ • Martin Kleppmann, Making Sense of Stream Processing eBook Also See…
  28. 28. Thank You 28
  29. 29. Extras 29
  30. 30. 30 The Dual Write Problem https://www.confluent.io/blog/bottled-water-real-time-integration-of-postgresql-and-kafka/
  31. 31. 31 Redshift Architecture Amazon Redshift
  32. 32. Replicating to Redshift 32
  33. 33. 33 Table Schema CREATE TABLE pgkafka_txservice_transactions ( pg_lsn NUMERIC(20,0) ENCODE raw, pg_txn_id BIGINT ENCODE lzo, pg_operation CHAR(6) ENCODE bytedict, pg_txn_timestamp TIMESTAMP ENCODE lzo, ingestion_timestamp TIMESTAMP ENCODE lzo, transaction_id INT ENCODE lzo, amount NUMERIC(18,2) ENCODE lzo ) DISTKEY transaction_id SORTKEY (transaction_id, pg_lsn, pg_operation); Amazon Redshift
  34. 34. 34 Deduplication CREATE TABLE deduped LIKE pgkafka_txservice_transactions; INSERT INTO deduped SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY pg_lsn ORDER BY ingestion_timestamp DESC) FROM pgkafka_txservice_transactions ) WHERE row_number = 1; DROP TABLE pgkafka_txservice_transactions; ALTER TABLE deduped RENAME TO pgkafka_txservice_transactions; Amazon Redshift
  35. 35. 35 View of Current State CREATE VIEW current_txservice_transactions AS SELECT transaction_id, amount, FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY transaction_id ORDER BY pg_lsn, pg_operation) AS n, COUNT(*) OVER (PARTITION BY transaction_id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS c FROM pgkafka_txservice_transactions) WHERE n = c AND pg_operation <> 'delete'; Amazon Redshift

×