Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

PostgreSQL + Kafka: The Delight of Change Data Capture

4,282 views

Published on

PostgreSQL is an open source relational database. Kafka is an open source log-based messaging system. Because both systems are powerful and flexible, they’re devouring whole categories of infrastructure. And they’re even better together.
In this talk, you’ll learn about commit logs and how that fundamental data structure underlies both PostgreSQL and Kafka. We’ll use that basis to understand what Kafka is, what advantages it has over traditional messaging systems, and why it’s perfect for modeling database tables as streams. From there, we’ll introduce the concept of change data capture (CDC) and run a live demo of Bottled Water, an open source CDC pipeline, watching INSERT, UPDATE, and DELETE operations in PostgreSQL stream into Kafka. We’ll wrap up with a discussion of use cases for this pipeline: messaging between systems with transactional guarantees, transmitting database changes to a data warehouse, and stream processing.

Published in: Technology
  • Be the first to comment

PostgreSQL + Kafka: The Delight of Change Data Capture

  1. 1. PostgreSQL + Kafka The Delight of Change Data Capture Jeff Klukas - Data Engineer at Simple 1
  2. 2. 2 Overview Commit logs: what are they? Write-ahead logging (WAL) Commit logs as a data store Demo: change data capture Use cases
  3. 3. 3 https://www.confluent.io/blog/hands-free-kafka-replication-a-lesson-in-operational-simplicity/ Commit Logs
  4. 4. 4 Ordered Immutable Durable Commit Logs
  5. 5. 5 Commit Logs Ordered Immutable Durable In practice, old logs can be deleted or archived
  6. 6. 6 Write-Ahead Logging (WAL)
  7. 7. 7 – https://www.postgresql.org/docs/current/static/wal-intro.html “WAL's central concept is that changes to data files (where tables and indexes reside) must be written only after those changes have been logged, that is, after log records describing the changes have been flushed to permanent storage”
  8. 8. 8 – https://www.postgresql.org/docs/9.4/static/logicaldecoding-explanation.html “Logical decoding is the process of extracting all persistent changes to a database's tables into a coherent, easy to understand format which can be interpreted without detailed knowledge of the database's internal state.”
  9. 9. 9
  10. 10. 10 Topic Partitions
  11. 11. 11 Topics
  12. 12. 12 Compacted Topics
  13. 13. 13 https://www.confluent.io/blog/bottled-water-real-time-integration-of-postgresql-and-kafka/
  14. 14. 14 INSERT INTO transactions VALUES (56789, 20.00); { "transaction_id": {"int": 56789}, "amount": {"double": 20.00} } Bottled Water - Message Key { "transaction_id": { "int": 56789 } } Bottled Water - Message Value
  15. 15. 15 UPDATE transactions SET amount = 25.00 WHERE transaction_id = 56789; { "transaction_id": {"int": 56789}, "amount": {"double": 25.00} } Bottled Water - Message Key { "transaction_id": { "int": 56789 } } Bottled Water - Message Value
  16. 16. 16 DELETE FROM transactions WHERE transaction_id = 56789; null Bottled Water - Message Key { "transaction_id": { "int": 56789 } } Bottled Water - Message Value
  17. 17. 17 tx-service tx-postgres Use Cases
  18. 18. 18 tx-service tx-postgres tx-pgkafka Kafka topic: tx-pgkafka
  19. 19. 19 tx-service tx-postgres tx-pgkafka demux-service Kafka topic: tx-pgkafka
  20. 20. 20 tx-service tx-postgres tx-pgkafka demux-service Kafka topic: tx-pgkafka Kafka topic: customers-table Kafka topic: transactions-table
  21. 21. 21 tx-service tx-postgres tx-pgkafka demux-service activity-service activity-postgres activity-pgkafka Kafka topic: tx-pgkafka Kafka topic: customers-table Kafka topic: transactions-table Kafka topic: activity-pgkafka
  22. 22. 22 tx-service tx-postgres tx-pgkafka demux-service activity-service activity-postgres activity-pgkafka Amazon Redshift (Data Warehouse) Amazon S3 (Data Lake) analytics-service Kafka topic: tx-pgkafka Kafka topic: customers-table Kafka topic: transactions-table Kafka topic: activity-pgkafka
  23. 23. 23 tx-service tx-postgres tx-pgkafka demux-service activity-service activity-postgres activity-pgkafka Amazon Redshift (Data Warehouse) Amazon S3 (Data Lake) analytics-service Kafka topic: tx-pgkafka Kafka topic: customers-table Kafka topic: transactions-table Kafka topic: activity-pgkafka Change Data Capture
  24. 24. 24 tx-service tx-postgres tx-pgkafka demux-service activity-service activity-postgres activity-pgkafka Amazon Redshift (Data Warehouse) Amazon S3 (Data Lake) analytics-service Kafka topic: tx-pgkafka Kafka topic: customers-table Kafka topic: transactions-table Kafka topic: activity-pgkafka Messaging
  25. 25. 25 tx-service tx-postgres tx-pgkafka demux-service activity-service activity-postgres activity-pgkafka Amazon Redshift (Data Warehouse) Amazon S3 (Data Lake) analytics-service Kafka topic: tx-pgkafka Kafka topic: customers-table Kafka topic: transactions-table Kafka topic: activity-pgkafka Analytics
  26. 26. 26 Recap Commit logs: what are they? Write-ahead logging (WAL) Commit logs as a data store Demo: change data capture Use cases
  27. 27. 27 • Blog post on Simple’s CDC pipeline • https://www.simple.com/engineering • Bottled Water: https://github.com/confluentinc/bottledwater-pg • Debezium (CDC to Kafka from Postgres, MySQL, or MongoDB) • http://debezium.io/ • https://wecode.wepay.com/posts/streaming-databases-in- realtime-with-mysql-debezium-kafka • https://www.confluent.io/kafka-summit-sf17/ • Martin Kleppmann, Making Sense of Stream Processing eBook Also See…
  28. 28. Thank You 28
  29. 29. Extras 29
  30. 30. 30 The Dual Write Problem https://www.confluent.io/blog/bottled-water-real-time-integration-of-postgresql-and-kafka/
  31. 31. 31 Redshift Architecture Amazon Redshift
  32. 32. Replicating to Redshift 32
  33. 33. 33 Table Schema CREATE TABLE pgkafka_txservice_transactions ( pg_lsn NUMERIC(20,0) ENCODE raw, pg_txn_id BIGINT ENCODE lzo, pg_operation CHAR(6) ENCODE bytedict, pg_txn_timestamp TIMESTAMP ENCODE lzo, ingestion_timestamp TIMESTAMP ENCODE lzo, transaction_id INT ENCODE lzo, amount NUMERIC(18,2) ENCODE lzo ) DISTKEY transaction_id SORTKEY (transaction_id, pg_lsn, pg_operation); Amazon Redshift
  34. 34. 34 Deduplication CREATE TABLE deduped LIKE pgkafka_txservice_transactions; INSERT INTO deduped SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY pg_lsn ORDER BY ingestion_timestamp DESC) FROM pgkafka_txservice_transactions ) WHERE row_number = 1; DROP TABLE pgkafka_txservice_transactions; ALTER TABLE deduped RENAME TO pgkafka_txservice_transactions; Amazon Redshift
  35. 35. 35 View of Current State CREATE VIEW current_txservice_transactions AS SELECT transaction_id, amount, FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY transaction_id ORDER BY pg_lsn, pg_operation) AS n, COUNT(*) OVER (PARTITION BY transaction_id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS c FROM pgkafka_txservice_transactions) WHERE n = c AND pg_operation <> 'delete'; Amazon Redshift

×