Data Stream Processing for Beginners with Kafka and CDC


Data Stream Processing for Beginners with Apache Kafka and Change Data Capture

Published in: Data & Analytics
Data Stream Processing for Beginners with Kafka and CDC

  1. 1. Data Stream Processing for Beginners with Apache Kafka and Change Data Capture -Abhijit Kumar
  2. 2. Agenda • Intro to Data Stream Processing • What is Change Data Capture • CDC Usecases • How to capture change data • CDC with Kafka and Kafka Connect • Intro to Debezium • Demo
  3. 3. About Me • 12+ years of work experience in Software Development and Architect • Currently working as a Data Architect at Deltatre • Previously worked at EY, Cisco, Dell and SAP • Moved to Sydney 6 months back from India One iinteresting fact about me: Back in India I worked for 3 startups and all three had a successful exits (Startups acquired by Cisco, Dell and SAP) Email:
  4. 4. Data Stream Processing • Big data technology • Processing of data in motion • Computing on data as soon as it is produced • Continuous streams: sensor events, user activity on a website, financial trades, etc • Data is only stored in data stores for processing later. • Getting stream of data from traditional RDBMS is a challenge.
  5. 5. What is CDC • CDC is identifying and capturing changes made to a database. • Change data capture records insert, update, and delete activity that is applied • Earlier technologies: Table differencing, change-value selection, and database triggers. • Inefficient and had substantial overhead on source servers • Log-based cdc is adopted now • Utilises a background process to scan database transaction logs
  6. 6. CDC Usecases • Data Replication • Microservice Architecture • Others: Caching, Alerting, Anomaly Detection
  7. 7. CDC Use Case: Data Replication • Replicate data to other DBs and keep content in sync • Send changes to Data Processing System • Sharing DB with other consumers/teams
  8. 8. CDC Usecase: Microservice Architecture • Share data between services without coupling • Each Microservices service keeps optimised views of data coming from source data base.
  9. 9. CDC Other Usecase • Update caches with changes • Data sync between caching • Using Elasticsearch or Solr as data sink to enable full text search on database • Alert and anomaly detection
  10. 10. How to do CDC: Legacy Approach • Parallel writes: Application level update different DBs at the same time. • Polling for changes (identifying the new, delete and update at source table) • Triggers (Performance issues, versioning issues, maintenance issue)
  11. 11. Preferred way for CDC Monitoring the DB continuously and identifying the changes: • Reading the database logs • No inconsistencies due to failure • Both upstream and downstream applications are unaware of this application.
  12. 12. Database logs for CDC • DB maintains log of changes. • Logs are used for TX recovery, replication, etc • Mysql - binlog, Postgres - write-ahead log, MongoDB- op log • These ordered sequence of changes are created into stream events for CDC.
  13. 13. Kafka for CDC • Kafka Key - Table Primary Key • Kafka guarantees ordering (per partition) • Pull based mechanism • Supports compaction • Horizontal scalability
  14. 14. Kafka Connect • Tool for streaming data between Apache Kafka and other data systems. • Framework for source and sink connectors • Tracks offsets: Replay in case of failure • Rich eco-system of connector
  15. 15. CDC Message Format • Key (Primary key of table ) and Value (Data) • Payload: Before and After state and Source information • Message can be wrapped in JSON and AVRO format
  16. 16. Debezium Connectors • Supports: MySQL, Postgres, MongoDB, Oracle • Provides Common event format (all connectors have same format) • Provides monitoring support via JMX • Filtering and snapshot modes
  17. 17. Demo Use docker images to start following: • Start Zookeeper • Kafka • Start Mysql (preloaded data) • Mysql terminal • Kafka Connect Service • Register and start Debezium-mysql connector • Watch Kafka topic • Modify records in mysql and view the captured data change in Kafka topic
  18. 18. What to do with CDC events • Transformation of cdc data can be done with Stream Application • Kafka Stream application for Java and Scala developer • KSQL can be used for non-developers • Kafka Connect to sink data
  19. 19. Do it yourself Docker Images • • • •
  20. 20. –Abhijit Kumar “Thank You”