Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Stream Processing for Beginners with Kafka and CDC


Published on

Data Stream Processing for Beginners with Apache Kafka and Change Data Capture

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Data Stream Processing for Beginners with Kafka and CDC

  1. 1. Data Stream Processing for Beginners with Apache Kafka and Change Data Capture -Abhijit Kumar
  2. 2. Agenda • Intro to Data Stream Processing • What is Change Data Capture • CDC Usecases • How to capture change data • CDC with Kafka and Kafka Connect • Intro to Debezium • Demo
  3. 3. About Me • 12+ years of work experience in Software Development and Architect • Currently working as a Data Architect at Deltatre • Previously worked at EY, Cisco, Dell and SAP • Moved to Sydney 6 months back from India One iinteresting fact about me: Back in India I worked for 3 startups and all three had a successful exits (Startups acquired by Cisco, Dell and SAP) Email:
  4. 4. Data Stream Processing • Big data technology • Processing of data in motion • Computing on data as soon as it is produced • Continuous streams: sensor events, user activity on a website, financial trades, etc • Data is only stored in data stores for processing later. • Getting stream of data from traditional RDBMS is a challenge.
  5. 5. What is CDC • CDC is identifying and capturing changes made to a database. • Change data capture records insert, update, and delete activity that is applied • Earlier technologies: Table differencing, change-value selection, and database triggers. • Inefficient and had substantial overhead on source servers • Log-based cdc is adopted now • Utilises a background process to scan database transaction logs
  6. 6. CDC Usecases • Data Replication • Microservice Architecture • Others: Caching, Alerting, Anomaly Detection
  7. 7. CDC Use Case: Data Replication • Replicate data to other DBs and keep content in sync • Send changes to Data Processing System • Sharing DB with other consumers/teams
  8. 8. CDC Usecase: Microservice Architecture • Share data between services without coupling • Each Microservices service keeps optimised views of data coming from source data base.
  9. 9. CDC Other Usecase • Update caches with changes • Data sync between caching • Using Elasticsearch or Solr as data sink to enable full text search on database • Alert and anomaly detection
  10. 10. How to do CDC: Legacy Approach • Parallel writes: Application level update different DBs at the same time. • Polling for changes (identifying the new, delete and update at source table) • Triggers (Performance issues, versioning issues, maintenance issue)
  11. 11. Preferred way for CDC Monitoring the DB continuously and identifying the changes: • Reading the database logs • No inconsistencies due to failure • Both upstream and downstream applications are unaware of this application.
  12. 12. Database logs for CDC • DB maintains log of changes. • Logs are used for TX recovery, replication, etc • Mysql - binlog, Postgres - write-ahead log, MongoDB- op log • These ordered sequence of changes are created into stream events for CDC.
  13. 13. Kafka for CDC • Kafka Key - Table Primary Key • Kafka guarantees ordering (per partition) • Pull based mechanism • Supports compaction • Horizontal scalability
  14. 14. Kafka Connect • Tool for streaming data between Apache Kafka and other data systems. • Framework for source and sink connectors • Tracks offsets: Replay in case of failure • Rich eco-system of connector
  15. 15. CDC Message Format • Key (Primary key of table ) and Value (Data) • Payload: Before and After state and Source information • Message can be wrapped in JSON and AVRO format
  16. 16. Debezium Connectors • Supports: MySQL, Postgres, MongoDB, Oracle • Provides Common event format (all connectors have same format) • Provides monitoring support via JMX • Filtering and snapshot modes
  17. 17. Demo Use docker images to start following: • Start Zookeeper • Kafka • Start Mysql (preloaded data) • Mysql terminal • Kafka Connect Service • Register and start Debezium-mysql connector • Watch Kafka topic • Modify records in mysql and view the captured data change in Kafka topic
  18. 18. What to do with CDC events • Transformation of cdc data can be done with Stream Application • Kafka Stream application for Java and Scala developer • KSQL can be used for non-developers • Kafka Connect to sink data
  19. 19. Do it yourself Docker Images • • • •
  20. 20. –Abhijit Kumar “Thank You”