Data Stream Processing for Beginners with
Apache Kafka and Change Data Capture
-Abhijit Kumar
https://au.linkedin.com/in/abhijitkumar1
Agenda
• Intro to Data Stream Processing
• What is Change Data Capture
• CDC Usecases
• How to capture change data
• CDC with Kafka and Kafka Connect
• Intro to Debezium
• Demo
About Me
• 12+ years of work experience in Software
Development and Architect
• Currently working as a Data Architect at
Deltatre
• Previously worked at EY, Cisco, Dell and SAP
• Moved to Sydney 6 months back from India
One iinteresting fact about me:
Back in India I worked for 3 startups and all three
had a successful exits (Startups acquired by
Cisco, Dell and SAP)
https://au.linkedin.com/in/abhijitkumar1
Email: abhijitk.connect@gmail.com
Data Stream Processing
• Big data technology
• Processing of data in motion
• Computing on data as soon as it is produced
• Continuous streams: sensor events, user activity on a website,
financial trades, etc
• Data is only stored in data stores for processing later.
• Getting stream of data from traditional RDBMS is a challenge.
What is CDC
• CDC is identifying and capturing changes made to a database.
• Change data capture records insert, update, and delete activity that
is applied
• Earlier technologies: Table differencing, change-value selection,
and database triggers.
• Inefficient and had substantial overhead on source servers
• Log-based cdc is adopted now
• Utilises a background process to scan database transaction logs
CDC Usecases
• Data Replication
• Microservice Architecture
• Others: Caching, Alerting, Anomaly Detection
CDC Use Case: Data
Replication
• Replicate data to other DBs and keep content in sync
• Send changes to Data Processing System
• Sharing DB with other consumers/teams
CDC Usecase: Microservice
Architecture
• Share data between services without coupling
• Each Microservices service keeps optimised views of data
coming from source data base.
CDC Other Usecase
• Update caches with changes
• Data sync between caching
• Using Elasticsearch or Solr as data sink to enable full
text search on database
• Alert and anomaly detection
How to do CDC: Legacy
Approach
• Parallel writes: Application level update different DBs at
the same time.
• Polling for changes (identifying the new, delete and
update at source table)
• Triggers (Performance issues, versioning issues,
maintenance issue)
Preferred way for CDC
Monitoring the DB continuously and identifying the changes:
• Reading the database logs
• No inconsistencies due to failure
• Both upstream and downstream applications are unaware of this
application.
Database logs for CDC
• DB maintains log of changes.
• Logs are used for TX recovery, replication, etc
• Mysql - binlog, Postgres - write-ahead log, MongoDB- op
log
• These ordered sequence of changes are created into
stream events for CDC.
Kafka for CDC
• Kafka Key - Table Primary Key
• Kafka guarantees ordering (per partition)
• Pull based mechanism
• Supports compaction
• Horizontal scalability
Kafka Connect
• Tool for streaming data between Apache Kafka and other
data systems.
• Framework for source and sink connectors
• Tracks offsets: Replay in case of failure
• Rich eco-system of connector
CDC Message Format
• Key (Primary key of table ) and Value (Data)
• Payload: Before and After state and Source information
• Message can be wrapped in JSON and AVRO format
Debezium Connectors
• Supports: MySQL, Postgres, MongoDB, Oracle
• Provides Common event format (all connectors have
same format)
• Provides monitoring support via JMX
• Filtering and snapshot modes
Demo
Use docker images to start following:
• Start Zookeeper
• Kafka
• Start Mysql (preloaded data)
• Mysql terminal
• Kafka Connect Service
• Register and start Debezium-mysql connector
• Watch Kafka topic
• Modify records in mysql and view the captured data change in Kafka topic
What to do with CDC events
• Transformation of cdc data can be done with Stream
Application
• Kafka Stream application for Java and Scala developer
• KSQL can be used for non-developers
• Kafka Connect to sink data
Do it yourself
Docker Images
• https://hub.docker.com/u/debezium/
• https://github.com/debezium/docker-images
• https://github.com/confluentinc/cp-docker-images
• https://docs.confluent.io/current/connect/managing/connectors.html
–Abhijit Kumar
“Thank You”

Data Stream Processing for Beginners with Kafka and CDC

  • 1.
    Data Stream Processingfor Beginners with Apache Kafka and Change Data Capture -Abhijit Kumar https://au.linkedin.com/in/abhijitkumar1
  • 2.
    Agenda • Intro toData Stream Processing • What is Change Data Capture • CDC Usecases • How to capture change data • CDC with Kafka and Kafka Connect • Intro to Debezium • Demo
  • 3.
    About Me • 12+years of work experience in Software Development and Architect • Currently working as a Data Architect at Deltatre • Previously worked at EY, Cisco, Dell and SAP • Moved to Sydney 6 months back from India One iinteresting fact about me: Back in India I worked for 3 startups and all three had a successful exits (Startups acquired by Cisco, Dell and SAP) https://au.linkedin.com/in/abhijitkumar1 Email: abhijitk.connect@gmail.com
  • 4.
    Data Stream Processing •Big data technology • Processing of data in motion • Computing on data as soon as it is produced • Continuous streams: sensor events, user activity on a website, financial trades, etc • Data is only stored in data stores for processing later. • Getting stream of data from traditional RDBMS is a challenge.
  • 5.
    What is CDC •CDC is identifying and capturing changes made to a database. • Change data capture records insert, update, and delete activity that is applied • Earlier technologies: Table differencing, change-value selection, and database triggers. • Inefficient and had substantial overhead on source servers • Log-based cdc is adopted now • Utilises a background process to scan database transaction logs
  • 6.
    CDC Usecases • DataReplication • Microservice Architecture • Others: Caching, Alerting, Anomaly Detection
  • 7.
    CDC Use Case:Data Replication • Replicate data to other DBs and keep content in sync • Send changes to Data Processing System • Sharing DB with other consumers/teams
  • 8.
    CDC Usecase: Microservice Architecture •Share data between services without coupling • Each Microservices service keeps optimised views of data coming from source data base.
  • 9.
    CDC Other Usecase •Update caches with changes • Data sync between caching • Using Elasticsearch or Solr as data sink to enable full text search on database • Alert and anomaly detection
  • 10.
    How to doCDC: Legacy Approach • Parallel writes: Application level update different DBs at the same time. • Polling for changes (identifying the new, delete and update at source table) • Triggers (Performance issues, versioning issues, maintenance issue)
  • 11.
    Preferred way forCDC Monitoring the DB continuously and identifying the changes: • Reading the database logs • No inconsistencies due to failure • Both upstream and downstream applications are unaware of this application.
  • 12.
    Database logs forCDC • DB maintains log of changes. • Logs are used for TX recovery, replication, etc • Mysql - binlog, Postgres - write-ahead log, MongoDB- op log • These ordered sequence of changes are created into stream events for CDC.
  • 13.
    Kafka for CDC •Kafka Key - Table Primary Key • Kafka guarantees ordering (per partition) • Pull based mechanism • Supports compaction • Horizontal scalability
  • 14.
    Kafka Connect • Toolfor streaming data between Apache Kafka and other data systems. • Framework for source and sink connectors • Tracks offsets: Replay in case of failure • Rich eco-system of connector
  • 15.
    CDC Message Format •Key (Primary key of table ) and Value (Data) • Payload: Before and After state and Source information • Message can be wrapped in JSON and AVRO format
  • 16.
    Debezium Connectors • Supports:MySQL, Postgres, MongoDB, Oracle • Provides Common event format (all connectors have same format) • Provides monitoring support via JMX • Filtering and snapshot modes
  • 17.
    Demo Use docker imagesto start following: • Start Zookeeper • Kafka • Start Mysql (preloaded data) • Mysql terminal • Kafka Connect Service • Register and start Debezium-mysql connector • Watch Kafka topic • Modify records in mysql and view the captured data change in Kafka topic
  • 18.
    What to dowith CDC events • Transformation of cdc data can be done with Stream Application • Kafka Stream application for Java and Scala developer • KSQL can be used for non-developers • Kafka Connect to sink data
  • 19.
    Do it yourself DockerImages • https://hub.docker.com/u/debezium/ • https://github.com/debezium/docker-images • https://github.com/confluentinc/cp-docker-images • https://docs.confluent.io/current/connect/managing/connectors.html
  • 20.