SlideShare a Scribd company logo
Data Stream Processing for Beginners with
Apache Kafka and Change Data Capture
-Abhijit Kumar
https://au.linkedin.com/in/abhijitkumar1
Agenda
• Intro to Data Stream Processing
• What is Change Data Capture
• CDC Usecases
• How to capture change data
• CDC with Kafka and Kafka Connect
• Intro to Debezium
• Demo
About Me
• 12+ years of work experience in Software
Development and Architect
• Currently working as a Data Architect at
Deltatre
• Previously worked at EY, Cisco, Dell and SAP
• Moved to Sydney 6 months back from India
One iinteresting fact about me:
Back in India I worked for 3 startups and all three
had a successful exits (Startups acquired by
Cisco, Dell and SAP)
https://au.linkedin.com/in/abhijitkumar1
Email: abhijitk.connect@gmail.com
Data Stream Processing
• Big data technology
• Processing of data in motion
• Computing on data as soon as it is produced
• Continuous streams: sensor events, user activity on a website,
financial trades, etc
• Data is only stored in data stores for processing later.
• Getting stream of data from traditional RDBMS is a challenge.
What is CDC
• CDC is identifying and capturing changes made to a database.
• Change data capture records insert, update, and delete activity that
is applied
• Earlier technologies: Table differencing, change-value selection,
and database triggers.
• Inefficient and had substantial overhead on source servers
• Log-based cdc is adopted now
• Utilises a background process to scan database transaction logs
CDC Usecases
• Data Replication
• Microservice Architecture
• Others: Caching, Alerting, Anomaly Detection
CDC Use Case: Data
Replication
• Replicate data to other DBs and keep content in sync
• Send changes to Data Processing System
• Sharing DB with other consumers/teams
CDC Usecase: Microservice
Architecture
• Share data between services without coupling
• Each Microservices service keeps optimised views of data
coming from source data base.
CDC Other Usecase
• Update caches with changes
• Data sync between caching
• Using Elasticsearch or Solr as data sink to enable full
text search on database
• Alert and anomaly detection
How to do CDC: Legacy
Approach
• Parallel writes: Application level update different DBs at
the same time.
• Polling for changes (identifying the new, delete and
update at source table)
• Triggers (Performance issues, versioning issues,
maintenance issue)
Preferred way for CDC
Monitoring the DB continuously and identifying the changes:
• Reading the database logs
• No inconsistencies due to failure
• Both upstream and downstream applications are unaware of this
application.
Database logs for CDC
• DB maintains log of changes.
• Logs are used for TX recovery, replication, etc
• Mysql - binlog, Postgres - write-ahead log, MongoDB- op
log
• These ordered sequence of changes are created into
stream events for CDC.
Kafka for CDC
• Kafka Key - Table Primary Key
• Kafka guarantees ordering (per partition)
• Pull based mechanism
• Supports compaction
• Horizontal scalability
Kafka Connect
• Tool for streaming data between Apache Kafka and other
data systems.
• Framework for source and sink connectors
• Tracks offsets: Replay in case of failure
• Rich eco-system of connector
CDC Message Format
• Key (Primary key of table ) and Value (Data)
• Payload: Before and After state and Source information
• Message can be wrapped in JSON and AVRO format
Debezium Connectors
• Supports: MySQL, Postgres, MongoDB, Oracle
• Provides Common event format (all connectors have
same format)
• Provides monitoring support via JMX
• Filtering and snapshot modes
Demo
Use docker images to start following:
• Start Zookeeper
• Kafka
• Start Mysql (preloaded data)
• Mysql terminal
• Kafka Connect Service
• Register and start Debezium-mysql connector
• Watch Kafka topic
• Modify records in mysql and view the captured data change in Kafka topic
What to do with CDC events
• Transformation of cdc data can be done with Stream
Application
• Kafka Stream application for Java and Scala developer
• KSQL can be used for non-developers
• Kafka Connect to sink data
Do it yourself
Docker Images
• https://hub.docker.com/u/debezium/
• https://github.com/debezium/docker-images
• https://github.com/confluentinc/cp-docker-images
• https://docs.confluent.io/current/connect/managing/connectors.html
–Abhijit Kumar
“Thank You”

More Related Content

What's hot

Basic Introduction to Crate @ ViennaDB Meetup
Basic Introduction to Crate @ ViennaDB MeetupBasic Introduction to Crate @ ViennaDB Meetup
Basic Introduction to Crate @ ViennaDB Meetup
Johannes Moser
 
The Rise of Streaming SQL
The Rise of Streaming SQLThe Rise of Streaming SQL
The Rise of Streaming SQL
Sriskandarajah Suhothayan
 
Change Data Capture - Scale by the Bay 2019
Change Data Capture - Scale by the Bay 2019Change Data Capture - Scale by the Bay 2019
Change Data Capture - Scale by the Bay 2019
Petr Zapletal
 
Big Challenges in Data Modeling: NoSQL and Data Modeling
Big Challenges in Data Modeling: NoSQL and Data ModelingBig Challenges in Data Modeling: NoSQL and Data Modeling
Big Challenges in Data Modeling: NoSQL and Data Modeling
DATAVERSITY
 
SQL vs. NoSQL. It's always a hard choice.
SQL vs. NoSQL. It's always a hard choice.SQL vs. NoSQL. It's always a hard choice.
SQL vs. NoSQL. It's always a hard choice.
Denis Reznik
 
Survey of the Microsoft Azure Data Landscape
Survey of the Microsoft Azure Data LandscapeSurvey of the Microsoft Azure Data Landscape
Survey of the Microsoft Azure Data Landscape
Ike Ellis
 
Cloud architectural patterns and Microsoft Azure tools
Cloud architectural patterns and Microsoft Azure toolsCloud architectural patterns and Microsoft Azure tools
Cloud architectural patterns and Microsoft Azure tools
Pushkar Chivate
 
Geek Sync | Data Integrity Demystified - Deborah Melkin | IDERA
Geek Sync | Data Integrity Demystified - Deborah Melkin | IDERAGeek Sync | Data Integrity Demystified - Deborah Melkin | IDERA
Geek Sync | Data Integrity Demystified - Deborah Melkin | IDERA
IDERA Software
 
Nosql data models
Nosql data modelsNosql data models
Nosql data models
Viet-Trung TRAN
 
NoSql Brownbag
NoSql BrownbagNoSql Brownbag
NoSql Brownbag
Sandeep Kumar
 
Streamsets and spark
Streamsets and sparkStreamsets and spark
Streamsets and spark
Hari Shreedharan
 
Automate your data flows with Apache NIFI
Automate your data flows with Apache NIFIAutomate your data flows with Apache NIFI
Automate your data flows with Apache NIFI
Adam Doyle
 
Sql server etl framework
Sql server etl frameworkSql server etl framework
Sql server etl framework
nijs
 
Data Engineer's Lunch #54: dbt and Spark
Data Engineer's Lunch #54: dbt and SparkData Engineer's Lunch #54: dbt and Spark
Data Engineer's Lunch #54: dbt and Spark
Anant Corporation
 
Introducing Venice
Introducing VeniceIntroducing Venice
Introducing Venice
Yan Yan
 
Cassandra Lunch #87: Recreating Cassandra.api using Astra and Stargate
Cassandra Lunch #87: Recreating Cassandra.api using Astra and StargateCassandra Lunch #87: Recreating Cassandra.api using Astra and Stargate
Cassandra Lunch #87: Recreating Cassandra.api using Astra and Stargate
Anant Corporation
 
SQLSaturday 664 - Troubleshoot SQL Server performance problems like a Microso...
SQLSaturday 664 - Troubleshoot SQL Server performance problems like a Microso...SQLSaturday 664 - Troubleshoot SQL Server performance problems like a Microso...
SQLSaturday 664 - Troubleshoot SQL Server performance problems like a Microso...
Marek Maśko
 
Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
Reactive Fast Data & the Data Lake with Akka, Kafka, SparkReactive Fast Data & the Data Lake with Akka, Kafka, Spark
Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
Todd Fritz
 
Data Ingestion Engine
Data Ingestion EngineData Ingestion Engine
Data Ingestion Engine
Adam Doyle
 
Kafka Summit SF 2017 - Building Event-Driven Services with Stateful Streams
Kafka Summit SF 2017 - Building Event-Driven Services with Stateful StreamsKafka Summit SF 2017 - Building Event-Driven Services with Stateful Streams
Kafka Summit SF 2017 - Building Event-Driven Services with Stateful Streams
confluent
 

What's hot (20)

Basic Introduction to Crate @ ViennaDB Meetup
Basic Introduction to Crate @ ViennaDB MeetupBasic Introduction to Crate @ ViennaDB Meetup
Basic Introduction to Crate @ ViennaDB Meetup
 
The Rise of Streaming SQL
The Rise of Streaming SQLThe Rise of Streaming SQL
The Rise of Streaming SQL
 
Change Data Capture - Scale by the Bay 2019
Change Data Capture - Scale by the Bay 2019Change Data Capture - Scale by the Bay 2019
Change Data Capture - Scale by the Bay 2019
 
Big Challenges in Data Modeling: NoSQL and Data Modeling
Big Challenges in Data Modeling: NoSQL and Data ModelingBig Challenges in Data Modeling: NoSQL and Data Modeling
Big Challenges in Data Modeling: NoSQL and Data Modeling
 
SQL vs. NoSQL. It's always a hard choice.
SQL vs. NoSQL. It's always a hard choice.SQL vs. NoSQL. It's always a hard choice.
SQL vs. NoSQL. It's always a hard choice.
 
Survey of the Microsoft Azure Data Landscape
Survey of the Microsoft Azure Data LandscapeSurvey of the Microsoft Azure Data Landscape
Survey of the Microsoft Azure Data Landscape
 
Cloud architectural patterns and Microsoft Azure tools
Cloud architectural patterns and Microsoft Azure toolsCloud architectural patterns and Microsoft Azure tools
Cloud architectural patterns and Microsoft Azure tools
 
Geek Sync | Data Integrity Demystified - Deborah Melkin | IDERA
Geek Sync | Data Integrity Demystified - Deborah Melkin | IDERAGeek Sync | Data Integrity Demystified - Deborah Melkin | IDERA
Geek Sync | Data Integrity Demystified - Deborah Melkin | IDERA
 
Nosql data models
Nosql data modelsNosql data models
Nosql data models
 
NoSql Brownbag
NoSql BrownbagNoSql Brownbag
NoSql Brownbag
 
Streamsets and spark
Streamsets and sparkStreamsets and spark
Streamsets and spark
 
Automate your data flows with Apache NIFI
Automate your data flows with Apache NIFIAutomate your data flows with Apache NIFI
Automate your data flows with Apache NIFI
 
Sql server etl framework
Sql server etl frameworkSql server etl framework
Sql server etl framework
 
Data Engineer's Lunch #54: dbt and Spark
Data Engineer's Lunch #54: dbt and SparkData Engineer's Lunch #54: dbt and Spark
Data Engineer's Lunch #54: dbt and Spark
 
Introducing Venice
Introducing VeniceIntroducing Venice
Introducing Venice
 
Cassandra Lunch #87: Recreating Cassandra.api using Astra and Stargate
Cassandra Lunch #87: Recreating Cassandra.api using Astra and StargateCassandra Lunch #87: Recreating Cassandra.api using Astra and Stargate
Cassandra Lunch #87: Recreating Cassandra.api using Astra and Stargate
 
SQLSaturday 664 - Troubleshoot SQL Server performance problems like a Microso...
SQLSaturday 664 - Troubleshoot SQL Server performance problems like a Microso...SQLSaturday 664 - Troubleshoot SQL Server performance problems like a Microso...
SQLSaturday 664 - Troubleshoot SQL Server performance problems like a Microso...
 
Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
Reactive Fast Data & the Data Lake with Akka, Kafka, SparkReactive Fast Data & the Data Lake with Akka, Kafka, Spark
Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
 
Data Ingestion Engine
Data Ingestion EngineData Ingestion Engine
Data Ingestion Engine
 
Kafka Summit SF 2017 - Building Event-Driven Services with Stateful Streams
Kafka Summit SF 2017 - Building Event-Driven Services with Stateful StreamsKafka Summit SF 2017 - Building Event-Driven Services with Stateful Streams
Kafka Summit SF 2017 - Building Event-Driven Services with Stateful Streams
 

Similar to Data Stream Processing for Beginners with Kafka and CDC

Building a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with DatabricksBuilding a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with Databricks
Databricks
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
AboutYouGmbH
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
Lars Albertsson
 
Modern ETL: Azure Data Factory, Data Lake, and SQL Database
Modern ETL: Azure Data Factory, Data Lake, and SQL DatabaseModern ETL: Azure Data Factory, Data Lake, and SQL Database
Modern ETL: Azure Data Factory, Data Lake, and SQL Database
Eric Bragas
 
Data Vault Automation at the Bijenkorf
Data Vault Automation at the BijenkorfData Vault Automation at the Bijenkorf
Data Vault Automation at the Bijenkorf
Rob Winters
 
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & TableauBig Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Sam Palani
 
SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1
sqlserver.co.il
 
Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL
David Smelker
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
David P. Moore
 
Migrating to Amazon RDS with Database Migration Service
Migrating to Amazon RDS with Database Migration ServiceMigrating to Amazon RDS with Database Migration Service
Migrating to Amazon RDS with Database Migration Service
Amazon Web Services
 
Scylla Summit 2016: Compose on Containing the Database
Scylla Summit 2016: Compose on Containing the DatabaseScylla Summit 2016: Compose on Containing the Database
Scylla Summit 2016: Compose on Containing the Database
ScyllaDB
 
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Databricks
 
Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)
Kent Graziano
 
Hpc lunch and learn
Hpc lunch and learnHpc lunch and learn
Hpc lunch and learn
John D Almon
 
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyModernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Alluxio, Inc.
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 
Building cloud native data microservice
Building cloud native data microserviceBuilding cloud native data microservice
Building cloud native data microservice
Nilanjan Roy
 
Cloud-native Data
Cloud-native DataCloud-native Data
Cloud-native Data
cornelia davis
 
Cloud-Native-Data with Cornelia Davis
Cloud-Native-Data with Cornelia DavisCloud-Native-Data with Cornelia Davis
Cloud-Native-Data with Cornelia Davis
VMware Tanzu
 
Efficient & effective data management for research projects : ILRI's Data Ma...
Efficient & effective  data management for research projects : ILRI's Data Ma...Efficient & effective  data management for research projects : ILRI's Data Ma...
Efficient & effective data management for research projects : ILRI's Data Ma...
CIARD Movement
 

Similar to Data Stream Processing for Beginners with Kafka and CDC (20)

Building a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with DatabricksBuilding a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with Databricks
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
 
Modern ETL: Azure Data Factory, Data Lake, and SQL Database
Modern ETL: Azure Data Factory, Data Lake, and SQL DatabaseModern ETL: Azure Data Factory, Data Lake, and SQL Database
Modern ETL: Azure Data Factory, Data Lake, and SQL Database
 
Data Vault Automation at the Bijenkorf
Data Vault Automation at the BijenkorfData Vault Automation at the Bijenkorf
Data Vault Automation at the Bijenkorf
 
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & TableauBig Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
 
SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1
 
Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
Migrating to Amazon RDS with Database Migration Service
Migrating to Amazon RDS with Database Migration ServiceMigrating to Amazon RDS with Database Migration Service
Migrating to Amazon RDS with Database Migration Service
 
Scylla Summit 2016: Compose on Containing the Database
Scylla Summit 2016: Compose on Containing the DatabaseScylla Summit 2016: Compose on Containing the Database
Scylla Summit 2016: Compose on Containing the Database
 
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...
 
Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)
 
Hpc lunch and learn
Hpc lunch and learnHpc lunch and learn
Hpc lunch and learn
 
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyModernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Building cloud native data microservice
Building cloud native data microserviceBuilding cloud native data microservice
Building cloud native data microservice
 
Cloud-native Data
Cloud-native DataCloud-native Data
Cloud-native Data
 
Cloud-Native-Data with Cornelia Davis
Cloud-Native-Data with Cornelia DavisCloud-Native-Data with Cornelia Davis
Cloud-Native-Data with Cornelia Davis
 
Efficient & effective data management for research projects : ILRI's Data Ma...
Efficient & effective  data management for research projects : ILRI's Data Ma...Efficient & effective  data management for research projects : ILRI's Data Ma...
Efficient & effective data management for research projects : ILRI's Data Ma...
 

Recently uploaded

一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
ytypuem
 
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
Timothy Spann
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
dataschool1
 
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
9gr6pty
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
aguty
 
Senior Software Profiles Backend Sample - Sheet1.pdf
Senior Software Profiles  Backend Sample - Sheet1.pdfSenior Software Profiles  Backend Sample - Sheet1.pdf
Senior Software Profiles Backend Sample - Sheet1.pdf
Vineet
 
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdfNamma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
22ad0301
 
Do People Really Know Their Fertility Intentions? Correspondence between Sel...
Do People Really Know Their Fertility Intentions?  Correspondence between Sel...Do People Really Know Their Fertility Intentions?  Correspondence between Sel...
Do People Really Know Their Fertility Intentions? Correspondence between Sel...
Xiao Xu
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
oaxefes
 
06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus
Timothy Spann
 
SAP BW4HANA Implementagtion Content Document
SAP BW4HANA Implementagtion Content DocumentSAP BW4HANA Implementagtion Content Document
SAP BW4HANA Implementagtion Content Document
newdirectionconsulta
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
Vietnam Cotton & Spinning Association
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
bmucuha
 
Econ3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdfEcon3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdf
blueshagoo1
 
一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理
keesa2
 
Salesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - CanariasSalesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - Canarias
davidpietrzykowski1
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
nyvan3
 
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Marlon Dumas
 
Call Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
Call Girls Lucknow 0000000000 Independent Call Girl Service LucknowCall Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
Call Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
hiju9823
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
Vietnam Cotton & Spinning Association
 

Recently uploaded (20)

一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
 
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
 
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
 
Senior Software Profiles Backend Sample - Sheet1.pdf
Senior Software Profiles  Backend Sample - Sheet1.pdfSenior Software Profiles  Backend Sample - Sheet1.pdf
Senior Software Profiles Backend Sample - Sheet1.pdf
 
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdfNamma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
 
Do People Really Know Their Fertility Intentions? Correspondence between Sel...
Do People Really Know Their Fertility Intentions?  Correspondence between Sel...Do People Really Know Their Fertility Intentions?  Correspondence between Sel...
Do People Really Know Their Fertility Intentions? Correspondence between Sel...
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
 
06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus
 
SAP BW4HANA Implementagtion Content Document
SAP BW4HANA Implementagtion Content DocumentSAP BW4HANA Implementagtion Content Document
SAP BW4HANA Implementagtion Content Document
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
Econ3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdfEcon3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdf
 
一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理
 
Salesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - CanariasSalesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - Canarias
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
 
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
 
Call Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
Call Girls Lucknow 0000000000 Independent Call Girl Service LucknowCall Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
Call Girls Lucknow 0000000000 Independent Call Girl Service Lucknow
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
 

Data Stream Processing for Beginners with Kafka and CDC

  • 1. Data Stream Processing for Beginners with Apache Kafka and Change Data Capture -Abhijit Kumar https://au.linkedin.com/in/abhijitkumar1
  • 2. Agenda • Intro to Data Stream Processing • What is Change Data Capture • CDC Usecases • How to capture change data • CDC with Kafka and Kafka Connect • Intro to Debezium • Demo
  • 3. About Me • 12+ years of work experience in Software Development and Architect • Currently working as a Data Architect at Deltatre • Previously worked at EY, Cisco, Dell and SAP • Moved to Sydney 6 months back from India One iinteresting fact about me: Back in India I worked for 3 startups and all three had a successful exits (Startups acquired by Cisco, Dell and SAP) https://au.linkedin.com/in/abhijitkumar1 Email: abhijitk.connect@gmail.com
  • 4. Data Stream Processing • Big data technology • Processing of data in motion • Computing on data as soon as it is produced • Continuous streams: sensor events, user activity on a website, financial trades, etc • Data is only stored in data stores for processing later. • Getting stream of data from traditional RDBMS is a challenge.
  • 5. What is CDC • CDC is identifying and capturing changes made to a database. • Change data capture records insert, update, and delete activity that is applied • Earlier technologies: Table differencing, change-value selection, and database triggers. • Inefficient and had substantial overhead on source servers • Log-based cdc is adopted now • Utilises a background process to scan database transaction logs
  • 6. CDC Usecases • Data Replication • Microservice Architecture • Others: Caching, Alerting, Anomaly Detection
  • 7. CDC Use Case: Data Replication • Replicate data to other DBs and keep content in sync • Send changes to Data Processing System • Sharing DB with other consumers/teams
  • 8. CDC Usecase: Microservice Architecture • Share data between services without coupling • Each Microservices service keeps optimised views of data coming from source data base.
  • 9. CDC Other Usecase • Update caches with changes • Data sync between caching • Using Elasticsearch or Solr as data sink to enable full text search on database • Alert and anomaly detection
  • 10. How to do CDC: Legacy Approach • Parallel writes: Application level update different DBs at the same time. • Polling for changes (identifying the new, delete and update at source table) • Triggers (Performance issues, versioning issues, maintenance issue)
  • 11. Preferred way for CDC Monitoring the DB continuously and identifying the changes: • Reading the database logs • No inconsistencies due to failure • Both upstream and downstream applications are unaware of this application.
  • 12. Database logs for CDC • DB maintains log of changes. • Logs are used for TX recovery, replication, etc • Mysql - binlog, Postgres - write-ahead log, MongoDB- op log • These ordered sequence of changes are created into stream events for CDC.
  • 13. Kafka for CDC • Kafka Key - Table Primary Key • Kafka guarantees ordering (per partition) • Pull based mechanism • Supports compaction • Horizontal scalability
  • 14. Kafka Connect • Tool for streaming data between Apache Kafka and other data systems. • Framework for source and sink connectors • Tracks offsets: Replay in case of failure • Rich eco-system of connector
  • 15. CDC Message Format • Key (Primary key of table ) and Value (Data) • Payload: Before and After state and Source information • Message can be wrapped in JSON and AVRO format
  • 16. Debezium Connectors • Supports: MySQL, Postgres, MongoDB, Oracle • Provides Common event format (all connectors have same format) • Provides monitoring support via JMX • Filtering and snapshot modes
  • 17. Demo Use docker images to start following: • Start Zookeeper • Kafka • Start Mysql (preloaded data) • Mysql terminal • Kafka Connect Service • Register and start Debezium-mysql connector • Watch Kafka topic • Modify records in mysql and view the captured data change in Kafka topic
  • 18. What to do with CDC events • Transformation of cdc data can be done with Stream Application • Kafka Stream application for Java and Scala developer • KSQL can be used for non-developers • Kafka Connect to sink data
  • 19. Do it yourself Docker Images • https://hub.docker.com/u/debezium/ • https://github.com/debezium/docker-images • https://github.com/confluentinc/cp-docker-images • https://docs.confluent.io/current/connect/managing/connectors.html