SlideShare a Scribd company logo
Change Data Capture
The road thus far...
Agenda
● What is CDC?
● Why should we CDC?
● High Level Architecture
● The Stack
○ Kafka Connect
○ Debezium
○ Schema Registry
○ Avro
● Open Issues
● Future Steps
● Demo
Agenda
What is
CDC?
● A set of software design patterns used to determine
(and track) the data that has changed so that action
can be taken using the changed data
● An approach to data integration that is based on the
identification, capture and delivery of the changes
made to enterprise data sources.
● feed the data into a central hub of data streams,
where it can readily be combined with event streams
and data from other databases in real-time
● Continuous data integration paradigm.
Agenda
Why
Should we
do it?
● To support data lake driven architecture
○ Current architecture is obsolete
○ Not best practice (deletes)
○ Not incremental - Overwrites everything
○ Freshness of Data
○ Being done behind your backs!
● Same data <> different representations & different
needs
○ Data lake
○ Search index
○ Cache
● To Support event driven architecture & sourcing
○ Review Created, Token Deleted ETC……..
Change Data Capture
BREAKS DATABASE ENCAPSULATION
But hey, we’ve being doing it for the
past 3 years...
Treat Your Data Models Like you
would with your APIs!
Or at least use the tools that would
enable you to do so
Change Data Capture
High Level Architecture
High Level Architecture
THIS IS POWERFUL IN SO MANY
WAYS WE CAN ONLY IMAGINE!
Kafka Connect
● Kafka Connect, an open source component of
Apache Kafka, is a framework for connecting Kafka
with external systems such as databases, key-value
stores, search indexes, and file systems.
● Source Connector
○ A source connector ingests entire databases
and streams table updates to Kafka topics. It
can also collect metrics from all of your
application servers into Kafka topics, making
the data available for stream processing with
low latency.
● Sink Connector
○ delivers data from Kafka topics into secondary
indexes such as Elasticsearch or batch systems
Kafka Connect
● An open source distributed platform for change
data capture.
● Turns your existing databases into event streams,
so applications can see and respond immediately to
each row-level change in the databases
● Debezium is built on top of Apache Kafka and
provides Kafka Connect compatible connectors that
monitor specific database.
● Reads the Binlog of the source database and
provides a unified structured event which describes
the changes
Agenda
Debezium
● Schema Registry provides a serving layer for your
metadata.
● It provides a RESTful interface for storing and
retrieving Avro schemas.
● It stores a versioned history of all schemas,
provides multiple compatibility settings and
allows evolution of schemas according to the
configured compatibility settings and expanded
Avro support.
● It provides serializers that plug into Kafka clients that
handle schema storage and retrieval for Kafka
messages that are sent in the Avro format.
Agenda
Schema
Registry
Schema Registry
● Avro is a binary data serialization framework
developed within Apache's Hadoop project.
● Similarly to how in a SQL database you can’t add
data without creating a table first, one can’t create
an Avro object without first providing a schema.
● Avro schemas are defined using JSON
● An Avro object contains the schema and the data.
The data without the schema is an invalid Avro
object. That’s a big difference with say, CSV, or JSON.
● You can make your schemas evolve over time.
Apache Avro has a concept of projection which
makes evolving schema seamless to the end user.
Agenda
Apache
Avro
Avro vs. JSON
Avro JSON
● Schema evolution!!!!!!!!!!!!!
● Fast, Compact & Binary
● Avro has support for primitive types &
complex types as well.
● Documentation of the schema is built in.
● Supports compression such as Google’s
Snappy.
● Readable using Avro-consumers which are
shipped with the schema registry
● JSON can be read by pretty much any
language
● JSON has no native schema support
● JSON objects can be quite big in size
because of repeated keys
● No comments, metadata, documentation
● Typing and parsing is the responsibility of
the consumer: INT, LONG?
● Readable
● Production - Orders/Unsubscribers?
● Secor - support schema registry param
● Single Message Transformations
○ Debezium supports masking, and column
blacklist
● Steam changes from MongoDB’s oplog
○ Register new type of connector
● Database 2 Database stream
○ Sync MySQL-> Elasticsearch like in filter &
search
● All time data and backfill
○ Consume topic from beginning
○ Use data lake to replay all time data
Agenda
Future
Steps
● Metorikku support avro serialization + schema
registry
● Metrics & Logging & Alerting
● Debezium/Schema registry Cluster, Scaling and
deployment
○ Worker Per Table? Database? Databases?
● Log Compaction and retention per connector
● UI
○ Schema Registry
○ Connect
● Dive deeper to each technology
Agenda
Future
Steps
● Introducing new complexity to the system
● Kafka Version 2.0
● Avro support in Kafka Gem and Go Packages
● Binlog + Performance = TBD
○ We can use Kafka connect built in JDBC driver
which does not work with the binlog
● Aurora + Binlog = 💩
Agenda
Open
Issues
Demo
Questions?
Thank You!

More Related Content

What's hot

Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
Naresh Rupareliya
 
Introducing Change Data Capture with Debezium
Introducing Change Data Capture with DebeziumIntroducing Change Data Capture with Debezium
Introducing Change Data Capture with Debezium
ChengKuan Gan
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
confluent
 
Data Ingest Self Service and Management using Nifi and Kafka
Data Ingest Self Service and Management using Nifi and KafkaData Ingest Self Service and Management using Nifi and Kafka
Data Ingest Self Service and Management using Nifi and Kafka
DataWorks Summit
 
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kai Wähner
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connect
Knoldus Inc.
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Kai Wähner
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
Knoldus Inc.
 
Stream Processing with Flink and Stream Sharing
Stream Processing with Flink and Stream SharingStream Processing with Flink and Stream Sharing
Stream Processing with Flink and Stream Sharing
confluent
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?
confluent
 
DevNation Live: Kafka and Debezium
DevNation Live: Kafka and DebeziumDevNation Live: Kafka and Debezium
DevNation Live: Kafka and Debezium
Red Hat Developers
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Adam Doyle
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedIn
Discover Pinterest
 
An Introduction to Confluent Cloud: Apache Kafka as a Service
An Introduction to Confluent Cloud: Apache Kafka as a ServiceAn Introduction to Confluent Cloud: Apache Kafka as a Service
An Introduction to Confluent Cloud: Apache Kafka as a Service
confluent
 
Unlocking the Power of Apache Flink: An Introduction in 4 Acts
Unlocking the Power of Apache Flink: An Introduction in 4 ActsUnlocking the Power of Apache Flink: An Introduction in 4 Acts
Unlocking the Power of Apache Flink: An Introduction in 4 Acts
HostedbyConfluent
 
Exactly-Once Semantics Revisited: Distributed Transactions across Flink and K...
Exactly-Once Semantics Revisited: Distributed Transactions across Flink and K...Exactly-Once Semantics Revisited: Distributed Transactions across Flink and K...
Exactly-Once Semantics Revisited: Distributed Transactions across Flink and K...
HostedbyConfluent
 
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Kai Wähner
 

What's hot (20)

Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Introducing Change Data Capture with Debezium
Introducing Change Data Capture with DebeziumIntroducing Change Data Capture with Debezium
Introducing Change Data Capture with Debezium
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
 
Data Ingest Self Service and Management using Nifi and Kafka
Data Ingest Self Service and Management using Nifi and KafkaData Ingest Self Service and Management using Nifi and Kafka
Data Ingest Self Service and Management using Nifi and Kafka
 
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid Cloud
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connect
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Stream Processing with Flink and Stream Sharing
Stream Processing with Flink and Stream SharingStream Processing with Flink and Stream Sharing
Stream Processing with Flink and Stream Sharing
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?
 
DevNation Live: Kafka and Debezium
DevNation Live: Kafka and DebeziumDevNation Live: Kafka and Debezium
DevNation Live: Kafka and Debezium
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedIn
 
An Introduction to Confluent Cloud: Apache Kafka as a Service
An Introduction to Confluent Cloud: Apache Kafka as a ServiceAn Introduction to Confluent Cloud: Apache Kafka as a Service
An Introduction to Confluent Cloud: Apache Kafka as a Service
 
Unlocking the Power of Apache Flink: An Introduction in 4 Acts
Unlocking the Power of Apache Flink: An Introduction in 4 ActsUnlocking the Power of Apache Flink: An Introduction in 4 Acts
Unlocking the Power of Apache Flink: An Introduction in 4 Acts
 
Exactly-Once Semantics Revisited: Distributed Transactions across Flink and K...
Exactly-Once Semantics Revisited: Distributed Transactions across Flink and K...Exactly-Once Semantics Revisited: Distributed Transactions across Flink and K...
Exactly-Once Semantics Revisited: Distributed Transactions across Flink and K...
 
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
 

Similar to Change data capture

Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
datamantra
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
datamantra
 
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
StampedeCon
 
Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...
DataWorks Summit
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
Lars Albertsson
 
Red Hat Storage Roadmap
Red Hat Storage RoadmapRed Hat Storage Roadmap
Red Hat Storage Roadmap
Red_Hat_Storage
 
Red Hat Storage Roadmap
Red Hat Storage RoadmapRed Hat Storage Roadmap
Red Hat Storage Roadmap
Colleen Corrice
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast Data
Naveen Korakoppa
 
Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...
DataWorks Summit
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 
Summer 2017 undergraduate research powerpoint
Summer 2017 undergraduate research powerpointSummer 2017 undergraduate research powerpoint
Summer 2017 undergraduate research powerpoint
Christopher Dubois
 
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Introduction to streaming and messaging  flume,kafka,SQS,kinesis Introduction to streaming and messaging  flume,kafka,SQS,kinesis
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Omid Vahdaty
 
Streamsets and spark at SF Hadoop User Group
Streamsets and spark at SF Hadoop User GroupStreamsets and spark at SF Hadoop User Group
Streamsets and spark at SF Hadoop User Group
Hari Shreedharan
 
Real time data pipline with kafka streams
Real time data pipline with kafka streamsReal time data pipline with kafka streams
Real time data pipline with kafka streams
Yoni Farin
 
Data Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCData Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKC
Mark Smith
 
Friends don't let friends do dual writes: Outbox pattern with OpenShift Strea...
Friends don't let friends do dual writes: Outbox pattern with OpenShift Strea...Friends don't let friends do dual writes: Outbox pattern with OpenShift Strea...
Friends don't let friends do dual writes: Outbox pattern with OpenShift Strea...
Red Hat Developers
 
What We Learned From Building a Modern Messaging and Streaming System for Cloud
What We Learned From Building a Modern Messaging and Streaming System for CloudWhat We Learned From Building a Modern Messaging and Streaming System for Cloud
What We Learned From Building a Modern Messaging and Streaming System for Cloud
StreamNative
 
Introduction to GCP Data Flow Presentation
Introduction to GCP Data Flow PresentationIntroduction to GCP Data Flow Presentation
Introduction to GCP Data Flow Presentation
Knoldus Inc.
 
Introduction to GCP DataFlow Presentation
Introduction to GCP DataFlow PresentationIntroduction to GCP DataFlow Presentation
Introduction to GCP DataFlow Presentation
Knoldus Inc.
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
Anton Nazaruk
 

Similar to Change data capture (20)

Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
 
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
 
Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
 
Red Hat Storage Roadmap
Red Hat Storage RoadmapRed Hat Storage Roadmap
Red Hat Storage Roadmap
 
Red Hat Storage Roadmap
Red Hat Storage RoadmapRed Hat Storage Roadmap
Red Hat Storage Roadmap
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast Data
 
Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Summer 2017 undergraduate research powerpoint
Summer 2017 undergraduate research powerpointSummer 2017 undergraduate research powerpoint
Summer 2017 undergraduate research powerpoint
 
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Introduction to streaming and messaging  flume,kafka,SQS,kinesis Introduction to streaming and messaging  flume,kafka,SQS,kinesis
Introduction to streaming and messaging flume,kafka,SQS,kinesis
 
Streamsets and spark at SF Hadoop User Group
Streamsets and spark at SF Hadoop User GroupStreamsets and spark at SF Hadoop User Group
Streamsets and spark at SF Hadoop User Group
 
Real time data pipline with kafka streams
Real time data pipline with kafka streamsReal time data pipline with kafka streams
Real time data pipline with kafka streams
 
Data Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCData Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKC
 
Friends don't let friends do dual writes: Outbox pattern with OpenShift Strea...
Friends don't let friends do dual writes: Outbox pattern with OpenShift Strea...Friends don't let friends do dual writes: Outbox pattern with OpenShift Strea...
Friends don't let friends do dual writes: Outbox pattern with OpenShift Strea...
 
What We Learned From Building a Modern Messaging and Streaming System for Cloud
What We Learned From Building a Modern Messaging and Streaming System for CloudWhat We Learned From Building a Modern Messaging and Streaming System for Cloud
What We Learned From Building a Modern Messaging and Streaming System for Cloud
 
Introduction to GCP Data Flow Presentation
Introduction to GCP Data Flow PresentationIntroduction to GCP Data Flow Presentation
Introduction to GCP Data Flow Presentation
 
Introduction to GCP DataFlow Presentation
Introduction to GCP DataFlow PresentationIntroduction to GCP DataFlow Presentation
Introduction to GCP DataFlow Presentation
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 

Recently uploaded

To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 

Recently uploaded (20)

To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 

Change data capture

  • 1. Change Data Capture The road thus far...
  • 2. Agenda ● What is CDC? ● Why should we CDC? ● High Level Architecture ● The Stack ○ Kafka Connect ○ Debezium ○ Schema Registry ○ Avro ● Open Issues ● Future Steps ● Demo
  • 3. Agenda What is CDC? ● A set of software design patterns used to determine (and track) the data that has changed so that action can be taken using the changed data ● An approach to data integration that is based on the identification, capture and delivery of the changes made to enterprise data sources. ● feed the data into a central hub of data streams, where it can readily be combined with event streams and data from other databases in real-time ● Continuous data integration paradigm.
  • 4. Agenda Why Should we do it? ● To support data lake driven architecture ○ Current architecture is obsolete ○ Not best practice (deletes) ○ Not incremental - Overwrites everything ○ Freshness of Data ○ Being done behind your backs! ● Same data <> different representations & different needs ○ Data lake ○ Search index ○ Cache ● To Support event driven architecture & sourcing ○ Review Created, Token Deleted ETC……..
  • 5. Change Data Capture BREAKS DATABASE ENCAPSULATION But hey, we’ve being doing it for the past 3 years...
  • 6. Treat Your Data Models Like you would with your APIs! Or at least use the tools that would enable you to do so
  • 8. High Level Architecture High Level Architecture
  • 9. THIS IS POWERFUL IN SO MANY WAYS WE CAN ONLY IMAGINE!
  • 10. Kafka Connect ● Kafka Connect, an open source component of Apache Kafka, is a framework for connecting Kafka with external systems such as databases, key-value stores, search indexes, and file systems. ● Source Connector ○ A source connector ingests entire databases and streams table updates to Kafka topics. It can also collect metrics from all of your application servers into Kafka topics, making the data available for stream processing with low latency. ● Sink Connector ○ delivers data from Kafka topics into secondary indexes such as Elasticsearch or batch systems
  • 12. ● An open source distributed platform for change data capture. ● Turns your existing databases into event streams, so applications can see and respond immediately to each row-level change in the databases ● Debezium is built on top of Apache Kafka and provides Kafka Connect compatible connectors that monitor specific database. ● Reads the Binlog of the source database and provides a unified structured event which describes the changes Agenda Debezium
  • 13. ● Schema Registry provides a serving layer for your metadata. ● It provides a RESTful interface for storing and retrieving Avro schemas. ● It stores a versioned history of all schemas, provides multiple compatibility settings and allows evolution of schemas according to the configured compatibility settings and expanded Avro support. ● It provides serializers that plug into Kafka clients that handle schema storage and retrieval for Kafka messages that are sent in the Avro format. Agenda Schema Registry
  • 15. ● Avro is a binary data serialization framework developed within Apache's Hadoop project. ● Similarly to how in a SQL database you can’t add data without creating a table first, one can’t create an Avro object without first providing a schema. ● Avro schemas are defined using JSON ● An Avro object contains the schema and the data. The data without the schema is an invalid Avro object. That’s a big difference with say, CSV, or JSON. ● You can make your schemas evolve over time. Apache Avro has a concept of projection which makes evolving schema seamless to the end user. Agenda Apache Avro
  • 16. Avro vs. JSON Avro JSON ● Schema evolution!!!!!!!!!!!!! ● Fast, Compact & Binary ● Avro has support for primitive types & complex types as well. ● Documentation of the schema is built in. ● Supports compression such as Google’s Snappy. ● Readable using Avro-consumers which are shipped with the schema registry ● JSON can be read by pretty much any language ● JSON has no native schema support ● JSON objects can be quite big in size because of repeated keys ● No comments, metadata, documentation ● Typing and parsing is the responsibility of the consumer: INT, LONG? ● Readable
  • 17. ● Production - Orders/Unsubscribers? ● Secor - support schema registry param ● Single Message Transformations ○ Debezium supports masking, and column blacklist ● Steam changes from MongoDB’s oplog ○ Register new type of connector ● Database 2 Database stream ○ Sync MySQL-> Elasticsearch like in filter & search ● All time data and backfill ○ Consume topic from beginning ○ Use data lake to replay all time data Agenda Future Steps
  • 18. ● Metorikku support avro serialization + schema registry ● Metrics & Logging & Alerting ● Debezium/Schema registry Cluster, Scaling and deployment ○ Worker Per Table? Database? Databases? ● Log Compaction and retention per connector ● UI ○ Schema Registry ○ Connect ● Dive deeper to each technology Agenda Future Steps
  • 19. ● Introducing new complexity to the system ● Kafka Version 2.0 ● Avro support in Kafka Gem and Go Packages ● Binlog + Performance = TBD ○ We can use Kafka connect built in JDBC driver which does not work with the binlog ● Aurora + Binlog = 💩 Agenda Open Issues
  • 20. Demo