Change Data Capture
The road thus far...
Agenda
● What is CDC?
● Why should we CDC?
● High Level Architecture
● The Stack
○ Kafka Connect
○ Debezium
○ Schema Registry
○ Avro
● Open Issues
● Future Steps
● Demo
Agenda
What is
CDC?
● A set of software design patterns used to determine
(and track) the data that has changed so that action
can be taken using the changed data
● An approach to data integration that is based on the
identification, capture and delivery of the changes
made to enterprise data sources.
● feed the data into a central hub of data streams,
where it can readily be combined with event streams
and data from other databases in real-time
● Continuous data integration paradigm.
Agenda
Why
Should we
do it?
● To support data lake driven architecture
○ Current architecture is obsolete
○ Not best practice (deletes)
○ Not incremental - Overwrites everything
○ Freshness of Data
○ Being done behind your backs!
● Same data <> different representations & different
needs
○ Data lake
○ Search index
○ Cache
● To Support event driven architecture & sourcing
○ Review Created, Token Deleted ETC……..
Change Data Capture
BREAKS DATABASE ENCAPSULATION
But hey, we’ve being doing it for the
past 3 years...
Treat Your Data Models Like you
would with your APIs!
Or at least use the tools that would
enable you to do so
Change Data Capture
High Level Architecture
High Level Architecture
THIS IS POWERFUL IN SO MANY
WAYS WE CAN ONLY IMAGINE!
Kafka Connect
● Kafka Connect, an open source component of
Apache Kafka, is a framework for connecting Kafka
with external systems such as databases, key-value
stores, search indexes, and file systems.
● Source Connector
○ A source connector ingests entire databases
and streams table updates to Kafka topics. It
can also collect metrics from all of your
application servers into Kafka topics, making
the data available for stream processing with
low latency.
● Sink Connector
○ delivers data from Kafka topics into secondary
indexes such as Elasticsearch or batch systems
Kafka Connect
● An open source distributed platform for change
data capture.
● Turns your existing databases into event streams,
so applications can see and respond immediately to
each row-level change in the databases
● Debezium is built on top of Apache Kafka and
provides Kafka Connect compatible connectors that
monitor specific database.
● Reads the Binlog of the source database and
provides a unified structured event which describes
the changes
Agenda
Debezium
● Schema Registry provides a serving layer for your
metadata.
● It provides a RESTful interface for storing and
retrieving Avro schemas.
● It stores a versioned history of all schemas,
provides multiple compatibility settings and
allows evolution of schemas according to the
configured compatibility settings and expanded
Avro support.
● It provides serializers that plug into Kafka clients that
handle schema storage and retrieval for Kafka
messages that are sent in the Avro format.
Agenda
Schema
Registry
Schema Registry
● Avro is a binary data serialization framework
developed within Apache's Hadoop project.
● Similarly to how in a SQL database you can’t add
data without creating a table first, one can’t create
an Avro object without first providing a schema.
● Avro schemas are defined using JSON
● An Avro object contains the schema and the data.
The data without the schema is an invalid Avro
object. That’s a big difference with say, CSV, or JSON.
● You can make your schemas evolve over time.
Apache Avro has a concept of projection which
makes evolving schema seamless to the end user.
Agenda
Apache
Avro
Avro vs. JSON
Avro JSON
● Schema evolution!!!!!!!!!!!!!
● Fast, Compact & Binary
● Avro has support for primitive types &
complex types as well.
● Documentation of the schema is built in.
● Supports compression such as Google’s
Snappy.
● Readable using Avro-consumers which are
shipped with the schema registry
● JSON can be read by pretty much any
language
● JSON has no native schema support
● JSON objects can be quite big in size
because of repeated keys
● No comments, metadata, documentation
● Typing and parsing is the responsibility of
the consumer: INT, LONG?
● Readable
● Production - Orders/Unsubscribers?
● Secor - support schema registry param
● Single Message Transformations
○ Debezium supports masking, and column
blacklist
● Steam changes from MongoDB’s oplog
○ Register new type of connector
● Database 2 Database stream
○ Sync MySQL-> Elasticsearch like in filter &
search
● All time data and backfill
○ Consume topic from beginning
○ Use data lake to replay all time data
Agenda
Future
Steps
● Metorikku support avro serialization + schema
registry
● Metrics & Logging & Alerting
● Debezium/Schema registry Cluster, Scaling and
deployment
○ Worker Per Table? Database? Databases?
● Log Compaction and retention per connector
● UI
○ Schema Registry
○ Connect
● Dive deeper to each technology
Agenda
Future
Steps
● Introducing new complexity to the system
● Kafka Version 2.0
● Avro support in Kafka Gem and Go Packages
● Binlog + Performance = TBD
○ We can use Kafka connect built in JDBC driver
which does not work with the binlog
● Aurora + Binlog = 💩
Agenda
Open
Issues
Demo
Questions?
Thank You!

Change data capture

  • 1.
    Change Data Capture Theroad thus far...
  • 2.
    Agenda ● What isCDC? ● Why should we CDC? ● High Level Architecture ● The Stack ○ Kafka Connect ○ Debezium ○ Schema Registry ○ Avro ● Open Issues ● Future Steps ● Demo
  • 3.
    Agenda What is CDC? ● Aset of software design patterns used to determine (and track) the data that has changed so that action can be taken using the changed data ● An approach to data integration that is based on the identification, capture and delivery of the changes made to enterprise data sources. ● feed the data into a central hub of data streams, where it can readily be combined with event streams and data from other databases in real-time ● Continuous data integration paradigm.
  • 4.
    Agenda Why Should we do it? ●To support data lake driven architecture ○ Current architecture is obsolete ○ Not best practice (deletes) ○ Not incremental - Overwrites everything ○ Freshness of Data ○ Being done behind your backs! ● Same data <> different representations & different needs ○ Data lake ○ Search index ○ Cache ● To Support event driven architecture & sourcing ○ Review Created, Token Deleted ETC……..
  • 5.
    Change Data Capture BREAKSDATABASE ENCAPSULATION But hey, we’ve being doing it for the past 3 years...
  • 6.
    Treat Your DataModels Like you would with your APIs! Or at least use the tools that would enable you to do so
  • 7.
  • 8.
    High Level Architecture HighLevel Architecture
  • 9.
    THIS IS POWERFULIN SO MANY WAYS WE CAN ONLY IMAGINE!
  • 10.
    Kafka Connect ● KafkaConnect, an open source component of Apache Kafka, is a framework for connecting Kafka with external systems such as databases, key-value stores, search indexes, and file systems. ● Source Connector ○ A source connector ingests entire databases and streams table updates to Kafka topics. It can also collect metrics from all of your application servers into Kafka topics, making the data available for stream processing with low latency. ● Sink Connector ○ delivers data from Kafka topics into secondary indexes such as Elasticsearch or batch systems
  • 11.
  • 12.
    ● An opensource distributed platform for change data capture. ● Turns your existing databases into event streams, so applications can see and respond immediately to each row-level change in the databases ● Debezium is built on top of Apache Kafka and provides Kafka Connect compatible connectors that monitor specific database. ● Reads the Binlog of the source database and provides a unified structured event which describes the changes Agenda Debezium
  • 13.
    ● Schema Registryprovides a serving layer for your metadata. ● It provides a RESTful interface for storing and retrieving Avro schemas. ● It stores a versioned history of all schemas, provides multiple compatibility settings and allows evolution of schemas according to the configured compatibility settings and expanded Avro support. ● It provides serializers that plug into Kafka clients that handle schema storage and retrieval for Kafka messages that are sent in the Avro format. Agenda Schema Registry
  • 14.
  • 15.
    ● Avro isa binary data serialization framework developed within Apache's Hadoop project. ● Similarly to how in a SQL database you can’t add data without creating a table first, one can’t create an Avro object without first providing a schema. ● Avro schemas are defined using JSON ● An Avro object contains the schema and the data. The data without the schema is an invalid Avro object. That’s a big difference with say, CSV, or JSON. ● You can make your schemas evolve over time. Apache Avro has a concept of projection which makes evolving schema seamless to the end user. Agenda Apache Avro
  • 16.
    Avro vs. JSON AvroJSON ● Schema evolution!!!!!!!!!!!!! ● Fast, Compact & Binary ● Avro has support for primitive types & complex types as well. ● Documentation of the schema is built in. ● Supports compression such as Google’s Snappy. ● Readable using Avro-consumers which are shipped with the schema registry ● JSON can be read by pretty much any language ● JSON has no native schema support ● JSON objects can be quite big in size because of repeated keys ● No comments, metadata, documentation ● Typing and parsing is the responsibility of the consumer: INT, LONG? ● Readable
  • 17.
    ● Production -Orders/Unsubscribers? ● Secor - support schema registry param ● Single Message Transformations ○ Debezium supports masking, and column blacklist ● Steam changes from MongoDB’s oplog ○ Register new type of connector ● Database 2 Database stream ○ Sync MySQL-> Elasticsearch like in filter & search ● All time data and backfill ○ Consume topic from beginning ○ Use data lake to replay all time data Agenda Future Steps
  • 18.
    ● Metorikku supportavro serialization + schema registry ● Metrics & Logging & Alerting ● Debezium/Schema registry Cluster, Scaling and deployment ○ Worker Per Table? Database? Databases? ● Log Compaction and retention per connector ● UI ○ Schema Registry ○ Connect ● Dive deeper to each technology Agenda Future Steps
  • 19.
    ● Introducing newcomplexity to the system ● Kafka Version 2.0 ● Avro support in Kafka Gem and Go Packages ● Binlog + Performance = TBD ○ We can use Kafka connect built in JDBC driver which does not work with the binlog ● Aurora + Binlog = 💩 Agenda Open Issues
  • 20.
  • 21.
  • 22.