Technology choices for Apache Kafka and Change Data Capture

Technology Choices for Kafka
and Change Data Capture
Kate Stanley and Andrew Schofield
Apache Kafka London Meetup October 2019
IBM Event StreamsApache Kafka

Change Data Capture identifies and
captures the changes to a data store
© 2019 IBM Corporation 2

Change Data Capture identifies and
captures the changes to a data store
as a stream of Kafka events

Point-to-point data
integration
MASTER DATABASE
APPLICATION

Point-to-point data
integration
MASTER DATABASE
RECOVERY
DATABASE
AUDIT LOG
QUERY CACHE
APPLICATION

It’s publish/subscribe for data
MASTER DATABASE
RECOVERY
DATABASE
AUDIT LOG
QUERY CACHE
APPLICATION

Technology choices
These different approaches have all been used successfully
1. Data store natively generates a feed of changes
2. Repeated queries, with optimization or restrictions
3. Log scanning

Why use Kafka with CDC?
Kafka has lots of connectors to other systems
It acts as a buffer, loosening coupling between source and target
Publish/subscribe, instead of point-to-point
Makes it easy to process the CDC stream as events in Kafka client application code

Kafka Connect JDBC source

JDBC source connector
Uses JDBC to connect to any compliant relational database
e.g. Oracle, Microsoft SQL Server, DB2, MySQL and Postgres.

Requires a Kafka Connect runtime

Can bulk copy tables with any columns
To receive just the changes, particular columns needed

Can bulk copy tables with any columns
To receive just the changes, particular columns needed
Open-source: https://github.com/confluentinc/kafka-connect-jdbc

Configuring the JDBC
connector
$ curl -X PUT -d '{"connector.class":”
io.confluent.connect.jdbc.JdbcSourceConnector"}’
http://localhost:8083/connector-
plugins/MyConnector/config/validate

connector
$ curl -X PUT -d '{"connector.class":”
io.confluent.connect.jdbc.JdbcSourceConnector"}’
http://localhost:8083/connector-
plugins/MyConnector/config/validate
Required config options:
name
connector.class
connection.url – JDBC connection URL
topic.prefix – prefix to prepend to table names

16
connector
© 2019 IBM Corporation

connector
Required config options:
name
connector.class
connection.url – JDBC connection URL
topic.prefix – prefix to prepend to table names
mode – bulk, incrementing, timestamp, timestamp + incrementing

Incrementing mode
Use a strictly incrementing column on each table to
detect only new rows.
id
First
name
Surname Amount
0 John Smith 20
1 Daisy Williams 25
2 Laura Thomas 15

Incrementing mode
Use a strictly incrementing column on each table to
detect only new rows.
Requires incrementing.column.name to be set
Does not detect modifications or deletions of existing
rows
ID column must be present on all tables
Identifier must be in a single column
id
First
name
Surname Amount
0 John Smith 20
1 Daisy Williams 25
2 Laura Thomas 15

Timestamp mode
Use a timestamp column to detect new and
modified rows.
timestamp First name Surname Amount
2019-10-09
18:10:15
John Smith 20
2019-10-09
18:17:36
Daisy Williams 25
2019-10-09
18:57:12
Laura Thomas 15

Timestamp mode
modified rows.
2019-10-09
18:10:15
John Smith 20
2019-10-09
18:17:36
Daisy Williams 25
2019-10-09
18:57:12
Laura Thomas 15
Requires timestamp.column.name to be set
Timestamp column must be updated with each write
Timestamp column must be monotonically incrementing
Timestamp column must be present on all tables

Timestamp mode
modified rows.
2019-10-09
18:10:15
John Smith 20
2019-10-09
18:17:36
Daisy Williams 25
2019-10-09
18:57:12
Laura Thomas 15
Requires timestamp.column.name to be set
Timestamp column must be updated with each write
Timestamp column must be monotonically incrementing
Timestamp column must be present on all tables
Does not guarantee all updated data delivered, since timestamps aren’t unique.

Timestamp+Incrementing
mode
Uses both a timestamp column and incrementing id column.
Detects new and updated rows.
More robust than timestamp alone since the combination of id and timestamp should
be unique.
timestamp id First name Surname Amount
2019-10-09
18:10:15
0 John Smith 20
2019-10-09
18:17:36
1 Daisy Williams 25
2019-10-09
18:57:12
2 Laura Thomas 15

LICENSE:
Confluent Community License Agreement Version 1.0

Building the JDBC connector
from source
1. Edit the pom.xml:
a) Comment out the Confluent parts of the
pom.xml
b) Add a version
c) Comment out checkstyle
d) Add Java 8 enforcement
e) Add versions for dependencies
2. git clone confluentinc/kafka-
connect-jdbc.git
3. cd kafka-connect-jdbc
mvn install –D skipTests

Building the JDBC connector
from source
1. git clone confluentinc/kafka.git
(Apache 2.0 license)
2. cd kafka
gradle
./gradlew installAll
3. git clone confluentinc/common.git
(Apache 2.0 license)
4. cd common
mvn install
5. git clone confluentinc/kafka-connect-jdbc.git
(Confluent Community license)
6. cd kafka-connect-jdbc
mvn install

Running the JDBC connector
Must check the JDBC driver has been loaded (SQLite and Postgres included by default)
1. Increase log level to DEBUG
2. Check JDBC driver JAR is in Loading plugin urls list
3. Check for ‘Added plugin’ line immediately after
CLASSPATH=/Users/katherinestanley/connectors/mysql-connector-java-
8.0.17.jar ./bin/connect-distributed.sh config/connect-
distributed.properties
https://www.confluent.io/blog/kafka-connect-deep-dive-jdbc-source-connector

Debezium

Debezium
Debezium is an open-source platform for change data capture using Kafka Connect
MySQL, MongoDB, PostgreSQL, SQL Server; incubator – Oracle, Cassandra, Db2 (soon)
Each supported database has separate code
Underlying technology depends on database
MySQL uses log scanning, SQL Server uses special CDC tables created by the database, …
Open-source – https://github.com/debezium/debezium
Proper open licence – Apache 2.0

Debezium – log scanning
Kafka Connect worker
T1
Ins
T2
Upd
T1
Ins
T2
Ins
T1
Del
T1
Cmt
T2
Pre
T2
Cmt
Debezium
Ins Upd Ins Ins Del
Read
Publish
DATABASE LOG

Debezium MySQL
Uses log scanning – requires configuration of row-based binary logs
WRITE_ROWS for row insert
UPDATE_ROWS for row update
DELETE_ROWS for row delete
QUERY for all kinds of miscellaneous stuff, including transaction commit
Nice and efficient, but connector code is very specific to MySQL internal details

Database replication
Source database
T1
Ins
T2
Upd
T1
Ins
T2
Ins
T1
Del
T1
Cmt
T2
Pre
T2
Cmt
CAPTURE PROGRAM
CHANGE DATA
TABLE
SOURCE
TABLE
Target database
TARGET
TABLE
APPLY PROGRAM
DATABASE LOG

Debezium – replication tables
Source database
T1
Ins
T2
Upd
T1
Ins
T2
Ins
T1
Del
T1
Cmt
T2
Pre
T2
Cmt
CAPTURE PROGRAM
CHANGE DATA
TABLE
SOURCE
TABLE
DATABASE LOG
Kafka Connect worker
Debezium
Ins Upd Ins Ins Del
Publish

How can I try it?
Try the totally excellent Docker-based tutorial
https://debezium.io/documentation/reference/0.10/tutorial.html

Record formatting
The default is comprehensive and very verbose
{
"schema" : {
},
"payload" : {
"op": "u",
"source": {
...
},
"ts_ms" : "...",
"before" : {
"field1" : "oldvalue1",
"field2" : "oldvalue2"
},
"after" : {
"field1" : "newvalue1",
"field2" : "newvalue2"
}
}
}

Record formatting
Just use the provided ExtractNewRecordState SMT
{
"schema" : {
},
"payload" : {
"op": "u",
"source": {
...
},
"ts_ms" : "...",
"before" : {
"field1" : "oldvalue1",
"field2" : "oldvalue2"
},
"after" : {
"field2" : "newvalue2"
}
}
}
{
"field2" : "newvalue2”
}
SMT

IBM InfoSphere Data Replication

IBM InfoSphere Data Replication
Enterprise-grade CDC built exclusively on log scanning
Focus on performance and transactionality
Can be customised with user code
Does not use Kafka Connect because wants tighter control over publish

IIDR architecture
Source server
T1
Ins
T2
Upd
T1
Ins
T2
Ins
T1
Del
T1
Cmt
T2
Pre
T2
Cmt
CDC SOURCE ENGINE
DATABASE LOG
Ins Upd Ins Ins Del
Target server
CDC TARGET ENGINE
PublishWRITER
WRITER
PARSE
TRANSFORM
MANAGEMENT CONSOLE
Read
Send

er 15, 2018 / © 2018 IBM Corporation
Four Time-Interleaved Source Database Transactions
Transaction 1 Op1(Tab2) Op2(Tab3) Op3(tab2) Commit
Transaction 2 Op1(Tab2) Op2(Tab2) Op3(Tab3) Op4(tab2) Commit
Transaction 3 Op1(Tab1) Commit
Transaction 4 Op1(tab1) Commit
===================== TIME =====================è

Transactionally Consistent Consumer
Recreates order of operations in source database across multiple topics and
partitions, with no duplicates
Uses a ”commitstream” topic to maintain transaction metadata
User topic data is not modified
Kafka records can be written out of strict order and TCC sorts it all out

Summary

Summary
There is a variety of open-source and commercial CDC options for Kafka
Choice depends largely on desired throughput, flexibility, semantics and cost

© 2019 IBM Corporation
IBM Cloud - London
This is a group for anyone interested in learning about
#IBMCloud, the cloud built for business. You can be an
existing #IBMCloud user, or someone who has never touched
the #IBMCloud before. Meetup topics will vary and can be of
interest to developers, administrators, or even business
leaders!
We are interested in using amazing tech to grow business and
make the world a better place. Some of the technology topics
that we will talk about are: cloud platforms, artificial
intelligence, blockchain, analytics, automation, cloud services
/ APIs, data science, integration, application development,
and governance.
Humanizing your chatbot,
how I digress!
Site Reliability
Engineer to the rescue!
Blockchain: The Good, The Bad and The Ugly!
Unlocking the power of
automation with AI and ML
Innovate with APIs (App
Mod #2)
Sign up at:
https://www.meetup.com/IBM-Cloud-London/
to come along and take part at our events!

Thank you
Kate Stanley @katestanley91
Andrew Schofield https://medium.com/@andrew_schofield
Links: https://kafka.apache.org/documentation/#connect
https://github.com/confluentinc/kafka-connect-jdbc
https://debezium.io
https://github.com/debezium
IBM Event Streams: ibm.biz/aboutEventStreams

Technology choices for Apache Kafka and Change Data Capture

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Technology choices for Apache Kafka and Change Data Capture

Similar to Technology choices for Apache Kafka and Change Data Capture (20)

More from Andrew Schofield

More from Andrew Schofield (9)

Recently uploaded

Recently uploaded (20)

Technology choices for Apache Kafka and Change Data Capture