"Kafka Connect, the framework for building scalable and reliable data pipelines, has gained immense popularity in the data engineering landscape. This session will provide a comprehensive guide to creating Kafka connectors using Kotlin, a language known for its conciseness and expressiveness.
In this session, we will explore a step-by-step approach to crafting Kafka connectors with Kotlin, from inception to deployment using an simple use case. The process includes the following key aspects:
Understanding Kafka Connect: We'll start with an overview of Kafka Connect and its architecture, emphasizing its importance in real-time data integration and streaming.
Connector Design: Delve into the design principles that govern connector creation. Learn how to choose between source and sink connectors and identify the data format that suits your use case.
Building a Source Connector: We'll start with building a Kafka source connector, exploring key considerations, such as data transformations, serialization, deserialization, error handling and delivery guarantees. You will see how Kotlin's concise syntax and type safety can simplify the implementation.
Testing: Learn how to rigorously test your connector to ensure its reliability and robustness, utilizing best practices for testing in Kotlin.
Connector Deployment: go through the process of deploying your connector in a Kafka Connect cluster, and discuss strategies for monitoring and scaling.
Real-World Use Cases: Explore real-world examples of Kafka connectors built with Kotlin.
By the end of this session, you will have a solid foundation for creating and deploying Kafka connectors using Kotlin, equipped with practical knowledge and insights to make your data integration processes more efficient and reliable. Whether you are a seasoned developer or new to Kafka Connect, this guide will help you harness the power of Kafka and Kotlin for seamless data flow in your applications."
Artificial intelligence in the post-deep learning era
Building Kafka Connectors with Kotlin: A Step-by-Step Guide to Creation and Deployment
1. Building Kafka Connectors with Kotlin
A Step-by-Step Guide to Creation and Deployment
By Sami Alashabi and Ramzi Alashabi
2. 2
Building Kafka
Connectors with
Kotlin
A Step-by-Step Guide to Creation
and Deployment
Sami Alashabi, Solutions Architect, Accenture/Essent
Ramzi Alashabi, Senior Data Engineer, ABN Amro
3. 3
Sami Alashabi
12+ Year Journey in Data
Various Roles and Segments
Architecture, Big Data, Real-Time
Low Latency Distributed Systems,
AWS
Love to solve problems
Love spending time with family
when I’m not coding/architecting
Kafka Enthusiast
https://www.linkedin.com/in/sami-alashabi/
4. 4
Ramzi Alashabi
10+ Years Data Specialist
Micro-services, ETLs, and Cloud
Engineering
Transform ideas to Production
Love learning new Languages &
hanging out with the fam.
Yes, I'm a Dog Person
https://www.linkedin.com/in/ramzialashabi/
5. 5
Q&A
Questions & Follow Up
01
02
03
Kafka Connect
Overview, Architecture, Types & Concepts
Kotlin
Introduction, Background, Features & Advantages
Implementation & Code
Building a Source Connector, Test & Deployment Strategies
Agenda
04 Key Learnings
Summary & Takeaways
05
21. 21
Connect: Standalone vs Distributed
Standalone
Ideal for large
production
Tasks are
distributed
across multiple
worker nodes
Configuration
stored in
Kafka, allows
dynamic
updates
Fault
tolerance,
tasks are
automatically
redistributed
It provides automatic
scalability.
more worker processes
can be added to scale
up (elastic)
Distributed
Ideal for
development &
testing
Tasks executed
in a single
process
Configuration
in a properties
file
No fault
tolerance,
If the process
fails, all tasks
stop
No automatic
scalability,
To scale up, you need to
manually start more
standalone processes.
23. 23
Single Message Transform
Single Message Transform: Is a way to modify the individual
messages as it flows through the Kafka Connect pipeline e.g.
● ReplaceField:
org.apache.kafka.connect.transforms.ReplaceField$Key
● MaskField:
org.apache.kafka.connect.transforms.MaskField$Value
● InsertField:
org.apache.kafka.connect.transforms.InsertField$Value
"config": {
...
"transforms":"flatten,createKey",
"transforms.flatten.type": "org.apache.kafka.connect.transforms.Flatten$Value",
"transforms.flatten.delimiter": "_",
"transforms.createKey.type": "org.apache.kafka.connect.transforms.ValueToKey",
"transforms.createKey.fields":"id,iid,project_id"
}
24. 24
Data formats can be chosen depending on the specific requirements of your application:
● ProtobufConverter: When you need to optimize for speed and size -
io.confluent.connect.protobuf.ProtobufConverter
● JsonSchemaConverter: When you want a human-readable format and working with RESTful APIs
- io.confluent.connect.json.JsonSchemaConverter
● AvroConverter: is easiest for schema evolution - io.confluent.connect.avro.AvroConverter
● JsonConverter: When you want a human-readable format and don't need a schema -
org.apache.kafka.connect.json.JsonConverter
Converters & Data Formats
"config": {
...
"key.converter":"io.confluent.connect.json.JsonSchemaConverter",
"key.converter.schema.registry.url":"http://schema-registry:8081",
"value.converter":"io.confluent.connect.json.JsonSchemaConverter",
"value.converter.schema.registry.url":"http://schema-registry:8081"
}
26. 26
Introduction
Kotlin is a modern, statically typed programming
language that mainly targets the Java Virtual
Machine (JVM)
● It was first introduced by JetBrains in 2011.
● In 2019, Google announced Kotlin as an
official language for Android development.
● Growing Community of Developers.
27. 27
Features
& Advantages
val message = "Hello, World!" // Type inference
if (message is String) { // Smart cast
println(message.length)} // Allows accessing String-specific funcs
// Using default arguments
fun greet(name: String = "John Doe", message: String = "Hello") {
println("$message, $name!")}
greet()
// Safe Calls (?.): Execute only when the value is not null
val name: String? = null
val length: Int? = name?.length
// Elvis Operator (?:): Use value if not null, otherwise use default
val name: String? = null
val length = name?.length ?: -1
// Not-null assertion (!!): Use when sure the value is not null
val name: String? = null
val length = name!!.length
// Higher-order function that takes a function as a parameter
fun calculate(x: Int, y: Int, operation: (Int, Int) -> Int): Int {
return operation(x, y)}
// Using lambda expression
val result = calculate(5, 3) { a, b -> a + b }
Concise Syntax
Reduces boilerplate which
allows writing clean, compact &
more readable code e.g.
● Type inference
● Smart casts
● Default arguments
Safe & Reliable
Built-in null safety features,
eliminating the infamous
NullPointerException errors
using
● safe calls (?.)
● the Elvis operator (?:)
● non-null assertion (!!)
Interoperability
It is fully compatible with Java,
which means you can seamlessly
use Kotlin code in Java projects
and vice versa.
Functional
Programming support
It embraces functional
programming and offers features
like higher-order & first-class
functions, lambda expressions,
functional utilities such as map,
filter, and reduce.
29. 29
Build.gradle.kts
● Plugins: e.g. Java library plugin, the
Kotlin JVM plugin, the Git version plugin,
and the Maven Publish plugin.
● Repositories: specifies where to fetch
dependencies from.
● Dependencies: libraries the project
depends on, including both
implementation and test dependencies
● Tasks: Test, Build, Jar
● Publishing: publish to a Maven
repository
plugins {
`java-library`
kotlin("jvm") version "1.9.22"
id("com.palantir.git-version") version "1.0.0"
`maven-publish`
}
dependencies {
implementation("org.apache.kafka:connect-api:3.4.0”)
implementation("commons-validator:commons-validator:1.7")
testImplementation("org.testcontainers:kafka:1.19.6")
}
Gitlab: Building a
Source Connector
30. 30
Source Connector Interface
● GitlabSourceConnector extends from
SourceConnector.
● SourceConnector: part of the Kafka
Connect framework to stream data from
external data systems to Kafka.
● Version: Returns the version of the
connector and is often used for logging
and debugging purposes.
Gitlab: Building a
Source Connector class GitlabSourceConnector: SourceConnector() {
override fun version(): String {
return ConnectorVersionDetails::class.java.`package`.implementationVersion ?:
"1.0.0" }
override fun start(props: Map<String, String>) {}
override fun config(): ConfigDef {}
override fun taskClass(): Class<out Task> {}
override fun taskConfigs(maxTasks: Int):
List<Map<String, String>> {}
override fun stop() {}
}
31. 31
Gitlab: Building a
Source Connector class GitlabSourceConnector: SourceConnector() {
override fun version(): String {}
override fun start(props: Map<String, String>) {
logger.info("Starting GitlabSourceConnector”)
this.props = props
}
override fun config(): ConfigDef {}
override fun taskClass (): Class< out Task> {}
override fun taskConfigs (maxTasks: Int): List<Map<String , String>> {}
override fun stop() {
logger.info("Requested connector to stop at ${Instant.now()}")
}
Source Connector Lifecycle
● The start and stop methods are part of
the lifecycle of a Source Connector in
Kafka Connect.
● start(props) is called on initialization
and allows the set up of any resources
the connector needs to run. The props is
a map of configuration settings.
● stop is called when the connector is
being shut down and where it clean up
any resources that were opened or
started in the start method.
32. 32
Gitlab: Building a
Source Connector
Source Connector Task
Configuration
● taskConfigs method is used to divide
the work of the connector into smaller,
independent tasks that can be distributed
across multiple workers in a Kafka
Connect cluster, with benefits such as:
○ Parallelism
○ Scalability
○ Fault Isolation
○ Flexibility
override fun taskConfigs(maxTasks: Int): List<Map<String,
String>> {
val taskConfigs = ListOf<Map<String, String>>()
val repositories = props[REPOSITORIES].split(", ")
val groups = repositories.size.coerceAtMost(maxTasks)
val reposGrouped =
ConnectorUtils.groupPartitions(repositories, groups)
for (group in reposGrouped) {
val taskProps = mutableMapOf<String, String>()
taskProps.putAll(props?.toMap()!!)
taskProps.replace(REPOSITORIES,
group.joinToString(";"))
taskConfigs.add(taskProps)
}
return taskConfigs
}
Output config:
[ {gitlab.repositories=Repo#1;Repo#2}, {gitlab.repositories=Repo#3} ]
Input config:
{"gitlab.repositories": "Repo#1, Repo#2, Repo#3", "tasks.max": 2
33. 33
Gitlab: Building a
Source Connector
override fun config(): ConfigDef {}
const val GITLAB_ENDPOINT_CONFIG = "gitlab.service.url"
val CONFIG: ConfigDef = ConfigDef()
.define(
/* name = */ GITLAB_ENDPOINT_CONFIG,
/* type = */ ConfigDef.Type.STRING,
/* defaultValue = */ "https://gitlab.company.nl/api/v4",
/* validator = */ EndpointValidator(),
/* importance = */ ConfigDef.Importance.HIGH,
/* documentation = */ "GitLab API Root Endpoint Ex.
https://gitlab.example.com/api/v4/",
/* group = */ "Settings",
/* orderInGroup = */ -1,
/* width = */ ConfigDef.Width.MEDIUM,
/* displayName = */ "GitLab Endpoint",
/* recommender = */ EndpointRecommender()
)
Source Connector Configuration
● ConfigDef class is used to define the
configuration options the Kafka connector
accepts.
34. 34
Gitlab: Building a
Source Connector
override fun config(): ConfigDef {}
class EndpointValidator : ConfigDef.Validator {
override fun ensureValid(name: String?, value: Any?) {
val url = value as String
val validator = UrlValidator()
if (!validator.isValid(url)) {
throw ConfigException("$url must be a valid URL, use
examples https://gitlab.example.com/api/v4/")
}
}
}
class EndpointRecommender : ConfigDef.Recommender {
override fun validValues(name: String, parsedConfig:
Map<String, Any>): List<String> {
return ListOf("https://gitlab.company.nl/api/v4/")
}
override fun visible(name: String?, parsedConfig:
Map<String, Any>?): Boolean {
return true
}
}
Source Connector Configuration
● Enhancing usability and reducing the
likelihood of configuration errors.
● Recommender: Is an instance of
ConfigDef.Recommender that can
suggest values for the configuration
option and make it easier for users to
configure options correctly.
● Validator: Is an instance of
ConfigDef.Validator that is used to
validate the configuration values which
can help catch configuration errors early,
before they cause problems at runtime.
35. 35
Gitlab: Building a
Source Connector val mergedRequest: Schema = SchemaBuilder.struct()
.name("com.sami12rom.mergedRequest")
.version(1).doc("Merged Request Value Schema")
.field("id", SchemaBuilder.int64())
.field("project_id", SchemaBuilder.int64())
.field("title", SchemaBuilder.string()
.optional().defaultValue(null))
.field("description", SchemaBuilder.string()
.optional().defaultValue(null))
.build()
val struct = Struct(Schemas.mergedRequest)
struct.put("id", mergedRequest.id)
struct.put("project_id", mergedRequest.project_id)
struct.put("title", mergedRequest.title)
struct.put("description", mergedRequest.description)
Data Schemas: SchemaBuilder
● Schemas define the structure of the
data in Kafka Connect and specify the
type of each field, whether it's required
or optional, and other properties.
○ Data types e.g. struct, map, array
○ Helps ensure data consistency
● Structs is used to hold actual data and
ensure that the data conforms to the
schema.
○ Needed for SourceRecord or
SinkRecord.
36. 36
Gitlab: Building a
Source Connector
class GitlabSourceTask : SourceTask() {
override fun start(props: Map<String, String>?) {
initializeSource()
}
override fun poll(): MutableList<SourceRecord> {
val records = mutableListOf<SourceRecord>()
sleepForInterval()
val response = ApiCalls.GitLabCall(props!!)
val record = generateSourceRecord(response as
MergedRequest)
records.add(record)
return records
}
override fun stop() {}
Source Task Class
● Poll: is called repeatedly to pull data
from external source into Kafka. It should
return a list of SourceRecord objects
or null if there's no data available.
37. 37
Source Record - Part 1
● topic: Name of the topic to write to.
● partition: Partition where the record will be
written, can be null to let Kafka assign it.
● keySchema & key: The schema & key for
this record.
● valueSchema & value: The schema & value
for this record. Value is the actual data that
will be written to the Kafka topic.
● timestamp: The timestamp for this record
and can be null to let Kafka assign the
current time.
● headers: Headers for this record.
Gitlab: Building a
Source Connector val record = SourceRecord(
/* sourcePartition = */ Map (Connector),
/* sourceOffset = */ Map (Connector),
/* topic = */ String,
/* partition = */ Integer (Optional),
/* keySchema = */ Schema (Optional),
/* key = */ Object (Optional),
/* valueSchema = */ Schema (Optional),
/* value = */ Object (Optional),
/* timestamp = */ Long (Optional),
/* headers= generateHeaders() (Optional)
)
38. 38
val record = SourceRecord(
/* sourcePartition = */ Map,
/* sourceOffset = */ Map,
...
)
Connector Restart Offset:
override fun start(props: Map<String, String>?) {
initializeSource()
}
fun initializeSource(): Map<String, Any>?{
return context.offsetStorageReader()
.offset(sourcePartition())
Gitlab: Building a
Source Connector
Source Record - Part 2
(Restartability)
● sourcePartition: It defines the partition
of the source system that this record
came from, e.g. a table name for a
database connector.
● sourceOffset: It defines the position in
the source partition that this record came
from, e.g. an ID of the row for a database
connector.
● offsetStorageReader: Retrieve the last
stored offset for a specific partition to
resume data ingestion where it last left
off.
39. 39
Testing Strategies
Soak & Error Handling Tests
Run your connector for an extended period under
typical load to identify long-term issues. Write tests to
ensure your connector handles errors gracefully and
recovers from failures.
Unit Tests
Isolated tests for individual functions or methods using
a testing framework like JUnit and a mocking library like
mockito.
Integration Tests
Test the interaction between all components using
tools like Testcontainers to set up realistic testing
environments.
End to End & Performance Tests
Validate the entire flow from producing a record to the
source system to consuming it from Kafka. Measure the
throughput and latency of your connector under
different loads.
42. 42
Errors Handling
Retries
For transient errors, e.g.
temporary network issues
Custom Error Handling
In your SourceTask or SinkTask,
custom error handling logic can be
added e.g. catch exceptions,
log them, and decide whether to
fail the task or attempt to recover
and continue
Monitoring Metrics
Actively monitor and alert
on error message rates of the
connector e.g. Task Error
Metrics, Records
Produced/Consumed, Task
Status, Lag/Throughput
Metrics
Error Tolerance
errors.tolerance = none
● fail fast (default)
errors.tolerance = all
● silently ignore
errors.deadletterqueue.topic.name
● dead letter queues
Log Errors
Errors can be logged for
troubleshooting and can be
controlled by:
● errors.log.enable = true
● Errors.log.include.messages
Avoid excessive use of Error or
Warn levels in your logging
Dead Letter Queue
Automatically send error records
to a DLQ topic for later inspection
along with header.
PS: DLQ is currently only supported
for Sink Connectors and not for
Source Connectors
43. 43
4. Resilience and Error
Handling
● Design your connector with restartability and
fault tolerance in mind.
● Implement error handling.
● Consider how the connector will handle
network failures, API rate limits, etc..
5. Testing, Deployment, and
Monitoring
● Test, Test & Test under different scenarios
● Set up Monitoring Mechanism
● Implement proper logging
● Track Performance (JMX)
2. Connector Development
● Add the required dependencies,
● Define the actions for the start and stop
methods,
● Determine the number of tasks based on
your parallelism requirements.
● implement the poll method, and decide on
the frequency of polling.
3. Data Management
● Develop a function to fetch data from your source system.
● Define the Schema and Struct.
● Define the contents of the source record.
● Choose the right Converter for your data format (operations)
● Consider the usage of Single Message Transforms (operations)
1. Planning and Design
● Understand Your Data Source
● Decide on the type (source or sink)
● Plan config inputs, defaults, validators, and
recommenders
● Consider the volume of data your connector
will need to handle (parallel processing)
Key Learnings
44. 44
Q & A
Thank you for your attention
and participation
Please rate the session in the Kafka Summit App
Code
https://github.com/sami12rom/kafka-connect-gitlab