So You Want to Write a Connector?

So You Want to
Write a Connector
Randall Hauch
@rhauch

Apps
Stream
Processing
Search
KV
RDBMS DW
Real Time
Analytics
Monitoring

PRODUCER
Producer 
Application

PRODUCER
CONSUMER
Producer 
Application
Consumer 
Application

PRODUCER
CONSUMER
Producer 
Application
Consumer 
Application
• Where to restart ?
• How to scale and parallelize ?
• What metrics to capture ?
• How to handle failure & retries ?
• How to properly use the
producer / consumer API ?

PRODUCER
CONSUMER
KAFKA CONNECT KAFKA CONNECT

PRODUCER
CONSUMER
Source
Connector ConverterSMTs

PRODUCER
CONSUMER
Sink
ConnectorSMTsSource
Connector ConverterSMTs Converter

PRODUCER
CONSUMER
Sink
ConnectorSMTsSource
Connector ConverterSMTs Converter
• Offset management
• Elastic scalability
• Parallelization
• Task distribution
• Metrics
• Failure & retries
• Configuration management
• REST API
• Schemas & data types

Discover connectors,
SMTs, and converters

Descriptions, licensing,
support, and more

Simple installs
Descriptions, licensing,
support, and more

KAFKA CONNECT
Starting and Running

RESTAPIRESTAPIRESTAPI
KAFKA CONNECT

KAFKA CONNECT
validate
C

KAFKA CONNECT
Connector
validate
C

KAFKA CONNECT
C
deploy

KAFKA CONNECT
Connector
C
deploy

KAFKA CONNECT
Connector
C
start(…)
deploy

KAFKA CONNECT
Connector
C

RESTAPI
Connector
RESTAPI
RESTAPI
C

RESTAPI
Connector
RESTAPI
RESTAPI
C
taskConfigs(…)

RESTAPI
Connector
RESTAPI
RESTAPI
C
taskConfigs(…)
TTTT

RESTAPI
Connector
RESTAPI
RESTAPI
C
TTTT
TTTT
TTTT

RESTAPI
Connector
RESTAPI
RESTAPI
C
TTT
T
TTTT
Task
start(…)
TTTT

RESTAPI
Connector
RESTAPI
RESTAPI
C
TTT
T
TTTT
Task
poll(…)
TTTT

RESTAPI
Connector
RESTAPI
RESTAPI
C
Task Task Task Task
poll(…) poll(…) poll(…)poll(…)
TTTT

RESTAPI
Connector
RESTAPI
RESTAPI
C
Task Task Task Task
Other
Task(s)
Other
Task(s)
Other
Task(s)
poll(…) poll(…) poll(…)poll(…)
Other
Connector(s)
TTTT

RESTAPI
Connector
RESTAPI
RESTAPI
C
TTTT
TTTT
Task Task Task Task
Other
Task(s) Other
Task(s)
Other
Task(s)
Other
Connector(s)

RESTAPI
Connector
RESTAPI
C
TTTT
TTTT
Task Task Task
Other
Task(s)
Other
Task(s)
Other
Connector(s)

RESTAPI
Connector
RESTAPI
C
TTTT
TTTT
Task Task Task
Other
Task(s)
Recovery from Failures
Other
Task(s)
Other
Connector(s)

RESTAPI
Connector
RESTAPI
C
TTTT
TTTT
Task Task Task
Other
Task(s)
Recover from Failures
Other
Task(s)
Other
Connector(s)

RESTAPI
Connector
RESTAPI
C
TTTT
TTTT
Task Task Task Task
Other
Task(s)
Other
Task(s)
Other
Connector(s)

RESTAPI
Connector
RESTAPI
C
TTTT
TTTT
Task Task Task Task
Other
Task(s)
Other
Connector(s)
Other
Task(s)

RESTAPI
Connector
RESTAPI
C
TTTT
TTTT
Task Task Task Task
Stopping
Other
Task(s)
Other
Connector(s)
Other
Task(s)

RESTAPI
Connector
RESTAPI
C
TTTT
TTTT
Task Task Task Task
Stopping
stop() stop() stop()stop()
Other
Task(s)
Other
Connector(s)
Other
Task(s)

RESTAPI
Connector
RESTAPI
C
TTTT
TTTT
Stopping
Other
Task(s)
Other
Connector(s)
Other
Task(s)

RESTAPI
RESTAPI
C
TTTT
TTTT
Stopping
Other
Task(s)
Other
Connector(s)
Other
Task(s)

class SinkConnector extends Connector {}
class SourceConnector extends Connector {}
class Connector {
String version();
ConfigDef config();
Class<? extends Task> taskClass();
void start(Map<String, String> props);
List<Map<String, String>> taskConfigs(int maxTasks);
void stop();
...
}

• The user-specified connector configuration
• Optionally talk to external system to get information about tasks
• Determine the task configuration
• Optionally start thread(s) to monitor external system, and if needed  
ask for task reconfiguration

List<Map<String, String>> taskConfigs(int maxTasks);
• Tell Connect the configurations for each of the tasks

abstract class SourceTask implements Task {
protected SourceTaskContext context;
List<SourceRecord> poll() throws InterruptedException;
void stop();
...
}

• This task’s configuration that was created by your connector
• Read previously committed offsets to know where in the external system
to start reading
• Create any resources it might need
- Connections to external system
- Buffers, queues, threads, etc.

• Called frequently
• Get the next batch of records that Connect should write to Kafka
- block until there are “enough” records to return
- return null if no records right now
• For systems that push data
- use separate thread to receive and process the data and enqueue
- poll then dequeues and returns records
List<SourceRecord> poll() throws InterruptedException;

topic : String
partition : Integer
keySchema : Schema
key : Object
valueSchema : Schema
value : Object
timestamp : Long
headers : Headers
sourcePartition : Map
sourceOffset : Map
• Topic where the record is to be written
• Optional partition # for the topic
• Optional key and the schema that describes it
• Optional value and the schema that describes it
• Optional headers
• Optional timestamp
• Source “partition” and “offset"
- Describes where this record originated
- Defined by connector, used only by connector
- Connect captures last partition+offset that it writes,
periodically committed to connect-offsets topic
- When task starts up, it reads these to know where it
should start
- TIP: reuse the same partition Map instance
• Serialized via Converters
SourceRecord

name : String
version : String
doc : String
type : Type
parameters : Map
fields : List<Field>
name : String
schema : Schema
index : int
Field
Schema • Name (required), version, and documentation
• Type of schema: primitive, Map, Array, Struct
• Whether the value described by the schema is optional
• Optional metadata parameters
• Information about the structure:
- For Struct schemas, the fields
- For Map schemas, the key schema and value schema
- For Array schemas, the element schema
• Name of field that this schema describes
• Schema for field value
• Index of field within the Struct

abstract class SinkTask implements Task {
protected SingTaskContext context;
void open(Collection<TopicPartition> partitions);
void put(Collection<SinkRecord> records);
Map<TopicPartition, OffsetAndMetadata> preCommit(
Map<TopicPartition, OffsetAndMetadata> currentOffsets
);
void close(Collection<TopicPartition> partitions);
void stop();
...
}

• This task’s configuration that was created by your connector
- includes topics and topics.regex property
• Start the task and create any resources it might need

void open(Collection<TopicPartition> partitions);
• Optionally pre-allocate writers and resources
• Consumer will by default start based on its own committed offsets
• Or use context.offsets(...) to set the desired starting point

• Called frequently with next batch of records returned from the consumer
• Size of batch depends on
- availability of records in our assigned topic partitions, and
- consumer settings in worker, including:
consumer.fetch.min.bytes and consumer.fetch.max.bytes
consumer.max.poll.interval.ms and consumer.max.poll.records
• Either write directly to external system or  
buffer them and write when there are “enough” to send
void put(Collection<SinkRecord> records);

• Topic, partition, and offset from where the record
was consumed
• Optional key and the schema that describes it
• Optional value and the schema that describes it
• Timestamp and whether its create/append/none
• Headers
- ordered list of name-value pairs
- utility to convert value to desired type, if possible
• Deserialized via Converters
topic : String
partition : Integer
keySchema : Schema
key : Object
valueSchema : Schema
value : Object
timestamp : Long
timestampType : enum
offset : long
headers : Headers
SinkRecord

• The topic partitions that are no longer associated with the task
• Close any writers / resources for these topic partitions
void close(Collection<TopicPartition> partitions);

Sink Tasks and Exactly Once Semantics (EOS)
• Must track the offsets of records that were written to external system
- typically involves atomically storing offsets in external system
- for convenience can optionally still have consumer commit those offsets
• When a task restarts and is assigned topic partitions, it
- looks in the external system for the next offset for each topic partition
- set in the task context the exact offsets where to start consuming

• Called periodically based upon worker’s offset.flush.interval.ms
- defaults to 60 seconds
• Supplied the current offsets for consumed records as of the last call to put(…)
• Return the offsets that the consumer should actually commit
- should reflect what was actually written so far to the external system for EOS
- or return null if consumer should not commit offsets
Map<TopicPartition, OffsetAndMetadata> preCommit(
Map<TopicPartition, OffsetAndMetadata> currentOffsets
);

Configuration
• Only way to control a connector
• Make as simple as possible, but no simpler
• Use validators and recommenders in ConfigDef
• Strive for backward compatibility

Message Ordering
• Each Kafka topic partition is totally ordered
• Design connectors around this notion

Scaling
• Use tasks to parallelize work
- each task runs in its own thread
- you can start other threads, but always clean them up properly
• Sink connectors - typically use # of tasks specified by user
• Source connectors - only one task (or several) may make sense

Testing
• Design for testability
• Unit tests and integration tests in your build
- look for reusable integration tests (PR5516) using  
embedded ZK & Kafka
• Continuous integration for regression tests
• System tests use real ZK, Kafka, Connect, and external systems
• Performance and soak tests

Be Resilient
• What are the failure modes?
• Can you retry (forever) rather than fail?
- consider RetriableException

Logging
• Don’t overuse ERROR or WARN
• Use INFO level so users know what’s happening under normal
operation  
and when misbehaving
• Use DEBUG so users can diagnose why it’s behaving strangely
• Use TRACE for you

Security
• Use PASSWORD configuration type in ConfigDef
• Communicate securely with external system
• Don’t log records (except maybe at TRACE)

Package Names
• Don’t use org.apache.kafka.* packages
• These are reserved for use by the Apache Kafka project

/usr/share/java
/opt/java/my-plugin
kafka-connect-jdbc
kafka-connect-s3
kafka-connect-hdfs
...
debezium-mysql
...
Connect’s Plugin Path
• Added in AK 0.11.0
• Isolates connectors JARs
• Set in Connect worker configuration
on each machine
• Value is comma-separated list of
the directories that contain
connector directories

/usr/share/java
/opt/java/my-plugin
kafka-connect-jdbc
kafka-connect-s3
kafka-connect-hdfs
...
debezium-mysql
...
These should be on plugin.pathConnect’s Plugin Path
• Added in AK 0.11.0
• Isolates connectors JARs
• Set in Connect worker configuration
on each machine
• Value is comma-separated list of
the directories that contain
connector directories

Confluent Hub Client
• Included in Confluent Platform (or can be installed separately)
• Sets plugin.path correctly
• Install a connector plugin, transform plugin, or converter plugin:
confluent-hub install <owner>/<connector>:<version|latest>
• Example:
confluent-hub install debezium/debezium-connector-mysql:latest
• Help
confluent-hub help
confluent-hub install <owner>/<connector>:<version>
confluent-hub install debezium/debezium-connector-mysql:latest
confluent-hub help

https://docs.confluent.io/current/connect/managing/confluent-hub/component-archive.html
<plugin>
<groupId>io.confluent</groupId>
<version>0.11.1</version>
<artifactId>kafka-connect-maven-plugin</artifactId>
<executions>
<execution>
<goals>
<goal>kafka-connect</goal>
</goals>
<configuration>
<title>My Super Connector</title>
<documentationUrl>...</documentationUrl>
<description>...</description>
...
<tags>
<tag>database</tag>
...
</tags>
</configuration>
</execution>
</executions>
</plugin>
Packaging for Confluent Hub
• Packaging Plugin for Maven
• Defines metadata for Confluent Hub
• Assembles an archive for installation
• includes correct JARs, images, docs, …
• Ready for you to submit

Recent Improvements
AK 1.1
• Expose record headers (KIP-145)
• Bug fixes and minor improvements
AK 2.0
• Error handling (KIP-298)
• Externalized secrets (KIP-297)
• Connect REST plugin (KIP-285)

Planned Improvements
• Logging improvements (KAFKA-3816)
• Integration test framework (pr 5516)
• Incremental cooperative rebalancing (wiki)
• Idempotent source connectors (KIP-318)
• Create topics for source connectors (KIP-158)
• Exactly once source connectors
• And many more

1. You can write connectors! 
2. Connect does most of the work for you 
3. Connectors focus on
- using the external system
- mapping external data to records on topics and partitions
Key Takeaways

Resources
Apache Kafka Resources
- kafka.apache.org (docs, mailing lists, etc.)
Confluent Resources
- hub.confluent.io
- docs.confluent.io
- blog.confluent.io
- Slack community @ confluent.io/contact

So You Want to Write a Connector?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to So You Want to Write a Connector?

Similar to So You Want to Write a Connector? (20)

More from confluent

More from confluent (20)

Recently uploaded

Recently uploaded (20)

So You Want to Write a Connector?