Apache Kafka® is the technology behind event streaming which is fast becoming the central nervous system of flexible, scalable, modern data architectures. Customers want to connect their databases, data warehouses, applications, microservices and more, to power the event streaming platform. To connect to Apache Kafka, you need a connector!
This online talk dives into the new Verified Integrations Program and the integration requirements, the Connect API and sources and sinks that use Kafka Connect. We cover the verification steps and provide code samples created by popular application and database companies. We will discuss the resources available to support you through the connector development process.
This is Part 2 of 2 in Building Kafka Connectors - The Why and How
9. @jwfbean | @confluentinc
9
Connectors
● A logical job that copies data in and out of Apache Kafka®
● Maintains tasks
● Provides lifecycle and configuration information
● Ultimately a JAR available to the connect JVM
● Stateless! Use Kafka topics or source / target for state
13. @jwfbean | @confluentinc
13
Tasks
● Lifecycle managed by a connector
● Runs in a worker
● Manages one or more topic partitions
● The assignment is dynamic at runtime
● Does the actual copying and transforming of things
14. @jwfbean | @confluentinc
14
Workers
● Actual JVM processes running on a computer of some kind
● Tasks get allocated to workers
● Can run in standalone or distributed mode
○ Standalone is good for dev, one-offs, and conference demos
○ Distributed good for scale and fault tolerance
20. @jwfbean | @confluentinc
20
Converters
● Convert input data to bytes for storing in Kafka
● Convert input data from bytes to output to somewhere else
● Sit between the connector and Kafka in either direction
22. Single Message
Transforms
● Mask sensitive information
● Add identifiers
● Tag events
● Lineage/provenance
● Remove unnecessary columns
● Route high priority events to
faster datastores
● Direct events to different
Elasticsearch indexes
● Cast data types to match
destination
● Remove unnecessary columns
Modify events before storing in Kafka:
Modify events going out of
Kafka:
24. @jwfbean | @confluentinc
24
Built-in Transformations
● InsertField – Add a field using either static data or record metadata
● ReplaceField – Filter or rename fields
● MaskField – Replace field with valid null value for the type (0, empty string, etc.)
● ValueToKey – Set the key to one of the value’s fields
● HoistField – Wrap the entire event as a single field inside a Struct or a Map
● ExtractField – Extract a specific field from Struct and Map and include only this field in results
● SetSchemaMetadata – Modify the schema name or version
● TimestampRouter – Modify the topic of a record based on original topic and timestamp, which is
useful when using a sink that needs to write to different tables or indexes based on timestamps
● RegexpRouter – Modify the topic of a record based on the original topic, replacement string, and a
regular expression
28. @jwfbean | @confluentinc
33
Starting a connector
void start(Map<String, String> props);
•The user-specified connector
configuration
•Optionally talk to external system to get
information about tasks
•Determine the task configuration
•Optionally start thread(s) to monitor
external system, and if needed
ask for task reconfiguration
29. @jwfbean | @confluentinc
34
Task configs
List<Map<String, String>> taskConfigs(int maxTasks);
•Tell Connect the configurations for each
of the tasks
30. @jwfbean | @confluentinc
35
Considerations - Testing
•Design for testability
•Unit tests and integration tests in your build
-look for reusable integration tests (PR5516) using
embedded ZK & Kafka
•Continuous integration for regression tests
•System tests use real ZK, Kafka, Connect, and external systems
•Performance and soak tests
•See the Connect Verification Guide for more detailed information on testing.
31. @jwfbean | @confluentinc
36
SourceConnector and
SinkConnector - Stoppingvoid stop();
•Notification that the connector is being
stopped
•Clean up any resources that were
allocated so far
33. @jwfbean | @confluentinc
38
SourceTask - Starting
•This task’s configuration that was created
by your connector
•Read previously committed offsets to
know where in the external system to start
reading
•Create any resources it might need
- Connections to external system
- Buffers, queues, threads, etc.
34. @jwfbean | @confluentinc
39
SourceTask - getting records
•The topic partitions assigned to this task's
consumer
•Optionally pre-allocate writers and
resources
•Consumer will by default start based on
its own committed offsets
•Or use context.offsets(...) to set the
desired starting point
● Called frequently
● Get the next batch of records
that Connect should write to
Kafka
● block until there are “enough”
records to return
● return null if no records right
now
● For systems that push data
● use separate thread to receive
and process the data and
enqueue
● poll then dequeues and returns
records
List<SourceRecord> poll() throws InterruptedException;
35. @jwfbean | @confluentinc
40
SourceRecord
topic : String
partition : Integer
keySchema : Schema
key : Object
valueSchema : Schema
value : Object
timestamp : Long
headers : Headers
sourcePartition : Map
sourceOffset : Map
• Topic where the record is to be written
• Optional partition # for the topic
• Optional key and the schema that describes it
• Optional value and the schema that describes it
• Optional headers
• Optional timestamp
• Source “partition” and “offset"
- Describes where this record originated
- Defined by connector, used only by connector
- Connect captures last partition+offset that it writes, periodically committed to
connect-offsetstopic
- When task starts up, it reads these to know where it should start
- TIP: reuse the same partition Map instance
• Serialized via Converters
SourceRecord
36. @jwfbean | @confluentinc
41
Connect Schemas
name : String
version : String
doc : String
type : Type
parameters : Map
fields : List<Field>
name : String
schema : Schema
index : int
Field
Schema
• Name (required), version, and documentation
• Type of schema: primitive, Map, Array, Struct
• Whether the value described by the schema is optional
• Optional metadata parameters
• Information about the structure:
- For Struct schemas, the fields
- For Map schemas, the key schema and value schema
- For Array schemas, the element schema
• Name of field that this schema describes
• Schema for field value
• Index of field within the Struct
39. @jwfbean | @confluentinc
44
SinkTask - Starting
void start(Map<String, String> props);
•This task’s configuration that was created
by your connector
- includes topics and topics.regex property
•Start the task and create any resources it
might need
40. @jwfbean | @confluentinc
45
SinkTask - Assigned topic
partitions
void open(Collection<TopicPartition> partitions);
•Optionally pre-allocate writers and
resources
•Consumer will by default start based on
its own committed offsets
•Or use context.offsets(...) to set the
desired starting point
41. @jwfbean | @confluentinc
46
SinkTask - Process records
•The topic partitions assigned to this task's
consumer
•Optionally pre-allocate writers and
resources
● Called frequently with next batch of
records returned from the consumer
● Size of batch depends on
● availability of records in our assigned
topic partitions, and
● consumer settings in worker,
including:
● consumer.fetch.min.bytes and
consumer.fetch.max.bytes
● consumer.max.poll.interval.ms and
consumer.max.poll.records
● Either write directly to external
system or
buffer them and write when there are
“enough” to send
void put(Collection<SinkRecord> records);
42. @jwfbean | @confluentinc
47
SinkRecord
• Topic, partition, and offset from where the record was
consumed
• Optional key and the schema that describes it
• Optional value and the schema that describes it
• Timestamp and whether its create/append/none
• Headers
- ordered list of name-value pairs
- utility to convert value to desired type, if possible
• Deserialized via Converters
topic : String
partition : Integer
keySchema : Schema
key : Object
valueSchema : Schema
value : Object
timestamp : Long
timestampType : enum
offset : long
headers : Headers
SinkRecord
43. @jwfbean | @confluentinc
48
Sink Connectors and Exactly
Once Semantics (EOS)
• Must track the offsets of records that were written to external
system
- typically involves atomically storing offsets in external system
- for convenience can optionally still have consumer commit those offsets
• When a task restarts and is assigned topic partitions, it
- looks in the external system for the next offset for each topic partition
- set in the task context the exact offsets where to start consuming
44. @jwfbean | @confluentinc
49
SinkTask - Committing offsets
•The topic partitions assigned to this task's
consumer
•Optionally pre-allocate writers and
resources
•Consumer will by default start based on
its own committed offsets
•Or use context.offsets(...) to set the
desired starting point
● Called periodically based upon
worker’s
offset.flush.interval.ms
● defaults to 60 seconds
● Supplied the current offsets for
consumed records as of the
last call to put(…)
● Return the offsets that the
consumer should actually
commit
● should reflect what was
actually written so far to the
external system for EOS
● or return null if consumer
should not commit offsets
Map<TopicPartition, OffsetAndMetadata> preCommit(
Map<TopicPartition, OffsetAndMetadata>
currentOffsets);
48. @jwfbean | @confluentinc
53
Considerations - Configuration
•Only way to control a connector
•Make as simple as possible, but no simpler
•Use validators and recommenders in ConfigDef
•Strive for backward compatibility
50. @jwfbean | @confluentinc
55
Considerations - Scaling
•Use tasks to parallelize work
-each task runs in its own thread
-you can start other threads, but always clean them up properly
•Sink connectors - typically use # of tasks specified by user
•Source connectors - only one task (or several) may make sense
51. @jwfbean | @confluentinc
56
Considerations - Version Compatibility
•Connectors are by definition multi-body problems
-external system
-Connect runtime
-Kafka broker
•Specify what versions can be used together
52. @jwfbean | @confluentinc
57
Considerations - Resilience
•What are the failure modes?
•Can you retry (forever with exponential backoff) rather than fail?
-consider RetriableException
53. @jwfbean | @confluentinc
58
Considerations - Logging
•Don’t overuse ERROR or WARN
•Use INFO level so users know what’s happening under normal
operation
and when misbehaving
•Use DEBUG so users can diagnose why it’s behaving strangely
•Use TRACE for you
54. @jwfbean | @confluentinc
59
Considerations - Security
•Use PASSWORD configuration type in ConfigDef
•Communicate securely with external system
•Don’t log records (except maybe at TRACE)
58. Lifecycle Management
● How to make sure systems are ready before sending them
records?
○ SinkTaskContext.pause() / SinkTaskContext.resume() -
when target is not ready
○ SinkTask.open() - called when partition is assigned to a
sink task
○ Connector.start() - Database Creation
○ taskConfigs()
59. @jwfbean | @confluentinc
Ordering
● Apache Kafka gaurantees ordering but the Connector
Developer has to ensure it as well
● Parallel transactions might complete out of order on the
target system
60. @jwfbean | @confluentinc
DLQ
● There’s an open request for API access to the DLQ
● The DLQ is just a Kafka topic, it can be written to by a
producer, but this violates Kafka Connect guidelines
● The DLQ is for records it cannot process, not for errors
61. @jwfbean | @confluentinc
Push vs Pull
● Kafka Connect source / sink paradigm “reverses the polarity” of the Kafka
producer / consumer paradigm on purpose.
○ Producer PUSH to Kafka, Source Connector PULLS from sink (Framework
PUSHes to Kafka)
○ Consumer PULL from Kafka, Sink Connector PUSHES to sink (Framework
PULLs from Kafka)
● What if I want a target to PULL via sink connector?
○ The connector PULLs
○ Sink task can call SinkTaskContext.pause, check for liveness on target and
then SinkTaskContext.resume when ready.
63. 68
MongoDB
•Source and Sink connector
•Source: Next release of MongoDB
CDC connector from Debezium
•Sink: Next release of from HPG’s
community connector
•https://www.mongodb.com/blog/post/getting-started-with-the-
mongodb-connector-for-apache-kafka-and-mongodb-atlas
•https://github.com/mongodb/mongo-kafka
70. 76
Sign up, ask questions, or start the process
confluent.io/Verified-Integrations-Program/
Email
vip@confluent.io
Kafka Summit San Francisco
kafka-summit.org/
code KS19Online25 for 25% off
Q&A