(Randall Hauch, Confluent) Kafka Summit SF 2018
The Kafka Connect framework makes it easy to move data into and out of Kafka, and you want to write a connector. Where do you start, and what are the most important things to know? This is an advanced talk that will cover important aspects of how the Connect framework works and best practices of designing, developing, testing and packaging connectors so that you and your users will be successful. We’ll review how the Connect framework is evolving, and how you can help develop and improve it.
52. void start(Map<String, String> props);
• The user-specified connector configuration
• Optionally talk to external system to get information about tasks
• Determine the task configuration
• Optionally start thread(s) to monitor external system, and if needed
ask for task reconfiguration
55. void start(Map<String, String> props);
• This task’s configuration that was created by your connector
• Read previously committed offsets to know where in the external system
to start reading
• Create any resources it might need
- Connections to external system
- Buffers, queues, threads, etc.
56. • Called frequently
• Get the next batch of records that Connect should write to Kafka
- block until there are “enough” records to return
- return null if no records right now
• For systems that push data
- use separate thread to receive and process the data and enqueue
- poll then dequeues and returns records
List<SourceRecord> poll() throws InterruptedException;
57. topic : String
partition : Integer
keySchema : Schema
key : Object
valueSchema : Schema
value : Object
timestamp : Long
headers : Headers
sourcePartition : Map
sourceOffset : Map
• Topic where the record is to be written
• Optional partition # for the topic
• Optional key and the schema that describes it
• Optional value and the schema that describes it
• Optional headers
• Optional timestamp
• Source “partition” and “offset"
- Describes where this record originated
- Defined by connector, used only by connector
- Connect captures last partition+offset that it writes,
periodically committed to connect-offsets topic
- When task starts up, it reads these to know where it
should start
- TIP: reuse the same partition Map instance
• Serialized via Converters
SourceRecord
58. name : String
version : String
doc : String
type : Type
parameters : Map
fields : List<Field>
name : String
schema : Schema
index : int
Field
Schema • Name (required), version, and documentation
• Type of schema: primitive, Map, Array, Struct
• Whether the value described by the schema is optional
• Optional metadata parameters
• Information about the structure:
- For Struct schemas, the fields
- For Map schemas, the key schema and value schema
- For Array schemas, the element schema
• Name of field that this schema describes
• Schema for field value
• Index of field within the Struct
60. void start(Map<String, String> props);
• This task’s configuration that was created by your connector
- includes topics and topics.regex property
• Start the task and create any resources it might need
61. void open(Collection<TopicPartition> partitions);
• Optionally pre-allocate writers and resources
• Consumer will by default start based on its own committed offsets
• Or use context.offsets(...) to set the desired starting point
62. • Called frequently with next batch of records returned from the consumer
• Size of batch depends on
- availability of records in our assigned topic partitions, and
- consumer settings in worker, including:
consumer.fetch.min.bytes and consumer.fetch.max.bytes
consumer.max.poll.interval.ms and consumer.max.poll.records
• Either write directly to external system or
buffer them and write when there are “enough” to send
void put(Collection<SinkRecord> records);
63. • Topic, partition, and offset from where the record
was consumed
• Optional key and the schema that describes it
• Optional value and the schema that describes it
• Timestamp and whether its create/append/none
• Headers
- ordered list of name-value pairs
- utility to convert value to desired type, if possible
• Deserialized via Converters
topic : String
partition : Integer
keySchema : Schema
key : Object
valueSchema : Schema
value : Object
timestamp : Long
timestampType : enum
offset : long
headers : Headers
SinkRecord
64. • The topic partitions that are no longer associated with the task
• Close any writers / resources for these topic partitions
void close(Collection<TopicPartition> partitions);
65. Sink Tasks and Exactly Once Semantics (EOS)
• Must track the offsets of records that were written to external system
- typically involves atomically storing offsets in external system
- for convenience can optionally still have consumer commit those offsets
• When a task restarts and is assigned topic partitions, it
- looks in the external system for the next offset for each topic partition
- set in the task context the exact offsets where to start consuming
66. • Called periodically based upon worker’s offset.flush.interval.ms
- defaults to 60 seconds
• Supplied the current offsets for consumed records as of the last call to put(…)
• Return the offsets that the consumer should actually commit
- should reflect what was actually written so far to the external system for EOS
- or return null if consumer should not commit offsets
Map<TopicPartition, OffsetAndMetadata> preCommit(
Map<TopicPartition, OffsetAndMetadata> currentOffsets
);
68. Configuration
• Only way to control a connector
• Make as simple as possible, but no simpler
• Use validators and recommenders in ConfigDef
• Strive for backward compatibility
69. Message Ordering
• Each Kafka topic partition is totally ordered
• Design connectors around this notion
70. Scaling
• Use tasks to parallelize work
- each task runs in its own thread
- you can start other threads, but always clean them up properly
• Sink connectors - typically use # of tasks specified by user
• Source connectors - only one task (or several) may make sense
71. Testing
• Design for testability
• Unit tests and integration tests in your build
- look for reusable integration tests (PR5516) using
embedded ZK & Kafka
• Continuous integration for regression tests
• System tests use real ZK, Kafka, Connect, and external systems
• Performance and soak tests
72. Be Resilient
• What are the failure modes?
• Can you retry (forever) rather than fail?
- consider RetriableException
73. Logging
• Don’t overuse ERROR or WARN
• Use INFO level so users know what’s happening under normal
operation
and when misbehaving
• Use DEBUG so users can diagnose why it’s behaving strangely
• Use TRACE for you
74. Security
• Use PASSWORD configuration type in ConfigDef
• Communicate securely with external system
• Don’t log records (except maybe at TRACE)
75. Package Names
• Don’t use org.apache.kafka.* packages
• These are reserved for use by the Apache Kafka project
82. Recent Improvements
AK 1.1
• Expose record headers (KIP-145)
• Bug fixes and minor improvements
AK 2.0
• Error handling (KIP-298)
• Externalized secrets (KIP-297)
• Connect REST plugin (KIP-285)
• Bug fixes and minor improvements
83. Planned Improvements
• Bug fixes and minor improvements
• Logging improvements (KAFKA-3816)
• Integration test framework (pr 5516)
• Incremental cooperative rebalancing (wiki)
• Idempotent source connectors (KIP-318)
• Create topics for source connectors (KIP-158)
• Exactly once source connectors
• And many more
84. 1. You can write connectors!
2. Connect does most of the work for you
3. Connectors focus on
- using the external system
- mapping external data to records on topics and partitions
Key Takeaways