Kafka Connect

Kafka Connect
Oleg Kuznetsov
Big Data Engineer

Kafka Connect
7
〉Focusing on data ingestion in / out Kafka topics
〉KafkaConnect - a standalone app, not a library
〉Distributed mode

12
Storage Kafka
Entity “Virtual” topic Topic
Partition
Logical partition
- file name
- table name
Physical partition
file on disk
Offset in
partition
Logical offset
- line number in file
- ID value in table
Record number within partition
offset
External storage ≈ Kafka topic

Components
13
SourceConnector
- defines parallelism level
- work distribution
- starts on leader node
- rebalancing job
Rebalancing job
- applying new connector config (REST-API)
- changes in structure of ingested data (new table, files, partitions, etc.)
SourceTask
data ingestion

Methods: SourceConnector
18
〉void start(Map<String, String> props)
〉List< Map<String, String> > taskConfigs(int maxTasks)
〉void stop()

FileSourceConnector (rebalancing)
20

Methods: SourceTask
24
〉Collection<SourceRecord> poll()
〉void stop()

FileSourceTask (offset filtering)
29

Methods: SinkTask
35
〉void put(Collection<SinkRecord>)
〉void flush(Map<TopicPatition, OffsetAndMetadata> currOffsets)
〉void stop()

Storing in put()
36
〉put() should be quick (there is an internal timeout)
〉A limited number of records are passed in put()
〉Automatic offset management (consumer)

Storing in flush()
37
〉put() stores in temp file / memory
〉flush() uploads optimal data amount in storage
〉Manual offset management (uploading index-files)

Resume reading using offsets
38

Global rebalancing
43
〉JVM with KafkaConnect can host multiple connectors
〉Rebalancing one of them initiates the rebalancing of the rest
Solution: run 1 connector per 1 JVM

Writing offsets without sending source record
44
〉Ingesting file without records (e.g. it is empty)
Solutions:
1) send marker SourceRecord with offset
2) get offsetStorageWriter by reflection and write offset directly

Controlling ingestion speed (backpressure)
45
〉Source
- no control of ingestion speed for writes to Kafka
- solution: sleep() in poll() + producer tuning
〉Sink
- no control of speed of storing data in external storage
- solution: sleep() + throw new RetryableException in put()

Exactly once delivery
46
〉not supported
〉Source
- data and offsets are stored separately => duplicates are possible
- there is technical capability, but it has not been implemented
Solution:
- extra deduplication process (for instance, KafkaStreams)
- compacted data topic
〉Sink
- idempotence: loading index-file with data files + consistent file naming

Conclusion
48
〉Simple and fast
〉Control how to ingest data
〉Mature
〉Cluster less
〉Lots of free connectors (Debezium, S3, FTP, ElasticSearch, etc.)

Kafka Connect

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Kafka Connect

Similar to Kafka Connect (20)

Recently uploaded

Recently uploaded (20)

Kafka Connect