12. 12
Storage Kafka
Entity “Virtual” topic Topic
Partition
Logical partition
- file name
- table name
Physical partition
file on disk
Offset in
partition
Logical offset
- line number in file
- ID value in table
Record number within partition
offset
External storage ≈ Kafka topic
13. Components
13
SourceConnector
- defines parallelism level
- work distribution
- starts on leader node
- rebalancing job
Rebalancing job
- applying new connector config (REST-API)
- changes in structure of ingested data (new table, files, partitions, etc.)
SourceTask
data ingestion
36. Storing in put()
36
〉put() should be quick (there is an internal timeout)
〉A limited number of records are passed in put()
〉Automatic offset management (consumer)
37. Storing in flush()
37
〉put() stores in temp file / memory
〉flush() uploads optimal data amount in storage
〉Manual offset management (uploading index-files)
43. Global rebalancing
43
〉JVM with KafkaConnect can host multiple connectors
〉Rebalancing one of them initiates the rebalancing of the rest
Solution: run 1 connector per 1 JVM
44. Writing offsets without sending source record
44
〉Ingesting file without records (e.g. it is empty)
Solutions:
1) send marker SourceRecord with offset
2) get offsetStorageWriter by reflection and write offset directly
45. Controlling ingestion speed (backpressure)
45
〉Source
- no control of ingestion speed for writes to Kafka
- solution: sleep() in poll() + producer tuning
〉Sink
- no control of speed of storing data in external storage
- solution: sleep() + throw new RetryableException in put()
46. Exactly once delivery
46
〉not supported
〉Source
- data and offsets are stored separately => duplicates are possible
- there is technical capability, but it has not been implemented
Solution:
- extra deduplication process (for instance, KafkaStreams)
- compacted data topic
〉Sink
- idempotence: loading index-file with data files + consistent file naming