Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Kafka Connect

70 views

Published on

Kafka Connect on practice

Published in: Software
  • Be the first to comment

Kafka Connect

  1. 1. Kafka Connect Oleg Kuznetsov Big Data Engineer
  2. 2. Intro
  3. 3. What is Kafka? 3
  4. 4. What is Kafka? 4
  5. 5. What is Kafka? 5
  6. 6. Kafka Connect 6
  7. 7. Kafka Connect 7 〉Focusing on data ingestion in / out Kafka topics 〉KafkaConnect - a standalone app, not a library 〉Distributed mode
  8. 8. Source Connector
  9. 9. Topic => Topic 9
  10. 10. External storage => Topic 10
  11. 11. External storage => Topic 11
  12. 12. 12 Storage Kafka Entity “Virtual” topic Topic Partition Logical partition - file name - table name Physical partition file on disk Offset in partition Logical offset - line number in file - ID value in table Record number within partition offset External storage ≈ Kafka topic
  13. 13. Components 13 SourceConnector - defines parallelism level - work distribution - starts on leader node - rebalancing job Rebalancing job - applying new connector config (REST-API) - changes in structure of ingested data (new table, files, partitions, etc.) SourceTask data ingestion
  14. 14. Architecture 14
  15. 15. Architecture 15
  16. 16. Architecture 16
  17. 17. Architecture 17
  18. 18. Methods: SourceConnector 18 〉void start(Map<String, String> props) 〉List< Map<String, String> > taskConfigs(int maxTasks) 〉void stop()
  19. 19. FileSourceConnector 19
  20. 20. FileSourceConnector (rebalancing) 20
  21. 21. Architecture 21
  22. 22. Architecture 22
  23. 23. Architecture 23
  24. 24. Methods: SourceTask 24 〉void start(Map<String, String> props) 〉Collection<SourceRecord> poll() 〉void stop()
  25. 25. FileSourceTask 25
  26. 26. FileSourceTask 26
  27. 27. FileSourceTask 27
  28. 28. Architecture 28
  29. 29. FileSourceTask (offset filtering) 29
  30. 30. Architecture 30
  31. 31. Architecture 31
  32. 32. Sink Connector
  33. 33. Architecture 33
  34. 34. FileSinkConnector 34
  35. 35. Methods: SinkTask 35 〉void start(Map<String, String> props) 〉void put(Collection<SinkRecord>) 〉void flush(Map<TopicPatition, OffsetAndMetadata> currOffsets) 〉void stop()
  36. 36. Storing in put() 36 〉put() should be quick (there is an internal timeout) 〉A limited number of records are passed in put() 〉Automatic offset management (consumer)
  37. 37. Storing in flush() 37 〉put() stores in temp file / memory 〉flush() uploads optimal data amount in storage 〉Manual offset management (uploading index-files)
  38. 38. Resume reading using offsets 38
  39. 39. Run
  40. 40. Dockerfile 40
  41. 41. Starting connector 41
  42. 42. Facing reality
  43. 43. Global rebalancing 43 〉JVM with KafkaConnect can host multiple connectors 〉Rebalancing one of them initiates the rebalancing of the rest Solution: run 1 connector per 1 JVM
  44. 44. Writing offsets without sending source record 44 〉Ingesting file without records (e.g. it is empty) Solutions: 1) send marker SourceRecord with offset 2) get offsetStorageWriter by reflection and write offset directly
  45. 45. Controlling ingestion speed (backpressure) 45 〉Source - no control of ingestion speed for writes to Kafka - solution: sleep() in poll() + producer tuning 〉Sink - no control of speed of storing data in external storage - solution: sleep() + throw new RetryableException in put()
  46. 46. Exactly once delivery 46 〉not supported 〉Source - data and offsets are stored separately => duplicates are possible - there is technical capability, but it has not been implemented Solution: - extra deduplication process (for instance, KafkaStreams) - compacted data topic 〉Sink - idempotence: loading index-file with data files + consistent file naming
  47. 47. Conclusion
  48. 48. Conclusion 48 〉Simple and fast 〉Control how to ingest data 〉Mature 〉Cluster less 〉Lots of free connectors (Debezium, S3, FTP, ElasticSearch, etc.)
  49. 49. Questions?

×