Kafka Connect

•Download as PPTX, PDF•

2 likes•550 views

Oleg Kuznetsov

Kafka Connect on practice

Software

Kafka Connect
Oleg Kuznetsov
Big Data Engineer

Kafka Connect
7
〉Focusing on data ingestion in / out Kafka topics
〉KafkaConnect - a standalone app, not a library
〉Distributed mode

12
Storage Kafka
Entity “Virtual” topic Topic
Partition
Logical partition
- file name
- table name
Physical partition
file on disk
Offset in
partition
Logical offset
- line number in file
- ID value in table
Record number within partition
offset
External storage ≈ Kafka topic

Components
13
SourceConnector
- defines parallelism level
- work distribution
- starts on leader node
- rebalancing job
Rebalancing job
- applying new connector config (REST-API)
- changes in structure of ingested data (new table, files, partitions, etc.)
SourceTask
data ingestion

Methods: SourceConnector
18
〉void start(Map<String, String> props)
〉List< Map<String, String> > taskConfigs(int maxTasks)
〉void stop()

Methods: SourceTask
24
〉void start(Map<String, String> props)
〉Collection<SourceRecord> poll()
〉void stop()

Methods: SinkTask
35
〉void start(Map<String, String> props)
〉void put(Collection<SinkRecord>)
〉void flush(Map<TopicPatition, OffsetAndMetadata> currOffsets)
〉void stop()

Storing in put()
36
〉put() should be quick (there is an internal timeout)
〉A limited number of records are passed in put()
〉Automatic offset management (consumer)

Storing in flush()
37
〉put() stores in temp file / memory
〉flush() uploads optimal data amount in storage
〉Manual offset management (uploading index-files)

Global rebalancing
43
〉JVM with KafkaConnect can host multiple connectors
〉Rebalancing one of them initiates the rebalancing of the rest
Solution: run 1 connector per 1 JVM

Writing offsets without sending source record
44
〉Ingesting file without records (e.g. it is empty)
Solutions:
1) send marker SourceRecord with offset
2) get offsetStorageWriter by reflection and write offset directly

Controlling ingestion speed (backpressure)
45
〉Source
- no control of ingestion speed for writes to Kafka
- solution: sleep() in poll() + producer tuning
〉Sink
- no control of speed of storing data in external storage
- solution: sleep() + throw new RetryableException in put()

Exactly once delivery
46
〉not supported
〉Source
- data and offsets are stored separately => duplicates are possible
- there is technical capability, but it has not been implemented
Solution:
- extra deduplication process (for instance, KafkaStreams)
- compacted data topic
〉Sink
- idempotence: loading index-file with data files + consistent file naming

Conclusion
48
〉Simple and fast
〉Control how to ingest data
〉Mature
〉Cluster less
〉Lots of free connectors (Debezium, S3, FTP, ElasticSearch, etc.)

What's hot

Exactly-once Stream Processing with Kafka StreamsGuozhang Wang

Power of the Log: LSM & Append Only Data Structuresconfluent

Kafka connectAndrew Stevenson

What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...HostedbyConfluent

Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...confluent

Schema registryWhiteklay

Kafka blr-meetup-presentation - Kafka internalsAyyappadas Ravindran (Appu)

Apache KafkaJoe Stein

Developing real-time data pipelines with Spring and Kafkamarius_bogoevici

Hello, kafka! (an introduction to apache kafka)Timothy Spann

Confluent building a real-time streaming platform using kafka streams and k...Thomas Alex

Kafka internalsDavid Groozman

PostgreSQL + Kafka: The Delight of Change Data CaptureJeff Klukas

Multitenancy: Kafka clusters for everyone at LINEkawamuray

Kafka Streams for Java enthusiastsSlim Baltagi

Introduction to apache kafkaDimitris Kontokostas

kafka for db as postgresPivotalOpenSourceHub

Introduction to Apache KafkaShiao-An Yuan

APACHE KAFKA / Kafka Connect / Kafka StreamsKetan Gote

From Message to Cluster: A Realworld Introduction to Kafka Capacity Planningconfluent

What's hot (20)

Exactly-once Stream Processing with Kafka Streams

Power of the Log: LSM & Append Only Data Structures

Kafka connect

What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...

Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...

Schema registry

Kafka blr-meetup-presentation - Kafka internals

Apache Kafka

Developing real-time data pipelines with Spring and Kafka

Hello, kafka! (an introduction to apache kafka)

Confluent building a real-time streaming platform using kafka streams and k...

Kafka internals

PostgreSQL + Kafka: The Delight of Change Data Capture

Multitenancy: Kafka clusters for everyone at LINE

Kafka Streams for Java enthusiasts

Introduction to apache kafka

kafka for db as postgres

Introduction to Apache Kafka

APACHE KAFKA / Kafka Connect / Kafka Streams

From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning

Similar to Kafka Connect

Managing ADLS gen2 using Apache SparkDatabricks

Dissecting Open Source Cloud Evolution: An OpenStack Case StudySalman Baset

Apache Spark Performance: Past, Future and PresentDatabricks

Don't change the partition count for kafka topics!Dainius Jocas

Apache Spark Introduction - CloudxLabAbhinav Singh

Spark ProgrammingTaewook Eom

Speed it up and Spark it up at IntelDataWorks Summit

Spark Study NotesRichard Kuo

Data science bootcamp day 3Chetan Khatri

The Design, Implementation and Open Source Way of Apache Pegasusacelyc1112009

Node Interactive Debugging Node.js In ProductionYunong Xiao

Writing Continuous Applications with Structured Streaming Python APIs in Apac...Databricks

Architecting a 35 PB distributed parallel file system for scienceSpeck&Tech

Designing Structured Streaming Pipelines—How to Architect Things RightDatabricks

Optimizing S3 Write-heavy Spark workloadsdatamantra

Apache Spark Workshop, Apr. 2016, Euangelos LinardosEuangelos Linardos

20170126 big data processingVienna Data Science Group

Zaharia spark-scala-days-2012Skills Matter Talks

Introduction to Kafka StreamsGuozhang Wang

Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Spark Summit

Similar to Kafka Connect (20)

Managing ADLS gen2 using Apache Spark

Dissecting Open Source Cloud Evolution: An OpenStack Case Study

Apache Spark Performance: Past, Future and Present

Don't change the partition count for kafka topics!

Apache Spark Introduction - CloudxLab

Spark Programming

Speed it up and Spark it up at Intel

Spark Study Notes

Data science bootcamp day 3

The Design, Implementation and Open Source Way of Apache Pegasus

Node Interactive Debugging Node.js In Production

Writing Continuous Applications with Structured Streaming Python APIs in Apac...

Architecting a 35 PB distributed parallel file system for science

Designing Structured Streaming Pipelines—How to Architect Things Right

Optimizing S3 Write-heavy Spark workloads

Apache Spark Workshop, Apr. 2016, Euangelos Linardos

20170126 big data processing

Zaharia spark-scala-days-2012

Introduction to Kafka Streams

Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...

Recently uploaded

[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypseTomasz Kowalczewski

BusinessGPT - Security and Governance for Generative AIAGATSoftware

Automate your OpenSIPS config tests - OpenSIPS Summit 2024Andreas Granig

Auto Affiliate AI Earns First Commission in 3 Hours..pdfSelfMade bd

Transformer Neural Network Use Cases with LinksJinanKordab

Lessons Learned from Building a Serverless Notifications System.pdfSrushith Repakula

Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Andrea Goulet

Software Engineering - Introduction + Process Models + Requirements EngineeringPrakhyath Rai

A Deep Dive into Secure Product Development Frameworks.pdfICS

Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...drm1699

^Clinic ^%[+27788225528*Abortion Pills For Sale In hararekasambamuno

^Clinic ^%[+27788225528*Abortion Pills For Sale In witbankkasambamuno

UNI DI NAPOLI FEDERICO II - Il ruolo dei grafi nell'AI Conversazionale IbridaNeo4j

Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg

Optimizing Operations by Aligning Resources with Strategic Objectives Using O...OnePlan Solutions

Abortion Pill Prices Germiston ](+27832195400*)[ 🏥 Women's Abortion Clinic in...Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg

Effective Strategies for Wix's Scaling challenges - GeeConNatan Silnitsky

architecting-ai-in-the-enterprise-apis-and-applications.pdfWSO2

Modern binary build systems - PyCon 2024Henry Schreiner

Food Delivery Business App Development Guide 2024Chirag Panchal

Recently uploaded (20)

[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse

BusinessGPT - Security and Governance for Generative AI

Automate your OpenSIPS config tests - OpenSIPS Summit 2024

Auto Affiliate AI Earns First Commission in 3 Hours..pdf

Transformer Neural Network Use Cases with Links

Lessons Learned from Building a Serverless Notifications System.pdf

Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...

Software Engineering - Introduction + Process Models + Requirements Engineering

A Deep Dive into Secure Product Development Frameworks.pdf

Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...

^Clinic ^%[+27788225528*Abortion Pills For Sale In harare

^Clinic ^%[+27788225528*Abortion Pills For Sale In witbank

UNI DI NAPOLI FEDERICO II - Il ruolo dei grafi nell'AI Conversazionale Ibrida

Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...

Optimizing Operations by Aligning Resources with Strategic Objectives Using O...

Abortion Pill Prices Germiston ](+27832195400*)[ 🏥 Women's Abortion Clinic in...

Effective Strategies for Wix's Scaling challenges - GeeCon

architecting-ai-in-the-enterprise-apis-and-applications.pdf

Modern binary build systems - PyCon 2024

Food Delivery Business App Development Guide 2024

Kafka Connect

1. Kafka Connect Oleg Kuznetsov Big Data Engineer

2. Intro

7. Kafka Connect 7 〉Focusing on data ingestion in / out Kafka topics 〉KafkaConnect - a standalone app, not a library 〉Distributed mode

8. Source Connector

9. Topic => Topic 9

10. External storage => Topic 10

11. External storage => Topic 11

12. 12 Storage Kafka Entity “Virtual” topic Topic Partition Logical partition - file name - table name Physical partition file on disk Offset in partition Logical offset - line number in file - ID value in table Record number within partition offset External storage ≈ Kafka topic

13. Components 13 SourceConnector - defines parallelism level - work distribution - starts on leader node - rebalancing job Rebalancing job - applying new connector config (REST-API) - changes in structure of ingested data (new table, files, partitions, etc.) SourceTask data ingestion

18. Methods: SourceConnector 18 〉void start(Map<String, String> props) 〉List< Map<String, String> > taskConfigs(int maxTasks) 〉void stop()

19. FileSourceConnector 19

20. FileSourceConnector (rebalancing) 20

21. Architecture 21

22. Architecture 22

23. Architecture 23

24. Methods: SourceTask 24 〉void start(Map<String, String> props) 〉Collection<SourceRecord> poll() 〉void stop()

25. FileSourceTask 25

26. FileSourceTask 26

27. FileSourceTask 27

28. Architecture 28

29. FileSourceTask (offset filtering) 29

34. FileSinkConnector 34

35. Methods: SinkTask 35 〉void start(Map<String, String> props) 〉void put(Collection<SinkRecord>) 〉void flush(Map<TopicPatition, OffsetAndMetadata> currOffsets) 〉void stop()

36. Storing in put() 36 〉put() should be quick (there is an internal timeout) 〉A limited number of records are passed in put() 〉Automatic offset management (consumer)

37. Storing in flush() 37 〉put() stores in temp file / memory 〉flush() uploads optimal data amount in storage 〉Manual offset management (uploading index-files)

38. Resume reading using offsets 38

39. Run

40. Dockerfile 40

41. Starting connector 41

42. Facing reality

43. Global rebalancing 43 〉JVM with KafkaConnect can host multiple connectors 〉Rebalancing one of them initiates the rebalancing of the rest Solution: run 1 connector per 1 JVM

44. Writing offsets without sending source record 44 〉Ingesting file without records (e.g. it is empty) Solutions: 1) send marker SourceRecord with offset 2) get offsetStorageWriter by reflection and write offset directly

45. Controlling ingestion speed (backpressure) 45 〉Source - no control of ingestion speed for writes to Kafka - solution: sleep() in poll() + producer tuning 〉Sink - no control of speed of storing data in external storage - solution: sleep() + throw new RetryableException in put()

46. Exactly once delivery 46 〉not supported 〉Source - data and offsets are stored separately => duplicates are possible - there is technical capability, but it has not been implemented Solution: - extra deduplication process (for instance, KafkaStreams) - compacted data topic 〉Sink - idempotence: loading index-file with data files + consistent file naming

47. Conclusion

48. Conclusion 48 〉Simple and fast 〉Control how to ingest data 〉Mature 〉Cluster less 〉Lots of free connectors (Debezium, S3, FTP, ElasticSearch, etc.)

49. Questions?

Kafka Connect

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Kafka Connect

Similar to Kafka Connect (20)

Recently uploaded

Recently uploaded (20)

Kafka Connect