It covers a brief introduction to Apache Kafka Connect, giving insights about its benefits,use cases, motivation behind building Kafka Connect.And also a short discussion on its architecture.
2. Topics Covered
● What is Kafka Connect ?
● Source and Sinks
● Motivation behind kafka Connect
● Use cases of kafka Connect
● Architecture
● Demo
3. What is Kafka Connect ?
● Added in 0.9 release of Apache Kafka.
● Tool for scalably and reliably streaming data
between Apache Kafka and other data systems.
4. ● It abstracts away the common problems every
connector to Kafka needs to solve:
– schema management
– fault tolerance
– delivery semantics
– operations, monitoring etc.
What is Kafka Connect ?
8. Motivation behind Kafka Connect
● Why build another framework when there are
already so many to choose from?
● most of the solutions do not integrate optimally
with a stream data platform.
9. Benefits of kafka Connect
● Broad copying by default
● Streaming and batch
● Scales to the application
● Focus on copying data only
● Accessible connector API
11. Connector Model
● The connector model defines how third-party
developers create connector plugins which
import or export data from another system.
● The model has two key concepts:
– Connector
– Tasks
13. Worker and Data Model
● The worker model represents the runtime in which
connectors and tasks execute.
● Worker model allows Kafka Connect to scale to the
application.
● The data model addresses the remaining
requirements, like coupling tightly with Kafka,
schema management etc..
14. ● Kafka Connect tracks offsets for each connector
so that connectors can resume from their
previous position in the event of failures or
graceful restarts for maintenance.
● It has two types of workers:
– Standalone
– Distributed.
Worker and Data Model
For a long time, companies used to do data processingas big batch jobs.CSV files dumped out of databases, log files collected at the end of the day.
But businesses operate in real time.So, rather than processing data at the end of the day, why not react to it continuosuly as the data arrives.This is where stream processing came into picture And this shift led to the popularity of apache kafka.
But even with apache kafka, building real time data pipeline has required some effort.
And this is why kafka connect was announced as a new feature in 0.9 relaease of kafka
Schema management: The ability of the data pipeline to carry schema information where it is available.
In the absence of this capability, you end up having to recreate it downstream.
Furthermore, if there are multiple consumers for the same data, then each consumer has to recreate it.
Fault tolerance: Run several instances of a process and be resilient to failures
Delivery semantics: Provide strong guarantees when machines fail or processes crash
Operations and monitoring: Monitor the health and progress of every data integration process in a consistent manner
Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other data systems.
It makes it simple to quickly define connectors that move large data sets into and out of Kafka.
Kafka Connect can ingest entire databases or collect metrics from all your application servers into Kafka topics, making the data available for stream processing with low latency.
Sources import data into Kafka,
and Sinks export data from Kafka.
An implementation of a Source or Sink is a Connector.
Users deploy connectors to enable data flows on Kafka
Some of the certified connectors utilizing kafka connect framework are :
Source -> Jdbc, couchbase, Apache ignite,cassandra
Sink -> HDFS, Apache ignite, Solr
where streaming, event-based data is the lingua franca and Kafka is the common medium that serves as a hub for all data.
eg. in log metric collection processing frameworks like flume,logstash
They do not handle integration well with batch systems.
Operationally complex for large data pipelines where an agent runs for each server.
Goblin,Siro ETL of data warehousing
Specific use case.
Work with single sink
Quickly define connectors that copy vast quantities of data between systems
Support copying to and from both streaming and batch-oriented systems.
Scale down to a single process running one connector a small production environment, and scale up to an organization-wide service for copying data between a wide variety of large scale systems.
Focus on reliable, scalable data copying; leave transformation, enrichment, and other modifications
It is easy to develop new connectors. The API and runtime model for implementing new connectors should make it simple to use.
Connectors are the largest logical unit of work in Kafka Connect and define where data should be copied to and from.
This might cover copying a whole database or collection of databases into Kafka.
connector does not perform any copying itself instead it schedules tasks for it.
Tasks are responsible for producing or consuming sequences of Kafka ConnectRecords in order to copy data.
Kafka Connect’s core concept that users interact with is a connector.
Partitions are balanced evenly across tasks.
Each task reads from its partitions, translates the data to Kafka Connect's format, decides the destination topic (and possibly partition) in Kafka.
This layer decouples the logical work (connectors) from the physical execution (workers executing tasks)
Workers are processes that execute connectors and tasks
Workers automatically coordinate with each other to distribute work and provide scalability and fault tolerance.
All other tasks like schema managemenet,tight coupling with kafka.
so that connectors can resume from their previous position in the event of failures or graceful restarts for maintenance.
Standalone mode is the simplest mode, where a single process is responsible for executing all connectors and tasks.
Since it is a single process, it requires minimal configuration.
In distributed mode, you start many worker processes using the same group.id
and they automatically coordinate to schedule execution of connectors and tasks across all available workers.
simple example of a cluster of 3 workers (processes launched via any mechanism you choose) running two connectors.
The worker processes have balanced the connectors and tasks across themselves
If a connector adds partitions, this causes it to regenerate task configurations.
If one of the workers fails, the remaining workers rebalance the connectors
and tasks so the work previously handled by the failed worker is moved to other workers: