Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Cheslack-Postava

•Download as PPTX, PDF•

4 likes•6,680 views

Many companies are adopting Apache Kafka to power their data pipelines, including LinkedIn, Netflix, and Airbnb. Kafka’s ability to handle high throughput real-time data makes it a perfect fit for solving the data integration problem, acting as the common buffer for all your data and bridging the gap between streaming and batch systems. However, building a data pipeline around Kafka today can be challenging because it requires combining a wide variety of tools to collect data from disparate data systems. One tool streams updates from your database to Kafka, another imports logs, and yet another exports to HDFS. As a result, building a data pipeline can take significant engineering effort and has high operational overhead because all these different tools require ongoing monitoring and maintenance. Additionally, some of the tools are simply a poor fit for the job: the fragmented nature of the data integration tools ecosystem lead to creative but misguided solutions such as misusing stream processing frameworks for data integration purposes. We describe the design and implementation of Kafka Connect, Kafka’s new tool for scalable, fault-tolerant data import and export. First we’ll discuss some existing tools in the space and why they fall short when applied to data integration at large scale. Next, we will explore Kafka Connect’s design and how it compares to systems with similar goals, discussing key design decisions that trade off between ease of use for connector developers, operational complexity, and reuse of existing connectors. Finally, we’ll discuss how standardizing on Kafka Connect can ultimately lead to simplifying your entire data pipeline, making ETL into your data warehouse and enabling stream processing applications as simple as adding another Kafka connector. eventbrite_kafka_summit_event_logo_v3-035858-edited.png

Engineering

Kafka Connect: Real-
time Data Integration at
Scale with Apache Kafka
By Ewen Cheslack-Postava

Data Integration
getting data to all the right places

Introducing
Kafka Connect
Large-scale streaming data import/export for Kafka

Offsets automatically committed and restored
On restart: task checks offsets & rewinds
At least once delivery – flush data, then commit
Exactly once for connectors that support it (e.g. HDFS)
Delivery Guarantees

Abstract serialization: 1 connector, many serialization formats
Convert between Kafka Connect Data API (Connectors) and serialized bytes
(Kafka)
JSON and Avro are currently well supported
Converters

Confluent Open Source – HDFS, JDBC
Connector Hub: connectors.confluent.io
Examples: MySQL, MongoDB, Twitter, Solr, S3, MQTT, Bloomberg, Apache Ignite, and more
Connectors Today

Jenkins connector – Aravind Yarram (Equifax)
Twitter semantic analysis and visualization – Ashish Singh (Cloudera)
Brain monitoring device connector – Silicon Valley Data Science
DynamoDB, Cassandra, Slack, Splunk, and many more
Connectors from the Hackathon

Improved connector control via REST API, standardized configs, metrics
Single record transformations
Data pipelines in an app - embedded mode & Kafka Streams integration
Many more connectors
Coming soon…

THANK YOU
@ewencp
@confluentinc
Try it out: http://confluent.io/download
More like this, but in blog form: http://confluent.io/blog

Watch this talk here: https://www.confluent.io/online-talks/from-zero-to-hero-with-kafka-connect-on-demand Integrating Apache Kafka® with other systems in a reliable and scalable way is often a key part of a streaming platform. Fortunately, Apache Kafka includes the Connect API that enables streaming integration both in and out of Kafka. Like any technology, understanding its architecture and deployment patterns is key to successful use, as is knowing where to go looking when things aren't working. This talk will discuss the key design concepts within Apache Kafka Connect and the pros and cons of standalone vs distributed deployment modes. We'll do a live demo of building pipelines with Apache Kafka Connect for streaming data in from databases, and out to targets including Elasticsearch. With some gremlins along the way, we'll go hands-on in methodically diagnosing and resolving common issues encountered with Apache Kafka Connect. The talk will finish off by discussing more advanced topics including Single Message Transforms, and deployment of Apache Kafka Connect in containers.

How Apache Kafka® Works

confluent

Watch this talk here: https://www.confluent.io/online-talks/how-apache-kafka-works-on-demand Pick up best practices for developing applications that use Apache Kafka, beginning with a high level code overview for a basic producer and consumer. From there we’ll cover strategies for building powerful stream processing applications, including high availability through replication, data retention policies, producer design and producer guarantees. We’ll delve into the details of delivery guarantees, including exactly-once semantics, partition strategies and consumer group rebalances. The talk will finish with a discussion of compacted topics, troubleshooting strategies and a security overview. This session is part 3 of 4 in our Fundamentals for Apache Kafka series.

Introduction to Kafka Streams

Guozhang Wang

Kafka Streams is a new stream processing library natively integrated with Kafka. It has a very low barrier to entry, easy operationalization, and a natural DSL for writing stream processing applications. As such it is the most convenient yet scalable option to analyze, transform, or otherwise process data that is backed by Kafka. We will provide the audience with an overview of Kafka Streams including its design and API, typical use cases, code examples, and an outlook of its upcoming roadmap. We will also compare Kafka Streams' light-weight library approach with heavier, framework-based tools such as Spark Streaming or Storm, which require you to understand and operate a whole different infrastructure for processing real-time data in Kafka.

Understanding Apache Kafka® Latency at Scale

confluent

Kafka Streams: What it is, and how to use it?

confluent

Kafka Streams is a client library for building distributed applications that process streaming data stored in Apache Kafka. It provides a high-level streams DSL that allows developers to express streaming applications as set of processing steps. Alternatively, developers can use the lower-level processor API to implement custom business logic. Kafka Streams handles tasks like fault-tolerance, scalability and state management. It represents data as streams for unbounded data or tables for bounded state. Common operations include transformations, aggregations, joins and table operations.

Deploying Confluent Platform for Production

confluent

Introduction to Apache Kafka and Confluent... and why they matter

confluent

Milano Apache Kafka Meetup by Confluent (First Italian Kafka Meetup) on Wednesday, November 29th 2017. Il talk introduce Apache Kafka (incluse le APIs Kafka Connect e Kafka Streams), Confluent (la società creata dai creatori di Kafka) e spiega perché Kafka è un'ottima e semplice soluzione per la gestione di stream di dati nel contesto di due delle principali forze trainanti e trend industriali: Internet of Things (IoT) e Microservices.

Introduction to Kafka streaming platform. Covers Kafka Architecture with some small examples from the command line. Then we expand on this with a multi-server example. Lastly, we added some simple Java client examples for a Kafka Producer and a Kafka Consumer. We have started to expand on the Java examples to correlate with the design discussion of Kafka. We have also expanded on the Kafka design section and added references.

Kafka Summit NYC 2017 - Singe Message Transforms are not the Transformations ...

confluent

Single message transformations allow lightweight modifications to individual messages as they are ingested or emitted by Kafka Connect connectors. Some key uses of single message transformations include data masking, event routing, event enhancement, and partitioning. They involve simple, message-at-a-time transformations configured through properties rather than writing complex code. Kafka Streams is better suited for more complex transformations like aggregations, joins, and windowing where the transformed data is stored back in Kafka.

Kafka Streams vs. KSQL for Stream Processing on top of Apache Kafka

Kai Wähner

Spoilt for Choice – Kafka Streams vs. KSQL for Stream Processing on top of Apache Kafka: Apache Kafka is a de facto standard streaming data processing platform. It is widely deployed as event streaming platform. Part of Kafka is its stream processing API “Kafka Streams”. In addition, the Kafka ecosystem now offers KSQL, a declarative, SQL-like stream processing language that lets you define powerful stream-processing applications easily. What once took some moderately sophisticated Java code can now be done at the command line with a familiar and eminently approachable syntax. This session discusses and demos the pros and cons of Kafka Streams and KSQL to understand when to use which stream processing alternative for continuous stream processing natively on Apache Kafka infrastructures. The end of the session compares the trade-offs of Kafka Streams and KSQL to separate stream processing frameworks such as Apache Flink or Spark Streaming.

Kafka Connect and Streams (Concepts, Architecture, Features)

Kai Wähner

Hello, kafka! (an introduction to apache kafka)

Timothy Spann

Real-time Stream Processing with Apache Flink

DataWorks Summit

Kafka 101 and Developer Best Practices

confluent

Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...

confluent

Apache Kafka can act as both an enemy and a friend to traditional middleware like message queues, ETL tools, and enterprise service buses. As an enemy, Kafka replaces many of the individual components and provides a single scalable platform for messaging, storage, and processing. However, Kafka can also integrate with traditional middleware as a friend through connectors and client APIs, allowing certain use cases to still leverage existing tools. In complex environments with both new and legacy systems, Kafka acts as a "frenemy" - replacing some functions but integrating with other existing technologies to provide a bridge to new architectures.

When NOT to use Apache Kafka?

Kai Wähner

Apache Kafka is the de facto standard for data streaming to process data in motion. With its significant adoption growth across all industries, I get a very valid question every week: When NOT to use Apache Kafka? What limitations does the event streaming platform have? When does Kafka simply not provide the needed capabilities? How to qualify Kafka out as it is not the right tool for the job? This session explores the DOs and DONTs. Separate sections explain when to use Kafka, when NOT to use Kafka, and when to MAYBE use Kafka. No matter if you think about open source Apache Kafka, a cloud service like Confluent Cloud, or another technology using the Kafka protocol like Redpanda or Pulsar, check out this slide deck. A detailed article about this topic: https://www.kai-waehner.de/blog/2022/01/04/when-not-to-use-apache-kafka/

How to Build an Apache Kafka® Connector

confluent

Apache Kafka® is the technology behind event streaming which is fast becoming the central nervous system of flexible, scalable, modern data architectures. Customers want to connect their databases, data warehouses, applications, microservices and more, to power the event streaming platform. To connect to Apache Kafka, you need a connector! This online talk dives into the new Verified Integrations Program and the integration requirements, the Connect API and sources and sinks that use Kafka Connect. We cover the verification steps and provide code samples created by popular application and database companies. We will discuss the resources available to support you through the connector development process. This is Part 2 of 2 in Building Kafka Connectors - The Why and How

Kafka for Real-Time Replication between Edge and Hybrid Cloud

Kai Wähner

Not all workloads allow cloud computing. Low latency, cybersecurity, and cost-efficiency require a suitable combination of edge computing and cloud integration. This session explores architectures and design patterns for software and hardware considerations to deploy hybrid data streaming with Apache Kafka anywhere. A live demo shows data synchronization from the edge to the public cloud across continents with Kafka on Hivecell and Confluent Cloud.

Kafka At Scale in the Cloud

confluent

Stream processing using Kafka

Knoldus Inc.

The Top 5 Apache Kafka Use Cases and Architectures in 2022

Kai Wähner

This document discusses the top 5 use cases and architectures for data in motion in 2022. It describes: 1) The Kappa architecture as an alternative to the Lambda architecture that uses a single stream to handle both real-time and batch data. 2) Hyper-personalized omnichannel experiences that integrate customer data from multiple sources in real-time to provide personalized experiences across channels. 3) Multi-cloud deployments using Apache Kafka and data mesh architectures to share data across different cloud platforms. 4) Edge analytics that deploy stream processing and Kafka brokers at the edge to enable low-latency use cases and offline functionality. 5) Real-time cybersecurity applications that use streaming data

Capture the Streams of Database Changes

confluent

Apache kafka

NexThoughts Technologies

Apache Kafka is a distributed publish-subscribe messaging system that can handle high volumes of data and enable messages to be passed from one endpoint to another. It uses a distributed commit log that allows messages to be persisted on disk for durability. Kafka is fast, scalable, fault-tolerant, and guarantees zero data loss. It is used by companies like LinkedIn, Twitter, and Netflix to handle high volumes of real-time data and streaming workloads.

Introduction to Apache Kafka

Jeff Holoman

The document provides an introduction and overview of Apache Kafka presented by Jeff Holoman. It begins with an agenda and background on the presenter. It then covers basic Kafka concepts like topics, partitions, producers, consumers and consumer groups. It discusses efficiency and delivery guarantees. Finally, it presents some use cases for Kafka and positioning around when it may or may not be a good fit compared to other technologies.

Data Streaming with Apache Kafka & MongoDB

confluent

Streaming Data and Stream Processing with Apache Kafka

confluent

Apache Kafka is an open-source streaming platform that can be used to build real-time data pipelines and streaming applications. It addresses challenges with diverse data sets arriving at increasing rates. The document discusses how Apache Kafka can help with challenges around data integration, stream processing, and managing streaming platforms at scale. It also outlines key features of Apache Kafka like the Kafka Connect API for data integration, the Kafka Streams API for stream processing, and Confluent Control Center for monitoring and management.

Kafka connect-london-meetup-2016

Gwen (Chen) Shapira

This document discusses Apache Kafka and Confluent's Kafka Connect tool for large-scale streaming data integration. Kafka Connect allows importing and exporting data from Kafka to other systems like HDFS, databases, search indexes, and more using reusable connectors. Connectors use converters to handle serialization between data formats. The document outlines some existing connectors and upcoming improvements to Kafka Connect.

Confluent and Elastic

Paolo Castagna

What's hot

Kafka connect 101

Whiteklay

Kafka Tutorial - basics of the Kafka streaming platform

Jean-Paul Azar

Kafka Summit NYC 2017 - Singe Message Transforms are not the Transformations ...

confluent

Kafka Streams vs. KSQL for Stream Processing on top of Apache Kafka

Kai Wähner

Kafka Connect and Streams (Concepts, Architecture, Features)

Kai Wähner

Hello, kafka! (an introduction to apache kafka)

Timothy Spann

Real-time Stream Processing with Apache Flink

DataWorks Summit

Kafka 101 and Developer Best Practices

confluent

Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...

confluent

When NOT to use Apache Kafka?

Kai Wähner

How to Build an Apache Kafka® Connector

confluent

Kafka for Real-Time Replication between Edge and Hybrid Cloud

Kai Wähner

Kafka At Scale in the Cloud

confluent

Stream processing using Kafka

Knoldus Inc.

The Top 5 Apache Kafka Use Cases and Architectures in 2022

Kai Wähner

Capture the Streams of Database Changes

confluent

Apache kafka

NexThoughts Technologies

Introduction to Apache Kafka

Jeff Holoman

Data Streaming with Apache Kafka & MongoDB

confluent

Streaming Data and Stream Processing with Apache Kafka

confluent

What's hot (20)

Kafka connect 101

Kafka Tutorial - basics of the Kafka streaming platform

Kafka Summit NYC 2017 - Singe Message Transforms are not the Transformations ...

Kafka Streams vs. KSQL for Stream Processing on top of Apache Kafka

Kafka Connect and Streams (Concepts, Architecture, Features)

Hello, kafka! (an introduction to apache kafka)

Real-time Stream Processing with Apache Flink

Kafka 101 and Developer Best Practices

Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...

When NOT to use Apache Kafka?

How to Build an Apache Kafka® Connector

Kafka for Real-Time Replication between Edge and Hybrid Cloud

Kafka At Scale in the Cloud

Stream processing using Kafka

The Top 5 Apache Kafka Use Cases and Architectures in 2022

Capture the Streams of Database Changes

Apache kafka

Introduction to Apache Kafka

Data Streaming with Apache Kafka & MongoDB

Streaming Data and Stream Processing with Apache Kafka

Similar to Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Cheslack-Postava

Kafka connect-london-meetup-2016

Gwen (Chen) Shapira

Confluent and Elastic

Paolo Castagna

Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ew...

Spark Summit

Kafka Connect allows for building real-time data pipelines with Kafka and Spark Streaming by enabling large-scale streaming data import and export to Kafka. It provides a separation of concerns between connectors that are responsible for importing or exporting data and tasks that run in parallel to perform the work. Kafka Connect supports at least once delivery guarantees through automatic offset checkpointing and recovery. When combined with Spark Streaming, it increases the number of systems Spark Streaming can integrate with and reduces the need for Spark-specific connectors by leveraging Kafka as the streaming data storage layer.

Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Guozhang Wang

Spark Streaming makes it easy to build scalable, robust stream processing applications — but only once you’ve made your data accessible to the framework. Spark Streaming solves the realtime data processing problem, but to build large scale data pipeline we need to combine it with another tool that addresses data integration challenges. The Apache Kafka project recently introduced a new tool, Kafka Connect, to make data import/export to and from Kafka easier.

Building Realtime Data Pipelines with Kafka Connect and Spark Streaming

Jen Aman

This document discusses building real-time data pipelines with Kafka Connect and Spark Streaming. It introduces Kafka Connect as a tool for large-scale streaming data import and export for Kafka. Kafka Connect uses connectors to move data between Kafka and other data systems in a scalable, parallel, and fault-tolerant manner. It then discusses how Kafka Connect can be used together with Spark Streaming to provide real-time data integration capabilities.

Introducing Kafka Connect and Implementing Custom Connectors

Itai Yaffe

Large scale, distributed and reliable messaging with Kafka

Rafał Hryniewski

Integrating Apache Kafka and Elastic Using the Connect Framework

confluent

As a streaming platform, Apache Kafka provides low-latency, high-throughput, fault-tolerant publish and subscribe pipelines and excels at processing streams of real-time events. Kafka provides reliable, millisecond delivery for connecting downstream systems with real-time data. In this talk, we will show how easy it is to leverage Kafka and the Elasticsearch connector to keep your indices populated with the latest data from the rest of your enterprise, as it changes.

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...

Helena Edelson

Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!

confluent

This document provides an overview and introduction to Apache Kafka and KSQL for building streaming data pipelines. It discusses how Kafka is an event streaming platform that can be used for messaging, streaming data, and stream processing. It then introduces KSQL, which is a streaming SQL engine for Apache Kafka that allows users to perform stream processing by writing SQL-like queries against Kafka topics. The document uses diagrams and examples to illustrate how to build a streaming data pipeline using Kafka Connect to ingest data, Kafka to store and transport streams, and KSQL to perform stream processing, enrichment, and analytics.

Confluent Kafka and KSQL: Streaming Data Pipelines Made Easy

Kairo Tavares

Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala

Helena Edelson

Scala Days, Amsterdam, 2015: Lambda Architecture - Batch and Streaming with Spark, Cassandra, Kafka, Akka and Scala; Fault Tolerance, Data Pipelines, Data Flows, Data Locality, Akka Actors, Spark, Spark Cassandra Connector, Big Data, Asynchronous data flows. Time series data, KillrWeather, Scalable Infrastructure, Partition For Scale, Replicate For Resiliency, Parallelism Isolation, Data Locality, Location Transparency

Scala usergroup stockholm - reactive integrations with akka streams

Johan Andrén

Introduction to Apache Kafka and Confluent... and why they matter!

Paolo Castagna

Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...

Lightbend

Things were easier when all our data used to be offline, analyzed overnight in batches. Now our data is online, in motion, and generated constantly. For architects, developers and their businesses, this means that there is an urgent need for tools and applications that can deliver real-time (or near real-time) streaming ETL capabilities. In this session by Konrad Malawski, author, speaker and Senior Akka Engineer at Lightbend, you will learn how to build these streaming ETL pipelines with Akka Streams, Alpakka and Apache Kafka, and why they matter to enterprises that are increasingly turning to streaming Fast Data applications.

Kafka 탄생과 생태계

Gee Yeol Nahm

1. Kafka is described as a "WAL (write-ahead logging) system" and "the global commit log thingy" that was used as part of LinkedIn's data pipeline architecture. 2. LinkedIn had an ad hoc approach to data pipelines between systems that became more complex over time, so they built pipelines using Kafka. 3. The Kafka ecosystem includes storage using Kafka brokers, publishing and subscribing using producers and consumers, and stream processing using tools like Kafka Streams and KSQL.

Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Evan Chan

Edbt19 paper 329

LUIS ALBEIRO GIRALDO BETANCOURTH

This document introduces KSQL, a streaming SQL engine for Apache Kafka. KSQL allows users to write streaming queries using SQL without needing to write code in languages like Java or Python. It provides powerful stream processing capabilities like joins, aggregations, and windowing functions. KSQL compiles SQL queries into Kafka Streams applications that run continuously on Apache Kafka. This lowers the barrier to entry for stream processing on Kafka compared to other systems that require programming.

How to integrate your database with kafka & CDC

Abdallah Mahmoud

Kafka Connect is a scalable and resilient tool for integrating Kafka with other systems. There are two main options for integrating a database with Kafka - using the JDBC connector for Kafka Connect, or using a log-based Change Data Capture (CDC) tool which also integrates with Kafka Connect. The JDBC connector allows streaming data between Kafka and any Relational Database Management System (RDBMS) that supports JDBC, while CDC tools provide a log of all changes to a database that can then be streamed to Kafka.

Kafka Streams for Java enthusiasts

Slim Baltagi

Kafka, Apache Kafka evolved from an enterprise messaging system to a fully distributed streaming data platform (Kafka Core + Kafka Connect + Kafka Streams) for building streaming data pipelines and streaming data applications. This talk, that I gave at the Chicago Java Users Group (CJUG) on June 8th 2017, is mainly focusing on Kafka Streams, a lightweight open source Java library for building stream processing applications on top of Kafka using Kafka topics as input/output. You will learn more about the following: 1. Apache Kafka: a Streaming Data Platform 2. Overview of Kafka Streams: Before Kafka Streams? What is Kafka Streams? Why Kafka Streams? What are Kafka Streams key concepts? Kafka Streams APIs and code examples? 3. Writing, deploying and running your first Kafka Streams application 4. Code and Demo of an end-to-end Kafka-based Streaming Data Application 5. Where to go from here?

Similar to Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Cheslack-Postava (20)