Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava

•

14 likes•6,497 views

Spark Summit

Spark Summit East Talk

Data & Analytics

BUILDING REALTIME DATA
PIPELINES WITH KAFKA CONNECT
AND SPARK STREAMING
Ewen Cheslack-Postava
Confluent

About Me: Ewen Cheslack-Postava
• Engineer @ Confluent
• Kafka Committer
• Kafka Connect Lead

Large-scale streaming data import/export for Kafka
Kafka Connect

Delivery Guarantees
• Automatic offset checkpointing and recovery
– Supports at least once
– Exactly once for connectors that support it
(e.g. HDFS)
– At most once simply swaps write & commit
– On restart: task checks offsets & rewinds

Spark Streaming
• Use Direct Kafka streams (1.3+)
– Better integration, more efficient, better
semantics
• Spark Kafka Writer
– At least once
– Kafka community is working on improved
producer semantics

Spark Streaming & Kafka Connect
• Increase # of systems Spark Streaming
works with, indirectly
• Reduce friction to adopt Spark Streaming
• Reduce need for Spark-specific connectors
• By leveraging Kafka as de facto streaming
data storage

Kafka Connect Summary
23
• Designed for large scale stream or batch data
integration
• Community supported and certified way of using
Kafka
• Soon, large repository of open source connectors
• Easy data pipelines when combined with Spark &
Spark Streaming

THANK YOU.
Follow me on Twitter: @ewencp
Try it out: http://confluent.io/download
More like this, but in blog form: http://confluent.io/blog

Add Pages as Necessary
• Supporting points go here.

What's hot

Operationalizing Machine Learning: Serving ML ModelsLightbend

How to deploy Apache Spark  to Mesos/DCOSLegacy Typesafe (now Lightbend)

Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...DataStax Academy

SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion DubaiCodemotion Dubai

Reactive Streams, linking Reactive Application to Spark Streaming by Luc Bour...Spark Summit

Apache Spark and Online Analytics Databricks

Exploring Reactive Integrations With Akka Streams, Alpakka And Apache KafkaLightbend

Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Exampleconfluent

Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonSpark Summit

How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...Lightbend

Spark Summit EU talk by Yiannis GkoufasSpark Summit

Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark Summit

Big Data visualization with Apache Spark and Zeppelinprajods

Data IntegrationDatio Big Data

Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationPatrick Di Loreto

How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021StreamNative

Sa introduction to big data pipelining with cassandra & spark west mins...Simon Ambridge

Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...Michael Noll

Real-time personal trainer on the SMACK stackAnirvan Chakraborty

Rethinking Streaming Analytics For ScaleHelena Edelson

What's hot (20)

Operationalizing Machine Learning: Serving ML Models

How to deploy Apache Spark  to Mesos/DCOS

Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...

SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai

Reactive Streams, linking Reactive Application to Spark Streaming by Luc Bour...

Apache Spark and Online Analytics

Exploring Reactive Integrations With Akka Streams, Alpakka And Apache Kafka

Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example

Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson

How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...

Spark Summit EU talk by Yiannis Gkoufas

Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)

Big Data visualization with Apache Spark and Zeppelin

Data Integration

Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization

How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021

Sa introduction to big data pipelining with cassandra & spark west mins...

Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...

Real-time personal trainer on the SMACK stack

Rethinking Streaming Analytics For Scale

Similar to Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava

Using the SDACK Architecture to Build a Big Data ProductEvans Ye

Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...confluent

Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQLconfluent

Streaming with Spring Cloud Stream and Apache Kafka - Soby ChackoVMware Tanzu

Building Realtime Data Pipelines with Kafka Connect and Spark StreamingJen Aman

Kafka connect 101Whiteklay

8th Athens Big Data Meetup - 1st Talk - Riding The Streaming Wave DIY StyleAthens Big Data

Riding the Streaming Wave DIY styleKonstantine Karantasis

Building Realtim Data Pipelines with Kafka Connect and Spark StreamingGuozhang Wang

Alpakka - Connecting Kafka and ElasticSearch to Akka StreamsKnoldus Inc.

Apache kafkaKumar Shivam

Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lightbend

Kafka Connect and Streams (Concepts, Architecture, Features)Kai Wähner

Confluent Kafka and KSQL: Streaming Data Pipelines Made EasyKairo Tavares

SSR: Structured Streaming for R and Machine Learningfelixcss

SSR: Structured Streaming on R for Machine Learning with Felix CheungDatabricks

Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...Reactivesummit

Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & KafkaAkara Sucharitakul

Streaming ETL with Apache Kafka and KSQLNick Dearden

Jug - ecosystemFlorent Ramiere

Similar to Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava (20)

Using the SDACK Architecture to Build a Big Data Product

Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...

Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL

Streaming with Spring Cloud Stream and Apache Kafka - Soby Chacko

Building Realtime Data Pipelines with Kafka Connect and Spark Streaming

Kafka connect 101

8th Athens Big Data Meetup - 1st Talk - Riding The Streaming Wave DIY Style

Riding the Streaming Wave DIY style

Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Alpakka - Connecting Kafka and ElasticSearch to Akka Streams

Apache kafka

Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...

Kafka Connect and Streams (Concepts, Architecture, Features)

Confluent Kafka and KSQL: Streaming Data Pipelines Made Easy

SSR: Structured Streaming for R and Machine Learning

SSR: Structured Streaming on R for Machine Learning with Felix Cheung

Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...

Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Kafka

Streaming ETL with Apache Kafka and KSQL

Jug - ecosystem

Recently uploaded

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss

modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx

Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ

Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Universitat Politècnica de Catalunya

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort

Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7

Learn How Data Science Changes Our WorldEduminds Learning

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics

Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics

Multiple time frame trading analysis -brianshannon.pdfchwongval

RadioAdProWritingCinderellabyButleri.pdfgstagge

LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter

INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman

原版1:1定制南十字星大学毕业证（SCU毕业证）#文凭成绩单#真实留信学历认证永久存档208367051

detection and classification of knee osteoarthritis.pptxAleenaJamil4

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics

Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy

Recently uploaded (20)

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理

modul pembelajaran robotic Workshop _ by Slidesgo.pptx

Advanced Machine Learning for Business Professionals

Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service

Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...

Learn How Data Science Changes Our World

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf

Multiple time frame trading analysis -brianshannon.pdf

RadioAdProWritingCinderellabyButleri.pdf

LLMs, LMMs, their Improvement Suggestions and the Path towards AGI

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD

原版1:1定制南十字星大学毕业证（SCU毕业证）#文凭成绩单#真实留信学历认证永久存档

detection and classification of knee osteoarthritis.pptx

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...

Student Profile Sample report on improving academic performance by uniting gr...

Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava

1. BUILDING REALTIME DATA PIPELINES WITH KAFKA CONNECT AND SPARK STREAMING Ewen Cheslack-Postava Confluent

2. About Me: Ewen Cheslack-Postava • Engineer @ Confluent • Kafka Committer • Kafka Connect Lead

3. Traditional ETL

4. More Data Systems

5. Stream Processing

8. Separation of Concerns

9. Large-scale streaming data import/export for Kafka Kafka Connect

10.

11. Separation of Concerns

12.

13.

14. Tasks - Parallelism

15. Execution Model - Standalone

16. Execution Model - Distributed

17. Execution Model - Distributed

18. Execution Model - Distributed

19. Data Integration as a Service

20. Delivery Guarantees • Automatic offset checkpointing and recovery – Supports at least once – Exactly once for connectors that support it (e.g. HDFS) – At most once simply swaps write & commit – On restart: task checks offsets & rewinds

21. Spark Streaming • Use Direct Kafka streams (1.3+) – Better integration, more efficient, better semantics • Spark Kafka Writer – At least once – Kafka community is working on improved producer semantics

22. Spark Streaming & Kafka Connect • Increase # of systems Spark Streaming works with, indirectly • Reduce friction to adopt Spark Streaming • Reduce need for Spark-specific connectors • By leveraging Kafka as de facto streaming data storage

23. Kafka Connect Summary 23 • Designed for large scale stream or batch data integration • Community supported and certified way of using Kafka • Soon, large repository of open source connectors • Easy data pipelines when combined with Spark & Spark Streaming

24. THANK YOU. Follow me on Twitter: @ewencp Try it out: http://confluent.io/download More like this, but in blog form: http://confluent.io/blog

25. Add Pages as Necessary • Supporting points go here.

Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava

Similar to Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava (20)

More from Spark Summit

More from Spark Summit (20)

Recently uploaded

Recently uploaded (20)

Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava