Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
BUILDING REALTIME DATA
PIPELINES WITH KAFKA CONNECT
AND SPARK STREAMING
Ewen Cheslack-Postava
Confluent
About Me: Ewen Cheslack-Postava
• Engineer @ Confluent
• Kafka Committer
• Kafka Connect Lead
Traditional ETL
More Data Systems
Stream Processing
Separation of Concerns
Large-scale streaming data import/export for Kafka
Kafka Connect
Separation of Concerns
Tasks - Parallelism
Execution Model - Standalone
Execution Model - Distributed
Execution Model - Distributed
Execution Model - Distributed
Data Integration as a Service
Delivery Guarantees
• Automatic offset checkpointing and recovery
– Supports at least once
– Exactly once for connectors t...
Spark Streaming
• Use Direct Kafka streams (1.3+)
– Better integration, more efficient, better
semantics
• Spark Kafka Wri...
Spark Streaming & Kafka Connect
• Increase # of systems Spark Streaming
works with, indirectly
• Reduce friction to adopt ...
Kafka Connect Summary
23
• Designed for large scale stream or batch data
integration
• Community supported and certified w...
THANK YOU.
Follow me on Twitter: @ewencp
Try it out: http://confluent.io/download
More like this, but in blog form: http:/...
Add Pages as Necessary
• Supporting points go here.
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava
Upcoming SlideShare
Loading in …5
×

Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava

5,958 views

Published on

Spark Summit East Talk

Published in: Data & Analytics

Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ewen Cheslack-Postava

  1. 1. BUILDING REALTIME DATA PIPELINES WITH KAFKA CONNECT AND SPARK STREAMING Ewen Cheslack-Postava Confluent
  2. 2. About Me: Ewen Cheslack-Postava • Engineer @ Confluent • Kafka Committer • Kafka Connect Lead
  3. 3. Traditional ETL
  4. 4. More Data Systems
  5. 5. Stream Processing
  6. 6. Separation of Concerns
  7. 7. Large-scale streaming data import/export for Kafka Kafka Connect
  8. 8. Separation of Concerns
  9. 9. Tasks - Parallelism
  10. 10. Execution Model - Standalone
  11. 11. Execution Model - Distributed
  12. 12. Execution Model - Distributed
  13. 13. Execution Model - Distributed
  14. 14. Data Integration as a Service
  15. 15. Delivery Guarantees • Automatic offset checkpointing and recovery – Supports at least once – Exactly once for connectors that support it (e.g. HDFS) – At most once simply swaps write & commit – On restart: task checks offsets & rewinds
  16. 16. Spark Streaming • Use Direct Kafka streams (1.3+) – Better integration, more efficient, better semantics • Spark Kafka Writer – At least once – Kafka community is working on improved producer semantics
  17. 17. Spark Streaming & Kafka Connect • Increase # of systems Spark Streaming works with, indirectly • Reduce friction to adopt Spark Streaming • Reduce need for Spark-specific connectors • By leveraging Kafka as de facto streaming data storage
  18. 18. Kafka Connect Summary 23 • Designed for large scale stream or batch data integration • Community supported and certified way of using Kafka • Soon, large repository of open source connectors • Easy data pipelines when combined with Spark & Spark Streaming
  19. 19. THANK YOU. Follow me on Twitter: @ewencp Try it out: http://confluent.io/download More like this, but in blog form: http://confluent.io/blog
  20. 20. Add Pages as Necessary • Supporting points go here.

×