Tale of Kafka Consumer for Spark Streaming

•

1 like•1,291 views

Sigmoid

Technology

Spark + Kafka 1. Streaming application uses Streaming Context
which uses Spark Context to launch Jobs
across the cluster.
2. Receivers running on Executors process
3. Receiver divides the streams into Blocks and
writes those Blocks to Spark BlockManager.
4. Spark BlockManager replicates Blocks
5. Receiver reports the received blocks to
Streaming Context.
6. Streaming Context periodically (every Batch
Intervals ) take all the blocks to create RDD
and launch jobs using Spark context on those
RDDs.
7. Spark will process the RDD by running tasks
on the blocks.
8. This process repeats for every batch intervals.

Failure Scenarios..
Receiver failed
Driver failed
Data Loss in both cases ?

Failure Scenarios..Receiver
Un-Reliable Receiver
 Need a Reliable Receiver

Kafka Receivers..
Reliable Receiver can use ..
 Kafka High Level API ( Spark Out of the box Receiver )
 Kafka Low Level API (part of Spark-Packages)
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer
 High Level Kafka API has SERIOUS issue with Consumer Re-
Balance...Can not be used in Production
https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Client+Re-Design

Low Level Kafka Receiver Challenges
Consumer implemented as Custom Spark
Receiver which need to handle ..
• Consumer need to know Leader of a Partition.
• Consumer should aware of leader changes.
• Consumer should handle ZK timeout.
• Consumer need to manage Kafka Offset.
• Consumer need to handle various failovers.

Failure Scenarios..Driver
 Need to enable WAL based recovery
 Data which is buffered but not processed are lost ...why ?

Zero Loss using Receiver Based approach..

Direct Kafka Stream Approach ..Experimental

Direct Kafka Stream Approach ..need to mature
Where to Store the Offset details ?
 Checkpoint is default mechanism . Has issue with
checkpoint recoverability .
 Offset in external store . Complex and manage
own recovery
 Offset saved to external store can only be possible
, when you get offset details from the Stream.

What's hot

NetflixOSS Open House Lightning talksRuslan Meshenberg

Ruby/rails performance and profilingDanny Guinther

Spark summit2014 techtalk - testing sparkAnu Shetty

S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...Codemotion Tel Aviv

Robust Operations of Kafka Streamsconfluent

Heat optimizationRico Lin

Concurrency at the Database Layer mcwilson1

Setup 3 Node Kafka Cluster on AWS - Hands Onhkbhadraa

(ARC348) Seagull: How Yelp Built A System For Task ExecutionAmazon Web Services

Distributed Tests on Pulsar with Fallout - Pulsar Summit NA 2021StreamNative

Monitoring with PrometheusShiao-An Yuan

Making KVS 10x ScalableSadayuki Furuhashi

Real-Time Streaming with Apache Spark Streaming and Apache StormDavorin Vukelic

Enable IPv6 on Route53 AWS ELB, docker and node AppFyllo

Heat and its resourcesSangeeth Kumar

(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014Amazon Web Services

Getting to Know the Cassandra Codebasegdusbabek

Gearpump akka streamsKam Kasravi

Kafka Streams at Scale (Deepak Goyal, Walmart Labs) Kafka Summit London 2019confluent

PostgreSQL TerminologyShowmax Engineering

What's hot (20)

NetflixOSS Open House Lightning talks

Ruby/rails performance and profiling

Spark summit2014 techtalk - testing spark

S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...

Robust Operations of Kafka Streams

Heat optimization

Concurrency at the Database Layer

Setup 3 Node Kafka Cluster on AWS - Hands On

(ARC348) Seagull: How Yelp Built A System For Task Execution

Distributed Tests on Pulsar with Fallout - Pulsar Summit NA 2021

Monitoring with Prometheus

Making KVS 10x Scalable

Real-Time Streaming with Apache Spark Streaming and Apache Storm

Enable IPv6 on Route53 AWS ELB, docker and node App

Heat and its resources

(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014

Getting to Know the Cassandra Codebase

Gearpump akka streams

Kafka Streams at Scale (Deepak Goyal, Walmart Labs) Kafka Summit London 2019

PostgreSQL Terminology

Viewers also liked

Spark streaming with apache kafkapunesparkmeetup

Equation solving-at-scale-using-apache-sparkSigmoid

Failsafe Hadoop Infrastructure and the way they workSigmoid

Real-time Supply Chain AnalyticsSigmoid

Angular js performance improvementsSigmoid

WEBSOCKETS AND WEBWORKERSSigmoid

Graph computationSigmoid

Building high scalable distributed framework on apache mesosSigmoid

Productionizing sparkSigmoid

Sparkstreaming with kafka and h base at scale (1)Sigmoid

Spark and spark streaming internalsSigmoid

Composing and scaling data platformsSigmoid

Introduction to apache nutchSigmoid

Approaches to text analysisSigmoid

Joining Large data at ScaleSigmoid

Introduction to Spark R with R studio - Mr. Pragith Sigmoid

Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Tathagata Das

Building bots to automate common developer tasks - Writing your first smart c...Sigmoid

Graph Analytics for big dataSigmoid

Time series database by Harshil AmbagadeSigmoid

Viewers also liked (20)

Spark streaming with apache kafka

Equation solving-at-scale-using-apache-spark

Failsafe Hadoop Infrastructure and the way they work

Real-time Supply Chain Analytics

Angular js performance improvements

WEBSOCKETS AND WEBWORKERS

Graph computation

Building high scalable distributed framework on apache mesos

Productionizing spark

Sparkstreaming with kafka and h base at scale (1)

Spark and spark streaming internals

Composing and scaling data platforms

Introduction to apache nutch

Approaches to text analysis

Joining Large data at Scale

Introduction to Spark R with R studio - Mr. Pragith

Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...

Building bots to automate common developer tasks - Writing your first smart c...

Graph Analytics for big data

Time series database by Harshil Ambagade

Similar to Tale of Kafka Consumer for Spark Streaming

Spark Streaming with Kafka - Meetup BangaloreDibyendu Bhattacharya

Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingDibyendu Bhattacharya

Introduction to Apache KafkaShiao-An Yuan

Stream processing in python with Apache Samza and BeamHai Lu

Event driven-archMohammed Shoaib

Apache kafkaKumar Shivam

Spark Streaming Recipes and "Exactly Once" Semantics RevisedMichael Spector

Westpac Bank Tech Talk 1: Dive into Apache Kafkaconfluent

Concurrency in rubyMarco Borromeo

Kafka Tutorial: Advanced ProducersJean-Paul Azar

14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...Athens Big Data

Spark streaming + kafka 0.10Joan Viladrosa Riera

Lessons Learned: Using Spark and MicroservicesAlexis Seigneurin

[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration StoryJoan Viladrosa Riera

Apache Kafka - Scalable Message-Processing and more !Guido Schmutz

Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Robert "Chip" Senkbeil

An Introduction to Apache KafkaAmir Sedighi

Akka for big data developersTaras Fedorov

Tuning kafka pipelinesSumant Tambe

Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721

Similar to Tale of Kafka Consumer for Spark Streaming (20)

Spark Streaming with Kafka - Meetup Bangalore

Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming

Introduction to Apache Kafka

Stream processing in python with Apache Samza and Beam

Event driven-arch

Apache kafka

Spark Streaming Recipes and "Exactly Once" Semantics Revised

Westpac Bank Tech Talk 1: Dive into Apache Kafka

Concurrency in ruby

Kafka Tutorial: Advanced Producers

14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...

Spark streaming + kafka 0.10

Lessons Learned: Using Spark and Microservices

[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story

Apache Kafka - Scalable Message-Processing and more !

Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)

An Introduction to Apache Kafka

Akka for big data developers

Tuning kafka pipelines

Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync

Recently uploaded

DBX First Quarter 2024 Investor PresentationDropbox

Understanding the FAA Part 107 License ..Christopher Logan Kennedy

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10

WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2

DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services

MS Copilot expands with MS Graph connectorsNanddeep Nachan

Corporate and higher education May webinar.pptxRustici Software

CNIC Information System with Pakdata Cf In Pakistandanishmna97

Architecting Cloud Native ApplicationsWSO2

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

Recently uploaded (20)

DBX First Quarter 2024 Investor Presentation

Understanding the FAA Part 107 License ..

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

WSO2's API Vision: Unifying Control, Empowering Developers

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam

Boost Fertility New Invention Ups Success Rates.pdf

Vector Search -An Introduction in Oracle Database 23ai.pptx

MS Copilot expands with MS Graph connectors

Corporate and higher education May webinar.pptx

CNIC Information System with Pakdata Cf In Pakistan

Architecting Cloud Native Applications

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

AWS Community Day CPH - Three problems of Terraform

Strategies for Landing an Oracle DBA Job as a Fresher

Six Myths about Ontologies: The Basics of Formal Ontology

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

Tale of Kafka Consumer for Spark Streaming

1. Tale of Kafka Consumer for Spark Streaming Bangalore Meetup

2. What we do at Pearson

3. Anatomy of Kafka Cluster..

4. Spark in a Slide..

5. Spark + Kafka

6. Spark + Kafka 1. Streaming application uses Streaming Context which uses Spark Context to launch Jobs across the cluster. 2. Receivers running on Executors process 3. Receiver divides the streams into Blocks and writes those Blocks to Spark BlockManager. 4. Spark BlockManager replicates Blocks 5. Receiver reports the received blocks to Streaming Context. 6. Streaming Context periodically (every Batch Intervals ) take all the blocks to create RDD and launch jobs using Spark context on those RDDs. 7. Spark will process the RDD by running tasks on the blocks. 8. This process repeats for every batch intervals.

7. Failure Scenarios.. Receiver failed Driver failed Data Loss in both cases ?

8. Failure Scenarios..Receiver Un-Reliable Receiver  Need a Reliable Receiver

9. Kafka Receivers.. Reliable Receiver can use ..  Kafka High Level API ( Spark Out of the box Receiver )  Kafka Low Level API (part of Spark-Packages) http://spark-packages.org/package/dibbhatt/kafka-spark-consumer  High Level Kafka API has SERIOUS issue with Consumer Re- Balance...Can not be used in Production https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Client+Re-Design

10. Low Level Kafka Receiver Challenges Consumer implemented as Custom Spark Receiver which need to handle .. • Consumer need to know Leader of a Partition. • Consumer should aware of leader changes. • Consumer should handle ZK timeout. • Consumer need to manage Kafka Offset. • Consumer need to handle various failovers.

11. Low Level Receiver..(will skip it)

12. Failure Scenarios..Driver  Need to enable WAL based recovery  Data which is buffered but not processed are lost ...why ?

13. Zero Loss using Receiver Based approach..

14. Direct Kafka Stream Approach ..Experimental

15. Direct Kafka Stream Approach ..need to mature Where to Store the Offset details ?  Checkpoint is default mechanism . Has issue with checkpoint recoverability .  Offset in external store . Complex and manage own recovery  Offset saved to external store can only be possible , when you get offset details from the Stream.

16. Date Rate : 1MB/250ms per Receiver

17. Date Rate : 5MB/250ms per Receiver

18. Date Rate : 10MB/250ms per Receiver

19. Thank You !

Tale of Kafka Consumer for Spark Streaming

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Tale of Kafka Consumer for Spark Streaming

Similar to Tale of Kafka Consumer for Spark Streaming (20)

More from Sigmoid

More from Sigmoid (12)

Recently uploaded

Recently uploaded (20)

Tale of Kafka Consumer for Spark Streaming