SlideShare a Scribd company logo
1 of 18
Download to read offline
1© Cloudera, Inc. All rights reserved. 1© Cloudera, Inc. All rights reserved.
How to build leakproof stream processing pipelines with
Apache Kafka and Apache Spark​
2© Cloudera, Inc. All rights reserved.
● Guru Medasani
○ Data Science Architect at Domino Data Lab
○ Previously senior solutions architect at Cloudera
● Jordan Hambleton - Consulting Manager in San Francisco
○ Nearly 4 years as Resident Senior Architect at large technology firm
○ Previously software engineer building operational data systems on CDH
Introduction
3© Cloudera, Inc. All rights reserved.
● Intro
● Overview of Spark Streaming from Kafka
○ Workflow of the DStream and RDD
○ Spark Streaming Kafka consumer types
● Offset management
○ Motivation
○ Storing offsets in external data stores
● Q & A
Agenda
4© Cloudera, Inc. All rights reserved.
Overview
serverserver
partition1
Kafka Cluster
partitionn
partition2
. . . .
Topic A
142
143
144
. . .
121
122
123
. . .
137
138
139
. . .
server
partition3
129
130
131
. . .
server
executor1
executor2
executor3
Hadoop / YARN Cluster
executorn
. . . .
more parallelism
5© Cloudera, Inc. All rights reserved.
● DStream - sequence of RDDs
● Two approaches in KafkaUtils
○ Receiver based
○ Direct approach (recommended & the method we talk about)
● Spark streaming embeds a kafka client
○ Spark 1.6 uses the 0.9.0-kafka-2.0.0 client (SimpleConsumer)
○ Spark 2.x kafka 0-8-0 uses the 0.9.0-kafka-2.0.2 client (SimpleConsumer)
○ Spark 2.x kafka 0-10-0 uses the 0.10.0-kafka-2.1.0 client (KafkaConsumer)
Overview Spark Streaming from Kafka
6© Cloudera, Inc. All rights reserved.
DStream and RDD Workflow
● Spark Streaming
○ batchIntervalInSeconds
○ stopGracefullyOnShutdown
● Kafka
○ bootstrap.servers
○ auto.offset.reset
○ group.id
○ key.deserializer
○ value.deserializer
7© Cloudera, Inc. All rights reserved.
● spark-streaming-kafka-0-8 / 0.9.0-kafka-2.0.2
● DStream
○ Gets range of each topic/partition - throttle maxRatePerPartition
○ auto.offset.reset (smallest|largest)
○ refresh.leader.backoff.ms - lost leader
● KafkaRDD for set of topic, partition, offsets
○ User can now get offset ranges from RDD
■ topic, partition, fromOffset (inclusive), untilOffset (exclusive)
● KafkaRDDPartition iterator
○ SimpleConsumer initialized and batches of events fetched
○ refresh.leader.backoff.ms - lost leader
Spark Streaming Kafka Consumer # 1
8© Cloudera, Inc. All rights reserved.
● Supported - spark-streaming-kafka-0-10 / 0.10.0-kafka-2.1.0
● Internal Kafka client uses new Java KafkaConsumer
● ConsumerStrategies
○ subscribe, assign, subscribe pattern
● LocationStrategies
○ executor distribution strategy (consistent, fixed, brokers)
● DStream
○ Gets range of each topic/partition - throttle maxRatePerPartition
○ auto.offset.reset (earliest|latest)
○ Be careful - enable.auto.commit (default true)
○ heartbeat & session timeouts
Spark Streaming Kafka Consumer # 2
9© Cloudera, Inc. All rights reserved.
● DStream
○ Consumer poll for group coordination & discovery
○ Identify new partitions, from offsets
○ Pause consumer
○ seekToEnd to get untilOffsets
● KafkaRDD
○ Fixed [enable.auto.commit = false, auto.offset.reset = none, spark-executor-${group.id}]
○ Attempts to assign offset range consistently for optimal consumer caching
● KafkaRDDPartition iterator
○ Initialize/lookup CachedKafkaConsumer with executor group
■ consumer assigned per single topic, partition with internal buffer
■ on cache miss, seek and poll
Spark Streaming Kafka Consumer # 2
10© Cloudera, Inc. All rights reserved.
Keeping Track
11© Cloudera, Inc. All rights reserved.
● Planned Maintenance
○ Upgrades
○ Bug-fixes
● Unplanned Maintenance
○ Failures
● Application Processing Errors
○ Wrong calculations
○ Updated algorithm over known streaming data
● More control over messages
○ Just earliest and latest are insufficient
Motivation for Tracking Offsets
12© Cloudera, Inc. All rights reserved.
● Cast RDD to HasOffsetRanges
● DStream’s first transformation
Obtaining Offsets
13© Cloudera, Inc. All rights reserved.
Offset management Workflow
● Limited options prior to
spark-streaming-kafka-0-10
● Store offsets in external datastore
○ Checkpoints (Not recommended)
○ ZooKeeper
○ Kafka
○ HBase
● Do not have to manage offsets
14© Cloudera, Inc. All rights reserved.
● ZooKeeper
○ znode - /consumers/[groupId]/offsets/[topic]/[partitionId] -> long (offset)
○ Only retains latest committed offsets
○ Can easily be managed by external tools
○ Leverage existing monitoring for Lag, no historical insight
Offset Management in ZooKeeper
15© Cloudera, Inc. All rights reserved.
● Kafka
○ CanCommitOffsets provides async commit to internal kafka topic
○ More difficult to manage internal kafka topic manually
○ Leverage existing monitoring for Lag, no historical insight
Offset Management in Kafka
16© Cloudera, Inc. All rights reserved.
● HBase
○ Unique entry per consumer group, batch
● Fine-grained monitoring over time
● HBase shell for easy management
● Get latest entry -
○ scan 'prod_stream',
○ STARTROW =>'device_alerts:csi_group',
○ REVERSED =>TRUE,
○ LIMIT =>1
Offset Management in HBase
schema:
row: <TOPIC_NAME>:<GROUP_ID>:<EPOCH_BATCHTIME_MS>
column family: offsets
qualifier: <PARTITION_ID>
value: <OFFSET_ID>
17© Cloudera, Inc. All rights reserved.
● Spark Streaming job started for the first time
● No changes in Kafka partitions
● Increase in number of Kafka partitions
http://blog.cloudera.com/blog/2017/06/offset-management-for-apache-kafka-with-apache-spark-streaming/
Starting Streaming Jobs with Known Offsets
18© Cloudera, Inc. All rights reserved.
Questions?
Thank you
Jordan Hambleton
Guru Medasani

More Related Content

What's hot

Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Community
 
Hadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native EraHadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native Era
DataWorks Summit
 

What's hot (20)

Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store
 
How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)
 
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
 
Choosing the right high availability strategy
Choosing the right high availability strategyChoosing the right high availability strategy
Choosing the right high availability strategy
 
Hadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native EraHadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native Era
 
Apache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinApache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek Berlin
 
Safer restarts, faster streaming, and better repair, just a glimpse of cassan...
Safer restarts, faster streaming, and better repair, just a glimpse of cassan...Safer restarts, faster streaming, and better repair, just a glimpse of cassan...
Safer restarts, faster streaming, and better repair, just a glimpse of cassan...
 
Tales from the Cloudera Field
Tales from the Cloudera FieldTales from the Cloudera Field
Tales from the Cloudera Field
 
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
 
AdGear Use Case with Scylla - 1M Queries Per Second with Single-Digit Millise...
AdGear Use Case with Scylla - 1M Queries Per Second with Single-Digit Millise...AdGear Use Case with Scylla - 1M Queries Per Second with Single-Digit Millise...
AdGear Use Case with Scylla - 1M Queries Per Second with Single-Digit Millise...
 
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
 
SFHUG Kudu Talk
SFHUG Kudu TalkSFHUG Kudu Talk
SFHUG Kudu Talk
 
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
 
[Hadoop Meetup] Yarn at Microsoft - The challenges of scale
[Hadoop Meetup] Yarn at Microsoft - The challenges of scale[Hadoop Meetup] Yarn at Microsoft - The challenges of scale
[Hadoop Meetup] Yarn at Microsoft - The challenges of scale
 
Under the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database ArchitectureUnder the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database Architecture
 
From docker to kubernetes: running Apache Hadoop in a cloud native way
From docker to kubernetes: running Apache Hadoop in a cloud native wayFrom docker to kubernetes: running Apache Hadoop in a cloud native way
From docker to kubernetes: running Apache Hadoop in a cloud native way
 
How to Build a Scylla Database Cluster that Fits Your Needs
How to Build a Scylla Database Cluster that Fits Your NeedsHow to Build a Scylla Database Cluster that Fits Your Needs
How to Build a Scylla Database Cluster that Fits Your Needs
 
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data AnalyticsSupersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
 
Pythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra ClusterPythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra Cluster
 
Cassandra Metrics
Cassandra MetricsCassandra Metrics
Cassandra Metrics
 

Similar to How to build leakproof stream processing pipelines with Apache Kafka and Apache Spark​ | Strata San Jose 2018

Similar to How to build leakproof stream processing pipelines with Apache Kafka and Apache Spark​ | Strata San Jose 2018 (20)

Apache Hadoop 3
Apache Hadoop 3Apache Hadoop 3
Apache Hadoop 3
 
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messagesMulti-Tenancy Kafka cluster for LINE services with 250 billion daily messages
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)
 
Migrating to Apache Spark at Netflix
Migrating to Apache Spark at NetflixMigrating to Apache Spark at Netflix
Migrating to Apache Spark at Netflix
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoop
 
What's New with Ceph - Ceph Day Silicon Valley
What's New with Ceph - Ceph Day Silicon ValleyWhat's New with Ceph - Ceph Day Silicon Valley
What's New with Ceph - Ceph Day Silicon Valley
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
CEPH DAY BERLIN - WHAT'S NEW IN CEPH
CEPH DAY BERLIN - WHAT'S NEW IN CEPH CEPH DAY BERLIN - WHAT'S NEW IN CEPH
CEPH DAY BERLIN - WHAT'S NEW IN CEPH
 
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with Hadoop
 
Fraud Detection Architecture
Fraud Detection ArchitectureFraud Detection Architecture
Fraud Detection Architecture
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 
Yarns About Yarn
Yarns About YarnYarns About Yarn
Yarns About Yarn
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
 
Empower Hive with Spark
Empower Hive with SparkEmpower Hive with Spark
Empower Hive with Spark
 
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptxKudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational Performance
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformLess Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
 
Navigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseNavigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern Enterprise
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 

How to build leakproof stream processing pipelines with Apache Kafka and Apache Spark​ | Strata San Jose 2018

  • 1. 1© Cloudera, Inc. All rights reserved. 1© Cloudera, Inc. All rights reserved. How to build leakproof stream processing pipelines with Apache Kafka and Apache Spark​
  • 2. 2© Cloudera, Inc. All rights reserved. ● Guru Medasani ○ Data Science Architect at Domino Data Lab ○ Previously senior solutions architect at Cloudera ● Jordan Hambleton - Consulting Manager in San Francisco ○ Nearly 4 years as Resident Senior Architect at large technology firm ○ Previously software engineer building operational data systems on CDH Introduction
  • 3. 3© Cloudera, Inc. All rights reserved. ● Intro ● Overview of Spark Streaming from Kafka ○ Workflow of the DStream and RDD ○ Spark Streaming Kafka consumer types ● Offset management ○ Motivation ○ Storing offsets in external data stores ● Q & A Agenda
  • 4. 4© Cloudera, Inc. All rights reserved. Overview serverserver partition1 Kafka Cluster partitionn partition2 . . . . Topic A 142 143 144 . . . 121 122 123 . . . 137 138 139 . . . server partition3 129 130 131 . . . server executor1 executor2 executor3 Hadoop / YARN Cluster executorn . . . . more parallelism
  • 5. 5© Cloudera, Inc. All rights reserved. ● DStream - sequence of RDDs ● Two approaches in KafkaUtils ○ Receiver based ○ Direct approach (recommended & the method we talk about) ● Spark streaming embeds a kafka client ○ Spark 1.6 uses the 0.9.0-kafka-2.0.0 client (SimpleConsumer) ○ Spark 2.x kafka 0-8-0 uses the 0.9.0-kafka-2.0.2 client (SimpleConsumer) ○ Spark 2.x kafka 0-10-0 uses the 0.10.0-kafka-2.1.0 client (KafkaConsumer) Overview Spark Streaming from Kafka
  • 6. 6© Cloudera, Inc. All rights reserved. DStream and RDD Workflow ● Spark Streaming ○ batchIntervalInSeconds ○ stopGracefullyOnShutdown ● Kafka ○ bootstrap.servers ○ auto.offset.reset ○ group.id ○ key.deserializer ○ value.deserializer
  • 7. 7© Cloudera, Inc. All rights reserved. ● spark-streaming-kafka-0-8 / 0.9.0-kafka-2.0.2 ● DStream ○ Gets range of each topic/partition - throttle maxRatePerPartition ○ auto.offset.reset (smallest|largest) ○ refresh.leader.backoff.ms - lost leader ● KafkaRDD for set of topic, partition, offsets ○ User can now get offset ranges from RDD ■ topic, partition, fromOffset (inclusive), untilOffset (exclusive) ● KafkaRDDPartition iterator ○ SimpleConsumer initialized and batches of events fetched ○ refresh.leader.backoff.ms - lost leader Spark Streaming Kafka Consumer # 1
  • 8. 8© Cloudera, Inc. All rights reserved. ● Supported - spark-streaming-kafka-0-10 / 0.10.0-kafka-2.1.0 ● Internal Kafka client uses new Java KafkaConsumer ● ConsumerStrategies ○ subscribe, assign, subscribe pattern ● LocationStrategies ○ executor distribution strategy (consistent, fixed, brokers) ● DStream ○ Gets range of each topic/partition - throttle maxRatePerPartition ○ auto.offset.reset (earliest|latest) ○ Be careful - enable.auto.commit (default true) ○ heartbeat & session timeouts Spark Streaming Kafka Consumer # 2
  • 9. 9© Cloudera, Inc. All rights reserved. ● DStream ○ Consumer poll for group coordination & discovery ○ Identify new partitions, from offsets ○ Pause consumer ○ seekToEnd to get untilOffsets ● KafkaRDD ○ Fixed [enable.auto.commit = false, auto.offset.reset = none, spark-executor-${group.id}] ○ Attempts to assign offset range consistently for optimal consumer caching ● KafkaRDDPartition iterator ○ Initialize/lookup CachedKafkaConsumer with executor group ■ consumer assigned per single topic, partition with internal buffer ■ on cache miss, seek and poll Spark Streaming Kafka Consumer # 2
  • 10. 10© Cloudera, Inc. All rights reserved. Keeping Track
  • 11. 11© Cloudera, Inc. All rights reserved. ● Planned Maintenance ○ Upgrades ○ Bug-fixes ● Unplanned Maintenance ○ Failures ● Application Processing Errors ○ Wrong calculations ○ Updated algorithm over known streaming data ● More control over messages ○ Just earliest and latest are insufficient Motivation for Tracking Offsets
  • 12. 12© Cloudera, Inc. All rights reserved. ● Cast RDD to HasOffsetRanges ● DStream’s first transformation Obtaining Offsets
  • 13. 13© Cloudera, Inc. All rights reserved. Offset management Workflow ● Limited options prior to spark-streaming-kafka-0-10 ● Store offsets in external datastore ○ Checkpoints (Not recommended) ○ ZooKeeper ○ Kafka ○ HBase ● Do not have to manage offsets
  • 14. 14© Cloudera, Inc. All rights reserved. ● ZooKeeper ○ znode - /consumers/[groupId]/offsets/[topic]/[partitionId] -> long (offset) ○ Only retains latest committed offsets ○ Can easily be managed by external tools ○ Leverage existing monitoring for Lag, no historical insight Offset Management in ZooKeeper
  • 15. 15© Cloudera, Inc. All rights reserved. ● Kafka ○ CanCommitOffsets provides async commit to internal kafka topic ○ More difficult to manage internal kafka topic manually ○ Leverage existing monitoring for Lag, no historical insight Offset Management in Kafka
  • 16. 16© Cloudera, Inc. All rights reserved. ● HBase ○ Unique entry per consumer group, batch ● Fine-grained monitoring over time ● HBase shell for easy management ● Get latest entry - ○ scan 'prod_stream', ○ STARTROW =>'device_alerts:csi_group', ○ REVERSED =>TRUE, ○ LIMIT =>1 Offset Management in HBase schema: row: <TOPIC_NAME>:<GROUP_ID>:<EPOCH_BATCHTIME_MS> column family: offsets qualifier: <PARTITION_ID> value: <OFFSET_ID>
  • 17. 17© Cloudera, Inc. All rights reserved. ● Spark Streaming job started for the first time ● No changes in Kafka partitions ● Increase in number of Kafka partitions http://blog.cloudera.com/blog/2017/06/offset-management-for-apache-kafka-with-apache-spark-streaming/ Starting Streaming Jobs with Known Offsets
  • 18. 18© Cloudera, Inc. All rights reserved. Questions? Thank you Jordan Hambleton Guru Medasani