SlideShare a Scribd company logo
1© Cloudera, Inc. All rights reserved. 1© Cloudera, Inc. All rights reserved.
How to build leakproof stream processing pipelines with
Apache Kafka and Apache Spark​
2© Cloudera, Inc. All rights reserved.
● Guru Medasani
○ Data Science Architect at Domino Data Lab
○ Previously senior solutions architect at Cloudera
● Jordan Hambleton - Consulting Manager in San Francisco
○ Nearly 4 years as Resident Senior Architect at large technology firm
○ Previously software engineer building operational data systems on CDH
Introduction
3© Cloudera, Inc. All rights reserved.
● Intro
● Overview of Spark Streaming from Kafka
○ Workflow of the DStream and RDD
○ Spark Streaming Kafka consumer types
● Offset management
○ Motivation
○ Storing offsets in external data stores
● Q & A
Agenda
4© Cloudera, Inc. All rights reserved.
Overview
serverserver
partition1
Kafka Cluster
partitionn
partition2
. . . .
Topic A
142
143
144
. . .
121
122
123
. . .
137
138
139
. . .
server
partition3
129
130
131
. . .
server
executor1
executor2
executor3
Hadoop / YARN Cluster
executorn
. . . .
more parallelism
5© Cloudera, Inc. All rights reserved.
● DStream - sequence of RDDs
● Two approaches in KafkaUtils
○ Receiver based
○ Direct approach (recommended & the method we talk about)
● Spark streaming embeds a kafka client
○ Spark 1.6 uses the 0.9.0-kafka-2.0.0 client (SimpleConsumer)
○ Spark 2.x kafka 0-8-0 uses the 0.9.0-kafka-2.0.2 client (SimpleConsumer)
○ Spark 2.x kafka 0-10-0 uses the 0.10.0-kafka-2.1.0 client (KafkaConsumer)
Overview Spark Streaming from Kafka
6© Cloudera, Inc. All rights reserved.
DStream and RDD Workflow
● Spark Streaming
○ batchIntervalInSeconds
○ stopGracefullyOnShutdown
● Kafka
○ bootstrap.servers
○ auto.offset.reset
○ group.id
○ key.deserializer
○ value.deserializer
7© Cloudera, Inc. All rights reserved.
● spark-streaming-kafka-0-8 / 0.9.0-kafka-2.0.2
● DStream
○ Gets range of each topic/partition - throttle maxRatePerPartition
○ auto.offset.reset (smallest|largest)
○ refresh.leader.backoff.ms - lost leader
● KafkaRDD for set of topic, partition, offsets
○ User can now get offset ranges from RDD
■ topic, partition, fromOffset (inclusive), untilOffset (exclusive)
● KafkaRDDPartition iterator
○ SimpleConsumer initialized and batches of events fetched
○ refresh.leader.backoff.ms - lost leader
Spark Streaming Kafka Consumer # 1
8© Cloudera, Inc. All rights reserved.
● Supported - spark-streaming-kafka-0-10 / 0.10.0-kafka-2.1.0
● Internal Kafka client uses new Java KafkaConsumer
● ConsumerStrategies
○ subscribe, assign, subscribe pattern
● LocationStrategies
○ executor distribution strategy (consistent, fixed, brokers)
● DStream
○ Gets range of each topic/partition - throttle maxRatePerPartition
○ auto.offset.reset (earliest|latest)
○ Be careful - enable.auto.commit (default true)
○ heartbeat & session timeouts
Spark Streaming Kafka Consumer # 2
9© Cloudera, Inc. All rights reserved.
● DStream
○ Consumer poll for group coordination & discovery
○ Identify new partitions, from offsets
○ Pause consumer
○ seekToEnd to get untilOffsets
● KafkaRDD
○ Fixed [enable.auto.commit = false, auto.offset.reset = none, spark-executor-${group.id}]
○ Attempts to assign offset range consistently for optimal consumer caching
● KafkaRDDPartition iterator
○ Initialize/lookup CachedKafkaConsumer with executor group
■ consumer assigned per single topic, partition with internal buffer
■ on cache miss, seek and poll
Spark Streaming Kafka Consumer # 2
10© Cloudera, Inc. All rights reserved.
Keeping Track
11© Cloudera, Inc. All rights reserved.
● Planned Maintenance
○ Upgrades
○ Bug-fixes
● Unplanned Maintenance
○ Failures
● Application Processing Errors
○ Wrong calculations
○ Updated algorithm over known streaming data
● More control over messages
○ Just earliest and latest are insufficient
Motivation for Tracking Offsets
12© Cloudera, Inc. All rights reserved.
● Cast RDD to HasOffsetRanges
● DStream’s first transformation
Obtaining Offsets
13© Cloudera, Inc. All rights reserved.
Offset management Workflow
● Limited options prior to
spark-streaming-kafka-0-10
● Store offsets in external datastore
○ Checkpoints (Not recommended)
○ ZooKeeper
○ Kafka
○ HBase
● Do not have to manage offsets
14© Cloudera, Inc. All rights reserved.
● ZooKeeper
○ znode - /consumers/[groupId]/offsets/[topic]/[partitionId] -> long (offset)
○ Only retains latest committed offsets
○ Can easily be managed by external tools
○ Leverage existing monitoring for Lag, no historical insight
Offset Management in ZooKeeper
15© Cloudera, Inc. All rights reserved.
● Kafka
○ CanCommitOffsets provides async commit to internal kafka topic
○ More difficult to manage internal kafka topic manually
○ Leverage existing monitoring for Lag, no historical insight
Offset Management in Kafka
16© Cloudera, Inc. All rights reserved.
● HBase
○ Unique entry per consumer group, batch
● Fine-grained monitoring over time
● HBase shell for easy management
● Get latest entry -
○ scan 'prod_stream',
○ STARTROW =>'device_alerts:csi_group',
○ REVERSED =>TRUE,
○ LIMIT =>1
Offset Management in HBase
schema:
row: <TOPIC_NAME>:<GROUP_ID>:<EPOCH_BATCHTIME_MS>
column family: offsets
qualifier: <PARTITION_ID>
value: <OFFSET_ID>
17© Cloudera, Inc. All rights reserved.
● Spark Streaming job started for the first time
● No changes in Kafka partitions
● Increase in number of Kafka partitions
http://blog.cloudera.com/blog/2017/06/offset-management-for-apache-kafka-with-apache-spark-streaming/
Starting Streaming Jobs with Known Offsets
18© Cloudera, Inc. All rights reserved.
Questions?
Thank you
Jordan Hambleton
Guru Medasani

More Related Content

What's hot

Apache Spark Operations
Apache Spark OperationsApache Spark Operations
Apache Spark Operations
Cloudera, Inc.
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Cloudera, Inc.
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Jeremy Beard
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014
Jonathan Seidman
 
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionFaster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Cloudera, Inc.
 
Impala Performance Update
Impala Performance UpdateImpala Performance Update
Impala Performance Update
Cloudera, Inc.
 
Unlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator OptimizerUnlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator Optimizer
Cloudera, Inc.
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache Hadoop
Cloudera, Inc.
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
Cloudera, Inc.
 
Kudu Cloudera Meetup Paris
Kudu Cloudera Meetup ParisKudu Cloudera Meetup Paris
Kudu Cloudera Meetup Paris
نهاد مبارك
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
Shravan (Sean) Pabba
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Spark Summit
 
Enabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache KuduEnabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache Kudu
Grant Henke
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Cloudera, Inc.
 
Apache Flink & Kudu: a connector to develop Kappa architectures
Apache Flink & Kudu: a connector to develop Kappa architecturesApache Flink & Kudu: a connector to develop Kappa architectures
Apache Flink & Kudu: a connector to develop Kappa architectures
Nacho García Fernández
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Cloudera, Inc.
 
Improving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceImproving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of Service
DataWorks Summit
 
One Hadoop, Multiple Clouds - NYC Big Data Meetup
One Hadoop, Multiple Clouds - NYC Big Data MeetupOne Hadoop, Multiple Clouds - NYC Big Data Meetup
One Hadoop, Multiple Clouds - NYC Big Data Meetup
Andrei Savu
 
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platformcloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
Rakuten Group, Inc.
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 

What's hot (20)

Apache Spark Operations
Apache Spark OperationsApache Spark Operations
Apache Spark Operations
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014
 
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionFaster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
 
Impala Performance Update
Impala Performance UpdateImpala Performance Update
Impala Performance Update
 
Unlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator OptimizerUnlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator Optimizer
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache Hadoop
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
Kudu Cloudera Meetup Paris
Kudu Cloudera Meetup ParisKudu Cloudera Meetup Paris
Kudu Cloudera Meetup Paris
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
 
Enabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache KuduEnabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache Kudu
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
 
Apache Flink & Kudu: a connector to develop Kappa architectures
Apache Flink & Kudu: a connector to develop Kappa architecturesApache Flink & Kudu: a connector to develop Kappa architectures
Apache Flink & Kudu: a connector to develop Kappa architectures
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
 
Improving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceImproving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of Service
 
One Hadoop, Multiple Clouds - NYC Big Data Meetup
One Hadoop, Multiple Clouds - NYC Big Data MeetupOne Hadoop, Multiple Clouds - NYC Big Data Meetup
One Hadoop, Multiple Clouds - NYC Big Data Meetup
 
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platformcloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 

Similar to How to build leakproof stream processing pipelines with Apache Kafka and Apache Spark

Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messagesMulti-Tenancy Kafka cluster for LINE services with 250 billion daily messages
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
LINE Corporation
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
Samuel Kerrien
 
Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)
Wei-Chiu Chuang
 
Migrating to Apache Spark at Netflix
Migrating to Apache Spark at NetflixMigrating to Apache Spark at Netflix
Migrating to Apache Spark at Netflix
Databricks
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoop
hadooparchbook
 
What's New with Ceph - Ceph Day Silicon Valley
What's New with Ceph - Ceph Day Silicon ValleyWhat's New with Ceph - Ceph Day Silicon Valley
What's New with Ceph - Ceph Day Silicon Valley
Ceph Community
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
Junping Du
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
DataWorks Summit
 
CEPH DAY BERLIN - WHAT'S NEW IN CEPH
CEPH DAY BERLIN - WHAT'S NEW IN CEPH CEPH DAY BERLIN - WHAT'S NEW IN CEPH
CEPH DAY BERLIN - WHAT'S NEW IN CEPH
Ceph Community
 
Fraud Detection Architecture
Fraud Detection ArchitectureFraud Detection Architecture
Fraud Detection Architecture
Gwen (Chen) Shapira
 
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with Hadoop
DataWorks Summit
 
SFHUG Kudu Talk
SFHUG Kudu TalkSFHUG Kudu Talk
SFHUG Kudu Talk
Felicia Haggarty
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Data Con LA
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
Jack Gudenkauf
 
Yarns About Yarn
Yarns About YarnYarns About Yarn
Yarns About Yarn
Cloudera, Inc.
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
Big Data Spain
 
Empower Hive with Spark
Empower Hive with SparkEmpower Hive with Spark
Empower Hive with Spark
DataWorks Summit
 
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptxKudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
Felicia Haggarty
 
Terraforming your Infrastructure on GCP
Terraforming your Infrastructure on GCPTerraforming your Infrastructure on GCP
Terraforming your Infrastructure on GCP
Samuel Chow
 

Similar to How to build leakproof stream processing pipelines with Apache Kafka and Apache Spark (20)

Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messagesMulti-Tenancy Kafka cluster for LINE services with 250 billion daily messages
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
 
Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)
 
Migrating to Apache Spark at Netflix
Migrating to Apache Spark at NetflixMigrating to Apache Spark at Netflix
Migrating to Apache Spark at Netflix
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoop
 
What's New with Ceph - Ceph Day Silicon Valley
What's New with Ceph - Ceph Day Silicon ValleyWhat's New with Ceph - Ceph Day Silicon Valley
What's New with Ceph - Ceph Day Silicon Valley
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
CEPH DAY BERLIN - WHAT'S NEW IN CEPH
CEPH DAY BERLIN - WHAT'S NEW IN CEPH CEPH DAY BERLIN - WHAT'S NEW IN CEPH
CEPH DAY BERLIN - WHAT'S NEW IN CEPH
 
Fraud Detection Architecture
Fraud Detection ArchitectureFraud Detection Architecture
Fraud Detection Architecture
 
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with Hadoop
 
SFHUG Kudu Talk
SFHUG Kudu TalkSFHUG Kudu Talk
SFHUG Kudu Talk
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 
Yarns About Yarn
Yarns About YarnYarns About Yarn
Yarns About Yarn
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
 
Empower Hive with Spark
Empower Hive with SparkEmpower Hive with Spark
Empower Hive with Spark
 
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptxKudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
 
Terraforming your Infrastructure on GCP
Terraforming your Infrastructure on GCPTerraforming your Infrastructure on GCP
Terraforming your Infrastructure on GCP
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

Wired_2.0_Create_AmsterdamJUG_09072024.pptx
Wired_2.0_Create_AmsterdamJUG_09072024.pptxWired_2.0_Create_AmsterdamJUG_09072024.pptx
Wired_2.0_Create_AmsterdamJUG_09072024.pptx
SimonedeGijt
 
The Ultimate Guide to Phone Spy Apps: Everything You Need to Know
The Ultimate Guide to Phone Spy Apps: Everything You Need to KnowThe Ultimate Guide to Phone Spy Apps: Everything You Need to Know
The Ultimate Guide to Phone Spy Apps: Everything You Need to Know
onemonitarsoftware
 
NYGGS 360: A Complete ERP for Construction Innovation
NYGGS 360: A Complete ERP for Construction InnovationNYGGS 360: A Complete ERP for Construction Innovation
NYGGS 360: A Complete ERP for Construction Innovation
NYGGS Construction ERP Software
 
Shivam Pandit working on Php Web Developer.
Shivam Pandit working on Php Web Developer.Shivam Pandit working on Php Web Developer.
Shivam Pandit working on Php Web Developer.
shivamt017
 
Introduction to Cloud computing for Internet of Things
Introduction to Cloud computing for Internet of ThingsIntroduction to Cloud computing for Internet of Things
Introduction to Cloud computing for Internet of Things
NachuSubramanian1
 
Leading Project Management Tool Taskruop.pptx
Leading Project Management Tool Taskruop.pptxLeading Project Management Tool Taskruop.pptx
Leading Project Management Tool Taskruop.pptx
taskroupseo
 
Install Ruby on Rails Like a Pro: Effortless Guide
Install Ruby on Rails Like a Pro: Effortless GuideInstall Ruby on Rails Like a Pro: Effortless Guide
Install Ruby on Rails Like a Pro: Effortless Guide
rorbitssoftware
 
Artificial intelligence in customer services or chatbots
Artificial intelligence  in customer services or chatbotsArtificial intelligence  in customer services or chatbots
Artificial intelligence in customer services or chatbots
kayash1656
 
Odoo E-commerce website development guides
Odoo E-commerce website development guidesOdoo E-commerce website development guides
Odoo E-commerce website development guides
jhkdigitalmarketing
 
IoT In Manufacturing_ Use Cases, Benefits, and Challenges.pdf
IoT In Manufacturing_ Use Cases, Benefits, and Challenges.pdfIoT In Manufacturing_ Use Cases, Benefits, and Challenges.pdf
IoT In Manufacturing_ Use Cases, Benefits, and Challenges.pdf
mohitd6
 
ENISA Threat Landscape 2023 documentation
ENISA Threat Landscape 2023 documentationENISA Threat Landscape 2023 documentation
ENISA Threat Landscape 2023 documentation
sofiafernandezon
 
To Avoid Mistakes When Using Online Attendance Sheets
To Avoid Mistakes When Using Online Attendance SheetsTo Avoid Mistakes When Using Online Attendance Sheets
To Avoid Mistakes When Using Online Attendance Sheets
Task Tracker
 
UMiami degree offer diploma Transcript
UMiami degree offer diploma TranscriptUMiami degree offer diploma Transcript
UMiami degree offer diploma Transcript
attueb
 
High Girls Call Chennai 000XX00000 Provide Best And Top Girl Service And No1 ...
High Girls Call Chennai 000XX00000 Provide Best And Top Girl Service And No1 ...High Girls Call Chennai 000XX00000 Provide Best And Top Girl Service And No1 ...
High Girls Call Chennai 000XX00000 Provide Best And Top Girl Service And No1 ...
singhlata50dh
 
當測試開始左移
當測試開始左移當測試開始左移
當測試開始左移
Jersey (CHE-PING) Su
 
React Native vs Flutter - SSTech System
React Native vs Flutter  - SSTech SystemReact Native vs Flutter  - SSTech System
React Native vs Flutter - SSTech System
SSTech System
 
Attendance Tracking From Paper To Digital
Attendance Tracking From Paper To DigitalAttendance Tracking From Paper To Digital
Attendance Tracking From Paper To Digital
Task Tracker
 
HIRE A HACKER FOR CHEATING HUSBAND/WIFE)
HIRE A HACKER FOR CHEATING HUSBAND/WIFE)HIRE A HACKER FOR CHEATING HUSBAND/WIFE)
HIRE A HACKER FOR CHEATING HUSBAND/WIFE)
josephinedrea942
 
BITCOIN HEIST RANSOMEWARE ATTACK PREDICTION
BITCOIN HEIST RANSOMEWARE ATTACK PREDICTIONBITCOIN HEIST RANSOMEWARE ATTACK PREDICTION
BITCOIN HEIST RANSOMEWARE ATTACK PREDICTION
ssuser2b426d1
 
Vip Girls Call ServiCe Hyderabad 0000000000 Pooja Best High Class Hyderabad A...
Vip Girls Call ServiCe Hyderabad 0000000000 Pooja Best High Class Hyderabad A...Vip Girls Call ServiCe Hyderabad 0000000000 Pooja Best High Class Hyderabad A...
Vip Girls Call ServiCe Hyderabad 0000000000 Pooja Best High Class Hyderabad A...
ashiklo9823
 

Recently uploaded (20)

Wired_2.0_Create_AmsterdamJUG_09072024.pptx
Wired_2.0_Create_AmsterdamJUG_09072024.pptxWired_2.0_Create_AmsterdamJUG_09072024.pptx
Wired_2.0_Create_AmsterdamJUG_09072024.pptx
 
The Ultimate Guide to Phone Spy Apps: Everything You Need to Know
The Ultimate Guide to Phone Spy Apps: Everything You Need to KnowThe Ultimate Guide to Phone Spy Apps: Everything You Need to Know
The Ultimate Guide to Phone Spy Apps: Everything You Need to Know
 
NYGGS 360: A Complete ERP for Construction Innovation
NYGGS 360: A Complete ERP for Construction InnovationNYGGS 360: A Complete ERP for Construction Innovation
NYGGS 360: A Complete ERP for Construction Innovation
 
Shivam Pandit working on Php Web Developer.
Shivam Pandit working on Php Web Developer.Shivam Pandit working on Php Web Developer.
Shivam Pandit working on Php Web Developer.
 
Introduction to Cloud computing for Internet of Things
Introduction to Cloud computing for Internet of ThingsIntroduction to Cloud computing for Internet of Things
Introduction to Cloud computing for Internet of Things
 
Leading Project Management Tool Taskruop.pptx
Leading Project Management Tool Taskruop.pptxLeading Project Management Tool Taskruop.pptx
Leading Project Management Tool Taskruop.pptx
 
Install Ruby on Rails Like a Pro: Effortless Guide
Install Ruby on Rails Like a Pro: Effortless GuideInstall Ruby on Rails Like a Pro: Effortless Guide
Install Ruby on Rails Like a Pro: Effortless Guide
 
Artificial intelligence in customer services or chatbots
Artificial intelligence  in customer services or chatbotsArtificial intelligence  in customer services or chatbots
Artificial intelligence in customer services or chatbots
 
Odoo E-commerce website development guides
Odoo E-commerce website development guidesOdoo E-commerce website development guides
Odoo E-commerce website development guides
 
IoT In Manufacturing_ Use Cases, Benefits, and Challenges.pdf
IoT In Manufacturing_ Use Cases, Benefits, and Challenges.pdfIoT In Manufacturing_ Use Cases, Benefits, and Challenges.pdf
IoT In Manufacturing_ Use Cases, Benefits, and Challenges.pdf
 
ENISA Threat Landscape 2023 documentation
ENISA Threat Landscape 2023 documentationENISA Threat Landscape 2023 documentation
ENISA Threat Landscape 2023 documentation
 
To Avoid Mistakes When Using Online Attendance Sheets
To Avoid Mistakes When Using Online Attendance SheetsTo Avoid Mistakes When Using Online Attendance Sheets
To Avoid Mistakes When Using Online Attendance Sheets
 
UMiami degree offer diploma Transcript
UMiami degree offer diploma TranscriptUMiami degree offer diploma Transcript
UMiami degree offer diploma Transcript
 
High Girls Call Chennai 000XX00000 Provide Best And Top Girl Service And No1 ...
High Girls Call Chennai 000XX00000 Provide Best And Top Girl Service And No1 ...High Girls Call Chennai 000XX00000 Provide Best And Top Girl Service And No1 ...
High Girls Call Chennai 000XX00000 Provide Best And Top Girl Service And No1 ...
 
當測試開始左移
當測試開始左移當測試開始左移
當測試開始左移
 
React Native vs Flutter - SSTech System
React Native vs Flutter  - SSTech SystemReact Native vs Flutter  - SSTech System
React Native vs Flutter - SSTech System
 
Attendance Tracking From Paper To Digital
Attendance Tracking From Paper To DigitalAttendance Tracking From Paper To Digital
Attendance Tracking From Paper To Digital
 
HIRE A HACKER FOR CHEATING HUSBAND/WIFE)
HIRE A HACKER FOR CHEATING HUSBAND/WIFE)HIRE A HACKER FOR CHEATING HUSBAND/WIFE)
HIRE A HACKER FOR CHEATING HUSBAND/WIFE)
 
BITCOIN HEIST RANSOMEWARE ATTACK PREDICTION
BITCOIN HEIST RANSOMEWARE ATTACK PREDICTIONBITCOIN HEIST RANSOMEWARE ATTACK PREDICTION
BITCOIN HEIST RANSOMEWARE ATTACK PREDICTION
 
Vip Girls Call ServiCe Hyderabad 0000000000 Pooja Best High Class Hyderabad A...
Vip Girls Call ServiCe Hyderabad 0000000000 Pooja Best High Class Hyderabad A...Vip Girls Call ServiCe Hyderabad 0000000000 Pooja Best High Class Hyderabad A...
Vip Girls Call ServiCe Hyderabad 0000000000 Pooja Best High Class Hyderabad A...
 

How to build leakproof stream processing pipelines with Apache Kafka and Apache Spark

  • 1. 1© Cloudera, Inc. All rights reserved. 1© Cloudera, Inc. All rights reserved. How to build leakproof stream processing pipelines with Apache Kafka and Apache Spark​
  • 2. 2© Cloudera, Inc. All rights reserved. ● Guru Medasani ○ Data Science Architect at Domino Data Lab ○ Previously senior solutions architect at Cloudera ● Jordan Hambleton - Consulting Manager in San Francisco ○ Nearly 4 years as Resident Senior Architect at large technology firm ○ Previously software engineer building operational data systems on CDH Introduction
  • 3. 3© Cloudera, Inc. All rights reserved. ● Intro ● Overview of Spark Streaming from Kafka ○ Workflow of the DStream and RDD ○ Spark Streaming Kafka consumer types ● Offset management ○ Motivation ○ Storing offsets in external data stores ● Q & A Agenda
  • 4. 4© Cloudera, Inc. All rights reserved. Overview serverserver partition1 Kafka Cluster partitionn partition2 . . . . Topic A 142 143 144 . . . 121 122 123 . . . 137 138 139 . . . server partition3 129 130 131 . . . server executor1 executor2 executor3 Hadoop / YARN Cluster executorn . . . . more parallelism
  • 5. 5© Cloudera, Inc. All rights reserved. ● DStream - sequence of RDDs ● Two approaches in KafkaUtils ○ Receiver based ○ Direct approach (recommended & the method we talk about) ● Spark streaming embeds a kafka client ○ Spark 1.6 uses the 0.9.0-kafka-2.0.0 client (SimpleConsumer) ○ Spark 2.x kafka 0-8-0 uses the 0.9.0-kafka-2.0.2 client (SimpleConsumer) ○ Spark 2.x kafka 0-10-0 uses the 0.10.0-kafka-2.1.0 client (KafkaConsumer) Overview Spark Streaming from Kafka
  • 6. 6© Cloudera, Inc. All rights reserved. DStream and RDD Workflow ● Spark Streaming ○ batchIntervalInSeconds ○ stopGracefullyOnShutdown ● Kafka ○ bootstrap.servers ○ auto.offset.reset ○ group.id ○ key.deserializer ○ value.deserializer
  • 7. 7© Cloudera, Inc. All rights reserved. ● spark-streaming-kafka-0-8 / 0.9.0-kafka-2.0.2 ● DStream ○ Gets range of each topic/partition - throttle maxRatePerPartition ○ auto.offset.reset (smallest|largest) ○ refresh.leader.backoff.ms - lost leader ● KafkaRDD for set of topic, partition, offsets ○ User can now get offset ranges from RDD ■ topic, partition, fromOffset (inclusive), untilOffset (exclusive) ● KafkaRDDPartition iterator ○ SimpleConsumer initialized and batches of events fetched ○ refresh.leader.backoff.ms - lost leader Spark Streaming Kafka Consumer # 1
  • 8. 8© Cloudera, Inc. All rights reserved. ● Supported - spark-streaming-kafka-0-10 / 0.10.0-kafka-2.1.0 ● Internal Kafka client uses new Java KafkaConsumer ● ConsumerStrategies ○ subscribe, assign, subscribe pattern ● LocationStrategies ○ executor distribution strategy (consistent, fixed, brokers) ● DStream ○ Gets range of each topic/partition - throttle maxRatePerPartition ○ auto.offset.reset (earliest|latest) ○ Be careful - enable.auto.commit (default true) ○ heartbeat & session timeouts Spark Streaming Kafka Consumer # 2
  • 9. 9© Cloudera, Inc. All rights reserved. ● DStream ○ Consumer poll for group coordination & discovery ○ Identify new partitions, from offsets ○ Pause consumer ○ seekToEnd to get untilOffsets ● KafkaRDD ○ Fixed [enable.auto.commit = false, auto.offset.reset = none, spark-executor-${group.id}] ○ Attempts to assign offset range consistently for optimal consumer caching ● KafkaRDDPartition iterator ○ Initialize/lookup CachedKafkaConsumer with executor group ■ consumer assigned per single topic, partition with internal buffer ■ on cache miss, seek and poll Spark Streaming Kafka Consumer # 2
  • 10. 10© Cloudera, Inc. All rights reserved. Keeping Track
  • 11. 11© Cloudera, Inc. All rights reserved. ● Planned Maintenance ○ Upgrades ○ Bug-fixes ● Unplanned Maintenance ○ Failures ● Application Processing Errors ○ Wrong calculations ○ Updated algorithm over known streaming data ● More control over messages ○ Just earliest and latest are insufficient Motivation for Tracking Offsets
  • 12. 12© Cloudera, Inc. All rights reserved. ● Cast RDD to HasOffsetRanges ● DStream’s first transformation Obtaining Offsets
  • 13. 13© Cloudera, Inc. All rights reserved. Offset management Workflow ● Limited options prior to spark-streaming-kafka-0-10 ● Store offsets in external datastore ○ Checkpoints (Not recommended) ○ ZooKeeper ○ Kafka ○ HBase ● Do not have to manage offsets
  • 14. 14© Cloudera, Inc. All rights reserved. ● ZooKeeper ○ znode - /consumers/[groupId]/offsets/[topic]/[partitionId] -> long (offset) ○ Only retains latest committed offsets ○ Can easily be managed by external tools ○ Leverage existing monitoring for Lag, no historical insight Offset Management in ZooKeeper
  • 15. 15© Cloudera, Inc. All rights reserved. ● Kafka ○ CanCommitOffsets provides async commit to internal kafka topic ○ More difficult to manage internal kafka topic manually ○ Leverage existing monitoring for Lag, no historical insight Offset Management in Kafka
  • 16. 16© Cloudera, Inc. All rights reserved. ● HBase ○ Unique entry per consumer group, batch ● Fine-grained monitoring over time ● HBase shell for easy management ● Get latest entry - ○ scan 'prod_stream', ○ STARTROW =>'device_alerts:csi_group', ○ REVERSED =>TRUE, ○ LIMIT =>1 Offset Management in HBase schema: row: <TOPIC_NAME>:<GROUP_ID>:<EPOCH_BATCHTIME_MS> column family: offsets qualifier: <PARTITION_ID> value: <OFFSET_ID>
  • 17. 17© Cloudera, Inc. All rights reserved. ● Spark Streaming job started for the first time ● No changes in Kafka partitions ● Increase in number of Kafka partitions http://blog.cloudera.com/blog/2017/06/offset-management-for-apache-kafka-with-apache-spark-streaming/ Starting Streaming Jobs with Known Offsets
  • 18. 18© Cloudera, Inc. All rights reserved. Questions? Thank you Jordan Hambleton Guru Medasani