SlideShare a Scribd company logo
1 of 34
Apache Kafka + Apache Spark =
♡
Let's check how Kafka integrates with Spark
Bartosz Konieczny
@waitingforcode
First things first
Bartosz Konieczny
#dataEngineer #ApacheSparkEnthusiast #AWSuser
#waitingforcode.com #becomedataengineer.com
#@waitingforcode
#github.com/bartosz25 /data-generator /spark-scala-playground ...
2
Apache Spark
3
The distributed data processing ecosystem
4
SQL Structured Streaming Streaming GraphX MLib
Python Scala Java RSQL
Kubernetes Hadoop YARN Mesos
AWS
DataProc HDInsightEMR
Databricks
GCP Azure
Databricks
Maintainers
5
+
Apache Spark
Structured Streaming
6
Streaming query execution - micro-batch
7
load state
for t1 query
load offsets
to process &
write them
for t1 query
process
data
confirm
processed
offsets &
next
watermark
commit state
t2
partition-based
checkpoint location
state store offset log commit log
Streaming query execution - continuous (experimental)
epoch
coordinator
persist offsets
checkpoint location
offset log commit log
order
offsets
logging
Streaming query execution - continuous (experimental)
process datatask 1
process datatask 2
process datatask 3
epoch
coordinator
persist offsets
checkpoint location
offset log commit log
t
order
offsets
logging report processed
offsets
long-running, per partition
Streaming query execution - continuous (experimental)
process datatask 1
process datatask 2
process datatask 3
epoch
coordinator
persist offsets
checkpoint location
offset log commit log
t
order
offsets
logging report processed
offsets
if all tasks
processed
offsets within
epoc
long-running, per partition
Popular data transformations
11
def select(cols: Column*): DataFrame
def as(alias: String): Dataset[T]
def map[U : Encoder](func: T => U): Dataset[U]
def filter(condition: Column): Dataset[T]
def groupByKey[K: Encoder](func: T => K):
KeyValueGroupedDataset[K, T]
def limit(n: Int): Dataset[T]
Popular data transformations
12
def select(cols: Column*): DataFrame
def as(alias: String): Dataset[T]
def map[U : Encoder](func: T => U): Dataset[U]
def filter(condition: Column): Dataset[T]
def groupByKey[K: Encoder](func: T => K):
KeyValueGroupedDataset[K, T]
def limit(n: Int): Dataset[T]
def mapPartitions[U : Encoder](func: Iterator[T] => Iterator[U]):
Dataset[U]
def mapGroups[U : Encoder](f: (K, Iterator[V]) => U):
Dataset[U]
def flatMapGroups[U : Encoder](f: (K, Iterator[V]) =>
TraversableOnce[U]): Dataset[U]
def join(right: Dataset[_], joinExprs: Column, joinType: String)
def reduce(func: (T, T) => T): T
Structured Streaming pipeline example
13
val loadQuery = sparkSession.readStream.format("kafka")
.option("kafka.bootstrap.servers", "210.0.0.20:9092")
.option("client.id", s"simple_kafka_spark_app")
.option("subscribePattern", "ss_starting_offset.*")
.option("startingOffsets", "earliest")
.load()
val processingLogic = loadQuery.selectExpr("CAST(value AS STRING)").as[String]
.filter(letter => letter.nonEmpty)
.map(letter => letter.size)
.select($"value".as("letter_length"))
.agg(Map("letter_length" -> "sum"))
val writeQuery = processingLogic.writeStream.outputMode("update")
.option("checkpointLocation", "/tmp/kafka-sample")
.format("console")
writeQuery.start().awaitTermination()
data source
data
processing
logic
data sink
Apache Kafka data
source
14
Kafka data source configuration
15
⇢ Where?
kafka.bootstrap.servers + (subscribe, subscribePattern, assign)
Kafka data source configuration
16
⇢ Where?
⇢ What?
kafka.bootstrap.servers + (subscribe, subscribePattern, assign)
startingOffsets, endingOffsets - topic/partition or global
Kafka data source configuration
17
⇢ Where?
⇢ What?
⇢ How?
kafka.bootstrap.servers + (subscribe, subscribePattern, assign)
startingOffsets, endingOffsets - topic/partition or global
data loss failure (streaming), max reading rate control, Spark partitions number
Kafka input schema
18
key
[binary]
value
[binary]
topic
[string]
partition
[int]
offset
[long]
timestamp
[long]
timestampType
[int]
Kafka input schema
19
key
[binary]
value
[binary]
topic
[string]
partition
[int]
offset
[long]
timestamp
[long]
timestampType
[int]
val query = dataFrame.selectExpr("CAST(key AS STRING)", "CAST(value AS
STRING)")
.groupByKey(row => row.getAs[String]("key"))
From the fetch to the reading - micro-batch
20
data loss
checks,
skewness
optimization
initialize
offsets to
process
create data
consumer if
needed
checkpoint
processed
offsets
poll
data
Apache Kafka broker
next offsets to process
max offsets in partition
(no maxOffsetsPerTrigger)
distribute
offsets to
executors
as long as
the read offset < max offset for topic/partition
data locality
if new data
available
data loss checks
if no
fatal failure
Data loss protection - conditions
21
deleted partitions
Data loss protection - conditions
22
deleted partitions expired records
(metadata consumer)
Data loss protection - conditions
23
deleted partitions expired records
(metadata consumer)
new partitions
with missing
offsets
Data loss protection - conditions
24
deleted partitions expired records
(metadata consumer)
new partitions
with missing
offsets
expired records
(data consumer)
Apache Kafka data sink
25
Delivery
semantics
26
at-least
once
At-least once - why?
27
protected def checkForErrors(): Unit = {
if (failedWrite != null) {
throw failedWrite
}
}
KafkaRowWriter
At-least once - why?
28
private val callback = new Callback() {
override def
onCompletion(recordMetadata:
RecordMetadata, e: Exception): Unit = {
if (failedWrite == null && e != null) {
failedWrite = e
}
}
}
KafkaRowWriter
At-least once - why?
29
def write(row: InternalRow): Unit = {
checkForErrors()
sendRow(row, producer)
}
KafkaStreamDataWriter
Output
generation
30
1 or
multiple
topics
1 or multiple outputs - how?
31
private def createProjection = {
val topicExpression = topic.map(Literal(_)).orElse {
inputSchema.find(_.name == TOPIC_ATTRIBUTE_NAME)
}.getOrElse {
throw new IllegalStateException(s"topic option required when no " +
s"'${KafkaWriter.TOPIC_ATTRIBUTE_NAME}' attribute is present")
}
KafkaRowWriter
Summary
32
● micro-batch oriented
● low latency in progress effort
● fault-tolerance with checkpoint mechanism
● batch and streaming supported
● alternative way to other streaming approaches
Resources
● Kafka on Spark documentation: https://spark.apache.org/docs/latest/structured-streaming-kafka-
integration.html
● Structured streaming support for consuming from Kafka:
https://issues.apache.org/jira/browse/SPARK-15406
● Github data generator: https://github.com/bartosz25/data-generator
● Kafka + Spark pipeline example: https://github.com/bartosz25/sessionization-demo
● Kafka + Spark series: https://www.waitingforcode.com/tags/kafka-spark-structured-streaming
33
Thank you !
@waitingforcode / waitingforcode.com
34

More Related Content

What's hot

Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.Alexey Lesovsky
 
RestMQ - HTTP/Redis based Message Queue
RestMQ - HTTP/Redis based Message QueueRestMQ - HTTP/Redis based Message Queue
RestMQ - HTTP/Redis based Message QueueGleicon Moraes
 
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry PiMonitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry PiInfluxData
 
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...Lucidworks
 
Troubleshooting PostgreSQL Streaming Replication
Troubleshooting PostgreSQL Streaming ReplicationTroubleshooting PostgreSQL Streaming Replication
Troubleshooting PostgreSQL Streaming ReplicationAlexey Lesovsky
 
Troubleshooting PostgreSQL with pgCenter
Troubleshooting PostgreSQL with pgCenterTroubleshooting PostgreSQL with pgCenter
Troubleshooting PostgreSQL with pgCenterAlexey Lesovsky
 
Centralized + Unified Logging
Centralized + Unified LoggingCentralized + Unified Logging
Centralized + Unified LoggingGabor Kozma
 
PostgreSQL Troubleshoot On-line, (RITfest 2015 meetup at Moscow, Russia).
PostgreSQL Troubleshoot On-line, (RITfest 2015 meetup at Moscow, Russia).PostgreSQL Troubleshoot On-line, (RITfest 2015 meetup at Moscow, Russia).
PostgreSQL Troubleshoot On-line, (RITfest 2015 meetup at Moscow, Russia).Alexey Lesovsky
 
collectd & PostgreSQL
collectd & PostgreSQLcollectd & PostgreSQL
collectd & PostgreSQLMark Wong
 
Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.Alexey Lesovsky
 
Scaling Flink in Cloud
Scaling Flink in CloudScaling Flink in Cloud
Scaling Flink in CloudSteven Wu
 
PostgreSQL Procedural Languages: Tips, Tricks and Gotchas
PostgreSQL Procedural Languages: Tips, Tricks and GotchasPostgreSQL Procedural Languages: Tips, Tricks and Gotchas
PostgreSQL Procedural Languages: Tips, Tricks and GotchasJim Mlodgenski
 
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxData
 
PostgreSQL Administration for System Administrators
PostgreSQL Administration for System AdministratorsPostgreSQL Administration for System Administrators
PostgreSQL Administration for System AdministratorsCommand Prompt., Inc
 
PGConf APAC 2018 - Patroni: Kubernetes-native PostgreSQL companion
PGConf APAC 2018 - Patroni: Kubernetes-native PostgreSQL companionPGConf APAC 2018 - Patroni: Kubernetes-native PostgreSQL companion
PGConf APAC 2018 - Patroni: Kubernetes-native PostgreSQL companionPGConf APAC
 
Managing PostgreSQL with PgCenter
Managing PostgreSQL with PgCenterManaging PostgreSQL with PgCenter
Managing PostgreSQL with PgCenterAlexey Lesovsky
 

What's hot (20)

Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.
 
RestMQ - HTTP/Redis based Message Queue
RestMQ - HTTP/Redis based Message QueueRestMQ - HTTP/Redis based Message Queue
RestMQ - HTTP/Redis based Message Queue
 
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry PiMonitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
 
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
 
Troubleshooting PostgreSQL Streaming Replication
Troubleshooting PostgreSQL Streaming ReplicationTroubleshooting PostgreSQL Streaming Replication
Troubleshooting PostgreSQL Streaming Replication
 
Troubleshooting PostgreSQL with pgCenter
Troubleshooting PostgreSQL with pgCenterTroubleshooting PostgreSQL with pgCenter
Troubleshooting PostgreSQL with pgCenter
 
Pgcenter overview
Pgcenter overviewPgcenter overview
Pgcenter overview
 
Centralized + Unified Logging
Centralized + Unified LoggingCentralized + Unified Logging
Centralized + Unified Logging
 
PostgreSQL Troubleshoot On-line, (RITfest 2015 meetup at Moscow, Russia).
PostgreSQL Troubleshoot On-line, (RITfest 2015 meetup at Moscow, Russia).PostgreSQL Troubleshoot On-line, (RITfest 2015 meetup at Moscow, Russia).
PostgreSQL Troubleshoot On-line, (RITfest 2015 meetup at Moscow, Russia).
 
Full Text Search in PostgreSQL
Full Text Search in PostgreSQLFull Text Search in PostgreSQL
Full Text Search in PostgreSQL
 
collectd & PostgreSQL
collectd & PostgreSQLcollectd & PostgreSQL
collectd & PostgreSQL
 
Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.
 
Scaling Flink in Cloud
Scaling Flink in CloudScaling Flink in Cloud
Scaling Flink in Cloud
 
PostgreSQL Procedural Languages: Tips, Tricks and Gotchas
PostgreSQL Procedural Languages: Tips, Tricks and GotchasPostgreSQL Procedural Languages: Tips, Tricks and Gotchas
PostgreSQL Procedural Languages: Tips, Tricks and Gotchas
 
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
 
PostgreSQL Administration for System Administrators
PostgreSQL Administration for System AdministratorsPostgreSQL Administration for System Administrators
PostgreSQL Administration for System Administrators
 
PGConf APAC 2018 - Patroni: Kubernetes-native PostgreSQL companion
PGConf APAC 2018 - Patroni: Kubernetes-native PostgreSQL companionPGConf APAC 2018 - Patroni: Kubernetes-native PostgreSQL companion
PGConf APAC 2018 - Patroni: Kubernetes-native PostgreSQL companion
 
Managing PostgreSQL with PgCenter
Managing PostgreSQL with PgCenterManaging PostgreSQL with PgCenter
Managing PostgreSQL with PgCenter
 
PostgreSQL and PL/Java
PostgreSQL and PL/JavaPostgreSQL and PL/Java
PostgreSQL and PL/Java
 
PostgreSQL Replication Tutorial
PostgreSQL Replication TutorialPostgreSQL Replication Tutorial
PostgreSQL Replication Tutorial
 

Similar to Apache Spark Structured Streaming + Apache Kafka = ♡

Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Databricks
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Databricks
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkDatabricks
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDatabricks
 
Spark streaming with kafka
Spark streaming with kafkaSpark streaming with kafka
Spark streaming with kafkaDori Waldman
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka Dori Waldman
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIDatabricks
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...Databricks
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Databricks
 
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...Microsoft Tech Community
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionDatabricks
 
Sparkstreaming with kafka and h base at scale (1)
Sparkstreaming with kafka and h base at scale (1)Sparkstreaming with kafka and h base at scale (1)
Sparkstreaming with kafka and h base at scale (1)Sigmoid
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionLightbend
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingDatabricks
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Databricks
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...DataWorks Summit
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
 
From Zero to Stream Processing
From Zero to Stream ProcessingFrom Zero to Stream Processing
From Zero to Stream ProcessingEventador
 
Kafka timestamp offset
Kafka timestamp offsetKafka timestamp offset
Kafka timestamp offsetDaeMyung Kang
 

Similar to Apache Spark Structured Streaming + Apache Kafka = ♡ (20)

Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Spark streaming with kafka
Spark streaming with kafkaSpark streaming with kafka
Spark streaming with kafka
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
 
Sparkstreaming with kafka and h base at scale (1)
Sparkstreaming with kafka and h base at scale (1)Sparkstreaming with kafka and h base at scale (1)
Sparkstreaming with kafka and h base at scale (1)
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
From Zero to Stream Processing
From Zero to Stream ProcessingFrom Zero to Stream Processing
From Zero to Stream Processing
 
Kafka timestamp offset
Kafka timestamp offsetKafka timestamp offset
Kafka timestamp offset
 

Recently uploaded

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsMehedi Hasan Shohan
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 

Recently uploaded (20)

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software Solutions
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 

Apache Spark Structured Streaming + Apache Kafka = ♡

  • 1. Apache Kafka + Apache Spark = ♡ Let's check how Kafka integrates with Spark Bartosz Konieczny @waitingforcode
  • 2. First things first Bartosz Konieczny #dataEngineer #ApacheSparkEnthusiast #AWSuser #waitingforcode.com #becomedataengineer.com #@waitingforcode #github.com/bartosz25 /data-generator /spark-scala-playground ... 2
  • 4. The distributed data processing ecosystem 4 SQL Structured Streaming Streaming GraphX MLib Python Scala Java RSQL Kubernetes Hadoop YARN Mesos AWS DataProc HDInsightEMR Databricks GCP Azure Databricks
  • 7. Streaming query execution - micro-batch 7 load state for t1 query load offsets to process & write them for t1 query process data confirm processed offsets & next watermark commit state t2 partition-based checkpoint location state store offset log commit log
  • 8. Streaming query execution - continuous (experimental) epoch coordinator persist offsets checkpoint location offset log commit log order offsets logging
  • 9. Streaming query execution - continuous (experimental) process datatask 1 process datatask 2 process datatask 3 epoch coordinator persist offsets checkpoint location offset log commit log t order offsets logging report processed offsets long-running, per partition
  • 10. Streaming query execution - continuous (experimental) process datatask 1 process datatask 2 process datatask 3 epoch coordinator persist offsets checkpoint location offset log commit log t order offsets logging report processed offsets if all tasks processed offsets within epoc long-running, per partition
  • 11. Popular data transformations 11 def select(cols: Column*): DataFrame def as(alias: String): Dataset[T] def map[U : Encoder](func: T => U): Dataset[U] def filter(condition: Column): Dataset[T] def groupByKey[K: Encoder](func: T => K): KeyValueGroupedDataset[K, T] def limit(n: Int): Dataset[T]
  • 12. Popular data transformations 12 def select(cols: Column*): DataFrame def as(alias: String): Dataset[T] def map[U : Encoder](func: T => U): Dataset[U] def filter(condition: Column): Dataset[T] def groupByKey[K: Encoder](func: T => K): KeyValueGroupedDataset[K, T] def limit(n: Int): Dataset[T] def mapPartitions[U : Encoder](func: Iterator[T] => Iterator[U]): Dataset[U] def mapGroups[U : Encoder](f: (K, Iterator[V]) => U): Dataset[U] def flatMapGroups[U : Encoder](f: (K, Iterator[V]) => TraversableOnce[U]): Dataset[U] def join(right: Dataset[_], joinExprs: Column, joinType: String) def reduce(func: (T, T) => T): T
  • 13. Structured Streaming pipeline example 13 val loadQuery = sparkSession.readStream.format("kafka") .option("kafka.bootstrap.servers", "210.0.0.20:9092") .option("client.id", s"simple_kafka_spark_app") .option("subscribePattern", "ss_starting_offset.*") .option("startingOffsets", "earliest") .load() val processingLogic = loadQuery.selectExpr("CAST(value AS STRING)").as[String] .filter(letter => letter.nonEmpty) .map(letter => letter.size) .select($"value".as("letter_length")) .agg(Map("letter_length" -> "sum")) val writeQuery = processingLogic.writeStream.outputMode("update") .option("checkpointLocation", "/tmp/kafka-sample") .format("console") writeQuery.start().awaitTermination() data source data processing logic data sink
  • 15. Kafka data source configuration 15 ⇢ Where? kafka.bootstrap.servers + (subscribe, subscribePattern, assign)
  • 16. Kafka data source configuration 16 ⇢ Where? ⇢ What? kafka.bootstrap.servers + (subscribe, subscribePattern, assign) startingOffsets, endingOffsets - topic/partition or global
  • 17. Kafka data source configuration 17 ⇢ Where? ⇢ What? ⇢ How? kafka.bootstrap.servers + (subscribe, subscribePattern, assign) startingOffsets, endingOffsets - topic/partition or global data loss failure (streaming), max reading rate control, Spark partitions number
  • 19. Kafka input schema 19 key [binary] value [binary] topic [string] partition [int] offset [long] timestamp [long] timestampType [int] val query = dataFrame.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") .groupByKey(row => row.getAs[String]("key"))
  • 20. From the fetch to the reading - micro-batch 20 data loss checks, skewness optimization initialize offsets to process create data consumer if needed checkpoint processed offsets poll data Apache Kafka broker next offsets to process max offsets in partition (no maxOffsetsPerTrigger) distribute offsets to executors as long as the read offset < max offset for topic/partition data locality if new data available data loss checks if no fatal failure
  • 21. Data loss protection - conditions 21 deleted partitions
  • 22. Data loss protection - conditions 22 deleted partitions expired records (metadata consumer)
  • 23. Data loss protection - conditions 23 deleted partitions expired records (metadata consumer) new partitions with missing offsets
  • 24. Data loss protection - conditions 24 deleted partitions expired records (metadata consumer) new partitions with missing offsets expired records (data consumer)
  • 25. Apache Kafka data sink 25
  • 27. At-least once - why? 27 protected def checkForErrors(): Unit = { if (failedWrite != null) { throw failedWrite } } KafkaRowWriter
  • 28. At-least once - why? 28 private val callback = new Callback() { override def onCompletion(recordMetadata: RecordMetadata, e: Exception): Unit = { if (failedWrite == null && e != null) { failedWrite = e } } } KafkaRowWriter
  • 29. At-least once - why? 29 def write(row: InternalRow): Unit = { checkForErrors() sendRow(row, producer) } KafkaStreamDataWriter
  • 31. 1 or multiple outputs - how? 31 private def createProjection = { val topicExpression = topic.map(Literal(_)).orElse { inputSchema.find(_.name == TOPIC_ATTRIBUTE_NAME) }.getOrElse { throw new IllegalStateException(s"topic option required when no " + s"'${KafkaWriter.TOPIC_ATTRIBUTE_NAME}' attribute is present") } KafkaRowWriter
  • 32. Summary 32 ● micro-batch oriented ● low latency in progress effort ● fault-tolerance with checkpoint mechanism ● batch and streaming supported ● alternative way to other streaming approaches
  • 33. Resources ● Kafka on Spark documentation: https://spark.apache.org/docs/latest/structured-streaming-kafka- integration.html ● Structured streaming support for consuming from Kafka: https://issues.apache.org/jira/browse/SPARK-15406 ● Github data generator: https://github.com/bartosz25/data-generator ● Kafka + Spark pipeline example: https://github.com/bartosz25/sessionization-demo ● Kafka + Spark series: https://www.waitingforcode.com/tags/kafka-spark-structured-streaming 33
  • 34. Thank you ! @waitingforcode / waitingforcode.com 34

Editor's Notes

  1. ask if everybody is aware of the watermark explain the idea of state store + where it can be stored explain where checkpoint location + where it can be stored (HDFS compatible fs)
  2. https://databricks.com/wp-content/uploads/2018/03/image2-2.png
  3. https://databricks.com/wp-content/uploads/2018/03/image2-2.png
  4. https://databricks.com/wp-content/uploads/2018/03/image2-2.png
  5. limit is useless since it will stop returning data as soon as it's reached
  6. limit is useless since it will stop returning data as soon as it's reached
  7. THE CODE used in the transformation is distributed only once, for the first query, or it's compiled & distributed for every query?
  8. .option("startingOffsets", """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}""") .option("endingOffsets → but it only applies to the batch processing!", """{"topic1":{"0":50,"1":-1},"topic2":{"0":-1}}""") .option("startingOffsets", "earliest") .option("endingOffsets", "latest") optinals > failOnDataLoss > maxOffsetsPerTrigger > minPartitions
  9. .option("startingOffsets", """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}""") .option("endingOffsets → but it only applies to the batch processing!", """{"topic1":{"0":50,"1":-1},"topic2":{"0":-1}}""") .option("startingOffsets", "earliest") .option("endingOffsets", "latest") optinals > failOnDataLoss > maxOffsetsPerTrigger > minPartitions
  10. .option("startingOffsets", """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}""") .option("endingOffsets → but it only applies to the batch processing!", """{"topic1":{"0":50,"1":-1},"topic2":{"0":-1}}""") .option("startingOffsets", "earliest") .option("endingOffsets", "latest") optinals > failOnDataLoss > maxOffsetsPerTrigger > minPartitions
  11. HEADERS and 3.0!
  12. TODO: extract_json no schema registry, even though there was a blog post of Xebia about integration of it
  13. explain that it can be different → V1 vs V2 data source say that it doesn't happen for the next query becaue data is stored in memory, unless the check on data loss poll data = seek + poll poll data => explain data loss checks consumer on the executor lifecycle ⇒ Is it closed after the batch read? In fact, it depends whether there are new topic/partitons. If it's not the case, it's reused, if yes, a new one is created. an exception ⇒ contiunous streaming mode always recreates a new consumer!!!!!!! EXPLAIN the diff between micro batch and continuous reader
  14. explain why not transactions (see comment from wfc)
  15. say that KafkaRowWriter is shared by V1 and V2 data sinks