SlideShare a Scribd company logo
1 of 34
Apache Kafka + Apache Spark =
♡
Let's check how Kafka integrates with Spark
Bartosz Konieczny
@waitingforcode
First things first
Bartosz Konieczny
#dataEngineer #ApacheSparkEnthusiast #AWSuser
#waitingforcode.com #becomedataengineer.com
#@waitingforcode
#github.com/bartosz25 /data-generator /spark-scala-playground ...
2
Apache Spark
3
The distributed data processing ecosystem
4
SQL Structured Streaming Streaming GraphX MLib
Python Scala Java RSQL
Kubernetes Hadoop YARN Mesos
AWS
DataProc HDInsightEMR
Databricks
GCP Azure
Databricks
Maintainers
5
+
Apache Spark
Structured Streaming
6
Streaming query execution - micro-batch
7
load state
for t1 query
load offsets
to process &
write them
for t1 query
process
data
confirm
processed
offsets &
next
watermark
commit state
t2
partition-based
checkpoint location
state store offset log commit log
Streaming query execution - continuous (experimental)
epoch
coordinator
persist offsets
checkpoint location
offset log commit log
order
offsets
logging
Streaming query execution - continuous (experimental)
process datatask 1
process datatask 2
process datatask 3
epoch
coordinator
persist offsets
checkpoint location
offset log commit log
t
order
offsets
logging report processed
offsets
long-running, per partition
Streaming query execution - continuous (experimental)
process datatask 1
process datatask 2
process datatask 3
epoch
coordinator
persist offsets
checkpoint location
offset log commit log
t
order
offsets
logging report processed
offsets
if all tasks
processed
offsets within
epoc
long-running, per partition
Popular data transformations
11
def select(cols: Column*): DataFrame
def as(alias: String): Dataset[T]
def map[U : Encoder](func: T => U): Dataset[U]
def filter(condition: Column): Dataset[T]
def groupByKey[K: Encoder](func: T => K):
KeyValueGroupedDataset[K, T]
def limit(n: Int): Dataset[T]
Popular data transformations
12
def select(cols: Column*): DataFrame
def as(alias: String): Dataset[T]
def map[U : Encoder](func: T => U): Dataset[U]
def filter(condition: Column): Dataset[T]
def groupByKey[K: Encoder](func: T => K):
KeyValueGroupedDataset[K, T]
def limit(n: Int): Dataset[T]
def mapPartitions[U : Encoder](func: Iterator[T] => Iterator[U]):
Dataset[U]
def mapGroups[U : Encoder](f: (K, Iterator[V]) => U):
Dataset[U]
def flatMapGroups[U : Encoder](f: (K, Iterator[V]) =>
TraversableOnce[U]): Dataset[U]
def join(right: Dataset[_], joinExprs: Column, joinType: String)
def reduce(func: (T, T) => T): T
Structured Streaming pipeline example
13
val loadQuery = sparkSession.readStream.format("kafka")
.option("kafka.bootstrap.servers", "210.0.0.20:9092")
.option("client.id", s"simple_kafka_spark_app")
.option("subscribePattern", "ss_starting_offset.*")
.option("startingOffsets", "earliest")
.load()
val processingLogic = loadQuery.selectExpr("CAST(value AS STRING)").as[String]
.filter(letter => letter.nonEmpty)
.map(letter => letter.size)
.select($"value".as("letter_length"))
.agg(Map("letter_length" -> "sum"))
val writeQuery = processingLogic.writeStream.outputMode("update")
.option("checkpointLocation", "/tmp/kafka-sample")
.format("console")
writeQuery.start().awaitTermination()
data source
data
processing
logic
data sink
Apache Kafka data
source
14
Kafka data source configuration
15
⇢ Where?
kafka.bootstrap.servers + (subscribe, subscribePattern, assign)
Kafka data source configuration
16
⇢ Where?
⇢ What?
kafka.bootstrap.servers + (subscribe, subscribePattern, assign)
startingOffsets, endingOffsets - topic/partition or global
Kafka data source configuration
17
⇢ Where?
⇢ What?
⇢ How?
kafka.bootstrap.servers + (subscribe, subscribePattern, assign)
startingOffsets, endingOffsets - topic/partition or global
data loss failure (streaming), max reading rate control, Spark partitions number
Kafka input schema
18
key
[binary]
value
[binary]
topic
[string]
partition
[int]
offset
[long]
timestamp
[long]
timestampType
[int]
Kafka input schema
19
key
[binary]
value
[binary]
topic
[string]
partition
[int]
offset
[long]
timestamp
[long]
timestampType
[int]
val query = dataFrame.selectExpr("CAST(key AS STRING)", "CAST(value AS
STRING)")
.groupByKey(row => row.getAs[String]("key"))
From the fetch to the reading - micro-batch
20
data loss
checks,
skewness
optimization
initialize
offsets to
process
create data
consumer if
needed
checkpoint
processed
offsets
poll
data
Apache Kafka broker
next offsets to process
max offsets in partition
(no maxOffsetsPerTrigger)
distribute
offsets to
executors
as long as
the read offset < max offset for topic/partition
data locality
if new data
available
data loss checks
if no
fatal failure
Data loss protection - conditions
21
deleted partitions
Data loss protection - conditions
22
deleted partitions expired records
(metadata consumer)
Data loss protection - conditions
23
deleted partitions expired records
(metadata consumer)
new partitions
with missing
offsets
Data loss protection - conditions
24
deleted partitions expired records
(metadata consumer)
new partitions
with missing
offsets
expired records
(data consumer)
Apache Kafka data sink
25
Delivery
semantics
26
at-least
once
At-least once - why?
27
protected def checkForErrors(): Unit = {
if (failedWrite != null) {
throw failedWrite
}
}
KafkaRowWriter
At-least once - why?
28
private val callback = new Callback() {
override def
onCompletion(recordMetadata:
RecordMetadata, e: Exception): Unit = {
if (failedWrite == null && e != null) {
failedWrite = e
}
}
}
KafkaRowWriter
At-least once - why?
29
def write(row: InternalRow): Unit = {
checkForErrors()
sendRow(row, producer)
}
KafkaStreamDataWriter
Output
generation
30
1 or
multiple
topics
1 or multiple outputs - how?
31
private def createProjection = {
val topicExpression = topic.map(Literal(_)).orElse {
inputSchema.find(_.name == TOPIC_ATTRIBUTE_NAME)
}.getOrElse {
throw new IllegalStateException(s"topic option required when no " +
s"'${KafkaWriter.TOPIC_ATTRIBUTE_NAME}' attribute is present")
}
KafkaRowWriter
Summary
32
● micro-batch oriented
● low latency in progress effort
● fault-tolerance with checkpoint mechanism
● batch and streaming supported
● alternative way to other streaming approaches
Resources
● Kafka on Spark documentation: https://spark.apache.org/docs/latest/structured-streaming-kafka-
integration.html
● Structured streaming support for consuming from Kafka:
https://issues.apache.org/jira/browse/SPARK-15406
● Github data generator: https://github.com/bartosz25/data-generator
● Kafka + Spark pipeline example: https://github.com/bartosz25/sessionization-demo
● Kafka + Spark series: https://www.waitingforcode.com/tags/kafka-spark-structured-streaming
33
Thank you !
@waitingforcode / waitingforcode.com
34

More Related Content

What's hot

Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
Lucidworks
 

What's hot (20)

Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.
 
RestMQ - HTTP/Redis based Message Queue
RestMQ - HTTP/Redis based Message QueueRestMQ - HTTP/Redis based Message Queue
RestMQ - HTTP/Redis based Message Queue
 
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry PiMonitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
 
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
 
Troubleshooting PostgreSQL Streaming Replication
Troubleshooting PostgreSQL Streaming ReplicationTroubleshooting PostgreSQL Streaming Replication
Troubleshooting PostgreSQL Streaming Replication
 
Troubleshooting PostgreSQL with pgCenter
Troubleshooting PostgreSQL with pgCenterTroubleshooting PostgreSQL with pgCenter
Troubleshooting PostgreSQL with pgCenter
 
Pgcenter overview
Pgcenter overviewPgcenter overview
Pgcenter overview
 
Centralized + Unified Logging
Centralized + Unified LoggingCentralized + Unified Logging
Centralized + Unified Logging
 
PostgreSQL Troubleshoot On-line, (RITfest 2015 meetup at Moscow, Russia).
PostgreSQL Troubleshoot On-line, (RITfest 2015 meetup at Moscow, Russia).PostgreSQL Troubleshoot On-line, (RITfest 2015 meetup at Moscow, Russia).
PostgreSQL Troubleshoot On-line, (RITfest 2015 meetup at Moscow, Russia).
 
Full Text Search in PostgreSQL
Full Text Search in PostgreSQLFull Text Search in PostgreSQL
Full Text Search in PostgreSQL
 
collectd & PostgreSQL
collectd & PostgreSQLcollectd & PostgreSQL
collectd & PostgreSQL
 
Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.
 
Scaling Flink in Cloud
Scaling Flink in CloudScaling Flink in Cloud
Scaling Flink in Cloud
 
PostgreSQL Procedural Languages: Tips, Tricks and Gotchas
PostgreSQL Procedural Languages: Tips, Tricks and GotchasPostgreSQL Procedural Languages: Tips, Tricks and Gotchas
PostgreSQL Procedural Languages: Tips, Tricks and Gotchas
 
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
 
PostgreSQL Administration for System Administrators
PostgreSQL Administration for System AdministratorsPostgreSQL Administration for System Administrators
PostgreSQL Administration for System Administrators
 
PGConf APAC 2018 - Patroni: Kubernetes-native PostgreSQL companion
PGConf APAC 2018 - Patroni: Kubernetes-native PostgreSQL companionPGConf APAC 2018 - Patroni: Kubernetes-native PostgreSQL companion
PGConf APAC 2018 - Patroni: Kubernetes-native PostgreSQL companion
 
Managing PostgreSQL with PgCenter
Managing PostgreSQL with PgCenterManaging PostgreSQL with PgCenter
Managing PostgreSQL with PgCenter
 
PostgreSQL and PL/Java
PostgreSQL and PL/JavaPostgreSQL and PL/Java
PostgreSQL and PL/Java
 
PostgreSQL Replication Tutorial
PostgreSQL Replication TutorialPostgreSQL Replication Tutorial
PostgreSQL Replication Tutorial
 

Similar to Apache Spark Structured Streaming + Apache Kafka = ♡

Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Databricks
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
Databricks
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
DataWorks Summit
 

Similar to Apache Spark Structured Streaming + Apache Kafka = ♡ (20)

Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Spark streaming with kafka
Spark streaming with kafkaSpark streaming with kafka
Spark streaming with kafka
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
 
Sparkstreaming with kafka and h base at scale (1)
Sparkstreaming with kafka and h base at scale (1)Sparkstreaming with kafka and h base at scale (1)
Sparkstreaming with kafka and h base at scale (1)
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
From Zero to Stream Processing
From Zero to Stream ProcessingFrom Zero to Stream Processing
From Zero to Stream Processing
 
Kafka timestamp offset
Kafka timestamp offsetKafka timestamp offset
Kafka timestamp offset
 

Recently uploaded

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 

Recently uploaded (20)

%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
 
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodology
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 

Apache Spark Structured Streaming + Apache Kafka = ♡

  • 1. Apache Kafka + Apache Spark = ♡ Let's check how Kafka integrates with Spark Bartosz Konieczny @waitingforcode
  • 2. First things first Bartosz Konieczny #dataEngineer #ApacheSparkEnthusiast #AWSuser #waitingforcode.com #becomedataengineer.com #@waitingforcode #github.com/bartosz25 /data-generator /spark-scala-playground ... 2
  • 4. The distributed data processing ecosystem 4 SQL Structured Streaming Streaming GraphX MLib Python Scala Java RSQL Kubernetes Hadoop YARN Mesos AWS DataProc HDInsightEMR Databricks GCP Azure Databricks
  • 7. Streaming query execution - micro-batch 7 load state for t1 query load offsets to process & write them for t1 query process data confirm processed offsets & next watermark commit state t2 partition-based checkpoint location state store offset log commit log
  • 8. Streaming query execution - continuous (experimental) epoch coordinator persist offsets checkpoint location offset log commit log order offsets logging
  • 9. Streaming query execution - continuous (experimental) process datatask 1 process datatask 2 process datatask 3 epoch coordinator persist offsets checkpoint location offset log commit log t order offsets logging report processed offsets long-running, per partition
  • 10. Streaming query execution - continuous (experimental) process datatask 1 process datatask 2 process datatask 3 epoch coordinator persist offsets checkpoint location offset log commit log t order offsets logging report processed offsets if all tasks processed offsets within epoc long-running, per partition
  • 11. Popular data transformations 11 def select(cols: Column*): DataFrame def as(alias: String): Dataset[T] def map[U : Encoder](func: T => U): Dataset[U] def filter(condition: Column): Dataset[T] def groupByKey[K: Encoder](func: T => K): KeyValueGroupedDataset[K, T] def limit(n: Int): Dataset[T]
  • 12. Popular data transformations 12 def select(cols: Column*): DataFrame def as(alias: String): Dataset[T] def map[U : Encoder](func: T => U): Dataset[U] def filter(condition: Column): Dataset[T] def groupByKey[K: Encoder](func: T => K): KeyValueGroupedDataset[K, T] def limit(n: Int): Dataset[T] def mapPartitions[U : Encoder](func: Iterator[T] => Iterator[U]): Dataset[U] def mapGroups[U : Encoder](f: (K, Iterator[V]) => U): Dataset[U] def flatMapGroups[U : Encoder](f: (K, Iterator[V]) => TraversableOnce[U]): Dataset[U] def join(right: Dataset[_], joinExprs: Column, joinType: String) def reduce(func: (T, T) => T): T
  • 13. Structured Streaming pipeline example 13 val loadQuery = sparkSession.readStream.format("kafka") .option("kafka.bootstrap.servers", "210.0.0.20:9092") .option("client.id", s"simple_kafka_spark_app") .option("subscribePattern", "ss_starting_offset.*") .option("startingOffsets", "earliest") .load() val processingLogic = loadQuery.selectExpr("CAST(value AS STRING)").as[String] .filter(letter => letter.nonEmpty) .map(letter => letter.size) .select($"value".as("letter_length")) .agg(Map("letter_length" -> "sum")) val writeQuery = processingLogic.writeStream.outputMode("update") .option("checkpointLocation", "/tmp/kafka-sample") .format("console") writeQuery.start().awaitTermination() data source data processing logic data sink
  • 15. Kafka data source configuration 15 ⇢ Where? kafka.bootstrap.servers + (subscribe, subscribePattern, assign)
  • 16. Kafka data source configuration 16 ⇢ Where? ⇢ What? kafka.bootstrap.servers + (subscribe, subscribePattern, assign) startingOffsets, endingOffsets - topic/partition or global
  • 17. Kafka data source configuration 17 ⇢ Where? ⇢ What? ⇢ How? kafka.bootstrap.servers + (subscribe, subscribePattern, assign) startingOffsets, endingOffsets - topic/partition or global data loss failure (streaming), max reading rate control, Spark partitions number
  • 19. Kafka input schema 19 key [binary] value [binary] topic [string] partition [int] offset [long] timestamp [long] timestampType [int] val query = dataFrame.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") .groupByKey(row => row.getAs[String]("key"))
  • 20. From the fetch to the reading - micro-batch 20 data loss checks, skewness optimization initialize offsets to process create data consumer if needed checkpoint processed offsets poll data Apache Kafka broker next offsets to process max offsets in partition (no maxOffsetsPerTrigger) distribute offsets to executors as long as the read offset < max offset for topic/partition data locality if new data available data loss checks if no fatal failure
  • 21. Data loss protection - conditions 21 deleted partitions
  • 22. Data loss protection - conditions 22 deleted partitions expired records (metadata consumer)
  • 23. Data loss protection - conditions 23 deleted partitions expired records (metadata consumer) new partitions with missing offsets
  • 24. Data loss protection - conditions 24 deleted partitions expired records (metadata consumer) new partitions with missing offsets expired records (data consumer)
  • 25. Apache Kafka data sink 25
  • 27. At-least once - why? 27 protected def checkForErrors(): Unit = { if (failedWrite != null) { throw failedWrite } } KafkaRowWriter
  • 28. At-least once - why? 28 private val callback = new Callback() { override def onCompletion(recordMetadata: RecordMetadata, e: Exception): Unit = { if (failedWrite == null && e != null) { failedWrite = e } } } KafkaRowWriter
  • 29. At-least once - why? 29 def write(row: InternalRow): Unit = { checkForErrors() sendRow(row, producer) } KafkaStreamDataWriter
  • 31. 1 or multiple outputs - how? 31 private def createProjection = { val topicExpression = topic.map(Literal(_)).orElse { inputSchema.find(_.name == TOPIC_ATTRIBUTE_NAME) }.getOrElse { throw new IllegalStateException(s"topic option required when no " + s"'${KafkaWriter.TOPIC_ATTRIBUTE_NAME}' attribute is present") } KafkaRowWriter
  • 32. Summary 32 ● micro-batch oriented ● low latency in progress effort ● fault-tolerance with checkpoint mechanism ● batch and streaming supported ● alternative way to other streaming approaches
  • 33. Resources ● Kafka on Spark documentation: https://spark.apache.org/docs/latest/structured-streaming-kafka- integration.html ● Structured streaming support for consuming from Kafka: https://issues.apache.org/jira/browse/SPARK-15406 ● Github data generator: https://github.com/bartosz25/data-generator ● Kafka + Spark pipeline example: https://github.com/bartosz25/sessionization-demo ● Kafka + Spark series: https://www.waitingforcode.com/tags/kafka-spark-structured-streaming 33
  • 34. Thank you ! @waitingforcode / waitingforcode.com 34

Editor's Notes

  1. ask if everybody is aware of the watermark explain the idea of state store + where it can be stored explain where checkpoint location + where it can be stored (HDFS compatible fs)
  2. https://databricks.com/wp-content/uploads/2018/03/image2-2.png
  3. https://databricks.com/wp-content/uploads/2018/03/image2-2.png
  4. https://databricks.com/wp-content/uploads/2018/03/image2-2.png
  5. limit is useless since it will stop returning data as soon as it's reached
  6. limit is useless since it will stop returning data as soon as it's reached
  7. THE CODE used in the transformation is distributed only once, for the first query, or it's compiled & distributed for every query?
  8. .option("startingOffsets", """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}""") .option("endingOffsets → but it only applies to the batch processing!", """{"topic1":{"0":50,"1":-1},"topic2":{"0":-1}}""") .option("startingOffsets", "earliest") .option("endingOffsets", "latest") optinals > failOnDataLoss > maxOffsetsPerTrigger > minPartitions
  9. .option("startingOffsets", """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}""") .option("endingOffsets → but it only applies to the batch processing!", """{"topic1":{"0":50,"1":-1},"topic2":{"0":-1}}""") .option("startingOffsets", "earliest") .option("endingOffsets", "latest") optinals > failOnDataLoss > maxOffsetsPerTrigger > minPartitions
  10. .option("startingOffsets", """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}""") .option("endingOffsets → but it only applies to the batch processing!", """{"topic1":{"0":50,"1":-1},"topic2":{"0":-1}}""") .option("startingOffsets", "earliest") .option("endingOffsets", "latest") optinals > failOnDataLoss > maxOffsetsPerTrigger > minPartitions
  11. HEADERS and 3.0!
  12. TODO: extract_json no schema registry, even though there was a blog post of Xebia about integration of it
  13. explain that it can be different → V1 vs V2 data source say that it doesn't happen for the next query becaue data is stored in memory, unless the check on data loss poll data = seek + poll poll data => explain data loss checks consumer on the executor lifecycle ⇒ Is it closed after the batch read? In fact, it depends whether there are new topic/partitons. If it's not the case, it's reused, if yes, a new one is created. an exception ⇒ contiunous streaming mode always recreates a new consumer!!!!!!! EXPLAIN the diff between micro batch and continuous reader
  14. explain why not transactions (see comment from wfc)
  15. say that KafkaRowWriter is shared by V1 and V2 data sinks