SlideShare a Scribd company logo
The Internals of
Stateful Stream Processing in
Spark Structured Streaming
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
● Freelance IT consultant
● Specializing in Spark, Kafka, Kafka Streams, Scala
● Development | Consulting | Training | Speaking
● "The Internals Of" online books
● Among contributors to Apache Spark
● Among Confluent Community Catalyst (Class of 2019 - 2020)
● Contact me at jacek@japila.pl
● Follow @JacekLaskowski on twitter for more #ApacheSpark
#ApacheKafka #KafkaStreams
Jacek Laskowski
Friendly reminder
Pictures...take a lot of pictures! 📷
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
Why Should You Care?
1. In case of troubles in production, everything counts 😎
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
Stateful Stream Processing
1. Stateful Stream Processing is a stream processing with state
2. State is simply a collection of keys and their current values
3. State can be explicit (available to a developer) or implicit (internal)
4. In Spark Structured Streaming, a streaming query is stateful when is one of
the following:
a. Streaming Aggregation
b. Arbitrary Stateful Streaming Aggregation
c. Stream-Stream Join
d. Streaming Deduplication
e. Streaming Limit
5. Read up on Stateful Stream Processing in The Internals of Spark Structured
Streaming online book
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
StateStore
1. StateStore is the abstraction of key-value stores for managing state in
Stateful Stream Processing
a. abort
b. commit
c. get
d. getRange
e. id
f. iterator
g. metrics
h. put
i. remove
j. version
2. Identified by operator and partition IDs
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
HDFSBackedStateStore
1. HDFSBackedStateStore is a concrete StateStore that uses a Hadoop
DFS-compatible file system for versioned state persistence
2. The default and only known implementation of StateStore
3. Created when StateStore utility is requested to retrieve the StateStore for a
given ID and version (via HDFSBackedStateStoreProvider)
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
StateStoreProvider
1. StateStoreProvider is the abstraction of StateStore providers that manage
StateStores in Stateful Stream Processing
a. getStore(version: Long): StateStore
b. init
c. stateStoreId
d. supportedCustomMetrics
2. spark.sql.streaming.stateStore.providerClass internal configuration property
a. Fully-qualified class name of a StateStoreProvider
b. Default: HDFSBackedStateStoreProvider
3. Identified by operator and partition IDs
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
HDFSBackedStateStoreProvider
1. HDFSBackedStateStoreProvider is a concrete StateStoreProvider that uses a
Hadoop DFS-compatible file system for versioned state checkpointing
2. The default and only known implementation of StateStoreProvider
a. spark.sql.streaming.stateStore.providerClass internal configuration property
3. HDFSStateStoreProvider uses HDFSBackedStateStores to manage state (one
per state version)
4. Manages versioned compressed state in delta and snapshot files
a. Uses cache internally for faster access to state versions
b. Periodically “compresses” delta files into snapshots
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
StateStoreCoordinator
1. StateStoreCoordinator keeps track of state stores on Spark executors (per
host and executor ID)
2. ThreadSafeRpcEndpoint RPC endpoint
a. ReportActiveInstance
b. GetLocation
c. DeactivateInstances
3. Used by StateStoreRDD for the location preferences of partitions (based on
the location of the stores)
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
StateStoreRDD
1. StateStoreRDD is an RDD that represents (part of) a stateful streaming query
a. Single micro-batch actually
2. Executing storeUpdateFunction with the StateStore and data per stateful
physical operator and partition IDs
a. FlatMapGroupsWithStateExec
b. StateStoreRestoreExec
c. StateStoreSaveExec
d. StreamingDeduplicateExec
e. StreamingGlobalLimitExec
3. StreamingQuery.explain
4. Uses StateStoreCoordinator for the preferred locations of a partition for job
scheduling
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
Streaming Aggregation (1 of 2)
1. In Spark Structured Streaming, streaming aggregation is a streaming query
that was described (build) using the following high-level streaming operators:
a. Dataset.groupBy, Dataset.rollup, Dataset.cube (RelationalGroupedDataset)
b. Dataset.groupByKey (KeyValueGroupedDataset)
c. SQL’s GROUP BY clause (including WITH CUBE and WITH ROLLUP)
2. High-level operators create a logical plan with one or more Aggregate logical
operators
a. Similarly to good ol’ aggregations in Spark SQL
3. IncrementalExecution uses StatefulAggregationStrategy execution planning
strategy for planning streaming aggregations
a. StateStoreRestoreExec and StateStoreSaveExec physical operators
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
Streaming Aggregation (2 of 2)
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
👉 IncrementalExecution — QueryExecution of Streaming Queries
Stream-Stream Join (1 of 2)
1. In Spark Structured Streaming, streaming join is a streaming query that was
described (build) using the following high-level streaming operators:
a. Dataset.join
b. SQL’s JOIN clause
2. High-level operators create a logical plan with one or more Join logical
operators
a. Similarly to good ol’ aggregations in Spark SQL
3. IncrementalExecution uses StreamingJoinStrategy execution planning
strategy for planning streaming joins
a. StreamingSymmetricHashJoinExec physical operator
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
Stream-Stream Join (2 of 2)
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
“The Internals Of” Online Books
1. The Internals of Spark SQL
2. The Internals of Spark Structured Streaming
3. The Internals of Apache Spark
Questions?
1. Follow @jaceklaskowski on twitter (DMs open)
2. Upvote my questions and answers on StackOverflow
3. Contact me at jacek@japila.pl
4. Connect with me at LinkedIn
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl

More Related Content

More from Databricks

Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Databricks
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
Databricks
 
Improving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot InstancesImproving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot Instances
Databricks
 
Importance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowImportance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLow
Databricks
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
Databricks
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
Databricks
 

More from Databricks (20)

Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
 
Improving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot InstancesImproving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot Instances
 
Importance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowImportance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLow
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 

Recently uploaded

University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
Natural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptxNatural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptx
fkyes25
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 

Recently uploaded (20)

University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
Natural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptxNatural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptx
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 

The Internals of Stateful Stream Processing in Spark Structured Streaming

  • 1. The Internals of Stateful Stream Processing in Spark Structured Streaming © Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
  • 2. ● Freelance IT consultant ● Specializing in Spark, Kafka, Kafka Streams, Scala ● Development | Consulting | Training | Speaking ● "The Internals Of" online books ● Among contributors to Apache Spark ● Among Confluent Community Catalyst (Class of 2019 - 2020) ● Contact me at jacek@japila.pl ● Follow @JacekLaskowski on twitter for more #ApacheSpark #ApacheKafka #KafkaStreams Jacek Laskowski
  • 3. Friendly reminder Pictures...take a lot of pictures! 📷 © Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
  • 4. Why Should You Care? 1. In case of troubles in production, everything counts 😎 © Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
  • 5. Stateful Stream Processing 1. Stateful Stream Processing is a stream processing with state 2. State is simply a collection of keys and their current values 3. State can be explicit (available to a developer) or implicit (internal) 4. In Spark Structured Streaming, a streaming query is stateful when is one of the following: a. Streaming Aggregation b. Arbitrary Stateful Streaming Aggregation c. Stream-Stream Join d. Streaming Deduplication e. Streaming Limit 5. Read up on Stateful Stream Processing in The Internals of Spark Structured Streaming online book © Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
  • 6. StateStore 1. StateStore is the abstraction of key-value stores for managing state in Stateful Stream Processing a. abort b. commit c. get d. getRange e. id f. iterator g. metrics h. put i. remove j. version 2. Identified by operator and partition IDs © Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
  • 7. HDFSBackedStateStore 1. HDFSBackedStateStore is a concrete StateStore that uses a Hadoop DFS-compatible file system for versioned state persistence 2. The default and only known implementation of StateStore 3. Created when StateStore utility is requested to retrieve the StateStore for a given ID and version (via HDFSBackedStateStoreProvider) © Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
  • 8. StateStoreProvider 1. StateStoreProvider is the abstraction of StateStore providers that manage StateStores in Stateful Stream Processing a. getStore(version: Long): StateStore b. init c. stateStoreId d. supportedCustomMetrics 2. spark.sql.streaming.stateStore.providerClass internal configuration property a. Fully-qualified class name of a StateStoreProvider b. Default: HDFSBackedStateStoreProvider 3. Identified by operator and partition IDs © Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
  • 9. HDFSBackedStateStoreProvider 1. HDFSBackedStateStoreProvider is a concrete StateStoreProvider that uses a Hadoop DFS-compatible file system for versioned state checkpointing 2. The default and only known implementation of StateStoreProvider a. spark.sql.streaming.stateStore.providerClass internal configuration property 3. HDFSStateStoreProvider uses HDFSBackedStateStores to manage state (one per state version) 4. Manages versioned compressed state in delta and snapshot files a. Uses cache internally for faster access to state versions b. Periodically “compresses” delta files into snapshots © Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
  • 10. StateStoreCoordinator 1. StateStoreCoordinator keeps track of state stores on Spark executors (per host and executor ID) 2. ThreadSafeRpcEndpoint RPC endpoint a. ReportActiveInstance b. GetLocation c. DeactivateInstances 3. Used by StateStoreRDD for the location preferences of partitions (based on the location of the stores) © Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
  • 11. StateStoreRDD 1. StateStoreRDD is an RDD that represents (part of) a stateful streaming query a. Single micro-batch actually 2. Executing storeUpdateFunction with the StateStore and data per stateful physical operator and partition IDs a. FlatMapGroupsWithStateExec b. StateStoreRestoreExec c. StateStoreSaveExec d. StreamingDeduplicateExec e. StreamingGlobalLimitExec 3. StreamingQuery.explain 4. Uses StateStoreCoordinator for the preferred locations of a partition for job scheduling © Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
  • 12. Streaming Aggregation (1 of 2) 1. In Spark Structured Streaming, streaming aggregation is a streaming query that was described (build) using the following high-level streaming operators: a. Dataset.groupBy, Dataset.rollup, Dataset.cube (RelationalGroupedDataset) b. Dataset.groupByKey (KeyValueGroupedDataset) c. SQL’s GROUP BY clause (including WITH CUBE and WITH ROLLUP) 2. High-level operators create a logical plan with one or more Aggregate logical operators a. Similarly to good ol’ aggregations in Spark SQL 3. IncrementalExecution uses StatefulAggregationStrategy execution planning strategy for planning streaming aggregations a. StateStoreRestoreExec and StateStoreSaveExec physical operators © Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
  • 13. Streaming Aggregation (2 of 2) © Jacek Laskowski / @JacekLaskowski / jacek@japila.pl 👉 IncrementalExecution — QueryExecution of Streaming Queries
  • 14. Stream-Stream Join (1 of 2) 1. In Spark Structured Streaming, streaming join is a streaming query that was described (build) using the following high-level streaming operators: a. Dataset.join b. SQL’s JOIN clause 2. High-level operators create a logical plan with one or more Join logical operators a. Similarly to good ol’ aggregations in Spark SQL 3. IncrementalExecution uses StreamingJoinStrategy execution planning strategy for planning streaming joins a. StreamingSymmetricHashJoinExec physical operator © Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
  • 15. Stream-Stream Join (2 of 2) © Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
  • 16. “The Internals Of” Online Books 1. The Internals of Spark SQL 2. The Internals of Spark Structured Streaming 3. The Internals of Apache Spark
  • 17. Questions? 1. Follow @jaceklaskowski on twitter (DMs open) 2. Upvote my questions and answers on StackOverflow 3. Contact me at jacek@japila.pl 4. Connect with me at LinkedIn © Jacek Laskowski / @JacekLaskowski / jacek@japila.pl