State management in Structured Streaming

State Management in
Structured Streaming
Chandan Prakash

00Copyright 2018 © Qubole
Agenda
● Structured Streaming : Brief Intro
● Types of Stream Processing : Stateless vs Stateful
● State in Stream Processing
● State Store in Stream Processing
● State Management in Old Spark Streaming
● State Management in Structured Streaming
● Demo with Code Example
● Quiz , Food For Thought

What does this picture represent ?
Image Source: google

Batch Processing Stream Processing

Structured Streaming : Brief Intro
● Built on Spark SQL engine
● Illusion : Stream of incoming data as unbounded Input Table, Processing
logic as Sql Query, output of processing as Results Table
● Internally query gets converted into incremental Micro-batch processing

Structured Streaming Query Example

Types of Stream Processing
● Stateless Streaming
○ Processing of every record is independent
○ Operations like map, filter
● Stateful Streaming
○ Processing of record is dependent on
previous records
○ Operations like aggregating count of records
per distinct key, deduplicating records

State in Stream Processing
● State of Streaming Progress
○ Metadata of stream processing : offsets
○ Keeping track how much data processed so far
○ Needed for fault tolerance
○ Present in both stateless and stateful processing
● State of Data
○ Intermediate data information between records
○ Operations like aggregation, deduplication
○ Present in Stateful Processing
Note: When we say “State”, in general it means the State of data for processing. The
other one is called metadata/offsets

State Store in Streaming
● Reliable place providing read and write of
intermediate data (state)
● Can sustain streaming failures and restore
processing from the same point
● Options :
In-memory, File Systems, Storage Systems
In-Memory HashMap

State Management in old/Dstream Spark Streaming
● RDD based Streaming
● Inefficient Flawed design
○ State persisted with offset metadata
○ Complete snapshot persistence every microbatch
○ Tightly coupled, synchronous with Spark RDD tasks
○ No provision for incremental state persistence
○ Processing overhead, bottleneck as state grows

State Management in Structured Streaming
Fundamental shift from Old Spark Streaming
● Decoupled from offsets/metadata checkpointing
● Asynchronous to Spark Tasks/Jobs
● Incremental State persistence

HDFS backed State Management
1. In-Memory Hashmap + HDFS
2. Versioned key-value store per
partition
3. Versioned Delta file per partition
4. Partition Task scheduled on same
executor where previous state is
5. Synchronous write to HashMap and
Delta file outputstream
6. Asynchronous daemon thread per
executor for snapshotting, file
purging/deletion in HDFS
7. Only one thread in Executor can write
to a delta file. But threads from
multiple executors can try to write to
same delta file.

Code Entities in HDFS backed State Management
● StatefulOperators
○ defines computation logic to be executed against the state store with set of rows in a partition
● StateStoreOps
○ prepares a StateStoreRDD for doing computations against state store with the computation logic
passed by the stateful operator.
● storeUpdateFunction
○ contains the computation logic defining what to do against the state store with data generated in a
partition task.
● HDFSBackedStateStore
○ concrete implementation of State Store using concurrent hashmap, backed by HDFS file system
for persistence.
● HDFSBackedStateStoreProvider
○ contains methods to get given store and execute maintenance task (snapshotting , purging,
deleting files, cleaning old states).
● StateStoreCoordinator
○ ensures task for a partition gets scheduled on an executor where its last versioned state is
maintained in hashmap.

Code Flow of Stateful Structured Streaming

Quiz Time
Possible Issues with
the HDFS backed
implementation in
production ?

Quiz Time
Possible Issues with
the HDFS backed
implementation in
production ?
● State is constrained by executor
memory
● Same executor memory to be shared
with RDD computation
● Single Daemon thread responsible
snapshotting entire state hashmaps,
file cleanings, etc

In-Memory HashMap
Possible Solution ?
Food for Thought

Embedded/Local Store :
● Key-Value embedded data store
● Improvised LevelDB open sourced by
Facebook
● Bring Database close to Processing
● Pros :
○ No Memory Issues (HashMap)
○ No Network Latency (Cassandra)
○ Fast writes : Buffer + Sequential Transaction Log
○ Isolation
● Cons
○ Not Distributed
○ Not Replicated
○ Overhead of maintenance, non-JVM memory
● Architecture
○ Memtable : in-memory buffer
○ Change Log
○ SST Table on disk

in Streaming Systems
● Apache Flink
https://flink.apache.org/features/2018/01/30/incremental-checkpointing.html
● Apache Samza
https://samza.apache.org/learn/documentation/0.7.0/container/state-management.html
● Kafka Streams
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Internal+Data+Mana
gement

Summary
● What is Stateful Processing and State in Streaming
● Architecture of State Management in Stateful processing of Structured
Streaming
● Code Example
● Why Embedded Store like RocksDB is so important in Stream Processing

Thank You. Questions?
Qubole Blog : https://www.qubole.com/blog/

State management in Structured Streaming

More Related Content

What's hot

Similar to State management in Structured Streaming

More from datamantra

Recently uploaded

State management in Structured Streaming

Editor's Notes