RocksDb State Store in
Structured Streaming
Vikram Agrawal
Qubole
State Storage Example
Batch 1
Batch 2
Batch 3
State
= 10
[1-4]
[5-8]
[9-10]
Count: 10
State
= 36
State
= 55
Count: 10 + 26 = 36
Count: 36 + 19 = 55
Running Count
Current Implementation
● Versioned key-value store for every
shuffle partitions
● In memory hashmap for key-value
store
● For each batch
○ Load previous version (batch) state
checkpointed in HDFS/S3
○ Create new state using previous version
and current batch data records
○ Create delta file in checkpointed folder
for the updated state.
● Maintenance task every x minutes
to create snapshot files using delta
files
Issues in the current implementation
● Scalability
○ Uses Executor JVM memory to store the states. State store size is limited by size of the
executor memory.
○ Executor JVM memory is shared by state storage and other tasks operations. State storage
size will impact the performance of task execution.
● Latency
○ GC issues, executor failures, OOM issues are common when size of state storage increases
which increases overall latency of a micro-batch
RocksDb Based Implementation
● Use rocksDb instead of HashMap for each shuffle partitions
● RocksDB
○ a storage engine with key/value interface based on levelDB
○ new writes are inserted into the memtable; when memtable fills up, it flushes the
data on local storage
○ supports both point lookups and range scans, and provides different types of ACID
guarantees
○ optimized for flash storage
End Goals
● Consistency
○ On Successful batch completion, all state updates should be committed
○ On Failure, none of the updates should be committed
○ Solution => Use Transactional RocksDB
● Isolation
○ Thread which opens the transaction in write mode should be able to read the updates
○ Another thread which opens the db in Read-only mode should not see any updates
○ Solution => backup the rocksDB in a seperate folder after every batch completion
● Durability
○ Checkpoint the delta in a S3 folder
○ Compact all delta files to create a snapshot once in a while
RocksDb State Store Implementation
Task for
partition P1
& batch V3
/ P1
/ P2
Delta
file for
P1
Delta
file for
P2
P1v1.delta
P1v2.delta
P1v3.delta
P2v1.delta
P2v2.delta
P2v3.delta
Snapshot
file for P1
Snapshot
file for P2
P1V2.snapshot P1V2.snapshot
Task for
partition P2
and batch V3
Worker node
External
FileSystem
Local Hard Disk
Maintenance
thread
Rocksdb
client
Rocksdb
client
Write as
snapshot
File
Executor 1
/
Implementation
● Creation of new State
For batch x and partition Pi
○ if Node(Pi, x) = Node(Pi, x-1) : state is already loaded in rocksDb
○ Else if Node(Pi, x) = Node(Pi, x-2) : update rocksDb state using downloaded Delta(Pi, X-1)
○ Otherwise create new rocksDB store using checkpointed data (snapshot + delta)
● During Batch Execution
○ Open rocksdb in transactional mode
○ on successful completion of the batch
■ Commit the transaction
■ Upload delta file into checkpoint folder (S3/HDFS)
■ Create a backup of current Db state in local storage
○ abort the transaction on any error
● Snapshot creation (Maintenance task)
○ Create a tarball of last backed up DB state and upload it to the checkpoint folder
Performance Analysis
● Setup
○ Master: r3.xlarge; Executor: r3.xlarge
○ Driver size - 8GB; Executor size - 12.5 GB
○ Campaign data sets -10000 unique campaign
○ Ingestion rate : 5k events per sec
● Sink
○ 3 node MKS (kafka) cluster
● Query
○ Sliding Window Aggregation on event time, campaign-id
○ For each key, ~150 bytes of state data is generated
● Config
○ shuffle partitions = 200 and 8
○ spark.dynamicAllocation.maxExecutors = 2
Comparison
Executor’s GC time and Heap Usage
Memory Based State Storage
Executor’s GC time and Heap Usage
RocksDb based state storage

Rocks db state store in structured streaming

  • 1.
    RocksDb State Storein Structured Streaming Vikram Agrawal Qubole
  • 2.
    State Storage Example Batch1 Batch 2 Batch 3 State = 10 [1-4] [5-8] [9-10] Count: 10 State = 36 State = 55 Count: 10 + 26 = 36 Count: 36 + 19 = 55 Running Count
  • 3.
    Current Implementation ● Versionedkey-value store for every shuffle partitions ● In memory hashmap for key-value store ● For each batch ○ Load previous version (batch) state checkpointed in HDFS/S3 ○ Create new state using previous version and current batch data records ○ Create delta file in checkpointed folder for the updated state. ● Maintenance task every x minutes to create snapshot files using delta files
  • 4.
    Issues in thecurrent implementation ● Scalability ○ Uses Executor JVM memory to store the states. State store size is limited by size of the executor memory. ○ Executor JVM memory is shared by state storage and other tasks operations. State storage size will impact the performance of task execution. ● Latency ○ GC issues, executor failures, OOM issues are common when size of state storage increases which increases overall latency of a micro-batch
  • 5.
    RocksDb Based Implementation ●Use rocksDb instead of HashMap for each shuffle partitions ● RocksDB ○ a storage engine with key/value interface based on levelDB ○ new writes are inserted into the memtable; when memtable fills up, it flushes the data on local storage ○ supports both point lookups and range scans, and provides different types of ACID guarantees ○ optimized for flash storage
  • 6.
    End Goals ● Consistency ○On Successful batch completion, all state updates should be committed ○ On Failure, none of the updates should be committed ○ Solution => Use Transactional RocksDB ● Isolation ○ Thread which opens the transaction in write mode should be able to read the updates ○ Another thread which opens the db in Read-only mode should not see any updates ○ Solution => backup the rocksDB in a seperate folder after every batch completion ● Durability ○ Checkpoint the delta in a S3 folder ○ Compact all delta files to create a snapshot once in a while
  • 7.
    RocksDb State StoreImplementation Task for partition P1 & batch V3 / P1 / P2 Delta file for P1 Delta file for P2 P1v1.delta P1v2.delta P1v3.delta P2v1.delta P2v2.delta P2v3.delta Snapshot file for P1 Snapshot file for P2 P1V2.snapshot P1V2.snapshot Task for partition P2 and batch V3 Worker node External FileSystem Local Hard Disk Maintenance thread Rocksdb client Rocksdb client Write as snapshot File Executor 1 /
  • 8.
    Implementation ● Creation ofnew State For batch x and partition Pi ○ if Node(Pi, x) = Node(Pi, x-1) : state is already loaded in rocksDb ○ Else if Node(Pi, x) = Node(Pi, x-2) : update rocksDb state using downloaded Delta(Pi, X-1) ○ Otherwise create new rocksDB store using checkpointed data (snapshot + delta) ● During Batch Execution ○ Open rocksdb in transactional mode ○ on successful completion of the batch ■ Commit the transaction ■ Upload delta file into checkpoint folder (S3/HDFS) ■ Create a backup of current Db state in local storage ○ abort the transaction on any error ● Snapshot creation (Maintenance task) ○ Create a tarball of last backed up DB state and upload it to the checkpoint folder
  • 9.
    Performance Analysis ● Setup ○Master: r3.xlarge; Executor: r3.xlarge ○ Driver size - 8GB; Executor size - 12.5 GB ○ Campaign data sets -10000 unique campaign ○ Ingestion rate : 5k events per sec ● Sink ○ 3 node MKS (kafka) cluster ● Query ○ Sliding Window Aggregation on event time, campaign-id ○ For each key, ~150 bytes of state data is generated ● Config ○ shuffle partitions = 200 and 8 ○ spark.dynamicAllocation.maxExecutors = 2
  • 10.
  • 11.
    Executor’s GC timeand Heap Usage Memory Based State Storage
  • 12.
    Executor’s GC timeand Heap Usage RocksDb based state storage