Jose Torres, Databricks
Continuous Processing in
Structured Streaming
#Dev4SAIS
Continuous Processing Overview
● Unified Spark SQL API
● No microbatches
● Low (~1ms) latency
2#Dev4SAIS
Continuous Processing Microbatch
DStream API
● Non-declarative, similar to RDDs
● Scala/Java only
● Checkpoints only through complete snapshots
● No event time
3#Dev4SAIS
Structured Streaming
● Data represented as a virtual append-only table
● Unified Spark SQL query API
● Batch and streaming queries return same
results
4#Dev4SAIS
Structured Streaming
5#Dev4SAIS
Structured Streaming Features
● Dataframes and Datasets
● SQL, Python, and R language APIs
● Delta-based aggregation state
6#Dev4SAIS
Microbatches
7#Dev4SAIS
Continuous Processing
8#Dev4SAIS
Chandy-Lamport Checkpoints
9#Dev4SAIS
Writer TaskReader
Aggregation
Task
Driver
save checkpoint to state store
epoch marker
partition level commit
checkpoint complete
ready for global commit
Data Stream
● Asynchronous
● Consistent
Checkpointing - Detailed
10#Dev4SAIS
AggregationReader Processing Writer
Shuffle
Aggregation WriterReader Processing
Reader Processing
Driver
epoch markers
Checkpointing - Detailed
11#Dev4SAIS
AggregationReader Processing Writer
Shuffle
Aggregation WriterReader Processing
Reader Processing
Checkpointing - Detailed
12#Dev4SAIS
AggregationReader Processing Writer
Shuffle
Aggregation WriterReader Processing
Reader Processing
Checkpointing - Detailed
13#Dev4SAIS
AggregationReader Processing Writer
Shuffle
Aggregation WriterReader Processing
Reader Processing
Checkpointing - Detailed
14#Dev4SAIS
AggregationReader Processing Writer
Shuffle
Aggregation WriterReader Processing
Reader Processing
Checkpointing - Detailed
15#Dev4SAIS
AggregationReader Processing Writer
Shuffle
Aggregation WriterReader Processing
Reader Processing
aggregation checkpoint
aggregation checkpoint
Checkpointing - Detailed
16#Dev4SAIS
AggregationReader Processing Writer
Shuffle
Aggregation WriterReader Processing
Reader Processing
commit partition writer
commit partition writer
commit partition writer
Driver
ready for global commit
Continuous Processing API
● It’s just Structured Streaming
● Run the same queries in continuous mode
17#Dev4SAIS
Continuous Processing in 2.3
● Initial experimental release
● Supports ETL use cases
18#Dev4SAIS
Ongoing And Future Work
● Shuffles (SPARK-24036)
● Event time (SPARK-24459)
● Metrics (SPARK-23887)
● Exactly-once semantics mode (SPARK-24460)
● Performance testing (TBD)
● Additional data sources (TBD)
19
Q&A
20

Continuous Processing in Structured Streaming with Jose Torres