Introduction to Structured Streaming

© 2015 IBM Corporation1
! Agenda
- Spark Streaming 1.X
•  Features
•  Areas for Improvement
- Spark Streaming 2.0 – Structured Streaming
•  Addressing the Improvement Areas
•  API
•  Fault Tolerance
•  Event Time
•  Managing Streaming queries
- Structured Streaming Examples
https://github.com/agsachin/spark-meetup/tree/master/sparkStructuredStreaming
- Summary thoughts

Spark Streaming 1. X
! Features of Spark Streaming
-  High Level API (stateful, joins, aggregates, windows etc.)
•  Overlap with RDD API (batch)
-  Fault – Tolerant (exactly once semantics achievable)
-  Back Pressure
-  Deep Integration with Spark Ecosystem (MLlib, SQL, GraphX etc.)
!
Apache
Hadoop
Day
2015

Spark Streaming 1. X – Areas of improvement
! Fault-tolerance
For end-2-end exactly once guarantees, user needs to do all the heavy lifting in
the Sink
Can that be handled in a very simple way for the end-user ?
Apache
Hadoop
Day
2015

Fault-Tolerant Semantics
Exactly
Once,
If
Outputs
are
Idempotent
or
transac6onal

Exactly
Once,
as
long
as
received
data
is
not
lost

Exactly
Once
needs
re-‐playable
sources
(e.g.
Ka?a
Direct)

Source
Receiver
Transforming
Outputting
Sink

Spark Streaming 1. X – Areas of improvement
! Fault-tolerance
-  For end-2-end exactly once guarantees, user needs to do all the heavy lifting in the Sink
! API
-  Request for more seamless API between Batch & Stream
-  Reduce complexities of streaming app *
! No Event Time support
-  Hard to support when processing time/batch time exposed in externals
! Streaming Query Management
! Micro-batch
!
Apache
Hadoop
Day
2015

Spark Streaming 2.0 API
! Built on top of Spark SQL Engine
! Implicit Benefits
- Extend the primary Batch API even to Streaming
- Gain an Optimizer and all other enhancements done in SparkSQL.
! Challenge
- Remove/Keep streaming complexities to minimum
!

Lets Dive in

SQL Batch vs SQL Streaming- Conceptually

Batch vs Streaming - Programmatically

Output Modes - Sink
! Defined as what gets written from the Result table to external storage (Sink)
! Output modes
-  Complete – Entire updated Result table is written to external storage.
-  Append – Only new rows added in the Result table since last incremental query execution is
written to external storage.
-  Update - Only the rows updated in the Result table since last incremental query execution is
written to external storage.
Upto implementation of Storage connector to decide how to write.
* Aggregate queries only support complete mode and non-aggregate queries append mode

Supported Sinks & Modes in 2.0
*DEBUG
ONLY

*DEBUG
ONLY

Windowing in Structured Streaming

Window operations
!  Continuous time based aggregations are most common in Streaming applications.
-  Sliding window & Tumbling window
E.g. Top x hashtags on Twitter in last half hour, every 5 minutes
! New function that treats windowing as a regular aggregation
!  Used in a Group By clause
Can be used in Batch as well

Event Time Windows
! Event-Time is time embedded within the data itself
It is not the time Spark received the data
! What about processing time windows if you want them

Handling Late Arrival in Event-Time
! Since the ‘Result’ table is updated by Spark, the late data is put in its correct
window group
! Use a normal filter in the SQL ?
! Watermarks

Fault Tolerance
! Why Care?
! Different guarantees for Data Loss
! Atleast Once
! Exactly Once
! What all can fail?
! Driver
! Executor

Spark 1.x Best Fault tolerance - Kafka Direct API
•  Simplified Parallelism
•  Less Storage Need
•  Exactly Once Semantics.
source & processing
Beneﬁts
of
this
approach

Fault Tolerance in Structured Streaming
Active
Driver
Checkpoint
to
HDFS

! Structured Streaming Checkpointing
Decided Offsets ranges for a trigger interval is logged to checkpoint Directory *before* any
processing is started for that trigger
Nth record in log indicates data that is currently being processed
N-1 entry in log indicates offsets idempotent written to Sink
Log entries are monotonically increasing integers
! On Recovery
Restart processing of nth entry in WAL

Fault Tolerance in Structured Streaming
! End-to-End Exactly Once guarantees with
-  idempotent Sinks (built-in for commonly used sinks e.g. Files / JDBC)
-  Built-in Sources will *mostly* be only ones that support replay
https://issues.apache.org/jira/browse/SPARK-15842

Managing Streaming Queries
!  Streaming in 1.x was definetly lacking in
-  Starting / Stopping individual Streaming Queries
-  Changing the computation done in a Query.
-  When a Streaming Query abnormally terminates handle more gracefully than app crash.

Summary
!  Overall has a good set of features
-  Easier code share between Batch and Streaming (No different type hierarchies)
-  Window not tied to Batch interval
-  No Streaming context
-  Optimizer now available for your queries.
!  Getting started
-  Combining of 3 things (Output Mode & Sink Type & Query type) needs some time to wrap your head around *
And not much control over those.
-  Only get Runtime exceptions when you mess with above
!  How does it compare to Apache Beam ?

For Each Sink

Thank YOU

Introduction to Structured Streaming

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to Structured Streaming

Similar to Introduction to Structured Streaming (20)

More from datamantra

More from datamantra (15)

Recently uploaded

Recently uploaded (20)

Introduction to Structured Streaming