● Continuous Data Flow Programming Model in
Spark introduced in 2.0
● Low Tolerance & High Throughput System
● Exactly Once Semantic - No Duplicates
● Stateful Aggregation over the Time, Event,
● A Streaming platform built on top of Spark SQL
● Express your the computational code as your
batch computational code in Spark SQL
● Alpha Release released with Spark 2.0
● Supports HDFS, S3 now and support for Kafka,
Kinesis and Other Sources very soon.
● Micro Batching : streams are called as Discretized
● Running Aggregations needs to be specified with
a updateStateByKey method
● Requires careful construction of fault tolerance.
● Live Data Streams Keep appending
to the Dataframe called Unbounded
● Runs incremental aggregates on the
● Continuous Data Flow : Streams are appended in
an Unbounded Table with Dataframes APIs on it.
● No need to specify any method for running
aggregates over the time, window, or record.
● Look at the network socket wordcount program.
● Streaming is performed in Complete, Append,
Lines = Input Table
wordCounts = Result Table
//Socket Stream - Read as and when it arrives in NetCat Channel
val lines = spark.readStream
val windowedCounts = words.groupBy(
● File Source (HDFS, S3, Text, Parquet, Csv,
● Socket Stream (NetCat)
● Kafka, Kinesis and Other Input Sources are Under
Research so cross your fingers.
● DataStreamReader API
Output Sink Types:
● Parquet Sink - HDFS, S3, Parquet
● Console Sink - Terminal
● Memory Sink - In memory table that can be queried over time interactively
● Foreach Sink
● Append Mode(Default)
○ New rows only appended
○ Applicable only for Non Aggregated Queries (select,where,filter,join,etc)
● Complete Mode
■ Output the whole result to any Sink
■ Applicable only for aggregated Queries (groupBy, etc)
● Update Mode
○ Updates on any of the row attributes will get appended to the output sink.
CheckPointing ● In case of Failure recover the previous progress
and state of a previous query, and continue where
it left off.
● Configure a CheckPoint location in writeStream
method of DataStreamWriter
● Must be configured for Parquet Sink, File Sink.
● Sort, Limit of First N rows, Distinct on Input
● Joins bt two streaming datasets
● Outer Joins (FO, LO, RO) bt two streaming
● ds.count() ⇒ Use ds.groupBy.count() instead
● Structured Streaming is still experimental but please try it out.
● Streaming Events are gathered and appended to a infinite
dataframe series (Unbounded Table) and queries are running on
top of that.
● Development is very similar to the development of Spark for
Static Dataframe/DataSets APIs.
● Execute Ad-hoc Queries, Run aggregates, update DBs, track
session data, prepare dashboards,etc.
● readStream() - Schema of the Streaming Dataframes are
checked only at run time hence it’s untyped.
● writeStream() with various Output Modes, Output Sinks are
available. Always remember when to use what types of Output
● Kafka, Kinesis, MLib Integrations, Sessionizations, WaterMarks
are the upcoming features and are being developed at the open
● Structured Streaming is not recommended for Production
workloads at this point even if it’s a File Streaming, Socket
Thank You Spark Code is available in my github:
Other Spark related repositories:
My blogs and Learning in Spark: