Structured Streaming
Spark Streaming 2.0
https://hadoopist.wordpress.com
Giri R Varatharajan
https://www.linkedin.com/in/girivaratharajan
What is
Structured
Streaming in
Apache Spark
● Continuous Data Flow Programming Model in
Spark introduced in 2.0
● Low Tolerance & High Throughput System
● Exactly Once Semantic - No Duplicates
● Stateful Aggregation over the Time, Event,
Window, Record.
● A Streaming platform built on top of Spark SQL
● Express your the computational code as your
batch computational code in Spark SQL
Dataframes
● Alpha Release released with Spark 2.0
● Supports HDFS, S3 now and support for Kafka,
Kinesis and Other Sources very soon.
Spark
Streaming
< 2.0
Behavior
● Micro Batching : streams are called as Discretized
Streams (DStreams)
● Running Aggregations needs to be specified with
a updateStateByKey method
● Requires careful construction of fault tolerance.
Micro
Batching
Streaming Model
● Live Data Streams Keep appending
to the Dataframe called Unbounded
table.
● Runs incremental aggregates on the
Unbounded table.
Spark
Streaming
2.0
Behavior
+
Demo
● Continuous Data Flow : Streams are appended in
an Unbounded Table with Dataframes APIs on it.
● No need to specify any method for running
aggregates over the time, window, or record.
● Look at the network socket wordcount program.
● Streaming is performed in Complete, Append,
Update Mode(s)
Continuous
Data Flow
Lines = Input Table
wordCounts = Result Table
Streaming Model
//Socket Stream - Read as and when it arrives in NetCat Channel
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
Streaming Model
val windowedCounts = words.groupBy(
window($"timestamp", windowDuration,
slideDuration), $"word"
).count().orderBy("window")
Create/
Read Streams
SparkSession.readStream()
● File Source (HDFS, S3, Text, Parquet, Csv,
Json,etc.)
● Socket Stream (NetCat)
● Kafka, Kinesis and Other Input Sources are Under
Research so cross your fingers.
● DataStreamReader API
(http://spark.apache.org/docs/latest/api/scala/index
.html#org.apache.spark.sql.streaming.DataStream
Reader)
Outputting
Streams
SparkSession.writeStream()
Output Sink Types:
● Parquet Sink - HDFS, S3, Parquet
● Console Sink - Terminal
● Memory Sink - In memory table that can be queried over time interactively
● Foreach Sink
● DataStreamWriter
API(http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.st
reaming.DataStreamWriter)
Output Modes:
● Append Mode(Default)
○ New rows only appended
○ Applicable only for Non Aggregated Queries (select,where,filter,join,etc)
● Complete Mode
■ Output the whole result to any Sink
■ Applicable only for aggregated Queries (groupBy, etc)
● Update Mode
○ Updates on any of the row attributes will get appended to the output sink.
CheckPointing ● In case of Failure recover the previous progress
and state of a previous query, and continue where
it left off.
● Configure a CheckPoint location in writeStream
method of DataStreamWriter
● Must be configured for Parquet Sink, File Sink.
Unsupported
Operations yet
● Sort, Limit of First N rows, Distinct on Input
Streams
● Joins bt two streaming datasets
● Outer Joins (FO, LO, RO) bt two streaming
datasets.
● ds.count() ⇒ Use ds.groupBy.count() instead
Key Takeaways
● Structured Streaming is still experimental but please try it out.
● Streaming Events are gathered and appended to a infinite
dataframe series (Unbounded Table) and queries are running on
top of that.
● Development is very similar to the development of Spark for
Static Dataframe/DataSets APIs.
● Execute Ad-hoc Queries, Run aggregates, update DBs, track
session data, prepare dashboards,etc.
● readStream() - Schema of the Streaming Dataframes are
checked only at run time hence it’s untyped.
● writeStream() with various Output Modes, Output Sinks are
available. Always remember when to use what types of Output
Mode.
● Kafka, Kinesis, MLib Integrations, Sessionizations, WaterMarks
are the upcoming features and are being developed at the open
source community.
● Structured Streaming is not recommended for Production
workloads at this point even if it’s a File Streaming, Socket
Streaming.
Thank You Spark Code is available in my github:
https://github.com/vgiri2015/Spark2.0-and-greater/tree/master
/src/main/scala/structStreaming
Other Spark related repositories:
https://github.com/vgiri2015/spark-latest-v1
My blogs and Learning in Spark:
https://hadoopist.wordpress.com/category/apache-spark/

Structured streaming in Spark

  • 1.
    Structured Streaming Spark Streaming2.0 https://hadoopist.wordpress.com Giri R Varatharajan https://www.linkedin.com/in/girivaratharajan
  • 2.
    What is Structured Streaming in ApacheSpark ● Continuous Data Flow Programming Model in Spark introduced in 2.0 ● Low Tolerance & High Throughput System ● Exactly Once Semantic - No Duplicates ● Stateful Aggregation over the Time, Event, Window, Record. ● A Streaming platform built on top of Spark SQL ● Express your the computational code as your batch computational code in Spark SQL Dataframes ● Alpha Release released with Spark 2.0 ● Supports HDFS, S3 now and support for Kafka, Kinesis and Other Sources very soon.
  • 3.
    Spark Streaming < 2.0 Behavior ● MicroBatching : streams are called as Discretized Streams (DStreams) ● Running Aggregations needs to be specified with a updateStateByKey method ● Requires careful construction of fault tolerance. Micro Batching
  • 4.
    Streaming Model ● LiveData Streams Keep appending to the Dataframe called Unbounded table. ● Runs incremental aggregates on the Unbounded table.
  • 5.
    Spark Streaming 2.0 Behavior + Demo ● Continuous DataFlow : Streams are appended in an Unbounded Table with Dataframes APIs on it. ● No need to specify any method for running aggregates over the time, window, or record. ● Look at the network socket wordcount program. ● Streaming is performed in Complete, Append, Update Mode(s) Continuous Data Flow Lines = Input Table wordCounts = Result Table
  • 6.
    Streaming Model //Socket Stream- Read as and when it arrives in NetCat Channel val lines = spark.readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load()
  • 7.
    Streaming Model val windowedCounts= words.groupBy( window($"timestamp", windowDuration, slideDuration), $"word" ).count().orderBy("window")
  • 8.
    Create/ Read Streams SparkSession.readStream() ● FileSource (HDFS, S3, Text, Parquet, Csv, Json,etc.) ● Socket Stream (NetCat) ● Kafka, Kinesis and Other Input Sources are Under Research so cross your fingers. ● DataStreamReader API (http://spark.apache.org/docs/latest/api/scala/index .html#org.apache.spark.sql.streaming.DataStream Reader)
  • 9.
    Outputting Streams SparkSession.writeStream() Output Sink Types: ●Parquet Sink - HDFS, S3, Parquet ● Console Sink - Terminal ● Memory Sink - In memory table that can be queried over time interactively ● Foreach Sink ● DataStreamWriter API(http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.st reaming.DataStreamWriter) Output Modes: ● Append Mode(Default) ○ New rows only appended ○ Applicable only for Non Aggregated Queries (select,where,filter,join,etc) ● Complete Mode ■ Output the whole result to any Sink ■ Applicable only for aggregated Queries (groupBy, etc) ● Update Mode ○ Updates on any of the row attributes will get appended to the output sink.
  • 10.
    CheckPointing ● Incase of Failure recover the previous progress and state of a previous query, and continue where it left off. ● Configure a CheckPoint location in writeStream method of DataStreamWriter ● Must be configured for Parquet Sink, File Sink.
  • 11.
    Unsupported Operations yet ● Sort,Limit of First N rows, Distinct on Input Streams ● Joins bt two streaming datasets ● Outer Joins (FO, LO, RO) bt two streaming datasets. ● ds.count() ⇒ Use ds.groupBy.count() instead
  • 12.
    Key Takeaways ● StructuredStreaming is still experimental but please try it out. ● Streaming Events are gathered and appended to a infinite dataframe series (Unbounded Table) and queries are running on top of that. ● Development is very similar to the development of Spark for Static Dataframe/DataSets APIs. ● Execute Ad-hoc Queries, Run aggregates, update DBs, track session data, prepare dashboards,etc. ● readStream() - Schema of the Streaming Dataframes are checked only at run time hence it’s untyped. ● writeStream() with various Output Modes, Output Sinks are available. Always remember when to use what types of Output Mode. ● Kafka, Kinesis, MLib Integrations, Sessionizations, WaterMarks are the upcoming features and are being developed at the open source community. ● Structured Streaming is not recommended for Production workloads at this point even if it’s a File Streaming, Socket Streaming.
  • 13.
    Thank You SparkCode is available in my github: https://github.com/vgiri2015/Spark2.0-and-greater/tree/master /src/main/scala/structStreaming Other Spark related repositories: https://github.com/vgiri2015/spark-latest-v1 My blogs and Learning in Spark: https://hadoopist.wordpress.com/category/apache-spark/