This document provides an introduction to Spark Structured Streaming. It discusses that Structured Streaming is a scalable, fault-tolerant stream processing engine built on the Spark SQL engine. It expresses streaming computations similar to batch processing and guarantees end-to-end exactly-once processing. The document also provides a code example of a word count application using Structured Streaming and discusses output modes for writing streaming query results.
3. Structured Streaming with Spark
Structured Streaming - Introduction
● Scalable
● Fault-tolerant stream processing engine
● Built on the Spark SQL engine
● Express streaming computation like batch
● Guarantees End-to-end exactly-once
4. Structured Streaming with Spark
Spark Streaming - Introduction
● Provides Dataset/DataFrame API
● In Scala, Java, Python or R
○ Streaming aggregations
○ Event-time windows
○ Stream-to-batch joins
5. Structured Streaming with Spark
Hands-on - Word Count
Structured Streaming code listens to
this host and port
Server generates data on a host and
a port. This server will work like a
producer
6. Structured Streaming with Spark
● Sample code is at CloudxLab GitHub repository
● https://github.com/cloudxlab/bigdata/blob/master/spark/examples/stre
aming/structured_streaming/ss_wc.scala
Word Count - Code
7. Structured Streaming with Spark
Word Count - Code
● Clone the repository
git clone https://github.com/cloudxlab/bigdata.git ~/cloudxlab
● Or update the repository if already cloned
cd ~/cloudxlab && git pull origin master
8. Structured Streaming with Spark
Word Count - Code
● Go to word_count directory
cd ~/cloudxlab/spark/examples/streaming/word_count
● There are word_count.scala and word_count.py files having Scala
and Python code for the word count problem
● Open word_count.scala
vi word_count.scala
● Copy the code and paste in spark-shell
9. Structured Streaming with Spark
Word Count - Producer
Create the data producer
● Open a new web console
● Run the following command to start listening to 9999 port
nc -lk 9999
● Whatever you type here would be passed to a process connecting
at 9999 port
10. Structured Streaming with Spark
Word Count - Producer
Type in data
●
The quick
brown fox
jumps over
the lazy
dog
●
my first Spark
Streaming code
11. Structured Streaming with Spark
Code - Part 1 - Imports
// In spark-shell copy-paste this:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
import spark.implicits._
// This part is needed if you are using spark-submit. "spark"
object is made available by spark-shell
// val spark = SparkSession
// .builder
// .appName("StructuredNetworkWordCount")
// .getOrCreate()
12. Structured Streaming with Spark
// Create DataFrame representing the stream of input lines from
connection to localhost:9999
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
// Split the lines into words
val words = lines.as[String].flatMap(_.split(" "))
// Generate running word count
val wordCounts = words.groupBy("value").count()
Code - Part 2 - Computing Counts
13. Structured Streaming with Spark
// Start running the query that prints the running counts to the
console
val query = wordCounts.writeStream
.outputMode("complete")
.format("console")
.start()
query.awaitTermination()
Code - Part 3 - Start
14. Structured Streaming with Spark
Programming Model
● Treat data stream as a table being appended
● Very Similar to a batch processing model.
● Express your streaming computation as standard batch-like query as on a
static table
● Spark runs it as an incremental query on the unbounded input table
16. Structured Streaming with Spark
Programming Model - Basic Concepts
Consider the input data stream as the “Input Table”.
17. Structured Streaming with Spark
Programming Model - Basic Concepts
Every data item that is arriving on the stream is like a new row
being appended to the Input Table.
23. Structured Streaming with Spark
Output Modes
The “Output” is defined as what gets written out to the external storage.
The output can be defined in a different mode:
● Complete Mode
● Append Mode
● Update Mode
24. Structured Streaming with Spark
Output Mode - Complete
Complete Mode:
● Entire updated Result Table is written to the external storage
● Storage connector decides how to handle writing of the entire table.
25. Structured Streaming with Spark
Append Mode:
● Only the new rows appended in the Result Table since the last trigger
● will be written to the external storage.
● Applicable only on the queries where existing rows in the Result Table
are not expected to change.
Output Mode - Complete
26. Structured Streaming with Spark
Update Mode:
● Only the rows that were updated
○ in the Result Table since the last trigger
○ will be written to the external storage
● Only outputs the rows that have changed since the last trigger.
● If query doesn’t contain aggregations, it’ll be equivalent to Append mode.
● Available since Spark 2.1.1
Output Mode - Complete
Note that each mode is applicable on certain types of queries.
27. Structured Streaming with Spark
Example
● The first lines DataFrame is the input table
● The final wordCounts DataFrame is the result table
●