Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | CloudxLab

Welcome to
Spark Structured Streaming

Structured Streaming with Spark
Structured Streaming - Introduction

Structured Streaming - Introduction
● Scalable
● Fault-tolerant stream processing engine
● Built on the Spark SQL engine
● Express streaming computation like batch
● Guarantees End-to-end exactly-once

Spark Streaming - Introduction
● Provides Dataset/DataFrame API
● In Scala, Java, Python or R
○ Streaming aggregations
○ Event-time windows
○ Stream-to-batch joins

Hands-on - Word Count
Structured Streaming code listens to
this host and port
Server generates data on a host and
a port. This server will work like a
producer

● Sample code is at CloudxLab GitHub repository
● https://github.com/cloudxlab/bigdata/blob/master/spark/examples/stre
aming/structured_streaming/ss_wc.scala
Word Count - Code

Word Count - Code
● Clone the repository
git clone https://github.com/cloudxlab/bigdata.git ~/cloudxlab
● Or update the repository if already cloned
cd ~/cloudxlab && git pull origin master

Word Count - Code
● Go to word_count directory
cd ~/cloudxlab/spark/examples/streaming/word_count
● There are word_count.scala and word_count.py files having Scala
and Python code for the word count problem
● Open word_count.scala
vi word_count.scala
● Copy the code and paste in spark-shell

Word Count - Producer
Create the data producer
● Open a new web console
● Run the following command to start listening to 9999 port
nc -lk 9999
● Whatever you type here would be passed to a process connecting
at 9999 port

Word Count - Producer
Type in data
●
The quick
brown fox
jumps over
the lazy
dog
●
my first Spark
Streaming code

Code - Part 1 - Imports
// In spark-shell copy-paste this:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
import spark.implicits._
// This part is needed if you are using spark-submit. "spark"
object is made available by spark-shell
// val spark = SparkSession
// .builder
// .appName("StructuredNetworkWordCount")
// .getOrCreate()

// Create DataFrame representing the stream of input lines from
connection to localhost:9999
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
// Split the lines into words
val words = lines.as[String].flatMap(_.split(" "))
// Generate running word count
val wordCounts = words.groupBy("value").count()
Code - Part 2 - Computing Counts

// Start running the query that prints the running counts to the
console
val query = wordCounts.writeStream
.outputMode("complete")
.format("console")
.start()
query.awaitTermination()
Code - Part 3 - Start

Programming Model
● Treat data stream as a table being appended
● Very Similar to a batch processing model.
● Express your streaming computation as standard batch-like query as on a
static table
● Spark runs it as an incremental query on the unbounded input table

Programming Model - Basic Concepts

Consider the input data stream as the “Input Table”.

Every data item that is arriving on the stream is like a new row
being appended to the Input Table.

Learn More
●
●
○
○

Learn More

Example

Thank you.
Spark Structured Streaming

Output Modes
The “Output” is defined as what gets written out to the external storage.
The output can be defined in a different mode:
● Complete Mode
● Append Mode
● Update Mode

Output Mode - Complete
Complete Mode:
● Entire updated Result Table is written to the external storage
● Storage connector decides how to handle writing of the entire table.

Append Mode:
● Only the new rows appended in the Result Table since the last trigger
● will be written to the external storage.
● Applicable only on the queries where existing rows in the Result Table
are not expected to change.

Update Mode:
● Only the rows that were updated
○ in the Result Table since the last trigger
○ will be written to the external storage
● Only outputs the rows that have changed since the last trigger.
● If query doesn’t contain aggregations, it’ll be equivalent to Append mode.
● Available since Spark 2.1.1
Note that each mode is applicable on certain types of queries.

Example
● The first lines DataFrame is the input table
● The final wordCounts DataFrame is the result table
●

Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | CloudxLab

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | CloudxLab

Similar to Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | CloudxLab (20)

More from CloudxLab

More from CloudxLab (20)

Recently uploaded

Recently uploaded (20)

Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | CloudxLab