Spark Barcelona Meetup: Migrating Batch Jobs into Structured Streaming

MIGRATING
BATCH JOBS INTO
STRUCTURED STREAMING
Introduction to near real time analytics
MARCGONZALEZ.EU

Hi, I’m Marc!
Freelance Data Engineer
Developer, Consultant
(and now) speaker
5+ years of big data experience,
applied to classiﬁeds market.

Q? sli.do #5910
Audience
• Experience with Dataframes.
• Experience with DStreams.
• Streams and tables theory.
• Beam model.

Q? sli.do #5910
Notes
• Most material is from Tyler Akidau,
either from his blog, talks or book.

Q? sli.do #5910
TALK STRUCTURE
Streams and tables theory.
Structured streaming migration demo.
Working with new developments.3
2
1
Q Q&A sli.do code: 5910

Q? sli.do #5910
TALK STRUCTURE
Streams and tables theory.
Structured streaming migration demo.
Working with new developments.3
2
1

Q? sli.do #5910
“Every Stream can yield a Table at a certain time,
& every Table can be observed into a Stream.”
1. Streams & Tables theory

Q? sli.do #5910
1. Streams & Tables theory: 
Demonstration
Formally:
By example:
Stream =
∂Table
∂t
Table =
∫
now
t0
Stream
Underlying structure of
a Database system for
handling updates.
Change Data Capture 
(CDC) in microservices.

Q? sli.do #5910
Operation
Stream !=> Stream Mapping
Stream !=> Table Grouping
Table !=> Stream Partitioning
This helps a lot with our migration right?
1. Streams & Tables theory: 
General approach

Q? sli.do #5910
“Semantically batch is really just
a (strict) subset of streaming.”
1. Streams & Tables theory:
Batch & Streaming Engines

Q? sli.do #5910
Bounded & Unbounded Tables
struct
!=>Insights
Unbounded tableData stream

Q? sli.do #5910
Bounded & Unbounded Tables
So can we swap one with the other?
struct
!=> Insights
Data stream

YES*
*but you’re going to need:
Tools for reasoning about time
Guarantee Correctness

Q? sli.do #5910
Tools for reasoning about time
• Event vs Processing Time
• Windowing
• Triggers

Q? sli.do #5910
1.1. Tools for reasoning about time:
Event vs Processing Times

Q? sli.do #5910
Event vs Processing Times Example

Q? sli.do #5910
Windowing
• Partitioning a data set along temporal boundaries.
Fixed Sliding Session
Event-Time

Q? sli.do #5910
2 Minute Windowing Example

Q? sli.do #5910
Triggers
• Mechanism for declaring when the output for a window should be
materialized (relative to some external signal).
• Per element
• Window completion
• Fixed

Q? sli.do #5910
2 Minute Triggers Example

Q? sli.do #5910
Correctness
• State
• Watermarks
• Late data ﬁring
• Exactly one

Q? sli.do #5910
1.2. Correctness:
State
• Amount of context stored between runs.

Q? sli.do #5910
1.2. Correctness:
Watermarks
• Watermarks are temporal notions of input completeness in the event-time
domain.

Q? sli.do #5910
1.2. Correctness:
Watermarks Example

Q? sli.do #5910
1.2. Correctness:
Handling late data
• Firing functions when events are observed outside the state.
Technique Side-effect
Discarding Approximate
Accumulation Duplicates
Accumulation 
& Retraction
Late updates

Q? sli.do #5910
1.2. Correctness:
Discarding late data Example

Q? sli.do #5910
1.2. Correctness:
Exactly one
“Exactly one = At least one + Only one”

Q? sli.do #5910
1.2. Correctness:
At least one
• Checkpoints (HDFS compatible)
• Write-ahead log

Q? sli.do #5910
1.2. Correctness:
At least one
• Checkpoints relates to State:

Q? sli.do #5910
1.2. Correctness:
Only one
Technique Scope
Deduplication Micro-Batch
Deduplication 
with Watermark
State
Deduplication 
with Left Join
Resources

Q? sli.do #5910
Recap Part 1
• Processing of Bounded & Unbounded Tables.
• Event vs Processing time & how it relates to Windowing and Triggering.
• Stateful processing is useful when working to guarantee correctness.
• State is managed with Watermarks, Late Data ﬁrings & Fault Tolerant
Exactly One semantics.

Q? sli.do #5910
import org.apache.spark.sql.functions._
def selectKafkaContent(df: DataFrame): DataFrame =
df.selectExpr("CAST(value AS STRING) as sValue")
def jsonScore(df: DataFrame): DataFrame =
df.selectExpr("CAST(get_json_object(sValue, '$.score') as INT) score")
def parse(df: DataFrame): DataFrame = jsonScore(selectKafkaContent(df))
def sumScores(df: DataFrame): DataFrame =
df.agg(sum("score").as("total"))
it should "sum 48 after consuming everything" in {
publishToMyKafka
kafka.getTopics().size shouldBe 1
val topicsAndOffsets = kafkaUtils.getTopicsAndOffsets("eu.marcgonzalez.demo")
topicsAndOffsets.foreach { topicAndOffset: TopicAndOffsets =>
val df = kafkaUtils
.load(topicAndOffset, kafkaConfiguration)
val jsonDf = df
.transform(parse)
.transform(sumScores)
jsonDf.collect()(0).get(0) shouldBe 48
}
}
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
2.Structured streaming migration demo:
Batch Example
• kafkaUtils handles offsets
• parse is a 
stream->stream mapping op
• sumScores is a 
stream->table grouping op

Q? sli.do #5910
def jsonScoreAndDate(df: DataFrame): DataFrame =
df.selectExpr(
"from_json(sValue, 'score INT, eventTime LONG, delayInMin INT') struct",
"timestamp as procTime")
.select(col("struct.*"), 'procTime)
.selectExpr("timestamp(eventTime/1000) as eventTime", "score", "procTime")
def parse(df: DataFrame): DataFrame = {
jsonScoreAndDate(selectKafkaContent(df))
}
def windowedSumScores(df: DataFrame): DataFrame =
df.groupBy(window($"eventTime", "2 minutes")).agg(sum("score").as("total"))
it should "sum 14, 18, 4, 12 after consuming everything in 2 minute windows" in {
val df = kafkaUtils
.load(topicAndOffset, kafkaConfiguration)
val jsonDf = df
.transform(parse)
.transform(windowedSumScores)
jsonDf
.sort("window").collect()
.foldLeft(Seq.empty[Int])(
(a, v) => a ++ Seq(v.get(1).asInstanceOf[Long].toInt)
) shouldBe Seq(14, 18, 4, 12)
}
}
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
Windowed Batch
• Extract eventTime from JSON
• Always try to enforce a
schema (JSON, AVRO)
• Fixed window partitions our
insight

Q? sli.do #5910
it should "sum 14,18,4,12 after streaming everything in 2 minute windows" in {
val df = spark.readStream
.format(“kafka")
.option(“kafka.bootstrap.servers", "localhost:9092")
.option("startingOffsets", "earliest")
.option("subscribe", topicAndOffset.topic)
.load()
df.isStreaming shouldBe true
val jsonDf = df
.transform(parse)
val query = jsonDf.writeStream
.outputMode("update") //complete
.format("memory") //console
.queryName(queryName)
.trigger(Trigger.ProcessingTime("5 seconds")) //Once
.start()
query.awaitTermination(10 * 1000)
spark.sql(s"select * from $queryName order by window asc")
.collect()
) shouldBe Seq(14, 18, 4, 12)
}
}
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
Windowed Stream Example
• read -> readStream, allowing S.S.S.
to take control over the input
offsets.
• Our dataframe is now a streaming
dataframe, & supports all previous
ops.
• write -> writeStream
• Modes: Complete, Update &
Append
• Sinks: Kafka, File, forEach,
memory & console for debug.
• Queryable state -> Eventual
Consistency

Q? sli.do #5910
it should "sum 5,18,4,12 after streaming everything in 2 minute windows" in {
timelyPublishToMyKafka
// Same reader as previous
val jsonDf = df
.transform(parse)
.withWatermark("eventTime", "2 minutes")
val query = jsonDf
.writeStream
.outputMode(“update") //append
.format(“memory")
.queryName(queryName)
.trigger(Trigger.ProcessingTime("5 seconds”))
.start()
query.awaitTermination(15 * SECONDS_MS)
spark.sql(s"select window, max(total) from $queryName
group by window order by window asc")
.collect()
) shouldBe Seq(5, 18, 4, 12)
}
}
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
Windowed Stream + Watermark Example
• Introduce a delay when
sending to kafka.
• We add a 2 minute watermark
• Event on a closed window gets
discarded!
• Append mode only sends
closed windows.

Q? sli.do #5910
Recap Part 2
• Reuse the same DF transformations
• Change read by readStream & write by writeStream
• Fixed triggers + start()
• Choosing the right Sinks + Output modes are key.
• State is managed through Spark Checkpoints. Do not mess with it!

Q? sli.do #5910
Beam model
• What results are calculated?
• Where in event time are results calculated?
• When in processing time are results materialized?
• How do reﬁnements of results relate?

Q? sli.do #5910
Beam model
• What results are calculated? Insights
• Where in event time are results calculated?

Q? sli.do #5910
Beam model
• Where in event time are results calculated? Windowing

Q? sli.do #5910
Beam model
• When in processing time are results materialized? Triggers & Watermarks

Q? sli.do #5910
Beam model
• When in processing time are results materialized? Triggers & Watermarks
• How do reﬁnements of results relate? Late Data Firings, Exactly Once

Q? sli.do #5910
Apache Beam
• Uniﬁed model
• Multiples languages
• Multiples runners

Q? sli.do #5910
Runners Comparison

Q? sli.do #5910
Upcoming Meetup!
January 29th @ 7pm
 
 
🍕🍺 & amazing terrace
https:!//!!www.meetup.com/
Barcelona-Apache-Beam-
Meetup

Q? sli.do #5910
Thank you!
sli.do

Spark Barcelona Meetup: Migrating Batch Jobs into Structured Streaming

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Spark Barcelona Meetup: Migrating Batch Jobs into Structured Streaming

Similar to Spark Barcelona Meetup: Migrating Batch Jobs into Structured Streaming (20)

Recently uploaded

Recently uploaded (20)

Spark Barcelona Meetup: Migrating Batch Jobs into Structured Streaming