SlideShare a Scribd company logo
MIGRATING
BATCH JOBS INTO
STRUCTURED STREAMING
Introduction to near real time analytics
MARCGONZALEZ.EU
Hi, I’m Marc!
Freelance Data Engineer
Developer, Consultant
(and now) speaker
5+ years of big data experience,
applied to classifieds market.
Q? sli.do #5910
Q? sli.do #5910
Q? sli.do #5910
Q? sli.do #5910
Q? sli.do #5910
Audience
• Experience with Dataframes.
• Experience with DStreams.
• Streams and tables theory.
• Beam model.
Q? sli.do #5910
Notes
• Most material is from Tyler Akidau,
either from his blog, talks or book.
Q? sli.do #5910
TALK STRUCTURE
Streams and tables theory.
Structured streaming migration demo.
Working with new developments.3
2
1
Q Q&A sli.do code: 5910
Q? sli.do #5910
TALK STRUCTURE
Streams and tables theory.
Structured streaming migration demo.
Working with new developments.3
2
1
Q? sli.do #5910
“Every Stream can yield a Table at a certain time,
& every Table can be observed into a Stream.”
1. Streams & Tables theory
Q? sli.do #5910
1. Streams & Tables theory:

Demonstration
Formally:
By example:
Stream =
∂Table
∂t
Table =
∫
now
t0
Stream
Underlying structure of
a Database system for
handling updates.
Change Data Capture

(CDC) in microservices.
Q? sli.do #5910
Operation
Stream !=> Stream Mapping
Stream !=> Table Grouping
Table !=> Stream Partitioning
This helps a lot with our migration right?
1. Streams & Tables theory:

General approach
Q? sli.do #5910
“Semantically batch is really just
a (strict) subset of streaming.”
1. Streams & Tables theory:
Batch & Streaming Engines
Q? sli.do #5910
1. Streams & Tables theory:
Bounded & Unbounded Tables
struct
!=>Insights
Unbounded tableData stream
Q? sli.do #5910
1. Streams & Tables theory:
Bounded & Unbounded Tables
So can we swap one with the other?
struct
!=> Insights
Data stream
YES*
*but you’re going to need:
Tools for reasoning about time
Guarantee Correctness
Q? sli.do #5910
1. Streams & Tables theory:
Tools for reasoning about time
• Event vs Processing Time
• Windowing
• Triggers
Q? sli.do #5910
1.1. Tools for reasoning about time:
Event vs Processing Times
Q? sli.do #5910
1.1. Tools for reasoning about time:
Event vs Processing Times Example
Q? sli.do #5910
1.1. Tools for reasoning about time:
Windowing
• Partitioning a data set along temporal boundaries.
Fixed Sliding Session
Event-Time
Q? sli.do #5910
1.1. Tools for reasoning about time:
2 Minute Windowing Example
Q? sli.do #5910
1.1. Tools for reasoning about time:
Triggers
• Mechanism for declaring when the output for a window should be
materialized (relative to some external signal).
• Per element
• Window completion
• Fixed
Q? sli.do #5910
1.1. Tools for reasoning about time:
2 Minute Triggers Example
Q? sli.do #5910
1. Streams & Tables theory:
Correctness
• State
• Watermarks
• Late data firing
• Exactly one
Q? sli.do #5910
1.2. Correctness:
State
• Amount of context stored between runs.
Q? sli.do #5910
1.2. Correctness:
Watermarks
• Watermarks are temporal notions of input completeness in the event-time
domain.
Q? sli.do #5910
1.2. Correctness:
Watermarks Example
Q? sli.do #5910
1.2. Correctness:
Handling late data
• Firing functions when events are observed outside the state.
Technique Side-effect
Discarding Approximate
Accumulation Duplicates
Accumulation

& Retraction
Late updates
Q? sli.do #5910
1.2. Correctness:
Discarding late data Example
Q? sli.do #5910
1.2. Correctness:
Exactly one
“Exactly one = At least one + Only one”
Q? sli.do #5910
1.2. Correctness:
At least one
• Checkpoints (HDFS compatible)
• Write-ahead log
Q? sli.do #5910
1.2. Correctness:
At least one
• Checkpoints relates to State:
Q? sli.do #5910
1.2. Correctness:
Only one
Technique Scope
Deduplication Micro-Batch
Deduplication

with Watermark
State
Deduplication

with Left Join
Resources
Q? sli.do #5910
Recap Part 1
• Processing of Bounded & Unbounded Tables.
• Event vs Processing time & how it relates to Windowing and Triggering.
• Stateful processing is useful when working to guarantee correctness.
• State is managed with Watermarks, Late Data firings & Fault Tolerant
Exactly One semantics.
Q? sli.do #5910
TALK STRUCTURE
Streams and tables theory.
Structured streaming migration demo.
Working with new developments.3
2
1
Q? sli.do #5910
import org.apache.spark.sql.functions._
def selectKafkaContent(df: DataFrame): DataFrame =
df.selectExpr("CAST(value AS STRING) as sValue")
def jsonScore(df: DataFrame): DataFrame =
df.selectExpr("CAST(get_json_object(sValue, '$.score') as INT) score")
def parse(df: DataFrame): DataFrame = jsonScore(selectKafkaContent(df))
def sumScores(df: DataFrame): DataFrame =
df.agg(sum("score").as("total"))
it should "sum 48 after consuming everything" in {
publishToMyKafka
kafka.getTopics().size shouldBe 1
val topicsAndOffsets = kafkaUtils.getTopicsAndOffsets("eu.marcgonzalez.demo")
topicsAndOffsets.foreach { topicAndOffset: TopicAndOffsets =>
val df = kafkaUtils
.load(topicAndOffset, kafkaConfiguration)
val jsonDf = df
.transform(parse)
.transform(sumScores)
jsonDf.collect()(0).get(0) shouldBe 48
}
}
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
2.Structured streaming migration demo:
Batch Example
• kafkaUtils handles offsets
• parse is a

stream->stream mapping op
• sumScores is a

stream->table grouping op
Q? sli.do #5910
def jsonScoreAndDate(df: DataFrame): DataFrame =
df.selectExpr(
"from_json(sValue, 'score INT, eventTime LONG, delayInMin INT') struct",
"timestamp as procTime")
.select(col("struct.*"), 'procTime)
.selectExpr("timestamp(eventTime/1000) as eventTime", "score", "procTime")
def parse(df: DataFrame): DataFrame = {
jsonScoreAndDate(selectKafkaContent(df))
}
def windowedSumScores(df: DataFrame): DataFrame =
df.groupBy(window($"eventTime", "2 minutes")).agg(sum("score").as("total"))
it should "sum 14, 18, 4, 12 after consuming everything in 2 minute windows" in {
val topicsAndOffsets = kafkaUtils.getTopicsAndOffsets("eu.marcgonzalez.demo")
topicsAndOffsets.foreach { topicAndOffset: TopicAndOffsets =>
val df = kafkaUtils
.load(topicAndOffset, kafkaConfiguration)
val jsonDf = df
.transform(parse)
.transform(windowedSumScores)
jsonDf
.sort("window").collect()
.foldLeft(Seq.empty[Int])(
(a, v) => a ++ Seq(v.get(1).asInstanceOf[Long].toInt)
) shouldBe Seq(14, 18, 4, 12)
}
}
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
2.Structured streaming migration demo:
Windowed Batch
• Extract eventTime from JSON
• Always try to enforce a
schema (JSON, AVRO)
• Fixed window partitions our
insight
Q? sli.do #5910
it should "sum 14,18,4,12 after streaming everything in 2 minute windows" in {
val topicsAndOffsets = kafkaUtils.getTopicsAndOffsets("eu.marcgonzalez.demo")
topicsAndOffsets.foreach { topicAndOffset: TopicAndOffsets =>
val df = spark.readStream
.format(“kafka")
.option(“kafka.bootstrap.servers", "localhost:9092")
.option("startingOffsets", "earliest")
.option("subscribe", topicAndOffset.topic)
.load()
df.isStreaming shouldBe true
val jsonDf = df
.transform(parse)
.transform(windowedSumScores)
val query = jsonDf.writeStream
.outputMode("update") //complete
.format("memory") //console
.queryName(queryName)
.trigger(Trigger.ProcessingTime("5 seconds")) //Once
.start()
query.awaitTermination(10 * 1000)
spark.sql(s"select * from $queryName order by window asc")
.collect()
.foldLeft(Seq.empty[Int])(
(a, v) => a ++ Seq(v.get(1).asInstanceOf[Long].toInt)
) shouldBe Seq(14, 18, 4, 12)
}
}
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
2.Structured streaming migration demo:
Windowed Stream Example
• read -> readStream, allowing S.S.S.
to take control over the input
offsets.
• Our dataframe is now a streaming
dataframe, & supports all previous
ops.
• write -> writeStream
• Modes: Complete, Update &
Append
• Sinks: Kafka, File, forEach,
memory & console for debug.
• Queryable state -> Eventual
Consistency
Q? sli.do #5910
it should "sum 5,18,4,12 after streaming everything in 2 minute windows" in {
timelyPublishToMyKafka
val topicsAndOffsets = kafkaUtils.getTopicsAndOffsets("eu.marcgonzalez.demo")
topicsAndOffsets.foreach { topicAndOffset: TopicAndOffsets =>
// Same reader as previous
val jsonDf = df
.transform(parse)
.withWatermark("eventTime", "2 minutes")
.transform(windowedSumScores)
val query = jsonDf
.writeStream
.outputMode(“update") //append
.format(“memory")
.queryName(queryName)
.trigger(Trigger.ProcessingTime("5 seconds”))
.start()
query.awaitTermination(15 * SECONDS_MS)
spark.sql(s"select window, max(total) from $queryName
group by window order by window asc")
.collect()
.foldLeft(Seq.empty[Int])(
(a, v) => a ++ Seq(v.get(1).asInstanceOf[Long].toInt)
) shouldBe Seq(5, 18, 4, 12)
}
}
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
2.Structured streaming migration demo:
Windowed Stream + Watermark Example
• Introduce a delay when
sending to kafka.
• We add a 2 minute watermark
• Event on a closed window gets
discarded!
• Append mode only sends
closed windows.
Q? sli.do #5910
Recap Part 2
• Reuse the same DF transformations
• Change read by readStream & write by writeStream
• Fixed triggers + start()
• Choosing the right Sinks + Output modes are key.
• State is managed through Spark Checkpoints. Do not mess with it!
Q? sli.do #5910
TALK STRUCTURE
Streams and tables theory.
Structured streaming migration demo.
Working with new developments.3
2
1
Q? sli.do #5910
Beam model
• What results are calculated?
• Where in event time are results calculated?
• When in processing time are results materialized?
• How do refinements of results relate?
Q? sli.do #5910
Beam model
• What results are calculated? Insights
• Where in event time are results calculated?
• When in processing time are results materialized?
• How do refinements of results relate?
Q? sli.do #5910
Beam model
• What results are calculated? Insights
• Where in event time are results calculated? Windowing
• When in processing time are results materialized?
• How do refinements of results relate?
Q? sli.do #5910
Beam model
• What results are calculated? Insights
• Where in event time are results calculated? Windowing
• When in processing time are results materialized? Triggers & Watermarks
• How do refinements of results relate?
Q? sli.do #5910
Beam model
• What results are calculated? Insights
• Where in event time are results calculated? Windowing
• When in processing time are results materialized? Triggers & Watermarks
• How do refinements of results relate? Late Data Firings, Exactly Once
Q? sli.do #5910
Apache Beam
• Unified model
• Multiples languages
• Multiples runners
Q? sli.do #5910
Runners Comparison
Q? sli.do #5910
Upcoming Meetup!
January 29th @ 7pm




🍕🍺 & amazing terrace
https:!//!!www.meetup.com/
Barcelona-Apache-Beam-
Meetup
Q? sli.do #5910
Thank you!
sli.do

More Related Content

What's hot

Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long version
Patrick McFadin
 
Dataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data ProcessingDataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data Processing
DoiT International
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming  by Tathaga...Deep dive into stateful stream processing in structured streaming  by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...
Databricks
 
Cassandra EU - Data model on fire
Cassandra EU - Data model on fireCassandra EU - Data model on fire
Cassandra EU - Data model on fire
Patrick McFadin
 
Storing time series data with Apache Cassandra
Storing time series data with Apache CassandraStoring time series data with Apache Cassandra
Storing time series data with Apache Cassandra
Patrick McFadin
 
Sessionization with Spark streaming
Sessionization with Spark streamingSessionization with Spark streaming
Sessionization with Spark streaming
Ramūnas Urbonas
 
Cassandra 2.0 and timeseries
Cassandra 2.0 and timeseriesCassandra 2.0 and timeseries
Cassandra 2.0 and timeseries
Patrick McFadin
 
Cassandra Basics, Counters and Time Series Modeling
Cassandra Basics, Counters and Time Series ModelingCassandra Basics, Counters and Time Series Modeling
Cassandra Basics, Counters and Time Series Modeling
Vassilis Bekiaris
 
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Zabbix
 
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBBuilding a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Cody Ray
 
Nagios Conference 2006 | Nagios 3.0 and beyond by Ethan Galstad
Nagios Conference 2006 |  Nagios 3.0 and beyond by Ethan GalstadNagios Conference 2006 |  Nagios 3.0 and beyond by Ethan Galstad
Nagios Conference 2006 | Nagios 3.0 and beyond by Ethan Galstad
NETWAYS
 
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
SegFaultConf
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017
Petr Zapletal
 
Geospatial and bitemporal search in cassandra with pluggable lucene index
Geospatial and bitemporal search in cassandra with pluggable lucene indexGeospatial and bitemporal search in cassandra with pluggable lucene index
Geospatial and bitemporal search in cassandra with pluggable lucene index
Andrés de la Peña
 
Monitoring MySQL with OpenTSDB
Monitoring MySQL with OpenTSDBMonitoring MySQL with OpenTSDB
Monitoring MySQL with OpenTSDB
Geoffrey Anderson
 
Real data models of silicon valley
Real data models of silicon valleyReal data models of silicon valley
Real data models of silicon valley
Patrick McFadin
 
Riddles of Streaming - Code Puzzlers for Fun & Profit (Nick Dearden, Confluen...
Riddles of Streaming - Code Puzzlers for Fun & Profit (Nick Dearden, Confluen...Riddles of Streaming - Code Puzzlers for Fun & Profit (Nick Dearden, Confluen...
Riddles of Streaming - Code Puzzlers for Fun & Profit (Nick Dearden, Confluen...
confluent
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Databricks
 
Cassandra 2.0 better, faster, stronger
Cassandra 2.0   better, faster, strongerCassandra 2.0   better, faster, stronger
Cassandra 2.0 better, faster, stronger
Patrick McFadin
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Flink Forward
 

What's hot (20)

Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long version
 
Dataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data ProcessingDataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data Processing
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming  by Tathaga...Deep dive into stateful stream processing in structured streaming  by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...
 
Cassandra EU - Data model on fire
Cassandra EU - Data model on fireCassandra EU - Data model on fire
Cassandra EU - Data model on fire
 
Storing time series data with Apache Cassandra
Storing time series data with Apache CassandraStoring time series data with Apache Cassandra
Storing time series data with Apache Cassandra
 
Sessionization with Spark streaming
Sessionization with Spark streamingSessionization with Spark streaming
Sessionization with Spark streaming
 
Cassandra 2.0 and timeseries
Cassandra 2.0 and timeseriesCassandra 2.0 and timeseries
Cassandra 2.0 and timeseries
 
Cassandra Basics, Counters and Time Series Modeling
Cassandra Basics, Counters and Time Series ModelingCassandra Basics, Counters and Time Series Modeling
Cassandra Basics, Counters and Time Series Modeling
 
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
 
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBBuilding a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
 
Nagios Conference 2006 | Nagios 3.0 and beyond by Ethan Galstad
Nagios Conference 2006 |  Nagios 3.0 and beyond by Ethan GalstadNagios Conference 2006 |  Nagios 3.0 and beyond by Ethan Galstad
Nagios Conference 2006 | Nagios 3.0 and beyond by Ethan Galstad
 
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017
 
Geospatial and bitemporal search in cassandra with pluggable lucene index
Geospatial and bitemporal search in cassandra with pluggable lucene indexGeospatial and bitemporal search in cassandra with pluggable lucene index
Geospatial and bitemporal search in cassandra with pluggable lucene index
 
Monitoring MySQL with OpenTSDB
Monitoring MySQL with OpenTSDBMonitoring MySQL with OpenTSDB
Monitoring MySQL with OpenTSDB
 
Real data models of silicon valley
Real data models of silicon valleyReal data models of silicon valley
Real data models of silicon valley
 
Riddles of Streaming - Code Puzzlers for Fun & Profit (Nick Dearden, Confluen...
Riddles of Streaming - Code Puzzlers for Fun & Profit (Nick Dearden, Confluen...Riddles of Streaming - Code Puzzlers for Fun & Profit (Nick Dearden, Confluen...
Riddles of Streaming - Code Puzzlers for Fun & Profit (Nick Dearden, Confluen...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Cassandra 2.0 better, faster, stronger
Cassandra 2.0   better, faster, strongerCassandra 2.0   better, faster, stronger
Cassandra 2.0 better, faster, stronger
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
 

Similar to Spark Barcelona Meetup: Migrating Batch Jobs into Structured Streaming

Kafka Streams: the easiest way to start with stream processing
Kafka Streams: the easiest way to start with stream processingKafka Streams: the easiest way to start with stream processing
Kafka Streams: the easiest way to start with stream processing
Yaroslav Tkachenko
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
Prakash Chockalingam
 
Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
Prakash Chockalingam
 
Onyx data processing the clojure way
Onyx   data processing  the clojure wayOnyx   data processing  the clojure way
Onyx data processing the clojure way
Bahadir Cambel
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
Databricks
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
Rafal Kwasny
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
DataWorks Summit/Hadoop Summit
 
Unified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache FlinkUnified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache Flink
DataWorks Summit/Hadoop Summit
 
Exactly-once Data Processing with Kafka Streams - July 27, 2017
Exactly-once Data Processing with Kafka Streams - July 27, 2017Exactly-once Data Processing with Kafka Streams - July 27, 2017
Exactly-once Data Processing with Kafka Streams - July 27, 2017
confluent
 
Strtio Spark Streaming + Siddhi CEP Engine
Strtio Spark Streaming + Siddhi CEP EngineStrtio Spark Streaming + Siddhi CEP Engine
Strtio Spark Streaming + Siddhi CEP Engine
Myung Ho Yun
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
Databricks
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
thelabdude
 
KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!
Guido Schmutz
 
Streaming Infrastructure at Wise with Levani Kokhreidze
Streaming Infrastructure at Wise with Levani KokhreidzeStreaming Infrastructure at Wise with Levani Kokhreidze
Streaming Infrastructure at Wise with Levani Kokhreidze
HostedbyConfluent
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward KeynoteK. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward Keynote
Flink Forward
 
Artimon - Apache Flume (incubating) NYC Meetup 20111108
Artimon - Apache Flume (incubating) NYC Meetup 20111108Artimon - Apache Flume (incubating) NYC Meetup 20111108
Artimon - Apache Flume (incubating) NYC Meetup 20111108
Mathias Herberts
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
Databricks
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017
Jags Ramnarayan
 

Similar to Spark Barcelona Meetup: Migrating Batch Jobs into Structured Streaming (20)

Kafka Streams: the easiest way to start with stream processing
Kafka Streams: the easiest way to start with stream processingKafka Streams: the easiest way to start with stream processing
Kafka Streams: the easiest way to start with stream processing
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
 
Onyx data processing the clojure way
Onyx   data processing  the clojure wayOnyx   data processing  the clojure way
Onyx data processing the clojure way
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
Unified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache FlinkUnified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache Flink
 
Exactly-once Data Processing with Kafka Streams - July 27, 2017
Exactly-once Data Processing with Kafka Streams - July 27, 2017Exactly-once Data Processing with Kafka Streams - July 27, 2017
Exactly-once Data Processing with Kafka Streams - July 27, 2017
 
Strtio Spark Streaming + Siddhi CEP Engine
Strtio Spark Streaming + Siddhi CEP EngineStrtio Spark Streaming + Siddhi CEP Engine
Strtio Spark Streaming + Siddhi CEP Engine
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!
 
Streaming Infrastructure at Wise with Levani Kokhreidze
Streaming Infrastructure at Wise with Levani KokhreidzeStreaming Infrastructure at Wise with Levani Kokhreidze
Streaming Infrastructure at Wise with Levani Kokhreidze
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward KeynoteK. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward Keynote
 
Artimon - Apache Flume (incubating) NYC Meetup 20111108
Artimon - Apache Flume (incubating) NYC Meetup 20111108Artimon - Apache Flume (incubating) NYC Meetup 20111108
Artimon - Apache Flume (incubating) NYC Meetup 20111108
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017
 

Recently uploaded

一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
a9qfiubqu
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
sameer shah
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
y3i0qsdzb
 

Recently uploaded (20)

一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
 

Spark Barcelona Meetup: Migrating Batch Jobs into Structured Streaming

  • 1. MIGRATING BATCH JOBS INTO STRUCTURED STREAMING Introduction to near real time analytics MARCGONZALEZ.EU
  • 2. Hi, I’m Marc! Freelance Data Engineer Developer, Consultant (and now) speaker 5+ years of big data experience, applied to classifieds market.
  • 7. Q? sli.do #5910 Audience • Experience with Dataframes. • Experience with DStreams. • Streams and tables theory. • Beam model.
  • 8. Q? sli.do #5910 Notes • Most material is from Tyler Akidau, either from his blog, talks or book.
  • 9. Q? sli.do #5910 TALK STRUCTURE Streams and tables theory. Structured streaming migration demo. Working with new developments.3 2 1 Q Q&A sli.do code: 5910
  • 10. Q? sli.do #5910 TALK STRUCTURE Streams and tables theory. Structured streaming migration demo. Working with new developments.3 2 1
  • 11. Q? sli.do #5910 “Every Stream can yield a Table at a certain time, & every Table can be observed into a Stream.” 1. Streams & Tables theory
  • 12. Q? sli.do #5910 1. Streams & Tables theory:
 Demonstration Formally: By example: Stream = ∂Table ∂t Table = ∫ now t0 Stream Underlying structure of a Database system for handling updates. Change Data Capture
 (CDC) in microservices.
  • 13. Q? sli.do #5910 Operation Stream !=> Stream Mapping Stream !=> Table Grouping Table !=> Stream Partitioning This helps a lot with our migration right? 1. Streams & Tables theory:
 General approach
  • 14. Q? sli.do #5910 “Semantically batch is really just a (strict) subset of streaming.” 1. Streams & Tables theory: Batch & Streaming Engines
  • 15. Q? sli.do #5910 1. Streams & Tables theory: Bounded & Unbounded Tables struct !=>Insights Unbounded tableData stream
  • 16. Q? sli.do #5910 1. Streams & Tables theory: Bounded & Unbounded Tables So can we swap one with the other? struct !=> Insights Data stream
  • 17. YES* *but you’re going to need: Tools for reasoning about time Guarantee Correctness
  • 18. Q? sli.do #5910 1. Streams & Tables theory: Tools for reasoning about time • Event vs Processing Time • Windowing • Triggers
  • 19. Q? sli.do #5910 1.1. Tools for reasoning about time: Event vs Processing Times
  • 20. Q? sli.do #5910 1.1. Tools for reasoning about time: Event vs Processing Times Example
  • 21. Q? sli.do #5910 1.1. Tools for reasoning about time: Windowing • Partitioning a data set along temporal boundaries. Fixed Sliding Session Event-Time
  • 22. Q? sli.do #5910 1.1. Tools for reasoning about time: 2 Minute Windowing Example
  • 23. Q? sli.do #5910 1.1. Tools for reasoning about time: Triggers • Mechanism for declaring when the output for a window should be materialized (relative to some external signal). • Per element • Window completion • Fixed
  • 24. Q? sli.do #5910 1.1. Tools for reasoning about time: 2 Minute Triggers Example
  • 25. Q? sli.do #5910 1. Streams & Tables theory: Correctness • State • Watermarks • Late data firing • Exactly one
  • 26. Q? sli.do #5910 1.2. Correctness: State • Amount of context stored between runs.
  • 27. Q? sli.do #5910 1.2. Correctness: Watermarks • Watermarks are temporal notions of input completeness in the event-time domain.
  • 28. Q? sli.do #5910 1.2. Correctness: Watermarks Example
  • 29. Q? sli.do #5910 1.2. Correctness: Handling late data • Firing functions when events are observed outside the state. Technique Side-effect Discarding Approximate Accumulation Duplicates Accumulation
 & Retraction Late updates
  • 30. Q? sli.do #5910 1.2. Correctness: Discarding late data Example
  • 31. Q? sli.do #5910 1.2. Correctness: Exactly one “Exactly one = At least one + Only one”
  • 32. Q? sli.do #5910 1.2. Correctness: At least one • Checkpoints (HDFS compatible) • Write-ahead log
  • 33. Q? sli.do #5910 1.2. Correctness: At least one • Checkpoints relates to State:
  • 34. Q? sli.do #5910 1.2. Correctness: Only one Technique Scope Deduplication Micro-Batch Deduplication
 with Watermark State Deduplication
 with Left Join Resources
  • 35. Q? sli.do #5910 Recap Part 1 • Processing of Bounded & Unbounded Tables. • Event vs Processing time & how it relates to Windowing and Triggering. • Stateful processing is useful when working to guarantee correctness. • State is managed with Watermarks, Late Data firings & Fault Tolerant Exactly One semantics.
  • 36. Q? sli.do #5910 TALK STRUCTURE Streams and tables theory. Structured streaming migration demo. Working with new developments.3 2 1
  • 37. Q? sli.do #5910 import org.apache.spark.sql.functions._ def selectKafkaContent(df: DataFrame): DataFrame = df.selectExpr("CAST(value AS STRING) as sValue") def jsonScore(df: DataFrame): DataFrame = df.selectExpr("CAST(get_json_object(sValue, '$.score') as INT) score") def parse(df: DataFrame): DataFrame = jsonScore(selectKafkaContent(df)) def sumScores(df: DataFrame): DataFrame = df.agg(sum("score").as("total")) it should "sum 48 after consuming everything" in { publishToMyKafka kafka.getTopics().size shouldBe 1 val topicsAndOffsets = kafkaUtils.getTopicsAndOffsets("eu.marcgonzalez.demo") topicsAndOffsets.foreach { topicAndOffset: TopicAndOffsets => val df = kafkaUtils .load(topicAndOffset, kafkaConfiguration) val jsonDf = df .transform(parse) .transform(sumScores) jsonDf.collect()(0).get(0) shouldBe 48 } } 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 2.Structured streaming migration demo: Batch Example • kafkaUtils handles offsets • parse is a
 stream->stream mapping op • sumScores is a
 stream->table grouping op
  • 38. Q? sli.do #5910 def jsonScoreAndDate(df: DataFrame): DataFrame = df.selectExpr( "from_json(sValue, 'score INT, eventTime LONG, delayInMin INT') struct", "timestamp as procTime") .select(col("struct.*"), 'procTime) .selectExpr("timestamp(eventTime/1000) as eventTime", "score", "procTime") def parse(df: DataFrame): DataFrame = { jsonScoreAndDate(selectKafkaContent(df)) } def windowedSumScores(df: DataFrame): DataFrame = df.groupBy(window($"eventTime", "2 minutes")).agg(sum("score").as("total")) it should "sum 14, 18, 4, 12 after consuming everything in 2 minute windows" in { val topicsAndOffsets = kafkaUtils.getTopicsAndOffsets("eu.marcgonzalez.demo") topicsAndOffsets.foreach { topicAndOffset: TopicAndOffsets => val df = kafkaUtils .load(topicAndOffset, kafkaConfiguration) val jsonDf = df .transform(parse) .transform(windowedSumScores) jsonDf .sort("window").collect() .foldLeft(Seq.empty[Int])( (a, v) => a ++ Seq(v.get(1).asInstanceOf[Long].toInt) ) shouldBe Seq(14, 18, 4, 12) } } 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 2.Structured streaming migration demo: Windowed Batch • Extract eventTime from JSON • Always try to enforce a schema (JSON, AVRO) • Fixed window partitions our insight
  • 39. Q? sli.do #5910 it should "sum 14,18,4,12 after streaming everything in 2 minute windows" in { val topicsAndOffsets = kafkaUtils.getTopicsAndOffsets("eu.marcgonzalez.demo") topicsAndOffsets.foreach { topicAndOffset: TopicAndOffsets => val df = spark.readStream .format(“kafka") .option(“kafka.bootstrap.servers", "localhost:9092") .option("startingOffsets", "earliest") .option("subscribe", topicAndOffset.topic) .load() df.isStreaming shouldBe true val jsonDf = df .transform(parse) .transform(windowedSumScores) val query = jsonDf.writeStream .outputMode("update") //complete .format("memory") //console .queryName(queryName) .trigger(Trigger.ProcessingTime("5 seconds")) //Once .start() query.awaitTermination(10 * 1000) spark.sql(s"select * from $queryName order by window asc") .collect() .foldLeft(Seq.empty[Int])( (a, v) => a ++ Seq(v.get(1).asInstanceOf[Long].toInt) ) shouldBe Seq(14, 18, 4, 12) } } 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 2.Structured streaming migration demo: Windowed Stream Example • read -> readStream, allowing S.S.S. to take control over the input offsets. • Our dataframe is now a streaming dataframe, & supports all previous ops. • write -> writeStream • Modes: Complete, Update & Append • Sinks: Kafka, File, forEach, memory & console for debug. • Queryable state -> Eventual Consistency
  • 40. Q? sli.do #5910 it should "sum 5,18,4,12 after streaming everything in 2 minute windows" in { timelyPublishToMyKafka val topicsAndOffsets = kafkaUtils.getTopicsAndOffsets("eu.marcgonzalez.demo") topicsAndOffsets.foreach { topicAndOffset: TopicAndOffsets => // Same reader as previous val jsonDf = df .transform(parse) .withWatermark("eventTime", "2 minutes") .transform(windowedSumScores) val query = jsonDf .writeStream .outputMode(“update") //append .format(“memory") .queryName(queryName) .trigger(Trigger.ProcessingTime("5 seconds”)) .start() query.awaitTermination(15 * SECONDS_MS) spark.sql(s"select window, max(total) from $queryName group by window order by window asc") .collect() .foldLeft(Seq.empty[Int])( (a, v) => a ++ Seq(v.get(1).asInstanceOf[Long].toInt) ) shouldBe Seq(5, 18, 4, 12) } } 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 2.Structured streaming migration demo: Windowed Stream + Watermark Example • Introduce a delay when sending to kafka. • We add a 2 minute watermark • Event on a closed window gets discarded! • Append mode only sends closed windows.
  • 41. Q? sli.do #5910 Recap Part 2 • Reuse the same DF transformations • Change read by readStream & write by writeStream • Fixed triggers + start() • Choosing the right Sinks + Output modes are key. • State is managed through Spark Checkpoints. Do not mess with it!
  • 42. Q? sli.do #5910 TALK STRUCTURE Streams and tables theory. Structured streaming migration demo. Working with new developments.3 2 1
  • 43. Q? sli.do #5910 Beam model • What results are calculated? • Where in event time are results calculated? • When in processing time are results materialized? • How do refinements of results relate?
  • 44. Q? sli.do #5910 Beam model • What results are calculated? Insights • Where in event time are results calculated? • When in processing time are results materialized? • How do refinements of results relate?
  • 45. Q? sli.do #5910 Beam model • What results are calculated? Insights • Where in event time are results calculated? Windowing • When in processing time are results materialized? • How do refinements of results relate?
  • 46. Q? sli.do #5910 Beam model • What results are calculated? Insights • Where in event time are results calculated? Windowing • When in processing time are results materialized? Triggers & Watermarks • How do refinements of results relate?
  • 47. Q? sli.do #5910 Beam model • What results are calculated? Insights • Where in event time are results calculated? Windowing • When in processing time are results materialized? Triggers & Watermarks • How do refinements of results relate? Late Data Firings, Exactly Once
  • 48. Q? sli.do #5910 Apache Beam • Unified model • Multiples languages • Multiples runners
  • 50. Q? sli.do #5910 Upcoming Meetup! January 29th @ 7pm 
 
 🍕🍺 & amazing terrace https:!//!!www.meetup.com/ Barcelona-Apache-Beam- Meetup
  • 51. Q? sli.do #5910 Thank you! sli.do