SlideShare a Scribd company logo
1 of 41
Download to read offline
Introto Spark Streaming
(pandemic edition)
Oleg Korolenko for RSF Talks @Ktech, March 2020
image credits: @Matt Turck - Big Data Landscape 2017
Agenda
1.Some streaming concepts (quickly)
2.Streaming models: Microbatchning vs One-record-a-
Time models
3.Windowing, watermarks, state management
4.Operations on state and joins
5.Sources and Sinks
Oleg Korolenko for RSF Talks @Ktech, March 2020
Notinthistalk
» Spark as distributed compute engine
» I will not cover specific integrations (like with
Kafka)
» I will not compare it to some specific streaming
solutions
Oleg Korolenko for RSF Talks @Ktech, March 2020
API hell
- DStreams (deprecated)
- Continuous mode (experimental from 2.3)
- Structured Streaming (the way to go, in this talk)
Oleg Korolenko for RSF Talks @Ktech, March 2020
Streaming concepts: Data
Data in motion vs data at rest (in the past)
Potentially unbounded vs known size
Oleg Korolenko for RSF Talks @Ktech, March 2020
Spark streaming - Concept
» serves small batches of data collected from stream
» provides them at fixed time intervals (from 0.5
secs)
» performs computation
image credits: Spark official doc
Microbatching
application of Bulk Synchronous Parallelism (BSP)
system
Consists of :
1. A split distribution of asynchronous work (tasks)
2. A synchronous barrier, coming in at fixed
intervals (stages)
Oleg Korolenko for RSF Talks @Ktech, March 2020
Model: Microbatching
Transforms a batch-like query into a series of
incremental execution plans
Oleg Korolenko for RSF Talks @Ktech, March 2020
One-record-at-a-time-processing
Dataflow programming
- computation is a graph of data flowing between
operations
- computations are black boxes one to-each other ( vs
Catalyst in Spark)
In : ApacheFlink, Google DataFlow
Oleg Korolenko for RSF Talks @Ktech, March 2020
Model: One-record-at-a-time-processing
processing user functions by pipelining
- deploys functions as pipelines in a cluster
- flows data through pipelines
- pipelines steps are parallilized (differently,
depedning on operators)
Oleg Korolenko for RSF Talks @Ktech, March 2020
Microbatchingvs One-at-a-time
despite higher latency
PROS:
1.sync boundaries gives the ability to adapt (f.i
task recovering from failure if executor is down,
scala executors etc)
2.data is available as a set at every microbatch (we
can inspect, adapt, drop, get stats)
3.easier model that looks like data at rest
Oleg Korolenko for RSF Talks @Ktech, March 2020
Spark streamingAPI
» API on top of Spark SQL Dataframe,Dataset APIs
// Read text from socket
val socketDF = spark
.readStream
.format("socket")
.option(...)
.load()
socketDF.isStreaming // Returns True for DataFrames that have streaming sources
Oleg Korolenko for RSF Talks @Ktech, March 2020
Spark streamingAPI, behindthe lines
[DataFrame/Dataset] =>
[Logical plan] =>
[Optimized plan] =>
[Series of incremental execution plans]
Oleg Korolenko for RSF Talks @Ktech, March 2020
Triggering
Run only once:
val onceStream = data
.writeStream
.format("console")
.queryName("Once")
.trigger(Trigger.Once())
Oleg Korolenko for RSF Talks @Ktech, March 2020
Triggering
Scheduled execution based on processing time:
val processingTimeStream = data
.writeStream
.format("console")
.trigger(Trigger.ProcessingTime("20 seconds"))
processing hasn't yet finished next batch will start
immediately
Oleg Korolenko for RSF Talks @Ktech, March 2020
Processing
We can use usual Spark transformation and aggregation
APIs
but where's streaming semantics there ?
Oleg Korolenko for RSF Talks @Ktech, March 2020
credits: https://twitter.com/bgeerdink/status/776003500656517120
Processing:WindowingAPI
val avgBySensorTypeOverTime = sensorStream
.select($"timestamp", $"sensorType")
.groupBy(window($"timestamp", "1 minutes", "1 minute"), $"sensorType")
.count()
Oleg Korolenko for RSF Talks @Ktech, March 2020
Tumblingwindow
eventsDF
.groupBy(window("eventTime", "5 minute"))
.count()
image credits: @DataBricks Engineering blog
Slidingwindow
eventsDF
.groupBy(window("eventTime", "10 minutes", "5 minutes"))
.count()
image credits: @DataBricks Engineering blog
Late events
image credits: @DataBricks Engineering blog
Watermarks
"all input data with event times less than X have
been observed"
eventsDF
.groupBy(window("eventTime", "10 minutes", "5 minutes"))
.watermark("10 minutes")
.count()
Oleg Korolenko for RSF Talks @Ktech, March 2020
Watermarks
image credits: @DataBricks Engineering blog
Statefulprocessing
Work with data in the context of what we had already
seen in the stream
Oleg Korolenko for RSF Talks @Ktech, March 2020
State management
image credits: @DataBricks Engineering blog
State managementand checkpoints
Backed by S3-compatible interface to store state
.
|-- commits/
|-- offsets/
|-- sources/
|-- state/
`-- metadata
Oleg Korolenko for RSF Talks @Ktech, March 2020
Operations - State
mapWithState // we produce a single result
flatMapWithState // we produce 0 or N results in output
Oleg Korolenko for RSF Talks @Ktech, March 2020
Example: Domain
// Input events
val weatherEvents: Dataset[WeatherEvents]
// Weather station event
case class WeatherEvent(
stationId: String,
timestamp: Timestamp,
temp: Double
)
// Weather avg temp output
case class WeatherEventAvg(
stationId: String,
start: Timestamp,
end: Timestamp,
avgTemp: Double
)
Oleg Korolenko for RSF Talks @Ktech, March 2020
Compute using state
val weatherEventsMovingAvg = weatherEvents
// group by station
.groupByKey(_.stationId)
// processing timeout
.mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout)
(mappingFunction)
Oleg Korolenko for RSF Talks @Ktech, March 2020
Mapping function
def mappingFunction(
key: String,
values: Iterator[WeatherEvent],
groupState: GroupState[List[WeatherEvent]]
): WeatherEventAvg = {
// update the state with the new events
val updatedState = ...
// update the group state
groupState.update(updatedState)
// compute new event output using updated state
WeatherEventAvg(key, ts1, ts2, tempAvg)
}
Oleg Korolenko for RSF Talks @Ktech, March 2020
Writetoasinkand startthe stream
// define the sink for the stream
weatherEventsMovingAvg
.writeStream
.format("kafka") // determines that the kafka sink is used
.option("kafka.bootstrap.servers", kafkaBootstrapServer)
.option("checkpointLocation", "/path/checkpoint")
// stream will start processing events from sources and write to sink
.start()
}
Oleg Korolenko for RSF Talks @Ktech, March 2020
Operations -Joins
» stream join stream
» stream join batch
Oleg Korolenko for RSF Talks @Ktech, March 2020
Sources
» File-based: JSON, CSV, Parquet, ORC, and plain
text
» Kafka, Kinesis, Flume
» TCP sockets
Oleg Korolenko for RSF Talks @Ktech, March 2020
Workingwith sources
image credits: Stream Processing with Apache Spark @OReilly
Offsets in checkpoints
.
|-- commits/
|-- offsets/
|-- sources/
|-- state/
`-- metadata
Oleg Korolenko for RSF Talks @Ktech, March 2020
Sinks
» File-based: JSON, CSV, Parquet, ORC, and plain
text
» Kafka, Kinesis, Flume
Experimentation:
- Memory, Console
Custom:
- forEach (implement ForEachWriter to integrate with
anything)
Oleg Korolenko for RSF Talks @Ktech, March 2020
Failure recovery
» Spark uses checkpoints
Write Ahead Log (WAL)
» for Spark Streaming hwen we receive data from
sources we buffer it
» we need to store additional metadata to register
offsets etc
» we save on offset, data to be able to replay it
from sources
Oleg Korolenko for RSF Talks @Ktech, March 2020
"Exactlyonce" deliveryguarantee
Combination of
replayable sources
idempotent sinks
processing checkpoints
Oleg Korolenko for RSF Talks @Ktech, March 2020
Readsand refs
1.Streaming 102:The World beyond Batch(article) by Tyler Akidau,
2016
2.Stream Processing with Apache Flink by Fabian Hueske and
Vasiliki Kalavri, O'Reilly, April 2019
3.Stream Processing with Apache Spark by Francois Garillot and
Gerard Maas, O'Reilly, 2019
4.Discretized Streams: Fault-Tolerant Streaming Computation at
Scale(whitepaper) by MatheiZaharia, Berkley
5.Event-time Aggregation and Watermarking in Apache Spark’s
Structured Streaming by Tathagata Das, DataBricks enginnering
blog
Oleg Korolenko for RSF Talks @Ktech, March 2020
Thanks !
Oleg Korolenko for RSF Talks @Ktech, March 2020

More Related Content

What's hot

Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit
 
FastR+Apache Flink
FastR+Apache FlinkFastR+Apache Flink
FastR+Apache FlinkJuan Fumero
 
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...Databricks
 
Use r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkrUse r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkrDatabricks
 
Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!Databricks
 
GraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLGraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLSpark Summit
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
 
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...Databricks
 
Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with SparkLearning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Sparkphanleson
 
Making Nested Columns as First Citizen in Apache Spark SQL
Making Nested Columns as First Citizen in Apache Spark SQLMaking Nested Columns as First Citizen in Apache Spark SQL
Making Nested Columns as First Citizen in Apache Spark SQLDatabricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Monitoring pg with_graphite_grafana
Monitoring pg with_graphite_grafanaMonitoring pg with_graphite_grafana
Monitoring pg with_graphite_grafanaJan Wieck
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Databricks
 
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim HunterWeb-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim HunterDatabricks
 
Stream Processing: Choosing the Right Tool for the Job
Stream Processing: Choosing the Right Tool for the JobStream Processing: Choosing the Right Tool for the Job
Stream Processing: Choosing the Right Tool for the JobDatabricks
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Stephan Ewen
 
Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...
Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...
Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...Spark Summit
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkOverview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkSlim Baltagi
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkDatabricks
 

What's hot (20)

Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
FastR+Apache Flink
FastR+Apache FlinkFastR+Apache Flink
FastR+Apache Flink
 
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
 
Use r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkrUse r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkr
 
Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!
 
GraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLGraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQL
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
 
Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with SparkLearning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Spark
 
Making Nested Columns as First Citizen in Apache Spark SQL
Making Nested Columns as First Citizen in Apache Spark SQLMaking Nested Columns as First Citizen in Apache Spark SQL
Making Nested Columns as First Citizen in Apache Spark SQL
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Monitoring pg with_graphite_grafana
Monitoring pg with_graphite_grafanaMonitoring pg with_graphite_grafana
Monitoring pg with_graphite_grafana
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™
 
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim HunterWeb-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
 
Stream Processing: Choosing the Right Tool for the Job
Stream Processing: Choosing the Right Tool for the JobStream Processing: Choosing the Right Tool for the Job
Stream Processing: Choosing the Right Tool for the Job
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)
 
Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...
Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...
Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkOverview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
 

Similar to Spark Streaming Intro @KTech

A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016 Databricks
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
 
What's New in Spark 2?
What's New in Spark 2?What's New in Spark 2?
What's New in Spark 2?Eyal Ben Ivri
 
PowerStream Demo
PowerStream DemoPowerStream Demo
PowerStream DemoSingleStore
 
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf
 
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaA noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaData Con LA
 
Realizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamRealizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamDataWorks Summit
 
PDE2011 pythonOCC project status and plans
PDE2011 pythonOCC project status and plansPDE2011 pythonOCC project status and plans
PDE2011 pythonOCC project status and plansThomas Paviot
 
Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...
Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...
Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...StreamNative
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's comingDatabricks
 
Apache Beam (incubating)
Apache Beam (incubating)Apache Beam (incubating)
Apache Beam (incubating)Apache Apex
 
O'Reilly Media Webcast: Building Real-Time Data Pipelines
O'Reilly Media Webcast: Building Real-Time Data PipelinesO'Reilly Media Webcast: Building Real-Time Data Pipelines
O'Reilly Media Webcast: Building Real-Time Data PipelinesSingleStore
 
Time series data monitoring at 99acres.com
Time series data monitoring at 99acres.comTime series data monitoring at 99acres.com
Time series data monitoring at 99acres.comRavi Raj
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
What's new in spark 2.0?
What's new in spark 2.0?What's new in spark 2.0?
What's new in spark 2.0?Örjan Lundberg
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the unionDatabricks
 
ELK-Stack-Grid-KA-School.pptx
ELK-Stack-Grid-KA-School.pptxELK-Stack-Grid-KA-School.pptx
ELK-Stack-Grid-KA-School.pptxabenyeung1
 
Dissecting Open Source Cloud Evolution: An OpenStack Case Study
Dissecting Open Source Cloud Evolution: An OpenStack Case StudyDissecting Open Source Cloud Evolution: An OpenStack Case Study
Dissecting Open Source Cloud Evolution: An OpenStack Case StudySalman Baset
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz
 

Similar to Spark Streaming Intro @KTech (20)

A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
What's New in Spark 2?
What's New in Spark 2?What's New in Spark 2?
What's New in Spark 2?
 
Big data apache spark + scala
Big data   apache spark + scalaBig data   apache spark + scala
Big data apache spark + scala
 
PowerStream Demo
PowerStream DemoPowerStream Demo
PowerStream Demo
 
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7
 
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaA noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
 
Realizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamRealizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache Beam
 
PDE2011 pythonOCC project status and plans
PDE2011 pythonOCC project status and plansPDE2011 pythonOCC project status and plans
PDE2011 pythonOCC project status and plans
 
Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...
Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...
Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Apache Beam (incubating)
Apache Beam (incubating)Apache Beam (incubating)
Apache Beam (incubating)
 
O'Reilly Media Webcast: Building Real-Time Data Pipelines
O'Reilly Media Webcast: Building Real-Time Data PipelinesO'Reilly Media Webcast: Building Real-Time Data Pipelines
O'Reilly Media Webcast: Building Real-Time Data Pipelines
 
Time series data monitoring at 99acres.com
Time series data monitoring at 99acres.comTime series data monitoring at 99acres.com
Time series data monitoring at 99acres.com
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
What's new in spark 2.0?
What's new in spark 2.0?What's new in spark 2.0?
What's new in spark 2.0?
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
 
ELK-Stack-Grid-KA-School.pptx
ELK-Stack-Grid-KA-School.pptxELK-Stack-Grid-KA-School.pptx
ELK-Stack-Grid-KA-School.pptx
 
Dissecting Open Source Cloud Evolution: An OpenStack Case Study
Dissecting Open Source Cloud Evolution: An OpenStack Case StudyDissecting Open Source Cloud Evolution: An OpenStack Case Study
Dissecting Open Source Cloud Evolution: An OpenStack Case Study
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 

Recently uploaded

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

Recently uploaded (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Spark Streaming Intro @KTech

  • 1. Introto Spark Streaming (pandemic edition) Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 2. image credits: @Matt Turck - Big Data Landscape 2017
  • 3. Agenda 1.Some streaming concepts (quickly) 2.Streaming models: Microbatchning vs One-record-a- Time models 3.Windowing, watermarks, state management 4.Operations on state and joins 5.Sources and Sinks Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 4. Notinthistalk » Spark as distributed compute engine » I will not cover specific integrations (like with Kafka) » I will not compare it to some specific streaming solutions Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 5. API hell - DStreams (deprecated) - Continuous mode (experimental from 2.3) - Structured Streaming (the way to go, in this talk) Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 6. Streaming concepts: Data Data in motion vs data at rest (in the past) Potentially unbounded vs known size Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 7. Spark streaming - Concept » serves small batches of data collected from stream » provides them at fixed time intervals (from 0.5 secs) » performs computation image credits: Spark official doc
  • 8. Microbatching application of Bulk Synchronous Parallelism (BSP) system Consists of : 1. A split distribution of asynchronous work (tasks) 2. A synchronous barrier, coming in at fixed intervals (stages) Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 9. Model: Microbatching Transforms a batch-like query into a series of incremental execution plans Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 10. One-record-at-a-time-processing Dataflow programming - computation is a graph of data flowing between operations - computations are black boxes one to-each other ( vs Catalyst in Spark) In : ApacheFlink, Google DataFlow Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 11. Model: One-record-at-a-time-processing processing user functions by pipelining - deploys functions as pipelines in a cluster - flows data through pipelines - pipelines steps are parallilized (differently, depedning on operators) Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 12. Microbatchingvs One-at-a-time despite higher latency PROS: 1.sync boundaries gives the ability to adapt (f.i task recovering from failure if executor is down, scala executors etc) 2.data is available as a set at every microbatch (we can inspect, adapt, drop, get stats) 3.easier model that looks like data at rest Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 13. Spark streamingAPI » API on top of Spark SQL Dataframe,Dataset APIs // Read text from socket val socketDF = spark .readStream .format("socket") .option(...) .load() socketDF.isStreaming // Returns True for DataFrames that have streaming sources Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 14. Spark streamingAPI, behindthe lines [DataFrame/Dataset] => [Logical plan] => [Optimized plan] => [Series of incremental execution plans] Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 15. Triggering Run only once: val onceStream = data .writeStream .format("console") .queryName("Once") .trigger(Trigger.Once()) Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 16. Triggering Scheduled execution based on processing time: val processingTimeStream = data .writeStream .format("console") .trigger(Trigger.ProcessingTime("20 seconds")) processing hasn't yet finished next batch will start immediately Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 17. Processing We can use usual Spark transformation and aggregation APIs but where's streaming semantics there ? Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 19. Processing:WindowingAPI val avgBySensorTypeOverTime = sensorStream .select($"timestamp", $"sensorType") .groupBy(window($"timestamp", "1 minutes", "1 minute"), $"sensorType") .count() Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 21. Slidingwindow eventsDF .groupBy(window("eventTime", "10 minutes", "5 minutes")) .count() image credits: @DataBricks Engineering blog
  • 22. Late events image credits: @DataBricks Engineering blog
  • 23. Watermarks "all input data with event times less than X have been observed" eventsDF .groupBy(window("eventTime", "10 minutes", "5 minutes")) .watermark("10 minutes") .count() Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 25. Statefulprocessing Work with data in the context of what we had already seen in the stream Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 26. State management image credits: @DataBricks Engineering blog
  • 27. State managementand checkpoints Backed by S3-compatible interface to store state . |-- commits/ |-- offsets/ |-- sources/ |-- state/ `-- metadata Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 28. Operations - State mapWithState // we produce a single result flatMapWithState // we produce 0 or N results in output Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 29. Example: Domain // Input events val weatherEvents: Dataset[WeatherEvents] // Weather station event case class WeatherEvent( stationId: String, timestamp: Timestamp, temp: Double ) // Weather avg temp output case class WeatherEventAvg( stationId: String, start: Timestamp, end: Timestamp, avgTemp: Double ) Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 30. Compute using state val weatherEventsMovingAvg = weatherEvents // group by station .groupByKey(_.stationId) // processing timeout .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout) (mappingFunction) Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 31. Mapping function def mappingFunction( key: String, values: Iterator[WeatherEvent], groupState: GroupState[List[WeatherEvent]] ): WeatherEventAvg = { // update the state with the new events val updatedState = ... // update the group state groupState.update(updatedState) // compute new event output using updated state WeatherEventAvg(key, ts1, ts2, tempAvg) } Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 32. Writetoasinkand startthe stream // define the sink for the stream weatherEventsMovingAvg .writeStream .format("kafka") // determines that the kafka sink is used .option("kafka.bootstrap.servers", kafkaBootstrapServer) .option("checkpointLocation", "/path/checkpoint") // stream will start processing events from sources and write to sink .start() } Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 33. Operations -Joins » stream join stream » stream join batch Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 34. Sources » File-based: JSON, CSV, Parquet, ORC, and plain text » Kafka, Kinesis, Flume » TCP sockets Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 35. Workingwith sources image credits: Stream Processing with Apache Spark @OReilly
  • 36. Offsets in checkpoints . |-- commits/ |-- offsets/ |-- sources/ |-- state/ `-- metadata Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 37. Sinks » File-based: JSON, CSV, Parquet, ORC, and plain text » Kafka, Kinesis, Flume Experimentation: - Memory, Console Custom: - forEach (implement ForEachWriter to integrate with anything) Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 38. Failure recovery » Spark uses checkpoints Write Ahead Log (WAL) » for Spark Streaming hwen we receive data from sources we buffer it » we need to store additional metadata to register offsets etc » we save on offset, data to be able to replay it from sources Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 39. "Exactlyonce" deliveryguarantee Combination of replayable sources idempotent sinks processing checkpoints Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 40. Readsand refs 1.Streaming 102:The World beyond Batch(article) by Tyler Akidau, 2016 2.Stream Processing with Apache Flink by Fabian Hueske and Vasiliki Kalavri, O'Reilly, April 2019 3.Stream Processing with Apache Spark by Francois Garillot and Gerard Maas, O'Reilly, 2019 4.Discretized Streams: Fault-Tolerant Streaming Computation at Scale(whitepaper) by MatheiZaharia, Berkley 5.Event-time Aggregation and Watermarking in Apache Spark’s Structured Streaming by Tathagata Das, DataBricks enginnering blog Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 41. Thanks ! Oleg Korolenko for RSF Talks @Ktech, March 2020