SlideShare a Scribd company logo
1 of 36
Download to read offline
Arbitrary Stateful Aggregations
using Structured Streaming
in Apache Spark™
Burak Yavuz
5/16/2017
2
Outline
• Structured Streaming Concepts
• Stateful Processing in Structured Streaming
• Use Cases
• Demos
3
The simplest way to perform streaming analytics
is not having to reason about streaming at all
4
5
Input: data from source as an
append-only table
Trigger: how frequently to check
input for new data
Query: operations on input
usual map/filter/reduce
new window, session ops
Trigger: every 1 sec
1 2 3
Time
data up
to 1
Input data up
to 2
data up
to 3
Query
New Model
6
Trigger: every 1 sec
1 2 3
result
for data
up to 1
Result
Query
Time
data up
to 1
Input data up
to 2
result
for data
up to 2
data up
to 3
result
for data
up to 3
Output
[complete mode]
output all the rows in the result table
New Model
Result: final operated table
updated every trigger interval
Output: what part of result to
write to data sink after every
trigger
Complete output: Write full result
table every time
7
Trigger: every 1 sec
1 2 3
result
for data
up to 1
Result
Query
Time
data up
to 1
Input data up
to 2
result
for data
up to 2
data up
to 3
result
for data
up to 3
Output
[append mode]
output only new rows since
last trigger
Result: final operated table updated
every trigger interval
Output: what part of result to write
to data sink after every trigger
Complete output: Write full result table
every time
Append output: Write only new rows that got
added to result table since previous batch
*Not all output modes are feasible with all queries
New Model
8
9
Output Modes
• Append mode (default) - New rows added to the Result Table
since the last trigger will be outputted to the sink. Rows will be
output only once, and cannot be rescinded.
Example use cases: ETL
10
Output Modes
• Complete mode - The whole Result Table will be outputted to
the sink after every trigger. This is supported for aggregation
queries.
Example use cases: Monitoring
11
Output Modes
• Update mode - (Available since Spark 2.1.1) Only the rows in the
Result Table that were updated since the last trigger will be
outputted to the sink.
Example use cases: Alerting, Sessionization
12
Outline
• Structured Streaming Concepts
• Stateful Processing in Structured Streaming
• Use Cases
• Demos
13
Event time Aggregations
Many use cases require aggregate statistics by event time
E.g. what's the #errors in each system in 1 hour windows?
Many challenges
Extracting event time from data, handling late, out-of-order data
DStream APIs were insufficient for event time operations
14
Event time Aggregations
Windowing is just another type of grouping in Struct.
Streaming
number of records every hour
parsedData
.groupBy(window("timestamp","1  hour"))
.count()
parsedData
.groupBy(
"device",  
window("timestamp","10  mins"))
.avg("signal")
avg signal strength of each
device every 10 mins
Use built-in functions to extract event-time
No need for separate extractors
15
Advanced Aggregations
Powerful built-in
aggregations
Multiple simultaneous
aggregations
Custom aggs using
reduceGroups, UDAFs
parsedData
.groupBy(window("timestamp","1  hour"))
.agg(avg("signal"),  stddev("signal"),  max("signal"))
variance,  stddev,  kurtosis,  stddev_samp,  collect_list,  
collect_set,  corr,  approx_count_distinct,  ...  
//  Compute  histogram  of  age  by  name.
val hist =  ds.groupBy(_.type).mapGroups {
case (type,  data:  Iter[DeviceData])  =>
val buckets =  new Array[Int](10)            
data.map(_.signal).foreach {  a  => buckets(a/10)+=1 }        
(type,  buckets)
}
16
Stateful Processing for Aggregations
In-memory,
streaming state
maintained for
aggregations
12:00 - 13:00 1 12:00 - 13:00 3
13:00 - 14:00 1
12:00 - 13:00 3
13:00 - 14:00 2
14:00 - 15:00 5
12:00 - 13:00 5
13:00 - 14:00 2
14:00 - 15:00 5
15:00 - 16:00 4
12:00 - 13:00 3
13:00 - 14:00 2
14:00 - 15:00 6
15:00 - 16:00 4
16:00 - 17:00 3
13:00 14:00 15:00 16:00 17:00
Keeping state allows late data to
update counts of old windows
But size of the state increases
indefinitely if old windows not dropped
red = state updated
with late data
17
18
Watermarking and Late Data
Watermark [Spark 2.1] - a
moving threshold that trails
behind the max seen event time
Trailing gap defines how late
data is expected to be
event time
max event time
watermark data older
than
watermark
not expected
12:30 PM
12:20 PM
trailing gap
of 10 mins
19
Watermarking and Late Data
Data newer than watermark may
be late, but allowed to aggregate
Data older than watermark is "too
late" and dropped
State older than watermark
automatically deleted to limit the
amount of intermediate state
max event time
event time
watermark
late data
allowed to
aggregate
data too
late,
dropped
20
Watermarking and Late Data
max event time
event time
watermark
allowed
lateness
of 10 mins
parsedData
.withWatermark("timestamp",  "10  minutes")
.groupBy(window("timestamp","5  minutes"))
.count()
late data
allowed to
aggregate
data too
late,
dropped
Control the tradeoff between state
size and lateness requirements
Handle more late à keep more state
Reduce state à handle less lateness
21
Watermarking to Limit State [Spark 2.1]
data too late,
ignored in counts,
state dropped
Processing Time12:00
12:05
12:10
12:15
12:10 12:15 12:20
12:07
12:13
12:08
EventTime
12:15
12:18
12:04
watermark updated to
12:14 - 10m = 12:04
for next trigger,
state < 12:04 deleted
data is late, but
considered in counts
parsedData
.withWatermark("timestamp",  "10  minutes")
.groupBy(window("timestamp","5  minutes"))
.count()
system tracks max
observed event time
12:08
wm = 12:04
10min
12:14
More details in blog post!
22
23
Working With Time
df.withWatermark("timestampColumn",  "5  hours")
.groupBy(window("timestampColumn",  "1  minute"))
.count()
.writeStream
.trigger("10  seconds")
Separate processing details (output rate, late data tolerance)
from query semantics.
24
Working With Time
df.withWatermark("timestampColumn",  "5  hours")
.groupBy(window("timestampColumn",  "1  minute"))
.count()
.writeStream
.trigger("10  seconds")
How to group
data by time
Same in streaming & batch
25
Working With Time
df.withWatermark("timestampColumn",  "5  hours")
.groupBy(window("timestampColumn",  "1  minute"))
.count()
.writeStream
.trigger("10  seconds")
How late
data can be
26
Working With Time
df.withWatermark("timestampColumn",  "5  hours")
.groupBy(window("timestampColumn",  "1  minute"))
.count()
.writeStream
.trigger("10  seconds")
How often
to emit updates
27
Arbitrary Stateful Operations [Spark 2.2]
mapGroupsWithState
allows any user-defined
stateful ops to a
user-defined state
Direct support for per-key
timeouts in event-time or
processing-time
supports Scala and Java
ds.groupByKey(groupingFunc)
.mapGroupsWithState
(timeoutConf)
(mappingWithStateFunc)
def mappingWithStateFunc(
key: K,  
values: Iterator[V],  
state: GroupState[S]): U =  {  
//  update  or  remove  state
//  set  timeouts
//  return  mapped  value
}
28
flatMapGroupsWithState
• Applies the given function to each group of data, while
maintaining a user-defined per-group state
• Invoked once per group in batch
• Invoked each trigger (with the existence of data) per group in
streaming
• Requires user to provide an output mode for the function
29
flatMapGroupsWithState
• mapGroupsWithState is a special case with
• Output mode: Update
• Output size: 1 row per group
• Supports both Processing Time and Event Time timeouts
30
Outline
• Structured Streaming Concepts
• Stateful Processing in Structured Streaming
• Use Cases
• Demos
31
Alerting
val monitoring  =  stream
.as[Event]
.groupBy(_.id)
.flatMapGroupsWithState(Append,  GST.ProcessingTimeTimeout)  {
(id:  Int,  events:  Iterator[Event],  state:  GroupState[…])  =>
...
}
.writeStream
.queryName("alerts")
.foreach(new  PagerdutySink(credentials))
Monitor a stream using custom stateful logic with timeouts.
32
Sessionization
val monitoring  =  stream
.as[Event]
.groupBy(_.session_id)
.mapGroupsWithState(GST.EventTimeTimeout)  {
(id:  Int,  events:  Iterator[Event],  state:  GroupState[…])  =>
...
}
.writeStream
.parquet("/user/sessions")
Analyze sessions of user/system behavior
33
Demo
34
SPARK SUMMIT 2017
DATA SCIENCE AND ENGINEERING AT SCALE
JUNE 5 – 7 | MOSCONE CENTER | SAN FRANCISCO
ORGANIZED BY spark-summit.org/2017
Discount Code: Databricks
We are hiring!
https://databricks.com/company/careers
Thank You
“Does anyone have any questions for my answers?” - Henry Kissinger

More Related Content

What's hot

What's hot (20)

Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developers
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0
 
Write Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdfWrite Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdf
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
Presto
PrestoPresto
Presto
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
 

Similar to Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark

Similar to Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark (20)

Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFrames
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming  by Tathaga...Deep dive into stateful stream processing in structured streaming  by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...
 
Dataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data ProcessingDataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data Processing
 
A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark
 
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data ProcessingCloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)
 
Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Stream Processing Overview
Stream Processing OverviewStream Processing Overview
Stream Processing Overview
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.ppt
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache FlinkUnifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Analyzing and Interpreting AWR
Analyzing and Interpreting AWRAnalyzing and Interpreting AWR
Analyzing and Interpreting AWR
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
How should I monitor my idaa
How should I monitor my idaaHow should I monitor my idaa
How should I monitor my idaa
 

More from Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Recently uploaded

Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfMastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
mbmh111980
 

Recently uploaded (20)

StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdfStrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
 
AI Hackathon.pptx
AI                        Hackathon.pptxAI                        Hackathon.pptx
AI Hackathon.pptx
 
Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024
 
Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024
 
A Guideline to Gorgias to to Re:amaze Data Migration
A Guideline to Gorgias to to Re:amaze Data MigrationA Guideline to Gorgias to to Re:amaze Data Migration
A Guideline to Gorgias to to Re:amaze Data Migration
 
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
 
APVP,apvp apvp High quality supplier safe spot transport, 98% purity
APVP,apvp apvp High quality supplier safe spot transport, 98% purityAPVP,apvp apvp High quality supplier safe spot transport, 98% purity
APVP,apvp apvp High quality supplier safe spot transport, 98% purity
 
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesGraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
 
10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf
 
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdfMicrosoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
 
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
 
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
 
CompTIA Security+ (Study Notes) for cs.pdf
CompTIA Security+ (Study Notes) for cs.pdfCompTIA Security+ (Study Notes) for cs.pdf
CompTIA Security+ (Study Notes) for cs.pdf
 
Agnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in KrakówAgnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in Kraków
 
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdfImplementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
 
What need to be mastered as AI-Powered Java Developers
What need to be mastered as AI-Powered Java DevelopersWhat need to be mastered as AI-Powered Java Developers
What need to be mastered as AI-Powered Java Developers
 
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfMastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
 
IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024
 
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfA Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
 

Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark

  • 1. Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark™ Burak Yavuz 5/16/2017
  • 2. 2 Outline • Structured Streaming Concepts • Stateful Processing in Structured Streaming • Use Cases • Demos
  • 3. 3 The simplest way to perform streaming analytics is not having to reason about streaming at all
  • 4. 4
  • 5. 5 Input: data from source as an append-only table Trigger: how frequently to check input for new data Query: operations on input usual map/filter/reduce new window, session ops Trigger: every 1 sec 1 2 3 Time data up to 1 Input data up to 2 data up to 3 Query New Model
  • 6. 6 Trigger: every 1 sec 1 2 3 result for data up to 1 Result Query Time data up to 1 Input data up to 2 result for data up to 2 data up to 3 result for data up to 3 Output [complete mode] output all the rows in the result table New Model Result: final operated table updated every trigger interval Output: what part of result to write to data sink after every trigger Complete output: Write full result table every time
  • 7. 7 Trigger: every 1 sec 1 2 3 result for data up to 1 Result Query Time data up to 1 Input data up to 2 result for data up to 2 data up to 3 result for data up to 3 Output [append mode] output only new rows since last trigger Result: final operated table updated every trigger interval Output: what part of result to write to data sink after every trigger Complete output: Write full result table every time Append output: Write only new rows that got added to result table since previous batch *Not all output modes are feasible with all queries New Model
  • 8. 8
  • 9. 9 Output Modes • Append mode (default) - New rows added to the Result Table since the last trigger will be outputted to the sink. Rows will be output only once, and cannot be rescinded. Example use cases: ETL
  • 10. 10 Output Modes • Complete mode - The whole Result Table will be outputted to the sink after every trigger. This is supported for aggregation queries. Example use cases: Monitoring
  • 11. 11 Output Modes • Update mode - (Available since Spark 2.1.1) Only the rows in the Result Table that were updated since the last trigger will be outputted to the sink. Example use cases: Alerting, Sessionization
  • 12. 12 Outline • Structured Streaming Concepts • Stateful Processing in Structured Streaming • Use Cases • Demos
  • 13. 13 Event time Aggregations Many use cases require aggregate statistics by event time E.g. what's the #errors in each system in 1 hour windows? Many challenges Extracting event time from data, handling late, out-of-order data DStream APIs were insufficient for event time operations
  • 14. 14 Event time Aggregations Windowing is just another type of grouping in Struct. Streaming number of records every hour parsedData .groupBy(window("timestamp","1  hour")) .count() parsedData .groupBy( "device",   window("timestamp","10  mins")) .avg("signal") avg signal strength of each device every 10 mins Use built-in functions to extract event-time No need for separate extractors
  • 15. 15 Advanced Aggregations Powerful built-in aggregations Multiple simultaneous aggregations Custom aggs using reduceGroups, UDAFs parsedData .groupBy(window("timestamp","1  hour")) .agg(avg("signal"),  stddev("signal"),  max("signal")) variance,  stddev,  kurtosis,  stddev_samp,  collect_list,   collect_set,  corr,  approx_count_distinct,  ...   //  Compute  histogram  of  age  by  name. val hist =  ds.groupBy(_.type).mapGroups { case (type,  data:  Iter[DeviceData])  => val buckets =  new Array[Int](10)             data.map(_.signal).foreach {  a  => buckets(a/10)+=1 }         (type,  buckets) }
  • 16. 16 Stateful Processing for Aggregations In-memory, streaming state maintained for aggregations 12:00 - 13:00 1 12:00 - 13:00 3 13:00 - 14:00 1 12:00 - 13:00 3 13:00 - 14:00 2 14:00 - 15:00 5 12:00 - 13:00 5 13:00 - 14:00 2 14:00 - 15:00 5 15:00 - 16:00 4 12:00 - 13:00 3 13:00 - 14:00 2 14:00 - 15:00 6 15:00 - 16:00 4 16:00 - 17:00 3 13:00 14:00 15:00 16:00 17:00 Keeping state allows late data to update counts of old windows But size of the state increases indefinitely if old windows not dropped red = state updated with late data
  • 17. 17
  • 18. 18 Watermarking and Late Data Watermark [Spark 2.1] - a moving threshold that trails behind the max seen event time Trailing gap defines how late data is expected to be event time max event time watermark data older than watermark not expected 12:30 PM 12:20 PM trailing gap of 10 mins
  • 19. 19 Watermarking and Late Data Data newer than watermark may be late, but allowed to aggregate Data older than watermark is "too late" and dropped State older than watermark automatically deleted to limit the amount of intermediate state max event time event time watermark late data allowed to aggregate data too late, dropped
  • 20. 20 Watermarking and Late Data max event time event time watermark allowed lateness of 10 mins parsedData .withWatermark("timestamp",  "10  minutes") .groupBy(window("timestamp","5  minutes")) .count() late data allowed to aggregate data too late, dropped Control the tradeoff between state size and lateness requirements Handle more late à keep more state Reduce state à handle less lateness
  • 21. 21 Watermarking to Limit State [Spark 2.1] data too late, ignored in counts, state dropped Processing Time12:00 12:05 12:10 12:15 12:10 12:15 12:20 12:07 12:13 12:08 EventTime 12:15 12:18 12:04 watermark updated to 12:14 - 10m = 12:04 for next trigger, state < 12:04 deleted data is late, but considered in counts parsedData .withWatermark("timestamp",  "10  minutes") .groupBy(window("timestamp","5  minutes")) .count() system tracks max observed event time 12:08 wm = 12:04 10min 12:14 More details in blog post!
  • 22. 22
  • 23. 23 Working With Time df.withWatermark("timestampColumn",  "5  hours") .groupBy(window("timestampColumn",  "1  minute")) .count() .writeStream .trigger("10  seconds") Separate processing details (output rate, late data tolerance) from query semantics.
  • 24. 24 Working With Time df.withWatermark("timestampColumn",  "5  hours") .groupBy(window("timestampColumn",  "1  minute")) .count() .writeStream .trigger("10  seconds") How to group data by time Same in streaming & batch
  • 25. 25 Working With Time df.withWatermark("timestampColumn",  "5  hours") .groupBy(window("timestampColumn",  "1  minute")) .count() .writeStream .trigger("10  seconds") How late data can be
  • 26. 26 Working With Time df.withWatermark("timestampColumn",  "5  hours") .groupBy(window("timestampColumn",  "1  minute")) .count() .writeStream .trigger("10  seconds") How often to emit updates
  • 27. 27 Arbitrary Stateful Operations [Spark 2.2] mapGroupsWithState allows any user-defined stateful ops to a user-defined state Direct support for per-key timeouts in event-time or processing-time supports Scala and Java ds.groupByKey(groupingFunc) .mapGroupsWithState (timeoutConf) (mappingWithStateFunc) def mappingWithStateFunc( key: K,   values: Iterator[V],   state: GroupState[S]): U =  {   //  update  or  remove  state //  set  timeouts //  return  mapped  value }
  • 28. 28 flatMapGroupsWithState • Applies the given function to each group of data, while maintaining a user-defined per-group state • Invoked once per group in batch • Invoked each trigger (with the existence of data) per group in streaming • Requires user to provide an output mode for the function
  • 29. 29 flatMapGroupsWithState • mapGroupsWithState is a special case with • Output mode: Update • Output size: 1 row per group • Supports both Processing Time and Event Time timeouts
  • 30. 30 Outline • Structured Streaming Concepts • Stateful Processing in Structured Streaming • Use Cases • Demos
  • 31. 31 Alerting val monitoring  =  stream .as[Event] .groupBy(_.id) .flatMapGroupsWithState(Append,  GST.ProcessingTimeTimeout)  { (id:  Int,  events:  Iterator[Event],  state:  GroupState[…])  => ... } .writeStream .queryName("alerts") .foreach(new  PagerdutySink(credentials)) Monitor a stream using custom stateful logic with timeouts.
  • 32. 32 Sessionization val monitoring  =  stream .as[Event] .groupBy(_.session_id) .mapGroupsWithState(GST.EventTimeTimeout)  { (id:  Int,  events:  Iterator[Event],  state:  GroupState[…])  => ... } .writeStream .parquet("/user/sessions") Analyze sessions of user/system behavior
  • 34. 34 SPARK SUMMIT 2017 DATA SCIENCE AND ENGINEERING AT SCALE JUNE 5 – 7 | MOSCONE CENTER | SAN FRANCISCO ORGANIZED BY spark-summit.org/2017 Discount Code: Databricks
  • 36. Thank You “Does anyone have any questions for my answers?” - Henry Kissinger