Arbitrary Stateful Aggregation and MERGE INTO - Data + AI Summit EU 2020

Databricks
DatabricksDeveloper Marketing and Relations at MuleSoft
Arbitrary Stateful Aggregation
and MERGE INTO
Spark Structured Streaming + Delta Lake = “Double Metrics”
Jacek Laskowski jaceklaskowski / November 2020
About the Speaker
Jacek Laskowski is an IT Freelancer specializing in Apache
Spark, Delta Lake, Apache Kafka and Kafka Streams.
Contact me at jacek@japila.pl or DM on twitter
@jaceklaskowski to discuss opportunities.
Best known by "The Internals Of" online books @
https://books.japila.pl
The Internals of Delta Lake
1. Available for free @
https://books.japila.pl/delta-lake-internals
Friendly Reminder
Should you have any questions,
Feel free to ask them in the chat window.
I’m going to answer them at the end of the talk.
Thank you!
Client Requirements and Recommendations
1. A client wants to load Kafka records at
regular intervals
● Spark Structured Streaming
2. A client wants to do a stateful
aggregation in a custom per-group way
● KeyValueGroupedDataset.flatMapGroups
WithState
3. A client wants to update a Delta table
with aggregation results
● MERGE INTO
● DataStreamWriter.foreachBatch
Arbitrary Stateful Aggregation
1. KeyValueGroupedDataset.flatMapGroupsWithState (scaladoc)
2. A user-defined per-group state
3. For a static batch Dataset, the function will be invoked once per group
4. For a streaming Dataset, the function will be invoked for each group repeatedly
in every trigger, and updates to each group's state will be saved across
invocations
The Code
1. Code?! Open Intellij IDEA! 😎
Delta Lake Users Mailing List
1. Multiple executions of flatMapGroupsWithState when DeltaTable.merge
Possible Way-Outs (“Solutions”)
1. Separate Delta table for state?
a. Avoid multiple passes over flatMapGroupsWithState
O’Reilly Learning Spark
2nd Edition
1. Available for free @ https://dbricks.co/get-ebook
2. Chapter 9 “Building Reliable Data Lakes with
Apache Spark” touches Delta Lake
a. Also the competitors: Apache Hudi and
Apache Iceberg
That’s all folks! Thank you! ❤
/me Answering questions...
Jacek Laskowski / @jaceklaskowski / jacek@japila.pl
1 of 11

Recommended

Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks by
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, DatabricksSpark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, DatabricksGoDataDriven
440 views38 slides
Scaling Machine Learning Feature Engineering in Apache Spark at Facebook by
Scaling Machine Learning Feature Engineering in Apache Spark at FacebookScaling Machine Learning Feature Engineering in Apache Spark at Facebook
Scaling Machine Learning Feature Engineering in Apache Spark at FacebookDatabricks
256 views28 slides
A look under the hood at Apache Spark's API and engine evolutions by
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
3.2K views56 slides
Jump Start with Apache Spark 2.0 on Databricks by
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
2.8K views78 slides
Exceptions are the Norm: Dealing with Bad Actors in ETL by
Exceptions are the Norm: Dealing with Bad Actors in ETLExceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETLDatabricks
6K views46 slides
Parallelize R Code Using Apache Spark by
Parallelize R Code Using Apache Spark Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark Databricks
1.8K views20 slides

More Related Content

What's hot

Koalas: Interoperability Between Koalas and Apache Spark by
Koalas: Interoperability Between Koalas and Apache SparkKoalas: Interoperability Between Koalas and Apache Spark
Koalas: Interoperability Between Koalas and Apache SparkDatabricks
223 views25 slides
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop by
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopDatabricks
2.3K views36 slides
Apache Spark Core—Deep Dive—Proper Optimization by
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
6.1K views50 slides
Spark SQL - 10 Things You Need to Know by
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowKristian Alexander
1.4K views110 slides
Building Robust ETL Pipelines with Apache Spark by
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkDatabricks
34.6K views43 slides
Spark sql by
Spark sqlSpark sql
Spark sqlZahra Eskandari
277 views114 slides

What's hot(20)

Koalas: Interoperability Between Koalas and Apache Spark by Databricks
Koalas: Interoperability Between Koalas and Apache SparkKoalas: Interoperability Between Koalas and Apache Spark
Koalas: Interoperability Between Koalas and Apache Spark
Databricks223 views
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop by Databricks
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Databricks2.3K views
Apache Spark Core—Deep Dive—Proper Optimization by Databricks
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
Databricks6.1K views
Building Robust ETL Pipelines with Apache Spark by Databricks
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
Databricks34.6K views
How Apache Spark fits into the Big Data landscape by Paco Nathan
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
Paco Nathan7.6K views
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi... by Databricks
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks9K views
Spark SQL Join Improvement at Facebook by Databricks
Spark SQL Join Improvement at FacebookSpark SQL Join Improvement at Facebook
Spark SQL Join Improvement at Facebook
Databricks464 views
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3 by Databricks
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks1.5K views
From Pipelines to Refineries: Scaling Big Data Applications by Databricks
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data Applications
Databricks1.2K views
Taking Spark Streaming to the Next Level with Datasets and DataFrames by Databricks
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Databricks8.6K views
Why you should care about data layout in the file system with Cheng Lian and ... by Databricks
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...
Databricks4.5K views
What is New with Apache Spark Performance Monitoring in Spark 3.0 by Databricks
What is New with Apache Spark Performance Monitoring in Spark 3.0What is New with Apache Spark Performance Monitoring in Spark 3.0
What is New with Apache Spark Performance Monitoring in Spark 3.0
Databricks441 views
Scaling Apache Spark at Facebook by Databricks
Scaling Apache Spark at FacebookScaling Apache Spark at Facebook
Scaling Apache Spark at Facebook
Databricks1.3K views
Spark shuffle introduction by colorant
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant50.7K views
Spark overview by Lisa Hua
Spark overviewSpark overview
Spark overview
Lisa Hua7.3K views
Robust and Scalable ETL over Cloud Storage with Apache Spark by Databricks
Robust and Scalable ETL over Cloud Storage with Apache SparkRobust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache Spark
Databricks2.9K views

Similar to Arbitrary Stateful Aggregation and MERGE INTO - Data + AI Summit EU 2020

 Kafka Streams VS Spark Structured Streaming - Modern Stream Processing Engin... by
 Kafka Streams VS Spark Structured Streaming - Modern Stream Processing Engin... Kafka Streams VS Spark Structured Streaming - Modern Stream Processing Engin...
 Kafka Streams VS Spark Structured Streaming - Modern Stream Processing Engin...Jacek Laskowski
227 views15 slides
Akka 2.4 plus new commercial features in Typesafe Reactive Platform by
Akka 2.4 plus new commercial features in Typesafe Reactive PlatformAkka 2.4 plus new commercial features in Typesafe Reactive Platform
Akka 2.4 plus new commercial features in Typesafe Reactive PlatformLegacy Typesafe (now Lightbend)
5.6K views99 slides
Akka 2.4 plus commercial features in Typesafe Reactive Platform by
Akka 2.4 plus commercial features in Typesafe Reactive PlatformAkka 2.4 plus commercial features in Typesafe Reactive Platform
Akka 2.4 plus commercial features in Typesafe Reactive PlatformLegacy Typesafe (now Lightbend)
2.4K views99 slides
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session) by
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)Databricks
631 views16 slides
What is Apache Kafka®? by
What is Apache Kafka®?What is Apache Kafka®?
What is Apache Kafka®?Eventador
32 views19 slides
What is apache Kafka? by
What is apache Kafka?What is apache Kafka?
What is apache Kafka?Kenny Gorman
224 views19 slides

Similar to Arbitrary Stateful Aggregation and MERGE INTO - Data + AI Summit EU 2020(20)

 Kafka Streams VS Spark Structured Streaming - Modern Stream Processing Engin... by Jacek Laskowski
 Kafka Streams VS Spark Structured Streaming - Modern Stream Processing Engin... Kafka Streams VS Spark Structured Streaming - Modern Stream Processing Engin...
 Kafka Streams VS Spark Structured Streaming - Modern Stream Processing Engin...
Jacek Laskowski227 views
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session) by Databricks
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
Databricks631 views
What is Apache Kafka®? by Eventador
What is Apache Kafka®?What is Apache Kafka®?
What is Apache Kafka®?
Eventador32 views
What is apache Kafka? by Kenny Gorman
What is apache Kafka?What is apache Kafka?
What is apache Kafka?
Kenny Gorman224 views
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai by Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion DubaiSMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
Codemotion Dubai1.4K views
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ... by Modern Data Stack France
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (... by confluent
Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...
Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...
confluent1.6K views
Streampunk - The Difference Engine for Unlocking the Kafka Black Box with Ral... by HostedbyConfluent
Streampunk - The Difference Engine for Unlocking the Kafka Black Box with Ral...Streampunk - The Difference Engine for Unlocking the Kafka Black Box with Ral...
Streampunk - The Difference Engine for Unlocking the Kafka Black Box with Ral...
HostedbyConfluent356 views
Migrating structured data between Hadoop and RDBMS by Bouquet
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMS
Bouquet774 views
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa... by Helena Edelson
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson3.7K views
Moving from Big Data to Fast Data? Here's How To Pick The Right Streaming Engine by Lightbend
Moving from Big Data to Fast Data? Here's How To Pick The Right Streaming EngineMoving from Big Data to Fast Data? Here's How To Pick The Right Streaming Engine
Moving from Big Data to Fast Data? Here's How To Pick The Right Streaming Engine
Lightbend4.5K views
Akka Streams And Kafka Streams: Where Microservices Meet Fast Data by Lightbend
Akka Streams And Kafka Streams: Where Microservices Meet Fast DataAkka Streams And Kafka Streams: Where Microservices Meet Fast Data
Akka Streams And Kafka Streams: Where Microservices Meet Fast Data
Lightbend6.5K views
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn... by Simplilearn
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Simplilearn1.1K views
Understanding Akka Streams, Back Pressure, and Asynchronous Architectures by Lightbend
Understanding Akka Streams, Back Pressure, and Asynchronous ArchitecturesUnderstanding Akka Streams, Back Pressure, and Asynchronous Architectures
Understanding Akka Streams, Back Pressure, and Asynchronous Architectures
Lightbend23.6K views
Scylla Summit 2018: Scalable Stream Processing with KSQL, Kafka and ScyllaDB by ScyllaDB
Scylla Summit 2018: Scalable Stream Processing with KSQL, Kafka and ScyllaDBScylla Summit 2018: Scalable Stream Processing with KSQL, Kafka and ScyllaDB
Scylla Summit 2018: Scalable Stream Processing with KSQL, Kafka and ScyllaDB
ScyllaDB1.5K views
What is Apache Kafka and What is an Event Streaming Platform? by confluent
What is Apache Kafka and What is an Event Streaming Platform?What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?
confluent2.9K views
Kafka Streams for Java enthusiasts by Slim Baltagi
Kafka Streams for Java enthusiastsKafka Streams for Java enthusiasts
Kafka Streams for Java enthusiasts
Slim Baltagi5.1K views

More from Databricks

DW Migration Webinar-March 2022.pptx by
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
4.3K views25 slides
Data Lakehouse Symposium | Day 1 | Part 1 by
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
1.5K views43 slides
Data Lakehouse Symposium | Day 1 | Part 2 by
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
743 views16 slides
Data Lakehouse Symposium | Day 4 by
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
1.8K views74 slides
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop by
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
6.3K views64 slides
Democratizing Data Quality Through a Centralized Platform by
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
1.4K views36 slides

More from Databricks(20)

DW Migration Webinar-March 2022.pptx by Databricks
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks4.3K views
Data Lakehouse Symposium | Day 1 | Part 1 by Databricks
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks1.5K views
Data Lakehouse Symposium | Day 1 | Part 2 by Databricks
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks743 views
Data Lakehouse Symposium | Day 4 by Databricks
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks1.8K views
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop by Databricks
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks6.3K views
Democratizing Data Quality Through a Centralized Platform by Databricks
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks1.4K views
Learn to Use Databricks for Data Science by Databricks
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks1.6K views
Why APM Is Not the Same As ML Monitoring by Databricks
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks743 views
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix by Databricks
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks689 views
Stage Level Scheduling Improving Big Data and AI Integration by Databricks
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks850 views
Simplify Data Conversion from Spark to TensorFlow and PyTorch by Databricks
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks1.8K views
Scaling your Data Pipelines with Apache Spark on Kubernetes by Databricks
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks2.1K views
Scaling and Unifying SciKit Learn and Apache Spark Pipelines by Databricks
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks667 views
Sawtooth Windows for Feature Aggregations by Databricks
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks606 views
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink by Databricks
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks677 views
Re-imagine Data Monitoring with whylogs and Spark by Databricks
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks551 views
Raven: End-to-end Optimization of ML Prediction Queries by Databricks
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks450 views
Processing Large Datasets for ADAS Applications using Apache Spark by Databricks
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks513 views
Massive Data Processing in Adobe Using Delta Lake by Databricks
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks719 views
Machine Learning CI/CD for Email Attack Detection by Databricks
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks389 views

Recently uploaded

shivam tiwari.pptx by
shivam tiwari.pptxshivam tiwari.pptx
shivam tiwari.pptxAanyaMishra4
7 views14 slides
VoxelNet by
VoxelNetVoxelNet
VoxelNettaeseon ryu
17 views21 slides
Data about the sector workshop by
Data about the sector workshopData about the sector workshop
Data about the sector workshopinfo828217
29 views27 slides
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf by
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf10urkyr34
7 views259 slides
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ... by
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...DataScienceConferenc1
7 views15 slides
K-Drama Recommendation Using Python by
K-Drama Recommendation Using PythonK-Drama Recommendation Using Python
K-Drama Recommendation Using PythonFridaPutriassa
5 views20 slides

Recently uploaded(20)

Data about the sector workshop by info828217
Data about the sector workshopData about the sector workshop
Data about the sector workshop
info82821729 views
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf by 10urkyr34
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf
10urkyr347 views
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ... by DataScienceConferenc1
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
K-Drama Recommendation Using Python by FridaPutriassa
K-Drama Recommendation Using PythonK-Drama Recommendation Using Python
K-Drama Recommendation Using Python
FridaPutriassa5 views
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P... by DataScienceConferenc1
[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P...
4_4_WP_4_06_ND_Model.pptx by d6fmc6kwd4
4_4_WP_4_06_ND_Model.pptx4_4_WP_4_06_ND_Model.pptx
4_4_WP_4_06_ND_Model.pptx
d6fmc6kwd47 views
Listed Instruments Survey 2022.pptx by secretariat4
Listed Instruments Survey  2022.pptxListed Instruments Survey  2022.pptx
Listed Instruments Survey 2022.pptx
secretariat4121 views
LIVE OAK MEMORIAL PARK.pptx by ms2332always
LIVE OAK MEMORIAL PARK.pptxLIVE OAK MEMORIAL PARK.pptx
LIVE OAK MEMORIAL PARK.pptx
ms2332always7 views
Ukraine Infographic_22NOV2023_v2.pdf by AnastosiyaGurin
Ukraine Infographic_22NOV2023_v2.pdfUkraine Infographic_22NOV2023_v2.pdf
Ukraine Infographic_22NOV2023_v2.pdf
AnastosiyaGurin1.4K views
Shreyas hospital statistics.pdf by samithavinal
Shreyas hospital statistics.pdfShreyas hospital statistics.pdf
Shreyas hospital statistics.pdf
samithavinal5 views
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion by Bertram Ludäscher
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionGames, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
DGST Methodology Presentation.pdf by maddierlegum
DGST Methodology Presentation.pdfDGST Methodology Presentation.pdf
DGST Methodology Presentation.pdf
maddierlegum7 views

Arbitrary Stateful Aggregation and MERGE INTO - Data + AI Summit EU 2020

  • 1. Arbitrary Stateful Aggregation and MERGE INTO Spark Structured Streaming + Delta Lake = “Double Metrics” Jacek Laskowski jaceklaskowski / November 2020
  • 2. About the Speaker Jacek Laskowski is an IT Freelancer specializing in Apache Spark, Delta Lake, Apache Kafka and Kafka Streams. Contact me at jacek@japila.pl or DM on twitter @jaceklaskowski to discuss opportunities. Best known by "The Internals Of" online books @ https://books.japila.pl
  • 3. The Internals of Delta Lake 1. Available for free @ https://books.japila.pl/delta-lake-internals
  • 4. Friendly Reminder Should you have any questions, Feel free to ask them in the chat window. I’m going to answer them at the end of the talk. Thank you!
  • 5. Client Requirements and Recommendations 1. A client wants to load Kafka records at regular intervals ● Spark Structured Streaming 2. A client wants to do a stateful aggregation in a custom per-group way ● KeyValueGroupedDataset.flatMapGroups WithState 3. A client wants to update a Delta table with aggregation results ● MERGE INTO ● DataStreamWriter.foreachBatch
  • 6. Arbitrary Stateful Aggregation 1. KeyValueGroupedDataset.flatMapGroupsWithState (scaladoc) 2. A user-defined per-group state 3. For a static batch Dataset, the function will be invoked once per group 4. For a streaming Dataset, the function will be invoked for each group repeatedly in every trigger, and updates to each group's state will be saved across invocations
  • 7. The Code 1. Code?! Open Intellij IDEA! 😎
  • 8. Delta Lake Users Mailing List 1. Multiple executions of flatMapGroupsWithState when DeltaTable.merge
  • 9. Possible Way-Outs (“Solutions”) 1. Separate Delta table for state? a. Avoid multiple passes over flatMapGroupsWithState
  • 10. O’Reilly Learning Spark 2nd Edition 1. Available for free @ https://dbricks.co/get-ebook 2. Chapter 9 “Building Reliable Data Lakes with Apache Spark” touches Delta Lake a. Also the competitors: Apache Hudi and Apache Iceberg
  • 11. That’s all folks! Thank you! ❤ /me Answering questions... Jacek Laskowski / @jaceklaskowski / jacek@japila.pl