Arbitrary Stateful Aggregation and MERGE INTO - Data + AI Summit EU 2020

•

0 likes•572 views

Jules Damji and Denny Lee from Databricks Developer Relations will recap some keynote highlights, and each will briefly present personal picks from sessions that resonated well with them. Next, Jacek Laskowski, an independent consultant, will speak about Spark 3.0 internals, and Scott Haines from Twilio, Inc. will give a talk about structured streaming microservice architectures. This live coding session and technical deep dive are not to be missed!

Data & Analytics

Arbitrary Stateful Aggregation
and MERGE INTO
Spark Structured Streaming + Delta Lake = “Double Metrics”
Jacek Laskowski jaceklaskowski / November 2020

About the Speaker
Jacek Laskowski is an IT Freelancer specializing in Apache
Spark, Delta Lake, Apache Kafka and Kafka Streams.
Contact me at jacek@japila.pl or DM on twitter
@jaceklaskowski to discuss opportunities.
Best known by "The Internals Of" online books @
https://books.japila.pl

The Internals of Delta Lake
1. Available for free @
https://books.japila.pl/delta-lake-internals

Friendly Reminder
Should you have any questions,
Feel free to ask them in the chat window.
I’m going to answer them at the end of the talk.
Thank you!

Client Requirements and Recommendations
1. A client wants to load Kafka records at
regular intervals
● Spark Structured Streaming
2. A client wants to do a stateful
aggregation in a custom per-group way
● KeyValueGroupedDataset.flatMapGroups
WithState
3. A client wants to update a Delta table
with aggregation results
● MERGE INTO
● DataStreamWriter.foreachBatch

Arbitrary Stateful Aggregation
1. KeyValueGroupedDataset.ﬂatMapGroupsWithState (scaladoc)
2. A user-deﬁned per-group state
3. For a static batch Dataset, the function will be invoked once per group
4. For a streaming Dataset, the function will be invoked for each group repeatedly
in every trigger, and updates to each group's state will be saved across
invocations

The Code
1. Code?! Open Intellij IDEA! 😎

Delta Lake Users Mailing List
1. Multiple executions of ﬂatMapGroupsWithState when DeltaTable.merge

Possible Way-Outs (“Solutions”)
1. Separate Delta table for state?
a. Avoid multiple passes over ﬂatMapGroupsWithState

O’Reilly Learning Spark
2nd Edition
1. Available for free @ https://dbricks.co/get-ebook
2. Chapter 9 “Building Reliable Data Lakes with
Apache Spark” touches Delta Lake
a. Also the competitors: Apache Hudi and
Apache Iceberg

That’s all folks! Thank you! ❤
/me Answering questions...
Jacek Laskowski / @jaceklaskowski / jacek@japila.pl

What's hot

Koalas: Interoperability Between Koalas and Apache SparkDatabricks

Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopDatabricks

Apache Spark Core—Deep Dive—Proper OptimizationDatabricks

Spark SQL - 10 Things You Need to KnowKristian Alexander

Building Robust ETL Pipelines with Apache SparkDatabricks

Spark sqlZahra Eskandari

Spark SQLJoud Khattab

How Apache Spark fits into the Big Data landscapePaco Nathan

Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks

Spark SQL Join Improvement at FacebookDatabricks

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks

From Pipelines to Refineries: Scaling Big Data ApplicationsDatabricks

SPARQL and Linked Data BenchmarkingKristian Alexander

Taking Spark Streaming to the Next Level with Datasets and DataFramesDatabricks

Why you should care about data layout in the file system with Cheng Lian and ...Databricks

What is New with Apache Spark Performance Monitoring in Spark 3.0Databricks

Scaling Apache Spark at FacebookDatabricks

Spark shuffle introductioncolorant

Spark overviewLisa Hua

Robust and Scalable ETL over Cloud Storage with Apache SparkDatabricks

What's hot (20)

Koalas: Interoperability Between Koalas and Apache Spark

Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop

Apache Spark Core—Deep Dive—Proper Optimization

Spark SQL - 10 Things You Need to Know

Building Robust ETL Pipelines with Apache Spark

Spark sql

Spark SQL

How Apache Spark fits into the Big Data landscape

Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...

Spark SQL Join Improvement at Facebook

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3

From Pipelines to Refineries: Scaling Big Data Applications

SPARQL and Linked Data Benchmarking

Taking Spark Streaming to the Next Level with Datasets and DataFrames

Why you should care about data layout in the file system with Cheng Lian and ...

What is New with Apache Spark Performance Monitoring in Spark 3.0

Scaling Apache Spark at Facebook

Spark shuffle introduction

Spark overview

Robust and Scalable ETL over Cloud Storage with Apache Spark

Similar to Arbitrary Stateful Aggregation and MERGE INTO - Data + AI Summit EU 2020

Kafka Streams VS Spark Structured Streaming - Modern Stream Processing Engin...Jacek Laskowski

Akka 2.4 plus commercial features in Typesafe Reactive PlatformLegacy Typesafe (now Lightbend)

Akka 2.4 plus new commercial features in Typesafe Reactive PlatformLegacy Typesafe (now Lightbend)

Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)Databricks

What is Apache Kafka®?Eventador

What is apache Kafka?Kenny Gorman

SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion DubaiCodemotion Dubai

HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...Modern Data Stack France

Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...confluent

Streampunk - The Difference Engine for Unlocking the Kafka Black Box with Ral...HostedbyConfluent

Migrating structured data between Hadoop and RDBMSBouquet

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson

Moving from Big Data to Fast Data? Here's How To Pick The Right Streaming EngineLightbend

Akka Streams And Kafka Streams: Where Microservices Meet Fast DataLightbend

Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Simplilearn

Display earthquakes with Akka-httpPierangelo Cecchetto

Understanding Akka Streams, Back Pressure, and Asynchronous ArchitecturesLightbend

Scylla Summit 2018: Scalable Stream Processing with KSQL, Kafka and ScyllaDBScyllaDB

What is Apache Kafka and What is an Event Streaming Platform?confluent

Kafka Streams for Java enthusiastsSlim Baltagi

Similar to Arbitrary Stateful Aggregation and MERGE INTO - Data + AI Summit EU 2020 (20)

Kafka Streams VS Spark Structured Streaming - Modern Stream Processing Engin...

Akka 2.4 plus commercial features in Typesafe Reactive Platform

Akka 2.4 plus new commercial features in Typesafe Reactive Platform

Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)

What is Apache Kafka®?

What is apache Kafka?

SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai

HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...

Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...

Streampunk - The Difference Engine for Unlocking the Kafka Black Box with Ral...

Migrating structured data between Hadoop and RDBMS

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...

Moving from Big Data to Fast Data? Here's How To Pick The Right Streaming Engine

Akka Streams And Kafka Streams: Where Microservices Meet Fast Data

Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...

Display earthquakes with Akka-http

Understanding Akka Streams, Back Pressure, and Asynchronous Architectures

Scylla Summit 2018: Scalable Stream Processing with KSQL, Kafka and ScyllaDB

What is Apache Kafka and What is an Event Streaming Platform?

Kafka Streams for Java enthusiasts

Recently uploaded

The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell

Real-Time AI Streaming - AI Max PrincetonTimothy Spann

English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics

Digital Marketing Plan, how digital marketing worksdeepakthakur548787

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics

Cyber awareness ppt on the recorded dataTecnoIncentive

Networking Case Study prepared by teacher.pptxHimangsuNath

6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)

NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali

Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics

Insurance Churn Prediction Data Analysis ProjectBoston Institute of Analytics

Rithik Kumar Singh codealpha pythohn.pdfrahulyadav957181

Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy

What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17

IBEF report on the Insurance market in IndiaManalVerma4

modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx

Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics

Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen

Principles and Practices of Data VisualizationKianJazayeri1

Recently uploaded (20)

The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx

Real-Time AI Streaming - AI Max Princeton

English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...

Digital Marketing Plan, how digital marketing works

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...

Cyber awareness ppt on the recorded data

Networking Case Study prepared by teacher.pptx

6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...

NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...

Decoding Patterns: Customer Churn Prediction Data Analysis Project

Insurance Churn Prediction Data Analysis Project

Rithik Kumar Singh codealpha pythohn.pdf

Student Profile Sample report on improving academic performance by uniting gr...

What To Do For World Nature Conservation Day by Slidesgo.pptx

IBEF report on the Insurance market in India

modul pembelajaran robotic Workshop _ by Slidesgo.pptx

Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model

Data Factory in Microsoft Fabric (MsBIP #82)

Principles and Practices of Data Visualization

Arbitrary Stateful Aggregation and MERGE INTO - Data + AI Summit EU 2020

1. Arbitrary Stateful Aggregation and MERGE INTO Spark Structured Streaming + Delta Lake = “Double Metrics” Jacek Laskowski jaceklaskowski / November 2020

2. About the Speaker Jacek Laskowski is an IT Freelancer specializing in Apache Spark, Delta Lake, Apache Kafka and Kafka Streams. Contact me at jacek@japila.pl or DM on twitter @jaceklaskowski to discuss opportunities. Best known by "The Internals Of" online books @ https://books.japila.pl

3. The Internals of Delta Lake 1. Available for free @ https://books.japila.pl/delta-lake-internals

4. Friendly Reminder Should you have any questions, Feel free to ask them in the chat window. I’m going to answer them at the end of the talk. Thank you!

5. Client Requirements and Recommendations 1. A client wants to load Kafka records at regular intervals ● Spark Structured Streaming 2. A client wants to do a stateful aggregation in a custom per-group way ● KeyValueGroupedDataset.flatMapGroups WithState 3. A client wants to update a Delta table with aggregation results ● MERGE INTO ● DataStreamWriter.foreachBatch

6. Arbitrary Stateful Aggregation 1. KeyValueGroupedDataset.ﬂatMapGroupsWithState (scaladoc) 2. A user-deﬁned per-group state 3. For a static batch Dataset, the function will be invoked once per group 4. For a streaming Dataset, the function will be invoked for each group repeatedly in every trigger, and updates to each group's state will be saved across invocations

7. The Code 1. Code?! Open Intellij IDEA! 😎

8. Delta Lake Users Mailing List 1. Multiple executions of ﬂatMapGroupsWithState when DeltaTable.merge

9. Possible Way-Outs (“Solutions”) 1. Separate Delta table for state? a. Avoid multiple passes over ﬂatMapGroupsWithState

10. O’Reilly Learning Spark 2nd Edition 1. Available for free @ https://dbricks.co/get-ebook 2. Chapter 9 “Building Reliable Data Lakes with Apache Spark” touches Delta Lake a. Also the competitors: Apache Hudi and Apache Iceberg

11. That’s all folks! Thank you! ❤ /me Answering questions... Jacek Laskowski / @jaceklaskowski / jacek@japila.pl

Arbitrary Stateful Aggregation and MERGE INTO - Data + AI Summit EU 2020

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Arbitrary Stateful Aggregation and MERGE INTO - Data + AI Summit EU 2020

Similar to Arbitrary Stateful Aggregation and MERGE INTO - Data + AI Summit EU 2020 (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Arbitrary Stateful Aggregation and MERGE INTO - Data + AI Summit EU 2020