SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
Arbitrary Stateful Aggregation and MERGE INTO - Data + AI Summit EU 2020
Jules Damji and Denny Lee from Databricks Developer Relations will recap some keynote highlights, and each will briefly present personal picks from sessions that resonated well with them. Next, Jacek Laskowski, an independent consultant, will speak about Spark 3.0 internals, and Scott Haines from Twilio, Inc. will give a talk about structured streaming microservice architectures. This live coding session and technical deep dive are not to be missed!
Jules Damji and Denny Lee from Databricks Developer Relations will recap some keynote highlights, and each will briefly present personal picks from sessions that resonated well with them. Next, Jacek Laskowski, an independent consultant, will speak about Spark 3.0 internals, and Scott Haines from Twilio, Inc. will give a talk about structured streaming microservice architectures. This live coding session and technical deep dive are not to be missed!
Arbitrary Stateful Aggregation and MERGE INTO - Data + AI Summit EU 2020
1.
Arbitrary Stateful Aggregation
and MERGE INTO
Spark Structured Streaming + Delta Lake = “Double Metrics”
Jacek Laskowski jaceklaskowski / November 2020
2.
About the Speaker
Jacek Laskowski is an IT Freelancer specializing in Apache
Spark, Delta Lake, Apache Kafka and Kafka Streams.
Contact me at jacek@japila.pl or DM on twitter
@jaceklaskowski to discuss opportunities.
Best known by "The Internals Of" online books @
https://books.japila.pl
3.
The Internals of Delta Lake
1. Available for free @
https://books.japila.pl/delta-lake-internals
4.
Friendly Reminder
Should you have any questions,
Feel free to ask them in the chat window.
I’m going to answer them at the end of the talk.
Thank you!
5.
Client Requirements and Recommendations
1. A client wants to load Kafka records at
regular intervals
● Spark Structured Streaming
2. A client wants to do a stateful
aggregation in a custom per-group way
● KeyValueGroupedDataset.flatMapGroups
WithState
3. A client wants to update a Delta table
with aggregation results
● MERGE INTO
● DataStreamWriter.foreachBatch
6.
Arbitrary Stateful Aggregation
1. KeyValueGroupedDataset.flatMapGroupsWithState (scaladoc)
2. A user-defined per-group state
3. For a static batch Dataset, the function will be invoked once per group
4. For a streaming Dataset, the function will be invoked for each group repeatedly
in every trigger, and updates to each group's state will be saved across
invocations
8.
Delta Lake Users Mailing List
1. Multiple executions of flatMapGroupsWithState when DeltaTable.merge
9.
Possible Way-Outs (“Solutions”)
1. Separate Delta table for state?
a. Avoid multiple passes over flatMapGroupsWithState
10.
O’Reilly Learning Spark
2nd Edition
1. Available for free @ https://dbricks.co/get-ebook
2. Chapter 9 “Building Reliable Data Lakes with
Apache Spark” touches Delta Lake
a. Also the competitors: Apache Hudi and
Apache Iceberg
11.
That’s all folks! Thank you! ❤
/me Answering questions...
Jacek Laskowski / @jaceklaskowski / jacek@japila.pl
0 likes
Be the first to like this
Views
Total views
564
On SlideShare
0
From Embeds
0
Number of Embeds
1
You have now unlocked unlimited access to 20M+ documents!
Unlimited Reading
Learn faster and smarter from top experts
Unlimited Downloading
Download to take your learnings offline and on the go
You also get free access to Scribd!
Instant access to millions of ebooks, audiobooks, magazines, podcasts and more.
Read and listen offline with any device.
Free access to premium services like Tuneln, Mubi and more.