We can leverage Delta Lake, structured streaming for write-heavy use cases. This talk will go through a use case at Intuit whereby we built MOR as an architecture to allow for a very low SLA, etc. For MOR, there are different ways to view the fresh data, so we will also go over the methods used to perfTest the various ways that we were able to arrive at the best method for the given use case.
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
Building MOR on Delta Lake for Highly Random Updates
1. Building Merge on Read on Delta Lake
Justin Breese
Senior Solutions Architect
Nick Karpov
Resident Solutions Architect
2. Who are we?
Justin Breese
justin.breese@databricks.com | Los Angeles
Senior Strategic Solutions Architect
I pester Nick with a lot of questions and thoughts
Nick Karpov
nick.karpov@databricks.com | San Francisco
Senior Resident Solutions Architect
History & Music
3. Agenda
▪ Background: Copy on Write (COW) & Merge on Read (MOR)
▪ Use case, challenges, & MOR strategies
▪ Testing: choosing the right MOR strategy
▪ Rematerialization?
4. Problem statement(s)
▪ Dealing with highly random and update heavy CDC streams
▪ Wanting to be able to get fresh data at any given time
Summary
▪ Using MOR allows for faster writes and still get reads that can meet SLAs
5. Building Merge on Read on Data Lake
▪ What is Merge on Read (MOR) and Copy on Write (COW)?
▪ What is the use case?
▪ Why did we build?
▪ What is the architecture?
▪ How to test and verify it?
7. Copy on Write (COW)
▪ TL;DR the merge is done during the write
▪ Default config for Delta Lake
▪ Data is “merged” into a Delta table by physically rewriting existing files
with modifications before making available to the reader
▪ In Delta Lake, merge is a three-step process
▪ Great for write once read many scenarios
▪
8. Delta Lake Merge - Under the hood
▪ source: new data, target: existing data (Delta table)
▪ Phase 1: Find the input files in target that are touched by the rows that
satisfy the condition and verify that no two source rows match with the
same target row [innerJoin]
▪ Phase 2: Read the touched files again and write new files with updated
and/or inserted rows
▪ Phase 3: Use the Delta protocol to atomically remove the touched files and
add the new files (write stuff to object/blob store)
9. COW: What is Delta Lake doing under the hood?
Phase 2: Read the touched files again and write new files
with updated and/or inserted rows.
The type of join can vary depending on the conditions of the merge:
▪ Insert only merge (e.g. no updates/deletes) → leftAntiJoin on the
source to find the inserts
▪ Matched only clauses (e.g. when matched) → rightOuterJoin
▪ Else (e.g. you have updates, deletes, and inserts) → fullOuterJoin
Phase 2 double click
10. Merge on Read (MOR)
▪ TL;DR the “merge” is done during the read
▪ Common strategy: don’t logically merge until you NEED the result
▪ Implementation? Two tables and a view
▪ Materialized table
▪ Changelog table (can be a diff, Avro, parquet, etc.)
▪ View that acts as the referee between the two and is the source of truth
11. Which one do you pick? Well it depends...
or
write many read less
write less read many
13. Use case info
▪ 100-200/second (6k-12k/minute)
▪ CDC data coming from Kafka
▪ usually 1-3 columns are changing
▪ partial updates
▪ Each row has a unique ID
▪ 200GB active files; growing at a small rate
▪ SLA: read updates to point lookups in <5 min
▪ Currently doing daily batch overwrites; data can be up to 24 hours
stale
14. Initial observations and problems encountered
▪ Lots of updates: 96% of events
▪ Matching condition is uniformly distributed across the target
▪ No natural partitioning keys
▪ Sample of 50k events could have 2k different days of updates
▪ Default Delta Lake Merge configs were not performing well
▪ Ended up rewriting almost the entire table each merge
16. Snapshot & Changeset
▪ Snapshot: base table ▪ Changset: append only
Primary Key id
Most recent data fragno
Partitioning optional (depends on use case)
many data
columns
….
Primary Key id
Most recent data fragno
Partitioning Structured Streaming batchId (this is
important!)
many data
columns
...
17. Changeset
▪ Get the unique values in the changeset - primaryKey and latest
▪ As I have partial updates, I need to coalesce(changes, baseline)
▪ Check to understand if the dataframe can be broadcasted?
▪ If I believe I can broadcast 1GB data and each row is 364b, then I can broadcast anything up to
2.8M rows. If the changeset is > 2.8M rows ⇒ do not broadcast -- because memory!
* if your changeset is small enough
18. View: Methods to join rankedChangeset into the baseline
doubleRankOver
fullOuterJoin
leftJoinAntiJoin broadcastable!
leftJoinUnionInserts: broadcastable! Great if you are guaranteed that your inserts are not upserts!
▪ Now that we have our changeset… we still need to compare these values to the baseline table to get the
latest by id
▪ There are several methods to do this
20. Testing [normally] takes a long time
▪ Things to consider:
▪ How many tests are sufficient?
▪ How can I make them as even as possible?
▪ What do you actually want to test?
▪ Why is this part so hard and manual?
▪ Databricks has a `/runs/submit` API - starts a fresh cluster for each run
▪ Databricks notebooks have widgets which act as params
▪ Let’s do 3 tests for each viewType (method) and each operation
(read/write) ⇒ 3 * 4 * 2 = 24 tests!
But it doesn’t have to!
21. Create the widgets in your Notebooks
Create your results payload (note: we are calling the widgets as params)
22. Create a timer function
Save to Delta table (note: payload)
Operation to test
Case statement to match the method and supply the correct view - send it to the stopwatch utility
23. Configuring the API
Check out my gitHub [https://github.com/justinbreese/databricks-gems#perftestautomationpy]
Made a simple script that leverages the Databricks runs/submit API
Run info
Cluster info
Here is what we will create:
Run Operation Method
0 Read leftJoinUnionInserts
1 Read leftJoinUnionInserts
2 Read leftJoinUnionInserts
0 Read outerJoined
1 Read outerJoined
2 Read outerJoined
0 Read antiJoinLeftJoinUnion
1 Read antiJoinLeftJoinUnion
2 Read antiJoinLeftJoinUnion
24. Calling the API
Check out gitHub [https://github.com/justinbreese/databricks-gems#perftestautomationpy]
Made a simple script that leverages the Databricks runs/submit API
python3 perfTestAutomation.py -t <userAccessToken> -s 0 -j artifacts/perfTest.json
28. Periodic Rematerialization
▪ If changes are getting appended consistently, then you’ll have more
and more rows to compare against
▪ This makes your read performance degrade over time
▪ Therefore, you need to do a periodic job that will reset your baseline
table
▪ And yes, there are some choices that you have for this:
Because you need to reset your baseline table for read perf
Method Consideration(s)
Merge Easy, very helpful if you have many larger partitions and only a smaller subset of partitions need to be changed,
and built into Delta Lake
Overwrite Easy, great if you do not have or cannot partition, or if all/most partitions need to be changed
replaceWhere Moderate, only can be used if you have partitions, built into Delta Lake
29. Periodic Rematerialization
▪ Now that we’ve materialized the new changes into the baseline, we want to delete those batches that we
don’t need
▪ Since we partitioned by batchId, when we delete those previous batches, this is a metadata only
operation and super fast/cheap - line 68
▪ We do this so we don’t duplicate changes and because we don’t need them anymore
▪ Remember: we have an initial bronze table that has all of our changes so we always have this if we ever need them
Code! Remember that we said that the batchId is important?
30. Periodic Rematerialization
▪ Yes, you can even do some perfTesting on this to understand which
method fits your use case best
▪ Our use case ended up using overwrite as it was a better fit
▪ Changes happened very randomly; going back up to 2000+ days
▪ Dataset was ~200GB; partitioning was not able to be effective
▪ 200GB is small and we can overwrite the complete table in <10 min with 80 cores
31. Final recap
▪ Talked about the use case
▪ Introduced the MOR architecture
▪ Talked about the two tables
▪ Different views and understanding their differences
▪ How to test the different view methods
▪ Periodic rematerialization
32. This wouldn’t have been possible without help
from:
Chris Fish
Daniel Tomes
Tathagata Das (TD)
Burak Yavuz
Joe Widen
Denny Lee
Paul Roome