WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Jasper Groot, Eventbrite
Near Real-Time Data
Warehousing with
Apache Spark and Delta
Lake
#UnifiedDataAnalytics #SparkAISummit
Introduction
Personal Introduction
- Data engineering in the event
industry for 4+ years
- Using spark for 3+ years
- Currently at Eventbrite
3#UnifiedDataAnalytics #SparkAISummit
Outline
• Structured Streaming
– In a nutshell
• Delta Lake
– How it works
• Data Warehousing
– Detailed example using these tools
– Gotchas
4#UnifiedDataAnalytics #SparkAISummit
Structured Streaming
In a nutshell
• Introduced in Spark 2.0
• Streams are unbounded dataframes
• Familiar API for anyone who has used
Dataframes
5#UnifiedDataAnalytics #SparkAISummit
Structured Streaming
6#UnifiedDataAnalytics #SparkAISummit
Structured Streaming
How streaming dataframes differ
• More restrictive operations
– Distinct
– Joins
– Aggregations
• Must be after joins
7#UnifiedDataAnalytics #SparkAISummit
Structured Streaming - Recovery
Recovery is done through checkpointing
• Checkpointing uses write-ahead logs
• Stores running aggregates and progress
• Must be a HDFS compatible FS
There are limitations on resuming from a
checkpoint after updating application logic
8#UnifiedDataAnalytics #SparkAISummit
Structured Streaming + Data
Warehousing
• Importance of watermarking
• Managing late data
• Using foreachBatch
9#UnifiedDataAnalytics #SparkAISummit
10#UnifiedDataAnalytics #SparkAISummit
11#UnifiedDataAnalytics #SparkAISummit
Delta Lake
• Open Sourced in 2019
• Parquet under the hood
• Enables ACID transactions
• Supports looking back in time
• UPDATE & DELETE existing records
• Schema management options
12#UnifiedDataAnalytics #SparkAISummit
Delta Lake
• Files can be backed by
– AWS S3
– Azure Blob Store
– HDFS
• Able to convert datasets between parquet and
delta lake
• Some SQL Support
13#UnifiedDataAnalytics #SparkAISummit
Delta Lake - ACID Transactions
Works using a transaction log
• Transaction log tracks state
• Files will not be deleted during read
• Optimistic conflict resolution
14#UnifiedDataAnalytics #SparkAISummit
Delta Lake - ACID Transactions
15#UnifiedDataAnalytics #SparkAISummit
Delta Lake - ACID Transactions
16#UnifiedDataAnalytics #SparkAISummit
Delta Lake - ACID Transactions
17#UnifiedDataAnalytics #SparkAISummit
Delta Lake - ACID Transactions
18#UnifiedDataAnalytics #SparkAISummit
Delta Lake - ACID Transactions
19#UnifiedDataAnalytics #SparkAISummit
Delta Lake - ACID Transactions
20#UnifiedDataAnalytics #SparkAISummit
Update
Insert
Delta Lake - ACID Transactions
21#UnifiedDataAnalytics #SparkAISummit
Delta Lake - ACID Transactions
22#UnifiedDataAnalytics #SparkAISummit
Defining our dataset
Aliases for merge
Join condition
Delta Lake - ACID Transactions
23#UnifiedDataAnalytics #SparkAISummit
Values to update
Delta Lake - ACID Transactions
24#UnifiedDataAnalytics #SparkAISummit
If the join condition is not met, insert
Delta Lake - ACID Transactions
25#UnifiedDataAnalytics #SparkAISummit
Delta Lake - ACID Transactions
26#UnifiedDataAnalytics #SparkAISummit
Delta Lake - ACID Transactions
27#UnifiedDataAnalytics #SparkAISummit
Delta Lake - ACID Transactions
Delta tracks operations on files
• Not all operations are effective immediately
• New log file is created for each transaction
28#UnifiedDataAnalytics #SparkAISummit
Delta Lake - ACID Transactions
29#UnifiedDataAnalytics #SparkAISummit
Delta Lake - ACID Transactions
30#UnifiedDataAnalytics #SparkAISummit
Delta Lake - ACID Transactions
31#UnifiedDataAnalytics #SparkAISummit
Delta Lake - File Management
Cleaning up
• Delta provides VACUUM commands
• VACUUM can be run with a retention period
– Default 7 day retention period
– VACUUM with a low retention period can corrupt
active writers
• VACUUM does not get logged
32#UnifiedDataAnalytics #SparkAISummit
Delta Lake - File Management
33#UnifiedDataAnalytics #SparkAISummit
Pulling it all together
Structured Streaming
• Leverages many strengths of the Dataframe
API
• Gives a clean way to manage late data
• Makes it manageable to join multiple streams
34#UnifiedDataAnalytics #SparkAISummit
Pulling it all together
Delta Lake
• Gives us ACID transactions
• Logs what has taken place
• Requires some file management
35#UnifiedDataAnalytics #SparkAISummit
Data Warehousing
There are many ways to model data, let’s stick to
an example:
• Star Schema
• Source is MySQL
• Sink is S3
• Possibilities to export from S3
36#UnifiedDataAnalytics #SparkAISummit
Data Warehousing
37#UnifiedDataAnalytics #SparkAISummit
38#UnifiedDataAnalytics #SparkAISummit
Schema
Data Warehousing
39#UnifiedDataAnalytics #SparkAISummit
Data Warehousing
40#UnifiedDataAnalytics #SparkAISummit
Data Warehousing
41#UnifiedDataAnalytics #SparkAISummit
Read stream from Kafka
Value comes in as a binary
Parse the message using a
fixed schema
Data Warehousing
42#UnifiedDataAnalytics #SparkAISummit
Parse the MySQL data
Write the stream as Delta
leveraging checkpoints
Data Warehouse
Type 2 Dimension
• A valid new record must
– update the previous version
– insert itself as the new version
• Delta merge is the way to go
• Process batches using
foreachBatch
43#UnifiedDataAnalytics #SparkAISummit
Data Warehousing
44#UnifiedDataAnalytics #SparkAISummit
Data Warehousing
45#UnifiedDataAnalytics #SparkAISummit
Data Warehousing
46#UnifiedDataAnalytics #SparkAISummit
foreachBatch method
• Takes a dataframe,
batchId, and more
• You are free to handle
the dataframe as you
see fit
Data Warehousing
47#UnifiedDataAnalytics #SparkAISummit
NULL merge key guarantees insert
Join to table to merge into
Filter to most recent records
Select only the batch data and merge key
Data Warehousing
48#UnifiedDataAnalytics #SparkAISummit
Data Warehousing
49#UnifiedDataAnalytics #SparkAISummit
Match condition, only
match current records
Set the previous latest
record as non-current
Insert all data if there is
no previous iteration
Data Warehousing
50#UnifiedDataAnalytics #SparkAISummit
Data Warehousing - Gotchas
51#UnifiedDataAnalytics #SparkAISummit
File Management
• Smaller trigger windows mean more files
– More files mean slower reads
• How useful is table history
• File size optimization
Data Warehousing - Gotchas
52#UnifiedDataAnalytics #SparkAISummit
Streaming joins
• Watermarks required for stream-to-stream joins
• Be aware of the latency of your streams
• Handle late data beyond watermark
– Set failure conditions for you streaming applications
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Near Real-Time Data Warehousing with Apache Spark and Delta Lake