Near Real-Time Data Warehousing with Apache Spark and Delta Lake

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

Jasper Groot, Eventbrite
Near Real-Time Data
Warehousing with
Apache Spark and Delta
Lake
#UnifiedDataAnalytics #SparkAISummit

Introduction
Personal Introduction
- Data engineering in the event
industry for 4+ years
- Using spark for 3+ years
- Currently at Eventbrite
3#UnifiedDataAnalytics #SparkAISummit

Outline
• Structured Streaming
– In a nutshell
• Delta Lake
– How it works
• Data Warehousing
– Detailed example using these tools
– Gotchas

Structured Streaming
In a nutshell
• Introduced in Spark 2.0
• Streams are unbounded dataframes
• Familiar API for anyone who has used
Dataframes

How streaming dataframes differ
• More restrictive operations
– Distinct
– Joins
– Aggregations
• Must be after joins

Structured Streaming - Recovery
Recovery is done through checkpointing
• Checkpointing uses write-ahead logs
• Stores running aggregates and progress
• Must be a HDFS compatible FS
There are limitations on resuming from a
checkpoint after updating application logic

Structured Streaming + Data
Warehousing
• Importance of watermarking
• Managing late data
• Using foreachBatch

Delta Lake
• Open Sourced in 2019
• Parquet under the hood
• Enables ACID transactions
• Supports looking back in time
• UPDATE & DELETE existing records
• Schema management options

Delta Lake
• Files can be backed by
– AWS S3
– Azure Blob Store
– HDFS
• Able to convert datasets between parquet and
delta lake
• Some SQL Support

Delta Lake - ACID Transactions
Works using a transaction log
• Transaction log tracks state
• Files will not be deleted during read
• Optimistic conflict resolution

Update
Insert

Defining our dataset
Aliases for merge
Join condition

Values to update

If the join condition is not met, insert

Delta tracks operations on files
• Not all operations are effective immediately
• New log file is created for each transaction

Delta Lake - File Management
Cleaning up
• Delta provides VACUUM commands
• VACUUM can be run with a retention period
– Default 7 day retention period
– VACUUM with a low retention period can corrupt
active writers
• VACUUM does not get logged

Delta Lake - File Management

Pulling it all together
• Leverages many strengths of the Dataframe
API
• Gives a clean way to manage late data
• Makes it manageable to join multiple streams

Pulling it all together
Delta Lake
• Gives us ACID transactions
• Logs what has taken place
• Requires some file management

Data Warehousing
There are many ways to model data, let’s stick to
an example:
• Star Schema
• Source is MySQL
• Sink is S3
• Possibilities to export from S3

Data Warehousing

Schema

Data Warehousing

Data Warehousing
Read stream from Kafka
Value comes in as a binary
Parse the message using a
fixed schema

Data Warehousing
Parse the MySQL data
Write the stream as Delta
leveraging checkpoints

Data Warehouse
Type 2 Dimension
• A valid new record must
– update the previous version
– insert itself as the new version
• Delta merge is the way to go
• Process batches using
foreachBatch

Data Warehousing

Data Warehousing
foreachBatch method
• Takes a dataframe,
batchId, and more
• You are free to handle
the dataframe as you
see fit

Data Warehousing
NULL merge key guarantees insert
Join to table to merge into
Filter to most recent records
Select only the batch data and merge key

Data Warehousing

Data Warehousing
Match condition, only
match current records
Set the previous latest
record as non-current
Insert all data if there is
no previous iteration

Data Warehousing

Data Warehousing - Gotchas
File Management
• Smaller trigger windows mean more files
– More files mean slower reads
• How useful is table history
• File size optimization

Data Warehousing - Gotchas
Streaming joins
• Watermarks required for stream-to-stream joins
• Be aware of the latency of your streams
• Handle late data beyond watermark
– Set failure conditions for you streaming applications

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Near Real-Time Data Warehousing with Apache Spark and Delta Lake

More Related Content

What's hot

Similar to Near Real-Time Data Warehousing with Apache Spark and Delta Lake

More from Databricks

Recently uploaded

In this document

Near Real-Time Data Warehousing with Apache Spark and Delta Lake