Fully Utilizing Spark
for Data Validation
Kevin Kho
Open Source Community Engineer at
Prefect
Agenda
▪ Starting Data Validation
▪ Great Expectations
▪ Pandera
▪ Fugue
▪ Validation by Partition
Case Study
• Fictitious Company
• Food Delivery
• Demand Pricing per Location
• Update rate every 10 minutes
Data Validation
Data Validation
Loading Data From Source Model Training
Data Validation
Loading Data From
Source
Infer New Price Update App Price
Model Training - Weekly pipeline
Price Inference - 10 minute pipeline
Common Validations
• Null Values
• Correct Schema
• DataFrame Shape
• Numeric values within range
Validation in Spark with
Great Expectations
Great Expectations
• Mostly parameter
• Same interface with Pandas
• Different result formats
Great Expectations - Detailed Results
Great Expectations - Data Documentation
Is there a more lightweight framework?
Pandera
• Pandas only
• Built-in validations
• Statistical validations
• Easily extensible
Pandera - Sample Code
Comparison of Validation Frameworks
▪ Spark support
▪ Flexible Success Criteria
▪ Detailed Outputs
▪ Data Documentation
▪ Notifications
▪ CLI
▪ Pandas only
▪ Lightweight
▪ Hypothesis Testing
▪ Decorators to Wrap Code
• Pandera
• Great Expectations
Using Pandera (and Pandas libraries)
in Spark
Fugue
• Friendlier interface than UDF
• Decouple logic and execution
• Write code once and scale seamlessly
Fugue
Python SQL
Pandas Spark Dask
Fugue - Basic Code
Combining Fugue and Pandera
Validation by Partition
Motivation - FoodSloth’s Expansion
Example Data - FoodSloth’s Pricing
AsOfTime Location Sublocation Price
12:30 FL Tampa 9
12:30 FL Miami 7.5
12:30 FL Orlando 8.5
12:30 CA San Francisco 14.1
12:30 CA Los Angeles 12.3
12:30 CA San Mateo 11.75
12:30 CA San Diego 12.5
Validation by Partition
Takeaways
Takeaways
• Data Validation
• Great Expectations
• Pandera
• Fugue
• Partition by Validation
End of Presentation
The need for multiple valida
• Motivation
• Geographic differences
• Bullet 2
• Sub-bullet
• Sub-bullet
Validation by Partition
Validation by Partition
• Motivation
• Geographic differences
• Bullet 2
• Sub-bullet
• Sub-bullet
Executing with Fugue
Executing with Fugue
Conclusion
Blank Slide
Basic Slide
• Bullet 1
• Sub-bullet
• Sub-bullet
• Bullet 2
• Sub-bullet
• Sub-bullet
Reduce Long Titles
• Bullet 1
• Sub-bullet
• Sub-bullet
• Bullet 2
• Sub-bullet
• Sub-bullet
By splitting them into a short title, and a more detailed subtitle using this slide format that includes a
subtitle area
Two Columns
▪ Bulleted list format
▪ Bulleted list format
▪ Bulleted list format
▪ Bulleted list format
▪ Bulleted list format
▪ Bulleted list format
▪ Bulleted list format
▪ Bulleted list format
• Headline Format
Headline Format
Color Palette
Primary
Colors
Two Box
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
• Category
• Category
Three Box
▪ Bulleted list
▪ Bulleted list
• Bulleted list
• Bulleted list
• Category
• Category
• Bulleted list
• Bulleted list
• Category
Four Box
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
• Category
• Category
▪ Bulleted list
▪ Bulleted list
• Category
▪ Bulleted list
▪ Bulleted list
• Category
Shapes
Shapes
Rounded corner rectangle Double corner
rectangle
Double corner
rectangle
Tables and Charts
Table
Column Column Column
Row Value Value Value
Row Value Value Value
Row Value Value Value
Row Value Value Value
Row Value Value Value
Row Value Value Value
Row Value Value Value
Bar chart
0
1
2
3
4
5
6
Category 1 Category 2 Category 3 Category 4
Chart Title
Series 1 Series 2 Series 3 Series 4 Series 5
Quotes and Text Callouts
Attribution Format
Second line of attribution
This is a template for a quote
slide. This is where the quote
goes. Attribute the source
below.
Line chart
0
1
2
3
4
5
6
7
Category 1 Category 2 Category 3 Category 4
Chart Title
Series 1 Series 2 Series 3 Series 4 Series 5 Series 6
Pie Chart
Sales
1st Qtr 2nd Qtr 3rdQtr 4th Qtr
Logos
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.
Data + AI Summit Logos
Databricks Logos
Open Source Logos
Partner Logos

Fully Utilizing Spark for Data Validation