Fully Utilizing Spark for Data Validation

Fully Utilizing Spark
for Data Validation
Kevin Kho
Open Source Community Engineer at
Prefect

Agenda
▪ Starting Data Validation
▪ Great Expectations
▪ Pandera
▪ Fugue
▪ Validation by Partition

Case Study
• Fictitious Company
• Food Delivery
• Demand Pricing per Location
• Update rate every 10 minutes

Data Validation
Data Validation
Loading Data From Source Model Training
Data Validation
Loading Data From
Source
Infer New Price Update App Price
Model Training - Weekly pipeline
Price Inference - 10 minute pipeline

Common Validations
• Null Values
• Correct Schema
• DataFrame Shape
• Numeric values within range

Validation in Spark with
Great Expectations

Great Expectations
• Mostly parameter
• Same interface with Pandas
• Different result formats

Great Expectations - Detailed Results

Great Expectations - Data Documentation

Is there a more lightweight framework?

Pandera
• Pandas only
• Built-in validations
• Statistical validations
• Easily extensible

Comparison of Validation Frameworks
▪ Spark support
▪ Flexible Success Criteria
▪ Detailed Outputs
▪ Data Documentation
▪ Notifications
▪ CLI
▪ Pandas only
▪ Lightweight
▪ Hypothesis Testing
▪ Decorators to Wrap Code
• Pandera
• Great Expectations

Using Pandera (and Pandas libraries)
in Spark

Fugue
• Friendlier interface than UDF
• Decouple logic and execution
• Write code once and scale seamlessly

Fugue
Python SQL
Pandas Spark Dask

Motivation - FoodSloth’s Expansion

Example Data - FoodSloth’s Pricing
AsOfTime Location Sublocation Price
12:30 FL Tampa 9
12:30 FL Miami 7.5
12:30 FL Orlando 8.5
12:30 CA San Francisco 14.1
12:30 CA Los Angeles 12.3
12:30 CA San Mateo 11.75
12:30 CA San Diego 12.5

Takeaways
• Data Validation
• Great Expectations
• Pandera
• Fugue
• Partition by Validation

The need for multiple valida
• Motivation
• Geographic differences
• Bullet 2
• Sub-bullet
• Sub-bullet

Validation by Partition
• Motivation
• Geographic differences
• Bullet 2
• Sub-bullet
• Sub-bullet

Basic Slide
• Bullet 1
• Sub-bullet
• Sub-bullet
• Bullet 2
• Sub-bullet
• Sub-bullet

Reduce Long Titles
• Bullet 1
• Sub-bullet
• Sub-bullet
• Bullet 2
• Sub-bullet
• Sub-bullet
By splitting them into a short title, and a more detailed subtitle using this slide format that includes a
subtitle area

Two Columns
▪ Bulleted list format
• Headline Format
Headline Format

Two Box
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
• Category
• Category

Three Box
▪ Bulleted list
▪ Bulleted list
• Bulleted list
• Bulleted list
• Category
• Category
• Bulleted list
• Bulleted list
• Category

Four Box
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
• Category
• Category
▪ Bulleted list
▪ Bulleted list
• Category
▪ Bulleted list
▪ Bulleted list
• Category

Shapes
Rounded corner rectangle Double corner
rectangle
Double corner
rectangle

Table
Column Column Column
Row Value Value Value

Bar chart
0
1
2
3
4
5
6
Category 1 Category 2 Category 3 Category 4
Chart Title
Series 1 Series 2 Series 3 Series 4 Series 5

Attribution Format
Second line of attribution
This is a template for a quote
slide. This is where the quote
goes. Attribute the source
below.

Line chart
0
1
2
3
4
5
6
7
Category 1 Category 2 Category 3 Category 4
Chart Title
Series 1 Series 2 Series 3 Series 4 Series 5 Series 6

Pie Chart
Sales
1st Qtr 2nd Qtr 3rdQtr 4th Qtr

Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Fully Utilizing Spark for Data Validation

More Related Content

What's hot

More from Databricks

Recently uploaded

Fully Utilizing Spark for Data Validation