Building the Petcare Data Platform
using Delta Lake and
Kyte: Our Spark ETL Pipeline
George Claireaux & Kirby Prowting
Data Engineers at Mars Petcare
Agenda
Platform Introduction & Context
▪ Mars Petcare Data Platform
▪ Our Data Landscape
Advantages of a Databricks & Spark
ETL Solution
▪ How does it work?
▪ Benefits and learns
Using Delta Lake for ETL Config
▪ Benefits and our approach
Leveraging Delta Lake for exposing
data to Data Scientists
▪ User Advantages
Platform Introduction & Context
Mars Petcare Data Platform
Consistent format and
location for data assets
from across Mars Petcare
Central single source of
truth for Mars Petcare
analytics
Shared documentation
and collaboration
across analytics teams
Mars Petcare
Our Data Landscape
▪ Many different business units
▪ Varied approaches to primary data
collection and structure at source
▪ Huge diversity in data structure,
quality and format
▪ Wide range of source systems
▪ New data sources to ingest
consistently arriving
▪ Technically literate data scientists
using Databricks as data consumers
▪ Standardised data to enable cross-
business analysis
▪ Fast and performant with tables in
the magnitude of > 1 billion rows
SinkSource
Advantages of a Databricks
& Spark ETL Solution
Tech Stack
• Utilisation of open source libraries
• JDBC connections to sources
• Automated parsing of unstructured
data
• ‘Infinitely’ scalable
Azure
• ACID compliant
• Nicely woven into the
Databricks fabric
• Time travel!
• The ‘Mars Choice’
• Vast suite of out-the-box
tools
• Databricks integration
ETL Flow
Sources
Connector
Template
Connector
Template
Connector
Template
Connector
Template
ETL Flow
Sources
Connector
Template
Connector
Template
Connector
Template
Connector
Template
Control with Databricks API
▪ Custom Dashboard
▪ Creating runs / schedules
▪ Monitoring runs
▪ Current state of data environment
▪ Unit testing from DevOps
▪ Creating unit-test runs on Pull Requests
▪ Able to spin up specific cluster spec for
pinpoint testing
Enabling development of a bespoke ecosystem above and beyond an ETL pipeline
Schema Evolution
Column 1
(integer)
Column 2
(float)
Column 3
(string)
1 1.0 one
Day 1 – initial load
Column 1
(integer)
Column 2
(float)
Column 3
(string)
Column 4
(boolean)
1 1.0 one true
• Schema is compared to our ‘truth’ in config (Day 1)
• Columns match up + additional column
Day 2
• Schema is compared to our ‘truth’ in config (Day 2)
• Datatype change detected and data blocked
Day 3
Column 1
(integer)
Column 2
(float)
Column 3
(string)
1 1.0 one
• Schema is compared to our ‘truth’ in config (Day 2)
• Columns match up with a dropped column
Day 4
Column 1
(date)
Column 2
(float)
Column 3
(string)
Column 4
(boolean)
10/10/2020 1.0 one true
• Schema is detected as shown
• Schema is stored in config metadata
Source Control & Collaboration
▪ Managed with Azure DevOps
▪ Utilizes traditional git workflows
▪ Allows engineers to deploy entire project code
base to users DataBricks environment for dev
work and testing
▪ Defined GIT and DevOps strategy leads to better
collaboration and less bugs introduced
CollaborationSource Control
.py files
Local
Machine
Git
push/pull
Using Delta Lake for
ETL Config
Main Benefits
▪ Makes concurrent writes
threadsafe for ETL runs
▪ Allows single master
(per environment) set of
configs rather than
many spread-out files
▪ Manual changes are
therefore easier to apply
since there is a single
master set
Zero Extra InfrastructureACID Transactions Versioning
▪ Saves cost & complexity
of spinning up a
database to serve same
requirements
▪ Easy to interact with
directly from Databricks
▪ Fast, easy restores
using time travel if
anything goes wrong
▪ Can track history of
updates using
transaction log to debug
issues
Our Approach
Tables
defined
in JSON
JSON to
Delta Lake
Deployer
Databricks
API
Git
push/pull
Merge
changes into
config tables
Revert any
attempted
changes
Reset master
git branch
Push JSON
changes to
master git branch
Failure
Success
Our Approach
Leveraging Delta Lake for
exposing data to Data Scientists
User Advantages
▪ Speed up query time and
reduce costs through
reduced compute
▪ Partition by common filters
▪ Z-order by join keys (IDs)
▪ Recreate research easily
▪ Validate models on updated
datasets
▪ Allows analysis on history of
overwritten data
▪ “Freeze” tables for analysis
on a static source without
writing out and losing
optimizations
Versioned dataOptimization
▪ Easy lookup of schema,
datatypes, update history
and optimizations without
changing tools
▪ Surface easily through HIVE
metastore in DataBricks
▪ Registering as unmanaged
HIVE tables gives a simple,
flexible access point to
DataBricks users.
Accessible meta-data
A better world for pets
A better world for pets
A better world for pets
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL Pipeline

Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL Pipeline

  • 2.
    Building the PetcareData Platform using Delta Lake and Kyte: Our Spark ETL Pipeline George Claireaux & Kirby Prowting Data Engineers at Mars Petcare
  • 3.
    Agenda Platform Introduction &Context ▪ Mars Petcare Data Platform ▪ Our Data Landscape Advantages of a Databricks & Spark ETL Solution ▪ How does it work? ▪ Benefits and learns Using Delta Lake for ETL Config ▪ Benefits and our approach Leveraging Delta Lake for exposing data to Data Scientists ▪ User Advantages
  • 4.
  • 5.
    Mars Petcare DataPlatform Consistent format and location for data assets from across Mars Petcare Central single source of truth for Mars Petcare analytics Shared documentation and collaboration across analytics teams
  • 6.
  • 7.
    Our Data Landscape ▪Many different business units ▪ Varied approaches to primary data collection and structure at source ▪ Huge diversity in data structure, quality and format ▪ Wide range of source systems ▪ New data sources to ingest consistently arriving ▪ Technically literate data scientists using Databricks as data consumers ▪ Standardised data to enable cross- business analysis ▪ Fast and performant with tables in the magnitude of > 1 billion rows SinkSource
  • 8.
    Advantages of aDatabricks & Spark ETL Solution
  • 9.
    Tech Stack • Utilisationof open source libraries • JDBC connections to sources • Automated parsing of unstructured data • ‘Infinitely’ scalable Azure • ACID compliant • Nicely woven into the Databricks fabric • Time travel! • The ‘Mars Choice’ • Vast suite of out-the-box tools • Databricks integration
  • 10.
  • 11.
  • 12.
    Control with DatabricksAPI ▪ Custom Dashboard ▪ Creating runs / schedules ▪ Monitoring runs ▪ Current state of data environment ▪ Unit testing from DevOps ▪ Creating unit-test runs on Pull Requests ▪ Able to spin up specific cluster spec for pinpoint testing Enabling development of a bespoke ecosystem above and beyond an ETL pipeline
  • 13.
    Schema Evolution Column 1 (integer) Column2 (float) Column 3 (string) 1 1.0 one Day 1 – initial load Column 1 (integer) Column 2 (float) Column 3 (string) Column 4 (boolean) 1 1.0 one true • Schema is compared to our ‘truth’ in config (Day 1) • Columns match up + additional column Day 2 • Schema is compared to our ‘truth’ in config (Day 2) • Datatype change detected and data blocked Day 3 Column 1 (integer) Column 2 (float) Column 3 (string) 1 1.0 one • Schema is compared to our ‘truth’ in config (Day 2) • Columns match up with a dropped column Day 4 Column 1 (date) Column 2 (float) Column 3 (string) Column 4 (boolean) 10/10/2020 1.0 one true • Schema is detected as shown • Schema is stored in config metadata
  • 14.
    Source Control &Collaboration ▪ Managed with Azure DevOps ▪ Utilizes traditional git workflows ▪ Allows engineers to deploy entire project code base to users DataBricks environment for dev work and testing ▪ Defined GIT and DevOps strategy leads to better collaboration and less bugs introduced CollaborationSource Control .py files Local Machine Git push/pull
  • 15.
    Using Delta Lakefor ETL Config
  • 16.
    Main Benefits ▪ Makesconcurrent writes threadsafe for ETL runs ▪ Allows single master (per environment) set of configs rather than many spread-out files ▪ Manual changes are therefore easier to apply since there is a single master set Zero Extra InfrastructureACID Transactions Versioning ▪ Saves cost & complexity of spinning up a database to serve same requirements ▪ Easy to interact with directly from Databricks ▪ Fast, easy restores using time travel if anything goes wrong ▪ Can track history of updates using transaction log to debug issues
  • 17.
    Our Approach Tables defined in JSON JSONto Delta Lake Deployer Databricks API Git push/pull Merge changes into config tables Revert any attempted changes Reset master git branch Push JSON changes to master git branch Failure Success
  • 18.
  • 19.
    Leveraging Delta Lakefor exposing data to Data Scientists
  • 20.
    User Advantages ▪ Speedup query time and reduce costs through reduced compute ▪ Partition by common filters ▪ Z-order by join keys (IDs) ▪ Recreate research easily ▪ Validate models on updated datasets ▪ Allows analysis on history of overwritten data ▪ “Freeze” tables for analysis on a static source without writing out and losing optimizations Versioned dataOptimization ▪ Easy lookup of schema, datatypes, update history and optimizations without changing tools ▪ Surface easily through HIVE metastore in DataBricks ▪ Registering as unmanaged HIVE tables gives a simple, flexible access point to DataBricks users. Accessible meta-data
  • 21.
  • 22.
  • 23.
  • 24.
    Feedback Your feedback isimportant to us. Don’t forget to rate and review the sessions.