Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL Pipeline

152 views

Published on

At Mars Petcare (in a division known as Kinship Data & Analytics) we are building out the Petcare Data Platform – a cloud based Data Lake solution

Published in: Data & Analytics
  • Be the first to comment

Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL Pipeline

  1. 1. Building the Petcare Data Platform using Delta Lake and Kyte: Our Spark ETL Pipeline George Claireaux & Kirby Prowting Data Engineers at Mars Petcare
  2. 2. Agenda Platform Introduction & Context ▪ Mars Petcare Data Platform ▪ Our Data Landscape Advantages of a Databricks & Spark ETL Solution ▪ How does it work? ▪ Benefits and learns Using Delta Lake for ETL Config ▪ Benefits and our approach Leveraging Delta Lake for exposing data to Data Scientists ▪ User Advantages
  3. 3. Platform Introduction & Context
  4. 4. Mars Petcare Data Platform Consistent format and location for data assets from across Mars Petcare Central single source of truth for Mars Petcare analytics Shared documentation and collaboration across analytics teams
  5. 5. Mars Petcare
  6. 6. Our Data Landscape ▪ Many different business units ▪ Varied approaches to primary data collection and structure at source ▪ Huge diversity in data structure, quality and format ▪ Wide range of source systems ▪ New data sources to ingest consistently arriving ▪ Technically literate data scientists using Databricks as data consumers ▪ Standardised data to enable cross- business analysis ▪ Fast and performant with tables in the magnitude of > 1 billion rows SinkSource
  7. 7. Advantages of a Databricks & Spark ETL Solution
  8. 8. Tech Stack • Utilisation of open source libraries • JDBC connections to sources • Automated parsing of unstructured data • ‘Infinitely’ scalable Azure • ACID compliant • Nicely woven into the Databricks fabric • Time travel! • The ‘Mars Choice’ • Vast suite of out-the-box tools • Databricks integration
  9. 9. ETL Flow Sources Connector Template Connector Template Connector Template Connector Template
  10. 10. ETL Flow Sources Connector Template Connector Template Connector Template Connector Template
  11. 11. Control with Databricks API ▪ Custom Dashboard ▪ Creating runs / schedules ▪ Monitoring runs ▪ Current state of data environment ▪ Unit testing from DevOps ▪ Creating unit-test runs on Pull Requests ▪ Able to spin up specific cluster spec for pinpoint testing Enabling development of a bespoke ecosystem above and beyond an ETL pipeline
  12. 12. Schema Evolution Column 1 (integer) Column 2 (float) Column 3 (string) 1 1.0 one Day 1 – initial load Column 1 (integer) Column 2 (float) Column 3 (string) Column 4 (boolean) 1 1.0 one true • Schema is compared to our ‘truth’ in config (Day 1) • Columns match up + additional column Day 2 • Schema is compared to our ‘truth’ in config (Day 2) • Datatype change detected and data blocked Day 3 Column 1 (integer) Column 2 (float) Column 3 (string) 1 1.0 one • Schema is compared to our ‘truth’ in config (Day 2) • Columns match up with a dropped column Day 4 Column 1 (date) Column 2 (float) Column 3 (string) Column 4 (boolean) 10/10/2020 1.0 one true • Schema is detected as shown • Schema is stored in config metadata
  13. 13. Source Control & Collaboration ▪ Managed with Azure DevOps ▪ Utilizes traditional git workflows ▪ Allows engineers to deploy entire project code base to users DataBricks environment for dev work and testing ▪ Defined GIT and DevOps strategy leads to better collaboration and less bugs introduced CollaborationSource Control .py files Local Machine Git push/pull
  14. 14. Using Delta Lake for ETL Config
  15. 15. Main Benefits ▪ Makes concurrent writes threadsafe for ETL runs ▪ Allows single master (per environment) set of configs rather than many spread-out files ▪ Manual changes are therefore easier to apply since there is a single master set Zero Extra InfrastructureACID Transactions Versioning ▪ Saves cost & complexity of spinning up a database to serve same requirements ▪ Easy to interact with directly from Databricks ▪ Fast, easy restores using time travel if anything goes wrong ▪ Can track history of updates using transaction log to debug issues
  16. 16. Our Approach Tables defined in JSON JSON to Delta Lake Deployer Databricks API Git push/pull Merge changes into config tables Revert any attempted changes Reset master git branch Push JSON changes to master git branch Failure Success
  17. 17. Our Approach
  18. 18. Leveraging Delta Lake for exposing data to Data Scientists
  19. 19. User Advantages ▪ Speed up query time and reduce costs through reduced compute ▪ Partition by common filters ▪ Z-order by join keys (IDs) ▪ Recreate research easily ▪ Validate models on updated datasets ▪ Allows analysis on history of overwritten data ▪ “Freeze” tables for analysis on a static source without writing out and losing optimizations Versioned dataOptimization ▪ Easy lookup of schema, datatypes, update history and optimizations without changing tools ▪ Surface easily through HIVE metastore in DataBricks ▪ Registering as unmanaged HIVE tables gives a simple, flexible access point to DataBricks users. Accessible meta-data
  20. 20. A better world for pets
  21. 21. A better world for pets
  22. 22. A better world for pets
  23. 23. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

×