Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Code Once Use Often
Declarative Data Pipelines
Anthony Awuley
Carter Kilgour
Agenda
§ Flashfood
§ Problem
§ The Declarative Pipeline
§ Examples
§ Lessons Learned
§ Spark YAML
Food Waste
The larger problem
• 160 billion pounds of food in North America
end up in the landfill each year
• Food waste ...
Food Waste
The larger problem
• According to usda.gov, in the US, about 30-
40% of the food supply ends up in the landfill...
Flashfood
• A marketplace for food nearing expiry
• Grocers recover costs on shrink
• Grocers reduce their carbon footprin...
Data Science
Recommendation system,
fraud detection, dynamic
pricing
Product
Power our mobile
& web platforms
Analytics
Dr...
Data
Flow
Databricks
Problem Definition
Many File Types Many Clouds Many Sources
▪ Partners are key to our
business; we are flexible on
how we ...
Problem Definition
Problem Statement
How can we quickly create and
easily maintain a growing number
of pipelines?
Attempt 1
Not enough automation
Operational Database
SyncTable1Job()
….n ….n
SyncTable2Job()
SyncTable3Job()
SyncTable4Job...
Attempt 1
Not enough automation
Attempt 2
Too much automation
Operational Database
MagicSyncAllTablesJob()
….n ….n
Attempt 2
Too much automation
Problem
▪ Inferred values cause unexpected
behavior
▪ Hard to make changes
▪ Difficult to reuse code
▪ Lazy solutions to p...
The Declarative Data Pipeline
YAML based Airflow DAGs
Config based Spark Application
Attempt 3
The right amount of automation
Database
….n ….n
SyncTableJob(config1)
config1, config2, config3, config4
The right amount of automation
Scenario 3
Database
….n ….n
SyncTableJob(config2)
config1, config2, config3, config4
The right amount of automation
Scenario 3
Database
….n ….n
SyncTableJob(config3)
config1, config2, config3, config4
The right amount of automation
Scenario 3
Database
….n ….n
SyncTableJob(config4)
config1, config2, config3, config4
Why configs?
• Creates a contract between source and sink
• Forces DRY principle for similar jobs
• Can manually or progra...
airflow-declarative
SyncTableJob
Custom Operator
Extract
Extract
Transform
Load
Airflow
Extensible orchestration,
community operators,
compute for ‘small data’
jobs
Config Based
Separate business logic
...
Results
• Reduced maintenance overhead
• Democratized ability to create like jobs
• Improved readability and coding standa...
Lessons Learned
• Favor parameters over inference
• Reuse code for extract & load
• Instance pools are important
Challenges ahead
• How much to generalize config
• Programmatically add new configurations
• Grammar parser for simple fun...
Spark YAML
• Combine orchestration with
execution
• Simplify usage of parameter
heavy functions
- Garrison Keillor
A young writer is easily tempted by the
allusive and ethereal and ironic and
reflective, but the declar...
Explicit
Indelicate
Logical
Simplistic
Settings and variables should be explicit
System should extend without breaking
Beh...
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
What to Upload to SlideShare
Next
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

Code Once Use Often with Declarative Data Pipelines

Download to read offline

Did you know 160,000,000,000 pounds of food ends up in North American landfills each year? Flashfood is helping reduce food waste by providing a mobile marketplace where grocers can sell food nearing its best before date. In 2020 alone Flashfood diverted 11.2 million pounds of food waste while saving shoppers 29 million dollars on groceries.

To operate and optimize the marketplace, Flashfood ingests, processes, and surfaces a wide variety of data from the core application, partners, and external sources. As the volume, variety and velocity of sources and sinks proliferate, the complexity of scheduling and maintaining jobs increases in tandem. We noticed this complexity largely stemmed from different implementations of core ETL mechanics, rather than business logic itself.

We’ve implemented declarative data pipelines following a mantra of ‘code once use often’ to solve for this complexity. We started by building a highly configurable Apache Spark application which is initialized with details of the source, file type, transformation, load destination, etc. We then used Airflow to extend on the DatabricksRunSubmitOperator which allowed us to customize the cluster and parameters used in execution. Finally, we used airflow-declartive to generate DAGs in YAML, enabling us to set configurations, instantiate jobs, and orchestrate execution in a human readable file.

The declarative nature means less specialized personnel are able to set up an ETL with confidence, no longer requiring a deep knowledge of Apache Spark intricacies. Additionally, by ensuring that boilerplate logic was only implemented once, we reduced maintenance and increased delivery speed by 80%.

  • Be the first to like this

Code Once Use Often with Declarative Data Pipelines

  1. 1. Code Once Use Often Declarative Data Pipelines Anthony Awuley Carter Kilgour
  2. 2. Agenda § Flashfood § Problem § The Declarative Pipeline § Examples § Lessons Learned § Spark YAML
  3. 3. Food Waste The larger problem • 160 billion pounds of food in North America end up in the landfill each year • Food waste makes up at least 6% of all greenhouse gas emissions globally. • If International food waste were a country, it would be the third leading cause to GHG emissions behind the US & China [1] [1] National Geographic, March 2016
  4. 4. Food Waste The larger problem • According to usda.gov, in the US, about 30- 40% of the food supply ends up in the landfill. • In Canada, about 58% (35.5 million tonnes) of all food produced goes to waste annually. • 10.5 percent (13.7 million) of U.S. households were food insecure at some time during 2019 [1] Second Harvest, 2019
  5. 5. Flashfood • A marketplace for food nearing expiry • Grocers recover costs on shrink • Grocers reduce their carbon footprint • More families are fed fresh food affordably • In 2020 alone Flashfood • Diverted 11.2 million pounds of food from landfills • Saved shoppers 29 million dollars on groceries
  6. 6. Data Science Recommendation system, fraud detection, dynamic pricing Product Power our mobile & web platforms Analytics Drive data driven decisions, business intelligence Flashfood Data
  7. 7. Data Flow Databricks
  8. 8. Problem Definition Many File Types Many Clouds Many Sources ▪ Partners are key to our business; we are flexible on how we integrate and manage their data ▪ Some of our partners have cloud provider restriction ▪ We have several other operational & 3rd party sources Many Pipelines
  9. 9. Problem Definition
  10. 10. Problem Statement How can we quickly create and easily maintain a growing number of pipelines?
  11. 11. Attempt 1 Not enough automation Operational Database SyncTable1Job() ….n ….n SyncTable2Job() SyncTable3Job() SyncTable4Job() ….n
  12. 12. Attempt 1 Not enough automation
  13. 13. Attempt 2 Too much automation Operational Database MagicSyncAllTablesJob() ….n ….n
  14. 14. Attempt 2 Too much automation
  15. 15. Problem ▪ Inferred values cause unexpected behavior ▪ Hard to make changes ▪ Difficult to reuse code ▪ Lazy solutions to problems ▪ Hard to debug • Too much automation • Not enough automation ▪ Difficult to maintain ▪ More room for errors ▪ Time spent on boilerplate logic ▪ Difficult to share code, pass on work ▪ Additions require Spark knowledge
  16. 16. The Declarative Data Pipeline YAML based Airflow DAGs Config based Spark Application
  17. 17. Attempt 3 The right amount of automation Database ….n ….n SyncTableJob(config1) config1, config2, config3, config4
  18. 18. The right amount of automation Scenario 3 Database ….n ….n SyncTableJob(config2) config1, config2, config3, config4
  19. 19. The right amount of automation Scenario 3 Database ….n ….n SyncTableJob(config3) config1, config2, config3, config4
  20. 20. The right amount of automation Scenario 3 Database ….n ….n SyncTableJob(config4) config1, config2, config3, config4
  21. 21. Why configs? • Creates a contract between source and sink • Forces DRY principle for similar jobs • Can manually or programmatically add new jobs
  22. 22. airflow-declarative
  23. 23. SyncTableJob
  24. 24. Custom Operator
  25. 25. Extract
  26. 26. Extract
  27. 27. Transform
  28. 28. Load
  29. 29. Airflow Extensible orchestration, community operators, compute for ‘small data’ jobs Config Based Separate business logic from application logic Databricks Native support across clouds, scalable processing, reliable connectors Summary
  30. 30. Results • Reduced maintenance overhead • Democratized ability to create like jobs • Improved readability and coding standard
  31. 31. Lessons Learned • Favor parameters over inference • Reuse code for extract & load • Instance pools are important
  32. 32. Challenges ahead • How much to generalize config • Programmatically add new configurations • Grammar parser for simple function definition in yaml • Check yaml validity at source • Could this be open sourced? • We have SparkR, PySpark and Spark SQL; could we have Spark YAML?
  33. 33. Spark YAML • Combine orchestration with execution • Simplify usage of parameter heavy functions
  34. 34. - Garrison Keillor A young writer is easily tempted by the allusive and ethereal and ironic and reflective, but the declarative is at the bottom of most good writing.
  35. 35. Explicit Indelicate Logical Simplistic Settings and variables should be explicit System should extend without breaking Behavior should do exactly as stated Jobs should make limited decisions & fail quickly Pipelines should be clear in function & execution Keillor’s Principles Declarative
  36. 36. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

Did you know 160,000,000,000 pounds of food ends up in North American landfills each year? Flashfood is helping reduce food waste by providing a mobile marketplace where grocers can sell food nearing its best before date. In 2020 alone Flashfood diverted 11.2 million pounds of food waste while saving shoppers 29 million dollars on groceries. To operate and optimize the marketplace, Flashfood ingests, processes, and surfaces a wide variety of data from the core application, partners, and external sources. As the volume, variety and velocity of sources and sinks proliferate, the complexity of scheduling and maintaining jobs increases in tandem. We noticed this complexity largely stemmed from different implementations of core ETL mechanics, rather than business logic itself. We’ve implemented declarative data pipelines following a mantra of ‘code once use often’ to solve for this complexity. We started by building a highly configurable Apache Spark application which is initialized with details of the source, file type, transformation, load destination, etc. We then used Airflow to extend on the DatabricksRunSubmitOperator which allowed us to customize the cluster and parameters used in execution. Finally, we used airflow-declartive to generate DAGs in YAML, enabling us to set configurations, instantiate jobs, and orchestrate execution in a human readable file. The declarative nature means less specialized personnel are able to set up an ETL with confidence, no longer requiring a deep knowledge of Apache Spark intricacies. Additionally, by ensuring that boilerplate logic was only implemented once, we reduced maintenance and increased delivery speed by 80%.

Views

Total views

84

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

5

Shares

0

Comments

0

Likes

0

×