Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Let's Build a Service Oriented Data Pipeline!

2,356 views

Published on

Like many software projects, data pipelines built by Business Intelligence teams often start out as quickly built monoliths, but over time they can be made simpler and more maintainable by splitting them into a series of loosely coupled jobs, stealing ideas from service oriented architecture.

In this presentation we'll look at the challenges we've faced scaling Data Analytics at Hootsuite, then move into a live coding session, where we'll stitch together a data pipeline as a series of Scala apps, deployed to AWS Lambda, connected using Airbnb's open source Airflow tool.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Let's Build a Service Oriented Data Pipeline!

  1. 1. Let’s Build a Service Oriented Data Pipeline! June 2016 Software Developer | Hootsuite Yasha Podeswa
  2. 2. Before: Oceanographer Me!
  3. 3. Now: Software Developer at Hootsuite Me!
  4. 4. Introduce a problem that requires a new data pipeline Design it in a service oriented style Build it on stage! This Talk
  5. 5. Passive Aggressive Inc. just cancelled their subscription! Desperate Dan in trouble! The Problem
  6. 6. Want to Build a Tool Like This
  7. 7. Want to Build a Tool Like This
  8. 8. Want to Build a Tool Like This
  9. 9. What We’re Starting With
  10. 10. What We’re Starting With Things Users Did
  11. 11. What We’re Starting With Things Organizations Did
  12. 12. What We’re Starting With Crap
  13. 13. High Level Plan JSON files Calculate stats about organizations DB
  14. 14. High Level Plan JSON files Calculate stats about organizations DB Extract Transform Load
  15. 15. High Level Plan JSON files Calculate stats about organizations DB Extract Transform Load
  16. 16. JSON files Calculate stats about organizations DB Clean and organize data Calculate stats per organization
  17. 17. JSON files Calculate stats about organizations DB Clean and organize data Calculate stats per organization Useful for lots of things!
  18. 18. JSON files Calculate stats about organizations DB Clean and organize data Calculate stats per organization Shouldn’t run until dependent job done
  19. 19. Need a “Service” Communication and Orchestration Layer!
  20. 20. Let’s build it!
  21. 21. First App Event Cleaning and Loading Read logs from S3, clean and sort into different types of events, load into data warehouse Vanilla Scala app AWS Lambda
  22. 22. Second App Organization Stat Calculation Read cleaned/sorted events from data warehouse, calculate stats about organization, load stats to data warehouse Vanilla Scala app AWS Lambda
  23. 23. Third App Airflow Hook up the Lambda apps in a dependency graph ● Scheduling ● Retries ● Monitoring
  24. 24. Steal my code! https://github.com/yashap/etl-load-events https://github.com/yashap/etl-organization-stats https://github.com/yashap/airflow
  25. 25. Questions?

×