Successfully reported this slideshow.
Your SlideShare is downloading. ×

Introduction to Apache Airflow - Data Day Seattle 2016

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 68 Ad

More Related Content

Slideshows for you (19)

Similar to Introduction to Apache Airflow - Data Day Seattle 2016 (20)

Advertisement

More from Sid Anand (20)

Recently uploaded (20)

Advertisement

Introduction to Apache Airflow - Data Day Seattle 2016

  1. 1. Introducing Apache Airflow (Incubating) Sid Anand (@r39132) Data Day Seattle 2016 1
  2. 2. About Me 2 Work [ed | s] @ Maintainer on Reports to Co-Chair for
  3. 3. Apache Airflow 3 What is it?
  4. 4. 4 Apache Airflow : What is it? Airflow is a platform to programmatically author, schedule and monitor workflows (a.k.a. DAGs)
  5. 5. 5 Apache Airflow : What is it? Airflow is a platform to programmatically author, schedule and monitor workflows (a.k.a. DAGs) It ships with • DAG Scheduler • Web application (UI) • Powerful CLI
  6. 6. 6 Apache Airflow : What is it?
  7. 7. Airflow - Authoring DAGs 7 Airflow: Visualizing a DAG
  8. 8. 8 Airflow: Author DAGs in Python! No need to bundle many XML files! Airflow - Authoring DAGs
  9. 9. 9 Airflow: The Tree View offers a view of DAG Runs over time! Airflow - Authoring DAGs
  10. 10. Airflow - Performance Insights 10 Airflow: Gantt charts reveal the slowest tasks for a run!
  11. 11. 11 Airflow: …And we can easily see performance trends over time Airflow - Performance Insights
  12. 12. 12 Apache Airflow : What is it? When would you use a Workflow Scheduler like Airflow? • ETL Pipelines • Machine Learning Pipelines • Predictive Data Pipelines • Fraud Detection, Scoring/Ranking, Classification, Recommender System, etc… • General Job Scheduling (e.g. Cron) • DB Back-ups, Scheduled code/config deployment
  13. 13. 13 Apache Airflow : What is it? What should a Workflow Scheduler do well? • Schedule a graph of dependencies • where Workflow = A DAG of Tasks • Handle task failures • Report / Alert on failures • Monitor performance of tasks over time • Enforce SLAs • E.g. Alerting if time or correctness SLAs are not met • Scale
  14. 14. 14 Apache Airflow : What is it? What Does Apache Airflow Add? • Configuration-as-code • Usability - Stunning UI / UX • Centralized configuration • Resource Pooling • Extensibility
  15. 15. Apache Airflow 15 Incubating
  16. 16. 16 Apache Airflow : Incubating Timeline • Airflow was created @ Airbnb in 2015 by Maxime Beauchemin • Max launched it @ Hadoop Summit in Summer 2015 • On 3/31/2016, Airflow —> Apache Incubator Today • 166+ Contributors • 300+ Users • 40+ companies officially using it! • 9 Committers/Maintainers <— We’re growing here
  17. 17. Agari 17 What We Do!
  18. 18. Agari : What We Do 18
  19. 19. 19 Agari : What We Do
  20. 20. 20 Agari : What We Do
  21. 21. 21 Agari : What We Do
  22. 22. 22 Agari : What We Do
  23. 23. 23 Enterprise Customers email metadata apply trust models email md + trust score Agari’s Current Product Agari : What We Do
  24. 24. 24 email metadata apply trust models email md + trust score Agari’s Future ProductEnterprise Customers Agari : What We Do
  25. 25. Apache Airflow @ Agari How Do We Use It? 25
  26. 26. Classes of Orchestration 26 apply trust models (message scoring) build trust models cron++ (general job scheduler) New Product (Enterprise Protect) Operational Automation BI / ETL N / A
  27. 27. Classes of Orchestration 27 apply trust models (message scoring) build trust models cron++ (general job scheduler) New Product (Enterprise Protect) Operational Automation BI / ETL N / A This Talk
  28. 28. Use-Case : Message Scoring Batch Pipeline Architecture 28
  29. 29. Use-Case : Message Scoring 29 enterprise A enterprise B enterprise C S3 S3 uploads every 15 minutes
  30. 30. Use-Case : Message Scoring 30 enterprise A enterprise B enterprise C S3 Airflow kicks of a Spark message scoring job every hour
  31. 31. Use-Case : Message Scoring 31 enterprise A enterprise B enterprise C S3 Spark job writes scored messages and stats to another S3 bucket S3
  32. 32. Use-Case : Message Scoring 32 enterprise A enterprise B enterprise C S3 This triggers SNS/SQS messages events S3 SNS SQS
  33. 33. Use-Case : Message Scoring 33 enterprise A enterprise B enterprise C S3 An Autoscale Group (ASG) of Importers spins up when it detects SQS messages S3 SNS SQS Importers ASG
  34. 34. 34 enterprise A enterprise B enterprise C S3 The importers rapidly ingest scored messages and aggregate statistics into the DB S3 SNS SQS Importers ASG DB Use-Case : Message Scoring
  35. 35. 35 enterprise A enterprise B enterprise C S3 Users receive alerts of untrusted emails & can review them in the web app S3 SNS SQS Importers ASG DB Use-Case : Message Scoring
  36. 36. 36 enterprise A enterprise B enterprise C S3 S3 SNS SQS Importers ASG DB Airflow manages the entire process Use-Case : Message Scoring
  37. 37. Use-Case : Message Scoring Airflow DAG 37
  38. 38. 38 Airflow DAG
  39. 39. 39 5 minute wait for S3 eventual consistency Airflow DAG
  40. 40. 40 1 hour a day, we also build new models Airflow DAG
  41. 41. 41 build models Airflow DAG
  42. 42. 42 dummy needed for branch operator Airflow DAG
  43. 43. 43 • trigger_rule: one_success Airflow DAG
  44. 44. 44 Prep Spark Run Airflow DAG
  45. 45. 45 • Run Spark • Verify a record is written to the DB • Wait for the SQS queue to empty Airflow DAG
  46. 46. 46 • Compute discrepancies • Send email report • Update monitoring graphs • Raise SLA (correctness) alerts Airflow DAG
  47. 47. SLAs & Insights Airflow 47
  48. 48. 48 Desirable Qualities of a Resilient Data Pipeline OperabilityCorrectness Timeliness Scalable/Available • Data Integrity (no loss, etc…) • Expected data distributions • All output within time-bound SLAs (e.g. 1 hour) • Fine-grained Monitoring & Alerting of Correctness & Timeliness SLAs • Quick Recoverability • ASGs, SQS, SNS, S3
  49. 49. 49 Desirable Qualities of a Resilient Data Pipeline OperabilityCorrectness Timeliness • Data Integrity (no loss, etc…) • Expected data distributions • All output within time-bound SLAs (e.g. 1 hour) • Fine-grained Monitoring & Alerting of Correctness & Timeliness SLAs • Quick Recoverability SLA SLA • ASGs, SQS, SNS, S3 Scalable/Available
  50. 50. 50 Correctness : Email Reporting orgs
  51. 51. 51 Correctness : Email Reporting For each org, we check for duplicate or missing data as a count & percentage orgs
  52. 52. 52 Correctness : Email Reporting These are the 3 stages of the pipeline. We can detect where a discrepancy is coming from - often related to a code push! orgs
  53. 53. 53 Correctness : Monitoring
  54. 54. 54 Airflow: …And easy to integrate with Ops tools! Correctness & Timeliness : Alerting
  55. 55. 55 Airflow: …And easy to integrate with Ops tools! Correctness & Timeliness : Alerting Timeliness SLA miss
  56. 56. 56 Airflow: …And easy to integrate with Ops tools! Correctness & Timeliness : Alerting Timeliness SLA miss dag = DAG(DAG_NAME, schedule_interval='@hourly', default_args=default_args, sla_miss_callback=sla_alert_func)
  57. 57. 57 Airflow: …And easy to integrate with Ops tools! Correctness & Timeliness : Alerting Timeliness SLA miss Correctness SLA miss
  58. 58. 58 Airflow: …And easy to integrate with Ops tools! Correctness & Timeliness : Alerting Timeliness & Correctness SLA misses sent to PagerDuty/VictorOps
  59. 59. Use-Case : Model Building v2 For Both Batch & Near Realtime Scoring Pipelines 59
  60. 60. 60 Airflow DAG
  61. 61. 61 Model Building DAG Launch an EMR cluster
  62. 62. 62 Run model building as EMR steps Model Building DAG
  63. 63. 63 Validate models Send email notification if tests fail Model Building DAG
  64. 64. 64 Terminate EMR cluster Model Building DAG
  65. 65. Apache Airflow Next Steps 65 Areas for Improvement
  66. 66. 66 Apache Airflow Next Steps Improvement Areas • Security • API (though we do have a CLI) • Deployment / Versioning • Execution Scale Out • On-demand Execution
  67. 67. Acknowledgments 67 • Vidur Apparao • Stephen Cattaneo • Jon Chase • Andrew Flury • William Forrester • Chris Haag • Mike Jones • Scot Kennedy • Thede Loder • Paul Lorence • Kevin Mandich • Gabriel Ortiz • Jacob Rideout • Josh Yang • Julian Mehnle None of this work would be possible without the contributions of the strong team below
  68. 68. Questions? (@r39132) 68

×