Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Airflow - An Open Source Platform to Author and Monitor Data Pipelines

12,278 views

Published on

Published in: Technology

Airflow - An Open Source Platform to Author and Monitor Data Pipelines

  1. 1. Airflow
  2. 2. Open source platform to programmatically author and monitor data pipelines. Airflow
  3. 3. Street Cred
  4. 4. * Companies grow to have a complex network of processes that have intricate dependencies * Analytics & batch processing are mission critical * Tons of time is spent writing jobs, monitoring and troubleshooting issues The Problem Statement
  5. 5. * Data lineage is opaque * Learning curve gets steeper as the ecosystem grows * Code / logic is duplicated (not OO) * Open ended framework = heterogenous methods * Ownership and accountability falls behind * Maintenance time grows exponentially * Operational metadata (job duration, landing times, logs, …) is scattered if existent * Loose guidelines leads to bad practices * Signal to noise ratio on alerts gets out of control Common Symptoms
  6. 6. * Airbnb was outgrowing its workflow methodology * High hopes: * Dynamic pipeline generation * Programmatic environment * A platform people love working with * We took it as an opportunity to build and share Genesis
  7. 7. Airflow@Airbnb Data Warehousing Experimentation Growth Analytics Email Targeting Sessionization Search Ranking Infrastructure Maintenance Engagement Analytics Anomaly Detection
  8. 8. Scale * 6 * c3.8xlarge nodes * 128 processing slots * 5-6k tasks per day * 30 pipeline authors @Airbnb * 8 Airflow contributors * 5 companies using Airflow in production * No bottleneck in sight
  9. 9. * Python! * A focus on authoring * Programmatic: the DAG definition is code * Modular & easily extensible * Rich CLI * Rich web UI Design Choices
  10. 10. UI Demo!
  11. 11. * Python backend * Distributed execution is extensible, but we use Celery on Redis for now * UI: Flask / SqlAlchemy / d3.js / Highcharts * Templating: Jinja! * Metadata database: SqlAlchemy (MySQL) * Airflow/requirements.txt Technologies
  12. 12. * BashOperator * PythonOperator * HiveOperator * MySqlOperator * CascadingOperator * SparkOperator * DummyOperator * EmailOperator * ExternalTaskSensor * HdfsSensor * Hive2SambaOperator * HivePartitionSensor Operators are Task Factories * MySqlToHiveTransfer * PostgresOperator * PrestoCheckOperator * PrestoIntervalCheckOperator * PrestoValueCheckOperator * HiveToMySqlTransfer * S3KeySensor * S3ToHiveTransfer * SqlSensor * TimeSensor * MORE! …
  13. 13. Local Repo Master Repo Metadata Database … Workers MySQL Hive HDFS Web Servers Cascading Scheduler Spark Presto … Architecture Web ServersWeb ServersWeb ServersWeb Servers
  14. 14. Conclusion
  15. 15. In the works / next steps * Build a community! * class YarnExecutor(BaseExecutor): * Sharing our services / frameworks * Hive stats collection * Anomaly detection * Experimentation framework
  16. 16. Qs?

×