Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Airflow at WePay

4,380 views

Published on

A 20 minute talk about how WePay runs airflow. Discusses usage and operations. Also covers running Airflow in Google cloud.

Video of the talk is available here:

https://wepayinc.box.com/s/hf1chwmthuet29ux2a83f5quc8o5q18k

Published in: Technology
  • Be the first to comment

Airflow at WePay

  1. 1. Airflow at WePay Chris Riccomini · June 14, 2016
  2. 2. Who? Chris Riccomini Engineer at WePay Working on infrastructure (mostly) Formerly LinkedIn, PayPal
  3. 3. Who?
  4. 4. Goal • What do we use Airflow for? • How do we operate Airflow?
  5. 5. Usage • ETL • Reporting • Monitoring • Machine learning • CRON replacement
  6. 6. Usage • ETL • Reporting • Monitoring • Machine learning • CRON replacement
  7. 7. Usage • 350 DAGs • 7,000 DAG runs per-day • 20,000 task instances/day
  8. 8. Environments
  9. 9. Airflow deployment • Google cloud platform • One n1-highcpu-32 machine (32 cores, 28G mem) • CloudSQL hosted MySQL 5.6 (250GB, 12GB used) • Supervisord • Icinga2 (investigating Sensu)
  10. 10. Airflow deployment
  11. 11. Airflow deployment pip install git+https://git@our-wepay-repo.com/DataInfra/airflow.git@1.7.1.2-wepay#egg=airflow[gcp_api,mysql,crypto]==1.7.1.2+wepay4
  12. 12. Airflow scheduler • Single scheduler on same machine as webserver executor = LocalExecutor parallelism = 64 dag_concurrency = 64 max_active_runs_per_dag = 16
  13. 13. Airflow logs • /var/log/airflow • Remote logger points to Google cloud storage • Experimenting with ELK
  14. 14. Airflow connections
  15. 15. Airflow security • Active directory • LDAP Airflow backend • Disabled admin and data profiler tabs
  16. 16. DAG development
  17. 17. DAG development 1. Install gcloud 2. Run `gcloud auth login` 3. Install/start Airflow 4. Add a Google cloud platform connection (just set project_id)
  18. 18. DAG development
  19. 19. DAG development
  20. 20. DAG testing • flake8 • Code coverage
  21. 21. DAG testing • Test that the scheduler can import the DAG without a failure • Check that the owner of every task is a known team • Check that the email of every task is set to a known team
  22. 22. DAG deployment • CRON (ironically) that pulls from airlfow-dags every two minutes $ cat ~/refresh-dags #!/bin/bash git -C /etc/airflow/dags/dags/dev clean -f -d git -C /etc/airflow/dags/dags/dev pull • Webserver/scheduler restarts happen manually (right now) • DAGs toggled off by default
  23. 23. DAG characteristics • (almost) All work happens off Airflow machine • Fairly homogenous operator usage (GCP) • Idempotent (re-run a DAG at any time) • ETL DAGs are very small, but there are many of them
  24. 24. Questions? (We’re hiring)
  25. 25. Addendum
  26. 26. Usage

×