Successfully reported this slideshow.

Airflow at WePay

13

Share

Loading in …3
×
1 of 26
1 of 26

Airflow at WePay

13

Share

Download to read offline

A 20 minute talk about how WePay runs airflow. Discusses usage and operations. Also covers running Airflow in Google cloud.

Video of the talk is available here:

https://wepayinc.box.com/s/hf1chwmthuet29ux2a83f5quc8o5q18k

A 20 minute talk about how WePay runs airflow. Discusses usage and operations. Also covers running Airflow in Google cloud.

Video of the talk is available here:

https://wepayinc.box.com/s/hf1chwmthuet29ux2a83f5quc8o5q18k

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

Airflow at WePay

  1. 1. Airflow at WePay Chris Riccomini · June 14, 2016
  2. 2. Who? Chris Riccomini Engineer at WePay Working on infrastructure (mostly) Formerly LinkedIn, PayPal
  3. 3. Who?
  4. 4. Goal • What do we use Airflow for? • How do we operate Airflow?
  5. 5. Usage • ETL • Reporting • Monitoring • Machine learning • CRON replacement
  6. 6. Usage • ETL • Reporting • Monitoring • Machine learning • CRON replacement
  7. 7. Usage • 350 DAGs • 7,000 DAG runs per-day • 20,000 task instances/day
  8. 8. Environments
  9. 9. Airflow deployment • Google cloud platform • One n1-highcpu-32 machine (32 cores, 28G mem) • CloudSQL hosted MySQL 5.6 (250GB, 12GB used) • Supervisord • Icinga2 (investigating Sensu)
  10. 10. Airflow deployment
  11. 11. Airflow deployment pip install git+https://git@our-wepay-repo.com/DataInfra/airflow.git@1.7.1.2-wepay#egg=airflow[gcp_api,mysql,crypto]==1.7.1.2+wepay4
  12. 12. Airflow scheduler • Single scheduler on same machine as webserver executor = LocalExecutor parallelism = 64 dag_concurrency = 64 max_active_runs_per_dag = 16
  13. 13. Airflow logs • /var/log/airflow • Remote logger points to Google cloud storage • Experimenting with ELK
  14. 14. Airflow connections
  15. 15. Airflow security • Active directory • LDAP Airflow backend • Disabled admin and data profiler tabs
  16. 16. DAG development
  17. 17. DAG development 1. Install gcloud 2. Run `gcloud auth login` 3. Install/start Airflow 4. Add a Google cloud platform connection (just set project_id)
  18. 18. DAG development
  19. 19. DAG development
  20. 20. DAG testing • flake8 • Code coverage
  21. 21. DAG testing • Test that the scheduler can import the DAG without a failure • Check that the owner of every task is a known team • Check that the email of every task is set to a known team
  22. 22. DAG deployment • CRON (ironically) that pulls from airlfow-dags every two minutes $ cat ~/refresh-dags #!/bin/bash git -C /etc/airflow/dags/dags/dev clean -f -d git -C /etc/airflow/dags/dags/dev pull • Webserver/scheduler restarts happen manually (right now) • DAGs toggled off by default
  23. 23. DAG characteristics • (almost) All work happens off Airflow machine • Fairly homogenous operator usage (GCP) • Idempotent (re-run a DAG at any time) • ETL DAGs are very small, but there are many of them
  24. 24. Questions? (We’re hiring)
  25. 25. Addendum
  26. 26. Usage

Editor's Notes

  • Platform payments
    Buyer, seller, and platform (ebay, uber, airbnb, etc)
    Unique risk, regulatory, and feature requirements
  • * Mostly want to focus on the second bullet point
  • ETL is most mature use case. Others are upcoming.
    Mostly talking about ETL use case today.
  • ETL is most mature use case. Others are upcoming.
    Mostly talking about ETL use case today.
  • Will discuss the low task:dag ratio
    Most DAGs are dynamically generated from a single Python script
  • Bottom three all have Airflows
    Production is being spun up. Will be run differently. Subsequent talk will cover that.
  • Run an internal GitHub branch
    Run off latest release
    Cherry some pick patches that have been merged to master
    Never run patches not on master (no forking!)
  • * Actual pip install command
  • Have a connection to the Airflow DB for debugging
    DB connections to our MySQL DBs
    GCP connections for GCP service accounts (per-team)
  • Monolithic repository for all DAGs right now
    `dags` directory has ETL/prod/dev folders for DAGs that go into each environment
  • * Development happens locally
  • * Send a PR to the airflow-dags repo
  • * TeamCity CI kicks off on the PR
  • * First run basic code quality checks catch some errors
  • Then run Airflow DAG checks
    Don’t test DAGs. Treat them as configuration. Treat operators as code.
  • All DAGs are pretty much BigQueryOperators, or moving data in and out of BigQuery.
    Small task:DAG ratio makes it easier to
    Turn off tasks
    Reset state
    Get around backfill (by using start_date)
  • Example table
    Extract/Load
    Rollups
    View creation
    Data quality checks
  • ×