Advertisement
Advertisement

More Related Content

Advertisement
Advertisement

Airflow at WePay

  1. Airflow at WePay Chris Riccomini · June 14, 2016
  2. Who? Chris Riccomini Engineer at WePay Working on infrastructure (mostly) Formerly LinkedIn, PayPal
  3. Who?
  4. Goal • What do we use Airflow for? • How do we operate Airflow?
  5. Usage • ETL • Reporting • Monitoring • Machine learning • CRON replacement
  6. Usage • ETL • Reporting • Monitoring • Machine learning • CRON replacement
  7. Usage • 350 DAGs • 7,000 DAG runs per-day • 20,000 task instances/day
  8. Environments
  9. Airflow deployment • Google cloud platform • One n1-highcpu-32 machine (32 cores, 28G mem) • CloudSQL hosted MySQL 5.6 (250GB, 12GB used) • Supervisord • Icinga2 (investigating Sensu)
  10. Airflow deployment
  11. Airflow deployment pip install git+https://git@our-wepay-repo.com/DataInfra/airflow.git@1.7.1.2-wepay#egg=airflow[gcp_api,mysql,crypto]==1.7.1.2+wepay4
  12. Airflow scheduler • Single scheduler on same machine as webserver executor = LocalExecutor parallelism = 64 dag_concurrency = 64 max_active_runs_per_dag = 16
  13. Airflow logs • /var/log/airflow • Remote logger points to Google cloud storage • Experimenting with ELK
  14. Airflow connections
  15. Airflow security • Active directory • LDAP Airflow backend • Disabled admin and data profiler tabs
  16. DAG development
  17. DAG development 1. Install gcloud 2. Run `gcloud auth login` 3. Install/start Airflow 4. Add a Google cloud platform connection (just set project_id)
  18. DAG development
  19. DAG development
  20. DAG testing • flake8 • Code coverage
  21. DAG testing • Test that the scheduler can import the DAG without a failure • Check that the owner of every task is a known team • Check that the email of every task is set to a known team
  22. DAG deployment • CRON (ironically) that pulls from airlfow-dags every two minutes $ cat ~/refresh-dags #!/bin/bash git -C /etc/airflow/dags/dags/dev clean -f -d git -C /etc/airflow/dags/dags/dev pull • Webserver/scheduler restarts happen manually (right now) • DAGs toggled off by default
  23. DAG characteristics • (almost) All work happens off Airflow machine • Fairly homogenous operator usage (GCP) • Idempotent (re-run a DAG at any time) • ETL DAGs are very small, but there are many of them
  24. Questions? (We’re hiring)
  25. Addendum
  26. Usage

Editor's Notes

  1. Platform payments Buyer, seller, and platform (ebay, uber, airbnb, etc) Unique risk, regulatory, and feature requirements
  2. * Mostly want to focus on the second bullet point
  3. ETL is most mature use case. Others are upcoming. Mostly talking about ETL use case today.
  4. ETL is most mature use case. Others are upcoming. Mostly talking about ETL use case today.
  5. Will discuss the low task:dag ratio Most DAGs are dynamically generated from a single Python script
  6. Bottom three all have Airflows Production is being spun up. Will be run differently. Subsequent talk will cover that.
  7. Run an internal GitHub branch Run off latest release Cherry some pick patches that have been merged to master Never run patches not on master (no forking!)
  8. * Actual pip install command
  9. Have a connection to the Airflow DB for debugging DB connections to our MySQL DBs GCP connections for GCP service accounts (per-team)
  10. Monolithic repository for all DAGs right now `dags` directory has ETL/prod/dev folders for DAGs that go into each environment
  11. * Development happens locally
  12. * Send a PR to the airflow-dags repo
  13. * TeamCity CI kicks off on the PR
  14. * First run basic code quality checks catch some errors
  15. Then run Airflow DAG checks Don’t test DAGs. Treat them as configuration. Treat operators as code.
  16. All DAGs are pretty much BigQueryOperators, or moving data in and out of BigQuery. Small task:DAG ratio makes it easier to Turn off tasks Reset state Get around backfill (by using start_date)
  17. Example table Extract/Load Rollups View creation Data quality checks
Advertisement