A 20 minute talk about how WePay runs airflow. Discusses usage and operations. Also covers running Airflow in Google cloud.
Video of the talk is available here:
https://wepayinc.box.com/s/hf1chwmthuet29ux2a83f5quc8o5q18k
DAG testing
• Test that the scheduler can import the DAG without a failure
• Check that the owner of every task is a known team
• Check that the email of every task is set to a known team
DAG deployment
• CRON (ironically) that pulls from airlfow-dags every two minutes
$ cat ~/refresh-dags
#!/bin/bash
git -C /etc/airflow/dags/dags/dev clean -f -d
git -C /etc/airflow/dags/dags/dev pull
• Webserver/scheduler restarts happen manually (right now)
• DAGs toggled off by default
DAG characteristics
• (almost) All work happens off Airflow machine
• Fairly homogenous operator usage (GCP)
• Idempotent (re-run a DAG at any time)
• ETL DAGs are very small, but there are many of them
Platform payments
Buyer, seller, and platform (ebay, uber, airbnb, etc)
Unique risk, regulatory, and feature requirements
* Mostly want to focus on the second bullet point
ETL is most mature use case. Others are upcoming.
Mostly talking about ETL use case today.
ETL is most mature use case. Others are upcoming.
Mostly talking about ETL use case today.
Will discuss the low task:dag ratio
Most DAGs are dynamically generated from a single Python script
Bottom three all have Airflows
Production is being spun up. Will be run differently. Subsequent talk will cover that.
Run an internal GitHub branch
Run off latest release
Cherry some pick patches that have been merged to master
Never run patches not on master (no forking!)
* Actual pip install command
Have a connection to the Airflow DB for debugging
DB connections to our MySQL DBs
GCP connections for GCP service accounts (per-team)
Monolithic repository for all DAGs right now
`dags` directory has ETL/prod/dev folders for DAGs that go into each environment
* Development happens locally
* Send a PR to the airflow-dags repo
* TeamCity CI kicks off on the PR
* First run basic code quality checks catch some errors
Then run Airflow DAG checks
Don’t test DAGs. Treat them as configuration. Treat operators as code.
All DAGs are pretty much BigQueryOperators, or moving data in and out of BigQuery.
Small task:DAG ratio makes it easier to
Turn off tasks
Reset state
Get around backfill (by using start_date)
Example table
Extract/Load
Rollups
View creation
Data quality checks