Airflow at lyft

Tao Feng
Tao FengEngineer at some company
April 10th 2019
Tao Feng | @feng-tao | Software Engineer, Lyft
Airflow @ Lyft
2
Who
● Tao Feng
● Engineer at Lyft Data Platform
● Apache Airflow PMC and Committer
● Working on different data products (Airflow,
Amundsen, etc)
Agenda
• Airflow in general
• Airflow @ Lyft
• Upstream @ Lyft
• Next Step
• Summary
3
Airflow in general
4
Airflow in general
5
• Airflow just became an Apache top level project(TLP) recently.
‒ Total 20 PMCs / committers
• Most recent release 1.10, 1.10.1, 1.10.2 (1.10.3 is coming).
‒ New Features: Airflow RBAC, Airflow K8S integration, etc
• New Process in Airflow for proposing architecture change.
‒ Airflow Improvements Proposals (currently 19+ proposals)
• Recent community conducted Airflow user survey (link).
11k+
github
stars
740+
contributors
250+
Companies
using
Airflow @ Lyft
6
7
Core Infra high level architecture @ Lyft
Airflow Architecture @ Lyft
8
Airflow Architecture @ Lyft
• WebUI: the portal for users to view the related status of the DAGs.
• Metadata DB: the Airflow metastore for storing various job status.
• Scheduler: a multi-process which parses the DAG bag, creates a DAG object and
triggers executor to execute those dependency met tasks.
• Executor: A message queuing process that orchestrates worker processes to execute
tasks. We uses CeleryExecutor at Lyft.
• TARS: Airflow development / backfill environment, which provides access to production
data. 9
Airflow Architecture @ Lyft
10
• Main Cluster Config: Apache Airflow 1.8.2 with cherry-picks and numerous in-
house Lyft patches.
• Scale: Three set of ASGs for workers.
‒ ASG #1: 15 worker nodes each of which is the r5.4xlarge (16vcpu, 128g mem) type. This fleet of
workers is for processing low-priority memory intensive tasks.
‒ ASG #2: 3 worker nodes each of which is the m4.4xlarge (16vcpu, 64g mem) )type. This fleet of
workers is dedicated for those DAGs with a strict SLA.
‒ ASG #3: 1 worker node with m4.10xlarge (40vcpu, 160g mem) type. The single node is used to
process the compute-intensive workloads from a critical team’s DAGs.
‒ Backfill Box(TARS): 1 node with m4.16xlarge (64vcpu, 256g mem) )type. This box is used for fast
DAG prototyping and backfill.
Airflow daily stats @ Lyft
11
600+
DAGs
800+
DagRuns
25k+
TIs
Airflow Monitoring @
Lyft
12
Airflow Availability
• Scheduler and worker health check
‒ Use Canary monitoring DAG.
‒ No task has been scheduled for 10 mins is considered downtime.
• UI health check
‒ Leverage Envoy membership health check.
• Total system Uptime pct
‒ Airflow is down if either scheduler, workers, or web server is down.
13
Schedule Delay
• scheduler delay = TI.start_time - TI.execution_date
14
DAG last run time
• The time that have elapsed since the DAG file was last processed.
• If the time becomes too long, it means scheduler has issues processing the
DAG files.
‒ E.g could due to parser threads occupied by malicious DAG files.
15
Executor parallelism
• Parallelism: control the #. concurrent running tasks.
‒ Please monitor your worker nodes’ cpu utilization before increasing the value.
16
Airflow monitoring @ Lyft
17
Stats Name Meaning
dag_processing.last_run.seconds_ago.<d
ag_file>
Seconds since <dag_file> was last
processed
executor.open_slots Number of of open slots on executor
(parallelism - # running task)
executor.queued_tasks Number of queued tasks on executor
executor.running_tasks Number of running tasks on executor
pool.starving_tasks.<pool_name> Number of starving tasks in the pool.
Check how many tasks are starving due to
pool count limitation.
…...
Airflow Customization
@ Lyft
18
Airflow customization @ Lyft
• UI auditing
• Extra link for task instance UI panel (AIRFLOW-161)
19
Airflow customization @ Lyft
• DAG dependency graph
20
Improve Airflow
Reliability @ Lyft
21
Improving Airflow Performance @ Lyft
• Reduce Airflow UI page load time
‒ Change default_dag_run_display_number to 5.
• Tunables that impacts tasks execution parallelisms
‒ Parallelism
‒ Concurrency
‒ Max_active_runs
‒ Pool
22
Improving Airflow Reliability at Lyft
• Source Control For Pools
‒ All Airflow pools are defined in a source controlled github source file.
‒ Airflow pools are configured based on the file in runtime.
• Integration tests for DAG to enforce best practice and improve reliability
‒ All the DAGs should be loadable within time threshold.
‒ All the DAGs should have valid pools associated.
‒ External task sensors should be valid((dag_id, task_id) exists).
‒ Each pool is used by at least by one DAG.
‒ The sensor has a reasonable timeout.
‒ Each DAG has a non dynamic start date.
• Secure UI access
23
Production Debug @
Lyft
24
Production Debug @ Lyft
• We document every production issue investigation in the doc.
• Couples of methodologies:
‒ View the centralized Airflow dashboard.
‒ Identify whether it is UI or Airflow scheduler(backend) issues.
‒ View the webserver log or scheduler log.
∙ If the log is not available in machine, check the log in kibana.
∙ To further identify issues, we sometimes even look at logs in S3
‒ Use different tools for further investigation
∙ If exceptions is thrown, understand which part of Airflow code throws the exception.
∙ If CPU / memory alarm, use top to identify which DAG causes the issue.
∙ If failure related to celery, login to celery flower UI to further investigate.
∙ ...
25
Airflow Gotchas @ Lyft
26
Airflow Gotchas at Lyft
• DST
‒ UI doesn’t have timezone support even in upstream.
‒ Scheduler internal version has no timezone support.
• DAGs with dynamic start date.
‒ Hard to predict when the DAG is scheduled
• Long running external task sensors that don’t have valid external tasks.
• HivePartitionSensor doesn’t work for partial partition
‒ It only checks whether data exists, not check whether data fully loaded.
• Backfill experience
‒ We use local executor to backfill.
• Long running sensor occupies task slot of the pool
• User confused with DAG level argument vs Task level argument
‒ E.g Put max_active_run in default task argument
• Legacy high abstraction framework over Airflow
‒ Hard to debug for the user and us. 27
Upstream @ Lyft
28
Improve backfill
experience
29
Improve backfill experience
30
• New options for backfill
‒ --reset_dagruns: if used, Airflow will first check if there are any existing dag_runs /
task_instances associated with the backfill date range. If yes, it will prompt user whether the
user wants to clear those task_instances first. (AIRFLOW-2718)
‒ --rerun_failed_tasks: if used, Airflow automatically try to rerun those failed tasks again
without requiring any user intervention. (AIRFLOW-2566)
• Backfill respects pool for isolation (AIRFLOW-1557)
Improve backfill experience
Support batch backfill
• Use {{ prev_ds }} and {{ ds }} in SQL
‒ Prev_ds equals to ds -
schedule_interval
‒ User could change the
schedule_interval in the DAG file
during backfill.
• Use could override dag param with -c
options during backfill.
31
INSERT OVERWRITE TABLE {{
dest_db(default.superhero_data) }}
SELECT supe.superhero_name AS superhero_name,
pop.popularity AS popularity
FROM
{{ source_table(events.superheroes) }} supe
WHERE {{ prev_ds }} >= ds AND ds < {{ ds }}
airflow backfill superheroes -s 2018-05-01 -e
2018-05-08 -c {‘hive_cluster’:
‘backfill_cluster’}
Airflow DAG level
access
33
Airflow DAG level access @ Lyft
34
• DAG access control has always been a real need at Lyft
‒ HR data, Financial data, etc
‒ The workaround is to build an isolated dedicated cluster for each use case.
• Airflow introduces the RBAC feature in 1.10
‒ Airflow new webserver is based on Flask-Appbuilder.
‒ Ships with 5 static roles(Admin, User, Op, Viewer, Public).
‒ ...
• Airflow DAG level access (AIRFLOW-2267)
‒ Provides additional granular access control on DAG level.
Airflow DAG level access @ Lyft
• New Airflow UI migrates from Flask-Admin to Flask-Appbuilder(FAB).
• FAB’s security model.
35
Airflow DAG level access @ Lyft
• Which Airflow includes the change?
‒ 1.10.2 includes initial implementation
‒ 1.10.3(upcoming) includes the enhancements
• How it works
‒ Two new perms: can_dag_read (read), can_dag_edit (write).
‒ DAG level role could be created through cli / UI by Admin (doc).
‒ DAG level role could only see the viewable DAGs.
‒ User could declare permissions in DAG file (AIRFLOW-2694).
36
Airflow DAG level access @ Lyft
37
• We build a new cluster based on
Airflow master branch and
onboard couples of new sensitive
data use cases.
‒ Each use case has its own repo.
‒ User role relationship source
controlled in a YAML file.
• DAG owners specify the access
control info in the DAG files.
• Gotchas
‒ New user onboarding
‒ Integration between FAB and
google authentication(OAUTH)
‒ Integration with internal ACL
service
‒ ...
User registration flow
Next Step
38
Next Step
• Support Airflow DAG level access feature in beta internally.
• Integrate Airflow RBAC / DAG level feature with internal ACL service(FAB issue).
• Migrate all the existing DAGs to this new cluster.
• Explore running Airflow with k8s executor internally.
39
Summary
40
Summary
41
• Airflow community has been growing a lot!
• We share our experience on operating Airflow at Lyft.
• We share some of our upstream work
‒ Improve Airflow backfill experience
‒ Support Airflow DAG level Access
Acknowledgement
42
• Members who maintain Airflow at Lyft
‒ Alagappan Sethuraman
‒ Andrew Stahlman
‒ Chao-han Tsai
‒ Jinhyuk Chang
‒ Junda Yang
‒ Max Payton
‒ Tao Feng
• Special thanks to Maxime Beauchemin who provides numerous suggestions for
us.
Tao Feng | @feng-tao
Slides at TBD
Blog at go.lyft.com/airflowblog
Icons under Creative Commons License from https://thenounproject.com/ 43
Backup
44
1 of 43

More Related Content

What's hot(20)

Similar to Airflow at lyft(20)

Recently uploaded(20)

LFA-NPG-Paper.pdfLFA-NPG-Paper.pdf
LFA-NPG-Paper.pdf
harinsrikanth40 views
PlumbingPlumbing
Plumbing
Iwiss Tools Co.,Ltd9 views
CHI-SQUARE ( χ2) TESTS.pptxCHI-SQUARE ( χ2) TESTS.pptx
CHI-SQUARE ( χ2) TESTS.pptx
ssusera597c511 views
Solar PVSolar PV
Solar PV
Iwiss Tools Co.,Ltd10 views
Sanitary Landfill- SWM.pptxSanitary Landfill- SWM.pptx
Sanitary Landfill- SWM.pptx
Vinod Nejkar5 views
Wire RopeWire Rope
Wire Rope
Iwiss Tools Co.,Ltd8 views
Pointers.pptxPointers.pptx
Pointers.pptx
Ananthi Palanisamy58 views
Codes and Conventions.pptxCodes and Conventions.pptx
Codes and Conventions.pptx
IsabellaGraceAnkers5 views
SICTECH CORPORATE PRESENTATIONSICTECH CORPORATE PRESENTATION
SICTECH CORPORATE PRESENTATION
SiCtechInduction15 views
Paper 3.pdfPaper 3.pdf
Paper 3.pdf
Javad Kadkhodapour6 views
Object Oriented Programming with JAVAObject Oriented Programming with JAVA
Object Oriented Programming with JAVA
Demian Antony D'Mello49 views
performance uploading.pptxperformance uploading.pptx
performance uploading.pptx
SanthiS107 views
SNMPxSNMPx
SNMPx
Amatullahbutt10 views
cloud computing-virtualization.pptxcloud computing-virtualization.pptx
cloud computing-virtualization.pptx
RajaulKarim2072 views
Activated sludge process .pdfActivated sludge process .pdf
Activated sludge process .pdf
8832RafiyaAltaf6 views

Airflow at lyft

  • 1. April 10th 2019 Tao Feng | @feng-tao | Software Engineer, Lyft Airflow @ Lyft
  • 2. 2 Who ● Tao Feng ● Engineer at Lyft Data Platform ● Apache Airflow PMC and Committer ● Working on different data products (Airflow, Amundsen, etc)
  • 3. Agenda • Airflow in general • Airflow @ Lyft • Upstream @ Lyft • Next Step • Summary 3
  • 5. Airflow in general 5 • Airflow just became an Apache top level project(TLP) recently. ‒ Total 20 PMCs / committers • Most recent release 1.10, 1.10.1, 1.10.2 (1.10.3 is coming). ‒ New Features: Airflow RBAC, Airflow K8S integration, etc • New Process in Airflow for proposing architecture change. ‒ Airflow Improvements Proposals (currently 19+ proposals) • Recent community conducted Airflow user survey (link). 11k+ github stars 740+ contributors 250+ Companies using
  • 7. 7 Core Infra high level architecture @ Lyft
  • 9. Airflow Architecture @ Lyft • WebUI: the portal for users to view the related status of the DAGs. • Metadata DB: the Airflow metastore for storing various job status. • Scheduler: a multi-process which parses the DAG bag, creates a DAG object and triggers executor to execute those dependency met tasks. • Executor: A message queuing process that orchestrates worker processes to execute tasks. We uses CeleryExecutor at Lyft. • TARS: Airflow development / backfill environment, which provides access to production data. 9
  • 10. Airflow Architecture @ Lyft 10 • Main Cluster Config: Apache Airflow 1.8.2 with cherry-picks and numerous in- house Lyft patches. • Scale: Three set of ASGs for workers. ‒ ASG #1: 15 worker nodes each of which is the r5.4xlarge (16vcpu, 128g mem) type. This fleet of workers is for processing low-priority memory intensive tasks. ‒ ASG #2: 3 worker nodes each of which is the m4.4xlarge (16vcpu, 64g mem) )type. This fleet of workers is dedicated for those DAGs with a strict SLA. ‒ ASG #3: 1 worker node with m4.10xlarge (40vcpu, 160g mem) type. The single node is used to process the compute-intensive workloads from a critical team’s DAGs. ‒ Backfill Box(TARS): 1 node with m4.16xlarge (64vcpu, 256g mem) )type. This box is used for fast DAG prototyping and backfill.
  • 11. Airflow daily stats @ Lyft 11 600+ DAGs 800+ DagRuns 25k+ TIs
  • 13. Airflow Availability • Scheduler and worker health check ‒ Use Canary monitoring DAG. ‒ No task has been scheduled for 10 mins is considered downtime. • UI health check ‒ Leverage Envoy membership health check. • Total system Uptime pct ‒ Airflow is down if either scheduler, workers, or web server is down. 13
  • 14. Schedule Delay • scheduler delay = TI.start_time - TI.execution_date 14
  • 15. DAG last run time • The time that have elapsed since the DAG file was last processed. • If the time becomes too long, it means scheduler has issues processing the DAG files. ‒ E.g could due to parser threads occupied by malicious DAG files. 15
  • 16. Executor parallelism • Parallelism: control the #. concurrent running tasks. ‒ Please monitor your worker nodes’ cpu utilization before increasing the value. 16
  • 17. Airflow monitoring @ Lyft 17 Stats Name Meaning dag_processing.last_run.seconds_ago.<d ag_file> Seconds since <dag_file> was last processed executor.open_slots Number of of open slots on executor (parallelism - # running task) executor.queued_tasks Number of queued tasks on executor executor.running_tasks Number of running tasks on executor pool.starving_tasks.<pool_name> Number of starving tasks in the pool. Check how many tasks are starving due to pool count limitation. …...
  • 19. Airflow customization @ Lyft • UI auditing • Extra link for task instance UI panel (AIRFLOW-161) 19
  • 20. Airflow customization @ Lyft • DAG dependency graph 20
  • 22. Improving Airflow Performance @ Lyft • Reduce Airflow UI page load time ‒ Change default_dag_run_display_number to 5. • Tunables that impacts tasks execution parallelisms ‒ Parallelism ‒ Concurrency ‒ Max_active_runs ‒ Pool 22
  • 23. Improving Airflow Reliability at Lyft • Source Control For Pools ‒ All Airflow pools are defined in a source controlled github source file. ‒ Airflow pools are configured based on the file in runtime. • Integration tests for DAG to enforce best practice and improve reliability ‒ All the DAGs should be loadable within time threshold. ‒ All the DAGs should have valid pools associated. ‒ External task sensors should be valid((dag_id, task_id) exists). ‒ Each pool is used by at least by one DAG. ‒ The sensor has a reasonable timeout. ‒ Each DAG has a non dynamic start date. • Secure UI access 23
  • 25. Production Debug @ Lyft • We document every production issue investigation in the doc. • Couples of methodologies: ‒ View the centralized Airflow dashboard. ‒ Identify whether it is UI or Airflow scheduler(backend) issues. ‒ View the webserver log or scheduler log. ∙ If the log is not available in machine, check the log in kibana. ∙ To further identify issues, we sometimes even look at logs in S3 ‒ Use different tools for further investigation ∙ If exceptions is thrown, understand which part of Airflow code throws the exception. ∙ If CPU / memory alarm, use top to identify which DAG causes the issue. ∙ If failure related to celery, login to celery flower UI to further investigate. ∙ ... 25
  • 26. Airflow Gotchas @ Lyft 26
  • 27. Airflow Gotchas at Lyft • DST ‒ UI doesn’t have timezone support even in upstream. ‒ Scheduler internal version has no timezone support. • DAGs with dynamic start date. ‒ Hard to predict when the DAG is scheduled • Long running external task sensors that don’t have valid external tasks. • HivePartitionSensor doesn’t work for partial partition ‒ It only checks whether data exists, not check whether data fully loaded. • Backfill experience ‒ We use local executor to backfill. • Long running sensor occupies task slot of the pool • User confused with DAG level argument vs Task level argument ‒ E.g Put max_active_run in default task argument • Legacy high abstraction framework over Airflow ‒ Hard to debug for the user and us. 27
  • 30. Improve backfill experience 30 • New options for backfill ‒ --reset_dagruns: if used, Airflow will first check if there are any existing dag_runs / task_instances associated with the backfill date range. If yes, it will prompt user whether the user wants to clear those task_instances first. (AIRFLOW-2718) ‒ --rerun_failed_tasks: if used, Airflow automatically try to rerun those failed tasks again without requiring any user intervention. (AIRFLOW-2566) • Backfill respects pool for isolation (AIRFLOW-1557)
  • 31. Improve backfill experience Support batch backfill • Use {{ prev_ds }} and {{ ds }} in SQL ‒ Prev_ds equals to ds - schedule_interval ‒ User could change the schedule_interval in the DAG file during backfill. • Use could override dag param with -c options during backfill. 31 INSERT OVERWRITE TABLE {{ dest_db(default.superhero_data) }} SELECT supe.superhero_name AS superhero_name, pop.popularity AS popularity FROM {{ source_table(events.superheroes) }} supe WHERE {{ prev_ds }} >= ds AND ds < {{ ds }} airflow backfill superheroes -s 2018-05-01 -e 2018-05-08 -c {‘hive_cluster’: ‘backfill_cluster’}
  • 33. Airflow DAG level access @ Lyft 34 • DAG access control has always been a real need at Lyft ‒ HR data, Financial data, etc ‒ The workaround is to build an isolated dedicated cluster for each use case. • Airflow introduces the RBAC feature in 1.10 ‒ Airflow new webserver is based on Flask-Appbuilder. ‒ Ships with 5 static roles(Admin, User, Op, Viewer, Public). ‒ ... • Airflow DAG level access (AIRFLOW-2267) ‒ Provides additional granular access control on DAG level.
  • 34. Airflow DAG level access @ Lyft • New Airflow UI migrates from Flask-Admin to Flask-Appbuilder(FAB). • FAB’s security model. 35
  • 35. Airflow DAG level access @ Lyft • Which Airflow includes the change? ‒ 1.10.2 includes initial implementation ‒ 1.10.3(upcoming) includes the enhancements • How it works ‒ Two new perms: can_dag_read (read), can_dag_edit (write). ‒ DAG level role could be created through cli / UI by Admin (doc). ‒ DAG level role could only see the viewable DAGs. ‒ User could declare permissions in DAG file (AIRFLOW-2694). 36
  • 36. Airflow DAG level access @ Lyft 37 • We build a new cluster based on Airflow master branch and onboard couples of new sensitive data use cases. ‒ Each use case has its own repo. ‒ User role relationship source controlled in a YAML file. • DAG owners specify the access control info in the DAG files. • Gotchas ‒ New user onboarding ‒ Integration between FAB and google authentication(OAUTH) ‒ Integration with internal ACL service ‒ ... User registration flow
  • 38. Next Step • Support Airflow DAG level access feature in beta internally. • Integrate Airflow RBAC / DAG level feature with internal ACL service(FAB issue). • Migrate all the existing DAGs to this new cluster. • Explore running Airflow with k8s executor internally. 39
  • 40. Summary 41 • Airflow community has been growing a lot! • We share our experience on operating Airflow at Lyft. • We share some of our upstream work ‒ Improve Airflow backfill experience ‒ Support Airflow DAG level Access
  • 41. Acknowledgement 42 • Members who maintain Airflow at Lyft ‒ Alagappan Sethuraman ‒ Andrew Stahlman ‒ Chao-han Tsai ‒ Jinhyuk Chang ‒ Junda Yang ‒ Max Payton ‒ Tao Feng • Special thanks to Maxime Beauchemin who provides numerous suggestions for us.
  • 42. Tao Feng | @feng-tao Slides at TBD Blog at go.lyft.com/airflowblog Icons under Creative Commons License from https://thenounproject.com/ 43

Editor's Notes

  1. Apache Airflow (or simply Airflow) is a platform to programmatically author, schedule, and monitor workflows. workflows are defined as code Growing community Todo: first mention about the stat then about the fact.
  2. What does the architecture for our core infra look like? Mobile application primarily… Raw events can come either from the client… or from the back end events triggered in the server… the data comes to our message bus… Kinesis/Kafka and then with light ELTing written to S3 where it persists… today we keep all the data in archival… then we develop data models and transform raw events to tables in Hive. We use Hive from long running queries and Presto for interactive queries… People build dashboards on top of Hive and visualize for exploratory analysis in Presto… Airflow is used for scheduling (executive dashboard, metric aggregation, derived data generation, machine learning feature computation)
  3. https://github.com/dpgaspar/Flask-AppBuilder/issues/518 We are not the only team manage Airflow, but we are the biggest team which manage Airflow at Lyft. Previously there are some other teams which has security requirement which they will have a separate cluster for their own use case.
  4. Parallelism set to 200 r5.4xlarge type(16vcpu, 128g mem) m4.4xlarge(16vcpu,64g) m4.10xlarge(40vcpu,160g) m4.16xlarge type(64vcpu, 256g)
  5. Canary monitoring dag: When we do Airflow maintance, we check whether the canary dag is running as the signal to see whether there is any issues.
  6. Scheduler delay roughly equals to the time that scheduler picks up the tasks(depends on scheduling loop, task priority) + the time celery worker picks up the task from celery broker Measure with canary monitoring dag
  7. Open slots, running tasks, queue tasks
  8. At Lyft we used externalTaskSensor and hivePartitionSensor mostly. This is one of our Intern’s summer project which built a DAG dependency graph which based externalTaskSensor and hivePartitionSensor . The info is generated in a daily Airflow DAG.
  9. Parallelism: This variable controls the number of task instances that the Airflow worker can run simultaneously. Users could increase the parallelism variable in the Airflow.cfg. We normally suggest users increase this value when doing backfill. Concurrency: The Airflow scheduler will run no more than concurrency task instances for your DAG at any given time. Concurrency is defined in your Airflow DAG as a DAG input argument. If you do not set the concurrency on your DAG, the scheduler will use the default value from the dag_concurrency entry in your Airflow.cfg. max_active_runs: Airflow will run no more than max_active_runs DagRuns of your DAG at a given time. If you do not set the max_active_runs on your DAG, Airflow will use the default value from the max_active_runs_per_dag entry in your Airflow.cfg. We suggest users not to set depends_on_past to true and increase this configuration during backfill. Pool: Airflow pool is used to limit the execution parallelism. Users could increase the priority_weight for the task if it is a critical one.
  10. Todo: mention pool source control Todo: need to have some examples for reliablity
  11. Todo: mention pool source control Todo: need to have some examples for reliablity
  12. Data engineering handbook
  13. Provide util to allow user to easy promote the partition to the table in dest schema.
  14. Talk about backfill improvement