Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Airflow - Insane power in a Tiny Box

120 views

Published on

A walk through of what Airflow is and isn't. Also how to use airflow to construct dynamic tasks and automate your entire ETL process. Presentation can be seen here: http://dovy.io/airflow/airflow-strength-and-weaknesses-and-dynamic-tasks

Published in: Software
  • Be the first to comment

Airflow - Insane power in a Tiny Box

  1. 1. Airflow Insane power in a tiny box
  2. 2. A BRIEF HISTORY OF DATA PIPELINES F o r r e a l , t h i s i s h o w w e u s e d t o d o i t …
  3. 3. The dev’s answer to EVERYTHING C r o n / c r o n t a b This works great for some use cases, but lacks in many other ways. Works great, provided the computer is on. Will manage at the time you set every time it can. No recovery, logs self managed, not sure when it runs. Can only execute on one computer.
  4. 4. It keeps tasks alive. S u p e r v i s o r / S u p e r v i s o r d Fantastic utility, works as expected and optionally embedded UI and CLI util. Keeps everything up and let’s you see what’s going on. Even rotates logs and allows groups. Still executes on the one computer. Isn’t more than it advertises to be. Limited scope.
  5. 5. Some one said… we can do better.
  6. 6. Airflow is a “workflow management system” created by airbnb.com “Today, we are proud to announce that we are open sourcing and sharing Airflow, our workflow management platform.” June 2, 2016 https://medium.com/airbnb-engineering/ airflow-a-workflow-management- platform-46318b977fd8 And it’s all written in Python!
  7. 7. What IS Airflow? B U T R E A L LY … Dependency Control Task Management Task Recovery Charting Logging Alerting History Folder Watching Trending Dynamic Tasks ANYTHING your pipeline may need…
  8. 8. Airflow is NOT… …perfect https://airflow.apache.org/ So contribute, and help it get better!
  9. 9. Webserver / UI The Airflow Architecture Scheduler Worker
  10. 10. WITH VERY LITTLE WORK… A i r f l o w c a n b e r u n l o c a l l y O r b e r u n i n m u c h m o r e c o m p l e x c o n f i g u r a t i o n s .
  11. 11. Master / Slave / UI Configuration W i t h l o g s b e i n g f e d t o G C S .
  12. 12. How we provision Airflow.
  13. 13. We place it all on a single Google Compute Engine VM. No bull! E x c u s e m e ? CPU: n1-standard-2 2 vCPUs, 7.5 GB memory HD: 30 GB Standard Persistant Disk (Non-SSD)
  14. 14. LET’S 
 TALK ABOUT
 AIRFLOW
 DAGs
  15. 15. A few key Airflow concepts. DAGs Directed Acyclic Graph – is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies. Written in python. 01 Describes how a single task performs in a workflow (dag). 
 There are many types of operators: BashOperator, PythonOperator, EmailOperator, HTTPOperator, MySqlOperator, SqliteOperator, PostgresOperator, MsSqlOperator, OracleOperator, JdbcOperator, Sensor, DockerOperator 02 Operators Tasks Once an operator is instantiated it’s referred to as a task. 03 dag = DAG( dag_id='example_python_operator', schedule_interval=None ) def my_sleeping_function(random_base): '''This is a function that will run within the DAG execution''' time.sleep(random_base) def print_context(ds, **kwargs): pprint(kwargs) print(ds) return 'Whatever you return gets printed in the logs’ run_this = PythonOperator( task_id='print_the_context', provide_context=True, python_callable=print_context, dag=dag) for i in range(10): ''' Generating 10 sleeping task, sleeping from 0 to 9 seconds respectively ''' task = PythonOperator( task_id='sleep_for_'+str(i), python_callable=my_sleeping_function, op_kwargs={'random_base': float(i)/10}, dag=dag ) task.set_upstream(run_this)
  16. 16. Stop doing things the way you have, think dynamically. You can automate your task by reading source code or listing files in a directory. You don’t have to worry about execution order, you only need to present airflow with relationships. Think in terms of how you can remove human error. Let airflow work for you.
  17. 17. Airflow really shines with dynamic tasks. Dictionary (array) of Dependencies What if you made a script that parsed all your jobs, and detected all dependencies automatically.
 
 Now what if you took that dictionary, and fed it into airflow?
 
 How would that simplify your pipeline? dependencies = { 'topic_billing_frequency': [ ‘dim_billing_frequency’, ‘dim_account' ], 'topic_payment_method': ‘dim_credit_card_type’, ‘dim_payment_accounts’ ] } Let’s take a look… L e t m e s h o w y o u …
  18. 18. Airflow really shines with dynamic tasks. T h e c o d e t o r u n i t a l l Top Level Dependencies Top level dependencies are created. Each of these tasks, depends on creating and deleting the cluster. 01 Now each child dependency is iterated over, and a task is created for each. Each is given the delete task as a “downstream” so delete cluster will never run until the tasks are complete. 02 Child Dependencies Connect children to parents Now set the parent task as an upstream for each child task. 03 all_tasks = {} # Create all parent tasks, top level for key, value in dependencies.all_dependencies.iteritems(): if key not in all_tasks: all_tasks[key] = PythonOperator( task_id=key, python_callable=process, op_kwargs={}, provide_context=True, dag=dag, retries=30, retry_delay=timedelta(minutes=10), on_retry_callback=airflow_retry_function, on_failure_callback=airflow_error_function, on_success_callback=airflow_success_function, ) all_tasks[key].set_upstream(task_create_cluster) all_tasks[key].set_downstream(task_delete_cluster) # Create all nested dependency tasks for key, value in dependencies.all_dependencies.iteritems(): for item in value: if item not in all_tasks: if key in all_tasks: continue all_tasks[item] = PythonOperator( task_id=item, python_callable=process, op_kwargs={}, provide_context=True, dag=dag, retries=30, retry_delay=timedelta(minutes=10), on_retry_callback=airflow_retry_function, on_failure_callback=airflow_error_function, on_success_callback=airflow_success_function, ) all_tasks[item].set_downstream(task_delete_cluster) all_tasks[item].set_downstream(all_tasks[key])
  19. 19. What does that code do?This is real code being used today.
  20. 20. Dovy Paukstys Consultant at Caserta #geek #bigdata #redux How can I help? http://dovy.io http://twitter.com/simplerain dovy.paukstys@caserta.com http://reduxframework.com https://github.com/dovy/ http://linkedin.com/in/dovyp

×