Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Fuss Free ETL with Airflow

2,540 views

Published on

Data scientists spend a lot of time doing (and redoing) tedious ETL work. This is all the more so if they don't have data engineers to support their ETL pipelines. It doesn't have to be that way. This talk will cover Airflow, an awesome open source ETL workflow tool developed by Airbnb (and inspired by Facebook's Dataswarm ETL system). We will go over how data scientists can setup, monitor and self-service their pipelines without data engineering's support. There will also be a live demo.

http://www.meetup.com/SF-Bay-Area-Data-Ingest-Meetup/events/231024947/

Published in: Technology
  • Be the first to comment

Fuss Free ETL with Airflow

  1. 1. Fuss Free ETL Vijay Bhat Tame your data pipelines with Airflow @vijaysbhat /in/vijaysbhat
  2. 2. About Me 13 Years In The Industry Mobile Financial Services Smart Meter Analytics Data Science Applications Forecasting Fraud DetectionRecommendation Systems Social Media Growth Analytics @vijaysbhat /in/vijaysbhat
  3. 3. 80% @vijaysbhat /in/vijaysbhat
  4. 4. What WhenHow @vijaysbhat /in/vijaysbhat
  5. 5. What does your ETL process look like? @vijaysbhat /in/vijaysbhat
  6. 6. What do we do to get here? @vijaysbhat /in/vijaysbhat
  7. 7. ❏ Automation ❏ Scheduling ❏ Version Control ❏ Redundancy ❏ Error Recovery ❏ Monitoring @vijaysbhat /in/vijaysbhat
  8. 8. Evolution @vijaysbhat /in/vijaysbhat
  9. 9. Introducing Airflow @vijaysbhat /in/vijaysbhat ● Open source ETL workflow engine ● Developed by Airbnb ● Inspired by Facebook’s Dataswarm ● Production ready ● Pipelines written in Python
  10. 10. Defining Pipelines @vijaysbhat /in/vijaysbhat
  11. 11. Pipeline Code Structure @vijaysbhat /in/vijaysbhat from datetime import datetime, timedelta default_args = { 'owner': 'airflow', 'depends_on_past': False, 'start_date': datetime(2015, 6, 1), 'email': ['airflow@airflow.com'], 'email_on_failure': False, 'email_on_retry': False, 'retries': 1, 'retry_delay': timedelta(minutes=5), # 'queue': 'bash_queue', # 'pool': 'backfill', # 'priority_weight': 10, # 'end_date': datetime(2016, 1, 1), } Define default arguments
  12. 12. Pipeline Code Structure @vijaysbhat /in/vijaysbhat dag = DAG( 'tutorial', default_args=default_args, schedule_interval=timedelta(1)) Instantiate DAG
  13. 13. Pipeline Code Structure @vijaysbhat /in/vijaysbhat t1 = BashOperator( task_id='print_date', bash_command='date', dag=dag) t2 = BashOperator( task_id='sleep', bash_command='sleep 5', retries=3, dag=dag) Define tasks
  14. 14. Pipeline Code Structure @vijaysbhat /in/vijaysbhat t2.set_upstream(t1) # This means that t2 will depend on t1 # running successfully to run # It is equivalent to # t1.set_downstream(t2) t3.set_upstream(t1) # all of this is equivalent to # dag.set_dependency('print_date', 'sleep') # dag.set_dependency('print_date', 'templated') Chain tasks
  15. 15. Then we get this pipeline @vijaysbhat /in/vijaysbhat t1 t3 t2
  16. 16. Code Merge Deployment Process @vijaysbhat /in/vijaysbhat Develop Test PR Review Prod Airflow Scheduler
  17. 17. Job Runs @vijaysbhat /in/vijaysbhat
  18. 18. Logs @vijaysbhat /in/vijaysbhat
  19. 19. Performance - Gantt Chart @vijaysbhat /in/vijaysbhat
  20. 20. Operators @vijaysbhat /in/vijaysbhat ● PythonOperator ● HiveOperator ● ... ● S3ToHiveTransfer ● HiveToDruidTransfer ● ... ● HdfsSensor ● HivePartitionSensor ● ... Action Transfer Sensor
  21. 21. Useful Configuration Options @vijaysbhat /in/vijaysbhat ● depends_on_past ○ wait until task run for previous day is complete? ● wait_for_downstream ○ dependency on downstream tasks for previous day. ● sla ○ send email alerts if SLA is missed.
  22. 22. CLI Commands @vijaysbhat /in/vijaysbhat ● airflow [-h] ○ webserver ○ scheduler ○ test ○ run ○ backfill ○ ...
  23. 23. Example: Hacker News Sentiment Tracker @vijaysbhat /in/vijaysbhat
  24. 24. Hacker News Example: Data Flow @vijaysbhat /in/vijaysbhat Pull Data Call API Hacker News IBM Watson Reporting DB Caravel Dashboard
  25. 25. Hacker News Example: Demo @vijaysbhat /in/vijaysbhat
  26. 26. pip install airflow! @vijaysbhat /in/vijaysbhat
  27. 27. Thank You. @vijaysbhat /in/vijaysbhat vijay@innosparkdata.com

×