Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Modern Data Pipelines with Apache Airflow (Momentum Dev Con 2018)

325 views

Published on

Original presentation on Google Slides - https://docs.google.com/presentation/d/1mCgDT7DEj2jsrr09Omm4lAspihmPXaA8r-dBuyPNT5U/edit?usp=sharing

---

Abstract:

Big data needs span both business users and developers across the world. Most classical ETL and BI tools attempted to cater to this hybrid demographic resulting in cluttered GUI tools that were inflexible, inextensible, and frustrating to use. Apache Airflow takes a different approach by representing tasks and config as Python code.

Airflow is a platform to programmatically author, schedule and monitor workflows composed of arbitrary tasks run on regular schedules. Airflow provides a platform for distributed task execution across complex workflows as directed acyclic graphs (DAGs) defined by code.

Built on top of Airflow, Astronomer provides a containerized Airflow service on Kubernetes as well as a variety of Airflow components and integrations to promote code reuse, extensibility, and modularity. The core of our stack is available as cloud hosted, on prem, and is also fully open source.

---

Bio: Taylor D. Edmiston

Taylor Edmiston is a senior software engineer experienced in designing and building backend systems from web apps to APIs to platforms for startups. He has software development experience at startups from multiple top accelerators including AngelPad, Techstars, and The Brandery. Currently, he's a developer on the core team at Astronomer.io working on the customer data platform that runs batch workflows via Airflow and clickstream pipelines via Kafka on top of Kubernetes. He's in the top 25% all time on Stack Overflow having reached over 500k fellow software developers, and the top 1% on Codewars.

On a personal note, he enjoys getting stamps in his passport and has traveled to 9 countries across 4 continents so far.

---

Bio: Andy Cooper

Andy Cooper is a Software Engineer who previously focused on ETL and Business Intelligence development. More recently he has applied those skills into building a Data Engineering Platform at Astronomer.

Outside of work Andy enjoys just about any outdoor activity, including climbing, hiking, biking and skiing.

Published in: Data & Analytics
  • Be the first to comment

Modern Data Pipelines with Apache Airflow (Momentum Dev Con 2018)

  1. 1. Modern Data Pipelines with Apache Airflow Andy Cooper & Taylor Edmiston @ Astronomer.io Momentum Dev Con 2018
  2. 2. Taylor Edmiston ● Backend software engineer building the Airflow platform at Astronomer.io ● 9 years with Python, 6 years as a professional developer ● Top 20% all time on Stack Overflow with a reach of 750k developers ● Enjoys travel - 9 countries / 4 continents About Us Andy Cooper ● Data Engineer ● 6 years of experience developing software and data pipelines ● Began career developing traditional data warehouses with Microsoft stack ● Using Airflow since 1.7
  3. 3. What is Astronomer? ● Astronomer is a data engineering platform built on Apache Airflow and clickstream analytics ● Building tools that make data engineers lives easier ● Seed-stage startup, founded ~3 years ago, located in Cincinnati (OTR) ● AngelPad #9 batch ● https://www.astronomer.io ● https://www.crunchbase.com/organization/astronomer
  4. 4. What do we do? Airflow ● Astronomer Cloud (Managed Airflow) ○ Get up and running with Airflow quickly ● Astronomer Enterprise (docs) ○ Keep your data and workflows in your private cloud ○ Astronomer Spacecamp - Enterprise support & training available (https://www.astronomer.io/blog/announcin g-astronomer-spacecamp/) ● Astronomer Open (docs) ○ The core of our platform is open source — try our Docker images on your machine Clickstream ● A clickstream analytics pipeline and router for user events ● Client-side (web, native mobile) or server-side ● Not an analytics service! We integrate with 50+ ● Free tier ● astronomer.io/clickstream ● 2-min demo video - https://www.youtube.com/watch?v=ru7VM e5MXZk
  5. 5. (~40 min) Outline ● (5 min) Intro ● (10 min) Part I - Airflow overview & concepts ● (10 min) Part II - Example DAGs ● Midpoint Q&A? ● (10 min) Part III - Getting started with Airflow + Astro CLI demo ● (5 min) Summary / Outro ● Q&A
  6. 6. What We’ll Cover ● Airflow Concepts ● Getting Started with Airflow ● Astro CLI ● Preview and Discussion Of Airflow UI ● Q&A
  7. 7. What is Apache Airflow? ● “Airflow is a platform to programmatically author, schedule and monitor workflows.” ● Open Source currently in the Apache Incubator phase ○ 7,500 stars ○ 4,000 commits ○ 400 contributors ● Written in Python ● Leverages Flask web framework
  8. 8. Airflow Concepts
  9. 9. What is a DAG? Directed Acyclic Graph
  10. 10. Define Your Pipelines in Code
  11. 11. A Centralized Web App for All Workflows
  12. 12. ● A quick look into DAG and task progress ● Error Logging ● Connections & Variables ● Connection Pooling Web App Features
  13. 13. Hooks and Operators
  14. 14. ● An interface to an external system ● Often a wrapper for an API client ● Examples ○ DbApiHook ○ S3Hook ○ SlackHook Hooks
  15. 15. ● Sensor Operators ○ S3KeySensor ○ S3PrefixSensor ○ HTTPSensor ● Action Operators ○ BashOperator ○ PythonOperator ○ EmailOperator ● Transfer Operators ○ SalesforceToRedshiftSchemaSync ○ SalesforceToS3 Operators
  16. 16. DAG Runs & Task Instances
  17. 17. Dynamic DAGs
  18. 18. Executors & Scaling
  19. 19. ● SequentialExecutor ● LocalExecutor ○ No additional dependencies ○ Multi-threaded out of the box ● CeleryExecutor ● MesosExecutor ● KubernetesExecutor (future) Executors
  20. 20. Plugins
  21. 21. ● Extend the Airflow API ● Build new dashboards ● Create custom Hooks and Operators ● Astronomer maintains the most comprehensive collection of Airflow Plugins ○ github.com/airflow-plugins ● Code reuse, composition, good software engineering practices, etc ● Examples ○ Salesforce To Redshift Plugin ○ airflow-api-plugin ○ Airflow DAG Creation Manager Plugin What can a plugin do?
  22. 22. Example DAGs
  23. 23. ● GitHub stats DAG ● Clickstream Redshift loader DAG ○ ~200 million events per month from customer apps ○ ~2 million Airflow task instances per month ● https://github.com/airflow-plugins/Example-Airflow-DAGs DAG Examples
  24. 24. Github Issue and Commit Tracking Ex.
  25. 25. Clickstream Redshift DAG
  26. 26. Clickstream Redshift DAG ● Your Website → Astronomer Clickstream → S3 → [S3 sensor → Redshift copy via Apache Spark] ● Dynamic DAGs configured via API → Scheduler (cached) → Variable
  27. 27. Astro CLI The fastest way to get started with Airflow
  28. 28. How can I get started with Airflow? ● Source Code ○ https://github.com/astronomerio/astro-cli ● Install CLI ○ $ curl -sL https://install.astronomer.io | sudo bash ● Start a Project ○ $ mkdir test-project && cd test-project ○ $ astro airflow init ○ $ astro airflow start
  29. 29. Takeaway ● Part I - Airflow overview & concepts ● Part II - Example DAGs ● Part III - Getting started with Airflow + Astro CLI demo
  30. 30. ● Official ○ https://github.com/apache/incubator-airflow ○ https://airflow.apache.org ○ Airflow Dev Mailing List ○ Apache Airflow meetups ● Community ○ https://github.com/airflow-plugins ○ https://soundcloud.com/the-airflow-podcast ○ https://github.com/jghoman/awesome-apache-airflow ● Related Talks ○ https://blog.tedmiston.com/talks/ Resources
  31. 31. Contact Info ● Andy ○ https://twitter.com/andscoop ○ https://www.linkedin.com/in/andscoop/ ○ https://andscoop.com/ ○ andy.cooper@astronomer.io ● Taylor ○ https://twitter.com/kicksopenminds ○ https://www.linkedin.com/in/tedmiston/ ○ https://blog.tedmiston.com ○ taylor@astronomer.io

×