SlideShare a Scribd company logo
1 of 28
Download to read offline
Building a Data Pipeline
using Apache Airflow
(on AWS / GCP)
Yohei Onishi
PyCon SG 2019, Oct. 11 2019
Presenter Profile
● Name: Yohei Onishi
● Data Engineer at a Japanese
retail company
● Based in Singapore since Oct.
2018
● Apache Airflow Contributor
2
Objective
● Expected audiences: Data engineers
○ who are working on building a data pipleline
○ who are looking for a better workflow solution
● Goal: Provide the following so they can start using Airflow
○ Airflow overview and how to author workflow
○ Airflow cluster and CI/CD pipeline
○ Data engineering services on AWS / GCP
3
Data pipeline
data source collect ETL analytics data consumer
micro services
enterprise
systems
IoT devices
object storage
message queue
micro services
enterprise
systems
BI tool
4
Example: logistics operation monitoring
factory
warehouse store
WH receipt /
shipment
store
receipt
inventory management
system
shipment
order
FA
shipment
regional logistics
operators
ETL
KPI report
5
Airflow overview
● Open sourced by Airbnb and Apache top project
● Cloud Composer: managed Airflow cluster on GCP
● Dynamic workflow generation by Python code
● Easily extensible so you can fit it to your usecase
● Scalable by using a message queue to orchestrate
arbitrary number of workers
● Workflow visualization
6
Example: Copy a file from s3 bucket to another
export records
as CSV Singapore region
US region
EU region
transfer it to a
regional bucket
7
local region
DEMO: UI and source code
sample code: https://github.com/yohei1126/pycon-apac-2019-airflow-sample 8
Concept: Directed acyclic graph, operator, task, etc
custom_param_per_dag = {'sg': { ... }, 'eu': { ... }, 'us': { ... }}
for region, v in custom_param_per_dag.items():
dag = DAG('shipment_{}'.format(region), ...)
export = PostgresToS3Operator(task_id='db_to_s3', ...)
transfer = S3CopyObjectOperator(task_id='s3_to_s3', ...)
export >> transfer
globals()[dag] = dag
9
template
t1 = PostgresToS3Operator(
task_id='db_to_s3',
sql="SELECT * FROM shipment WHERE region = '{{ params.region }}'
AND ship_date = '{{ execution_date.strftime("%Y-%m-%d") }}'",
bucket=default_args['source_bucket'],
object_key='{{ params.region }}/{{
execution_date.strftime("%Y%m%d%H%M%S") }}.csv',
params={'region':region},
dag=dag) 10
Operator
class PostgresToS3Operator(BaseOperator):
template_fields = ('sql', 'bucket', 'object_key')
def __init__(self, ..., *args, **kwargs):
super(PostgresToS3Operator, self).__init__(*args, **kwargs)
...
def execute(self, context):
...
11
Building a data pipeline: AWS vs GCP
12
AWS (2 years ago) GCP (current)
Workflow (Airflow
cluster)
EC2 (or ECS / EKS) Cloud Composer
Big data processing Spark on EC2 (or EMR) Cloud Dataflow
(or Dataproc)
Data warehouse Hive on EC2 -> Athena
(or Hive on EMR / Redshift)
BigQuery
CI / CD Jenkins on EC2
(or Code Build)
Cloud Build
AWS: Airflow cluster
executor
(1..N)
worker node (1)
executor
(1..N)
worker node (2)
executor
(1..N)
worker node (1)
... scheduler
master node (1)
web
server
master node
(2)
web
server
LB
admin
Airflow metadata DBCelery result backend message broker 13
http://site.clairvoyantsoft.com/setting-apache-airflow-cluster/
GCP: Airflow Cluster = Cloud Composer
● Fully managed Airflow cluster provided by GCP
○ Fully managed
○ Built in integrated with the other GCP services
● To focus on business logic, you should build Airflow
cluster using GCP composer
14
GCP: Airflow Cluster = Cloud Composer
15https://cloud.google.com/composer/docs/concepts/overview
● Airflow cluster on Google Kubernetes
Engine can be easilly created by CLI or
Web console
● Allowed changes to the cluster: increase
number of worker node or install Python
modules
● You can not install Linux command to
worker node.
AWS: Running Spark job in client mode
https://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_running_spark_on_yarn.html
16
Airflow
worker node
● Build Spark cluster out side of Airflow cluster
● official SparkSQLOperator does not support
cluster mode
● Use official SparkSubmitOperator or extend
official SparkSQLOperator
● Note: if you run Spark job with client mode
SparkDriver run on Airflow worker node.
This will cause out of memory on driver
side.
AWS: Running Spark job in cluster mode
https://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_running_spark_on_yarn.html
17
Airflow
worker node
● Specifying cluster mode in
SparkSubmitOperator in your Airflow DAG
● Then your Spark job runs on YARN
container (Spark cluster)
● This gives enough memory to Spark driver
GCP: Big data processing = Cloud Dataflow
● Fully managed service streaming / batch data processing
● Single API for both batch and streaming data
● Develop a pipeline in Apache Beam SDK (Java, Python and Go)
● Fully integrated with GCP services
● https://cloud.google.com/dataflow/
18
GCP: Big data processing = Cloud Dataflow
19
Airflow
executor
Airflow worker node (Composer)
Dataflow
Java (Jar)
Dataflow
Python
Dataflow
GCS
Dataflow template
(Java or Python)
upload template in advance
load template and
deploy jobs
(2) run template
deploy
Dataflow
job
(1) run
local code
Data warehouse: Hive / Athena / BigQuery
20
Hive AWS Athena BigQuery
Managed or not Nor Fully managed Fully managed
Pricing model Pay for computer
resource
Pay for usage Pay by usage
Standard SQL No (HiveQL) Yes Yes
Data load Required Not required Required
Partitioning Any column Any column Daily partition
Scalability Depends on your
cluster size
Mid High (peta bytes)
AWS: Data warehouse = Athena
21
Airflow workerAthena
S3 (data storage)
S3 (destination)
query
export
query result
run query
● AWSAthenaOperator support query
● Explicit table partitioning is needed
GCP: Data warehouse = BigQuery
22
Composer
(Airflow cluster)
BigQuery
GCS (data storage)
GCS (destination)
(1) load
(3) export query result
(2) run query
AWS: CI/CD pipeline
AWS SNS AWS SQS
Github repo
raise / merge
a PR
Airflow worker
polling
run Ansible script
git pull
test
deployment
23
GCP: CI/CD pipeline
24
Github repo Cloud Build
(Test and deploy)
GCS
(provided
from
Composer)
Composer
(Airflow cluster)
trigger build
deploy
automaticallyupload
merge a PR
Building a data pipeline: AWS vs GCP
25
AWS (2 years ago) GCP (current)
Workflow (Airflow
cluster)
EC2 (or ECS / EKS) Cloud Composer
Big data processing Spark on EC2 (or EMR) Cloud Dataflow
(or Dataproc)
Data warehouse Hive on EC2 -> Athena
(or Hive on EMR / Redshift)
BigQuery
CI / CD Jenkins on EC2
(or Code Build)
Cloud Build
recommended
Summary
● Data Engineers have to build reliable and scalable data
pipeline to accelate data analytics activities
● Airflow is great tool to author and monitor workflow
● HA cluster is required in production
● IMHO GCP provide better managed service for data
pipeline and data warehouse
26
References
● Apache Airflow
● GCP Cloud Composer
● Airflow: a workflow management platform
● ETL best practices in Airflow 1.8
● Data Science for Startups: Data Pipelines
● Airflow: Tips, Tricks, and Pitfalls
27
Thank you!
28

More Related Content

What's hot

What's hot (20)

Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
Airflow 101
Airflow 101Airflow 101
Airflow 101
 
Orchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWSOrchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWS
 
Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0
 
Airflow Intro-1.pdf
Airflow Intro-1.pdfAirflow Intro-1.pdf
Airflow Intro-1.pdf
 
Apache airflow
Apache airflowApache airflow
Apache airflow
 
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementIntro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
 
Airflow - a data flow engine
Airflow - a data flow engineAirflow - a data flow engine
Airflow - a data flow engine
 
Apache Airflow Architecture
Apache Airflow ArchitectureApache Airflow Architecture
Apache Airflow Architecture
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
From airflow to google cloud composer
From airflow to google cloud composerFrom airflow to google cloud composer
From airflow to google cloud composer
 
Introduction to Apache Airflow
Introduction to Apache AirflowIntroduction to Apache Airflow
Introduction to Apache Airflow
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Airflow for Beginners
Airflow for BeginnersAirflow for Beginners
Airflow for Beginners
 
Google Cloud Dataflow
Google Cloud DataflowGoogle Cloud Dataflow
Google Cloud Dataflow
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
 
Airflow at lyft
Airflow at lyftAirflow at lyft
Airflow at lyft
 

Similar to Building a Data Pipeline using Apache Airflow (on AWS / GCP)

Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - final
Sub Szabolcs Feczak
 

Similar to Building a Data Pipeline using Apache Airflow (on AWS / GCP) (20)

How to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
How to build an ETL pipeline with Apache Beam on Google Cloud DataflowHow to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
How to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
 
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...
 
Machine learning at scale with Google Cloud Platform
Machine learning at scale with Google Cloud PlatformMachine learning at scale with Google Cloud Platform
Machine learning at scale with Google Cloud Platform
 
Data Engineer's Lunch #50: Airbyte for Data Engineering
Data Engineer's Lunch #50: Airbyte for Data EngineeringData Engineer's Lunch #50: Airbyte for Data Engineering
Data Engineer's Lunch #50: Airbyte for Data Engineering
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - final
 
GraphQL across the stack: How everything fits together
GraphQL across the stack: How everything fits togetherGraphQL across the stack: How everything fits together
GraphQL across the stack: How everything fits together
 
Upcoming features in Airflow 2
Upcoming features in Airflow 2Upcoming features in Airflow 2
Upcoming features in Airflow 2
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
 
Dataflow.pptx
Dataflow.pptxDataflow.pptx
Dataflow.pptx
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
 
From business requirements to working pipelines with apache airflow
From business requirements to working pipelines with apache airflowFrom business requirements to working pipelines with apache airflow
From business requirements to working pipelines with apache airflow
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
 
Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)
 
Scheduling Apps in the Cloud - Glenn Renfro & Roy Clarkson
Scheduling Apps in the Cloud - Glenn Renfro & Roy ClarksonScheduling Apps in the Cloud - Glenn Renfro & Roy Clarkson
Scheduling Apps in the Cloud - Glenn Renfro & Roy Clarkson
 
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
 
Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...
Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...
Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...
 
From AWS to GCP, TABLEAPP Architecture Story
From AWS to GCP, TABLEAPP Architecture StoryFrom AWS to GCP, TABLEAPP Architecture Story
From AWS to GCP, TABLEAPP Architecture Story
 
Scheduling Apps in the Cloud - Glenn Renfro & Roy Clarkson
Scheduling Apps in the Cloud - Glenn Renfro & Roy ClarksonScheduling Apps in the Cloud - Glenn Renfro & Roy Clarkson
Scheduling Apps in the Cloud - Glenn Renfro & Roy Clarkson
 
Scaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in CloudScaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in Cloud
 
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
 

More from Yohei Onishi

More from Yohei Onishi (8)

Better parking experience with Automatic - Api Days San Francisco
Better parking experience with Automatic - Api Days San FranciscoBetter parking experience with Automatic - Api Days San Francisco
Better parking experience with Automatic - Api Days San Francisco
 
(日本人一人で)米国企業で働くために必要な3つのこと〜渡米後1ヶ月編〜
(日本人一人で)米国企業で働くために必要な3つのこと〜渡米後1ヶ月編〜(日本人一人で)米国企業で働くために必要な3つのこと〜渡米後1ヶ月編〜
(日本人一人で)米国企業で働くために必要な3つのこと〜渡米後1ヶ月編〜
 
誰かが言ってたけど人生はRPGのようだ
誰かが言ってたけど人生はRPGのようだ誰かが言ってたけど人生はRPGのようだ
誰かが言ってたけど人生はRPGのようだ
 
Test-Driven Development for [Embedded] C by James Grenning at Agile Japan 2013
Test-Driven Development for [Embedded] C by James Grenning at Agile Japan 2013Test-Driven Development for [Embedded] C by James Grenning at Agile Japan 2013
Test-Driven Development for [Embedded] C by James Grenning at Agile Japan 2013
 
ど根性駆動型コミュニティ開発
ど根性駆動型コミュニティ開発ど根性駆動型コミュニティ開発
ど根性駆動型コミュニティ開発
 
#tdd4ec is back!!〜テスト駆動開発による 組み込みプログラミングの集い〜
#tdd4ec is back!!〜テスト駆動開発による 組み込みプログラミングの集い〜#tdd4ec is back!!〜テスト駆動開発による 組み込みプログラミングの集い〜
#tdd4ec is back!!〜テスト駆動開発による 組み込みプログラミングの集い〜
 
自分のコミュニティを始めてみませんか?
自分のコミュニティを始めてみませんか?自分のコミュニティを始めてみませんか?
自分のコミュニティを始めてみませんか?
 
外乱光対策にまいまい式を使おう(ETロボコン2011東京連合第1回)
外乱光対策にまいまい式を使おう(ETロボコン2011東京連合第1回)外乱光対策にまいまい式を使おう(ETロボコン2011東京連合第1回)
外乱光対策にまいまい式を使おう(ETロボコン2011東京連合第1回)
 

Recently uploaded

Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Dr.Costas Sachpazis
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Recently uploaded (20)

Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spain
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 

Building a Data Pipeline using Apache Airflow (on AWS / GCP)

  • 1. Building a Data Pipeline using Apache Airflow (on AWS / GCP) Yohei Onishi PyCon SG 2019, Oct. 11 2019
  • 2. Presenter Profile ● Name: Yohei Onishi ● Data Engineer at a Japanese retail company ● Based in Singapore since Oct. 2018 ● Apache Airflow Contributor 2
  • 3. Objective ● Expected audiences: Data engineers ○ who are working on building a data pipleline ○ who are looking for a better workflow solution ● Goal: Provide the following so they can start using Airflow ○ Airflow overview and how to author workflow ○ Airflow cluster and CI/CD pipeline ○ Data engineering services on AWS / GCP 3
  • 4. Data pipeline data source collect ETL analytics data consumer micro services enterprise systems IoT devices object storage message queue micro services enterprise systems BI tool 4
  • 5. Example: logistics operation monitoring factory warehouse store WH receipt / shipment store receipt inventory management system shipment order FA shipment regional logistics operators ETL KPI report 5
  • 6. Airflow overview ● Open sourced by Airbnb and Apache top project ● Cloud Composer: managed Airflow cluster on GCP ● Dynamic workflow generation by Python code ● Easily extensible so you can fit it to your usecase ● Scalable by using a message queue to orchestrate arbitrary number of workers ● Workflow visualization 6
  • 7. Example: Copy a file from s3 bucket to another export records as CSV Singapore region US region EU region transfer it to a regional bucket 7 local region
  • 8. DEMO: UI and source code sample code: https://github.com/yohei1126/pycon-apac-2019-airflow-sample 8
  • 9. Concept: Directed acyclic graph, operator, task, etc custom_param_per_dag = {'sg': { ... }, 'eu': { ... }, 'us': { ... }} for region, v in custom_param_per_dag.items(): dag = DAG('shipment_{}'.format(region), ...) export = PostgresToS3Operator(task_id='db_to_s3', ...) transfer = S3CopyObjectOperator(task_id='s3_to_s3', ...) export >> transfer globals()[dag] = dag 9
  • 10. template t1 = PostgresToS3Operator( task_id='db_to_s3', sql="SELECT * FROM shipment WHERE region = '{{ params.region }}' AND ship_date = '{{ execution_date.strftime("%Y-%m-%d") }}'", bucket=default_args['source_bucket'], object_key='{{ params.region }}/{{ execution_date.strftime("%Y%m%d%H%M%S") }}.csv', params={'region':region}, dag=dag) 10
  • 11. Operator class PostgresToS3Operator(BaseOperator): template_fields = ('sql', 'bucket', 'object_key') def __init__(self, ..., *args, **kwargs): super(PostgresToS3Operator, self).__init__(*args, **kwargs) ... def execute(self, context): ... 11
  • 12. Building a data pipeline: AWS vs GCP 12 AWS (2 years ago) GCP (current) Workflow (Airflow cluster) EC2 (or ECS / EKS) Cloud Composer Big data processing Spark on EC2 (or EMR) Cloud Dataflow (or Dataproc) Data warehouse Hive on EC2 -> Athena (or Hive on EMR / Redshift) BigQuery CI / CD Jenkins on EC2 (or Code Build) Cloud Build
  • 13. AWS: Airflow cluster executor (1..N) worker node (1) executor (1..N) worker node (2) executor (1..N) worker node (1) ... scheduler master node (1) web server master node (2) web server LB admin Airflow metadata DBCelery result backend message broker 13 http://site.clairvoyantsoft.com/setting-apache-airflow-cluster/
  • 14. GCP: Airflow Cluster = Cloud Composer ● Fully managed Airflow cluster provided by GCP ○ Fully managed ○ Built in integrated with the other GCP services ● To focus on business logic, you should build Airflow cluster using GCP composer 14
  • 15. GCP: Airflow Cluster = Cloud Composer 15https://cloud.google.com/composer/docs/concepts/overview ● Airflow cluster on Google Kubernetes Engine can be easilly created by CLI or Web console ● Allowed changes to the cluster: increase number of worker node or install Python modules ● You can not install Linux command to worker node.
  • 16. AWS: Running Spark job in client mode https://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_running_spark_on_yarn.html 16 Airflow worker node ● Build Spark cluster out side of Airflow cluster ● official SparkSQLOperator does not support cluster mode ● Use official SparkSubmitOperator or extend official SparkSQLOperator ● Note: if you run Spark job with client mode SparkDriver run on Airflow worker node. This will cause out of memory on driver side.
  • 17. AWS: Running Spark job in cluster mode https://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_running_spark_on_yarn.html 17 Airflow worker node ● Specifying cluster mode in SparkSubmitOperator in your Airflow DAG ● Then your Spark job runs on YARN container (Spark cluster) ● This gives enough memory to Spark driver
  • 18. GCP: Big data processing = Cloud Dataflow ● Fully managed service streaming / batch data processing ● Single API for both batch and streaming data ● Develop a pipeline in Apache Beam SDK (Java, Python and Go) ● Fully integrated with GCP services ● https://cloud.google.com/dataflow/ 18
  • 19. GCP: Big data processing = Cloud Dataflow 19 Airflow executor Airflow worker node (Composer) Dataflow Java (Jar) Dataflow Python Dataflow GCS Dataflow template (Java or Python) upload template in advance load template and deploy jobs (2) run template deploy Dataflow job (1) run local code
  • 20. Data warehouse: Hive / Athena / BigQuery 20 Hive AWS Athena BigQuery Managed or not Nor Fully managed Fully managed Pricing model Pay for computer resource Pay for usage Pay by usage Standard SQL No (HiveQL) Yes Yes Data load Required Not required Required Partitioning Any column Any column Daily partition Scalability Depends on your cluster size Mid High (peta bytes)
  • 21. AWS: Data warehouse = Athena 21 Airflow workerAthena S3 (data storage) S3 (destination) query export query result run query ● AWSAthenaOperator support query ● Explicit table partitioning is needed
  • 22. GCP: Data warehouse = BigQuery 22 Composer (Airflow cluster) BigQuery GCS (data storage) GCS (destination) (1) load (3) export query result (2) run query
  • 23. AWS: CI/CD pipeline AWS SNS AWS SQS Github repo raise / merge a PR Airflow worker polling run Ansible script git pull test deployment 23
  • 24. GCP: CI/CD pipeline 24 Github repo Cloud Build (Test and deploy) GCS (provided from Composer) Composer (Airflow cluster) trigger build deploy automaticallyupload merge a PR
  • 25. Building a data pipeline: AWS vs GCP 25 AWS (2 years ago) GCP (current) Workflow (Airflow cluster) EC2 (or ECS / EKS) Cloud Composer Big data processing Spark on EC2 (or EMR) Cloud Dataflow (or Dataproc) Data warehouse Hive on EC2 -> Athena (or Hive on EMR / Redshift) BigQuery CI / CD Jenkins on EC2 (or Code Build) Cloud Build recommended
  • 26. Summary ● Data Engineers have to build reliable and scalable data pipeline to accelate data analytics activities ● Airflow is great tool to author and monitor workflow ● HA cluster is required in production ● IMHO GCP provide better managed service for data pipeline and data warehouse 26
  • 27. References ● Apache Airflow ● GCP Cloud Composer ● Airflow: a workflow management platform ● ETL best practices in Airflow 1.8 ● Data Science for Startups: Data Pipelines ● Airflow: Tips, Tricks, and Pitfalls 27