SlideShare a Scribd company logo
1 of 80
Download to read offline
Cloud Composer
Airflow Summit 2023
Workshop
September 21, 2023
Hi! It's nice to meet you!
Bartosz Jankiewicz
Engineering Manager
Filip Knapik
Group Product
Michał Modras
Engineering Manager
Leah Cole
Developer Relations
Engineer
Rafal Biegacz
Engineering Manager
Arun Vattoly
Technical Solutions
Engineer
Victor
Cloud Support
Manager
Mateusz Henc
Software Engineer
Agenda
Of workshop
01
Table of contents
Agenda
Setting up
Introduction to Cloud Composer
Disaster Recovery in Cloud Composer
Data Lineage in Cloud Composer
01
02
03
04
05
Data Lineage
● Composer + BigQuery
● Composer + Dataproc
Agenda
Introduction
● Composer Architecture
● Composer Features
Disaster Recovery
● Snapshots
15m 1h 1h
☕
Setting up
Workshop Projects
02
GCP Projects used during Workshop
● Main activities/exercises will be done in environments created in your dedicated project that we set
up for you. Composer environments were pre-set up for you, as well.
These projects and environments will be deleted after the workshop.
● Composer project info to be used by you should be passed to you in the separate email.
● If your email address you used to register for the workshop doesn’t have association with active
Google Account then you will need to finish this registration as described here
● You can also get this information during the workshop
A voucher for Google Cloud Platform (GCP) Credits
● As part of this workshop you will receive a GCP credits voucher worth $500.
To be able to redeem the credits, in a addition to active Google Account you will need to set up your
GCP project and associate it with an active billing account.
This project and GCP credits will be owned by you.
You can use activate GCP coupon within 2 months after the workshop.
Workshop’s GCP credits are valid for 1 year since activation.
Introduction to
Cloud Composer
03
Cloud Composer 2 architecture
Cloud Composer 2 interacts with the following services:
● CloudSQL - running Airflow metadata storage
● Cloud Storage - user uploaded content (DAGs, user data)
● Kubernetes - runs Scheduler(s), WebServer, Redis queue,
SQL proxy and Airflow workloads
● Cloud Logging - stores and indexes components logs
● Cloud Monitoring - searchable Cloud Composer metrics
… and many more that we are managing for you.
Cloud Composer in a nutshell
BigQuery Data Fusion Dataflow Dataproc
Storage
Cloud Composer
100+ APIs …
Orchestrate work across Google Cloud, external SaaS services and proprietary APIs
…
Cloud Composer benefits
Simple
deployment
Robust
Monitoring &
Logging
Enterprise
Security Features
DAG code
portability
Technical
Support
Managed
infrastructure
Disaster Recovery
in Cloud Composer
04
What is a disaster?
Disaster is an event where Cloud
Composer or other components
essential for your environment's
operation are unavailable.
In case of Google Cloud the impact of
disaster can be zonal, regional or global.
��
HR
Highly resilient Cloud Composer
environments are Cloud Composer 2
environments that use built-in
redundancy and failover
mechanisms that reduce the
environment's susceptibility to zonal
failures and single point of failure
outages.
DR
Disaster Recovery (DR), in the context
of Cloud Composer, is a process of
restoring the environment's operation
after a disaster. The process involves
recreating the environment, possibly in
another region.
What about Composer with High Resilience?
Composer HR makes your application is available right
now. DR makes sure you can get it back up later.
HR != DR
High resilience is critical for the availability of Cloud
Composer, but it is often useless for recoverability.
For example, a critical historical transactions table may
be lost, but new transactions will still be processed
Availability should be calculated based on how long a service was unavailable
over a specified period of time. Planned downtime is still downtime.
Definition of availability
Availability =
Uptime
Total Time
Availability = 1 - (1 - Aa) * (1 - Ab)
Availability in distributed systems
Service A SLO
99%
Service B
SLO 99%
Availability = Aa * Ab
Service A SLO
99%
Service B
SLO 99%
98%
Parallelization (OR) Chaining (AND)
99.99%
DR Process: Failover
Step 1: Everything is fine
Primary
environment
Snapshots
storage
Scheduled
Snapshots
Note: User Multi-regional GCS bucket for
snapshots storage.
That's why Snapshots Storage should be
multi-regional.
DR Process: Failover
Step 2: Disaster!
Snapshots
storage
��
Primary
environment
DR Process: Failover
Step 3: Create Cloud Composer in DR region
Snapshots
storage
Primary
environment
Failover
environment
��
Recovery Point
Objective
Maximum acceptable length of time during which
data might be lost from your application due to a
major incident.
Recovery Time
Objective
Maximum acceptable length of time that your
application can be offline.
RTO and RPO
Proprietary + Confidential
DR scenarios
RTO ↘ Cost ↗
RTO ↗ Cost ↘
Warm DR scenario is a variant of Disaster
Recovery, where you use a standby failover
environment, which you create before a
disaster occurs.
Cold DR scenario is a variant of Disaster
Recovery, where you create a failover
environment after a disaster occurs.
DR Process: Failover
Step 4: Load snapshot
Snapshots
storage
��
Primary
environment
Failover
environment
Load
Snapshot
DR Process: Failover
Step 4: Load snapshot and resumed workflows
Snapshots
storage
��
Primary
environment
Failover
environment
Scheduled
Snapshot
DR Process: Failover
Step 5: Disaster mitigated
Snapshots
storage
Primary
environment
Failover
environment
Scheduled
Snapshot
Note: Make sure to pause
DAGs in the Primary
environment
⏸
🥶 Option 1: Switch Failover with Primary environment and delete Primary environment (Cold DR )
🌤Option 1a: Switch Failover with Primary Environment and keep it (Warm DR)
🥶 Option 2: Fallback to Primary Environment and delete Failover environment (Cold DR)
🌤Option 2a: Fallback to Primary Environment and delete Failover environment (Warm DR)
DR Process: Failover
Next steps
Creating a detailed DR plan
1. What is your RTO?
2. What is your RPO?
3. How do you want to verify your plan?
https://cloud.google.com/architecture/dr-scenarios-planning-guide
Practice
Step 1: Create Snapshots storage bucket
● Use a multi-regional bucket to ensure
resiliency to regional failures.
● Make sure the bucket is accessible to your
environment service account
○ Grant permissions to
lab-sa@airflow-summit-workshop-{project}.iam.gse
rviceaccount.com to access created bucket
Step 2: Configure scheduled snapshots
Primary environment
Step 3: Create manual snapshot
[Optional] Step 3a: Setup metadata db maintenance
The Airflow metadata database must have less than 20 Gb of data to support
snapshots
1. Upload a maintenance DAG - http://bit.ly/3t1iiYJ
a. The dag is already in your environment
2. Verify database size metric.
3. [Optional] Set up an alert on database size metric.
Step 4: Verify a snapshot has been created
1. Visit storage bucket to observe created snapshot objects.
2. … or better delegate this effort to Cloud Monitoring.
a. https://cloud.google.com/composer/docs/composer-2/disaster-recovery-with-snapshots
Step 5: Disaster!
��
Step 6: Load snapshot in failover environment
Note: In case of Warm-DR you should skip
some options to reduce time of loading
and therefore RTO
Secondary environment
[Extra topic] Anatomy of Snapshot
● Snapshot is a folder on GCS bucket
● Convention for creating folder names (convention is not validated during load)
{project}_{region}_{environment}_{ISO_DATE_AND_TIME}
Eg.: myproject_us-central1_test-env_2023-09-15T05-02-06/
● Contents:
airflow-database.postgres.sql.gz
environment.json
fernet-key.txt
gcs/
dags/
metadata.json
preliminary_metadata.json
Step 7: Verify Failover environment health
1. Visit Monitoring Dashboard
2. Check your DAGs are running
3. Verify DAGs history - it should have been loaded with Snapshot
Limitations
1. The database size cannot exceed 20GB - the metric available in Monitoring dashboard
2. Snapshot can be stored with 2h+ intervals.
Good practices
1. Prepare your DR plan.
2. Test your disaster recovery procedure on a regular basis.
3. Decide what to do with the primary environment afterwards.
4. Set up DB maintenance and monitor DB size.
5. Set up monitoring for scheduled snapshot operations.
Let's take a break ☕
Data lineage
in Cloud Composer
05
Data lineage traces the
relationship between data
sources based on movement of
data, explaining how data was
sourced and transformed.
● Airflow gains rich lineage capabilities thanks to
the OpenLineage integration.
● Implemented by Dataplex in Google Cloud.
Data Lineage in Google Cloud: Dataplex
● Process
● Run (execution of Process)
● Event (emitted in a Run)
- Currently based on the Airflow
Lineage Backend feature.
- Backend exports lineage data to
Dataplex.
- Working on migrating to
OpenLineage.
Data Lineage in Cloud Composer
Operators support:
- BigQueryExecuteQueryOperator
- BigQueryInsertJobOperator
- BigQueryToBigQueryOperator
- BigQueryToCloudStorageOperator
- BigQueryToGCSOperator
- GCSToBigQueryOperator
- GoogleCloudStorageToBigQueryOperator
- DataprocSubmitJobOperator
Data Lineage in Cloud Composer
- A growing number of Google Cloud
services support Data Lineage (e.g.
BigQuery, Dataproc, Cloud Data
Fusion).
- Goal: Complete data lake lineage.
Data Lineage in other Google Cloud services
Exercise 1: Lineage from Composer
orchestrated BigQuery pipeline
This exercise covers data lineage with Cloud Composer, from a
BigQuery context with a minimum viable data engineering pipeline.
We will demonstrate lineage capture off of an Airflow DAG composed
of BigQuery actions.
We will first use an Apache Airflow DAG on Cloud Composer to
orchestrate the BigQuery jobs and observe the lineage captured by
Dataplex. Note that the lineage shows up minutes after a process is
run/an entity is created.
1. Review the existing Composer environment
2. Review the existing lineage graph
3. Review the Airflow DAG code
Go to DAGS section of the Composer environment.
4. Run the Airflow DAG.
5. Validate the creation of
tables in BigQuery.
6. Review the lineage captured in Dataplex
UI.
7. Navigate back to Composer Airflow DAG
using a link in Data Lineage UI.
TODO screenshot
Exercise 2: Lineage from Composer
orchestrated Dataproc Spark job
In this exercise, we will repeat what we did with lineage of BigQuery
based Airflow DAG, except, we will use Apache Spark on Dataproc
Serverless instead. Note that Dataproc Serverless is not a natively
supported service with Dataplex automated lineage capture. We will
use custom lineage feature in Cloud Composer.
1. Review the DAG
Navigate to the Cloud Composer UI and launch the Airflow UI
Click on the Spark DAG
2. Verify inlets & outlets definition
Scroll to look at the "inlet" and "outlet" where we specify lineage for
BigQuery external tables.
3. Run the DAG
4. Verify the Dataproc Serverless batch jobs in
Dataproc Batches UI in Google Cloud Console.
5. Analyze the lineage captured from
Composer’s environment Airflow.
The lineage captured is custom and BQ external table centric and therefore not visible in the Dataplex UI. The latency of lineage availability
is dependent on discovery settings for the asset.
Navigate to the BigQuery UI and click on the external table, oda_curated_zone.crimes_curated_spark table.
Thank you.

More Related Content

What's hot

[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유
[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유
[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유Hyojun Jeon
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentationIlias Okacha
 
Polymorphic Table Functions in SQL
Polymorphic Table Functions in SQLPolymorphic Table Functions in SQL
Polymorphic Table Functions in SQLChris Saxon
 
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementIntro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementBurasakorn Sabyeying
 
High Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando PatroniHigh Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando PatroniZalando Technology
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
PostgreSQL Replication High Availability Methods
PostgreSQL Replication High Availability MethodsPostgreSQL Replication High Availability Methods
PostgreSQL Replication High Availability MethodsMydbops
 
今だから知りたい BigQuery 再入門 | Google Cloud INSIDE Games & Apps: Online
今だから知りたい BigQuery 再入門 | Google Cloud INSIDE Games & Apps: Online今だから知りたい BigQuery 再入門 | Google Cloud INSIDE Games & Apps: Online
今だから知りたい BigQuery 再入門 | Google Cloud INSIDE Games & Apps: OnlineGoogle Cloud Platform - Japan
 
How to build massive service for advance
How to build massive service for advanceHow to build massive service for advance
How to build massive service for advanceDaeMyung Kang
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for ExperimentationGleb Kanterov
 
Migrating with Debezium
Migrating with DebeziumMigrating with Debezium
Migrating with DebeziumMike Fowler
 
Ndc2014 시즌 2 : 멀티쓰레드 프로그래밍이 왜 이리 힘드나요? (Lock-free에서 Transactional Memory까지)
Ndc2014 시즌 2 : 멀티쓰레드 프로그래밍이  왜 이리 힘드나요?  (Lock-free에서 Transactional Memory까지)Ndc2014 시즌 2 : 멀티쓰레드 프로그래밍이  왜 이리 힘드나요?  (Lock-free에서 Transactional Memory까지)
Ndc2014 시즌 2 : 멀티쓰레드 프로그래밍이 왜 이리 힘드나요? (Lock-free에서 Transactional Memory까지)내훈 정
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowYohei Onishi
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty
 
GCGC- CGCII 서버 엔진에 적용된 기술 (2) - Perfornance
GCGC- CGCII 서버 엔진에 적용된 기술 (2) - PerfornanceGCGC- CGCII 서버 엔진에 적용된 기술 (2) - Perfornance
GCGC- CGCII 서버 엔진에 적용된 기술 (2) - Perfornance상현 조
 
Airflow - a data flow engine
Airflow - a data flow engineAirflow - a data flow engine
Airflow - a data flow engineWalter Liu
 
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요Jo Hoon
 

What's hot (20)

[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유
[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유
[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
Polymorphic Table Functions in SQL
Polymorphic Table Functions in SQLPolymorphic Table Functions in SQL
Polymorphic Table Functions in SQL
 
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementIntro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
 
High Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando PatroniHigh Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando Patroni
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
PostgreSQL Replication High Availability Methods
PostgreSQL Replication High Availability MethodsPostgreSQL Replication High Availability Methods
PostgreSQL Replication High Availability Methods
 
Apache airflow
Apache airflowApache airflow
Apache airflow
 
今だから知りたい BigQuery 再入門 | Google Cloud INSIDE Games & Apps: Online
今だから知りたい BigQuery 再入門 | Google Cloud INSIDE Games & Apps: Online今だから知りたい BigQuery 再入門 | Google Cloud INSIDE Games & Apps: Online
今だから知りたい BigQuery 再入門 | Google Cloud INSIDE Games & Apps: Online
 
How to build massive service for advance
How to build massive service for advanceHow to build massive service for advance
How to build massive service for advance
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
Migrating with Debezium
Migrating with DebeziumMigrating with Debezium
Migrating with Debezium
 
Ndc2014 시즌 2 : 멀티쓰레드 프로그래밍이 왜 이리 힘드나요? (Lock-free에서 Transactional Memory까지)
Ndc2014 시즌 2 : 멀티쓰레드 프로그래밍이  왜 이리 힘드나요?  (Lock-free에서 Transactional Memory까지)Ndc2014 시즌 2 : 멀티쓰레드 프로그래밍이  왜 이리 힘드나요?  (Lock-free에서 Transactional Memory까지)
Ndc2014 시즌 2 : 멀티쓰레드 프로그래밍이 왜 이리 힘드나요? (Lock-free에서 Transactional Memory까지)
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache Airflow
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
GCGC- CGCII 서버 엔진에 적용된 기술 (2) - Perfornance
GCGC- CGCII 서버 엔진에 적용된 기술 (2) - PerfornanceGCGC- CGCII 서버 엔진에 적용된 기술 (2) - Perfornance
GCGC- CGCII 서버 엔진에 적용된 기술 (2) - Perfornance
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
Airflow - a data flow engine
Airflow - a data flow engineAirflow - a data flow engine
Airflow - a data flow engine
 
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
 
Apache airflow
Apache airflowApache airflow
Apache airflow
 

Similar to Cloud Composer workshop at Airflow Summit 2023.pdf

Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoophuguk
 
Flink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For YouFlink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For YouFlink Forward
 
b04-DataflowArchitecture.pdf
b04-DataflowArchitecture.pdfb04-DataflowArchitecture.pdf
b04-DataflowArchitecture.pdfRAJA RAY
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Serverless and Design Patterns In GCP
Serverless and Design Patterns In GCPServerless and Design Patterns In GCP
Serverless and Design Patterns In GCPOliver Fierro
 
Google Cloud Platform (GCP) At a Glance
Google Cloud Platform (GCP)  At a GlanceGoogle Cloud Platform (GCP)  At a Glance
Google Cloud Platform (GCP) At a GlanceCloud Analogy
 
GCP-Professional-Cloud-Developer-Exam-v22.2.1_139-taqwlj.pdf
GCP-Professional-Cloud-Developer-Exam-v22.2.1_139-taqwlj.pdfGCP-Professional-Cloud-Developer-Exam-v22.2.1_139-taqwlj.pdf
GCP-Professional-Cloud-Developer-Exam-v22.2.1_139-taqwlj.pdfssuserc36624
 
DATASHEET▶ Enterprise Cloud Backup & Recovery with Symantec NetBackup
DATASHEET▶ Enterprise Cloud Backup & Recovery with Symantec NetBackupDATASHEET▶ Enterprise Cloud Backup & Recovery with Symantec NetBackup
DATASHEET▶ Enterprise Cloud Backup & Recovery with Symantec NetBackupSymantec
 
GDG Cloud Southlake #8 Steve Cravens: Infrastructure as-Code (IaC) in 2022: ...
GDG Cloud Southlake #8  Steve Cravens: Infrastructure as-Code (IaC) in 2022: ...GDG Cloud Southlake #8  Steve Cravens: Infrastructure as-Code (IaC) in 2022: ...
GDG Cloud Southlake #8 Steve Cravens: Infrastructure as-Code (IaC) in 2022: ...James Anderson
 
Sitecore MC best practices for DR and HA.pptx
Sitecore MC best practices for DR and HA.pptxSitecore MC best practices for DR and HA.pptx
Sitecore MC best practices for DR and HA.pptxJitendra Soni
 
Breaking the Monolith road to containers.pdf
Breaking the Monolith road to containers.pdfBreaking the Monolith road to containers.pdf
Breaking the Monolith road to containers.pdfAmazon Web Services
 
Breaking the Monolith road to containers.pdf
Breaking the Monolith road to containers.pdfBreaking the Monolith road to containers.pdf
Breaking the Monolith road to containers.pdfAmazon Web Services
 
Google Cloud Next '22 Recap: Serverless & Data edition
Google Cloud Next '22 Recap: Serverless & Data editionGoogle Cloud Next '22 Recap: Serverless & Data edition
Google Cloud Next '22 Recap: Serverless & Data editionDaniel Zivkovic
 
Solving enterprise challenges through scale out storage & big compute final
Solving enterprise challenges through scale out storage & big compute finalSolving enterprise challenges through scale out storage & big compute final
Solving enterprise challenges through scale out storage & big compute finalAvere Systems
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalSub Szabolcs Feczak
 
Train, predict, serve: How to go into production your machine learning model
Train, predict, serve: How to go into production your machine learning modelTrain, predict, serve: How to go into production your machine learning model
Train, predict, serve: How to go into production your machine learning modelCloudera Japan
 
[Study Guide] Google Professional Cloud Architect (GCP-PCA) Certification
[Study Guide] Google Professional Cloud Architect (GCP-PCA) Certification[Study Guide] Google Professional Cloud Architect (GCP-PCA) Certification
[Study Guide] Google Professional Cloud Architect (GCP-PCA) CertificationAmaaira Johns
 
Session 4 GCCP.pptx
Session 4 GCCP.pptxSession 4 GCCP.pptx
Session 4 GCCP.pptxDSCIITPatna
 

Similar to Cloud Composer workshop at Airflow Summit 2023.pdf (20)

Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
 
Flink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For YouFlink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For You
 
b04-DataflowArchitecture.pdf
b04-DataflowArchitecture.pdfb04-DataflowArchitecture.pdf
b04-DataflowArchitecture.pdf
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Serverless and Design Patterns In GCP
Serverless and Design Patterns In GCPServerless and Design Patterns In GCP
Serverless and Design Patterns In GCP
 
Google Cloud Platform (GCP) At a Glance
Google Cloud Platform (GCP)  At a GlanceGoogle Cloud Platform (GCP)  At a Glance
Google Cloud Platform (GCP) At a Glance
 
TIAD : Automate everything with Google Cloud
TIAD : Automate everything with Google CloudTIAD : Automate everything with Google Cloud
TIAD : Automate everything with Google Cloud
 
Gdsc muk - innocent
Gdsc   muk - innocentGdsc   muk - innocent
Gdsc muk - innocent
 
GCP-Professional-Cloud-Developer-Exam-v22.2.1_139-taqwlj.pdf
GCP-Professional-Cloud-Developer-Exam-v22.2.1_139-taqwlj.pdfGCP-Professional-Cloud-Developer-Exam-v22.2.1_139-taqwlj.pdf
GCP-Professional-Cloud-Developer-Exam-v22.2.1_139-taqwlj.pdf
 
DATASHEET▶ Enterprise Cloud Backup & Recovery with Symantec NetBackup
DATASHEET▶ Enterprise Cloud Backup & Recovery with Symantec NetBackupDATASHEET▶ Enterprise Cloud Backup & Recovery with Symantec NetBackup
DATASHEET▶ Enterprise Cloud Backup & Recovery with Symantec NetBackup
 
GDG Cloud Southlake #8 Steve Cravens: Infrastructure as-Code (IaC) in 2022: ...
GDG Cloud Southlake #8  Steve Cravens: Infrastructure as-Code (IaC) in 2022: ...GDG Cloud Southlake #8  Steve Cravens: Infrastructure as-Code (IaC) in 2022: ...
GDG Cloud Southlake #8 Steve Cravens: Infrastructure as-Code (IaC) in 2022: ...
 
Sitecore MC best practices for DR and HA.pptx
Sitecore MC best practices for DR and HA.pptxSitecore MC best practices for DR and HA.pptx
Sitecore MC best practices for DR and HA.pptx
 
Breaking the Monolith road to containers.pdf
Breaking the Monolith road to containers.pdfBreaking the Monolith road to containers.pdf
Breaking the Monolith road to containers.pdf
 
Breaking the Monolith road to containers.pdf
Breaking the Monolith road to containers.pdfBreaking the Monolith road to containers.pdf
Breaking the Monolith road to containers.pdf
 
Google Cloud Next '22 Recap: Serverless & Data edition
Google Cloud Next '22 Recap: Serverless & Data editionGoogle Cloud Next '22 Recap: Serverless & Data edition
Google Cloud Next '22 Recap: Serverless & Data edition
 
Solving enterprise challenges through scale out storage & big compute final
Solving enterprise challenges through scale out storage & big compute finalSolving enterprise challenges through scale out storage & big compute final
Solving enterprise challenges through scale out storage & big compute final
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - final
 
Train, predict, serve: How to go into production your machine learning model
Train, predict, serve: How to go into production your machine learning modelTrain, predict, serve: How to go into production your machine learning model
Train, predict, serve: How to go into production your machine learning model
 
[Study Guide] Google Professional Cloud Architect (GCP-PCA) Certification
[Study Guide] Google Professional Cloud Architect (GCP-PCA) Certification[Study Guide] Google Professional Cloud Architect (GCP-PCA) Certification
[Study Guide] Google Professional Cloud Architect (GCP-PCA) Certification
 
Session 4 GCCP.pptx
Session 4 GCCP.pptxSession 4 GCCP.pptx
Session 4 GCCP.pptx
 

Recently uploaded

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Cloud Composer workshop at Airflow Summit 2023.pdf

  • 1. Cloud Composer Airflow Summit 2023 Workshop September 21, 2023
  • 2. Hi! It's nice to meet you! Bartosz Jankiewicz Engineering Manager Filip Knapik Group Product Michał Modras Engineering Manager Leah Cole Developer Relations Engineer Rafal Biegacz Engineering Manager Arun Vattoly Technical Solutions Engineer Victor Cloud Support Manager Mateusz Henc Software Engineer
  • 4. Table of contents Agenda Setting up Introduction to Cloud Composer Disaster Recovery in Cloud Composer Data Lineage in Cloud Composer 01 02 03 04 05
  • 5. Data Lineage ● Composer + BigQuery ● Composer + Dataproc Agenda Introduction ● Composer Architecture ● Composer Features Disaster Recovery ● Snapshots 15m 1h 1h ☕
  • 7. GCP Projects used during Workshop ● Main activities/exercises will be done in environments created in your dedicated project that we set up for you. Composer environments were pre-set up for you, as well. These projects and environments will be deleted after the workshop. ● Composer project info to be used by you should be passed to you in the separate email. ● If your email address you used to register for the workshop doesn’t have association with active Google Account then you will need to finish this registration as described here ● You can also get this information during the workshop
  • 8. A voucher for Google Cloud Platform (GCP) Credits ● As part of this workshop you will receive a GCP credits voucher worth $500. To be able to redeem the credits, in a addition to active Google Account you will need to set up your GCP project and associate it with an active billing account. This project and GCP credits will be owned by you. You can use activate GCP coupon within 2 months after the workshop. Workshop’s GCP credits are valid for 1 year since activation.
  • 10. Cloud Composer 2 architecture Cloud Composer 2 interacts with the following services: ● CloudSQL - running Airflow metadata storage ● Cloud Storage - user uploaded content (DAGs, user data) ● Kubernetes - runs Scheduler(s), WebServer, Redis queue, SQL proxy and Airflow workloads ● Cloud Logging - stores and indexes components logs ● Cloud Monitoring - searchable Cloud Composer metrics … and many more that we are managing for you.
  • 11. Cloud Composer in a nutshell BigQuery Data Fusion Dataflow Dataproc Storage Cloud Composer 100+ APIs … Orchestrate work across Google Cloud, external SaaS services and proprietary APIs …
  • 12. Cloud Composer benefits Simple deployment Robust Monitoring & Logging Enterprise Security Features DAG code portability Technical Support Managed infrastructure
  • 14. What is a disaster? Disaster is an event where Cloud Composer or other components essential for your environment's operation are unavailable. In case of Google Cloud the impact of disaster can be zonal, regional or global. ��
  • 15. HR Highly resilient Cloud Composer environments are Cloud Composer 2 environments that use built-in redundancy and failover mechanisms that reduce the environment's susceptibility to zonal failures and single point of failure outages. DR Disaster Recovery (DR), in the context of Cloud Composer, is a process of restoring the environment's operation after a disaster. The process involves recreating the environment, possibly in another region. What about Composer with High Resilience?
  • 16. Composer HR makes your application is available right now. DR makes sure you can get it back up later. HR != DR High resilience is critical for the availability of Cloud Composer, but it is often useless for recoverability. For example, a critical historical transactions table may be lost, but new transactions will still be processed
  • 17. Availability should be calculated based on how long a service was unavailable over a specified period of time. Planned downtime is still downtime. Definition of availability Availability = Uptime Total Time
  • 18. Availability = 1 - (1 - Aa) * (1 - Ab) Availability in distributed systems Service A SLO 99% Service B SLO 99% Availability = Aa * Ab Service A SLO 99% Service B SLO 99% 98% Parallelization (OR) Chaining (AND) 99.99%
  • 19. DR Process: Failover Step 1: Everything is fine Primary environment Snapshots storage Scheduled Snapshots Note: User Multi-regional GCS bucket for snapshots storage.
  • 20. That's why Snapshots Storage should be multi-regional. DR Process: Failover Step 2: Disaster! Snapshots storage �� Primary environment
  • 21. DR Process: Failover Step 3: Create Cloud Composer in DR region Snapshots storage Primary environment Failover environment ��
  • 22. Recovery Point Objective Maximum acceptable length of time during which data might be lost from your application due to a major incident. Recovery Time Objective Maximum acceptable length of time that your application can be offline. RTO and RPO
  • 23. Proprietary + Confidential DR scenarios RTO ↘ Cost ↗ RTO ↗ Cost ↘ Warm DR scenario is a variant of Disaster Recovery, where you use a standby failover environment, which you create before a disaster occurs. Cold DR scenario is a variant of Disaster Recovery, where you create a failover environment after a disaster occurs.
  • 24. DR Process: Failover Step 4: Load snapshot Snapshots storage �� Primary environment Failover environment Load Snapshot
  • 25. DR Process: Failover Step 4: Load snapshot and resumed workflows Snapshots storage �� Primary environment Failover environment Scheduled Snapshot
  • 26. DR Process: Failover Step 5: Disaster mitigated Snapshots storage Primary environment Failover environment Scheduled Snapshot Note: Make sure to pause DAGs in the Primary environment ⏸
  • 27. 🥶 Option 1: Switch Failover with Primary environment and delete Primary environment (Cold DR ) 🌤Option 1a: Switch Failover with Primary Environment and keep it (Warm DR) 🥶 Option 2: Fallback to Primary Environment and delete Failover environment (Cold DR) 🌤Option 2a: Fallback to Primary Environment and delete Failover environment (Warm DR) DR Process: Failover Next steps
  • 28. Creating a detailed DR plan 1. What is your RTO? 2. What is your RPO? 3. How do you want to verify your plan? https://cloud.google.com/architecture/dr-scenarios-planning-guide
  • 30. Step 1: Create Snapshots storage bucket ● Use a multi-regional bucket to ensure resiliency to regional failures. ● Make sure the bucket is accessible to your environment service account ○ Grant permissions to lab-sa@airflow-summit-workshop-{project}.iam.gse rviceaccount.com to access created bucket
  • 31. Step 2: Configure scheduled snapshots Primary environment
  • 32. Step 3: Create manual snapshot
  • 33. [Optional] Step 3a: Setup metadata db maintenance The Airflow metadata database must have less than 20 Gb of data to support snapshots 1. Upload a maintenance DAG - http://bit.ly/3t1iiYJ a. The dag is already in your environment 2. Verify database size metric. 3. [Optional] Set up an alert on database size metric.
  • 34. Step 4: Verify a snapshot has been created 1. Visit storage bucket to observe created snapshot objects. 2. … or better delegate this effort to Cloud Monitoring. a. https://cloud.google.com/composer/docs/composer-2/disaster-recovery-with-snapshots
  • 36. Step 6: Load snapshot in failover environment Note: In case of Warm-DR you should skip some options to reduce time of loading and therefore RTO Secondary environment
  • 37. [Extra topic] Anatomy of Snapshot ● Snapshot is a folder on GCS bucket ● Convention for creating folder names (convention is not validated during load) {project}_{region}_{environment}_{ISO_DATE_AND_TIME} Eg.: myproject_us-central1_test-env_2023-09-15T05-02-06/ ● Contents: airflow-database.postgres.sql.gz environment.json fernet-key.txt gcs/ dags/ metadata.json preliminary_metadata.json
  • 38. Step 7: Verify Failover environment health 1. Visit Monitoring Dashboard 2. Check your DAGs are running 3. Verify DAGs history - it should have been loaded with Snapshot
  • 39. Limitations 1. The database size cannot exceed 20GB - the metric available in Monitoring dashboard 2. Snapshot can be stored with 2h+ intervals.
  • 40. Good practices 1. Prepare your DR plan. 2. Test your disaster recovery procedure on a regular basis. 3. Decide what to do with the primary environment afterwards. 4. Set up DB maintenance and monitor DB size. 5. Set up monitoring for scheduled snapshot operations.
  • 41. Let's take a break ☕
  • 42. Data lineage in Cloud Composer 05
  • 43. Data lineage traces the relationship between data sources based on movement of data, explaining how data was sourced and transformed. ● Airflow gains rich lineage capabilities thanks to the OpenLineage integration. ● Implemented by Dataplex in Google Cloud.
  • 44. Data Lineage in Google Cloud: Dataplex ● Process ● Run (execution of Process) ● Event (emitted in a Run)
  • 45. - Currently based on the Airflow Lineage Backend feature. - Backend exports lineage data to Dataplex. - Working on migrating to OpenLineage. Data Lineage in Cloud Composer
  • 46. Operators support: - BigQueryExecuteQueryOperator - BigQueryInsertJobOperator - BigQueryToBigQueryOperator - BigQueryToCloudStorageOperator - BigQueryToGCSOperator - GCSToBigQueryOperator - GoogleCloudStorageToBigQueryOperator - DataprocSubmitJobOperator Data Lineage in Cloud Composer
  • 47. - A growing number of Google Cloud services support Data Lineage (e.g. BigQuery, Dataproc, Cloud Data Fusion). - Goal: Complete data lake lineage. Data Lineage in other Google Cloud services
  • 48. Exercise 1: Lineage from Composer orchestrated BigQuery pipeline This exercise covers data lineage with Cloud Composer, from a BigQuery context with a minimum viable data engineering pipeline. We will demonstrate lineage capture off of an Airflow DAG composed of BigQuery actions. We will first use an Apache Airflow DAG on Cloud Composer to orchestrate the BigQuery jobs and observe the lineage captured by Dataplex. Note that the lineage shows up minutes after a process is run/an entity is created.
  • 49. 1. Review the existing Composer environment
  • 50.
  • 51.
  • 52. 2. Review the existing lineage graph
  • 53.
  • 54. 3. Review the Airflow DAG code Go to DAGS section of the Composer environment.
  • 55.
  • 56. 4. Run the Airflow DAG.
  • 57.
  • 58.
  • 59. 5. Validate the creation of tables in BigQuery.
  • 60. 6. Review the lineage captured in Dataplex UI.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
  • 66.
  • 67. 7. Navigate back to Composer Airflow DAG using a link in Data Lineage UI. TODO screenshot
  • 68. Exercise 2: Lineage from Composer orchestrated Dataproc Spark job In this exercise, we will repeat what we did with lineage of BigQuery based Airflow DAG, except, we will use Apache Spark on Dataproc Serverless instead. Note that Dataproc Serverless is not a natively supported service with Dataplex automated lineage capture. We will use custom lineage feature in Cloud Composer.
  • 69. 1. Review the DAG Navigate to the Cloud Composer UI and launch the Airflow UI Click on the Spark DAG
  • 70.
  • 71.
  • 72. 2. Verify inlets & outlets definition Scroll to look at the "inlet" and "outlet" where we specify lineage for BigQuery external tables.
  • 73. 3. Run the DAG
  • 74. 4. Verify the Dataproc Serverless batch jobs in Dataproc Batches UI in Google Cloud Console.
  • 75. 5. Analyze the lineage captured from Composer’s environment Airflow. The lineage captured is custom and BQ external table centric and therefore not visible in the Dataplex UI. The latency of lineage availability is dependent on discovery settings for the asset. Navigate to the BigQuery UI and click on the external table, oda_curated_zone.crimes_curated_spark table.
  • 76.
  • 77.
  • 78.
  • 79.