SlideShare a Scribd company logo
1 of 70
Download to read offline
Clearing Airflow
obstructions
@tati_alchueyr
Online
15 July 2021
Principal Data Engineer @ BBC Datalab
● During this session there will be some quizzes
● Be prepared to either:
○ Scan the QR code
○ Access using the URL
Heads up!
@tati_alchueyr.__doc__
● Brazilian living in London since 2014
● Principal Data Engineer at the BBC Datalab team
● Graduated in Computer Engineering at Unicamp
● Passionate software developer for 18 years
● Experience in the private and public sectors
● Developed software for Medicine, Media and
Education
● Loves Open Source
● Loves Brazilian Jiu Jitsu
● Proud mother of Amanda (v4.0)
I ❤ Airflow Community & Summit
Tomek
Urbaszek
Jarek
Potiuk
Ash
Berlin-Taylor
Kaxil Naik
Leah
Cole
BBC.Datalab.Hummingbirds
The work presented here is the result of lots of teamwork
within one squad of a much larger team and organisation
active squad team members
directly contributed in the past
Darren
Mundy
David
Hollands
Richard
Bownes
Marc
Oppenheimer
Bettina
Hermant
Tatiana
Al-Chueyr
A. I don’t run Airflow
B. On-premise
C. Astronomer.io
D. Google Cloud Composer
E. AWS Managed Workflows
F. Other
Quiz time how do you run Airflow?
Quiz time how do you run Airflow?
* responses from Airflow Summit 2021 participants, during the presentation
How we use Airflow
when things went wrong
How we use Airflow when things go wrong
How we use Airflow infrastructure
● Managed Airflow
○ Cloud Composer (GCP)
○ Terraform
● Celery Executors GCP constraint
○ Running within Kubernetes (GKE)
● Outdated version 1.10.12
○ Upgrades have been time consuming
■ Last: 1.10.4 => 1.10.12 @ Dec ‘20
○ GCP supports newer releases
■ 1.10.5 since April ‘21 (March ‘21)
■ 2.0.1 since May ‘21 (Feb ‘21)
● Execute within Airflow executors
○ BaseOperator
○ BashOperator
○ DummyOperator
○ PythonOperator
○ ShortCircuitOperator
○ TriggerDagRunOperator
○ GCSDeleteObjectsOperator
● Delegate to Kubernetes in a dedicated GKE cluster
○ GKEPodOperator
● Delegate to Apache Beam (Dataflow)
○ DataflowPythonOperator
○ DataflowCreatePythonJobOperator
How we use Airflow operators
How we use Airflow local, dev, int, prod
How we use Airflow application
● Data integration: ETL (extract, transform, load)
● Machine learning: training and precomputation
User
activity
Content
metadata
Train Model
Artefacts
Predict Prediction
Results
Extract &
Transform
Extract &
Transform
historical data future
User activity
features
Content metadata
features
Ingest user activity
Ingest content metadata
Train model Precompute
How we use Airflow application
historical data future
Train model Precompute
How we use Airflow application
historical data future * April 2021
A. Ingest & transform Content Metadata
○ Input:
○
B. Ingest & transform User Activity
○ Input:t:
○
C. Train model
○ Output:
○
D. Precompute recommendations
○ Input:
○ Output:
Quiz time which is our most stable DAG?
A. Ingest & transform Content Metadata Python Operator.
○ ~ 225k records
○ Transforms ~12 GB => 57 MB
B. Ingest & transform User Activity Dataflow Operator.
○ ~ 3.5 million records
○ Output: ~ 2 GB
C. Train model Kubernetes Operator.
○ Output: ~ 8 GB model & artefacts
○ lkj
D. Precompute recommendations Dataflow Operator.
○ ~ 3.5 million records
○ Output: ~ 2.5 GB
Quiz time which is our most stable DAG? tips.
Quiz time which is our most stable DAG? attendees.
* responses from Airflow Summit 2021 participants, during the presentation
Quiz time which is our most stable DAG? answer.
A. Ingest & transform Content Metadata
○ 2 live incidents
○ x
B. Ingest & transform User Activity
○ 1 live incident
○ x
C. Train model
○ 2 live incidents
○ lkj
D. Precompute recommendations
○ 2 live incidents
* from October 2020 until April 2021
A. Ingest & transform Content Metadata
○ Insufficient CPU
○ Spikes -> Timeouts (during higher volumes) / Continuing from where it stopped
B. Ingest & transform User Activity
○ Idempotency issue
○ x
C. Train model
○ K8s Pod reattach
○ Scheduling leading to two tasks running concurrently
D. Precompute recommendations
○ Change to default job settings in Dataflow
○ GCS access limit
○ Non-stop Dataflow job
Quiz time which is our most stable DAG? details.
Removal of workflows
obstructions
when things went wrong
Obstruction 1 The programme metadata chronic issue
historical data future
Ingest user activity
Ingest content metadata
Train model Precompute
Obstruction 1 The programme metadata chronic issue
historical data future
DAG’s goals
● Import objects from AWS S3 (protected by STS) into Google Cloud Storage
● Requirements: between dozens and thousands KB-sized objects
● Filter and enrich the metadata
● Merge multiple streams of data and create an up-to-date snapshot
Mostly implemented using subclasses of the Python Operator. class
Obstruction 1 The programme metadata chronic issue
historical data
Obstruction 1 The programme metadata chronic issue
historical data
Issue
Depending on the
volumes of data, a single
PythonOperator task
which usually takes
10 min could take almost
3h!
Solutions
Increase timeouts
Improve machine type
Delegate processing to
another service
Consequences
Delay
Blocked Airflow executor
Obstruction 2 When the user activity workflow failed
historical data future
Ingest user activity
Ingest content metadata
Train model Precompute
Obstruction 2 When the user activity workflow failed
historical data future
DAG’s goals
● Read from user activity Parquet files in Google Cloud Storage
● Filter relevant activity and metadata
● Export a snapshot for the relevant interval of time
● Requirements: millions of records in MB-sized files
Mostly implemented using subclasses of the Dataflow Operator. class
Obstruction 2 When the user activity workflow failed
historical data future
Obstruction 2 When the user activity workflow failed
historical data future
Troubleshooting
● The volume of user activity meant to train the model had doubled!
Obstruction 2 When the user activity workflow failed
historical data future
Obstruction 2 When the user activity workflow failed
historical data future
What happened
● Dataflow took longer than expected to run a job triggered by Airflow
● Airflow retried
● Both jobs completed successfully - and output the data in the same directory!
● The setup to train the model didn’t expect to handle such spike it the volume of data
and failed
Obstruction 2 When the user activity workflow failed
historical data future
What happened
● Dataflow took longer than expected to run a job triggered by Airflow
● Airflow retried
● Both jobs completed successfully - and output the data in the same directory!
● The setup to train the model didn’t expect to handle such spike it he volume of data
and failed
Solution
● Have idempotent tasks
● Clear the target path before processing a task
Obstruction 3 When precompute failed due to training
historical data future
Ingest user activity
Ingest content metadata
Train model Precompute
Obstruction 3 When precompute failed due to training
historical data future
Obstruction 3 When precompute failed due to training
historical data future
"message": "No such object:
datalab-sounds-prod-6c75-data/recommenders/xa
ntus/model/2021-07-08T00:00:00+00:00/xantus.p
kl"
Obstruction 3 When precompute failed due to training
historical data future
Obstruction 3 When precompute failed due to training
historical data future
Obstruction 3 When precompute failed due to training
historical data future
[2021-07-08 12:15:55,278] {logging_mixin.py:112}
INFO - Running <TaskInstance:
train_model.move_model_from_k8s_to_gcs
2021-07-08T00:00:00+00:00 [running]> on host
airflow-worker-867b96c854-jzqw7
[2021-07-08 12:15:55,422]
{gcp_container_operator.py:299} INFO - Using gcloud
with application default credentials.
[2021-07-08 12:15:57,286] {pod_launcher.py:173}
INFO - Event:
move-model-to-gcs-3deac821b57047619c1c9505ddc
5db18 had an event of type Pending
[2021-07-08 12:15:57,286] {pod_launcher.py:139}
WARNING - Pod not yet started:
move-model-to-gcs-3deac821b57047619c1c9505ddc
5db18
[2021-07-08 12:17:57,499] {taskinstance.py:1152}
ERROR - Pod Launching failed: Pod Launching failed:
Pod took too long to start
[2021-07-08 12:18:59,584] {pod_launcher.py:156} INFO
- b'gsutil -m rm
gs://datalab-sounds-prod-6c75-data/recommenders/xant
us/model/2021-07-08T00:00:00+00:00/xantus.pkl ||
truen'
[2021-07-08 12:18:59,911] {pod_launcher.py:156} INFO
- b'gsutil -m mv
/data/recommenders/xantus/model/2021-07-08T00:00:0
0+00:00/xantus.pkl (...)
[2021-07-08 12:19:35,295] {pod_launcher.py:156} INFO
- b'Operation completed over 1 objects/4.7 GiB.
n'
[2021-07-08 12:19:35,536] {pod_launcher.py:156} INFO
- b'rm -rf
/data/recommenders/xantus/model/2021-07-08T00:00:0
0+00:00/xantus.pkln'
[2021-07-08 12:19:36,687] {taskinstance.py:1071} INFO
- Marking task as SUCCESS.dag_id=train_model,
Obstruction 3 When precompute failed due to training
historical data future
$ kubectl logs move-model-to-gcs-a0b5193a42e040aaa37b3ad82953ee29 -n xantus-training
gsutil -m rm gs://datalab-sounds-prod-6c75-data/recommenders/xantus/model/2021-07-08T00:00:00+00:00/xantus.pkl || true
CommandException: 1 files/objects could not be removed.
gsutil -m mv /data/recommenders/xantus/model/2021-07-08T00:00:00+00:00/xantus.pkl
gs://datalab-sounds-prod-6c75-data/recommenders/xantus/model/2021-07-08T00:00:00+00:00/xantus.pkl
Copying file:///data/recommenders/xantus/model/2021-07-08T00:00:00+00:00/xantus.pkl [Content-Type=application/octet-stream]...
Removing file:///data/recommenders/xantus/model/2021-07-08T00:00:00+00:00/xantus.pkl...
| [1/1 files][ 4.7 GiB/ 4.7 GiB] 100% Done 111.3 MiB/s ETA 00:00:00
Operation completed over 1 objects/4.7 GiB.
rm -rf /data/recommenders/xantus/model/2021-07-08T00:00:00+00:00/xantus.pkl
Obstruction 3 When precompute failed due to training
historical data future
What happened
● The GKE node pool, where jobs were executed, was struggling
● Airflow Kubernetes Operator timed out after a long time waiting (Kubernetes didn’t
know)
● A new task retry was triggered
● Both pods run concurrently and due to how we implemented idempotency, the data
was deleted - but the last task retry was successful
Solution
● Use newer version of the KubernetesPodOperator
● Confirm by the end of the task that the desired artefact exists
Obstruction 4 Intermittent DAG after an Airflow upgrade
historical data future
historical data future
Ingest user activity
Ingest content metadata
Train model Precompute
Obstruction 4 Intermittent DAG after an Airflow upgrade
historical data future
What happened
● After upgrading from Airflow 1.10.4 to 1.10.12 some KubenetesPodOperator tasks
became intermittent
● Legit running pods failed
● The logs seemed to show that new jobs were trying to reattach to previously existing
Pods and failed
Obstruction 4 Intermittent DAG after an Airflow upgrade
historical data future
Obstruction 4 Intermittent DAG after an Airflow upgrade
historical data future
HTTP response headers: HTTPHeaderDict({'Audit-Id':
'133bc1f2-388b-490c-bdc6-34053685d5ee', 'Content-Type':
'application/json', 'Date': 'Sat, 30 Jan 2021 09:11:33 GMT',
'Content-Length': '231'}
HTTP response body:
b'{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Fail
ure","message":"container "base" in pod
"create-dataset-17a3b6f132e44544a836550be367c670" is
waiting to start:
ContainerCreating","reason":"BadRequest","code":400}n
(...)
"/opt/python3.6/lib/python3.6/site-packages/kubernetes/client/
rest.py", line 231, in reques
raise ApiException(http_resp=r
kubernetes.client.rest.ApiException: (400
Reason: Bad Request
Obstruction 4 Intermittent DAG after an Airflow upgrade
historical data future
Solution
https://airflow.apache.org/docs/apache-airflow/1.10.12/_api/airflow/contrib/operators/kubernetes_pod_operator/index.html
Obstruction 5 When the Dataflow job said no
historical data future
historical data future
Ingest user activity
Ingest content metadata
Train model Precompute
Obstruction 5 When the Dataflow job said no
historical data future
Obstruction 5 When the Dataflow job said no
historical data future
OSError: [Errno 28] No space left on device During handling
Obstruction 5 When the Dataflow job said no
future
If a batch job uses Dataflow Shuffle, then the default is 25 GB; otherwise, the default is 250 GB.
https://cloud.google.com/dataflow/docs/guides/specifying-exec-params#python
Obstruction 5 When the Dataflow job said no
future
Obstruction 5 When the Dataflow job said no
future
--experiments=shuffle_mode=appliance
Solution
Obstruction 6 The never ending Dataflow job
historical data future
historical data
Ingest user activity
Ingest content metadata
Train model Precompute
Obstruction 6 The never ending Dataflow job
historical data future
Obstruction 6 The never ending Dataflow job
historical data future
[2021-05-13 11:52:34,884] {gcp_dataflow_hook.py:1 (...) Traceback (most recent call last):n File
"/home/airflow/gcs/dags/predictions/compute_predictions.py", line 337, in runn return pipelinen File
"/tmp/dataflow_venv/lib/python3.6/site-packages/apache_beam/pipeline.py", line 569, in __exit__n
self.result.wait_until_finish()n File
"/tmp/dataflow_venv/lib/python3.6/site-packages/apache_beam/runners/dataflow/dataflow_runner.py",
line 1650, in wait_until_finishn
self)napache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline
failed. State: FAILED,
(.. )
The job failed because a work item has failed 4 times. Look in previous log entries for the cause of each
one of the 4 failures. For more information, see
https://cloud.google.com/dataflow/docs/guides/common-errors. The work item was attempted on these
workers: n compute-predictions-xantu-05130255-opno-harness-mtj1n Root cause: The worker lost
contact with the service.,n compute-predictions-xantu-05130255-opno-harness-2x4vn Root cause:
The worker lost contact with the service.,n compute-predictions-xantu-05130255-opno-harness-5gkdn
Root cause: The worker lost contact with the service.,n
compute-predictions-xantu-05130255-opno-harness-2t1nn Root cause: The worker lost contact with
the service.'
Obstruction 6 The never ending Dataflow job
historical data future
Obstruction 6 The never ending Dataflow job
historical data future
Solution
● Backport the latest Google Cloud operators in Apache Airflow
● Particularly:
○ DataflowCreatePythonJobOperator
○ DataflowJobStatusSensor
https://medium.com/google-cloud/backporting-google-cloud-operators-in-apache-airflow-34b
6c9efffc8
A. Processing programme metadata within Airflow executors
B. Non-idempotent tasks
C. Breaking change after Airflow upgrade
D. Breaking change in upstream service
E. Not monitoring efficiently pre-production environments
Quiz time which do you reckon was the most costly issue?
Quiz time which do you reckon was the most costly issue?
* responses from Airflow Summit 2021 participants, during the presentation
Quiz time which do you reckon was the most costly issue?
A. Processing programme metadata within Airflow executors
B. Non-idempotent tasks
C. Breaking change after Airflow upgrade
D. Breaking change in upstream service
E. Not monitoring efficiently pre-production environments
Quiz time which do you reckon was the most costly issue?
The never ending
Dataflow Job, triggered
with DataflowOperator in
our int environment, run
for over 3 days and costed
over £12k
Hygienic workflows
throughout development
Hygienic pipelines
Smell 1 To plugin or not to plugin
future
● Requirement:
○ Have common packages across multiple DAGs without using PyPI or similar
● Attempt:
○ Use plugins to expose those
○ Deploy using
■ gcloud composer environments storage dags import .
■ gcloud composer environments storage dags import .
● Problems
○ Lots of broken deployments
○ Unsynchronised upload of plugins and DAGs to Composer web server & workers
○ Issues in enabling DAG serialisation with plugins
● Solution:
○ Stop using plugins
○ Use standard Python packages
○ Upload them using Google Cloud to the Bucket / path
■ gsutil rm (...)
■ gsutil cp (...)
Smell 1 To plugin or not to plugin
future
Smell 2 Configuration (mis)patterns
historical data future
● Requirement:
○ Have a strategy for handling environment-specific and common configuration
● Attempt:
○ To use Environment variables to declare each env-specific variable
○ To deploy using Terraform
○ To declare common configuration in the DAGs
● Problems
○ Each variant updated represented a Cloud Composer deployment
○ Redundant configuration definition across DAGs and multiple utilities modules
○ Hard to identify the data sources / targets paths
Smell 2 Configuration (mis)patterns
historical data future
● Solution: have configuration files loaded into Airflow variables with paths and variables
https://www.astronomer.io/guides/dynamically-generating-dags
Smell 2 Configuration (mis)patterns
historical data future
❏ Keep processing out of Airflow executors
❏ Idempotency matters - and it can be hard!
❏ Backporting is better than sticking to the past
❏ Reviewing release notes can help avoid live incidents
❏ Monitoring pre-production environments can save money
Takeaways Avoiding live incidents
❏ Avoid plugins
❏ A delete-deploy approach can avoid problems
❏ Early configuration-driven approach saves time
Takeaways Keeping the house clean
Much more than obstructions
With the help of Apache Airflow, Datalab:
● Was able to end a contract of the BBC, with an external recommendation
service, by increasing in 59% the audience engagement
● Serves daily millions of personalised recommendations to the BBC
audiences
● Built a configurable Machine Learning pipeline agnostic of the model
○ Constantly adds new variants and extends workflows
Thank you!
Tatiana Al-Chueyr
@tati_alchueyr

More Related Content

What's hot

Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native WayMigrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native WayDatabricks
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowPyData
 
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...Kaxil Naik
 
Airflow at lyft
Airflow at lyftAirflow at lyft
Airflow at lyftTao Feng
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowSid Anand
 
Scaling machine learning to millions of users with Apache Beam
Scaling machine learning to millions of users with Apache BeamScaling machine learning to millions of users with Apache Beam
Scaling machine learning to millions of users with Apache BeamTatiana Al-Chueyr
 
Data Pipelines with Apache Airflow
Data Pipelines with Apache AirflowData Pipelines with Apache Airflow
Data Pipelines with Apache AirflowManning Publications
 
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...Itai Yaffe
 
Apache Airflow Architecture
Apache Airflow ArchitectureApache Airflow Architecture
Apache Airflow ArchitectureGerard Toonstra
 
Apache Zeppelin Meetup Christian Tzolov 1/21/16
Apache Zeppelin Meetup Christian Tzolov 1/21/16 Apache Zeppelin Meetup Christian Tzolov 1/21/16
Apache Zeppelin Meetup Christian Tzolov 1/21/16 PivotalOpenSourceHub
 
Building Robust Pipelines with Airflow
Building Robust Pipelines with AirflowBuilding Robust Pipelines with Airflow
Building Robust Pipelines with AirflowErin Shellman
 
Airflow - a data flow engine
Airflow - a data flow engineAirflow - a data flow engine
Airflow - a data flow engineWalter Liu
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for HadoopJoe Crobak
 
Realizing the promise of portability with Apache Beam
Realizing the promise of portability with Apache BeamRealizing the promise of portability with Apache Beam
Realizing the promise of portability with Apache BeamJ On The Beach
 

What's hot (20)

Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native WayMigrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
Crafting APIs
Crafting APIsCrafting APIs
Crafting APIs
 
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...
 
Airflow at lyft
Airflow at lyftAirflow at lyft
Airflow at lyft
 
Workflow Engines + Luigi
Workflow Engines + LuigiWorkflow Engines + Luigi
Workflow Engines + Luigi
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Scaling machine learning to millions of users with Apache Beam
Scaling machine learning to millions of users with Apache BeamScaling machine learning to millions of users with Apache Beam
Scaling machine learning to millions of users with Apache Beam
 
Data Pipelines with Apache Airflow
Data Pipelines with Apache AirflowData Pipelines with Apache Airflow
Data Pipelines with Apache Airflow
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
 
Apache Airflow Architecture
Apache Airflow ArchitectureApache Airflow Architecture
Apache Airflow Architecture
 
Apache Zeppelin Meetup Christian Tzolov 1/21/16
Apache Zeppelin Meetup Christian Tzolov 1/21/16 Apache Zeppelin Meetup Christian Tzolov 1/21/16
Apache Zeppelin Meetup Christian Tzolov 1/21/16
 
Airflow for Beginners
Airflow for BeginnersAirflow for Beginners
Airflow for Beginners
 
Building Robust Pipelines with Airflow
Building Robust Pipelines with AirflowBuilding Robust Pipelines with Airflow
Building Robust Pipelines with Airflow
 
Airflow - a data flow engine
Airflow - a data flow engineAirflow - a data flow engine
Airflow - a data flow engine
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for Hadoop
 
Realizing the promise of portability with Apache Beam
Realizing the promise of portability with Apache BeamRealizing the promise of portability with Apache Beam
Realizing the promise of portability with Apache Beam
 

Similar to Clearing Airflow Obstructions

Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKKriangkrai Chaonithi
 
Simplified News Analytics in Presidential Election with Google Cloud Platform
Simplified News Analytics in Presidential Election with Google Cloud PlatformSimplified News Analytics in Presidential Election with Google Cloud Platform
Simplified News Analytics in Presidential Election with Google Cloud PlatformImre Nagi
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and EngineeringVijayananda Mohire
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and EngineeringVijayananda Mohire
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixC4Media
 
Workshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformWorkshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformGoDataDriven
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalSub Szabolcs Feczak
 
Cloud Composer workshop at Airflow Summit 2023.pdf
Cloud Composer workshop at Airflow Summit 2023.pdfCloud Composer workshop at Airflow Summit 2023.pdf
Cloud Composer workshop at Airflow Summit 2023.pdfLeah Cole
 
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureFei Chen
 
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...How a distributed graph analytics platform uses Apache Kafka for data ingesti...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...HostedbyConfluent
 
Building Modern Data Pipelines on GCP via a FREE online Bootcamp
Building Modern Data Pipelines on GCP via a FREE online BootcampBuilding Modern Data Pipelines on GCP via a FREE online Bootcamp
Building Modern Data Pipelines on GCP via a FREE online BootcampData Con LA
 
Data Science Presentation.pdf
Data Science Presentation.pdfData Science Presentation.pdf
Data Science Presentation.pdfAamirJadoon5
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish styleLars Albertsson
 
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupKaxil Naik
 
Last Conference 2017: Big Data in a Production Environment: Lessons Learnt
Last Conference 2017: Big Data in a Production Environment: Lessons LearntLast Conference 2017: Big Data in a Production Environment: Lessons Learnt
Last Conference 2017: Big Data in a Production Environment: Lessons LearntMark Grebler
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkDataWorks Summit
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowLaura Lorenz
 
Stream Processing Live Traffic Data with Kafka Streams
Stream Processing Live Traffic Data with Kafka StreamsStream Processing Live Traffic Data with Kafka Streams
Stream Processing Live Traffic Data with Kafka StreamsTim Ysewyn
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016aspyker
 
Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Sharma Podila
 

Similar to Clearing Airflow Obstructions (20)

Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
 
Simplified News Analytics in Presidential Election with Google Cloud Platform
Simplified News Analytics in Presidential Election with Google Cloud PlatformSimplified News Analytics in Presidential Election with Google Cloud Platform
Simplified News Analytics in Presidential Election with Google Cloud Platform
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Workshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformWorkshop on Google Cloud Data Platform
Workshop on Google Cloud Data Platform
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - final
 
Cloud Composer workshop at Airflow Summit 2023.pdf
Cloud Composer workshop at Airflow Summit 2023.pdfCloud Composer workshop at Airflow Summit 2023.pdf
Cloud Composer workshop at Airflow Summit 2023.pdf
 
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
 
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...How a distributed graph analytics platform uses Apache Kafka for data ingesti...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
 
Building Modern Data Pipelines on GCP via a FREE online Bootcamp
Building Modern Data Pipelines on GCP via a FREE online BootcampBuilding Modern Data Pipelines on GCP via a FREE online Bootcamp
Building Modern Data Pipelines on GCP via a FREE online Bootcamp
 
Data Science Presentation.pdf
Data Science Presentation.pdfData Science Presentation.pdf
Data Science Presentation.pdf
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish style
 
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
 
Last Conference 2017: Big Data in a Production Environment: Lessons Learnt
Last Conference 2017: Big Data in a Production Environment: Lessons LearntLast Conference 2017: Big Data in a Production Environment: Lessons Learnt
Last Conference 2017: Big Data in a Production Environment: Lessons Learnt
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
Stream Processing Live Traffic Data with Kafka Streams
Stream Processing Live Traffic Data with Kafka StreamsStream Processing Live Traffic Data with Kafka Streams
Stream Processing Live Traffic Data with Kafka Streams
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016
 
Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016
 

More from Tatiana Al-Chueyr

Integrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowIntegrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowTatiana Al-Chueyr
 
Contributing to Apache Airflow
Contributing to Apache AirflowContributing to Apache Airflow
Contributing to Apache AirflowTatiana Al-Chueyr
 
From an idea to production: building a recommender for BBC Sounds
From an idea to production: building a recommender for BBC SoundsFrom an idea to production: building a recommender for BBC Sounds
From an idea to production: building a recommender for BBC SoundsTatiana Al-Chueyr
 
Precomputing recommendations with Apache Beam
Precomputing recommendations with Apache BeamPrecomputing recommendations with Apache Beam
Precomputing recommendations with Apache BeamTatiana Al-Chueyr
 
Scaling machine learning workflows with Apache Beam
Scaling machine learning workflows with Apache BeamScaling machine learning workflows with Apache Beam
Scaling machine learning workflows with Apache BeamTatiana Al-Chueyr
 
Responsible machine learning at the BBC
Responsible machine learning at the BBCResponsible machine learning at the BBC
Responsible machine learning at the BBCTatiana Al-Chueyr
 
Responsible Machine Learning at the BBC
Responsible Machine Learning at the BBCResponsible Machine Learning at the BBC
Responsible Machine Learning at the BBCTatiana Al-Chueyr
 
PyConUK 2018 - Journey from HTTP to gRPC
PyConUK 2018 - Journey from HTTP to gRPCPyConUK 2018 - Journey from HTTP to gRPC
PyConUK 2018 - Journey from HTTP to gRPCTatiana Al-Chueyr
 
PythonBrasil[8] - CPython for dummies
PythonBrasil[8] - CPython for dummiesPythonBrasil[8] - CPython for dummies
PythonBrasil[8] - CPython for dummiesTatiana Al-Chueyr
 
QCon SP - recommended for you
QCon SP - recommended for youQCon SP - recommended for you
QCon SP - recommended for youTatiana Al-Chueyr
 
PyConUK 2016 - Writing English Right
PyConUK 2016  - Writing English RightPyConUK 2016  - Writing English Right
PyConUK 2016 - Writing English RightTatiana Al-Chueyr
 
InVesalius: 3D medical imaging software
InVesalius: 3D medical imaging softwareInVesalius: 3D medical imaging software
InVesalius: 3D medical imaging softwareTatiana Al-Chueyr
 
Automatic English text correction
Automatic English text correctionAutomatic English text correction
Automatic English text correctionTatiana Al-Chueyr
 
Python packaging and dependency resolution
Python packaging and dependency resolutionPython packaging and dependency resolution
Python packaging and dependency resolutionTatiana Al-Chueyr
 
Rio info 2013 - Linked Data at Globo.com
Rio info 2013 - Linked Data at Globo.comRio info 2013 - Linked Data at Globo.com
Rio info 2013 - Linked Data at Globo.comTatiana Al-Chueyr
 
Linking the world with Python and Semantics
Linking the world with Python and SemanticsLinking the world with Python and Semantics
Linking the world with Python and SemanticsTatiana Al-Chueyr
 
Desarollando aplicaciones web en python con pruebas
Desarollando aplicaciones web en python con pruebasDesarollando aplicaciones web en python con pruebas
Desarollando aplicaciones web en python con pruebasTatiana Al-Chueyr
 
Desarollando aplicaciones móviles con Python y Android
Desarollando aplicaciones móviles con Python y AndroidDesarollando aplicaciones móviles con Python y Android
Desarollando aplicaciones móviles con Python y AndroidTatiana Al-Chueyr
 

More from Tatiana Al-Chueyr (20)

Integrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowIntegrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache Airflow
 
Contributing to Apache Airflow
Contributing to Apache AirflowContributing to Apache Airflow
Contributing to Apache Airflow
 
From an idea to production: building a recommender for BBC Sounds
From an idea to production: building a recommender for BBC SoundsFrom an idea to production: building a recommender for BBC Sounds
From an idea to production: building a recommender for BBC Sounds
 
Precomputing recommendations with Apache Beam
Precomputing recommendations with Apache BeamPrecomputing recommendations with Apache Beam
Precomputing recommendations with Apache Beam
 
Scaling machine learning workflows with Apache Beam
Scaling machine learning workflows with Apache BeamScaling machine learning workflows with Apache Beam
Scaling machine learning workflows with Apache Beam
 
Responsible machine learning at the BBC
Responsible machine learning at the BBCResponsible machine learning at the BBC
Responsible machine learning at the BBC
 
Responsible Machine Learning at the BBC
Responsible Machine Learning at the BBCResponsible Machine Learning at the BBC
Responsible Machine Learning at the BBC
 
PyConUK 2018 - Journey from HTTP to gRPC
PyConUK 2018 - Journey from HTTP to gRPCPyConUK 2018 - Journey from HTTP to gRPC
PyConUK 2018 - Journey from HTTP to gRPC
 
Sprint cPython at Globo.com
Sprint cPython at Globo.comSprint cPython at Globo.com
Sprint cPython at Globo.com
 
PythonBrasil[8] - CPython for dummies
PythonBrasil[8] - CPython for dummiesPythonBrasil[8] - CPython for dummies
PythonBrasil[8] - CPython for dummies
 
QCon SP - recommended for you
QCon SP - recommended for youQCon SP - recommended for you
QCon SP - recommended for you
 
PyConUK 2016 - Writing English Right
PyConUK 2016  - Writing English RightPyConUK 2016  - Writing English Right
PyConUK 2016 - Writing English Right
 
InVesalius: 3D medical imaging software
InVesalius: 3D medical imaging softwareInVesalius: 3D medical imaging software
InVesalius: 3D medical imaging software
 
Automatic English text correction
Automatic English text correctionAutomatic English text correction
Automatic English text correction
 
Python packaging and dependency resolution
Python packaging and dependency resolutionPython packaging and dependency resolution
Python packaging and dependency resolution
 
Rio info 2013 - Linked Data at Globo.com
Rio info 2013 - Linked Data at Globo.comRio info 2013 - Linked Data at Globo.com
Rio info 2013 - Linked Data at Globo.com
 
PythonBrasil[8] closing
PythonBrasil[8] closingPythonBrasil[8] closing
PythonBrasil[8] closing
 
Linking the world with Python and Semantics
Linking the world with Python and SemanticsLinking the world with Python and Semantics
Linking the world with Python and Semantics
 
Desarollando aplicaciones web en python con pruebas
Desarollando aplicaciones web en python con pruebasDesarollando aplicaciones web en python con pruebas
Desarollando aplicaciones web en python con pruebas
 
Desarollando aplicaciones móviles con Python y Android
Desarollando aplicaciones móviles con Python y AndroidDesarollando aplicaciones móviles con Python y Android
Desarollando aplicaciones móviles con Python y Android
 

Recently uploaded

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 

Recently uploaded (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Clearing Airflow Obstructions

  • 1. Clearing Airflow obstructions @tati_alchueyr Online 15 July 2021 Principal Data Engineer @ BBC Datalab
  • 2. ● During this session there will be some quizzes ● Be prepared to either: ○ Scan the QR code ○ Access using the URL Heads up!
  • 3. @tati_alchueyr.__doc__ ● Brazilian living in London since 2014 ● Principal Data Engineer at the BBC Datalab team ● Graduated in Computer Engineering at Unicamp ● Passionate software developer for 18 years ● Experience in the private and public sectors ● Developed software for Medicine, Media and Education ● Loves Open Source ● Loves Brazilian Jiu Jitsu ● Proud mother of Amanda (v4.0)
  • 4. I ❤ Airflow Community & Summit Tomek Urbaszek Jarek Potiuk Ash Berlin-Taylor Kaxil Naik Leah Cole
  • 5. BBC.Datalab.Hummingbirds The work presented here is the result of lots of teamwork within one squad of a much larger team and organisation active squad team members directly contributed in the past Darren Mundy David Hollands Richard Bownes Marc Oppenheimer Bettina Hermant Tatiana Al-Chueyr
  • 6. A. I don’t run Airflow B. On-premise C. Astronomer.io D. Google Cloud Composer E. AWS Managed Workflows F. Other Quiz time how do you run Airflow?
  • 7. Quiz time how do you run Airflow? * responses from Airflow Summit 2021 participants, during the presentation
  • 8. How we use Airflow when things went wrong
  • 9. How we use Airflow when things go wrong
  • 10. How we use Airflow infrastructure ● Managed Airflow ○ Cloud Composer (GCP) ○ Terraform ● Celery Executors GCP constraint ○ Running within Kubernetes (GKE) ● Outdated version 1.10.12 ○ Upgrades have been time consuming ■ Last: 1.10.4 => 1.10.12 @ Dec ‘20 ○ GCP supports newer releases ■ 1.10.5 since April ‘21 (March ‘21) ■ 2.0.1 since May ‘21 (Feb ‘21)
  • 11. ● Execute within Airflow executors ○ BaseOperator ○ BashOperator ○ DummyOperator ○ PythonOperator ○ ShortCircuitOperator ○ TriggerDagRunOperator ○ GCSDeleteObjectsOperator ● Delegate to Kubernetes in a dedicated GKE cluster ○ GKEPodOperator ● Delegate to Apache Beam (Dataflow) ○ DataflowPythonOperator ○ DataflowCreatePythonJobOperator How we use Airflow operators
  • 12. How we use Airflow local, dev, int, prod
  • 13. How we use Airflow application ● Data integration: ETL (extract, transform, load) ● Machine learning: training and precomputation User activity Content metadata Train Model Artefacts Predict Prediction Results Extract & Transform Extract & Transform historical data future User activity features Content metadata features
  • 14. Ingest user activity Ingest content metadata Train model Precompute How we use Airflow application historical data future
  • 15. Train model Precompute How we use Airflow application historical data future * April 2021
  • 16. A. Ingest & transform Content Metadata ○ Input: ○ B. Ingest & transform User Activity ○ Input:t: ○ C. Train model ○ Output: ○ D. Precompute recommendations ○ Input: ○ Output: Quiz time which is our most stable DAG?
  • 17. A. Ingest & transform Content Metadata Python Operator. ○ ~ 225k records ○ Transforms ~12 GB => 57 MB B. Ingest & transform User Activity Dataflow Operator. ○ ~ 3.5 million records ○ Output: ~ 2 GB C. Train model Kubernetes Operator. ○ Output: ~ 8 GB model & artefacts ○ lkj D. Precompute recommendations Dataflow Operator. ○ ~ 3.5 million records ○ Output: ~ 2.5 GB Quiz time which is our most stable DAG? tips.
  • 18. Quiz time which is our most stable DAG? attendees. * responses from Airflow Summit 2021 participants, during the presentation
  • 19. Quiz time which is our most stable DAG? answer. A. Ingest & transform Content Metadata ○ 2 live incidents ○ x B. Ingest & transform User Activity ○ 1 live incident ○ x C. Train model ○ 2 live incidents ○ lkj D. Precompute recommendations ○ 2 live incidents * from October 2020 until April 2021
  • 20. A. Ingest & transform Content Metadata ○ Insufficient CPU ○ Spikes -> Timeouts (during higher volumes) / Continuing from where it stopped B. Ingest & transform User Activity ○ Idempotency issue ○ x C. Train model ○ K8s Pod reattach ○ Scheduling leading to two tasks running concurrently D. Precompute recommendations ○ Change to default job settings in Dataflow ○ GCS access limit ○ Non-stop Dataflow job Quiz time which is our most stable DAG? details.
  • 22. Obstruction 1 The programme metadata chronic issue historical data future Ingest user activity Ingest content metadata Train model Precompute
  • 23. Obstruction 1 The programme metadata chronic issue historical data future DAG’s goals ● Import objects from AWS S3 (protected by STS) into Google Cloud Storage ● Requirements: between dozens and thousands KB-sized objects ● Filter and enrich the metadata ● Merge multiple streams of data and create an up-to-date snapshot Mostly implemented using subclasses of the Python Operator. class
  • 24. Obstruction 1 The programme metadata chronic issue historical data
  • 25. Obstruction 1 The programme metadata chronic issue historical data Issue Depending on the volumes of data, a single PythonOperator task which usually takes 10 min could take almost 3h! Solutions Increase timeouts Improve machine type Delegate processing to another service Consequences Delay Blocked Airflow executor
  • 26. Obstruction 2 When the user activity workflow failed historical data future Ingest user activity Ingest content metadata Train model Precompute
  • 27. Obstruction 2 When the user activity workflow failed historical data future DAG’s goals ● Read from user activity Parquet files in Google Cloud Storage ● Filter relevant activity and metadata ● Export a snapshot for the relevant interval of time ● Requirements: millions of records in MB-sized files Mostly implemented using subclasses of the Dataflow Operator. class
  • 28. Obstruction 2 When the user activity workflow failed historical data future
  • 29. Obstruction 2 When the user activity workflow failed historical data future
  • 30. Troubleshooting ● The volume of user activity meant to train the model had doubled! Obstruction 2 When the user activity workflow failed historical data future
  • 31. Obstruction 2 When the user activity workflow failed historical data future What happened ● Dataflow took longer than expected to run a job triggered by Airflow ● Airflow retried ● Both jobs completed successfully - and output the data in the same directory! ● The setup to train the model didn’t expect to handle such spike it the volume of data and failed
  • 32. Obstruction 2 When the user activity workflow failed historical data future What happened ● Dataflow took longer than expected to run a job triggered by Airflow ● Airflow retried ● Both jobs completed successfully - and output the data in the same directory! ● The setup to train the model didn’t expect to handle such spike it he volume of data and failed Solution ● Have idempotent tasks ● Clear the target path before processing a task
  • 33. Obstruction 3 When precompute failed due to training historical data future Ingest user activity Ingest content metadata Train model Precompute
  • 34. Obstruction 3 When precompute failed due to training historical data future
  • 35. Obstruction 3 When precompute failed due to training historical data future "message": "No such object: datalab-sounds-prod-6c75-data/recommenders/xa ntus/model/2021-07-08T00:00:00+00:00/xantus.p kl"
  • 36. Obstruction 3 When precompute failed due to training historical data future
  • 37. Obstruction 3 When precompute failed due to training historical data future
  • 38. Obstruction 3 When precompute failed due to training historical data future [2021-07-08 12:15:55,278] {logging_mixin.py:112} INFO - Running <TaskInstance: train_model.move_model_from_k8s_to_gcs 2021-07-08T00:00:00+00:00 [running]> on host airflow-worker-867b96c854-jzqw7 [2021-07-08 12:15:55,422] {gcp_container_operator.py:299} INFO - Using gcloud with application default credentials. [2021-07-08 12:15:57,286] {pod_launcher.py:173} INFO - Event: move-model-to-gcs-3deac821b57047619c1c9505ddc 5db18 had an event of type Pending [2021-07-08 12:15:57,286] {pod_launcher.py:139} WARNING - Pod not yet started: move-model-to-gcs-3deac821b57047619c1c9505ddc 5db18 [2021-07-08 12:17:57,499] {taskinstance.py:1152} ERROR - Pod Launching failed: Pod Launching failed: Pod took too long to start [2021-07-08 12:18:59,584] {pod_launcher.py:156} INFO - b'gsutil -m rm gs://datalab-sounds-prod-6c75-data/recommenders/xant us/model/2021-07-08T00:00:00+00:00/xantus.pkl || truen' [2021-07-08 12:18:59,911] {pod_launcher.py:156} INFO - b'gsutil -m mv /data/recommenders/xantus/model/2021-07-08T00:00:0 0+00:00/xantus.pkl (...) [2021-07-08 12:19:35,295] {pod_launcher.py:156} INFO - b'Operation completed over 1 objects/4.7 GiB. n' [2021-07-08 12:19:35,536] {pod_launcher.py:156} INFO - b'rm -rf /data/recommenders/xantus/model/2021-07-08T00:00:0 0+00:00/xantus.pkln' [2021-07-08 12:19:36,687] {taskinstance.py:1071} INFO - Marking task as SUCCESS.dag_id=train_model,
  • 39. Obstruction 3 When precompute failed due to training historical data future $ kubectl logs move-model-to-gcs-a0b5193a42e040aaa37b3ad82953ee29 -n xantus-training gsutil -m rm gs://datalab-sounds-prod-6c75-data/recommenders/xantus/model/2021-07-08T00:00:00+00:00/xantus.pkl || true CommandException: 1 files/objects could not be removed. gsutil -m mv /data/recommenders/xantus/model/2021-07-08T00:00:00+00:00/xantus.pkl gs://datalab-sounds-prod-6c75-data/recommenders/xantus/model/2021-07-08T00:00:00+00:00/xantus.pkl Copying file:///data/recommenders/xantus/model/2021-07-08T00:00:00+00:00/xantus.pkl [Content-Type=application/octet-stream]... Removing file:///data/recommenders/xantus/model/2021-07-08T00:00:00+00:00/xantus.pkl... | [1/1 files][ 4.7 GiB/ 4.7 GiB] 100% Done 111.3 MiB/s ETA 00:00:00 Operation completed over 1 objects/4.7 GiB. rm -rf /data/recommenders/xantus/model/2021-07-08T00:00:00+00:00/xantus.pkl
  • 40. Obstruction 3 When precompute failed due to training historical data future What happened ● The GKE node pool, where jobs were executed, was struggling ● Airflow Kubernetes Operator timed out after a long time waiting (Kubernetes didn’t know) ● A new task retry was triggered ● Both pods run concurrently and due to how we implemented idempotency, the data was deleted - but the last task retry was successful Solution ● Use newer version of the KubernetesPodOperator ● Confirm by the end of the task that the desired artefact exists
  • 41. Obstruction 4 Intermittent DAG after an Airflow upgrade historical data future historical data future Ingest user activity Ingest content metadata Train model Precompute
  • 42. Obstruction 4 Intermittent DAG after an Airflow upgrade historical data future What happened ● After upgrading from Airflow 1.10.4 to 1.10.12 some KubenetesPodOperator tasks became intermittent ● Legit running pods failed ● The logs seemed to show that new jobs were trying to reattach to previously existing Pods and failed
  • 43. Obstruction 4 Intermittent DAG after an Airflow upgrade historical data future
  • 44. Obstruction 4 Intermittent DAG after an Airflow upgrade historical data future HTTP response headers: HTTPHeaderDict({'Audit-Id': '133bc1f2-388b-490c-bdc6-34053685d5ee', 'Content-Type': 'application/json', 'Date': 'Sat, 30 Jan 2021 09:11:33 GMT', 'Content-Length': '231'} HTTP response body: b'{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Fail ure","message":"container "base" in pod "create-dataset-17a3b6f132e44544a836550be367c670" is waiting to start: ContainerCreating","reason":"BadRequest","code":400}n (...) "/opt/python3.6/lib/python3.6/site-packages/kubernetes/client/ rest.py", line 231, in reques raise ApiException(http_resp=r kubernetes.client.rest.ApiException: (400 Reason: Bad Request
  • 45. Obstruction 4 Intermittent DAG after an Airflow upgrade historical data future Solution https://airflow.apache.org/docs/apache-airflow/1.10.12/_api/airflow/contrib/operators/kubernetes_pod_operator/index.html
  • 46. Obstruction 5 When the Dataflow job said no historical data future historical data future Ingest user activity Ingest content metadata Train model Precompute
  • 47. Obstruction 5 When the Dataflow job said no historical data future
  • 48. Obstruction 5 When the Dataflow job said no historical data future OSError: [Errno 28] No space left on device During handling
  • 49. Obstruction 5 When the Dataflow job said no future If a batch job uses Dataflow Shuffle, then the default is 25 GB; otherwise, the default is 250 GB. https://cloud.google.com/dataflow/docs/guides/specifying-exec-params#python
  • 50. Obstruction 5 When the Dataflow job said no future
  • 51. Obstruction 5 When the Dataflow job said no future --experiments=shuffle_mode=appliance Solution
  • 52. Obstruction 6 The never ending Dataflow job historical data future historical data Ingest user activity Ingest content metadata Train model Precompute
  • 53. Obstruction 6 The never ending Dataflow job historical data future
  • 54. Obstruction 6 The never ending Dataflow job historical data future [2021-05-13 11:52:34,884] {gcp_dataflow_hook.py:1 (...) Traceback (most recent call last):n File "/home/airflow/gcs/dags/predictions/compute_predictions.py", line 337, in runn return pipelinen File "/tmp/dataflow_venv/lib/python3.6/site-packages/apache_beam/pipeline.py", line 569, in __exit__n self.result.wait_until_finish()n File "/tmp/dataflow_venv/lib/python3.6/site-packages/apache_beam/runners/dataflow/dataflow_runner.py", line 1650, in wait_until_finishn self)napache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, (.. ) The job failed because a work item has failed 4 times. Look in previous log entries for the cause of each one of the 4 failures. For more information, see https://cloud.google.com/dataflow/docs/guides/common-errors. The work item was attempted on these workers: n compute-predictions-xantu-05130255-opno-harness-mtj1n Root cause: The worker lost contact with the service.,n compute-predictions-xantu-05130255-opno-harness-2x4vn Root cause: The worker lost contact with the service.,n compute-predictions-xantu-05130255-opno-harness-5gkdn Root cause: The worker lost contact with the service.,n compute-predictions-xantu-05130255-opno-harness-2t1nn Root cause: The worker lost contact with the service.'
  • 55. Obstruction 6 The never ending Dataflow job historical data future
  • 56. Obstruction 6 The never ending Dataflow job historical data future Solution ● Backport the latest Google Cloud operators in Apache Airflow ● Particularly: ○ DataflowCreatePythonJobOperator ○ DataflowJobStatusSensor https://medium.com/google-cloud/backporting-google-cloud-operators-in-apache-airflow-34b 6c9efffc8
  • 57. A. Processing programme metadata within Airflow executors B. Non-idempotent tasks C. Breaking change after Airflow upgrade D. Breaking change in upstream service E. Not monitoring efficiently pre-production environments Quiz time which do you reckon was the most costly issue?
  • 58. Quiz time which do you reckon was the most costly issue? * responses from Airflow Summit 2021 participants, during the presentation
  • 59. Quiz time which do you reckon was the most costly issue? A. Processing programme metadata within Airflow executors B. Non-idempotent tasks C. Breaking change after Airflow upgrade D. Breaking change in upstream service E. Not monitoring efficiently pre-production environments
  • 60. Quiz time which do you reckon was the most costly issue? The never ending Dataflow Job, triggered with DataflowOperator in our int environment, run for over 3 days and costed over £12k
  • 62. Smell 1 To plugin or not to plugin future ● Requirement: ○ Have common packages across multiple DAGs without using PyPI or similar ● Attempt: ○ Use plugins to expose those ○ Deploy using ■ gcloud composer environments storage dags import . ■ gcloud composer environments storage dags import . ● Problems ○ Lots of broken deployments ○ Unsynchronised upload of plugins and DAGs to Composer web server & workers ○ Issues in enabling DAG serialisation with plugins
  • 63. ● Solution: ○ Stop using plugins ○ Use standard Python packages ○ Upload them using Google Cloud to the Bucket / path ■ gsutil rm (...) ■ gsutil cp (...) Smell 1 To plugin or not to plugin future
  • 64. Smell 2 Configuration (mis)patterns historical data future ● Requirement: ○ Have a strategy for handling environment-specific and common configuration ● Attempt: ○ To use Environment variables to declare each env-specific variable ○ To deploy using Terraform ○ To declare common configuration in the DAGs ● Problems ○ Each variant updated represented a Cloud Composer deployment ○ Redundant configuration definition across DAGs and multiple utilities modules ○ Hard to identify the data sources / targets paths
  • 65. Smell 2 Configuration (mis)patterns historical data future ● Solution: have configuration files loaded into Airflow variables with paths and variables https://www.astronomer.io/guides/dynamically-generating-dags
  • 66. Smell 2 Configuration (mis)patterns historical data future
  • 67. ❏ Keep processing out of Airflow executors ❏ Idempotency matters - and it can be hard! ❏ Backporting is better than sticking to the past ❏ Reviewing release notes can help avoid live incidents ❏ Monitoring pre-production environments can save money Takeaways Avoiding live incidents
  • 68. ❏ Avoid plugins ❏ A delete-deploy approach can avoid problems ❏ Early configuration-driven approach saves time Takeaways Keeping the house clean
  • 69. Much more than obstructions With the help of Apache Airflow, Datalab: ● Was able to end a contract of the BBC, with an external recommendation service, by increasing in 59% the audience engagement ● Serves daily millions of personalised recommendations to the BBC audiences ● Built a configurable Machine Learning pipeline agnostic of the model ○ Constantly adds new variants and extends workflows