SlideShare a Scribd company logo
1 of 43
Download to read offline
Creating your own Chat GPT with
Apache Airflow
@tati_alchueyr
Staff Software Engineer - Astronomer
13th July 2023 - AI Camp London Meetup
Turing test
https://marcabraham.com/2022/10/17/what-is-the-turing-test/
ChatGPT
https://chat.openai.com/
https://chat.openai.com/
https://xkcd.com/329/
inspect(ChatGPT)
● Artificial intelligence chatbot
● Developed by OpenAI
● Proprietary machine learning model
○ Uses LLM (Large Language Models)
○ GPT == Generative Pre-Trained Transformer
○ Fine-tuned GPT-3.5 (text-DaVinci-003)
● Over 100 million user base
● Dataset size: 570 GBs; 175 Billion Parameters
● Estimated cost to run per month: $3 million
https://www.theguardian.com/technology/2023/feb/02/chatgpt-100-million-users-open-ai-fastest-growing-app
https://indianexpress.com/article/technology/tech-news-technology/chatgpt-interesting-things-to-know-8334991/
https://meetanshi.com/blog/chatgpt-statistics/
help(LLM)
A Large Language Model is a type
of AI algorithm trained on huge
amounts of text data that can
understand and generate text
help(LLM)
LLM can be characterized by 4 parameters:
● Size of the training dataset
● Cost of training
● Size of the model
● Performance after training
timeline(LLM)
https://samim.io/p/2023-04-30-evolutionary-tree-of-llms/
Proprietary LLM limitations
● Data Privacy and Security
● Dependency and Customisation
● Cost and Scalability
● Access and Availability
Open-source LLM alternatives
● LLaMA (Meta)
● Alpaca (Stanford)
● Vicuna (Berkeley, Carnegie Mellon, Stanford)
● Dolly (Datricks)
● Open Assistant (individuals)
● h2oGPT (h2o)
https://bdtechtalks.com/2023/04/17/open-source-chatgpt-alternatives/
h2oGPT about
● Open-source (Apache 2.0) generative AI
● Empowers users to create their own language models
● https://gpt.h2o.ai/
● https://github.com/h2oai/h2ogpt
● https://www.youtube.com/watch?v=Coj72EzmX20&t=757s
https://bdtechtalks.com/2023/04/17/open-source-chatgpt-alternatives/
h2oGPT about
https://bdtechtalks.com/2023/04/17/open-source-chatgpt-alternatives/
Apache Airflow
Apache Airflow is an open-source
platform for developing,
scheduling, and monitoring
batch-oriented workflows.
help(airflow)
usage(airflow)
https://github.com/apache/airflow
https://pypistats.org/packages/apache-airflow
airflow.__author__
example(workflow)
airflow.concepts
airflow.concepts
airflow.concepts
airflow providers packages
https://airflow.apache.org/docs/apache-airflow-providers/packages-ref.html
● apache-airflow-providers-airbyte
● apache-airflow-providers-alibaba
● apache-airflow-providers-amazon
● apache-airflow-providers-apache-beam
● apache-airflow-providers-apache-cassandra
● apache-airflow-providers-apache-drill
● apache-airflow-providers-apache-druid
● apache-airflow-providers-apache-flink
● apache-airflow-providers-apache-hdfs
● apache-airflow-providers-apache-hive
● apache-airflow-providers-apache-impala
● apache-airflow-providers-apache-kafka
● apache-airflow-providers-apache-kylin
● apache-airflow-providers-apache-livy
● apache-airflow-providers-apache-pig
● apache-airflow-providers-apache-pinot
● apache-airflow-providers-apache-spark
● apache-airflow-providers-apache-sqoop
● apache-airflow-providers-apprise
● apache-airflow-providers-arangodb
● apache-airflow-providers-asana
● apache-airflow-providers-atlassian-jira
● apache-airflow-providers-celery
● apache-airflow-providers-cloudant
● apache-airflow-providers-cncf-kubernetes
● apache-airflow-providers-common-sql
● apache-airflow-providers-databricks
● apache-airflow-providers-datadog
● apache-airflow-providers-dbt-cloud
● apache-airflow-providers-dingding
● apache-airflow-providers-discord
● apache-airflow-providers-docker
● apache-airflow-providers-elasticsearch
● apache-airflow-providers-exasol
● apache-airflow-providers-facebook
● apache-airflow-providers-ftp
● apache-airflow-providers-github
● apache-airflow-providers-google
● apache-airflow-providers-grpc
● apache-airflow-providers-hashicorp
airflow providers packages
https://airflow.apache.org/docs/apache-airflow-providers/packages-ref.html
● apache-airflow-providers-http
● apache-airflow-providers-imap
● apache-airflow-providers-influxdb
● apache-airflow-providers-jdbc
● apache-airflow-providers-jenkins
● apache-airflow-providers-microsoft-azure
● apache-airflow-providers-microsoft-mssql
● apache-airflow-providers-microsoft-psrp
● apache-airflow-providers-microsoft-winrm
● apache-airflow-providers-mongo
● apache-airflow-providers-mysql
● apache-airflow-providers-neo4j
● apache-airflow-providers-odbc
● apache-airflow-providers-openfaas
● apache-airflow-providers-openlineage
● apache-airflow-providers-opsgenie
● apache-airflow-providers-oracle
● Apache-airflow-providers-pagerduty
● Apache-airflow-providers-papermill
● Apache-airflow-providers-plexus
● apache-airflow-providers-postgres
● apache-airflow-providers-presto
● apache-airflow-providers-qubole
● apache-airflow-providers-redis
● apache-airflow-providers-salesforce
● apache-airflow-providers-samba
● apache-airflow-providers-segment
● apache-airflow-providers-sendgrid
● apache-airflow-providers-sftp
● apache-airflow-providers-singularity
● apache-airflow-providers-slack
● apache-airflow-providers-smtp
● apache-airflow-providers-snowflake
● apache-airflow-providers-sqlite
● apache-airflow-providers-ssh
● apache-airflow-providers-tableau
● apache-airflow-providers-tabular
● apache-airflow-providers-telegram
● apache-airflow-providers-trino
● apache-airflow-providers-vertica
● apache-airflow-providers-zendesk
airflow example DAG
from airflow import DAG
from datetime import datetime
def train_model():
pass
with DAG(
“train_models",
start_date=datetime(2023, 7, 4),
schedule="@daily") as dag:
train_model = PythonOperator(
task_id="train_model",
python_callable=train_model
)
airflow example DAG
from airflow import DAG
from airflow.operators.python import PythonOperator
from random import randint
from datetime import datetime
def _evaluate_model():
return randint(1,10)
def _choose_best(ti):
tasks = [
"evaluate_model_a",
"evaluate_model_b"
]
accuracies = [ti.xcom_pull(task_id) for task_id in
tasks]
best_accuracy = max(accuracies)
for model, model_accuracy in zip(tasks,
accuracies):
if model_accuracy == best_accuracy:
return model
with DAG(
"evaluate_models",
start_date=datetime(2023, 7, 4),
schedule="@daily") as dag:
evaluate_model_a = PythonOperator(
task_id="evaluate_model_a",
python_callable=_evaluate_model
)
evaluate_model_b = PythonOperator(
task_id="evaluate_model_b",
python_callable=_evaluate_model
)
choose_best_model = PythonOperator(
task_id="choose_best_model",
python_callable=_choose_best
)
[evaluate_model_a, evaluate_model_b] >>
choose_best_model
airflow example DAG
airflow example of pipelines
airflow example of pipelines
airflow example of pipelines
Building an AI Chat Bot
with Airflow
Airflow to build a LLM Chat Bot
● Open-source and cloud-agnostic: you are not locked in!
● Same orchestration tool for ELT/ETL and ML
● Automate the steps of a model pipeline, using Airflow to:
○ Monitor the status and duration of tasks over time
○ Retry on failures
○ Send notifications (email, slack, others) to the team
● Dynamically trigger tasks using different hyper parameters
● Dynamically select models based on their scores
● Trigger model pipelines based of dataset changes
● Smoothly run tasks in VMs, containers or Kubernetes
Use the KubernetesPodOperator
● Create tasks which are run in Kubernetes pods
● Use node_affinity to allocate job to run on the nodepool
with the desired memory/CPU/GPU
● Use k8s.V1VolumeMount to efficiently mount volumes (e.g.
NFS) to access large models from different Pods (evaluate,
serve)
https://airflow.apache.org/docs/apache-airflow-providers-cncf-kubernetes/stable/operators.html
Use Dataset-aware scheduling
● Schedule when tasks (from other DAGs) complete successfully
from airflow.datasets import Dataset
with DAG(“ingest_dataset”, ...):
MyOperator(
# this task updates example.csv
outlets=[Dataset("s3://dataset-bucket/source-data.parquet")],
...,
)
with DAG(“train_model”,
# this DAG should be run when source-data.parquet is updated (by dag “ingest_dataset”)
schedule=[Dataset("s3://dataset-bucket/source_data.csv")],
...,
):
https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/datasets.html
Use Dynamic Task Mapping
● Create a variable number of tasks at runtime based upon the
data created by the previous task
● Can be useful in several situations, including chosing the most
adequate model
● Support map/reduce
https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/dynamic-task-mapping.html
Dynamic Task Mapping
from __future__ import annotations
from datetime import datetime
from airflow import DAG
from airflow.decorators import task
with DAG(
dag_id="example_dynamic_task_mapping",
start_date=datetime(2022, 3, 4)
) as dag:
@task
def evaluate_model(model_path):
(...)
return evaluation_metrics
@task
def chose_model(metrics_by_model):
(...)
return chosen_one
models_metrics = evaluate_model.expand(
model_path=["/data/model1", "/data/model2", "/data/model3"]
)
chose_model(models_metrics)
Apache Airflow
Community
Apache Airflow Community
https://airflow.apache.org/community/
https://github.com/apache/airflow
https://www.meetup.com/london-apache-airflow-meetup/
https://www.astronomer.io/
@tati_alchueyr
tatiana.alchueyr@astronomer.io
Thank you!

More Related Content

What's hot

NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020Timothy McAliley
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptxDori Waldman
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 DataWorks Summit
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Azure data bricks by Eugene Polonichko
Azure data bricks by Eugene PolonichkoAzure data bricks by Eugene Polonichko
Azure data bricks by Eugene PolonichkoAlex Tumanoff
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overviewJames Serra
 
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022HostedbyConfluent
 
Migration From Oracle to PostgreSQL
Migration From Oracle to PostgreSQLMigration From Oracle to PostgreSQL
Migration From Oracle to PostgreSQLPGConf APAC
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDatabricks
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureDmitry Anoshin
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkKazuaki Ishizaki
 
Reproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchReproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchDatabricks
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Combine Spring Data Neo4j and Spring Boot to quickl
Combine Spring Data Neo4j and Spring Boot to quicklCombine Spring Data Neo4j and Spring Boot to quickl
Combine Spring Data Neo4j and Spring Boot to quicklNeo4j
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar ZecevicDataScienceConferenc1
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
 

What's hot (20)

NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptx
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Azure data bricks by Eugene Polonichko
Azure data bricks by Eugene PolonichkoAzure data bricks by Eugene Polonichko
Azure data bricks by Eugene Polonichko
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
 
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
 
Migration From Oracle to PostgreSQL
Migration From Oracle to PostgreSQLMigration From Oracle to PostgreSQL
Migration From Oracle to PostgreSQL
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 
Reproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchReproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorch
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Combine Spring Data Neo4j and Spring Boot to quickl
Combine Spring Data Neo4j and Spring Boot to quicklCombine Spring Data Neo4j and Spring Boot to quickl
Combine Spring Data Neo4j and Spring Boot to quickl
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
Ml ops on AWS
Ml ops on AWSMl ops on AWS
Ml ops on AWS
 

Similar to Integrating ChatGPT with Apache Airflow

Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...
Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...
Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...OpenShift Origin
 
SaltConf14 - Eric johnson, Google - Orchestrating Google Compute Engine with ...
SaltConf14 - Eric johnson, Google - Orchestrating Google Compute Engine with ...SaltConf14 - Eric johnson, Google - Orchestrating Google Compute Engine with ...
SaltConf14 - Eric johnson, Google - Orchestrating Google Compute Engine with ...SaltStack
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Databricks
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
 
Orchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWSOrchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWSDerrick Qin
 
Yaetos Tech Overview
Yaetos Tech OverviewYaetos Tech Overview
Yaetos Tech Overviewprevota
 
Introduction to Apache Airflow
Introduction to Apache AirflowIntroduction to Apache Airflow
Introduction to Apache Airflowmutt_data
 
Elasticsearch on Kubernetes
Elasticsearch on KubernetesElasticsearch on Kubernetes
Elasticsearch on KubernetesJoerg Henning
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark DownscalingDatabricks
 
GE Predix 新手入门 赵锴 物联网_IoT
GE Predix 新手入门 赵锴 物联网_IoTGE Predix 新手入门 赵锴 物联网_IoT
GE Predix 新手入门 赵锴 物联网_IoTKai Zhao
 
OpenShift Origin Community Day (Boston) Extending OpenShift Origin: Build You...
OpenShift Origin Community Day (Boston) Extending OpenShift Origin: Build You...OpenShift Origin Community Day (Boston) Extending OpenShift Origin: Build You...
OpenShift Origin Community Day (Boston) Extending OpenShift Origin: Build You...OpenShift Origin
 
OpenShift Origin Community Day (Boston) Writing Cartridges V2 by Jhon Honce
OpenShift Origin Community Day (Boston) Writing Cartridges V2 by Jhon Honce OpenShift Origin Community Day (Boston) Writing Cartridges V2 by Jhon Honce
OpenShift Origin Community Day (Boston) Writing Cartridges V2 by Jhon Honce Diane Mueller
 
How to Puppetize Google Cloud Platform - PuppetConf 2014
How to Puppetize Google Cloud Platform - PuppetConf 2014How to Puppetize Google Cloud Platform - PuppetConf 2014
How to Puppetize Google Cloud Platform - PuppetConf 2014Puppet
 
BaseX user-group-talk XML Prague 2013
BaseX user-group-talk XML Prague 2013BaseX user-group-talk XML Prague 2013
BaseX user-group-talk XML Prague 2013Andy Bunce
 
Time series denver an introduction to prometheus
Time series denver   an introduction to prometheusTime series denver   an introduction to prometheus
Time series denver an introduction to prometheusBob Cotton
 
Leveraging open source for large scale analytics
Leveraging open source for large scale analyticsLeveraging open source for large scale analytics
Leveraging open source for large scale analyticsSouth West Data Meetup
 
Fluent 2018: When third parties stop being polite... and start getting real
Fluent 2018: When third parties stop being polite... and start getting realFluent 2018: When third parties stop being polite... and start getting real
Fluent 2018: When third parties stop being polite... and start getting realAkamai Developers & Admins
 
When Third Parties Stop Being Polite... and Start Getting Real
When Third Parties Stop Being Polite... and Start Getting RealWhen Third Parties Stop Being Polite... and Start Getting Real
When Third Parties Stop Being Polite... and Start Getting RealNicholas Jansma
 
Improving Operations Efficiency with Puppet
Improving Operations Efficiency with PuppetImproving Operations Efficiency with Puppet
Improving Operations Efficiency with PuppetNicolas Brousse
 
Testing kubernetes and_open_shift_at_scale_20170209
Testing kubernetes and_open_shift_at_scale_20170209Testing kubernetes and_open_shift_at_scale_20170209
Testing kubernetes and_open_shift_at_scale_20170209mffiedler
 

Similar to Integrating ChatGPT with Apache Airflow (20)

Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...
Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...
Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...
 
SaltConf14 - Eric johnson, Google - Orchestrating Google Compute Engine with ...
SaltConf14 - Eric johnson, Google - Orchestrating Google Compute Engine with ...SaltConf14 - Eric johnson, Google - Orchestrating Google Compute Engine with ...
SaltConf14 - Eric johnson, Google - Orchestrating Google Compute Engine with ...
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
Orchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWSOrchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWS
 
Yaetos Tech Overview
Yaetos Tech OverviewYaetos Tech Overview
Yaetos Tech Overview
 
Introduction to Apache Airflow
Introduction to Apache AirflowIntroduction to Apache Airflow
Introduction to Apache Airflow
 
Elasticsearch on Kubernetes
Elasticsearch on KubernetesElasticsearch on Kubernetes
Elasticsearch on Kubernetes
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
 
GE Predix 新手入门 赵锴 物联网_IoT
GE Predix 新手入门 赵锴 物联网_IoTGE Predix 新手入门 赵锴 物联网_IoT
GE Predix 新手入门 赵锴 物联网_IoT
 
OpenShift Origin Community Day (Boston) Extending OpenShift Origin: Build You...
OpenShift Origin Community Day (Boston) Extending OpenShift Origin: Build You...OpenShift Origin Community Day (Boston) Extending OpenShift Origin: Build You...
OpenShift Origin Community Day (Boston) Extending OpenShift Origin: Build You...
 
OpenShift Origin Community Day (Boston) Writing Cartridges V2 by Jhon Honce
OpenShift Origin Community Day (Boston) Writing Cartridges V2 by Jhon Honce OpenShift Origin Community Day (Boston) Writing Cartridges V2 by Jhon Honce
OpenShift Origin Community Day (Boston) Writing Cartridges V2 by Jhon Honce
 
How to Puppetize Google Cloud Platform - PuppetConf 2014
How to Puppetize Google Cloud Platform - PuppetConf 2014How to Puppetize Google Cloud Platform - PuppetConf 2014
How to Puppetize Google Cloud Platform - PuppetConf 2014
 
BaseX user-group-talk XML Prague 2013
BaseX user-group-talk XML Prague 2013BaseX user-group-talk XML Prague 2013
BaseX user-group-talk XML Prague 2013
 
Time series denver an introduction to prometheus
Time series denver   an introduction to prometheusTime series denver   an introduction to prometheus
Time series denver an introduction to prometheus
 
Leveraging open source for large scale analytics
Leveraging open source for large scale analyticsLeveraging open source for large scale analytics
Leveraging open source for large scale analytics
 
Fluent 2018: When third parties stop being polite... and start getting real
Fluent 2018: When third parties stop being polite... and start getting realFluent 2018: When third parties stop being polite... and start getting real
Fluent 2018: When third parties stop being polite... and start getting real
 
When Third Parties Stop Being Polite... and Start Getting Real
When Third Parties Stop Being Polite... and Start Getting RealWhen Third Parties Stop Being Polite... and Start Getting Real
When Third Parties Stop Being Polite... and Start Getting Real
 
Improving Operations Efficiency with Puppet
Improving Operations Efficiency with PuppetImproving Operations Efficiency with Puppet
Improving Operations Efficiency with Puppet
 
Testing kubernetes and_open_shift_at_scale_20170209
Testing kubernetes and_open_shift_at_scale_20170209Testing kubernetes and_open_shift_at_scale_20170209
Testing kubernetes and_open_shift_at_scale_20170209
 

More from Tatiana Al-Chueyr

Contributing to Apache Airflow
Contributing to Apache AirflowContributing to Apache Airflow
Contributing to Apache AirflowTatiana Al-Chueyr
 
From an idea to production: building a recommender for BBC Sounds
From an idea to production: building a recommender for BBC SoundsFrom an idea to production: building a recommender for BBC Sounds
From an idea to production: building a recommender for BBC SoundsTatiana Al-Chueyr
 
Precomputing recommendations with Apache Beam
Precomputing recommendations with Apache BeamPrecomputing recommendations with Apache Beam
Precomputing recommendations with Apache BeamTatiana Al-Chueyr
 
Scaling machine learning to millions of users with Apache Beam
Scaling machine learning to millions of users with Apache BeamScaling machine learning to millions of users with Apache Beam
Scaling machine learning to millions of users with Apache BeamTatiana Al-Chueyr
 
Clearing Airflow Obstructions
Clearing Airflow ObstructionsClearing Airflow Obstructions
Clearing Airflow ObstructionsTatiana Al-Chueyr
 
Scaling machine learning workflows with Apache Beam
Scaling machine learning workflows with Apache BeamScaling machine learning workflows with Apache Beam
Scaling machine learning workflows with Apache BeamTatiana Al-Chueyr
 
Responsible machine learning at the BBC
Responsible machine learning at the BBCResponsible machine learning at the BBC
Responsible machine learning at the BBCTatiana Al-Chueyr
 
Powering machine learning workflows with Apache Airflow and Python
Powering machine learning workflows with Apache Airflow and PythonPowering machine learning workflows with Apache Airflow and Python
Powering machine learning workflows with Apache Airflow and PythonTatiana Al-Chueyr
 
Responsible Machine Learning at the BBC
Responsible Machine Learning at the BBCResponsible Machine Learning at the BBC
Responsible Machine Learning at the BBCTatiana Al-Chueyr
 
PyConUK 2018 - Journey from HTTP to gRPC
PyConUK 2018 - Journey from HTTP to gRPCPyConUK 2018 - Journey from HTTP to gRPC
PyConUK 2018 - Journey from HTTP to gRPCTatiana Al-Chueyr
 
PythonBrasil[8] - CPython for dummies
PythonBrasil[8] - CPython for dummiesPythonBrasil[8] - CPython for dummies
PythonBrasil[8] - CPython for dummiesTatiana Al-Chueyr
 
QCon SP - recommended for you
QCon SP - recommended for youQCon SP - recommended for you
QCon SP - recommended for youTatiana Al-Chueyr
 
PyConUK 2016 - Writing English Right
PyConUK 2016  - Writing English RightPyConUK 2016  - Writing English Right
PyConUK 2016 - Writing English RightTatiana Al-Chueyr
 
InVesalius: 3D medical imaging software
InVesalius: 3D medical imaging softwareInVesalius: 3D medical imaging software
InVesalius: 3D medical imaging softwareTatiana Al-Chueyr
 
Automatic English text correction
Automatic English text correctionAutomatic English text correction
Automatic English text correctionTatiana Al-Chueyr
 
Python packaging and dependency resolution
Python packaging and dependency resolutionPython packaging and dependency resolution
Python packaging and dependency resolutionTatiana Al-Chueyr
 
Rio info 2013 - Linked Data at Globo.com
Rio info 2013 - Linked Data at Globo.comRio info 2013 - Linked Data at Globo.com
Rio info 2013 - Linked Data at Globo.comTatiana Al-Chueyr
 

More from Tatiana Al-Chueyr (20)

Contributing to Apache Airflow
Contributing to Apache AirflowContributing to Apache Airflow
Contributing to Apache Airflow
 
From an idea to production: building a recommender for BBC Sounds
From an idea to production: building a recommender for BBC SoundsFrom an idea to production: building a recommender for BBC Sounds
From an idea to production: building a recommender for BBC Sounds
 
Precomputing recommendations with Apache Beam
Precomputing recommendations with Apache BeamPrecomputing recommendations with Apache Beam
Precomputing recommendations with Apache Beam
 
Scaling machine learning to millions of users with Apache Beam
Scaling machine learning to millions of users with Apache BeamScaling machine learning to millions of users with Apache Beam
Scaling machine learning to millions of users with Apache Beam
 
Clearing Airflow Obstructions
Clearing Airflow ObstructionsClearing Airflow Obstructions
Clearing Airflow Obstructions
 
Scaling machine learning workflows with Apache Beam
Scaling machine learning workflows with Apache BeamScaling machine learning workflows with Apache Beam
Scaling machine learning workflows with Apache Beam
 
Responsible machine learning at the BBC
Responsible machine learning at the BBCResponsible machine learning at the BBC
Responsible machine learning at the BBC
 
Powering machine learning workflows with Apache Airflow and Python
Powering machine learning workflows with Apache Airflow and PythonPowering machine learning workflows with Apache Airflow and Python
Powering machine learning workflows with Apache Airflow and Python
 
Responsible Machine Learning at the BBC
Responsible Machine Learning at the BBCResponsible Machine Learning at the BBC
Responsible Machine Learning at the BBC
 
PyConUK 2018 - Journey from HTTP to gRPC
PyConUK 2018 - Journey from HTTP to gRPCPyConUK 2018 - Journey from HTTP to gRPC
PyConUK 2018 - Journey from HTTP to gRPC
 
Sprint cPython at Globo.com
Sprint cPython at Globo.comSprint cPython at Globo.com
Sprint cPython at Globo.com
 
PythonBrasil[8] - CPython for dummies
PythonBrasil[8] - CPython for dummiesPythonBrasil[8] - CPython for dummies
PythonBrasil[8] - CPython for dummies
 
QCon SP - recommended for you
QCon SP - recommended for youQCon SP - recommended for you
QCon SP - recommended for you
 
Crafting APIs
Crafting APIsCrafting APIs
Crafting APIs
 
PyConUK 2016 - Writing English Right
PyConUK 2016  - Writing English RightPyConUK 2016  - Writing English Right
PyConUK 2016 - Writing English Right
 
InVesalius: 3D medical imaging software
InVesalius: 3D medical imaging softwareInVesalius: 3D medical imaging software
InVesalius: 3D medical imaging software
 
Automatic English text correction
Automatic English text correctionAutomatic English text correction
Automatic English text correction
 
Python packaging and dependency resolution
Python packaging and dependency resolutionPython packaging and dependency resolution
Python packaging and dependency resolution
 
Rio info 2013 - Linked Data at Globo.com
Rio info 2013 - Linked Data at Globo.comRio info 2013 - Linked Data at Globo.com
Rio info 2013 - Linked Data at Globo.com
 
PythonBrasil[8] closing
PythonBrasil[8] closingPythonBrasil[8] closing
PythonBrasil[8] closing
 

Recently uploaded

Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformLess Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformWSO2
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringWSO2
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaWSO2
 
Navigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseNavigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseWSO2
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfdanishmna97
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceIES VE
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....rightmanforbloodline
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightSafe Software
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 

Recently uploaded (20)

Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformLess Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software Engineering
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 
Navigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseNavigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern Enterprise
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cf
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational Performance
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 

Integrating ChatGPT with Apache Airflow

  • 1. Creating your own Chat GPT with Apache Airflow @tati_alchueyr Staff Software Engineer - Astronomer 13th July 2023 - AI Camp London Meetup
  • 4.
  • 6.
  • 9. inspect(ChatGPT) ● Artificial intelligence chatbot ● Developed by OpenAI ● Proprietary machine learning model ○ Uses LLM (Large Language Models) ○ GPT == Generative Pre-Trained Transformer ○ Fine-tuned GPT-3.5 (text-DaVinci-003) ● Over 100 million user base ● Dataset size: 570 GBs; 175 Billion Parameters ● Estimated cost to run per month: $3 million https://www.theguardian.com/technology/2023/feb/02/chatgpt-100-million-users-open-ai-fastest-growing-app https://indianexpress.com/article/technology/tech-news-technology/chatgpt-interesting-things-to-know-8334991/ https://meetanshi.com/blog/chatgpt-statistics/
  • 10. help(LLM) A Large Language Model is a type of AI algorithm trained on huge amounts of text data that can understand and generate text
  • 11. help(LLM) LLM can be characterized by 4 parameters: ● Size of the training dataset ● Cost of training ● Size of the model ● Performance after training
  • 13. Proprietary LLM limitations ● Data Privacy and Security ● Dependency and Customisation ● Cost and Scalability ● Access and Availability
  • 14. Open-source LLM alternatives ● LLaMA (Meta) ● Alpaca (Stanford) ● Vicuna (Berkeley, Carnegie Mellon, Stanford) ● Dolly (Datricks) ● Open Assistant (individuals) ● h2oGPT (h2o) https://bdtechtalks.com/2023/04/17/open-source-chatgpt-alternatives/
  • 15. h2oGPT about ● Open-source (Apache 2.0) generative AI ● Empowers users to create their own language models ● https://gpt.h2o.ai/ ● https://github.com/h2oai/h2ogpt ● https://www.youtube.com/watch?v=Coj72EzmX20&t=757s https://bdtechtalks.com/2023/04/17/open-source-chatgpt-alternatives/
  • 18. Apache Airflow is an open-source platform for developing, scheduling, and monitoring batch-oriented workflows. help(airflow)
  • 25. airflow providers packages https://airflow.apache.org/docs/apache-airflow-providers/packages-ref.html ● apache-airflow-providers-airbyte ● apache-airflow-providers-alibaba ● apache-airflow-providers-amazon ● apache-airflow-providers-apache-beam ● apache-airflow-providers-apache-cassandra ● apache-airflow-providers-apache-drill ● apache-airflow-providers-apache-druid ● apache-airflow-providers-apache-flink ● apache-airflow-providers-apache-hdfs ● apache-airflow-providers-apache-hive ● apache-airflow-providers-apache-impala ● apache-airflow-providers-apache-kafka ● apache-airflow-providers-apache-kylin ● apache-airflow-providers-apache-livy ● apache-airflow-providers-apache-pig ● apache-airflow-providers-apache-pinot ● apache-airflow-providers-apache-spark ● apache-airflow-providers-apache-sqoop ● apache-airflow-providers-apprise ● apache-airflow-providers-arangodb ● apache-airflow-providers-asana ● apache-airflow-providers-atlassian-jira ● apache-airflow-providers-celery ● apache-airflow-providers-cloudant ● apache-airflow-providers-cncf-kubernetes ● apache-airflow-providers-common-sql ● apache-airflow-providers-databricks ● apache-airflow-providers-datadog ● apache-airflow-providers-dbt-cloud ● apache-airflow-providers-dingding ● apache-airflow-providers-discord ● apache-airflow-providers-docker ● apache-airflow-providers-elasticsearch ● apache-airflow-providers-exasol ● apache-airflow-providers-facebook ● apache-airflow-providers-ftp ● apache-airflow-providers-github ● apache-airflow-providers-google ● apache-airflow-providers-grpc ● apache-airflow-providers-hashicorp
  • 26. airflow providers packages https://airflow.apache.org/docs/apache-airflow-providers/packages-ref.html ● apache-airflow-providers-http ● apache-airflow-providers-imap ● apache-airflow-providers-influxdb ● apache-airflow-providers-jdbc ● apache-airflow-providers-jenkins ● apache-airflow-providers-microsoft-azure ● apache-airflow-providers-microsoft-mssql ● apache-airflow-providers-microsoft-psrp ● apache-airflow-providers-microsoft-winrm ● apache-airflow-providers-mongo ● apache-airflow-providers-mysql ● apache-airflow-providers-neo4j ● apache-airflow-providers-odbc ● apache-airflow-providers-openfaas ● apache-airflow-providers-openlineage ● apache-airflow-providers-opsgenie ● apache-airflow-providers-oracle ● Apache-airflow-providers-pagerduty ● Apache-airflow-providers-papermill ● Apache-airflow-providers-plexus ● apache-airflow-providers-postgres ● apache-airflow-providers-presto ● apache-airflow-providers-qubole ● apache-airflow-providers-redis ● apache-airflow-providers-salesforce ● apache-airflow-providers-samba ● apache-airflow-providers-segment ● apache-airflow-providers-sendgrid ● apache-airflow-providers-sftp ● apache-airflow-providers-singularity ● apache-airflow-providers-slack ● apache-airflow-providers-smtp ● apache-airflow-providers-snowflake ● apache-airflow-providers-sqlite ● apache-airflow-providers-ssh ● apache-airflow-providers-tableau ● apache-airflow-providers-tabular ● apache-airflow-providers-telegram ● apache-airflow-providers-trino ● apache-airflow-providers-vertica ● apache-airflow-providers-zendesk
  • 27. airflow example DAG from airflow import DAG from datetime import datetime def train_model(): pass with DAG( “train_models", start_date=datetime(2023, 7, 4), schedule="@daily") as dag: train_model = PythonOperator( task_id="train_model", python_callable=train_model )
  • 28. airflow example DAG from airflow import DAG from airflow.operators.python import PythonOperator from random import randint from datetime import datetime def _evaluate_model(): return randint(1,10) def _choose_best(ti): tasks = [ "evaluate_model_a", "evaluate_model_b" ] accuracies = [ti.xcom_pull(task_id) for task_id in tasks] best_accuracy = max(accuracies) for model, model_accuracy in zip(tasks, accuracies): if model_accuracy == best_accuracy: return model with DAG( "evaluate_models", start_date=datetime(2023, 7, 4), schedule="@daily") as dag: evaluate_model_a = PythonOperator( task_id="evaluate_model_a", python_callable=_evaluate_model ) evaluate_model_b = PythonOperator( task_id="evaluate_model_b", python_callable=_evaluate_model ) choose_best_model = PythonOperator( task_id="choose_best_model", python_callable=_choose_best ) [evaluate_model_a, evaluate_model_b] >> choose_best_model
  • 30. airflow example of pipelines
  • 31. airflow example of pipelines
  • 32. airflow example of pipelines
  • 33. Building an AI Chat Bot with Airflow
  • 34. Airflow to build a LLM Chat Bot ● Open-source and cloud-agnostic: you are not locked in! ● Same orchestration tool for ELT/ETL and ML ● Automate the steps of a model pipeline, using Airflow to: ○ Monitor the status and duration of tasks over time ○ Retry on failures ○ Send notifications (email, slack, others) to the team ● Dynamically trigger tasks using different hyper parameters ● Dynamically select models based on their scores ● Trigger model pipelines based of dataset changes ● Smoothly run tasks in VMs, containers or Kubernetes
  • 35. Use the KubernetesPodOperator ● Create tasks which are run in Kubernetes pods ● Use node_affinity to allocate job to run on the nodepool with the desired memory/CPU/GPU ● Use k8s.V1VolumeMount to efficiently mount volumes (e.g. NFS) to access large models from different Pods (evaluate, serve) https://airflow.apache.org/docs/apache-airflow-providers-cncf-kubernetes/stable/operators.html
  • 36. Use Dataset-aware scheduling ● Schedule when tasks (from other DAGs) complete successfully from airflow.datasets import Dataset with DAG(“ingest_dataset”, ...): MyOperator( # this task updates example.csv outlets=[Dataset("s3://dataset-bucket/source-data.parquet")], ..., ) with DAG(“train_model”, # this DAG should be run when source-data.parquet is updated (by dag “ingest_dataset”) schedule=[Dataset("s3://dataset-bucket/source_data.csv")], ..., ): https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/datasets.html
  • 37. Use Dynamic Task Mapping ● Create a variable number of tasks at runtime based upon the data created by the previous task ● Can be useful in several situations, including chosing the most adequate model ● Support map/reduce https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/dynamic-task-mapping.html
  • 38. Dynamic Task Mapping from __future__ import annotations from datetime import datetime from airflow import DAG from airflow.decorators import task with DAG( dag_id="example_dynamic_task_mapping", start_date=datetime(2022, 3, 4) ) as dag: @task def evaluate_model(model_path): (...) return evaluation_metrics @task def chose_model(metrics_by_model): (...) return chosen_one models_metrics = evaluate_model.expand( model_path=["/data/model1", "/data/model2", "/data/model3"] ) chose_model(models_metrics)
  • 41.