Integrating ChatGPT with Apache Airflow

Creating your own Chat GPT with
Apache Airﬂow
@tati_alchueyr
Staff Software Engineer - Astronomer
13th July 2023 - AI Camp London Meetup

Turing test
https://marcabraham.com/2022/10/17/what-is-the-turing-test/

inspect(ChatGPT)
● Artificial intelligence chatbot
● Developed by OpenAI
● Proprietary machine learning model
○ Uses LLM (Large Language Models)
○ GPT == Generative Pre-Trained Transformer
○ Fine-tuned GPT-3.5 (text-DaVinci-003)
● Over 100 million user base
● Dataset size: 570 GBs; 175 Billion Parameters
● Estimated cost to run per month: $3 million
https://www.theguardian.com/technology/2023/feb/02/chatgpt-100-million-users-open-ai-fastest-growing-app
https://indianexpress.com/article/technology/tech-news-technology/chatgpt-interesting-things-to-know-8334991/
https://meetanshi.com/blog/chatgpt-statistics/

help(LLM)
A Large Language Model is a type
of AI algorithm trained on huge
amounts of text data that can
understand and generate text

help(LLM)
LLM can be characterized by 4 parameters:
● Size of the training dataset
● Cost of training
● Size of the model
● Performance after training

timeline(LLM)
https://samim.io/p/2023-04-30-evolutionary-tree-of-llms/

Proprietary LLM limitations
● Data Privacy and Security
● Dependency and Customisation
● Cost and Scalability
● Access and Availability

Open-source LLM alternatives
● LLaMA (Meta)
● Alpaca (Stanford)
● Vicuna (Berkeley, Carnegie Mellon, Stanford)
● Dolly (Datricks)
● Open Assistant (individuals)
● h2oGPT (h2o)
https://bdtechtalks.com/2023/04/17/open-source-chatgpt-alternatives/

h2oGPT about
● Open-source (Apache 2.0) generative AI
● Empowers users to create their own language models
● https://gpt.h2o.ai/
● https://github.com/h2oai/h2ogpt
● https://www.youtube.com/watch?v=Coj72EzmX20&t=757s

h2oGPT about

Apache Airflow is an open-source
platform for developing,
scheduling, and monitoring
batch-oriented workflows.
help(airflow)

usage(airflow)
https://github.com/apache/airflow
https://pypistats.org/packages/apache-airflow

airflow providers packages
https://airflow.apache.org/docs/apache-airflow-providers/packages-ref.html
● apache-airflow-providers-airbyte
● apache-airflow-providers-alibaba
● apache-airflow-providers-amazon
● apache-airflow-providers-apache-beam
● apache-airflow-providers-apache-cassandra
● apache-airflow-providers-apache-drill
● apache-airflow-providers-apache-druid
● apache-airflow-providers-apache-flink
● apache-airflow-providers-apache-hdfs
● apache-airflow-providers-apache-hive
● apache-airflow-providers-apache-impala
● apache-airflow-providers-apache-kafka
● apache-airflow-providers-apache-kylin
● apache-airflow-providers-apache-livy
● apache-airflow-providers-apache-pig
● apache-airflow-providers-apache-pinot
● apache-airflow-providers-apache-spark
● apache-airflow-providers-apache-sqoop
● apache-airflow-providers-apprise
● apache-airflow-providers-arangodb
● apache-airflow-providers-asana
● apache-airflow-providers-atlassian-jira
● apache-airflow-providers-celery
● apache-airflow-providers-cloudant
● apache-airflow-providers-cncf-kubernetes
● apache-airflow-providers-common-sql
● apache-airflow-providers-databricks
● apache-airflow-providers-datadog
● apache-airflow-providers-dbt-cloud
● apache-airflow-providers-dingding
● apache-airflow-providers-discord
● apache-airflow-providers-docker
● apache-airflow-providers-elasticsearch
● apache-airflow-providers-exasol
● apache-airflow-providers-facebook
● apache-airflow-providers-ftp
● apache-airflow-providers-github
● apache-airflow-providers-google
● apache-airflow-providers-grpc
● apache-airflow-providers-hashicorp

airflow providers packages
https://airflow.apache.org/docs/apache-airflow-providers/packages-ref.html
● apache-airflow-providers-http
● apache-airflow-providers-imap
● apache-airflow-providers-influxdb
● apache-airflow-providers-jdbc
● apache-airflow-providers-jenkins
● apache-airflow-providers-microsoft-azure
● apache-airflow-providers-microsoft-mssql
● apache-airflow-providers-microsoft-psrp
● apache-airflow-providers-microsoft-winrm
● apache-airflow-providers-mongo
● apache-airflow-providers-mysql
● apache-airflow-providers-neo4j
● apache-airflow-providers-odbc
● apache-airflow-providers-openfaas
● apache-airflow-providers-openlineage
● apache-airflow-providers-opsgenie
● apache-airflow-providers-oracle
● Apache-airflow-providers-pagerduty
● Apache-airflow-providers-papermill
● Apache-airflow-providers-plexus
● apache-airflow-providers-postgres
● apache-airflow-providers-presto
● apache-airflow-providers-qubole
● apache-airflow-providers-redis
● apache-airflow-providers-salesforce
● apache-airflow-providers-samba
● apache-airflow-providers-segment
● apache-airflow-providers-sendgrid
● apache-airflow-providers-sftp
● apache-airflow-providers-singularity
● apache-airflow-providers-slack
● apache-airflow-providers-smtp
● apache-airflow-providers-snowflake
● apache-airflow-providers-sqlite
● apache-airflow-providers-ssh
● apache-airflow-providers-tableau
● apache-airflow-providers-tabular
● apache-airflow-providers-telegram
● apache-airflow-providers-trino
● apache-airflow-providers-vertica
● apache-airflow-providers-zendesk

airflow example DAG
from airflow import DAG
from datetime import datetime
def train_model():
pass
with DAG(
“train_models",
start_date=datetime(2023, 7, 4),
schedule="@daily") as dag:
train_model = PythonOperator(
task_id="train_model",
python_callable=train_model
)

airflow example DAG
from airflow.operators.python import PythonOperator
from random import randint
def _evaluate_model():
return randint(1,10)
def _choose_best(ti):
tasks = [
"evaluate_model_a",
"evaluate_model_b"
]
accuracies = [ti.xcom_pull(task_id) for task_id in
tasks]
best_accuracy = max(accuracies)
for model, model_accuracy in zip(tasks,
accuracies):
if model_accuracy == best_accuracy:
return model
with DAG(
"evaluate_models",
start_date=datetime(2023, 7, 4),
schedule="@daily") as dag:
evaluate_model_a = PythonOperator(
task_id="evaluate_model_a",
python_callable=_evaluate_model
)
evaluate_model_b = PythonOperator(
task_id="evaluate_model_b",
python_callable=_evaluate_model
)
choose_best_model = PythonOperator(
task_id="choose_best_model",
python_callable=_choose_best
)
[evaluate_model_a, evaluate_model_b] >>
choose_best_model

Building an AI Chat Bot
with Airflow

Airflow to build a LLM Chat Bot
● Open-source and cloud-agnostic: you are not locked in!
● Same orchestration tool for ELT/ETL and ML
● Automate the steps of a model pipeline, using Airflow to:
○ Monitor the status and duration of tasks over time
○ Retry on failures
○ Send notifications (email, slack, others) to the team
● Dynamically trigger tasks using different hyper parameters
● Dynamically select models based on their scores
● Trigger model pipelines based of dataset changes
● Smoothly run tasks in VMs, containers or Kubernetes

Use the KubernetesPodOperator
● Create tasks which are run in Kubernetes pods
● Use node_affinity to allocate job to run on the nodepool
with the desired memory/CPU/GPU
● Use k8s.V1VolumeMount to efficiently mount volumes (e.g.
NFS) to access large models from different Pods (evaluate,
serve)
https://airflow.apache.org/docs/apache-airflow-providers-cncf-kubernetes/stable/operators.html

Use Dataset-aware scheduling
● Schedule when tasks (from other DAGs) complete successfully
from airflow.datasets import Dataset
with DAG(“ingest_dataset”, ...):
MyOperator(
# this task updates example.csv
outlets=[Dataset("s3://dataset-bucket/source-data.parquet")],
...,
)
with DAG(“train_model”,
# this DAG should be run when source-data.parquet is updated (by dag “ingest_dataset”)
schedule=[Dataset("s3://dataset-bucket/source_data.csv")],
...,
):
https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/datasets.html

Use Dynamic Task Mapping
● Create a variable number of tasks at runtime based upon the
data created by the previous task
● Can be useful in several situations, including chosing the most
adequate model
● Support map/reduce
https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/dynamic-task-mapping.html

Dynamic Task Mapping
from __future__ import annotations
from airflow.decorators import task
with DAG(
dag_id="example_dynamic_task_mapping",
start_date=datetime(2022, 3, 4)
) as dag:
@task
def evaluate_model(model_path):
(...)
return evaluation_metrics
@task
def chose_model(metrics_by_model):
(...)
return chosen_one
models_metrics = evaluate_model.expand(
model_path=["/data/model1", "/data/model2", "/data/model3"]
)
chose_model(models_metrics)

Apache Airflow Community
https://airflow.apache.org/community/
https://github.com/apache/airflow
https://www.meetup.com/london-apache-airflow-meetup/

@tati_alchueyr
tatiana.alchueyr@astronomer.io
Thank you!

Integrating ChatGPT with Apache Airflow

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Integrating ChatGPT with Apache Airflow

Similar to Integrating ChatGPT with Apache Airflow (20)

More from Tatiana Al-Chueyr

More from Tatiana Al-Chueyr (20)

Recently uploaded

Recently uploaded (20)

Integrating ChatGPT with Apache Airflow