Talk given at the London AICamp meet up on the 13 July 2023. It's an introduction on building open-source ChatGPT-like chat bots and some of the considerations to have while training/tuning them using Airflow.
9. inspect(ChatGPT)
● Artificial intelligence chatbot
● Developed by OpenAI
● Proprietary machine learning model
○ Uses LLM (Large Language Models)
○ GPT == Generative Pre-Trained Transformer
○ Fine-tuned GPT-3.5 (text-DaVinci-003)
● Over 100 million user base
● Dataset size: 570 GBs; 175 Billion Parameters
● Estimated cost to run per month: $3 million
https://www.theguardian.com/technology/2023/feb/02/chatgpt-100-million-users-open-ai-fastest-growing-app
https://indianexpress.com/article/technology/tech-news-technology/chatgpt-interesting-things-to-know-8334991/
https://meetanshi.com/blog/chatgpt-statistics/
10. help(LLM)
A Large Language Model is a type
of AI algorithm trained on huge
amounts of text data that can
understand and generate text
11. help(LLM)
LLM can be characterized by 4 parameters:
● Size of the training dataset
● Cost of training
● Size of the model
● Performance after training
15. h2oGPT about
● Open-source (Apache 2.0) generative AI
● Empowers users to create their own language models
● https://gpt.h2o.ai/
● https://github.com/h2oai/h2ogpt
● https://www.youtube.com/watch?v=Coj72EzmX20&t=757s
https://bdtechtalks.com/2023/04/17/open-source-chatgpt-alternatives/
27. airflow example DAG
from airflow import DAG
from datetime import datetime
def train_model():
pass
with DAG(
“train_models",
start_date=datetime(2023, 7, 4),
schedule="@daily") as dag:
train_model = PythonOperator(
task_id="train_model",
python_callable=train_model
)
28. airflow example DAG
from airflow import DAG
from airflow.operators.python import PythonOperator
from random import randint
from datetime import datetime
def _evaluate_model():
return randint(1,10)
def _choose_best(ti):
tasks = [
"evaluate_model_a",
"evaluate_model_b"
]
accuracies = [ti.xcom_pull(task_id) for task_id in
tasks]
best_accuracy = max(accuracies)
for model, model_accuracy in zip(tasks,
accuracies):
if model_accuracy == best_accuracy:
return model
with DAG(
"evaluate_models",
start_date=datetime(2023, 7, 4),
schedule="@daily") as dag:
evaluate_model_a = PythonOperator(
task_id="evaluate_model_a",
python_callable=_evaluate_model
)
evaluate_model_b = PythonOperator(
task_id="evaluate_model_b",
python_callable=_evaluate_model
)
choose_best_model = PythonOperator(
task_id="choose_best_model",
python_callable=_choose_best
)
[evaluate_model_a, evaluate_model_b] >>
choose_best_model
34. Airflow to build a LLM Chat Bot
● Open-source and cloud-agnostic: you are not locked in!
● Same orchestration tool for ELT/ETL and ML
● Automate the steps of a model pipeline, using Airflow to:
○ Monitor the status and duration of tasks over time
○ Retry on failures
○ Send notifications (email, slack, others) to the team
● Dynamically trigger tasks using different hyper parameters
● Dynamically select models based on their scores
● Trigger model pipelines based of dataset changes
● Smoothly run tasks in VMs, containers or Kubernetes
35. Use the KubernetesPodOperator
● Create tasks which are run in Kubernetes pods
● Use node_affinity to allocate job to run on the nodepool
with the desired memory/CPU/GPU
● Use k8s.V1VolumeMount to efficiently mount volumes (e.g.
NFS) to access large models from different Pods (evaluate,
serve)
https://airflow.apache.org/docs/apache-airflow-providers-cncf-kubernetes/stable/operators.html
36. Use Dataset-aware scheduling
● Schedule when tasks (from other DAGs) complete successfully
from airflow.datasets import Dataset
with DAG(“ingest_dataset”, ...):
MyOperator(
# this task updates example.csv
outlets=[Dataset("s3://dataset-bucket/source-data.parquet")],
...,
)
with DAG(“train_model”,
# this DAG should be run when source-data.parquet is updated (by dag “ingest_dataset”)
schedule=[Dataset("s3://dataset-bucket/source_data.csv")],
...,
):
https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/datasets.html
37. Use Dynamic Task Mapping
● Create a variable number of tasks at runtime based upon the
data created by the previous task
● Can be useful in several situations, including chosing the most
adequate model
● Support map/reduce
https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/dynamic-task-mapping.html
38. Dynamic Task Mapping
from __future__ import annotations
from datetime import datetime
from airflow import DAG
from airflow.decorators import task
with DAG(
dag_id="example_dynamic_task_mapping",
start_date=datetime(2022, 3, 4)
) as dag:
@task
def evaluate_model(model_path):
(...)
return evaluation_metrics
@task
def chose_model(metrics_by_model):
(...)
return chosen_one
models_metrics = evaluate_model.expand(
model_path=["/data/model1", "/data/model2", "/data/model3"]
)
chose_model(models_metrics)