SlideShare a Scribd company logo
1 of 31
Download to read offline
Running LLM in Kubernetes
Volodymyr Tsap
CTO @ SHALB
What is LLM?
A large language model (LLM) is a language model notable for its ability to achieve general-purpose language
generation and understanding.
LLMs acquire these abilities by learning statistical relationships from text documents during a computationally
intensive self-supervised and semi-supervised training process.
LLMs are artificial neural networks, the largest and most capable of which are built with a transformer-based
architecture.
Wikipedia
What is Transformers?
! Transformers are a type of deep learning model that have revolutionized the way natural
language processing tasks are approached.
! Transformers utilize a unique architecture that relies on self-attention mechanisms to weigh
the significance of different words in a sentence. This allows the model to capture the context of
each word more effectively than previous models, leading to better understanding and
generation of text.
Building LLM. Data Collection and Preparation.
! Collect a large and diverse dataset from various sources such as books, websites, and other
texts.
! Clean and preprocess the data to remove irrelevant content, normalize text (e.g., lowercasing,
removing special characters), and ensure data quality.
Building LLM. Tokenization and Vocabulary Building.
! Tokenize the text data into smaller units (tokens) such as words, subwords, or characters. This
step may involve choosing a specific tokenization algorithm (e.g., BPE, WordPiece).
! Create a vocabulary of unique tokens and possibly generate embeddings for them. This could
involve pre-training embeddings or using embeddings from an existing model.
Building LLM. Model Architecture Design.
! Choose a transformer architecture (e.g., GPT, BERT) that suits the goals of your LLM. This
involves deciding on the number of layers, attention heads, and other hyperparameters.
! Implement or adapt an existing transformer model framework using deep learning libraries such
as TensorFlow or PyTorch.
Building LLM. Model Architecture Design.
Building LLM. Training.
! Split the data into training, validation, and test sets.
! Pre-train the model on the collected data, which involves running it through the computation of
weights over multiple epochs. This step is computationally intensive and can take from hours to
weeks depending on the model size and hardware capabilities.
! Use techniques such as gradient clipping, learning rate scheduling, and regularization to
improve training efficiency and model performance.
Building LLM. Fine-Tuning (Optional).
! Fine-tune the pre-trained model on a smaller, task-specific dataset if the LLM will be used for
specific applications (e.g., question answering, sentiment analysis).
! Adjust hyperparameters and training settings to optimize performance for the target task.
Building LLM. Evaluation and Testing.
! Evaluate the model on a test set to measure its performance using appropriate metrics (e.g.,
accuracy, F1 score, perplexity).
! Perform error analysis and adjust the training process as necessary to improve model quality.
Building LLM. Saving and Deployment.
! Save the trained model weights and configuration to files.
! Deploy the model for inference, which can involve setting up a serving infrastructure capable of
handling requests in real-time or batch processing.
TLDR. Watch Andrej Karpathy Explanation.
Hugging Face - GitHub for LLM’s
LLM Files
LLM Files
How to run? Using Google Colab with T4 gpu
How to run? Using laptop and llama.cpp. Quantization.
Using Managed Cloud Services.
! Amazon SageMaker
! Google Cloud AI Platform & Vertex AI
! Microsoft Azure Machine Learning
! NVIDIA AI Enterprise
! Hugging Face Endpoints
! AnyScale Endpoints
Why to run them in Kubernetes?
1. We already know him :)
2. Scalability. Resource efficiency, HPA, auto-scaling, API Limits, etc..
3. Price. Managed service 20-40% overhead. Reserved instances.
4. GPU sharing.
5. ML ecosystem - pipelines, artifacts. (KubeFlow, Ray Framework).
6. No vendor lock. Transportable.
LLM Serving Frameworks
Options to run LLM on K8s.
1. KServe from Kubeflow.
2. Ray Serve from Ray Framework.
3. Flux AI controller.
4. Own Kubernetes wrapper on top of Frameworks.
We choose TGI and made it Kubernetes ready.
We have Docker, Lets adapt it to Kubernetes
Demo Time!
Let’s bootstrap infrastructure from cluster.dev template
Then add model configuration
Apply and check the model is running
Changing models and infrastructure
Enabling HF chat-ui
Deploy Monitoring and Metrics with DCGM Exporter
Thank you! Now Questions?

More Related Content

Similar to "Running Open-Source LLM models on Kubernetes", Volodymyr Tsap

MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle
Databricks
 
MY NEWEST RESUME
MY NEWEST RESUMEMY NEWEST RESUME
MY NEWEST RESUME
Han Yan
 
Adopting AnswerModules ModuleSuite
Adopting AnswerModules ModuleSuiteAdopting AnswerModules ModuleSuite
Adopting AnswerModules ModuleSuite
AnswerModules
 
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPTAutomate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Anant Corporation
 
An introduction to the MDA
An introduction to the MDAAn introduction to the MDA
An introduction to the MDA
Lai Ha
 

Similar to "Running Open-Source LLM models on Kubernetes", Volodymyr Tsap (20)

Navigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi DaparthiNavigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi Daparthi
 
DLT UNIT-3.docx
DLT  UNIT-3.docxDLT  UNIT-3.docx
DLT UNIT-3.docx
 
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google CloudVertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
 
Norman Sasono - Incorporating AI/ML into Your Application Architecture
Norman Sasono - Incorporating AI/ML into Your Application ArchitectureNorman Sasono - Incorporating AI/ML into Your Application Architecture
Norman Sasono - Incorporating AI/ML into Your Application Architecture
 
Norman Sasono - Incorporating AI/ML into Your Application Architecture
Norman Sasono - Incorporating AI/ML into Your Application ArchitectureNorman Sasono - Incorporating AI/ML into Your Application Architecture
Norman Sasono - Incorporating AI/ML into Your Application Architecture
 
MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle
 
MY NEWEST RESUME
MY NEWEST RESUMEMY NEWEST RESUME
MY NEWEST RESUME
 
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021
 
Scaling up Machine Learning Development
Scaling up Machine Learning DevelopmentScaling up Machine Learning Development
Scaling up Machine Learning Development
 
C19013010 the tutorial to build shared ai services session 1
C19013010  the tutorial to build shared ai services session 1C19013010  the tutorial to build shared ai services session 1
C19013010 the tutorial to build shared ai services session 1
 
Adopting AnswerModules ModuleSuite
Adopting AnswerModules ModuleSuiteAdopting AnswerModules ModuleSuite
Adopting AnswerModules ModuleSuite
 
Paige Roberts: Shortcut MLOps with In-Database Machine Learning
Paige Roberts: Shortcut MLOps with In-Database Machine LearningPaige Roberts: Shortcut MLOps with In-Database Machine Learning
Paige Roberts: Shortcut MLOps with In-Database Machine Learning
 
Best Practices with OLAP Modeling with Cognos Transformer (Cognos 8)
Best Practices with OLAP Modeling with Cognos Transformer (Cognos 8)Best Practices with OLAP Modeling with Cognos Transformer (Cognos 8)
Best Practices with OLAP Modeling with Cognos Transformer (Cognos 8)
 
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPTAutomate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
 
MLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
MLOps Virtual Event | Building Machine Learning Platforms for the Full LifecycleMLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
MLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
 
DevBCN Vertex AI - Pipelines for your MLOps workflows
DevBCN Vertex AI - Pipelines for your MLOps workflowsDevBCN Vertex AI - Pipelines for your MLOps workflows
DevBCN Vertex AI - Pipelines for your MLOps workflows
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
An introduction to the MDA
An introduction to the MDAAn introduction to the MDA
An introduction to the MDA
 
Deep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps WorkflowsDeep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps Workflows
 
openGauss - The evolution route of openGauss' AIcapabilities
openGauss - The evolution route of openGauss' AIcapabilitiesopenGauss - The evolution route of openGauss' AIcapabilities
openGauss - The evolution route of openGauss' AIcapabilities
 

More from Fwdays

More from Fwdays (20)

"How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y...
"How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y..."How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y...
"How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y...
 
"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii
"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii
"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
"What is a RAG system and how to build it",Dmytro Spodarets
"What is a RAG system and how to build it",Dmytro Spodarets"What is a RAG system and how to build it",Dmytro Spodarets
"What is a RAG system and how to build it",Dmytro Spodarets
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"Distributed graphs and microservices in Prom.ua", Maksym Kindritskyi
"Distributed graphs and microservices in Prom.ua",  Maksym Kindritskyi"Distributed graphs and microservices in Prom.ua",  Maksym Kindritskyi
"Distributed graphs and microservices in Prom.ua", Maksym Kindritskyi
 
"Rethinking the existing data loading and processing process as an ETL exampl...
"Rethinking the existing data loading and processing process as an ETL exampl..."Rethinking the existing data loading and processing process as an ETL exampl...
"Rethinking the existing data loading and processing process as an ETL exampl...
 
"How Ukrainian IT specialist can go on vacation abroad without crossing the T...
"How Ukrainian IT specialist can go on vacation abroad without crossing the T..."How Ukrainian IT specialist can go on vacation abroad without crossing the T...
"How Ukrainian IT specialist can go on vacation abroad without crossing the T...
 
"The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ...
"The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ..."The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ...
"The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ...
 
"[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu...
"[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu..."[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu...
"[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu...
 
"[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care...
"[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care..."[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care...
"[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care...
 
"4 horsemen of the apocalypse of working relationships (+ antidotes to them)"...
"4 horsemen of the apocalypse of working relationships (+ antidotes to them)"..."4 horsemen of the apocalypse of working relationships (+ antidotes to them)"...
"4 horsemen of the apocalypse of working relationships (+ antidotes to them)"...
 
"Reconnecting with Purpose: Rediscovering Job Interest after Burnout", Anast...
"Reconnecting with Purpose: Rediscovering Job Interest after Burnout",  Anast..."Reconnecting with Purpose: Rediscovering Job Interest after Burnout",  Anast...
"Reconnecting with Purpose: Rediscovering Job Interest after Burnout", Anast...
 
"Mentoring 101: How to effectively invest experience in the success of others...
"Mentoring 101: How to effectively invest experience in the success of others..."Mentoring 101: How to effectively invest experience in the success of others...
"Mentoring 101: How to effectively invest experience in the success of others...
 
"Mission (im) possible: How to get an offer in 2024?", Oleksandra Myronova
"Mission (im) possible: How to get an offer in 2024?",  Oleksandra Myronova"Mission (im) possible: How to get an offer in 2024?",  Oleksandra Myronova
"Mission (im) possible: How to get an offer in 2024?", Oleksandra Myronova
 
"Why have we learned how to package products, but not how to 'package ourselv...
"Why have we learned how to package products, but not how to 'package ourselv..."Why have we learned how to package products, but not how to 'package ourselv...
"Why have we learned how to package products, but not how to 'package ourselv...
 
"How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin...
"How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin..."How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin...
"How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin...
 

Recently uploaded

Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
FIDO Alliance
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Microsoft BitLocker Bypass Attack Method.pdf
Microsoft BitLocker Bypass Attack Method.pdfMicrosoft BitLocker Bypass Attack Method.pdf
Microsoft BitLocker Bypass Attack Method.pdf
Overkill Security
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
panagenda
 

Recently uploaded (20)

Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
 
Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTop 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development Companies
 
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptxCyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Microsoft BitLocker Bypass Attack Method.pdf
Microsoft BitLocker Bypass Attack Method.pdfMicrosoft BitLocker Bypass Attack Method.pdf
Microsoft BitLocker Bypass Attack Method.pdf
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cf
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdfFrisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
الأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهلهالأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهله
 

"Running Open-Source LLM models on Kubernetes", Volodymyr Tsap

  • 1. Running LLM in Kubernetes Volodymyr Tsap CTO @ SHALB
  • 2. What is LLM? A large language model (LLM) is a language model notable for its ability to achieve general-purpose language generation and understanding. LLMs acquire these abilities by learning statistical relationships from text documents during a computationally intensive self-supervised and semi-supervised training process. LLMs are artificial neural networks, the largest and most capable of which are built with a transformer-based architecture. Wikipedia
  • 3. What is Transformers? ! Transformers are a type of deep learning model that have revolutionized the way natural language processing tasks are approached. ! Transformers utilize a unique architecture that relies on self-attention mechanisms to weigh the significance of different words in a sentence. This allows the model to capture the context of each word more effectively than previous models, leading to better understanding and generation of text.
  • 4. Building LLM. Data Collection and Preparation. ! Collect a large and diverse dataset from various sources such as books, websites, and other texts. ! Clean and preprocess the data to remove irrelevant content, normalize text (e.g., lowercasing, removing special characters), and ensure data quality.
  • 5. Building LLM. Tokenization and Vocabulary Building. ! Tokenize the text data into smaller units (tokens) such as words, subwords, or characters. This step may involve choosing a specific tokenization algorithm (e.g., BPE, WordPiece). ! Create a vocabulary of unique tokens and possibly generate embeddings for them. This could involve pre-training embeddings or using embeddings from an existing model.
  • 6. Building LLM. Model Architecture Design. ! Choose a transformer architecture (e.g., GPT, BERT) that suits the goals of your LLM. This involves deciding on the number of layers, attention heads, and other hyperparameters. ! Implement or adapt an existing transformer model framework using deep learning libraries such as TensorFlow or PyTorch.
  • 7. Building LLM. Model Architecture Design.
  • 8. Building LLM. Training. ! Split the data into training, validation, and test sets. ! Pre-train the model on the collected data, which involves running it through the computation of weights over multiple epochs. This step is computationally intensive and can take from hours to weeks depending on the model size and hardware capabilities. ! Use techniques such as gradient clipping, learning rate scheduling, and regularization to improve training efficiency and model performance.
  • 9. Building LLM. Fine-Tuning (Optional). ! Fine-tune the pre-trained model on a smaller, task-specific dataset if the LLM will be used for specific applications (e.g., question answering, sentiment analysis). ! Adjust hyperparameters and training settings to optimize performance for the target task.
  • 10. Building LLM. Evaluation and Testing. ! Evaluate the model on a test set to measure its performance using appropriate metrics (e.g., accuracy, F1 score, perplexity). ! Perform error analysis and adjust the training process as necessary to improve model quality.
  • 11. Building LLM. Saving and Deployment. ! Save the trained model weights and configuration to files. ! Deploy the model for inference, which can involve setting up a serving infrastructure capable of handling requests in real-time or batch processing.
  • 12. TLDR. Watch Andrej Karpathy Explanation.
  • 13. Hugging Face - GitHub for LLM’s
  • 16. How to run? Using Google Colab with T4 gpu
  • 17. How to run? Using laptop and llama.cpp. Quantization.
  • 18. Using Managed Cloud Services. ! Amazon SageMaker ! Google Cloud AI Platform & Vertex AI ! Microsoft Azure Machine Learning ! NVIDIA AI Enterprise ! Hugging Face Endpoints ! AnyScale Endpoints
  • 19. Why to run them in Kubernetes? 1. We already know him :) 2. Scalability. Resource efficiency, HPA, auto-scaling, API Limits, etc.. 3. Price. Managed service 20-40% overhead. Reserved instances. 4. GPU sharing. 5. ML ecosystem - pipelines, artifacts. (KubeFlow, Ray Framework). 6. No vendor lock. Transportable.
  • 21. Options to run LLM on K8s. 1. KServe from Kubeflow. 2. Ray Serve from Ray Framework. 3. Flux AI controller. 4. Own Kubernetes wrapper on top of Frameworks.
  • 22. We choose TGI and made it Kubernetes ready.
  • 23. We have Docker, Lets adapt it to Kubernetes
  • 25. Let’s bootstrap infrastructure from cluster.dev template
  • 26. Then add model configuration
  • 27. Apply and check the model is running
  • 28. Changing models and infrastructure
  • 30. Deploy Monitoring and Metrics with DCGM Exporter
  • 31. Thank you! Now Questions?