SlideShare a Scribd company logo
1 of 35
Download to read offline
Dan Sun, Bloomberg
Theofilos Papapanagiotou, AWS
The State and Future of
Cloud Native ML Model Serving
Productionize AI Model is Challenging
“Launching AI application pilots is deceptively easy, but
deploying them into production is notoriously challenging.”
Inference
request
Inference
response
Productionize AI Model is Challenging
“Launching AI application pilots is deceptively easy, but
deploying them into production is notoriously challenging.”
Inference
request
Inference
response
Pre Processing
Post Processing
Model
Input
Model
Output
Feature-Store
Extract features,
image/text preprocessing
Scalability
Security
Model Store
REST/gRPC
Load balancer
Reproducibility
/Portability
Observability
Why Kubernetes is a Great Platform for Serving Models?
● Microservice: Kubernetes handles container orchestration, facilitate
deployment of microservices with resource management capabilities
and load balancing.
● Reproducibility/Portability: ML model deployments can be written
once, reproduced, and run with declarative yaml in a cloud agnostic
way.
● Scalability: Kubernetes provides horizontal scaling features for both
CPU and GPU workload.
● Fault-tolerance: Detect and recover from container failures, more
resilient to outages and minimize downtime.
What’s KServe?
● KServe is a highly scalable and standards-based cloud-native
model inference platform on Kubernetes for Trusted AI that
encapsulates the complexity of deploying models to production.
● KServe can be deployed standalone or as an add-on component with
Kubeflow in the cloud or on-premises environment.
KServe History
2021
2022
2019
2020
NVIDIA contributed
Inference Protocol
IBM contributed
ModelMesh
The Current State
● Features supported as of KServe 0.10 release
Core Inference
● Transformer/Predictor
● Serving Runtimes
● Custom Runtime SDK
● Open Inference Protocol
● Serverless Autoscaling
● Cloud/PVC Storage
● Maintenance
● Quality checks
● Common tasks
Advanced
Inference
● ModelMesh for
Multi-Model Serving
● Inference Graph
● Payload Logging
● Request Batching
● Canary Rollout
● Maintenance
● Quality checks
● Common tasks
Model Explanability
& Monitoring
● Text, Image, Tabular
Explainer
● Bias Detector
● Adversarial Detector
● Outlier Detector
● Drift Detector
Serving Runtimes
InferenceService (for user)
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: example-sklearn-isvc
spec:
predictor:
model:
modelFormat:
name: sklearn
version: 1.0
storageUri: s3://bucket/sklearn/mnist.joblib
runtime: kserve-sklearnserver # optional
Serving Runtime (for KServe admin)
apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
name: kserve-sklearnserver
spec:
supportedModelFormats:
- name: sklearn
version: "1"
autoSelect: true
containers:
- name: kserve-container
image: kserve/sklearnserver:latest
args:
- --model_name={{.Name}}
- --model_dir=/mnt/models
- --http_port=8080
resources:
requests:
cpu: "1"
memory: 2Gi
● ModelSpec specifies the model formats or version for trained model
● KServe automatically selects the serving runtime to instantiate the
deployment that supports the given model format.
Serving Runtime Support Matrix
Serving
Runtime/
Model
Format
scikit-learn xgboost lightgbm TensorFlow PyTorch TorchScript ONNX MLFlow Custom
MLServer
(open)
Triton
(open)
TorchServe
(v1, open)
KServe
Runtime
(v1, open)
TFServing
(v1)
Open Inference Protocol
● Open inference protocol enables a standardized high performance inference
data plane.
● Allows interoperability between serving runtimes.
● Allows building client or benchmarking tools that can work with a wide range of
serving runtimes
● It is implemented by KServe, MLServer, Triton, TorchServe, OpenVino, AMD
Inference Server.
https://github.com/kserve/open-inference-protocol
Open Inference Protocol
● REST vs. gRPC: Ease of use vs high performance
REST gRPC
GET v2/health/live rpc ServerLive(ServerLiveRequest) returns (ServerLiveResponse)
GET v2/health/ready rpc ServerReady(ServerReadyRequest) returns (ServerReadyResponse)
GET v2/models/{model_name}/ready rpc ModelReady(ModelReadyRequest) returns (ModelReadyResponse)
GET v2/models/{model_name} rpc ModelMetadata(ModelMetadataRequest) returns (ModelMetadataResponse)
POST v2/models/{model_name}/infer rpc Modelnfer(ModelInferRequest) returns (ModelInferResponse)
Open Inference Protocol Codec
● Single Input vs. Multiple Inputs with numpy/pandas codec
Feature 1
Name: f1
Value: 1.0
Feature 2
Name: f2
Value: 2.0
Feature 3
Name: f3
Value: 3.0
{
“inputs”: [
{
“name”: “tensor”,
“shape”: [1, 3],
“datatype”: “FP32”,
“data”: [[1.0, 2.0, 3.0]]
}
]
}
Feature 1
Name: f1
Value: 1.0
Feature 2
Name: f2
Value: 2
Feature 3
Name: f3
Value: “foo”
{
“inputs”: [
{
“name”: “f1”,
“shape”: [1,2],
“datatype”: “FP32”,
“data”: [[1.0, 2.0]]
},
{
“name”: “f3”,
“shape”: [1, 2],
“datatype”: “BYTES”,
“data”: [[“foo”, “bar”]]
}
]
}
Numpy
codec
Pandas
codec
f1, f2, f3
0 1.0, 2, foo
1 2.0, 4, bar
1.0, 2.0, 3.0
4.0, 5.0, 6.0
Inference Service Deployment
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "sklearn-feast-transformer"
spec:
transformer:
containers:
- image:
kserve/driver-transformer:latest
name: kserve-container
command:
- "python -m driver_transformer"
args:
- --entity_ids
- driver_id
- --feature_refs
- driver_hourly_stats:acc_rate
- driver_hourly_stats:avg_daily_trips
- driver_hourly_stats:conv_rate
Predictor:
model:
modelFormat:
name: sklearn
storageUri: "gs://pv/driver"
https://kserve.github.io/website/0.10/modelserving/v1beta1/transformer/feast/
Model Storage Patterns
● Runs as init container to download the models before starting KServe container.
● Runs as sidecar to pull models as needed in ModelMesh.
/mnt/models
storage-initializer
kserve-container
model-puller
container
modelmesh
container
/models
|--
model1/
|- model1.bin
model2/
|- model2.bin
model3/
|- model3.bin
Multi-Model Serving
● Single model single container usually under-utilize the resources.
● You can share the container resources by deploying multiple models, especially for the
GPU sharing cases.
● When scaling up the same set of models need to be scaled out together.
GPU
model1 model2 model3
GPU
model1 model2 model3
Inference Service Replica 1 Inference Service Replica 2
ModelMesh
● Models can be scaled independently while sharing the container resource
● ModelMesh is a scale-out layer for model servers that can load and serve multiple
models concurrently which uses the same KServe InferenceService API.
● Designed for high volume, high density, high churn production serving, it has powered
IBM production AI services for a number of years and manages thousands of models.
Kubernetes schedules
pod onto nodes
ModelMesh schedules
models onto container
Inference Graph
● Inference Graph is built for more complex multi-stage inference pipelines.
● Inference Graph is deployed in a declarative way and highly scalable.
● Inference Graph supports Sequence, Switch, Ensemble and Splitter nodes.
● Inference Graph is highly composable. It is made up with a list of routing nodes and each node
consists of a set of routing steps which can be either route to an InferecenService or another node.
apiVersion: "serving.kserve.io/v1alpha1"
kind: "InferenceGraph"
metadata:
name: "dog-breed-pipeline"
spec:
nodes:
root:
routerType: Sequence
steps:
- serviceName: cat-dog-classifier
name: cat_dog_classifier # step name
- serviceName: dog-breed-classifier
name: dog_breed_classifier
data: $request
condition: "[@this].#(predictions.0=="dog")"
KServe Cloud Native Ecosystem
● KServe offers seamless integration with many CNCF projects to empower
the production model deployments for security, observability, and
serverless capabilities.
Serverless Inference
● Autoscaling based on incoming request RPS or Concurrency.
● Unified autoscaler for both CPUs and GPUs.
Queue-proxy kserve-container
Concurrent
Requests = 2
Container Concurrency = 1
KnativeCon talk 2022
ServiceMesh: Secure Inference Service
● Istio control plane mounts the client certificates to sidecar proxy to allow automatic
authentication between transformer and predictor and encrypt service traffic.
● Authorization can be implemented with Istio Authorization Policy or Open Policy Agent.
queue-proxy kserve-container queue-proxy kserve-container
http/grpc mTLS
Istio Ingress
Gateway
KServe Transformer KServe Predictor
Authz check
SPIFFE and SPIRE: Strong Identity
● SPIFFE is a secure identity framework that can be federated across heterogeneous environments.
● SPIRE distinguishes from other identity providers, such as API keys or secrets, with an attestation
process before issuing the credential.
queue-proxy kserve-container
KServe Predictor
Workload
Spire Agent
Spire Server
Spire Agent
JWT/SVID
VM
Kubernetes
KServe Observability
● OpenTelemetry can be used to instructment, collect and export metrics, log
and tracing data that you can analyze to understand the production Inference
Service performance.
Gateway Queue proxy KServe container
Cloud Native Buildpacks
● Cloud Native Buildpacks can turn your custom transformer or predictor into a
container image without Dockerfile and run anywhere.
● Ensures meeting the security and compliance requirements.
Looking Forward
KServe 1.0 Roadmap
● Graduate KServe core inference capabilities to stable/GA
● Open Inference Protocol coverage for all supported serving runtimes
● Graduate KServe SDK to 1.0
● Graduate ModelMesh and Inference Graph
● Large Language Model (LLM) Support
● KServe Observability and Security improvements
● Update KServe 1.0 documentation
https://github.com/kserve/kserve/blob/master/ROADMAP.md
Large Language Models
Source: https://huggingface.co/blog/large-language-models
KServe released
Distributed Inference for LLM
● Large transformer based models with
hundreds of billions of parameters are
computationally expensive.
● NVIDIA FasterTransformer implements
an accelerated engine for inference of
transformer based model spanning many
GPUs and nodes in distributed manner.
● Huggingface Accelerate allows
distributed inference with sharded
checkpoints using less memory.
● Tensor parallelism allows running
inference on multiple GPUs.
● Pipeline parallelism allows on
multi-GPU and multi-node
environment.
Large Language Model Challenges
● Inference hardware requirements
○ Cost: Requires significant computational resources with high end GPUs and large
amount of memory
○ Latency: long response time up to tens of seconds
● Model blob/file sizes in GBs(BLOOM):
○ 176bln params = 360GB
○ 72 splits of 5GB which needs to mapped to multiple GPU devices
● Model loading time
○ From network (S3, minio) to instance disk
○ From instance disk to CPU RAM
○ From CPU RAM to GPU RAM
● Model Serving Runtime: FasterTransformer-Triton (32GB)
● Model data can be sensitive and private for inference
LLM(BLOOM 176bln) Inference Demo
curl https://fastertransformer.default.kubecon.theofpa.people.aws.dev/v1/models/fastertransformer:predict -d '{
"inputs":
[
{
"input":"Kubernetes is the best platform to serve your models because",
"output_len":"40"
}
]
}'
[["Kubernetes is the best platform to serve your models because it is a cloud-based
service that is built on top of the Kubernetes API. It is a great platform for developers
who want to build their own Kubernetes application"]]
https://github.com/kserve/kserve/tree/master/docs/samples/v1beta1/triton/fastertransformer
LLM Download and Loading Time
$ git lfs clone https://huggingface.co/bigscience/bloom-560m
$ ls -lh bloom-560m
total 3.2G
-rw-r--r-- 1 theofpa domain^users 693 Apr 13 11:16 config.json
-rw-r--r-- 1 theofpa domain^users 1.1G Apr 13 11:22 flax_model.msgpack
-rw-r--r-- 1 theofpa domain^users 16K Apr 13 11:16 LICENSE
-rw-r--r-- 1 theofpa domain^users 1.1G Apr 13 11:22 model.safetensors
-rw-r--r-- 1 theofpa domain^users 1.1G Apr 13 11:22 pytorch_model.bin
-rw-r--r-- 1 theofpa domain^users 21K Apr 13 11:16 README.md
-rw-r--r-- 1 theofpa domain^users 85 Apr 13 11:16 special_tokens_map.json
-rw-r--r-- 1 theofpa domain^users 222 Apr 13 11:16 tokenizer_config.json
-rw-r--r-- 1 theofpa domain^users 14M Apr 13 11:16 tokenizer.json
Huggingface transformer to FasterTransformer conversion based on tensor parallelisms and target precision.
python3 FasterTransformer/examples/pytorch/gpt/utils/huggingface_bloom_convert.py
-o fastertransformer/1 -i ./bloom-560m/
$ ls -lhS fastertransformer/1 | head
total 2.1G
-rw-r--r-- 1 theofpa domain^users 980M Apr 13 11:28 model.wte.bin
-rw-r--r-- 1 theofpa domain^users 16M Apr 13 11:28 model.layers.0.mlp.dense_4h_to_h.weight.0.bin
-rw-r--r-- 1 theofpa domain^users 16M Apr 13 11:28 model.layers.0.mlp.dense_h_to_4h.weight.0.bin
-rw-r--r-- 1 theofpa domain^users 16M Apr 13 11:28 model.layers.10.mlp.dense_4h_to_h.weight.0.bin
-rw-r--r-- 1 theofpa domain^users 16M Apr 13 11:28 model.layers.10.mlp.dense_h_to_4h.weight.0.bin
-rw-r--r-- 1 theofpa domain^users 16M Apr 13 11:28 model.layers.11.mlp.dense_4h_to_h.weight.0.bin
-rw-r--r-- 1 theofpa domain^users 16M Apr 13 11:28 model.layers.11.mlp.dense_h_to_4h.weight.0.bin
-rw-r--r-- 1 theofpa domain^users 16M Apr 13 11:28 model.layers.12.mlp.dense_4h_to_h.weight.0.bin
-rw-r--r-- 1 theofpa domain^users 16M Apr 13 11:28 model.layers.12.mlp.dense_h_to_4h.weight.0.bin
Image source: https://docs.nvidia.com/deeplearning/dali/user-guide/docs/examples/general/multigpu.html#Sharding
storage-initializer:
2023-04-13 19:
28:34.151 1 root INFO [downlo
Copying contents of s3://kubecon-models/mod
2023-04-13 19:
29:28.922 1 root INFO [downlo
Successfully copied s3://kubecon-models/mod
/mnt/models
predictor:
I0413 19:29:34.205155 1 libfastertransforme
TRITONBACKEND_ModelInitialize: fastertransf
1)
I0413 19:29:45.448307 1 model_lifecycle.cc:
successfully loaded 'fastertransformer' ver
BloombergGPT - Powered by KServe
● Trained approximately 53 days on 64 servers, each with 8 A100 GPUs on
AWS Sagemaker.
● HF checkpoints are converted into FasterTransformer format setting
target precision to BF16, tensor parallelism 2 and pipeline parallelism 1.
● Deployed on KServe on-prem for internal product development with
NVIDIA Triton FasterTransformer Serving Runtime on 2 A100 GPU (80G
memory) for each replica.
https://www.bloomberg.com/company/press/bloomberggpt-50-billion-parameter-llm-tuned-finance/
BloombergGPT Paper: https://arxiv.org/abs/2303.17564
BloombergGPT - KServe Deployment
Preprocessing
(Tokenizer)
Triton FasterTransformer
Predictor
Input tokens generated tokens
gRPC
GPU 1 GPU 2
Predictor
Text Input Text Output
KServe Community
● KServe Website
● KServe on GitHub
● KServe Community
Session QR Codes will be
sent via email before the event
Please scan the QR Code above
to leave feedback on this session

More Related Content

What's hot

MLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLMLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLJordan Birdsell
 
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google CloudVertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google CloudMárton Kodok
 
KFServing - Serverless Model Inferencing
KFServing - Serverless Model InferencingKFServing - Serverless Model Inferencing
KFServing - Serverless Model InferencingAnimesh Singh
 
MLOps Bridging the gap between Data Scientists and Ops.
MLOps Bridging the gap between Data Scientists and Ops.MLOps Bridging the gap between Data Scientists and Ops.
MLOps Bridging the gap between Data Scientists and Ops.Knoldus Inc.
 
Event driven autoscaling with keda
Event driven autoscaling with kedaEvent driven autoscaling with keda
Event driven autoscaling with kedaAdam Hamsik
 
"Managing the Complete Machine Learning Lifecycle with MLflow"
"Managing the Complete Machine Learning Lifecycle with MLflow""Managing the Complete Machine Learning Lifecycle with MLflow"
"Managing the Complete Machine Learning Lifecycle with MLflow"Databricks
 
Introduction to MLflow
Introduction to MLflowIntroduction to MLflow
Introduction to MLflowDatabricks
 
Google Vertex AI
Google Vertex AIGoogle Vertex AI
Google Vertex AIVikasBisoi
 
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
End to end Machine Learning using Kubeflow - Build, Train, Deploy and ManageEnd to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
End to end Machine Learning using Kubeflow - Build, Train, Deploy and ManageAnimesh Singh
 
Introduction to kubernetes
Introduction to kubernetesIntroduction to kubernetes
Introduction to kubernetesRishabh Indoria
 
From Data Science to MLOps
From Data Science to MLOpsFrom Data Science to MLOps
From Data Science to MLOpsCarl W. Handlin
 
mlflow: Accelerating the End-to-End ML lifecycle
mlflow: Accelerating the End-to-End ML lifecyclemlflow: Accelerating the End-to-End ML lifecycle
mlflow: Accelerating the End-to-End ML lifecycleDatabricks
 
Kubernetes Webinar - Using ConfigMaps & Secrets
Kubernetes Webinar - Using ConfigMaps & Secrets Kubernetes Webinar - Using ConfigMaps & Secrets
Kubernetes Webinar - Using ConfigMaps & Secrets Janakiram MSV
 
Using MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOpsUsing MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOpsWeaveworks
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
 MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ... MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...Databricks
 
Simplifying Model Management with MLflow
Simplifying Model Management with MLflowSimplifying Model Management with MLflow
Simplifying Model Management with MLflowDatabricks
 
Seamless MLOps with Seldon and MLflow
Seamless MLOps with Seldon and MLflowSeamless MLOps with Seldon and MLflow
Seamless MLOps with Seldon and MLflowDatabricks
 
Kubernetes 101 - an Introduction to Containers, Kubernetes, and OpenShift
Kubernetes 101 - an Introduction to Containers, Kubernetes, and OpenShiftKubernetes 101 - an Introduction to Containers, Kubernetes, and OpenShift
Kubernetes 101 - an Introduction to Containers, Kubernetes, and OpenShiftDevOps.com
 

What's hot (20)

MLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLMLOps - The Assembly Line of ML
MLOps - The Assembly Line of ML
 
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google CloudVertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
 
KFServing - Serverless Model Inferencing
KFServing - Serverless Model InferencingKFServing - Serverless Model Inferencing
KFServing - Serverless Model Inferencing
 
MLOps Bridging the gap between Data Scientists and Ops.
MLOps Bridging the gap between Data Scientists and Ops.MLOps Bridging the gap between Data Scientists and Ops.
MLOps Bridging the gap between Data Scientists and Ops.
 
Event driven autoscaling with keda
Event driven autoscaling with kedaEvent driven autoscaling with keda
Event driven autoscaling with keda
 
MLOps in action
MLOps in actionMLOps in action
MLOps in action
 
"Managing the Complete Machine Learning Lifecycle with MLflow"
"Managing the Complete Machine Learning Lifecycle with MLflow""Managing the Complete Machine Learning Lifecycle with MLflow"
"Managing the Complete Machine Learning Lifecycle with MLflow"
 
Introduction to MLflow
Introduction to MLflowIntroduction to MLflow
Introduction to MLflow
 
Google Vertex AI
Google Vertex AIGoogle Vertex AI
Google Vertex AI
 
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
End to end Machine Learning using Kubeflow - Build, Train, Deploy and ManageEnd to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
 
Introduction to kubernetes
Introduction to kubernetesIntroduction to kubernetes
Introduction to kubernetes
 
From Data Science to MLOps
From Data Science to MLOpsFrom Data Science to MLOps
From Data Science to MLOps
 
mlflow: Accelerating the End-to-End ML lifecycle
mlflow: Accelerating the End-to-End ML lifecyclemlflow: Accelerating the End-to-End ML lifecycle
mlflow: Accelerating the End-to-End ML lifecycle
 
Kubernetes Webinar - Using ConfigMaps & Secrets
Kubernetes Webinar - Using ConfigMaps & Secrets Kubernetes Webinar - Using ConfigMaps & Secrets
Kubernetes Webinar - Using ConfigMaps & Secrets
 
Using MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOpsUsing MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOps
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
 MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ... MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
 
Simplifying Model Management with MLflow
Simplifying Model Management with MLflowSimplifying Model Management with MLflow
Simplifying Model Management with MLflow
 
Seamless MLOps with Seldon and MLflow
Seamless MLOps with Seldon and MLflowSeamless MLOps with Seldon and MLflow
Seamless MLOps with Seldon and MLflow
 
Kubernetes 101 - an Introduction to Containers, Kubernetes, and OpenShift
Kubernetes 101 - an Introduction to Containers, Kubernetes, and OpenShiftKubernetes 101 - an Introduction to Containers, Kubernetes, and OpenShift
Kubernetes 101 - an Introduction to Containers, Kubernetes, and OpenShift
 
MLOps for production-level machine learning
MLOps for production-level machine learningMLOps for production-level machine learning
MLOps for production-level machine learning
 

Similar to Kubecon 2023 EU - KServe - The State and Future of Cloud-Native Model Serving

Monitoring on Kubernetes using prometheus
Monitoring on Kubernetes using prometheusMonitoring on Kubernetes using prometheus
Monitoring on Kubernetes using prometheusChandresh Pancholi
 
Monitoring on Kubernetes using Prometheus - Chandresh
Monitoring on Kubernetes using Prometheus - Chandresh Monitoring on Kubernetes using Prometheus - Chandresh
Monitoring on Kubernetes using Prometheus - Chandresh CodeOps Technologies LLP
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLNordic APIs
 
Private Cloud with Open Stack, Docker
Private Cloud with Open Stack, DockerPrivate Cloud with Open Stack, Docker
Private Cloud with Open Stack, DockerDavinder Kohli
 
Being Stateful in Kubernetes
Being Stateful in KubernetesBeing Stateful in Kubernetes
Being Stateful in KubernetesKnoldus Inc.
 
A Primer Towards Running Kafka on Top of Kubernetes.pdf
A Primer Towards Running Kafka on Top of Kubernetes.pdfA Primer Towards Running Kafka on Top of Kubernetes.pdf
A Primer Towards Running Kafka on Top of Kubernetes.pdfAvinashUpadhyaya3
 
Ultimate Guide to Microservice Architecture on Kubernetes
Ultimate Guide to Microservice Architecture on KubernetesUltimate Guide to Microservice Architecture on Kubernetes
Ultimate Guide to Microservice Architecture on Kuberneteskloia
 
Machine learning at scale with aws sage maker
Machine learning at scale with aws sage makerMachine learning at scale with aws sage maker
Machine learning at scale with aws sage makerPhilipBasford
 
Openstack_administration
Openstack_administrationOpenstack_administration
Openstack_administrationAshish Sharma
 
WSO2Con 2015 USA: Introducing Microservices Server
WSO2Con 2015 USA: Introducing Microservices ServerWSO2Con 2015 USA: Introducing Microservices Server
WSO2Con 2015 USA: Introducing Microservices ServerWSO2
 
Phil Basford - machine learning at scale with aws sage maker
Phil Basford - machine learning at scale with aws sage makerPhil Basford - machine learning at scale with aws sage maker
Phil Basford - machine learning at scale with aws sage makerAWSCOMSUM
 
Being Stateful In Kubernetes
Being Stateful In KubernetesBeing Stateful In Kubernetes
Being Stateful In KubernetesKnoldus Inc.
 
MuleSoft Surat Meetup#42 - Runtime Fabric Manager on Self Managed Kubernetes ...
MuleSoft Surat Meetup#42 - Runtime Fabric Manager on Self Managed Kubernetes ...MuleSoft Surat Meetup#42 - Runtime Fabric Manager on Self Managed Kubernetes ...
MuleSoft Surat Meetup#42 - Runtime Fabric Manager on Self Managed Kubernetes ...Jitendra Bafna
 
Containerized architectures for deep learning
Containerized architectures for deep learningContainerized architectures for deep learning
Containerized architectures for deep learningAntje Barth
 
04_Azure Kubernetes Service: Basic Practices for Developers_GAB2019
04_Azure Kubernetes Service: Basic Practices for Developers_GAB201904_Azure Kubernetes Service: Basic Practices for Developers_GAB2019
04_Azure Kubernetes Service: Basic Practices for Developers_GAB2019Kumton Suttiraksiri
 
Kubernetes: від знайомства до використання у CI/CD
Kubernetes: від знайомства до використання у CI/CDKubernetes: від знайомства до використання у CI/CD
Kubernetes: від знайомства до використання у CI/CDStfalcon Meetups
 
Amazon EKS 그리고 Service Mesh (김세호 솔루션즈 아키텍트, AWS) :: Gaming on AWS 2018
Amazon EKS 그리고 Service Mesh (김세호 솔루션즈 아키텍트, AWS) :: Gaming on AWS 2018Amazon EKS 그리고 Service Mesh (김세호 솔루션즈 아키텍트, AWS) :: Gaming on AWS 2018
Amazon EKS 그리고 Service Mesh (김세호 솔루션즈 아키텍트, AWS) :: Gaming on AWS 2018Amazon Web Services Korea
 
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul Maddox
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul MaddoxAWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul Maddox
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul MaddoxAWS Riyadh User Group
 
Docker Swarm secrets for creating great FIWARE platforms
Docker Swarm secrets for creating great FIWARE platformsDocker Swarm secrets for creating great FIWARE platforms
Docker Swarm secrets for creating great FIWARE platformsFederico Michele Facca
 

Similar to Kubecon 2023 EU - KServe - The State and Future of Cloud-Native Model Serving (20)

Monitoring on Kubernetes using prometheus
Monitoring on Kubernetes using prometheusMonitoring on Kubernetes using prometheus
Monitoring on Kubernetes using prometheus
 
Monitoring on Kubernetes using Prometheus - Chandresh
Monitoring on Kubernetes using Prometheus - Chandresh Monitoring on Kubernetes using Prometheus - Chandresh
Monitoring on Kubernetes using Prometheus - Chandresh
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of ML
 
Private Cloud with Open Stack, Docker
Private Cloud with Open Stack, DockerPrivate Cloud with Open Stack, Docker
Private Cloud with Open Stack, Docker
 
Being Stateful in Kubernetes
Being Stateful in KubernetesBeing Stateful in Kubernetes
Being Stateful in Kubernetes
 
A Primer Towards Running Kafka on Top of Kubernetes.pdf
A Primer Towards Running Kafka on Top of Kubernetes.pdfA Primer Towards Running Kafka on Top of Kubernetes.pdf
A Primer Towards Running Kafka on Top of Kubernetes.pdf
 
Ultimate Guide to Microservice Architecture on Kubernetes
Ultimate Guide to Microservice Architecture on KubernetesUltimate Guide to Microservice Architecture on Kubernetes
Ultimate Guide to Microservice Architecture on Kubernetes
 
Machine learning at scale with aws sage maker
Machine learning at scale with aws sage makerMachine learning at scale with aws sage maker
Machine learning at scale with aws sage maker
 
Openstack_administration
Openstack_administrationOpenstack_administration
Openstack_administration
 
WSO2Con 2015 USA: Introducing Microservices Server
WSO2Con 2015 USA: Introducing Microservices ServerWSO2Con 2015 USA: Introducing Microservices Server
WSO2Con 2015 USA: Introducing Microservices Server
 
Phil Basford - machine learning at scale with aws sage maker
Phil Basford - machine learning at scale with aws sage makerPhil Basford - machine learning at scale with aws sage maker
Phil Basford - machine learning at scale with aws sage maker
 
Being Stateful In Kubernetes
Being Stateful In KubernetesBeing Stateful In Kubernetes
Being Stateful In Kubernetes
 
MuleSoft Surat Meetup#42 - Runtime Fabric Manager on Self Managed Kubernetes ...
MuleSoft Surat Meetup#42 - Runtime Fabric Manager on Self Managed Kubernetes ...MuleSoft Surat Meetup#42 - Runtime Fabric Manager on Self Managed Kubernetes ...
MuleSoft Surat Meetup#42 - Runtime Fabric Manager on Self Managed Kubernetes ...
 
Containerized architectures for deep learning
Containerized architectures for deep learningContainerized architectures for deep learning
Containerized architectures for deep learning
 
04_Azure Kubernetes Service: Basic Practices for Developers_GAB2019
04_Azure Kubernetes Service: Basic Practices for Developers_GAB201904_Azure Kubernetes Service: Basic Practices for Developers_GAB2019
04_Azure Kubernetes Service: Basic Practices for Developers_GAB2019
 
AWS in Practice
AWS in PracticeAWS in Practice
AWS in Practice
 
Kubernetes: від знайомства до використання у CI/CD
Kubernetes: від знайомства до використання у CI/CDKubernetes: від знайомства до використання у CI/CD
Kubernetes: від знайомства до використання у CI/CD
 
Amazon EKS 그리고 Service Mesh (김세호 솔루션즈 아키텍트, AWS) :: Gaming on AWS 2018
Amazon EKS 그리고 Service Mesh (김세호 솔루션즈 아키텍트, AWS) :: Gaming on AWS 2018Amazon EKS 그리고 Service Mesh (김세호 솔루션즈 아키텍트, AWS) :: Gaming on AWS 2018
Amazon EKS 그리고 Service Mesh (김세호 솔루션즈 아키텍트, AWS) :: Gaming on AWS 2018
 
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul Maddox
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul MaddoxAWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul Maddox
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul Maddox
 
Docker Swarm secrets for creating great FIWARE platforms
Docker Swarm secrets for creating great FIWARE platformsDocker Swarm secrets for creating great FIWARE platforms
Docker Swarm secrets for creating great FIWARE platforms
 

Recently uploaded

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 

Recently uploaded (20)

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 

Kubecon 2023 EU - KServe - The State and Future of Cloud-Native Model Serving

  • 1.
  • 2. Dan Sun, Bloomberg Theofilos Papapanagiotou, AWS The State and Future of Cloud Native ML Model Serving
  • 3. Productionize AI Model is Challenging “Launching AI application pilots is deceptively easy, but deploying them into production is notoriously challenging.” Inference request Inference response
  • 4. Productionize AI Model is Challenging “Launching AI application pilots is deceptively easy, but deploying them into production is notoriously challenging.” Inference request Inference response Pre Processing Post Processing Model Input Model Output Feature-Store Extract features, image/text preprocessing Scalability Security Model Store REST/gRPC Load balancer Reproducibility /Portability Observability
  • 5. Why Kubernetes is a Great Platform for Serving Models? ● Microservice: Kubernetes handles container orchestration, facilitate deployment of microservices with resource management capabilities and load balancing. ● Reproducibility/Portability: ML model deployments can be written once, reproduced, and run with declarative yaml in a cloud agnostic way. ● Scalability: Kubernetes provides horizontal scaling features for both CPU and GPU workload. ● Fault-tolerance: Detect and recover from container failures, more resilient to outages and minimize downtime.
  • 6. What’s KServe? ● KServe is a highly scalable and standards-based cloud-native model inference platform on Kubernetes for Trusted AI that encapsulates the complexity of deploying models to production. ● KServe can be deployed standalone or as an add-on component with Kubeflow in the cloud or on-premises environment.
  • 8. The Current State ● Features supported as of KServe 0.10 release Core Inference ● Transformer/Predictor ● Serving Runtimes ● Custom Runtime SDK ● Open Inference Protocol ● Serverless Autoscaling ● Cloud/PVC Storage ● Maintenance ● Quality checks ● Common tasks Advanced Inference ● ModelMesh for Multi-Model Serving ● Inference Graph ● Payload Logging ● Request Batching ● Canary Rollout ● Maintenance ● Quality checks ● Common tasks Model Explanability & Monitoring ● Text, Image, Tabular Explainer ● Bias Detector ● Adversarial Detector ● Outlier Detector ● Drift Detector
  • 9. Serving Runtimes InferenceService (for user) apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: name: example-sklearn-isvc spec: predictor: model: modelFormat: name: sklearn version: 1.0 storageUri: s3://bucket/sklearn/mnist.joblib runtime: kserve-sklearnserver # optional Serving Runtime (for KServe admin) apiVersion: serving.kserve.io/v1alpha1 kind: ClusterServingRuntime metadata: name: kserve-sklearnserver spec: supportedModelFormats: - name: sklearn version: "1" autoSelect: true containers: - name: kserve-container image: kserve/sklearnserver:latest args: - --model_name={{.Name}} - --model_dir=/mnt/models - --http_port=8080 resources: requests: cpu: "1" memory: 2Gi ● ModelSpec specifies the model formats or version for trained model ● KServe automatically selects the serving runtime to instantiate the deployment that supports the given model format.
  • 10. Serving Runtime Support Matrix Serving Runtime/ Model Format scikit-learn xgboost lightgbm TensorFlow PyTorch TorchScript ONNX MLFlow Custom MLServer (open) Triton (open) TorchServe (v1, open) KServe Runtime (v1, open) TFServing (v1)
  • 11. Open Inference Protocol ● Open inference protocol enables a standardized high performance inference data plane. ● Allows interoperability between serving runtimes. ● Allows building client or benchmarking tools that can work with a wide range of serving runtimes ● It is implemented by KServe, MLServer, Triton, TorchServe, OpenVino, AMD Inference Server. https://github.com/kserve/open-inference-protocol
  • 12. Open Inference Protocol ● REST vs. gRPC: Ease of use vs high performance REST gRPC GET v2/health/live rpc ServerLive(ServerLiveRequest) returns (ServerLiveResponse) GET v2/health/ready rpc ServerReady(ServerReadyRequest) returns (ServerReadyResponse) GET v2/models/{model_name}/ready rpc ModelReady(ModelReadyRequest) returns (ModelReadyResponse) GET v2/models/{model_name} rpc ModelMetadata(ModelMetadataRequest) returns (ModelMetadataResponse) POST v2/models/{model_name}/infer rpc Modelnfer(ModelInferRequest) returns (ModelInferResponse)
  • 13. Open Inference Protocol Codec ● Single Input vs. Multiple Inputs with numpy/pandas codec Feature 1 Name: f1 Value: 1.0 Feature 2 Name: f2 Value: 2.0 Feature 3 Name: f3 Value: 3.0 { “inputs”: [ { “name”: “tensor”, “shape”: [1, 3], “datatype”: “FP32”, “data”: [[1.0, 2.0, 3.0]] } ] } Feature 1 Name: f1 Value: 1.0 Feature 2 Name: f2 Value: 2 Feature 3 Name: f3 Value: “foo” { “inputs”: [ { “name”: “f1”, “shape”: [1,2], “datatype”: “FP32”, “data”: [[1.0, 2.0]] }, { “name”: “f3”, “shape”: [1, 2], “datatype”: “BYTES”, “data”: [[“foo”, “bar”]] } ] } Numpy codec Pandas codec f1, f2, f3 0 1.0, 2, foo 1 2.0, 4, bar 1.0, 2.0, 3.0 4.0, 5.0, 6.0
  • 14. Inference Service Deployment apiVersion: "serving.kserve.io/v1beta1" kind: "InferenceService" metadata: name: "sklearn-feast-transformer" spec: transformer: containers: - image: kserve/driver-transformer:latest name: kserve-container command: - "python -m driver_transformer" args: - --entity_ids - driver_id - --feature_refs - driver_hourly_stats:acc_rate - driver_hourly_stats:avg_daily_trips - driver_hourly_stats:conv_rate Predictor: model: modelFormat: name: sklearn storageUri: "gs://pv/driver" https://kserve.github.io/website/0.10/modelserving/v1beta1/transformer/feast/
  • 15. Model Storage Patterns ● Runs as init container to download the models before starting KServe container. ● Runs as sidecar to pull models as needed in ModelMesh. /mnt/models storage-initializer kserve-container model-puller container modelmesh container /models |-- model1/ |- model1.bin model2/ |- model2.bin model3/ |- model3.bin
  • 16. Multi-Model Serving ● Single model single container usually under-utilize the resources. ● You can share the container resources by deploying multiple models, especially for the GPU sharing cases. ● When scaling up the same set of models need to be scaled out together. GPU model1 model2 model3 GPU model1 model2 model3 Inference Service Replica 1 Inference Service Replica 2
  • 17. ModelMesh ● Models can be scaled independently while sharing the container resource ● ModelMesh is a scale-out layer for model servers that can load and serve multiple models concurrently which uses the same KServe InferenceService API. ● Designed for high volume, high density, high churn production serving, it has powered IBM production AI services for a number of years and manages thousands of models. Kubernetes schedules pod onto nodes ModelMesh schedules models onto container
  • 18. Inference Graph ● Inference Graph is built for more complex multi-stage inference pipelines. ● Inference Graph is deployed in a declarative way and highly scalable. ● Inference Graph supports Sequence, Switch, Ensemble and Splitter nodes. ● Inference Graph is highly composable. It is made up with a list of routing nodes and each node consists of a set of routing steps which can be either route to an InferecenService or another node. apiVersion: "serving.kserve.io/v1alpha1" kind: "InferenceGraph" metadata: name: "dog-breed-pipeline" spec: nodes: root: routerType: Sequence steps: - serviceName: cat-dog-classifier name: cat_dog_classifier # step name - serviceName: dog-breed-classifier name: dog_breed_classifier data: $request condition: "[@this].#(predictions.0=="dog")"
  • 19. KServe Cloud Native Ecosystem ● KServe offers seamless integration with many CNCF projects to empower the production model deployments for security, observability, and serverless capabilities.
  • 20. Serverless Inference ● Autoscaling based on incoming request RPS or Concurrency. ● Unified autoscaler for both CPUs and GPUs. Queue-proxy kserve-container Concurrent Requests = 2 Container Concurrency = 1 KnativeCon talk 2022
  • 21. ServiceMesh: Secure Inference Service ● Istio control plane mounts the client certificates to sidecar proxy to allow automatic authentication between transformer and predictor and encrypt service traffic. ● Authorization can be implemented with Istio Authorization Policy or Open Policy Agent. queue-proxy kserve-container queue-proxy kserve-container http/grpc mTLS Istio Ingress Gateway KServe Transformer KServe Predictor Authz check
  • 22. SPIFFE and SPIRE: Strong Identity ● SPIFFE is a secure identity framework that can be federated across heterogeneous environments. ● SPIRE distinguishes from other identity providers, such as API keys or secrets, with an attestation process before issuing the credential. queue-proxy kserve-container KServe Predictor Workload Spire Agent Spire Server Spire Agent JWT/SVID VM Kubernetes
  • 23. KServe Observability ● OpenTelemetry can be used to instructment, collect and export metrics, log and tracing data that you can analyze to understand the production Inference Service performance. Gateway Queue proxy KServe container
  • 24. Cloud Native Buildpacks ● Cloud Native Buildpacks can turn your custom transformer or predictor into a container image without Dockerfile and run anywhere. ● Ensures meeting the security and compliance requirements.
  • 26. KServe 1.0 Roadmap ● Graduate KServe core inference capabilities to stable/GA ● Open Inference Protocol coverage for all supported serving runtimes ● Graduate KServe SDK to 1.0 ● Graduate ModelMesh and Inference Graph ● Large Language Model (LLM) Support ● KServe Observability and Security improvements ● Update KServe 1.0 documentation https://github.com/kserve/kserve/blob/master/ROADMAP.md
  • 27. Large Language Models Source: https://huggingface.co/blog/large-language-models KServe released
  • 28. Distributed Inference for LLM ● Large transformer based models with hundreds of billions of parameters are computationally expensive. ● NVIDIA FasterTransformer implements an accelerated engine for inference of transformer based model spanning many GPUs and nodes in distributed manner. ● Huggingface Accelerate allows distributed inference with sharded checkpoints using less memory. ● Tensor parallelism allows running inference on multiple GPUs. ● Pipeline parallelism allows on multi-GPU and multi-node environment.
  • 29. Large Language Model Challenges ● Inference hardware requirements ○ Cost: Requires significant computational resources with high end GPUs and large amount of memory ○ Latency: long response time up to tens of seconds ● Model blob/file sizes in GBs(BLOOM): ○ 176bln params = 360GB ○ 72 splits of 5GB which needs to mapped to multiple GPU devices ● Model loading time ○ From network (S3, minio) to instance disk ○ From instance disk to CPU RAM ○ From CPU RAM to GPU RAM ● Model Serving Runtime: FasterTransformer-Triton (32GB) ● Model data can be sensitive and private for inference
  • 30. LLM(BLOOM 176bln) Inference Demo curl https://fastertransformer.default.kubecon.theofpa.people.aws.dev/v1/models/fastertransformer:predict -d '{ "inputs": [ { "input":"Kubernetes is the best platform to serve your models because", "output_len":"40" } ] }' [["Kubernetes is the best platform to serve your models because it is a cloud-based service that is built on top of the Kubernetes API. It is a great platform for developers who want to build their own Kubernetes application"]] https://github.com/kserve/kserve/tree/master/docs/samples/v1beta1/triton/fastertransformer
  • 31. LLM Download and Loading Time $ git lfs clone https://huggingface.co/bigscience/bloom-560m $ ls -lh bloom-560m total 3.2G -rw-r--r-- 1 theofpa domain^users 693 Apr 13 11:16 config.json -rw-r--r-- 1 theofpa domain^users 1.1G Apr 13 11:22 flax_model.msgpack -rw-r--r-- 1 theofpa domain^users 16K Apr 13 11:16 LICENSE -rw-r--r-- 1 theofpa domain^users 1.1G Apr 13 11:22 model.safetensors -rw-r--r-- 1 theofpa domain^users 1.1G Apr 13 11:22 pytorch_model.bin -rw-r--r-- 1 theofpa domain^users 21K Apr 13 11:16 README.md -rw-r--r-- 1 theofpa domain^users 85 Apr 13 11:16 special_tokens_map.json -rw-r--r-- 1 theofpa domain^users 222 Apr 13 11:16 tokenizer_config.json -rw-r--r-- 1 theofpa domain^users 14M Apr 13 11:16 tokenizer.json Huggingface transformer to FasterTransformer conversion based on tensor parallelisms and target precision. python3 FasterTransformer/examples/pytorch/gpt/utils/huggingface_bloom_convert.py -o fastertransformer/1 -i ./bloom-560m/ $ ls -lhS fastertransformer/1 | head total 2.1G -rw-r--r-- 1 theofpa domain^users 980M Apr 13 11:28 model.wte.bin -rw-r--r-- 1 theofpa domain^users 16M Apr 13 11:28 model.layers.0.mlp.dense_4h_to_h.weight.0.bin -rw-r--r-- 1 theofpa domain^users 16M Apr 13 11:28 model.layers.0.mlp.dense_h_to_4h.weight.0.bin -rw-r--r-- 1 theofpa domain^users 16M Apr 13 11:28 model.layers.10.mlp.dense_4h_to_h.weight.0.bin -rw-r--r-- 1 theofpa domain^users 16M Apr 13 11:28 model.layers.10.mlp.dense_h_to_4h.weight.0.bin -rw-r--r-- 1 theofpa domain^users 16M Apr 13 11:28 model.layers.11.mlp.dense_4h_to_h.weight.0.bin -rw-r--r-- 1 theofpa domain^users 16M Apr 13 11:28 model.layers.11.mlp.dense_h_to_4h.weight.0.bin -rw-r--r-- 1 theofpa domain^users 16M Apr 13 11:28 model.layers.12.mlp.dense_4h_to_h.weight.0.bin -rw-r--r-- 1 theofpa domain^users 16M Apr 13 11:28 model.layers.12.mlp.dense_h_to_4h.weight.0.bin Image source: https://docs.nvidia.com/deeplearning/dali/user-guide/docs/examples/general/multigpu.html#Sharding storage-initializer: 2023-04-13 19: 28:34.151 1 root INFO [downlo Copying contents of s3://kubecon-models/mod 2023-04-13 19: 29:28.922 1 root INFO [downlo Successfully copied s3://kubecon-models/mod /mnt/models predictor: I0413 19:29:34.205155 1 libfastertransforme TRITONBACKEND_ModelInitialize: fastertransf 1) I0413 19:29:45.448307 1 model_lifecycle.cc: successfully loaded 'fastertransformer' ver
  • 32. BloombergGPT - Powered by KServe ● Trained approximately 53 days on 64 servers, each with 8 A100 GPUs on AWS Sagemaker. ● HF checkpoints are converted into FasterTransformer format setting target precision to BF16, tensor parallelism 2 and pipeline parallelism 1. ● Deployed on KServe on-prem for internal product development with NVIDIA Triton FasterTransformer Serving Runtime on 2 A100 GPU (80G memory) for each replica. https://www.bloomberg.com/company/press/bloomberggpt-50-billion-parameter-llm-tuned-finance/ BloombergGPT Paper: https://arxiv.org/abs/2303.17564
  • 33. BloombergGPT - KServe Deployment Preprocessing (Tokenizer) Triton FasterTransformer Predictor Input tokens generated tokens gRPC GPU 1 GPU 2 Predictor Text Input Text Output
  • 34. KServe Community ● KServe Website ● KServe on GitHub ● KServe Community
  • 35. Session QR Codes will be sent via email before the event Please scan the QR Code above to leave feedback on this session