Triton As NLP Model Inference Back-end

•

0 likes•128 views

This slide was used in the largest open source conference in East Asia, COSCUP. I introduced a basic usage about Triton server at 2022/07/30.

Technology

Triton as NLP Model Inference Back-end
Ko Ko, Microsoft AI MVP
2022/07/30 COSCUP @NTUST

About Ko Ko
• Just call me Ko Ko.
• Microsoft AI MVP.
• Lecturer in large conference, such as COSCUP, .NET Conf,
ModernWeb and so on.
• https://www.linkedin.com/in/ko-ko-b12a3474/

Contents
Overview of Triton
Structure of Triton
Other Features in Triton
Types of Model and Model Repo Structure
Conﬁg for Served model
Conﬁg for Ensemble Model
Start Triton Inference Server
Practical Example of NLP model deployment

What are the problems of AI inference?
AI inference service in the server is getting heavier and heavier.
Concurrency is still a big issue in AI back-end.
More and more models are integrated.
Many AI models still use in Jupyter notebook.

My previous way…
PredictionEnginePool
Web API for
service

Overview of Triton
1. Born for deployment of AI models.
2. BSD license.
3. Speed up AI model inference.
4. Support multiple AI model frameworks.
5. Support gRPC and HTTP.
6. Support CPU, GPU, Multiple GPUs .
7. Model management in load/unload/update.

https://developer.nvidia.com/blog/nvidia-serves-deep-learning-inference/

Structure of Triton
https://developer.nvidia.com/nvidia-triton-inference-server

Other Features in Triton
● Model analyzer
○ Performance analysis
○ Memory analysis
● NGC
○ Just like Docker hub but for Nvidia solution

Types of model
● Stateless
○ CV related models
● Stateful
○ Predict results base on previous result
○ Some NLP models
● Ensemble
○ Pipeline of models

Model Repo Structure
<model-repository-path>/
<model-name>/
[conﬁg.pbtxt]
[<output-labels-ﬁle> ...]
<version>/
<model-deﬁnition-ﬁle>
<version>/
<model-deﬁnition-ﬁle>
...
<model-name>/
[conﬁg.pbtxt]
[<output-labels-ﬁle> ...]
<version>/
<model-deﬁnition-ﬁle>
<version>/
<model-deﬁnition-ﬁle>

Model Repo Structure
Must follow the structure of ﬁles and folders.
conﬁg.pbtxt is not must for TensorRT, Tensorﬂow saved-model and ONNX.
$ tritonserver --model-repository=<model-repository-path>
If your model repo is on the cloud such as Azure:
$ tritonserver --model-repository=as://account_name/container_name/path/to/model/repository

Conﬁg for Served model (Dynamic batch)
Only use in stateless model.
It combines requests as batch dynamcally.
It can oﬀer better performace.

Conﬁg for Served model (TensorRT)
Make your ONNX and TensorFlow model accelerated with TensorRT.

Start Triton Inference Server
$ docker run --gpus=1 --rm -p8000:8000 -p8001:8001 -p8002:8002 -v
/full/path/to/docs/examples/model_repository:/models
nvcr.io/nvidia/tritonserver:22.06-py3
$ docker run --rm -p8000:8000 -p8001:8001 -p8002:8002 -v
/full/path/to/docs/examples/model_repository:/models
nvcr.io/nvidia/tritonserver:22.06-py3 tritonserver --model-repository=/models
$ tritonserver --model-repository=<model-repository-path>
$ curl -v localhost:8000/v2/health/ready
8000 is HTTP.
8001 is gRPC.
8002 is Metrics Service.

Demo with practical example
https://github.com/Ko-Ko-Kirk/triton_nlp_demo

Triton on Azure Machine Learning
https://docs.microsoft.com/zh-tw/azure/machine-learning/how-to-deploy-with-triton?tabs=endpoint
YAML file
name: densenet-onnx-model
version: 1
path: ./models
type: triton_model
description: Registering my Triton
format model.

Recap
Overview of Triton
Structure of Triton
Other Features in Triton
Types of Model and Model Repo Structure
Conﬁg for Served model
Conﬁg for Ensemble Model
Start Triton Inference Server
Practical Example of NLP model deployment

What's hot

5 Factors When Selecting a High Performance, Low Latency DatabaseScyllaDB

Elastic SearchNavule Rao

The Basics of MongoDBvaluebound

What I learnt: Elastic search & Kibana : introduction, installtion & configur...Rahul K Chauhan

Blazing Performance with Flame GraphsBrendan Gregg

Docker를 활용한 손쉬운 ECS 활용기 - 김민태 (AUSG) :: AWS Community Day Online 2021AWSKRUG - AWS한국사용자모임

DatastoresRaveen Vijayan

Airbnb's Journey from Self-Managed Redis to ElastiCache for Redis (DAT319) - ...Amazon Web Services

Vectorized Query Execution in Apache Spark at FacebookDatabricks

오딘: 발할라 라이징 MMORPG의 성능 최적화 사례 공유 [카카오게임즈 - 레벨 300] - 발표자: 김문권, 팀장, 라이온하트 스튜디오...Amazon Web Services Korea

Seamless MLOps with Seldon and MLflowDatabricks

Grafana and AWS - Implementation and UsageManish Chopra

CockroachDBandrei moga

AWS 비용, 어떻게 사용하고 계신가요? - 비용 최적화를 위한 AWS의 다양한 툴 알아보기 – 허경원, AWS 클라우드 파이낸셜 매니저:...Amazon Web Services Korea

Mongodb basics and architectureBishal Khanal

MongoDB Administration 101MongoDB

[236] 카카오의데이터파이프라인 윤도영NAVER D2

AWS 6월 웨비나 | AWS에서 MS SQL 서버 운영하기 (김민성 솔루션즈아키텍트)Amazon Web Services Korea

Grafana optimization for PrometheusMitsuhiro Tanda

CockroachDB: Architecture of a Geo-Distributed SQL DatabaseC4Media

What's hot (20)

5 Factors When Selecting a High Performance, Low Latency Database

Elastic Search

The Basics of MongoDB

What I learnt: Elastic search & Kibana : introduction, installtion & configur...

Blazing Performance with Flame Graphs

Docker를 활용한 손쉬운 ECS 활용기 - 김민태 (AUSG) :: AWS Community Day Online 2021

Datastores

Airbnb's Journey from Self-Managed Redis to ElastiCache for Redis (DAT319) - ...

Vectorized Query Execution in Apache Spark at Facebook

오딘: 발할라 라이징 MMORPG의 성능 최적화 사례 공유 [카카오게임즈 - 레벨 300] - 발표자: 김문권, 팀장, 라이온하트 스튜디오...

Seamless MLOps with Seldon and MLflow

Grafana and AWS - Implementation and Usage

CockroachDB

AWS 비용, 어떻게 사용하고 계신가요? - 비용 최적화를 위한 AWS의 다양한 툴 알아보기 – 허경원, AWS 클라우드 파이낸셜 매니저:...

Mongodb basics and architecture

MongoDB Administration 101

[236] 카카오의데이터파이프라인 윤도영

AWS 6월 웨비나 | AWS에서 MS SQL 서버 운영하기 (김민성 솔루션즈아키텍트)

Grafana optimization for Prometheus

CockroachDB: Architecture of a Geo-Distributed SQL Database

Similar to Triton As NLP Model Inference Back-end

Managing the Machine Learning Lifecycle with MLOpsFatih Baltacı

Scale machine learning deploymentGang Tao

Scaling Up Deep Learning Model Serving Using OpenCVKavika Roy

running Tensorflow in ProductionMatthias Feys

MLFlow: Platform for Complete Machine Learning Lifecycle Databricks

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit

Advanced Model Inferencing leveraging Kubeflow Serving, KNative and IstioAnimesh Singh

FLossEd-BK Tequila Framework3.2.1Siwawong Wuttipongprasert

MLflow with DatabricksLiangjun Jiang

Mlflow with databricksLiangjun Jiang

Productionizing Machine Learning - Bigdata meetup 5-06-2019Iulian Pintoiu

Intro - End to end ML with Kubeflow @ SignalConf 2018Holden Karau

Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...All Things Open

AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)dtz001

Machine Learning for .NET Developers - ADC21Gülden Bilgütay

Infrastructure Agnostic Machine Learning Workload DeploymentDatabricks

Clipper: A Low-Latency Online Prediction Serving SystemDatabricks

Terraform Modules RestructuredDoiT International

Terraform modules restructuredAmi Mahloof

The Hitchhiker's Guide to Faster Builds. Viktor Kirilov. CoreHard Spring 2019corehard_by

Similar to Triton As NLP Model Inference Back-end (20)

Managing the Machine Learning Lifecycle with MLOps

Scale machine learning deployment

Scaling Up Deep Learning Model Serving Using OpenCV

running Tensorflow in Production

MLFlow: Platform for Complete Machine Learning Lifecycle

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

Advanced Model Inferencing leveraging Kubeflow Serving, KNative and Istio

FLossEd-BK Tequila Framework3.2.1

MLflow with Databricks

Mlflow with databricks

Productionizing Machine Learning - Bigdata meetup 5-06-2019

Intro - End to end ML with Kubeflow @ SignalConf 2018

Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...

AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)

Machine Learning for .NET Developers - ADC21

Infrastructure Agnostic Machine Learning Workload Deployment

Clipper: A Low-Latency Online Prediction Serving System

Terraform Modules Restructured

Terraform modules restructured

The Hitchhiker's Guide to Faster Builds. Viktor Kirilov. CoreHard Spring 2019

Recently uploaded

Partners Life - Insurer Innovation Award 2024The Digital Insurer

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

GenCyber Cyber Security Day PresentationMichael W. Hawkins

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Histor y of HAM Radio presentation slidevu2urc

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Scaling API-first – The story of a global engineering organizationRadu Cotescu

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

Slack Application Development 101 Slidespraypatel2

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

How to convert PDF to text with Nanonetsnaman860154

Recently uploaded (20)

Partners Life - Insurer Innovation Award 2024

IAC 2024 - IA Fast Track to Search Focused AI Solutions

GenCyber Cyber Security Day Presentation

[2024]Digital Global Overview Report 2024 Meltwater.pdf

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

Finology Group – Insurtech Innovation Award 2024

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Axa Assurance Maroc - Insurer Innovation Award 2024

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

Driving Behavioral Change for Information Management through Data-Driven Gree...

Handwritten Text Recognition for manuscripts and early printed texts

Histor y of HAM Radio presentation slide

Presentation on how to chat with PDF using ChatGPT code interpreter

Scaling API-first – The story of a global engineering organization

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

Slack Application Development 101 Slides

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

How to convert PDF to text with Nanonets

Triton As NLP Model Inference Back-end

1. Triton as NLP Model Inference Back-end Ko Ko, Microsoft AI MVP 2022/07/30 COSCUP @NTUST

2. About Ko Ko • Just call me Ko Ko. • Microsoft AI MVP. • Lecturer in large conference, such as COSCUP, .NET Conf, ModernWeb and so on. • https://www.linkedin.com/in/ko-ko-b12a3474/

3. Contents Overview of Triton Structure of Triton Other Features in Triton Types of Model and Model Repo Structure Conﬁg for Served model Conﬁg for Ensemble Model Start Triton Inference Server Practical Example of NLP model deployment

4. What are the problems of AI inference? AI inference service in the server is getting heavier and heavier. Concurrency is still a big issue in AI back-end. More and more models are integrated. Many AI models still use in Jupyter notebook.

5. My previous way… PredictionEnginePool Web API for service

6. Overview of Triton 1. Born for deployment of AI models. 2. BSD license. 3. Speed up AI model inference. 4. Support multiple AI model frameworks. 5. Support gRPC and HTTP. 6. Support CPU, GPU, Multiple GPUs . 7. Model management in load/unload/update.

7. https://developer.nvidia.com/blog/nvidia-serves-deep-learning-inference/

8. Structure of Triton https://developer.nvidia.com/nvidia-triton-inference-server

9. Structure of Triton https://developer.nvidia.com/nvidia-triton-inference-server

10. Other Features in Triton ● Model analyzer ○ Performance analysis ○ Memory analysis ● NGC ○ Just like Docker hub but for Nvidia solution

11. Types of model ● Stateless ○ CV related models ● Stateful ○ Predict results base on previous result ○ Some NLP models ● Ensemble ○ Pipeline of models

12. Model Repo Structure <model-repository-path>/ <model-name>/ [config.pbtxt] [<output-labels-file> ...] <version>/ <model-definition-file> <version>/ <model-definition-file> ... <model-name>/ [config.pbtxt] [<output-labels-file> ...] <version>/ <model-definition-file> <version>/ <model-definition-file>

13. Model Repo Structure Must follow the structure of files and folders. config.pbtxt is not must for TensorRT, Tensorflow saved-model and ONNX. $ tritonserver --model-repository=<model-repository-path> If your model repo is on the cloud such as Azure: $ tritonserver --model-repository=as://account_name/container_name/path/to/model/repository

14. Conﬁg for Served model (in conﬁg.pbtxt)

15. Conﬁg for Served model (instance group)

16. Conﬁg for Served model (Dynamic batch) Only use in stateless model. It combines requests as batch dynamcally. It can oﬀer better performace.

17. Conﬁg for Served model (TensorRT) Make your ONNX and TensorFlow model accelerated with TensorRT.

18. Conﬁg for Ensemble Model

19. Conﬁg for Ensemble Model

20. Conﬁg for Ensemble Model

21. Start Triton Inference Server $ docker run --gpus=1 --rm -p8000:8000 -p8001:8001 -p8002:8002 -v /full/path/to/docs/examples/model_repository:/models nvcr.io/nvidia/tritonserver:22.06-py3 $ docker run --rm -p8000:8000 -p8001:8001 -p8002:8002 -v /full/path/to/docs/examples/model_repository:/models nvcr.io/nvidia/tritonserver:22.06-py3 tritonserver --model-repository=/models $ tritonserver --model-repository=<model-repository-path> $ curl -v localhost:8000/v2/health/ready 8000 is HTTP. 8001 is gRPC. 8002 is Metrics Service.

22. $ curl localhost:8002/metrics

23. Demo with practical example https://github.com/Ko-Ko-Kirk/triton_nlp_demo

24. Triton on Azure Machine Learning https://docs.microsoft.com/zh-tw/azure/machine-learning/how-to-deploy-with-triton?tabs=endpoint YAML file name: densenet-onnx-model version: 1 path: ./models type: triton_model description: Registering my Triton format model.

25. Recap Overview of Triton Structure of Triton Other Features in Triton Types of Model and Model Repo Structure Conﬁg for Served model Conﬁg for Ensemble Model Start Triton Inference Server Practical Example of NLP model deployment

26. Thank you and Q & A

Triton As NLP Model Inference Back-end

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Triton As NLP Model Inference Back-end

Similar to Triton As NLP Model Inference Back-end (20)

More from Ko Ko

More from Ko Ko (20)

Recently uploaded

Recently uploaded (20)

Triton As NLP Model Inference Back-end