This talk will explain how we deploy computer vision models at Veriff by leveraging a mix of classic software engineering techniques combined with last AWS tooling. Thanks to this, we have cut down our model deployment time from 10 days to 1 day and inference costs by 75%
5. ● Each model running on a microservice
● GPUs are expensive!
● GPUs are difficult to get!
● Custom solution to share GPUs between
services
PREVIOUS SOLUTION
Run ML on Kubernetes
K8 cluster
ML
Service
ML
Service
ML
Service
GPU node GPU node
6. ● A unique model per service
● Only a model version available
● Pods are independent -> no batch
processing
● New model needs a new service
● CPU steps consume GPU nodes
PREVIOUS SOLUTION
ML Service ML Service
Image
fetch
Preprocess
Inference
Post
process
K8 Cluster
GPU node
7. ● Development time is high (a service per
model)
● Running GPU models is expensive
● Models are difficult to reuse
PREVIOUS SOLUTION
Drawbacks
9. ● No code solution
● Supports major training frameworks
● Wraps models into APIS
● Multiple backends
● Inference pipelines
● Dynamic batching
● Multi model
● Multi version
TRITON MODELS
Triton inference server ML API
Model
weights Model
repository
Triton
server
API
K8 Cluster
GPU node
Config file
10. ● Low migration time (no code)
● Standardization -> Reusability
● Reduced GPU costs
● Inference pipelines
● Multiple models & multiple versions
● GPU managing still in Kubernetes
● New repository per model
TRITON MODELS
Good & Bad
12. ● Fleet of instances under an endpoint
● Support different instances (CPU & GPU)
● Autoscaling policies
● Models are Triton model repositories
● Models loaded on demand (LRU cache)
AWS MME
Multi Model Endpoints MME Endpoint
Load
balancer
Instance A
Instance B
S3
A v1
B v1
A v2
A v2
A v1
B v1
Instance C
14. ● GPU are managed outside our clusters
● Autoscaling minimizes GPU and operational
costs
● Models are easy to deploy
● Model artifacts need to be built
● No model metrics available
AWS MME
Good & Bad
18. ● All MME models are hosted in a monorepo
● Shared tooling
● Inference pipelines
● Model versioning
● Unit tests for models
● CI steps takes care of all
● Deployment -> PR
MME AT VERIFF
Monorepo
Model
weights
Model
repository
Unit tests
Config file
Model
conversion
Staging Production
19. ● GPU are managed outside our clusters
● Autoscaling minimizes GPU and operational
costs
● Model pipelines are easy to deploy (PR)
● Quality control for models
● Model metrics available
● New functionalities using shared client
AWS MME AT VERIFF
Good & Bad