AI Workloads running on Cloud Run with GPUs

TL/DR
• Google have introduced 20GB Nvidia L4 GPUs for Cloud Run instances
into public preview

Longer
version…
GPU Enabled workloads…
• Apps
• APIs
• Local Host
• Cloud Run

Managed AI
Services
Text and Image Generation:
OpenAI ChatGPT, Google Gemini, Microsoft Copilot
Coding:
GitHub Copilot, Cursor, Google Gemini Code Assist, Amazon Q
Developer
Multi-modal:
Google NotebookLM… generate Study guides, Pod Cast, Briefing
Doc, FAQ, Timeline
Productivity:
Notion AI, Grammarly, Office 365 Copilot
Video & Audio:
CreationRunway ML, Descript, ElevenLabs

VertexAI gemini-2.0-pro-exp-02-05

Booooo…
• code-gecko@002:
• “Quota exceeded for
aiplatform.googleapis.com/generate_content_requests_per_minute_per_pro
ject_per_base_model with base model: code-gecko. Please submit a quota
increase request.”
• code-bison:
• error generating content: rpc error: code = FailedPrecondition desc =
Project `7282415755` is not allowed to use Publisher Model
`projects/play-pen-pup/locations/europe-west2/publishers/google/models
/code-bison`

Cloud Run
GPUs
Required Quota: Total Nvidia L4 GPU allocation
Full managed
No extra drivers or libraries needed.
On-demand availability with no reservations needed
Scale down to zero for cost savings when not in use
Cold Start: Expect 5 seconds
1 GPU per Cloud Run instance (includes side-cars)

GPU Model Memory Key Features Suitable Workloads
NVIDIA H200 141GB
High-performance GPU with NVLink, PCIe, and
Hopper architecture; ideal for AI training and
inference
Large-scale deep learning training, AI
inference, and high-performance computing
(HPC)
NVIDIA H100 80 GB
High-performance GPU with NVLink, PCIe, and
Hopper architecture; ideal for AI training and
inference
Large-scale deep learning training, AI
inference, and high-performance computing
(HPC)
NVIDIA A100 40 GB
NVLink connectivity, Tensor Core acceleration,
optimized for ML and HPC
Large-scale ML training, data analytics,
scientific simulations
NVIDIA V100 16 GB
NVLink support, Tensor Core acceleration,
previous-gen AI acceleration Deep learning training, HPC workloads
NVIDIA T4 16 GB
Energy-efficient, optimized for inference, supports
GPU GRID
Inference, training, remote visualization,
video transcoding
NVIDIA P100 16 GB
High-performance without NVLink, good for
compute-intensive tasks Deep learning training, HPC workloads
NVIDIA P4 8 GB
Low power consumption, optimized for inference
and graphics acceleration
Inference, remote visualization, video
transcoding
GPU Model Memory Key Features Suitable Workloads
NVIDIA L4 20 GB
Power-efficient GPU, PCIe, and advanced AI
inference & video acceleration
AI inference, video processing (supports AV1
encoding/decoding), and machine learning.

Optimizing Container Storage
Model Location Deploy Time Setup Complexity Startup Speed Storage Cost
Container Image
Slow (large models take
longer to import)
Requires redeployment for
changes
Varies by model size; large
models may need Cloud
Storage
Multiple copies in
Artifact Registry
Cloud Storage (FUSE
mount)
Fast (downloads during
startup)
Easy, no Docker changes
needed
Fast with network
optimizations, but no
parallel downloads
One copy in Cloud
Storage
Cloud Storage (gcloud)
startup)
Moderate, requires CLI or
API setup
Faster than FUSE due to
parallel downloads
One copy in Cloud
Storage
Internet
startup)
Simple, many frameworks
support this
Unpredictable, potential
reliability risks
Depends on the
provider; best to use
Cloud Storage

Thank you
/pmgledhill102/cloudrun-gpu-demos

CodeLabs
• Calling from a Cloud Run service into Gemini:
• https://codelabs.developers.google.com/codelabs/deploy-from-github/gen-ai-python#10
• https://codelabs.developers.google.com/codelabs/deploy-from-github/genkit-nodejs?hl=en#10
• Using Stable Diffusion
• https://codelabs.developers.google.com/codelabs/how-to-use-transformers-js-cloud-run-gpu?hl=en#6 (HuggingFace Transformers.js)
• https://codelabs.developers.google.com/codelabs/how-to-use-stable-diffusion-cloud-run-gpu?hl=en#7 (TorchServe)
• Ollama as a sidecar with Cloud Run GPUs and Open WebUI
• https://codelabs.developers.google.com/codelabs/how-to-use-ollama-sidecar-open-webui-frontend-cloud-run-gpu?hl=en#0
• Summarise text into storage bucket
• https://codelabs.developers.google.com/codelabs/how-to-gemini-text-summarization-cloud-run-functions?hl=en#3
• LLM inference on Cloud Run GPUs
• https://codelabs.developers.google.com/codelabs/how-to-run-inference-cloud-run-gpu-vllm?hl=en#0
• Gemini powered chat app
• https://codelabs.developers.google.com/codelabs/how-to-deploy-gemini-powered-chat-app-cloud-run?hl=en#0
• Cloud Run with Gemini Function Calling
• https://codelabs.developers.google.com/codelabs/how-to-cloud-run-gemini-function-calling?hl=en#0
• Cloud Run Quiz using Gemini through VertexAI
• https://codelabs.developers.google.com/cloud-genai?hl=en#6

Web Docs
• Ollama model server: (https://cloud.google.com/run/docs/tutorials/gpu-gemma2-with-ollama)
• Gemma 2 (9B)… 5.4GB
• Llama 3.1 (8B)
• Mistral (7B)
• Qwen2 (7B)
• Trade off – location of model weights
• vLLM model server: (Run LLM inference on Cloud Run GPUs with vLLM)
• Gemma 2
• OpenCV (Computer Vision) with and without CPUs:
• https://github.com/GoogleCloudPlatform/cloudrun-gpus-opencv-cuda-demo/

YouTube Videos
• Building with HuggingFace playlist:
https://www.youtube.com/playlist?list=PLIivdWyY5sqIwEOfjCSVl87ND
7Rn3m1Fd

AI Workloads running on Cloud Run with GPUs

More Related Content

Similar to AI Workloads running on Cloud Run with GPUs

Recently uploaded

AI Workloads running on Cloud Run with GPUs

Editor's Notes