Innovating Inference - Remote Triggering of Large Language Models on HPC Clusters Using Globus Compute

Innovating Inference at Exascale
Remote triggering of Large Language Models on HPC clusters using Globus
Compute
Aditya Tanikanti
atanikanti@anl.gov
Computer Scientist
ALCF

Inference at ALCF
Subset of neurons from human brain tissue, reconstructed on
Aurora using the FFN convolutional network and based on electron
microscopy from Harvard, includes a 2-micron scale bar. Larger
volume analysis will continue on Aurora under an INCITE award.
Potential to screen 240 billion molecules in 10 mins on 10K Aurora
nodes for the RTCB Cancer Protein. High performance binding affinity
prediction with a Transformer-based surrogate model, A.Vasan et al.,
To appear in HICOMB 2024
Connectomics and RTCB Cancer protein(referenced below) are few of many model inference examples that run in ALCF clusters.

GPU workloads for Inference
Inference for LLM models, among others, epitomize the cutting-edge advancements in inference capabilities. Our overarching
objective is to democratize access to inference services rooted in the rich tapestry of scientific data for all users on ALCF clusters.
Source: https://www.investors.com/news/technology/ai-stocks-market-shifting-to-inferencing-from-training/

LLM Inference
● Unlocking the potential of Large Language Models (LLMs) such as ChatGPT necessitates harnessing tools akin to vLLM, a
cutting-edge library designed specifically for LLM inference and seamless deployment.
● vLLM boasts compatibility with an array of generative Transformer models found in the HuggingFace Transformers
ecosystem. Below is a curated collection of model architectures currently bolstered by vLLM, accompanied by notable
instances that leverage each architecture's capabilities.
●
Source: https://docs.vllm.ai/

Single User: Globus Compute for remote inference at ALCF
POLARI
S Polaris
REMOTE
User credentials
authentication
GPU Node
Elastic based on model
requirements
Model
weights
Endpoint
Setup for running vLLM
& Ray to serve LLM
models
Client App
Users using notebooks
to interact with Globus
tools to execute vLLM.

Single User: Globus Compute for remote inference at ALCF
POLARI
S Polaris
REMOTE
User credentials
authentication
model
weights

Django web portal
Multi User: Globus Compute for remote inference at ALCF
Client App
Users using UI or API to
interact with Django
portal
Polaris
Job Scheduler
Endpoint
Multiple pre registered
endpoints running vLLM
+ Ray with various
models
GPU Nodes
Elastic based on model
requested
Django Portal
Provides UI and API
access to the running
vLLM endpoints

Multi User: Globus Compute for remote inference at ALCF
Django web portal
Polaris
Job Scheduler

Summary
At present, our inference service efficiently delivers models from Polaris. However, our ongoing efforts are focused on crafting
computational endpoints tailored to facilitate model deployment from Aurora, AI Testbed and inference clusters.
Step by Step Guide to run vLLM on Polaris:
https://github.com/atanikan/vllm_service
Globus compute notebook to run vLLM remotely:
https://github.com/atanikan/vllm_service/blob/main/inference_using_globus/vLLM_Inference.ipynb
Django web portal to run inference:
https://github.com/argonne-lcf/inference-as-a-service/tree/main

This research used resources of the Argonne Leadership Computing Facility,
which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.

Innovating Inference - Remote Triggering of Large Language Models on HPC Clusters Using Globus Compute

More Related Content

Similar to Innovating Inference - Remote Triggering of Large Language Models on HPC Clusters Using Globus Compute

More from Globus

Recently uploaded

Innovating Inference - Remote Triggering of Large Language Models on HPC Clusters Using Globus Compute