Innovating Inference at Exascale
Remote triggering of Large Language Models on HPC clusters using Globus
Compute
Aditya Tanikanti
atanikanti@anl.gov
Computer Scientist
ALCF
Inference at ALCF
Subset of neurons from human brain tissue, reconstructed on
Aurora using the FFN convolutional network and based on electron
microscopy from Harvard, includes a 2-micron scale bar. Larger
volume analysis will continue on Aurora under an INCITE award.
Potential to screen 240 billion molecules in 10 mins on 10K Aurora
nodes for the RTCB Cancer Protein. High performance binding affinity
prediction with a Transformer-based surrogate model, A.Vasan et al.,
To appear in HICOMB 2024
Connectomics and RTCB Cancer protein(referenced below) are few of many model inference examples that run in ALCF clusters.
GPU workloads for Inference
Inference for LLM models, among others, epitomize the cutting-edge advancements in inference capabilities. Our overarching
objective is to democratize access to inference services rooted in the rich tapestry of scientific data for all users on ALCF clusters.
Source: https://www.investors.com/news/technology/ai-stocks-market-shifting-to-inferencing-from-training/
LLM Inference
● Unlocking the potential of Large Language Models (LLMs) such as ChatGPT necessitates harnessing tools akin to vLLM, a
cutting-edge library designed specifically for LLM inference and seamless deployment.
● vLLM boasts compatibility with an array of generative Transformer models found in the HuggingFace Transformers
ecosystem. Below is a curated collection of model architectures currently bolstered by vLLM, accompanied by notable
instances that leverage each architecture's capabilities.
●
Source: https://docs.vllm.ai/
Single User: Globus Compute for remote inference at ALCF
POLARI
S Polaris
REMOTE
User credentials
authentication
GPU Node
Elastic based on model
requirements
Model
weights
Endpoint
Setup for running vLLM
& Ray to serve LLM
models
Client App
Users using notebooks
to interact with Globus
tools to execute vLLM.
Single User: Globus Compute for remote inference at ALCF
POLARI
S Polaris
REMOTE
User credentials
authentication
model
weights
Single User: Globus Compute for remote inference at ALCF
POLARI
S Polaris
REMOTE
User credentials
authentication
model
weights
Django web portal
Multi User: Globus Compute for remote inference at ALCF
Client App
Users using UI or API to
interact with Django
portal
Polaris
Job Scheduler
Endpoint
Multiple pre registered
endpoints running vLLM
+ Ray with various
models
GPU Nodes
Elastic based on model
requested
Django Portal
Provides UI and API
access to the running
vLLM endpoints
Multi User: Globus Compute for remote inference at ALCF
Django web portal
Polaris
Job Scheduler
Multi User: Globus Compute for remote inference at ALCF
Django web portal
Polaris
Job Scheduler
Summary
At present, our inference service efficiently delivers models from Polaris. However, our ongoing efforts are focused on crafting
computational endpoints tailored to facilitate model deployment from Aurora, AI Testbed and inference clusters.
Step by Step Guide to run vLLM on Polaris:
https://github.com/atanikan/vllm_service
Globus compute notebook to run vLLM remotely:
https://github.com/atanikan/vllm_service/blob/main/inference_using_globus/vLLM_Inference.ipynb
Django web portal to run inference:
https://github.com/argonne-lcf/inference-as-a-service/tree/main
This research used resources of the Argonne Leadership Computing Facility,
which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.

Innovating Inference - Remote Triggering of Large Language Models on HPC Clusters Using Globus Compute

  • 1.
    Innovating Inference atExascale Remote triggering of Large Language Models on HPC clusters using Globus Compute Aditya Tanikanti atanikanti@anl.gov Computer Scientist ALCF
  • 2.
    Inference at ALCF Subsetof neurons from human brain tissue, reconstructed on Aurora using the FFN convolutional network and based on electron microscopy from Harvard, includes a 2-micron scale bar. Larger volume analysis will continue on Aurora under an INCITE award. Potential to screen 240 billion molecules in 10 mins on 10K Aurora nodes for the RTCB Cancer Protein. High performance binding affinity prediction with a Transformer-based surrogate model, A.Vasan et al., To appear in HICOMB 2024 Connectomics and RTCB Cancer protein(referenced below) are few of many model inference examples that run in ALCF clusters.
  • 3.
    GPU workloads forInference Inference for LLM models, among others, epitomize the cutting-edge advancements in inference capabilities. Our overarching objective is to democratize access to inference services rooted in the rich tapestry of scientific data for all users on ALCF clusters. Source: https://www.investors.com/news/technology/ai-stocks-market-shifting-to-inferencing-from-training/
  • 4.
    LLM Inference ● Unlockingthe potential of Large Language Models (LLMs) such as ChatGPT necessitates harnessing tools akin to vLLM, a cutting-edge library designed specifically for LLM inference and seamless deployment. ● vLLM boasts compatibility with an array of generative Transformer models found in the HuggingFace Transformers ecosystem. Below is a curated collection of model architectures currently bolstered by vLLM, accompanied by notable instances that leverage each architecture's capabilities. ● Source: https://docs.vllm.ai/
  • 5.
    Single User: GlobusCompute for remote inference at ALCF POLARI S Polaris REMOTE User credentials authentication GPU Node Elastic based on model requirements Model weights Endpoint Setup for running vLLM & Ray to serve LLM models Client App Users using notebooks to interact with Globus tools to execute vLLM.
  • 6.
    Single User: GlobusCompute for remote inference at ALCF POLARI S Polaris REMOTE User credentials authentication model weights
  • 7.
    Single User: GlobusCompute for remote inference at ALCF POLARI S Polaris REMOTE User credentials authentication model weights
  • 8.
    Django web portal MultiUser: Globus Compute for remote inference at ALCF Client App Users using UI or API to interact with Django portal Polaris Job Scheduler Endpoint Multiple pre registered endpoints running vLLM + Ray with various models GPU Nodes Elastic based on model requested Django Portal Provides UI and API access to the running vLLM endpoints
  • 9.
    Multi User: GlobusCompute for remote inference at ALCF Django web portal Polaris Job Scheduler
  • 10.
    Multi User: GlobusCompute for remote inference at ALCF Django web portal Polaris Job Scheduler
  • 11.
    Summary At present, ourinference service efficiently delivers models from Polaris. However, our ongoing efforts are focused on crafting computational endpoints tailored to facilitate model deployment from Aurora, AI Testbed and inference clusters. Step by Step Guide to run vLLM on Polaris: https://github.com/atanikan/vllm_service Globus compute notebook to run vLLM remotely: https://github.com/atanikan/vllm_service/blob/main/inference_using_globus/vLLM_Inference.ipynb Django web portal to run inference: https://github.com/argonne-lcf/inference-as-a-service/tree/main
  • 12.
    This research usedresources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.