AI/ML Infra Meetup | Best Practice for LLM Serving in the Cloud

Accelerating LLM
Inference in the
Cloud
Nilesh Agarwal
Co-Founder & CTO, Inferless
Previously Amazon & Trilogy
Helped hundreds of companies optimize better on cloud
Infrastructure, developed a Machine Learning-based ads
automation platform. Published a patent in using
containerization of applications & several other on Code
Analysis.

Infrastructure for LLM Inference
● Request for Inference goes to the
loadbalancer
● IT checks for replicas that are
available for the request
● If there if no replica / all replicas are
busy it starts a new container
● Container will pull the OCI image
model weights to boot up the
replica

Limitations Encountered with Scaling LLM
● No native LRU caching of models weights , which can degrade performance by and
be costly if pulling from cloud storage ( typically 200-500 Mbps )
● High setup time for containers, each model have dependency of diﬀerent
packages.
● NFS can expensive and hard to scale as you have more demand for high
performance storage solution

How Alluxio is helping inferless pull model
weights faster
● Most Cloud machine have NVME
disks that are super fast mostly
unused.
● Using Fuse Alluxio create a virtual
mount can serve a drop in
replacement.
● It creates a 3 tier cache with where it
tries to full the weights from

How inferless is helping companies deploy ml models
–CONFIDENTIAL–
Latency :
A key challenge in crafting this service is swiftly
loading model weights into GPU memory for
immediate inference. To combat this, we've
introduced multi-tier caching for both model
weights and runtimes, ensuring cold boots
occur in mere seconds.
For Llama -7 Bn:
Cold-start: 13 seconds
Inference : 11.17 secs (A10G)

When to use Serverless Inference ?
● Unpredictable Usage:
a. Ideal for spikey loads
b. Scales linearly with demands
c. Warm pool ensures fast response
● Real-time Latency Critical:
a. When you need the results in order of seconds not mins.
b. Better scalability than Containers/Other deployments.
● Model Size Constraints:
a. Best for models between 1-30B parameters
b. Avoid for super-large models due to high switch in/out penalty

Thank you
Scan the QR code
to reach out
www.inferless.com

AI/ML Infra Meetup | Best Practice for LLM Serving in the Cloud

More Related Content

Similar to AI/ML Infra Meetup | Best Practice for LLM Serving in the Cloud

More from Alluxio, Inc.

Recently uploaded

AI/ML Infra Meetup | Best Practice for LLM Serving in the Cloud