Accelerating LLM
Inference in the
Cloud
Nilesh Agarwal
Co-Founder & CTO, Inferless
Previously Amazon & Trilogy
Helped hundreds of companies optimize better on cloud
Infrastructure, developed a Machine Learning-based ads
automation platform. Published a patent in using
containerization of applications & several other on Code
Analysis.
Infrastructure for LLM Inference
● Request for Inference goes to the
loadbalancer
● IT checks for replicas that are
available for the request
● If there if no replica / all replicas are
busy it starts a new container
● Container will pull the OCI image
model weights to boot up the
replica
Limitations Encountered with Scaling LLM
● No native LRU caching of models weights , which can degrade performance by and
be costly if pulling from cloud storage ( typically 200-500 Mbps )
● High setup time for containers, each model have dependency of different
packages.
● NFS can expensive and hard to scale as you have more demand for high
performance storage solution
How Alluxio is helping inferless pull model
weights faster
● Most Cloud machine have NVME
disks that are super fast mostly
unused.
● Using Fuse Alluxio create a virtual
mount can serve a drop in
replacement.
● It creates a 3 tier cache with where it
tries to full the weights from
How inferless is helping companies deploy ml models
–CONFIDENTIAL–
Latency :
A key challenge in crafting this service is swiftly
loading model weights into GPU memory for
immediate inference. To combat this, we've
introduced multi-tier caching for both model
weights and runtimes, ensuring cold boots
occur in mere seconds.
For Llama -7 Bn:
Cold-start: 13 seconds
Inference : 11.17 secs (A10G)
When to use Serverless Inference ?
● Unpredictable Usage:
a. Ideal for spikey loads
b. Scales linearly with demands
c. Warm pool ensures fast response
● Real-time Latency Critical:
a. When you need the results in order of seconds not mins.
b. Better scalability than Containers/Other deployments.
● Model Size Constraints:
a. Best for models between 1-30B parameters
b. Avoid for super-large models due to high switch in/out penalty
Thank you
Scan the QR code
to reach out
www.inferless.com

AI/ML Infra Meetup | Best Practice for LLM Serving in the Cloud

  • 1.
    Accelerating LLM Inference inthe Cloud Nilesh Agarwal Co-Founder & CTO, Inferless Previously Amazon & Trilogy Helped hundreds of companies optimize better on cloud Infrastructure, developed a Machine Learning-based ads automation platform. Published a patent in using containerization of applications & several other on Code Analysis.
  • 2.
    Infrastructure for LLMInference ● Request for Inference goes to the loadbalancer ● IT checks for replicas that are available for the request ● If there if no replica / all replicas are busy it starts a new container ● Container will pull the OCI image model weights to boot up the replica
  • 3.
    Limitations Encountered withScaling LLM ● No native LRU caching of models weights , which can degrade performance by and be costly if pulling from cloud storage ( typically 200-500 Mbps ) ● High setup time for containers, each model have dependency of different packages. ● NFS can expensive and hard to scale as you have more demand for high performance storage solution
  • 4.
    How Alluxio ishelping inferless pull model weights faster ● Most Cloud machine have NVME disks that are super fast mostly unused. ● Using Fuse Alluxio create a virtual mount can serve a drop in replacement. ● It creates a 3 tier cache with where it tries to full the weights from
  • 5.
    How inferless ishelping companies deploy ml models –CONFIDENTIAL– Latency : A key challenge in crafting this service is swiftly loading model weights into GPU memory for immediate inference. To combat this, we've introduced multi-tier caching for both model weights and runtimes, ensuring cold boots occur in mere seconds. For Llama -7 Bn: Cold-start: 13 seconds Inference : 11.17 secs (A10G)
  • 6.
    When to useServerless Inference ? ● Unpredictable Usage: a. Ideal for spikey loads b. Scales linearly with demands c. Warm pool ensures fast response ● Real-time Latency Critical: a. When you need the results in order of seconds not mins. b. Better scalability than Containers/Other deployments. ● Model Size Constraints: a. Best for models between 1-30B parameters b. Avoid for super-large models due to high switch in/out penalty
  • 7.
    Thank you Scan theQR code to reach out www.inferless.com