There are so many external API(OpenAI, Bard,...) and open source models (LLAMA, Mistral, ..) building a user facing application must be easy! What could go wrong? What do we have to think about before creating experiences?
Here is a short glimpse of some of things you need to think of for building your own application
Finetuning or using pre-trained models
Token optimizations: every word costs time and money
Building small ML models vs using prompts for all tasks
Prompt Engineering
Prompt versioning
Building an evaluation framework
Engineering challenges for streaming data
Moderation & safety of LLMs
.... and the list goes on.
4. What keeps
me up at
night?
• LLM Models: FineTune vs External API
• Token Optimizations & Latency
• Building a robust evaluations framework
• Prompt Engineering
• Engineering challenges
• Building small LMs vs using prompts for most ML tasks
• Prompt versioning
• When should we use RAGs?
• Moderation and safety guardrails
• A/B testing prompt versions, Agent versions, LLM models :
what creates the best consumer experience?
5. LLM Models: To finetune or not ?
External API
• Hosted by third party : reliable uptime
• Wide range of use cases
• Prompts are developed by community
• Should have good data privacy and safety
measures
Finetuned Open Source Models
Pros
Cons
• Models are not trained on specific use case
which could produce lower quality results.
• Paying an external Vendor (example: OpenAI)
can be expensive.
A great place to start building your
first consumer facing applications
• Smaller Models
• Data is not send to external API
• Transparency: investigate code
• Scope for innovation and collaboration
• Full Finetuning
• PEFT Finetuning
Pros
Cons
• Self Hosting can be expensive
• Since code is open, its vulnerable to hacking
• Full fine tuning : lose its ability to handle
general behaviors and result in poor
performance on tasks it wasn't originally
trained for.
Finetuned GPT-3.5
Once you have collected data , gathered
expertise in LLMs – its time to finetune
If your application is build on GPT-3.5
finetuning it improves performance
Pros
Cons
OpenAI, Cluade, Bard, … LLAMA, Falcon, T5, …
• Application/agent build with GPT-3.5 can have
performance similar to GPT-4.
• Less expensive.
• Pipeline for training is available & documented.
• Use prompting & develop on already available
resources.
• Tied to OpenAI.
• Could get more expensive in future.
• Code is a black box.
6. Token Optimization & Latency
Every word costs money and
takes time!!
Model Parameters
GPT - 4 1.76 T
GPT - 3.5 175 B
Claude 93-137 B
LLAMA 7-70B
Optimization Techniques
• Use smaller LMs to do classifications, NER & other
relevant models
• Context Summarization
• Stop word removal
• Make fewer call to LLMs
• Optimize prompt sizes & combine prompts.
• Specify token limit for content generated by LLMs
• Finetuning: use smaller models with task specific data
to achieve similar performance without prompts
• Queue responses to stay within TPM limits
8. Building a robust Evaluation Framework
Constantly evolves: needs versioning
Offline Online
9. Engineering challenges
Streaming output gives a better user
experience
• Text is broken into chunks , chunks need to be re-
processed to create the output, increases compute
requirements & needs real time processing.
• Use of coroutines while building a fast API endpoint
to ensure concurrent requests.
• Use of singleton design to make sure that the same
function is not instantiated multiple times.
• As systems are build by stacking multiple layers for
intelligent decision making latency can increase with
high traffic. This can lead to timeouts. Building a
queuing system can help with timeouts and sub
optimal user experience.
• LLM results are not deterministic : they are ML
models!
10.
11. Thank You
Taranveer Singh, Snir Orlanczyk, Hardik Nahata, Bonaventure Raj
A huge shout out to my team!
https://www.linkedin.com/in/sanghamitra-deb-ml/