What are Phi Small Language Models Capable of

Prompt
Meeting
Prompt Engineering Bulgaria 2024
10 DECEMBER 2024

PROMPT
sponsors 2024
ENGINEERING
sponsors 2024
BULGARIA
sponsors 2024

Small Language
Models got Smaller
to Run on your Phone
What are Phi-3 SLMs Capable of?

• Solution Architect @
• Microsoft AI & IoT MVP
• External Expert Eurostars-Eureka, Horizon Europe
• External Expert InnoFund Denmark, RIF Cyprus
• Business Interests
• Web Development, SOA, Integration
• IoT, Machine Learning
• Security & Performance Optimization
• Contact
• ivelin.andreev@kongsbergdigital.com
• www.linkedin.com/in/ivelin
• www.slideshare.net/ivoandreev
About

TAKEAWAYS
● Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
● https://arxiv.org/pdf/2404.14219
● Phi-3 Cookbook
● https://github.com/microsoft/Phi-3CookBook
● Activation-aware Weight Quantization for LLM Compression and Acceleration
● https://arxiv.org/abs/2306.00978
● Microsoft Responsible AI Standard
● https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RE5cmFl
● ONNX Runtime generate() API Samples
● https://github.com/microsoft/onnxruntime-genai/blob/main/README.md
● Deployment
● https://techcommunity.microsoft.com/blog/educatordeveloperblog/deploy-a-
phi-3-model-in-azure-ai-and-consume-it-with-c-and-semantic-kernel/4188024

Why SLMs?
● Cost-Effective
○ Cheaper to train makes them free and efficient
○ Low compute requirements (CPU, GPU)
○ No cloud infrastructure cost adding up over time (on-premises)
● Future of Secure AI
○ All data processing happens locally, ensuring privacy and compliance
○ Easier to fine-tune on specific data
○ Limited training scope avoids bias
● No Network Dependency
○ Higher reliability
○ No network delays
● Policy Enforcement
○ Cloud vendor could make changes without notice (APIs, behaviour, filters)

SLM Evolution
● LLMs
○ Gold standard for solving creative tasks
○ Slow to train, difficult to fine-tune, expensive
● Phi-3 Family (by Microsoft)
○ Announced on MS Build (April 2024)
○ The most capable open cost-effective SLMs
● Highlights
○ Performance on par with 10x larger models
○ Instruction-tuned – reflect people’s normal communication
○ Available on AZ AI Foundry (a.k.a. Studio), Hugging Face and Ollama
○ Azure AI provides deployment and fine-tuning advantage

● Language Model Quality
○ Precision/Recall – classification of generated information on facts
○ Diversity – variability of response
○ Fluency – grammatical correctness
○ Consistency – with the subject matter
● Quality-Cost
○ Target Customers – individuals and small organisations
○ Quality – model performance for the task it was trained
○ Cost of Training
Quality-Cost Tradeoff
Model Parameters Cost Notes
Phi-3-mini 3.8 B $ 0.5M - $ 1M Estimated
GPT-3 175 B $ 4.6M - $ 12M
GPT-4 500 B - 1T $ 40M - $ 60M Estimated
Llama-3 405 B $ 640M - $ 800M Estimated, Open

● Compact Size, High Performance
○ 3.8B parameters, Production Ready
○ First SLM with 128K context
○ Competitive against GPT-3.5 and Llama-3
● Extensive Training Dataset
○ 3.3T tokens (wide range, filtered web, LLM synthetic)
○ Educationally relevant and logically rigorous data
● Edge Deployment
○ MIT-licensed open source
○ Enhanced privacy, Industrial use
● Multimodal Capabilities
○ Primarily language model, Phi-3.5-Vision (images and text)
Small, Nimble & Capable

Phi-3 vs GPT-3.5 Technical
Parameters
Phi-3-medium has 14b
params, 8% of GPT-3.5 175b
Training Data
Trained on 3.3T tokens, 6.5x
GPT-3.5 500B tokens (est.)
Context
2 Token options per type
4K (default), 128K (Max)
Model Size
3’072 hidden dimensions,
32 attention heads (1.8GB)
vs 12’288 and 96 in GPT-3.5
(350GB)
14b
3.3t
4K
3’072

• Phi-3.5-mini
• Phi-3.5-vision
o Multi-frame image understanding and reasoning
o Not optimized for multi-lingual use cases
• Phi-3.5-MoE (Mixture of Experts)
o 16 experts
o Total model size of 42B parameters,
o Activates 6.6B parameters at once with two experts
• Mixture of Experts
o Experts – individual models within larger architecture, expert in certain area
o Gate – trained NN, determines most relevant expert activation
o Sparse Activation – only few experts are activated
o Output Layer – combines Expert output
https://huggingface.co/microsoft/Phi-3.5-MoE-instruct
https://aka.ms/try-phi3.5moe
○
Phi-3.5 (Aug 2024)

● The first model in this category
○ 3.8B parameters, 128K context, multi-lingual
● Average 5-6% better quality
● Multi-lingual support
○ High-resource languages: Arabic, Chinese, Czech, Danish, Dutch, English,
Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean,
Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish,
Ukrainian
● Vocabulary Size
○ Phi-3-mini – 32K tokens
○ Phi-3-small – 100K tokens
Phi-3.5-mini vs Phi-3-mini

● Models run in the cloud and on the edge
● Runs locally on mobile devices (1.8GB RAM, 12 tkns/sec on iPhone 14)
AZ Serverless Deployment Pricing
Phi-3 Deployment Options
Model Context Input (1M Tokens) Output (1M tokens)
Phi-3-mini -4k-instruct / -128k-instruct 4K / 128K €0.13 €0.50
Phi-3.5-mini-instruct, Phi-3.5-vision-instruct 128K €0.13 €0.50
Phi-3-small -8k-instruct / -128k-instruct 8K / 128K €0.15 €0.58
Phi-3-medium -4k-instruct / -128k-instruct 4K / 128K €0.17 €0.65
Phi-3.5-MoE-instruct 128K €0.16 €0.62
GPT-4o mini 128K €0.16 €0.62
GPT-4o-0513 128K €4.63 €13.89
GPT-4o-2024-08-06 [Newer, More Censored] 128K €2.32 €9.26

How is that Possible?
● Quantization
○ Compress model maintaining most of the accuracy
○ Convert ANN weights precision (i.e. Float16) to lower (i.e. Int4)
● Quantization Accuracy
○ Degradation of quantized ANN KPIs (i.e. Accuracy) vs. baseline
● Model Weights
○ Fraction of weights are more important for performance
○ Higher activation magnitude = more important feature
○ Scale up key weights before quantization
● Activation-Aware Quantization
○ Quantizes the important weights, not all
○ Reduces activation errors compared to alternatives
○ Maintains generalization and quantization accuracy

1. Requirements (Microsoft Responsible AI Compliance)
○ Accountability (Human in control)
○ Transparency (Explain behaviour and decisions)
○ Fairness & Inclusiveness (Same recommendations to anyone)
○ Reliability & Safety (Transparent collection and storage of data)
2. Training
○ Pre-training – heavily filtered public web and synthetic data
○ Post-training
■ Supervised finetuning (SFT)
■ Direct preference optimization (DPO)
○ Safety and bias mitigation
3. Evaluation
○ Various academic benchmarks to compare
Safety First Model Design

• Massive Multitask Language Understanding Test
o 57 areas (Math, Bio, Physics,…) , 100 questions each, 4 levels of complexity
o GPT-4o is a leader with 88.7 score, Llama 3.1 - 88.6 score
Phi-3 Language Understanding
https://paperswithcode.com/sota/multi-task-language-understanding-on-mmlu

● Multimodal(language & vision) based on Phi-3.5-mini (4.2B parameters, 128K tokens)
○ No direct support of non-image files
○ Trained with synthetic data (generated by GPT-4o)
● Use Cases & Performance (vs. GPT 4o) BLINK Test
Phi-3.5 Vision Performance

Phi-3 Family Performance (vs. Llama-3 and GPT-3.5)
GPT-4o would have
stolen the show

• Language models are no longer simply completing sentences
• Phi-3 models have high performance
• Phi-3 often outperform larger models
• Strong reasoning and logic capabilities
• Maths abilities are very high
• Factual knowledge performance is lower than large models
• Code generation good performance
• HumanEval – 164 versatile programming tasks, 8x each
Benchmark Conclusions

Limitations
Limitation Details Mitigation
1. Model Size Smaller model = limited size to
store factual knowledge
Augmentation external sources (DB, Web
search)
2. Factual
Inaccuracies
Affects the reliability of the
output. Undermines trust
Common challenge for small GenAI.
Typically solved with RAG, impossible to train
model on everything
3. Multilingual 23 languages – limits
usefulness
Understands other languages but non-high
resource languages are poor
4. Safety Fails in some sensitive inquiries
(disinformation)
Safety post-training - automated evaluations
across Responsible AI (RAI) harm categories.
5. Ethical Amplifies bias from training
data.
Supervised fine tuning with safe data to steer
output in right direction.

● Open Source
○ Users train with Phi-3 output other models
○ Contamination of datasets for training other models
● Potential LLaMA License
○ Phi-3 is trained with synthetic data
○ Contamination with LLaMA data in Phi-3 could virtually spread to Phi-3 outputs
■ LLaMA license prohibits use of outputs for improving non-LLaMA licensed models
● Synthetic Data
○ Dependency of SLM training on LLM output
○ Could increase bias and negatively affect performance
● Off-topic Moralizing
○ Probably the most censored model by now
○ “You turned this LLM into a schizophrenic moralizing dolt willing to break the flow of stories, and
even interrupt them with absurd lecturing, when they drift out of a fairy-tail perversion of reality
that you've deemed appropriate.”
Criticism
https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/discussions/20

● Azure AI Foundry
○ 1800+ models available
○ Phi-family is Microsoft Collection
● Easy to fine tune
○ Requires relevant training data
○ Deploy trained model
● Microsoft Guidelines
○ Do when you have a specific use case you could name loud
○ Consider few-shot learning
○ Consider RAG
○ Is base model failing in edge cases or format
○ https://learn.microsoft.com/en-us/azure/ai-studio/concepts/fine-tuning-overview
Phi-3 in the Cloud

Ollama
● Download and Install Ollama
https://ollama.com/download
● Install Phi-3
Hugging Face
● Install Hugging Face CLI
● Install the generate() API for CPU
● Download Phi3-Vision files
● Download phi3V example by MSFT
https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/phi3v.py
Phi-3 on the Edge
PS > ollama run phi3:mini [2.2GB]
PS > ollama run phi3:medium
PS > ollama run phi3.5 [2.2GB]
> pip install -U "huggingface_hub[cli]“
> pip install onnxruntime-genai
> huggingface-cli download microsoft/Phi-3.5-vision-instruct-onnx --
include cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/* --local-
dir .

● The following code needs to be fixed in Phi3v.py
Phi3v.py Issue
generated_text = ""
# Loop to generate each token after calling compute_logits
while True:
generator.compute_logits() # Logits vector corresponds to tokens in the model vocabulary
generator.generate_next_token() # Generate the next token
new_tokens = generator.get_next_tokens() # Get the output tokens
for token in new_tokens:
generated_text += tokenizer_stream.decode(token) # Decode each token and accumulate
if generator.is_done():
break
# Print out the generated tokens
print(generated_text)

● OCR with structure output format
○ Documents not supported directly
○ Query specific information (i.e. What is the price of…)
○ Query characteristics (i.e. What is the colour of …)
DEMO

Thank you!
See you next year with the
first event in 2025 ‘’Global
Power Platform Bootcamp’’
9

What are Phi Small Language Models Capable of

More Related Content

Similar to What are Phi Small Language Models Capable of

More from Ivo Andreev

Recently uploaded

What are Phi Small Language Models Capable of