Prompt
Meeting
Prompt Engineering Bulgaria 2024
10 DECEMBER 2024
PROMPT
sponsors 2024
ENGINEERING
sponsors 2024
BULGARIA
sponsors 2024
Small Language
Models got Smaller
to Run on your Phone
What are Phi-3 SLMs Capable of?
• Solution Architect @
• Microsoft AI & IoT MVP
• External Expert Eurostars-Eureka, Horizon Europe
• External Expert InnoFund Denmark, RIF Cyprus
• Business Interests
• Web Development, SOA, Integration
• IoT, Machine Learning
• Security & Performance Optimization
• Contact
• ivelin.andreev@kongsbergdigital.com
• www.linkedin.com/in/ivelin
• www.slideshare.net/ivoandreev
About
TAKEAWAYS
● Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
● https://arxiv.org/pdf/2404.14219
● Phi-3 Cookbook
● https://github.com/microsoft/Phi-3CookBook
● Activation-aware Weight Quantization for LLM Compression and Acceleration
● https://arxiv.org/abs/2306.00978
● Microsoft Responsible AI Standard
● https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RE5cmFl
● ONNX Runtime generate() API Samples
● https://github.com/microsoft/onnxruntime-genai/blob/main/README.md
● Deployment
● https://techcommunity.microsoft.com/blog/educatordeveloperblog/deploy-a-
phi-3-model-in-azure-ai-and-consume-it-with-c-and-semantic-kernel/4188024
Why SLMs?
● Cost-Effective
○ Cheaper to train makes them free and efficient
○ Low compute requirements (CPU, GPU)
○ No cloud infrastructure cost adding up over time (on-premises)
● Future of Secure AI
○ All data processing happens locally, ensuring privacy and compliance
○ Easier to fine-tune on specific data
○ Limited training scope avoids bias
● No Network Dependency
○ Higher reliability
○ No network delays
● Policy Enforcement
○ Cloud vendor could make changes without notice (APIs, behaviour, filters)
SLM Evolution
● LLMs
○ Gold standard for solving creative tasks
○ Slow to train, difficult to fine-tune, expensive
● Phi-3 Family (by Microsoft)
○ Announced on MS Build (April 2024)
○ The most capable open cost-effective SLMs
● Highlights
○ Performance on par with 10x larger models
○ Instruction-tuned – reflect people’s normal communication
○ Available on AZ AI Foundry (a.k.a. Studio), Hugging Face and Ollama
○ Azure AI provides deployment and fine-tuning advantage
Phi-3
Open Source
● Language Model Quality
○ Precision/Recall – classification of generated information on facts
○ Diversity – variability of response
○ Fluency – grammatical correctness
○ Consistency – with the subject matter
● Quality-Cost
○ Target Customers – individuals and small organisations
○ Quality – model performance for the task it was trained
○ Cost of Training
Quality-Cost Tradeoff
Model Parameters Cost Notes
Phi-3-mini 3.8 B $ 0.5M - $ 1M Estimated
GPT-3 175 B $ 4.6M - $ 12M
GPT-4 500 B - 1T $ 40M - $ 60M Estimated
Llama-3 405 B $ 640M - $ 800M Estimated, Open
● Compact Size, High Performance
○ 3.8B parameters, Production Ready
○ First SLM with 128K context
○ Competitive against GPT-3.5 and Llama-3
● Extensive Training Dataset
○ 3.3T tokens (wide range, filtered web, LLM synthetic)
○ Educationally relevant and logically rigorous data
● Edge Deployment
○ MIT-licensed open source
○ Enhanced privacy, Industrial use
● Multimodal Capabilities
○ Primarily language model, Phi-3.5-Vision (images and text)
Small, Nimble & Capable
Phi-3 vs GPT-3.5 Technical
Parameters
Phi-3-medium has 14b
params, 8% of GPT-3.5 175b
Training Data
Trained on 3.3T tokens, 6.5x
GPT-3.5 500B tokens (est.)
Context
2 Token options per type
4K (default), 128K (Max)
Model Size
3’072 hidden dimensions,
32 attention heads (1.8GB)
vs 12’288 and 96 in GPT-3.5
(350GB)
14b
3.3t
4K
3’072
• Phi-3.5-mini
• Phi-3.5-vision
o Multi-frame image understanding and reasoning
o Not optimized for multi-lingual use cases
• Phi-3.5-MoE (Mixture of Experts)
o 16 experts
o Total model size of 42B parameters,
o Activates 6.6B parameters at once with two experts
• Mixture of Experts
o Experts – individual models within larger architecture, expert in certain area
o Gate – trained NN, determines most relevant expert activation
o Sparse Activation – only few experts are activated
o Output Layer – combines Expert output
https://huggingface.co/microsoft/Phi-3.5-MoE-instruct
https://aka.ms/try-phi3.5moe
○
Phi-3.5 (Aug 2024)
● The first model in this category
○ 3.8B parameters, 128K context, multi-lingual
● Average 5-6% better quality
● Multi-lingual support
○ High-resource languages: Arabic, Chinese, Czech, Danish, Dutch, English,
Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean,
Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish,
Ukrainian
● Vocabulary Size
○ Phi-3-mini – 32K tokens
○ Phi-3-small – 100K tokens
Phi-3.5-mini vs Phi-3-mini
● Models run in the cloud and on the edge
● Runs locally on mobile devices (1.8GB RAM, 12 tkns/sec on iPhone 14)
AZ Serverless Deployment Pricing
Phi-3 Deployment Options
Model Context Input (1M Tokens) Output (1M tokens)
Phi-3-mini -4k-instruct / -128k-instruct 4K / 128K €0.13 €0.50
Phi-3.5-mini-instruct, Phi-3.5-vision-instruct 128K €0.13 €0.50
Phi-3-small -8k-instruct / -128k-instruct 8K / 128K €0.15 €0.58
Phi-3-medium -4k-instruct / -128k-instruct 4K / 128K €0.17 €0.65
Phi-3.5-MoE-instruct 128K €0.16 €0.62
GPT-4o mini 128K €0.16 €0.62
GPT-4o-0513 128K €4.63 €13.89
GPT-4o-2024-08-06 [Newer, More Censored] 128K €2.32 €9.26
How is that Possible?
● Quantization
○ Compress model maintaining most of the accuracy
○ Convert ANN weights precision (i.e. Float16) to lower (i.e. Int4)
● Quantization Accuracy
○ Degradation of quantized ANN KPIs (i.e. Accuracy) vs. baseline
● Model Weights
○ Fraction of weights are more important for performance
○ Higher activation magnitude = more important feature
○ Scale up key weights before quantization
● Activation-Aware Quantization
○ Quantizes the important weights, not all
○ Reduces activation errors compared to alternatives
○ Maintains generalization and quantization accuracy
Performance
1. Requirements (Microsoft Responsible AI Compliance)
○ Accountability (Human in control)
○ Transparency (Explain behaviour and decisions)
○ Fairness & Inclusiveness (Same recommendations to anyone)
○ Reliability & Safety (Transparent collection and storage of data)
2. Training
○ Pre-training – heavily filtered public web and synthetic data
○ Post-training
■ Supervised finetuning (SFT)
■ Direct preference optimization (DPO)
○ Safety and bias mitigation
3. Evaluation
○ Various academic benchmarks to compare
Safety First Model Design
• Massive Multitask Language Understanding Test
o 57 areas (Math, Bio, Physics,…) , 100 questions each, 4 levels of complexity
o GPT-4o is a leader with 88.7 score, Llama 3.1 - 88.6 score
Phi-3 Language Understanding
https://paperswithcode.com/sota/multi-task-language-understanding-on-mmlu
● Multimodal(language & vision) based on Phi-3.5-mini (4.2B parameters, 128K tokens)
○ No direct support of non-image files
○ Trained with synthetic data (generated by GPT-4o)
● Use Cases & Performance (vs. GPT 4o) BLINK Test
Phi-3.5 Vision Performance
Phi-3 Family Performance (vs. Llama-3 and GPT-3.5)
GPT-4o would have
stolen the show
• Language models are no longer simply completing sentences
• Phi-3 models have high performance
• Phi-3 often outperform larger models
• Strong reasoning and logic capabilities
• Maths abilities are very high
• Factual knowledge performance is lower than large models
• Code generation good performance
• HumanEval – 164 versatile programming tasks, 8x each
Benchmark Conclusions
Limitations
Limitation Details Mitigation
1. Model Size Smaller model = limited size to
store factual knowledge
Augmentation external sources (DB, Web
search)
2. Factual
Inaccuracies
Affects the reliability of the
output. Undermines trust
Common challenge for small GenAI.
Typically solved with RAG, impossible to train
model on everything
3. Multilingual 23 languages – limits
usefulness
Understands other languages but non-high
resource languages are poor
4. Safety Fails in some sensitive inquiries
(disinformation)
Safety post-training - automated evaluations
across Responsible AI (RAI) harm categories.
5. Ethical Amplifies bias from training
data.
Supervised fine tuning with safe data to steer
output in right direction.
● Open Source
○ Users train with Phi-3 output other models
○ Contamination of datasets for training other models
● Potential LLaMA License
○ Phi-3 is trained with synthetic data
○ Contamination with LLaMA data in Phi-3 could virtually spread to Phi-3 outputs
■ LLaMA license prohibits use of outputs for improving non-LLaMA licensed models
● Synthetic Data
○ Dependency of SLM training on LLM output
○ Could increase bias and negatively affect performance
● Off-topic Moralizing
○ Probably the most censored model by now
○ “You turned this LLM into a schizophrenic moralizing dolt willing to break the flow of stories, and
even interrupt them with absurd lecturing, when they drift out of a fairy-tail perversion of reality
that you've deemed appropriate.”
Criticism
https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/discussions/20
Usage
● Azure AI Foundry
○ 1800+ models available
○ Phi-family is Microsoft Collection
● Easy to fine tune
○ Requires relevant training data
○ Deploy trained model
● Microsoft Guidelines
○ Do when you have a specific use case you could name loud
○ Consider few-shot learning
○ Consider RAG
○ Is base model failing in edge cases or format
○ https://learn.microsoft.com/en-us/azure/ai-studio/concepts/fine-tuning-overview
Phi-3 in the Cloud
Ollama
● Download and Install Ollama
https://ollama.com/download
● Install Phi-3
Hugging Face
● Install Hugging Face CLI
● Install the generate() API for CPU
● Download Phi3-Vision files
● Download phi3V example by MSFT
https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/phi3v.py
Phi-3 on the Edge
PS > ollama run phi3:mini [2.2GB]
PS > ollama run phi3:medium
PS > ollama run phi3.5 [2.2GB]
> pip install -U "huggingface_hub[cli]“
> pip install onnxruntime-genai
> huggingface-cli download microsoft/Phi-3.5-vision-instruct-onnx --
include cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/* --local-
dir .
● The following code needs to be fixed in Phi3v.py
Phi3v.py Issue
generated_text = ""
# Loop to generate each token after calling compute_logits
while True:
generator.compute_logits() # Logits vector corresponds to tokens in the model vocabulary
generator.generate_next_token() # Generate the next token
new_tokens = generator.get_next_tokens() # Get the output tokens
for token in new_tokens:
generated_text += tokenizer_stream.decode(token) # Decode each token and accumulate
if generator.is_done():
break
# Print out the generated tokens
print(generated_text)
● OCR with structure output format
○ Documents not supported directly
○ Query specific information (i.e. What is the price of…)
○ Query characteristics (i.e. What is the colour of …)
DEMO
Thank you!
See you next year with the
first event in 2025 ‘’Global
Power Platform Bootcamp’’
9

What are Phi Small Language Models Capable of

  • 1.
  • 2.
  • 3.
    Small Language Models gotSmaller to Run on your Phone What are Phi-3 SLMs Capable of?
  • 4.
    • Solution Architect@ • Microsoft AI & IoT MVP • External Expert Eurostars-Eureka, Horizon Europe • External Expert InnoFund Denmark, RIF Cyprus • Business Interests • Web Development, SOA, Integration • IoT, Machine Learning • Security & Performance Optimization • Contact • ivelin.andreev@kongsbergdigital.com • www.linkedin.com/in/ivelin • www.slideshare.net/ivoandreev About
  • 5.
    TAKEAWAYS ● Phi-3 TechnicalReport: A Highly Capable Language Model Locally on Your Phone ● https://arxiv.org/pdf/2404.14219 ● Phi-3 Cookbook ● https://github.com/microsoft/Phi-3CookBook ● Activation-aware Weight Quantization for LLM Compression and Acceleration ● https://arxiv.org/abs/2306.00978 ● Microsoft Responsible AI Standard ● https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RE5cmFl ● ONNX Runtime generate() API Samples ● https://github.com/microsoft/onnxruntime-genai/blob/main/README.md ● Deployment ● https://techcommunity.microsoft.com/blog/educatordeveloperblog/deploy-a- phi-3-model-in-azure-ai-and-consume-it-with-c-and-semantic-kernel/4188024
  • 6.
    Why SLMs? ● Cost-Effective ○Cheaper to train makes them free and efficient ○ Low compute requirements (CPU, GPU) ○ No cloud infrastructure cost adding up over time (on-premises) ● Future of Secure AI ○ All data processing happens locally, ensuring privacy and compliance ○ Easier to fine-tune on specific data ○ Limited training scope avoids bias ● No Network Dependency ○ Higher reliability ○ No network delays ● Policy Enforcement ○ Cloud vendor could make changes without notice (APIs, behaviour, filters)
  • 7.
    SLM Evolution ● LLMs ○Gold standard for solving creative tasks ○ Slow to train, difficult to fine-tune, expensive ● Phi-3 Family (by Microsoft) ○ Announced on MS Build (April 2024) ○ The most capable open cost-effective SLMs ● Highlights ○ Performance on par with 10x larger models ○ Instruction-tuned – reflect people’s normal communication ○ Available on AZ AI Foundry (a.k.a. Studio), Hugging Face and Ollama ○ Azure AI provides deployment and fine-tuning advantage
  • 8.
  • 9.
    ● Language ModelQuality ○ Precision/Recall – classification of generated information on facts ○ Diversity – variability of response ○ Fluency – grammatical correctness ○ Consistency – with the subject matter ● Quality-Cost ○ Target Customers – individuals and small organisations ○ Quality – model performance for the task it was trained ○ Cost of Training Quality-Cost Tradeoff Model Parameters Cost Notes Phi-3-mini 3.8 B $ 0.5M - $ 1M Estimated GPT-3 175 B $ 4.6M - $ 12M GPT-4 500 B - 1T $ 40M - $ 60M Estimated Llama-3 405 B $ 640M - $ 800M Estimated, Open
  • 10.
    ● Compact Size,High Performance ○ 3.8B parameters, Production Ready ○ First SLM with 128K context ○ Competitive against GPT-3.5 and Llama-3 ● Extensive Training Dataset ○ 3.3T tokens (wide range, filtered web, LLM synthetic) ○ Educationally relevant and logically rigorous data ● Edge Deployment ○ MIT-licensed open source ○ Enhanced privacy, Industrial use ● Multimodal Capabilities ○ Primarily language model, Phi-3.5-Vision (images and text) Small, Nimble & Capable
  • 11.
    Phi-3 vs GPT-3.5Technical Parameters Phi-3-medium has 14b params, 8% of GPT-3.5 175b Training Data Trained on 3.3T tokens, 6.5x GPT-3.5 500B tokens (est.) Context 2 Token options per type 4K (default), 128K (Max) Model Size 3’072 hidden dimensions, 32 attention heads (1.8GB) vs 12’288 and 96 in GPT-3.5 (350GB) 14b 3.3t 4K 3’072
  • 12.
    • Phi-3.5-mini • Phi-3.5-vision oMulti-frame image understanding and reasoning o Not optimized for multi-lingual use cases • Phi-3.5-MoE (Mixture of Experts) o 16 experts o Total model size of 42B parameters, o Activates 6.6B parameters at once with two experts • Mixture of Experts o Experts – individual models within larger architecture, expert in certain area o Gate – trained NN, determines most relevant expert activation o Sparse Activation – only few experts are activated o Output Layer – combines Expert output https://huggingface.co/microsoft/Phi-3.5-MoE-instruct https://aka.ms/try-phi3.5moe ○ Phi-3.5 (Aug 2024)
  • 13.
    ● The firstmodel in this category ○ 3.8B parameters, 128K context, multi-lingual ● Average 5-6% better quality ● Multi-lingual support ○ High-resource languages: Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian ● Vocabulary Size ○ Phi-3-mini – 32K tokens ○ Phi-3-small – 100K tokens Phi-3.5-mini vs Phi-3-mini
  • 14.
    ● Models runin the cloud and on the edge ● Runs locally on mobile devices (1.8GB RAM, 12 tkns/sec on iPhone 14) AZ Serverless Deployment Pricing Phi-3 Deployment Options Model Context Input (1M Tokens) Output (1M tokens) Phi-3-mini -4k-instruct / -128k-instruct 4K / 128K €0.13 €0.50 Phi-3.5-mini-instruct, Phi-3.5-vision-instruct 128K €0.13 €0.50 Phi-3-small -8k-instruct / -128k-instruct 8K / 128K €0.15 €0.58 Phi-3-medium -4k-instruct / -128k-instruct 4K / 128K €0.17 €0.65 Phi-3.5-MoE-instruct 128K €0.16 €0.62 GPT-4o mini 128K €0.16 €0.62 GPT-4o-0513 128K €4.63 €13.89 GPT-4o-2024-08-06 [Newer, More Censored] 128K €2.32 €9.26
  • 15.
    How is thatPossible? ● Quantization ○ Compress model maintaining most of the accuracy ○ Convert ANN weights precision (i.e. Float16) to lower (i.e. Int4) ● Quantization Accuracy ○ Degradation of quantized ANN KPIs (i.e. Accuracy) vs. baseline ● Model Weights ○ Fraction of weights are more important for performance ○ Higher activation magnitude = more important feature ○ Scale up key weights before quantization ● Activation-Aware Quantization ○ Quantizes the important weights, not all ○ Reduces activation errors compared to alternatives ○ Maintains generalization and quantization accuracy
  • 16.
  • 17.
    1. Requirements (MicrosoftResponsible AI Compliance) ○ Accountability (Human in control) ○ Transparency (Explain behaviour and decisions) ○ Fairness & Inclusiveness (Same recommendations to anyone) ○ Reliability & Safety (Transparent collection and storage of data) 2. Training ○ Pre-training – heavily filtered public web and synthetic data ○ Post-training ■ Supervised finetuning (SFT) ■ Direct preference optimization (DPO) ○ Safety and bias mitigation 3. Evaluation ○ Various academic benchmarks to compare Safety First Model Design
  • 18.
    • Massive MultitaskLanguage Understanding Test o 57 areas (Math, Bio, Physics,…) , 100 questions each, 4 levels of complexity o GPT-4o is a leader with 88.7 score, Llama 3.1 - 88.6 score Phi-3 Language Understanding https://paperswithcode.com/sota/multi-task-language-understanding-on-mmlu
  • 19.
    ● Multimodal(language &vision) based on Phi-3.5-mini (4.2B parameters, 128K tokens) ○ No direct support of non-image files ○ Trained with synthetic data (generated by GPT-4o) ● Use Cases & Performance (vs. GPT 4o) BLINK Test Phi-3.5 Vision Performance
  • 20.
    Phi-3 Family Performance(vs. Llama-3 and GPT-3.5) GPT-4o would have stolen the show
  • 21.
    • Language modelsare no longer simply completing sentences • Phi-3 models have high performance • Phi-3 often outperform larger models • Strong reasoning and logic capabilities • Maths abilities are very high • Factual knowledge performance is lower than large models • Code generation good performance • HumanEval – 164 versatile programming tasks, 8x each Benchmark Conclusions
  • 22.
    Limitations Limitation Details Mitigation 1.Model Size Smaller model = limited size to store factual knowledge Augmentation external sources (DB, Web search) 2. Factual Inaccuracies Affects the reliability of the output. Undermines trust Common challenge for small GenAI. Typically solved with RAG, impossible to train model on everything 3. Multilingual 23 languages – limits usefulness Understands other languages but non-high resource languages are poor 4. Safety Fails in some sensitive inquiries (disinformation) Safety post-training - automated evaluations across Responsible AI (RAI) harm categories. 5. Ethical Amplifies bias from training data. Supervised fine tuning with safe data to steer output in right direction.
  • 23.
    ● Open Source ○Users train with Phi-3 output other models ○ Contamination of datasets for training other models ● Potential LLaMA License ○ Phi-3 is trained with synthetic data ○ Contamination with LLaMA data in Phi-3 could virtually spread to Phi-3 outputs ■ LLaMA license prohibits use of outputs for improving non-LLaMA licensed models ● Synthetic Data ○ Dependency of SLM training on LLM output ○ Could increase bias and negatively affect performance ● Off-topic Moralizing ○ Probably the most censored model by now ○ “You turned this LLM into a schizophrenic moralizing dolt willing to break the flow of stories, and even interrupt them with absurd lecturing, when they drift out of a fairy-tail perversion of reality that you've deemed appropriate.” Criticism https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/discussions/20
  • 24.
  • 25.
    ● Azure AIFoundry ○ 1800+ models available ○ Phi-family is Microsoft Collection ● Easy to fine tune ○ Requires relevant training data ○ Deploy trained model ● Microsoft Guidelines ○ Do when you have a specific use case you could name loud ○ Consider few-shot learning ○ Consider RAG ○ Is base model failing in edge cases or format ○ https://learn.microsoft.com/en-us/azure/ai-studio/concepts/fine-tuning-overview Phi-3 in the Cloud
  • 26.
    Ollama ● Download andInstall Ollama https://ollama.com/download ● Install Phi-3 Hugging Face ● Install Hugging Face CLI ● Install the generate() API for CPU ● Download Phi3-Vision files ● Download phi3V example by MSFT https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/phi3v.py Phi-3 on the Edge PS > ollama run phi3:mini [2.2GB] PS > ollama run phi3:medium PS > ollama run phi3.5 [2.2GB] > pip install -U "huggingface_hub[cli]“ > pip install onnxruntime-genai > huggingface-cli download microsoft/Phi-3.5-vision-instruct-onnx -- include cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/* --local- dir .
  • 27.
    ● The followingcode needs to be fixed in Phi3v.py Phi3v.py Issue generated_text = "" # Loop to generate each token after calling compute_logits while True: generator.compute_logits() # Logits vector corresponds to tokens in the model vocabulary generator.generate_next_token() # Generate the next token new_tokens = generator.get_next_tokens() # Get the output tokens for token in new_tokens: generated_text += tokenizer_stream.decode(token) # Decode each token and accumulate if generator.is_done(): break # Print out the generated tokens print(generated_text)
  • 28.
    ● OCR withstructure output format ○ Documents not supported directly ○ Query specific information (i.e. What is the price of…) ○ Query characteristics (i.e. What is the colour of …) DEMO
  • 29.
    Thank you! See younext year with the first event in 2025 ‘’Global Power Platform Bootcamp’’ 9