SlideShare a Scribd company logo
A Survey on Model Compression
for Large Language Models
Paper presentation
Sanjana
Kothari
CMPE 297
Introduction
● Advancements like GPT-4 have
pushed the boundaries of AI with
human-like language processing,
but their large size limits their
deployment and access.
● Model compression techniques are
critical to shrink these LLMs,
making them viable for low-
resource devices and reducing their
environmental footprint.
LLMs can understand and generate human-like text,
enabling them to perform a wide range of language tasks.
The survey by Xunyu Zhu et al.
sheds light on strategies for
model compression of Large
Language Models
Reference:
https://arxiv.org/pdf/2308.07633.pdf
Compression techniques
Pruning
● Pruning is a powerful technique to reduce the size or complexity of a model by
removing unnecessary or redundant components.
● It also makes the model storage-friendly, memory efficient and compute efficient.
● Two types:
○ Unstructured pruning
○ Structured pruning
Unstructured pruning
● Simplifies an LLM by removing specific parameters without considering its internal
structure.
● Targets individual weights or neurons in the LLM, usually by applying a threshold to
zero out parameters below it.
● Drawbacks:
○ Disregards the overall LLM structure, resulting in an irregular sparse model
composition, which in turn demands specialized compression techniques for
efficient storage and computation of the pruned model.
○ Often involves substantial retraining of the LLM to regain accuracy, which is
especially expensive for LLMs.
Eg. SparseGPT
Structured pruning
● Simplifies an LLM by removing entire structural components, such as neurons,
channels, or layers.
● Targets whole sets of weights at once, offering the advantage of reducing model
complexity and memory usage while maintaining the overall LLM structure intact.
Eg. GUM and LLM Pruner
Knowledge Distillation
A technique that enhances the performance of a smaller, simpler ‘student’ model by
transferring knowledge from a larger, more complex ‘teacher’ model, streamlining the
comprehensive information into a more efficient form.
White-box and Black-box Knowledge Distillation
● White-box distillation goes beyond traditional KD by not only using the teacher
model’s outputs but also its internal parameters and representations. This gives the
student model insight into the teacher’s reasoning and decision-making processes.
● Eg. MINILLM, GKD, TF-LLMD
● In Black-box distillation, only the predictions made by the teacher LLM are
accessible.
● These LLMs exhibit Emergent Abilities, when tackling intricate tasks
● Different facets of emergent abilities include In-Context Learning (ICL), Chain-of-
Thought (CoT) and Instruction Following (IF)
Types of Black-box Knowledge Distillation
● In-Context Learning (ICL) distillation is a technique where LLMs teach smaller
language models to perform new tasks using structured prompts that include task
descriptions and examples.
● Chain of Thought (CoT) distillation includes intermediate reasoning steps in the
prompts, not just input-output examples.
● Instruction Fine-tuning (IF) distillation aims to upgrade the ability of language
models to perform tasks described by instructions without requiring explicit examples. It
fine-tunes models on a variety of tasks framed as instructions, thus enabling them to
understand and execute previously unseen directives.
Low-Rank Factorization
● Compresses LLMs by decomposing a
weight matrix W into two smaller
matrices U and V, with W≈UV. For
example, if U is an m×k matrix and V is
a k×n matrix, and k is substantially
smaller than m and n.
● It greatly reduces the number of
parameters and computational
overhead.
Eg. LORA and TensorGPT
Quantization
● Reduces the storage and computational demands of deep learning models by
transforming floating-point numbers into integers or other discrete forms.
● While traditional representation employs floating point numbers, quantization converts
them to integers or other discrete forms. This transformation significantly reduces
storage requirements and computational complexity.
● Effective quantization methods can significantly compress models with minimal impact
on accuracy.
● Two types of quantization:
○ Quantization-aware training (QAT)
○ Post-training quantization (PTQ)
Quantization-Aware Training (QAT)
● Models are adjusted to low-precision formats during their training to better handle the
precision loss from quantization while maintaining performance.
● LLM-QAT tackles the challenge of obtaining training data by using outputs from a pre-
trained model for data-free distillation, quantizing weights, activations, and key-value
(KV) caches to as low as 4 bits, which is crucial for achieving high efficiency in large
models such as LLaMA.
● PEQA and QLORA, both types of Parameter-Efficient Fine-Tuning (PEFT), aim to
compress models and speed up inference. These ideas are aimed at conserving
memory without compromising performance.
Post-Training Quantization (PTQ)
● PTQ simplifies the reduction of a LLM's storage and computational demands by
quantizing its parameters after training. This method is valued for its
straightforwardness and ability to compress models efficiently without altering the
architecture or requiring retraining.
● There are two approaches to PTQ:
○ Weight Quantization: Qauntize only the weights of LLMs to enhance efficiency and
reduce computational demands. To name some, LUT-GEMM, GPTQ and AWQ
work using this technique.
○ Weight and Activation Quantization: Quantize both weights and activations of
LLMs. ZeroQuant and SmoothQuant are some of the more popular models that
Measuring inference efficiency of LLMs
Number of parameters: It refers to the total number of learnable weights or variables that
the LLM needs to optimise during training.
Model size: This refers to the disk space required to store the entire LLM.
Compression ratio: This is the ratio between the original size of the uncompressed LLM
and the size of the compressed LLM.
Inference time: The time taken by the LLM to process and generate responses for input
data.
Floating point operations (FLOPs): Measure of the number of arithmetic operations
involving floating-point numbers that the LLM performs when processing input data.
Future Direction
Performance-Size Tradeoff: Enabling the design of more efficient compression techniques
within existing hardware limits.
Dynamic LLM Compression: Reducing or eliminating the reliance on trial-and-error and
experimentation to determine the compressed size and structure of LLMs. Hence,
developing techniques like NAS (Neural Architecture Search) that reduce the dependence
on human-designed architectures.
Explainability: Adopting these transparent approaches will improve our understanding,
ease the evaluation of compressed models, and ultimately lead to more reliable AI systems.
Thank you

More Related Content

What's hot

Accelerated Training of Transformer Models
Accelerated Training of Transformer ModelsAccelerated Training of Transformer Models
Accelerated Training of Transformer Models
Databricks
 
Bert
BertBert
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Po-Chuan Chen
 
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
Po-Chuan Chen
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
Young Seok Kim
 
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
Po-Chuan Chen
 
Review On In-Context Leaning.pptx
Review On In-Context Leaning.pptxReview On In-Context Leaning.pptx
Review On In-Context Leaning.pptx
wesleyshih4
 
LLaMA 2.pptx
LLaMA 2.pptxLLaMA 2.pptx
LLaMA 2.pptx
RkRahul16
 
Benchmark comparison of Large Language Models
Benchmark comparison of Large Language ModelsBenchmark comparison of Large Language Models
Benchmark comparison of Large Language Models
Matej Varga
 
Customizing LLMs
Customizing LLMsCustomizing LLMs
Customizing LLMs
Jim Steele
 
presentation.pdf
presentation.pdfpresentation.pdf
presentation.pdf
caa28steve
 
LangChain Intro by KeyMate.AI
LangChain Intro by KeyMate.AILangChain Intro by KeyMate.AI
LangChain Intro by KeyMate.AI
OzgurOscarOzkan
 
Introduction to LLMs
Introduction to LLMsIntroduction to LLMs
Introduction to LLMs
Loic Merckel
 
GPT and other Text Transformers: Black Swans and Stochastic Parrots
GPT and other Text Transformers:  Black Swans and Stochastic ParrotsGPT and other Text Transformers:  Black Swans and Stochastic Parrots
GPT and other Text Transformers: Black Swans and Stochastic Parrots
Konstantin Savenkov
 
A comprehensive guide to prompt engineering.pdf
A comprehensive guide to prompt engineering.pdfA comprehensive guide to prompt engineering.pdf
A comprehensive guide to prompt engineering.pdf
AnastasiaSteele10
 
TensorFlow Lite for mobile & IoT
TensorFlow Lite for mobile & IoT   TensorFlow Lite for mobile & IoT
TensorFlow Lite for mobile & IoT
Mia Chang
 
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
David Talby
 
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Databricks
 
230309_LoRa
230309_LoRa230309_LoRa
230309_LoRa
YongSang Yoo
 
Transformers, LLMs, and the Possibility of AGI
Transformers, LLMs, and the Possibility of AGITransformers, LLMs, and the Possibility of AGI
Transformers, LLMs, and the Possibility of AGI
SynaptonIncorporated
 

What's hot (20)

Accelerated Training of Transformer Models
Accelerated Training of Transformer ModelsAccelerated Training of Transformer Models
Accelerated Training of Transformer Models
 
Bert
BertBert
Bert
 
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
 
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
 
Review On In-Context Leaning.pptx
Review On In-Context Leaning.pptxReview On In-Context Leaning.pptx
Review On In-Context Leaning.pptx
 
LLaMA 2.pptx
LLaMA 2.pptxLLaMA 2.pptx
LLaMA 2.pptx
 
Benchmark comparison of Large Language Models
Benchmark comparison of Large Language ModelsBenchmark comparison of Large Language Models
Benchmark comparison of Large Language Models
 
Customizing LLMs
Customizing LLMsCustomizing LLMs
Customizing LLMs
 
presentation.pdf
presentation.pdfpresentation.pdf
presentation.pdf
 
LangChain Intro by KeyMate.AI
LangChain Intro by KeyMate.AILangChain Intro by KeyMate.AI
LangChain Intro by KeyMate.AI
 
Introduction to LLMs
Introduction to LLMsIntroduction to LLMs
Introduction to LLMs
 
GPT and other Text Transformers: Black Swans and Stochastic Parrots
GPT and other Text Transformers:  Black Swans and Stochastic ParrotsGPT and other Text Transformers:  Black Swans and Stochastic Parrots
GPT and other Text Transformers: Black Swans and Stochastic Parrots
 
A comprehensive guide to prompt engineering.pdf
A comprehensive guide to prompt engineering.pdfA comprehensive guide to prompt engineering.pdf
A comprehensive guide to prompt engineering.pdf
 
TensorFlow Lite for mobile & IoT
TensorFlow Lite for mobile & IoT   TensorFlow Lite for mobile & IoT
TensorFlow Lite for mobile & IoT
 
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
 
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
 
230309_LoRa
230309_LoRa230309_LoRa
230309_LoRa
 
Transformers, LLMs, and the Possibility of AGI
Transformers, LLMs, and the Possibility of AGITransformers, LLMs, and the Possibility of AGI
Transformers, LLMs, and the Possibility of AGI
 

Similar to Paper presentation on LLM compression

QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
Anant Corporation
 
CMPE255: Short Story Assignment FinGPT
CMPE255: Short Story Assignment FinGPTCMPE255: Short Story Assignment FinGPT
CMPE255: Short Story Assignment FinGPT
shawnchumbar
 
Chain-Of-Thought Prompting.pptx
Chain-Of-Thought Prompting.pptxChain-Of-Thought Prompting.pptx
Chain-Of-Thought Prompting.pptx
atharva553835
 
Efficient Deep Learning in Natural Language Processing Production, with Moshe...
Efficient Deep Learning in Natural Language Processing Production, with Moshe...Efficient Deep Learning in Natural Language Processing Production, with Moshe...
Efficient Deep Learning in Natural Language Processing Production, with Moshe...
Seth Grimes
 
C3 w3
C3 w3C3 w3
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
Bharath Sudharsan
 
slides.pdf
slides.pdfslides.pdf
slides.pdf
JongwooKo1
 
NS-CUK Seminar: J.H.Lee, Review on "Task Relation-aware Continual User Repres...
NS-CUK Seminar: J.H.Lee, Review on "Task Relation-aware Continual User Repres...NS-CUK Seminar: J.H.Lee, Review on "Task Relation-aware Continual User Repres...
NS-CUK Seminar: J.H.Lee, Review on "Task Relation-aware Continual User Repres...
ssuser4b1f48
 
Scolari's ICCD17 Talk
Scolari's ICCD17 TalkScolari's ICCD17 Talk
Scolari's ICCD17 Talk
NECST Lab @ Politecnico di Milano
 
Roman Kyslyi: Великі мовні моделі: огляд, виклики та рішення
Roman Kyslyi: Великі мовні моделі: огляд, виклики та рішенняRoman Kyslyi: Великі мовні моделі: огляд, виклики та рішення
Roman Kyslyi: Великі мовні моделі: огляд, виклики та рішення
Lviv Startup Club
 
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
MLconf
 
10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems
Xavier Amatriain
 
10 more lessons learned from building Machine Learning systems - MLConf
10 more lessons learned from building Machine Learning systems - MLConf10 more lessons learned from building Machine Learning systems - MLConf
10 more lessons learned from building Machine Learning systems - MLConf
Xavier Amatriain
 
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Ryo Takahashi
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"
Fwdays
 
Cmpe 255 Short Story Assignment
Cmpe 255 Short Story AssignmentCmpe 255 Short Story Assignment
Cmpe 255 Short Story Assignment
San Jose State University
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Models
taeseon ryu
 
[243] Deep Learning to help student’s Deep Learning
[243] Deep Learning to help student’s Deep Learning[243] Deep Learning to help student’s Deep Learning
[243] Deep Learning to help student’s Deep Learning
NAVER D2
 
A SIMPLE PROCESS TO SPEED UP MACHINE LEARNING METHODS: APPLICATION TO HIDDEN ...
A SIMPLE PROCESS TO SPEED UP MACHINE LEARNING METHODS: APPLICATION TO HIDDEN ...A SIMPLE PROCESS TO SPEED UP MACHINE LEARNING METHODS: APPLICATION TO HIDDEN ...
A SIMPLE PROCESS TO SPEED UP MACHINE LEARNING METHODS: APPLICATION TO HIDDEN ...
cscpconf
 

Similar to Paper presentation on LLM compression (20)

QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
 
CMPE255: Short Story Assignment FinGPT
CMPE255: Short Story Assignment FinGPTCMPE255: Short Story Assignment FinGPT
CMPE255: Short Story Assignment FinGPT
 
Chain-Of-Thought Prompting.pptx
Chain-Of-Thought Prompting.pptxChain-Of-Thought Prompting.pptx
Chain-Of-Thought Prompting.pptx
 
Efficient Deep Learning in Natural Language Processing Production, with Moshe...
Efficient Deep Learning in Natural Language Processing Production, with Moshe...Efficient Deep Learning in Natural Language Processing Production, with Moshe...
Efficient Deep Learning in Natural Language Processing Production, with Moshe...
 
C3 w3
C3 w3C3 w3
C3 w3
 
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
 
slides.pdf
slides.pdfslides.pdf
slides.pdf
 
NS-CUK Seminar: J.H.Lee, Review on "Task Relation-aware Continual User Repres...
NS-CUK Seminar: J.H.Lee, Review on "Task Relation-aware Continual User Repres...NS-CUK Seminar: J.H.Lee, Review on "Task Relation-aware Continual User Repres...
NS-CUK Seminar: J.H.Lee, Review on "Task Relation-aware Continual User Repres...
 
Scolari's ICCD17 Talk
Scolari's ICCD17 TalkScolari's ICCD17 Talk
Scolari's ICCD17 Talk
 
Roman Kyslyi: Великі мовні моделі: огляд, виклики та рішення
Roman Kyslyi: Великі мовні моделі: огляд, виклики та рішенняRoman Kyslyi: Великі мовні моделі: огляд, виклики та рішення
Roman Kyslyi: Великі мовні моделі: огляд, виклики та рішення
 
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
 
10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems
 
10 more lessons learned from building Machine Learning systems - MLConf
10 more lessons learned from building Machine Learning systems - MLConf10 more lessons learned from building Machine Learning systems - MLConf
10 more lessons learned from building Machine Learning systems - MLConf
 
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"
 
Cmpe 255 Short Story Assignment
Cmpe 255 Short Story AssignmentCmpe 255 Short Story Assignment
Cmpe 255 Short Story Assignment
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Models
 
[243] Deep Learning to help student’s Deep Learning
[243] Deep Learning to help student’s Deep Learning[243] Deep Learning to help student’s Deep Learning
[243] Deep Learning to help student’s Deep Learning
 
A SIMPLE PROCESS TO SPEED UP MACHINE LEARNING METHODS: APPLICATION TO HIDDEN ...
A SIMPLE PROCESS TO SPEED UP MACHINE LEARNING METHODS: APPLICATION TO HIDDEN ...A SIMPLE PROCESS TO SPEED UP MACHINE LEARNING METHODS: APPLICATION TO HIDDEN ...
A SIMPLE PROCESS TO SPEED UP MACHINE LEARNING METHODS: APPLICATION TO HIDDEN ...
 
PyData2015
PyData2015PyData2015
PyData2015
 

Recently uploaded

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 

Recently uploaded (20)

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 

Paper presentation on LLM compression

  • 1. A Survey on Model Compression for Large Language Models Paper presentation Sanjana Kothari CMPE 297
  • 2. Introduction ● Advancements like GPT-4 have pushed the boundaries of AI with human-like language processing, but their large size limits their deployment and access. ● Model compression techniques are critical to shrink these LLMs, making them viable for low- resource devices and reducing their environmental footprint. LLMs can understand and generate human-like text, enabling them to perform a wide range of language tasks.
  • 3. The survey by Xunyu Zhu et al. sheds light on strategies for model compression of Large Language Models Reference: https://arxiv.org/pdf/2308.07633.pdf
  • 5. Pruning ● Pruning is a powerful technique to reduce the size or complexity of a model by removing unnecessary or redundant components. ● It also makes the model storage-friendly, memory efficient and compute efficient. ● Two types: ○ Unstructured pruning ○ Structured pruning
  • 6. Unstructured pruning ● Simplifies an LLM by removing specific parameters without considering its internal structure. ● Targets individual weights or neurons in the LLM, usually by applying a threshold to zero out parameters below it. ● Drawbacks: ○ Disregards the overall LLM structure, resulting in an irregular sparse model composition, which in turn demands specialized compression techniques for efficient storage and computation of the pruned model. ○ Often involves substantial retraining of the LLM to regain accuracy, which is especially expensive for LLMs. Eg. SparseGPT
  • 7. Structured pruning ● Simplifies an LLM by removing entire structural components, such as neurons, channels, or layers. ● Targets whole sets of weights at once, offering the advantage of reducing model complexity and memory usage while maintaining the overall LLM structure intact. Eg. GUM and LLM Pruner
  • 8. Knowledge Distillation A technique that enhances the performance of a smaller, simpler ‘student’ model by transferring knowledge from a larger, more complex ‘teacher’ model, streamlining the comprehensive information into a more efficient form.
  • 9. White-box and Black-box Knowledge Distillation ● White-box distillation goes beyond traditional KD by not only using the teacher model’s outputs but also its internal parameters and representations. This gives the student model insight into the teacher’s reasoning and decision-making processes. ● Eg. MINILLM, GKD, TF-LLMD ● In Black-box distillation, only the predictions made by the teacher LLM are accessible. ● These LLMs exhibit Emergent Abilities, when tackling intricate tasks ● Different facets of emergent abilities include In-Context Learning (ICL), Chain-of- Thought (CoT) and Instruction Following (IF)
  • 10. Types of Black-box Knowledge Distillation ● In-Context Learning (ICL) distillation is a technique where LLMs teach smaller language models to perform new tasks using structured prompts that include task descriptions and examples. ● Chain of Thought (CoT) distillation includes intermediate reasoning steps in the prompts, not just input-output examples. ● Instruction Fine-tuning (IF) distillation aims to upgrade the ability of language models to perform tasks described by instructions without requiring explicit examples. It fine-tunes models on a variety of tasks framed as instructions, thus enabling them to understand and execute previously unseen directives.
  • 11. Low-Rank Factorization ● Compresses LLMs by decomposing a weight matrix W into two smaller matrices U and V, with W≈UV. For example, if U is an m×k matrix and V is a k×n matrix, and k is substantially smaller than m and n. ● It greatly reduces the number of parameters and computational overhead. Eg. LORA and TensorGPT
  • 12. Quantization ● Reduces the storage and computational demands of deep learning models by transforming floating-point numbers into integers or other discrete forms. ● While traditional representation employs floating point numbers, quantization converts them to integers or other discrete forms. This transformation significantly reduces storage requirements and computational complexity. ● Effective quantization methods can significantly compress models with minimal impact on accuracy. ● Two types of quantization: ○ Quantization-aware training (QAT) ○ Post-training quantization (PTQ)
  • 13. Quantization-Aware Training (QAT) ● Models are adjusted to low-precision formats during their training to better handle the precision loss from quantization while maintaining performance. ● LLM-QAT tackles the challenge of obtaining training data by using outputs from a pre- trained model for data-free distillation, quantizing weights, activations, and key-value (KV) caches to as low as 4 bits, which is crucial for achieving high efficiency in large models such as LLaMA. ● PEQA and QLORA, both types of Parameter-Efficient Fine-Tuning (PEFT), aim to compress models and speed up inference. These ideas are aimed at conserving memory without compromising performance.
  • 14. Post-Training Quantization (PTQ) ● PTQ simplifies the reduction of a LLM's storage and computational demands by quantizing its parameters after training. This method is valued for its straightforwardness and ability to compress models efficiently without altering the architecture or requiring retraining. ● There are two approaches to PTQ: ○ Weight Quantization: Qauntize only the weights of LLMs to enhance efficiency and reduce computational demands. To name some, LUT-GEMM, GPTQ and AWQ work using this technique. ○ Weight and Activation Quantization: Quantize both weights and activations of LLMs. ZeroQuant and SmoothQuant are some of the more popular models that
  • 15. Measuring inference efficiency of LLMs Number of parameters: It refers to the total number of learnable weights or variables that the LLM needs to optimise during training. Model size: This refers to the disk space required to store the entire LLM. Compression ratio: This is the ratio between the original size of the uncompressed LLM and the size of the compressed LLM. Inference time: The time taken by the LLM to process and generate responses for input data. Floating point operations (FLOPs): Measure of the number of arithmetic operations involving floating-point numbers that the LLM performs when processing input data.
  • 16. Future Direction Performance-Size Tradeoff: Enabling the design of more efficient compression techniques within existing hardware limits. Dynamic LLM Compression: Reducing or eliminating the reliance on trial-and-error and experimentation to determine the compressed size and structure of LLMs. Hence, developing techniques like NAS (Neural Architecture Search) that reduce the dependence on human-designed architectures. Explainability: Adopting these transparent approaches will improve our understanding, ease the evaluation of compressed models, and ultimately lead to more reliable AI systems.