SlideShare a Scribd company logo
1 of 23
Linformer
•Self-Attention Mechanism
: 𝑂(𝑛2
)
•Linformer Self-Attention Mechanism
: 𝑂(𝑛)
Introduction
In this paper, we seek to answer the question
can Transformer models be optimized to avoid this quadratic operation, or is
this operation required to maintain strong performance?
𝑛
𝑛
Sparse Transformers
• 𝑂 𝑛 𝑛 , 2% accuracy drop with 20% speed up.
Reformer - LSH Attention
• LSH(locally-sensitive hashing)
-> Multi-round(increases the number of sequential operations)
• efficiency gains only appear on sequences with length > 2048
𝑂(𝑛𝑙𝑜𝑔 𝑛 )
Comparison of Complexity
Related Work
• Mixed Precision
• Knowledge Distillation
• Sparse Attention
• LSH Attention
• Improving Optimizer Efficiency
Background
• MuiltiHead Attention
𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑 𝑄, 𝐾, 𝑉 = 𝐶𝑜𝑛𝑐𝑎𝑡 ℎ𝑒𝑎𝑑1, ℎ𝑒𝑎𝑑2, ℎ𝑒𝑎𝑑3 … ℎ𝑒𝑎𝑑ℎ 𝑊 𝑂
ℎ𝑒𝑎𝑑𝑖 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄𝑊𝑖
𝑄
, 𝐾𝑊𝑖
𝐾
, 𝑉𝑊𝑖
𝑉
= 𝑠𝑜𝑓𝑡𝑚𝑎𝑥[
𝑄𝑊𝑖
𝑄
𝐾𝑊𝑖
𝐾 𝑇
𝑑 𝑘
]𝑉𝑊𝑖
𝑇
𝑊 𝑄, 𝑊 𝐾, 𝑊 𝑉 ∈ ℝ 𝑑×𝑑
𝑊 𝑂
∈ ℝℎ𝑑×𝑑
𝑃 ∈ ℝ 𝑛×𝑛
Self Attention is Low Rank
• context mapping matrix 𝑃 is low-rank
• Model : RoBERTa-base, RoBERTa-large
• Task : IMDB(Classification), Wiki103(masked language modeling)
Eigenvalue and Eigenvector
𝐴𝑥 = 𝜆𝑥
• 𝐴 = 𝑠𝑞𝑢𝑎𝑟𝑒 𝑚𝑎𝑡𝑟𝑖𝑥
• 𝑥 = 𝑒𝑖𝑔𝑒𝑛𝑣𝑒𝑐𝑡𝑜𝑟 (𝑖𝑓 𝑛𝑜𝑡 0)
• 𝜆 = 𝑒𝑖𝑔𝑒𝑛𝑣𝑎𝑙𝑢𝑒
행렬 곱의 결과가 원래 벡터와 Direction이 같고 Scale만 변했다는 의미
https://rfriend.tistory.com/181
SVD(Singular Value Decomposition)
singular value
Rotation RotationScale
Sigular Value : 모든 행렬 가능, 음이 아닌 scalar
Eigen Value : 정사각행렬만 가능, 모든 scalar
• eigenvalue : 512 diagonal matrix 이겠고
• eigenvalue를 구해보니 128 ~ 부터 dominan한 값이 saturation 되더라
• 그래서 low rank가 가능하더라
𝑃 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥[
𝑄𝑊𝑖
𝑄
𝐾𝑊𝑖
𝐾 𝑇
𝑑 𝑘
]
𝑃 ∈ ℝ 𝑛×𝑛
eigenvalue
= 512 diagonal matrix
Theorem 1
𝐷𝐴 ∶ 𝑛 × 𝑛 diagonal matrix
𝑃 ∶ exp 𝐴 ⋅ 𝐷𝐴
−1
𝑅 𝑇
𝑅
𝑅 ∈ ℝ 𝑘×𝑛
𝑤 ∈ ℝ 𝑛
𝑘 = 5 log 𝑛 / (𝜖2 − 𝜖3)
• 𝑃와 𝑃가 근사하다면 rank를 줄일 수 있다.
SVD
• Theorem 1을 결과를 기반으로 k개를 뽑아서 low rank로 만든다는 것은 이렇게
표현할 수 있지만 SVD를 위한 추가적 연산이 필요하다.
• 학습 된 파라미터는 줄일 수 있다.
-> 그래서 다른 접근법을 사용한다.
Linformer
Transformer vs Linformer
• 𝐸𝑖, 𝐹𝑖 ∈ ℝ 𝑛×𝑘
: linear projection matrices
• 𝑃 ∈ ℝ 𝑛×𝑛
• 𝑃 ∈ ℝ 𝑛×𝑑
Transformer vs Linformer
𝑛
Efficiency Techniques
• Parameter Sharing between projections
• Headwise sharing: for each layer, we share two projection matrices 𝐸 and 𝐹 such
that 𝐸𝑖 = 𝐸 and 𝐹𝑖 = 𝐹 across all heads 𝑖.
• Key-value sharing: we do headwise sharing, with the additional constraint of
sharing the key and value projections. For each layer, we create a single projection
matrix 𝐸 such that 𝐸𝑖 = 𝐹𝑖 = 𝐸 for each key-value projection matrix across all
head 𝑖.
• Layerwise sharing: we use a single projection matrix 𝐸 across all layers, for all
heads, and for both key and value.
Efficiency Techniques
• Nonuniform projected dimension
• One can choose a different projected
dimension 𝑘 for different heads and
layers.
• This implies one can choose a smaller
projected dimension 𝑘 for higher
layers.
The right figure plots the heatmap of normalized cumulative eigenvalue
at the 128-th largest eigenvalue across different layers and heads in
Wiki103 data.
Efficiency Techniques
• General projections
• One can also choose different kinds of low-dimensional projection methods instead
of a simple linear projection.
• For example, one can choose mean/max pooling, or convolution where the kernel
and stride is set to 𝑛/𝑘.
• The convolutional functions contain parameters that require training.
Experiments - Training Curve
Experiments - Results of Accuracy
Experiments - Results of Inference time efficiency

More Related Content

Similar to 2021 01-02-linformer

4200 Kte7.0 Training V1.0
4200 Kte7.0 Training V1.04200 Kte7.0 Training V1.0
4200 Kte7.0 Training V1.0wayneliao
 
微软客户端技术纵览
微软客户端技术纵览微软客户端技术纵览
微软客户端技术纵览ntoskrnl
 
Ruby on Rails 2.1 What's New Chinese Version
Ruby on Rails 2.1 What's New Chinese VersionRuby on Rails 2.1 What's New Chinese Version
Ruby on Rails 2.1 What's New Chinese VersionLibin Pan
 
Chinaonrails Rubyonrails21 Zh
Chinaonrails Rubyonrails21 ZhChinaonrails Rubyonrails21 Zh
Chinaonrails Rubyonrails21 ZhJesse Cai
 
Windows 7兼容性系列课程(5):Windows 7徽标认证
Windows 7兼容性系列课程(5):Windows 7徽标认证Windows 7兼容性系列课程(5):Windows 7徽标认证
Windows 7兼容性系列课程(5):Windows 7徽标认证Chui-Wen Chiu
 
RM-CVaR: Regularized Multiple β-CVaR Portfolio(IJCAI Presentation)
RM-CVaR: Regularized Multiple β-CVaR Portfolio(IJCAI Presentation)RM-CVaR: Regularized Multiple β-CVaR Portfolio(IJCAI Presentation)
RM-CVaR: Regularized Multiple β-CVaR Portfolio(IJCAI Presentation)Kei Nakagawa
 
query optimization
query optimizationquery optimization
query optimizationDimara Hakim
 
20090323 Phpstudy
20090323 Phpstudy20090323 Phpstudy
20090323 PhpstudyYusuke Ando
 
12.2008 Trendbird Monthly Trend Report Sample
12.2008 Trendbird  Monthly Trend Report Sample12.2008 Trendbird  Monthly Trend Report Sample
12.2008 Trendbird Monthly Trend Report Samplewebtel125
 
設計思考研究演講 07.11
設計思考研究演講 07.11設計思考研究演講 07.11
設計思考研究演講 07.11NTUST
 
Windows 7兼容性系列课程(3):有针对的兼容性开发(上)
Windows 7兼容性系列课程(3):有针对的兼容性开发(上)Windows 7兼容性系列课程(3):有针对的兼容性开发(上)
Windows 7兼容性系列课程(3):有针对的兼容性开发(上)Chui-Wen Chiu
 
創業家研習營-7分鐘創意簡報技巧,Mr.6劉威麟
創業家研習營-7分鐘創意簡報技巧,Mr.6劉威麟創業家研習營-7分鐘創意簡報技巧,Mr.6劉威麟
創業家研習營-7分鐘創意簡報技巧,Mr.6劉威麟taiwanweb20
 
Five Minutes Introduction For Rails
Five Minutes Introduction For RailsFive Minutes Introduction For Rails
Five Minutes Introduction For RailsKoichi ITO
 
QM-081-品質控制統計製程管制
QM-081-品質控制統計製程管制QM-081-品質控制統計製程管制
QM-081-品質控制統計製程管制handbook
 
Windows 7兼容性系列课程(2):Windows 7用户权限控制 (UAC)
Windows 7兼容性系列课程(2):Windows 7用户权限控制 (UAC)Windows 7兼容性系列课程(2):Windows 7用户权限控制 (UAC)
Windows 7兼容性系列课程(2):Windows 7用户权限控制 (UAC)Chui-Wen Chiu
 
Zurich2007 MySQL Query Optimization
Zurich2007 MySQL Query OptimizationZurich2007 MySQL Query Optimization
Zurich2007 MySQL Query OptimizationHiệp Lê Tuấn
 
Zurich2007 MySQL Query Optimization
Zurich2007 MySQL Query OptimizationZurich2007 MySQL Query Optimization
Zurich2007 MySQL Query OptimizationHiệp Lê Tuấn
 
QM-078-企業導入六標準差之個案探討
QM-078-企業導入六標準差之個案探討QM-078-企業導入六標準差之個案探討
QM-078-企業導入六標準差之個案探討handbook
 
Qiskit advocate demo qsvm
Qiskit advocate demo qsvmQiskit advocate demo qsvm
Qiskit advocate demo qsvmYuma Nakamura
 

Similar to 2021 01-02-linformer (20)

4200 Kte7.0 Training V1.0
4200 Kte7.0 Training V1.04200 Kte7.0 Training V1.0
4200 Kte7.0 Training V1.0
 
微软客户端技术纵览
微软客户端技术纵览微软客户端技术纵览
微软客户端技术纵览
 
Iir 08 ver.1.0
Iir 08 ver.1.0Iir 08 ver.1.0
Iir 08 ver.1.0
 
Ruby on Rails 2.1 What's New Chinese Version
Ruby on Rails 2.1 What's New Chinese VersionRuby on Rails 2.1 What's New Chinese Version
Ruby on Rails 2.1 What's New Chinese Version
 
Chinaonrails Rubyonrails21 Zh
Chinaonrails Rubyonrails21 ZhChinaonrails Rubyonrails21 Zh
Chinaonrails Rubyonrails21 Zh
 
Windows 7兼容性系列课程(5):Windows 7徽标认证
Windows 7兼容性系列课程(5):Windows 7徽标认证Windows 7兼容性系列课程(5):Windows 7徽标认证
Windows 7兼容性系列课程(5):Windows 7徽标认证
 
RM-CVaR: Regularized Multiple β-CVaR Portfolio(IJCAI Presentation)
RM-CVaR: Regularized Multiple β-CVaR Portfolio(IJCAI Presentation)RM-CVaR: Regularized Multiple β-CVaR Portfolio(IJCAI Presentation)
RM-CVaR: Regularized Multiple β-CVaR Portfolio(IJCAI Presentation)
 
query optimization
query optimizationquery optimization
query optimization
 
20090323 Phpstudy
20090323 Phpstudy20090323 Phpstudy
20090323 Phpstudy
 
12.2008 Trendbird Monthly Trend Report Sample
12.2008 Trendbird  Monthly Trend Report Sample12.2008 Trendbird  Monthly Trend Report Sample
12.2008 Trendbird Monthly Trend Report Sample
 
設計思考研究演講 07.11
設計思考研究演講 07.11設計思考研究演講 07.11
設計思考研究演講 07.11
 
Windows 7兼容性系列课程(3):有针对的兼容性开发(上)
Windows 7兼容性系列课程(3):有针对的兼容性开发(上)Windows 7兼容性系列课程(3):有针对的兼容性开发(上)
Windows 7兼容性系列课程(3):有针对的兼容性开发(上)
 
創業家研習營-7分鐘創意簡報技巧,Mr.6劉威麟
創業家研習營-7分鐘創意簡報技巧,Mr.6劉威麟創業家研習營-7分鐘創意簡報技巧,Mr.6劉威麟
創業家研習營-7分鐘創意簡報技巧,Mr.6劉威麟
 
Five Minutes Introduction For Rails
Five Minutes Introduction For RailsFive Minutes Introduction For Rails
Five Minutes Introduction For Rails
 
QM-081-品質控制統計製程管制
QM-081-品質控制統計製程管制QM-081-品質控制統計製程管制
QM-081-品質控制統計製程管制
 
Windows 7兼容性系列课程(2):Windows 7用户权限控制 (UAC)
Windows 7兼容性系列课程(2):Windows 7用户权限控制 (UAC)Windows 7兼容性系列课程(2):Windows 7用户权限控制 (UAC)
Windows 7兼容性系列课程(2):Windows 7用户权限控制 (UAC)
 
Zurich2007 MySQL Query Optimization
Zurich2007 MySQL Query OptimizationZurich2007 MySQL Query Optimization
Zurich2007 MySQL Query Optimization
 
Zurich2007 MySQL Query Optimization
Zurich2007 MySQL Query OptimizationZurich2007 MySQL Query Optimization
Zurich2007 MySQL Query Optimization
 
QM-078-企業導入六標準差之個案探討
QM-078-企業導入六標準差之個案探討QM-078-企業導入六標準差之個案探討
QM-078-企業導入六標準差之個案探討
 
Qiskit advocate demo qsvm
Qiskit advocate demo qsvmQiskit advocate demo qsvm
Qiskit advocate demo qsvm
 

More from JAEMINJEONG5

Jaemin_230701_Simple_Copy_paste.pptx
Jaemin_230701_Simple_Copy_paste.pptxJaemin_230701_Simple_Copy_paste.pptx
Jaemin_230701_Simple_Copy_paste.pptxJAEMINJEONG5
 
2022-01-17-Rethinking_Bisenet.pptx
2022-01-17-Rethinking_Bisenet.pptx2022-01-17-Rethinking_Bisenet.pptx
2022-01-17-Rethinking_Bisenet.pptxJAEMINJEONG5
 
2021 04-04-google nmt
2021 04-04-google nmt2021 04-04-google nmt
2021 04-04-google nmtJAEMINJEONG5
 
2021 03-02-distributed representations-of_words_and_phrases
2021 03-02-distributed representations-of_words_and_phrases2021 03-02-distributed representations-of_words_and_phrases
2021 03-02-distributed representations-of_words_and_phrasesJAEMINJEONG5
 
2021 03-02-transformer interpretability
2021 03-02-transformer interpretability2021 03-02-transformer interpretability
2021 03-02-transformer interpretabilityJAEMINJEONG5
 
2021 01-04-learning filter-basis
2021 01-04-learning filter-basis2021 01-04-learning filter-basis
2021 01-04-learning filter-basisJAEMINJEONG5
 
2020 11 4_bag_of_tricks
2020 11 4_bag_of_tricks2020 11 4_bag_of_tricks
2020 11 4_bag_of_tricksJAEMINJEONG5
 
2020 11 1_sleep_net
2020 11 1_sleep_net2020 11 1_sleep_net
2020 11 1_sleep_netJAEMINJEONG5
 
2020 11 3_face_detection
2020 11 3_face_detection2020 11 3_face_detection
2020 11 3_face_detectionJAEMINJEONG5
 
white blood cell classification
white blood cell classificationwhite blood cell classification
white blood cell classificationJAEMINJEONG5
 

More from JAEMINJEONG5 (19)

Jaemin_230701_Simple_Copy_paste.pptx
Jaemin_230701_Simple_Copy_paste.pptxJaemin_230701_Simple_Copy_paste.pptx
Jaemin_230701_Simple_Copy_paste.pptx
 
2022-01-17-Rethinking_Bisenet.pptx
2022-01-17-Rethinking_Bisenet.pptx2022-01-17-Rethinking_Bisenet.pptx
2022-01-17-Rethinking_Bisenet.pptx
 
Swin transformer
Swin transformerSwin transformer
Swin transformer
 
2021 06-02-tabnet
2021 06-02-tabnet2021 06-02-tabnet
2021 06-02-tabnet
 
2021 05-04-u2-net
2021 05-04-u2-net2021 05-04-u2-net
2021 05-04-u2-net
 
2021 04-04-google nmt
2021 04-04-google nmt2021 04-04-google nmt
2021 04-04-google nmt
 
2021 04-03-sean
2021 04-03-sean2021 04-03-sean
2021 04-03-sean
 
2021 04-01-dalle
2021 04-01-dalle2021 04-01-dalle
2021 04-01-dalle
 
2021 03-02-spade
2021 03-02-spade2021 03-02-spade
2021 03-02-spade
 
2021 03-02-distributed representations-of_words_and_phrases
2021 03-02-distributed representations-of_words_and_phrases2021 03-02-distributed representations-of_words_and_phrases
2021 03-02-distributed representations-of_words_and_phrases
 
2021 03-02-transformer interpretability
2021 03-02-transformer interpretability2021 03-02-transformer interpretability
2021 03-02-transformer interpretability
 
2021 01-04-learning filter-basis
2021 01-04-learning filter-basis2021 01-04-learning filter-basis
2021 01-04-learning filter-basis
 
2020 12-03-vit
2020 12-03-vit2020 12-03-vit
2020 12-03-vit
 
2020 12-2-detr
2020 12-2-detr2020 12-2-detr
2020 12-2-detr
 
2020 11 4_bag_of_tricks
2020 11 4_bag_of_tricks2020 11 4_bag_of_tricks
2020 11 4_bag_of_tricks
 
2020 11 1_sleep_net
2020 11 1_sleep_net2020 11 1_sleep_net
2020 11 1_sleep_net
 
2020 12-1-adam w
2020 12-1-adam w2020 12-1-adam w
2020 12-1-adam w
 
2020 11 3_face_detection
2020 11 3_face_detection2020 11 3_face_detection
2020 11 3_face_detection
 
white blood cell classification
white blood cell classificationwhite blood cell classification
white blood cell classification
 

Recently uploaded

INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
pipeline in computer architecture design
pipeline in computer architecture  designpipeline in computer architecture  design
pipeline in computer architecture designssuser87fa0c1
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfme23b1001
 
DATA ANALYTICS PPT definition usage example
DATA ANALYTICS PPT definition usage exampleDATA ANALYTICS PPT definition usage example
DATA ANALYTICS PPT definition usage examplePragyanshuParadkar1
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 

Recently uploaded (20)

INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
pipeline in computer architecture design
pipeline in computer architecture  designpipeline in computer architecture  design
pipeline in computer architecture design
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
 
DATA ANALYTICS PPT definition usage example
DATA ANALYTICS PPT definition usage exampleDATA ANALYTICS PPT definition usage example
DATA ANALYTICS PPT definition usage example
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 

2021 01-02-linformer

  • 1.
  • 3. Introduction In this paper, we seek to answer the question can Transformer models be optimized to avoid this quadratic operation, or is this operation required to maintain strong performance? 𝑛 𝑛
  • 4. Sparse Transformers • 𝑂 𝑛 𝑛 , 2% accuracy drop with 20% speed up.
  • 5. Reformer - LSH Attention • LSH(locally-sensitive hashing) -> Multi-round(increases the number of sequential operations) • efficiency gains only appear on sequences with length > 2048 𝑂(𝑛𝑙𝑜𝑔 𝑛 )
  • 7. Related Work • Mixed Precision • Knowledge Distillation • Sparse Attention • LSH Attention • Improving Optimizer Efficiency
  • 8. Background • MuiltiHead Attention 𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑 𝑄, 𝐾, 𝑉 = 𝐶𝑜𝑛𝑐𝑎𝑡 ℎ𝑒𝑎𝑑1, ℎ𝑒𝑎𝑑2, ℎ𝑒𝑎𝑑3 … ℎ𝑒𝑎𝑑ℎ 𝑊 𝑂 ℎ𝑒𝑎𝑑𝑖 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄𝑊𝑖 𝑄 , 𝐾𝑊𝑖 𝐾 , 𝑉𝑊𝑖 𝑉 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥[ 𝑄𝑊𝑖 𝑄 𝐾𝑊𝑖 𝐾 𝑇 𝑑 𝑘 ]𝑉𝑊𝑖 𝑇 𝑊 𝑄, 𝑊 𝐾, 𝑊 𝑉 ∈ ℝ 𝑑×𝑑 𝑊 𝑂 ∈ ℝℎ𝑑×𝑑 𝑃 ∈ ℝ 𝑛×𝑛
  • 9. Self Attention is Low Rank • context mapping matrix 𝑃 is low-rank • Model : RoBERTa-base, RoBERTa-large • Task : IMDB(Classification), Wiki103(masked language modeling)
  • 10. Eigenvalue and Eigenvector 𝐴𝑥 = 𝜆𝑥 • 𝐴 = 𝑠𝑞𝑢𝑎𝑟𝑒 𝑚𝑎𝑡𝑟𝑖𝑥 • 𝑥 = 𝑒𝑖𝑔𝑒𝑛𝑣𝑒𝑐𝑡𝑜𝑟 (𝑖𝑓 𝑛𝑜𝑡 0) • 𝜆 = 𝑒𝑖𝑔𝑒𝑛𝑣𝑎𝑙𝑢𝑒 행렬 곱의 결과가 원래 벡터와 Direction이 같고 Scale만 변했다는 의미 https://rfriend.tistory.com/181
  • 11. SVD(Singular Value Decomposition) singular value Rotation RotationScale Sigular Value : 모든 행렬 가능, 음이 아닌 scalar Eigen Value : 정사각행렬만 가능, 모든 scalar
  • 12. • eigenvalue : 512 diagonal matrix 이겠고 • eigenvalue를 구해보니 128 ~ 부터 dominan한 값이 saturation 되더라 • 그래서 low rank가 가능하더라 𝑃 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥[ 𝑄𝑊𝑖 𝑄 𝐾𝑊𝑖 𝐾 𝑇 𝑑 𝑘 ] 𝑃 ∈ ℝ 𝑛×𝑛 eigenvalue = 512 diagonal matrix
  • 13. Theorem 1 𝐷𝐴 ∶ 𝑛 × 𝑛 diagonal matrix 𝑃 ∶ exp 𝐴 ⋅ 𝐷𝐴 −1 𝑅 𝑇 𝑅 𝑅 ∈ ℝ 𝑘×𝑛 𝑤 ∈ ℝ 𝑛 𝑘 = 5 log 𝑛 / (𝜖2 − 𝜖3) • 𝑃와 𝑃가 근사하다면 rank를 줄일 수 있다.
  • 14. SVD • Theorem 1을 결과를 기반으로 k개를 뽑아서 low rank로 만든다는 것은 이렇게 표현할 수 있지만 SVD를 위한 추가적 연산이 필요하다. • 학습 된 파라미터는 줄일 수 있다. -> 그래서 다른 접근법을 사용한다.
  • 16. Transformer vs Linformer • 𝐸𝑖, 𝐹𝑖 ∈ ℝ 𝑛×𝑘 : linear projection matrices • 𝑃 ∈ ℝ 𝑛×𝑛 • 𝑃 ∈ ℝ 𝑛×𝑑
  • 18. Efficiency Techniques • Parameter Sharing between projections • Headwise sharing: for each layer, we share two projection matrices 𝐸 and 𝐹 such that 𝐸𝑖 = 𝐸 and 𝐹𝑖 = 𝐹 across all heads 𝑖. • Key-value sharing: we do headwise sharing, with the additional constraint of sharing the key and value projections. For each layer, we create a single projection matrix 𝐸 such that 𝐸𝑖 = 𝐹𝑖 = 𝐸 for each key-value projection matrix across all head 𝑖. • Layerwise sharing: we use a single projection matrix 𝐸 across all layers, for all heads, and for both key and value.
  • 19. Efficiency Techniques • Nonuniform projected dimension • One can choose a different projected dimension 𝑘 for different heads and layers. • This implies one can choose a smaller projected dimension 𝑘 for higher layers. The right figure plots the heatmap of normalized cumulative eigenvalue at the 128-th largest eigenvalue across different layers and heads in Wiki103 data.
  • 20. Efficiency Techniques • General projections • One can also choose different kinds of low-dimensional projection methods instead of a simple linear projection. • For example, one can choose mean/max pooling, or convolution where the kernel and stride is set to 𝑛/𝑘. • The convolutional functions contain parameters that require training.
  • 22. Experiments - Results of Accuracy
  • 23. Experiments - Results of Inference time efficiency