SlideShare a Scribd company logo
1 of 33
Towards Efficient Transformers
Sangmin Woo
2020.12.17
[2020 ICLR] Reformer: The Efficient Transformer
[2020 ICML] Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
[2020 NIPS] Big Bird: Transformers for Longer Sequences
[2021 ICLR] Rethinking Attention with Performers
2 / 33
Contents
[2020 ICLR] Reformer: The Efficient Transformer
Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya
U.C. Berkeley & Google Research
[2020 ICML] Transformers are RNNs: Fast Autoregressive
Transformers with Linear Attention
Angelos Katharopoulo, Apoorv Vyas, Nikolaos Pappas, Franc¸ois Fleuret
Idiap Research Institute & EPPL & University of Washington & University of Geneva
[2020 NIPS] Big Bird: Transformers for Longer Sequences
Manzil Zaheer, Guru Guruganesh, Avinava Dubey Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh
Ravula, Qifan Wang, Li Yang, Amr Ahmed
Google Research
[2021 ICLR] Rethinking Attention with Performers
Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Son, Andreea Gane, Tamas Sarlos, Peter Hawkins,
Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, Adrian Weller
Google & University of Cambridge & DeepMind & Alan Turing Institute
3 / 33
Recap
 Attention is all you need [1]
• Scaled dot-product attention mechanism
• The output for the query is computed as an attention weighted sum of
values (𝑉), with the attention weights obtained from the product of the
queries (𝑄) with keys (𝐾).
[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, 2017.
4 / 33
Recap
 Attention is all you need [1]
5 / 33
Recap
 Attention is all you need [1]
• The operation matches every single query with every single key to find out
where information flows → 𝑂(𝑛2
) complexity 
• However, those information flows are mostly sparse
[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, 2017.
6 / 33
Recap
 Attention is all you need [1]
• The operation matches every single query with every single key to find out
where information flows → 𝑂(𝑛2
) complexity 
• However, those information flows are mostly sparse -> Let’s reduce the
complexity!
[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, 2017.
7 / 33
Reformer: The Efficient Transformer
[2020 ICLR]
 Locality Sensitive Hashing (LSH)
• Bucketing
• If distance between 𝑣1 and 𝑣2 is small → same bucket
• If distance between 𝑣1 and 𝑣2 is large → different bucket
8 / 33
Reformer: The Efficient Transformer
[2020 ICLR]
 Angular Locality Sensitive Hashing (LSH)
• If cosine distance between 𝑣1 and 𝑣2 is small → same bucket
• If cosine distance between 𝑣1 and 𝑣2 is large → different bucket
9 / 33
Reformer: The Efficient Transformer
[2020 ICLR]
 Angular Locality Sensitive Hashing (LSH)
• If distance between 𝑣1 and 𝑣2 is small → same bucket
• If distance between 𝑣1 and 𝑣2 is large → different bucket
10/ 33
Reformer: The Efficient Transformer
[2020 ICLR]
 Efficiency: Memory & Time complexity
11 / 33
Reformer: The Efficient Transformer
[2020 ICLR]
 LSH Attention Performance
12/ 33
Transformers are RNNs: Fast Autoregressive
Transformers with Linear Attention [2020 ICML]
 Generalized Attention Mechanism
• Typical Attention Formulation
• Generalized Form
13/ 33
Transformers are RNNs: Fast Autoregressive
Transformers with Linear Attention [2020 ICML]
 Linearized Attention
• (Roughly) Kernel allows you to represent vectors (e.g., K(𝑋, 𝑌) ) as inner
product of vectors (e.g., 𝜙 𝑋 𝑇 𝜙(𝑌) ) in some other space.
14/ 33
Transformers are RNNs: Fast Autoregressive
Transformers with Linear Attention [2020 ICML]
 Linearized Attention
• General Form
• Kernel as similarity function
• Kernel function
15/ 33
Transformers are RNNs: Fast Autoregressive
Transformers with Linear Attention [2020 ICML]
 Efficiency: Memory & Time complexity
16/ 33
Transformers are RNNs: Fast Autoregressive
Transformers with Linear Attention [2020 ICML]
 Convergence Comparison
17/ 33
Transformers are RNNs: Fast Autoregressive
Transformers with Linear Attention [2020 ICML]
 Comparison of Image Generation Speed
18/ 33
Big Bird: Transformers for Longer Sequences
[2020 NIPS]
 Big Bird
• Big Bird = All in one!
19/ 33
Big Bird: Transformers for Longer Sequences
[2020 NIPS]
 Big Bird
20/ 33
Big Bird: Transformers for Longer Sequences
[2020 NIPS]
 Big Bird in Graph Perspective
21/ 33
Big Bird: Transformers for Longer Sequences
[2020 NIPS]
 Building Block Ablations
22/ 33
Rethinking Attention with Performers
[2021 ICLR]
 Performer in High-level
23/ 33
Rethinking Attention with Performers
[2021 ICLR]
 Decomposing the Attention Matrix
24/ 33
Rethinking Attention with Performers
[2021 ICLR]
 Approximating the Softmax Kernel
• Let’s decompose softmax function into inner product of linear function with
kernel!
• Attention approximation with kernel
• Random feature map 𝜙
Where,
• The choice of ℎ and 𝑓 determines which kernel you would like to
approximate, and the more 𝜔 you sample the more accurately you
approximate the kernel.
25/ 33
Rethinking Attention with Performers
[2021 ICLR]
 Approximating the Softmax Kernel
• Random feature map 𝜙
Where,
• Example:
• The choice of ℎ and 𝑓 determines what the 𝜙 function is
26/ 33
Rethinking Attention with Performers
[2021 ICLR]
 Approximating the Softmax Kernel
• Softmax-kernel
• Approximating Softmax
Where,
• Robust Approximation of Softmax
Bad approximation… 
negative dimension-values
(sin / cos) leads to unstable
behaviors → High variance
27/ 33
Rethinking Attention with Performers
[2021 ICLR]
 Positive & Orthogonal Random Features (ORFs)
• Positive features
• ORFs: If 𝜔1, … , 𝜔 𝑚 are to be exactly orthogonal, it can further reduce the
variance of the estimator
28/ 33
Rethinking Attention with Performers
[2021 ICLR]
 Fast Attention Via positive Orthogonal Random features (FAVOR+)
• Softmax can be approximated by the kernel
Where,
29/ 33
Rethinking Attention with Performers
[2021 ICLR]
 Softmax Attention Approximation Error
30/ 33
Rethinking Attention with Performers
[2021 ICLR]
 Forward & Backward Speed
31/ 33
Further Readings
[2020 arxiv] Longformer: The Long-Document Transformer
[2020 arxiv] Synthesizer: Rethinking Self-Attention in Transformer Models
[2020 arxiv] Linformer: Self-Attention with Linear Complexity
…
32/ 33
Concluding Remarks
 Transformer is known to have quadratic complexity 𝑂(𝑛2
) 
 Several studies aim to reduce the quadratic complexity into the linear
complexity!
• Reformer: Angular Local Sensitivity Hashing
• Linear Attention: Kernel Trick
• Big Bird: Random + Window + Global Attention
• Performer: Approximating Softmax with Kernel (FAVOR+)
• Longformer, Synthesizer, Linformer…
 While maintaining the performance, they successfully reduced the
complexity 
 Still many researches are digging into the efficiency issues on
transformer…
Thank You
shmwoo9395@{gist.ac.kr, gmail.com}

More Related Content

Similar to Towards Efficient Transformers

Spm And Sicm Lecture
Spm And Sicm LectureSpm And Sicm Lecture
Spm And Sicm Lecture
sschraml
 
Ultimate astronomicalimaging
Ultimate astronomicalimagingUltimate astronomicalimaging
Ultimate astronomicalimaging
Clifford Stone
 
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...
MLconf
 

Similar to Towards Efficient Transformers (20)

vasp_tutorial.pptx
vasp_tutorial.pptxvasp_tutorial.pptx
vasp_tutorial.pptx
 
Kailash(13EC35032)_mtp.pptx
Kailash(13EC35032)_mtp.pptxKailash(13EC35032)_mtp.pptx
Kailash(13EC35032)_mtp.pptx
 
Combining remote sensing earth observations and in situ networks: detection o...
Combining remote sensing earth observations and in situ networks: detection o...Combining remote sensing earth observations and in situ networks: detection o...
Combining remote sensing earth observations and in situ networks: detection o...
 
Spm And Sicm Lecture
Spm And Sicm LectureSpm And Sicm Lecture
Spm And Sicm Lecture
 
Creative machine learning approaches for climate change detection
Creative machine learning approaches for climate change detectionCreative machine learning approaches for climate change detection
Creative machine learning approaches for climate change detection
 
SPATIAL POINT PATTERNS
SPATIAL POINT PATTERNSSPATIAL POINT PATTERNS
SPATIAL POINT PATTERNS
 
Ultimate astronomicalimaging
Ultimate astronomicalimagingUltimate astronomicalimaging
Ultimate astronomicalimaging
 
Dual-hop Variable-Gain Relaying with Beamforming over 휿−흁 Shadowed Fading Cha...
Dual-hop Variable-Gain Relaying with Beamforming over 휿−흁 Shadowed Fading Cha...Dual-hop Variable-Gain Relaying with Beamforming over 휿−흁 Shadowed Fading Cha...
Dual-hop Variable-Gain Relaying with Beamforming over 휿−흁 Shadowed Fading Cha...
 
Conformer review
Conformer reviewConformer review
Conformer review
 
[Mmlab seminar 2016] deep learning for human pose estimation
[Mmlab seminar 2016] deep learning for human pose estimation[Mmlab seminar 2016] deep learning for human pose estimation
[Mmlab seminar 2016] deep learning for human pose estimation
 
Intelligent reflecting surface 2
Intelligent reflecting surface 2Intelligent reflecting surface 2
Intelligent reflecting surface 2
 
Human Action Recognition
Human Action RecognitionHuman Action Recognition
Human Action Recognition
 
Distributed Data Processing using Spark by Panos Labropoulos_and Sarod Yataw...
Distributed Data Processing using Spark by  Panos Labropoulos_and Sarod Yataw...Distributed Data Processing using Spark by  Panos Labropoulos_and Sarod Yataw...
Distributed Data Processing using Spark by Panos Labropoulos_and Sarod Yataw...
 
Monte carlo methods in graphics and hacking
Monte carlo methods in graphics and hackingMonte carlo methods in graphics and hacking
Monte carlo methods in graphics and hacking
 
Pixelor presentation slides for SIGGRAPH Asia 2020
Pixelor presentation slides for SIGGRAPH Asia 2020Pixelor presentation slides for SIGGRAPH Asia 2020
Pixelor presentation slides for SIGGRAPH Asia 2020
 
Monte carlo and network cmg'14
Monte carlo and network cmg'14Monte carlo and network cmg'14
Monte carlo and network cmg'14
 
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...
 
05 2017 05-04-clear sky models g-kimball
05 2017 05-04-clear sky models g-kimball05 2017 05-04-clear sky models g-kimball
05 2017 05-04-clear sky models g-kimball
 
Finding Meaning in Points, Areas and Surfaces: Spatial Analysis in R
Finding Meaning in Points, Areas and Surfaces: Spatial Analysis in RFinding Meaning in Points, Areas and Surfaces: Spatial Analysis in R
Finding Meaning in Points, Areas and Surfaces: Spatial Analysis in R
 
Assessing the Impacts of Uncertainty Propagation to System Requirements by Ev...
Assessing the Impacts of Uncertainty Propagation to System Requirements by Ev...Assessing the Impacts of Uncertainty Propagation to System Requirements by Ev...
Assessing the Impacts of Uncertainty Propagation to System Requirements by Ev...
 

More from Sangmin Woo

More from Sangmin Woo (13)

Multimodal Learning with Severely Missing Modality.pptx
Multimodal Learning with Severely Missing Modality.pptxMultimodal Learning with Severely Missing Modality.pptx
Multimodal Learning with Severely Missing Modality.pptx
 
Video Transformers.pptx
Video Transformers.pptxVideo Transformers.pptx
Video Transformers.pptx
 
Masked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptxMasked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptx
 
An Empirical Study of Training Self-Supervised Vision Transformers.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptxAn Empirical Study of Training Self-Supervised Vision Transformers.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptx
 
Visual Commonsense Reasoning.pptx
Visual Commonsense Reasoning.pptxVisual Commonsense Reasoning.pptx
Visual Commonsense Reasoning.pptx
 
Video Grounding.pptx
Video Grounding.pptxVideo Grounding.pptx
Video Grounding.pptx
 
Action Recognition Datasets.pptx
Action Recognition Datasets.pptxAction Recognition Datasets.pptx
Action Recognition Datasets.pptx
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation Learning
 
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
 
Action Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene GraphsAction Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene Graphs
 
Neural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global contextNeural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global context
 
Attentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene GraphsAttentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene Graphs
 
Graph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph GenerationGraph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph Generation
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Towards Efficient Transformers

  • 1. Towards Efficient Transformers Sangmin Woo 2020.12.17 [2020 ICLR] Reformer: The Efficient Transformer [2020 ICML] Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention [2020 NIPS] Big Bird: Transformers for Longer Sequences [2021 ICLR] Rethinking Attention with Performers
  • 2. 2 / 33 Contents [2020 ICLR] Reformer: The Efficient Transformer Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya U.C. Berkeley & Google Research [2020 ICML] Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention Angelos Katharopoulo, Apoorv Vyas, Nikolaos Pappas, Franc¸ois Fleuret Idiap Research Institute & EPPL & University of Washington & University of Geneva [2020 NIPS] Big Bird: Transformers for Longer Sequences Manzil Zaheer, Guru Guruganesh, Avinava Dubey Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed Google Research [2021 ICLR] Rethinking Attention with Performers Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Son, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, Adrian Weller Google & University of Cambridge & DeepMind & Alan Turing Institute
  • 3. 3 / 33 Recap  Attention is all you need [1] • Scaled dot-product attention mechanism • The output for the query is computed as an attention weighted sum of values (𝑉), with the attention weights obtained from the product of the queries (𝑄) with keys (𝐾). [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, 2017.
  • 4. 4 / 33 Recap  Attention is all you need [1]
  • 5. 5 / 33 Recap  Attention is all you need [1] • The operation matches every single query with every single key to find out where information flows → 𝑂(𝑛2 ) complexity  • However, those information flows are mostly sparse [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, 2017.
  • 6. 6 / 33 Recap  Attention is all you need [1] • The operation matches every single query with every single key to find out where information flows → 𝑂(𝑛2 ) complexity  • However, those information flows are mostly sparse -> Let’s reduce the complexity! [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, 2017.
  • 7. 7 / 33 Reformer: The Efficient Transformer [2020 ICLR]  Locality Sensitive Hashing (LSH) • Bucketing • If distance between 𝑣1 and 𝑣2 is small → same bucket • If distance between 𝑣1 and 𝑣2 is large → different bucket
  • 8. 8 / 33 Reformer: The Efficient Transformer [2020 ICLR]  Angular Locality Sensitive Hashing (LSH) • If cosine distance between 𝑣1 and 𝑣2 is small → same bucket • If cosine distance between 𝑣1 and 𝑣2 is large → different bucket
  • 9. 9 / 33 Reformer: The Efficient Transformer [2020 ICLR]  Angular Locality Sensitive Hashing (LSH) • If distance between 𝑣1 and 𝑣2 is small → same bucket • If distance between 𝑣1 and 𝑣2 is large → different bucket
  • 10. 10/ 33 Reformer: The Efficient Transformer [2020 ICLR]  Efficiency: Memory & Time complexity
  • 11. 11 / 33 Reformer: The Efficient Transformer [2020 ICLR]  LSH Attention Performance
  • 12. 12/ 33 Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention [2020 ICML]  Generalized Attention Mechanism • Typical Attention Formulation • Generalized Form
  • 13. 13/ 33 Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention [2020 ICML]  Linearized Attention • (Roughly) Kernel allows you to represent vectors (e.g., K(𝑋, 𝑌) ) as inner product of vectors (e.g., 𝜙 𝑋 𝑇 𝜙(𝑌) ) in some other space.
  • 14. 14/ 33 Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention [2020 ICML]  Linearized Attention • General Form • Kernel as similarity function • Kernel function
  • 15. 15/ 33 Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention [2020 ICML]  Efficiency: Memory & Time complexity
  • 16. 16/ 33 Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention [2020 ICML]  Convergence Comparison
  • 17. 17/ 33 Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention [2020 ICML]  Comparison of Image Generation Speed
  • 18. 18/ 33 Big Bird: Transformers for Longer Sequences [2020 NIPS]  Big Bird • Big Bird = All in one!
  • 19. 19/ 33 Big Bird: Transformers for Longer Sequences [2020 NIPS]  Big Bird
  • 20. 20/ 33 Big Bird: Transformers for Longer Sequences [2020 NIPS]  Big Bird in Graph Perspective
  • 21. 21/ 33 Big Bird: Transformers for Longer Sequences [2020 NIPS]  Building Block Ablations
  • 22. 22/ 33 Rethinking Attention with Performers [2021 ICLR]  Performer in High-level
  • 23. 23/ 33 Rethinking Attention with Performers [2021 ICLR]  Decomposing the Attention Matrix
  • 24. 24/ 33 Rethinking Attention with Performers [2021 ICLR]  Approximating the Softmax Kernel • Let’s decompose softmax function into inner product of linear function with kernel! • Attention approximation with kernel • Random feature map 𝜙 Where, • The choice of ℎ and 𝑓 determines which kernel you would like to approximate, and the more 𝜔 you sample the more accurately you approximate the kernel.
  • 25. 25/ 33 Rethinking Attention with Performers [2021 ICLR]  Approximating the Softmax Kernel • Random feature map 𝜙 Where, • Example: • The choice of ℎ and 𝑓 determines what the 𝜙 function is
  • 26. 26/ 33 Rethinking Attention with Performers [2021 ICLR]  Approximating the Softmax Kernel • Softmax-kernel • Approximating Softmax Where, • Robust Approximation of Softmax Bad approximation…  negative dimension-values (sin / cos) leads to unstable behaviors → High variance
  • 27. 27/ 33 Rethinking Attention with Performers [2021 ICLR]  Positive & Orthogonal Random Features (ORFs) • Positive features • ORFs: If 𝜔1, … , 𝜔 𝑚 are to be exactly orthogonal, it can further reduce the variance of the estimator
  • 28. 28/ 33 Rethinking Attention with Performers [2021 ICLR]  Fast Attention Via positive Orthogonal Random features (FAVOR+) • Softmax can be approximated by the kernel Where,
  • 29. 29/ 33 Rethinking Attention with Performers [2021 ICLR]  Softmax Attention Approximation Error
  • 30. 30/ 33 Rethinking Attention with Performers [2021 ICLR]  Forward & Backward Speed
  • 31. 31/ 33 Further Readings [2020 arxiv] Longformer: The Long-Document Transformer [2020 arxiv] Synthesizer: Rethinking Self-Attention in Transformer Models [2020 arxiv] Linformer: Self-Attention with Linear Complexity …
  • 32. 32/ 33 Concluding Remarks  Transformer is known to have quadratic complexity 𝑂(𝑛2 )   Several studies aim to reduce the quadratic complexity into the linear complexity! • Reformer: Angular Local Sensitivity Hashing • Linear Attention: Kernel Trick • Big Bird: Random + Window + Global Attention • Performer: Approximating Softmax with Kernel (FAVOR+) • Longformer, Synthesizer, Linformer…  While maintaining the performance, they successfully reduced the complexity   Still many researches are digging into the efficiency issues on transformer…