SlideShare a Scribd company logo
Towards Efficient Transformers
Sangmin Woo
2020.12.17
[2020 ICLR] Reformer: The Efficient Transformer
[2020 ICML] Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
[2020 NIPS] Big Bird: Transformers for Longer Sequences
[2021 ICLR] Rethinking Attention with Performers
2 / 33
Contents
[2020 ICLR] Reformer: The Efficient Transformer
Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya
U.C. Berkeley & Google Research
[2020 ICML] Transformers are RNNs: Fast Autoregressive
Transformers with Linear Attention
Angelos Katharopoulo, Apoorv Vyas, Nikolaos Pappas, Franc¸ois Fleuret
Idiap Research Institute & EPPL & University of Washington & University of Geneva
[2020 NIPS] Big Bird: Transformers for Longer Sequences
Manzil Zaheer, Guru Guruganesh, Avinava Dubey Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh
Ravula, Qifan Wang, Li Yang, Amr Ahmed
Google Research
[2021 ICLR] Rethinking Attention with Performers
Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Son, Andreea Gane, Tamas Sarlos, Peter Hawkins,
Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, Adrian Weller
Google & University of Cambridge & DeepMind & Alan Turing Institute
3 / 33
Recap
 Attention is all you need [1]
• Scaled dot-product attention mechanism
• The output for the query is computed as an attention weighted sum of
values (𝑉), with the attention weights obtained from the product of the
queries (𝑄) with keys (𝐾).
[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, 2017.
4 / 33
Recap
 Attention is all you need [1]
5 / 33
Recap
 Attention is all you need [1]
• The operation matches every single query with every single key to find out
where information flows → 𝑂(𝑛2
) complexity 
• However, those information flows are mostly sparse
[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, 2017.
6 / 33
Recap
 Attention is all you need [1]
• The operation matches every single query with every single key to find out
where information flows → 𝑂(𝑛2
) complexity 
• However, those information flows are mostly sparse -> Let’s reduce the
complexity!
[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, 2017.
7 / 33
Reformer: The Efficient Transformer
[2020 ICLR]
 Locality Sensitive Hashing (LSH)
• Bucketing
• If distance between 𝑣1 and 𝑣2 is small → same bucket
• If distance between 𝑣1 and 𝑣2 is large → different bucket
8 / 33
Reformer: The Efficient Transformer
[2020 ICLR]
 Angular Locality Sensitive Hashing (LSH)
• If cosine distance between 𝑣1 and 𝑣2 is small → same bucket
• If cosine distance between 𝑣1 and 𝑣2 is large → different bucket
9 / 33
Reformer: The Efficient Transformer
[2020 ICLR]
 Angular Locality Sensitive Hashing (LSH)
• If distance between 𝑣1 and 𝑣2 is small → same bucket
• If distance between 𝑣1 and 𝑣2 is large → different bucket
10/ 33
Reformer: The Efficient Transformer
[2020 ICLR]
 Efficiency: Memory & Time complexity
11 / 33
Reformer: The Efficient Transformer
[2020 ICLR]
 LSH Attention Performance
12/ 33
Transformers are RNNs: Fast Autoregressive
Transformers with Linear Attention [2020 ICML]
 Generalized Attention Mechanism
• Typical Attention Formulation
• Generalized Form
13/ 33
Transformers are RNNs: Fast Autoregressive
Transformers with Linear Attention [2020 ICML]
 Linearized Attention
• (Roughly) Kernel allows you to represent vectors (e.g., K(𝑋, 𝑌) ) as inner
product of vectors (e.g., 𝜙 𝑋 𝑇 𝜙(𝑌) ) in some other space.
14/ 33
Transformers are RNNs: Fast Autoregressive
Transformers with Linear Attention [2020 ICML]
 Linearized Attention
• General Form
• Kernel as similarity function
• Kernel function
15/ 33
Transformers are RNNs: Fast Autoregressive
Transformers with Linear Attention [2020 ICML]
 Efficiency: Memory & Time complexity
16/ 33
Transformers are RNNs: Fast Autoregressive
Transformers with Linear Attention [2020 ICML]
 Convergence Comparison
17/ 33
Transformers are RNNs: Fast Autoregressive
Transformers with Linear Attention [2020 ICML]
 Comparison of Image Generation Speed
18/ 33
Big Bird: Transformers for Longer Sequences
[2020 NIPS]
 Big Bird
• Big Bird = All in one!
19/ 33
Big Bird: Transformers for Longer Sequences
[2020 NIPS]
 Big Bird
20/ 33
Big Bird: Transformers for Longer Sequences
[2020 NIPS]
 Big Bird in Graph Perspective
21/ 33
Big Bird: Transformers for Longer Sequences
[2020 NIPS]
 Building Block Ablations
22/ 33
Rethinking Attention with Performers
[2021 ICLR]
 Performer in High-level
23/ 33
Rethinking Attention with Performers
[2021 ICLR]
 Decomposing the Attention Matrix
24/ 33
Rethinking Attention with Performers
[2021 ICLR]
 Approximating the Softmax Kernel
• Let’s decompose softmax function into inner product of linear function with
kernel!
• Attention approximation with kernel
• Random feature map 𝜙
Where,
• The choice of ℎ and 𝑓 determines which kernel you would like to
approximate, and the more 𝜔 you sample the more accurately you
approximate the kernel.
25/ 33
Rethinking Attention with Performers
[2021 ICLR]
 Approximating the Softmax Kernel
• Random feature map 𝜙
Where,
• Example:
• The choice of ℎ and 𝑓 determines what the 𝜙 function is
26/ 33
Rethinking Attention with Performers
[2021 ICLR]
 Approximating the Softmax Kernel
• Softmax-kernel
• Approximating Softmax
Where,
• Robust Approximation of Softmax
Bad approximation… 
negative dimension-values
(sin / cos) leads to unstable
behaviors → High variance
27/ 33
Rethinking Attention with Performers
[2021 ICLR]
 Positive & Orthogonal Random Features (ORFs)
• Positive features
• ORFs: If 𝜔1, … , 𝜔 𝑚 are to be exactly orthogonal, it can further reduce the
variance of the estimator
28/ 33
Rethinking Attention with Performers
[2021 ICLR]
 Fast Attention Via positive Orthogonal Random features (FAVOR+)
• Softmax can be approximated by the kernel
Where,
29/ 33
Rethinking Attention with Performers
[2021 ICLR]
 Softmax Attention Approximation Error
30/ 33
Rethinking Attention with Performers
[2021 ICLR]
 Forward & Backward Speed
31/ 33
Further Readings
[2020 arxiv] Longformer: The Long-Document Transformer
[2020 arxiv] Synthesizer: Rethinking Self-Attention in Transformer Models
[2020 arxiv] Linformer: Self-Attention with Linear Complexity
…
32/ 33
Concluding Remarks
 Transformer is known to have quadratic complexity 𝑂(𝑛2
) 
 Several studies aim to reduce the quadratic complexity into the linear
complexity!
• Reformer: Angular Local Sensitivity Hashing
• Linear Attention: Kernel Trick
• Big Bird: Random + Window + Global Attention
• Performer: Approximating Softmax with Kernel (FAVOR+)
• Longformer, Synthesizer, Linformer…
 While maintaining the performance, they successfully reduced the
complexity 
 Still many researches are digging into the efficiency issues on
transformer…
Thank You
shmwoo9395@{gist.ac.kr, gmail.com}

More Related Content

Similar to Towards Efficient Transformers

vasp_tutorial.pptx
vasp_tutorial.pptxvasp_tutorial.pptx
vasp_tutorial.pptxUsman Mastoi
 
Combining remote sensing earth observations and in situ networks: detection o...
Combining remote sensing earth observations and in situ networks: detection o...Combining remote sensing earth observations and in situ networks: detection o...
Combining remote sensing earth observations and in situ networks: detection o...Integrated Carbon Observation System (ICOS)
 
Spm And Sicm Lecture
Spm And Sicm LectureSpm And Sicm Lecture
Spm And Sicm Lecturesschraml
 
Creative machine learning approaches for climate change detection
Creative machine learning approaches for climate change detectionCreative machine learning approaches for climate change detection
Creative machine learning approaches for climate change detectionZachary Labe
 
SPATIAL POINT PATTERNS
SPATIAL POINT PATTERNSSPATIAL POINT PATTERNS
SPATIAL POINT PATTERNSLiemNguyenDuy
 
Ultimate astronomicalimaging
Ultimate astronomicalimagingUltimate astronomicalimaging
Ultimate astronomicalimagingClifford Stone
 
Dual-hop Variable-Gain Relaying with Beamforming over 휿−흁 Shadowed Fading Cha...
Dual-hop Variable-Gain Relaying with Beamforming over 휿−흁 Shadowed Fading Cha...Dual-hop Variable-Gain Relaying with Beamforming over 휿−흁 Shadowed Fading Cha...
Dual-hop Variable-Gain Relaying with Beamforming over 휿−흁 Shadowed Fading Cha...zeenta zeenta
 
[Mmlab seminar 2016] deep learning for human pose estimation
[Mmlab seminar 2016] deep learning for human pose estimation[Mmlab seminar 2016] deep learning for human pose estimation
[Mmlab seminar 2016] deep learning for human pose estimationWei Yang
 
Intelligent reflecting surface 2
Intelligent reflecting surface 2Intelligent reflecting surface 2
Intelligent reflecting surface 2VARUN KUMAR
 
Distributed Data Processing using Spark by Panos Labropoulos_and Sarod Yataw...
Distributed Data Processing using Spark by  Panos Labropoulos_and Sarod Yataw...Distributed Data Processing using Spark by  Panos Labropoulos_and Sarod Yataw...
Distributed Data Processing using Spark by Panos Labropoulos_and Sarod Yataw...Spark Summit
 
Monte carlo methods in graphics and hacking
Monte carlo methods in graphics and hackingMonte carlo methods in graphics and hacking
Monte carlo methods in graphics and hackingHimanshu Goel
 
Pixelor presentation slides for SIGGRAPH Asia 2020
Pixelor presentation slides for SIGGRAPH Asia 2020Pixelor presentation slides for SIGGRAPH Asia 2020
Pixelor presentation slides for SIGGRAPH Asia 2020Ayan Das
 
Monte carlo and network cmg'14
Monte carlo and network cmg'14Monte carlo and network cmg'14
Monte carlo and network cmg'14Alex Gilgur
 
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...MLconf
 
Finding Meaning in Points, Areas and Surfaces: Spatial Analysis in R
Finding Meaning in Points, Areas and Surfaces: Spatial Analysis in RFinding Meaning in Points, Areas and Surfaces: Spatial Analysis in R
Finding Meaning in Points, Areas and Surfaces: Spatial Analysis in RRevolution Analytics
 
Assessing the Impacts of Uncertainty Propagation to System Requirements by Ev...
Assessing the Impacts of Uncertainty Propagation to System Requirements by Ev...Assessing the Impacts of Uncertainty Propagation to System Requirements by Ev...
Assessing the Impacts of Uncertainty Propagation to System Requirements by Ev...Alejandro Salado
 

Similar to Towards Efficient Transformers (20)

vasp_tutorial.pptx
vasp_tutorial.pptxvasp_tutorial.pptx
vasp_tutorial.pptx
 
Kailash(13EC35032)_mtp.pptx
Kailash(13EC35032)_mtp.pptxKailash(13EC35032)_mtp.pptx
Kailash(13EC35032)_mtp.pptx
 
Combining remote sensing earth observations and in situ networks: detection o...
Combining remote sensing earth observations and in situ networks: detection o...Combining remote sensing earth observations and in situ networks: detection o...
Combining remote sensing earth observations and in situ networks: detection o...
 
Spm And Sicm Lecture
Spm And Sicm LectureSpm And Sicm Lecture
Spm And Sicm Lecture
 
Creative machine learning approaches for climate change detection
Creative machine learning approaches for climate change detectionCreative machine learning approaches for climate change detection
Creative machine learning approaches for climate change detection
 
SPATIAL POINT PATTERNS
SPATIAL POINT PATTERNSSPATIAL POINT PATTERNS
SPATIAL POINT PATTERNS
 
Ultimate astronomicalimaging
Ultimate astronomicalimagingUltimate astronomicalimaging
Ultimate astronomicalimaging
 
Dual-hop Variable-Gain Relaying with Beamforming over 휿−흁 Shadowed Fading Cha...
Dual-hop Variable-Gain Relaying with Beamforming over 휿−흁 Shadowed Fading Cha...Dual-hop Variable-Gain Relaying with Beamforming over 휿−흁 Shadowed Fading Cha...
Dual-hop Variable-Gain Relaying with Beamforming over 휿−흁 Shadowed Fading Cha...
 
Conformer review
Conformer reviewConformer review
Conformer review
 
[Mmlab seminar 2016] deep learning for human pose estimation
[Mmlab seminar 2016] deep learning for human pose estimation[Mmlab seminar 2016] deep learning for human pose estimation
[Mmlab seminar 2016] deep learning for human pose estimation
 
Intelligent reflecting surface 2
Intelligent reflecting surface 2Intelligent reflecting surface 2
Intelligent reflecting surface 2
 
Human Action Recognition
Human Action RecognitionHuman Action Recognition
Human Action Recognition
 
Distributed Data Processing using Spark by Panos Labropoulos_and Sarod Yataw...
Distributed Data Processing using Spark by  Panos Labropoulos_and Sarod Yataw...Distributed Data Processing using Spark by  Panos Labropoulos_and Sarod Yataw...
Distributed Data Processing using Spark by Panos Labropoulos_and Sarod Yataw...
 
Monte carlo methods in graphics and hacking
Monte carlo methods in graphics and hackingMonte carlo methods in graphics and hacking
Monte carlo methods in graphics and hacking
 
Pixelor presentation slides for SIGGRAPH Asia 2020
Pixelor presentation slides for SIGGRAPH Asia 2020Pixelor presentation slides for SIGGRAPH Asia 2020
Pixelor presentation slides for SIGGRAPH Asia 2020
 
Monte carlo and network cmg'14
Monte carlo and network cmg'14Monte carlo and network cmg'14
Monte carlo and network cmg'14
 
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...
Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...
 
05 2017 05-04-clear sky models g-kimball
05 2017 05-04-clear sky models g-kimball05 2017 05-04-clear sky models g-kimball
05 2017 05-04-clear sky models g-kimball
 
Finding Meaning in Points, Areas and Surfaces: Spatial Analysis in R
Finding Meaning in Points, Areas and Surfaces: Spatial Analysis in RFinding Meaning in Points, Areas and Surfaces: Spatial Analysis in R
Finding Meaning in Points, Areas and Surfaces: Spatial Analysis in R
 
Assessing the Impacts of Uncertainty Propagation to System Requirements by Ev...
Assessing the Impacts of Uncertainty Propagation to System Requirements by Ev...Assessing the Impacts of Uncertainty Propagation to System Requirements by Ev...
Assessing the Impacts of Uncertainty Propagation to System Requirements by Ev...
 

More from Sangmin Woo

Multimodal Learning with Severely Missing Modality.pptx
Multimodal Learning with Severely Missing Modality.pptxMultimodal Learning with Severely Missing Modality.pptx
Multimodal Learning with Severely Missing Modality.pptxSangmin Woo
 
Video Transformers.pptx
Video Transformers.pptxVideo Transformers.pptx
Video Transformers.pptxSangmin Woo
 
Masked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptxMasked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptxSangmin Woo
 
An Empirical Study of Training Self-Supervised Vision Transformers.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptxAn Empirical Study of Training Self-Supervised Vision Transformers.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptxSangmin Woo
 
Visual Commonsense Reasoning.pptx
Visual Commonsense Reasoning.pptxVisual Commonsense Reasoning.pptx
Visual Commonsense Reasoning.pptxSangmin Woo
 
Video Grounding.pptx
Video Grounding.pptxVideo Grounding.pptx
Video Grounding.pptxSangmin Woo
 
Action Recognition Datasets.pptx
Action Recognition Datasets.pptxAction Recognition Datasets.pptx
Action Recognition Datasets.pptxSangmin Woo
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningSangmin Woo
 
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...Sangmin Woo
 
Action Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene GraphsAction Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene GraphsSangmin Woo
 
Neural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global contextNeural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global contextSangmin Woo
 
Attentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene GraphsAttentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene GraphsSangmin Woo
 
Graph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph GenerationGraph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph GenerationSangmin Woo
 

More from Sangmin Woo (13)

Multimodal Learning with Severely Missing Modality.pptx
Multimodal Learning with Severely Missing Modality.pptxMultimodal Learning with Severely Missing Modality.pptx
Multimodal Learning with Severely Missing Modality.pptx
 
Video Transformers.pptx
Video Transformers.pptxVideo Transformers.pptx
Video Transformers.pptx
 
Masked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptxMasked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptx
 
An Empirical Study of Training Self-Supervised Vision Transformers.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptxAn Empirical Study of Training Self-Supervised Vision Transformers.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptx
 
Visual Commonsense Reasoning.pptx
Visual Commonsense Reasoning.pptxVisual Commonsense Reasoning.pptx
Visual Commonsense Reasoning.pptx
 
Video Grounding.pptx
Video Grounding.pptxVideo Grounding.pptx
Video Grounding.pptx
 
Action Recognition Datasets.pptx
Action Recognition Datasets.pptxAction Recognition Datasets.pptx
Action Recognition Datasets.pptx
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation Learning
 
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
 
Action Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene GraphsAction Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene Graphs
 
Neural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global contextNeural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global context
 
Attentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene GraphsAttentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene Graphs
 
Graph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph GenerationGraph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph Generation
 

Recently uploaded

PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform EngineeringJemma Hussein Allen
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Product School
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsPaul Groth
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsExpeed Software
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...Product School
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Product School
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsVlad Stirbu
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesThousandEyes
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀DianaGray10
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2DianaGray10
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
 

Recently uploaded (20)

PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 

Towards Efficient Transformers

  • 1. Towards Efficient Transformers Sangmin Woo 2020.12.17 [2020 ICLR] Reformer: The Efficient Transformer [2020 ICML] Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention [2020 NIPS] Big Bird: Transformers for Longer Sequences [2021 ICLR] Rethinking Attention with Performers
  • 2. 2 / 33 Contents [2020 ICLR] Reformer: The Efficient Transformer Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya U.C. Berkeley & Google Research [2020 ICML] Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention Angelos Katharopoulo, Apoorv Vyas, Nikolaos Pappas, Franc¸ois Fleuret Idiap Research Institute & EPPL & University of Washington & University of Geneva [2020 NIPS] Big Bird: Transformers for Longer Sequences Manzil Zaheer, Guru Guruganesh, Avinava Dubey Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed Google Research [2021 ICLR] Rethinking Attention with Performers Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Son, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, Adrian Weller Google & University of Cambridge & DeepMind & Alan Turing Institute
  • 3. 3 / 33 Recap  Attention is all you need [1] • Scaled dot-product attention mechanism • The output for the query is computed as an attention weighted sum of values (𝑉), with the attention weights obtained from the product of the queries (𝑄) with keys (𝐾). [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, 2017.
  • 4. 4 / 33 Recap  Attention is all you need [1]
  • 5. 5 / 33 Recap  Attention is all you need [1] • The operation matches every single query with every single key to find out where information flows → 𝑂(𝑛2 ) complexity  • However, those information flows are mostly sparse [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, 2017.
  • 6. 6 / 33 Recap  Attention is all you need [1] • The operation matches every single query with every single key to find out where information flows → 𝑂(𝑛2 ) complexity  • However, those information flows are mostly sparse -> Let’s reduce the complexity! [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, 2017.
  • 7. 7 / 33 Reformer: The Efficient Transformer [2020 ICLR]  Locality Sensitive Hashing (LSH) • Bucketing • If distance between 𝑣1 and 𝑣2 is small → same bucket • If distance between 𝑣1 and 𝑣2 is large → different bucket
  • 8. 8 / 33 Reformer: The Efficient Transformer [2020 ICLR]  Angular Locality Sensitive Hashing (LSH) • If cosine distance between 𝑣1 and 𝑣2 is small → same bucket • If cosine distance between 𝑣1 and 𝑣2 is large → different bucket
  • 9. 9 / 33 Reformer: The Efficient Transformer [2020 ICLR]  Angular Locality Sensitive Hashing (LSH) • If distance between 𝑣1 and 𝑣2 is small → same bucket • If distance between 𝑣1 and 𝑣2 is large → different bucket
  • 10. 10/ 33 Reformer: The Efficient Transformer [2020 ICLR]  Efficiency: Memory & Time complexity
  • 11. 11 / 33 Reformer: The Efficient Transformer [2020 ICLR]  LSH Attention Performance
  • 12. 12/ 33 Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention [2020 ICML]  Generalized Attention Mechanism • Typical Attention Formulation • Generalized Form
  • 13. 13/ 33 Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention [2020 ICML]  Linearized Attention • (Roughly) Kernel allows you to represent vectors (e.g., K(𝑋, 𝑌) ) as inner product of vectors (e.g., 𝜙 𝑋 𝑇 𝜙(𝑌) ) in some other space.
  • 14. 14/ 33 Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention [2020 ICML]  Linearized Attention • General Form • Kernel as similarity function • Kernel function
  • 15. 15/ 33 Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention [2020 ICML]  Efficiency: Memory & Time complexity
  • 16. 16/ 33 Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention [2020 ICML]  Convergence Comparison
  • 17. 17/ 33 Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention [2020 ICML]  Comparison of Image Generation Speed
  • 18. 18/ 33 Big Bird: Transformers for Longer Sequences [2020 NIPS]  Big Bird • Big Bird = All in one!
  • 19. 19/ 33 Big Bird: Transformers for Longer Sequences [2020 NIPS]  Big Bird
  • 20. 20/ 33 Big Bird: Transformers for Longer Sequences [2020 NIPS]  Big Bird in Graph Perspective
  • 21. 21/ 33 Big Bird: Transformers for Longer Sequences [2020 NIPS]  Building Block Ablations
  • 22. 22/ 33 Rethinking Attention with Performers [2021 ICLR]  Performer in High-level
  • 23. 23/ 33 Rethinking Attention with Performers [2021 ICLR]  Decomposing the Attention Matrix
  • 24. 24/ 33 Rethinking Attention with Performers [2021 ICLR]  Approximating the Softmax Kernel • Let’s decompose softmax function into inner product of linear function with kernel! • Attention approximation with kernel • Random feature map 𝜙 Where, • The choice of ℎ and 𝑓 determines which kernel you would like to approximate, and the more 𝜔 you sample the more accurately you approximate the kernel.
  • 25. 25/ 33 Rethinking Attention with Performers [2021 ICLR]  Approximating the Softmax Kernel • Random feature map 𝜙 Where, • Example: • The choice of ℎ and 𝑓 determines what the 𝜙 function is
  • 26. 26/ 33 Rethinking Attention with Performers [2021 ICLR]  Approximating the Softmax Kernel • Softmax-kernel • Approximating Softmax Where, • Robust Approximation of Softmax Bad approximation…  negative dimension-values (sin / cos) leads to unstable behaviors → High variance
  • 27. 27/ 33 Rethinking Attention with Performers [2021 ICLR]  Positive & Orthogonal Random Features (ORFs) • Positive features • ORFs: If 𝜔1, … , 𝜔 𝑚 are to be exactly orthogonal, it can further reduce the variance of the estimator
  • 28. 28/ 33 Rethinking Attention with Performers [2021 ICLR]  Fast Attention Via positive Orthogonal Random features (FAVOR+) • Softmax can be approximated by the kernel Where,
  • 29. 29/ 33 Rethinking Attention with Performers [2021 ICLR]  Softmax Attention Approximation Error
  • 30. 30/ 33 Rethinking Attention with Performers [2021 ICLR]  Forward & Backward Speed
  • 31. 31/ 33 Further Readings [2020 arxiv] Longformer: The Long-Document Transformer [2020 arxiv] Synthesizer: Rethinking Self-Attention in Transformer Models [2020 arxiv] Linformer: Self-Attention with Linear Complexity …
  • 32. 32/ 33 Concluding Remarks  Transformer is known to have quadratic complexity 𝑂(𝑛2 )   Several studies aim to reduce the quadratic complexity into the linear complexity! • Reformer: Angular Local Sensitivity Hashing • Linear Attention: Kernel Trick • Big Bird: Random + Window + Global Attention • Performer: Approximating Softmax with Kernel (FAVOR+) • Longformer, Synthesizer, Linformer…  While maintaining the performance, they successfully reduced the complexity   Still many researches are digging into the efficiency issues on transformer…