Towards Efficient Transformers

Towards Efficient Transformers
Sangmin Woo
2020.12.17
[2020 ICLR] Reformer: The Efficient Transformer
[2020 ICML] Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
[2020 NIPS] Big Bird: Transformers for Longer Sequences
[2021 ICLR] Rethinking Attention with Performers

2 / 33
Contents
[2020 ICLR] Reformer: The Efficient Transformer
Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya
U.C. Berkeley & Google Research
[2020 ICML] Transformers are RNNs: Fast Autoregressive
Transformers with Linear Attention
Angelos Katharopoulo, Apoorv Vyas, Nikolaos Pappas, Franc¸ois Fleuret
Idiap Research Institute & EPPL & University of Washington & University of Geneva
[2020 NIPS] Big Bird: Transformers for Longer Sequences
Manzil Zaheer, Guru Guruganesh, Avinava Dubey Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh
Ravula, Qifan Wang, Li Yang, Amr Ahmed
Google Research
[2021 ICLR] Rethinking Attention with Performers
Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Son, Andreea Gane, Tamas Sarlos, Peter Hawkins,
Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, Adrian Weller
Google & University of Cambridge & DeepMind & Alan Turing Institute

3 / 33
Recap
 Attention is all you need [1]
• Scaled dot-product attention mechanism
• The output for the query is computed as an attention weighted sum of
values (𝑉), with the attention weights obtained from the product of the
queries (𝑄) with keys (𝐾).
[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, 2017.

4 / 33
Recap

5 / 33
Recap
• The operation matches every single query with every single key to find out
where information flows → 𝑂(𝑛2
) complexity 
• However, those information flows are mostly sparse

6 / 33
Recap
• The operation matches every single query with every single key to find out
where information flows → 𝑂(𝑛2
) complexity 
• However, those information flows are mostly sparse -> Let’s reduce the
complexity!

7 / 33
Reformer: The Efficient Transformer
[2020 ICLR]
 Locality Sensitive Hashing (LSH)
• Bucketing
• If distance between 𝑣1 and 𝑣2 is small → same bucket
• If distance between 𝑣1 and 𝑣2 is large → different bucket

8 / 33
[2020 ICLR]
 Angular Locality Sensitive Hashing (LSH)
• If cosine distance between 𝑣1 and 𝑣2 is small → same bucket
• If cosine distance between 𝑣1 and 𝑣2 is large → different bucket

9 / 33
[2020 ICLR]
 Angular Locality Sensitive Hashing (LSH)
• If distance between 𝑣1 and 𝑣2 is small → same bucket
• If distance between 𝑣1 and 𝑣2 is large → different bucket

10/ 33
[2020 ICLR]
 Efficiency: Memory & Time complexity

11 / 33
[2020 ICLR]
 LSH Attention Performance

12/ 33
Transformers are RNNs: Fast Autoregressive
Transformers with Linear Attention [2020 ICML]
 Generalized Attention Mechanism
• Typical Attention Formulation
• Generalized Form

13/ 33
 Linearized Attention
• (Roughly) Kernel allows you to represent vectors (e.g., K(𝑋, 𝑌) ) as inner
product of vectors (e.g., 𝜙 𝑋 𝑇 𝜙(𝑌) ) in some other space.

14/ 33
 Linearized Attention
• General Form
• Kernel as similarity function
• Kernel function

15/ 33
 Efficiency: Memory & Time complexity

16/ 33
 Convergence Comparison

17/ 33
 Comparison of Image Generation Speed

18/ 33
Big Bird: Transformers for Longer Sequences
[2020 NIPS]
 Big Bird
• Big Bird = All in one!

19/ 33
[2020 NIPS]
 Big Bird

20/ 33
[2020 NIPS]
 Big Bird in Graph Perspective

21/ 33
[2020 NIPS]
 Building Block Ablations

22/ 33
Rethinking Attention with Performers
[2021 ICLR]
 Performer in High-level

23/ 33
[2021 ICLR]
 Decomposing the Attention Matrix

24/ 33
[2021 ICLR]
 Approximating the Softmax Kernel
• Let’s decompose softmax function into inner product of linear function with
kernel!
• Attention approximation with kernel
• Random feature map 𝜙
Where,
• The choice of ℎ and 𝑓 determines which kernel you would like to
approximate, and the more 𝜔 you sample the more accurately you
approximate the kernel.

25/ 33
[2021 ICLR]
• Random feature map 𝜙
Where,
• Example:
• The choice of ℎ and 𝑓 determines what the 𝜙 function is

26/ 33
[2021 ICLR]
• Softmax-kernel
• Approximating Softmax
Where,
• Robust Approximation of Softmax
Bad approximation… 
negative dimension-values
(sin / cos) leads to unstable
behaviors → High variance

27/ 33
[2021 ICLR]
 Positive & Orthogonal Random Features (ORFs)
• Positive features
• ORFs: If 𝜔1, … , 𝜔 𝑚 are to be exactly orthogonal, it can further reduce the
variance of the estimator

28/ 33
[2021 ICLR]
 Fast Attention Via positive Orthogonal Random features (FAVOR+)
• Softmax can be approximated by the kernel
Where,

29/ 33
[2021 ICLR]
 Softmax Attention Approximation Error

30/ 33
[2021 ICLR]
 Forward & Backward Speed

31/ 33
Further Readings
[2020 arxiv] Longformer: The Long-Document Transformer
[2020 arxiv] Synthesizer: Rethinking Self-Attention in Transformer Models
[2020 arxiv] Linformer: Self-Attention with Linear Complexity
…

32/ 33
Concluding Remarks
 Transformer is known to have quadratic complexity 𝑂(𝑛2
) 
 Several studies aim to reduce the quadratic complexity into the linear
complexity!
• Reformer: Angular Local Sensitivity Hashing
• Linear Attention: Kernel Trick
• Big Bird: Random + Window + Global Attention
• Performer: Approximating Softmax with Kernel (FAVOR+)
• Longformer, Synthesizer, Linformer…
 While maintaining the performance, they successfully reduced the
complexity 
 Still many researches are digging into the efficiency issues on
transformer…

Thank You
shmwoo9395@{gist.ac.kr, gmail.com}

Towards Efficient Transformers

Recommended

Recommended

More Related Content

Similar to Towards Efficient Transformers

Similar to Towards Efficient Transformers (20)

More from Sangmin Woo

More from Sangmin Woo (13)

Recently uploaded

Recently uploaded (20)

Towards Efficient Transformers