[2020 ICLR] Reformer: The Efficient Transformer
[2020 ICML] Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
[2020 NIPS] Big Bird: Transformers for Longer Sequences
[2021 ICLR] Rethinking Attention with Performers
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Towards Efficient Transformers
1. Towards Efficient Transformers
Sangmin Woo
2020.12.17
[2020 ICLR] Reformer: The Efficient Transformer
[2020 ICML] Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
[2020 NIPS] Big Bird: Transformers for Longer Sequences
[2021 ICLR] Rethinking Attention with Performers
2. 2 / 33
Contents
[2020 ICLR] Reformer: The Efficient Transformer
Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya
U.C. Berkeley & Google Research
[2020 ICML] Transformers are RNNs: Fast Autoregressive
Transformers with Linear Attention
Angelos Katharopoulo, Apoorv Vyas, Nikolaos Pappas, Franc¸ois Fleuret
Idiap Research Institute & EPPL & University of Washington & University of Geneva
[2020 NIPS] Big Bird: Transformers for Longer Sequences
Manzil Zaheer, Guru Guruganesh, Avinava Dubey Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh
Ravula, Qifan Wang, Li Yang, Amr Ahmed
Google Research
[2021 ICLR] Rethinking Attention with Performers
Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Son, Andreea Gane, Tamas Sarlos, Peter Hawkins,
Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, Adrian Weller
Google & University of Cambridge & DeepMind & Alan Turing Institute
3. 3 / 33
Recap
Attention is all you need [1]
• Scaled dot-product attention mechanism
• The output for the query is computed as an attention weighted sum of
values (𝑉), with the attention weights obtained from the product of the
queries (𝑄) with keys (𝐾).
[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, 2017.
5. 5 / 33
Recap
Attention is all you need [1]
• The operation matches every single query with every single key to find out
where information flows → 𝑂(𝑛2
) complexity
• However, those information flows are mostly sparse
[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, 2017.
6. 6 / 33
Recap
Attention is all you need [1]
• The operation matches every single query with every single key to find out
where information flows → 𝑂(𝑛2
) complexity
• However, those information flows are mostly sparse -> Let’s reduce the
complexity!
[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, 2017.
7. 7 / 33
Reformer: The Efficient Transformer
[2020 ICLR]
Locality Sensitive Hashing (LSH)
• Bucketing
• If distance between 𝑣1 and 𝑣2 is small → same bucket
• If distance between 𝑣1 and 𝑣2 is large → different bucket
8. 8 / 33
Reformer: The Efficient Transformer
[2020 ICLR]
Angular Locality Sensitive Hashing (LSH)
• If cosine distance between 𝑣1 and 𝑣2 is small → same bucket
• If cosine distance between 𝑣1 and 𝑣2 is large → different bucket
9. 9 / 33
Reformer: The Efficient Transformer
[2020 ICLR]
Angular Locality Sensitive Hashing (LSH)
• If distance between 𝑣1 and 𝑣2 is small → same bucket
• If distance between 𝑣1 and 𝑣2 is large → different bucket
10. 10/ 33
Reformer: The Efficient Transformer
[2020 ICLR]
Efficiency: Memory & Time complexity
12. 12/ 33
Transformers are RNNs: Fast Autoregressive
Transformers with Linear Attention [2020 ICML]
Generalized Attention Mechanism
• Typical Attention Formulation
• Generalized Form
13. 13/ 33
Transformers are RNNs: Fast Autoregressive
Transformers with Linear Attention [2020 ICML]
Linearized Attention
• (Roughly) Kernel allows you to represent vectors (e.g., K(𝑋, 𝑌) ) as inner
product of vectors (e.g., 𝜙 𝑋 𝑇 𝜙(𝑌) ) in some other space.
14. 14/ 33
Transformers are RNNs: Fast Autoregressive
Transformers with Linear Attention [2020 ICML]
Linearized Attention
• General Form
• Kernel as similarity function
• Kernel function
15. 15/ 33
Transformers are RNNs: Fast Autoregressive
Transformers with Linear Attention [2020 ICML]
Efficiency: Memory & Time complexity
16. 16/ 33
Transformers are RNNs: Fast Autoregressive
Transformers with Linear Attention [2020 ICML]
Convergence Comparison
17. 17/ 33
Transformers are RNNs: Fast Autoregressive
Transformers with Linear Attention [2020 ICML]
Comparison of Image Generation Speed
18. 18/ 33
Big Bird: Transformers for Longer Sequences
[2020 NIPS]
Big Bird
• Big Bird = All in one!
19. 19/ 33
Big Bird: Transformers for Longer Sequences
[2020 NIPS]
Big Bird
20. 20/ 33
Big Bird: Transformers for Longer Sequences
[2020 NIPS]
Big Bird in Graph Perspective
21. 21/ 33
Big Bird: Transformers for Longer Sequences
[2020 NIPS]
Building Block Ablations
24. 24/ 33
Rethinking Attention with Performers
[2021 ICLR]
Approximating the Softmax Kernel
• Let’s decompose softmax function into inner product of linear function with
kernel!
• Attention approximation with kernel
• Random feature map 𝜙
Where,
• The choice of ℎ and 𝑓 determines which kernel you would like to
approximate, and the more 𝜔 you sample the more accurately you
approximate the kernel.
25. 25/ 33
Rethinking Attention with Performers
[2021 ICLR]
Approximating the Softmax Kernel
• Random feature map 𝜙
Where,
• Example:
• The choice of ℎ and 𝑓 determines what the 𝜙 function is
26. 26/ 33
Rethinking Attention with Performers
[2021 ICLR]
Approximating the Softmax Kernel
• Softmax-kernel
• Approximating Softmax
Where,
• Robust Approximation of Softmax
Bad approximation…
negative dimension-values
(sin / cos) leads to unstable
behaviors → High variance
27. 27/ 33
Rethinking Attention with Performers
[2021 ICLR]
Positive & Orthogonal Random Features (ORFs)
• Positive features
• ORFs: If 𝜔1, … , 𝜔 𝑚 are to be exactly orthogonal, it can further reduce the
variance of the estimator
28. 28/ 33
Rethinking Attention with Performers
[2021 ICLR]
Fast Attention Via positive Orthogonal Random features (FAVOR+)
• Softmax can be approximated by the kernel
Where,
31. 31/ 33
Further Readings
[2020 arxiv] Longformer: The Long-Document Transformer
[2020 arxiv] Synthesizer: Rethinking Self-Attention in Transformer Models
[2020 arxiv] Linformer: Self-Attention with Linear Complexity
…
32. 32/ 33
Concluding Remarks
Transformer is known to have quadratic complexity 𝑂(𝑛2
)
Several studies aim to reduce the quadratic complexity into the linear
complexity!
• Reformer: Angular Local Sensitivity Hashing
• Linear Attention: Kernel Trick
• Big Bird: Random + Window + Global Attention
• Performer: Approximating Softmax with Kernel (FAVOR+)
• Longformer, Synthesizer, Linformer…
While maintaining the performance, they successfully reduced the
complexity
Still many researches are digging into the efficiency issues on
transformer…