The Transformer architecture is responsible for many state of the art results in Natural Language Processing. A central feature behind its superior performance over Recurrent Neural Networks is its multi-headed self-attention mechanism. However, the superior performance comes at a cost, an O(n2) time and memory complexity, where n is the size of the input sequence. Because of this, it is computationally infeasible to feed large documents to the standard transformer. To overcome this limitation, a number of approaches have been proposed, which involve modifying the self-attention mechanism in interesting ways.
In this presentation, I will describe the transformer architecture, and specifically the self-attention mechanism, and then describe some of the approaches proposed to address the O(n2) complexity. Some of these approaches have also been implemented in the HuggingFace transformers library, and I will demonstrate some code for doing document level operations using one of these approaches.
1. September 2020
Sujit Pal, Elsevier Labs
Transformer Mods for
Document Length Inputs
A survey of techniques to make long input
sequences practical to use with Transformers
2. About Me
• Work at Elsevier Labs
• Ex-search guy, Lucene and Solr mainly
• Started with NLP and ML as search started
using these techniques, got interested.
• Mostly focus on NLP problems nowadays.
2
5. Transformer and Self-Attention
• Embeddings for terms in sequence are input
into encoder and decoder in parallel.
• Input paths mingle in self-attention layer.
• Again parallelized when input to FFN layer.
• Each term vector split into Q, K, and V using
trainable weights WQ, WK, WV.
5
8. Self-Attention is sparse
• Self attention is O(n2) regardless of
whether seq2seq or transformer.
• Precludes use with large n (long input
sequences)
• Even though we no longer have issue
with sequential processing in RNN.
• But… self-attention matrix is sparse.
8
9. Sparse Transformers
• Autoregressive (left to right)
• Two-dimensional factorization of attention matrix
− Strided (center) – each position attends to its row and column
− Fixed (right) – each position attends to fixed column and elements after latest column
element
• Algorithmic complexity O(n√𝑛)
9
10. Transformer-XL
• Autoregressive, segments input into fixed size blocks
• Segment level recurrence with state reuse
• Analogous to BPTT (Back Propagation over Time), caches and applies
sequence of hidden states from previous segments.
• Better perplexity scores up to 900 tokens.
10
11. Reformer
• Uses Locally Sensitive Hashing
(LSH) to convert sparse attention
matrix to set of dense matrices
− Hashing input tokens
− Sorting and chunking
• Reversible Residuals
• Algorithmic complexity O(n log n)
• Can handle 64k token inputs
11
12. Routing Transformer
• Adds a sparse routing module based on online K-Means to self-attention
• Clusters K and Q matrices into clusters using K-Means
• Each attention step considers only context in same cluster
• Some notion of non-local or global context
• Algorithmic complexity O( 𝑛3 * k) where k = number of clusters
12
13. Sinkhorn Transformer
• Meta-sorting network that learns to
rearrange and sort input sequence
• Sequences blocked and attention
computed within each block.
• Memory complexity O(B2 + NB
2), where
B is block size, and NB is number of
blocks, NB << n.
• SortCut variant only looks at top k
nearest neighbors in block, reduces
complexity to O(n*k).
13
14. Linformer
14
• Observation: self-attention can be
approximated by low rank
matrix
• SVD component vectors and use
top k principal components
• Introduces new self-attention
pipeline, uses linear projection to
do dimensionality reduction.
15. Attention with Linear Complexity
• Uses matrix multiplication associativity property to convert:
• Algorithmic complexity changes from O(n2 * d) to O(n * d2) where d is the
size of the K, Q, and V vectors and n is input length, d << n.
15
16. Longformer
• Sparsify full attention matrix according to ”attention pattern”
• Sliding window of size k (k/2 left, k/2 right) to capture local context
• Optional dilated sliding window at different layers of multi-head attention to get bigger
receptive field
• Additional (task dependent) global attention – [CLS] for classification, question tokens for
QA
• Scales linearly with input length N and sliding window length k (O(n * k)), global context
effect minimal, k << n
16
17. Big Bird
• Consider self-attention as a DAG and apply graph sparsification principles
• Composed of
• Set of global tokens g that attend to all parts of sequence (ITC – subset of input
tokens, ETC – additional tokens such as [CLS]
• For each Q, set of r random keys it will attend to
• Local neighborhood of size w for each input token
• Complexity is O(n*w) because effect of g and r are negligible, w << n
17
18. References
• Attention is all you need (Vaswani, et al, 2017)
• Visualizing a Neural Machine Translation Model (Mechanics of Seq2Seq models with Attention) (Alammar,
2018)
• The Illustrated Transformer (Alammar, 2018)
• Generating Long Sequences with Sparse Transformers (Child, Gray, Radford, and Sutskever, 2019)
• Transformer-XL: Attentive Language Models beyond a fixed length content (Dai, et al, 2019)
• Reformer: The Efficient Transformer (Kitaev, Kaiser, and Levskaya, 2020)
• Efficient Content-based Sparse Attention with Routing Transformers (Roy, Saffar, Vaswani, and Grangier, 2020)
• Sparse Sinkhorn Attention (Tay, et al, 2020)
• Linformer: Self-Attention with Linear Complexity (Li, Khabsa, Fang, and Ma, 2020)
• Efficient Attention: Attention with Linear Complexities (Shen, et al, 2020)
• LongFormer: The Long Document Transformer (Beltagy, Peters, and Cohan, 2020)
• BigBird: Transformers for Longer Sequences (Zaheer, et al, 2020)
• A Survey of Long-Term Context in Transformers (May, 2020)
18
19. Notebooks
• lf1_longformer_pretrained.ipynb -- using Pre-trained and Fine-tuned
Longformer model for Document embedding and Question Answering
respectively.
• lf2_longformer_sentiment_training.ipynb – training a Longformer model for
sentiment classification.
19
The huggingface/transformers project provides (Pytorch and
TF) implementations for transformers discussed here.
• Transformer/XL
• Reformer
• Longformer