September 2020
Sujit Pal, Elsevier Labs
Transformer Mods for
Document Length Inputs
A survey of techniques to make long input
sequences practical to use with Transformers
About Me
• Work at Elsevier Labs
• Ex-search guy, Lucene and Solr mainly
• Started with NLP and ML as search started
using these techniques, got interested.
• Mostly focus on NLP problems nowadays.
2
Agenda
• Transformers
• Self-Attention and its limitations
• Approaches to address self-attention limitations
• Code walkthrough with LongFormer
3
Seq2seq, Attention, and Transformer
4
Attention amplifies signal for specific terms
Transformer and Self-Attention
• Embeddings for terms in sequence are input
into encoder and decoder in parallel.
• Input paths mingle in self-attention layer.
• Again parallelized when input to FFN layer.
• Each term vector split into Q, K, and V using
trainable weights WQ, WK, WV.
5
Self-Attention in depth
6
Self Attention is O(n2)
7
Self-Attention is sparse
• Self attention is O(n2) regardless of
whether seq2seq or transformer.
• Precludes use with large n (long input
sequences)
• Even though we no longer have issue
with sequential processing in RNN.
• But… self-attention matrix is sparse.
8
Sparse Transformers
• Autoregressive (left to right)
• Two-dimensional factorization of attention matrix
− Strided (center) – each position attends to its row and column
− Fixed (right) – each position attends to fixed column and elements after latest column
element
• Algorithmic complexity O(n√𝑛)
9
Transformer-XL
• Autoregressive, segments input into fixed size blocks
• Segment level recurrence with state reuse
• Analogous to BPTT (Back Propagation over Time), caches and applies
sequence of hidden states from previous segments.
• Better perplexity scores up to 900 tokens.
10
Reformer
• Uses Locally Sensitive Hashing
(LSH) to convert sparse attention
matrix to set of dense matrices
− Hashing input tokens
− Sorting and chunking
• Reversible Residuals
• Algorithmic complexity O(n log n)
• Can handle 64k token inputs
11
Routing Transformer
• Adds a sparse routing module based on online K-Means to self-attention
• Clusters K and Q matrices into clusters using K-Means
• Each attention step considers only context in same cluster
• Some notion of non-local or global context
• Algorithmic complexity O( 𝑛3 * k) where k = number of clusters
12
Sinkhorn Transformer
• Meta-sorting network that learns to
rearrange and sort input sequence
• Sequences blocked and attention
computed within each block.
• Memory complexity O(B2 + NB
2), where
B is block size, and NB is number of
blocks, NB << n.
• SortCut variant only looks at top k
nearest neighbors in block, reduces
complexity to O(n*k).
13
Linformer
14
• Observation: self-attention can be
approximated by low rank
matrix
• SVD component vectors and use
top k principal components
• Introduces new self-attention
pipeline, uses linear projection to
do dimensionality reduction.
Attention with Linear Complexity
• Uses matrix multiplication associativity property to convert:
• Algorithmic complexity changes from O(n2 * d) to O(n * d2) where d is the
size of the K, Q, and V vectors and n is input length, d << n.
15
Longformer
• Sparsify full attention matrix according to ”attention pattern”
• Sliding window of size k (k/2 left, k/2 right) to capture local context
• Optional dilated sliding window at different layers of multi-head attention to get bigger
receptive field
• Additional (task dependent) global attention – [CLS] for classification, question tokens for
QA
• Scales linearly with input length N and sliding window length k (O(n * k)), global context
effect minimal, k << n
16
Big Bird
• Consider self-attention as a DAG and apply graph sparsification principles
• Composed of
• Set of global tokens g that attend to all parts of sequence (ITC – subset of input
tokens, ETC – additional tokens such as [CLS]
• For each Q, set of r random keys it will attend to
• Local neighborhood of size w for each input token
• Complexity is O(n*w) because effect of g and r are negligible, w << n
17
References
• Attention is all you need (Vaswani, et al, 2017)
• Visualizing a Neural Machine Translation Model (Mechanics of Seq2Seq models with Attention) (Alammar,
2018)
• The Illustrated Transformer (Alammar, 2018)
• Generating Long Sequences with Sparse Transformers (Child, Gray, Radford, and Sutskever, 2019)
• Transformer-XL: Attentive Language Models beyond a fixed length content (Dai, et al, 2019)
• Reformer: The Efficient Transformer (Kitaev, Kaiser, and Levskaya, 2020)
• Efficient Content-based Sparse Attention with Routing Transformers (Roy, Saffar, Vaswani, and Grangier, 2020)
• Sparse Sinkhorn Attention (Tay, et al, 2020)
• Linformer: Self-Attention with Linear Complexity (Li, Khabsa, Fang, and Ma, 2020)
• Efficient Attention: Attention with Linear Complexities (Shen, et al, 2020)
• LongFormer: The Long Document Transformer (Beltagy, Peters, and Cohan, 2020)
• BigBird: Transformers for Longer Sequences (Zaheer, et al, 2020)
• A Survey of Long-Term Context in Transformers (May, 2020)
18
Notebooks
• lf1_longformer_pretrained.ipynb -- using Pre-trained and Fine-tuned
Longformer model for Document embedding and Question Answering
respectively.
• lf2_longformer_sentiment_training.ipynb – training a Longformer model for
sentiment classification.
19
The huggingface/transformers project provides (Pytorch and
TF) implementations for transformers discussed here.
• Transformer/XL
• Reformer
• Longformer
Thank you

Transformer Mods for Document Length Inputs

  • 1.
    September 2020 Sujit Pal,Elsevier Labs Transformer Mods for Document Length Inputs A survey of techniques to make long input sequences practical to use with Transformers
  • 2.
    About Me • Workat Elsevier Labs • Ex-search guy, Lucene and Solr mainly • Started with NLP and ML as search started using these techniques, got interested. • Mostly focus on NLP problems nowadays. 2
  • 3.
    Agenda • Transformers • Self-Attentionand its limitations • Approaches to address self-attention limitations • Code walkthrough with LongFormer 3
  • 4.
    Seq2seq, Attention, andTransformer 4 Attention amplifies signal for specific terms
  • 5.
    Transformer and Self-Attention •Embeddings for terms in sequence are input into encoder and decoder in parallel. • Input paths mingle in self-attention layer. • Again parallelized when input to FFN layer. • Each term vector split into Q, K, and V using trainable weights WQ, WK, WV. 5
  • 6.
  • 7.
  • 8.
    Self-Attention is sparse •Self attention is O(n2) regardless of whether seq2seq or transformer. • Precludes use with large n (long input sequences) • Even though we no longer have issue with sequential processing in RNN. • But… self-attention matrix is sparse. 8
  • 9.
    Sparse Transformers • Autoregressive(left to right) • Two-dimensional factorization of attention matrix − Strided (center) – each position attends to its row and column − Fixed (right) – each position attends to fixed column and elements after latest column element • Algorithmic complexity O(n√𝑛) 9
  • 10.
    Transformer-XL • Autoregressive, segmentsinput into fixed size blocks • Segment level recurrence with state reuse • Analogous to BPTT (Back Propagation over Time), caches and applies sequence of hidden states from previous segments. • Better perplexity scores up to 900 tokens. 10
  • 11.
    Reformer • Uses LocallySensitive Hashing (LSH) to convert sparse attention matrix to set of dense matrices − Hashing input tokens − Sorting and chunking • Reversible Residuals • Algorithmic complexity O(n log n) • Can handle 64k token inputs 11
  • 12.
    Routing Transformer • Addsa sparse routing module based on online K-Means to self-attention • Clusters K and Q matrices into clusters using K-Means • Each attention step considers only context in same cluster • Some notion of non-local or global context • Algorithmic complexity O( 𝑛3 * k) where k = number of clusters 12
  • 13.
    Sinkhorn Transformer • Meta-sortingnetwork that learns to rearrange and sort input sequence • Sequences blocked and attention computed within each block. • Memory complexity O(B2 + NB 2), where B is block size, and NB is number of blocks, NB << n. • SortCut variant only looks at top k nearest neighbors in block, reduces complexity to O(n*k). 13
  • 14.
    Linformer 14 • Observation: self-attentioncan be approximated by low rank matrix • SVD component vectors and use top k principal components • Introduces new self-attention pipeline, uses linear projection to do dimensionality reduction.
  • 15.
    Attention with LinearComplexity • Uses matrix multiplication associativity property to convert: • Algorithmic complexity changes from O(n2 * d) to O(n * d2) where d is the size of the K, Q, and V vectors and n is input length, d << n. 15
  • 16.
    Longformer • Sparsify fullattention matrix according to ”attention pattern” • Sliding window of size k (k/2 left, k/2 right) to capture local context • Optional dilated sliding window at different layers of multi-head attention to get bigger receptive field • Additional (task dependent) global attention – [CLS] for classification, question tokens for QA • Scales linearly with input length N and sliding window length k (O(n * k)), global context effect minimal, k << n 16
  • 17.
    Big Bird • Considerself-attention as a DAG and apply graph sparsification principles • Composed of • Set of global tokens g that attend to all parts of sequence (ITC – subset of input tokens, ETC – additional tokens such as [CLS] • For each Q, set of r random keys it will attend to • Local neighborhood of size w for each input token • Complexity is O(n*w) because effect of g and r are negligible, w << n 17
  • 18.
    References • Attention isall you need (Vaswani, et al, 2017) • Visualizing a Neural Machine Translation Model (Mechanics of Seq2Seq models with Attention) (Alammar, 2018) • The Illustrated Transformer (Alammar, 2018) • Generating Long Sequences with Sparse Transformers (Child, Gray, Radford, and Sutskever, 2019) • Transformer-XL: Attentive Language Models beyond a fixed length content (Dai, et al, 2019) • Reformer: The Efficient Transformer (Kitaev, Kaiser, and Levskaya, 2020) • Efficient Content-based Sparse Attention with Routing Transformers (Roy, Saffar, Vaswani, and Grangier, 2020) • Sparse Sinkhorn Attention (Tay, et al, 2020) • Linformer: Self-Attention with Linear Complexity (Li, Khabsa, Fang, and Ma, 2020) • Efficient Attention: Attention with Linear Complexities (Shen, et al, 2020) • LongFormer: The Long Document Transformer (Beltagy, Peters, and Cohan, 2020) • BigBird: Transformers for Longer Sequences (Zaheer, et al, 2020) • A Survey of Long-Term Context in Transformers (May, 2020) 18
  • 19.
    Notebooks • lf1_longformer_pretrained.ipynb --using Pre-trained and Fine-tuned Longformer model for Document embedding and Question Answering respectively. • lf2_longformer_sentiment_training.ipynb – training a Longformer model for sentiment classification. 19 The huggingface/transformers project provides (Pytorch and TF) implementations for transformers discussed here. • Transformer/XL • Reformer • Longformer
  • 20.