Transformer Mods for Document Length Inputs

September 2020
Sujit Pal, Elsevier Labs
Transformer Mods for
Document Length Inputs
A survey of techniques to make long input
sequences practical to use with Transformers

About Me
• Work at Elsevier Labs
• Ex-search guy, Lucene and Solr mainly
• Started with NLP and ML as search started
using these techniques, got interested.
• Mostly focus on NLP problems nowadays.
2

Agenda
• Transformers
• Self-Attention and its limitations
• Approaches to address self-attention limitations
• Code walkthrough with LongFormer
3

Seq2seq, Attention, and Transformer
4
Attention amplifies signal for specific terms

Transformer and Self-Attention
• Embeddings for terms in sequence are input
into encoder and decoder in parallel.
• Input paths mingle in self-attention layer.
• Again parallelized when input to FFN layer.
• Each term vector split into Q, K, and V using
trainable weights WQ, WK, WV.
5

Self-Attention is sparse
• Self attention is O(n2) regardless of
whether seq2seq or transformer.
• Precludes use with large n (long input
sequences)
• Even though we no longer have issue
with sequential processing in RNN.
• But… self-attention matrix is sparse.
8

Sparse Transformers
• Autoregressive (left to right)
• Two-dimensional factorization of attention matrix
− Strided (center) – each position attends to its row and column
− Fixed (right) – each position attends to fixed column and elements after latest column
element
• Algorithmic complexity O(n√𝑛)
9

Transformer-XL
• Autoregressive, segments input into fixed size blocks
• Segment level recurrence with state reuse
• Analogous to BPTT (Back Propagation over Time), caches and applies
sequence of hidden states from previous segments.
• Better perplexity scores up to 900 tokens.
10

Reformer
• Uses Locally Sensitive Hashing
(LSH) to convert sparse attention
matrix to set of dense matrices
− Hashing input tokens
− Sorting and chunking
• Reversible Residuals
• Algorithmic complexity O(n log n)
• Can handle 64k token inputs
11

Routing Transformer
• Adds a sparse routing module based on online K-Means to self-attention
• Clusters K and Q matrices into clusters using K-Means
• Each attention step considers only context in same cluster
• Some notion of non-local or global context
• Algorithmic complexity O( 𝑛3 * k) where k = number of clusters
12

Sinkhorn Transformer
• Meta-sorting network that learns to
rearrange and sort input sequence
• Sequences blocked and attention
computed within each block.
• Memory complexity O(B2 + NB
2), where
B is block size, and NB is number of
blocks, NB << n.
• SortCut variant only looks at top k
nearest neighbors in block, reduces
complexity to O(n*k).
13

Linformer
14
• Observation: self-attention can be
approximated by low rank
matrix
• SVD component vectors and use
top k principal components
• Introduces new self-attention
pipeline, uses linear projection to
do dimensionality reduction.

Attention with Linear Complexity
• Uses matrix multiplication associativity property to convert:
• Algorithmic complexity changes from O(n2 * d) to O(n * d2) where d is the
size of the K, Q, and V vectors and n is input length, d << n.
15

Longformer
• Sparsify full attention matrix according to ”attention pattern”
• Sliding window of size k (k/2 left, k/2 right) to capture local context
• Optional dilated sliding window at different layers of multi-head attention to get bigger
receptive field
• Additional (task dependent) global attention – [CLS] for classification, question tokens for
QA
• Scales linearly with input length N and sliding window length k (O(n * k)), global context
effect minimal, k << n
16

Big Bird
• Consider self-attention as a DAG and apply graph sparsification principles
• Composed of
• Set of global tokens g that attend to all parts of sequence (ITC – subset of input
tokens, ETC – additional tokens such as [CLS]
• For each Q, set of r random keys it will attend to
• Local neighborhood of size w for each input token
• Complexity is O(n*w) because effect of g and r are negligible, w << n
17

References
• Attention is all you need (Vaswani, et al, 2017)
• Visualizing a Neural Machine Translation Model (Mechanics of Seq2Seq models with Attention) (Alammar,
2018)
• The Illustrated Transformer (Alammar, 2018)
• Generating Long Sequences with Sparse Transformers (Child, Gray, Radford, and Sutskever, 2019)
• Transformer-XL: Attentive Language Models beyond a fixed length content (Dai, et al, 2019)
• Reformer: The Efficient Transformer (Kitaev, Kaiser, and Levskaya, 2020)
• Efficient Content-based Sparse Attention with Routing Transformers (Roy, Saffar, Vaswani, and Grangier, 2020)
• Sparse Sinkhorn Attention (Tay, et al, 2020)
• Linformer: Self-Attention with Linear Complexity (Li, Khabsa, Fang, and Ma, 2020)
• Efficient Attention: Attention with Linear Complexities (Shen, et al, 2020)
• LongFormer: The Long Document Transformer (Beltagy, Peters, and Cohan, 2020)
• BigBird: Transformers for Longer Sequences (Zaheer, et al, 2020)
• A Survey of Long-Term Context in Transformers (May, 2020)
18

Notebooks
• lf1_longformer_pretrained.ipynb -- using Pre-trained and Fine-tuned
Longformer model for Document embedding and Question Answering
respectively.
• lf2_longformer_sentiment_training.ipynb – training a Longformer model for
sentiment classification.
19
The huggingface/transformers project provides (Pytorch and
TF) implementations for transformers discussed here.
• Transformer/XL
• Reformer
• Longformer

Transformer Mods for Document Length Inputs

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Transformer Mods for Document Length Inputs

Similar to Transformer Mods for Document Length Inputs (20)

More from Sujit Pal

More from Sujit Pal (20)

Recently uploaded

Recently uploaded (20)

Transformer Mods for Document Length Inputs