Longformer

Outline
• Intro

• Longformer’s attention

• Intuition

• Structure

• Questions & Concerns

• Experiments

• Results & Ablation

• For Pretraining

• Discussion

Intro
• Limitation of document length (tf-based model)

• Bottleneck: Self-Attention layers are expensive

• Time, Memory

• Contributions:

• Cheaper for long document tasks(e.g. QA)

• Contextual representation of entire context
• Exisiting approach:

• Divided(Chunk) from long into pieces -

• Truncated: Information loss - e.g. BERT 512

• Two-step model - e.g. Pool >> candidates >> answer, e.g. Multihop
3

Longformer’s Attention
• Capture far context(even full sequece) eﬃciently

• Avoid full attention…(Sparse but still cost)

• Types: Windowed attention / Dilated attention

• Evaluate ability(v.s. RoBERTa)

• Countinually trained on RoBERTa’s checkpoint(?), and
apply on donwstream tasks

• Sliding Window:

• Fixed window size W(say 2, means attend last&next only)

• Stacked with multiple W for each layers(small W: local info)

• Dilated Window:

• Add dilated size D with sliding window

• Larger acceptive ﬁelds(longer input/can attend far away from D)

• Global + Sliding Window:

• Customized attention(pre-select, depends on tasks) e.g. MLM

• Two set of QKV projections(One for global, One for sliding window)

• Q1: Contextual informations loss: (sort of distortion)

• Increasing W from bottom to top.  
Low for local, while high for entire

• Use dilated sliding window only at higher layer

• Q2: Must be Stacked layers to expand recpetive ﬁelds.

• Combine(share) the each layer’s attn

• Local ﬁrst / stacked / Increasing W&D / More Elastic
Questions

Experiments
• Task: Character-level Autoregressive LM (for longer sequence) on
text8 & enwik8

• Boundary: Longest sequence with limited memory

• Training:

• Staged training: 5 phase

• Length from 2048 to 23040

• Evaluation:

• Max length to 32256…(no optim/grads?)

Result & Ablation
• BPC: Bit per character, smaller the better

• Same #param, better performance

• Window size should be increasing

• Add some dilation may improved

Pretraining
• Pretrained Longformer and then ﬁnetuning on 6 tasks

• Countinue from RoBERTA checkpoint(??) 
* longformer attn can be load into any tfmodel

• Pretrained MLM…

• Same tokenizer: Wordpiece

• W=512/D=0 for all layer

• L(longformer)=4096…while L(roberta)=512

• How to utilize pos-emb??

• Finetuning on QA tasks

• Better than RoBERTa, but

Results & Ablation
• All better models use

• multistage, GNNs…

• Conﬁrmed that performance
NO improved due to additional
pretraining

•

Discussion
• For long documnets tasks:

• QAs(multihop, OpenDomainQA)

• Long document generation e.g. summarization

• Other tasks…

• For Explainable

• Task speciﬁc

• Contextual Embeddings

• Multihead

Longformer

More Related Content

Similar to Longformer

Recently uploaded

Longformer