Longformer
Allen AI
Outline
• Intro

• Longformer’s attention

• Intuition

• Structure

• Questions & Concerns

• Experiments

• Results & Ablation

• For Pretraining 

• Discussion
Intro
• Limitation of document length (tf-based model)

• Bottleneck: Self-Attention layers are expensive

• Time, Memory

• Contributions:

• Cheaper for long document tasks(e.g. QA)

• Contextual representation of entire context
• Exisiting approach:

• Divided(Chunk) from long into pieces - 

• Truncated: Information loss - e.g. BERT 512

• Two-step model - e.g. Pool >> candidates >> answer, e.g. Multihop
3
Longformer’s Attention
• Capture far context(even full sequece) efficiently

• Avoid full attention…(Sparse but still cost)

• Types: Windowed attention / Dilated attention

• Evaluate ability(v.s. RoBERTa)

• Countinually trained on RoBERTa’s checkpoint(?), and
apply on donwstream tasks
• Sliding Window: 

• Fixed window size W(say 2, means attend last&next only)

• Stacked with multiple W for each layers(small W: local info)

• Dilated Window:

• Add dilated size D with sliding window

• Larger acceptive fields(longer input/can attend far away from D)

• Global + Sliding Window:

• Customized attention(pre-select, depends on tasks) e.g. MLM

• Two set of QKV projections(One for global, One for sliding window)
• Q1: Contextual informations loss: (sort of distortion)

• Increasing W from bottom to top. 

Low for local, while high for entire

• Use dilated sliding window only at higher layer

• Q2: Must be Stacked layers to expand recpetive fields. 

• Combine(share) the each layer’s attn

• Local first / stacked / Increasing W&D / More Elastic
Questions
Experiments
• Task: Character-level Autoregressive LM (for longer sequence) on
text8 & enwik8

• Boundary: Longest sequence with limited memory

• Training:

• Staged training: 5 phase 

• Length from 2048 to 23040

• Evaluation:

• Max length to 32256…(no optim/grads?)
Result & Ablation
• BPC: Bit per character, smaller the better

• Same #param, better performance

• Window size should be increasing

• Add some dilation may improved
Pretraining
• Pretrained Longformer and then finetuning on 6 tasks

• Countinue from RoBERTA checkpoint(??)

* longformer attn can be load into any tfmodel

• Pretrained MLM…

• Same tokenizer: Wordpiece

• W=512/D=0 for all layer

• L(longformer)=4096…while L(roberta)=512

• How to utilize pos-emb??

• Finetuning on QA tasks

• Better than RoBERTa, but
Results & Ablation
• All better models use

• multistage, GNNs…

• Confirmed that performance
NO improved due to additional
pretraining

•
Discussion
• For long documnets tasks:

• QAs(multihop, OpenDomainQA)

• Long document generation e.g. summarization

• Other tasks…

• For Explainable

• Task specific 

• Contextual Embeddings 

• Multihead

Longformer

  • 1.
  • 2.
    Outline • Intro • Longformer’sattention • Intuition • Structure • Questions & Concerns • Experiments • Results & Ablation • For Pretraining • Discussion
  • 3.
    Intro • Limitation ofdocument length (tf-based model) • Bottleneck: Self-Attention layers are expensive • Time, Memory • Contributions: • Cheaper for long document tasks(e.g. QA) • Contextual representation of entire context • Exisiting approach: • Divided(Chunk) from long into pieces - • Truncated: Information loss - e.g. BERT 512 • Two-step model - e.g. Pool >> candidates >> answer, e.g. Multihop 3
  • 4.
    Longformer’s Attention • Capturefar context(even full sequece) efficiently • Avoid full attention…(Sparse but still cost) • Types: Windowed attention / Dilated attention • Evaluate ability(v.s. RoBERTa) • Countinually trained on RoBERTa’s checkpoint(?), and apply on donwstream tasks
  • 5.
    • Sliding Window: • Fixed window size W(say 2, means attend last&next only) • Stacked with multiple W for each layers(small W: local info) • Dilated Window: • Add dilated size D with sliding window • Larger acceptive fields(longer input/can attend far away from D) • Global + Sliding Window: • Customized attention(pre-select, depends on tasks) e.g. MLM • Two set of QKV projections(One for global, One for sliding window)
  • 6.
    • Q1: Contextualinformations loss: (sort of distortion) • Increasing W from bottom to top. 
 Low for local, while high for entire • Use dilated sliding window only at higher layer • Q2: Must be Stacked layers to expand recpetive fields. • Combine(share) the each layer’s attn • Local first / stacked / Increasing W&D / More Elastic Questions
  • 7.
    Experiments • Task: Character-levelAutoregressive LM (for longer sequence) on text8 & enwik8 • Boundary: Longest sequence with limited memory • Training: • Staged training: 5 phase • Length from 2048 to 23040 • Evaluation: • Max length to 32256…(no optim/grads?)
  • 8.
    Result & Ablation •BPC: Bit per character, smaller the better • Same #param, better performance • Window size should be increasing • Add some dilation may improved
  • 9.
    Pretraining • Pretrained Longformerand then finetuning on 6 tasks • Countinue from RoBERTA checkpoint(??)
 * longformer attn can be load into any tfmodel • Pretrained MLM… • Same tokenizer: Wordpiece • W=512/D=0 for all layer • L(longformer)=4096…while L(roberta)=512 • How to utilize pos-emb?? • Finetuning on QA tasks • Better than RoBERTa, but
  • 10.
    Results & Ablation •All better models use • multistage, GNNs… • Confirmed that performance NO improved due to additional pretraining •
  • 11.
    Discussion • For longdocumnets tasks: • QAs(multihop, OpenDomainQA) • Long document generation e.g. summarization • Other tasks… • For Explainable • Task specific • Contextual Embeddings • Multihead