3. Intro
• Limitation of document length (tf-based model)
• Bottleneck: Self-Attention layers are expensive
• Time, Memory
• Contributions:
• Cheaper for long document tasks(e.g. QA)
• Contextual representation of entire context
• Exisiting approach:
• Divided(Chunk) from long into pieces -
• Truncated: Information loss - e.g. BERT 512
• Two-step model - e.g. Pool >> candidates >> answer, e.g. Multihop
3
4. Longformer’s Attention
• Capture far context(even full sequece) efficiently
• Avoid full attention…(Sparse but still cost)
• Types: Windowed attention / Dilated attention
• Evaluate ability(v.s. RoBERTa)
• Countinually trained on RoBERTa’s checkpoint(?), and
apply on donwstream tasks
5. • Sliding Window:
• Fixed window size W(say 2, means attend last&next only)
• Stacked with multiple W for each layers(small W: local info)
• Dilated Window:
• Add dilated size D with sliding window
• Larger acceptive fields(longer input/can attend far away from D)
• Global + Sliding Window:
• Customized attention(pre-select, depends on tasks) e.g. MLM
• Two set of QKV projections(One for global, One for sliding window)
6. • Q1: Contextual informations loss: (sort of distortion)
• Increasing W from bottom to top.
Low for local, while high for entire
• Use dilated sliding window only at higher layer
• Q2: Must be Stacked layers to expand recpetive fields.
• Combine(share) the each layer’s attn
• Local first / stacked / Increasing W&D / More Elastic
Questions
7. Experiments
• Task: Character-level Autoregressive LM (for longer sequence) on
text8 & enwik8
• Boundary: Longest sequence with limited memory
• Training:
• Staged training: 5 phase
• Length from 2048 to 23040
• Evaluation:
• Max length to 32256…(no optim/grads?)
8. Result & Ablation
• BPC: Bit per character, smaller the better
• Same #param, better performance
• Window size should be increasing
• Add some dilation may improved
9. Pretraining
• Pretrained Longformer and then finetuning on 6 tasks
• Countinue from RoBERTA checkpoint(??)
* longformer attn can be load into any tfmodel
• Pretrained MLM…
• Same tokenizer: Wordpiece
• W=512/D=0 for all layer
• L(longformer)=4096…while L(roberta)=512
• How to utilize pos-emb??
• Finetuning on QA tasks
• Better than RoBERTa, but
10. Results & Ablation
• All better models use
• multistage, GNNs…
• Confirmed that performance
NO improved due to additional
pretraining
•
11. Discussion
• For long documnets tasks:
• QAs(multihop, OpenDomainQA)
• Long document generation e.g. summarization
• Other tasks…
• For Explainable
• Task specific
• Contextual Embeddings
• Multihead