SlideShare a Scribd company logo
1 of 11
Download to read offline
Longformer
Allen AI
Outline
• Intro

• Longformer’s attention

• Intuition

• Structure

• Questions & Concerns

• Experiments

• Results & Ablation

• For Pretraining 

• Discussion
Intro
• Limitation of document length (tf-based model)

• Bottleneck: Self-Attention layers are expensive

• Time, Memory

• Contributions:

• Cheaper for long document tasks(e.g. QA)

• Contextual representation of entire context
• Exisiting approach:

• Divided(Chunk) from long into pieces - 

• Truncated: Information loss - e.g. BERT 512

• Two-step model - e.g. Pool >> candidates >> answer, e.g. Multihop
3
Longformer’s Attention
• Capture far context(even full sequece) efficiently

• Avoid full attention…(Sparse but still cost)

• Types: Windowed attention / Dilated attention

• Evaluate ability(v.s. RoBERTa)

• Countinually trained on RoBERTa’s checkpoint(?), and
apply on donwstream tasks
• Sliding Window: 

• Fixed window size W(say 2, means attend last&next only)

• Stacked with multiple W for each layers(small W: local info)

• Dilated Window:

• Add dilated size D with sliding window

• Larger acceptive fields(longer input/can attend far away from D)

• Global + Sliding Window:

• Customized attention(pre-select, depends on tasks) e.g. MLM

• Two set of QKV projections(One for global, One for sliding window)
• Q1: Contextual informations loss: (sort of distortion)

• Increasing W from bottom to top. 

Low for local, while high for entire

• Use dilated sliding window only at higher layer

• Q2: Must be Stacked layers to expand recpetive fields. 

• Combine(share) the each layer’s attn

• Local first / stacked / Increasing W&D / More Elastic
Questions
Experiments
• Task: Character-level Autoregressive LM (for longer sequence) on
text8 & enwik8

• Boundary: Longest sequence with limited memory

• Training:

• Staged training: 5 phase 

• Length from 2048 to 23040

• Evaluation:

• Max length to 32256…(no optim/grads?)
Result & Ablation
• BPC: Bit per character, smaller the better

• Same #param, better performance

• Window size should be increasing

• Add some dilation may improved
Pretraining
• Pretrained Longformer and then finetuning on 6 tasks

• Countinue from RoBERTA checkpoint(??)

* longformer attn can be load into any tfmodel

• Pretrained MLM…

• Same tokenizer: Wordpiece

• W=512/D=0 for all layer

• L(longformer)=4096…while L(roberta)=512

• How to utilize pos-emb??

• Finetuning on QA tasks

• Better than RoBERTa, but
Results & Ablation
• All better models use

• multistage, GNNs…

• Confirmed that performance
NO improved due to additional
pretraining

•
Discussion
• For long documnets tasks:

• QAs(multihop, OpenDomainQA)

• Long document generation e.g. summarization

• Other tasks…

• For Explainable

• Task specific 

• Contextual Embeddings 

• Multihead

More Related Content

Similar to Longformer

Validation and-design-in-a-small-team-environment
Validation and-design-in-a-small-team-environmentValidation and-design-in-a-small-team-environment
Validation and-design-in-a-small-team-environment
Obsidian Software
 
Validation and Design in a Small Team Environment
Validation and Design in a Small Team EnvironmentValidation and Design in a Small Team Environment
Validation and Design in a Small Team Environment
DVClub
 

Similar to Longformer (20)

DSL's with Groovy
DSL's with GroovyDSL's with Groovy
DSL's with Groovy
 
Reduced instruction set computers
Reduced instruction set computersReduced instruction set computers
Reduced instruction set computers
 
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
 
Improving your code design using Java
Improving your code design using JavaImproving your code design using Java
Improving your code design using Java
 
2019 2ed internet addressing , internet addressing
2019 2ed internet addressing , internet addressing2019 2ed internet addressing , internet addressing
2019 2ed internet addressing , internet addressing
 
Gpgpu intro
Gpgpu introGpgpu intro
Gpgpu intro
 
Lecture1.ppt
Lecture1.pptLecture1.ppt
Lecture1.ppt
 
Top schools in gudgao
Top schools in gudgaoTop schools in gudgao
Top schools in gudgao
 
C++ overview
C++ overviewC++ overview
C++ overview
 
Longformer: The Long-Document Transformer
Longformer: The Long-Document Transformer Longformer: The Long-Document Transformer
Longformer: The Long-Document Transformer
 
Mixing d ps building architecture on the cross cutting example
Mixing d ps building architecture on the cross cutting exampleMixing d ps building architecture on the cross cutting example
Mixing d ps building architecture on the cross cutting example
 
Validation and-design-in-a-small-team-environment
Validation and-design-in-a-small-team-environmentValidation and-design-in-a-small-team-environment
Validation and-design-in-a-small-team-environment
 
Validation and Design in a Small Team Environment
Validation and Design in a Small Team EnvironmentValidation and Design in a Small Team Environment
Validation and Design in a Small Team Environment
 
Top schools in noida
Top schools in noidaTop schools in noida
Top schools in noida
 
Top schools in gudgao
Top schools in gudgaoTop schools in gudgao
Top schools in gudgao
 
Reading Notes : the practice of programming
Reading Notes : the practice of programmingReading Notes : the practice of programming
Reading Notes : the practice of programming
 
RISC.ppt
RISC.pptRISC.ppt
RISC.ppt
 
13 risc
13 risc13 risc
13 risc
 
Writing Scalable Software in Java
Writing Scalable Software in JavaWriting Scalable Software in Java
Writing Scalable Software in Java
 
Performance improvements in etcd 3.5 release
Performance improvements in etcd 3.5 releasePerformance improvements in etcd 3.5 release
Performance improvements in etcd 3.5 release
 

Recently uploaded

Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Nanoparticles for the Treatment of Alzheimer’s Disease_102718.pptx
Nanoparticles for the Treatment of Alzheimer’s Disease_102718.pptxNanoparticles for the Treatment of Alzheimer’s Disease_102718.pptx
Nanoparticles for the Treatment of Alzheimer’s Disease_102718.pptx
ssusera4ec7b
 

Recently uploaded (20)

Information science research with large language models: between science and ...
Information science research with large language models: between science and ...Information science research with large language models: between science and ...
Information science research with large language models: between science and ...
 
Vital Signs of Animals Presentation By Aftab Ahmed Rahimoon
Vital Signs of Animals Presentation By Aftab Ahmed RahimoonVital Signs of Animals Presentation By Aftab Ahmed Rahimoon
Vital Signs of Animals Presentation By Aftab Ahmed Rahimoon
 
VILLAGE ATTACHMENT For rural agriculture PPT.pptx
VILLAGE ATTACHMENT For rural agriculture  PPT.pptxVILLAGE ATTACHMENT For rural agriculture  PPT.pptx
VILLAGE ATTACHMENT For rural agriculture PPT.pptx
 
ANITINUTRITION FACTOR GYLCOSIDES SAPONINS CYANODENS
ANITINUTRITION FACTOR GYLCOSIDES SAPONINS CYANODENSANITINUTRITION FACTOR GYLCOSIDES SAPONINS CYANODENS
ANITINUTRITION FACTOR GYLCOSIDES SAPONINS CYANODENS
 
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
 
Technical english Technical english.pptx
Technical english Technical english.pptxTechnical english Technical english.pptx
Technical english Technical english.pptx
 
MSC IV_Forensic medicine - Mechanical injuries.pdf
MSC IV_Forensic medicine - Mechanical injuries.pdfMSC IV_Forensic medicine - Mechanical injuries.pdf
MSC IV_Forensic medicine - Mechanical injuries.pdf
 
GBSN - Biochemistry (Unit 8) Enzymology
GBSN - Biochemistry (Unit 8) EnzymologyGBSN - Biochemistry (Unit 8) Enzymology
GBSN - Biochemistry (Unit 8) Enzymology
 
ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY // USES OF ANTIOBIOTICS TYPES OF ANTIB...
ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY  // USES OF ANTIOBIOTICS TYPES OF ANTIB...ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY  // USES OF ANTIOBIOTICS TYPES OF ANTIB...
ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY // USES OF ANTIOBIOTICS TYPES OF ANTIB...
 
X-rays from a Central “Exhaust Vent” of the Galactic Center Chimney
X-rays from a Central “Exhaust Vent” of the Galactic Center ChimneyX-rays from a Central “Exhaust Vent” of the Galactic Center Chimney
X-rays from a Central “Exhaust Vent” of the Galactic Center Chimney
 
Fun for mover student's book- English book for teaching.pdf
Fun for mover student's book- English book for teaching.pdfFun for mover student's book- English book for teaching.pdf
Fun for mover student's book- English book for teaching.pdf
 
POST TRANSCRIPTIONAL GENE SILENCING-AN INTRODUCTION.pptx
POST TRANSCRIPTIONAL GENE SILENCING-AN INTRODUCTION.pptxPOST TRANSCRIPTIONAL GENE SILENCING-AN INTRODUCTION.pptx
POST TRANSCRIPTIONAL GENE SILENCING-AN INTRODUCTION.pptx
 
Heads-Up Multitasker: CHI 2024 Presentation.pdf
Heads-Up Multitasker: CHI 2024 Presentation.pdfHeads-Up Multitasker: CHI 2024 Presentation.pdf
Heads-Up Multitasker: CHI 2024 Presentation.pdf
 
Costs to heap leach gold ore tailings in Karamoja region of Uganda
Costs to heap leach gold ore tailings in Karamoja region of UgandaCosts to heap leach gold ore tailings in Karamoja region of Uganda
Costs to heap leach gold ore tailings in Karamoja region of Uganda
 
GBSN - Microbiology (Unit 4) Concept of Asepsis
GBSN - Microbiology (Unit 4) Concept of AsepsisGBSN - Microbiology (Unit 4) Concept of Asepsis
GBSN - Microbiology (Unit 4) Concept of Asepsis
 
Polyethylene and its polymerization.pptx
Polyethylene and its polymerization.pptxPolyethylene and its polymerization.pptx
Polyethylene and its polymerization.pptx
 
TEST BANK for Organic Chemistry 6th Edition.pdf
TEST BANK for Organic Chemistry 6th Edition.pdfTEST BANK for Organic Chemistry 6th Edition.pdf
TEST BANK for Organic Chemistry 6th Edition.pdf
 
Nanoparticles for the Treatment of Alzheimer’s Disease_102718.pptx
Nanoparticles for the Treatment of Alzheimer’s Disease_102718.pptxNanoparticles for the Treatment of Alzheimer’s Disease_102718.pptx
Nanoparticles for the Treatment of Alzheimer’s Disease_102718.pptx
 
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptxSaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
 
RACEMIzATION AND ISOMERISATION completed.pptx
RACEMIzATION AND ISOMERISATION completed.pptxRACEMIzATION AND ISOMERISATION completed.pptx
RACEMIzATION AND ISOMERISATION completed.pptx
 

Longformer

  • 2. Outline • Intro • Longformer’s attention • Intuition • Structure • Questions & Concerns • Experiments • Results & Ablation • For Pretraining • Discussion
  • 3. Intro • Limitation of document length (tf-based model) • Bottleneck: Self-Attention layers are expensive • Time, Memory • Contributions: • Cheaper for long document tasks(e.g. QA) • Contextual representation of entire context • Exisiting approach: • Divided(Chunk) from long into pieces - • Truncated: Information loss - e.g. BERT 512 • Two-step model - e.g. Pool >> candidates >> answer, e.g. Multihop 3
  • 4. Longformer’s Attention • Capture far context(even full sequece) efficiently • Avoid full attention…(Sparse but still cost) • Types: Windowed attention / Dilated attention • Evaluate ability(v.s. RoBERTa) • Countinually trained on RoBERTa’s checkpoint(?), and apply on donwstream tasks
  • 5. • Sliding Window: • Fixed window size W(say 2, means attend last&next only) • Stacked with multiple W for each layers(small W: local info) • Dilated Window: • Add dilated size D with sliding window • Larger acceptive fields(longer input/can attend far away from D) • Global + Sliding Window: • Customized attention(pre-select, depends on tasks) e.g. MLM • Two set of QKV projections(One for global, One for sliding window)
  • 6. • Q1: Contextual informations loss: (sort of distortion) • Increasing W from bottom to top. 
 Low for local, while high for entire • Use dilated sliding window only at higher layer • Q2: Must be Stacked layers to expand recpetive fields. • Combine(share) the each layer’s attn • Local first / stacked / Increasing W&D / More Elastic Questions
  • 7. Experiments • Task: Character-level Autoregressive LM (for longer sequence) on text8 & enwik8 • Boundary: Longest sequence with limited memory • Training: • Staged training: 5 phase • Length from 2048 to 23040 • Evaluation: • Max length to 32256…(no optim/grads?)
  • 8. Result & Ablation • BPC: Bit per character, smaller the better • Same #param, better performance • Window size should be increasing • Add some dilation may improved
  • 9. Pretraining • Pretrained Longformer and then finetuning on 6 tasks • Countinue from RoBERTA checkpoint(??)
 * longformer attn can be load into any tfmodel • Pretrained MLM… • Same tokenizer: Wordpiece • W=512/D=0 for all layer • L(longformer)=4096…while L(roberta)=512 • How to utilize pos-emb?? • Finetuning on QA tasks • Better than RoBERTa, but
  • 10. Results & Ablation • All better models use • multistage, GNNs… • Confirmed that performance NO improved due to additional pretraining •
  • 11. Discussion • For long documnets tasks: • QAs(multihop, OpenDomainQA) • Long document generation e.g. summarization • Other tasks… • For Explainable • Task specific • Contextual Embeddings • Multihead