SlideShare a Scribd company logo
Longformer
Allen AI
Outline
• Intro

• Longformer’s attention

• Intuition

• Structure

• Questions & Concerns

• Experiments

• Results & Ablation

• For Pretraining 

• Discussion
Intro
• Limitation of document length (tf-based model)

• Bottleneck: Self-Attention layers are expensive

• Time, Memory

• Contributions:

• Cheaper for long document tasks(e.g. QA)

• Contextual representation of entire context
• Exisiting approach:

• Divided(Chunk) from long into pieces - 

• Truncated: Information loss - e.g. BERT 512

• Two-step model - e.g. Pool >> candidates >> answer, e.g. Multihop
3
Longformer’s Attention
• Capture far context(even full sequece) efficiently

• Avoid full attention…(Sparse but still cost)

• Types: Windowed attention / Dilated attention

• Evaluate ability(v.s. RoBERTa)

• Countinually trained on RoBERTa’s checkpoint(?), and
apply on donwstream tasks
• Sliding Window: 

• Fixed window size W(say 2, means attend last&next only)

• Stacked with multiple W for each layers(small W: local info)

• Dilated Window:

• Add dilated size D with sliding window

• Larger acceptive fields(longer input/can attend far away from D)

• Global + Sliding Window:

• Customized attention(pre-select, depends on tasks) e.g. MLM

• Two set of QKV projections(One for global, One for sliding window)
• Q1: Contextual informations loss: (sort of distortion)

• Increasing W from bottom to top. 

Low for local, while high for entire

• Use dilated sliding window only at higher layer

• Q2: Must be Stacked layers to expand recpetive fields. 

• Combine(share) the each layer’s attn

• Local first / stacked / Increasing W&D / More Elastic
Questions
Experiments
• Task: Character-level Autoregressive LM (for longer sequence) on
text8 & enwik8

• Boundary: Longest sequence with limited memory

• Training:

• Staged training: 5 phase 

• Length from 2048 to 23040

• Evaluation:

• Max length to 32256…(no optim/grads?)
Result & Ablation
• BPC: Bit per character, smaller the better

• Same #param, better performance

• Window size should be increasing

• Add some dilation may improved
Pretraining
• Pretrained Longformer and then finetuning on 6 tasks

• Countinue from RoBERTA checkpoint(??)

* longformer attn can be load into any tfmodel

• Pretrained MLM…

• Same tokenizer: Wordpiece

• W=512/D=0 for all layer

• L(longformer)=4096…while L(roberta)=512

• How to utilize pos-emb??

• Finetuning on QA tasks

• Better than RoBERTa, but
Results & Ablation
• All better models use

• multistage, GNNs…

• Confirmed that performance
NO improved due to additional
pretraining

•
Discussion
• For long documnets tasks:

• QAs(multihop, OpenDomainQA)

• Long document generation e.g. summarization

• Other tasks…

• For Explainable

• Task specific 

• Contextual Embeddings 

• Multihead

More Related Content

Similar to Longformer

DSL's with Groovy
DSL's with GroovyDSL's with Groovy
DSL's with Groovy
paulbowler
 
Reduced instruction set computers
Reduced instruction set computersReduced instruction set computers
Reduced instruction set computers
Syed Zaid Irshad
 
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward
 
Improving your code design using Java
Improving your code design using JavaImproving your code design using Java
Improving your code design using Java
Roan Brasil Monteiro
 
2019 2ed internet addressing , internet addressing
2019 2ed internet addressing , internet addressing2019 2ed internet addressing , internet addressing
2019 2ed internet addressing , internet addressing
Osama Ghandour Geris
 
Gpgpu intro
Gpgpu introGpgpu intro
Gpgpu intro
Dominik Seifert
 
Lecture1.ppt
Lecture1.pptLecture1.ppt
Lecture1.ppt
AqeelAbbas94
 
Top schools in gudgao
Top schools in gudgaoTop schools in gudgao
Top schools in gudgao
Edhole.com
 
C++ overview
C++ overviewC++ overview
C++ overview
Prem Ranjan
 
Longformer: The Long-Document Transformer
Longformer: The Long-Document Transformer Longformer: The Long-Document Transformer
Longformer: The Long-Document Transformer
taeseon ryu
 
Mixing d ps building architecture on the cross cutting example
Mixing d ps building architecture on the cross cutting exampleMixing d ps building architecture on the cross cutting example
Mixing d ps building architecture on the cross cutting example
corehard_by
 
Validation and-design-in-a-small-team-environment
Validation and-design-in-a-small-team-environmentValidation and-design-in-a-small-team-environment
Validation and-design-in-a-small-team-environment
Obsidian Software
 
Validation and Design in a Small Team Environment
Validation and Design in a Small Team EnvironmentValidation and Design in a Small Team Environment
Validation and Design in a Small Team Environment
DVClub
 
Top schools in noida
Top schools in noidaTop schools in noida
Top schools in noida
Edhole.com
 
Top schools in gudgao
Top schools in gudgaoTop schools in gudgao
Top schools in gudgao
Edhole.com
 
Reading Notes : the practice of programming
Reading Notes : the practice of programmingReading Notes : the practice of programming
Reading Notes : the practice of programming
Juggernaut Liu
 
RISC.ppt
RISC.pptRISC.ppt
RISC.ppt
AmarDura2
 
13 risc
13 risc13 risc
13 risc
Anwal Mirza
 
Writing Scalable Software in Java
Writing Scalable Software in JavaWriting Scalable Software in Java
Writing Scalable Software in Java
Ruben Badaró
 
Performance improvements in etcd 3.5 release
Performance improvements in etcd 3.5 releasePerformance improvements in etcd 3.5 release
Performance improvements in etcd 3.5 release
LibbySchulze
 

Similar to Longformer (20)

DSL's with Groovy
DSL's with GroovyDSL's with Groovy
DSL's with Groovy
 
Reduced instruction set computers
Reduced instruction set computersReduced instruction set computers
Reduced instruction set computers
 
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
 
Improving your code design using Java
Improving your code design using JavaImproving your code design using Java
Improving your code design using Java
 
2019 2ed internet addressing , internet addressing
2019 2ed internet addressing , internet addressing2019 2ed internet addressing , internet addressing
2019 2ed internet addressing , internet addressing
 
Gpgpu intro
Gpgpu introGpgpu intro
Gpgpu intro
 
Lecture1.ppt
Lecture1.pptLecture1.ppt
Lecture1.ppt
 
Top schools in gudgao
Top schools in gudgaoTop schools in gudgao
Top schools in gudgao
 
C++ overview
C++ overviewC++ overview
C++ overview
 
Longformer: The Long-Document Transformer
Longformer: The Long-Document Transformer Longformer: The Long-Document Transformer
Longformer: The Long-Document Transformer
 
Mixing d ps building architecture on the cross cutting example
Mixing d ps building architecture on the cross cutting exampleMixing d ps building architecture on the cross cutting example
Mixing d ps building architecture on the cross cutting example
 
Validation and-design-in-a-small-team-environment
Validation and-design-in-a-small-team-environmentValidation and-design-in-a-small-team-environment
Validation and-design-in-a-small-team-environment
 
Validation and Design in a Small Team Environment
Validation and Design in a Small Team EnvironmentValidation and Design in a Small Team Environment
Validation and Design in a Small Team Environment
 
Top schools in noida
Top schools in noidaTop schools in noida
Top schools in noida
 
Top schools in gudgao
Top schools in gudgaoTop schools in gudgao
Top schools in gudgao
 
Reading Notes : the practice of programming
Reading Notes : the practice of programmingReading Notes : the practice of programming
Reading Notes : the practice of programming
 
RISC.ppt
RISC.pptRISC.ppt
RISC.ppt
 
13 risc
13 risc13 risc
13 risc
 
Writing Scalable Software in Java
Writing Scalable Software in JavaWriting Scalable Software in Java
Writing Scalable Software in Java
 
Performance improvements in etcd 3.5 release
Performance improvements in etcd 3.5 releasePerformance improvements in etcd 3.5 release
Performance improvements in etcd 3.5 release
 

Recently uploaded

Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
Leonel Morgado
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
David Osipyan
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
Abdul Wali Khan University Mardan,kP,Pakistan
 
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero WaterSharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Texas Alliance of Groundwater Districts
 
Basics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different formsBasics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different forms
MaheshaNanjegowda
 
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
AbdullaAlAsif1
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
Sérgio Sacani
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
HongcNguyn6
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
vluwdy49
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
Sharon Liu
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
pablovgd
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
yqqaatn0
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
Sérgio Sacani
 
Equivariant neural networks and representation theory
Equivariant neural networks and representation theoryEquivariant neural networks and representation theory
Equivariant neural networks and representation theory
Daniel Tubbenhauer
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
MAGOTI ERNEST
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
muralinath2
 
The debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically youngThe debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically young
Sérgio Sacani
 
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
LengamoLAppostilic
 

Recently uploaded (20)

Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
 
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero WaterSharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
 
Basics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different formsBasics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different forms
 
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
 
Equivariant neural networks and representation theory
Equivariant neural networks and representation theoryEquivariant neural networks and representation theory
Equivariant neural networks and representation theory
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
 
The debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically youngThe debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically young
 
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
 

Longformer

  • 2. Outline • Intro • Longformer’s attention • Intuition • Structure • Questions & Concerns • Experiments • Results & Ablation • For Pretraining • Discussion
  • 3. Intro • Limitation of document length (tf-based model) • Bottleneck: Self-Attention layers are expensive • Time, Memory • Contributions: • Cheaper for long document tasks(e.g. QA) • Contextual representation of entire context • Exisiting approach: • Divided(Chunk) from long into pieces - • Truncated: Information loss - e.g. BERT 512 • Two-step model - e.g. Pool >> candidates >> answer, e.g. Multihop 3
  • 4. Longformer’s Attention • Capture far context(even full sequece) efficiently • Avoid full attention…(Sparse but still cost) • Types: Windowed attention / Dilated attention • Evaluate ability(v.s. RoBERTa) • Countinually trained on RoBERTa’s checkpoint(?), and apply on donwstream tasks
  • 5. • Sliding Window: • Fixed window size W(say 2, means attend last&next only) • Stacked with multiple W for each layers(small W: local info) • Dilated Window: • Add dilated size D with sliding window • Larger acceptive fields(longer input/can attend far away from D) • Global + Sliding Window: • Customized attention(pre-select, depends on tasks) e.g. MLM • Two set of QKV projections(One for global, One for sliding window)
  • 6. • Q1: Contextual informations loss: (sort of distortion) • Increasing W from bottom to top. 
 Low for local, while high for entire • Use dilated sliding window only at higher layer • Q2: Must be Stacked layers to expand recpetive fields. • Combine(share) the each layer’s attn • Local first / stacked / Increasing W&D / More Elastic Questions
  • 7. Experiments • Task: Character-level Autoregressive LM (for longer sequence) on text8 & enwik8 • Boundary: Longest sequence with limited memory • Training: • Staged training: 5 phase • Length from 2048 to 23040 • Evaluation: • Max length to 32256…(no optim/grads?)
  • 8. Result & Ablation • BPC: Bit per character, smaller the better • Same #param, better performance • Window size should be increasing • Add some dilation may improved
  • 9. Pretraining • Pretrained Longformer and then finetuning on 6 tasks • Countinue from RoBERTA checkpoint(??)
 * longformer attn can be load into any tfmodel • Pretrained MLM… • Same tokenizer: Wordpiece • W=512/D=0 for all layer • L(longformer)=4096…while L(roberta)=512 • How to utilize pos-emb?? • Finetuning on QA tasks • Better than RoBERTa, but
  • 10. Results & Ablation • All better models use • multistage, GNNs… • Confirmed that performance NO improved due to additional pretraining •
  • 11. Discussion • For long documnets tasks: • QAs(multihop, OpenDomainQA) • Long document generation e.g. summarization • Other tasks… • For Explainable • Task specific • Contextual Embeddings • Multihead