Longformer: The Long-Document Transformer

•

0 likes•437 views

자연어처리 팀 이번 주제는 longformer : The long Doucument Transformer 입니다 2017년도에 나온 트랜스포머는 많은 분야에서 소타를 달성했는데 그 이유는 트랜스포머에서 사용된 셀프어텐션 기법이 전체 시퀀스의 컨테스쳐 인포메이션을 잘 캡처할 수 있기 때문입니다 셀프어텐션 은 입력 텍스트를 전체를 다 보기 때문에 O n 제곱의 복잡도를 보인다고 합니다 그래서 굉장히 많은 계샨량을 필요로하고 시간과 메모리가 많이 소요됩니다. Longformer는 이런 문제를 개선 하기위해 ON만을 수행하는 어텐션은 제안 합니다. Longformer는 문장이 길더라도 전체 텍스트를 고려하는 홀컨텍스쳐 리프리젠테이션을 학습하면서 모델 아키텍처에 의존하지 않고 성능을 높일 수 있다고 합니다 오늘 논문리뷰는 자연어 처리팀 황소현님이 자세한 리뷰를 도와주셨습니다!

Data & Analytics

Longformer:
The Long-Document Transformer
자연어처리팀
박희수, 주정헌, 황소현(발표자)
https://arxiv.org/abs/2004.05150

• The original Transformer model has a self-attention component with
O(𝒏𝟐) time and memory complexity where n is the input sequence length.
3. Longformer

• Given a fixed window size w, each token
attends to
1
2
*w tokens on each side
• The computation complexity : O(n × w)
• receptive field :
3. Attention Pattern
Sliding Window

• Window를 dilated size만큼 확장시켜 사용
• receptive field : l x d x w
3. Attention Pattern
Dilated Sliding Window

• Special token의 위치에 대해서만
representation 학습
• 인코더의 전체 hidden state 가중치 계산
• The computation complexity : O(n)
3. Attention Pattern
Global attention

⚫ Differ window sizes across the layers
• 위쪽 layer 로 갈수록 window size 크게
• 아래쪽 layer - local 정보
위쪽 layer - 문장 전체 정보
• efficiency 와 performance 사이의 trade-off 조정
⚫ Use dilated sliding window attention
• 아래쪽 layer - capacity 를 키우기 위해 dilation 을 적용하지 않음
• 위쪽 layer - 오직 2개의 head에만 dilation 적용
4. Autoregressive Language Modeling
4.1 Attention Pattern

⚫ longer context 를 배우기 전에 상당한 양의 gradient update 를 통해 local
context 부터 배우는 것이 필요하다는 것을 찾음
⚫ Staged training procedure
• 몇 단계에 걸쳐 window size 와 sequence length를 늘려가면서 학습
• 먼저, 짧은 길이의 문장과 window size로 학습
• 이어서 문장 길이와 window size를 배로 늘려가면서 학습
• 총 5개의 phase로 2,048 길이의 문장에서 시작해서 23,040 까지 학습
4. Autoregressive Language Modeling
4.2 Experiment Setup

⚫ BPC (Bit-per-character)
• Character 단위의 perplexity
4. Autoregressive Language Modeling
4.2.1 Results

⚫window size 를 bottom 에서 top 으로 갈 수록 증가시킬 때 성능이 가장 좋음
⚫Dilation을 적용 하는 것이 더 성능이 올라감을 확인
4. Autoregressive Language Modeling
4.2.2 Ablation Study

⚫ Position Embeddings
• RoBERTa 의 maximum position 은 512 → 여기에 extra position
embedding 을 추가하여 4,096 까지 늘림
• RoBERTa의 512 개의 position embedding 을 copy 하여 initializing
5. Pretraining and Finetuning

5. Pretraining and Finetuning
⚫ Continued MLM Pretraining
• RoBERTa 의 weight 를 고정
• batch size 64.
• Sequences length 4,096의 인풋에 대해 65K gradient updates

• 512글자 이상되는 컨텍스트를 피쳐로 사용
• Baseline : RoBERTa
6. Tasks

What's hot

Attention is all you needHoon Heo

Introduction For seq2seq(sequence to sequence) and RNNHye-min Ahn

Feature EngineeringHJ van Veen

BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingMinh Pham

PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsJinwon Lee

PR-217: EfficientDet: Scalable and Efficient Object DetectionJinwon Lee

InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...홍배 김

lecun-01.pptVenkyChinna8

BERT introductionHanwha System / ICT

Deep Learning for Natural Language Processing: Word EmbeddingsRoelof Pieters

Self-Attention with Linear ComplexitySangwoo Mo

Attention is All You Need (Transformer)Jeong-Gwan Lee

Notes on attention mechanismKhang Pham

Autoencoders in Deep Learningmilad abbasi

Natural language processing and transformer modelsDing Li

Tips for data science competitionsOwen Zhang

A brief introduction to recent segmentation methodsShunta Saito

Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Sergey Karayev

BERTKhang Pham

Transformer Introduction (Seminar Material)Yuta Niki

What's hot (20)

Attention is all you need

Introduction For seq2seq(sequence to sequence) and RNN

Feature Engineering

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

PR-231: A Simple Framework for Contrastive Learning of Visual Representations

PR-217: EfficientDet: Scalable and Efficient Object Detection

InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...

lecun-01.ppt

BERT introduction

Deep Learning for Natural Language Processing: Word Embeddings

Self-Attention with Linear Complexity

Attention is All You Need (Transformer)

Notes on attention mechanism

Autoencoders in Deep Learning

Natural language processing and transformer models

Tips for data science competitions

A brief introduction to recent segmentation methods

Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)

BERT

Transformer Introduction (Seminar Material)

Similar to Longformer: The Long-Document Transformer

LongformerJiaHuei Joo

End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFJayavardhan Reddy Peddamail

Deep Learning for Machine TranslationMatīss ‎‎‎‎‎‎‎

NLP and Deep Learning for non_expertsSanghamitra Deb

Introduction to Long Running Workflows 3.7StephenKardian

Neural machine translation by jointly learning to align and translate.pptxssuser2624f71

Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Saurabh Kaushik

[Paper Review] 2014 combining time and frequency-domain convolution in convol...JaeSung Bae

Multi-class Image Classification using deep convolutional networks on extreme...Ashis Kumar Chanda

Multi-class Image Classification using Deep Convolutional Networks on extreme...Ashis Chanda

Sequence Modelling with Deep LearningNatasha Latysheva

Lecture 9 - Deep Sequence Models, Learn Recurrent Neural Networks (RNN), GRU ...Maninda Edirisooriya

Introduction to TransformersSuman Debnath

Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Yuki Tomo

Neural Networks with Google TensorFlowDarshan Patel

An Introduction to Deep LearningPoo Kuan Hoong

Introduction to C ++.pptxVAIBHAVKADAGANCHI

Building a Neural Machine Translation System From ScratchNatasha Latysheva

ComputerVisionwithDeepLearning.pdfSyedMahmoodAliRoomi

Bccon use notes objects in memory and other usefulFrank van der Linden

Similar to Longformer: The Long-Document Transformer (20)

Longformer

End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF

Deep Learning for Machine Translation

NLP and Deep Learning for non_experts

Introduction to Long Running Workflows 3.7

Neural machine translation by jointly learning to align and translate.pptx

Engineering Intelligent NLP Applications Using Deep Learning – Part 2

[Paper Review] 2014 combining time and frequency-domain convolution in convol...

Multi-class Image Classification using deep convolutional networks on extreme...

Multi-class Image Classification using Deep Convolutional Networks on extreme...

Sequence Modelling with Deep Learning

Lecture 9 - Deep Sequence Models, Learn Recurrent Neural Networks (RNN), GRU ...

Introduction to Transformers

Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...

Neural Networks with Google TensorFlow

An Introduction to Deep Learning

Introduction to C ++.pptx

Building a Neural Machine Translation System From Scratch

ComputerVisionwithDeepLearning.pdf

Bccon use notes objects in memory and other useful

Recently uploaded

From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster

INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman

Brighton SEO | April 2024 | Data StorytellingNeil Barnes

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa

9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo

E-Commerce Order PredictionShraddha Kamble.pptxBoston Institute of Analytics

How we prevented account sharing with MFAAndrei Kaleshka

Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson

Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda

Call Girls in Saket 99530🔝 56974 Escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation

Industrialised data - the key to AI success.pdfLars Albertsson

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh

Decoding Loan Approval: Predictive Modeling in ActionBoston Institute of Analytics

Recently uploaded (20)

From idea to production in a day – Leveraging Azure ML and Streamlit to build...

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD

Brighton SEO | April 2024 | Data Storytelling

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf

9654467111 Call Girls In Munirka Hotel And Home Service

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改

E-Commerce Order PredictionShraddha Kamble.pptx

How we prevented account sharing with MFA

Schema on read is obsolete. Welcome metaprogramming..pdf

Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx

Call Girls in Saket 99530🔝 56974 Escort Service

Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...

Industrialised data - the key to AI success.pdf

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...

Decoding Loan Approval: Predictive Modeling in Action

Longformer: The Long-Document Transformer

1. Longformer: The Long-Document Transformer 자연어처리팀 박희수, 주정헌, 황소현(발표자) https://arxiv.org/abs/2004.05150

2. 1. Backgrond

3. 2. Related Work

4. • The original Transformer model has a self-attention component with O(𝒏𝟐) time and memory complexity where n is the input sequence length. 3. Longformer

5. • Given a fixed window size w, each token attends to 1 2 *w tokens on each side • The computation complexity : O(n × w) • receptive field : 3. Attention Pattern Sliding Window

6. • Window를 dilated size만큼 확장시켜 사용 • receptive field : l x d x w 3. Attention Pattern Dilated Sliding Window

7. • Special token의 위치에 대해서만 representation 학습 • 인코더의 전체 hidden state 가중치 계산 • The computation complexity : O(n) 3. Attention Pattern Global attention

8. Q&A

9. ⚫ Differ window sizes across the layers • 위쪽 layer 로 갈수록 window size 크게 • 아래쪽 layer - local 정보 위쪽 layer - 문장 전체 정보 • efficiency 와 performance 사이의 trade-off 조정 ⚫ Use dilated sliding window attention • 아래쪽 layer - capacity 를 키우기 위해 dilation 을 적용하지 않음 • 위쪽 layer - 오직 2개의 head에만 dilation 적용 4. Autoregressive Language Modeling 4.1 Attention Pattern

10. ⚫ longer context 를 배우기 전에 상당한 양의 gradient update 를 통해 local context 부터 배우는 것이 필요하다는 것을 찾음 ⚫ Staged training procedure • 몇 단계에 걸쳐 window size 와 sequence length를 늘려가면서 학습 • 먼저, 짧은 길이의 문장과 window size로 학습 • 이어서 문장 길이와 window size를 배로 늘려가면서 학습 • 총 5개의 phase로 2,048 길이의 문장에서 시작해서 23,040 까지 학습 4. Autoregressive Language Modeling 4.2 Experiment Setup

11. ⚫ BPC (Bit-per-character) • Character 단위의 perplexity 4. Autoregressive Language Modeling 4.2.1 Results

12. ⚫window size 를 bottom 에서 top 으로 갈 수록 증가시킬 때 성능이 가장 좋음 ⚫Dilation을 적용 하는 것이 더 성능이 올라감을 확인 4. Autoregressive Language Modeling 4.2.2 Ablation Study

13. ⚫ Position Embeddings • RoBERTa 의 maximum position 은 512 → 여기에 extra position embedding 을 추가하여 4,096 까지 늘림 • RoBERTa의 512 개의 position embedding 을 copy 하여 initializing 5. Pretraining and Finetuning

14. 5. Pretraining and Finetuning ⚫ Continued MLM Pretraining • RoBERTa 의 weight 를 고정 • batch size 64. • Sequences length 4,096의 인풋에 대해 65K gradient updates

15. Q&A

16. • 512글자 이상되는 컨텍스트를 피쳐로 사용 • Baseline : RoBERTa 6. Tasks

17. 6. Tasks 6.1 Question answering

18. 6. Tasks 6.2 Coreference Resolution

19. 6. Tasks 6.3 Document Classification

20. 7. Longformer-Encoder-Decoder(LED)

21. Q&A

22. THANK YOU