[EMNLP2017読み会] Efficient Attention using a Fixed-Size Memory Representation

•

0 likes•202 views

Hayahide Yamagishi

首都大・小町研で12月ごろに行ったEMNLP2017読み会で使用した資料です。詳細は以下をご覧ください。 http://cl.sd.tmu.ac.jp/groups/seminars/emnlp2017

Data & Analytics

Efficient Attention using a Fixed-Size
Memory Representation
Denny Britz, Melody Y. Guan and Minh-Thang
Luong
11/22 @EMNLP2017 Reading
Reader: M1 Hayahide Yamagishi

Introduction
● The most of encoder-decoder architectures are equipped
with attention mechanism.
● But, human translators don’t reread previously translated
source text words. (It’s shown by the eye-tracking test)
● They believe “it may be unnecessary to look back at the
entire original source sequence at each step.”
● They proposed an alternative attention mechanism
○ Smaller computational time

Attention [Bahdanau+ ICLR2015, Luong+ EMNLP2015]
● Attention mechanism aims to make the context vectors c.
○ s: encoder state, h: decoder state
○
● Computational time: O(D2
|S||T|)
○ D: state size of the encoder and decoder
○ |S| and |T| represents the length of the source and target, respectively.
○ If we use the Luong’s dot attention, computational time is O(D|S||T|)
○ Luong’s dot attention: hi
T
sj

Memory-Based Attention Model (Proposed method)
● During encoding, they computed an attention matrix C.
○ size of C and W: K×D
○ K: the number of attention vectors
○ computational time: O(KD|S|)
● C is regarded as compact fixed-length memory.

Memory-Based Attention Model (Proposed method)
● During decoding, they computed the context vector c.
○ They used C for computing the attention instead of encoder states.
● Total computational time: O(KD(|S| + |T|))
○ They expected their model to be faster than (O(D2
|S||T|))
○ For long sequences (|S| is large), this model will be faster than dot attention
● They used a sigmoid function instead of softmax for
calculating the attention scores.

Position Encoding
● Calculating C doesn’t depend on k.
○ “we would hope for the model to learn to generate distinct attention contexts”
● They add position encodings

Experiment 1: Toy copying (like NTM[Graves+ 2014])
● Copying the random sequence.
○ Length: from 0 to {10, 50, 100, 200}
● Vocabulary: 20
● 2-layer, bi-directional LSTM (256 units)
● Dropout: 0.2
● Train : test = 100,000 : 1,000
○ batch size: 128
○ They trained for 200,000 steps.
● K40m × 1

Result
● Vanilla enc-dec is weak.
● The number of K depends on the data length.
● Decoding process becomes faster when the length of
sequence is larger.
● “Traditional attention may be representing the source with
redundancy and wasting computational resources.”

Experiment 2: Neural Machine Translation
● WMT’17
○ English-Czech (52M sentences)
○ English-German (5.9M sentences)
○ English-Finish (2.6M sentences)
○ English-Turkish (207K sentences)
○ Dev: newstest2015, Test: newstest2016 (not included en-tr)
○ Average length for the test data is 35.
● Hyperparameters
○ Vocabulary: 16,000 subwords (BPE)
○ hidden states: 512
○ Other parameters are the same as copy experiments’.

Discussion
● “Our memory attention model performs on-par with, or
slightly better, than the baseline model”
● Position encoding improves model performance.
● If we set the task to be K << T, we will get the good
performance by this model. (for example, summarization)
● Decoding time decreased.

Discussion
● softmax/softmax performs badly.

Visualizing: each memory (no position encoding)

Visualizing: when the model try to translate

Conclusion
● They proposed a memory-based attention mechanism.
● Their technique lead to speedup.
● It can fit complex data like an NMT.

所感（日本語ですみません）
● 性能がさほど変わらないのに速くなるのはいいこと
● [Luong+ EMNLP’15] のlocal attentionと比較してほしかった
○ local attention的なものを事前に計算しているようなもの？
○ Luongが共著なのに
● Position EncodingがないのにAttentionがばらけているのが謎
● 要約の方が向いていると思う

What's hot

Deep learningJin Sakuma

Avito Duplicate Ads Detection @ kaggleAlexey Grigorev

Deep Learning meetupIvan Goloskokovic

IOEfficientParalleMatrixMultiplication_presentShubham Joshi

Hubba Deep LearningIvan Goloskokovic

D422 7-2 string hadelingOmkar Rane

Stream Processing in the Cloud - Athens Kubernetes Meetup 16.07.2019Rafał Leszko

A hierarchical neural autoencoder for paragraphs and documentsHayahide Yamagishi

Linuxconf 2011 parallel languages talkLenz Gschwendtner

CIKM Cup 2016: Cross-Device LinkingAlexey Grigorev

Building Conclave: a decentralized, real-time collaborative text editorSun-Li Beatteay

Computer network (9)NYversity

Design and analysis of algorithmsPSG College of Technology

Speech recognition: SurveyWonjun Jeong

Complexity of AlgorithmMuhammad Muzammal

What's hot (15)

Deep learning

Avito Duplicate Ads Detection @ kaggle

Deep Learning meetup

IOEfficientParalleMatrixMultiplication_present

Hubba Deep Learning

D422 7-2 string hadeling

Stream Processing in the Cloud - Athens Kubernetes Meetup 16.07.2019

A hierarchical neural autoencoder for paragraphs and documents

Linuxconf 2011 parallel languages talk

CIKM Cup 2016: Cross-Device Linking

Building Conclave: a decentralized, real-time collaborative text editor

Computer network (9)

Design and analysis of algorithms

Speech recognition: Survey

Complexity of Algorithm

Similar to [EMNLP2017読み会] Efficient Attention using a Fixed-Size Memory Representation

Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya

Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)Universitat Politècnica de Catalunya

Chromatic Sparse LearningDatabricks

Neural Machine Translation (D2L10 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya

Neural machine translation by jointly learning to align and translate.pptxssuser2624f71

Algorithm 110801105245-phpapp01-120223065724-phpapp02dhruv patel

Algorithm 110801105245-phpapp01Jay Patel

Icon18revrec sudeshnaMuthusamy Chelliah

Nips 2017 in a nutshellLULU CHENG

Deep Neural Machine Translation with Linear Associative UnitSatoru Katsumata

Use CNN for Sequence ModelingDongang (Sean) Wang

Time and space complexityAnkit Katiyar

LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptxSan Kim

Algorithms overviewDeborah Akuoko

Reading: "Pi in the sky: Calculating a record-breaking 31.4 trillion digits o...Kento Aoyama

Analysis of algoSandeep Bhargava

Asymptotic NotationsRishabh Soni

Netflix machine learningAmer Ather

Natural Question Generation using Deep LearningArijit Mukherjee

Parallel convolutional neural networkAbdullah Khan Zehady

Similar to [EMNLP2017読み会] Efficient Attention using a Fixed-Size Memory Representation (20)

Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020

Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)

Chromatic Sparse Learning

Neural Machine Translation (D2L10 Insight@DCU Machine Learning Workshop 2017)

Neural machine translation by jointly learning to align and translate.pptx

Algorithm 110801105245-phpapp01-120223065724-phpapp02

Algorithm 110801105245-phpapp01

Icon18revrec sudeshna

Nips 2017 in a nutshell

Deep Neural Machine Translation with Linear Associative Unit

Use CNN for Sequence Modeling

Time and space complexity

LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx

Algorithms overview

Reading: "Pi in the sky: Calculating a record-breaking 31.4 trillion digits o...

Analysis of algo

Asymptotic Notations

Netflix machine learning

Natural Question Generation using Deep Learning

Parallel convolutional neural network

Recently uploaded

Semantic Shed - Squashing and Squeezing.pptxMike Bennett

Principles and Practices of Data VisualizationKianJazayeri1

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics

Networking Case Study prepared by teacher.pptxHimangsuNath

Insurance Churn Prediction Data Analysis ProjectBoston Institute of Analytics

Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy

What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17

Cyber awareness ppt on the recorded dataTecnoIncentive

Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics

The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell

Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics

Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann

SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1

Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics

Data Analysis Project: Stroke PredictionBoston Institute of Analytics

Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics

English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml

Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ

Recently uploaded (20)

Semantic Shed - Squashing and Squeezing.pptx

Principles and Practices of Data Visualization

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...

Networking Case Study prepared by teacher.pptx

Insurance Churn Prediction Data Analysis Project

Student profile product demonstration on grades, ability, well-being and mind...

What To Do For World Nature Conservation Day by Slidesgo.pptx

Cyber awareness ppt on the recorded data

Defining Constituents, Data Vizzes and Telling a Data Story

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...

The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx

Decoding Patterns: Customer Churn Prediction Data Analysis Project

Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines

SMOTE and K-Fold Cross Validation-Presentation.pptx

Bank Loan Approval Analysis: A Comprehensive Data Analysis Project

Data Analysis Project: Stroke Prediction

Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...

English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf

Advanced Machine Learning for Business Professionals

[EMNLP2017読み会] Efficient Attention using a Fixed-Size Memory Representation

1. Efficient Attention using a Fixed-Size Memory Representation Denny Britz, Melody Y. Guan and Minh-Thang Luong 11/22 @EMNLP2017 Reading Reader: M1 Hayahide Yamagishi

2. Introduction ● The most of encoder-decoder architectures are equipped with attention mechanism. ● But, human translators don’t reread previously translated source text words. (It’s shown by the eye-tracking test) ● They believe “it may be unnecessary to look back at the entire original source sequence at each step.” ● They proposed an alternative attention mechanism ○ Smaller computational time

4. Attention [Bahdanau+ ICLR2015, Luong+ EMNLP2015] ● Attention mechanism aims to make the context vectors c. ○ s: encoder state, h: decoder state ○ ● Computational time: O(D2 |S||T|) ○ D: state size of the encoder and decoder ○ |S| and |T| represents the length of the source and target, respectively. ○ If we use the Luong’s dot attention, computational time is O(D|S||T|) ○ Luong’s dot attention: hi T sj

5. Memory-Based Attention Model (Proposed method) ● During encoding, they computed an attention matrix C. ○ size of C and W: K×D ○ K: the number of attention vectors ○ computational time: O(KD|S|) ● C is regarded as compact fixed-length memory.

6. Memory-Based Attention Model (Proposed method) ● During decoding, they computed the context vector c. ○ They used C for computing the attention instead of encoder states. ● Total computational time: O(KD(|S| + |T|)) ○ They expected their model to be faster than (O(D2 |S||T|)) ○ For long sequences (|S| is large), this model will be faster than dot attention ● They used a sigmoid function instead of softmax for calculating the attention scores.

7. Position Encoding ● Calculating C doesn’t depend on k. ○ “we would hope for the model to learn to generate distinct attention contexts” ● They add position encodings

8. Experiment 1: Toy copying (like NTM[Graves+ 2014]) ● Copying the random sequence. ○ Length: from 0 to {10, 50, 100, 200} ● Vocabulary: 20 ● 2-layer, bi-directional LSTM (256 units) ● Dropout: 0.2 ● Train : test = 100,000 : 1,000 ○ batch size: 128 ○ They trained for 200,000 steps. ● K40m × 1

9. Result

10. Result

11. Result ● Vanilla enc-dec is weak. ● The number of K depends on the data length. ● Decoding process becomes faster when the length of sequence is larger. ● “Traditional attention may be representing the source with redundancy and wasting computational resources.”

12. Experiment 2: Neural Machine Translation ● WMT’17 ○ English-Czech (52M sentences) ○ English-German (5.9M sentences) ○ English-Finish (2.6M sentences) ○ English-Turkish (207K sentences) ○ Dev: newstest2015, Test: newstest2016 (not included en-tr) ○ Average length for the test data is 35. ● Hyperparameters ○ Vocabulary: 16,000 subwords (BPE) ○ hidden states: 512 ○ Other parameters are the same as copy experiments’.

13. Result

14. Learning curves

15. Discussion ● “Our memory attention model performs on-par with, or slightly better, than the baseline model” ● Position encoding improves model performance. ● If we set the task to be K << T, we will get the good performance by this model. (for example, summarization) ● Decoding time decreased.

16. Discussion ● softmax/softmax performs badly.

17. Visualizing: each memory (no position encoding)

18. Visualizing: when K is small

19. Visualizing: when the model try to translate

20. Conclusion ● They proposed a memory-based attention mechanism. ● Their technique lead to speedup. ● It can fit complex data like an NMT.

21. 所感（日本語ですみません） ● 性能がさほど変わらないのに速くなるのはいいこと ● [Luong+ EMNLP’15] のlocal attentionと比較してほしかった ○ local attention的なものを事前に計算しているようなもの？ ○ Luongが共著なのに ● Position EncodingがないのにAttentionがばらけているのが謎 ● 要約の方が向いていると思う

[EMNLP2017読み会] Efficient Attention using a Fixed-Size Memory Representation

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Similar to [EMNLP2017読み会] Efficient Attention using a Fixed-Size Memory Representation

Similar to [EMNLP2017読み会] Efficient Attention using a Fixed-Size Memory Representation (20)

More from Hayahide Yamagishi

More from Hayahide Yamagishi (15)

Recently uploaded

Recently uploaded (20)

[EMNLP2017読み会] Efficient Attention using a Fixed-Size Memory Representation