SlideShare a Scribd company logo
1 of 21
Download to read offline
Efficient Attention using a Fixed-Size
Memory Representation
Denny Britz, Melody Y. Guan and Minh-Thang
Luong
11/22 @EMNLP2017 Reading
Reader: M1 Hayahide Yamagishi
Introduction
● The most of encoder-decoder architectures are equipped
with attention mechanism.
● But, human translators don’t reread previously translated
source text words. (It’s shown by the eye-tracking test)
● They believe “it may be unnecessary to look back at the
entire original source sequence at each step.”
● They proposed an alternative attention mechanism
○ Smaller computational time
Attention [Bahdanau+ ICLR2015, Luong+ EMNLP2015]
● Attention mechanism aims to make the context vectors c.
○ s: encoder state, h: decoder state
○
● Computational time: O(D2
|S||T|)
○ D: state size of the encoder and decoder
○ |S| and |T| represents the length of the source and target, respectively.
○ If we use the Luong’s dot attention, computational time is O(D|S||T|)
○ Luong’s dot attention: hi
T
sj
Memory-Based Attention Model (Proposed method)
● During encoding, they computed an attention matrix C.
○ size of C and W: K×D
○ K: the number of attention vectors
○ computational time: O(KD|S|)
● C is regarded as compact fixed-length memory.
Memory-Based Attention Model (Proposed method)
● During decoding, they computed the context vector c.
○ They used C for computing the attention instead of encoder states.
● Total computational time: O(KD(|S| + |T|))
○ They expected their model to be faster than (O(D2
|S||T|))
○ For long sequences (|S| is large), this model will be faster than dot attention
● They used a sigmoid function instead of softmax for
calculating the attention scores.
Position Encoding
● Calculating C doesn’t depend on k.
○ “we would hope for the model to learn to generate distinct attention contexts”
● They add position encodings
Experiment 1: Toy copying (like NTM[Graves+ 2014])
● Copying the random sequence.
○ Length: from 0 to {10, 50, 100, 200}
● Vocabulary: 20
● 2-layer, bi-directional LSTM (256 units)
● Dropout: 0.2
● Train : test = 100,000 : 1,000
○ batch size: 128
○ They trained for 200,000 steps.
● K40m × 1
Result
Result
Result
● Vanilla enc-dec is weak.
● The number of K depends on the data length.
● Decoding process becomes faster when the length of
sequence is larger.
● “Traditional attention may be representing the source with
redundancy and wasting computational resources.”
Experiment 2: Neural Machine Translation
● WMT’17
○ English-Czech (52M sentences)
○ English-German (5.9M sentences)
○ English-Finish (2.6M sentences)
○ English-Turkish (207K sentences)
○ Dev: newstest2015, Test: newstest2016 (not included en-tr)
○ Average length for the test data is 35.
● Hyperparameters
○ Vocabulary: 16,000 subwords (BPE)
○ hidden states: 512
○ Other parameters are the same as copy experiments’.
Result
Learning curves
Discussion
● “Our memory attention model performs on-par with, or
slightly better, than the baseline model”
● Position encoding improves model performance.
● If we set the task to be K << T, we will get the good
performance by this model. (for example, summarization)
● Decoding time decreased.
Discussion
● softmax/softmax performs badly.
Visualizing: each memory (no position encoding)
Visualizing: when K is small
Visualizing: when the model try to translate
Conclusion
● They proposed a memory-based attention mechanism.
● Their technique lead to speedup.
● It can fit complex data like an NMT.
所感(日本語ですみません)
● 性能がさほど変わらないのに速くなるのはいいこと
● [Luong+ EMNLP’15] のlocal attentionと比較してほしかった
○ local attention的なものを事前に計算しているようなもの?
○ Luongが共著なのに
● Position EncodingがないのにAttentionがばらけているのが謎
● 要約の方が向いていると思う

More Related Content

What's hot

Avito Duplicate Ads Detection @ kaggle
Avito Duplicate Ads Detection @ kaggleAvito Duplicate Ads Detection @ kaggle
Avito Duplicate Ads Detection @ kaggleAlexey Grigorev
 
IOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_presentIOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_presentShubham Joshi
 
D422 7-2 string hadeling
D422 7-2  string hadelingD422 7-2  string hadeling
D422 7-2 string hadelingOmkar Rane
 
Stream Processing in the Cloud - Athens Kubernetes Meetup 16.07.2019
Stream Processing in the Cloud - Athens Kubernetes Meetup 16.07.2019Stream Processing in the Cloud - Athens Kubernetes Meetup 16.07.2019
Stream Processing in the Cloud - Athens Kubernetes Meetup 16.07.2019Rafał Leszko
 
A hierarchical neural autoencoder for paragraphs and documents
A hierarchical neural autoencoder for paragraphs and documentsA hierarchical neural autoencoder for paragraphs and documents
A hierarchical neural autoencoder for paragraphs and documentsHayahide Yamagishi
 
Linuxconf 2011 parallel languages talk
Linuxconf 2011 parallel languages talkLinuxconf 2011 parallel languages talk
Linuxconf 2011 parallel languages talkLenz Gschwendtner
 
CIKM Cup 2016: Cross-Device Linking
CIKM Cup 2016: Cross-Device LinkingCIKM Cup 2016: Cross-Device Linking
CIKM Cup 2016: Cross-Device LinkingAlexey Grigorev
 
Building Conclave: a decentralized, real-time collaborative text editor
Building Conclave: a decentralized, real-time collaborative text editorBuilding Conclave: a decentralized, real-time collaborative text editor
Building Conclave: a decentralized, real-time collaborative text editorSun-Li Beatteay
 
Computer network (9)
Computer network (9)Computer network (9)
Computer network (9)NYversity
 
Speech recognition: Survey
Speech recognition: SurveySpeech recognition: Survey
Speech recognition: SurveyWonjun Jeong
 

What's hot (15)

Deep learning
Deep learningDeep learning
Deep learning
 
Avito Duplicate Ads Detection @ kaggle
Avito Duplicate Ads Detection @ kaggleAvito Duplicate Ads Detection @ kaggle
Avito Duplicate Ads Detection @ kaggle
 
Deep Learning meetup
Deep Learning meetupDeep Learning meetup
Deep Learning meetup
 
IOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_presentIOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_present
 
Hubba Deep Learning
Hubba Deep LearningHubba Deep Learning
Hubba Deep Learning
 
D422 7-2 string hadeling
D422 7-2  string hadelingD422 7-2  string hadeling
D422 7-2 string hadeling
 
Stream Processing in the Cloud - Athens Kubernetes Meetup 16.07.2019
Stream Processing in the Cloud - Athens Kubernetes Meetup 16.07.2019Stream Processing in the Cloud - Athens Kubernetes Meetup 16.07.2019
Stream Processing in the Cloud - Athens Kubernetes Meetup 16.07.2019
 
A hierarchical neural autoencoder for paragraphs and documents
A hierarchical neural autoencoder for paragraphs and documentsA hierarchical neural autoencoder for paragraphs and documents
A hierarchical neural autoencoder for paragraphs and documents
 
Linuxconf 2011 parallel languages talk
Linuxconf 2011 parallel languages talkLinuxconf 2011 parallel languages talk
Linuxconf 2011 parallel languages talk
 
CIKM Cup 2016: Cross-Device Linking
CIKM Cup 2016: Cross-Device LinkingCIKM Cup 2016: Cross-Device Linking
CIKM Cup 2016: Cross-Device Linking
 
Building Conclave: a decentralized, real-time collaborative text editor
Building Conclave: a decentralized, real-time collaborative text editorBuilding Conclave: a decentralized, real-time collaborative text editor
Building Conclave: a decentralized, real-time collaborative text editor
 
Computer network (9)
Computer network (9)Computer network (9)
Computer network (9)
 
Design and analysis of algorithms
Design and analysis of algorithmsDesign and analysis of algorithms
Design and analysis of algorithms
 
Speech recognition: Survey
Speech recognition: SurveySpeech recognition: Survey
Speech recognition: Survey
 
Complexity of Algorithm
Complexity of AlgorithmComplexity of Algorithm
Complexity of Algorithm
 

Similar to [EMNLP2017読み会] Efficient Attention using a Fixed-Size Memory Representation

Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)Universitat Politècnica de Catalunya
 
Chromatic Sparse Learning
Chromatic Sparse LearningChromatic Sparse Learning
Chromatic Sparse LearningDatabricks
 
Neural Machine Translation (D2L10 Insight@DCU Machine Learning Workshop 2017)
Neural Machine Translation (D2L10 Insight@DCU Machine Learning Workshop 2017)Neural Machine Translation (D2L10 Insight@DCU Machine Learning Workshop 2017)
Neural Machine Translation (D2L10 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya
 
Neural machine translation by jointly learning to align and translate.pptx
Neural machine translation by jointly learning to align and translate.pptxNeural machine translation by jointly learning to align and translate.pptx
Neural machine translation by jointly learning to align and translate.pptxssuser2624f71
 
Algorithm 110801105245-phpapp01-120223065724-phpapp02
Algorithm 110801105245-phpapp01-120223065724-phpapp02Algorithm 110801105245-phpapp01-120223065724-phpapp02
Algorithm 110801105245-phpapp01-120223065724-phpapp02dhruv patel
 
Algorithm 110801105245-phpapp01
Algorithm 110801105245-phpapp01Algorithm 110801105245-phpapp01
Algorithm 110801105245-phpapp01Jay Patel
 
Nips 2017 in a nutshell
Nips 2017 in a nutshellNips 2017 in a nutshell
Nips 2017 in a nutshellLULU CHENG
 
Deep Neural Machine Translation with Linear Associative Unit
Deep Neural Machine Translation with Linear Associative UnitDeep Neural Machine Translation with Linear Associative Unit
Deep Neural Machine Translation with Linear Associative UnitSatoru Katsumata
 
Time and space complexity
Time and space complexityTime and space complexity
Time and space complexityAnkit Katiyar
 
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptxLongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptxSan Kim
 
Reading: "Pi in the sky: Calculating a record-breaking 31.4 trillion digits o...
Reading: "Pi in the sky: Calculating a record-breaking 31.4 trillion digits o...Reading: "Pi in the sky: Calculating a record-breaking 31.4 trillion digits o...
Reading: "Pi in the sky: Calculating a record-breaking 31.4 trillion digits o...Kento Aoyama
 
Asymptotic Notations
Asymptotic NotationsAsymptotic Notations
Asymptotic NotationsRishabh Soni
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learningAmer Ather
 
Natural Question Generation using Deep Learning
Natural Question Generation using Deep LearningNatural Question Generation using Deep Learning
Natural Question Generation using Deep LearningArijit Mukherjee
 
Parallel convolutional neural network
Parallel  convolutional neural networkParallel  convolutional neural network
Parallel convolutional neural networkAbdullah Khan Zehady
 

Similar to [EMNLP2017読み会] Efficient Attention using a Fixed-Size Memory Representation (20)

Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
 
Chromatic Sparse Learning
Chromatic Sparse LearningChromatic Sparse Learning
Chromatic Sparse Learning
 
Neural Machine Translation (D2L10 Insight@DCU Machine Learning Workshop 2017)
Neural Machine Translation (D2L10 Insight@DCU Machine Learning Workshop 2017)Neural Machine Translation (D2L10 Insight@DCU Machine Learning Workshop 2017)
Neural Machine Translation (D2L10 Insight@DCU Machine Learning Workshop 2017)
 
Neural machine translation by jointly learning to align and translate.pptx
Neural machine translation by jointly learning to align and translate.pptxNeural machine translation by jointly learning to align and translate.pptx
Neural machine translation by jointly learning to align and translate.pptx
 
Algorithm 110801105245-phpapp01-120223065724-phpapp02
Algorithm 110801105245-phpapp01-120223065724-phpapp02Algorithm 110801105245-phpapp01-120223065724-phpapp02
Algorithm 110801105245-phpapp01-120223065724-phpapp02
 
Algorithm 110801105245-phpapp01
Algorithm 110801105245-phpapp01Algorithm 110801105245-phpapp01
Algorithm 110801105245-phpapp01
 
Icon18revrec sudeshna
Icon18revrec sudeshnaIcon18revrec sudeshna
Icon18revrec sudeshna
 
Nips 2017 in a nutshell
Nips 2017 in a nutshellNips 2017 in a nutshell
Nips 2017 in a nutshell
 
Deep Neural Machine Translation with Linear Associative Unit
Deep Neural Machine Translation with Linear Associative UnitDeep Neural Machine Translation with Linear Associative Unit
Deep Neural Machine Translation with Linear Associative Unit
 
Use CNN for Sequence Modeling
Use CNN for Sequence ModelingUse CNN for Sequence Modeling
Use CNN for Sequence Modeling
 
Time and space complexity
Time and space complexityTime and space complexity
Time and space complexity
 
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptxLongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
 
Algorithms overview
Algorithms overviewAlgorithms overview
Algorithms overview
 
Reading: "Pi in the sky: Calculating a record-breaking 31.4 trillion digits o...
Reading: "Pi in the sky: Calculating a record-breaking 31.4 trillion digits o...Reading: "Pi in the sky: Calculating a record-breaking 31.4 trillion digits o...
Reading: "Pi in the sky: Calculating a record-breaking 31.4 trillion digits o...
 
Analysis of algo
Analysis of algoAnalysis of algo
Analysis of algo
 
Asymptotic Notations
Asymptotic NotationsAsymptotic Notations
Asymptotic Notations
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learning
 
Natural Question Generation using Deep Learning
Natural Question Generation using Deep LearningNatural Question Generation using Deep Learning
Natural Question Generation using Deep Learning
 
Parallel convolutional neural network
Parallel  convolutional neural networkParallel  convolutional neural network
Parallel convolutional neural network
 

More from Hayahide Yamagishi

[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...Hayahide Yamagishi
 
[修論発表会資料] 目的言語の文書文脈を用いたニューラル機械翻訳
[修論発表会資料] 目的言語の文書文脈を用いたニューラル機械翻訳[修論発表会資料] 目的言語の文書文脈を用いたニューラル機械翻訳
[修論発表会資料] 目的言語の文書文脈を用いたニューラル機械翻訳Hayahide Yamagishi
 
[論文読み会資料] Beyond Error Propagation in Neural Machine Translation: Characteris...
[論文読み会資料] Beyond Error Propagation in Neural Machine Translation: Characteris...[論文読み会資料] Beyond Error Propagation in Neural Machine Translation: Characteris...
[論文読み会資料] Beyond Error Propagation in Neural Machine Translation: Characteris...Hayahide Yamagishi
 
[ACL2018読み会資料] Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use C...
[ACL2018読み会資料] Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use C...[ACL2018読み会資料] Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use C...
[ACL2018読み会資料] Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use C...Hayahide Yamagishi
 
[NAACL2018読み会] Deep Communicating Agents for Abstractive Summarization
[NAACL2018読み会] Deep Communicating Agents for Abstractive Summarization[NAACL2018読み会] Deep Communicating Agents for Abstractive Summarization
[NAACL2018読み会] Deep Communicating Agents for Abstractive SummarizationHayahide Yamagishi
 
[論文読み会資料] Asynchronous Bidirectional Decoding for Neural Machine Translation
[論文読み会資料] Asynchronous Bidirectional Decoding for Neural Machine Translation[論文読み会資料] Asynchronous Bidirectional Decoding for Neural Machine Translation
[論文読み会資料] Asynchronous Bidirectional Decoding for Neural Machine TranslationHayahide Yamagishi
 
[ML論文読み会資料] Teaching Machines to Read and Comprehend
[ML論文読み会資料] Teaching Machines to Read and Comprehend[ML論文読み会資料] Teaching Machines to Read and Comprehend
[ML論文読み会資料] Teaching Machines to Read and ComprehendHayahide Yamagishi
 
[ML論文読み会資料] Training RNNs as Fast as CNNs
[ML論文読み会資料] Training RNNs as Fast as CNNs[ML論文読み会資料] Training RNNs as Fast as CNNs
[ML論文読み会資料] Training RNNs as Fast as CNNsHayahide Yamagishi
 
入力文への情報の付加によるNMTの出力文の変化についてのエラー分析
入力文への情報の付加によるNMTの出力文の変化についてのエラー分析入力文への情報の付加によるNMTの出力文の変化についてのエラー分析
入力文への情報の付加によるNMTの出力文の変化についてのエラー分析Hayahide Yamagishi
 
[ACL2017読み会] What do Neural Machine Translation Models Learn about Morphology?
[ACL2017読み会] What do Neural Machine Translation Models Learn about Morphology?[ACL2017読み会] What do Neural Machine Translation Models Learn about Morphology?
[ACL2017読み会] What do Neural Machine Translation Models Learn about Morphology?Hayahide Yamagishi
 
Why neural translations are the right length
Why neural translations are  the right lengthWhy neural translations are  the right length
Why neural translations are the right lengthHayahide Yamagishi
 
ニューラル論文を読む前に
ニューラル論文を読む前にニューラル論文を読む前に
ニューラル論文を読む前にHayahide Yamagishi
 
ニューラル日英翻訳における出力文の態制御
ニューラル日英翻訳における出力文の態制御ニューラル日英翻訳における出力文の態制御
ニューラル日英翻訳における出力文の態制御Hayahide Yamagishi
 
[EMNLP2016読み会] Memory-enhanced Decoder for Neural Machine Translation
[EMNLP2016読み会] Memory-enhanced Decoder for Neural Machine Translation[EMNLP2016読み会] Memory-enhanced Decoder for Neural Machine Translation
[EMNLP2016読み会] Memory-enhanced Decoder for Neural Machine TranslationHayahide Yamagishi
 
[ACL2016] Achieving Open Vocabulary Neural Machine Translation with Hybrid Wo...
[ACL2016] Achieving Open Vocabulary Neural Machine Translation with Hybrid Wo...[ACL2016] Achieving Open Vocabulary Neural Machine Translation with Hybrid Wo...
[ACL2016] Achieving Open Vocabulary Neural Machine Translation with Hybrid Wo...Hayahide Yamagishi
 

More from Hayahide Yamagishi (15)

[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
 
[修論発表会資料] 目的言語の文書文脈を用いたニューラル機械翻訳
[修論発表会資料] 目的言語の文書文脈を用いたニューラル機械翻訳[修論発表会資料] 目的言語の文書文脈を用いたニューラル機械翻訳
[修論発表会資料] 目的言語の文書文脈を用いたニューラル機械翻訳
 
[論文読み会資料] Beyond Error Propagation in Neural Machine Translation: Characteris...
[論文読み会資料] Beyond Error Propagation in Neural Machine Translation: Characteris...[論文読み会資料] Beyond Error Propagation in Neural Machine Translation: Characteris...
[論文読み会資料] Beyond Error Propagation in Neural Machine Translation: Characteris...
 
[ACL2018読み会資料] Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use C...
[ACL2018読み会資料] Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use C...[ACL2018読み会資料] Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use C...
[ACL2018読み会資料] Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use C...
 
[NAACL2018読み会] Deep Communicating Agents for Abstractive Summarization
[NAACL2018読み会] Deep Communicating Agents for Abstractive Summarization[NAACL2018読み会] Deep Communicating Agents for Abstractive Summarization
[NAACL2018読み会] Deep Communicating Agents for Abstractive Summarization
 
[論文読み会資料] Asynchronous Bidirectional Decoding for Neural Machine Translation
[論文読み会資料] Asynchronous Bidirectional Decoding for Neural Machine Translation[論文読み会資料] Asynchronous Bidirectional Decoding for Neural Machine Translation
[論文読み会資料] Asynchronous Bidirectional Decoding for Neural Machine Translation
 
[ML論文読み会資料] Teaching Machines to Read and Comprehend
[ML論文読み会資料] Teaching Machines to Read and Comprehend[ML論文読み会資料] Teaching Machines to Read and Comprehend
[ML論文読み会資料] Teaching Machines to Read and Comprehend
 
[ML論文読み会資料] Training RNNs as Fast as CNNs
[ML論文読み会資料] Training RNNs as Fast as CNNs[ML論文読み会資料] Training RNNs as Fast as CNNs
[ML論文読み会資料] Training RNNs as Fast as CNNs
 
入力文への情報の付加によるNMTの出力文の変化についてのエラー分析
入力文への情報の付加によるNMTの出力文の変化についてのエラー分析入力文への情報の付加によるNMTの出力文の変化についてのエラー分析
入力文への情報の付加によるNMTの出力文の変化についてのエラー分析
 
[ACL2017読み会] What do Neural Machine Translation Models Learn about Morphology?
[ACL2017読み会] What do Neural Machine Translation Models Learn about Morphology?[ACL2017読み会] What do Neural Machine Translation Models Learn about Morphology?
[ACL2017読み会] What do Neural Machine Translation Models Learn about Morphology?
 
Why neural translations are the right length
Why neural translations are  the right lengthWhy neural translations are  the right length
Why neural translations are the right length
 
ニューラル論文を読む前に
ニューラル論文を読む前にニューラル論文を読む前に
ニューラル論文を読む前に
 
ニューラル日英翻訳における出力文の態制御
ニューラル日英翻訳における出力文の態制御ニューラル日英翻訳における出力文の態制御
ニューラル日英翻訳における出力文の態制御
 
[EMNLP2016読み会] Memory-enhanced Decoder for Neural Machine Translation
[EMNLP2016読み会] Memory-enhanced Decoder for Neural Machine Translation[EMNLP2016読み会] Memory-enhanced Decoder for Neural Machine Translation
[EMNLP2016読み会] Memory-enhanced Decoder for Neural Machine Translation
 
[ACL2016] Achieving Open Vocabulary Neural Machine Translation with Hybrid Wo...
[ACL2016] Achieving Open Vocabulary Neural Machine Translation with Hybrid Wo...[ACL2016] Achieving Open Vocabulary Neural Machine Translation with Hybrid Wo...
[ACL2016] Achieving Open Vocabulary Neural Machine Translation with Hybrid Wo...
 

Recently uploaded

Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 

Recently uploaded (20)

Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 

[EMNLP2017読み会] Efficient Attention using a Fixed-Size Memory Representation

  • 1. Efficient Attention using a Fixed-Size Memory Representation Denny Britz, Melody Y. Guan and Minh-Thang Luong 11/22 @EMNLP2017 Reading Reader: M1 Hayahide Yamagishi
  • 2. Introduction ● The most of encoder-decoder architectures are equipped with attention mechanism. ● But, human translators don’t reread previously translated source text words. (It’s shown by the eye-tracking test) ● They believe “it may be unnecessary to look back at the entire original source sequence at each step.” ● They proposed an alternative attention mechanism ○ Smaller computational time
  • 3.
  • 4. Attention [Bahdanau+ ICLR2015, Luong+ EMNLP2015] ● Attention mechanism aims to make the context vectors c. ○ s: encoder state, h: decoder state ○ ● Computational time: O(D2 |S||T|) ○ D: state size of the encoder and decoder ○ |S| and |T| represents the length of the source and target, respectively. ○ If we use the Luong’s dot attention, computational time is O(D|S||T|) ○ Luong’s dot attention: hi T sj
  • 5. Memory-Based Attention Model (Proposed method) ● During encoding, they computed an attention matrix C. ○ size of C and W: K×D ○ K: the number of attention vectors ○ computational time: O(KD|S|) ● C is regarded as compact fixed-length memory.
  • 6. Memory-Based Attention Model (Proposed method) ● During decoding, they computed the context vector c. ○ They used C for computing the attention instead of encoder states. ● Total computational time: O(KD(|S| + |T|)) ○ They expected their model to be faster than (O(D2 |S||T|)) ○ For long sequences (|S| is large), this model will be faster than dot attention ● They used a sigmoid function instead of softmax for calculating the attention scores.
  • 7. Position Encoding ● Calculating C doesn’t depend on k. ○ “we would hope for the model to learn to generate distinct attention contexts” ● They add position encodings
  • 8. Experiment 1: Toy copying (like NTM[Graves+ 2014]) ● Copying the random sequence. ○ Length: from 0 to {10, 50, 100, 200} ● Vocabulary: 20 ● 2-layer, bi-directional LSTM (256 units) ● Dropout: 0.2 ● Train : test = 100,000 : 1,000 ○ batch size: 128 ○ They trained for 200,000 steps. ● K40m × 1
  • 11. Result ● Vanilla enc-dec is weak. ● The number of K depends on the data length. ● Decoding process becomes faster when the length of sequence is larger. ● “Traditional attention may be representing the source with redundancy and wasting computational resources.”
  • 12. Experiment 2: Neural Machine Translation ● WMT’17 ○ English-Czech (52M sentences) ○ English-German (5.9M sentences) ○ English-Finish (2.6M sentences) ○ English-Turkish (207K sentences) ○ Dev: newstest2015, Test: newstest2016 (not included en-tr) ○ Average length for the test data is 35. ● Hyperparameters ○ Vocabulary: 16,000 subwords (BPE) ○ hidden states: 512 ○ Other parameters are the same as copy experiments’.
  • 15. Discussion ● “Our memory attention model performs on-par with, or slightly better, than the baseline model” ● Position encoding improves model performance. ● If we set the task to be K << T, we will get the good performance by this model. (for example, summarization) ● Decoding time decreased.
  • 17. Visualizing: each memory (no position encoding)
  • 19. Visualizing: when the model try to translate
  • 20. Conclusion ● They proposed a memory-based attention mechanism. ● Their technique lead to speedup. ● It can fit complex data like an NMT.
  • 21. 所感(日本語ですみません) ● 性能がさほど変わらないのに速くなるのはいいこと ● [Luong+ EMNLP’15] のlocal attentionと比較してほしかった ○ local attention的なものを事前に計算しているようなもの? ○ Luongが共著なのに ● Position EncodingがないのにAttentionがばらけているのが謎 ● 要約の方が向いていると思う