Disha NEET Physics Guide for classes 11 and 12.pdf
240311_JW_labseminar[Sequence to Sequence Learning with Neural Networks].pptx
1. Jin-Woo Jeong
Network Science Lab
Dept. of Mathematics
The Catholic University of Korea
E-mail: zeus0208b@gmail.com
Ilya Sutskever, Oriol Vinyals, Quoc V.
Le
2. 1
Previous work
• LSTM
Introduction
• Motivation
• Introduction
Model
Experiment
• Dataset
• Training details
• Experimental results
Conclusion
Q/A
3. 2
Previous work
LSTM
• In the previous presentation, I built a stock price prediction program using LSTM. However, I only checked
the prediction visually and did not measure the accuracy. For time series data, similar to regression, it is
difficult to measure accuracy, so I used MAE to evaluate it. For numerical comparison, I measured the
training MAE and the test MAE.
4. 3
Introduction
Motivation
• DNNs can only be applied to problems whose inputs and targets can be sensibly encoded with vectors of
fixed dimensionality. It is a significant limitation, since many important problems are best expressed with
sequences whose lengths are not known a-priori.
• It is therefore clear that a domain-independent method that learns to map sequences to sequences would
be useful.
• In this paper, they show that a straightforward application of the Long Short-Term Memory (LSTM)
architecture can solve general sequence to sequence problems.
5. 4
Introduction
Introduction
• The idea is to use one LSTM to read the input sequence, one timestep at a time, to obtain large fixed
dimensional vector representation, and then to use another LSTM to extract the output sequence from that
vector.
• The model stops making predictions after outputting the end-of-sentence token.
• And LSTM reads the input sentence in reverse, because doing so introduces many short term dependencies
in the data that make the optimization problem much easier.
6. 5
Introduction
Introduction
• Here is a figure for better understanding. The source sentence enters the encoder, and one
context vector with the information of the sentence enters the decoder for translation, and then
the translation proceeds until <eos> is obtained.
• Encoders and decoders have different parameters (weights).
• In this figure, the source sentences are in their original order, but they often
mention that reversing the tokens in the sentences greatly improved the
performance of the model.
• they said that the simple trick of reversing the words in the source sentence is
one of the key technical contributions of this work.
7. 6
Model
Model
The goal of the LSTM is to estimate the conditional probability p(𝑦1, ⋯ , 𝑦𝑇′ |𝑥1, ⋯ , 𝑥𝑇 ) where
(𝑥1, ⋯ , 𝑥𝑇 ) is an input sequence and 𝑦1, ⋯ , 𝑦𝑇′ is its corresponding output sequence whose length T′ may
differ from T . The LSTM computes this conditional probability by first obtaining the fixed dimensional
representation 𝑣 of the input sequence (𝑥1, ⋯ , 𝑥𝑇 ) given ›by the last hidden state of the LSTM, and then
computing the probability of (𝑦1, ⋯ , 𝑦𝑇′ ) with a standard LSTM-LM formulation whose initial hidden state
is set to the representation v of 𝑥1, ⋯ , 𝑥𝑇 :
In this equation, each p(𝑦𝑡|𝑣, 𝑦1, ⋯ , 𝑦𝑡−1 ) distribution is represented with a softmax over all the words in
the vocabulary.
8. 7
Experiments
Experiments
• Dataset: WMT’14 English to French MT task
• In Experiments, they used two different LSTMs(Encoder & Decoder) and they choose LSTM with four
layers and they reverse the order of the words of the input sentence. (but not the target sentences in
training and test set)
• Training details
• Initialize all of the LSTM’s parameters with the uniform distribution between -0.08 and 0.08
• Uses stochastic gradient descent without momentum, with fixed learning rate of 0.7. After 5 epochs,
they began halving the learning rate every half epoch. And they trained models for a total of 7.5
epochs.
• Uses batches of 128 sequences.
• They made sure that all sentences in a minibatch are roughly of the same length.
13. 12
Conclusion
Conclusion
• The success of their simple LSTM-based approach on MT suggests that it should do well on many other
sequence learning problems, provided they have enough training data.
• We conclude that it is important to find a problem encoding that has the greatest number of short term
dependencies, as they make the learning problem much simpler.
• These results suggest that their approach will likely do well on other challenging sequence to sequence
problems.
As a result, the train MAE is 36won that means that on averages, the difference between predicted stock price and real stock price is about 36won.
But the test MAE is almost 5000won. This doesn't look like a good number. However, when we look at the graph, the model is getting the flow of the graph right.
본 논문에서는 LSTM을 활용한 효율적인 Seq2Seq 기계번역 아키텍처를 제안합니다.
Seq2Seq는 딥러닝 기반 기계 번역의 돌파구와 같은 역할을 수행했습니다.
In this paper, they propose an efficient Seq2Seq machine translation architecture using LSTMs.
Seq2Seq has been a breakthrough in deep learning-based machine translation.
1000차원의 워드 임베딩
균일한 분포를 따르도록 파라미터를 이니셜라이즈 할 때 -0.08 에서 0.08 사이로 정함.
대부분의 문장이 짧지만, 중간에 긴 문장이 들
어가면 그 배치의 다른 문장들이 패딩 되어야 하기 때문에 계산적으로 낭비가 될 수 있다. 따라서 비슷한 길이의 문장끼리 묶었다. – 학습속도를 높일 수 있었다.
While most of the sentences are short, having a long sentence in the middle would be computationally wasteful because the other sentences in that batch would have to be padded. Therefore, they grouped sentences of similar length together. doing so speeds up learning.
BLEU score는 일종의 기계번역 성능지표 중 하나.
LSTM에 앙상블 기법만 적용하더라도 충분히 베이스라인 모델보다 좋은 성능을 내고 있다.
Applying the ensemble technique to the LSTM alone is sufficient to outperform the baseline model.
딥러닝 기법을 기존의 통계적 기법과 함께 사용했더니, 좋은 성능을 보였다.
전통적인 SMT기법과 함께 했을 때, Best WMT’14 Result에 근접한 것을 볼 수 있다.
When deep learning techniques were used in combination with traditional statistical techniques, they performed well.
When combined with the traditional SMT technique, we can see that it is close to the best WMT'14 result.
이는 LSTM 히든스테이트 값을 PCA를 이용해서 2-dimension에 프로젝션시킨 것이다.
비슷한 의미끼리 잘 클러스터링 된 것을 확인할 수 있다.
This is a two-dimensional projection of the LSTM hidden state values using PCA.
You can see that similar meanings are well clustered together.
이는 베이스라인 모델과 비교했을 때, 더 좋은 성능이 나온다는 것을 보여주고 있다.
This shows that compared to the baseline model, the performance is better.