Neural machine translation by jointly learning to align and translate.pptx

Min-Seo Kim
Network Science Lab
Dept. of Artificial Intelligence
The Catholic University of Korea
E-mail: kms39273@naver.com

1
Background
• Machine translation is a field that has been extensively researched for a long time in the areas of NLP (Natural
Language Processing) and Computer Science.
• Initially, research in this field was based on rule-based methods, but these methods did not yield good
performance.
• Later, a shift towards statistical methods led to significant improvements in performance.
• The emergence of Neural Machine Translation (NMT) has received considerable attention and acclaim in the
field.
Machine Translation

2
Background
• RBMT is designed based on the grammatical, syntactic, and lexical rules of each language.
• Since the order of sentences varies from language to language, specific translation rules are required for each
language.
• The process involves converting the analyzed language into an intermediate language (interlingual) and then
mapping it back to the target language's words through a reverse process.
• This method lacks flexibility and has difficulty adapting to new languages or expressions.
Ruled based Machine Translation(RBMT)

3
Background
• SMT operates using statistical probability models.
• It estimates the probability of a given source language sentence being translated into a target language
sentence.
• Unlike the rule-based approach of RBMT, SMT relies more on patterns learned from large amounts of data
rather than on the rules and structure of the language.
Statistical Machine Translation(SMT)

4
Background
• NMT is a machine translation approach based on artificial neural networks.
• It converts input sentences into vector forms.
• Utilizes an Encoder – Decoder structure.
• Requires only pairs of data – operates in an End-to-End manner.
Neural Machine Translation(NMT)

5
Background
• I was wondering if you can help me on this problem
• Tokenization: Breaking down the sentence into individual words or tokens.
• Example: "I / was / wondering / if / you / can / help / me / on / this / problem
• Cleaning and Extraction: Removing unnecessary words and extracting the essential ones.
• Example: "i / wondering / you / help / problem”
• Encoding: Representing words or phrases in a numerical format.
• Example 1: "4, 6, 1, 5, 7“
• Example 2: "[1, 0], [0, 1, 0], [1, 1, 0], [1, 1, 1], [0, 0]"
Text preprocessing

6
Background
• Breaking down the sentence "화분에 예쁜 꽃이 피었다" into
individual grammatical components.
• Tokenization:
• 화분(명사) / 에(조사) / 예쁘(어간) / ㄴ(어미) / 꽃(명사) /
이(조사) / 피(어간) / 었(어미) / 다(어미
• Conjugation of the verb '모르다'
Tokenization
• It is very difficult to tokenize.
• The need for efficient preprocessing.

7
Background
• An example code for tokenization using TreebankWordTokenizer.
Tokenization

8
Background
• Cleaning:
• Converting from uppercase to lowercase.
• Removing words with low frequency of occurrence.
• Eliminating short words, pronouns, and articles.
• Stemming(어간 추출):
• Extracting word stems.
• Example 1: Using the Porter Algorithm.
• Lemmatization(표제어 추출):
• Extracting lemmas (base forms of words).
• Stopword(불용어 제거):
• Removing stopwords (commonly used words that may be irrelevant in some contexts).
Cleaning and Extraction

9
Background
• Integer-Encoding
Encoding

10
Problem Statement
• The problem of a bottleneck due to fixed-size vectors.
• In Neural Machine Translation (NMT), performance significantly deteriorates with longer sentences.
Issues with the Encoder-Decoder Model

11
Previous work
• A deep learning structure designed for analyzing sequential data.
• O(2) reflects both past and current information using h(1) and x(2).
• U: Input layer to hidden layer.
• W: Hidden layer at time t to hidden layer at time t+1.
• V: Hidden layer to output layer.
Recurrent Neural Networks (RNN)

12
Previous work
• RNNs are models that are flexible in terms of the length of input and output values, allowing for the
construction of RNNs in various structures depending on the form of the input and output.

13
Previous work

14
Previous work
• As the time steps in a vanilla RNN increase, there arises a problem of long-term dependencies, where
information from earlier time steps is not sufficiently transmitted to later stages.
• If important information for prediction is located at the beginning, it becomes impossible to predict effectively.
• Example 1: "I grew up in France and want to be a plumber who is the best in the world and I speak
fluent French."
Long Short-Term Memory (LSTM)

15
Previous work
• The core idea of LSTM is to store the information from previous steps in
a memory cell and pass it forward.
• It determines how much of the past information to forget based on the
current information, multiplies it accordingly, and then adds the current
information to this result to pass it on to the next time step.

16
Previous work
• Forget Gate
• A gate that decides how much of the past
information to forget.

17
Previous work
• Input Gate & Input Candidate
• The Input Gate determines the amount of
current information to be entered into the cell,
and the Input Candidate calculates the current
information.

18
Previous work
• Calculations in the Memory Cell
• The process of storing information in the
memory cell using the Forget Gate, Input Gate,
and Input Candidate.

19
Previous work
• Output Gate
• The Output Gate decides the amount of the
current memory cell value to be output as the
current hidden layer value.
• Output Layer
• The calculation for the output layer is the same
as in RNN.

20
Previous work
• The Gated Recurrent Unit (GRU) is a model that simplifies the operational process of LSTM.
GRU

21
Previous work
GRU
• Reset Gate
• The purpose of the Reset Gate in GRU is to
appropriately reset past information. It uses the
sigmoid function as its output, generating
values between (0, 1), which are then multiplied
with the previous hidden layer.

22
Previous work
GRU
• Update Gate
• The design of GRU feels like a combination of
LSTM's forget gate and input gate, determining
the ratio of updating information between the
past and the present.

23
Previous work
GRU
• Candidate
• This stage involves calculating the current time
step's candidate information.
• The key aspect is that it doesn't use the
information from the past hidden layer directly;
instead, it multiplies it by the result of the reset
gate.

24
Previous work
GRU
• Hidden Layer Calculation
• This step involves calculating the current hidden
layer by combining the results of the update
gate and the candidate.
• The output of the sigmoid function determines
the amount of information from the current
time step, while 1 minus the sigmoid function's
output determines the amount of information
from the previous time step.
• It varies by context whether LSTM or GRU is more effective.
• A clear advantage of GRU is that it has fewer weights to learn.

26
Methodology
• Assume that the hidden state up to step t-1 has been calculated.
Methodology

27
Methodology
Methodology
• Apply an Fc (fully connected) layer followed by the tanh function.
• The distribution obtained after applying softmax is the Attention Distribution.
• A Context vector is created through a weighted sum.

28
Methodology
Methodology
• Concatenate the Context vector with the output at time t-1 and input it into the decoder.
• Then, by passing through an fc layer and softmax, predict the output at time t.

29
Baseline
Base RNN decoder
• Encoder's Hidden State
• The encoder converts the input sentence x = (x1, x2, ..., xTx) into a fixed-length vector C.
• This fixed-length vector is also known as the context vector c. ht represents the hidden state at time t.

30
Baseline
Base RNN decoder
• When given the context vector c from the encoder, the next word y t is predicted based on c and the
previously predicted results y1 ,y2 ,...,y t−1 .
• The translation results are generated through the following conditional probability, and the objective is to
maximize this conditional probability.

31
Baseline
• A forward RNN sequence reads from the beginning in order and calculates
the forward hidden state.
• A backward RNN reads the sequence in reverse, from the end to the
beginning, and calculates the backward hidden state.
• For each word, concatenate the forward hidden state and the backward
hidden state.
Align and translate
• h_i contains information from both before and after the word.

32
Baseline
Bahdanau Attention Decoder
• The conditional probability p(yi | y1, ..., yi-1, X) represents the probability of the target sentence word yi
given the previous words and the input sentence X.
• This is calculated using the function g which depends on the previous output yi-1, the previous hidden state
si-1, and the context vector ci.
• The hidden state si is updated by the function f, taking as inputs the previous hidden state si-1, the previous
word of the output yi-1, and the context vector ci.
• The context vector ci is computed as a weighted sum of the hidden states hj of the encoder.
• The weights αij are determined by the attention mechanism.
• The attention weights αij are calculated using a softmax function applied to the alignment scores eij, which
measure the match between inputs around position j and the output at position i.
• The alignment model a computes eij as a function of the decoder's previous hidden state si-1 and the
encoder's hidden state hj.

33
Experiments
• Using the WMT’14 dataset for translating English to French.
• No UNK: Sentences without any unknown words.
• RNNsearch-50*: Trained for an extended period until there was no further improvement in performance on
the development set.
• Moses: A conventional phrase-based translation system that utilizes a separate monolingual corpus.
Quantitative results
• The longer the sentence, the more significant the difference in performance.

34
Experiments
Confusion matrix

35
Experiments
Confusion matrix

36
Paper review
• This paper introduces a novel architecture, 'RNNsearch', to overcome limitations of traditional neural machine
translation, by dynamically searching for relevant input words or their annotations while generating each
target word, rather than relying on a fixed-length vector.
• The experimental results show that RNNsearch significantly outperforms the conventional encoder-decoder model,
especially in translating longer sentences, and demonstrates robustness against the length of the source sentence.
• Remarkably, RNNsearch achieves a translation performance comparable to existing phrase-based statistical
machine translation systems, marking a promising step towards improved machine translation and a deeper
understanding of natural languages.
Conclusions

Neural machine translation by jointly learning to align and translate.pptx

Neural machine translation by jointly learning to align and translate.pptx

Recommended

Recommended

More Related Content

Similar to Neural machine translation by jointly learning to align and translate.pptx

Similar to Neural machine translation by jointly learning to align and translate.pptx (20)

More from ssuser2624f71

More from ssuser2624f71 (20)

Recently uploaded

Recently uploaded (20)

Neural machine translation by jointly learning to align and translate.pptx

Editor's Notes