SlideShare a Scribd company logo
Pointing
the Unknown Words
1
ACL 2016
Pointing the Unknown Words
Caglar Gulcehre
Universit´e de Montr´eal
Sungjin Ahn
Universit´e de Montr´eal
Ramesh Nallapati
IBM T.J. Watson Research
Bowen Zhou
IBM T.J. Watson Research
Yoshua Bengio
Universit´e de Montr´eal
CIFAR Senior Fellow
Abstract
The problem of rare and unknown words
is an important issue that can potentially
effect the performance of many NLP sys-
tems, including both the traditional count-
based and the deep learning models. We
propose a novel way to deal with the rare
and unseen words for the neural network
models using attention. Our model uses
two softmax layers in order to predict the
softmax output layer where each of the output di-
mension corresponds to a word in a predefined
word-shortlist. Because computing high dimen-
sional softmax is computationally expensive, in
practice the shortlist is limited to have only top-
K most frequent words in the training corpus. All
other words are then replaced by a special word,
called the unknown word (UNK).
The shortlist approach has two fundamental
problems. The first problem, which is known as
21Aug2016
Pointing the Unknown Words
Caglar Gulcehre
Universit´e de Montr´eal
Sungjin Ahn
Universit´e de Montr´eal
Ramesh Nallapati
IBM T.J. Watson Research
Bowen Zhou
IBM T.J. Watson Research
Yoshua Bengio
Universit´e de Montr´eal
CIFAR Senior Fellow
Abstract
The problem of rare and unknown words
is an important issue that can potentially
effect the performance of many NLP sys-
tems, including both the traditional count-
based and the deep learning models. We
propose a novel way to deal with the rare
and unseen words for the neural network
models using attention. Our model uses
two softmax layers in order to predict the
softmax output layer where each of the output di-
mension corresponds to a word in a predefined
word-shortlist. Because computing high dimen-
sional softmax is computationally expensive, in
practice the shortlist is limited to have only top-
K most frequent words in the training corpus. All
other words are then replaced by a special word,
called the unknown word (UNK).
The shortlist approach has two fundamental
problems. The first problem, which is known as
the rare word problem, is that some of the words
21Aug2016
2
-
e
-
-
e
-
f
e
l
o
-
Guillaume et Cesar ont une voiture bleue a Lausanne.
Guillaume and Cesar have a blue car in Lausanne.
Copy Copy Copy
French:
English:
Figure 1: An example of how copying can happen
for machine translation. Common words that ap-
pear both in source and the target can directly be
copied from input to source. The rest of the un-
known in the target can be copied from the input
after being translated with a dictionary.
•
•
•
• V
a man yesterday . [eos]!
killed a man yesterday .!t
+ bhp) (16)
(17)
Whp
bhp) (16)
(17)
(18)
(19)
!
man
a
a
man !
6
RNN t
pt
pt = softmax(W
ht =
−−−→
RNNt′≺t(
softmax(s)i =
exp(si
sj∈s exp
→
t′≺t(xwt′ ) (17)
p(si)
s exp(sj)
(18)
(19)
s i
Whp ∈ RV ×N
V
bhp(w)
Whp bhp
t = softmax(Whpht + bhp) (16)
t =
−−−→
RNNt′≺t(xwt′ ) (17)
i =
exp(si)
sj∈s exp(sj)
(18)
(19)
s i
Whp ∈ RV ×N
V
Whp(w) bhp(w)
(18)
(19)
i
Whp ∈ RV ×N
V
)
t wt
pt = softmax(Whpht + bhp)
ht =
−−−→
RNNt′≺t(xwt′ )
6
RNN t wt
pt
pt = softmax(Whpht + bhp)
ht =
−−−→
RNNt′≺t(xwt′ )
softmax(s)i =
exp(si)
sj∈s exp(sj)
softmax(s)i N s
exp 3
•
•
•
• V
a man yesterday . [eos]!
killed a man yesterday .!t
+ bhp) (16)
(17)
Whp
bhp) (16)
(17)
(18)
(19)
!
man
a
a
man !
6
RNN t
pt
pt = softmax(W
ht =
−−−→
RNNt′≺t(
softmax(s)i =
exp(si
sj∈s exp
→
t′≺t(xwt′ ) (17)
p(si)
s exp(sj)
(18)
(19)
s i
Whp ∈ RV ×N
V
bhp(w)
Whp bhp
t = softmax(Whpht + bhp) (16)
t =
−−−→
RNNt′≺t(xwt′ ) (17)
i =
exp(si)
sj∈s exp(sj)
(18)
(19)
s i
Whp ∈ RV ×N
V
Whp(w) bhp(w)
(18)
(19)
i
Whp ∈ RV ×N
V
)
t wt
pt = softmax(Whpht + bhp)
ht =
−−−→
RNNt′≺t(xwt′ )
• V T
•
•
•
Pointer Softmax
pt = softmax(Whpht + bhp)
ht =
−−−→
RNNt′≺t(xwt′ )
softmax(s)i =
exp(si)
sj∈s exp(sj)
i N s
exp
w Whp ∈ RV ×N
Whp(w) bhp(w)
T
6
RNN t wt
pt
pt = softmax(Whpht + bhp)
ht =
−−−→
RNNt′≺t(xwt′ )
softmax(s)i =
exp(si)
sj∈s exp(sj)
softmax(s)i N s
exp
6
RNN t wt
pt
pt = softmax(Whpht +
ht =
−−−→
RNNt′≺t(xwt′ )
softmax(s)i =
exp(si)
sj∈s exp(sj)
softmax(s)i N s
exp
・
4
•
5
rce sentence during decoding a translation (Sec. 3.1).
ER: GENERAL DESCRIPTION
x1 x2 x3 xT
+
αt,1
αt,2 αt,3
αt,T
h1 h2 h3 hT
h1 h2 h3 hT
st-1 st
Figure 1: The graphical illus-
tration of the proposed model
trying to generate the t-th tar-
get word yt given a source
sentence (x , x , . . . , x ).
el architecture, we define each conditional probability
p(yi|y1, . . . , yi 1, x) = g(yi 1, si, ci), (4)
RNN hidden state for time i, computed by
si = f(si 1, yi 1, ci).
noted that unlike the existing encoder–decoder ap-
q. (2)), here the probability is conditioned on a distinct
ci for each target word yi.
vector ci depends on a sequence of annotations
) to which an encoder maps the input sentence. Each
contains information about the whole input sequence
focus on the parts surrounding the i-th word of the
e. We explain in detail how the annotations are com-
ext section.
ector ci is, then, computed as a weighted sum of these
i:
TxX
F
t
t
g
s
t vector ci depends on a sequence of annotations
x
) to which an encoder maps the input sentence. Each
hi contains information about the whole input sequence
g focus on the parts surrounding the i-th word of the
nce. We explain in detail how the annotations are com-
next section.
vector ci is, then, computed as a weighted sum of these
hi:
ci =
TxX
j=1
↵ijhj. (5)
↵ij of each annotation hj is computed by
↵ij =
exp (eij)
P ,
GENERAL DESCRIPTION
hitecture, we define each conditional probability
y1, . . . , yi 1, x) = g(yi 1, si, ci), (4)
N hidden state for time i, computed by
si = f(si 1, yi 1, ci).
d that unlike the existing encoder–decoder ap-
), here the probability is conditioned on a distinct
or each target word yi.
or ci depends on a sequence of annotations
which an encoder maps the input sentence. Each
Figure 1: The graphical illus-
tration of the proposed model
trying to generate the t-th tar-
get word yt given a source
sentence (x1, x2, . . . , xT ).
ere the probability is conditioned on a distinct
ach target word yi.
ci depends on a sequence of annotations
ch an encoder maps the input sentence. Each
s information about the whole input sequence
n the parts surrounding the i-th word of the
xplain in detail how the annotations are com-
on.
is, then, computed as a weighted sum of these
ci =
TxX
j=1
↵ijhj. (5)
h annotation hj is computed by
↵ij =
exp (eij)
PTx
k=1 exp (eik)
, (6)
e = a(s , h )
Figure 1: The graphical illus-
tration of the proposed model
trying to generate the t-th tar-
get word yt given a source
sentence (x1, x2, . . . , xT ).
t the whole input sequence
nding the i-th word of the
w the annotations are com-
as a weighted sum of these
. (5)
computed by
j =
exp (eij)
PTx
k=1 exp (eik)
, (6)
eij = a(si 1, hj)
well the inputs around position j and the output at position
hidden state si 1 (just before emitting yi, Eq. (4)) and the
3 Neural Machine Translation Model
with Attention
As the baseline neural machine translation sys-
tem, we use the model proposed by (Bahdanau et
al., 2014) that learns to (soft-)align and translate
jointly. We refer this model as NMT.
The encoder of the NMT is a bidirectional
RNN (Schuster and Paliwal, 1997). The forward
RNN reads input sequence x = (x1, . . . , xT )
in left-to-right direction, resulting in a sequence
of hidden states (
!
h 1, . . . ,
!
h T ). The backward
RNN reads x in the reversed direction and outputs
( h 1, . . . , h T ). We then concatenate the hidden
states of forward and backward RNNs at each time
step and obtain a sequence of annotation vectors
(h1, . . . , hT ) where hj =
h!
h j|| h j
i
. Here, ||
denotes the concatenation operator. Thus, each an-
notation vector hj encodes information about the
j-th word with respect to all the other surrounding
where fr is G
We use a
2013) to com
words:
p(yt
ex
where W is
bias of the o
forward neu
that perform
And the sup
umn vector o
The whol
and the deco
(conditional
p(yi|y1, . . . , yi 1, x) = g(yi 1, si, ci),
is an RNN hidden state for time i, computed by
si = f(si 1, yi 1, ci).
be noted that unlike the existing encoder–decoder
ee Eq. (2)), here the probability is conditioned on a dis
ector ci for each target word yi.
ext vector ci depends on a sequence of annotat
hTx
) to which an encoder maps the input sentence. E
n hi contains information about the whole input sequ
ong focus on the parts surrounding the i-th word of
uence. We explain in detail how the annotations are c
(t=i)
[Bahdanau+15]
ncepaperatICLR2015
trainedtopredictthenextwordyt0giventhecontextvectorcandallthe
ords{y1,···,yt01}.Inotherwords,thedecoderdefinesaprobabilityover
composingthejointprobabilityintotheorderedconditionals:
p(y)=
TY
t=1
p(yt|{y1,···,yt1},c),(2)
Ty
.WithanRNN,eachconditionalprobabilityismodeledas
p(yt|{y1,···,yt1},c)=g(yt1,st,c),(3)
potentiallymulti-layered,functionthatoutputstheprobabilityofyt,andstis
RNN.ItshouldbenotedthatotherarchitecturessuchasahybridofanRNN
lneuralnetworkcanbeused(KalchbrennerandBlunsom,2013).
ALIGNANDTRANSLATE
poseanovelarchitectureforneuralmachinetranslation.Thenewarchitecture
onalRNNasanencoder(Sec.3.2)andadecoderthatemulatessearching
nceduringdecodingatranslation(Sec.3.1).
ERALDESCRIPTION
st
cture,wedefineeachconditionalprobability
...,yi1,x)=g(yi1,si,ci),(4)
iddenstatefortimei,computedby
si=f(si1,yi1,ci).
atunliketheexistingencoder–decoderap-
eretheprobabilityisconditionedonadistinct
achtargetwordyi.
idependsonasequenceofannotations
nedtopredictthenextwordyt0giventhecontextvectorcandallthe
s{y1,···,yt01}.Inotherwords,thedecoderdefinesaprobabilityover
posingthejointprobabilityintotheorderedconditionals:
p(y)=
TY
t=1
p(yt|{y1,···,yt1},c),(2)
.WithanRNN,eachconditionalprobabilityismodeledas
p(yt|{y1,···,yt1},c)=g(yt1,st,c),(3)
entiallymulti-layered,functionthatoutputstheprobabilityofyt,andstis
N.ItshouldbenotedthatotherarchitecturessuchasahybridofanRNN
uralnetworkcanbeused(KalchbrennerandBlunsom,2013).
LIGNANDTRANSLATE
anovelarchitectureforneuralmachinetranslation.Thenewarchitecture
lRNNasanencoder(Sec.3.2)andadecoderthatemulatessearching
duringdecodingatranslation(Sec.3.1).
ALDESCRIPTION
st
e,wedefineeachconditionalprobability
,yi1,x)=g(yi1,si,ci),(4)
nstatefortimei,computedby
f(si1,yi1,ci).
nliketheexistingencoder–decoderap-
heprobabilityisconditionedonadistinct
targetwordyi.
ependsonasequenceofannotations
encodermapstheinputsentence.Each
ormationaboutthewholeinputsequence
epartssurroundingthei-thwordofthe
inindetailhowtheannotationsarecom-
atICLR2015
predictthenextwordyt0giventhecontextvectorcandallthe
···,yt01}.Inotherwords,thedecoderdefinesaprobabilityover
thejointprobabilityintotheorderedconditionals:
y)=
TY
t=1
p(yt|{y1,···,yt1},c),(2)
anRNN,eachconditionalprobabilityismodeledas
|{y1,···,yt1},c)=g(yt1,st,c),(3)
multi-layered,functionthatoutputstheprobabilityofyt,andstis
ouldbenotedthatotherarchitecturessuchasahybridofanRNN
tworkcanbeused(KalchbrennerandBlunsom,2013).
ANDTRANSLATE
larchitectureforneuralmachinetranslation.Thenewarchitecture
asanencoder(Sec.3.2)andadecoderthatemulatessearching
decodingatranslation(Sec.3.1).
SCRIPTION
st
efineeachconditionalprobability
x)=g(yi1,si,ci),(4)
fortimei,computedby
1,yi1,ci).
theexistingencoder–decoderap-
abilityisconditionedonadistinct
ordyi.
onasequenceofannotations
dictthenextwordyt0giventhecontextvectorcandallthe
,yt01}.Inotherwords,thedecoderdefinesaprobabilityover
ejointprobabilityintotheorderedconditionals:
=
TY
t=1
p(yt|{y1,···,yt1},c),(2)
NN,eachconditionalprobabilityismodeledas
y1,···,yt1},c)=g(yt1,st,c),(3)
ulti-layered,functionthatoutputstheprobabilityofyt,andstis
ldbenotedthatotherarchitecturessuchasahybridofanRNN
rkcanbeused(KalchbrennerandBlunsom,2013).
DTRANSLATE
rchitectureforneuralmachinetranslation.Thenewarchitecture
anencoder(Sec.3.2)andadecoderthatemulatessearching
codingatranslation(Sec.3.1).
IPTION
st
neeachconditionalprobability
g(yi1,si,ci),(4)
timei,computedby
i1,ci).
existingencoder–decoderap-
ilityisconditionedonadistinct
dyi.
nasequenceofannotations
mapstheinputsentence.Each
boutthewholeinputsequence
rroundingthei-thwordofthe
lhowtheannotationsarecom-
p(yt | {y1, · · · , yt 1} , c) = g(yt 1, st, c), (3)
where g is a nonlinear, potentially multi-layered, function that outputs the probability of yt, and st is
the hidden state of the RNN. It should be noted that other architectures such as a hybrid of an RNN
and a de-convolutional neural network can be used (Kalchbrenner and Blunsom, 2013).
3 LEARNING TO ALIGN AND TRANSLATE
In this section, we propose a novel architecture for neural machine translation. The new architecture
consists of a bidirectional RNN as an encoder (Sec. 3.2) and a decoder that emulates searching
through a source sentence during decoding a translation (Sec. 3.1).
3.1 DECODER: GENERAL DESCRIPTION
yt-1
Figure 1: The graphical illus-
tration of the proposed model
trying to generate the t-th tar-
get word yt given a source
sentence (x1, x2, . . . , xT ).
In a new model architecture, we define each conditional probability
in Eq. (2) as:
p(yi|y1, . . . , yi 1, x) = g(yi 1, si, ci), (4)
where si is an RNN hidden state for time i, computed by
si = f(si 1, yi 1, ci).
It should be noted that unlike the existing encoder–decoder ap-
proach (see Eq. (2)), here the probability is conditioned on a distinct
context vector ci for each target word yi.
The context vector ci depends on a sequence of annotations
(h1, · · · , hTx
) to which an encoder maps the input sentence. Each
annotation hi contains information about the whole input sequence
with a strong focus on the parts surrounding the i-th word of the
input sequence. We explain in detail how the annotations are com-
puted in the next section.
The context vector ci is, then, computed as a weighted sum of these
annotations hi:
ci =
TxX
j=1
↵ijhj. (5)
The weight ↵ij of each annotation hj is computed by
↵ij =
exp (eij)
PTx
exp (eik)
, (6)
ptpt-1
c.f. http://www.slideshare.net/yutakikuchi927/deep-learning-nlp-attention
•
•
α
6
rce sentence during decoding a translation (Sec. 3.1).
ER: GENERAL DESCRIPTION
x1 x2 x3 xT
+
αt,1
αt,2 αt,3
αt,T
h1 h2 h3 hT
h1 h2 h3 hT
st-1 st
Figure 1: The graphical illus-
tration of the proposed model
trying to generate the t-th tar-
get word yt given a source
sentence (x , x , . . . , x ).
el architecture, we define each conditional probability
p(yi|y1, . . . , yi 1, x) = g(yi 1, si, ci), (4)
RNN hidden state for time i, computed by
si = f(si 1, yi 1, ci).
noted that unlike the existing encoder–decoder ap-
q. (2)), here the probability is conditioned on a distinct
ci for each target word yi.
vector ci depends on a sequence of annotations
) to which an encoder maps the input sentence. Each
contains information about the whole input sequence
focus on the parts surrounding the i-th word of the
e. We explain in detail how the annotations are com-
ext section.
ector ci is, then, computed as a weighted sum of these
i:
TxX
F
t
t
g
s
t vector ci depends on a sequence of annotations
x
) to which an encoder maps the input sentence. Each
hi contains information about the whole input sequence
g focus on the parts surrounding the i-th word of the
nce. We explain in detail how the annotations are com-
next section.
vector ci is, then, computed as a weighted sum of these
hi:
ci =
TxX
j=1
↵ijhj. (5)
↵ij of each annotation hj is computed by
↵ij =
exp (eij)
P ,
GENERAL DESCRIPTION
hitecture, we define each conditional probability
y1, . . . , yi 1, x) = g(yi 1, si, ci), (4)
N hidden state for time i, computed by
si = f(si 1, yi 1, ci).
d that unlike the existing encoder–decoder ap-
), here the probability is conditioned on a distinct
or each target word yi.
or ci depends on a sequence of annotations
which an encoder maps the input sentence. Each
Figure 1: The graphical illus-
tration of the proposed model
trying to generate the t-th tar-
get word yt given a source
sentence (x1, x2, . . . , xT ).
ere the probability is conditioned on a distinct
ach target word yi.
ci depends on a sequence of annotations
ch an encoder maps the input sentence. Each
s information about the whole input sequence
n the parts surrounding the i-th word of the
xplain in detail how the annotations are com-
on.
is, then, computed as a weighted sum of these
ci =
TxX
j=1
↵ijhj. (5)
h annotation hj is computed by
↵ij =
exp (eij)
PTx
k=1 exp (eik)
, (6)
e = a(s , h )
Figure 1: The graphical illus-
tration of the proposed model
trying to generate the t-th tar-
get word yt given a source
sentence (x1, x2, . . . , xT ).
t the whole input sequence
nding the i-th word of the
w the annotations are com-
as a weighted sum of these
. (5)
computed by
j =
exp (eij)
PTx
k=1 exp (eik)
, (6)
eij = a(si 1, hj)
well the inputs around position j and the output at position
hidden state si 1 (just before emitting yi, Eq. (4)) and the
3 Neural Machine Translation Model
with Attention
As the baseline neural machine translation sys-
tem, we use the model proposed by (Bahdanau et
al., 2014) that learns to (soft-)align and translate
jointly. We refer this model as NMT.
The encoder of the NMT is a bidirectional
RNN (Schuster and Paliwal, 1997). The forward
RNN reads input sequence x = (x1, . . . , xT )
in left-to-right direction, resulting in a sequence
of hidden states (
!
h 1, . . . ,
!
h T ). The backward
RNN reads x in the reversed direction and outputs
( h 1, . . . , h T ). We then concatenate the hidden
states of forward and backward RNNs at each time
step and obtain a sequence of annotation vectors
(h1, . . . , hT ) where hj =
h!
h j|| h j
i
. Here, ||
denotes the concatenation operator. Thus, each an-
notation vector hj encodes information about the
j-th word with respect to all the other surrounding
where fr is G
We use a
2013) to com
words:
p(yt
ex
where W is
bias of the o
forward neu
that perform
And the sup
umn vector o
The whol
and the deco
(conditional
(t=i)
ncepaperatICLR2015
trainedtopredictthenextwordyt0giventhecontextvectorcandallthe
ords{y1,···,yt01}.Inotherwords,thedecoderdefinesaprobabilityover
composingthejointprobabilityintotheorderedconditionals:
p(y)=
TY
t=1
p(yt|{y1,···,yt1},c),(2)
Ty
.WithanRNN,eachconditionalprobabilityismodeledas
p(yt|{y1,···,yt1},c)=g(yt1,st,c),(3)
potentiallymulti-layered,functionthatoutputstheprobabilityofyt,andstis
RNN.ItshouldbenotedthatotherarchitecturessuchasahybridofanRNN
lneuralnetworkcanbeused(KalchbrennerandBlunsom,2013).
ALIGNANDTRANSLATE
poseanovelarchitectureforneuralmachinetranslation.Thenewarchitecture
onalRNNasanencoder(Sec.3.2)andadecoderthatemulatessearching
nceduringdecodingatranslation(Sec.3.1).
ERALDESCRIPTION
st
cture,wedefineeachconditionalprobability
...,yi1,x)=g(yi1,si,ci),(4)
iddenstatefortimei,computedby
si=f(si1,yi1,ci).
atunliketheexistingencoder–decoderap-
eretheprobabilityisconditionedonadistinct
achtargetwordyi.
idependsonasequenceofannotations
nedtopredictthenextwordyt0giventhecontextvectorcandallthe
s{y1,···,yt01}.Inotherwords,thedecoderdefinesaprobabilityover
posingthejointprobabilityintotheorderedconditionals:
p(y)=
TY
t=1
p(yt|{y1,···,yt1},c),(2)
.WithanRNN,eachconditionalprobabilityismodeledas
p(yt|{y1,···,yt1},c)=g(yt1,st,c),(3)
entiallymulti-layered,functionthatoutputstheprobabilityofyt,andstis
N.ItshouldbenotedthatotherarchitecturessuchasahybridofanRNN
uralnetworkcanbeused(KalchbrennerandBlunsom,2013).
LIGNANDTRANSLATE
anovelarchitectureforneuralmachinetranslation.Thenewarchitecture
lRNNasanencoder(Sec.3.2)andadecoderthatemulatessearching
duringdecodingatranslation(Sec.3.1).
ALDESCRIPTION
st
e,wedefineeachconditionalprobability
,yi1,x)=g(yi1,si,ci),(4)
nstatefortimei,computedby
f(si1,yi1,ci).
nliketheexistingencoder–decoderap-
heprobabilityisconditionedonadistinct
targetwordyi.
ependsonasequenceofannotations
encodermapstheinputsentence.Each
ormationaboutthewholeinputsequence
epartssurroundingthei-thwordofthe
inindetailhowtheannotationsarecom-
atICLR2015
predictthenextwordyt0giventhecontextvectorcandallthe
···,yt01}.Inotherwords,thedecoderdefinesaprobabilityover
thejointprobabilityintotheorderedconditionals:
y)=
TY
t=1
p(yt|{y1,···,yt1},c),(2)
anRNN,eachconditionalprobabilityismodeledas
|{y1,···,yt1},c)=g(yt1,st,c),(3)
multi-layered,functionthatoutputstheprobabilityofyt,andstis
ouldbenotedthatotherarchitecturessuchasahybridofanRNN
tworkcanbeused(KalchbrennerandBlunsom,2013).
ANDTRANSLATE
larchitectureforneuralmachinetranslation.Thenewarchitecture
asanencoder(Sec.3.2)andadecoderthatemulatessearching
decodingatranslation(Sec.3.1).
SCRIPTION
st
efineeachconditionalprobability
x)=g(yi1,si,ci),(4)
fortimei,computedby
1,yi1,ci).
theexistingencoder–decoderap-
abilityisconditionedonadistinct
ordyi.
onasequenceofannotations
dictthenextwordyt0giventhecontextvectorcandallthe
,yt01}.Inotherwords,thedecoderdefinesaprobabilityover
ejointprobabilityintotheorderedconditionals:
=
TY
t=1
p(yt|{y1,···,yt1},c),(2)
NN,eachconditionalprobabilityismodeledas
y1,···,yt1},c)=g(yt1,st,c),(3)
ulti-layered,functionthatoutputstheprobabilityofyt,andstis
ldbenotedthatotherarchitecturessuchasahybridofanRNN
rkcanbeused(KalchbrennerandBlunsom,2013).
DTRANSLATE
rchitectureforneuralmachinetranslation.Thenewarchitecture
anencoder(Sec.3.2)andadecoderthatemulatessearching
codingatranslation(Sec.3.1).
IPTION
st
neeachconditionalprobability
g(yi1,si,ci),(4)
timei,computedby
i1,ci).
existingencoder–decoderap-
ilityisconditionedonadistinct
dyi.
nasequenceofannotations
mapstheinputsentence.Each
boutthewholeinputsequence
rroundingthei-thwordofthe
lhowtheannotationsarecom-
p(yt | {y1, · · · , yt 1} , c) = g(yt 1, st, c), (3)
where g is a nonlinear, potentially multi-layered, function that outputs the probability of yt, and st is
the hidden state of the RNN. It should be noted that other architectures such as a hybrid of an RNN
and a de-convolutional neural network can be used (Kalchbrenner and Blunsom, 2013).
3 LEARNING TO ALIGN AND TRANSLATE
In this section, we propose a novel architecture for neural machine translation. The new architecture
consists of a bidirectional RNN as an encoder (Sec. 3.2) and a decoder that emulates searching
through a source sentence during decoding a translation (Sec. 3.1).
3.1 DECODER: GENERAL DESCRIPTION
yt-1
Figure 1: The graphical illus-
tration of the proposed model
trying to generate the t-th tar-
get word yt given a source
sentence (x1, x2, . . . , xT ).
In a new model architecture, we define each conditional probability
in Eq. (2) as:
p(yi|y1, . . . , yi 1, x) = g(yi 1, si, ci), (4)
where si is an RNN hidden state for time i, computed by
si = f(si 1, yi 1, ci).
It should be noted that unlike the existing encoder–decoder ap-
proach (see Eq. (2)), here the probability is conditioned on a distinct
context vector ci for each target word yi.
The context vector ci depends on a sequence of annotations
(h1, · · · , hTx
) to which an encoder maps the input sentence. Each
annotation hi contains information about the whole input sequence
with a strong focus on the parts surrounding the i-th word of the
input sequence. We explain in detail how the annotations are com-
puted in the next section.
The context vector ci is, then, computed as a weighted sum of these
annotations hi:
ci =
TxX
j=1
↵ijhj. (5)
The weight ↵ij of each annotation hj is computed by
↵ij =
exp (eij)
PTx
exp (eik)
, (6)
ptpt-1
•
•
•
• V
a man yesterday . [eos]!
killed a man yesterday .!t
+ bhp) (16)
(17)
Whp
bhp) (16)
(17)
(18)
(19)
!
man
a
a
man !
6
RNN t
pt
pt = softmax(W
ht =
−−−→
RNNt′≺t(
softmax(s)i =
exp(si
sj∈s exp
→
t′≺t(xwt′ ) (17)
p(si)
s exp(sj)
(18)
(19)
s i
Whp ∈ RV ×N
V
bhp(w)
Whp bhp
t = softmax(Whpht + bhp) (16)
t =
−−−→
RNNt′≺t(xwt′ ) (17)
i =
exp(si)
sj∈s exp(sj)
(18)
(19)
s i
Whp ∈ RV ×N
V
Whp(w) bhp(w)
(18)
(19)
i
Whp ∈ RV ×N
V
)
t wt
pt = softmax(Whpht + bhp)
ht =
−−−→
RNNt′≺t(xwt′ )
• V T
•
•
•
pt = softmax(Whpht + bhp)
ht =
−−−→
RNNt′≺t(xwt′ )
softmax(s)i =
exp(si)
sj∈s exp(sj)
i N s
exp
w Whp ∈ RV ×N
Whp(w) bhp(w)
T
6
RNN t wt
pt
pt = softmax(Whpht + bhp)
ht =
−−−→
RNNt′≺t(xwt′ )
softmax(s)i =
exp(si)
sj∈s exp(sj)
softmax(s)i N s
exp
6
RNN t wt
pt
pt = softmax(Whpht +
ht =
−−−→
RNNt′≺t(xwt′ )
softmax(s)i =
exp(si)
sj∈s exp(sj)
softmax(s)i N s
exp
・
7
•
8
atten-
nd the
e pro-
or un-
at we
Guillaume et Cesar ont une voiture bleue a Lausanne.
Guillaume and Cesar have a blue car in Lausanne.
Copy Copy Copy
French:
English:
Figure 1: An example of how copying can happen
•
•
•
•
9
•
10
Table 4: Generated summaries from NMT with PS. Boldface words are the words copied from the source
Source #1 china ’s tang gonghong set a world record with a clean and
jerk lift of ### kilograms to win the women ’s over-## kilogram
weightlifting title at the asian games on tuesday .
Target #1 china ’s tang <unk>,sets world weightlifting record
NMT+PS #1 china ’s tang gonghong wins women ’s weightlifting weightlift-
ing title at asian games
Source #2 owing to criticism , nbc said on wednesday that it was ending
a three-month-old experiment that would have brought the first
liquor advertisements onto national broadcast network television
Table 4: Generated summaries from NMT with PS. Boldface words are the words copied from the so
Source #1 china ’s tang gonghong set a world record with a clean and
jerk lift of ### kilograms to win the women ’s over-## kilogram
weightlifting title at the asian games on tuesday .
Target #1 china ’s tang <unk>,sets world weightlifting record
NMT+PS #1 china ’s tang gonghong wins women ’s weightlifting weightlift-
<v1> ’s <v2> <v3> set a world record with a clean and
jerk lift of ### kilograms to win the women ’s over-##
kilogram weightlifting title at the asian games on tuesday .
<v1> ’s <v2> <v3>,sets world weightlifting record
•
•
•
•
•
•
•
11
• gonghonh <unk>
•
•
12
The experimental results comparing the Pointer
Softmax with NMT model are displayed in Ta-
ble 1 for the UNK pointers data and in Table 2
for the entity pointers data. As our experiments
show, pointer softmax improves over the baseline
NMT on both UNK data and entities data. Our
hope was that the improvement would be larger
for the entities data since the incidence of point-
ers was much greater. However, it turns out this
is not the case, and we suspect the main reason
is anonymization of entities which removed data-
sparsity by converting all entities to integer-ids
that are shared across all documents. We believe
that on de-anonymized data, our model could help
more, since the issue of data-sparsity is more acute
in this case.
Table 1: Results on Gigaword Corpus when point-
ers are used for UNKs in the training data, using
Rouge-F1 as the evaluation metric.
Rouge-1 Rouge-2 Rouge-L
NMT + lvt 34.87 16.54 32.27
NMT + lvt + PS 35.19 16.66 32.51
I
ate
tra
am
acc
it n
5.3
In
me
the
the
Fre
ear
log
eva
els
sco
W
the
tok
Th
gli
We
Table 4: Generated summaries from NMT with PS. Boldface words are the words copied from the source
Source #1 china ’s tang gonghong set a world record with a clean and
jerk lift of ### kilograms to win the women ’s over-## kilogram
weightlifting title at the asian games on tuesday .
Target #1 china ’s tang <unk>,sets world weightlifting record
NMT+PS #1 china ’s tang gonghong wins women ’s weightlifting weightlift-
ing title at asian games
Source #2 owing to criticism , nbc said on wednesday that it was ending
a three-month-old experiment that would have brought the first
liquor advertisements onto national broadcast network television
.
Target #2 advertising : nbc retreats from liquor commercials
NMT+PS #2 nbc says it is ending a three-month-old experiment
Source #3 a senior trade union official here wednesday called on ghana ’s
government to be “ mindful of the plight ” of the ordinary people
in the country in its decisions on tax increases .
Target #3 tuc official,on behalf of ordinary ghanaians
NMT+PS #3 ghana ’s government urged to be mindful of the plight
•
•
13
is not the case, and we suspect the main reason
is anonymization of entities which removed data-
sparsity by converting all entities to integer-ids
that are shared across all documents. We believe
that on de-anonymized data, our model could help
more, since the issue of data-sparsity is more acute
in this case.
Table 1: Results on Gigaword Corpus when point-
ers are used for UNKs in the training data, using
Rouge-F1 as the evaluation metric.
Rouge-1 Rouge-2 Rouge-L
NMT + lvt 34.87 16.54 32.27
NMT + lvt + PS 35.19 16.66 32.51
Table 2: Results on anonymized Gigaword Corpus
when pointers are used for entities, using Rouge-
F1 as the evaluation metric.
Rouge-1 Rouge-2 Rouge-L
NMT + lvt 34.89 16.78 32.37
NMT + lvt + PS 35.11 16.76 32.55
Table 4: Generated summaries from NMT with PS. Boldface words are the words copied from the source
Source #1 china ’s tang gonghong set a world record with a clean and
jerk lift of ### kilograms to win the women ’s over-## kilogram
weightlifting title at the asian games on tuesday .
Target #1 china ’s tang <unk>,sets world weightlifting record
NMT+PS #1 china ’s tang gonghong wins women ’s weightlifting weightlift-
ing title at asian games
Source #2 owing to criticism , nbc said on wednesday that it was ending
a three-month-old experiment that would have brought the first
liquor advertisements onto national broadcast network television
.
Target #2 advertising : nbc retreats from liquor commercials
NMT+PS #2 nbc says it is ending a three-month-old experiment
Source #3 a senior trade union official here wednesday called on ghana ’s
government to be “ mindful of the plight ” of the ordinary people
<v1> ’s <v2> <v3> set a world record with a clean and
jerk lift of ### kilograms to win the women ’s over-##
kilogram weightlifting title at the asian games on tuesday .
<v1> ’s <v2> <v3>,sets world weightlifting record
•
•
•
•
•
•
• 14
en-
the
pro-
un-
we
ict-
e of
We
odel
Guillaume et Cesar ont une voiture bleue a Lausanne.
Guillaume and Cesar have a blue car in Lausanne.
Copy Copy Copy
French:
English:
Figure 1: An example of how copying can happen
for machine translation. Common words that ap-
pear both in source and the target can directly be
copied from input to source. The rest of the un-
known in the target can be copied from the input
after being translated with a dictionary.
•
•
•
•
15
of the gradients exceed 1 (Pascanu et al., 2012).
Table 5: Europarl Dataset (EN-FR)
BLEU-4
NMT 20.19
NMT + PS 23.76
in the country in its decisions on tax increases .
et #3 tuc official,on behalf of ordinary ghanaians
T+PS #3 ghana ’s government urged to be mindful of the plight
y, we first check if the same word yt ap-
e source sentence. If it is not, we then
translated version of the word exists in
sentence by using a look-up table be-
source and the target language. If the
the source sentence, we then use the lo-
he word in the source as the target. Oth-
check if one of the English senses from
anguage dictionary of the French word
urce. If it is in the source sentence, then
location of that word as our translation.
we just use the argmax of lt as the tar-
ching network dt, we observed that us-
layered MLP with noisy-tanh activation
et al., 2016) function with residual con-
om the lower layer (He et al., 2015) ac-
In Table 5, we provided the result of NMT w
pointer softmax and we observe about 3.6 BLE
score improvement over our baseline.
Figure 4: A comparison of the validation learnin
•
•
•
16
h2 hTh1 …
yt-1
Source Sequence
x2 xTx1 …
BiRNN
Target Sequence
Figure 2: A depiction of neural machine transla-
tion architecture with attention. At each timestep,
the model generates the attention distribution lt.
We use lt and the encoder’s hidden states to obtain
the context ct. The decoder uses ct to predict a
vector of probabilities for the words wt by using
vocabulary softmax.
4 The Pointer Softmax
In this section, we introduce our method, called as
the pointer softmax (PS), to deal with the rare and
unknown words. The pointer softmax can be an
applicable approach to many NLP tasks, because
it resolves the limitations about unknown words
for neural networks. It can be used in parallel with
other existing techniques such as the large vocabu-
lary trick (Jean et al., 2014). Our model learns two
key abilities jointly to make the pointing mech-
anism applicable in more general settings: (i) to
predict whether it is required to use the pointing
complish this, we introduce a switching network
to the model. The switching network, which is
a multilayer perceptron in our experiments, takes
the representation of the context sequence (similar
to the input annotation in NMT) and the previous
hidden state of the output RNN as its input. It out-
puts a binary variable zt which indicates whether
to use the shortlist softmax (when zt = 1) or the
location softmax (when zt = 0). Note that if the
word that is expected to be generated at each time-
step is neither in the shortlist nor in the context se-
quence, the switching network selects the shortlist
softmax, and then the shortlist softmax predicts
UNK. The details of the pointer softmax model can
be seen in Figure 3 as well.
h2 hTh1 …
st ct
zt yl
tyw
t
yt-1
Vocabulary softmax
Pointer distribution (lt)
Source Sequence
Point & copy
x2 xTx1 …
BiRNN
Target Sequence
p 1-p
st-1
Figure 3: A depiction of the Pointer Softmax (PS)
Ø
Ø
Ø
Ø
Ø
Ø
17
improved the convergence speed of the model as
well. For French to English machine translation
on Europarl corpora, we observe that using the
pointer softmax can also improve the training con-
vergence of the model.
References
[Bahdanau et al.2014] Dzmitry Bahdanau, Kyunghyun
Cho, and Yoshua Bengio. 2014. Neural machine
translation by jointly learning to align and translate.
CoRR, abs/1409.0473.
[Bengio and Sen´ecal2008] Yoshua Bengio and Jean-
S´ebastien Sen´ecal. 2008. Adaptive importance
sampling to accelerate training of a neural proba-
bilistic language model. Neural Networks, IEEE
Transactions on, 19(4):713–722.
[Bordes et al.2015] Antoine Bordes, Nicolas Usunier,
Sumit Chopra, and Jason Weston. 2015. Large-
scale simple question answering with memory net-
works. arXiv preprint arXiv:1506.02075.
[Cheng and Lapata2016] Jianpeng Cheng and Mirella
Lapata. 2016. Neural summarization by ex-
plications to natural image statistics. The Journal of
Machine Learning Research, 13(1):307–361.
[He et al.2015] Kaiming He, Xiangyu Zhang, Shao-
qing Ren, and Jian Sun. 2015. Deep resid-
ual learning for mage recognition. arXiv preprint
arXiv:1512.03385.
[Hermann et al.2015] Karl Moritz Hermann, Tomas
Kocisky, Edward Grefenstette, Lasse Espeholt, Will
Kay, Mustafa Suleyman, and Phil Blunsom. 2015.
Teaching machines to read and comprehend. In Ad-
vances in Neural Information Processing Systems,
pages 1684–1692.
[Jean et al.2014] S´ebastien Jean, Kyunghyun Cho,
Roland Memisevic, and Yoshua Bengio. 2014. On
using very large target vocabulary for neural ma-
chine translation. arXiv preprint arXiv:1412.2007.
[Kingma and Adam2015] Diederik P Kingma and
Jimmy Ba Adam. 2015. A method for stochastic
optimization. In International Conference on
Learning Representation.
[Luong et al.2015] Minh-Thang Luong, Ilya Sutskever,
Quoc V Le, Oriol Vinyals, and Wojciech Zaremba.
2015. Addressing the rare word problem in neural
machine translation. In Proceedings of ACL.
148
[Schuster and Paliwal1997] Mike Schuster and
Kuldip K Paliwal. 1997. Bidirectional recur-
rent neural networks. Signal Processing, IEEE
Transactions on, 45(11):2673–2681.
[Sennrich et al.2015] Rico Sennrich, Barry Haddow,
and Alexandra Birch. 2015. Neural machine trans-
lation of rare words with subword units. arXiv
preprint arXiv:1508.07909.
[Theano Development Team2016] Theano Develop-
ment Team. 2016. Theano: A Python framework
for fast computation of mathematical expressions.
arXiv e-prints, abs/1605.02688, May.
[Tomasello et al.2007] Michael Tomasello, Malinda
Carpenter, and Ulf Liszkowski. 2007. A new look at
infant pointing. Child development, 78(3):705–722.
[Vinyals et al.2015] Oriol Vinyals, Meire Fortunato,
and Navdeep Jaitly. 2015. Pointer networks. In Ad-
vances in Neural Information Processing Systems,
pages 2674–2682.
[Zeiler2012] Matthew D Zeiler. 2012. Adadelta:
an adaptive learning rate method. arXiv preprint
arXiv:1212.5701.
7 Acknowledgments
We would also like to thank the developers of
Theano 5, for developing such a powerful tool
5
http://deeplearning.net/software/
theano/
149
able to improve the results even when it is used
together with the large-vocabulary trick. In the
case of neural machine translation, we observed
that the training with the pointer softmax is also
improved the convergence speed of the model as
well. For French to English machine translation
on Europarl corpora, we observe that using the
pointer softmax can also improve the training con-
vergence of the model.
References
[Bahdanau et al.2014] Dzmitry Bahdanau, Kyunghyun
Cho, and Yoshua Bengio. 2014. Neural machine
translation by jointly learning to align and translate.
CoRR, abs/1409.0473.
[Bengio and Sen´ecal2008] Yoshua Bengio and Jean-
S´ebastien Sen´ecal. 2008. Adaptive importance
sampling to accelerate training of a neural proba-
bilistic language model. Neural Networks, IEEE
Transactions on, 19(4):713–722.
[Bordes et al.2015] Antoine Bordes, Nicolas Usunier,
Sumit Chopra, and Jason Weston. 2015. Large-
scale simple question answering with memory net-
works. arXiv preprint arXiv:1506.02075.
[Cheng and Lapata2016] Jianpeng Cheng and Mirella
Lapata. 2016. Neural summarization by ex-
2016. Noisy activation functions. arXiv preprint
arXiv:1603.00391.
[Gutmann and Hyv¨arinen2012] Michael U Gutmann
and Aapo Hyv¨arinen. 2012. Noise-contrastive esti-
mation of unnormalized statistical models, with ap-
plications to natural image statistics. The Journal of
Machine Learning Research, 13(1):307–361.
[He et al.2015] Kaiming He, Xiangyu Zhang, Shao-
qing Ren, and Jian Sun. 2015. Deep resid-
ual learning for mage recognition. arXiv preprint
arXiv:1512.03385.
[Hermann et al.2015] Karl Moritz Hermann, Tomas
Kocisky, Edward Grefenstette, Lasse Espeholt, Will
Kay, Mustafa Suleyman, and Phil Blunsom. 2015.
Teaching machines to read and comprehend. In Ad-
vances in Neural Information Processing Systems,
pages 1684–1692.
[Jean et al.2014] S´ebastien Jean, Kyunghyun Cho,
Roland Memisevic, and Yoshua Bengio. 2014. On
using very large target vocabulary for neural ma-
chine translation. arXiv preprint arXiv:1412.2007.
[Kingma and Adam2015] Diederik P Kingma and
Jimmy Ba Adam. 2015. A method for stochastic
optimization. In International Conference on
Learning Representation.
[Luong et al.2015] Minh-Thang Luong, Ilya Sutskever,
Quoc V Le, Oriol Vinyals, and Wojciech Zaremba.
2015. Addressing the rare word problem in neural
machine translation. In Proceedings of ACL.
148
lidation learning-
tracting sentences and words. arXiv preprint
arXiv:1603.07252.
[Cho et al.2014] Kyunghyun Cho, Bart
Van Merri¨enboer, Caglar Gulcehre, Dzmitry
Bahdanau, Fethi Bougares, Holger Schwenk,
and Yoshua Bengio. 2014. Learning phrase
representations using rnn encoder-decoder for
statistical machine translation. arXiv preprint
arXiv:1406.1078.
[Chung et al.2014] Junyoung Chung, C¸ aglar G¨ulc¸ehre,
KyungHyun Cho, and Yoshua Bengio. 2014. Em-
pirical evaluation of gated recurrent neural networks
on sequence modeling. CoRR, abs/1412.3555.
n learning-
ined with
layer. As
el trained
an the reg-
or pointer
tion func-
ble to gen-
with rare-
marization
ftmax was
it is used
k. In the
observed
ax is also
model as
ranslation
using the
ining con-
tracting sentences and words. arXiv preprint
arXiv:1603.07252.
[Cho et al.2014] Kyunghyun Cho, Bart
Van Merri¨enboer, Caglar Gulcehre, Dzmitry
Bahdanau, Fethi Bougares, Holger Schwenk,
and Yoshua Bengio. 2014. Learning phrase
representations using rnn encoder-decoder for
statistical machine translation. arXiv preprint
arXiv:1406.1078.
[Chung et al.2014] Junyoung Chung, C¸ aglar G¨ulc¸ehre,
KyungHyun Cho, and Yoshua Bengio. 2014. Em-
pirical evaluation of gated recurrent neural networks
on sequence modeling. CoRR, abs/1412.3555.
[Gillick et al.2015] Dan Gillick, Cliff Brunk, Oriol
Vinyals, and Amarnag Subramanya. 2015. Mul-
tilingual language processing from bytes. arXiv
preprint arXiv:1512.00103.
[Graves2013] Alex Graves. 2013. Generating se-
quences with recurrent neural networks. arXiv
preprint arXiv:1308.0850.
[Gu et al.2016] Jiatao Gu, Zhengdong Lu, Hang Li,
and Victor OK Li. 2016. Incorporating copying
mechanism in sequence-to-sequence learning. arXiv
preprint arXiv:1603.06393.
[Gulcehre et al.2016] Caglar Gulcehre, Marcin
Moczulski, Misha Denil, and Yoshua Bengio.
2016. Noisy activation functions. arXiv preprint
arXiv:1603.00391.
[Gutmann and Hyv¨arinen2012] Michael U Gutmann
and Aapo Hyv¨arinen. 2012. Noise-contrastive esti-
mation of unnormalized statistical models, with ap-
plications to natural image statistics. The Journal of
Machine Learning Research, 13(1):307–361.
[He et al.2015] Kaiming He, Xiangyu Zhang, Shao-
qing Ren, and Jian Sun. 2015. Deep resid-
ual learning for mage recognition. arXiv preprint
arXiv:1512.03385.
[Hermann et al.2015] Karl Moritz Hermann, Tomas
Kocisky, Edward Grefenstette, Lasse Espeholt, Will
[Morin and Bengio2005] Frederic Morin and Yoshua
Bengio. 2005. Hierarchical probabilistic neural net-
work language model. In Aistats, volume 5, pages
246–252. Citeseer.
[Pascanu et al.2012] Razvan Pascanu, Tomas Mikolov,
and Yoshua Bengio. 2012. On the difficulty of
training recurrent neural networks. arXiv preprint
arXiv:1211.5063.
[Pascanu et al.2013] Razvan Pascanu, Caglar Gulcehre,
Kyunghyun Cho, and Yoshua Bengio. 2013. How
to construct deep recurrent neural networks. arXiv
preprint arXiv:1312.6026.
[Rush et al.2015] Alexander M. Rush, Sumit Chopra,
and Jason Weston. 2015. A neural attention model
for abstractive sentence summarization. CoRR,
abs/1509.00685.
[Schuster and Paliwal1997] Mike Schuster and
Kuldip K Paliwal. 1997. Bidirectional recur-
rent neural networks. Signal Processing, IEEE
Transactions on, 45(11):2673–2681.
[Sennrich et al.2015] Rico Sennrich, Barry Haddow,
and Alexandra Birch. 2015. Neural machine trans-
lation of rare words with subword units. arXiv
preprint arXiv:1508.07909.
[Theano Development Team2016] Theano Develop-
ment Team. 2016. Theano: A Python framework
for fast computation of mathematical expressions.
arXiv e-prints, abs/1605.02688, May.
[Tomasello et al.2007] Michael Tomasello, Malinda
Carpenter, and Ulf Liszkowski. 2007. A new look at
infant pointing. Child development, 78(3):705–722.
[Vinyals et al.2015] Oriol Vinyals, Meire Fortunato,
and Navdeep Jaitly. 2015. Pointer networks. In Ad-
vances in Neural Information Processing Systems,
pages 2674–2682.
[Zeiler2012] Matthew D Zeiler. 2012. Adadelta:
an adaptive learning rate method. arXiv preprint
arXiv:1212.5701.
7 Acknowledgments
We would also like to thank the developers of
5
ranslation
using the
ining con-
Kyunghyun
al machine
d translate.
and Jean-
importance
ural proba-
orks, IEEE
as Usunier,
5. Large-
emory net-
nd Mirella
on by ex-
Machine Learning Research, 13(1):307–361.
[He et al.2015] Kaiming He, Xiangyu Zhang, Shao-
qing Ren, and Jian Sun. 2015. Deep resid-
ual learning for mage recognition. arXiv preprint
arXiv:1512.03385.
[Hermann et al.2015] Karl Moritz Hermann, Tomas
Kocisky, Edward Grefenstette, Lasse Espeholt, Will
Kay, Mustafa Suleyman, and Phil Blunsom. 2015.
Teaching machines to read and comprehend. In Ad-
vances in Neural Information Processing Systems,
pages 1684–1692.
[Jean et al.2014] S´ebastien Jean, Kyunghyun Cho,
Roland Memisevic, and Yoshua Bengio. 2014. On
using very large target vocabulary for neural ma-
chine translation. arXiv preprint arXiv:1412.2007.
[Kingma and Adam2015] Diederik P Kingma and
Jimmy Ba Adam. 2015. A method for stochastic
optimization. In International Conference on
Learning Representation.
[Luong et al.2015] Minh-Thang Luong, Ilya Sutskever,
Quoc V Le, Oriol Vinyals, and Wojciech Zaremba.
2015. Addressing the rare word problem in neural
machine translation. In Proceedings of ACL.
148
Management of Data (SIGMOD). pages 1247–
1250.
S. Bowman, G. Angeli, C. Potts, and C. D. Man-
ning. 2015. A large annotated corpus for learn-
ing natural language inference. In Empiri-
cal Methods in Natural Language Processing
(EMNLP).
D. L. Chen and R. J. Mooney. 2008. Learning to
sportscast: A test of grounded language acqui-
sition. In International Conference on Machine
Learning (ICML). pages 128–135.
F. Chevalier, R. Vuillemot, and G. Gali. 2013. Us-
ing concrete scales: A practical framework for
effective visual depiction of complex measures.
IEEE Transactions on Visualization and Com-
puter Graphics 19:2426–2435.
G. Chiacchieri. 2013. Dictionary of numbers.
http://www.dictionaryofnumbers.
com/.
A. Fader, S. Soderland, and O. Etzioni. 2011.
Identifying relations for open information ex-
traction. In Empirical Methods in Natural Lan-
guage Processing (EMNLP).
R. Jia and P. Liang. 2016. Data recombination
for neural semantic parsing. In Association for
Computational Linguistics (ACL).
M. G. Jones and A. R. Taylor. 2009. Developing
a sense of scale: Looking backward. Journal of
Research in Science Teaching 46:460–475.
Y. Kim, J. Hullman, and M. Agarwala. 2016. Gen-
erating personalized spatial analogies for dis-
tances and areas. In Conference on Human Fac-
tors in Computing Systems (CHI).
C. Seife. 2010. Proofine
fooled by the numbers. P
I. Sutskever, O. Vinyals, a
quence to sequence lea
works. In Advances in N
cessing Systems (NIPS).
K. H. Teigen. 2015. Fram
ties. The Wiley Blackw
ment and Decision Maki
T. R. Tretter, M. G. Jones,
Accuracy of scale conce
tal maneuverings acros
tial magnitude. Journal
Teaching 43:1061–1085
Y. Wang, J. Berant, and P.
a semantic parser overni
Computational Linguisti
Y. W. Wong and R. J. Moo
by inverting a semantic
cal machine translation.
Technology and North
for Computational Ling
pages 172–179.

More Related Content

What's hot

Learning for semantic parsing using statistical syntactic parsing techniques
Learning for semantic parsing using statistical syntactic parsing techniquesLearning for semantic parsing using statistical syntactic parsing techniques
Learning for semantic parsing using statistical syntactic parsing techniquesUKM university
 
(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence
Masahiro Suzuki
 
(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks
Masahiro Suzuki
 
A lexisearch algorithm for the Bottleneck Traveling Salesman Problem
A lexisearch algorithm for the Bottleneck Traveling Salesman ProblemA lexisearch algorithm for the Bottleneck Traveling Salesman Problem
A lexisearch algorithm for the Bottleneck Traveling Salesman Problem
CSCJournals
 
Gradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsGradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation Graphs
Yoonho Lee
 
EVEN GRACEFUL LABELLING OF A CLASS OF TREES
EVEN GRACEFUL LABELLING OF A CLASS OF TREESEVEN GRACEFUL LABELLING OF A CLASS OF TREES
EVEN GRACEFUL LABELLING OF A CLASS OF TREES
Fransiskeran
 
(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network
Masahiro Suzuki
 
Data structure-question-bank
Data structure-question-bankData structure-question-bank
Data structure-question-bank
Jagan Mohan Bishoyi
 
4515ijci01
4515ijci014515ijci01
4515ijci01
IJCI JOURNAL
 
Chapter 24 aoa
Chapter 24 aoaChapter 24 aoa
Chapter 24 aoa
Hanif Durad
 
CommunicationComplexity1_jieren
CommunicationComplexity1_jierenCommunicationComplexity1_jieren
CommunicationComplexity1_jierenjie ren
 
Deep Learning Opening Workshop - ProxSARAH Algorithms for Stochastic Composit...
Deep Learning Opening Workshop - ProxSARAH Algorithms for Stochastic Composit...Deep Learning Opening Workshop - ProxSARAH Algorithms for Stochastic Composit...
Deep Learning Opening Workshop - ProxSARAH Algorithms for Stochastic Composit...
The Statistical and Applied Mathematical Sciences Institute
 
PaperNo21-habibi-IJMTT-V8P514-IJMMT
PaperNo21-habibi-IJMTT-V8P514-IJMMTPaperNo21-habibi-IJMTT-V8P514-IJMMT
PaperNo21-habibi-IJMTT-V8P514-IJMMTMezban Habibi
 
Jensen's inequality, EM 알고리즘
Jensen's inequality, EM 알고리즘 Jensen's inequality, EM 알고리즘
Jensen's inequality, EM 알고리즘
Jungkyu Lee
 
Predicting organic reaction outcomes with weisfeiler lehman network
Predicting organic reaction outcomes with weisfeiler lehman networkPredicting organic reaction outcomes with weisfeiler lehman network
Predicting organic reaction outcomes with weisfeiler lehman network
Kazuki Fujikawa
 
Algorithms and Complexity: Cryptography Theory
Algorithms and Complexity: Cryptography TheoryAlgorithms and Complexity: Cryptography Theory
Algorithms and Complexity: Cryptography Theory
Alex Prut
 
머피의 머신러닝: 17장 Markov Chain and HMM
머피의 머신러닝: 17장  Markov Chain and HMM머피의 머신러닝: 17장  Markov Chain and HMM
머피의 머신러닝: 17장 Markov Chain and HMMJungkyu Lee
 
A method for finding an optimal solution of an assignment problem under mixed...
A method for finding an optimal solution of an assignment problem under mixed...A method for finding an optimal solution of an assignment problem under mixed...
A method for finding an optimal solution of an assignment problem under mixed...
Navodaya Institute of Technology
 

What's hot (20)

Learning for semantic parsing using statistical syntactic parsing techniques
Learning for semantic parsing using statistical syntactic parsing techniquesLearning for semantic parsing using statistical syntactic parsing techniques
Learning for semantic parsing using statistical syntactic parsing techniques
 
(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence
 
(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks
 
A lexisearch algorithm for the Bottleneck Traveling Salesman Problem
A lexisearch algorithm for the Bottleneck Traveling Salesman ProblemA lexisearch algorithm for the Bottleneck Traveling Salesman Problem
A lexisearch algorithm for the Bottleneck Traveling Salesman Problem
 
Gradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsGradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation Graphs
 
EVEN GRACEFUL LABELLING OF A CLASS OF TREES
EVEN GRACEFUL LABELLING OF A CLASS OF TREESEVEN GRACEFUL LABELLING OF A CLASS OF TREES
EVEN GRACEFUL LABELLING OF A CLASS OF TREES
 
(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network
 
Data structure-question-bank
Data structure-question-bankData structure-question-bank
Data structure-question-bank
 
4515ijci01
4515ijci014515ijci01
4515ijci01
 
Chapter 24 aoa
Chapter 24 aoaChapter 24 aoa
Chapter 24 aoa
 
CommunicationComplexity1_jieren
CommunicationComplexity1_jierenCommunicationComplexity1_jieren
CommunicationComplexity1_jieren
 
Deep Learning Opening Workshop - ProxSARAH Algorithms for Stochastic Composit...
Deep Learning Opening Workshop - ProxSARAH Algorithms for Stochastic Composit...Deep Learning Opening Workshop - ProxSARAH Algorithms for Stochastic Composit...
Deep Learning Opening Workshop - ProxSARAH Algorithms for Stochastic Composit...
 
cdrw
cdrwcdrw
cdrw
 
PaperNo21-habibi-IJMTT-V8P514-IJMMT
PaperNo21-habibi-IJMTT-V8P514-IJMMTPaperNo21-habibi-IJMTT-V8P514-IJMMT
PaperNo21-habibi-IJMTT-V8P514-IJMMT
 
Bq25399403
Bq25399403Bq25399403
Bq25399403
 
Jensen's inequality, EM 알고리즘
Jensen's inequality, EM 알고리즘 Jensen's inequality, EM 알고리즘
Jensen's inequality, EM 알고리즘
 
Predicting organic reaction outcomes with weisfeiler lehman network
Predicting organic reaction outcomes with weisfeiler lehman networkPredicting organic reaction outcomes with weisfeiler lehman network
Predicting organic reaction outcomes with weisfeiler lehman network
 
Algorithms and Complexity: Cryptography Theory
Algorithms and Complexity: Cryptography TheoryAlgorithms and Complexity: Cryptography Theory
Algorithms and Complexity: Cryptography Theory
 
머피의 머신러닝: 17장 Markov Chain and HMM
머피의 머신러닝: 17장  Markov Chain and HMM머피의 머신러닝: 17장  Markov Chain and HMM
머피의 머신러닝: 17장 Markov Chain and HMM
 
A method for finding an optimal solution of an assignment problem under mixed...
A method for finding an optimal solution of an assignment problem under mixed...A method for finding an optimal solution of an assignment problem under mixed...
A method for finding an optimal solution of an assignment problem under mixed...
 

Viewers also liked

Visualizing and understanding neural models in NLP
Visualizing and understanding neural models in NLPVisualizing and understanding neural models in NLP
Visualizing and understanding neural models in NLP
Naoaki Okazaki
 
Black holes and white rabbits metaphor identification with visual features
Black holes and white rabbits  metaphor identification with visual featuresBlack holes and white rabbits  metaphor identification with visual features
Black holes and white rabbits metaphor identification with visual features
Sumit Maharjan
 
"Joint Extraction of Events and Entities within a Document Context"の解説
"Joint Extraction of Events and Entities within a Document Context"の解説"Joint Extraction of Events and Entities within a Document Context"の解説
"Joint Extraction of Events and Entities within a Document Context"の解説
Akihiro Kameda
 
SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive S...
SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive S...SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive S...
SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive S...
Sharath TS
 
Language and Domain Independent Entity Linking with Quantified Collective Val...
Language and Domain Independent Entity Linking with Quantified Collective Val...Language and Domain Independent Entity Linking with Quantified Collective Val...
Language and Domain Independent Entity Linking with Quantified Collective Val...
Shuangshuang Zhou
 
Improving Coreference Resolution by Learning Entity-Level Distributed Represe...
Improving Coreference Resolution by Learning Entity-Level Distributed Represe...Improving Coreference Resolution by Learning Entity-Level Distributed Represe...
Improving Coreference Resolution by Learning Entity-Level Distributed Represe...
Yuichiroh Matsubayashi
 
第8回最先端NLP勉強会 EMNLP2015 Guosh et al Sarcastic or Not
第8回最先端NLP勉強会 EMNLP2015 Guosh et al Sarcastic or Not第8回最先端NLP勉強会 EMNLP2015 Guosh et al Sarcastic or Not
第8回最先端NLP勉強会 EMNLP2015 Guosh et al Sarcastic or Not
Shuntaro Yada
 
Large-Scale Information Extraction from Textual Definitions through Deep Syn...
Large-Scale Information Extraction from Textual Definitions through Deep Syn...Large-Scale Information Extraction from Textual Definitions through Deep Syn...
Large-Scale Information Extraction from Textual Definitions through Deep Syn...
Koji Matsuda
 
Snlp2016 kameko
Snlp2016 kamekoSnlp2016 kameko
Snlp2016 kameko
Hirotaka Kameko
 
知能数理研究室へようこそ!
知能数理研究室へようこそ!知能数理研究室へようこそ!
知能数理研究室へようこそ!
Yagami Kenta
 
Recent Progress in RNN and NLP
Recent Progress in RNN and NLPRecent Progress in RNN and NLP
Recent Progress in RNN and NLP
hytae
 
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Márton Miháltz
 
新たなRNNと自然言語処理
新たなRNNと自然言語処理新たなRNNと自然言語処理
新たなRNNと自然言語処理
hytae
 
深層学習時代の自然言語処理
深層学習時代の自然言語処理深層学習時代の自然言語処理
深層学習時代の自然言語処理
Yuya Unno
 

Viewers also liked (14)

Visualizing and understanding neural models in NLP
Visualizing and understanding neural models in NLPVisualizing and understanding neural models in NLP
Visualizing and understanding neural models in NLP
 
Black holes and white rabbits metaphor identification with visual features
Black holes and white rabbits  metaphor identification with visual featuresBlack holes and white rabbits  metaphor identification with visual features
Black holes and white rabbits metaphor identification with visual features
 
"Joint Extraction of Events and Entities within a Document Context"の解説
"Joint Extraction of Events and Entities within a Document Context"の解説"Joint Extraction of Events and Entities within a Document Context"の解説
"Joint Extraction of Events and Entities within a Document Context"の解説
 
SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive S...
SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive S...SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive S...
SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive S...
 
Language and Domain Independent Entity Linking with Quantified Collective Val...
Language and Domain Independent Entity Linking with Quantified Collective Val...Language and Domain Independent Entity Linking with Quantified Collective Val...
Language and Domain Independent Entity Linking with Quantified Collective Val...
 
Improving Coreference Resolution by Learning Entity-Level Distributed Represe...
Improving Coreference Resolution by Learning Entity-Level Distributed Represe...Improving Coreference Resolution by Learning Entity-Level Distributed Represe...
Improving Coreference Resolution by Learning Entity-Level Distributed Represe...
 
第8回最先端NLP勉強会 EMNLP2015 Guosh et al Sarcastic or Not
第8回最先端NLP勉強会 EMNLP2015 Guosh et al Sarcastic or Not第8回最先端NLP勉強会 EMNLP2015 Guosh et al Sarcastic or Not
第8回最先端NLP勉強会 EMNLP2015 Guosh et al Sarcastic or Not
 
Large-Scale Information Extraction from Textual Definitions through Deep Syn...
Large-Scale Information Extraction from Textual Definitions through Deep Syn...Large-Scale Information Extraction from Textual Definitions through Deep Syn...
Large-Scale Information Extraction from Textual Definitions through Deep Syn...
 
Snlp2016 kameko
Snlp2016 kamekoSnlp2016 kameko
Snlp2016 kameko
 
知能数理研究室へようこそ!
知能数理研究室へようこそ!知能数理研究室へようこそ!
知能数理研究室へようこそ!
 
Recent Progress in RNN and NLP
Recent Progress in RNN and NLPRecent Progress in RNN and NLP
Recent Progress in RNN and NLP
 
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
 
新たなRNNと自然言語処理
新たなRNNと自然言語処理新たなRNNと自然言語処理
新たなRNNと自然言語処理
 
深層学習時代の自然言語処理
深層学習時代の自然言語処理深層学習時代の自然言語処理
深層学習時代の自然言語処理
 

Similar to Pointing the Unknown Words

Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10)
Larry Guo
 
Ruifeng.pptx
Ruifeng.pptxRuifeng.pptx
Ruifeng.pptx
RoopRanjan2
 
RNN and sequence-to-sequence processing
RNN and sequence-to-sequence processingRNN and sequence-to-sequence processing
RNN and sequence-to-sequence processing
Dongang (Sean) Wang
 
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
Learning Financial Market Data with Recurrent Autoencoders and TensorFlowLearning Financial Market Data with Recurrent Autoencoders and TensorFlow
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
Altoros
 
Convergence Theorems for Implicit Iteration Scheme With Errors For A Finite F...
Convergence Theorems for Implicit Iteration Scheme With Errors For A Finite F...Convergence Theorems for Implicit Iteration Scheme With Errors For A Finite F...
Convergence Theorems for Implicit Iteration Scheme With Errors For A Finite F...
inventy
 
Common Fixed Theorems Using Random Implicit Iterative Schemes
Common Fixed Theorems Using Random Implicit Iterative SchemesCommon Fixed Theorems Using Random Implicit Iterative Schemes
Common Fixed Theorems Using Random Implicit Iterative Schemes
inventy
 
k-MLE: A fast algorithm for learning statistical mixture models
k-MLE: A fast algorithm for learning statistical mixture modelsk-MLE: A fast algorithm for learning statistical mixture models
k-MLE: A fast algorithm for learning statistical mixture models
Frank Nielsen
 
SURF 2012 Final Report(1)
SURF 2012 Final Report(1)SURF 2012 Final Report(1)
SURF 2012 Final Report(1)Eric Zhang
 
Skiena algorithm 2007 lecture16 introduction to dynamic programming
Skiena algorithm 2007 lecture16 introduction to dynamic programmingSkiena algorithm 2007 lecture16 introduction to dynamic programming
Skiena algorithm 2007 lecture16 introduction to dynamic programmingzukun
 
Digital Signal Processing[ECEG-3171]-Ch1_L07
Digital Signal Processing[ECEG-3171]-Ch1_L07Digital Signal Processing[ECEG-3171]-Ch1_L07
Digital Signal Processing[ECEG-3171]-Ch1_L07
Rediet Moges
 
An overview of Hidden Markov Models (HMM)
An overview of Hidden Markov Models (HMM)An overview of Hidden Markov Models (HMM)
An overview of Hidden Markov Models (HMM)
ananth
 
Statement of stochastic programming problems
Statement of stochastic programming problemsStatement of stochastic programming problems
Statement of stochastic programming problems
SSA KPI
 
Noise Immunity With Hermite Polynomial Presentation Final Presentation
Noise Immunity With Hermite Polynomial Presentation Final PresentationNoise Immunity With Hermite Polynomial Presentation Final Presentation
Noise Immunity With Hermite Polynomial Presentation Final Presentationguestf6db45
 
Dependent Types and Dynamics of Natural Language
Dependent Types and Dynamics of Natural LanguageDependent Types and Dynamics of Natural Language
Dependent Types and Dynamics of Natural Language
Daisuke BEKKI
 
DSP_DiscSignals_LinearS_150417.pptx
DSP_DiscSignals_LinearS_150417.pptxDSP_DiscSignals_LinearS_150417.pptx
DSP_DiscSignals_LinearS_150417.pptx
HamedNassar5
 
Contemporary Models of Natural Language Processing
Contemporary Models of Natural Language ProcessingContemporary Models of Natural Language Processing
Contemporary Models of Natural Language Processing
Katerina Vylomova
 
Real Time Speech Enhancement in the Waveform Domain
Real Time Speech Enhancement in the Waveform DomainReal Time Speech Enhancement in the Waveform Domain
Real Time Speech Enhancement in the Waveform Domain
Willy Marroquin (WillyDevNET)
 
Lecture7 channel capacity
Lecture7   channel capacityLecture7   channel capacity
Lecture7 channel capacity
Frank Katta
 

Similar to Pointing the Unknown Words (20)

Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10)
 
Ruifeng.pptx
Ruifeng.pptxRuifeng.pptx
Ruifeng.pptx
 
RNN and sequence-to-sequence processing
RNN and sequence-to-sequence processingRNN and sequence-to-sequence processing
RNN and sequence-to-sequence processing
 
Fol
FolFol
Fol
 
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
Learning Financial Market Data with Recurrent Autoencoders and TensorFlowLearning Financial Market Data with Recurrent Autoencoders and TensorFlow
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
 
Convergence Theorems for Implicit Iteration Scheme With Errors For A Finite F...
Convergence Theorems for Implicit Iteration Scheme With Errors For A Finite F...Convergence Theorems for Implicit Iteration Scheme With Errors For A Finite F...
Convergence Theorems for Implicit Iteration Scheme With Errors For A Finite F...
 
Common Fixed Theorems Using Random Implicit Iterative Schemes
Common Fixed Theorems Using Random Implicit Iterative SchemesCommon Fixed Theorems Using Random Implicit Iterative Schemes
Common Fixed Theorems Using Random Implicit Iterative Schemes
 
k-MLE: A fast algorithm for learning statistical mixture models
k-MLE: A fast algorithm for learning statistical mixture modelsk-MLE: A fast algorithm for learning statistical mixture models
k-MLE: A fast algorithm for learning statistical mixture models
 
SURF 2012 Final Report(1)
SURF 2012 Final Report(1)SURF 2012 Final Report(1)
SURF 2012 Final Report(1)
 
Skiena algorithm 2007 lecture16 introduction to dynamic programming
Skiena algorithm 2007 lecture16 introduction to dynamic programmingSkiena algorithm 2007 lecture16 introduction to dynamic programming
Skiena algorithm 2007 lecture16 introduction to dynamic programming
 
Digital Signal Processing[ECEG-3171]-Ch1_L07
Digital Signal Processing[ECEG-3171]-Ch1_L07Digital Signal Processing[ECEG-3171]-Ch1_L07
Digital Signal Processing[ECEG-3171]-Ch1_L07
 
An overview of Hidden Markov Models (HMM)
An overview of Hidden Markov Models (HMM)An overview of Hidden Markov Models (HMM)
An overview of Hidden Markov Models (HMM)
 
Statement of stochastic programming problems
Statement of stochastic programming problemsStatement of stochastic programming problems
Statement of stochastic programming problems
 
Noise Immunity With Hermite Polynomial Presentation Final Presentation
Noise Immunity With Hermite Polynomial Presentation Final PresentationNoise Immunity With Hermite Polynomial Presentation Final Presentation
Noise Immunity With Hermite Polynomial Presentation Final Presentation
 
Dynamic programing
Dynamic programingDynamic programing
Dynamic programing
 
Dependent Types and Dynamics of Natural Language
Dependent Types and Dynamics of Natural LanguageDependent Types and Dynamics of Natural Language
Dependent Types and Dynamics of Natural Language
 
DSP_DiscSignals_LinearS_150417.pptx
DSP_DiscSignals_LinearS_150417.pptxDSP_DiscSignals_LinearS_150417.pptx
DSP_DiscSignals_LinearS_150417.pptx
 
Contemporary Models of Natural Language Processing
Contemporary Models of Natural Language ProcessingContemporary Models of Natural Language Processing
Contemporary Models of Natural Language Processing
 
Real Time Speech Enhancement in the Waveform Domain
Real Time Speech Enhancement in the Waveform DomainReal Time Speech Enhancement in the Waveform Domain
Real Time Speech Enhancement in the Waveform Domain
 
Lecture7 channel capacity
Lecture7   channel capacityLecture7   channel capacity
Lecture7 channel capacity
 

Recently uploaded

Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
ViralQR
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 

Recently uploaded (20)

Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 

Pointing the Unknown Words

  • 1. Pointing the Unknown Words 1 ACL 2016 Pointing the Unknown Words Caglar Gulcehre Universit´e de Montr´eal Sungjin Ahn Universit´e de Montr´eal Ramesh Nallapati IBM T.J. Watson Research Bowen Zhou IBM T.J. Watson Research Yoshua Bengio Universit´e de Montr´eal CIFAR Senior Fellow Abstract The problem of rare and unknown words is an important issue that can potentially effect the performance of many NLP sys- tems, including both the traditional count- based and the deep learning models. We propose a novel way to deal with the rare and unseen words for the neural network models using attention. Our model uses two softmax layers in order to predict the softmax output layer where each of the output di- mension corresponds to a word in a predefined word-shortlist. Because computing high dimen- sional softmax is computationally expensive, in practice the shortlist is limited to have only top- K most frequent words in the training corpus. All other words are then replaced by a special word, called the unknown word (UNK). The shortlist approach has two fundamental problems. The first problem, which is known as 21Aug2016 Pointing the Unknown Words Caglar Gulcehre Universit´e de Montr´eal Sungjin Ahn Universit´e de Montr´eal Ramesh Nallapati IBM T.J. Watson Research Bowen Zhou IBM T.J. Watson Research Yoshua Bengio Universit´e de Montr´eal CIFAR Senior Fellow Abstract The problem of rare and unknown words is an important issue that can potentially effect the performance of many NLP sys- tems, including both the traditional count- based and the deep learning models. We propose a novel way to deal with the rare and unseen words for the neural network models using attention. Our model uses two softmax layers in order to predict the softmax output layer where each of the output di- mension corresponds to a word in a predefined word-shortlist. Because computing high dimen- sional softmax is computationally expensive, in practice the shortlist is limited to have only top- K most frequent words in the training corpus. All other words are then replaced by a special word, called the unknown word (UNK). The shortlist approach has two fundamental problems. The first problem, which is known as the rare word problem, is that some of the words 21Aug2016
  • 2. 2 - e - - e - f e l o - Guillaume et Cesar ont une voiture bleue a Lausanne. Guillaume and Cesar have a blue car in Lausanne. Copy Copy Copy French: English: Figure 1: An example of how copying can happen for machine translation. Common words that ap- pear both in source and the target can directly be copied from input to source. The rest of the un- known in the target can be copied from the input after being translated with a dictionary.
  • 3. • • • • V a man yesterday . [eos]! killed a man yesterday .!t + bhp) (16) (17) Whp bhp) (16) (17) (18) (19) ! man a a man ! 6 RNN t pt pt = softmax(W ht = −−−→ RNNt′≺t( softmax(s)i = exp(si sj∈s exp → t′≺t(xwt′ ) (17) p(si) s exp(sj) (18) (19) s i Whp ∈ RV ×N V bhp(w) Whp bhp t = softmax(Whpht + bhp) (16) t = −−−→ RNNt′≺t(xwt′ ) (17) i = exp(si) sj∈s exp(sj) (18) (19) s i Whp ∈ RV ×N V Whp(w) bhp(w) (18) (19) i Whp ∈ RV ×N V ) t wt pt = softmax(Whpht + bhp) ht = −−−→ RNNt′≺t(xwt′ ) 6 RNN t wt pt pt = softmax(Whpht + bhp) ht = −−−→ RNNt′≺t(xwt′ ) softmax(s)i = exp(si) sj∈s exp(sj) softmax(s)i N s exp 3
  • 4. • • • • V a man yesterday . [eos]! killed a man yesterday .!t + bhp) (16) (17) Whp bhp) (16) (17) (18) (19) ! man a a man ! 6 RNN t pt pt = softmax(W ht = −−−→ RNNt′≺t( softmax(s)i = exp(si sj∈s exp → t′≺t(xwt′ ) (17) p(si) s exp(sj) (18) (19) s i Whp ∈ RV ×N V bhp(w) Whp bhp t = softmax(Whpht + bhp) (16) t = −−−→ RNNt′≺t(xwt′ ) (17) i = exp(si) sj∈s exp(sj) (18) (19) s i Whp ∈ RV ×N V Whp(w) bhp(w) (18) (19) i Whp ∈ RV ×N V ) t wt pt = softmax(Whpht + bhp) ht = −−−→ RNNt′≺t(xwt′ ) • V T • • • Pointer Softmax pt = softmax(Whpht + bhp) ht = −−−→ RNNt′≺t(xwt′ ) softmax(s)i = exp(si) sj∈s exp(sj) i N s exp w Whp ∈ RV ×N Whp(w) bhp(w) T 6 RNN t wt pt pt = softmax(Whpht + bhp) ht = −−−→ RNNt′≺t(xwt′ ) softmax(s)i = exp(si) sj∈s exp(sj) softmax(s)i N s exp 6 RNN t wt pt pt = softmax(Whpht + ht = −−−→ RNNt′≺t(xwt′ ) softmax(s)i = exp(si) sj∈s exp(sj) softmax(s)i N s exp ・ 4
  • 5. • 5 rce sentence during decoding a translation (Sec. 3.1). ER: GENERAL DESCRIPTION x1 x2 x3 xT + αt,1 αt,2 αt,3 αt,T h1 h2 h3 hT h1 h2 h3 hT st-1 st Figure 1: The graphical illus- tration of the proposed model trying to generate the t-th tar- get word yt given a source sentence (x , x , . . . , x ). el architecture, we define each conditional probability p(yi|y1, . . . , yi 1, x) = g(yi 1, si, ci), (4) RNN hidden state for time i, computed by si = f(si 1, yi 1, ci). noted that unlike the existing encoder–decoder ap- q. (2)), here the probability is conditioned on a distinct ci for each target word yi. vector ci depends on a sequence of annotations ) to which an encoder maps the input sentence. Each contains information about the whole input sequence focus on the parts surrounding the i-th word of the e. We explain in detail how the annotations are com- ext section. ector ci is, then, computed as a weighted sum of these i: TxX F t t g s t vector ci depends on a sequence of annotations x ) to which an encoder maps the input sentence. Each hi contains information about the whole input sequence g focus on the parts surrounding the i-th word of the nce. We explain in detail how the annotations are com- next section. vector ci is, then, computed as a weighted sum of these hi: ci = TxX j=1 ↵ijhj. (5) ↵ij of each annotation hj is computed by ↵ij = exp (eij) P , GENERAL DESCRIPTION hitecture, we define each conditional probability y1, . . . , yi 1, x) = g(yi 1, si, ci), (4) N hidden state for time i, computed by si = f(si 1, yi 1, ci). d that unlike the existing encoder–decoder ap- ), here the probability is conditioned on a distinct or each target word yi. or ci depends on a sequence of annotations which an encoder maps the input sentence. Each Figure 1: The graphical illus- tration of the proposed model trying to generate the t-th tar- get word yt given a source sentence (x1, x2, . . . , xT ). ere the probability is conditioned on a distinct ach target word yi. ci depends on a sequence of annotations ch an encoder maps the input sentence. Each s information about the whole input sequence n the parts surrounding the i-th word of the xplain in detail how the annotations are com- on. is, then, computed as a weighted sum of these ci = TxX j=1 ↵ijhj. (5) h annotation hj is computed by ↵ij = exp (eij) PTx k=1 exp (eik) , (6) e = a(s , h ) Figure 1: The graphical illus- tration of the proposed model trying to generate the t-th tar- get word yt given a source sentence (x1, x2, . . . , xT ). t the whole input sequence nding the i-th word of the w the annotations are com- as a weighted sum of these . (5) computed by j = exp (eij) PTx k=1 exp (eik) , (6) eij = a(si 1, hj) well the inputs around position j and the output at position hidden state si 1 (just before emitting yi, Eq. (4)) and the 3 Neural Machine Translation Model with Attention As the baseline neural machine translation sys- tem, we use the model proposed by (Bahdanau et al., 2014) that learns to (soft-)align and translate jointly. We refer this model as NMT. The encoder of the NMT is a bidirectional RNN (Schuster and Paliwal, 1997). The forward RNN reads input sequence x = (x1, . . . , xT ) in left-to-right direction, resulting in a sequence of hidden states ( ! h 1, . . . , ! h T ). The backward RNN reads x in the reversed direction and outputs ( h 1, . . . , h T ). We then concatenate the hidden states of forward and backward RNNs at each time step and obtain a sequence of annotation vectors (h1, . . . , hT ) where hj = h! h j|| h j i . Here, || denotes the concatenation operator. Thus, each an- notation vector hj encodes information about the j-th word with respect to all the other surrounding where fr is G We use a 2013) to com words: p(yt ex where W is bias of the o forward neu that perform And the sup umn vector o The whol and the deco (conditional p(yi|y1, . . . , yi 1, x) = g(yi 1, si, ci), is an RNN hidden state for time i, computed by si = f(si 1, yi 1, ci). be noted that unlike the existing encoder–decoder ee Eq. (2)), here the probability is conditioned on a dis ector ci for each target word yi. ext vector ci depends on a sequence of annotat hTx ) to which an encoder maps the input sentence. E n hi contains information about the whole input sequ ong focus on the parts surrounding the i-th word of uence. We explain in detail how the annotations are c (t=i) [Bahdanau+15] ncepaperatICLR2015 trainedtopredictthenextwordyt0giventhecontextvectorcandallthe ords{y1,···,yt01}.Inotherwords,thedecoderdefinesaprobabilityover composingthejointprobabilityintotheorderedconditionals: p(y)= TY t=1 p(yt|{y1,···,yt1},c),(2) Ty .WithanRNN,eachconditionalprobabilityismodeledas p(yt|{y1,···,yt1},c)=g(yt1,st,c),(3) potentiallymulti-layered,functionthatoutputstheprobabilityofyt,andstis RNN.ItshouldbenotedthatotherarchitecturessuchasahybridofanRNN lneuralnetworkcanbeused(KalchbrennerandBlunsom,2013). ALIGNANDTRANSLATE poseanovelarchitectureforneuralmachinetranslation.Thenewarchitecture onalRNNasanencoder(Sec.3.2)andadecoderthatemulatessearching nceduringdecodingatranslation(Sec.3.1). ERALDESCRIPTION st cture,wedefineeachconditionalprobability ...,yi1,x)=g(yi1,si,ci),(4) iddenstatefortimei,computedby si=f(si1,yi1,ci). atunliketheexistingencoder–decoderap- eretheprobabilityisconditionedonadistinct achtargetwordyi. idependsonasequenceofannotations nedtopredictthenextwordyt0giventhecontextvectorcandallthe s{y1,···,yt01}.Inotherwords,thedecoderdefinesaprobabilityover posingthejointprobabilityintotheorderedconditionals: p(y)= TY t=1 p(yt|{y1,···,yt1},c),(2) .WithanRNN,eachconditionalprobabilityismodeledas p(yt|{y1,···,yt1},c)=g(yt1,st,c),(3) entiallymulti-layered,functionthatoutputstheprobabilityofyt,andstis N.ItshouldbenotedthatotherarchitecturessuchasahybridofanRNN uralnetworkcanbeused(KalchbrennerandBlunsom,2013). LIGNANDTRANSLATE anovelarchitectureforneuralmachinetranslation.Thenewarchitecture lRNNasanencoder(Sec.3.2)andadecoderthatemulatessearching duringdecodingatranslation(Sec.3.1). ALDESCRIPTION st e,wedefineeachconditionalprobability ,yi1,x)=g(yi1,si,ci),(4) nstatefortimei,computedby f(si1,yi1,ci). nliketheexistingencoder–decoderap- heprobabilityisconditionedonadistinct targetwordyi. ependsonasequenceofannotations encodermapstheinputsentence.Each ormationaboutthewholeinputsequence epartssurroundingthei-thwordofthe inindetailhowtheannotationsarecom- atICLR2015 predictthenextwordyt0giventhecontextvectorcandallthe ···,yt01}.Inotherwords,thedecoderdefinesaprobabilityover thejointprobabilityintotheorderedconditionals: y)= TY t=1 p(yt|{y1,···,yt1},c),(2) anRNN,eachconditionalprobabilityismodeledas |{y1,···,yt1},c)=g(yt1,st,c),(3) multi-layered,functionthatoutputstheprobabilityofyt,andstis ouldbenotedthatotherarchitecturessuchasahybridofanRNN tworkcanbeused(KalchbrennerandBlunsom,2013). ANDTRANSLATE larchitectureforneuralmachinetranslation.Thenewarchitecture asanencoder(Sec.3.2)andadecoderthatemulatessearching decodingatranslation(Sec.3.1). SCRIPTION st efineeachconditionalprobability x)=g(yi1,si,ci),(4) fortimei,computedby 1,yi1,ci). theexistingencoder–decoderap- abilityisconditionedonadistinct ordyi. onasequenceofannotations dictthenextwordyt0giventhecontextvectorcandallthe ,yt01}.Inotherwords,thedecoderdefinesaprobabilityover ejointprobabilityintotheorderedconditionals: = TY t=1 p(yt|{y1,···,yt1},c),(2) NN,eachconditionalprobabilityismodeledas y1,···,yt1},c)=g(yt1,st,c),(3) ulti-layered,functionthatoutputstheprobabilityofyt,andstis ldbenotedthatotherarchitecturessuchasahybridofanRNN rkcanbeused(KalchbrennerandBlunsom,2013). DTRANSLATE rchitectureforneuralmachinetranslation.Thenewarchitecture anencoder(Sec.3.2)andadecoderthatemulatessearching codingatranslation(Sec.3.1). IPTION st neeachconditionalprobability g(yi1,si,ci),(4) timei,computedby i1,ci). existingencoder–decoderap- ilityisconditionedonadistinct dyi. nasequenceofannotations mapstheinputsentence.Each boutthewholeinputsequence rroundingthei-thwordofthe lhowtheannotationsarecom- p(yt | {y1, · · · , yt 1} , c) = g(yt 1, st, c), (3) where g is a nonlinear, potentially multi-layered, function that outputs the probability of yt, and st is the hidden state of the RNN. It should be noted that other architectures such as a hybrid of an RNN and a de-convolutional neural network can be used (Kalchbrenner and Blunsom, 2013). 3 LEARNING TO ALIGN AND TRANSLATE In this section, we propose a novel architecture for neural machine translation. The new architecture consists of a bidirectional RNN as an encoder (Sec. 3.2) and a decoder that emulates searching through a source sentence during decoding a translation (Sec. 3.1). 3.1 DECODER: GENERAL DESCRIPTION yt-1 Figure 1: The graphical illus- tration of the proposed model trying to generate the t-th tar- get word yt given a source sentence (x1, x2, . . . , xT ). In a new model architecture, we define each conditional probability in Eq. (2) as: p(yi|y1, . . . , yi 1, x) = g(yi 1, si, ci), (4) where si is an RNN hidden state for time i, computed by si = f(si 1, yi 1, ci). It should be noted that unlike the existing encoder–decoder ap- proach (see Eq. (2)), here the probability is conditioned on a distinct context vector ci for each target word yi. The context vector ci depends on a sequence of annotations (h1, · · · , hTx ) to which an encoder maps the input sentence. Each annotation hi contains information about the whole input sequence with a strong focus on the parts surrounding the i-th word of the input sequence. We explain in detail how the annotations are com- puted in the next section. The context vector ci is, then, computed as a weighted sum of these annotations hi: ci = TxX j=1 ↵ijhj. (5) The weight ↵ij of each annotation hj is computed by ↵ij = exp (eij) PTx exp (eik) , (6) ptpt-1 c.f. http://www.slideshare.net/yutakikuchi927/deep-learning-nlp-attention
  • 6. • • α 6 rce sentence during decoding a translation (Sec. 3.1). ER: GENERAL DESCRIPTION x1 x2 x3 xT + αt,1 αt,2 αt,3 αt,T h1 h2 h3 hT h1 h2 h3 hT st-1 st Figure 1: The graphical illus- tration of the proposed model trying to generate the t-th tar- get word yt given a source sentence (x , x , . . . , x ). el architecture, we define each conditional probability p(yi|y1, . . . , yi 1, x) = g(yi 1, si, ci), (4) RNN hidden state for time i, computed by si = f(si 1, yi 1, ci). noted that unlike the existing encoder–decoder ap- q. (2)), here the probability is conditioned on a distinct ci for each target word yi. vector ci depends on a sequence of annotations ) to which an encoder maps the input sentence. Each contains information about the whole input sequence focus on the parts surrounding the i-th word of the e. We explain in detail how the annotations are com- ext section. ector ci is, then, computed as a weighted sum of these i: TxX F t t g s t vector ci depends on a sequence of annotations x ) to which an encoder maps the input sentence. Each hi contains information about the whole input sequence g focus on the parts surrounding the i-th word of the nce. We explain in detail how the annotations are com- next section. vector ci is, then, computed as a weighted sum of these hi: ci = TxX j=1 ↵ijhj. (5) ↵ij of each annotation hj is computed by ↵ij = exp (eij) P , GENERAL DESCRIPTION hitecture, we define each conditional probability y1, . . . , yi 1, x) = g(yi 1, si, ci), (4) N hidden state for time i, computed by si = f(si 1, yi 1, ci). d that unlike the existing encoder–decoder ap- ), here the probability is conditioned on a distinct or each target word yi. or ci depends on a sequence of annotations which an encoder maps the input sentence. Each Figure 1: The graphical illus- tration of the proposed model trying to generate the t-th tar- get word yt given a source sentence (x1, x2, . . . , xT ). ere the probability is conditioned on a distinct ach target word yi. ci depends on a sequence of annotations ch an encoder maps the input sentence. Each s information about the whole input sequence n the parts surrounding the i-th word of the xplain in detail how the annotations are com- on. is, then, computed as a weighted sum of these ci = TxX j=1 ↵ijhj. (5) h annotation hj is computed by ↵ij = exp (eij) PTx k=1 exp (eik) , (6) e = a(s , h ) Figure 1: The graphical illus- tration of the proposed model trying to generate the t-th tar- get word yt given a source sentence (x1, x2, . . . , xT ). t the whole input sequence nding the i-th word of the w the annotations are com- as a weighted sum of these . (5) computed by j = exp (eij) PTx k=1 exp (eik) , (6) eij = a(si 1, hj) well the inputs around position j and the output at position hidden state si 1 (just before emitting yi, Eq. (4)) and the 3 Neural Machine Translation Model with Attention As the baseline neural machine translation sys- tem, we use the model proposed by (Bahdanau et al., 2014) that learns to (soft-)align and translate jointly. We refer this model as NMT. The encoder of the NMT is a bidirectional RNN (Schuster and Paliwal, 1997). The forward RNN reads input sequence x = (x1, . . . , xT ) in left-to-right direction, resulting in a sequence of hidden states ( ! h 1, . . . , ! h T ). The backward RNN reads x in the reversed direction and outputs ( h 1, . . . , h T ). We then concatenate the hidden states of forward and backward RNNs at each time step and obtain a sequence of annotation vectors (h1, . . . , hT ) where hj = h! h j|| h j i . Here, || denotes the concatenation operator. Thus, each an- notation vector hj encodes information about the j-th word with respect to all the other surrounding where fr is G We use a 2013) to com words: p(yt ex where W is bias of the o forward neu that perform And the sup umn vector o The whol and the deco (conditional (t=i) ncepaperatICLR2015 trainedtopredictthenextwordyt0giventhecontextvectorcandallthe ords{y1,···,yt01}.Inotherwords,thedecoderdefinesaprobabilityover composingthejointprobabilityintotheorderedconditionals: p(y)= TY t=1 p(yt|{y1,···,yt1},c),(2) Ty .WithanRNN,eachconditionalprobabilityismodeledas p(yt|{y1,···,yt1},c)=g(yt1,st,c),(3) potentiallymulti-layered,functionthatoutputstheprobabilityofyt,andstis RNN.ItshouldbenotedthatotherarchitecturessuchasahybridofanRNN lneuralnetworkcanbeused(KalchbrennerandBlunsom,2013). ALIGNANDTRANSLATE poseanovelarchitectureforneuralmachinetranslation.Thenewarchitecture onalRNNasanencoder(Sec.3.2)andadecoderthatemulatessearching nceduringdecodingatranslation(Sec.3.1). ERALDESCRIPTION st cture,wedefineeachconditionalprobability ...,yi1,x)=g(yi1,si,ci),(4) iddenstatefortimei,computedby si=f(si1,yi1,ci). atunliketheexistingencoder–decoderap- eretheprobabilityisconditionedonadistinct achtargetwordyi. idependsonasequenceofannotations nedtopredictthenextwordyt0giventhecontextvectorcandallthe s{y1,···,yt01}.Inotherwords,thedecoderdefinesaprobabilityover posingthejointprobabilityintotheorderedconditionals: p(y)= TY t=1 p(yt|{y1,···,yt1},c),(2) .WithanRNN,eachconditionalprobabilityismodeledas p(yt|{y1,···,yt1},c)=g(yt1,st,c),(3) entiallymulti-layered,functionthatoutputstheprobabilityofyt,andstis N.ItshouldbenotedthatotherarchitecturessuchasahybridofanRNN uralnetworkcanbeused(KalchbrennerandBlunsom,2013). LIGNANDTRANSLATE anovelarchitectureforneuralmachinetranslation.Thenewarchitecture lRNNasanencoder(Sec.3.2)andadecoderthatemulatessearching duringdecodingatranslation(Sec.3.1). ALDESCRIPTION st e,wedefineeachconditionalprobability ,yi1,x)=g(yi1,si,ci),(4) nstatefortimei,computedby f(si1,yi1,ci). nliketheexistingencoder–decoderap- heprobabilityisconditionedonadistinct targetwordyi. ependsonasequenceofannotations encodermapstheinputsentence.Each ormationaboutthewholeinputsequence epartssurroundingthei-thwordofthe inindetailhowtheannotationsarecom- atICLR2015 predictthenextwordyt0giventhecontextvectorcandallthe ···,yt01}.Inotherwords,thedecoderdefinesaprobabilityover thejointprobabilityintotheorderedconditionals: y)= TY t=1 p(yt|{y1,···,yt1},c),(2) anRNN,eachconditionalprobabilityismodeledas |{y1,···,yt1},c)=g(yt1,st,c),(3) multi-layered,functionthatoutputstheprobabilityofyt,andstis ouldbenotedthatotherarchitecturessuchasahybridofanRNN tworkcanbeused(KalchbrennerandBlunsom,2013). ANDTRANSLATE larchitectureforneuralmachinetranslation.Thenewarchitecture asanencoder(Sec.3.2)andadecoderthatemulatessearching decodingatranslation(Sec.3.1). SCRIPTION st efineeachconditionalprobability x)=g(yi1,si,ci),(4) fortimei,computedby 1,yi1,ci). theexistingencoder–decoderap- abilityisconditionedonadistinct ordyi. onasequenceofannotations dictthenextwordyt0giventhecontextvectorcandallthe ,yt01}.Inotherwords,thedecoderdefinesaprobabilityover ejointprobabilityintotheorderedconditionals: = TY t=1 p(yt|{y1,···,yt1},c),(2) NN,eachconditionalprobabilityismodeledas y1,···,yt1},c)=g(yt1,st,c),(3) ulti-layered,functionthatoutputstheprobabilityofyt,andstis ldbenotedthatotherarchitecturessuchasahybridofanRNN rkcanbeused(KalchbrennerandBlunsom,2013). DTRANSLATE rchitectureforneuralmachinetranslation.Thenewarchitecture anencoder(Sec.3.2)andadecoderthatemulatessearching codingatranslation(Sec.3.1). IPTION st neeachconditionalprobability g(yi1,si,ci),(4) timei,computedby i1,ci). existingencoder–decoderap- ilityisconditionedonadistinct dyi. nasequenceofannotations mapstheinputsentence.Each boutthewholeinputsequence rroundingthei-thwordofthe lhowtheannotationsarecom- p(yt | {y1, · · · , yt 1} , c) = g(yt 1, st, c), (3) where g is a nonlinear, potentially multi-layered, function that outputs the probability of yt, and st is the hidden state of the RNN. It should be noted that other architectures such as a hybrid of an RNN and a de-convolutional neural network can be used (Kalchbrenner and Blunsom, 2013). 3 LEARNING TO ALIGN AND TRANSLATE In this section, we propose a novel architecture for neural machine translation. The new architecture consists of a bidirectional RNN as an encoder (Sec. 3.2) and a decoder that emulates searching through a source sentence during decoding a translation (Sec. 3.1). 3.1 DECODER: GENERAL DESCRIPTION yt-1 Figure 1: The graphical illus- tration of the proposed model trying to generate the t-th tar- get word yt given a source sentence (x1, x2, . . . , xT ). In a new model architecture, we define each conditional probability in Eq. (2) as: p(yi|y1, . . . , yi 1, x) = g(yi 1, si, ci), (4) where si is an RNN hidden state for time i, computed by si = f(si 1, yi 1, ci). It should be noted that unlike the existing encoder–decoder ap- proach (see Eq. (2)), here the probability is conditioned on a distinct context vector ci for each target word yi. The context vector ci depends on a sequence of annotations (h1, · · · , hTx ) to which an encoder maps the input sentence. Each annotation hi contains information about the whole input sequence with a strong focus on the parts surrounding the i-th word of the input sequence. We explain in detail how the annotations are com- puted in the next section. The context vector ci is, then, computed as a weighted sum of these annotations hi: ci = TxX j=1 ↵ijhj. (5) The weight ↵ij of each annotation hj is computed by ↵ij = exp (eij) PTx exp (eik) , (6) ptpt-1
  • 7. • • • • V a man yesterday . [eos]! killed a man yesterday .!t + bhp) (16) (17) Whp bhp) (16) (17) (18) (19) ! man a a man ! 6 RNN t pt pt = softmax(W ht = −−−→ RNNt′≺t( softmax(s)i = exp(si sj∈s exp → t′≺t(xwt′ ) (17) p(si) s exp(sj) (18) (19) s i Whp ∈ RV ×N V bhp(w) Whp bhp t = softmax(Whpht + bhp) (16) t = −−−→ RNNt′≺t(xwt′ ) (17) i = exp(si) sj∈s exp(sj) (18) (19) s i Whp ∈ RV ×N V Whp(w) bhp(w) (18) (19) i Whp ∈ RV ×N V ) t wt pt = softmax(Whpht + bhp) ht = −−−→ RNNt′≺t(xwt′ ) • V T • • • pt = softmax(Whpht + bhp) ht = −−−→ RNNt′≺t(xwt′ ) softmax(s)i = exp(si) sj∈s exp(sj) i N s exp w Whp ∈ RV ×N Whp(w) bhp(w) T 6 RNN t wt pt pt = softmax(Whpht + bhp) ht = −−−→ RNNt′≺t(xwt′ ) softmax(s)i = exp(si) sj∈s exp(sj) softmax(s)i N s exp 6 RNN t wt pt pt = softmax(Whpht + ht = −−−→ RNNt′≺t(xwt′ ) softmax(s)i = exp(si) sj∈s exp(sj) softmax(s)i N s exp ・ 7
  • 8. • 8 atten- nd the e pro- or un- at we Guillaume et Cesar ont une voiture bleue a Lausanne. Guillaume and Cesar have a blue car in Lausanne. Copy Copy Copy French: English: Figure 1: An example of how copying can happen
  • 10. • 10 Table 4: Generated summaries from NMT with PS. Boldface words are the words copied from the source Source #1 china ’s tang gonghong set a world record with a clean and jerk lift of ### kilograms to win the women ’s over-## kilogram weightlifting title at the asian games on tuesday . Target #1 china ’s tang <unk>,sets world weightlifting record NMT+PS #1 china ’s tang gonghong wins women ’s weightlifting weightlift- ing title at asian games Source #2 owing to criticism , nbc said on wednesday that it was ending a three-month-old experiment that would have brought the first liquor advertisements onto national broadcast network television Table 4: Generated summaries from NMT with PS. Boldface words are the words copied from the so Source #1 china ’s tang gonghong set a world record with a clean and jerk lift of ### kilograms to win the women ’s over-## kilogram weightlifting title at the asian games on tuesday . Target #1 china ’s tang <unk>,sets world weightlifting record NMT+PS #1 china ’s tang gonghong wins women ’s weightlifting weightlift- <v1> ’s <v2> <v3> set a world record with a clean and jerk lift of ### kilograms to win the women ’s over-## kilogram weightlifting title at the asian games on tuesday . <v1> ’s <v2> <v3>,sets world weightlifting record
  • 12. • gonghonh <unk> • • 12 The experimental results comparing the Pointer Softmax with NMT model are displayed in Ta- ble 1 for the UNK pointers data and in Table 2 for the entity pointers data. As our experiments show, pointer softmax improves over the baseline NMT on both UNK data and entities data. Our hope was that the improvement would be larger for the entities data since the incidence of point- ers was much greater. However, it turns out this is not the case, and we suspect the main reason is anonymization of entities which removed data- sparsity by converting all entities to integer-ids that are shared across all documents. We believe that on de-anonymized data, our model could help more, since the issue of data-sparsity is more acute in this case. Table 1: Results on Gigaword Corpus when point- ers are used for UNKs in the training data, using Rouge-F1 as the evaluation metric. Rouge-1 Rouge-2 Rouge-L NMT + lvt 34.87 16.54 32.27 NMT + lvt + PS 35.19 16.66 32.51 I ate tra am acc it n 5.3 In me the the Fre ear log eva els sco W the tok Th gli We Table 4: Generated summaries from NMT with PS. Boldface words are the words copied from the source Source #1 china ’s tang gonghong set a world record with a clean and jerk lift of ### kilograms to win the women ’s over-## kilogram weightlifting title at the asian games on tuesday . Target #1 china ’s tang <unk>,sets world weightlifting record NMT+PS #1 china ’s tang gonghong wins women ’s weightlifting weightlift- ing title at asian games Source #2 owing to criticism , nbc said on wednesday that it was ending a three-month-old experiment that would have brought the first liquor advertisements onto national broadcast network television . Target #2 advertising : nbc retreats from liquor commercials NMT+PS #2 nbc says it is ending a three-month-old experiment Source #3 a senior trade union official here wednesday called on ghana ’s government to be “ mindful of the plight ” of the ordinary people in the country in its decisions on tax increases . Target #3 tuc official,on behalf of ordinary ghanaians NMT+PS #3 ghana ’s government urged to be mindful of the plight
  • 13. • • 13 is not the case, and we suspect the main reason is anonymization of entities which removed data- sparsity by converting all entities to integer-ids that are shared across all documents. We believe that on de-anonymized data, our model could help more, since the issue of data-sparsity is more acute in this case. Table 1: Results on Gigaword Corpus when point- ers are used for UNKs in the training data, using Rouge-F1 as the evaluation metric. Rouge-1 Rouge-2 Rouge-L NMT + lvt 34.87 16.54 32.27 NMT + lvt + PS 35.19 16.66 32.51 Table 2: Results on anonymized Gigaword Corpus when pointers are used for entities, using Rouge- F1 as the evaluation metric. Rouge-1 Rouge-2 Rouge-L NMT + lvt 34.89 16.78 32.37 NMT + lvt + PS 35.11 16.76 32.55 Table 4: Generated summaries from NMT with PS. Boldface words are the words copied from the source Source #1 china ’s tang gonghong set a world record with a clean and jerk lift of ### kilograms to win the women ’s over-## kilogram weightlifting title at the asian games on tuesday . Target #1 china ’s tang <unk>,sets world weightlifting record NMT+PS #1 china ’s tang gonghong wins women ’s weightlifting weightlift- ing title at asian games Source #2 owing to criticism , nbc said on wednesday that it was ending a three-month-old experiment that would have brought the first liquor advertisements onto national broadcast network television . Target #2 advertising : nbc retreats from liquor commercials NMT+PS #2 nbc says it is ending a three-month-old experiment Source #3 a senior trade union official here wednesday called on ghana ’s government to be “ mindful of the plight ” of the ordinary people <v1> ’s <v2> <v3> set a world record with a clean and jerk lift of ### kilograms to win the women ’s over-## kilogram weightlifting title at the asian games on tuesday . <v1> ’s <v2> <v3>,sets world weightlifting record
  • 14. • • • • • • • 14 en- the pro- un- we ict- e of We odel Guillaume et Cesar ont une voiture bleue a Lausanne. Guillaume and Cesar have a blue car in Lausanne. Copy Copy Copy French: English: Figure 1: An example of how copying can happen for machine translation. Common words that ap- pear both in source and the target can directly be copied from input to source. The rest of the un- known in the target can be copied from the input after being translated with a dictionary.
  • 15. • • • • 15 of the gradients exceed 1 (Pascanu et al., 2012). Table 5: Europarl Dataset (EN-FR) BLEU-4 NMT 20.19 NMT + PS 23.76 in the country in its decisions on tax increases . et #3 tuc official,on behalf of ordinary ghanaians T+PS #3 ghana ’s government urged to be mindful of the plight y, we first check if the same word yt ap- e source sentence. If it is not, we then translated version of the word exists in sentence by using a look-up table be- source and the target language. If the the source sentence, we then use the lo- he word in the source as the target. Oth- check if one of the English senses from anguage dictionary of the French word urce. If it is in the source sentence, then location of that word as our translation. we just use the argmax of lt as the tar- ching network dt, we observed that us- layered MLP with noisy-tanh activation et al., 2016) function with residual con- om the lower layer (He et al., 2015) ac- In Table 5, we provided the result of NMT w pointer softmax and we observe about 3.6 BLE score improvement over our baseline. Figure 4: A comparison of the validation learnin
  • 16. • • • 16 h2 hTh1 … yt-1 Source Sequence x2 xTx1 … BiRNN Target Sequence Figure 2: A depiction of neural machine transla- tion architecture with attention. At each timestep, the model generates the attention distribution lt. We use lt and the encoder’s hidden states to obtain the context ct. The decoder uses ct to predict a vector of probabilities for the words wt by using vocabulary softmax. 4 The Pointer Softmax In this section, we introduce our method, called as the pointer softmax (PS), to deal with the rare and unknown words. The pointer softmax can be an applicable approach to many NLP tasks, because it resolves the limitations about unknown words for neural networks. It can be used in parallel with other existing techniques such as the large vocabu- lary trick (Jean et al., 2014). Our model learns two key abilities jointly to make the pointing mech- anism applicable in more general settings: (i) to predict whether it is required to use the pointing complish this, we introduce a switching network to the model. The switching network, which is a multilayer perceptron in our experiments, takes the representation of the context sequence (similar to the input annotation in NMT) and the previous hidden state of the output RNN as its input. It out- puts a binary variable zt which indicates whether to use the shortlist softmax (when zt = 1) or the location softmax (when zt = 0). Note that if the word that is expected to be generated at each time- step is neither in the shortlist nor in the context se- quence, the switching network selects the shortlist softmax, and then the shortlist softmax predicts UNK. The details of the pointer softmax model can be seen in Figure 3 as well. h2 hTh1 … st ct zt yl tyw t yt-1 Vocabulary softmax Pointer distribution (lt) Source Sequence Point & copy x2 xTx1 … BiRNN Target Sequence p 1-p st-1 Figure 3: A depiction of the Pointer Softmax (PS)
  • 17. Ø Ø Ø Ø Ø Ø 17 improved the convergence speed of the model as well. For French to English machine translation on Europarl corpora, we observe that using the pointer softmax can also improve the training con- vergence of the model. References [Bahdanau et al.2014] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473. [Bengio and Sen´ecal2008] Yoshua Bengio and Jean- S´ebastien Sen´ecal. 2008. Adaptive importance sampling to accelerate training of a neural proba- bilistic language model. Neural Networks, IEEE Transactions on, 19(4):713–722. [Bordes et al.2015] Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. 2015. Large- scale simple question answering with memory net- works. arXiv preprint arXiv:1506.02075. [Cheng and Lapata2016] Jianpeng Cheng and Mirella Lapata. 2016. Neural summarization by ex- plications to natural image statistics. The Journal of Machine Learning Research, 13(1):307–361. [He et al.2015] Kaiming He, Xiangyu Zhang, Shao- qing Ren, and Jian Sun. 2015. Deep resid- ual learning for mage recognition. arXiv preprint arXiv:1512.03385. [Hermann et al.2015] Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Ad- vances in Neural Information Processing Systems, pages 1684–1692. [Jean et al.2014] S´ebastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2014. On using very large target vocabulary for neural ma- chine translation. arXiv preprint arXiv:1412.2007. [Kingma and Adam2015] Diederik P Kingma and Jimmy Ba Adam. 2015. A method for stochastic optimization. In International Conference on Learning Representation. [Luong et al.2015] Minh-Thang Luong, Ilya Sutskever, Quoc V Le, Oriol Vinyals, and Wojciech Zaremba. 2015. Addressing the rare word problem in neural machine translation. In Proceedings of ACL. 148 [Schuster and Paliwal1997] Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recur- rent neural networks. Signal Processing, IEEE Transactions on, 45(11):2673–2681. [Sennrich et al.2015] Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine trans- lation of rare words with subword units. arXiv preprint arXiv:1508.07909. [Theano Development Team2016] Theano Develop- ment Team. 2016. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints, abs/1605.02688, May. [Tomasello et al.2007] Michael Tomasello, Malinda Carpenter, and Ulf Liszkowski. 2007. A new look at infant pointing. Child development, 78(3):705–722. [Vinyals et al.2015] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer networks. In Ad- vances in Neural Information Processing Systems, pages 2674–2682. [Zeiler2012] Matthew D Zeiler. 2012. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701. 7 Acknowledgments We would also like to thank the developers of Theano 5, for developing such a powerful tool 5 http://deeplearning.net/software/ theano/ 149 able to improve the results even when it is used together with the large-vocabulary trick. In the case of neural machine translation, we observed that the training with the pointer softmax is also improved the convergence speed of the model as well. For French to English machine translation on Europarl corpora, we observe that using the pointer softmax can also improve the training con- vergence of the model. References [Bahdanau et al.2014] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473. [Bengio and Sen´ecal2008] Yoshua Bengio and Jean- S´ebastien Sen´ecal. 2008. Adaptive importance sampling to accelerate training of a neural proba- bilistic language model. Neural Networks, IEEE Transactions on, 19(4):713–722. [Bordes et al.2015] Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. 2015. Large- scale simple question answering with memory net- works. arXiv preprint arXiv:1506.02075. [Cheng and Lapata2016] Jianpeng Cheng and Mirella Lapata. 2016. Neural summarization by ex- 2016. Noisy activation functions. arXiv preprint arXiv:1603.00391. [Gutmann and Hyv¨arinen2012] Michael U Gutmann and Aapo Hyv¨arinen. 2012. Noise-contrastive esti- mation of unnormalized statistical models, with ap- plications to natural image statistics. The Journal of Machine Learning Research, 13(1):307–361. [He et al.2015] Kaiming He, Xiangyu Zhang, Shao- qing Ren, and Jian Sun. 2015. Deep resid- ual learning for mage recognition. arXiv preprint arXiv:1512.03385. [Hermann et al.2015] Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Ad- vances in Neural Information Processing Systems, pages 1684–1692. [Jean et al.2014] S´ebastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2014. On using very large target vocabulary for neural ma- chine translation. arXiv preprint arXiv:1412.2007. [Kingma and Adam2015] Diederik P Kingma and Jimmy Ba Adam. 2015. A method for stochastic optimization. In International Conference on Learning Representation. [Luong et al.2015] Minh-Thang Luong, Ilya Sutskever, Quoc V Le, Oriol Vinyals, and Wojciech Zaremba. 2015. Addressing the rare word problem in neural machine translation. In Proceedings of ACL. 148 lidation learning- tracting sentences and words. arXiv preprint arXiv:1603.07252. [Cho et al.2014] Kyunghyun Cho, Bart Van Merri¨enboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. [Chung et al.2014] Junyoung Chung, C¸ aglar G¨ulc¸ehre, KyungHyun Cho, and Yoshua Bengio. 2014. Em- pirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555. n learning- ined with layer. As el trained an the reg- or pointer tion func- ble to gen- with rare- marization ftmax was it is used k. In the observed ax is also model as ranslation using the ining con- tracting sentences and words. arXiv preprint arXiv:1603.07252. [Cho et al.2014] Kyunghyun Cho, Bart Van Merri¨enboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. [Chung et al.2014] Junyoung Chung, C¸ aglar G¨ulc¸ehre, KyungHyun Cho, and Yoshua Bengio. 2014. Em- pirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555. [Gillick et al.2015] Dan Gillick, Cliff Brunk, Oriol Vinyals, and Amarnag Subramanya. 2015. Mul- tilingual language processing from bytes. arXiv preprint arXiv:1512.00103. [Graves2013] Alex Graves. 2013. Generating se- quences with recurrent neural networks. arXiv preprint arXiv:1308.0850. [Gu et al.2016] Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK Li. 2016. Incorporating copying mechanism in sequence-to-sequence learning. arXiv preprint arXiv:1603.06393. [Gulcehre et al.2016] Caglar Gulcehre, Marcin Moczulski, Misha Denil, and Yoshua Bengio. 2016. Noisy activation functions. arXiv preprint arXiv:1603.00391. [Gutmann and Hyv¨arinen2012] Michael U Gutmann and Aapo Hyv¨arinen. 2012. Noise-contrastive esti- mation of unnormalized statistical models, with ap- plications to natural image statistics. The Journal of Machine Learning Research, 13(1):307–361. [He et al.2015] Kaiming He, Xiangyu Zhang, Shao- qing Ren, and Jian Sun. 2015. Deep resid- ual learning for mage recognition. arXiv preprint arXiv:1512.03385. [Hermann et al.2015] Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will [Morin and Bengio2005] Frederic Morin and Yoshua Bengio. 2005. Hierarchical probabilistic neural net- work language model. In Aistats, volume 5, pages 246–252. Citeseer. [Pascanu et al.2012] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2012. On the difficulty of training recurrent neural networks. arXiv preprint arXiv:1211.5063. [Pascanu et al.2013] Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. 2013. How to construct deep recurrent neural networks. arXiv preprint arXiv:1312.6026. [Rush et al.2015] Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. CoRR, abs/1509.00685. [Schuster and Paliwal1997] Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recur- rent neural networks. Signal Processing, IEEE Transactions on, 45(11):2673–2681. [Sennrich et al.2015] Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine trans- lation of rare words with subword units. arXiv preprint arXiv:1508.07909. [Theano Development Team2016] Theano Develop- ment Team. 2016. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints, abs/1605.02688, May. [Tomasello et al.2007] Michael Tomasello, Malinda Carpenter, and Ulf Liszkowski. 2007. A new look at infant pointing. Child development, 78(3):705–722. [Vinyals et al.2015] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer networks. In Ad- vances in Neural Information Processing Systems, pages 2674–2682. [Zeiler2012] Matthew D Zeiler. 2012. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701. 7 Acknowledgments We would also like to thank the developers of 5 ranslation using the ining con- Kyunghyun al machine d translate. and Jean- importance ural proba- orks, IEEE as Usunier, 5. Large- emory net- nd Mirella on by ex- Machine Learning Research, 13(1):307–361. [He et al.2015] Kaiming He, Xiangyu Zhang, Shao- qing Ren, and Jian Sun. 2015. Deep resid- ual learning for mage recognition. arXiv preprint arXiv:1512.03385. [Hermann et al.2015] Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Ad- vances in Neural Information Processing Systems, pages 1684–1692. [Jean et al.2014] S´ebastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2014. On using very large target vocabulary for neural ma- chine translation. arXiv preprint arXiv:1412.2007. [Kingma and Adam2015] Diederik P Kingma and Jimmy Ba Adam. 2015. A method for stochastic optimization. In International Conference on Learning Representation. [Luong et al.2015] Minh-Thang Luong, Ilya Sutskever, Quoc V Le, Oriol Vinyals, and Wojciech Zaremba. 2015. Addressing the rare word problem in neural machine translation. In Proceedings of ACL. 148 Management of Data (SIGMOD). pages 1247– 1250. S. Bowman, G. Angeli, C. Potts, and C. D. Man- ning. 2015. A large annotated corpus for learn- ing natural language inference. In Empiri- cal Methods in Natural Language Processing (EMNLP). D. L. Chen and R. J. Mooney. 2008. Learning to sportscast: A test of grounded language acqui- sition. In International Conference on Machine Learning (ICML). pages 128–135. F. Chevalier, R. Vuillemot, and G. Gali. 2013. Us- ing concrete scales: A practical framework for effective visual depiction of complex measures. IEEE Transactions on Visualization and Com- puter Graphics 19:2426–2435. G. Chiacchieri. 2013. Dictionary of numbers. http://www.dictionaryofnumbers. com/. A. Fader, S. Soderland, and O. Etzioni. 2011. Identifying relations for open information ex- traction. In Empirical Methods in Natural Lan- guage Processing (EMNLP). R. Jia and P. Liang. 2016. Data recombination for neural semantic parsing. In Association for Computational Linguistics (ACL). M. G. Jones and A. R. Taylor. 2009. Developing a sense of scale: Looking backward. Journal of Research in Science Teaching 46:460–475. Y. Kim, J. Hullman, and M. Agarwala. 2016. Gen- erating personalized spatial analogies for dis- tances and areas. In Conference on Human Fac- tors in Computing Systems (CHI). C. Seife. 2010. Proofine fooled by the numbers. P I. Sutskever, O. Vinyals, a quence to sequence lea works. In Advances in N cessing Systems (NIPS). K. H. Teigen. 2015. Fram ties. The Wiley Blackw ment and Decision Maki T. R. Tretter, M. G. Jones, Accuracy of scale conce tal maneuverings acros tial magnitude. Journal Teaching 43:1061–1085 Y. Wang, J. Berant, and P. a semantic parser overni Computational Linguisti Y. W. Wong and R. J. Moo by inverting a semantic cal machine translation. Technology and North for Computational Ling pages 172–179.