Recurrent Neural Networkの発展
• LSTM
– ゲート( )により履履歴・単語に応じた重みを付与可能
– 記憶領領域により⻑⾧長期の依存関係を扱える
• Lコンポーネント(パラメータ)が多すぎる
– 例例:数単語の句句を扱うために記憶領領域が必要か?
19
…smoking increase the risk of lung…
記憶領領域
句句の分散表現
3
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
for
for
ost
ds.
rity
n’s
28.
ee-
the
ell-
pre-
ach
sin-
ed-
im-
ep,
ons
ach
ers
ive
using a function, F(x1, ..., xT ), where F(.) is mod-
eled by a variant of recurrent neural network (RNN).
3.1 Baseline: Long Short-Term Memory
Long Short-Term Memory (LSTM) (Hochreiter and
Schmidhuber, 1997) is a variant of RNN that is
applied successfully to various NLP tasks includ-
ing word segmentation (Chen et al., 2015), depen-
dency parsing (Dyer et al., 2015), machine transla-
tion (Sutskever et al., 2014), and sentiment analy-
sis (Tai et al., 2015). LSTM computes the input gate
it ∈ Rd, forget gate ft ∈ Rd, output gate ot ∈ Rd,
memory cell ct ∈ Rd, and hidden state ht ∈ Rd for
a given embedding xt at position t5.
it = σ(Wixxt + Wihht−1) (1)
ft = σ(Wfxxt + Wfhht−1) (2)
ot = σ(Woxxt + Wohht−1) (3)
ct = ft ⊙ ct−1 + it ⊙ g(Wcxxt + Wchht−1) (4)
ht = ot ⊙ g(ct) (5)
5
We omitted peephole connections and bias terms in this
study. We set the number of dimensions of hidden states iden-
tical to that of word embeddings (d) so that we can adapt the
objective function of Skip-gram model (Section 3.3).
20.
Gated Additive Composition(GAC) [Takase+ 16]
• ゲートにより履履歴・単語に応じた重みを付与可能
– 加法構成 + 単語の順序 + ゲート
• ⼊入⼒力力ゲート:
– 機能語(the, of)を無視,内容語(increase, risk)を⼊入⼒力力
• 忘却ゲート:
– 以前の分散表現をどの程度度無視する(忘れる)かを操作
20
…smoking increase the risk of lung…
句句の分散表現
4
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
ly apparent (Jozefowicz et al., 2015). We are un-
e whether LSTM is the optimal architecture for
odeling relational patterns.
For this reason, we simplified the LSTM archi-
ture as follows. We removed a memory cell by
placing ct with a hidden state ht because the prob-
m of exponential error decay (Hochreiter et al.,
01) might not be prominent for relational patterns.
e also removed matrices corresponding to Whh
d Whx because most relational patterns hold addi-
e composition. This simplification yields the ar-
tecture defined by Equations 6–8.
it = σ(Wixxt + Wihht−1) (6)
ft = σ(Wfxxt + Wfhht−1) (7)
ht = g(ft ⊙ ht−1 + it ⊙ xt) (8)
lp =
τ∈Cp
log σ(h⊤
p ˜xτ ) +
k=1
log σ(−h⊤
p ˜x˘τ )
(9)
In this formula: K denotes the number of negative
samples; hp ∈ Rd is the vector for the relational
pattern p computed by LSTM or GAC; ˜xτ ∈ Rd is
the context vector for the word wτ
6; x˘τ′ ∈ Rd is the
context vector for the word that were sampled from
6
The Skip-gram model has two kinds of vectors xt and
˜xt assigned for a word wt. Equation 2 of the original pa-
per (Mikolov et al., 2013) denotes xt (word vector) as v (in-
put vector) and ˜xt (context vector) as v′
(output vector). The
word2vec implementation does not write context (output) vec-
tors but only word (input) vectors to a model file. Therefore, we
modified the source code to save context vectors, and use them
in Equation 9. This modification ensures the consistency of the
entire model.
ゲートの値と⼊入⼒力力・履履歴
• GACの各ゲートが開く/閉じるときの⼊入⼒力力単語と
履履歴
27
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
3 2,2720.234 0.386 0.344 0.370 0.
4 1,206 0.208 0.306 0.314 0.329 0.
> 5 423 0.278 0.315 0.369 0.384 0.
All 5,555 0.215 0.340 0.336 0.356 0.
Table 1: Spearman’s rank correlations on different pattern lengths (numb
wt wt+1 wt+2 ...
large it reimburse for
(input payable in
open) liable to
small it a charter member of
(input a valuable member of
close) be an avid reader of
large ft be eligible to participate in
(forget be require to submit
open) be request to submit
small ft coauthor of
(forget capital of
close) center of
Table 2: Prominent moments for input/forget
|it|2 or |ft|2 is small
to one) on the relatio
state that we compose
order (from the last t
‘author’, and ‘be’ in
vector of the relation
Table 2 displays t
tified using the proce
groups of tendencies.
gates close when scan
sition and the curren
these situations, GA
vector of the content
mantic vector of the p
gates close and forge