1. Komachi Lab ACL reading 2014/8/1
Fast and Robust Neural Network
Joint Model for Statistical Machine
Translation
Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas
Lamar, Richard Schwartz and John Makhoul
Introducing by Yoshiaki Kitagawa
7. 具体例
ターゲットのワードが” the”であると
き
対応するソース言語の中心となる単語(今でいう” money”)はいくつか
のヒューリスティックに基づく考えから決定する(3つの場合分け)
vector for target word “ the” , using a3-word target history and a5-word = 5). Here, “ the” inherits itsaffiliation from“money” because this The number in each box denotes the index of the word in the context consistent across samples, but the absolute ordering does not affect results.
8. NNの構造
• 3 target word + 11 source word = 14 word を
入力とする
• それぞれの単語を192次元のベクトルに
変換
• 隠れ層は2つで次元は512次元
– tanhで非線形変換
• 出力層
– Soft-maxで確率に
10. denotes the index of the word in the context vector. This
Self-normalizerを考えたきっか
absolute ordering does not affect results.
け
likelihood as:
• 出力層はsoft-max
log(P (x)) = log
eUr (x)
Z(x)
= Ur (x)− log(Z(x))
Z(x) = ⌃ |V |
r 0=1eUr 0(x)
着眼点はこ
wherex isthesample,U istheraw output layer
scores, • Z(x)r のis 計the 算にoutput 時間layer がかかrow る
corresponding to
the observed target word, andZ(x) is thesoftmax
normalizer.
If we could guarantee that were al-ways
!
– log(Z(x))=0 ⇒ Z(x)=1 になってくれれば…!
– log(P(x))=Ur(x) としたい!
こ!
11. Self-normalizer
K10
samples)
resulting in
Decoding is
• トレーニングで以下の式を用いることで
log(Z(x))を出来るだけ0に近づけた
–これによりデコードのスピードは15倍程度向
isasignificant
dominated by
vocabu-lary.
Le et
vocabulary, and
fairly
sim-ply
If we could guarantee that log(Z(x)) were al-ways
equal to 0 (i.e., Z(x) = 1) then at decode
timewewould only have to compute row r of the
output layer instead of the whole matrix. While
wecannot train aneural network with thisguaran-tee,
we can explicitly encourage the log-softmax
normalizer to be as close to 0 as possible by aug-menting
our training objective function:
L =
X
i
⇥ log(P (xi
))− ↵(log(Z(xi
))− 0)2⇤
=
X
i
⇥ log(P (xi
))− ↵ log2(Z(xi
))
⇤
In this case, the output layer bias weights are
initialized to log(1/|V|), so that the initial net-work
is self-normalized. At decode time, we sim-ply
use Ur (x) as the feature score, rather than
上
• αはパラメータで0-1の間で調整
– α=0はスタンダードなNNと変わらないこと
に注意