3. •
– P(I have a dream) > P(a have I dream)
–
• perplexity
–
– →
NTTW
2i
Encoder-Decoder 2
2RNN Encoder-Decoder
P(I have a dream) > P(a have I dream) > P(fuga spam hoge
:
• 2RNN e
• 2 P
P(I have a dream)
= P(I)P(have | I)P(a | I have)P(dream | I have a)
I have a dream
3
5. RNN
• RNN
•
RNN RNN RNN RNN
<BOS> I have a
P(I) P(have | I) P(a | I have) P(dream | I have a)
5
6. RNN 1/2
• c
P
RNN RNN
<BOS> I
hc
c
[P (x1 | c ), P(x2 | c ), …, ]
hc W
6
7. RNN 1/2
•
→ (N) M A
RNN RNN
<BOS> I
[P (x1 | c1 ), P(x2 | c1 ), …, ]
RNN
have
[P (x1 | c2 ), P(x2 | c2 ), …, ]
……
A
Hθ HθWT
A HθWT = log A
we are asking the following
such that P✓(X|c) = P⇤
(X|c
We start by looking at a Softm
2.1 SOFTMAX
The majority of parametric l
(or hidden state) hc and a wo
specifically, the model distrib
where hc is a function of c, a
the context vector hc and th
h>
c wx is called a logit.
To help discuss the expressiv
H✓ =
2
6
6
4
h>
c1
h>
c2
· · ·
h>
cN
3
7
7
5 ; W✓ =
2
6
6
4
w>
x
w>
x
· ·
w>
x
where H✓ 2 RN⇥d
, W✓ 2
context vectors, word embed
We use the subscript ✓ becaus7
11. •
–
• PTB WT2
– unk
• 3 LSTM [Merity+ 17]
1)
on,
ne
PTB WikiText-2
Vocab 10,000 33,278
Train 929,590 2,088,628
#Token Valid 73,761 217,646
Test 82,431 245,569
Table 1: Statistics of PTB and WikiText-2. 11
12. • Softmax Linear-softmax
• [Yang+ 18]
• [Yang+ 18]
•
– [Yang+ 18] PPL PTB 54.44 WT2 61.45
• PTB 22M
– [Takase+ 18] PPL PTB 52.38 WT2 58.03
– Transformer-XL [Dai+ 19] PTB 54.52
Single model perplexities on validation and test sets on Penn Treebank and WikiText-2 datasets. For a fair comparison,
re obtained by running the respective open-source implementations locally, however, being comparable to the published
show the training time per epoch when using a single Tesla P100 GPU.
PENN TREEBANK WIKITEXT-2
#PARAM VALID PPL TEST PPL #SEC/EP #PARAM VALID PPL TEST PPL #SEC/EP
LINEAR-SOFTMAX
W/ AWD-LSTM, W/O FINETUNE
(MERITY ET AL., 2017)
24.2M 60.83 58.37 ⇠60 33M 68.11 65.22 ⇠120
OURS LMS-PLIF, 105
KNOTS
W/ AWD-LSTM, W/O FINETUNE
24.4M 59.45 57.25 ⇠70 33.2M 67.87 64.86 ⇠150
MOS, K = 15
W/ AWD-LSTM, W/O FINETUNE
(YANG ET AL., 2017)
26.6M 58.58 56.43 ⇠150 33M 66.01 63.33 ⇠550
MOS(15 COMP) +
OUR PLIF (106
KNOTS)
W/ AWD-LSTM, W/O FINETUNE
28.6M 58.20 56.02 ⇠220 - - - -
nd WikiText-2 (Merity et al., 2016). These datasets
ord vocabulary sizes of 10,000 and 33,000.
nes. We integrate our PLIF layer on top of the state
art language models of AWD-LSTM (Merity et al.,
nd AWD-LSTM+MoS (Yang et al., 2017). Addition-
r PLIF architecture can also be combined with MoS
of standard Softmax. We call this model "MoS +
We use the AWD-LSTM open source implementa-
All the models in table 1 5
were ran locally and we
hese results; we did this to understand how differ-
Figure 4. Learned function for the model in table 2
12
Perplexity PPL