Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Recent Progress
in RNN and NLP
Tohoku University
Inui and Okazaki Lab.
Sosuke Kobayashi
⼩林 颯介
• Revised presentation slides from
2016/6/22 NLP-DL MTG@Preferred Networks and
2016/6/30 Inui and Okazaki Lab. Talk
• Over...
• Basic RNN
• RNN’s Unit
• Benchmarking various RNNs
• Connections in RNNs
• RNN and Tree
• Regularizations and Tricks for...
• Benchmarking RNN’s various units
• Variants of LSTM or GRU
• Examinations of gates in LSTM
• Initialization trick of LST...
LSTM and GRU
• LSTM [Hochreiter&Schmidhuber97] • GRU [Cho+14]
(Biases are omitted.)Tohoku University, Inui and Okazaki Lab...
• Search better unit structures from LSTM and GRU by
mutating computation graphs
• Arith.: Calculation with noise tokens
•...
• Better units are similar to GRU
• Due to bias from
search algorighm?
• MUT1: Update gate is
controlled by only x (not h)...
• Remove LSTM’s input, forget, output gates?
• LSTM’s forget gate’s bias is initialized to +1?
≒ Keeping 73% cell’s values...
• “LSTM: A Search Space Odyssey.” Cool title.
• Good examinations on LSTM
• Gates, peephole, tanh before output,
forget ga...
• Structurally Constrained Recurrent Network (SCRN)
[Mikolov+15]
• RNN with a simple cell
by weighted sum
• IRNN [Le+15]
•...
• Minimal Gated Unit; MGU
[Zhou+, 16]
Other GRU-like Units
• Simple Gated Unit; SGU
[Gao+, 16]
• Deep SGU; DSGU
[Gao+, 16]...
• Multiplicative Integration [Wu+16]
• Improvements by using multiplication with addition in
RNN, changing
into
.
• Simila...
• Visualization of character-based language model
• A cell got a function of apostrophes’ opening and
closing
• But other ...
Word Ablation [Kádár+16]
• Analyzing a GRU’s output by omission score
when encoding a image caption
• Model predicting
ima...
• Removing word ‘pizza’ removes a just ’pizza’ from the
image (searched from dataset)
Word Ablation [Kádár+16]
Tohoku Univ...
• Analyzing mean omission scores in dataset on pos-tag,
model for image focuses on
NN > JJ > VB, CD > ...
Word Ablation [K...
• Connections of RNNs
• Tree structure and RNN (LSTM)
• Tree-based Composition by Shift-reduce
2. Connections and Trees
To...
• Clockwork RNN. Combination of different RNNs with
different step processing [Koutník+14, Liu+15,
(Chung+16)]
• Gated Fee...
Pixel Recurrent Neu
x1
xi
xn
xn2
Figure 2. Left: To generate pixel xi one conditions on all the pre-
viously generated pix...
• Tree-LSTM [Tai+15] Apply LSTM to follow directed edges
(child to parent) of tree structure. Most cited “Tree-LSTM”
• S-L...
Figure 2: Attentional Encoder-Decoder model.
dj is calculated as the summation vector weighted
by ↵j(i):
dj =
nX
i=1
↵j(i)...
The hungry cat
NP (VP(S
REDUCE
GENNT(NP)NT(VP)
…
cat hungry The
a<t
p(at)
ut
TtSt
Figure 5: Neural architecture for definin...
• Joint learning of shift-reduce parsing and sentence-level
classification with shift-reduce-based tree-LSTM
composition. ...
• Repeated attention with
LSTM captures
input vectors as a set
[Vinyals+15]
• (End-to-end) Memory Networks [Sukhbaatar+15]...
• Regularizations
• Dropout in RNN
• Batch Normalization in RNN
• Other Regularizations
• Multi-task learning and pre-trai...
• Dropout [Hinton+12, Srivastava+14]
Drop nodes by probability p and multiply 1/(1-p).
• RNN (mainly LSTM and GRU) need so...
• Batch Normalization [Ioffe+15] Normalization, scale and shift
function at intermediate layer. Popular for CNN (DNN).
• R...
• Norm-stabilizer [Krueger+15]
• Penalize the difference between the norms of hidden
vectors at successive time steps
• Te...
• Auto-(sentence-)encoder and language modeling as pre-
training for sentence classifications. (But, joint learning is
not...
• Lighten Calculation of Softmax output
• Copy mechanism
• Character-based
• Global Optimization of Decoding
4. Decoding
T...
• Softmax over large vocabulary (class) has large time
and space computational complexity.
Lighten (replace) it by
• Sampl...
• Copy function from source sentence
• [Gulcehre+16] calculates attention distribution over source
sentence(‘s LSTM output...
forms and their meanings is non-trivial (de Saus-
sure, 1916). While some compositional relation-
ships exist, e.g., morph...
Figure 1: Illustration of the Scheduled Sampling approach,
• Reinventing the wheel (of non-NN research)?
(Even so, these a...
• REINFORCE to optimize BLEU/ROUGE [Ranzato+15]
• Minimum Risk Training [Shen+15, Ayana+16]
• Optimization for beam search...
• Better RNN’s units and connections are produced,
however, their impacts are small now
(compared to ones between “vanilla...
Upcoming SlideShare
Loading in …5
×

Recent Progress in RNN and NLP

13,118 views

Published on

Revised presentation slide for NLP-DL, 2016/6/22.
Recent Progress (from 2014) in Recurrent Neural Networks and Natural Language Processing.
Profile http://www.cl.ecei.tohoku.ac.jp/~sosuke.k/
Japanese ver. https://www.slideshare.net/hytae/rnn-63761483

Published in: Technology

Recent Progress in RNN and NLP

  1. 1. Recent Progress in RNN and NLP Tohoku University Inui and Okazaki Lab. Sosuke Kobayashi ⼩林 颯介
  2. 2. • Revised presentation slides from 2016/6/22 NLP-DL MTG@Preferred Networks and 2016/6/30 Inui and Okazaki Lab. Talk • Overview of basic progress in RNN from late 2014 • Attention is not included. c.f. http://www.slideshare.net/yutakikuchi927/deep-learning-nlp-attention • Not published arXiv papers are marked with ” ” • Reference: https://docs.google.com/document/d/1nmkidNi_MsRPbB65kHsmyMfGqmaQ0r5dW518J8k_aeI/edit?usp=sharing ( https://goo.gl/kE6GCM ) Note Tohoku University, Inui and Okazaki Lab. Sosuke Kobayashi
  3. 3. • Basic RNN • RNN’s Unit • Benchmarking various RNNs • Connections in RNNs • RNN and Tree • Regularizations and Tricks for RNN’s Learning • Decoding Agenda Tohoku University, Inui and Okazaki Lab. Sosuke Kobayashi
  4. 4. • Benchmarking RNN’s various units • Variants of LSTM or GRU • Examinations of gates in LSTM • Initialization trick of LSTM • High performance by simple units • Visualization and analysis 1. Unit and Benchmark Tohoku University, Inui and Okazaki Lab. Sosuke Kobayashi
  5. 5. LSTM and GRU • LSTM [Hochreiter&Schmidhuber97] • GRU [Cho+14] (Biases are omitted.)Tohoku University, Inui and Okazaki Lab. Sosuke Kobayashi
  6. 6. • Search better unit structures from LSTM and GRU by mutating computation graphs • Arith.: Calculation with noise tokens • XML: Character-based prediction of XML tags • PTB: Language modeling Discovered Units [Jozefowicz+15] Tohoku University, Inui and Okazaki Lab. Sosuke Kobayashi
  7. 7. • Better units are similar to GRU • Due to bias from search algorighm? • MUT1: Update gate is controlled by only x (not h). It looks reasonable for Arith. h_tをh_(t_1)にずらしてください 7 [Jozefowicz+15] GRU: Discovered Units Tohoku University, Inui and Okazaki Lab. Sosuke Kobayashi
  8. 8. • Remove LSTM’s input, forget, output gates? • LSTM’s forget gate’s bias is initialized to +1? ≒ Keeping 73% cell’s values initially? • Initializing forget gate with positive bias is good ([Gers+2000] also said so.) • Dropout improves LSTM, not GRU, in language modeling • Gates’ importance are f >> i > o. Examination of LSTM [Jozefowicz+15] Tohoku University, Inui and Okazaki Lab. Sosuke Kobayashi
  9. 9. • “LSTM: A Search Space Odyssey.” Cool title. • Good examinations on LSTM • Gates, peephole, tanh before output, forget gate = 1 – input gate (like GRU) • Full gate recurrence; gates are also controlled by gates’ values at previous step • Peephole is not important, forget gate is important, f=1-i is good and can save the # of parameters • You are recommended to use a common LSTM [Greff+15] Examination of LSTM Tohoku University, Inui and Okazaki Lab. Sosuke Kobayashi
  10. 10. • Structurally Constrained Recurrent Network (SCRN) [Mikolov+15] • RNN with a simple cell by weighted sum • IRNN [Le+15] • Simple RNN with recurrent matrix initialized with identity matrix and ReLU instead of tanh • Effects of diagonal and orthogonal matrix in RNN [Henaff+16] Other Devised Units Q is diagonal Tohoku University, Inui and Okazaki Lab. Sosuke Kobayashi
  11. 11. • Minimal Gated Unit; MGU [Zhou+, 16] Other GRU-like Units • Simple Gated Unit; SGU [Gao+, 16] • Deep SGU; DSGU [Gao+, 16] Tohoku University, Inui and Okazaki Lab. Sosuke Kobayashi
  12. 12. • Multiplicative Integration [Wu+16] • Improvements by using multiplication with addition in RNN, changing into . • Similarly in LSTM and GRU • Improve performances of many tasks • In the near future, this will get common ...? Multiplicative Integration resurgence of new structural designs for recurrent neural networks (RNNs) esigns are derived from popular structures including vanilla RNNs, Long works (LSTMs) [4] and Gated Recurrent Units (GRUs) [5]. Despite of their ost of them share a common computational building block, described by the (Wx + Uz + b), (1) Rm are state vectors coming from different information sources, W 2 Rd⇥n e-to-state transition matrices, and b is a bias vector. This computational a combinator for integrating information flow from the x and z by a sum by a nonlinearity . We refer to it as the additive building block. Additive ly implemented in various state computations in RNNs (e.g. hidden state RNNs, gate/cell computations of LSTMs and GRUs. an alternative design for constructing the computational building block by of information integration. Specifically, instead of utilizing sum operation e Hadamard product “ ” to fuse Wx and Uz: (Wx Uz + b) (2) ucture Description and Analysis neral Formulation of Multiplicative Integration idea behind Multiplicative Integration is to integrate different information flows Wx adamard product “ ”. A more general formulation of Multiplicative Integration e bias vectors 1 and 2 added to Wx and Uz: ((Wx + 1) (Uz + 2) + b) 1, 2 2 Rd are bias vectors. Notice that such formulation contains the first order itive building block, i.e., 1 Uht 1 + 2 Wxt. In order to make the Mult on more flexible, we introduce another bias vector ↵ 2 Rd to gate2 the term W g the following formulation: (↵ Wx Uz + 1 Uz + 2 Wx + b), t the number of parameters of the Multiplicative Integration is about the same as t building block, since the number of new parameters (↵, 1 and 2) are negligible c number of parameters. Also, Multiplicative Integration can be easily extended to Us3 , that adopt vanilla building blocks for computing gates and output states, wher replace them with the Multiplicative Integration. More generally, in any kind of information flows (k 2) are involved (e.g. RNN with multiple skip connect dforward models like residual networks [12]), one can implement pairwise Mult on for integrating all k information sources.Tohoku University, Inui and Okazaki Lab. Sosuke Kobayashi
  13. 13. • Visualization of character-based language model • A cell got a function of apostrophes’ opening and closing • But other cells can not be interpretable Visualization Figure 2: Several examples of cells with interpretable activa A tanh(cell)’s value. Red -1 <---> +1Blue [Karpathy+15] Tohoku University, Inui and Okazaki Lab. Sosuke Kobayashi
  14. 14. Word Ablation [Kádár+16] • Analyzing a GRU’s output by omission score when encoding a image caption • Model predicting image’s vector (CNN output) focuses on nouns • Language model focuses more evenly omission(i, S) = 1 cosine(hend(S), hend(Si)) (12) Tohoku University, Inui and Okazaki Lab. Sosuke Kobayashi
  15. 15. • Removing word ‘pizza’ removes a just ’pizza’ from the image (searched from dataset) Word Ablation [Kádár+16] Tohoku University, Inui and Okazaki Lab. Sosuke Kobayashi
  16. 16. • Analyzing mean omission scores in dataset on pos-tag, model for image focuses on NN > JJ > VB, CD > ... Word Ablation [Kádár+16] Tohoku University, Inui and Okazaki Lab. Sosuke Kobayashi
  17. 17. • Connections of RNNs • Tree structure and RNN (LSTM) • Tree-based Composition by Shift-reduce 2. Connections and Trees Tohoku University, Inui and Okazaki Lab. Sosuke Kobayashi
  18. 18. • Clockwork RNN. Combination of different RNNs with different step processing [Koutník+14, Liu+15, (Chung+16)] • Gated Feedback RNN. Feedback outputs into lower layers with gate [Chung+15] • Depth-Gated LSTM, Highway LSTM: Cell are connected to the upper layer‘s cell with gate [Yao+15, Chen+15] • k-th layer’s input is from k-1th’s input and output [Zhou+16] • Hierarchical RNN [Serban+2015] Connections in Multi-RNNs Tohoku University, Inui and Okazaki Lab. Sosuke Kobayashi
  19. 19. Pixel Recurrent Neu x1 xi xn xn2 Figure 2. Left: To generate pixel xi one conditions on all the pre- viously generated pixels left and above of xi. Center: Illustration of a Row LSTM with a kernel of size 3. The dependency field of the Row LSTM does not reach pixels further away on the sides of the image. Right: Illustration of the two directions of the Di- agonal BiLSTM. The dependency field of the Diagonal BiLSTM covers the entire available context in the image. Figure 3. In the Diagonal BiLSTM, to allow for parallelization along the diagonals, the input map is skewed by offseting each row by one position with respect to the previous row. When the spatial layer is computed left to right and column by column, the output map is shifted back into the original size. The convolution uses a kernel of size 2 ⇥ 1. (2015); Uria et al. (2014)). By contrast we model p(x) as a discrete distribution, with every conditional distribution 3 T th tu fo x p d la T in T a c L th tw u T (s re in la T th s h Pixel Recurrent Neural Networks x1 xi xn xn2 Figure 2. Left: To generate pixel xi one conditions on all the pre- viously generated pixels left and above of xi. Center: Illustration of a Row LSTM with a kernel of size 3. The dependency field of the Row LSTM does not reach pixels further away on the sides of the image. Right: Illustration of the two directions of the Di- agonal BiLSTM. The dependency field of the Diagonal BiLSTM covers the entire available context in the image. 3.1. Row LSTM The Row LSTM is a unidirectiona the image row by row from top to b tures for a whole row at once; the formed with a one-dimensional con xi the layer captures a roughly triang pixel as shown in Figure 2 (center). dimensional convolution has size k larger the value of k the broader the c The weight sharing in the convoluti invariance of the computed features The computation proceeds as follow an input-to-state component and a r component that together determine th LSTM core. To enhance parallelizat • Grid LSTM. Each axis has each LSTM for multi- dimensional applications [Kalchbrenner+15] • RNN for DAG, (image) pixel [Shuai+15, Zhu+16, Oord+16] • Structure complexity of RNN model [Zhang+16] as a conference paper at ICLR 2016 2d Grid LSTM blockblock m0 h0 h1 h2 h0 2 h0 1 m1 m0 1 m0 2m2 1d Grid LSTM Block 3d Grid LSTM Block cks form the standard LSTM and those that form Grid LSTM networks of N = 1, 2 ons. The dashed lines indicate identity transformations. The standard LSTM block a memory vector in the vertical dimension; by contrast, the 2d Grid LSTM block has ector m1 applied along the vertical dimension. essfully train feed-forward networks with up to 900 layers of depth. Grid LSTM with er review as a conference paper at ICLR 2016 2d Grid LSTM blockandard LSTM block m0 h0 h0 I ⇤ xi h1 h2 h0 2 h0 1 m1 m0 1 m0 2m2 1d Grid LSTM Block 3d Grid LSTM Block re 1: Blocks form the standard LSTM and those that form Grid LSTM networks of N = 1, 2 3 dimensions. The dashed lines indicate identity transformations. The standard LSTM block not have a memory vector in the vertical dimension; by contrast, the 2d Grid LSTM block has memory vector m1 applied along the vertical dimension. ed to successfully train feed-forward networks with up to 900 layers of depth. Grid LSTM with review as a conference paper at ICLR 2016 2d Grid LSTM blockard LSTM block m0 h0 h0 I ⇤ xi h1 h2 h0 2 h0 1 m1 m0 1 m0 2m2 1d Grid LSTM Block 3d Grid LSTM Block 1: Blocks form the standard LSTM and those that form Grid LSTM networks of N = 1, 2 dimensions. The dashed lines indicate identity transformations. The standard LSTM block ot have a memory vector in the vertical dimension; by contrast, the 2d Grid LSTM block has mory vector m1 applied along the vertical dimension. to successfully train feed-forward networks with up to 900 layers of depth. Grid LSTM with conference paper at ICLR 2016 2d Grid LSTM block m0 h0 h1 h2 h0 2 h0 1 m1 m0 1 m0 2m2 1d Grid LSTM Block 3d Grid LSTM Block orm the standard LSTM and those that form Grid LSTM networks of N = 1, 2 The dashed lines indicate identity transformations. The standard LSTM block mory vector in the vertical dimension; by contrast, the 2d Grid LSTM block has m1 applied along the vertical dimension. ully train feed-forward networks with up to 900 layers of depth. Grid LSTM with onference paper at ICLR 2016 2d Grid LSTM block m0 h0 h1 h2 h0 2 h0 1 m1 m0 1 m0 2m2 1d Grid LSTM Block 3d Grid LSTM Block rm the standard LSTM and those that form Grid LSTM networks of N = 1, 2 The dashed lines indicate identity transformations. The standard LSTM block mory vector in the vertical dimension; by contrast, the 2d Grid LSTM block has m1 applied along the vertical dimension. ly train feed-forward networks with up to 900 layers of depth. Grid LSTM with review as a conference paper at ICLR 2016 2d Grid LSTM blockdard LSTM block m0 h0 h0 I ⇤ xi h1 h2 h0 2 h0 1 m1 m0 1 m0 2m2 1d Grid LSTM Block 3d Grid LSTM Block e 1: Blocks form the standard LSTM and those that form Grid LSTM networks of N = 1, 2 dimensions. The dashed lines indicate identity transformations. The standard LSTM block ot have a memory vector in the vertical dimension; by contrast, the 2d Grid LSTM block has emory vector m1 applied along the vertical dimension. d to successfully train feed-forward networks with up to 900 layers of depth. Grid LSTM with Under review as a conference paper at ICLR 2016 2d Grid LSTM blockStandard LSTM block m m0 h0 h h0 I ⇤ xi h1 h2 h0 2 h0 1 m1 m0 1 m0 2m2 1d Grid LSTM Block 3d Grid LSTM Figure 1: Blocks form the standard LSTM and those that form Grid LSTM networks o and 3 dimensions. The dashed lines indicate identity transformations. The standard L does not have a memory vector in the vertical dimension; by contrast, the 2d Grid LSTM the memory vector m1 applied along the vertical dimension. is used to successfully train feed-forward networks with up to 900 layers of depth. Grid L two dimensions is analogous to the Stacked LSTM, but it adds cells along the depth dim Connections in Multi-RNNs Tohoku University, Inui and Okazaki Lab. Sosuke Kobayashi
  20. 20. • Tree-LSTM [Tai+15] Apply LSTM to follow directed edges (child to parent) of tree structure. Most cited “Tree-LSTM” • S-LSTM [Zhu+15] Add peephole and remove input x • LSTM-RecursiveNN [Le+15] Control forget and input gates with untied matrices of each cell and ouput (h). Input gate is applied before tanh. • Top-down TreeLSTM [Zhang+16] Sentence generation from root of dependency tree Tree-LSTM Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pages 1556–1566, Beijing, China, July 26-31, 2015. c 2015 Association for Computational Linguistics works, a type of recurrent neural net- work with a more complex computational unit, have obtained strong results on a va- riety of sequence modeling tasks. The only underlying LSTM structure that has been explored so far is a linear chain. However, natural language exhibits syn- tactic properties that would naturally com- bine words to phrases. We introduce the Tree-LSTM, a generalization of LSTMs to tree-structured network topologies. Tree- LSTMs outperform all existing systems and strong LSTM baselines on two tasks: predicting the semantic relatedness of two sentences (SemEval 2014, Task 1) and sentiment classification (Stanford Senti- ment Treebank). 1 Introduction Most models for distributed representations of phrases and sentences—that is, models where real- valued vectors are used to represent meaning—fall into one of three classes: bag-of-words models, sequence models, and tree-structured models. In bag-of-words models, phrase and sentence repre- sentations are independent of word order; for ex- ample, they can be generated by averaging con- stituent word representations (Landauer and Du- mais, 1997; Foltz et al., 1998). In contrast, se- quence models construct sentence representations as an order-sensitive function of the sequence of tokens (Elman, 1990; Mikolov, 2012). Lastly, tree-structured models compose each phrase and sentence representation from its constituent sub- phrases according to a given syntactic structure over the sentence (Goller and Kuchler, 1996; Socher et al., 2011). x1 x2 x4 x5 x6 y1 y2 y3 y4 y6 Figure 1: Top: A chain-structured LSTM net- work. Bottom: A tree-structured LSTM network with arbitrary branching factor. Order-insensitive models are insufficient to fully capture the semantics of natural language due to their inability to account for differences in meaning as a result of differences in word order or syntactic structure (e.g., “cats climb trees” vs. “trees climb cats”). We therefore turn to order- sensitive sequential or tree-structured models. In particular, tree-structured models are a linguisti- cally attractive option due to their relation to syn- tactic interpretations of sentence structure. A nat- ural question, then, is the following: to what ex- tent (if at all) can we do better with tree-structured models as opposed to sequential models for sen- tence representation? In this paper, we work to- wards addressing this question by directly com- paring a type of sequential model that has recently been used to achieve state-of-the-art results in sev- eral NLP tasks against its tree-structured general- ization. Due to their capability for processing arbitrary- length sequences, recurrent neural networks 1556 w0 w0w1w2 w4 w5 w6 w0 w4 w5 G EN -L GEN-NX-LGEN-NX-L G EN -R GEN-NX-R GEN-NX-R w1w2w3 LD LD Figure 4: Generation of left and right dependents of node w0 according to LDTREELSTM. by input gate it and how much of the earlier mem- ory cell ˆcl t0 will be forgotten is controlled by forget gate ft. This process is computed as follows: z,l ˆl 1 z,l ˆl i g l d r t r p d t p g o Tohoku University, Inui and Okazaki Lab. Sosuke Kobayashi
  21. 21. Figure 2: Attentional Encoder-Decoder model. dj is calculated as the summation vector weighted by ↵j(i): dj = nX i=1 ↵j(i)hi. (6) To incorporate the attention mechanism into the decoding process, the context vector is used for the the j-th word prediction by putting an additional hidden layer ˜sj: ˜s = tanh(W [s ; d ] + b ), (7) Figure 3: Proposed model: Tree-to-sequence tentional NMT model. a sentence inherent in language. We propose novel tree-based encoder in order to explicitly ta the syntactic structure into consideration in t NMT model. We focus on the phrase structure a sentence and construct a sentence vector fro phrase vectors in a bottom-up fashion. The se tence vector in the tree-based encoder is the • Tree-based and Sequential Encoder for Attention. [Eriguchi+16] • Tree-LSTM composition with leaf nodes output from seq-LSTM • “The cutest approach!”, Kyunghyun Cho said at SedMT, NAACL16. • Undercoated seq-LSTM makes nodes more context-aware and less ambiguous. +[Bowman+16] Combination of Tree and Seq Tohoku University, Inui and Okazaki Lab. Sosuke Kobayashi
  22. 22. The hungry cat NP (VP(S REDUCE GENNT(NP)NT(VP) … cat hungry The a<t p(at) ut TtSt Figure 5: Neural architecture for defining a distribution over at given representations of the stack (St), output buffer (Tt) and history of actions (a<t). Details of the composition architecture of the NP, the action history LSTM, and the other elements of the stack are not shown. This architecture corresponds to the generator state at line 7 of Figure 4. of the forward and reverse LSTMs are concatenated, passed through an affine transformation and a tanh nonlinearity to become the subtree embedding.4 Be- cause each of the child node embeddings (u, v, w in Fig. 6) is computed similarly (if it corresponds to an internal node), this composition function is a kind of recursive neural network. 4.4 Discriminative Parsing Model A discriminative parsing model can be obtained by replacing the embedding of Tt at each time step with an embedding of the input buffer Bt. To train this model, the conditional likelihood of each sequence of actions given the input string is maximized.5 5 Inference via Importance Sampling • Generation by sequential actions from {GEN(word), REDUCE, NT(non-terminal symbol)}. Features for action decisions are LSTM’s outputs of (1) terminals (2) stack (3) action history. Recurrent Neural Network Grammars ber of output nodes in a parse tree as a function of the number of input words, stating the runtime com- plexity of the parsing algorithm as a function of the input size requires further assumptions. Assuming our fixed constraint on maximum depth, it is linear. 3.5 Comparison to Other Models Our generation algorithm algorithm differs from previous stack-based parsing/generation algorithms in two ways. First, it constructs rooted tree struc- tures top down (rather than bottom up), and sec- ond, the transition operators are capable of directly generating arbitrary tree structures rather than, e.g., assuming binarized trees, as is the case in much prior work that has used transition-based algorithms to produce phrase-structure trees (Sagae and Lavie, 2005; Zhang and Clark, 2011; Zhu et al., 2013). 4 Generative Model RNNGs use the generator transition set just pre- sented to define a joint distribution on syntax trees (y) and words (x). This distribution is defined as a sequence model over generator transitions that is pa- rameterized using a continuous space embedding of the algorithm state at each time step (ut); i.e., p(x, y) = |a(x,y)| Y t=1 p(at | a<t) = |a(x,y)| Y t=1 exp r> at ut + bat P a02AG(Tt,St,nt) exp r> a0 ut + ba0 , resentations of them we use recurrent neural net- works to “encode” their contents (Cho et al., 2014). Since the output buffer and history of actions are only appended to and only contain symbols from a finite alphabet, it is straightforward to apply a stan- dard RNN encoding architecture. The stack (S) is more complicated for two reasons. First, the ele- ments of the stack are more complicated objects than symbols from a discrete alphabet: open nontermi- nals, terminals, and full trees, are all present on the stack. Second, it is manipulated using both push and pop operations. To efficiently obtain representations of S under push and pop operations, we use stack LSTMs (Dyer et al., 2015). 4.1 Syntactic Composition Function When a REDUCE operation is executed, the parser pops a sequence of completed subtrees and/or to- kens (together with their vector embeddings) from the stack and makes them children of the most recent open nonterminal on the stack, “completing” the constituent. To compute an embedding of this new subtree, we use a composition function based on bidirectional LSTMs, which is illustrated in Fig. 6. NP u v w NP u v w NP x x Figure 6: Syntactic composition function based on bidirec- [Dyer+16] Input: The hungry cat meows . Stack Buffer Action 0 The | hungry | cat | meows | . NT(S) 1 (S The | hungry | cat | meows | . NT(NP) 2 (S | (NP The | hungry | cat | meows | . SHIFT 3 (S | (NP | The hungry | cat | meows | . SHIFT 4 (S | (NP | The | hungry cat | meows | . SHIFT 5 (S | (NP | The | hungry | cat meows | . REDUCE 6 (S | (NP The hungry cat) meows | . NT(VP) 7 (S | (NP The hungry cat) | (VP meows | . SHIFT 8 (S | (NP The hungry cat) | (VP meows . REDUCE 9 (S | (NP The hungry cat) | (VP meows) . SHIFT 10 (S | (NP The hungry cat) | (VP meows) | . REDUCE 11 (S (NP The hungry cat) (VP meows) .) Figure 2: Top-down parsing example. tackt Termst Open NTst Action Stackt+1 Termst+1 Open NTst+1 T n NT(X) S | (X T n + 1 T n GEN(x) S | x T | x n | (X | ⌧1 | . . . | ⌧` T n REDUCE S | (X ⌧1 . . . ⌧`) T n 1 ure 3: Generator transitions. Symbols defined as in Fig. 1 with the addition of T representing the history of generated terminals. Stack Terminals Action 0 NT(S) 1 (S NT(NP) 2 (S | (NP GEN(The) 3 (S | (NP | The The GEN(hungry) 4 (S | (NP | The | hungry The | hungry GEN(cat) 5 (S | (NP | The | hungry | cat The | hungry | cat REDUCE 6 (S | (NP The hungry cat) The | hungry | cat NT(VP) 7 (S | (NP The hungry cat) | (VP The | hungry | cat GEN(meows) 8 (S | (NP The hungry cat) | (VP meows The | hungry | cat | meows REDUCE 9 (S | (NP The hungry cat) | (VP meows) The | hungry | cat | meows GEN(.) 10 (S | (NP The hungry cat) | (VP meows) | . The | hungry | cat | meows | . REDUCE 11 (S (NP The hungry cat) (VP meows) .) The | hungry | cat | meows | . • REDUCE action weaves a new chunk vector by bi-LSTM and re-stacks it. “NP→the→hungry→cat” Tohoku University, Inui and Okazaki Lab. Sosuke Kobayashi
  23. 23. • Joint learning of shift-reduce parsing and sentence-level classification with shift-reduce-based tree-LSTM composition. When REDUCE the top 2 chunks are composed by tree-LSTM. • Speedy tree composition (like Recurrent NN). SPINN [Bowman+16] bu er down sat stack cat the composition tracking transition down sat the cat composition tracking transition down sat the cat tracking (a) The SPINN model unrolled for two transitions during the processing of the sentence the cat sat down. ‘Tracking’, ‘transition’, and ‘composition’ are neural network layers. Gray arrows indicate connections which are blocked by a gating function. bu er stack t = 0 down sat cat the t = 1 down sat cat the t = 2 down sat cat the t = 3 down sat the cat t = 4 down sat the cat t = 5 down sat the cat t = 6 sat down the cat t = 7 = T (the cat) (sat down) output to model for semantic task bu er down sat stack cat the composition tracking transition down sat the cat composition tracking transition down sat the cat tracking (a) The SPINN model unrolled for two transitions during the processing of the sentence the cat sat down. ‘Tracking’, ‘transition’, and ‘composition’ are neural network layers. Gray arrows indicate connections which are blocked by a gating function. bu er stack t = 0 down sat cat the t = 1 down sat cat the t = 2 down sat cat the t = 3 down sat the cat t = 4 down sat the cat t = 5 down sat the cat t = 6 sat down the cat t = 7 = T (the cat) (sat down) output to model for semantic task (b) The fully unrolled SPINN for the cat sat down, with neural network layers omitted for clarity. Stack-augmented Parser-Interpreter Neural Network Tohoku University, Inui and Okazaki Lab. Sosuke Kobayashi
  24. 24. • Repeated attention with LSTM captures input vectors as a set [Vinyals+15] • (End-to-end) Memory Networks [Sukhbaatar+15] Sentence encoding by weighted sum. Earlier word(’s vector) has larger weights at smaller dims. Later word(‘s vector) has larger weights at larger dims. • e.g., When sentence length d is 10 and vector dimension J is 20, value of 1st vec at 1st dim: (1-1/20)-(1/10)(1-2*1/20) = 0.86 value of 1st vec at 20th dim: (1-20/20)-(1/10)(1-2*20/20) = 0.1 value of 10th vec at 1st dim: (1-1/20)-(10/10)(1-2*1/20) = 0.05 value of 10th vec at 20th dim: (1-20/20)-(10/10)(1-2*20/20) = 1.0 (Encoders without C/RNN) tor with the structure lkj = (1 j/J) (k/d)(1 2j/J) (assuming 1-based indexing), ng the number of words in the sentence, and d is the dimension of the embedding. This presentation, which we call position encoding (PE), means that the order of the words mi. The same representation is used for questions, memory inputs and memory outputs. Encoding: Many of the QA tasks require some notion of temporal context, i.e. in ample of Section 2, the model needs to understand that Sam is in the bedroom after simplistic nature of the QA language). The same representation is used for the d answer a. Two versions of the data are used, one that has 1000 training problems second larger one with 10,000 per task. Details wise stated, all experiments used a K = 3 hops model with the adjacent weight sharing all tasks that output lists (i.e. the answers are multiple words), we take each possible of possible outputs and record them as a separate answer vocabulary word. presentation: In our experiments we explore two different representations for s. The first is the bag-of-words (BoW) representation that takes the sentence 2, ..., xin}, embeds each word and sums the resulting vectors: e.g mi = P j Axij and j. The input vector u representing the question is also embedded as a bag of words: . This has the drawback that it cannot capture the order of the words in the sentence, ortant for some tasks. propose a second representation that encodes the position of words within the s takes the form: mi = P j lj · Axij, where · is an element-wise multiplication. lj is a 4 4.2 ATTENTION MECHANISMS Neural models with memories coupled to differentiable addressing mechanism have been success- fully applied to handwriting generation and recognition (Graves, 2012), machine translation (Bah- danau et al., 2015a), and more general computation machines (Graves et al., 2014; Weston et al., 2015). Since we are interested in associative memories we employed a “content” based attention. This has the property that the vector retrieved from our memory would not change if we randomly shuffled the memory. This is crucial for proper treatment of the input set X as such. In particular, our process block based on an attention mechanism uses the following: qt = LSTM(q⇤ t 1) (3) ei,t = f(mi, qt) (4) ai,t = exp(ei,t) P j exp(ej,t) (5) rt = X i ai,tmi (6) q⇤ t = [qt rt] (7) Read Process Write Figure 1: The Read-Process-and-Write model. where i indexes through each memory vector mi (typically equal to the cardinality of X), qt is a query vector which allows us to read rt from the memories, f is a function that computes a single scalar from mi and qt (e.g., a dot product), and LSTM is an LSTM which computes a recurrent state but which takes no inputs. q⇤ t is the state which this LSTM evolves, and is formed by concatenating the query qt with the resulting attention readout rt. t is the index which indicates how many “processing steps” are being carried to compute the state to be fed to the decoder. Note that permuting mi and mi0 has no effect on the read vector rt. 4.3 READ, PROCESS, WRITE Our model, which naturally handles input sets, has three components (the exact equations and im- plementation will be released in an appendix prior to publication): • A reading block, which simply embeds each element xi 2 X using a small neural network onto a memory vector mi (the same neural network is used for all i). • A process block, which is an LSTM without inputs or outputs performing T steps of com- putation over the memories mi. This LSTM keeps updating its state by reading mi repeat- edly using the attention mechanism described in the previous section. At the end of this block, its hidden state q⇤ T is an embedding which is permutation invariant to the inputs. See eqs. (3)-(7) for more details. 4 fully applied to handwriting generation a danau et al., 2015a), and more general c 2015). Since we are interested in associa This has the property that the vector retri shuffled the memory. This is crucial for our process block based on an attention m qt = LSTM(q⇤ t 1) (3) ei,t = f(mi, qt) (4) ai,t = exp(ei,t) P j exp(ej,t) (5) rt = X i ai,tmi (6) q⇤ t = [qt rt] (7) where i indexes through each memory v a query vector which allows us to read single scalar from mi and qt (e.g., a do recurrent state but which takes no inputs by concatenating the query qt with the re how many “processing steps” are being c that permuting mi and mi0 has no effect o 4.3 READ, PROCESS, WRITE Our model, which naturally handles inpu plementation will be released in an appen • A reading block, which simply e onto a memory vector mi (the sa • A process block, which is an LS putation over the memories mi. edly using the attention mechan block, its hidden state q⇤ T is an em eqs. (3)-(7) for more details. Tohoku University, Inui and Okazaki Lab. Sosuke Kobayashi
  25. 25. • Regularizations • Dropout in RNN • Batch Normalization in RNN • Other Regularizations • Multi-task learning and pre-training of encoder(-decoder) 3. Learning Tricks Tohoku University, Inui and Okazaki Lab. Sosuke Kobayashi
  26. 26. • Dropout [Hinton+12, Srivastava+14] Drop nodes by probability p and multiply 1/(1-p). • RNN (mainly LSTM and GRU) need some tricks • At upward (inter-layer) connections [Zaremba+14] • At update terms [Semeniuta+16] • Use one consistent dropout mask in one seq. (Effect looks obscure now) [Semeniuta+16, Gal15] • Zoneout. Stochastically preserve previous c and h [Krueger+16] • Word dropout. Stochastically use zero/mean/<unk> vec as a word vec. [Iyyer+15, Dai&Le15, Dyer+15, Bowman+15] Dropout , d(ht−1)] + bh), (4) function from Equation 2. d-forward fully connected is a significant difference: ks every fully-connected only once, while it is not nt layer: each training ex- mposed of a number of in- ropout this results in hid- on every step. This obser- tion of how to sample the re two options: sample it sequence (per-sequence) mask on every step (per- wo strategies for sampling etail in Section 3.4. ht = ot ∗ f(ct), where it, ft, ot are input, output and forget gate step t; gt is the vector of cell updates and ct is updated cell vector used to update the hidden s ht; σ is the sigmoid function and ∗ is the elem wise multiplication. Our approach is to apply dropout to the cell date vector ct as follows: ct = ft ∗ ct−1 + it ∗ d(gt) In contrast, Moon et al. (2015) propose to ply dropout directly to the cell values and use sequence sampling: ct = d(ft ∗ ct−1 + it ∗ gt) We will discuss the limitations of the appro of Moon et al. (2015) in Section 3.4 and sup Figure 1: Illustration of the three types circles represent connections, hidden state we apply dropout. gt = f(Wg xt, rt ∗ ht−1 + bg) ht = (1 − zt) ∗ ht−1 + zt ∗ gt Similarly to the LSTMs, we propoose dropout to the hidden state updates vector ht = (1 − zt) ∗ ht−1 + zt ∗ d(gt) To the best of our knowledge, this work is to study the effect of recurrent dropout networks. 3.4 Dropout and memory Before going further with the explanatio LSTM GRU Tohoku University, Inui and Okazaki Lab. Sosuke Kobayashi
  27. 27. • Batch Normalization [Ioffe+15] Normalization, scale and shift function at intermediate layer. Popular for CNN (DNN). • RNN needs tricks [Cooijmans+16, Laurent+15] • Mean and Variance value for normalization are prepared at each time step • Don’t insert at cell’s recurrence • Initialize scale value lower (0.1). Prevent sigm/tanh saturation initially. Batch Normalization hidden-to-hidden transformations. We introduce the batch-normalizing transform BN( · ; , ) the LSTM as follows: 0 B B @ ˜ft ˜it ˜ot ˜gt 1 C C A = BN(Whht 1; h, h) + BN(Wxxt; x, x) + b (6) ct = (˜ft) ct 1 + (˜it) tanh( ˜gt) (7) ht = (˜ot) tanh(BN(ct; c, c)) (8) network by discarding the absolute scale of activations. We want to a preserve the information in the network, by normalizing the activations in a training example relative to the statistics of the entire training data. 3 Normalization via Mini-Batch Statistics Since the full whitening of each layer’s inputs is costly and not everywhere differentiable, we make two neces- sary simplifications. The first is that instead of whitening the features in layer inputs and outputs jointly, we will normalize each scalar feature independently, by making it have the mean of zero and the variance of 1. For a layer with d-dimensional input x = (x(1) . . . x(d) ), we will nor- malize each dimension x(k) = x(k) − E[x(k) ] Var[x(k)] where the expectation and variance are computed over the training data set. As shown in (LeCun et al., 1998b), such normalization speeds up convergence, even when the fea- tures are not decorrelated. Note that simply normalizing each input of a layer may change what the layer can represent. For instance, nor- B = {x1... Let the normalized values be x1. formations be y1...m. We refer to BNγ,β : x1...m → as the Batch Normalizing Trans Transform in Algorithm 1. In the added to the mini-batch variance Input: Values of x over a mini- Parameters to be learned Output: {yi = BNγ,β(xi)} µB ← 1 m m i=1 xi σ2 B ← 1 m m i=1 (xi − µB)2 xi ← xi − µB σ2 B + ϵ yi ← γxi + β ≡ BNγ,β(xi) Algorithm 1: Batch Normalizi Tohoku University, Inui and Okazaki Lab. Sosuke Kobayashi
  28. 28. • Norm-stabilizer [Krueger+15] • Penalize the difference between the norms of hidden vectors at successive time steps • Temporal Coherence Loss [Jonschkowski&Brock15] • Penalize the difference between the hidden vectors at successive time steps. Regularization for Successiveness Overfitting in machine learning is addressed by restricting the space o considered. This can be accomplished by reducing the number of par with an inductive bias for simpler models, such as early stopping. can be achieved by incorporating more sophisticated prior knowledg activations on a reasonable path can be difficult, especially across lo in mind, we devise a regularizer for the state representation learned RNNs, that aims to encourage stability of the path taken through repr we propose the following additional cost term for Recurrent Neural N 1 T TX t=1 (khtk2 kht 1k2)2 Where ht is the vector of hidden activations at time-step t, and is a h amounts of regularization. We call this penalty the norm-stabilizer, as norms of the hiddens to be stable (i.e. approximately constant acros coherence” penalty of Jonschkowski & Brock (2015), our penalty representation to remain constant, only its norm. In the absence of inputs and nonlinearities, a constant norm would imp to-hidden transition matrix for simple RNNs (SRNNs). However, in t sition matrix, inputs and nonlinearities can still change the norm of instability. This makes targeting the hidden activations directly a mo ing norm stability. Stability becomes especially important when we sequences at test time than those seen during training (the “training h arXiv:1511.08400v Tohoku University, Inui and Okazaki Lab. Sosuke Kobayashi
  29. 29. • Auto-(sentence-)encoder and language modeling as pre- training for sentence classifications. (But, joint learning is not good.) [Dai&Le15] • Multi-task learning for encoder-decoder. The coefficients of tasks’ losses are so important. (Multi-language translation, parsing, image captioning, auto-encoder, skip-thought vectors) [Luong+15] Multi-task Learning nce paper at ICLR 2016 English (unsupervised) German (translation) Tags (parsing)English y Setting – one encoder, multiple decoders. This scheme is useful for either as in Dong et al. (2015) or between different tasks. Here, English and Ger- of words in the respective languages. The α values give the proportions of are allocated for the different tasks. Published as a conference paper at ICLR 2016 English (unsupervised) German (translation) Tags (parsing)English Figure 2: One-to-many Setting – one encoder, multiple decoders. This scheme is useful for either multi-target translation as in Dong et al. (2015) or between different tasks. Here, English and Ger- man imply sequences of words in the respective languages. The α values give the proportions of parameter updates that are allocated for the different tasks. for constituency parsing as used in (Vinyals et al., 2015a), (b) a sequence of German words for ma- chine translation (Luong et al., 2015a), and (c) the same sequence of English words for autoencoders or a related sequence of English words for the skip-thought objective (Kiros et al., 2015). 3.2 MANY-TO-ONE SETTING This scheme is the opposite of the one-to-many setting. As illustrated in Figure 3, it consists of mul- tiple encoders and one decoder. This is useful for tasks in which only the decoder can be shared, for example, when our tasks include machine translation and image caption generation (Vinyals et al., 2015b). In addition, from a machine translation perspective, this setting can benefit from a large amount of monolingual data on the target side, which is a standard practice in machine translation system and has also been explored for neural MT by Gulcehre et al. (2015). English (unsupervised) Image (captioning) English German (translation) Figure 3: Many-to-one setting – multiple encoders, one decoder. This scheme is handy for tasks in which only the decoders can be shared. 3.3 MANY-TO-MANY SETTING Lastly, as the name describes, this category is the most general one, consisting of multiple encoders Published as a conference paper at ICLR 2016 German (translation) English (unsupervised) German (unsupervised) English Figure 4: Many-to-many setting – multiple encoders, multiple decoders. We consider t in a limited context of machine translation to utilize the large monolingual corpora i source and the target languages. Here, we consider a single translation task and two un autoencoder tasks. consist of ordered sentences, e.g., paragraphs. Unfortunately, in many applications th machine translation, we only have sentence-level data where the sentences are unordered. that, we split each sentence into two halves; we then use one half to predict the other hal Tohoku University, Inui and Okazaki Lab. Sosuke Kobayashi
  30. 30. • Lighten Calculation of Softmax output • Copy mechanism • Character-based • Global Optimization of Decoding 4. Decoding Tohoku University, Inui and Okazaki Lab. Sosuke Kobayashi
  31. 31. • Softmax over large vocabulary (class) has large time and space computational complexity. Lighten (replace) it by • Sampled Softmax • Class-factored Softmax • Hierarchical Softmax • BlackOut • Noise Constrastive Estimation; NCE • Self-normalization • Negative Sampling Lighten Softmax Tohoku University, Inui and Okazaki Lab. Sosuke Kobayashi
  32. 32. • Copy function from source sentence • [Gulcehre+16] calculates attention distribution over source sentence(‘s LSTM outputs). (Pointer Networks [Vinyals+15]) Sigmoid gate-based weighted sum of common vocabulary output probability and copy vocabulary probability distribution. • [Gu+16] Similar, but more complicated structure Copy Mechanism hello , my name is Tony Jebara . Attentive Read hi , Tony Jebara <eos> hi , Tony h1 h2 h3 h4 h5 s1 s2 s3 s4 h6 h7 h8 “Tony” DNN Embedding for “Tony” Selective Read for “Tony” (a) Attention-based Encoder-Decoder (RNNSearch) (c) State Update s4 SourceVocabulary Softmax Prob(“Jebara”)=Prob(“Jebara”, g) +Prob(“Jebara”, c) … ... (b) Generate-Mode & Copy-Mode ! M M Tohoku University, Inui and Okazaki Lab. Sosuke Kobayashi
  33. 33. forms and their meanings is non-trivial (de Saus- sure, 1916). While some compositional relation- ships exist, e.g., morphological processes such as adding -ing or -ly to a stem have relatively reg- ular effects, many words with lexical similarities convey different meanings, such as, the word pairs lesson () lessen and coarse () course. 3 C2W Model Our compositional character to word (C2W) model is based on bidirectional LSTMs (Graves and Schmidhuber, 2005), which are able to learn complex non-local dependencies in sequence models. An illustration is shown in Figure 1. The input of the C2W model (illustrated on bottom) is a single word type w, and we wish to obtain is a d-dimensional vector used to represent w. This model shares the same input and output of a word lookup table (illustrated on top), allowing it to eas- ily replace then in any network. As input, we define an alphabet of characters C. For English, this vocabulary would contain an entry for each uppercase and lowercase letter as well as numbers and punctuation. The input word w is decomposed into a sequence of characters c1, . . . , cm, where m is the length of w. Each ci cats cat cats job .... .... ........ cats c a t s a c t .... .... s Character Lookup Table Word Lookup Table Bi-LSTM embeddings for word "cats" embeddings for word "cats" • In/output chunk is a character, not a defined word • Language model, various task’s input features, machine translation’s decoding LM: [Sutskever+11, Graves13, Ling+15a, Kim+15], MT: [Chung+16, Ling+15b,Costa- jussa&Fonollosa16,Luong+16] • Combination of words and characters [Kang+11,Józefowicz+16,Miyamoto&Cho16] • (Not only RNN composition, but also CNN) • Good in terms of morphology and rare word problem Character-based Tohoku University, Inui and Okazaki Lab. Sosuke Kobayashi
  34. 34. Figure 1: Illustration of the Scheduled Sampling approach, • Reinventing the wheel (of non-NN research)? (Even so, these are useful and good next steps.) • Use model’s prediction as next input in training (while usually only true input is used) [Bengio+15] • Similar to DAgger [Daumé III16; Blog] • Use dynamic oracle [Daumé III16; Blog] [Ballesteros+16, Goldberg&Nivre13] Global Decoding Tohoku University, Inui and Okazaki Lab. Sosuke Kobayashi
  35. 35. • REINFORCE to optimize BLEU/ROUGE [Ranzato+15] • Minimum Risk Training [Shen+15, Ayana+16] • Optimization for beam search [Wiseman&Rush16] Global Decoding In order to apply the REINFORCE algorithm (Williams, 1992; Zaremba & Sutskever, 2015) to the problem of sequence generation we cast our problem in the reinforcement learning (RL) frame- work (Sutton & Barto, 1988). Our generative model (the RNN) can be viewed as an agent, which interacts with the external environment (the words and the context vector it sees as input at every time step). The parameters of this agent defines a policy, whose execution results in the agent pick- ing an action. In the sequence generation setting, an action refers to predicting the next word in the sequence at each time step. After taking an action the agent updates its internal state (the hid- den units of RNN). Once the agent has reached the end of a sequence, it observes a reward. We can choose any reward function. Here, we use BLEU (Papineni et al., 2002) and ROUGE-2 (Lin & Hovy, 2003) since these are the metrics we use at test time. BLEU is essentially a geometric mean over n-gram precision scores as well as a brevity penalty (Liang et al., 2006); in this work, we consider up to 4-grams. ROUGE-2 is instead recall over bi-grams. Like in imitation learning, we have a training set of optimal sequences of actions. During training we choose actions according to the current policy and only observe a reward at the end of the sequence (or after maximum sequence length), by comparing the sequence of actions from the current policy against the optimal action sequence. The goal of training is to find the parameters of the agent that maximize the expected reward. We define our loss as the negative expected reward: L✓ = X wg 1 ,...,wg T p✓(wg 1, . . . , wg T )r(wg 1, . . . , wg T ) = E[wg 1 ,...wg T ]⇠p✓ r(wg 1, . . . , wg T ), (9) where wg n is the word chosen by our model at the n-th time step, and r is the reward associated with the generated sequence. In practice, we approximate this expectation with a single sample from the distribution of actions implemented by the RNN (right hand side of the equation above and Figure 9 of Supplementary Material). We refer the reader to prior work (Zaremba & Sutskever, 2015; Williams, 1992) for the full derivation of the gradients. Here, we directly report the partial derivatives and their interpretation. The derivatives w.r.t. parameters are: @L✓ @✓ = X t @L✓ @ot @ot @✓ (10) 6 Published as a conference paper at ICLR 2016 h2 = ✓( , h1) p✓(w| , h1) XENT h1 w2 w3XENT top-k w0 1,...,k p✓(w|w0 1,...,k, h2) w00 1,...,k h3 = ✓(w0 1,...,k, h2) top-k Figure 3: Illustration of the End-to-End BackProp method. The first steps of the unrolled sequence (here just the first step) are exactly the same as in a regular RNN trained with cross-entropy. How- ever, in the remaining steps the input to each module is a sparse vector whose non-zero entries are the k largest probabilities of the distribution predicted at the previous time step. Errors are back- propagated through these inputs as well. While this algorithm is a simple way to expose the model to its own predictions, the loss function optimized is still XENT at each time step. There is no explicit supervision at the sequence level while training the model. 3.2 SEQUENCE LEVEL TRAINING We now introduce a novel algorithm for sequence level training, which we call Mixed Incremental Cross-Entropy Reinforce (MIXER). The proposed method avoids the exposure bias problem, and oss L using a two-step pro- ass, we compute candidate n violations (sequences with backward pass, we back- ugh the seq2seq RNNs. Un- ining, the first-step requires case beam search) to find Time Step a red dog smells home today the dog dog barks quickly Friday red blue cat barks straight now runs today a red dog runs quickly today blue dog barks home today Tohoku University, Inui and Okazaki Lab. Sosuke Kobayashi
  36. 36. • Better RNN’s units and connections are produced, however, their impacts are small now (compared to ones between “vanilla and LSTM” or “1- layer or multi-layer”.) • Analysis is more needed in general and each task • Designing models with (reasonable) idea may good result, e.g., tree composition • Regularization and learning tricks increased • Other decoding training or inference algorithms are required Summary Tohoku University, Inui and Okazaki Lab. Sosuke Kobayashi

×