Successfully reported this slideshow.
Upcoming SlideShare
×

# RNN, LSTM and Seq-2-Seq Models

7,653 views

Published on

Presented by Jayeol Chun and Sang-Hyun Eun
June 9, 2016.

Published in: Technology
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

### RNN, LSTM and Seq-2-Seq Models

1. 1. 6/23/2016 Presentation Final ﬁle:///Users/jayeolchun/Downloads/Presentation+Final%20(1).html 1/11 Introductory Presentation on RNN, LSTM and Seq-2-Seq Models by Jayeol Chun and Sang-Hyun Eun 1. Brief Overview of Theory behind RNN Q: What is RNN?
2. 2. 6/23/2016 Presentation Final ﬁle:///Users/jayeolchun/Downloads/Presentation+Final%20(1).html 2/11 Feed-Forward vs. Feed-Back : Static vs. Dynamic As opposed to Convolution Neural Network (CNN) where there are no cycles, Recurrent Neural Network (RNN) maintains the Persistence of Information by linking the outputs of previous computations to the later computations, and is thus well suited for processing sequence of characters, naturally making it an ideal tool in NLP. Basic RNN Computation in Theory class RNN: # ... def step(self, x): # update the hidden state self.h = np.tanh(np.dot(self.W_hh, self.h) + np.dot(self.W_xh, x)) # compute the output vector y = np.dot(self.W_hy, self.h) return y # main instruction rnn = RNN() y = rnn.step(x) # x is an input vector, y is the RNN's output vector ~ Point to Take Away : Quite Simple !! Challenge Unstable Gradient Problem -> "The gradient in deep neural networks is unstable, tending to either explode or vanish in earlier layers." In at least some deep neural networks, the gradient tends to get smaller as we move backward through the hidden layers. This means that neurons in the earlier layers learn much more slowly than neurons in later layers. Question : The more the hidden layers, the better ??
3. 3. 6/23/2016 Presentation Final ﬁle:///Users/jayeolchun/Downloads/Presentation+Final%20(1).html 3/11 "Backpropagation computes gradients by chain rule -> This has the eﬀect of multiplying n of these small numbers to compute gradients of the front layers in an n-layer network, meaning that the gradient (error signal) decreases exponentially with n and the front layers train very slowly." 2. Long Short Term Memory Network (LSTM)
4. 4. 6/23/2016 Presentation Final ﬁle:///Users/jayeolchun/Downloads/Presentation+Final%20(1).html 4/11 The most commonly used type of RNN that addresses the above challenge Can learn to recognize context-sensitive languages Key is the cell state. It runs down the straight down the entire chain, with only minor linear interactions It updates its information with a structure called gates There are the 3 main types of gates Forget Gate Layer - Sigmoid layer and chooses what information to forget. Input Gate Layer - Choose what values to update and whats values to add Output Gate Layer - Based on our cell state, ﬁlter it to decide which values we want to output 3. Sequence to Sequence Model
5. 5. 6/23/2016 Presentation Final ﬁle:///Users/jayeolchun/Downloads/Presentation+Final%20(1).html 5/11 Seq-2-Seq Model consists of two RNNs : an encoder that processes the input and maps it to a vector, and a decoder that generates the output sequence of symbols from the vector representation. Speciﬁcally, the encoder maps a variable-length source sequence to a ﬁxed-length vector, and the decoder maps the vector representation back to a variable-length target sequence of symbols. The two networks are trained jointly to maximize the conditional probability of the target sequence given a source sequence. Each box in the picture above represents a cell of the RNN.
6. 6. 6/23/2016 Presentation Final ﬁle:///Users/jayeolchun/Downloads/Presentation+Final%20(1).html 6/11 Example Sample RNN / Seq-2-Seq Code
7. 7. 6/23/2016 Presentation Final ﬁle:///Users/jayeolchun/Downloads/Presentation+Final%20(1).html 7/11 In [1]: import tensorflow as tf from tensorflow.models.rnn import rnn, rnn_cell import numpy as np char_rdic = ['h','e','l','o'] # id -> char char_dic = {w: i for i, w in enumerate(char_rdic)} # char -> id sample = [char_dic[c] for c in "hello"] # to index x_data = np.array([ [1,0,0,0], # h [0,1,0,0], # e [0,0,1,0], # l [0,0,1,0]], # l dtype='f') # Configuration char_vocab_size = len(char_dic) rnn_size = 4 #char_vocab_size # 1 hot coding (one of 4) time_step_size = 4 # 'hell' -> predict 'ello' batch_size = 1 # one sample # RNN model rnn_cell = rnn_cell.BasicRNNCell(rnn_size) state = tf.zeros([batch_size, rnn_cell.state_size]) X_split = tf.split(0, time_step_size, x_data) outputs, state = tf.nn.seq2seq.rnn_decoder ( X_split, state, rnn_cell) print (state) print (outputs) # logits: list of 2D Tensors of shape [batch_size x num_decoder_symbol s]. # targets: list of 1D batch-sized int32 Tensors of the same length as lo gits. # weights: list of 1D batch-sized float-Tensors of the same length as lo gits. logits = tf.reshape(tf.concat(1, outputs), [-1, rnn_size]) targets = tf.reshape(sample[1:], [-1]) weights = tf.ones([time_step_size * batch_size]) loss = tf.nn.seq2seq.sequence_loss_by_example([logits], [targets], [weig hts]) cost = tf.reduce_sum(loss) / batch_size train_op = tf.train.RMSPropOptimizer(0.01, 0.9).minimize(cost) # Launch the graph in a session with tf.Session() as sess: # you need to initialize all variables tf.initialize_all_variables().run() for i in range(100): sess.run(train_op) result = sess.run(tf.arg_max(logits, 1)) print (result, [char_rdic[t] for t in result])