Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

406 views

Published on

Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.

Published in:
Data & Analytics

No Downloads

Total views

406

On SlideShare

0

From Embeds

0

Number of Embeds

0

Shares

0

Downloads

35

Comments

0

Likes

1

No embeds

No notes for slide

- 1. Recurrent Neural Networks DLAI – MARTA R. COSTA-JUSSÀ
- 2. 2 Outline 1. The importance of context. Sequence modeling 2. Feed-forward networks review 3. Vanilla RNN 4. Vanishing gradient 5. Gating methodology 6. Other RNN extensions 7. Use case
- 3. 1.The importance of context 3
- 4. 4 The importance of context. Sequence modeling ● Recall the 5th digit of your phone number ● Sing your favourite song beginning at third sentence ● Recall 10th character of the alphabet Probably you went straight from the beginning of the stream in each case… because in sequences order matters! Idea: retain the information preserving the importance of order
- 5. Language Applications •Language Modeling (probability) •Machine Translation •Speech Recognition p(I’m) p(fine|I’m) p(.|fine) EOS I’m fine .<s> 5
- 6. Image Applications Image credit: The Unreasonable Effectiveness of RNNs, André Karpathy 6
- 7. 2. Feedforward Network Review 7
- 8. 8 Feedforward Networks FeedFORWARD Every y/hi is computed from the sequence of forward activations out of input x.
- 9. 9 Where is the Memory I? If we have a sequence of samples... predict sample x[t+1] knowing previous values {x[t], x[t-1], x[t-2], …, x[t-τ]}
- 10. 10 Where is the Memory II? ... ... ... x[t+1] x[t-L], …, x[t-1], x[t] x[t+1] L Feed Forward approach: ● static window of size L ● slide the window time-step wise
- 11. 11 Where is the Memory III? ... ... ... x[t+1] x[t-L], …, x[t-1], x[t] x[t+1] L ... ... ... x[t-L+1], …, x[t], x[t+1] x[t+2]Feed Forward approach: ● static window of size L ● slide the window time-step wise
- 12. 12 Where is the Memory IV? Feed Forward approach: ● static window of size L ● slide the window time-step wise ... ... ... x[t+1] x[t-L], …, x[t-1], x[t] x[t+1] L ... ... ... x[t-L+1], …, x[t], x[t+1] x[t+2] ... ... ... x[t-L+2], …, x[t+1], x[t+2] x[t+3]
- 13. Any problems with this approach? 13
- 14. 14 Problems for the FNN + static window approach I ● If increasing L, fast growth of num of parameters!
- 15. 15 Problems for the FNN + static window approach II ● If increasing L, fast growth of num of parameters! ● Decisions are independent between time-steps! ○ The network doesn’t care about what happened at previous time-step, only present window matters → doesn’t look good TIME-STEPS: the memory do you want to include in your network. If you want your network to have memory of 60 characters, this number should be 60.
- 16. 16 Problems for the FNN + static window approach III ● If increasing L, fast growth of num of parameters! ● Decisions are independent between time-steps! ○ The network doesn’t care about what happened at previous time-step, only present window matters → doesn’t look good ● Cumbersome padding when there are not enough samples to fill L size ○ Can’t work with variable sequence lengths
- 17. 3. Vanilla RNN 17
- 18. 18 Recurrent Neural Network (RNN) adding the “temporal” evolution Allow to build specific connections capturing ”history” Image src: https://medium.com/towards- data-science/introduction-to-recurrent- neural-network-27202c3945f3 𝑥" ℎ" 𝑦"
- 19. 19 Recurrent Neural Network (RNN) adding the “temporal” evolution Allow to build specific connections capturing ”history” Image src: https://medium.com/towards- data-science/introduction-to-recurrent- neural-network-27202c3945f3 𝑥" ℎ" 𝑦" 𝒚 𝒕 = 𝒔𝒐𝒇𝒕𝒎𝒂𝒙(𝑽𝒉 𝒕)
- 20. 20 RNN: parameters 𝑦2 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑉ℎ2) 𝑁;<=2> = × 𝑁A;2B;2> = Image src: https://medium.com/towards- data-science/introduction-to-recurrent- neural-network-27202c3945f3
- 21. 21 RNN: unfolding BEWARE: We have extra depth now! Every time-step is an extra level of depth (as a deeper stack of layers in a feed-forward fashion!)
- 22. 22 RNN: depth 1 Forward in space propagation
- 23. 23 RNN: depth 2 Forward in time propagation
- 24. 24 RNN: depth in space + time Hence we have two data flows: Forward in space + time propagation: 2 projections per layer activation Last time-step includes the context of our decisions recursively W, U, V shared across all steps 𝒚 𝒕 = 𝒔𝒐𝒇𝒕𝒎𝒂𝒙(𝑽𝒉 𝒕)
- 25. 25 Alternatives of recurrence Image src: Goodfellow et al. Deep Learning. The MIT Press
- 26. 26 Alternatives of outputs Image src: Goodfellow et al. Deep Learning. The MIT Press
- 27. 27 Training a RNN I: BPTT Backpropagation through time (BPTT): The training algorithm for updating network weights to minimize error including time
- 28. 28 Remember BackPropagation 1. Present a training input pattern and propagate it through the network to get an output. 2. Compare the predicted outputs to the expected outputs and calculate the error. 3. Calculate the derivatives of the error with respect to the network weights. 4. Adjust the weights to minimize the error. 5. Repeat.
- 29. 29 BPTT I Cross Entropy Loss 𝑦2 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑉ℎ2)
- 30. 30 BPTT II NOTE: our goal is to calculate the gradients of the error with respect to our parameters U, W and V and and then learn good parameters using Stochastic Gradient Descent. Just like we sum up the errors, we also sum up the gradients at each time step for one training example:
- 31. 31 BPTT III (E3 computation) 𝜕𝐸E 𝜕𝑊 = 𝜕𝐸E 𝜕𝑦GE 𝜕𝑦GE 𝜕ℎE 𝜕ℎE 𝜕𝑊 ℎE = 𝑓(𝑈𝑥2 + 𝑊ℎJ)
- 32. 32 BPTT IV 𝜕𝐸E 𝜕𝑊 = K 𝜕𝐸E 𝜕𝑦GE 𝜕𝑦GE 𝜕ℎE 𝜕ℎE 𝜕ℎL 𝜕ℎL 𝜕𝑊 E MNO
- 33. 4. Vanishing gradient 33
- 34. Example Maria has studied law and she is a lawyer. 34
- 35. 35 Vanishing gradient I ● During training gradients explode/vanish easily because of depth-in-time → Exploding/Vanishing gradients!
- 36. 36 Vanishing gradient II For example, 𝜕𝐸E 𝜕𝑊 = K 𝜕𝐸E 𝜕𝑦GE 𝜕𝑦GE 𝜕ℎE 𝜕ℎE 𝜕ℎL 𝜕ℎL 𝜕𝑊 E MNO 𝜕ℎE 𝜕ℎ" = 𝜕ℎE 𝜕ℎJ 𝜕ℎJ 𝜕ℎ" 𝜕𝐸E 𝜕𝑊 = K 𝜕𝐸E 𝜕𝑦GE 𝜕𝑦GE 𝜕ℎE ( P 𝜕ℎQ 𝜕ℎQR" E QNLS" ) 𝜕ℎL 𝜕𝑊 E MNO
- 37. 37 Vanishing gradient III 𝜕𝐸E 𝜕𝑊 = K 𝜕𝐸E 𝜕𝑦GE 𝜕𝑦GE 𝜕ℎE ( P 𝜕ℎQ 𝜕ℎQR" E QNLS" ) 𝜕ℎL 𝜕𝑊 E MNO tanh and derivative. Source: http://nn.readthedocs.org/en/rtd/transfer/
- 38. 38 Standard Solutions Proper initialization of Weight Matrix Regularization of outputs or Dropout Use of ReLU Activations as it’s derivative is either 0 or 1
- 39. Quizz 39
- 40. 5. Gating method 40
- 41. Standard RNN 41 https://www.nextbigfuture.com/2016/03/recurrent-neural-nets.html http://colah.github.io/posts/2015-08-Understanding-LSTMs/
- 42. Long-Short Term Memory (LSTM) 42
- 43. 43 Gating method 1. Change the way in which past information is kept → create the notion of cell state, a memory unit that keeps long-term information in a safer way by protecting it from recursive operations 2. Make every RNN unit able to decide whether the current time-step information matters or not, to accept or discard (optimized reading mechanism) 3. Make every RNN unit able to forget whatever may not be useful anymore by clearing that info from the cell state (optimized clearing mechanism) 4. Make every RNN unit able to output the decisions whenever it is ready to do so (optimized output mechanism)
- 44. 44 Long Short Term Memory (LSTM) cell An LSTM cell is defined by two groups of neurons plus the cell state (memory unit): 1. Gates 2. Activation units 3. Cell state Computation Flow
- 45. 45 Long Short-Term Memory (LSTM) Three gates are governed by sigmoid units (btw [0,1]) define the control of in & out information.. Figure: Cristopher Olah, “Understanding LSTM Networks” (2015) slide credit: Xavi Giro
- 46. 46 Forget Gate Forget Gate: Concatenate Figure: Cristopher Olah, “Understanding LSTM Networks” (2015) / Slide: Alberto Montes
- 47. 47 Forget Gate: Example Figure: Cristopher Olah, “Understanding LSTM Networks” (2015) / Slide: Alberto Montes LANGUAGE MODELING Joan es un chico activo y Anna es una chica calmada Forget about ”male” gender
- 48. 48 Input Gate Input Gate Layer New contribution to cell state Classic neuron Figure: Cristopher Olah, “Understanding LSTM Networks” (2015) / Slide: Alberto Montes
- 49. 49 Input Gate: Example Input Gate Layer Figure: Cristopher Olah, “Understanding LSTM Networks” (2015) / Slide: Alberto Montes LANGUAGE MODELING Joan es un chico activo y Anna es una chica calmada Input about ”female” gender
- 50. 50 Update Cell State Update Cell State (memory): Figure: Cristopher Olah, “Understanding LSTM Networks” (2015) / Slide: Alberto Montes
- 51. 51 Output Gate Output Gate Layer Output to next layer Figure: Cristopher Olah, “Understanding LSTM Networks” (2015) / Slide: Alberto Montes
- 52. 52 Output Gate: Example Figure: Cristopher Olah, “Understanding LSTM Networks” (2015) / Slide: Alberto Montes LANGUAGE MODELING Joan es un chico activo y Anna es una chica calmada Info relevant for a verb? : ”female”, “singular”, “3rd person”
- 53. 53 LSTM: parameters An LSTM cell is defined by two groups of neurons plus the cell state (memory unit): 1. Gates 2. Activation units 3. Cell state 3 gates + input activation
- 54. 54 Gated Recurrent Unit (GRU) Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. "Learning phrase representations using RNN encoder-decoder for statistical machine translation." AMNLP 2014. slide credit: Xavi Giro
- 55. Visual Comparison FNN, Vanilla RNNs and LSTMs 55 Image src http://blog.echen.me/2017/05/30/exploring-lstms/
- 56. Visual Comparison FNN, Vanilla RNNs and LSTMs 56 Image src http://blog.echen.me/2017/05/30/exploring-lstms/
- 57. Visual Comparison FNN, Vanilla RNNs and LSTMs 57 Image src http://blog.echen.me/2017/05/30/exploring-lstms/
- 58. Visualization 58
- 59. 6. Other RNN extensions 59
- 60. Bidirectional RNNs Image src: http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/ 60
- 61. Deep RNNs Image src: http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/ 61
- 62. 7. Use Case 62
- 63. Character-level language models P(‘he’)=P(e|h) P(‘hel’)=P(l|’he’) P(‘hell’)=P(l|’hel’) P(‘hello’)=P(o|’hell’) All these probabilities should be likely 63
- 64. Character-level language models with RNN h 1 0 0 0 e 0 1 0 0 l 0 0 1 0 o 0 0 0 1 64
- 65. Character-level language models diagram I Imagesrc: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ 65 Hidden layer Output layer
- 66. Character-level language diagram II Imagesrc: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ 66 Hidden layer Input layer
- 67. Character-level language diagram III Imagesrc: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ 67
- 68. Code Available: https://gist.github.com/karpathy/ min-char-rnn.py 68
- 69. Still ISSUES with RNNs?? 69
- 70. 70 Thanks ! Q&A ?

No public clipboards found for this slide

Be the first to comment