Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Recurrent	Neural	
Networks
DLAI	– MARTA	R.	COSTA-JUSSÀ
2
Outline
1. The importance of context. Sequence modeling
2. Feed-forward networks review
3. Vanilla RNN
4. Vanishing grad...
1.The	importance	of	
context
3
4
The	importance	of	context.	Sequence modeling
● Recall the 5th digit of your phone number
● Sing your favourite song begi...
Language	Applications
•Language Modeling (probability)
•Machine Translation
•Speech Recognition
p(I’m) p(fine|I’m) p(.|fin...
Image	Applications
Image credit: The Unreasonable Effectiveness of RNNs, André Karpathy
6
2.	Feedforward	Network	
Review
7
8
Feedforward	Networks
FeedFORWARD
Every y/hi is computed from
the sequence of forward
activations out of input x.
9
Where	is	the	Memory I?
If we have a sequence of samples...
predict sample x[t+1] knowing previous values {x[t], x[t-1], ...
10
Where	is	the	Memory II?
...
...
...
x[t+1]
x[t-L], …, x[t-1], x[t]
x[t+1]
L
Feed Forward approach:
● static window of s...
11
Where	is	the	Memory III?
...
...
...
x[t+1]
x[t-L], …, x[t-1], x[t]
x[t+1]
L
...
...
...
x[t-L+1], …, x[t], x[t+1]
x[t+...
12
Where	is	the	Memory IV?
Feed Forward approach:
● static window of size L
● slide the window time-step wise ...
...
...
...
Any	problems	with	this	
approach?
13
14
Problems for the FNN	+	static window approach I
● If increasing L, fast growth of num of parameters!
15
Problems for the FNN	+	static window approach II
● If increasing L, fast growth of num of parameters!
● Decisions are i...
16
Problems for the FNN	+	static window approach III
● If increasing L, fast growth of num of parameters!
● Decisions are ...
3.	Vanilla	RNN
17
18
Recurrent	Neural	Network (RNN)	
adding the “temporal”	evolution
Allow to build specific connections
capturing ”history”...
19
Recurrent	Neural	Network (RNN)	
adding the “temporal”	evolution
Allow to build specific connections
capturing ”history”...
20
RNN:	parameters
𝑦2 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑉ℎ2)
𝑁;<=2>
=
	×	𝑁A;2B;2>
=
Image src: https://medium.com/towards-
data-science/introducti...
21
RNN:	unfolding
BEWARE: We have extra depth now! Every time-step is an extra level of depth
(as a deeper stack of layers...
22
RNN:	depth 1
Forward in space propagation
23
RNN:	depth 2
Forward in time propagation
24
RNN:	depth in	space +	time
Hence we have two data flows: Forward
in space + time propagation: 2
projections per layer a...
25
Alternatives	of	recurrence	
Image src: Goodfellow
et al. Deep Learning.
The MIT Press
26
Alternatives	of	outputs
Image src: Goodfellow
et al. Deep Learning.
The MIT Press
27
Training	a	RNN	I:	BPTT
Backpropagation through time (BPTT): The training algorithm for updating
network weights to mini...
28
Remember	BackPropagation
1. Present a training input pattern and propagate it through the network to get
an output.
2. ...
29
BPTT	I
Cross Entropy Loss
𝑦2 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑉ℎ2)
30
BPTT	II
NOTE: our goal is to calculate the gradients of the
error with respect to our parameters U, W and V and
and the...
31
BPTT	III	(E3	computation)
𝜕𝐸E
𝜕𝑊
=	
𝜕𝐸E
𝜕𝑦GE
𝜕𝑦GE
𝜕ℎE
𝜕ℎE
𝜕𝑊
ℎE = 𝑓(𝑈𝑥2 + 𝑊ℎJ)
32
BPTT	IV
𝜕𝐸E
𝜕𝑊
= K
𝜕𝐸E
𝜕𝑦GE
𝜕𝑦GE
𝜕ℎE
𝜕ℎE
𝜕ℎL
𝜕ℎL
𝜕𝑊
E
MNO
4.	Vanishing	gradient
33
Example
Maria has	studied	law	and	she is	a	lawyer.
34
35
Vanishing gradient I
● During training gradients explode/vanish easily because of depth-in-time →
Exploding/Vanishing g...
36
Vanishing gradient II
For example,
𝜕𝐸E
𝜕𝑊
= K
𝜕𝐸E
𝜕𝑦GE
𝜕𝑦GE
𝜕ℎE
𝜕ℎE
𝜕ℎL
𝜕ℎL
𝜕𝑊
E
MNO
𝜕ℎE
𝜕ℎ"
=
𝜕ℎE
𝜕ℎJ
𝜕ℎJ
𝜕ℎ"
𝜕𝐸E
𝜕𝑊
=...
37
Vanishing gradient III
𝜕𝐸E
𝜕𝑊
= K
𝜕𝐸E
𝜕𝑦GE
𝜕𝑦GE
𝜕ℎE
( P
𝜕ℎQ
𝜕ℎQR"
E
QNLS"
)
𝜕ℎL
𝜕𝑊
E
MNO
tanh and derivative. Source: h...
38
Standard	Solutions
Proper	initialization	of	Weight	Matrix
Regularization	of	outputs	or	Dropout
Use	of	ReLU Activations	...
Quizz
39
5.	Gating	method	
40
Standard	RNN
41
https://www.nextbigfuture.com/2016/03/recurrent-neural-nets.html
http://colah.github.io/posts/2015-08-Unde...
Long-Short	Term	Memory	(LSTM)
42
43
Gating method
1. Change the way in which past information is kept → create the notion of cell state, a
memory unit that...
44
Long Short Term Memory (LSTM) cell
An LSTM cell is defined by two groups of neurons plus the cell state (memory unit):
...
45
Long	Short-Term	Memory	(LSTM)
Three gates are governed by sigmoid units (btw [0,1]) define
the control of in & out info...
46
Forget Gate
Forget Gate:
Concatenate
Figure: Cristopher Olah, “Understanding LSTM Networks” (2015) / Slide: Alberto Mon...
47
Forget Gate:	Example
Figure: Cristopher Olah, “Understanding LSTM Networks” (2015) / Slide: Alberto Montes
LANGUAGE MOD...
48
Input	Gate
Input Gate Layer
New contribution to cell state
Classic neuron
Figure: Cristopher Olah, “Understanding LSTM ...
49
Input	Gate:	Example
Input Gate Layer
Figure: Cristopher Olah, “Understanding LSTM Networks” (2015) / Slide: Alberto Mon...
50
Update Cell State
Update Cell State (memory):
Figure: Cristopher Olah, “Understanding LSTM Networks” (2015) / Slide: Al...
51
Output	Gate
Output Gate Layer
Output to next layer
Figure: Cristopher Olah, “Understanding LSTM Networks” (2015) / Slid...
52
Output	Gate:	Example
Figure: Cristopher Olah, “Understanding LSTM Networks” (2015) / Slide: Alberto Montes
LANGUAGE MOD...
53
LSTM: parameters
An LSTM cell is defined by two groups of neurons plus the cell state (memory unit):
1. Gates
2. Activa...
54
Gated	Recurrent	Unit	(GRU)
Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Hol...
Visual	Comparison	FNN,	Vanilla	RNNs	
and	LSTMs
55
Image src http://blog.echen.me/2017/05/30/exploring-lstms/
Visual	Comparison	FNN,	Vanilla	RNNs	
and	LSTMs
56
Image src http://blog.echen.me/2017/05/30/exploring-lstms/
Visual	Comparison	FNN,	Vanilla	RNNs	
and	LSTMs
57
Image src http://blog.echen.me/2017/05/30/exploring-lstms/
Visualization
58
6.	Other	RNN	
extensions
59
Bidirectional	RNNs
Image src: http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns...
Deep	RNNs
Image src: http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/
61
7.	Use	Case
62
Character-level	language	models	
P(‘he’)=P(e|h)
P(‘hel’)=P(l|’he’)
P(‘hell’)=P(l|’hel’)
P(‘hello’)=P(o|’hell’)
All these p...
Character-level	language	models	with	
RNN		
h 1	0	0	0
e 0 1	0	0
l 0 0	1	0
o 0 0	0	1
64
Character-level	language	models	
diagram	I
Imagesrc: http://karpathy.github.io/2015/05/21/rnn-effectiveness/
65
Hidden	lay...
Character-level	language	diagram	II
Imagesrc: http://karpathy.github.io/2015/05/21/rnn-effectiveness/
66
Hidden	layer
Inpu...
Character-level	language	diagram	III
Imagesrc: http://karpathy.github.io/2015/05/21/rnn-effectiveness/
67
Code	Available:	
https://gist.github.com/karpathy/
min-char-rnn.py
68
Still	ISSUES	with	RNNs??
69
70
Thanks	!	Q&A	?
Upcoming SlideShare
Loading in …5
×

Recurrent Neural Networks (DLAI D7L1 2017 UPC Deep Learning for Artificial Intelligence)

406 views

Published on

https://telecombcn-dl.github.io/2017-dlai/

Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.

Published in: Data & Analytics
  • Be the first to comment

Recurrent Neural Networks (DLAI D7L1 2017 UPC Deep Learning for Artificial Intelligence)

  1. 1. Recurrent Neural Networks DLAI – MARTA R. COSTA-JUSSÀ
  2. 2. 2 Outline 1. The importance of context. Sequence modeling 2. Feed-forward networks review 3. Vanilla RNN 4. Vanishing gradient 5. Gating methodology 6. Other RNN extensions 7. Use case
  3. 3. 1.The importance of context 3
  4. 4. 4 The importance of context. Sequence modeling ● Recall the 5th digit of your phone number ● Sing your favourite song beginning at third sentence ● Recall 10th character of the alphabet Probably you went straight from the beginning of the stream in each case… because in sequences order matters! Idea: retain the information preserving the importance of order
  5. 5. Language Applications •Language Modeling (probability) •Machine Translation •Speech Recognition p(I’m) p(fine|I’m) p(.|fine) EOS I’m fine .<s> 5
  6. 6. Image Applications Image credit: The Unreasonable Effectiveness of RNNs, André Karpathy 6
  7. 7. 2. Feedforward Network Review 7
  8. 8. 8 Feedforward Networks FeedFORWARD Every y/hi is computed from the sequence of forward activations out of input x.
  9. 9. 9 Where is the Memory I? If we have a sequence of samples... predict sample x[t+1] knowing previous values {x[t], x[t-1], x[t-2], …, x[t-τ]}
  10. 10. 10 Where is the Memory II? ... ... ... x[t+1] x[t-L], …, x[t-1], x[t] x[t+1] L Feed Forward approach: ● static window of size L ● slide the window time-step wise
  11. 11. 11 Where is the Memory III? ... ... ... x[t+1] x[t-L], …, x[t-1], x[t] x[t+1] L ... ... ... x[t-L+1], …, x[t], x[t+1] x[t+2]Feed Forward approach: ● static window of size L ● slide the window time-step wise
  12. 12. 12 Where is the Memory IV? Feed Forward approach: ● static window of size L ● slide the window time-step wise ... ... ... x[t+1] x[t-L], …, x[t-1], x[t] x[t+1] L ... ... ... x[t-L+1], …, x[t], x[t+1] x[t+2] ... ... ... x[t-L+2], …, x[t+1], x[t+2] x[t+3]
  13. 13. Any problems with this approach? 13
  14. 14. 14 Problems for the FNN + static window approach I ● If increasing L, fast growth of num of parameters!
  15. 15. 15 Problems for the FNN + static window approach II ● If increasing L, fast growth of num of parameters! ● Decisions are independent between time-steps! ○ The network doesn’t care about what happened at previous time-step, only present window matters → doesn’t look good TIME-STEPS: the memory do you want to include in your network. If you want your network to have memory of 60 characters, this number should be 60.
  16. 16. 16 Problems for the FNN + static window approach III ● If increasing L, fast growth of num of parameters! ● Decisions are independent between time-steps! ○ The network doesn’t care about what happened at previous time-step, only present window matters → doesn’t look good ● Cumbersome padding when there are not enough samples to fill L size ○ Can’t work with variable sequence lengths
  17. 17. 3. Vanilla RNN 17
  18. 18. 18 Recurrent Neural Network (RNN) adding the “temporal” evolution Allow to build specific connections capturing ”history” Image src: https://medium.com/towards- data-science/introduction-to-recurrent- neural-network-27202c3945f3 𝑥" ℎ" 𝑦"
  19. 19. 19 Recurrent Neural Network (RNN) adding the “temporal” evolution Allow to build specific connections capturing ”history” Image src: https://medium.com/towards- data-science/introduction-to-recurrent- neural-network-27202c3945f3 𝑥" ℎ" 𝑦" 𝒚 𝒕 = 𝒔𝒐𝒇𝒕𝒎𝒂𝒙(𝑽𝒉 𝒕)
  20. 20. 20 RNN: parameters 𝑦2 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑉ℎ2) 𝑁;<=2> = × 𝑁A;2B;2> = Image src: https://medium.com/towards- data-science/introduction-to-recurrent- neural-network-27202c3945f3
  21. 21. 21 RNN: unfolding BEWARE: We have extra depth now! Every time-step is an extra level of depth (as a deeper stack of layers in a feed-forward fashion!)
  22. 22. 22 RNN: depth 1 Forward in space propagation
  23. 23. 23 RNN: depth 2 Forward in time propagation
  24. 24. 24 RNN: depth in space + time Hence we have two data flows: Forward in space + time propagation: 2 projections per layer activation Last time-step includes the context of our decisions recursively W, U, V shared across all steps 𝒚 𝒕 = 𝒔𝒐𝒇𝒕𝒎𝒂𝒙(𝑽𝒉 𝒕)
  25. 25. 25 Alternatives of recurrence Image src: Goodfellow et al. Deep Learning. The MIT Press
  26. 26. 26 Alternatives of outputs Image src: Goodfellow et al. Deep Learning. The MIT Press
  27. 27. 27 Training a RNN I: BPTT Backpropagation through time (BPTT): The training algorithm for updating network weights to minimize error including time
  28. 28. 28 Remember BackPropagation 1. Present a training input pattern and propagate it through the network to get an output. 2. Compare the predicted outputs to the expected outputs and calculate the error. 3. Calculate the derivatives of the error with respect to the network weights. 4. Adjust the weights to minimize the error. 5. Repeat.
  29. 29. 29 BPTT I Cross Entropy Loss 𝑦2 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑉ℎ2)
  30. 30. 30 BPTT II NOTE: our goal is to calculate the gradients of the error with respect to our parameters U, W and V and and then learn good parameters using Stochastic Gradient Descent. Just like we sum up the errors, we also sum up the gradients at each time step for one training example:
  31. 31. 31 BPTT III (E3 computation) 𝜕𝐸E 𝜕𝑊 = 𝜕𝐸E 𝜕𝑦GE 𝜕𝑦GE 𝜕ℎE 𝜕ℎE 𝜕𝑊 ℎE = 𝑓(𝑈𝑥2 + 𝑊ℎJ)
  32. 32. 32 BPTT IV 𝜕𝐸E 𝜕𝑊 = K 𝜕𝐸E 𝜕𝑦GE 𝜕𝑦GE 𝜕ℎE 𝜕ℎE 𝜕ℎL 𝜕ℎL 𝜕𝑊 E MNO
  33. 33. 4. Vanishing gradient 33
  34. 34. Example Maria has studied law and she is a lawyer. 34
  35. 35. 35 Vanishing gradient I ● During training gradients explode/vanish easily because of depth-in-time → Exploding/Vanishing gradients!
  36. 36. 36 Vanishing gradient II For example, 𝜕𝐸E 𝜕𝑊 = K 𝜕𝐸E 𝜕𝑦GE 𝜕𝑦GE 𝜕ℎE 𝜕ℎE 𝜕ℎL 𝜕ℎL 𝜕𝑊 E MNO 𝜕ℎE 𝜕ℎ" = 𝜕ℎE 𝜕ℎJ 𝜕ℎJ 𝜕ℎ" 𝜕𝐸E 𝜕𝑊 = K 𝜕𝐸E 𝜕𝑦GE 𝜕𝑦GE 𝜕ℎE ( P 𝜕ℎQ 𝜕ℎQR" E QNLS" ) 𝜕ℎL 𝜕𝑊 E MNO
  37. 37. 37 Vanishing gradient III 𝜕𝐸E 𝜕𝑊 = K 𝜕𝐸E 𝜕𝑦GE 𝜕𝑦GE 𝜕ℎE ( P 𝜕ℎQ 𝜕ℎQR" E QNLS" ) 𝜕ℎL 𝜕𝑊 E MNO tanh and derivative. Source: http://nn.readthedocs.org/en/rtd/transfer/
  38. 38. 38 Standard Solutions Proper initialization of Weight Matrix Regularization of outputs or Dropout Use of ReLU Activations as it’s derivative is either 0 or 1
  39. 39. Quizz 39
  40. 40. 5. Gating method 40
  41. 41. Standard RNN 41 https://www.nextbigfuture.com/2016/03/recurrent-neural-nets.html http://colah.github.io/posts/2015-08-Understanding-LSTMs/
  42. 42. Long-Short Term Memory (LSTM) 42
  43. 43. 43 Gating method 1. Change the way in which past information is kept → create the notion of cell state, a memory unit that keeps long-term information in a safer way by protecting it from recursive operations 2. Make every RNN unit able to decide whether the current time-step information matters or not, to accept or discard (optimized reading mechanism) 3. Make every RNN unit able to forget whatever may not be useful anymore by clearing that info from the cell state (optimized clearing mechanism) 4. Make every RNN unit able to output the decisions whenever it is ready to do so (optimized output mechanism)
  44. 44. 44 Long Short Term Memory (LSTM) cell An LSTM cell is defined by two groups of neurons plus the cell state (memory unit): 1. Gates 2. Activation units 3. Cell state Computation Flow
  45. 45. 45 Long Short-Term Memory (LSTM) Three gates are governed by sigmoid units (btw [0,1]) define the control of in & out information.. Figure: Cristopher Olah, “Understanding LSTM Networks” (2015) slide credit: Xavi Giro
  46. 46. 46 Forget Gate Forget Gate: Concatenate Figure: Cristopher Olah, “Understanding LSTM Networks” (2015) / Slide: Alberto Montes
  47. 47. 47 Forget Gate: Example Figure: Cristopher Olah, “Understanding LSTM Networks” (2015) / Slide: Alberto Montes LANGUAGE MODELING Joan es un chico activo y Anna es una chica calmada Forget about ”male” gender
  48. 48. 48 Input Gate Input Gate Layer New contribution to cell state Classic neuron Figure: Cristopher Olah, “Understanding LSTM Networks” (2015) / Slide: Alberto Montes
  49. 49. 49 Input Gate: Example Input Gate Layer Figure: Cristopher Olah, “Understanding LSTM Networks” (2015) / Slide: Alberto Montes LANGUAGE MODELING Joan es un chico activo y Anna es una chica calmada Input about ”female” gender
  50. 50. 50 Update Cell State Update Cell State (memory): Figure: Cristopher Olah, “Understanding LSTM Networks” (2015) / Slide: Alberto Montes
  51. 51. 51 Output Gate Output Gate Layer Output to next layer Figure: Cristopher Olah, “Understanding LSTM Networks” (2015) / Slide: Alberto Montes
  52. 52. 52 Output Gate: Example Figure: Cristopher Olah, “Understanding LSTM Networks” (2015) / Slide: Alberto Montes LANGUAGE MODELING Joan es un chico activo y Anna es una chica calmada Info relevant for a verb? : ”female”, “singular”, “3rd person”
  53. 53. 53 LSTM: parameters An LSTM cell is defined by two groups of neurons plus the cell state (memory unit): 1. Gates 2. Activation units 3. Cell state 3 gates + input activation
  54. 54. 54 Gated Recurrent Unit (GRU) Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. "Learning phrase representations using RNN encoder-decoder for statistical machine translation." AMNLP 2014. slide credit: Xavi Giro
  55. 55. Visual Comparison FNN, Vanilla RNNs and LSTMs 55 Image src http://blog.echen.me/2017/05/30/exploring-lstms/
  56. 56. Visual Comparison FNN, Vanilla RNNs and LSTMs 56 Image src http://blog.echen.me/2017/05/30/exploring-lstms/
  57. 57. Visual Comparison FNN, Vanilla RNNs and LSTMs 57 Image src http://blog.echen.me/2017/05/30/exploring-lstms/
  58. 58. Visualization 58
  59. 59. 6. Other RNN extensions 59
  60. 60. Bidirectional RNNs Image src: http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/ 60
  61. 61. Deep RNNs Image src: http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/ 61
  62. 62. 7. Use Case 62
  63. 63. Character-level language models P(‘he’)=P(e|h) P(‘hel’)=P(l|’he’) P(‘hell’)=P(l|’hel’) P(‘hello’)=P(o|’hell’) All these probabilities should be likely 63
  64. 64. Character-level language models with RNN h 1 0 0 0 e 0 1 0 0 l 0 0 1 0 o 0 0 0 1 64
  65. 65. Character-level language models diagram I Imagesrc: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ 65 Hidden layer Output layer
  66. 66. Character-level language diagram II Imagesrc: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ 66 Hidden layer Input layer
  67. 67. Character-level language diagram III Imagesrc: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ 67
  68. 68. Code Available: https://gist.github.com/karpathy/ min-char-rnn.py 68
  69. 69. Still ISSUES with RNNs?? 69
  70. 70. 70 Thanks ! Q&A ?

×