SlideShare a Scribd company logo
Recurrent Neural
Networks
Daniel Thorngren
Sharath T.S.
Shubhangi Tandon
Motivation
● Translational Invariance
● Deep computational graph:
T O B E O R ...
y(0) y(1) y(2) y(3) y(4) y(t)
Motivation
● Translational Invariance
● Deep computational graph:
T O B E O R ...
y(0) y(1) y(2) y(3) y(4) y(t)
Computation Graphs
● Different sources use different systems.
This is the system the book uses.
● Nodes are variables
● Connections are functions
● A variable is computed using all of the
connections pointing towards it
● Can compute derivatives by applying
chain rule, working backwards through
the graph.
NOT ALL GRAPHS FOLLOW THESE RULES!
>:(
L
y
h
x
yt
W
RNN Computation Graphs - Unfolding
Output
Loss func.
Truth
Hidden L.
Input
Square Connects to next timestep
x(0) x(1) x(2)x(t)
Folded Unfolded
yt
(t)
L L L L
y(t)
h h h h
yt
(1) yt
(2) yt
(3)
y(t) y(t) y(t)
Common Design Patterns
Standard,
output every step
Output is the only
recurrence information Output computed at the end
x x
y y
y
yt
yt
yt
L L
L
h h
h h h h
x(n)x(t)x(2)x(1)
Training
● Compute forward in time, work backwards for gradient.
○ Exploding/vanishing gradient problem
● Teacher forcing:
○ Pipe true outputs into the hidden layer instead of model outputs
y(t) y(t+1)
x(t) x(t+1)
h h
yt
(t) yt
(t+1)
y(t) y(t+1)
x(t) x(t+1)
h h
L L
Training Time Test Time
Recursive Neural Nets
● Map a sequence to a tree, and reduce
the tree one layer at a time until you
reach a single point, your output
● Many choices of how to arrange the
tree.
x(1) x(2) x(3) x(4)
y yt
L
U W U W
U W
Deep Recurrent Neural Nets
Can add depth to any of the stages mentioned:
Multiple Recurrent Layers
Extra input, output, and
hidden layer processing
+Direct hidden layer
yt
yt
yt
x x x
h1
h2 h h
(1) Vanilla mode without RNN (e.g. image classification).
(2) Sequence output (e.g. image captioning).
(3) Sequence input (e.g. sentiment analysis).
(4) Sequence input and sequence output (e.g. Machine Translation).
(5) Synced sequence input and output (e.g. video classification).
What makes Recurrent Networks so special? Sequences !
The unreasonable effectiveness of RNNs
-
- Character level language model
- LSTM of Leo Tolstoy’s War and Peace
- Outputs after 100 iters, 300 iters, 700 iters and
2000 iters
Challenges of Vanishing and Exploding Gradients
Hidden State Recurrence Relation
using Power Method
- Spectral radius will make gradient explode or vanish
- Variance multiplies at every cell (or timestep)
- For Feed-forward networks of fixed size:
- obtain some desired variance v∗
, choose the individual weights with variance v = n
√ v∗
.
- carefully chosen scaling can avoid the vanishing and exploding gradient problem
- For RNNs , this means we cannot effectively capture Long term dependencies.
- Gradient of a long term interaction has exponentially smaller magnitude than short term
interaction
- After a forward pass, the gradients of the non-linearities are fixed.
- Back propagation is like going forwards through a linear system in which the slope of the
non-linearity has been fixed.
Loss function of a char-level RNN
def lossFun(inputs, targets, hprev):
"""
inputs,targets are both list of integers.
hprev is Hx1 array of initial hidden state
returns the loss, gradients on model parameters, and last hidden state
"""
xs, hs, ys, ps = {}, {}, {}, {}
hs[-1] = np.copy(hprev)
loss = 0
# forward pass
for t in xrange(len(inputs)):
xs[t] = np.zeros((vocab_size,1)) # encode in 1-of-k representation
xs[t][inputs[t]] = 1
hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state
ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars
ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars
loss += -np.log(ps[t][targets[t],0]) # softmax (cross-entropy loss)
# backward pass: compute gradients going backwards
dWxh, dWhh, dWhy = np.zeros_like(Wxh), np.zeros_like(Whh),
np.zeros_like(Why)
dbh, dby = np.zeros_like(bh), np.zeros_like(by)
dhnext = np.zeros_like(hs[0])
for t in reversed(xrange(len(inputs))):
dy = np.copy(ps[t])
dy[targets[t]] -= 1 # backprop into y. see
http://cs231n.github.io/neural-networks-case-study/#grad if confused here
dWhy += np.dot(dy, hs[t].T)
dby += dy
dh = np.dot(Why.T, dy) + dhnext # backprop into h
dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity
dbh += dhraw
dWxh += np.dot(dhraw, xs[t].T)
dWhh += np.dot(dhraw, hs[t-1].T)
dhnext = np.dot(Whh.T, dhraw)
for dparam in [dWxh, dWhh, dWhy, dbh, dby]:
np.clip(dparam, -5, 5, out=dparam) # clip to mitigate exploding gradients
return loss, dWxh, dWhh, dWhy, dbh, dby, hs[len(inputs)-1]
Understanding gradient flow dynamics
How to overcome gradient issues ?
Remedial strategies #1
- Gradient Clipping for Exploding Gradient
- Skip connections
- Integer valued skip length
- Example : ResNet
- Leaky Units
- Linear self-connections approach allows the effect of remembrance and forgetfulness to
be adapted more smoothly and flexibly by adjusting the real-valued α rather than by
adjusting the integer-valued skip length.
- α can be sampled from a distribution or learnt.
- Removing connections
- Learns to interact with far off and nearby connections
- Have explicit and discrete updates taking place at different times, with a different
frequency for different groups of units
Remedial strategies #2
- Regularization to maintain information flow
- Require the gradient at any time step t to be similar in magnitude to the gradient of the
loss at the very last layer.
=
- For easy gradient computation, is treated as a constant
- Doesn’t perform as well as leaky units with abundant data
- Perhaps because the constant gradient assumption doesn’t scale well .
Echo State Networks
- Recurrent and input weights are fixed. Only output weights are learnable.
- Relies on the idea that a big, random expansion of the input vector, can often make it easy for a
linear model to fit the data.
- fix the recurrent weights to have some spectral radius such as 3, does not explode due to the
stabilizing effect of saturating nonlinearities like tanh.
- Sparse connectivity - Very few non zero values in hidden to hidden weights
- Creates loosely coupled oscillators, information can hang around in a particular part of
the network.
- Important to choose the scale of the input to hidden connections. They need to drive the
states of the loosely coupled oscillators but, they mustn't wipe out information that those
oscillators contain about the recent history.
- used to initialize the weights in a fully trainable recurrent network (Sutskever 2012, Sutskever
et al., 2013).
LSTMs
Unrolled RNN
The repeating module in a standard RNN contains
a single layer.
The repeating module in an LSTM contains four
interacting layers.
LSTMs
- Adding even more structure
- LSTM : RNN cell with 4 gates that control how information is retained
- Input value can be accumulated into the state if the sigmoidal input gate allows it.
- The cell state unit has a linear self-loop whose weight is controlled by the forget gate.
- The output of the cell can be shut off by the output gate.
- All the gating units have a sigmoid nonlinearity, while the ‘g’ gate can have any squashing
nonlinearity.
- i and g gates - multiplicative interaction
- g - what between -1 to 1 should I add to the cell state
- i - should I go through with the operation.
- Forget gate - Can kill gradients in LSTM if set to zero. Initialize to 1 at start so gradients flow
nicely and LSTM learns to shut or open whenever it wants.
- The state unit can also be used as an extra input to the gating units(Peephole connections).
LSTM - Equations
Forget Input
Update Output
Gated Recurrent Units
LSTM : Search Space Odyssey
- 2015 Paper by Greff et al.
- Compare 8 different configurations of LSTM Architecture
- GRUs
- Without Peephole connections
- Without output gate
- Without non-linearities at output and forget gate etc
- Trained for 5200 iters, over 15 CPU years
- Did not see any major improvement in results, classic LSTM architecture
works as well as other versions
Encoder Decoder Frameworks : Seq2Seq
Sequence to Sequence with Attention - NMT
Explicit Memory
● Motivation
○ Some knowledge can be implicit, subconscious, and difficult to verbalize
■ Ex - how a dog looks different from a cat.
○ It can also be explicit, declarative and straightforward to put into words
■ Ex - everyday commonsense knowledge -> a cat is a kind of animal
■ Ex - Very specific facts -> the meeting with the sales team is at 3:00 PM, room 141.”
○ Neural networks excel at storing implicit knowledge but struggle to memorize facts
■ SGD requires a sample to be repeated several time for a NN to memorize, that too
not precisely. (Graves et al, 2014b)
○ Such explicit memory allows systems to rapidly and intentionally store and retrieve
specific facts and to sequentially reason with them.
Memory Networks
● Memory networks include a set of memory cells that can be accessed via an addressing
mechanism.
○ Originally required a supervision signal instructing them how to use their memory cells
Weston et al. (2014)
○ Graves et al. (2014b) introduced NMTs
■ able to learn to read from and write arbitrary content to memory cells without
explicit supervision about which actions to undertake
■ allow end-to-end training using a content-based soft attention mechanism.
Bahdanau et al.(2015)
Memory Networks
● Soft Addressing - (Content based)
○ Cell state is a Vector - weight used to read to or write from a cell is a function of that cell.
■ Weight can be produced using a softmax across all cells.
○ Completely retrieve vector-valued memory if we are able to produce a pattern that
matches some but not all of its elements
● Hard addressing - (Location based)
○ Output a discrete memory location/Treat weights as probabilities and choose a particular
cell to read or write from
○ Requires specialized optimization algorithms
Memory Networks
Resources and References
- http://colah.github.io/posts/2015-08-Understanding-LSTMs/
- http://karpathy.github.io/2015/05/21/rnn-effectiveness/
- https://www.youtube.com/watch?v=iX5V1WpxxkY&list=PL16j5WbGpaM0_
Tj8CRmurZ8Kk1gEBc7fg&index=10
- https://www.coursera.org/learn/neural-networks/home/week/7
- https://www.coursera.org/learn/neural-networks/home/week/8
- http://cs231n.stanford.edu/slides/2016/winter1516_lecture10.pdf
- https://arxiv.org/abs/1409.0473
- https://arxiv.org/pdf/1308.0850.pdf
- https://arxiv.org/pdf/1503.04069.pdf
Thank you !!
Optimisation for Long term dependencies
- Problem
- Specifically, whenever the model is able to represent long term dependencies, the
gradient of a long term interaction has exponentially smaller magnitude than the gradient
of a short term interaction.
- It does not mean that it is impossible to learn, but that it might take a very long time to
learn long-term dependencies.
- gradient-based optimization becomes increasingly difficult with the probability of
successful training reaching 0 for sequences of only length 10 or 20
- Leaky units & multiple time scales
- Skip connections through time
- Leaky units - The linear self-connection approach allows this effect to be adapted more
smoothly and flexibly by adjusting the real-valued α rather than by adjusting the
integer-valued skip length.
- Remove connections -
- Gradient Clipping

More Related Content

What's hot

LSTM Basics
LSTM BasicsLSTM Basics
LSTM Basics
Akshay Sehgal
 
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Edureka!
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
Yan Xu
 
RNN and its applications
RNN and its applicationsRNN and its applications
RNN and its applications
Sungjoon Choi
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Simplilearn
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networks
Si Haem
 
History of deep learning
History of deep learningHistory of deep learning
History of deep learning
ayatan2
 
Rnn & Lstm
Rnn & LstmRnn & Lstm
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
Rakuten Group, Inc.
 
Long Short Term Memory (Neural Networks)
Long Short Term Memory (Neural Networks)Long Short Term Memory (Neural Networks)
Long Short Term Memory (Neural Networks)
Olusola Amusan
 
Artifical Neural Network and its applications
Artifical Neural Network and its applicationsArtifical Neural Network and its applications
Artifical Neural Network and its applications
Sangeeta Tiwari
 
Neural network final NWU 4.3 Graphics Course
Neural network final NWU 4.3 Graphics CourseNeural network final NWU 4.3 Graphics Course
Neural network final NWU 4.3 Graphics Course
Mohaiminur Rahman
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
CloudxLab
 
LSTM Tutorial
LSTM TutorialLSTM Tutorial
LSTM Tutorial
Ralph Schlosser
 
Artificial Neural Networks Lect3: Neural Network Learning rules
Artificial Neural Networks Lect3: Neural Network Learning rulesArtificial Neural Networks Lect3: Neural Network Learning rules
Artificial Neural Networks Lect3: Neural Network Learning rules
Mohammed Bennamoun
 
RNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential DataRNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential Data
Yao-Chieh Hu
 
Feedforward neural network
Feedforward neural networkFeedforward neural network
Feedforward neural network
Sopheaktra YONG
 
A Brief Introduction on Recurrent Neural Network and Its Application
A Brief Introduction on Recurrent Neural Network and Its ApplicationA Brief Introduction on Recurrent Neural Network and Its Application
A Brief Introduction on Recurrent Neural Network and Its Application
Xiaohu ZHU
 

What's hot (20)

Recurrent neural network
Recurrent neural networkRecurrent neural network
Recurrent neural network
 
LSTM Basics
LSTM BasicsLSTM Basics
LSTM Basics
 
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
RNN and its applications
RNN and its applicationsRNN and its applications
RNN and its applications
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networks
 
History of deep learning
History of deep learningHistory of deep learning
History of deep learning
 
Rnn & Lstm
Rnn & LstmRnn & Lstm
Rnn & Lstm
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
 
Long Short Term Memory (Neural Networks)
Long Short Term Memory (Neural Networks)Long Short Term Memory (Neural Networks)
Long Short Term Memory (Neural Networks)
 
Artifical Neural Network and its applications
Artifical Neural Network and its applicationsArtifical Neural Network and its applications
Artifical Neural Network and its applications
 
Neural network final NWU 4.3 Graphics Course
Neural network final NWU 4.3 Graphics CourseNeural network final NWU 4.3 Graphics Course
Neural network final NWU 4.3 Graphics Course
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
 
rnn BASICS
rnn BASICSrnn BASICS
rnn BASICS
 
LSTM Tutorial
LSTM TutorialLSTM Tutorial
LSTM Tutorial
 
Artificial Neural Networks Lect3: Neural Network Learning rules
Artificial Neural Networks Lect3: Neural Network Learning rulesArtificial Neural Networks Lect3: Neural Network Learning rules
Artificial Neural Networks Lect3: Neural Network Learning rules
 
RNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential DataRNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential Data
 
Feedforward neural network
Feedforward neural networkFeedforward neural network
Feedforward neural network
 
A Brief Introduction on Recurrent Neural Network and Its Application
A Brief Introduction on Recurrent Neural Network and Its ApplicationA Brief Introduction on Recurrent Neural Network and Its Application
A Brief Introduction on Recurrent Neural Network and Its Application
 

Similar to Recurrent Neural Networks

Rnn presentation 2
Rnn presentation 2Rnn presentation 2
Rnn presentation 2
Shubhangi Tandon
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
Junaid Bhat
 
RNN and LSTM model description and working advantages and disadvantages
RNN and LSTM model description and working advantages and disadvantagesRNN and LSTM model description and working advantages and disadvantages
RNN and LSTM model description and working advantages and disadvantages
AbhijitVenkatesh1
 
recurrent_neural_networks_april_2020.pptx
recurrent_neural_networks_april_2020.pptxrecurrent_neural_networks_april_2020.pptx
recurrent_neural_networks_april_2020.pptx
SagarTekwani4
 
Advanced Machine Learning
Advanced Machine LearningAdvanced Machine Learning
Advanced Machine Learning
ANANDBABUGOPATHOTI1
 
Lec10new
Lec10newLec10new
lec10new.ppt
lec10new.pptlec10new.ppt
lec10new.ppt
SumantKuch
 
Deep Learning: Application & Opportunity
Deep Learning: Application & OpportunityDeep Learning: Application & Opportunity
Deep Learning: Application & Opportunity
iTrain
 
Recurrent Neural Networks II (D2L3 Deep Learning for Speech and Language UPC ...
Recurrent Neural Networks II (D2L3 Deep Learning for Speech and Language UPC ...Recurrent Neural Networks II (D2L3 Deep Learning for Speech and Language UPC ...
Recurrent Neural Networks II (D2L3 Deep Learning for Speech and Language UPC ...
Universitat Politècnica de Catalunya
 
An Introduction to Long Short-term Memory (LSTMs)
An Introduction to Long Short-term Memory (LSTMs)An Introduction to Long Short-term Memory (LSTMs)
An Introduction to Long Short-term Memory (LSTMs)
EmmanuelJosterSsenjo
 
14889574 dl ml RNN Deeplearning MMMm.ppt
14889574 dl ml RNN Deeplearning MMMm.ppt14889574 dl ml RNN Deeplearning MMMm.ppt
14889574 dl ml RNN Deeplearning MMMm.ppt
ManiMaran230751
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
CastLabKAIST
 
Paper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit sat
ChenYiHuang5
 
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves MabialaDeep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Spark Summit
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspective
Anirban Santara
 
Recurrent and Recursive Nets (part 2)
Recurrent and Recursive Nets (part 2)Recurrent and Recursive Nets (part 2)
Recurrent and Recursive Nets (part 2)
sohaib_alam
 
Language translation with Deep Learning (RNN) with TensorFlow
Language translation with Deep Learning (RNN) with TensorFlowLanguage translation with Deep Learning (RNN) with TensorFlow
Language translation with Deep Learning (RNN) with TensorFlow
S N
 
Economic Load Dispatch (ELD), Economic Emission Dispatch (EED), Combined Econ...
Economic Load Dispatch (ELD), Economic Emission Dispatch (EED), Combined Econ...Economic Load Dispatch (ELD), Economic Emission Dispatch (EED), Combined Econ...
Economic Load Dispatch (ELD), Economic Emission Dispatch (EED), Combined Econ...
cscpconf
 
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
Fordham University
 
XLnet RoBERTa Reformer
XLnet RoBERTa ReformerXLnet RoBERTa Reformer
XLnet RoBERTa Reformer
San Kim
 

Similar to Recurrent Neural Networks (20)

Rnn presentation 2
Rnn presentation 2Rnn presentation 2
Rnn presentation 2
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
 
RNN and LSTM model description and working advantages and disadvantages
RNN and LSTM model description and working advantages and disadvantagesRNN and LSTM model description and working advantages and disadvantages
RNN and LSTM model description and working advantages and disadvantages
 
recurrent_neural_networks_april_2020.pptx
recurrent_neural_networks_april_2020.pptxrecurrent_neural_networks_april_2020.pptx
recurrent_neural_networks_april_2020.pptx
 
Advanced Machine Learning
Advanced Machine LearningAdvanced Machine Learning
Advanced Machine Learning
 
Lec10new
Lec10newLec10new
Lec10new
 
lec10new.ppt
lec10new.pptlec10new.ppt
lec10new.ppt
 
Deep Learning: Application & Opportunity
Deep Learning: Application & OpportunityDeep Learning: Application & Opportunity
Deep Learning: Application & Opportunity
 
Recurrent Neural Networks II (D2L3 Deep Learning for Speech and Language UPC ...
Recurrent Neural Networks II (D2L3 Deep Learning for Speech and Language UPC ...Recurrent Neural Networks II (D2L3 Deep Learning for Speech and Language UPC ...
Recurrent Neural Networks II (D2L3 Deep Learning for Speech and Language UPC ...
 
An Introduction to Long Short-term Memory (LSTMs)
An Introduction to Long Short-term Memory (LSTMs)An Introduction to Long Short-term Memory (LSTMs)
An Introduction to Long Short-term Memory (LSTMs)
 
14889574 dl ml RNN Deeplearning MMMm.ppt
14889574 dl ml RNN Deeplearning MMMm.ppt14889574 dl ml RNN Deeplearning MMMm.ppt
14889574 dl ml RNN Deeplearning MMMm.ppt
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
Paper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit sat
 
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves MabialaDeep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspective
 
Recurrent and Recursive Nets (part 2)
Recurrent and Recursive Nets (part 2)Recurrent and Recursive Nets (part 2)
Recurrent and Recursive Nets (part 2)
 
Language translation with Deep Learning (RNN) with TensorFlow
Language translation with Deep Learning (RNN) with TensorFlowLanguage translation with Deep Learning (RNN) with TensorFlow
Language translation with Deep Learning (RNN) with TensorFlow
 
Economic Load Dispatch (ELD), Economic Emission Dispatch (EED), Combined Econ...
Economic Load Dispatch (ELD), Economic Emission Dispatch (EED), Combined Econ...Economic Load Dispatch (ELD), Economic Emission Dispatch (EED), Combined Econ...
Economic Load Dispatch (ELD), Economic Emission Dispatch (EED), Combined Econ...
 
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
 
XLnet RoBERTa Reformer
XLnet RoBERTa ReformerXLnet RoBERTa Reformer
XLnet RoBERTa Reformer
 

Recently uploaded

RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCINGRNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
AADYARAJPANDEY1
 
Structural Classification Of Protein (SCOP)
Structural Classification Of Protein  (SCOP)Structural Classification Of Protein  (SCOP)
Structural Classification Of Protein (SCOP)
aishnasrivastava
 
platelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptxplatelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptx
muralinath2
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
Health Advances
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
SAMIR PANDA
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
pablovgd
 
insect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationinsect taxonomy importance systematics and classification
insect taxonomy importance systematics and classification
anitaento25
 
Viksit bharat till 2047 India@2047.pptx
Viksit bharat till 2047  India@2047.pptxViksit bharat till 2047  India@2047.pptx
Viksit bharat till 2047 India@2047.pptx
rakeshsharma20142015
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
ChetanK57
 
Richard's entangled aventures in wonderland
Richard's entangled aventures in wonderlandRichard's entangled aventures in wonderland
Richard's entangled aventures in wonderland
Richard Gill
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SELF-EXPLANATORY
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
YOGESH DOGRA
 
Large scale production of streptomycin.pptx
Large scale production of streptomycin.pptxLarge scale production of streptomycin.pptx
Large scale production of streptomycin.pptx
Cherry
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
AlaminAfendy1
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Erdal Coalmaker
 
Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...
Sérgio Sacani
 
justice-and-fairness-ethics with example
justice-and-fairness-ethics with examplejustice-and-fairness-ethics with example
justice-and-fairness-ethics with example
azzyixes
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
Richard Gill
 
Citrus Greening Disease and its Management
Citrus Greening Disease and its ManagementCitrus Greening Disease and its Management
Citrus Greening Disease and its Management
subedisuryaofficial
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Sérgio Sacani
 

Recently uploaded (20)

RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCINGRNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
 
Structural Classification Of Protein (SCOP)
Structural Classification Of Protein  (SCOP)Structural Classification Of Protein  (SCOP)
Structural Classification Of Protein (SCOP)
 
platelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptxplatelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptx
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
 
insect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationinsect taxonomy importance systematics and classification
insect taxonomy importance systematics and classification
 
Viksit bharat till 2047 India@2047.pptx
Viksit bharat till 2047  India@2047.pptxViksit bharat till 2047  India@2047.pptx
Viksit bharat till 2047 India@2047.pptx
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
 
Richard's entangled aventures in wonderland
Richard's entangled aventures in wonderlandRichard's entangled aventures in wonderland
Richard's entangled aventures in wonderland
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
 
Large scale production of streptomycin.pptx
Large scale production of streptomycin.pptxLarge scale production of streptomycin.pptx
Large scale production of streptomycin.pptx
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
 
Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...
 
justice-and-fairness-ethics with example
justice-and-fairness-ethics with examplejustice-and-fairness-ethics with example
justice-and-fairness-ethics with example
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
 
Citrus Greening Disease and its Management
Citrus Greening Disease and its ManagementCitrus Greening Disease and its Management
Citrus Greening Disease and its Management
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
 

Recurrent Neural Networks

  • 2. Motivation ● Translational Invariance ● Deep computational graph: T O B E O R ... y(0) y(1) y(2) y(3) y(4) y(t)
  • 3. Motivation ● Translational Invariance ● Deep computational graph: T O B E O R ... y(0) y(1) y(2) y(3) y(4) y(t)
  • 4. Computation Graphs ● Different sources use different systems. This is the system the book uses. ● Nodes are variables ● Connections are functions ● A variable is computed using all of the connections pointing towards it ● Can compute derivatives by applying chain rule, working backwards through the graph. NOT ALL GRAPHS FOLLOW THESE RULES! >:( L y h x yt W
  • 5. RNN Computation Graphs - Unfolding Output Loss func. Truth Hidden L. Input Square Connects to next timestep x(0) x(1) x(2)x(t) Folded Unfolded yt (t) L L L L y(t) h h h h yt (1) yt (2) yt (3) y(t) y(t) y(t)
  • 6. Common Design Patterns Standard, output every step Output is the only recurrence information Output computed at the end x x y y y yt yt yt L L L h h h h h h x(n)x(t)x(2)x(1)
  • 7. Training ● Compute forward in time, work backwards for gradient. ○ Exploding/vanishing gradient problem ● Teacher forcing: ○ Pipe true outputs into the hidden layer instead of model outputs y(t) y(t+1) x(t) x(t+1) h h yt (t) yt (t+1) y(t) y(t+1) x(t) x(t+1) h h L L Training Time Test Time
  • 8. Recursive Neural Nets ● Map a sequence to a tree, and reduce the tree one layer at a time until you reach a single point, your output ● Many choices of how to arrange the tree. x(1) x(2) x(3) x(4) y yt L U W U W U W
  • 9. Deep Recurrent Neural Nets Can add depth to any of the stages mentioned: Multiple Recurrent Layers Extra input, output, and hidden layer processing +Direct hidden layer yt yt yt x x x h1 h2 h h
  • 10. (1) Vanilla mode without RNN (e.g. image classification). (2) Sequence output (e.g. image captioning). (3) Sequence input (e.g. sentiment analysis). (4) Sequence input and sequence output (e.g. Machine Translation). (5) Synced sequence input and output (e.g. video classification). What makes Recurrent Networks so special? Sequences !
  • 11. The unreasonable effectiveness of RNNs - - Character level language model - LSTM of Leo Tolstoy’s War and Peace - Outputs after 100 iters, 300 iters, 700 iters and 2000 iters
  • 12. Challenges of Vanishing and Exploding Gradients Hidden State Recurrence Relation using Power Method - Spectral radius will make gradient explode or vanish - Variance multiplies at every cell (or timestep) - For Feed-forward networks of fixed size: - obtain some desired variance v∗ , choose the individual weights with variance v = n √ v∗ . - carefully chosen scaling can avoid the vanishing and exploding gradient problem - For RNNs , this means we cannot effectively capture Long term dependencies. - Gradient of a long term interaction has exponentially smaller magnitude than short term interaction
  • 13. - After a forward pass, the gradients of the non-linearities are fixed. - Back propagation is like going forwards through a linear system in which the slope of the non-linearity has been fixed.
  • 14. Loss function of a char-level RNN def lossFun(inputs, targets, hprev): """ inputs,targets are both list of integers. hprev is Hx1 array of initial hidden state returns the loss, gradients on model parameters, and last hidden state """ xs, hs, ys, ps = {}, {}, {}, {} hs[-1] = np.copy(hprev) loss = 0 # forward pass for t in xrange(len(inputs)): xs[t] = np.zeros((vocab_size,1)) # encode in 1-of-k representation xs[t][inputs[t]] = 1 hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars loss += -np.log(ps[t][targets[t],0]) # softmax (cross-entropy loss) # backward pass: compute gradients going backwards dWxh, dWhh, dWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why) dbh, dby = np.zeros_like(bh), np.zeros_like(by) dhnext = np.zeros_like(hs[0]) for t in reversed(xrange(len(inputs))): dy = np.copy(ps[t]) dy[targets[t]] -= 1 # backprop into y. see http://cs231n.github.io/neural-networks-case-study/#grad if confused here dWhy += np.dot(dy, hs[t].T) dby += dy dh = np.dot(Why.T, dy) + dhnext # backprop into h dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity dbh += dhraw dWxh += np.dot(dhraw, xs[t].T) dWhh += np.dot(dhraw, hs[t-1].T) dhnext = np.dot(Whh.T, dhraw) for dparam in [dWxh, dWhh, dWhy, dbh, dby]: np.clip(dparam, -5, 5, out=dparam) # clip to mitigate exploding gradients return loss, dWxh, dWhh, dWhy, dbh, dby, hs[len(inputs)-1]
  • 16. How to overcome gradient issues ?
  • 17. Remedial strategies #1 - Gradient Clipping for Exploding Gradient - Skip connections - Integer valued skip length - Example : ResNet - Leaky Units - Linear self-connections approach allows the effect of remembrance and forgetfulness to be adapted more smoothly and flexibly by adjusting the real-valued α rather than by adjusting the integer-valued skip length. - α can be sampled from a distribution or learnt. - Removing connections - Learns to interact with far off and nearby connections - Have explicit and discrete updates taking place at different times, with a different frequency for different groups of units
  • 18. Remedial strategies #2 - Regularization to maintain information flow - Require the gradient at any time step t to be similar in magnitude to the gradient of the loss at the very last layer. = - For easy gradient computation, is treated as a constant - Doesn’t perform as well as leaky units with abundant data - Perhaps because the constant gradient assumption doesn’t scale well .
  • 19. Echo State Networks - Recurrent and input weights are fixed. Only output weights are learnable. - Relies on the idea that a big, random expansion of the input vector, can often make it easy for a linear model to fit the data. - fix the recurrent weights to have some spectral radius such as 3, does not explode due to the stabilizing effect of saturating nonlinearities like tanh. - Sparse connectivity - Very few non zero values in hidden to hidden weights - Creates loosely coupled oscillators, information can hang around in a particular part of the network. - Important to choose the scale of the input to hidden connections. They need to drive the states of the loosely coupled oscillators but, they mustn't wipe out information that those oscillators contain about the recent history. - used to initialize the weights in a fully trainable recurrent network (Sutskever 2012, Sutskever et al., 2013).
  • 21. The repeating module in a standard RNN contains a single layer.
  • 22. The repeating module in an LSTM contains four interacting layers.
  • 23. LSTMs - Adding even more structure - LSTM : RNN cell with 4 gates that control how information is retained - Input value can be accumulated into the state if the sigmoidal input gate allows it. - The cell state unit has a linear self-loop whose weight is controlled by the forget gate. - The output of the cell can be shut off by the output gate. - All the gating units have a sigmoid nonlinearity, while the ‘g’ gate can have any squashing nonlinearity. - i and g gates - multiplicative interaction - g - what between -1 to 1 should I add to the cell state - i - should I go through with the operation. - Forget gate - Can kill gradients in LSTM if set to zero. Initialize to 1 at start so gradients flow nicely and LSTM learns to shut or open whenever it wants. - The state unit can also be used as an extra input to the gating units(Peephole connections).
  • 24. LSTM - Equations Forget Input Update Output
  • 26. LSTM : Search Space Odyssey - 2015 Paper by Greff et al. - Compare 8 different configurations of LSTM Architecture - GRUs - Without Peephole connections - Without output gate - Without non-linearities at output and forget gate etc - Trained for 5200 iters, over 15 CPU years - Did not see any major improvement in results, classic LSTM architecture works as well as other versions
  • 28. Sequence to Sequence with Attention - NMT
  • 29. Explicit Memory ● Motivation ○ Some knowledge can be implicit, subconscious, and difficult to verbalize ■ Ex - how a dog looks different from a cat. ○ It can also be explicit, declarative and straightforward to put into words ■ Ex - everyday commonsense knowledge -> a cat is a kind of animal ■ Ex - Very specific facts -> the meeting with the sales team is at 3:00 PM, room 141.” ○ Neural networks excel at storing implicit knowledge but struggle to memorize facts ■ SGD requires a sample to be repeated several time for a NN to memorize, that too not precisely. (Graves et al, 2014b) ○ Such explicit memory allows systems to rapidly and intentionally store and retrieve specific facts and to sequentially reason with them.
  • 30. Memory Networks ● Memory networks include a set of memory cells that can be accessed via an addressing mechanism. ○ Originally required a supervision signal instructing them how to use their memory cells Weston et al. (2014) ○ Graves et al. (2014b) introduced NMTs ■ able to learn to read from and write arbitrary content to memory cells without explicit supervision about which actions to undertake ■ allow end-to-end training using a content-based soft attention mechanism. Bahdanau et al.(2015)
  • 31. Memory Networks ● Soft Addressing - (Content based) ○ Cell state is a Vector - weight used to read to or write from a cell is a function of that cell. ■ Weight can be produced using a softmax across all cells. ○ Completely retrieve vector-valued memory if we are able to produce a pattern that matches some but not all of its elements ● Hard addressing - (Location based) ○ Output a discrete memory location/Treat weights as probabilities and choose a particular cell to read or write from ○ Requires specialized optimization algorithms
  • 33. Resources and References - http://colah.github.io/posts/2015-08-Understanding-LSTMs/ - http://karpathy.github.io/2015/05/21/rnn-effectiveness/ - https://www.youtube.com/watch?v=iX5V1WpxxkY&list=PL16j5WbGpaM0_ Tj8CRmurZ8Kk1gEBc7fg&index=10 - https://www.coursera.org/learn/neural-networks/home/week/7 - https://www.coursera.org/learn/neural-networks/home/week/8 - http://cs231n.stanford.edu/slides/2016/winter1516_lecture10.pdf - https://arxiv.org/abs/1409.0473 - https://arxiv.org/pdf/1308.0850.pdf - https://arxiv.org/pdf/1503.04069.pdf
  • 35. Optimisation for Long term dependencies - Problem - Specifically, whenever the model is able to represent long term dependencies, the gradient of a long term interaction has exponentially smaller magnitude than the gradient of a short term interaction. - It does not mean that it is impossible to learn, but that it might take a very long time to learn long-term dependencies. - gradient-based optimization becomes increasingly difficult with the probability of successful training reaching 0 for sequences of only length 10 or 20 - Leaky units & multiple time scales - Skip connections through time - Leaky units - The linear self-connection approach allows this effect to be adapted more smoothly and flexibly by adjusting the real-valued α rather than by adjusting the integer-valued skip length. - Remove connections - - Gradient Clipping