SlideShare a Scribd company logo
1 of 35
Download to read offline
Recurrent Neural
Networks
Daniel Thorngren
Sharath T.S.
Shubhangi Tandon
Motivation
● Translational Invariance
● Deep computational graph:
T O B E O R ...
y(0) y(1) y(2) y(3) y(4) y(t)
Motivation
● Translational Invariance
● Deep computational graph:
T O B E O R ...
y(0) y(1) y(2) y(3) y(4) y(t)
Computation Graphs
● Different sources use different systems.
This is the system the book uses.
● Nodes are variables
● Connections are functions
● A variable is computed using all of the
connections pointing towards it
● Can compute derivatives by applying
chain rule, working backwards through
the graph.
NOT ALL GRAPHS FOLLOW THESE RULES!
>:(
L
y
h
x
yt
W
RNN Computation Graphs - Unfolding
Output
Loss func.
Truth
Hidden L.
Input
Square Connects to next timestep
x(0) x(1) x(2)x(t)
Folded Unfolded
yt
(t)
L L L L
y(t)
h h h h
yt
(1) yt
(2) yt
(3)
y(t) y(t) y(t)
Common Design Patterns
Standard,
output every step
Output is the only
recurrence information Output computed at the end
x x
y y
y
yt
yt
yt
L L
L
h h
h h h h
x(n)x(t)x(2)x(1)
Training
● Compute forward in time, work backwards for gradient.
○ Exploding/vanishing gradient problem
● Teacher forcing:
○ Pipe true outputs into the hidden layer instead of model outputs
y(t) y(t+1)
x(t) x(t+1)
h h
yt
(t) yt
(t+1)
y(t) y(t+1)
x(t) x(t+1)
h h
L L
Training Time Test Time
Recursive Neural Nets
● Map a sequence to a tree, and reduce
the tree one layer at a time until you
reach a single point, your output
● Many choices of how to arrange the
tree.
x(1) x(2) x(3) x(4)
y yt
L
U W U W
U W
Deep Recurrent Neural Nets
Can add depth to any of the stages mentioned:
Multiple Recurrent Layers
Extra input, output, and
hidden layer processing
+Direct hidden layer
yt
yt
yt
x x x
h1
h2 h h
(1) Vanilla mode without RNN (e.g. image classification).
(2) Sequence output (e.g. image captioning).
(3) Sequence input (e.g. sentiment analysis).
(4) Sequence input and sequence output (e.g. Machine Translation).
(5) Synced sequence input and output (e.g. video classification).
What makes Recurrent Networks so special? Sequences !
The unreasonable effectiveness of RNNs
-
- Character level language model
- LSTM of Leo Tolstoy’s War and Peace
- Outputs after 100 iters, 300 iters, 700 iters and
2000 iters
Challenges of Vanishing and Exploding Gradients
Hidden State Recurrence Relation
using Power Method
- Spectral radius will make gradient explode or vanish
- Variance multiplies at every cell (or timestep)
- For Feed-forward networks of fixed size:
- obtain some desired variance v∗
, choose the individual weights with variance v = n
√ v∗
.
- carefully chosen scaling can avoid the vanishing and exploding gradient problem
- For RNNs , this means we cannot effectively capture Long term dependencies.
- Gradient of a long term interaction has exponentially smaller magnitude than short term
interaction
- After a forward pass, the gradients of the non-linearities are fixed.
- Back propagation is like going forwards through a linear system in which the slope of the
non-linearity has been fixed.
Loss function of a char-level RNN
def lossFun(inputs, targets, hprev):
"""
inputs,targets are both list of integers.
hprev is Hx1 array of initial hidden state
returns the loss, gradients on model parameters, and last hidden state
"""
xs, hs, ys, ps = {}, {}, {}, {}
hs[-1] = np.copy(hprev)
loss = 0
# forward pass
for t in xrange(len(inputs)):
xs[t] = np.zeros((vocab_size,1)) # encode in 1-of-k representation
xs[t][inputs[t]] = 1
hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state
ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars
ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars
loss += -np.log(ps[t][targets[t],0]) # softmax (cross-entropy loss)
# backward pass: compute gradients going backwards
dWxh, dWhh, dWhy = np.zeros_like(Wxh), np.zeros_like(Whh),
np.zeros_like(Why)
dbh, dby = np.zeros_like(bh), np.zeros_like(by)
dhnext = np.zeros_like(hs[0])
for t in reversed(xrange(len(inputs))):
dy = np.copy(ps[t])
dy[targets[t]] -= 1 # backprop into y. see
http://cs231n.github.io/neural-networks-case-study/#grad if confused here
dWhy += np.dot(dy, hs[t].T)
dby += dy
dh = np.dot(Why.T, dy) + dhnext # backprop into h
dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity
dbh += dhraw
dWxh += np.dot(dhraw, xs[t].T)
dWhh += np.dot(dhraw, hs[t-1].T)
dhnext = np.dot(Whh.T, dhraw)
for dparam in [dWxh, dWhh, dWhy, dbh, dby]:
np.clip(dparam, -5, 5, out=dparam) # clip to mitigate exploding gradients
return loss, dWxh, dWhh, dWhy, dbh, dby, hs[len(inputs)-1]
Understanding gradient flow dynamics
How to overcome gradient issues ?
Remedial strategies #1
- Gradient Clipping for Exploding Gradient
- Skip connections
- Integer valued skip length
- Example : ResNet
- Leaky Units
- Linear self-connections approach allows the effect of remembrance and forgetfulness to
be adapted more smoothly and flexibly by adjusting the real-valued α rather than by
adjusting the integer-valued skip length.
- α can be sampled from a distribution or learnt.
- Removing connections
- Learns to interact with far off and nearby connections
- Have explicit and discrete updates taking place at different times, with a different
frequency for different groups of units
Remedial strategies #2
- Regularization to maintain information flow
- Require the gradient at any time step t to be similar in magnitude to the gradient of the
loss at the very last layer.
=
- For easy gradient computation, is treated as a constant
- Doesn’t perform as well as leaky units with abundant data
- Perhaps because the constant gradient assumption doesn’t scale well .
Echo State Networks
- Recurrent and input weights are fixed. Only output weights are learnable.
- Relies on the idea that a big, random expansion of the input vector, can often make it easy for a
linear model to fit the data.
- fix the recurrent weights to have some spectral radius such as 3, does not explode due to the
stabilizing effect of saturating nonlinearities like tanh.
- Sparse connectivity - Very few non zero values in hidden to hidden weights
- Creates loosely coupled oscillators, information can hang around in a particular part of
the network.
- Important to choose the scale of the input to hidden connections. They need to drive the
states of the loosely coupled oscillators but, they mustn't wipe out information that those
oscillators contain about the recent history.
- used to initialize the weights in a fully trainable recurrent network (Sutskever 2012, Sutskever
et al., 2013).
LSTMs
Unrolled RNN
The repeating module in a standard RNN contains
a single layer.
The repeating module in an LSTM contains four
interacting layers.
LSTMs
- Adding even more structure
- LSTM : RNN cell with 4 gates that control how information is retained
- Input value can be accumulated into the state if the sigmoidal input gate allows it.
- The cell state unit has a linear self-loop whose weight is controlled by the forget gate.
- The output of the cell can be shut off by the output gate.
- All the gating units have a sigmoid nonlinearity, while the ‘g’ gate can have any squashing
nonlinearity.
- i and g gates - multiplicative interaction
- g - what between -1 to 1 should I add to the cell state
- i - should I go through with the operation.
- Forget gate - Can kill gradients in LSTM if set to zero. Initialize to 1 at start so gradients flow
nicely and LSTM learns to shut or open whenever it wants.
- The state unit can also be used as an extra input to the gating units(Peephole connections).
LSTM - Equations
Forget Input
Update Output
Gated Recurrent Units
LSTM : Search Space Odyssey
- 2015 Paper by Greff et al.
- Compare 8 different configurations of LSTM Architecture
- GRUs
- Without Peephole connections
- Without output gate
- Without non-linearities at output and forget gate etc
- Trained for 5200 iters, over 15 CPU years
- Did not see any major improvement in results, classic LSTM architecture
works as well as other versions
Encoder Decoder Frameworks : Seq2Seq
Sequence to Sequence with Attention - NMT
Explicit Memory
● Motivation
○ Some knowledge can be implicit, subconscious, and difficult to verbalize
■ Ex - how a dog looks different from a cat.
○ It can also be explicit, declarative and straightforward to put into words
■ Ex - everyday commonsense knowledge -> a cat is a kind of animal
■ Ex - Very specific facts -> the meeting with the sales team is at 3:00 PM, room 141.”
○ Neural networks excel at storing implicit knowledge but struggle to memorize facts
■ SGD requires a sample to be repeated several time for a NN to memorize, that too
not precisely. (Graves et al, 2014b)
○ Such explicit memory allows systems to rapidly and intentionally store and retrieve
specific facts and to sequentially reason with them.
Memory Networks
● Memory networks include a set of memory cells that can be accessed via an addressing
mechanism.
○ Originally required a supervision signal instructing them how to use their memory cells
Weston et al. (2014)
○ Graves et al. (2014b) introduced NMTs
■ able to learn to read from and write arbitrary content to memory cells without
explicit supervision about which actions to undertake
■ allow end-to-end training using a content-based soft attention mechanism.
Bahdanau et al.(2015)
Memory Networks
● Soft Addressing - (Content based)
○ Cell state is a Vector - weight used to read to or write from a cell is a function of that cell.
■ Weight can be produced using a softmax across all cells.
○ Completely retrieve vector-valued memory if we are able to produce a pattern that
matches some but not all of its elements
● Hard addressing - (Location based)
○ Output a discrete memory location/Treat weights as probabilities and choose a particular
cell to read or write from
○ Requires specialized optimization algorithms
Memory Networks
Resources and References
- http://colah.github.io/posts/2015-08-Understanding-LSTMs/
- http://karpathy.github.io/2015/05/21/rnn-effectiveness/
- https://www.youtube.com/watch?v=iX5V1WpxxkY&list=PL16j5WbGpaM0_
Tj8CRmurZ8Kk1gEBc7fg&index=10
- https://www.coursera.org/learn/neural-networks/home/week/7
- https://www.coursera.org/learn/neural-networks/home/week/8
- http://cs231n.stanford.edu/slides/2016/winter1516_lecture10.pdf
- https://arxiv.org/abs/1409.0473
- https://arxiv.org/pdf/1308.0850.pdf
- https://arxiv.org/pdf/1503.04069.pdf
Thank you !!
Optimisation for Long term dependencies
- Problem
- Specifically, whenever the model is able to represent long term dependencies, the
gradient of a long term interaction has exponentially smaller magnitude than the gradient
of a short term interaction.
- It does not mean that it is impossible to learn, but that it might take a very long time to
learn long-term dependencies.
- gradient-based optimization becomes increasingly difficult with the probability of
successful training reaching 0 for sequences of only length 10 or 20
- Leaky units & multiple time scales
- Skip connections through time
- Leaky units - The linear self-connection approach allows this effect to be adapted more
smoothly and flexibly by adjusting the real-valued α rather than by adjusting the
integer-valued skip length.
- Remove connections -
- Gradient Clipping

More Related Content

What's hot

Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRUananth
 
An Introduction to Long Short-term Memory (LSTMs)
An Introduction to Long Short-term Memory (LSTMs)An Introduction to Long Short-term Memory (LSTMs)
An Introduction to Long Short-term Memory (LSTMs)EmmanuelJosterSsenjo
 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMDivya Gera
 
Long Short Term Memory
Long Short Term MemoryLong Short Term Memory
Long Short Term MemoryYan Xu
 
Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Larry Guo
 
Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnnKuppusamy P
 
Long Short Term Memory (Neural Networks)
Long Short Term Memory (Neural Networks)Long Short Term Memory (Neural Networks)
Long Short Term Memory (Neural Networks)Olusola Amusan
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkYan Xu
 
Understanding RNN and LSTM
Understanding RNN and LSTMUnderstanding RNN and LSTM
Understanding RNN and LSTM健程 杨
 
Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN)Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN)Muhammad Haroon
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Gaurav Mittal
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksChristian Perone
 
Introduction to CNN
Introduction to CNNIntroduction to CNN
Introduction to CNNShuai Zhang
 
Deep Learning - Overview of my work II
Deep Learning - Overview of my work IIDeep Learning - Overview of my work II
Deep Learning - Overview of my work IIMohamed Loey
 

What's hot (20)

Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
An Introduction to Long Short-term Memory (LSTMs)
An Introduction to Long Short-term Memory (LSTMs)An Introduction to Long Short-term Memory (LSTMs)
An Introduction to Long Short-term Memory (LSTMs)
 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTM
 
Long Short Term Memory
Long Short Term MemoryLong Short Term Memory
Long Short Term Memory
 
Lstm
LstmLstm
Lstm
 
Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10)
 
Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnn
 
Long Short Term Memory (Neural Networks)
Long Short Term Memory (Neural Networks)Long Short Term Memory (Neural Networks)
Long Short Term Memory (Neural Networks)
 
Recurrent Neural Network
Recurrent Neural NetworkRecurrent Neural Network
Recurrent Neural Network
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
Understanding RNN and LSTM
Understanding RNN and LSTMUnderstanding RNN and LSTM
Understanding RNN and LSTM
 
Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN)Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN)
 
LSTM Basics
LSTM BasicsLSTM Basics
LSTM Basics
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
 
Neural Networks
Neural NetworksNeural Networks
Neural Networks
 
Recurrent neural network
Recurrent neural networkRecurrent neural network
Recurrent neural network
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural Networks
 
Introduction to CNN
Introduction to CNNIntroduction to CNN
Introduction to CNN
 
Deep Learning - Overview of my work II
Deep Learning - Overview of my work IIDeep Learning - Overview of my work II
Deep Learning - Overview of my work II
 

Similar to Recurrent Neural Networks

Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learningJunaid Bhat
 
RNN and LSTM model description and working advantages and disadvantages
RNN and LSTM model description and working advantages and disadvantagesRNN and LSTM model description and working advantages and disadvantages
RNN and LSTM model description and working advantages and disadvantagesAbhijitVenkatesh1
 
recurrent_neural_networks_april_2020.pptx
recurrent_neural_networks_april_2020.pptxrecurrent_neural_networks_april_2020.pptx
recurrent_neural_networks_april_2020.pptxSagarTekwani4
 
Deep Learning: Application & Opportunity
Deep Learning: Application & OpportunityDeep Learning: Application & Opportunity
Deep Learning: Application & OpportunityiTrain
 
Recurrent Neural Networks II (D2L3 Deep Learning for Speech and Language UPC ...
Recurrent Neural Networks II (D2L3 Deep Learning for Speech and Language UPC ...Recurrent Neural Networks II (D2L3 Deep Learning for Speech and Language UPC ...
Recurrent Neural Networks II (D2L3 Deep Learning for Speech and Language UPC ...Universitat Politècnica de Catalunya
 
14889574 dl ml RNN Deeplearning MMMm.ppt
14889574 dl ml RNN Deeplearning MMMm.ppt14889574 dl ml RNN Deeplearning MMMm.ppt
14889574 dl ml RNN Deeplearning MMMm.pptManiMaran230751
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningCastLabKAIST
 
Paper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satChenYiHuang5
 
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves MabialaDeep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves MabialaSpark Summit
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspectiveAnirban Santara
 
Recurrent and Recursive Nets (part 2)
Recurrent and Recursive Nets (part 2)Recurrent and Recursive Nets (part 2)
Recurrent and Recursive Nets (part 2)sohaib_alam
 
Language translation with Deep Learning (RNN) with TensorFlow
Language translation with Deep Learning (RNN) with TensorFlowLanguage translation with Deep Learning (RNN) with TensorFlow
Language translation with Deep Learning (RNN) with TensorFlowS N
 
Economic Load Dispatch (ELD), Economic Emission Dispatch (EED), Combined Econ...
Economic Load Dispatch (ELD), Economic Emission Dispatch (EED), Combined Econ...Economic Load Dispatch (ELD), Economic Emission Dispatch (EED), Combined Econ...
Economic Load Dispatch (ELD), Economic Emission Dispatch (EED), Combined Econ...cscpconf
 
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...Fordham University
 
XLnet RoBERTa Reformer
XLnet RoBERTa ReformerXLnet RoBERTa Reformer
XLnet RoBERTa ReformerSan Kim
 
Recurrent Neuron Network-from point of dynamic system & state machine
Recurrent Neuron Network-from point of dynamic system & state machineRecurrent Neuron Network-from point of dynamic system & state machine
Recurrent Neuron Network-from point of dynamic system & state machineGAYO3
 

Similar to Recurrent Neural Networks (20)

Rnn presentation 2
Rnn presentation 2Rnn presentation 2
Rnn presentation 2
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
 
RNN and LSTM model description and working advantages and disadvantages
RNN and LSTM model description and working advantages and disadvantagesRNN and LSTM model description and working advantages and disadvantages
RNN and LSTM model description and working advantages and disadvantages
 
recurrent_neural_networks_april_2020.pptx
recurrent_neural_networks_april_2020.pptxrecurrent_neural_networks_april_2020.pptx
recurrent_neural_networks_april_2020.pptx
 
Advanced Machine Learning
Advanced Machine LearningAdvanced Machine Learning
Advanced Machine Learning
 
Lec10new
Lec10newLec10new
Lec10new
 
lec10new.ppt
lec10new.pptlec10new.ppt
lec10new.ppt
 
Deep Learning: Application & Opportunity
Deep Learning: Application & OpportunityDeep Learning: Application & Opportunity
Deep Learning: Application & Opportunity
 
Recurrent Neural Networks II (D2L3 Deep Learning for Speech and Language UPC ...
Recurrent Neural Networks II (D2L3 Deep Learning for Speech and Language UPC ...Recurrent Neural Networks II (D2L3 Deep Learning for Speech and Language UPC ...
Recurrent Neural Networks II (D2L3 Deep Learning for Speech and Language UPC ...
 
14889574 dl ml RNN Deeplearning MMMm.ppt
14889574 dl ml RNN Deeplearning MMMm.ppt14889574 dl ml RNN Deeplearning MMMm.ppt
14889574 dl ml RNN Deeplearning MMMm.ppt
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
Paper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit sat
 
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves MabialaDeep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspective
 
Recurrent and Recursive Nets (part 2)
Recurrent and Recursive Nets (part 2)Recurrent and Recursive Nets (part 2)
Recurrent and Recursive Nets (part 2)
 
Language translation with Deep Learning (RNN) with TensorFlow
Language translation with Deep Learning (RNN) with TensorFlowLanguage translation with Deep Learning (RNN) with TensorFlow
Language translation with Deep Learning (RNN) with TensorFlow
 
Economic Load Dispatch (ELD), Economic Emission Dispatch (EED), Combined Econ...
Economic Load Dispatch (ELD), Economic Emission Dispatch (EED), Combined Econ...Economic Load Dispatch (ELD), Economic Emission Dispatch (EED), Combined Econ...
Economic Load Dispatch (ELD), Economic Emission Dispatch (EED), Combined Econ...
 
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
 
XLnet RoBERTa Reformer
XLnet RoBERTa ReformerXLnet RoBERTa Reformer
XLnet RoBERTa Reformer
 
Recurrent Neuron Network-from point of dynamic system & state machine
Recurrent Neuron Network-from point of dynamic system & state machineRecurrent Neuron Network-from point of dynamic system & state machine
Recurrent Neuron Network-from point of dynamic system & state machine
 

Recently uploaded

Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)DHURKADEVIBASKAR
 
Solution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsSolution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsHajira Mahmood
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxyaramohamed343013
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trssuser06f238
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfWildaNurAmalia2
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
Twin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptxTwin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptxEran Akiva Sinbar
 
BREEDING FOR RESISTANCE TO BIOTIC STRESS.pptx
BREEDING FOR RESISTANCE TO BIOTIC STRESS.pptxBREEDING FOR RESISTANCE TO BIOTIC STRESS.pptx
BREEDING FOR RESISTANCE TO BIOTIC STRESS.pptxPABOLU TEJASREE
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxNandakishor Bhaurao Deshmukh
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 

Recently uploaded (20)

Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)
 
Solution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsSolution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutions
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docx
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 tr
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
Twin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptxTwin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptx
 
BREEDING FOR RESISTANCE TO BIOTIC STRESS.pptx
BREEDING FOR RESISTANCE TO BIOTIC STRESS.pptxBREEDING FOR RESISTANCE TO BIOTIC STRESS.pptx
BREEDING FOR RESISTANCE TO BIOTIC STRESS.pptx
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 

Recurrent Neural Networks

  • 2. Motivation ● Translational Invariance ● Deep computational graph: T O B E O R ... y(0) y(1) y(2) y(3) y(4) y(t)
  • 3. Motivation ● Translational Invariance ● Deep computational graph: T O B E O R ... y(0) y(1) y(2) y(3) y(4) y(t)
  • 4. Computation Graphs ● Different sources use different systems. This is the system the book uses. ● Nodes are variables ● Connections are functions ● A variable is computed using all of the connections pointing towards it ● Can compute derivatives by applying chain rule, working backwards through the graph. NOT ALL GRAPHS FOLLOW THESE RULES! >:( L y h x yt W
  • 5. RNN Computation Graphs - Unfolding Output Loss func. Truth Hidden L. Input Square Connects to next timestep x(0) x(1) x(2)x(t) Folded Unfolded yt (t) L L L L y(t) h h h h yt (1) yt (2) yt (3) y(t) y(t) y(t)
  • 6. Common Design Patterns Standard, output every step Output is the only recurrence information Output computed at the end x x y y y yt yt yt L L L h h h h h h x(n)x(t)x(2)x(1)
  • 7. Training ● Compute forward in time, work backwards for gradient. ○ Exploding/vanishing gradient problem ● Teacher forcing: ○ Pipe true outputs into the hidden layer instead of model outputs y(t) y(t+1) x(t) x(t+1) h h yt (t) yt (t+1) y(t) y(t+1) x(t) x(t+1) h h L L Training Time Test Time
  • 8. Recursive Neural Nets ● Map a sequence to a tree, and reduce the tree one layer at a time until you reach a single point, your output ● Many choices of how to arrange the tree. x(1) x(2) x(3) x(4) y yt L U W U W U W
  • 9. Deep Recurrent Neural Nets Can add depth to any of the stages mentioned: Multiple Recurrent Layers Extra input, output, and hidden layer processing +Direct hidden layer yt yt yt x x x h1 h2 h h
  • 10. (1) Vanilla mode without RNN (e.g. image classification). (2) Sequence output (e.g. image captioning). (3) Sequence input (e.g. sentiment analysis). (4) Sequence input and sequence output (e.g. Machine Translation). (5) Synced sequence input and output (e.g. video classification). What makes Recurrent Networks so special? Sequences !
  • 11. The unreasonable effectiveness of RNNs - - Character level language model - LSTM of Leo Tolstoy’s War and Peace - Outputs after 100 iters, 300 iters, 700 iters and 2000 iters
  • 12. Challenges of Vanishing and Exploding Gradients Hidden State Recurrence Relation using Power Method - Spectral radius will make gradient explode or vanish - Variance multiplies at every cell (or timestep) - For Feed-forward networks of fixed size: - obtain some desired variance v∗ , choose the individual weights with variance v = n √ v∗ . - carefully chosen scaling can avoid the vanishing and exploding gradient problem - For RNNs , this means we cannot effectively capture Long term dependencies. - Gradient of a long term interaction has exponentially smaller magnitude than short term interaction
  • 13. - After a forward pass, the gradients of the non-linearities are fixed. - Back propagation is like going forwards through a linear system in which the slope of the non-linearity has been fixed.
  • 14. Loss function of a char-level RNN def lossFun(inputs, targets, hprev): """ inputs,targets are both list of integers. hprev is Hx1 array of initial hidden state returns the loss, gradients on model parameters, and last hidden state """ xs, hs, ys, ps = {}, {}, {}, {} hs[-1] = np.copy(hprev) loss = 0 # forward pass for t in xrange(len(inputs)): xs[t] = np.zeros((vocab_size,1)) # encode in 1-of-k representation xs[t][inputs[t]] = 1 hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars loss += -np.log(ps[t][targets[t],0]) # softmax (cross-entropy loss) # backward pass: compute gradients going backwards dWxh, dWhh, dWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why) dbh, dby = np.zeros_like(bh), np.zeros_like(by) dhnext = np.zeros_like(hs[0]) for t in reversed(xrange(len(inputs))): dy = np.copy(ps[t]) dy[targets[t]] -= 1 # backprop into y. see http://cs231n.github.io/neural-networks-case-study/#grad if confused here dWhy += np.dot(dy, hs[t].T) dby += dy dh = np.dot(Why.T, dy) + dhnext # backprop into h dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity dbh += dhraw dWxh += np.dot(dhraw, xs[t].T) dWhh += np.dot(dhraw, hs[t-1].T) dhnext = np.dot(Whh.T, dhraw) for dparam in [dWxh, dWhh, dWhy, dbh, dby]: np.clip(dparam, -5, 5, out=dparam) # clip to mitigate exploding gradients return loss, dWxh, dWhh, dWhy, dbh, dby, hs[len(inputs)-1]
  • 16. How to overcome gradient issues ?
  • 17. Remedial strategies #1 - Gradient Clipping for Exploding Gradient - Skip connections - Integer valued skip length - Example : ResNet - Leaky Units - Linear self-connections approach allows the effect of remembrance and forgetfulness to be adapted more smoothly and flexibly by adjusting the real-valued α rather than by adjusting the integer-valued skip length. - α can be sampled from a distribution or learnt. - Removing connections - Learns to interact with far off and nearby connections - Have explicit and discrete updates taking place at different times, with a different frequency for different groups of units
  • 18. Remedial strategies #2 - Regularization to maintain information flow - Require the gradient at any time step t to be similar in magnitude to the gradient of the loss at the very last layer. = - For easy gradient computation, is treated as a constant - Doesn’t perform as well as leaky units with abundant data - Perhaps because the constant gradient assumption doesn’t scale well .
  • 19. Echo State Networks - Recurrent and input weights are fixed. Only output weights are learnable. - Relies on the idea that a big, random expansion of the input vector, can often make it easy for a linear model to fit the data. - fix the recurrent weights to have some spectral radius such as 3, does not explode due to the stabilizing effect of saturating nonlinearities like tanh. - Sparse connectivity - Very few non zero values in hidden to hidden weights - Creates loosely coupled oscillators, information can hang around in a particular part of the network. - Important to choose the scale of the input to hidden connections. They need to drive the states of the loosely coupled oscillators but, they mustn't wipe out information that those oscillators contain about the recent history. - used to initialize the weights in a fully trainable recurrent network (Sutskever 2012, Sutskever et al., 2013).
  • 21. The repeating module in a standard RNN contains a single layer.
  • 22. The repeating module in an LSTM contains four interacting layers.
  • 23. LSTMs - Adding even more structure - LSTM : RNN cell with 4 gates that control how information is retained - Input value can be accumulated into the state if the sigmoidal input gate allows it. - The cell state unit has a linear self-loop whose weight is controlled by the forget gate. - The output of the cell can be shut off by the output gate. - All the gating units have a sigmoid nonlinearity, while the ‘g’ gate can have any squashing nonlinearity. - i and g gates - multiplicative interaction - g - what between -1 to 1 should I add to the cell state - i - should I go through with the operation. - Forget gate - Can kill gradients in LSTM if set to zero. Initialize to 1 at start so gradients flow nicely and LSTM learns to shut or open whenever it wants. - The state unit can also be used as an extra input to the gating units(Peephole connections).
  • 24. LSTM - Equations Forget Input Update Output
  • 26. LSTM : Search Space Odyssey - 2015 Paper by Greff et al. - Compare 8 different configurations of LSTM Architecture - GRUs - Without Peephole connections - Without output gate - Without non-linearities at output and forget gate etc - Trained for 5200 iters, over 15 CPU years - Did not see any major improvement in results, classic LSTM architecture works as well as other versions
  • 28. Sequence to Sequence with Attention - NMT
  • 29. Explicit Memory ● Motivation ○ Some knowledge can be implicit, subconscious, and difficult to verbalize ■ Ex - how a dog looks different from a cat. ○ It can also be explicit, declarative and straightforward to put into words ■ Ex - everyday commonsense knowledge -> a cat is a kind of animal ■ Ex - Very specific facts -> the meeting with the sales team is at 3:00 PM, room 141.” ○ Neural networks excel at storing implicit knowledge but struggle to memorize facts ■ SGD requires a sample to be repeated several time for a NN to memorize, that too not precisely. (Graves et al, 2014b) ○ Such explicit memory allows systems to rapidly and intentionally store and retrieve specific facts and to sequentially reason with them.
  • 30. Memory Networks ● Memory networks include a set of memory cells that can be accessed via an addressing mechanism. ○ Originally required a supervision signal instructing them how to use their memory cells Weston et al. (2014) ○ Graves et al. (2014b) introduced NMTs ■ able to learn to read from and write arbitrary content to memory cells without explicit supervision about which actions to undertake ■ allow end-to-end training using a content-based soft attention mechanism. Bahdanau et al.(2015)
  • 31. Memory Networks ● Soft Addressing - (Content based) ○ Cell state is a Vector - weight used to read to or write from a cell is a function of that cell. ■ Weight can be produced using a softmax across all cells. ○ Completely retrieve vector-valued memory if we are able to produce a pattern that matches some but not all of its elements ● Hard addressing - (Location based) ○ Output a discrete memory location/Treat weights as probabilities and choose a particular cell to read or write from ○ Requires specialized optimization algorithms
  • 33. Resources and References - http://colah.github.io/posts/2015-08-Understanding-LSTMs/ - http://karpathy.github.io/2015/05/21/rnn-effectiveness/ - https://www.youtube.com/watch?v=iX5V1WpxxkY&list=PL16j5WbGpaM0_ Tj8CRmurZ8Kk1gEBc7fg&index=10 - https://www.coursera.org/learn/neural-networks/home/week/7 - https://www.coursera.org/learn/neural-networks/home/week/8 - http://cs231n.stanford.edu/slides/2016/winter1516_lecture10.pdf - https://arxiv.org/abs/1409.0473 - https://arxiv.org/pdf/1308.0850.pdf - https://arxiv.org/pdf/1503.04069.pdf
  • 35. Optimisation for Long term dependencies - Problem - Specifically, whenever the model is able to represent long term dependencies, the gradient of a long term interaction has exponentially smaller magnitude than the gradient of a short term interaction. - It does not mean that it is impossible to learn, but that it might take a very long time to learn long-term dependencies. - gradient-based optimization becomes increasingly difficult with the probability of successful training reaching 0 for sequences of only length 10 or 20 - Leaky units & multiple time scales - Skip connections through time - Leaky units - The linear self-connection approach allows this effect to be adapted more smoothly and flexibly by adjusting the real-valued α rather than by adjusting the integer-valued skip length. - Remove connections - - Gradient Clipping