This document discusses issues with recurrent neural networks (RNNs) and advanced RNN architectures. It begins by explaining the vanishing gradient problem, where the gradient signal decays exponentially with the number of time steps, preventing RNNs from learning long-term dependencies. Two solutions are introduced: long short-term memory (LSTM) networks, which add gating mechanisms to better preserve error signals; and gated recurrent units (GRU), a simpler variant of LSTMs. Bidirectional RNNs that process the input sequence in both directions are also covered. The document aims to provide intuition on these advanced RNN architectures and how they address limitations of basic RNNs.
Deep learning (also known as deep structured learning or hierarchical learning) is the application of artificial neural networks (ANNs) to learning tasks that contain more than one hidden layer. Deep learning is part of a broader family of machine learning methods based on learning data representations, as opposed to task-specific algorithms. Learning can be supervised, partially supervised or unsupervised.
Recurrent Neural Network
ACRRL
Applied Control & Robotics Research Laboratory of Shiraz University
Department of Power and Control Engineering, Shiraz University, Fars, Iran.
Mohammad Sabouri
https://sites.google.com/view/acrrl/
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...Fordham University
In recent years, the field of artificial intelligence (AI) has witnessed remarkable advancements, particularly in the domain of Generative models. Generative AI, a subset of machine learning, focuses on developing systems that can create novel and realistic content, ranging from text, speech, images to the multimodal content. This burgeoning field has demonstrated unprecedented potential to revolutionize various industries, making it imperative to introduce dedicated study materials on the foundation of Generative AI. With the increasing integration of Generative AI in various industries, professionals with expertise in this field are in high demand, and thus we believe that the publication of the slides are extremely important to meet the current need. The proposed outline aims to equip students with the knowledge and skills required to harness the creative power of AI and navigate the ethical implications associated with Generative technologies. * Materials used in this PPT were collected from Wikipedia, Google Image, and OpenAI GPT. No copyright is claimed by the author.
This presentation on Recurrent Neural Network will help you understand what is a neural network, what are the popular neural networks, why we need recurrent neural network, what is a recurrent neural network, how does a RNN work, what is vanishing and exploding gradient problem, what is LSTM and you will also see a use case implementation of LSTM (Long short term memory). Neural networks used in Deep Learning consists of different layers connected to each other and work on the structure and functions of the human brain. It learns from huge volumes of data and used complex algorithms to train a neural net. The recurrent neural network works on the principle of saving the output of a layer and feeding this back to the input in order to predict the output of the layer. Now lets deep dive into this presentation and understand what is RNN and how does it actually work.
Below topics are explained in this recurrent neural networks tutorial:
1. What is a neural network?
2. Popular neural networks?
3. Why recurrent neural network?
4. What is a recurrent neural network?
5. How does an RNN work?
6. Vanishing and exploding gradient problem
7. Long short term memory (LSTM)
8. Use case implementation of LSTM
Simplilearn’s Deep Learning course will transform you into an expert in deep learning techniques using TensorFlow, the open-source software library designed to conduct machine learning & deep neural network research. With our deep learning course, you'll master deep learning and TensorFlow concepts, learn to implement algorithms, build artificial neural networks and traverse layers of data abstraction to understand the power of data and prepare you for your new role as deep learning scientist.
Why Deep Learning?
It is one of the most popular software platforms used for deep learning and contains powerful tools to help you build and implement artificial neural networks.
Advancements in deep learning are being seen in smartphone applications, creating efficiencies in the power grid, driving advancements in healthcare, improving agricultural yields, and helping us find solutions to climate change. With this Tensorflow course, you’ll build expertise in deep learning models, learn to operate TensorFlow to manage neural networks and interpret the results.
And according to payscale.com, the median salary for engineers with deep learning skills tops $120,000 per year.
You can gain in-depth knowledge of Deep Learning by taking our Deep Learning certification training course. With Simplilearn’s Deep Learning course, you will prepare for a career as a Deep Learning engineer as you master concepts and techniques including supervised and unsupervised learning, mathematical and heuristic aspects, and hands-on modeling to develop algorithms. Those who complete the course will be able to:
Learn more at: https://www.simplilearn.com/
Deep learning (also known as deep structured learning or hierarchical learning) is the application of artificial neural networks (ANNs) to learning tasks that contain more than one hidden layer. Deep learning is part of a broader family of machine learning methods based on learning data representations, as opposed to task-specific algorithms. Learning can be supervised, partially supervised or unsupervised.
Recurrent Neural Network
ACRRL
Applied Control & Robotics Research Laboratory of Shiraz University
Department of Power and Control Engineering, Shiraz University, Fars, Iran.
Mohammad Sabouri
https://sites.google.com/view/acrrl/
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...Fordham University
In recent years, the field of artificial intelligence (AI) has witnessed remarkable advancements, particularly in the domain of Generative models. Generative AI, a subset of machine learning, focuses on developing systems that can create novel and realistic content, ranging from text, speech, images to the multimodal content. This burgeoning field has demonstrated unprecedented potential to revolutionize various industries, making it imperative to introduce dedicated study materials on the foundation of Generative AI. With the increasing integration of Generative AI in various industries, professionals with expertise in this field are in high demand, and thus we believe that the publication of the slides are extremely important to meet the current need. The proposed outline aims to equip students with the knowledge and skills required to harness the creative power of AI and navigate the ethical implications associated with Generative technologies. * Materials used in this PPT were collected from Wikipedia, Google Image, and OpenAI GPT. No copyright is claimed by the author.
This presentation on Recurrent Neural Network will help you understand what is a neural network, what are the popular neural networks, why we need recurrent neural network, what is a recurrent neural network, how does a RNN work, what is vanishing and exploding gradient problem, what is LSTM and you will also see a use case implementation of LSTM (Long short term memory). Neural networks used in Deep Learning consists of different layers connected to each other and work on the structure and functions of the human brain. It learns from huge volumes of data and used complex algorithms to train a neural net. The recurrent neural network works on the principle of saving the output of a layer and feeding this back to the input in order to predict the output of the layer. Now lets deep dive into this presentation and understand what is RNN and how does it actually work.
Below topics are explained in this recurrent neural networks tutorial:
1. What is a neural network?
2. Popular neural networks?
3. Why recurrent neural network?
4. What is a recurrent neural network?
5. How does an RNN work?
6. Vanishing and exploding gradient problem
7. Long short term memory (LSTM)
8. Use case implementation of LSTM
Simplilearn’s Deep Learning course will transform you into an expert in deep learning techniques using TensorFlow, the open-source software library designed to conduct machine learning & deep neural network research. With our deep learning course, you'll master deep learning and TensorFlow concepts, learn to implement algorithms, build artificial neural networks and traverse layers of data abstraction to understand the power of data and prepare you for your new role as deep learning scientist.
Why Deep Learning?
It is one of the most popular software platforms used for deep learning and contains powerful tools to help you build and implement artificial neural networks.
Advancements in deep learning are being seen in smartphone applications, creating efficiencies in the power grid, driving advancements in healthcare, improving agricultural yields, and helping us find solutions to climate change. With this Tensorflow course, you’ll build expertise in deep learning models, learn to operate TensorFlow to manage neural networks and interpret the results.
And according to payscale.com, the median salary for engineers with deep learning skills tops $120,000 per year.
You can gain in-depth knowledge of Deep Learning by taking our Deep Learning certification training course. With Simplilearn’s Deep Learning course, you will prepare for a career as a Deep Learning engineer as you master concepts and techniques including supervised and unsupervised learning, mathematical and heuristic aspects, and hands-on modeling to develop algorithms. Those who complete the course will be able to:
Learn more at: https://www.simplilearn.com/
The slides includes an introduction to Long Short-term Memory (LSTM ) >> A novel approach in dealing with vanishing gradients in deep neural networks. Made for students, and anyone out there who'd love to learn about recurrent artificial neural networks, specifically of the LSTMs architecture.
Reference material has been attached to further your reading.
An introductory but very precise slide on mathematics of RNN/LSTM algorithms. You would get a clearer understanding on RNN back/forward propagation with this.
*This slide is not finished yet. If you like it, please give me some feedback to motivate me.
I made this slide as an intern in DATANOMIQ Gmbh
URL: https://www.datanomiq.de/
Tensors are higher order extensions of matrices that can incorporate multiple modalities and encode higher order relationships in data. Tensors play a significant role in machine learning through (1) tensor contractions, (2) tensor sketches, and (3) tensor decompositions. Tensor contractions are extensions of matrix products to higher dimensions. Tensor sketches efficiently compress tensors while preserving information. Tensor decompositions compute low rank components that constitute a tensor.
Long short-term memory (LSTM) network is a recurrent neural network (RNN), aimed to deal with the vanishing gradient problem present in traditional RNNs. Its relative insensitivity to gap length is its advantage over other RNNs, hidden Markov models and other sequence learning methods. It aims to provide a short-term memory for RNN that can last thousands of timesteps, thus "long short-term memory". It is applicable to classification, processing and predicting data based on time series, such as in handwriting, speech recognition, machine translation, speech activity detection, robot control, video games, and healthcare.
A common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate. The cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell. Forget gates decide what information to discard from a previous state by assigning a previous state, compared to a current input, a value between 0 and 1. A (rounded) value of 1 means to keep the information, and a value of 0 means to discard it. Input gates decide which pieces of new information to store in the current state, using the same system as forget gates. Output gates control which pieces of information in the current state to output by assigning a value from 0 to 1 to the information, considering the previous and current states. Selectively outputting relevant information from the current state allows the LSTM network to maintain useful, long-term dependencies to make predictions, both in current and future time-steps.
"Mainstream access to deep learning technology will greatly impact most industries over the next three to five years."
So what exactly is deep learning? How does it work? And most importantly, why should you even care?
Deep learning is used in the research community and in industry to help solve many big data problems such as computer vision, speech recognition, and natural language processing.
Practical examples include:
-Vehicle, pedestrian and landmark identification for driver assistance
-Image recognition
-Speech recognition and translation
-Natural language processing
-Life sciences
-What You Will Learn
-Understand the intuition behind Artificial Neural Networks
-Apply Artificial Neural Networks in practice
-Understand the intuition behind Convolutional Neural Networks
-Apply Convolutional Neural Networks in practice
-Understand the intuition behind Recurrent Neural Networks
-Apply Recurrent Neural Networks in practice
-Understand the intuition behind Self-Organizing Maps
-Apply Self-Organizing Maps in practice
-Understand the intuition behind Boltzmann Machines
-Apply Boltzmann Machines in practice
-Understand the intuition behind AutoEncoders
-Apply AutoEncoders in practice
An introductory/illustrative but precise slide on mathematics on neural networks (densely connected layers).
Please download it and see its animations with PowerPoint.
*This slide is not finished yet. If you like it, please give me some feedback to motivate me.
I made this slide as an intern in DATANOMIQ Gmbh
URL: https://www.datanomiq.de/
MUTUAL FUNDS (ICICI Prudential Mutual Fund) BY JAMES RODRIGUESWilliamRodrigues148
Mutual funds are investment vehicles that pool money from multiple investors to purchase a diversified portfolio of stocks, bonds, or other securities. They are managed by professional portfolio managers or investment companies who make investment decisions on behalf of the fund's investors.
The slides includes an introduction to Long Short-term Memory (LSTM ) >> A novel approach in dealing with vanishing gradients in deep neural networks. Made for students, and anyone out there who'd love to learn about recurrent artificial neural networks, specifically of the LSTMs architecture.
Reference material has been attached to further your reading.
An introductory but very precise slide on mathematics of RNN/LSTM algorithms. You would get a clearer understanding on RNN back/forward propagation with this.
*This slide is not finished yet. If you like it, please give me some feedback to motivate me.
I made this slide as an intern in DATANOMIQ Gmbh
URL: https://www.datanomiq.de/
Tensors are higher order extensions of matrices that can incorporate multiple modalities and encode higher order relationships in data. Tensors play a significant role in machine learning through (1) tensor contractions, (2) tensor sketches, and (3) tensor decompositions. Tensor contractions are extensions of matrix products to higher dimensions. Tensor sketches efficiently compress tensors while preserving information. Tensor decompositions compute low rank components that constitute a tensor.
Long short-term memory (LSTM) network is a recurrent neural network (RNN), aimed to deal with the vanishing gradient problem present in traditional RNNs. Its relative insensitivity to gap length is its advantage over other RNNs, hidden Markov models and other sequence learning methods. It aims to provide a short-term memory for RNN that can last thousands of timesteps, thus "long short-term memory". It is applicable to classification, processing and predicting data based on time series, such as in handwriting, speech recognition, machine translation, speech activity detection, robot control, video games, and healthcare.
A common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate. The cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell. Forget gates decide what information to discard from a previous state by assigning a previous state, compared to a current input, a value between 0 and 1. A (rounded) value of 1 means to keep the information, and a value of 0 means to discard it. Input gates decide which pieces of new information to store in the current state, using the same system as forget gates. Output gates control which pieces of information in the current state to output by assigning a value from 0 to 1 to the information, considering the previous and current states. Selectively outputting relevant information from the current state allows the LSTM network to maintain useful, long-term dependencies to make predictions, both in current and future time-steps.
"Mainstream access to deep learning technology will greatly impact most industries over the next three to five years."
So what exactly is deep learning? How does it work? And most importantly, why should you even care?
Deep learning is used in the research community and in industry to help solve many big data problems such as computer vision, speech recognition, and natural language processing.
Practical examples include:
-Vehicle, pedestrian and landmark identification for driver assistance
-Image recognition
-Speech recognition and translation
-Natural language processing
-Life sciences
-What You Will Learn
-Understand the intuition behind Artificial Neural Networks
-Apply Artificial Neural Networks in practice
-Understand the intuition behind Convolutional Neural Networks
-Apply Convolutional Neural Networks in practice
-Understand the intuition behind Recurrent Neural Networks
-Apply Recurrent Neural Networks in practice
-Understand the intuition behind Self-Organizing Maps
-Apply Self-Organizing Maps in practice
-Understand the intuition behind Boltzmann Machines
-Apply Boltzmann Machines in practice
-Understand the intuition behind AutoEncoders
-Apply AutoEncoders in practice
An introductory/illustrative but precise slide on mathematics on neural networks (densely connected layers).
Please download it and see its animations with PowerPoint.
*This slide is not finished yet. If you like it, please give me some feedback to motivate me.
I made this slide as an intern in DATANOMIQ Gmbh
URL: https://www.datanomiq.de/
MUTUAL FUNDS (ICICI Prudential Mutual Fund) BY JAMES RODRIGUESWilliamRodrigues148
Mutual funds are investment vehicles that pool money from multiple investors to purchase a diversified portfolio of stocks, bonds, or other securities. They are managed by professional portfolio managers or investment companies who make investment decisions on behalf of the fund's investors.
The E-Way Bill revolutionizes logistics by digitizing the documentation of goods transport, ensuring transparency, tax compliance, and streamlined processes. This mandatory, electronic system reduces delays, enhances accountability, and combats tax evasion, benefiting businesses and authorities alike. Embrace the E-Way Bill for efficient, reliable transportation operations.
World economy charts case study presented by a Big 4
World economy charts case study presented by a Big 4
World economy charts case
World economy charts case study presented by a Big 4
World economy charts case study presented by a Big 4World economy charts case study presented by a Big 4
World economy charts case study presented by a Big 4
World economy charts case study presented by a Big 4World economy charts case study presented by a Big 4World economy charts case study presented by a Big 4World economy charts case study presented by a Big 4World economy charts case study presented by a Big 4World economy charts case study presented by a Big 4World economy charts case study presented by a Big 4World economy charts case study presented by a Big 4World economy charts case study presented by a Big 4World economy charts case study presented by a Big 4World economy charts case study presented by a Big 4World economy charts case study presented by a Big 4World economy charts case study presented by a Big 4World economy charts case study presented by a Big 4study presented by a Big 4
7. Vanishing gradient intuition
What happens if these are small?
Vanishing gradient problem:
When these are small, the
gradient signal gets smaller
and smaller as it
backpropagates further
8. Vanishing gradient proof sketch (linear case)
• Recall:
• What if were the identity function, ?
Source: “On the difficulty of training recurrent neural networks”, Pascanu et al, 2013. http://proceedings.mlr.press/v28/pascanu13.pdf
(and supplemental materials), at http://proceedings.mlr.press/v28/pascanu13-supp.pdf
If Wh is “small”, then this term gets
exponentially problematic as becomes large
(chain rule)
• Consider the gradient of the loss on step , with respect
to the hidden state on some previous step . Let
(chain rule)
(value of )
9. Vanishing gradient proof sketch (linear case)
• What’s wrong with ?
• Consider if the eigenvalues of are all less than 1:
• We can write as a basis:
• What about nonlinear activations (i.e., what we use?)
• Pretty much the same thing, except the proof requires
for some dependent on dimensionality and
Source: “On the difficulty of training recurrent neural networks”, Pascanu et al, 2013. http://proceedings.mlr.press/v28/pascanu13.pdf
(and supplemental materials), at http://proceedings.mlr.press/v28/pascanu13-supp.pdf
(eigenvectors)
using the eigenvectors of
Approaches 0 as grows, so gradient vanishes
sufficientbut
not necessary
10. Why is vanishing gradient a problem?
Gradient signal from faraway is lost because it’s much
smaller than gradient signal from close-by.
So model weights are only updated only with respect to
near effects, not long-term effects.
11. Effect of vanishing gradient on RNN-LM
• LM task: When she tried to print her tickets, she found that the
printer was out of toner. She went to the stationery store to buy
more toner. It was very overpriced. After installing the toner into
the printer, she finally printed her
• To learn from this training example, the RNN-LM needs to
model the dependency between “tickets” on the 7th step and
the target word “tickets” at the end.
• But if gradient is small, the model can’t learn this dependency
• So the model is unable to predict similar long-distance
dependencies at test time
12. • LM task: The writer of the books
• Correct answer: The writer of the books is planning a sequel
• Syntactic recency: The writer of the books is (correct)
• Sequential recency: The writer of the books are (incorrect)
• Due to vanishing gradient, RNN-LMs are better at learning from
sequential recency than syntactic recency, so they make this
type of error more often than we’d like [Linzen et al 2016]
Effect of vanishing gradient on RNN-LM
is
are
“Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies”, Linzen et al, 2016. https://arxiv.org/pdf/1611.01368.pdf
13. Why is exploding gradient a problem?
• If the gradient becomes too big, then the SGD update step
becomes too big:
learning rate
gradient
• This can cause bad updates: we take too large a step and reach
a bad parameter configuration (with large loss)
• In the worst case, this will result in Inf or NaN in your network
(then you have to restart training from an earlier checkpoint)
14. Gradient clipping: solution for exploding gradient
Source: “On the difficulty of training recurrent neural networks”, Pascanu et al, 2013. http://proceedings.mlr.press/v28/pascanu13.pdf
• Gradient clipping: if the norm of the gradient is greater than
some threshold, scale it down before applying SGD update
• Intuition: take a step in the same direction, but a smaller step
15. How to fix vanishing gradient problem?
• The main problem is that it’s too difficult for the RNN to learn to
preserve information over many timesteps.
• In a vanilla RNN, the hidden state is constantly being rewritten
• How about a RNN with separate memory?
16. Long Short-Term Memory (LSTM)
• A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a
solution to the vanishing gradients problem.
• On step t, there is a hidden state and a cell state
• Both are vectors length n
• The cell stores long-term information
• The LSTM can erase, write and read information from the cell
• The selection of which information is erased/written/read is controlled by
three corresponding gates
• The gates are also vectors length n
• On each timestep, each element of the gates can be open (1), closed (0),
or somewhere in-between.
• The gates are dynamic: their value is computed based on the current
context
“Long short-term memory”, Hochreiter and Schmidhuber, 1997. https://www.bioinf.jku.at/publications/older/2604.pdf
17. We have a sequence of inputs , and we will compute a sequence of hidden states
and cell states . On timestep t:
Long Short-Term Memory (LSTM)
All
these
are
vectors
of
same
length
n
Forget gate: controls what is kept vs
forgotten, from previous cell state
Input gate: controls what parts of the
new cell content are written to cell
Output gate: controls what parts of
cell are output to hidden state
(READ)
New cell content: this is the new
content to be written to the cell
Cell state: erase (“forget”) some
content from last cell state, and write
(“input”) some new cell content
Hidden state: read (“output”) some
content from the cell
Sigmoid function: all gate
values are between 0 and 1
Gates are applied using
element-wise product
18. Long Short-Term Memory (LSTM)
You can think of the LSTM equations visually like this:
18 Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
19. ct-1
ht-1
ct
ht
ft
it ot
ct
t
~
c
Long Short-Term Memory (LSTM)
You can think of the LSTM equations visually like this:
Compute the
forget gate
Forget some
cell content
Compute the
input gate
Compute the
new cell content
Compute the
output gate
Write some new cell content
Output some cell content
to the hidden state
19 Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
20. 20
How does LSTM solve vanishing gradients?
• The LSTM architecture makes it easier for the RNN to
preserve information over many timesteps
• e.g. if the forget gate is set to remember everything on every
timestep, then the info in the cell is preserved indefinitely
• By contrast, it’s harder for vanilla RNN to learn a recurrent
weight matrix Wh that preserves info in hidden state
• LSTM doesn’t guarantee that there is no vanishing/exploding
gradient, but it does provide an easier way for the model to
learn long-distance dependencies
21. Gated Recurrent Units (GRU)
• Proposed by Cho et al. in 2014 as a simpler alternative to the LSTM.
• On each timestep t we have input and hidden state (no cell state).
"Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation", Cho et al. 2014, https://arxiv.org/pdf/1406.1078v3.pdf
Update gate: controls what parts of
hidden state are updated vs preserved
Reset gate: controls what parts of
previous hidden state are used to
compute new content
Hidden state: update gate
simultaneously controls what is kept
from previous hidden state, and what
is updated to new hidden state content
New hidden state content: reset gate
selects useful parts of prev hidden
state. Use this and current input to
compute new hidden content.
How does this solve vanishing gradient?
Like LSTM, GRU makes it easier to retain info
long-term (e.g. by setting update gate to 0)
22. LSTM vs GRU
• Researchers have proposed many gated RNN variants, but LSTM
and GRU are the most widely-used
• The biggest difference is that GRU is quicker to compute and has
fewer parameters
• There is no conclusive evidence that one consistently performs
better than the other
• LSTM is a good default choice (especially if your data has
particularly long dependencies, or you have lots of training data)
• Rule of thumb: start with LSTM, but switch to GRU if you want
something more efficient
23. Bidirectional RNNs: motivation
Task: Sentiment Classification
terribly exciting !
the movie was
positive
Sentence encoding
We can regard this hidden state as a
representation of the word “terribly” in the
context of this sentence. We call this a
contextual representation.
These contextual
representations only
contain information
about the left context
(e.g. “the movie
was”).
What about right
context?
In this example,
“exciting” is in the
right context and this
modifies the meaning
of “terribly” (from
negative to positive)
24. Bidirectional RNNs
terribly exciting !
the movie was
Forward RNN
Backward RNN
Concatenated
hidden states
This contextual representation of “terribly”
has both left and right context!
25. Bidirectional RNNs
Forward RNN
Backward RNN
Concatenated hidden states
This is a general notation to mean “compute
one forward step of the RNN” – it could be a
vanilla, LSTM or GRU computation.
We regard this as “the hidden
state” of a bidirectional RNN.
This is what we pass on to the
next parts of the network.
Generally, these
two RNNs have
separate weights
On timestep t:
26. Bidirectional RNNs: simplified diagram
terribly exciting !
the movie was
The two-way arrows indicate bidirectionality and
the depicted hidden states are assumed to be
the concatenated forwards+backwards states.
27. 27
Bidirectional RNNs
• Note: bidirectional RNNs are only applicable if you have access
to the entire input sequence.
• They are not applicable to Language Modeling, because in LM
you only have left context available.
• If you do have entire input sequence (e.g. any kind of encoding),
bidirectionality is powerful (you should use it by default).
• For example, BERT (Bidirectional Encoder Representations from
Transformers) is a powerful pretrained contextual
representation system built on bidirectionality.
• You will learn more about BERT later in the course!
28. 28
Multi-layer RNNs
• RNNs are already “deep” on one dimension (they unroll over
many timesteps)
• We can also make them “deep” in another dimension by
applying multiple RNNs – this is a multi-layer RNN.
• This allows the network to compute more complex
representations
• The lower RNNs should compute lower-level features and the
higher RNNs should compute higher-level features.
• Multi-layer RNNs are also called stacked RNNs.
29. Multi-layer RNNs
terribly exciting !
the movie was
RNN layer 1
RNN layer 2
29
RNN layer 3
The hidden states from RNN layer i
are the inputs to RNN layer i+1
30. Multi-layer RNNs in practice
• High-performing RNNs are often multi-layer (but aren’t as deep
as convolutional or feed-forward networks)
• For example: In a 2017 paper, Britz et al find that for Neural
Machine Translation, 2 to 4 layers is best for the encoder RNN,
and 4 layers is best for the decoder RNN
• However, skip-connections/dense-connections are needed to train
deeper RNNs (e.g. 8 layers)
• Transformer-based networks (e.g. BERT) can be up to 24 layers
• You will learn about Transformers later; they have a lot of
skipping-like connections
43 “Massive Exploration of Neural Machine Translation Architecutres”, Britz et al, 2017. https://arxiv.org/pdf/1703.03906.pdf