Designing IA for AI - Information Architecture Conference 2024
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021
1. Full Stack Deep Learning - UW Spring 2020 - Sergey Karayev - with content by Pieter Abbeel, Josh Tobin
Recurrent Neural Networks
2. Agenda
1. Sequence Problems
2. RNNs
3. Vanishing gradients and LSTMs
4. Case study: Machine Translation
(Bidirectionality and Attention)
5. CTC loss
6. Pros and Cons
7. A preview of non-recurrent sequence models
3. Agenda
1. Sequence Problems
2. RNNs
3. Vanishing gradients and LSTMs
4. Case study: Machine Translation
(Bidirectionality and Attention)
5. CTC loss
6. Pros and Cons
7. A preview of non-recurrent sequence models
4. Sequence
Time series
forecasting Time series Predicted next value
Sentiment
classification
Review text Predicted sentiment
Translation English text French text
Speech recognition
and generation
Audio waveform Text
Text or music
generation Ø Text or
Image captioning Image
“The quick brown fox
jumped over the lazy
dog”
Description
Sequence problems
Question Answering Text Text
5. Types of sequence problems
1. Why RNNs?
(From http://karpathy.github.io/2015/05/21/rnn-effectiveness/)
7. Why not use feedforward networks?
1. Why RNNs?
many to many
8. Why not use feedforward networks?
1. Why RNNs?
Concatenate
many to many
9. Why not use feedforward networks?
1. Why RNNs?
Concatenate
Fully
connected
Output
many to many
10. Why not use feedforward networks?
1. Why RNNs?
Concatenate
Fully
connected
Output
Reshape
many to many
11. Problem 1: variable length inputs
1. Why RNNs?
many to many
? Can deal with this by
padding all sequences to
the max length, but…
12. Problem 2: memory scaling
1. Why RNNs?
many to many
Memory requirement
scales linearly in number
of timesteps
k-dim feature t timesteps
k * t dim
d-dim
k * t * d dim
matrix
13. Problem 3: overkill
1. Why RNNs?
many to many
k-dim feature t timesteps
k * t dim
d-dim
k * t * d dim
matrix
This matrix has to learn
patterns everywhere
they may occur in the
sequence!
This ignores the nature
of the problem: patterns
over time.
15. Agenda
1. Sequence Problems
2. RNNs
3. Vanishing gradients and LSTMs
4. Case study: Machine Translation
(Bidirectionality and Attention)
5. CTC loss
6. Pros and Cons
7. A preview of non-recurrent sequence models
21. RNNs for many-to-one problems
2. Review of RNNs
RNN
0.5
0.2
-0.1
-0.3
0.4
1.2
FC
Input Encoder
State at last
timestep Classifier Output
Architecture
24. RNNs for one-to-many problems
2. Review of RNNs
ConvNet
(e.g.)
0.5
0.2
-0.1
-0.3
0.4
1.2
RNN
“The quick brown
fox jumped over
the lazy dog”
Input Encoder
Hidden
state Decoder Output
Encoder-decoder architectures
25. RNNs for one-to-many problems
2. Review of RNNs
h0
<latexit sha1_base64="u90zgAiIVkwK7FYP2F5fm926QP8=">AAAB6nicbVBNS8NAEJ3Ur1q/oh69LBbBU0lE0GPBiyepaD+gDWWznbRLN5uwuxFK6E/w4kERr/4ib/4bt20O2vpg4PHeDDPzwlRwbTzv2ymtrW9sbpW3Kzu7e/sH7uFRSyeZYthkiUhUJ6QaBZfYNNwI7KQKaRwKbIfjm5nffkKleSIfzSTFIKZDySPOqLHSw6jv9d2qV/PmIKvEL0gVCjT67ldvkLAsRmmYoFp3fS81QU6V4UzgtNLLNKaUjekQu5ZKGqMO8vmpU3JmlQGJEmVLGjJXf0/kNNZ6Eoe2M6ZmpJe9mfif181MdB3kXKaZQckWi6JMEJOQ2d9kwBUyIyaWUKa4vZWwEVWUGZtOxYbgL7+8SloXNd+r+feX1fpdEUcZTuAUzsGHK6jDLTSgCQyG8Ayv8OYI58V5dz4WrSWnmDmGP3A+fwD1WY2b</latexit>
<latexit sha1_base64="u90zgAiIVkwK7FYP2F5fm926QP8=">AAAB6nicbVBNS8NAEJ3Ur1q/oh69LBbBU0lE0GPBiyepaD+gDWWznbRLN5uwuxFK6E/w4kERr/4ib/4bt20O2vpg4PHeDDPzwlRwbTzv2ymtrW9sbpW3Kzu7e/sH7uFRSyeZYthkiUhUJ6QaBZfYNNwI7KQKaRwKbIfjm5nffkKleSIfzSTFIKZDySPOqLHSw6jv9d2qV/PmIKvEL0gVCjT67ldvkLAsRmmYoFp3fS81QU6V4UzgtNLLNKaUjekQu5ZKGqMO8vmpU3JmlQGJEmVLGjJXf0/kNNZ6Eoe2M6ZmpJe9mfif181MdB3kXKaZQckWi6JMEJOQ2d9kwBUyIyaWUKa4vZWwEVWUGZtOxYbgL7+8SloXNd+r+feX1fpdEUcZTuAUzsGHK6jDLTSgCQyG8Ayv8OYI58V5dz4WrSWnmDmGP3A+fwD1WY2b</latexit>
<latexit sha1_base64="u90zgAiIVkwK7FYP2F5fm926QP8=">AAAB6nicbVBNS8NAEJ3Ur1q/oh69LBbBU0lE0GPBiyepaD+gDWWznbRLN5uwuxFK6E/w4kERr/4ib/4bt20O2vpg4PHeDDPzwlRwbTzv2ymtrW9sbpW3Kzu7e/sH7uFRSyeZYthkiUhUJ6QaBZfYNNwI7KQKaRwKbIfjm5nffkKleSIfzSTFIKZDySPOqLHSw6jv9d2qV/PmIKvEL0gVCjT67ldvkLAsRmmYoFp3fS81QU6V4UzgtNLLNKaUjekQu5ZKGqMO8vmpU3JmlQGJEmVLGjJXf0/kNNZ6Eoe2M6ZmpJe9mfif181MdB3kXKaZQckWi6JMEJOQ2d9kwBUyIyaWUKa4vZWwEVWUGZtOxYbgL7+8SloXNd+r+feX1fpdEUcZTuAUzsGHK6jDLTSgCQyG8Ayv8OYI58V5dz4WrSWnmDmGP3A+fwD1WY2b</latexit>
<latexit sha1_base64="u90zgAiIVkwK7FYP2F5fm926QP8=">AAAB6nicbVBNS8NAEJ3Ur1q/oh69LBbBU0lE0GPBiyepaD+gDWWznbRLN5uwuxFK6E/w4kERr/4ib/4bt20O2vpg4PHeDDPzwlRwbTzv2ymtrW9sbpW3Kzu7e/sH7uFRSyeZYthkiUhUJ6QaBZfYNNwI7KQKaRwKbIfjm5nffkKleSIfzSTFIKZDySPOqLHSw6jv9d2qV/PmIKvEL0gVCjT67ldvkLAsRmmYoFp3fS81QU6V4UzgtNLLNKaUjekQu5ZKGqMO8vmpU3JmlQGJEmVLGjJXf0/kNNZ6Eoe2M6ZmpJe9mfif181MdB3kXKaZQckWi6JMEJOQ2d9kwBUyIyaWUKa4vZWwEVWUGZtOxYbgL7+8SloXNd+r+feX1fpdEUcZTuAUzsGHK6jDLTSgCQyG8Ayv8OYI58V5dz4WrSWnmDmGP3A+fwD1WY2b</latexit>
ConvNet
32. 2. Review of RNNs
RNN
0.5
0.2
-0.1
-0.3
0.4
1.2
RNN
Input Encoder
Hidden
state Decoder Output
Encoder-decoder architectures
“I am a
student”
“Je suis
étudient”
RNNs for many-to-many problems
33. I am a student <s>
Je suis étudiant <s>
RNNs for many-to-many problems
2. Review of RNNs
34. I am a student <s>
Je suis étudiant <s>
RNNs for many-to-many problems
2. Review of RNNs
All the information in the
input sentence is
condensed into one
hidden state vector!
(In practice, we need
more tricks for this to
work -- explained soon)
35. Agenda
1. Sequence Problems
2. RNNs
3. Vanishing gradients and LSTMs
4. Case study: Machine Translation
(Bidirectionality and Attention)
5. CTC loss
6. Pros and Cons
7. A preview of non-recurrent sequence models
36. RNN Desiderata
• Goal: handle long sequences
• Connect events from the past to outcomes
in the future
• i.e., Long-term dependencies
• e.g., remember the name of a character
from the first sentence
3. Vanishing gradients
37. Vanilla RNNs: the reality
• Can’t handle more than 10-20 timesteps
• Longer-term dependencies get lost
• Why? Vanishing gradients
3. Vanishing gradients
https://bair.berkeley.edu/blog/2018/08/06/recurrent/