Reproducing and Analyzing
Adaptive Computation Time
in PyTorch and TensorFlow
Víctor Campos Xavier Giró-i-NietoDani Fojo
06/02/2018
Outline
2
1. Motivation
2. Theoretical background
3. Related work
4. Adaptive Computation Time
5. Implementation
6. Experiments
7. Conclusions
Motivation
Motivation
4
Why do we need adaptive computation?
Motivation
5
The complexity of posing a problem is not
proportional to the complexity of solving it.
Motivation
6
The complexity of posing a problem is not
proportional to the complexity of solving it.
Year 1637
Motivation
7
The complexity of posing a problem is not
proportional to the complexity of solving it.
A. Wiles “Modular elliptic curves and Fermat’s last theorem”
Year 1637 Year 1995
Motivation
8
Problems may differ in complexity
Theoretical background
Neural Networks
Alternate linear and non-linear functions.
10
Loss function
11
Mean Squared Error:
Cross Entropy:
Recurrent Neural Networks (RNN’s)
12
Main idea: The network has a state.
Slide credit: Xavier Giró
Recurrent Neural Networks (RNN’s)
13
time
time
Unfold
(Rotation
90o
)
Front View Side View
Rotation
90o
Slide credit: Xavier Giró
Related work
Related work
15
Spatially-Adaptive Computation Time for
Residual Networks.
M. Figurnov et al. “Spatially Adaptive Computation Time for Residual Networks” CVPR 2017
Related work
16
LSTM-Jump and Skip-RNN.
A. Yu et al. “Learning to Skim Text” ACL 2017
V. Campos et al. “Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks” ICLR 2018
Adaptive Computation
Time
Adaptive Computation Time
18
Adaptive Computation Time for RNN’s (ACT).
A. Graves “Adaptive Computation Time for Recurrent Neural Networks”
Adaptive Computation Time (ACT)
19
Simple Recurrent Neural Network (RNN)
20
Model description
21
Model description
22
Halting probability
Model description
23
Each sample is processed until these probabilities
add up to one.
Model description
24
● Residual:
● Outputs:
Limiting computation
25
● Modified loss:
● Ponder cost:
Time penalty
Ponder cost: Intuition
26
Constant
Negated
probabilities
Limiting computation
27
Maximum steps
Implementation
Deep Learning Frameworks
29
Deep Learning Frameworks
30
Computation graph
31
Static graph Dynamic graph
Implementation
32
Pros:
● Dynamic graph
Cons:
● Version 0.2
● Lack of docs
● Too slow for custom
RNN’s
Implementation
33
Pros:
● Version 1.4
● Much better docs
● Much faster
Cons:
● Static graph
● Harder to use
Implementation
34
bit.ly/dani-tfg
Experiments
Experiments
36
We wanted to test ACT:
Comparing it with a simple RNN is unfair.
VS
New baseline: Repeat-RNN
37
Repetitions
Experiments: Parity
38
Experiments: Addition
39
Each digit is fed as a one-hot encoding
Experiments: Addition
40
Each digit is fed as a one-hot encoding
Experiments: Addition
41
Each digit is fed as a one-hot encoding
Experiments: Addition
42
Each digit is fed as a one-hot encoding
Experiments: Addition
43
Each digit is fed as a one-hot encoding
Experiments: Addition
44
Each digit is fed as a one-hot encoding
One-hot encoding
45
Experiments: Addition
46
Each digit is fed as a one-hot encoding
Parity: Comparison
47
Parity: Comparison
48
Parity: Comparison
49
Parity: Comparison
50
Parity: Comparison
51
Parity: Comparison
52
Addition: Comparison
53
Conclusions
Conclusions
55
1. We implemented ACT in two frameworks
2. Designed Repeat-RNN and compared it to ACT
3. We could achieve better performance than ACT
with a simpler more interpretable model
Conclusions
56
1. We implemented ACT in two frameworks
2. Designed Repeat-RNN and compared it to ACT
3. We could achieve better performance than ACT
with a simpler more interpretable model
Conclusions
57
1. We implemented ACT in two frameworks
2. Designed Repeat-RNN and compared it to ACT
3. We could achieve better performance than ACT
with a simpler more interpretable model
VS
Future work
58
● ICLR 2018 workshop
● Improve ACT
● Skipping samples of Repeat-RNN
59
Backup slides
Parity: ACT-RNN
Parity: Repeat-RNN
Parity: Comparison
Addition: ACT-LSTM
Addition: Repeat-LSTM
Addition: Ponder distribution
Addition: Comparison

Reproducing and Analyzing Adaptive Computation Time in PyTorch and TensorFlow