Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf

SPG
Cold-Start Reinforcement Learning with Softmax
Policy Gradient
NeurIPS, 2017
Chih-Chun Chen, Pin-Yen Liu, Po-Chuan Chen
June 8, 2023
1 / 21

SPG
Table of contents
1 Abstract
2 Introduction
3 Limitations of Existing Sequence Learning Regimes
4 Softmax Policy Gradient (SPG) Method
5 Conclusion
2 / 21

SPG
Abstract
Table of contents
1 Abstract
2 Introduction
5 Conclusion
3 / 21

SPG
Abstract
Abstract
When we use policy gradient in reinforcement learning. There will
have 2 issues.
1 warm-start training
2 sample variance reduction
In here, Cold-Start Reinforcement Learning with Softmax Policy
Gradient [2] can overcome them based on softmax value function.
The method can be used in training sequence generation models for
structured output prediction problems.
4 / 21

SPG
Introduction
Table of contents
1 Abstract
2 Introduction
5 Conclusion
5 / 21

SPG
Introduction
Introduction
Here are the procedure when we use policy gradient method on
sequence generation work:
The model proposes a sequence
Compute a reward for the proposed sequence based on the
ground-truth target
The reward metrics such as ROUGE [3] for summarization,
CIDEr [5] or SPICE [1] for image captioning, etc
Optimizing the weighted average of the log-likelihood of the
proposed sequences, with reward as weight
6 / 21

SPG
Limitations of Existing Sequence Learning Regimes
Table of contents
1 Abstract
2 Introduction
5 Conclusion
7 / 21

SPG
Sequence-learning training has a method, Maximum-likelihood
Estimation (MLE). Given a set of inputs X =

xi and target
sequences Y =

yi , the MLE loss function is:
LMLE (𝜃) =
∑︁
i
Li
MLE (𝜃), where Li
MLE (𝜃) = − log p𝜃

yi
| xi

(1)
Here xi and yi =

yi
1, . . . , yi
T denote the input and the target sequence
of the i-th example, respectively.
8 / 21

SPG
Reward Augmented Maximum Likelihood (RAML)
For MLE, a general approach called Reward Augmented Maximum
Likelihood (RAML) [4], which all alternative outputs are equally
penalized through normalization, regardless of their relationship to the
ground-truth target. And get approximation with Monte Carlo
integration.
Li
RAML(𝜃) = −
∑︁
zi
rR

zi
| yi

log p𝜃

zi
| xi

≃ −
1
J
J
∑︁
j=1
log p𝜃

zij
| xi

(2)
where rR zi | yi

=
exp(R(zi |yi
)/𝜏)
Í
zi exp(R(zi |yi
)/𝜏)
.
9 / 21

SPG
Challenge
But the large discrepancy between the model prediction distribution
p𝜃 (zi | xi) and the reward R(zi | yi)’s values, which is especially acute
during the early training stages.
It cause some problems like inefficient from a speed-to-convergence
and unsatisfactory from a theoretical and modeling.
Both these issues will be addressed by the value function we describe
next.
10 / 21

SPG
Softmax Policy Gradient (SPG) Method
Table of contents
1 Abstract
2 Introduction
Inference
Bang-bang Rewarded SPG Method
Algorithm
5 Conclusion
11 / 21

SPG
Softmax value function
Original value function:
Li
PG(𝜃) = −Vi
PG(𝜃), Vi
PG(𝜃) = Ep𝜃 (zi |xi
)
h
R

zi
| yi
i
(3)
Softmax value function:
Li
SPG(𝜃) = −Vi
SPG(𝜃), Vi
SPG(𝜃) = log

Ep𝜃 (zi |xi
)
h
exp

R

zi
| yi
i
(4)
12 / 21

SPG
Inference
Reward function
Reward increments:
Δri
t (zi
t|yi
, zi
1:t−1) := R(zi
1:t|yi
) − R(zi
1:t−1|yi
) (5)
Additional Reward Functions:
DUPi
t =

−1 if zi
t = zi
t−1
0 otherwise
(6)
EOSi
t =
(
−1 if zi
t = /S and t yi
0 otherwise
(7)
13 / 21

SPG
To minimize the efforts for fine-tuning the reward weights, we propose
a bang-bang rewarded softmax value function:
Li
BBSPG(𝜃) = −
Í
wi p wi

log

Ep𝜃 (zi |xi
)

exp R zi | yi, wi

(8)
𝜕
𝜕𝜃
L̃i
BBSPG(𝜃) = −
∑︁
wi
p

wi
∑︁
zi
q̃𝜃

zi
| xi
, yi
, wi
𝜕
𝜕𝜃
log p𝜃

zi
| xi

| {z }
≜− 𝜕
𝜕𝜃 L̃i
SPG (𝜃 |wi
)
(9)
Figure: An example of sequence generation.
14 / 21

SPG
Main Loss
Using Monte-Carlo integration, we approximate Eq. (9) by first
drawing wij from p wi

and then iteratively drawing z
ij
t from
q̃𝜃

zi
t | xi, zi
1:t−1, yi, w
ij
t

for t = 1, . . . , T.
For larger values of pdrop , the wij sample contains more w
ij
t = 0 and
the resulting zij contains proportionally more samples from the model
prediction distribution. After zij is obtained, only the log-likelihood of
z
ij
t when w
ij
t ≠ 0 are included in the loss:
𝜕
𝜕𝜃
L̃i
BBSPG(𝜃) ≃ −
1
J
J
∑︁
j=1
∑︁
n
t:w
ij
t ≠0
o
𝜕
𝜕𝜃
log p𝜃

z
ij
t | xi
, z
ij
1:t−1

. (10)
15 / 21

SPG
Algorithm
Algorithm
16 / 21

SPG
Conclusion
Table of contents
1 Abstract
2 Introduction
5 Conclusion
17 / 21

SPG
Conclusion
Conclusion
Based on softmax value function, this policy-gradient approach that
eliminates the need for warm-start training and sample variance
reduction during policy updates.
We can know that the proposed method achieves superior
performance on text-to-text (automatic summarization) and
image-to-text (automatic image captioning).
18 / 21

SPG
Conclusion
References I
[1] Peter Anderson et al. “SPICE: semantic propositional image
caption evaluation”. English. In: Computer Vision - 14th
European Conference, ECCV 2016, Proceedings. Ed. by
Bastian Leibe et al. Vol. Part V. Lecture Notes in Computer
Science (including subseries Lecture Notes in Artificial
Intelligence and Lecture Notes in Bioinformatics). European
Conference on Computer Vision (14th : 2016) ; Conference date:
11-10-2016 Through 14-10-2016. United States: Springer,
Springer Nature, 2016, pp. 382–398. isbn: 9783319464534. doi:
10.1007/978-3-319-46454-1_24.
19 / 21

SPG
Conclusion
References II
[2] Nan Ding and Radu Soricut. “Cold-Start Reinforcement
Learning with Softmax Policy Gradient”. In: Proceedings of the
31st International Conference on Neural Information Processing
Systems. NIPS’17. Long Beach, California, USA: Curran
Associates Inc., 2017, pp. 2814–2823. isbn: 9781510860964.
[3] Chin-Yew Lin and Franz Josef Och. “Automatic Evaluation of
Machine Translation Quality Using Longest Common
Subsequence and Skip-Bigram Statistics”. In: Proceedings of the
42nd Annual Meeting on Association for Computational
Linguistics. ACL ’04. Barcelona, Spain: Association for
Computational Linguistics, 2004, 605–es. doi:
10.3115/1218955.1219032. url:
https://doi.org/10.3115/1218955.1219032.
20 / 21

SPG
Conclusion
References III
[4] Mohammad Norouzi et al. Reward Augmented Maximum
Likelihood for Neural Structured Prediction. 2017. arXiv:
1609.00150 [cs.LG].
[5] Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh.
“CIDEr: Consensus-based image description evaluation”. In:
2015 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR). 2015, pp. 4566–4575. doi:
10.1109/CVPR.2015.7299087.
21 / 21

Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf

Recommended

Recommended

More Related Content

Similar to Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf

Similar to Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf (20)

More from Po-Chuan Chen

More from Po-Chuan Chen (20)

Recently uploaded

Recently uploaded (20)

Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf