4. SPG
Abstract
Abstract
When we use policy gradient in reinforcement learning. There will
have 2 issues.
1 warm-start training
2 sample variance reduction
In here, Cold-Start Reinforcement Learning with Softmax Policy
Gradient [2] can overcome them based on softmax value function.
The method can be used in training sequence generation models for
structured output prediction problems.
4 / 21
6. SPG
Introduction
Introduction
Here are the procedure when we use policy gradient method on
sequence generation work:
The model proposes a sequence
Compute a reward for the proposed sequence based on the
ground-truth target
The reward metrics such as ROUGE [3] for summarization,
CIDEr [5] or SPICE [1] for image captioning, etc
Optimizing the weighted average of the log-likelihood of the
proposed sequences, with reward as weight
6 / 21
8. SPG
Limitations of Existing Sequence Learning Regimes
Limitations of Existing Sequence Learning Regimes
Sequence-learning training has a method, Maximum-likelihood
Estimation (MLE). Given a set of inputs X =
xi and target
sequences Y =
yi , the MLE loss function is:
LMLE (๐) =
โ๏ธ
i
Li
MLE (๐), where Li
MLE (๐) = โ log p๐
yi
| xi
(1)
Here xi and yi =
yi
1, . . . , yi
T denote the input and the target sequence
of the i-th example, respectively.
8 / 21
9. SPG
Limitations of Existing Sequence Learning Regimes
Reward Augmented Maximum Likelihood (RAML)
For MLE, a general approach called Reward Augmented Maximum
Likelihood (RAML) [4], which all alternative outputs are equally
penalized through normalization, regardless of their relationship to the
ground-truth target. And get approximation with Monte Carlo
integration.
Li
RAML(๐) = โ
โ๏ธ
zi
rR
zi
| yi
log p๐
zi
| xi
โ โ
1
J
J
โ๏ธ
j=1
log p๐
zij
| xi
(2)
where rR zi | yi
=
exp(R(zi |yi
)/๐)
ร
zi exp(R(zi |yi
)/๐)
.
9 / 21
10. SPG
Limitations of Existing Sequence Learning Regimes
Challenge
But the large discrepancy between the model prediction distribution
p๐ (zi | xi) and the reward R(zi | yi)โs values, which is especially acute
during the early training stages.
It cause some problems like inefficient from a speed-to-convergence
and unsatisfactory from a theoretical and modeling.
Both these issues will be addressed by the value function we describe
next.
10 / 21
12. SPG
Softmax Policy Gradient (SPG) Method
Softmax value function
Original value function:
Li
PG(๐) = โVi
PG(๐), Vi
PG(๐) = Ep๐ (zi |xi
)
h
R
zi
| yi
i
(3)
Softmax value function:
Li
SPG(๐) = โVi
SPG(๐), Vi
SPG(๐) = log
Ep๐ (zi |xi
)
h
exp
R
zi
| yi
i
(4)
12 / 21
13. SPG
Softmax Policy Gradient (SPG) Method
Inference
Reward function
Reward increments:
ฮri
t (zi
t|yi
, zi
1:tโ1) := R(zi
1:t|yi
) โ R(zi
1:tโ1|yi
) (5)
Additional Reward Functions:
DUPi
t =
โ1 if zi
t = zi
tโ1
0 otherwise
(6)
EOSi
t =
(
โ1 if zi
t = /S and t yi
0 otherwise
(7)
13 / 21
14. SPG
Softmax Policy Gradient (SPG) Method
Bang-bang Rewarded SPG Method
To minimize the efforts for fine-tuning the reward weights, we propose
a bang-bang rewarded softmax value function:
Li
BBSPG(๐) = โ
ร
wi p wi
log
Ep๐ (zi |xi
)
exp R zi | yi, wi
(8)
๐
๐๐
Lฬi
BBSPG(๐) = โ
โ๏ธ
wi
p
wi
โ๏ธ
zi
qฬ๐
zi
| xi
, yi
, wi
๐
๐๐
log p๐
zi
| xi
| {z }
โโ ๐
๐๐ Lฬi
SPG (๐ |wi
)
(9)
Figure: An example of sequence generation.
14 / 21
15. SPG
Softmax Policy Gradient (SPG) Method
Bang-bang Rewarded SPG Method
Main Loss
Using Monte-Carlo integration, we approximate Eq. (9) by first
drawing wij from p wi
and then iteratively drawing z
ij
t from
qฬ๐
zi
t | xi, zi
1:tโ1, yi, w
ij
t
for t = 1, . . . , T.
For larger values of pdrop , the wij sample contains more w
ij
t = 0 and
the resulting zij contains proportionally more samples from the model
prediction distribution. After zij is obtained, only the log-likelihood of
z
ij
t when w
ij
t โ 0 are included in the loss:
๐
๐๐
Lฬi
BBSPG(๐) โ โ
1
J
J
โ๏ธ
j=1
โ๏ธ
n
t:w
ij
t โ 0
o
๐
๐๐
log p๐
z
ij
t | xi
, z
ij
1:tโ1
. (10)
15 / 21
18. SPG
Conclusion
Conclusion
Based on softmax value function, this policy-gradient approach that
eliminates the need for warm-start training and sample variance
reduction during policy updates.
We can know that the proposed method achieves superior
performance on text-to-text (automatic summarization) and
image-to-text (automatic image captioning).
18 / 21
19. SPG
Conclusion
References I
[1] Peter Anderson et al. โSPICE: semantic propositional image
caption evaluationโ. English. In: Computer Vision - 14th
European Conference, ECCV 2016, Proceedings. Ed. by
Bastian Leibe et al. Vol. Part V. Lecture Notes in Computer
Science (including subseries Lecture Notes in Artificial
Intelligence and Lecture Notes in Bioinformatics). European
Conference on Computer Vision (14th : 2016) ; Conference date:
11-10-2016 Through 14-10-2016. United States: Springer,
Springer Nature, 2016, pp. 382โ398. isbn: 9783319464534. doi:
10.1007/978-3-319-46454-1_24.
19 / 21
20. SPG
Conclusion
References II
[2] Nan Ding and Radu Soricut. โCold-Start Reinforcement
Learning with Softmax Policy Gradientโ. In: Proceedings of the
31st International Conference on Neural Information Processing
Systems. NIPSโ17. Long Beach, California, USA: Curran
Associates Inc., 2017, pp. 2814โ2823. isbn: 9781510860964.
[3] Chin-Yew Lin and Franz Josef Och. โAutomatic Evaluation of
Machine Translation Quality Using Longest Common
Subsequence and Skip-Bigram Statisticsโ. In: Proceedings of the
42nd Annual Meeting on Association for Computational
Linguistics. ACL โ04. Barcelona, Spain: Association for
Computational Linguistics, 2004, 605โes. doi:
10.3115/1218955.1219032. url:
https://doi.org/10.3115/1218955.1219032.
20 / 21
21. SPG
Conclusion
References III
[4] Mohammad Norouzi et al. Reward Augmented Maximum
Likelihood for Neural Structured Prediction. 2017. arXiv:
1609.00150 [cs.LG].
[5] Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh.
โCIDEr: Consensus-based image description evaluationโ. In:
2015 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR). 2015, pp. 4566โ4575. doi:
10.1109/CVPR.2015.7299087.
21 / 21