SlideShare a Scribd company logo
1 of 21
Download to read offline
SPG
Cold-Start Reinforcement Learning with Softmax
Policy Gradient
NeurIPS, 2017
Chih-Chun Chen, Pin-Yen Liu, Po-Chuan Chen
June 8, 2023
1 / 21
SPG
Table of contents
1 Abstract
2 Introduction
3 Limitations of Existing Sequence Learning Regimes
4 Softmax Policy Gradient (SPG) Method
5 Conclusion
2 / 21
SPG
Abstract
Table of contents
1 Abstract
2 Introduction
3 Limitations of Existing Sequence Learning Regimes
4 Softmax Policy Gradient (SPG) Method
5 Conclusion
3 / 21
SPG
Abstract
Abstract
When we use policy gradient in reinforcement learning. There will
have 2 issues.
1 warm-start training
2 sample variance reduction
In here, Cold-Start Reinforcement Learning with Softmax Policy
Gradient [2] can overcome them based on softmax value function.
The method can be used in training sequence generation models for
structured output prediction problems.
4 / 21
SPG
Introduction
Table of contents
1 Abstract
2 Introduction
3 Limitations of Existing Sequence Learning Regimes
4 Softmax Policy Gradient (SPG) Method
5 Conclusion
5 / 21
SPG
Introduction
Introduction
Here are the procedure when we use policy gradient method on
sequence generation work:
The model proposes a sequence
Compute a reward for the proposed sequence based on the
ground-truth target
The reward metrics such as ROUGE [3] for summarization,
CIDEr [5] or SPICE [1] for image captioning, etc
Optimizing the weighted average of the log-likelihood of the
proposed sequences, with reward as weight
6 / 21
SPG
Limitations of Existing Sequence Learning Regimes
Table of contents
1 Abstract
2 Introduction
3 Limitations of Existing Sequence Learning Regimes
4 Softmax Policy Gradient (SPG) Method
5 Conclusion
7 / 21
SPG
Limitations of Existing Sequence Learning Regimes
Limitations of Existing Sequence Learning Regimes
Sequence-learning training has a method, Maximum-likelihood
Estimation (MLE). Given a set of inputs X =

xi and target
sequences Y =

yi , the MLE loss function is:
LMLE (๐œƒ) =
โˆ‘๏ธ
i
Li
MLE (๐œƒ), where Li
MLE (๐œƒ) = โˆ’ log p๐œƒ

yi
| xi

(1)
Here xi and yi =

yi
1, . . . , yi
T denote the input and the target sequence
of the i-th example, respectively.
8 / 21
SPG
Limitations of Existing Sequence Learning Regimes
Reward Augmented Maximum Likelihood (RAML)
For MLE, a general approach called Reward Augmented Maximum
Likelihood (RAML) [4], which all alternative outputs are equally
penalized through normalization, regardless of their relationship to the
ground-truth target. And get approximation with Monte Carlo
integration.
Li
RAML(๐œƒ) = โˆ’
โˆ‘๏ธ
zi
rR

zi
| yi

log p๐œƒ

zi
| xi

โ‰ƒ โˆ’
1
J
J
โˆ‘๏ธ
j=1
log p๐œƒ

zij
| xi

(2)
where rR zi | yi

=
exp(R(zi |yi
)/๐œ)
ร
zi exp(R(zi |yi
)/๐œ)
.
9 / 21
SPG
Limitations of Existing Sequence Learning Regimes
Challenge
But the large discrepancy between the model prediction distribution
p๐œƒ (zi | xi) and the reward R(zi | yi)โ€™s values, which is especially acute
during the early training stages.
It cause some problems like inefficient from a speed-to-convergence
and unsatisfactory from a theoretical and modeling.
Both these issues will be addressed by the value function we describe
next.
10 / 21
SPG
Softmax Policy Gradient (SPG) Method
Table of contents
1 Abstract
2 Introduction
3 Limitations of Existing Sequence Learning Regimes
4 Softmax Policy Gradient (SPG) Method
Inference
Bang-bang Rewarded SPG Method
Algorithm
5 Conclusion
11 / 21
SPG
Softmax Policy Gradient (SPG) Method
Softmax value function
Original value function:
Li
PG(๐œƒ) = โˆ’Vi
PG(๐œƒ), Vi
PG(๐œƒ) = Ep๐œƒ (zi |xi
)
h
R

zi
| yi
i
(3)
Softmax value function:
Li
SPG(๐œƒ) = โˆ’Vi
SPG(๐œƒ), Vi
SPG(๐œƒ) = log

Ep๐œƒ (zi |xi
)
h
exp

R

zi
| yi
i
(4)
12 / 21
SPG
Softmax Policy Gradient (SPG) Method
Inference
Reward function
Reward increments:
ฮ”ri
t (zi
t|yi
, zi
1:tโˆ’1) := R(zi
1:t|yi
) โˆ’ R(zi
1:tโˆ’1|yi
) (5)
Additional Reward Functions:
DUPi
t =

โˆ’1 if zi
t = zi
tโˆ’1
0 otherwise
(6)
EOSi
t =
(
โˆ’1 if zi
t = /S  and t  yi
0 otherwise
(7)
13 / 21
SPG
Softmax Policy Gradient (SPG) Method
Bang-bang Rewarded SPG Method
To minimize the efforts for fine-tuning the reward weights, we propose
a bang-bang rewarded softmax value function:
Li
BBSPG(๐œƒ) = โˆ’
ร
wi p wi

log

Ep๐œƒ (zi |xi
)

exp R zi | yi, wi

(8)
๐œ•
๐œ•๐œƒ
Lฬƒi
BBSPG(๐œƒ) = โˆ’
โˆ‘๏ธ
wi
p

wi
 โˆ‘๏ธ
zi
qฬƒ๐œƒ

zi
| xi
, yi
, wi
 ๐œ•
๐œ•๐œƒ
log p๐œƒ

zi
| xi

| {z }
โ‰œโˆ’ ๐œ•
๐œ•๐œƒ Lฬƒi
SPG (๐œƒ |wi
)
(9)
Figure: An example of sequence generation.
14 / 21
SPG
Softmax Policy Gradient (SPG) Method
Bang-bang Rewarded SPG Method
Main Loss
Using Monte-Carlo integration, we approximate Eq. (9) by first
drawing wij from p wi

and then iteratively drawing z
ij
t from
qฬƒ๐œƒ

zi
t | xi, zi
1:tโˆ’1, yi, w
ij
t

for t = 1, . . . , T.
For larger values of pdrop , the wij sample contains more w
ij
t = 0 and
the resulting zij contains proportionally more samples from the model
prediction distribution. After zij is obtained, only the log-likelihood of
z
ij
t when w
ij
t โ‰  0 are included in the loss:
๐œ•
๐œ•๐œƒ
Lฬƒi
BBSPG(๐œƒ) โ‰ƒ โˆ’
1
J
J
โˆ‘๏ธ
j=1
โˆ‘๏ธ
n
t:w
ij
t โ‰ 0
o
๐œ•
๐œ•๐œƒ
log p๐œƒ

z
ij
t | xi
, z
ij
1:tโˆ’1

. (10)
15 / 21
SPG
Softmax Policy Gradient (SPG) Method
Algorithm
Algorithm
16 / 21
SPG
Conclusion
Table of contents
1 Abstract
2 Introduction
3 Limitations of Existing Sequence Learning Regimes
4 Softmax Policy Gradient (SPG) Method
5 Conclusion
17 / 21
SPG
Conclusion
Conclusion
Based on softmax value function, this policy-gradient approach that
eliminates the need for warm-start training and sample variance
reduction during policy updates.
We can know that the proposed method achieves superior
performance on text-to-text (automatic summarization) and
image-to-text (automatic image captioning).
18 / 21
SPG
Conclusion
References I
[1] Peter Anderson et al. โ€œSPICE: semantic propositional image
caption evaluationโ€. English. In: Computer Vision - 14th
European Conference, ECCV 2016, Proceedings. Ed. by
Bastian Leibe et al. Vol. Part V. Lecture Notes in Computer
Science (including subseries Lecture Notes in Artificial
Intelligence and Lecture Notes in Bioinformatics). European
Conference on Computer Vision (14th : 2016) ; Conference date:
11-10-2016 Through 14-10-2016. United States: Springer,
Springer Nature, 2016, pp. 382โ€“398. isbn: 9783319464534. doi:
10.1007/978-3-319-46454-1_24.
19 / 21
SPG
Conclusion
References II
[2] Nan Ding and Radu Soricut. โ€œCold-Start Reinforcement
Learning with Softmax Policy Gradientโ€. In: Proceedings of the
31st International Conference on Neural Information Processing
Systems. NIPSโ€™17. Long Beach, California, USA: Curran
Associates Inc., 2017, pp. 2814โ€“2823. isbn: 9781510860964.
[3] Chin-Yew Lin and Franz Josef Och. โ€œAutomatic Evaluation of
Machine Translation Quality Using Longest Common
Subsequence and Skip-Bigram Statisticsโ€. In: Proceedings of the
42nd Annual Meeting on Association for Computational
Linguistics. ACL โ€™04. Barcelona, Spain: Association for
Computational Linguistics, 2004, 605โ€“es. doi:
10.3115/1218955.1219032. url:
https://doi.org/10.3115/1218955.1219032.
20 / 21
SPG
Conclusion
References III
[4] Mohammad Norouzi et al. Reward Augmented Maximum
Likelihood for Neural Structured Prediction. 2017. arXiv:
1609.00150 [cs.LG].
[5] Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh.
โ€œCIDEr: Consensus-based image description evaluationโ€. In:
2015 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR). 2015, pp. 4566โ€“4575. doi:
10.1109/CVPR.2015.7299087.
21 / 21

More Related Content

Similar to Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf

LFA-NPG-Paper.pdf
LFA-NPG-Paper.pdfLFA-NPG-Paper.pdf
LFA-NPG-Paper.pdfharinsrikanth
ย 
MULTI-OBJECTIVE ENERGY EFFICIENT OPTIMIZATION ALGORITHM FOR COVERAGE CONTROL ...
MULTI-OBJECTIVE ENERGY EFFICIENT OPTIMIZATION ALGORITHM FOR COVERAGE CONTROL ...MULTI-OBJECTIVE ENERGY EFFICIENT OPTIMIZATION ALGORITHM FOR COVERAGE CONTROL ...
MULTI-OBJECTIVE ENERGY EFFICIENT OPTIMIZATION ALGORITHM FOR COVERAGE CONTROL ...ijcseit
ย 
Training language models to follow instructions with human feedback.pdf
Training language models to follow instructions
with human feedback.pdfTraining language models to follow instructions
with human feedback.pdf
Training language models to follow instructions with human feedback.pdfPo-Chuan Chen
ย 
LNCS 5050 - Bilevel Optimization and Machine Learning
LNCS 5050 - Bilevel Optimization and Machine LearningLNCS 5050 - Bilevel Optimization and Machine Learning
LNCS 5050 - Bilevel Optimization and Machine Learningbutest
ย 
On the Effectiveness of Offline RL for Dialogue Response Generation.pdf
On the Effectiveness of Offline RL for Dialogue Response Generation.pdfOn the Effectiveness of Offline RL for Dialogue Response Generation.pdf
On the Effectiveness of Offline RL for Dialogue Response Generation.pdfPo-Chuan Chen
ย 
PPT - Adaptive Quantitative Trading : An Imitative Deep Reinforcement Learnin...
PPT - Adaptive Quantitative Trading : An Imitative Deep Reinforcement Learnin...PPT - Adaptive Quantitative Trading : An Imitative Deep Reinforcement Learnin...
PPT - Adaptive Quantitative Trading : An Imitative Deep Reinforcement Learnin...Jisang Yoon
ย 
A Reinforcement Learning Approach for Hybrid Flexible Flowline Scheduling Pro...
A Reinforcement Learning Approach for Hybrid Flexible Flowline Scheduling Pro...A Reinforcement Learning Approach for Hybrid Flexible Flowline Scheduling Pro...
A Reinforcement Learning Approach for Hybrid Flexible Flowline Scheduling Pro...Bert Van Vreckem
ย 
Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Dan Elton
ย 
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithmOptimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithminfopapers
ย 
Generic Reinforcement Schemes and Their Optimization
Generic Reinforcement Schemes and Their OptimizationGeneric Reinforcement Schemes and Their Optimization
Generic Reinforcement Schemes and Their Optimizationinfopapers
ย 
Presentation of the unbalanced R package
Presentation of the unbalanced R packagePresentation of the unbalanced R package
Presentation of the unbalanced R packageAndrea Dal Pozzolo
ย 
Stride Random Erasing Augmentation
Stride Random Erasing AugmentationStride Random Erasing Augmentation
Stride Random Erasing AugmentationIJITE
ย 
Stride Random Erasing Augmentation
Stride Random Erasing AugmentationStride Random Erasing Augmentation
Stride Random Erasing Augmentationgerogepatton
ย 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIJack Clark
ย 
Imitation learning tutorial
Imitation learning tutorialImitation learning tutorial
Imitation learning tutorialYisong Yue
ย 
[update] Introductory Parts of the Book "Dive into Deep Learning"
[update] Introductory Parts of the Book "Dive into Deep Learning"[update] Introductory Parts of the Book "Dive into Deep Learning"
[update] Introductory Parts of the Book "Dive into Deep Learning"Young-Min kang
ย 
Lec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsLec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsRonald Teo
ย 
On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...
On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...
On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...Amir Ziai
ย 

Similar to Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf (20)

QMC Opening Workshop, Support Points - a new way to compact distributions, wi...
QMC Opening Workshop, Support Points - a new way to compact distributions, wi...QMC Opening Workshop, Support Points - a new way to compact distributions, wi...
QMC Opening Workshop, Support Points - a new way to compact distributions, wi...
ย 
LFA-NPG-Paper.pdf
LFA-NPG-Paper.pdfLFA-NPG-Paper.pdf
LFA-NPG-Paper.pdf
ย 
MULTI-OBJECTIVE ENERGY EFFICIENT OPTIMIZATION ALGORITHM FOR COVERAGE CONTROL ...
MULTI-OBJECTIVE ENERGY EFFICIENT OPTIMIZATION ALGORITHM FOR COVERAGE CONTROL ...MULTI-OBJECTIVE ENERGY EFFICIENT OPTIMIZATION ALGORITHM FOR COVERAGE CONTROL ...
MULTI-OBJECTIVE ENERGY EFFICIENT OPTIMIZATION ALGORITHM FOR COVERAGE CONTROL ...
ย 
GDRR Opening Workshop - Modeling Approaches for High-Frequency Financial Time...
GDRR Opening Workshop - Modeling Approaches for High-Frequency Financial Time...GDRR Opening Workshop - Modeling Approaches for High-Frequency Financial Time...
GDRR Opening Workshop - Modeling Approaches for High-Frequency Financial Time...
ย 
Training language models to follow instructions with human feedback.pdf
Training language models to follow instructions
with human feedback.pdfTraining language models to follow instructions
with human feedback.pdf
Training language models to follow instructions with human feedback.pdf
ย 
LNCS 5050 - Bilevel Optimization and Machine Learning
LNCS 5050 - Bilevel Optimization and Machine LearningLNCS 5050 - Bilevel Optimization and Machine Learning
LNCS 5050 - Bilevel Optimization and Machine Learning
ย 
On the Effectiveness of Offline RL for Dialogue Response Generation.pdf
On the Effectiveness of Offline RL for Dialogue Response Generation.pdfOn the Effectiveness of Offline RL for Dialogue Response Generation.pdf
On the Effectiveness of Offline RL for Dialogue Response Generation.pdf
ย 
PPT - Adaptive Quantitative Trading : An Imitative Deep Reinforcement Learnin...
PPT - Adaptive Quantitative Trading : An Imitative Deep Reinforcement Learnin...PPT - Adaptive Quantitative Trading : An Imitative Deep Reinforcement Learnin...
PPT - Adaptive Quantitative Trading : An Imitative Deep Reinforcement Learnin...
ย 
A Reinforcement Learning Approach for Hybrid Flexible Flowline Scheduling Pro...
A Reinforcement Learning Approach for Hybrid Flexible Flowline Scheduling Pro...A Reinforcement Learning Approach for Hybrid Flexible Flowline Scheduling Pro...
A Reinforcement Learning Approach for Hybrid Flexible Flowline Scheduling Pro...
ย 
Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design
ย 
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithmOptimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
ย 
Generic Reinforcement Schemes and Their Optimization
Generic Reinforcement Schemes and Their OptimizationGeneric Reinforcement Schemes and Their Optimization
Generic Reinforcement Schemes and Their Optimization
ย 
Presentation of the unbalanced R package
Presentation of the unbalanced R packagePresentation of the unbalanced R package
Presentation of the unbalanced R package
ย 
Stride Random Erasing Augmentation
Stride Random Erasing AugmentationStride Random Erasing Augmentation
Stride Random Erasing Augmentation
ย 
Stride Random Erasing Augmentation
Stride Random Erasing AugmentationStride Random Erasing Augmentation
Stride Random Erasing Augmentation
ย 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
ย 
Imitation learning tutorial
Imitation learning tutorialImitation learning tutorial
Imitation learning tutorial
ย 
[update] Introductory Parts of the Book "Dive into Deep Learning"
[update] Introductory Parts of the Book "Dive into Deep Learning"[update] Introductory Parts of the Book "Dive into Deep Learning"
[update] Introductory Parts of the Book "Dive into Deep Learning"
ย 
Lec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsLec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methods
ย 
On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...
On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...
On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...
ย 

More from Po-Chuan Chen

Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...Po-Chuan Chen
ย 
Quark: Controllable Text Generation with Reinforced [Un]learning.pdf
Quark: Controllable Text Generation with Reinforced [Un]learning.pdfQuark: Controllable Text Generation with Reinforced [Un]learning.pdf
Quark: Controllable Text Generation with Reinforced [Un]learning.pdfPo-Chuan Chen
ย 
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...Po-Chuan Chen
ย 
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...Po-Chuan Chen
ย 
A Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdfA Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdfPo-Chuan Chen
ย 
A Neural Corpus Indexer for Document Retrieval.pdf
A Neural Corpus Indexer for Document Retrieval.pdfA Neural Corpus Indexer for Document Retrieval.pdf
A Neural Corpus Indexer for Document Retrieval.pdfPo-Chuan Chen
ย 
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdfAdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdfPo-Chuan Chen
ย 
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...Po-Chuan Chen
ย 
Active Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdfActive Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdfPo-Chuan Chen
ย 
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdfOffline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdfPo-Chuan Chen
ย 
Image_to_Prompts.pdf
Image_to_Prompts.pdfImage_to_Prompts.pdf
Image_to_Prompts.pdfPo-Chuan Chen
ย 
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfPo-Chuan Chen
ย 
Evaluating Parameter Efficient Learning for Generation.pdf
Evaluating Parameter Efficient Learning for Generation.pdfEvaluating Parameter Efficient Learning for Generation.pdf
Evaluating Parameter Efficient Learning for Generation.pdfPo-Chuan Chen
ย 
Off-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdfOff-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdfPo-Chuan Chen
ย 
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdfA Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdfPo-Chuan Chen
ย 
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
Is Reinforcement Learning (Not) for Natural
Language Processing.pdfIs Reinforcement Learning (Not) for Natural
Language Processing.pdf
Is Reinforcement Learning (Not) for Natural Language Processing.pdfPo-Chuan Chen
ย 
HyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of TransformerspdfHyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of TransformerspdfPo-Chuan Chen
ย 
Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...
Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...
Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...Po-Chuan Chen
ย 
Beyond Write-reduction Consideration: A Wear-leveling-enabled B+-tree Indexin...
Beyond Write-reduction Consideration: A Wear-leveling-enabled B+-tree Indexin...Beyond Write-reduction Consideration: A Wear-leveling-enabled B+-tree Indexin...
Beyond Write-reduction Consideration: A Wear-leveling-enabled B+-tree Indexin...Po-Chuan Chen
ย 
Enabling the Duo-phase Data Management to Realize Longevity Bit-alterable Fla...
Enabling the Duo-phase Data Management to Realize Longevity Bit-alterable Fla...Enabling the Duo-phase Data Management to Realize Longevity Bit-alterable Fla...
Enabling the Duo-phase Data Management to Realize Longevity Bit-alterable Fla...Po-Chuan Chen
ย 

More from Po-Chuan Chen (20)

Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
ย 
Quark: Controllable Text Generation with Reinforced [Un]learning.pdf
Quark: Controllable Text Generation with Reinforced [Un]learning.pdfQuark: Controllable Text Generation with Reinforced [Un]learning.pdf
Quark: Controllable Text Generation with Reinforced [Un]learning.pdf
ย 
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
ย 
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
ย 
A Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdfA Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdf
ย 
A Neural Corpus Indexer for Document Retrieval.pdf
A Neural Corpus Indexer for Document Retrieval.pdfA Neural Corpus Indexer for Document Retrieval.pdf
A Neural Corpus Indexer for Document Retrieval.pdf
ย 
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdfAdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
ย 
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
ย 
Active Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdfActive Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdf
ย 
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdfOffline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdf
ย 
Image_to_Prompts.pdf
Image_to_Prompts.pdfImage_to_Prompts.pdf
Image_to_Prompts.pdf
ย 
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
ย 
Evaluating Parameter Efficient Learning for Generation.pdf
Evaluating Parameter Efficient Learning for Generation.pdfEvaluating Parameter Efficient Learning for Generation.pdf
Evaluating Parameter Efficient Learning for Generation.pdf
ย 
Off-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdfOff-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdf
ย 
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdfA Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
ย 
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
Is Reinforcement Learning (Not) for Natural
Language Processing.pdfIs Reinforcement Learning (Not) for Natural
Language Processing.pdf
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
ย 
HyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of TransformerspdfHyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
ย 
Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...
Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...
Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...
ย 
Beyond Write-reduction Consideration: A Wear-leveling-enabled B+-tree Indexin...
Beyond Write-reduction Consideration: A Wear-leveling-enabled B+-tree Indexin...Beyond Write-reduction Consideration: A Wear-leveling-enabled B+-tree Indexin...
Beyond Write-reduction Consideration: A Wear-leveling-enabled B+-tree Indexin...
ย 
Enabling the Duo-phase Data Management to Realize Longevity Bit-alterable Fla...
Enabling the Duo-phase Data Management to Realize Longevity Bit-alterable Fla...Enabling the Duo-phase Data Management to Realize Longevity Bit-alterable Fla...
Enabling the Duo-phase Data Management to Realize Longevity Bit-alterable Fla...
ย 

Recently uploaded

Call Girls in Munirka Delhi ๐Ÿ’ฏCall Us ๐Ÿ”8264348440๐Ÿ”
Call Girls in Munirka Delhi ๐Ÿ’ฏCall Us ๐Ÿ”8264348440๐Ÿ”Call Girls in Munirka Delhi ๐Ÿ’ฏCall Us ๐Ÿ”8264348440๐Ÿ”
Call Girls in Munirka Delhi ๐Ÿ’ฏCall Us ๐Ÿ”8264348440๐Ÿ”soniya singh
ย 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B
ย 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
ย 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptArshadWarsi13
ย 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
ย 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxyaramohamed343013
ย 
Forest laws, Indian forest laws, why they are important
Forest laws, Indian forest laws, why they are importantForest laws, Indian forest laws, why they are important
Forest laws, Indian forest laws, why they are importantadityabhardwaj282
ย 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad
ย 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
ย 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
ย 
Call Us โ‰ฝ 9953322196 โ‰ผ Call Girls In Lajpat Nagar (Delhi) |
Call Us โ‰ฝ 9953322196 โ‰ผ Call Girls In Lajpat Nagar (Delhi) |Call Us โ‰ฝ 9953322196 โ‰ผ Call Girls In Lajpat Nagar (Delhi) |
Call Us โ‰ฝ 9953322196 โ‰ผ Call Girls In Lajpat Nagar (Delhi) |aasikanpl
ย 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
ย 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxNandakishor Bhaurao Deshmukh
ย 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfWildaNurAmalia2
ย 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
ย 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
ย 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
ย 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
ย 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar
ย 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsssuserddc89b
ย 

Recently uploaded (20)

Call Girls in Munirka Delhi ๐Ÿ’ฏCall Us ๐Ÿ”8264348440๐Ÿ”
Call Girls in Munirka Delhi ๐Ÿ’ฏCall Us ๐Ÿ”8264348440๐Ÿ”Call Girls in Munirka Delhi ๐Ÿ’ฏCall Us ๐Ÿ”8264348440๐Ÿ”
Call Girls in Munirka Delhi ๐Ÿ’ฏCall Us ๐Ÿ”8264348440๐Ÿ”
ย 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
ย 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
ย 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.ppt
ย 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
ย 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docx
ย 
Forest laws, Indian forest laws, why they are important
Forest laws, Indian forest laws, why they are importantForest laws, Indian forest laws, why they are important
Forest laws, Indian forest laws, why they are important
ย 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensor
ย 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
ย 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
ย 
Call Us โ‰ฝ 9953322196 โ‰ผ Call Girls In Lajpat Nagar (Delhi) |
Call Us โ‰ฝ 9953322196 โ‰ผ Call Girls In Lajpat Nagar (Delhi) |Call Us โ‰ฝ 9953322196 โ‰ผ Call Girls In Lajpat Nagar (Delhi) |
Call Us โ‰ฝ 9953322196 โ‰ผ Call Girls In Lajpat Nagar (Delhi) |
ย 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
ย 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
ย 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
ย 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
ย 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
ย 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
ย 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
ย 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
ย 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physics
ย 

Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf

  • 1. SPG Cold-Start Reinforcement Learning with Softmax Policy Gradient NeurIPS, 2017 Chih-Chun Chen, Pin-Yen Liu, Po-Chuan Chen June 8, 2023 1 / 21
  • 2. SPG Table of contents 1 Abstract 2 Introduction 3 Limitations of Existing Sequence Learning Regimes 4 Softmax Policy Gradient (SPG) Method 5 Conclusion 2 / 21
  • 3. SPG Abstract Table of contents 1 Abstract 2 Introduction 3 Limitations of Existing Sequence Learning Regimes 4 Softmax Policy Gradient (SPG) Method 5 Conclusion 3 / 21
  • 4. SPG Abstract Abstract When we use policy gradient in reinforcement learning. There will have 2 issues. 1 warm-start training 2 sample variance reduction In here, Cold-Start Reinforcement Learning with Softmax Policy Gradient [2] can overcome them based on softmax value function. The method can be used in training sequence generation models for structured output prediction problems. 4 / 21
  • 5. SPG Introduction Table of contents 1 Abstract 2 Introduction 3 Limitations of Existing Sequence Learning Regimes 4 Softmax Policy Gradient (SPG) Method 5 Conclusion 5 / 21
  • 6. SPG Introduction Introduction Here are the procedure when we use policy gradient method on sequence generation work: The model proposes a sequence Compute a reward for the proposed sequence based on the ground-truth target The reward metrics such as ROUGE [3] for summarization, CIDEr [5] or SPICE [1] for image captioning, etc Optimizing the weighted average of the log-likelihood of the proposed sequences, with reward as weight 6 / 21
  • 7. SPG Limitations of Existing Sequence Learning Regimes Table of contents 1 Abstract 2 Introduction 3 Limitations of Existing Sequence Learning Regimes 4 Softmax Policy Gradient (SPG) Method 5 Conclusion 7 / 21
  • 8. SPG Limitations of Existing Sequence Learning Regimes Limitations of Existing Sequence Learning Regimes Sequence-learning training has a method, Maximum-likelihood Estimation (MLE). Given a set of inputs X = xi and target sequences Y = yi , the MLE loss function is: LMLE (๐œƒ) = โˆ‘๏ธ i Li MLE (๐œƒ), where Li MLE (๐œƒ) = โˆ’ log p๐œƒ yi | xi (1) Here xi and yi = yi 1, . . . , yi T denote the input and the target sequence of the i-th example, respectively. 8 / 21
  • 9. SPG Limitations of Existing Sequence Learning Regimes Reward Augmented Maximum Likelihood (RAML) For MLE, a general approach called Reward Augmented Maximum Likelihood (RAML) [4], which all alternative outputs are equally penalized through normalization, regardless of their relationship to the ground-truth target. And get approximation with Monte Carlo integration. Li RAML(๐œƒ) = โˆ’ โˆ‘๏ธ zi rR zi | yi log p๐œƒ zi | xi โ‰ƒ โˆ’ 1 J J โˆ‘๏ธ j=1 log p๐œƒ zij | xi (2) where rR zi | yi = exp(R(zi |yi )/๐œ) ร zi exp(R(zi |yi )/๐œ) . 9 / 21
  • 10. SPG Limitations of Existing Sequence Learning Regimes Challenge But the large discrepancy between the model prediction distribution p๐œƒ (zi | xi) and the reward R(zi | yi)โ€™s values, which is especially acute during the early training stages. It cause some problems like inefficient from a speed-to-convergence and unsatisfactory from a theoretical and modeling. Both these issues will be addressed by the value function we describe next. 10 / 21
  • 11. SPG Softmax Policy Gradient (SPG) Method Table of contents 1 Abstract 2 Introduction 3 Limitations of Existing Sequence Learning Regimes 4 Softmax Policy Gradient (SPG) Method Inference Bang-bang Rewarded SPG Method Algorithm 5 Conclusion 11 / 21
  • 12. SPG Softmax Policy Gradient (SPG) Method Softmax value function Original value function: Li PG(๐œƒ) = โˆ’Vi PG(๐œƒ), Vi PG(๐œƒ) = Ep๐œƒ (zi |xi ) h R zi | yi i (3) Softmax value function: Li SPG(๐œƒ) = โˆ’Vi SPG(๐œƒ), Vi SPG(๐œƒ) = log Ep๐œƒ (zi |xi ) h exp R zi | yi i (4) 12 / 21
  • 13. SPG Softmax Policy Gradient (SPG) Method Inference Reward function Reward increments: ฮ”ri t (zi t|yi , zi 1:tโˆ’1) := R(zi 1:t|yi ) โˆ’ R(zi 1:tโˆ’1|yi ) (5) Additional Reward Functions: DUPi t = โˆ’1 if zi t = zi tโˆ’1 0 otherwise (6) EOSi t = ( โˆ’1 if zi t = /S and t yi 0 otherwise (7) 13 / 21
  • 14. SPG Softmax Policy Gradient (SPG) Method Bang-bang Rewarded SPG Method To minimize the efforts for fine-tuning the reward weights, we propose a bang-bang rewarded softmax value function: Li BBSPG(๐œƒ) = โˆ’ ร wi p wi log Ep๐œƒ (zi |xi ) exp R zi | yi, wi (8) ๐œ• ๐œ•๐œƒ Lฬƒi BBSPG(๐œƒ) = โˆ’ โˆ‘๏ธ wi p wi โˆ‘๏ธ zi qฬƒ๐œƒ zi | xi , yi , wi ๐œ• ๐œ•๐œƒ log p๐œƒ zi | xi | {z } โ‰œโˆ’ ๐œ• ๐œ•๐œƒ Lฬƒi SPG (๐œƒ |wi ) (9) Figure: An example of sequence generation. 14 / 21
  • 15. SPG Softmax Policy Gradient (SPG) Method Bang-bang Rewarded SPG Method Main Loss Using Monte-Carlo integration, we approximate Eq. (9) by first drawing wij from p wi and then iteratively drawing z ij t from qฬƒ๐œƒ zi t | xi, zi 1:tโˆ’1, yi, w ij t for t = 1, . . . , T. For larger values of pdrop , the wij sample contains more w ij t = 0 and the resulting zij contains proportionally more samples from the model prediction distribution. After zij is obtained, only the log-likelihood of z ij t when w ij t โ‰  0 are included in the loss: ๐œ• ๐œ•๐œƒ Lฬƒi BBSPG(๐œƒ) โ‰ƒ โˆ’ 1 J J โˆ‘๏ธ j=1 โˆ‘๏ธ n t:w ij t โ‰ 0 o ๐œ• ๐œ•๐œƒ log p๐œƒ z ij t | xi , z ij 1:tโˆ’1 . (10) 15 / 21
  • 16. SPG Softmax Policy Gradient (SPG) Method Algorithm Algorithm 16 / 21
  • 17. SPG Conclusion Table of contents 1 Abstract 2 Introduction 3 Limitations of Existing Sequence Learning Regimes 4 Softmax Policy Gradient (SPG) Method 5 Conclusion 17 / 21
  • 18. SPG Conclusion Conclusion Based on softmax value function, this policy-gradient approach that eliminates the need for warm-start training and sample variance reduction during policy updates. We can know that the proposed method achieves superior performance on text-to-text (automatic summarization) and image-to-text (automatic image captioning). 18 / 21
  • 19. SPG Conclusion References I [1] Peter Anderson et al. โ€œSPICE: semantic propositional image caption evaluationโ€. English. In: Computer Vision - 14th European Conference, ECCV 2016, Proceedings. Ed. by Bastian Leibe et al. Vol. Part V. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). European Conference on Computer Vision (14th : 2016) ; Conference date: 11-10-2016 Through 14-10-2016. United States: Springer, Springer Nature, 2016, pp. 382โ€“398. isbn: 9783319464534. doi: 10.1007/978-3-319-46454-1_24. 19 / 21
  • 20. SPG Conclusion References II [2] Nan Ding and Radu Soricut. โ€œCold-Start Reinforcement Learning with Softmax Policy Gradientโ€. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPSโ€™17. Long Beach, California, USA: Curran Associates Inc., 2017, pp. 2814โ€“2823. isbn: 9781510860964. [3] Chin-Yew Lin and Franz Josef Och. โ€œAutomatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statisticsโ€. In: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics. ACL โ€™04. Barcelona, Spain: Association for Computational Linguistics, 2004, 605โ€“es. doi: 10.3115/1218955.1219032. url: https://doi.org/10.3115/1218955.1219032. 20 / 21
  • 21. SPG Conclusion References III [4] Mohammad Norouzi et al. Reward Augmented Maximum Likelihood for Neural Structured Prediction. 2017. arXiv: 1609.00150 [cs.LG]. [5] Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. โ€œCIDEr: Consensus-based image description evaluationโ€. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015, pp. 4566โ€“4575. doi: 10.1109/CVPR.2015.7299087. 21 / 21