SlideShare a Scribd company logo
1 of 33
Download to read offline
SECOND THOUGHTS
Second Thoughts are Best: Learning to Re-Align
With Human Values from Text Edits
Ruibo Liu, Chenyan Jia, Ge Zhang et al.
National Yang Ming Chiao Tung University, Hsinchu
Speaker: Po-Chuan, Chen
December 6, 2022
1 / 33
SECOND THOUGHTS
Table of contents
1 Abstract
2 Introduction
3 Previous work
4 Approach
Problem Statement of Re-alignment
Augmented Edits Modeling
Refinement by Reinforcement Learning
5 Experiments
6 Limitations and Discussion
7 Conclusion
2 / 33
SECOND THOUGHTS
Abstract
Table of contents
1 Abstract
2 Introduction
3 Previous work
4 Approach
Problem Statement of Re-alignment
Augmented Edits Modeling
Refinement by Reinforcement Learning
5 Experiments
6 Limitations and Discussion
7 Conclusion
3 / 33
SECOND THOUGHTS
Abstract
Abstract
This paper present SECOND THOUGHTS, a new learning paradigm
that enables language models (LMs) to re-align with human values.
By modeling the chain-of-edits between value-unaligned and
value-aligned text, with LM fine-tuning and additional refinement
through reinforcement learning.
The generated editing steps also offer better interpretability and ease
for interactive error correction.
4 / 33
SECOND THOUGHTS
Introduction
Table of contents
1 Abstract
2 Introduction
3 Previous work
4 Approach
Problem Statement of Re-alignment
Augmented Edits Modeling
Refinement by Reinforcement Learning
5 Experiments
6 Limitations and Discussion
7 Conclusion
5 / 33
SECOND THOUGHTS
Introduction
Motivation
Current large-scale pre-trained language models (LMs) have shown
great success in many knowledge-recalling tasks.
The ability to select socially good text from bad (or generating
prosocial text) in open-world settings is still limited, even when the
models are scaled up to hundreds of billions of parameters.
6 / 33
SECOND THOUGHTS
Introduction
Fine-tuned language models (LMs) for generating text
7 / 33
SECOND THOUGHTS
Introduction
Contribution
Presenting a new learning paradigm that can make current LMs aware
of the human value alignment.
Trained with SECOND THOUGHTS, LMs can not only re-align
their generation with human values, even when the context has
already been poisoned, but also show the chain of editing steps for
ease of interpretability and to facilitate further edits.
The experiments confirm that simply scaling LMs is not adequate for
good alignment with human values, which echoes the findings of
recent studies.
8 / 33
SECOND THOUGHTS
Previous work
Previous work
This section will review existing work that considers in-context
explanations during prompting or training. Also summarize other
value alignment methods for language models.
Learning From In-Context Instructions
Human Value Alignment for Language Models
9 / 33
SECOND THOUGHTS
Previous work
Learning From In-Context Instructions
The few-shot performance of LMs can be enhanced by learning from
in-context instructions.
Ex: task descriptions, answer demonstrations ...
However, the instructions normally require careful human design,
which is costly and whose quality greatly affects performance.
10 / 33
SECOND THOUGHTS
Previous work
Human Value Alignment for Language Models
Trained on unfiltered and problematic language from the web, current
large-scale LMs have be shown to be poorly aligned with human
values.
Existing general purpose remedies include filtering the training data,
attribute-control generation, and modifying the decoding algorithm
with hard or soft constraints.
But based on the experiment that they have limited performance
when the context has already been poisoned. And the solution for
this situation costly and not always available in every value
alignment dataset.
11 / 33
SECOND THOUGHTS
Approach
Table of contents
1 Abstract
2 Introduction
3 Previous work
4 Approach
Problem Statement of Re-alignment
Augmented Edits Modeling
Refinement by Reinforcement Learning
5 Experiments
6 Limitations and Discussion
7 Conclusion
12 / 33
SECOND THOUGHTS
Approach
SECOND THOUGHTS
SECOND THOUGHTS comprises two main steps.
1 Inferring chain-of-edits automatically from source and target
responses with a dynamic programming algorithm, and fine-tune
an LM on the edits augmented training data.
2 Then deploy a reinforce learning stage to refine the generation,
by either adversarial imitation learning or value modeling.
13 / 33
SECOND THOUGHTS
Approach
Comparison
14 / 33
SECOND THOUGHTS
Approach
Problem Statement of Re-alignment
Problem Statement of Re-alignment
Value alignment datasets normally consist of contexts, value-aligned
responses, and value unaligned responses.
Existing alignment methods formulate the value alignment task as a
conditional generation problem: given a situation as the context,
train a model that can generate responses resembling a value-aligned
target rather than a not-aligned wrong target.
15 / 33
SECOND THOUGHTS
Approach
Problem Statement of Re-alignment
Problem Statement of Re-alignment
However, many studies have shown that LMs trained with such a
paradigm can be easily derailed by poisoned contexts —i.e., contexts
that already include value-unaligned content, either from the model’s
own generation or from malicious users.
To teach a model how to re-align, they deliberately add the
value-unaligned response into the context, referred to as the source,
and keep the value-aligned response as the target.
16 / 33
SECOND THOUGHTS
Approach
Augmented Edits Modeling
Augmented Edits Modeling
1 DP-based Edits Inference
2 Augmented Edits Modeling (AEM)
17 / 33
SECOND THOUGHTS
Approach
Augmented Edits Modeling
DP-based Edits Inference
Given two text strings, source and target, one can find unlimited ways
to edit source to produce target.
Thus, they apply two constraints onto the editing:
1 The edits should be combinations of generic editing
operations—inserting, deleting, and replacing a single token.
2 Each edit operation has a cost and the goal is to infer the
chain-of-edits that has minimum cost.
Under these constraints, the edits inference problem can be converted
to a token-level “edit distance problem”, which can be solved by
dynamic programming (DP).
18 / 33
SECOND THOUGHTS
Approach
Augmented Edits Modeling
Augmented Edits Modeling (AEM)
To augment the edits, they run the DP algorithm on the same source
and target pairs with a variety of editing costs to create a collection of
chain-of-edits for each source-target pair, which they call positive
demonstrations (y+).
And then fine-tune an LM on these source-edits-target text inputs.
This paper also construct negative demonstrations (y−) by using the
targets from other contexts, leading to inferred chain-of-edits that
generate value-aligned responses which are incoherent with the given
context.
19 / 33
SECOND THOUGHTS
Approach
Augmented Edits Modeling
20 / 33
SECOND THOUGHTS
Approach
Refinement by Reinforcement Learning
Refinement by Reinforcement Learning
Based on manual examination, the responses tend to be generic, so
they are thus motivated to deploy a reinforcement learning (RL) stage
to further refine the generation quality.
Notation
context and source as x
target that generated by SECOND THOUGHTS as y
Define the state at time t as the set of generated tokens before t
(i.e., st = y<t)
The action as the current step’s output token (i.e., at = yt)
The softmax output of the language modeling head is considered
as the policy 𝜋t for picking token yt (action at), given the state
st = y<t.
21 / 33
SECOND THOUGHTS
Approach
Refinement by Reinforcement Learning
Adversarial Imitation Learning (AIL)
They propose to leverage negative samples to penalize the LM for
imitating the mismatched target by training an adversarial LM only
on the negative demonstrations y−, so that following its policy 𝜋ADV.
t
will lead to incoherent generations. The t-th step objective of AIL to
be maximized is:
JAIL,t = E𝜏∼𝜋∗
t
[− log 𝜋ADV
t (at | st)
| {z }
unlikelihood
+ 𝛼 log 𝜋∗
t (at | st)
| {z }
likelihood
] − 𝛽 KL 𝜋t∥𝜋∗
t

where 𝜋∗
t is the desired refinement policy, 𝛼 is the balancing factor,
and the KL penalty term KL(𝜋∥𝜋∗
t ) with the coefficient 𝛽 is the trust
region constraint, which prevents the updated policy from drifting
too far away from the original one.
22 / 33
SECOND THOUGHTS
Approach
Refinement by Reinforcement Learning
Adversarial Imitation Learning (AIL)
The intuition behind such a design is to maximize the unlikelihood
of forming the trajectory 𝜏 = {s1, a1, ..., st, at} that can be induced
by the adversarial policy 𝜋ADV., weighted against the balancing
likelihood term.
After refinement, the learned policy 𝜋∗
t can generate tokens unlike
those that can be produced by 𝜋ADV., which will form sequences
more coherent to the context.
23 / 33
SECOND THOUGHTS
Approach
Refinement by Reinforcement Learning
Value Modeling (VM)
And another refinement method that directly learns a value function.
By training a binary LM-based classifier f on the mixture of positive
and negative demonstrations.
Taking the sigmoid of the log-likelihood predicted by f as the reward
r, which is r = 𝜎 log f (x, y), where the t-th step r is adjusted by an
importance-sampling ratio between the current and original policy for
off-policy stability, for the entropy bonus term encourage more
exploration of the current policy.
JVM,t = E𝜏∼𝜋t

𝜋∗
t (at | st)
𝜋t (at | st)
· rt

+ 𝜆H (· | st)∼𝜋∗
24 / 33
SECOND THOUGHTS
Experiments
Table of contents
1 Abstract
2 Introduction
3 Previous work
4 Approach
Problem Statement of Re-alignment
Augmented Edits Modeling
Refinement by Reinforcement Learning
5 Experiments
6 Limitations and Discussion
7 Conclusion
25 / 33
SECOND THOUGHTS
Experiments
SECOND THOUGHTS on three benchmark datasets
1 Moral Stories. The Moral Stories dataset (N = 20, 000)
examines whether LMs can generate moral responses under
diverse social situations.
2 MIC. The MIC dataset (N = 38, 000) studies whether chatbots
can generate utterances that are aligned with a set of “Rules of
Thumb (RoT)” of morality.
3 ETHICS-Deontology. The ETHICS dataset (N = 25, 356)
investigates the performance of LMs on five human values
alignment tasks.
This paper also consider two smaller-scale human values alignment
datasets: HHH (Helpful, Honest,  Harmless) (N = 178) and
Truthful QA (N = 299), to evaluate the domain transfer ability.
26 / 33
SECOND THOUGHTS
Experiments
Main Results on the Performance of Value Alignment
Alignment, by asking “To what extent does the edited response
improve the original response in terms of alignment with human
values?” Answers range from 1-not at all. to 7-to an extreme extent.
Coherence, by asking “How coherent is the edited response with the
given context?” Answers range from 1-not at all. to 7-extremely
coherent.
27 / 33
SECOND THOUGHTS
Experiments
Value Transfer Learning with Limited Human-Labeled Data
Transfer learning ability of SECOND THOUGHTS from seen human
values (i.e., trained on MRL, MIC, ETC) to unseen values (i.e., testing
on TQA, HHH)
28 / 33
SECOND THOUGHTS
Experiments
Error Analysis and Human-Guided Correction
SECOND THOUGHTS enables higher quality human-guided
corrections, in terms of alignment and coherence scores.
29 / 33
SECOND THOUGHTS
Experiments
Configuration for the Best Performing SECOND
THOUGHTS
Hyperparameter search on balancing factor 𝛼 and entropy factor 𝜆 in
the Moral Stories task for best performing SECOND THOUGHTS.
30 / 33
SECOND THOUGHTS
Limitations and Discussion
Table of contents
1 Abstract
2 Introduction
3 Previous work
4 Approach
Problem Statement of Re-alignment
Augmented Edits Modeling
Refinement by Reinforcement Learning
5 Experiments
6 Limitations and Discussion
7 Conclusion
31 / 33
SECOND THOUGHTS
Limitations and Discussion
Limitations and Discussion
SECOND THOUGHTS can be limited by the LM that it is based
on—for instance, the total length of the chain-of-edits is limited by the
max sequence length allowed for the LM.
32 / 33
SECOND THOUGHTS
Conclusion
Conclusion
This paper has proposed SECOND THOUGHTS, a novel learning
paradigm that enables LMs to re-align with human values when given
a poisoned context.
In addition, the chain-of-edits modeling by SECOND THOUGHTS
enables easy error diagnosis and human-guided correction, which they
believe to be an essential ability for human-AI interactive systems.
33 / 33

More Related Content

Similar to Second Thoughts are Best: Learning to Re-Align With Human Values from Text Edits.pdf

2-Chapter Two-N-gram Language Models.ppt
2-Chapter Two-N-gram Language Models.ppt2-Chapter Two-N-gram Language Models.ppt
2-Chapter Two-N-gram Language Models.pptmilkesa13
 
Duality Theory in Multi Objective Linear Programming Problems
Duality Theory in Multi Objective Linear Programming ProblemsDuality Theory in Multi Objective Linear Programming Problems
Duality Theory in Multi Objective Linear Programming Problemstheijes
 
A0311010106
A0311010106A0311010106
A0311010106theijes
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Machine Learning and Artificial Neural Networks.ppt
Machine Learning and Artificial Neural Networks.pptMachine Learning and Artificial Neural Networks.ppt
Machine Learning and Artificial Neural Networks.pptAnshika865276
 
Interactive Fuzzy Goal Programming approach for Tri-Level Linear Programming ...
Interactive Fuzzy Goal Programming approach for Tri-Level Linear Programming ...Interactive Fuzzy Goal Programming approach for Tri-Level Linear Programming ...
Interactive Fuzzy Goal Programming approach for Tri-Level Linear Programming ...IJERA Editor
 
Knowledge distillation deeplab
Knowledge distillation deeplabKnowledge distillation deeplab
Knowledge distillation deeplabFrozen Paradise
 
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...Deren Lei
 
Evaluation of subjective answers using glsa enhanced with contextual synonymy
Evaluation of subjective answers using glsa enhanced with contextual synonymyEvaluation of subjective answers using glsa enhanced with contextual synonymy
Evaluation of subjective answers using glsa enhanced with contextual synonymyijnlc
 
SemiBoost: Boosting for Semi-supervised Learning
SemiBoost: Boosting for Semi-supervised LearningSemiBoost: Boosting for Semi-supervised Learning
SemiBoost: Boosting for Semi-supervised Learningbutest
 
Monoton-working version-1995.doc
Monoton-working version-1995.docMonoton-working version-1995.doc
Monoton-working version-1995.docbutest
 
Monoton-working version-1995.doc
Monoton-working version-1995.docMonoton-working version-1995.doc
Monoton-working version-1995.docbutest
 
Machine learning and Neural Networks
Machine learning and Neural NetworksMachine learning and Neural Networks
Machine learning and Neural Networksbutest
 
On the Effectiveness of Offline RL for Dialogue Response Generation.pdf
On the Effectiveness of Offline RL for Dialogue Response Generation.pdfOn the Effectiveness of Offline RL for Dialogue Response Generation.pdf
On the Effectiveness of Offline RL for Dialogue Response Generation.pdfPo-Chuan Chen
 
OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...
OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...
OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...cscpconf
 

Similar to Second Thoughts are Best: Learning to Re-Align With Human Values from Text Edits.pdf (20)

2-Chapter Two-N-gram Language Models.ppt
2-Chapter Two-N-gram Language Models.ppt2-Chapter Two-N-gram Language Models.ppt
2-Chapter Two-N-gram Language Models.ppt
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Duality Theory in Multi Objective Linear Programming Problems
Duality Theory in Multi Objective Linear Programming ProblemsDuality Theory in Multi Objective Linear Programming Problems
Duality Theory in Multi Objective Linear Programming Problems
 
A0311010106
A0311010106A0311010106
A0311010106
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Machine Learning and Artificial Neural Networks.ppt
Machine Learning and Artificial Neural Networks.pptMachine Learning and Artificial Neural Networks.ppt
Machine Learning and Artificial Neural Networks.ppt
 
Interactive Fuzzy Goal Programming approach for Tri-Level Linear Programming ...
Interactive Fuzzy Goal Programming approach for Tri-Level Linear Programming ...Interactive Fuzzy Goal Programming approach for Tri-Level Linear Programming ...
Interactive Fuzzy Goal Programming approach for Tri-Level Linear Programming ...
 
Knowledge distillation deeplab
Knowledge distillation deeplabKnowledge distillation deeplab
Knowledge distillation deeplab
 
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
 
Evaluation of subjective answers using glsa enhanced with contextual synonymy
Evaluation of subjective answers using glsa enhanced with contextual synonymyEvaluation of subjective answers using glsa enhanced with contextual synonymy
Evaluation of subjective answers using glsa enhanced with contextual synonymy
 
SemiBoost: Boosting for Semi-supervised Learning
SemiBoost: Boosting for Semi-supervised LearningSemiBoost: Boosting for Semi-supervised Learning
SemiBoost: Boosting for Semi-supervised Learning
 
Paris Lecture 1
Paris Lecture 1Paris Lecture 1
Paris Lecture 1
 
Monoton-working version-1995.doc
Monoton-working version-1995.docMonoton-working version-1995.doc
Monoton-working version-1995.doc
 
Monoton-working version-1995.doc
Monoton-working version-1995.docMonoton-working version-1995.doc
Monoton-working version-1995.doc
 
Machine learning and Neural Networks
Machine learning and Neural NetworksMachine learning and Neural Networks
Machine learning and Neural Networks
 
G
GG
G
 
On the Effectiveness of Offline RL for Dialogue Response Generation.pdf
On the Effectiveness of Offline RL for Dialogue Response Generation.pdfOn the Effectiveness of Offline RL for Dialogue Response Generation.pdf
On the Effectiveness of Offline RL for Dialogue Response Generation.pdf
 
p138-jiang
p138-jiangp138-jiang
p138-jiang
 
paper
paperpaper
paper
 
OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...
OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...
OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...
 

More from Po-Chuan Chen

E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdfE-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdfPo-Chuan Chen
 
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...Po-Chuan Chen
 
Quark: Controllable Text Generation with Reinforced [Un]learning.pdf
Quark: Controllable Text Generation with Reinforced [Un]learning.pdfQuark: Controllable Text Generation with Reinforced [Un]learning.pdf
Quark: Controllable Text Generation with Reinforced [Un]learning.pdfPo-Chuan Chen
 
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...Po-Chuan Chen
 
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...Po-Chuan Chen
 
A Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdfA Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdfPo-Chuan Chen
 
A Neural Corpus Indexer for Document Retrieval.pdf
A Neural Corpus Indexer for Document Retrieval.pdfA Neural Corpus Indexer for Document Retrieval.pdf
A Neural Corpus Indexer for Document Retrieval.pdfPo-Chuan Chen
 
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdfAdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdfPo-Chuan Chen
 
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...Po-Chuan Chen
 
Active Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdfActive Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdfPo-Chuan Chen
 
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdfOffline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdfPo-Chuan Chen
 
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdfCold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdfPo-Chuan Chen
 
Image_to_Prompts.pdf
Image_to_Prompts.pdfImage_to_Prompts.pdf
Image_to_Prompts.pdfPo-Chuan Chen
 
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfPo-Chuan Chen
 
Evaluating Parameter Efficient Learning for Generation.pdf
Evaluating Parameter Efficient Learning for Generation.pdfEvaluating Parameter Efficient Learning for Generation.pdf
Evaluating Parameter Efficient Learning for Generation.pdfPo-Chuan Chen
 
Off-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdfOff-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdfPo-Chuan Chen
 
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdfA Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdfPo-Chuan Chen
 
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
Is Reinforcement Learning (Not) for Natural
Language Processing.pdfIs Reinforcement Learning (Not) for Natural
Language Processing.pdf
Is Reinforcement Learning (Not) for Natural Language Processing.pdfPo-Chuan Chen
 
HyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of TransformerspdfHyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of TransformerspdfPo-Chuan Chen
 
Training language models to follow instructions with human feedback.pdf
Training language models to follow instructions
with human feedback.pdfTraining language models to follow instructions
with human feedback.pdf
Training language models to follow instructions with human feedback.pdfPo-Chuan Chen
 

More from Po-Chuan Chen (20)

E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdfE-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
 
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
 
Quark: Controllable Text Generation with Reinforced [Un]learning.pdf
Quark: Controllable Text Generation with Reinforced [Un]learning.pdfQuark: Controllable Text Generation with Reinforced [Un]learning.pdf
Quark: Controllable Text Generation with Reinforced [Un]learning.pdf
 
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
 
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
 
A Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdfA Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdf
 
A Neural Corpus Indexer for Document Retrieval.pdf
A Neural Corpus Indexer for Document Retrieval.pdfA Neural Corpus Indexer for Document Retrieval.pdf
A Neural Corpus Indexer for Document Retrieval.pdf
 
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdfAdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
 
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
 
Active Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdfActive Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdf
 
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdfOffline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdf
 
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdfCold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
 
Image_to_Prompts.pdf
Image_to_Prompts.pdfImage_to_Prompts.pdf
Image_to_Prompts.pdf
 
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
 
Evaluating Parameter Efficient Learning for Generation.pdf
Evaluating Parameter Efficient Learning for Generation.pdfEvaluating Parameter Efficient Learning for Generation.pdf
Evaluating Parameter Efficient Learning for Generation.pdf
 
Off-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdfOff-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdf
 
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdfA Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
 
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
Is Reinforcement Learning (Not) for Natural
Language Processing.pdfIs Reinforcement Learning (Not) for Natural
Language Processing.pdf
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
 
HyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of TransformerspdfHyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
 
Training language models to follow instructions with human feedback.pdf
Training language models to follow instructions
with human feedback.pdfTraining language models to follow instructions
with human feedback.pdf
Training language models to follow instructions with human feedback.pdf
 

Recently uploaded

Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxchumtiyababu
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadhamedmustafa094
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxJuliansyahHarahap1
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsvanyagupta248
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxMuhammadAsimMuhammad6
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Arindam Chakraborty, Ph.D., P.E. (CA, TX)
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationBhangaleSonal
 
Wadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxWadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxNadaHaitham1
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesMayuraD1
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilVinayVitekari
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdfKamal Acharya
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiessarkmank1
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startQuintin Balsdon
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 

Recently uploaded (20)

Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
Wadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxWadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptx
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech Civil
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and properties
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 

Second Thoughts are Best: Learning to Re-Align With Human Values from Text Edits.pdf

  • 1. SECOND THOUGHTS Second Thoughts are Best: Learning to Re-Align With Human Values from Text Edits Ruibo Liu, Chenyan Jia, Ge Zhang et al. National Yang Ming Chiao Tung University, Hsinchu Speaker: Po-Chuan, Chen December 6, 2022 1 / 33
  • 2. SECOND THOUGHTS Table of contents 1 Abstract 2 Introduction 3 Previous work 4 Approach Problem Statement of Re-alignment Augmented Edits Modeling Refinement by Reinforcement Learning 5 Experiments 6 Limitations and Discussion 7 Conclusion 2 / 33
  • 3. SECOND THOUGHTS Abstract Table of contents 1 Abstract 2 Introduction 3 Previous work 4 Approach Problem Statement of Re-alignment Augmented Edits Modeling Refinement by Reinforcement Learning 5 Experiments 6 Limitations and Discussion 7 Conclusion 3 / 33
  • 4. SECOND THOUGHTS Abstract Abstract This paper present SECOND THOUGHTS, a new learning paradigm that enables language models (LMs) to re-align with human values. By modeling the chain-of-edits between value-unaligned and value-aligned text, with LM fine-tuning and additional refinement through reinforcement learning. The generated editing steps also offer better interpretability and ease for interactive error correction. 4 / 33
  • 5. SECOND THOUGHTS Introduction Table of contents 1 Abstract 2 Introduction 3 Previous work 4 Approach Problem Statement of Re-alignment Augmented Edits Modeling Refinement by Reinforcement Learning 5 Experiments 6 Limitations and Discussion 7 Conclusion 5 / 33
  • 6. SECOND THOUGHTS Introduction Motivation Current large-scale pre-trained language models (LMs) have shown great success in many knowledge-recalling tasks. The ability to select socially good text from bad (or generating prosocial text) in open-world settings is still limited, even when the models are scaled up to hundreds of billions of parameters. 6 / 33
  • 7. SECOND THOUGHTS Introduction Fine-tuned language models (LMs) for generating text 7 / 33
  • 8. SECOND THOUGHTS Introduction Contribution Presenting a new learning paradigm that can make current LMs aware of the human value alignment. Trained with SECOND THOUGHTS, LMs can not only re-align their generation with human values, even when the context has already been poisoned, but also show the chain of editing steps for ease of interpretability and to facilitate further edits. The experiments confirm that simply scaling LMs is not adequate for good alignment with human values, which echoes the findings of recent studies. 8 / 33
  • 9. SECOND THOUGHTS Previous work Previous work This section will review existing work that considers in-context explanations during prompting or training. Also summarize other value alignment methods for language models. Learning From In-Context Instructions Human Value Alignment for Language Models 9 / 33
  • 10. SECOND THOUGHTS Previous work Learning From In-Context Instructions The few-shot performance of LMs can be enhanced by learning from in-context instructions. Ex: task descriptions, answer demonstrations ... However, the instructions normally require careful human design, which is costly and whose quality greatly affects performance. 10 / 33
  • 11. SECOND THOUGHTS Previous work Human Value Alignment for Language Models Trained on unfiltered and problematic language from the web, current large-scale LMs have be shown to be poorly aligned with human values. Existing general purpose remedies include filtering the training data, attribute-control generation, and modifying the decoding algorithm with hard or soft constraints. But based on the experiment that they have limited performance when the context has already been poisoned. And the solution for this situation costly and not always available in every value alignment dataset. 11 / 33
  • 12. SECOND THOUGHTS Approach Table of contents 1 Abstract 2 Introduction 3 Previous work 4 Approach Problem Statement of Re-alignment Augmented Edits Modeling Refinement by Reinforcement Learning 5 Experiments 6 Limitations and Discussion 7 Conclusion 12 / 33
  • 13. SECOND THOUGHTS Approach SECOND THOUGHTS SECOND THOUGHTS comprises two main steps. 1 Inferring chain-of-edits automatically from source and target responses with a dynamic programming algorithm, and fine-tune an LM on the edits augmented training data. 2 Then deploy a reinforce learning stage to refine the generation, by either adversarial imitation learning or value modeling. 13 / 33
  • 15. SECOND THOUGHTS Approach Problem Statement of Re-alignment Problem Statement of Re-alignment Value alignment datasets normally consist of contexts, value-aligned responses, and value unaligned responses. Existing alignment methods formulate the value alignment task as a conditional generation problem: given a situation as the context, train a model that can generate responses resembling a value-aligned target rather than a not-aligned wrong target. 15 / 33
  • 16. SECOND THOUGHTS Approach Problem Statement of Re-alignment Problem Statement of Re-alignment However, many studies have shown that LMs trained with such a paradigm can be easily derailed by poisoned contexts —i.e., contexts that already include value-unaligned content, either from the model’s own generation or from malicious users. To teach a model how to re-align, they deliberately add the value-unaligned response into the context, referred to as the source, and keep the value-aligned response as the target. 16 / 33
  • 17. SECOND THOUGHTS Approach Augmented Edits Modeling Augmented Edits Modeling 1 DP-based Edits Inference 2 Augmented Edits Modeling (AEM) 17 / 33
  • 18. SECOND THOUGHTS Approach Augmented Edits Modeling DP-based Edits Inference Given two text strings, source and target, one can find unlimited ways to edit source to produce target. Thus, they apply two constraints onto the editing: 1 The edits should be combinations of generic editing operations—inserting, deleting, and replacing a single token. 2 Each edit operation has a cost and the goal is to infer the chain-of-edits that has minimum cost. Under these constraints, the edits inference problem can be converted to a token-level “edit distance problem”, which can be solved by dynamic programming (DP). 18 / 33
  • 19. SECOND THOUGHTS Approach Augmented Edits Modeling Augmented Edits Modeling (AEM) To augment the edits, they run the DP algorithm on the same source and target pairs with a variety of editing costs to create a collection of chain-of-edits for each source-target pair, which they call positive demonstrations (y+). And then fine-tune an LM on these source-edits-target text inputs. This paper also construct negative demonstrations (y−) by using the targets from other contexts, leading to inferred chain-of-edits that generate value-aligned responses which are incoherent with the given context. 19 / 33
  • 21. SECOND THOUGHTS Approach Refinement by Reinforcement Learning Refinement by Reinforcement Learning Based on manual examination, the responses tend to be generic, so they are thus motivated to deploy a reinforcement learning (RL) stage to further refine the generation quality. Notation context and source as x target that generated by SECOND THOUGHTS as y Define the state at time t as the set of generated tokens before t (i.e., st = y<t) The action as the current step’s output token (i.e., at = yt) The softmax output of the language modeling head is considered as the policy 𝜋t for picking token yt (action at), given the state st = y<t. 21 / 33
  • 22. SECOND THOUGHTS Approach Refinement by Reinforcement Learning Adversarial Imitation Learning (AIL) They propose to leverage negative samples to penalize the LM for imitating the mismatched target by training an adversarial LM only on the negative demonstrations y−, so that following its policy 𝜋ADV. t will lead to incoherent generations. The t-th step objective of AIL to be maximized is: JAIL,t = E𝜏∼𝜋∗ t [− log 𝜋ADV t (at | st) | {z } unlikelihood + 𝛼 log 𝜋∗ t (at | st) | {z } likelihood ] − 𝛽 KL 𝜋t∥𝜋∗ t where 𝜋∗ t is the desired refinement policy, 𝛼 is the balancing factor, and the KL penalty term KL(𝜋∥𝜋∗ t ) with the coefficient 𝛽 is the trust region constraint, which prevents the updated policy from drifting too far away from the original one. 22 / 33
  • 23. SECOND THOUGHTS Approach Refinement by Reinforcement Learning Adversarial Imitation Learning (AIL) The intuition behind such a design is to maximize the unlikelihood of forming the trajectory 𝜏 = {s1, a1, ..., st, at} that can be induced by the adversarial policy 𝜋ADV., weighted against the balancing likelihood term. After refinement, the learned policy 𝜋∗ t can generate tokens unlike those that can be produced by 𝜋ADV., which will form sequences more coherent to the context. 23 / 33
  • 24. SECOND THOUGHTS Approach Refinement by Reinforcement Learning Value Modeling (VM) And another refinement method that directly learns a value function. By training a binary LM-based classifier f on the mixture of positive and negative demonstrations. Taking the sigmoid of the log-likelihood predicted by f as the reward r, which is r = 𝜎 log f (x, y), where the t-th step r is adjusted by an importance-sampling ratio between the current and original policy for off-policy stability, for the entropy bonus term encourage more exploration of the current policy. JVM,t = E𝜏∼𝜋t 𝜋∗ t (at | st) 𝜋t (at | st) · rt + 𝜆H (· | st)∼𝜋∗ 24 / 33
  • 25. SECOND THOUGHTS Experiments Table of contents 1 Abstract 2 Introduction 3 Previous work 4 Approach Problem Statement of Re-alignment Augmented Edits Modeling Refinement by Reinforcement Learning 5 Experiments 6 Limitations and Discussion 7 Conclusion 25 / 33
  • 26. SECOND THOUGHTS Experiments SECOND THOUGHTS on three benchmark datasets 1 Moral Stories. The Moral Stories dataset (N = 20, 000) examines whether LMs can generate moral responses under diverse social situations. 2 MIC. The MIC dataset (N = 38, 000) studies whether chatbots can generate utterances that are aligned with a set of “Rules of Thumb (RoT)” of morality. 3 ETHICS-Deontology. The ETHICS dataset (N = 25, 356) investigates the performance of LMs on five human values alignment tasks. This paper also consider two smaller-scale human values alignment datasets: HHH (Helpful, Honest, Harmless) (N = 178) and Truthful QA (N = 299), to evaluate the domain transfer ability. 26 / 33
  • 27. SECOND THOUGHTS Experiments Main Results on the Performance of Value Alignment Alignment, by asking “To what extent does the edited response improve the original response in terms of alignment with human values?” Answers range from 1-not at all. to 7-to an extreme extent. Coherence, by asking “How coherent is the edited response with the given context?” Answers range from 1-not at all. to 7-extremely coherent. 27 / 33
  • 28. SECOND THOUGHTS Experiments Value Transfer Learning with Limited Human-Labeled Data Transfer learning ability of SECOND THOUGHTS from seen human values (i.e., trained on MRL, MIC, ETC) to unseen values (i.e., testing on TQA, HHH) 28 / 33
  • 29. SECOND THOUGHTS Experiments Error Analysis and Human-Guided Correction SECOND THOUGHTS enables higher quality human-guided corrections, in terms of alignment and coherence scores. 29 / 33
  • 30. SECOND THOUGHTS Experiments Configuration for the Best Performing SECOND THOUGHTS Hyperparameter search on balancing factor 𝛼 and entropy factor 𝜆 in the Moral Stories task for best performing SECOND THOUGHTS. 30 / 33
  • 31. SECOND THOUGHTS Limitations and Discussion Table of contents 1 Abstract 2 Introduction 3 Previous work 4 Approach Problem Statement of Re-alignment Augmented Edits Modeling Refinement by Reinforcement Learning 5 Experiments 6 Limitations and Discussion 7 Conclusion 31 / 33
  • 32. SECOND THOUGHTS Limitations and Discussion Limitations and Discussion SECOND THOUGHTS can be limited by the LM that it is based on—for instance, the total length of the chain-of-edits is limited by the max sequence length allowed for the LM. 32 / 33
  • 33. SECOND THOUGHTS Conclusion Conclusion This paper has proposed SECOND THOUGHTS, a novel learning paradigm that enables LMs to re-align with human values when given a poisoned context. In addition, the chain-of-edits modeling by SECOND THOUGHTS enables easy error diagnosis and human-guided correction, which they believe to be an essential ability for human-AI interactive systems. 33 / 33