Quark: Controllable Text Generation with Reinforced [Un]learning.pdf

Quark
Quark: Controllable Text Generation with
Reinforced [Un]learning
NeurIPS, 2022
Ximing Lu, Sean Welleck, Jack Hessel et al.
Speaker: Po-Chuan Chen
Apr 9, 2024
1 / 36

Quark
Table of contents
1 Abstract
2 Introduction
3 Quark: Quantized Reward Konditioning
4 Experiments
5 Model Ablations
6 Conclusion
2 / 36

Quark
Abstract
Table of contents
1 Abstract
2 Introduction
4 Experiments
5 Model Ablations
6 Conclusion
3 / 36

Quark
Abstract
Abstract
Large language models may generate content that is misaligned with
the user’s expectations. For example, generating toxic words, repeated
content, and undesired responses for users.
This paper addresses this challenge with an algorithm for optimizing a
reward function that quantifies an (un)wanted property.
4 / 36

Quark
Introduction
Table of contents
1 Abstract
2 Introduction
4 Experiments
5 Model Ablations
6 Conclusion
5 / 36

Quark
Introduction
Introduction
Large language models, which are trained on vast amounts of web
text, have demonstrated remarkable proficiency across various tasks.
However, this extensive training can also lead to the manifestation of
undesirable behaviors in these models.
6 / 36

Quark
Introduction
Introduction
Unlearning undesirable behaviors through supervised learning on a
curated corpus [5] poses challenges in data collection and risks
overfitting, potentially losing desirable traits.
An alternative approach involves building a detector for undesirable
behavior [7], but adjusting the model based on this detector is
non-trivial, as detectors score full-text samples rather than offering
token-level feedback, making direct differentiation impractical.
7 / 36

Quark
Introduction
Introduction
Sentence-level, scalar feedback dynamically adjusting learning,
typically fits the reinforcement learning (RL) paradigm.
In NLP, RL optimizes scalar metrics via reward. Nonetheless, (deep)
RL is sensitive to reward function variance [1], necessitating
additional models and specialized heuristics for stable training, often
doubling learnable parameters.
8 / 36

Quark
Introduction
Contribution
This paper introduces Quantized Reward Konditioning (Quark), an
algorithm for reward-based (un)learning with language models.
1 Collecting samples with the current language model.
2 Sort them into quantiles based on reward, each quantile identified
by a reward token added to the input for the language model.
3 Maximize likelihood of samples from each reward quantile given
its reward token, while staying close to the original language
model with a KL-divergence penalty.
9 / 36

Quark
Quark: Quantized Reward Konditioning
Table of contents
1 Abstract
2 Introduction
4 Experiments
5 Model Ablations
6 Conclusion
10 / 36

Quark
Figure 1: Quantized Reward Konditioning (Quark) is an online, off-policy
reinforcement learning (RL) algorithm used to (un)learn properties from
language models via three iterative stages: exploration, quantization, and
learning.
11 / 36

Quark
Quantized Reward Konditioning
This algorithm contains three steps:
Exploration
Quantization
Learning
12 / 36

Quark
Initialization
A pre-trained language model p0(y | x)
A set of training prompts X
A reward function r(x, y) → R
Sequences of tokens x and y which from a vocabulary V
Quark creates a data pool by sampling from p0 based on training
prompts and then evaluates the samples using the reward function:
D0 = {(x, y, r, (x, y)) | y ∼ p0(· | x), ∀x ∈ X}
13 / 36

Quark
Quantization
Quark quantizes data examples based on their rewards by dividing the
sorted pool into equally sized quantiles D1, . . . , DK.
Each quantile is associated with a reward token rk, where higher k
corresponds to higher rewards.
14 / 36

Quark
Learning
Quark maximizes likelihood on quantized data pool D while applying
a KL-penalty to maintain fidelity to the original model:
max
𝜃
Ek∼U(1,K)E(x,y)∼Dk
"
log p𝜃 (y | x, rk) − 𝛽
T
∑︁
t=1
KL (p0 (· | y<t, x) ∥p𝜃 (· | y<t, x, rk))
#
where each KL term is
Í
yt ∈V p0(yt) log p0 (yt )
p𝜃 (yt ) .
15 / 36

Quark
Exploration
Quark updates its data pool by sampling from the model, focusing on
tokens with the highest rewards:
D ← D ∪ {(x, y, r(x, y)) | y ∼ p𝜃 (· | x, rK) , ∀ x ∈ X}
This step focuses on probing the model for high-reward completions
in order to explore promising areas of the distribution.
16 / 36

Quark
17 / 36

Quark
Experiments
Table of contents
1 Abstract
2 Introduction
4 Experiments
5 Model Ablations
6 Conclusion
18 / 36

Quark
Experiments
Experiments
This study uses GPT2-large [8] as the initial policy p0 for toxicity and
sentiment experiments, and GPT2-base for repetition experiments.
19 / 36

Quark
Experiments
Unlearning Toxicity from Language Models
They consider unlearning toxicity from GPT-2 on the
REALTOXICITYPROMPTS benchmark [3]. Additionally, we also
conduct an out-of-domain evaluation with the WRITINGPROMPTS
dataset [2].
In their experiment, Perspective API is a reward function, which
provides a score between 1 (non-toxic) and 0 (toxic).
They use K = 5 quantiles.
20 / 36

Quark
Experiments
Table 1: Automatic evaluation results of unlearning toxicity experiments.
21 / 36

Quark
Experiments
Table 2: Human evaluation results of unlearning toxicity experiments.
22 / 36

Quark
Experiments
Steering Away from Unwanted Sentiment of Generated
Texts
They use 100K prompts from the OpenWebText Corpus (OWT) [4].
This step leverages a sentiment analysis classifier (DistillBERT [9])
trained on the SST-2 dataset [10] from HuggingFace to explore
promising regions.
It provides sentiment scores ranging from 1 (positive) to 0 (negative)
and utilizes K = 5 quantiles.
23 / 36

Quark
Experiments
Texts
Table 3: Automatic evaluation results of unlearning sentiment experiments.
24 / 36

Quark
Experiments
Texts
Table 4: Human evaluation results of unlearning sentiment experiments.
25 / 36

Quark
Experiments
Unlearning Degenerate Repetition
They use WIKITEXT-103 [6] as the dataset.
They use a diversity metric as the reward:
diversity(y) =
4
Ö
n=2
(1.0 −
rep-n(y)
100
)
where rep-n(y) = 100 × (1.0 −
|unique n-grams(y)|
|total n-grams(y)| ).
They use K = 8 quantiles.
26 / 36

Quark
Experiments
Table 5: Unlearning repetitions of sequences generated from GPT2-base via
greedy decoding
27 / 36

Quark
Experiments
Figure 2: The orange and blue lines denote Quark with and without the
unlikelihood loss respectively.
28 / 36

Quark
Model Ablations
Table of contents
1 Abstract
2 Introduction
4 Experiments
5 Model Ablations
6 Conclusion
29 / 36

Quark
Conclusion
Table of contents
1 Abstract
2 Introduction
4 Experiments
5 Model Ablations
6 Conclusion
31 / 36

Quark
Conclusion
Conclusion
This paper presents Quark, an algorithm aimed at mitigating
undesirable properties acquired by language models during
pretraining through reward optimization.
32 / 36

Quark
Conclusion
Reflection
This paper acknowledges two primary concerns regarding the dual use
of this method:
1 Like any controllable text generation technique, Quark could
potentially be exploited for malicious purposes.
2 There is a risk that reward functions may unintentionally reflect
societal biases, particularly when derived from opaque or
complex neural networks.
33 / 36

Quark
Conclusion
References I
[1] Rishabh Agarwal, Max Schwarzer, et al. “Deep reinforcement
learning at the edge of the statistical precipice”. In: Advances in
neural information processing systems (2021),
pp. 29304–29320.
[2] Angela Fan, Mike Lewis, et al. “Hierarchical Neural Story
Generation”. In: Proceedings of the 56th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long
Papers). 2018, pp. 889–898.
[3] Samuel Gehman, Suchin Gururangan, et al.
“RealToxicityPrompts: Evaluating Neural Toxic Degeneration
in Language Models”. In: Findings of the Association for
Computational Linguistics: EMNLP 2020. 2020,
pp. 3356–3369.
34 / 36

Quark
Conclusion
References II
[4] Aaron Gokaslan and Vanya Cohen. OpenWebText Corpus.
http://Skylion007.github.io/OpenWebTextCorpus.
2019.
[5] Alisa Liu, Maarten Sap, et al. “DExperts: Decoding-Time
Controlled Text Generation with Experts and Anti-Experts”. In:
Proceedings of the 59th Annual Meeting of the Association for
Computational Linguistics and the 11th International Joint
Conference on Natural Language Processing (Volume 1: Long
Papers). 2021, pp. 6691–6706.
[6] Stephen Merity, Caiming Xiong, et al. “Pointer Sentinel
Mixture Models”. In: International Conference on Learning
Representations. 2016.
35 / 36

Quark
Conclusion
References III
[7] Romain Paulus, Caiming Xiong, et al. “A Deep Reinforced
Model for Abstractive Summarization”. In: International
Conference on Learning Representations. 2018.
[8] Alec Radford, Jeffrey Wu, et al. “Language models are
unsupervised multitask learners”. In: OpenAI blog (2019), p. 9.
[9] V Sanh. “DistilBERT, a distilled version of BERT: smaller,
faster, cheaper and lighter.”. In: Proceedings of Thirty-third
Conference on Neural Information Processing Systems. 2019.
[10] Richard Socher, Alex Perelygin, et al. “Recursive deep models
for semantic compositionality over a sentiment treebank”. In:
Proceedings of the 2013 conference on empirical methods in
natural language processing. 2013, pp. 1631–1642.
36 / 36

Quark: Controllable Text Generation with Reinforced [Un]learning.pdf

Recommended

Recommended

More Related Content

Similar to Quark: Controllable Text Generation with Reinforced [Un]learning.pdf

Similar to Quark: Controllable Text Generation with Reinforced [Un]learning.pdf (20)

More from Po-Chuan Chen

More from Po-Chuan Chen (20)

Recently uploaded

Recently uploaded (20)

Quark: Controllable Text Generation with Reinforced [Un]learning.pdf