4. Quark
Abstract
Abstract
Large language models may generate content that is misaligned with
the user’s expectations. For example, generating toxic words, repeated
content, and undesired responses for users.
This paper addresses this challenge with an algorithm for optimizing a
reward function that quantifies an (un)wanted property.
4 / 36
6. Quark
Introduction
Introduction
Large language models, which are trained on vast amounts of web
text, have demonstrated remarkable proficiency across various tasks.
However, this extensive training can also lead to the manifestation of
undesirable behaviors in these models.
6 / 36
7. Quark
Introduction
Introduction
Unlearning undesirable behaviors through supervised learning on a
curated corpus [5] poses challenges in data collection and risks
overfitting, potentially losing desirable traits.
An alternative approach involves building a detector for undesirable
behavior [7], but adjusting the model based on this detector is
non-trivial, as detectors score full-text samples rather than offering
token-level feedback, making direct differentiation impractical.
7 / 36
8. Quark
Introduction
Introduction
Sentence-level, scalar feedback dynamically adjusting learning,
typically fits the reinforcement learning (RL) paradigm.
In NLP, RL optimizes scalar metrics via reward. Nonetheless, (deep)
RL is sensitive to reward function variance [1], necessitating
additional models and specialized heuristics for stable training, often
doubling learnable parameters.
8 / 36
9. Quark
Introduction
Contribution
This paper introduces Quantized Reward Konditioning (Quark), an
algorithm for reward-based (un)learning with language models.
1 Collecting samples with the current language model.
2 Sort them into quantiles based on reward, each quantile identified
by a reward token added to the input for the language model.
3 Maximize likelihood of samples from each reward quantile given
its reward token, while staying close to the original language
model with a KL-divergence penalty.
9 / 36
11. Quark
Quark: Quantized Reward Konditioning
Figure 1: Quantized Reward Konditioning (Quark) is an online, off-policy
reinforcement learning (RL) algorithm used to (un)learn properties from
language models via three iterative stages: exploration, quantization, and
learning.
11 / 36
12. Quark
Quark: Quantized Reward Konditioning
Quantized Reward Konditioning
This algorithm contains three steps:
Exploration
Quantization
Learning
12 / 36
13. Quark
Quark: Quantized Reward Konditioning
Initialization
A pre-trained language model p0(y | x)
A set of training prompts X
A reward function r(x, y) → R
Sequences of tokens x and y which from a vocabulary V
Quark creates a data pool by sampling from p0 based on training
prompts and then evaluates the samples using the reward function:
D0 = {(x, y, r, (x, y)) | y ∼ p0(· | x), ∀x ∈ X}
13 / 36
14. Quark
Quark: Quantized Reward Konditioning
Quantization
Quark quantizes data examples based on their rewards by dividing the
sorted pool into equally sized quantiles D1, . . . , DK.
Each quantile is associated with a reward token rk, where higher k
corresponds to higher rewards.
14 / 36
15. Quark
Quark: Quantized Reward Konditioning
Learning
Quark maximizes likelihood on quantized data pool D while applying
a KL-penalty to maintain fidelity to the original model:
max
𝜃
Ek∼U(1,K)E(x,y)∼Dk
"
log p𝜃 (y | x, rk) − 𝛽
T
∑︁
t=1
KL (p0 (· | y<t, x) ∥p𝜃 (· | y<t, x, rk))
#
where each KL term is
Í
yt ∈V p0(yt) log p0 (yt )
p𝜃 (yt ) .
15 / 36
16. Quark
Quark: Quantized Reward Konditioning
Exploration
Quark updates its data pool by sampling from the model, focusing on
tokens with the highest rewards:
D ← D ∪ {(x, y, r(x, y)) | y ∼ p𝜃 (· | x, rK) , ∀ x ∈ X}
This step focuses on probing the model for high-reward completions
in order to explore promising areas of the distribution.
16 / 36
20. Quark
Experiments
Unlearning Toxicity from Language Models
They consider unlearning toxicity from GPT-2 on the
REALTOXICITYPROMPTS benchmark [3]. Additionally, we also
conduct an out-of-domain evaluation with the WRITINGPROMPTS
dataset [2].
In their experiment, Perspective API is a reward function, which
provides a score between 1 (non-toxic) and 0 (toxic).
They use K = 5 quantiles.
20 / 36
23. Quark
Experiments
Steering Away from Unwanted Sentiment of Generated
Texts
They use 100K prompts from the OpenWebText Corpus (OWT) [4].
This step leverages a sentiment analysis classifier (DistillBERT [9])
trained on the SST-2 dataset [10] from HuggingFace to explore
promising regions.
It provides sentiment scores ranging from 1 (positive) to 0 (negative)
and utilizes K = 5 quantiles.
23 / 36
24. Quark
Experiments
Steering Away from Unwanted Sentiment of Generated
Texts
Table 3: Automatic evaluation results of unlearning sentiment experiments.
24 / 36
25. Quark
Experiments
Steering Away from Unwanted Sentiment of Generated
Texts
Table 4: Human evaluation results of unlearning sentiment experiments.
25 / 36
26. Quark
Experiments
Unlearning Degenerate Repetition
They use WIKITEXT-103 [6] as the dataset.
They use a diversity metric as the reward:
diversity(y) =
4
Ö
n=2
(1.0 −
rep-n(y)
100
)
where rep-n(y) = 100 × (1.0 −
|unique n-grams(y)|
|total n-grams(y)| ).
They use K = 8 quantiles.
26 / 36
32. Quark
Conclusion
Conclusion
This paper presents Quark, an algorithm aimed at mitigating
undesirable properties acquired by language models during
pretraining through reward optimization.
32 / 36
33. Quark
Conclusion
Reflection
This paper acknowledges two primary concerns regarding the dual use
of this method:
1 Like any controllable text generation technique, Quark could
potentially be exploited for malicious purposes.
2 There is a risk that reward functions may unintentionally reflect
societal biases, particularly when derived from opaque or
complex neural networks.
33 / 36
34. Quark
Conclusion
References I
[1] Rishabh Agarwal, Max Schwarzer, et al. “Deep reinforcement
learning at the edge of the statistical precipice”. In: Advances in
neural information processing systems (2021),
pp. 29304–29320.
[2] Angela Fan, Mike Lewis, et al. “Hierarchical Neural Story
Generation”. In: Proceedings of the 56th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long
Papers). 2018, pp. 889–898.
[3] Samuel Gehman, Suchin Gururangan, et al.
“RealToxicityPrompts: Evaluating Neural Toxic Degeneration
in Language Models”. In: Findings of the Association for
Computational Linguistics: EMNLP 2020. 2020,
pp. 3356–3369.
34 / 36
35. Quark
Conclusion
References II
[4] Aaron Gokaslan and Vanya Cohen. OpenWebText Corpus.
http://Skylion007.github.io/OpenWebTextCorpus.
2019.
[5] Alisa Liu, Maarten Sap, et al. “DExperts: Decoding-Time
Controlled Text Generation with Experts and Anti-Experts”. In:
Proceedings of the 59th Annual Meeting of the Association for
Computational Linguistics and the 11th International Joint
Conference on Natural Language Processing (Volume 1: Long
Papers). 2021, pp. 6691–6706.
[6] Stephen Merity, Caiming Xiong, et al. “Pointer Sentinel
Mixture Models”. In: International Conference on Learning
Representations. 2016.
35 / 36
36. Quark
Conclusion
References III
[7] Romain Paulus, Caiming Xiong, et al. “A Deep Reinforced
Model for Abstractive Summarization”. In: International
Conference on Learning Representations. 2018.
[8] Alec Radford, Jeffrey Wu, et al. “Language models are
unsupervised multitask learners”. In: OpenAI blog (2019), p. 9.
[9] V Sanh. “DistilBERT, a distilled version of BERT: smaller,
faster, cheaper and lighter.”. In: Proceedings of Thirty-third
Conference on Neural Information Processing Systems. 2019.
[10] Richard Socher, Alex Perelygin, et al. “Recursive deep models
for semantic compositionality over a sentiment treebank”. In:
Proceedings of the 2013 conference on empirical methods in
natural language processing. 2013, pp. 1631–1642.
36 / 36