SlideShare a Scribd company logo
1 of 36
Download to read offline
Quark
Quark: Controllable Text Generation with
Reinforced [Un]learning
NeurIPS, 2022
Ximing Lu, Sean Welleck, Jack Hessel et al.
Speaker: Po-Chuan Chen
Apr 9, 2024
1 / 36
Quark
Table of contents
1 Abstract
2 Introduction
3 Quark: Quantized Reward Konditioning
4 Experiments
5 Model Ablations
6 Conclusion
2 / 36
Quark
Abstract
Table of contents
1 Abstract
2 Introduction
3 Quark: Quantized Reward Konditioning
4 Experiments
5 Model Ablations
6 Conclusion
3 / 36
Quark
Abstract
Abstract
Large language models may generate content that is misaligned with
the user’s expectations. For example, generating toxic words, repeated
content, and undesired responses for users.
This paper addresses this challenge with an algorithm for optimizing a
reward function that quantifies an (un)wanted property.
4 / 36
Quark
Introduction
Table of contents
1 Abstract
2 Introduction
3 Quark: Quantized Reward Konditioning
4 Experiments
5 Model Ablations
6 Conclusion
5 / 36
Quark
Introduction
Introduction
Large language models, which are trained on vast amounts of web
text, have demonstrated remarkable proficiency across various tasks.
However, this extensive training can also lead to the manifestation of
undesirable behaviors in these models.
6 / 36
Quark
Introduction
Introduction
Unlearning undesirable behaviors through supervised learning on a
curated corpus [5] poses challenges in data collection and risks
overfitting, potentially losing desirable traits.
An alternative approach involves building a detector for undesirable
behavior [7], but adjusting the model based on this detector is
non-trivial, as detectors score full-text samples rather than offering
token-level feedback, making direct differentiation impractical.
7 / 36
Quark
Introduction
Introduction
Sentence-level, scalar feedback dynamically adjusting learning,
typically fits the reinforcement learning (RL) paradigm.
In NLP, RL optimizes scalar metrics via reward. Nonetheless, (deep)
RL is sensitive to reward function variance [1], necessitating
additional models and specialized heuristics for stable training, often
doubling learnable parameters.
8 / 36
Quark
Introduction
Contribution
This paper introduces Quantized Reward Konditioning (Quark), an
algorithm for reward-based (un)learning with language models.
1 Collecting samples with the current language model.
2 Sort them into quantiles based on reward, each quantile identified
by a reward token added to the input for the language model.
3 Maximize likelihood of samples from each reward quantile given
its reward token, while staying close to the original language
model with a KL-divergence penalty.
9 / 36
Quark
Quark: Quantized Reward Konditioning
Table of contents
1 Abstract
2 Introduction
3 Quark: Quantized Reward Konditioning
4 Experiments
5 Model Ablations
6 Conclusion
10 / 36
Quark
Quark: Quantized Reward Konditioning
Figure 1: Quantized Reward Konditioning (Quark) is an online, off-policy
reinforcement learning (RL) algorithm used to (un)learn properties from
language models via three iterative stages: exploration, quantization, and
learning.
11 / 36
Quark
Quark: Quantized Reward Konditioning
Quantized Reward Konditioning
This algorithm contains three steps:
Exploration
Quantization
Learning
12 / 36
Quark
Quark: Quantized Reward Konditioning
Initialization
A pre-trained language model p0(y | x)
A set of training prompts X
A reward function r(x, y) → R
Sequences of tokens x and y which from a vocabulary V
Quark creates a data pool by sampling from p0 based on training
prompts and then evaluates the samples using the reward function:
D0 = {(x, y, r, (x, y)) | y ∼ p0(· | x), ∀x ∈ X}
13 / 36
Quark
Quark: Quantized Reward Konditioning
Quantization
Quark quantizes data examples based on their rewards by dividing the
sorted pool into equally sized quantiles D1, . . . , DK.
Each quantile is associated with a reward token rk, where higher k
corresponds to higher rewards.
14 / 36
Quark
Quark: Quantized Reward Konditioning
Learning
Quark maximizes likelihood on quantized data pool D while applying
a KL-penalty to maintain fidelity to the original model:
max
𝜃
Ek∼U(1,K)E(x,y)∼Dk
"
log p𝜃 (y | x, rk) − 𝛽
T
∑︁
t=1
KL (p0 (· | y<t, x) ∥p𝜃 (· | y<t, x, rk))
#
where each KL term is
Í
yt ∈V p0(yt) log p0 (yt )
p𝜃 (yt ) .
15 / 36
Quark
Quark: Quantized Reward Konditioning
Exploration
Quark updates its data pool by sampling from the model, focusing on
tokens with the highest rewards:
D ← D ∪ {(x, y, r(x, y)) | y ∼ p𝜃 (· | x, rK) , ∀ x ∈ X}
This step focuses on probing the model for high-reward completions
in order to explore promising areas of the distribution.
16 / 36
Quark
Quark: Quantized Reward Konditioning
17 / 36
Quark
Experiments
Table of contents
1 Abstract
2 Introduction
3 Quark: Quantized Reward Konditioning
4 Experiments
5 Model Ablations
6 Conclusion
18 / 36
Quark
Experiments
Experiments
This study uses GPT2-large [8] as the initial policy p0 for toxicity and
sentiment experiments, and GPT2-base for repetition experiments.
19 / 36
Quark
Experiments
Unlearning Toxicity from Language Models
They consider unlearning toxicity from GPT-2 on the
REALTOXICITYPROMPTS benchmark [3]. Additionally, we also
conduct an out-of-domain evaluation with the WRITINGPROMPTS
dataset [2].
In their experiment, Perspective API is a reward function, which
provides a score between 1 (non-toxic) and 0 (toxic).
They use K = 5 quantiles.
20 / 36
Quark
Experiments
Unlearning Toxicity from Language Models
Table 1: Automatic evaluation results of unlearning toxicity experiments.
21 / 36
Quark
Experiments
Unlearning Toxicity from Language Models
Table 2: Human evaluation results of unlearning toxicity experiments.
22 / 36
Quark
Experiments
Steering Away from Unwanted Sentiment of Generated
Texts
They use 100K prompts from the OpenWebText Corpus (OWT) [4].
This step leverages a sentiment analysis classifier (DistillBERT [9])
trained on the SST-2 dataset [10] from HuggingFace to explore
promising regions.
It provides sentiment scores ranging from 1 (positive) to 0 (negative)
and utilizes K = 5 quantiles.
23 / 36
Quark
Experiments
Steering Away from Unwanted Sentiment of Generated
Texts
Table 3: Automatic evaluation results of unlearning sentiment experiments.
24 / 36
Quark
Experiments
Steering Away from Unwanted Sentiment of Generated
Texts
Table 4: Human evaluation results of unlearning sentiment experiments.
25 / 36
Quark
Experiments
Unlearning Degenerate Repetition
They use WIKITEXT-103 [6] as the dataset.
They use a diversity metric as the reward:
diversity(y) =
4
Ö
n=2
(1.0 −
rep-n(y)
100
)
where rep-n(y) = 100 × (1.0 −
|unique n-grams(y)|
|total n-grams(y)| ).
They use K = 8 quantiles.
26 / 36
Quark
Experiments
Unlearning Degenerate Repetition
Table 5: Unlearning repetitions of sequences generated from GPT2-base via
greedy decoding
27 / 36
Quark
Experiments
Unlearning Degenerate Repetition
Figure 2: The orange and blue lines denote Quark with and without the
unlikelihood loss respectively.
28 / 36
Quark
Model Ablations
Table of contents
1 Abstract
2 Introduction
3 Quark: Quantized Reward Konditioning
4 Experiments
5 Model Ablations
6 Conclusion
29 / 36
Quark
Model Ablations
30 / 36
Quark
Conclusion
Table of contents
1 Abstract
2 Introduction
3 Quark: Quantized Reward Konditioning
4 Experiments
5 Model Ablations
6 Conclusion
31 / 36
Quark
Conclusion
Conclusion
This paper presents Quark, an algorithm aimed at mitigating
undesirable properties acquired by language models during
pretraining through reward optimization.
32 / 36
Quark
Conclusion
Reflection
This paper acknowledges two primary concerns regarding the dual use
of this method:
1 Like any controllable text generation technique, Quark could
potentially be exploited for malicious purposes.
2 There is a risk that reward functions may unintentionally reflect
societal biases, particularly when derived from opaque or
complex neural networks.
33 / 36
Quark
Conclusion
References I
[1] Rishabh Agarwal, Max Schwarzer, et al. “Deep reinforcement
learning at the edge of the statistical precipice”. In: Advances in
neural information processing systems (2021),
pp. 29304–29320.
[2] Angela Fan, Mike Lewis, et al. “Hierarchical Neural Story
Generation”. In: Proceedings of the 56th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long
Papers). 2018, pp. 889–898.
[3] Samuel Gehman, Suchin Gururangan, et al.
“RealToxicityPrompts: Evaluating Neural Toxic Degeneration
in Language Models”. In: Findings of the Association for
Computational Linguistics: EMNLP 2020. 2020,
pp. 3356–3369.
34 / 36
Quark
Conclusion
References II
[4] Aaron Gokaslan and Vanya Cohen. OpenWebText Corpus.
http://Skylion007.github.io/OpenWebTextCorpus.
2019.
[5] Alisa Liu, Maarten Sap, et al. “DExperts: Decoding-Time
Controlled Text Generation with Experts and Anti-Experts”. In:
Proceedings of the 59th Annual Meeting of the Association for
Computational Linguistics and the 11th International Joint
Conference on Natural Language Processing (Volume 1: Long
Papers). 2021, pp. 6691–6706.
[6] Stephen Merity, Caiming Xiong, et al. “Pointer Sentinel
Mixture Models”. In: International Conference on Learning
Representations. 2016.
35 / 36
Quark
Conclusion
References III
[7] Romain Paulus, Caiming Xiong, et al. “A Deep Reinforced
Model for Abstractive Summarization”. In: International
Conference on Learning Representations. 2018.
[8] Alec Radford, Jeffrey Wu, et al. “Language models are
unsupervised multitask learners”. In: OpenAI blog (2019), p. 9.
[9] V Sanh. “DistilBERT, a distilled version of BERT: smaller,
faster, cheaper and lighter.”. In: Proceedings of Thirty-third
Conference on Neural Information Processing Systems. 2019.
[10] Richard Socher, Alex Perelygin, et al. “Recursive deep models
for semantic compositionality over a sentiment treebank”. In:
Proceedings of the 2013 conference on empirical methods in
natural language processing. 2013, pp. 1631–1642.
36 / 36

More Related Content

Similar to Quark: Controllable Text Generation with Reinforced [Un]learning.pdf

Evaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernelsEvaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernelsinfopapers
 
Evaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernelsEvaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernelsinfopapers
 
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfPo-Chuan Chen
 
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfPo-Chuan Chen
 
ensembles_emptytemplate_v2
ensembles_emptytemplate_v2ensembles_emptytemplate_v2
ensembles_emptytemplate_v2Shrayes Ramesh
 
ensembles_emptytemplate_v2
ensembles_emptytemplate_v2ensembles_emptytemplate_v2
ensembles_emptytemplate_v2Shrayes Ramesh
 
Selection K in K-means Clustering
Selection K in K-means ClusteringSelection K in K-means Clustering
Selection K in K-means ClusteringJunghoon Kim
 
Selection K in K-means Clustering
Selection K in K-means ClusteringSelection K in K-means Clustering
Selection K in K-means ClusteringJunghoon Kim
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewVahid Mirjalili
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewVahid Mirjalili
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETScsandit
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETScsandit
 
K-means Clustering Method for the Analysis of Log Data
K-means Clustering Method for the Analysis of Log DataK-means Clustering Method for the Analysis of Log Data
K-means Clustering Method for the Analysis of Log Dataidescitation
 
K-means Clustering Method for the Analysis of Log Data
K-means Clustering Method for the Analysis of Log DataK-means Clustering Method for the Analysis of Log Data
K-means Clustering Method for the Analysis of Log Dataidescitation
 
A scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clusteringA scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clusteringAllenWu
 
A scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clusteringA scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clusteringAllenWu
 

Similar to Quark: Controllable Text Generation with Reinforced [Un]learning.pdf (20)

Csc446: Pattern Recognition
Csc446: Pattern Recognition Csc446: Pattern Recognition
Csc446: Pattern Recognition
 
Csc446: Pattern Recognition
Csc446: Pattern Recognition Csc446: Pattern Recognition
Csc446: Pattern Recognition
 
Evaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernelsEvaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernels
 
Evaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernelsEvaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernels
 
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
 
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
 
ensembles_emptytemplate_v2
ensembles_emptytemplate_v2ensembles_emptytemplate_v2
ensembles_emptytemplate_v2
 
ensembles_emptytemplate_v2
ensembles_emptytemplate_v2ensembles_emptytemplate_v2
ensembles_emptytemplate_v2
 
KMAP PAPER (1)
KMAP PAPER (1)KMAP PAPER (1)
KMAP PAPER (1)
 
KMAP PAPER (1)
KMAP PAPER (1)KMAP PAPER (1)
KMAP PAPER (1)
 
Selection K in K-means Clustering
Selection K in K-means ClusteringSelection K in K-means Clustering
Selection K in K-means Clustering
 
Selection K in K-means Clustering
Selection K in K-means ClusteringSelection K in K-means Clustering
Selection K in K-means Clustering
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overview
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overview
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
 
K-means Clustering Method for the Analysis of Log Data
K-means Clustering Method for the Analysis of Log DataK-means Clustering Method for the Analysis of Log Data
K-means Clustering Method for the Analysis of Log Data
 
K-means Clustering Method for the Analysis of Log Data
K-means Clustering Method for the Analysis of Log DataK-means Clustering Method for the Analysis of Log Data
K-means Clustering Method for the Analysis of Log Data
 
A scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clusteringA scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clustering
 
A scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clusteringA scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clustering
 

More from Po-Chuan Chen

E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdfE-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdfPo-Chuan Chen
 
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...Po-Chuan Chen
 
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...Po-Chuan Chen
 
On the Effectiveness of Offline RL for Dialogue Response Generation.pdf
On the Effectiveness of Offline RL for Dialogue Response Generation.pdfOn the Effectiveness of Offline RL for Dialogue Response Generation.pdf
On the Effectiveness of Offline RL for Dialogue Response Generation.pdfPo-Chuan Chen
 
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...Po-Chuan Chen
 
A Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdfA Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdfPo-Chuan Chen
 
A Neural Corpus Indexer for Document Retrieval.pdf
A Neural Corpus Indexer for Document Retrieval.pdfA Neural Corpus Indexer for Document Retrieval.pdf
A Neural Corpus Indexer for Document Retrieval.pdfPo-Chuan Chen
 
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdfAdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdfPo-Chuan Chen
 
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...Po-Chuan Chen
 
Active Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdfActive Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdfPo-Chuan Chen
 
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdfOffline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdfPo-Chuan Chen
 
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdfCold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdfPo-Chuan Chen
 
Image_to_Prompts.pdf
Image_to_Prompts.pdfImage_to_Prompts.pdf
Image_to_Prompts.pdfPo-Chuan Chen
 
Evaluating Parameter Efficient Learning for Generation.pdf
Evaluating Parameter Efficient Learning for Generation.pdfEvaluating Parameter Efficient Learning for Generation.pdf
Evaluating Parameter Efficient Learning for Generation.pdfPo-Chuan Chen
 
Off-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdfOff-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdfPo-Chuan Chen
 
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdfA Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdfPo-Chuan Chen
 
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
Is Reinforcement Learning (Not) for Natural
Language Processing.pdfIs Reinforcement Learning (Not) for Natural
Language Processing.pdf
Is Reinforcement Learning (Not) for Natural Language Processing.pdfPo-Chuan Chen
 
HyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of TransformerspdfHyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of TransformerspdfPo-Chuan Chen
 
Training language models to follow instructions with human feedback.pdf
Training language models to follow instructions
with human feedback.pdfTraining language models to follow instructions
with human feedback.pdf
Training language models to follow instructions with human feedback.pdfPo-Chuan Chen
 
Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...
Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...
Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...Po-Chuan Chen
 

More from Po-Chuan Chen (20)

E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdfE-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
 
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
 
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
 
On the Effectiveness of Offline RL for Dialogue Response Generation.pdf
On the Effectiveness of Offline RL for Dialogue Response Generation.pdfOn the Effectiveness of Offline RL for Dialogue Response Generation.pdf
On the Effectiveness of Offline RL for Dialogue Response Generation.pdf
 
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
 
A Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdfA Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdf
 
A Neural Corpus Indexer for Document Retrieval.pdf
A Neural Corpus Indexer for Document Retrieval.pdfA Neural Corpus Indexer for Document Retrieval.pdf
A Neural Corpus Indexer for Document Retrieval.pdf
 
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdfAdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
 
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
 
Active Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdfActive Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdf
 
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdfOffline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdf
 
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdfCold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
 
Image_to_Prompts.pdf
Image_to_Prompts.pdfImage_to_Prompts.pdf
Image_to_Prompts.pdf
 
Evaluating Parameter Efficient Learning for Generation.pdf
Evaluating Parameter Efficient Learning for Generation.pdfEvaluating Parameter Efficient Learning for Generation.pdf
Evaluating Parameter Efficient Learning for Generation.pdf
 
Off-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdfOff-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdf
 
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdfA Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
 
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
Is Reinforcement Learning (Not) for Natural
Language Processing.pdfIs Reinforcement Learning (Not) for Natural
Language Processing.pdf
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
 
HyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of TransformerspdfHyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
 
Training language models to follow instructions with human feedback.pdf
Training language models to follow instructions
with human feedback.pdfTraining language models to follow instructions
with human feedback.pdf
Training language models to follow instructions with human feedback.pdf
 
Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...
Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...
Leveling to the Last Mile: Near-zero-cost Bit Level Wear Leveling for PCM-bas...
 

Recently uploaded

UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 

Recently uploaded (20)

UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 

Quark: Controllable Text Generation with Reinforced [Un]learning.pdf

  • 1. Quark Quark: Controllable Text Generation with Reinforced [Un]learning NeurIPS, 2022 Ximing Lu, Sean Welleck, Jack Hessel et al. Speaker: Po-Chuan Chen Apr 9, 2024 1 / 36
  • 2. Quark Table of contents 1 Abstract 2 Introduction 3 Quark: Quantized Reward Konditioning 4 Experiments 5 Model Ablations 6 Conclusion 2 / 36
  • 3. Quark Abstract Table of contents 1 Abstract 2 Introduction 3 Quark: Quantized Reward Konditioning 4 Experiments 5 Model Ablations 6 Conclusion 3 / 36
  • 4. Quark Abstract Abstract Large language models may generate content that is misaligned with the user’s expectations. For example, generating toxic words, repeated content, and undesired responses for users. This paper addresses this challenge with an algorithm for optimizing a reward function that quantifies an (un)wanted property. 4 / 36
  • 5. Quark Introduction Table of contents 1 Abstract 2 Introduction 3 Quark: Quantized Reward Konditioning 4 Experiments 5 Model Ablations 6 Conclusion 5 / 36
  • 6. Quark Introduction Introduction Large language models, which are trained on vast amounts of web text, have demonstrated remarkable proficiency across various tasks. However, this extensive training can also lead to the manifestation of undesirable behaviors in these models. 6 / 36
  • 7. Quark Introduction Introduction Unlearning undesirable behaviors through supervised learning on a curated corpus [5] poses challenges in data collection and risks overfitting, potentially losing desirable traits. An alternative approach involves building a detector for undesirable behavior [7], but adjusting the model based on this detector is non-trivial, as detectors score full-text samples rather than offering token-level feedback, making direct differentiation impractical. 7 / 36
  • 8. Quark Introduction Introduction Sentence-level, scalar feedback dynamically adjusting learning, typically fits the reinforcement learning (RL) paradigm. In NLP, RL optimizes scalar metrics via reward. Nonetheless, (deep) RL is sensitive to reward function variance [1], necessitating additional models and specialized heuristics for stable training, often doubling learnable parameters. 8 / 36
  • 9. Quark Introduction Contribution This paper introduces Quantized Reward Konditioning (Quark), an algorithm for reward-based (un)learning with language models. 1 Collecting samples with the current language model. 2 Sort them into quantiles based on reward, each quantile identified by a reward token added to the input for the language model. 3 Maximize likelihood of samples from each reward quantile given its reward token, while staying close to the original language model with a KL-divergence penalty. 9 / 36
  • 10. Quark Quark: Quantized Reward Konditioning Table of contents 1 Abstract 2 Introduction 3 Quark: Quantized Reward Konditioning 4 Experiments 5 Model Ablations 6 Conclusion 10 / 36
  • 11. Quark Quark: Quantized Reward Konditioning Figure 1: Quantized Reward Konditioning (Quark) is an online, off-policy reinforcement learning (RL) algorithm used to (un)learn properties from language models via three iterative stages: exploration, quantization, and learning. 11 / 36
  • 12. Quark Quark: Quantized Reward Konditioning Quantized Reward Konditioning This algorithm contains three steps: Exploration Quantization Learning 12 / 36
  • 13. Quark Quark: Quantized Reward Konditioning Initialization A pre-trained language model p0(y | x) A set of training prompts X A reward function r(x, y) → R Sequences of tokens x and y which from a vocabulary V Quark creates a data pool by sampling from p0 based on training prompts and then evaluates the samples using the reward function: D0 = {(x, y, r, (x, y)) | y ∼ p0(· | x), ∀x ∈ X} 13 / 36
  • 14. Quark Quark: Quantized Reward Konditioning Quantization Quark quantizes data examples based on their rewards by dividing the sorted pool into equally sized quantiles D1, . . . , DK. Each quantile is associated with a reward token rk, where higher k corresponds to higher rewards. 14 / 36
  • 15. Quark Quark: Quantized Reward Konditioning Learning Quark maximizes likelihood on quantized data pool D while applying a KL-penalty to maintain fidelity to the original model: max 𝜃 Ek∼U(1,K)E(x,y)∼Dk " log p𝜃 (y | x, rk) − 𝛽 T ∑︁ t=1 KL (p0 (· | y<t, x) ∥p𝜃 (· | y<t, x, rk)) # where each KL term is Í yt ∈V p0(yt) log p0 (yt ) p𝜃 (yt ) . 15 / 36
  • 16. Quark Quark: Quantized Reward Konditioning Exploration Quark updates its data pool by sampling from the model, focusing on tokens with the highest rewards: D ← D ∪ {(x, y, r(x, y)) | y ∼ p𝜃 (· | x, rK) , ∀ x ∈ X} This step focuses on probing the model for high-reward completions in order to explore promising areas of the distribution. 16 / 36
  • 17. Quark Quark: Quantized Reward Konditioning 17 / 36
  • 18. Quark Experiments Table of contents 1 Abstract 2 Introduction 3 Quark: Quantized Reward Konditioning 4 Experiments 5 Model Ablations 6 Conclusion 18 / 36
  • 19. Quark Experiments Experiments This study uses GPT2-large [8] as the initial policy p0 for toxicity and sentiment experiments, and GPT2-base for repetition experiments. 19 / 36
  • 20. Quark Experiments Unlearning Toxicity from Language Models They consider unlearning toxicity from GPT-2 on the REALTOXICITYPROMPTS benchmark [3]. Additionally, we also conduct an out-of-domain evaluation with the WRITINGPROMPTS dataset [2]. In their experiment, Perspective API is a reward function, which provides a score between 1 (non-toxic) and 0 (toxic). They use K = 5 quantiles. 20 / 36
  • 21. Quark Experiments Unlearning Toxicity from Language Models Table 1: Automatic evaluation results of unlearning toxicity experiments. 21 / 36
  • 22. Quark Experiments Unlearning Toxicity from Language Models Table 2: Human evaluation results of unlearning toxicity experiments. 22 / 36
  • 23. Quark Experiments Steering Away from Unwanted Sentiment of Generated Texts They use 100K prompts from the OpenWebText Corpus (OWT) [4]. This step leverages a sentiment analysis classifier (DistillBERT [9]) trained on the SST-2 dataset [10] from HuggingFace to explore promising regions. It provides sentiment scores ranging from 1 (positive) to 0 (negative) and utilizes K = 5 quantiles. 23 / 36
  • 24. Quark Experiments Steering Away from Unwanted Sentiment of Generated Texts Table 3: Automatic evaluation results of unlearning sentiment experiments. 24 / 36
  • 25. Quark Experiments Steering Away from Unwanted Sentiment of Generated Texts Table 4: Human evaluation results of unlearning sentiment experiments. 25 / 36
  • 26. Quark Experiments Unlearning Degenerate Repetition They use WIKITEXT-103 [6] as the dataset. They use a diversity metric as the reward: diversity(y) = 4 Ö n=2 (1.0 − rep-n(y) 100 ) where rep-n(y) = 100 × (1.0 − |unique n-grams(y)| |total n-grams(y)| ). They use K = 8 quantiles. 26 / 36
  • 27. Quark Experiments Unlearning Degenerate Repetition Table 5: Unlearning repetitions of sequences generated from GPT2-base via greedy decoding 27 / 36
  • 28. Quark Experiments Unlearning Degenerate Repetition Figure 2: The orange and blue lines denote Quark with and without the unlikelihood loss respectively. 28 / 36
  • 29. Quark Model Ablations Table of contents 1 Abstract 2 Introduction 3 Quark: Quantized Reward Konditioning 4 Experiments 5 Model Ablations 6 Conclusion 29 / 36
  • 31. Quark Conclusion Table of contents 1 Abstract 2 Introduction 3 Quark: Quantized Reward Konditioning 4 Experiments 5 Model Ablations 6 Conclusion 31 / 36
  • 32. Quark Conclusion Conclusion This paper presents Quark, an algorithm aimed at mitigating undesirable properties acquired by language models during pretraining through reward optimization. 32 / 36
  • 33. Quark Conclusion Reflection This paper acknowledges two primary concerns regarding the dual use of this method: 1 Like any controllable text generation technique, Quark could potentially be exploited for malicious purposes. 2 There is a risk that reward functions may unintentionally reflect societal biases, particularly when derived from opaque or complex neural networks. 33 / 36
  • 34. Quark Conclusion References I [1] Rishabh Agarwal, Max Schwarzer, et al. “Deep reinforcement learning at the edge of the statistical precipice”. In: Advances in neural information processing systems (2021), pp. 29304–29320. [2] Angela Fan, Mike Lewis, et al. “Hierarchical Neural Story Generation”. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018, pp. 889–898. [3] Samuel Gehman, Suchin Gururangan, et al. “RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models”. In: Findings of the Association for Computational Linguistics: EMNLP 2020. 2020, pp. 3356–3369. 34 / 36
  • 35. Quark Conclusion References II [4] Aaron Gokaslan and Vanya Cohen. OpenWebText Corpus. http://Skylion007.github.io/OpenWebTextCorpus. 2019. [5] Alisa Liu, Maarten Sap, et al. “DExperts: Decoding-Time Controlled Text Generation with Experts and Anti-Experts”. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021, pp. 6691–6706. [6] Stephen Merity, Caiming Xiong, et al. “Pointer Sentinel Mixture Models”. In: International Conference on Learning Representations. 2016. 35 / 36
  • 36. Quark Conclusion References III [7] Romain Paulus, Caiming Xiong, et al. “A Deep Reinforced Model for Abstractive Summarization”. In: International Conference on Learning Representations. 2018. [8] Alec Radford, Jeffrey Wu, et al. “Language models are unsupervised multitask learners”. In: OpenAI blog (2019), p. 9. [9] V Sanh. “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.”. In: Proceedings of Thirty-third Conference on Neural Information Processing Systems. 2019. [10] Richard Socher, Alex Perelygin, et al. “Recursive deep models for semantic compositionality over a sentiment treebank”. In: Proceedings of the 2013 conference on empirical methods in natural language processing. 2013, pp. 1631–1642. 36 / 36