This paper evaluates the effectiveness of offline reinforcement learning methods for dialogue response generation. It finds that decision transformers and implicit Q-learning show improvements over teacher forcing, generating responses that are similar in meaning to the target while not requiring exact matching. Evaluation on several datasets demonstrates these offline RL methods achieve better performance than teacher forcing according to automated metrics and human evaluations, while avoiding issues with online reinforcement learning.
TEACHING AND LEARNING BASED OPTIMISATIONUday Wankar
Teaching–Learning-Based Optimization (TLBO) seems to be a rising star from amongst a number of metaheuristics with relatively competitive performances. It is reported that it outperforms some of the well-known metaheuristics regarding constrained benchmark functions, constrained mechanical design, and continuous non-linear numerical optimization problems. Such a breakthrough has steered us towards investigating the secrets of TLBO’s dominance. This report’s findings on TLBO qualitatively and quantitatively through code-reviews and experiments, respectively.
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIJack Clark
This document discusses deep reinforcement learning through policy optimization. It begins with an introduction to reinforcement learning and how deep neural networks can be used to approximate policies, value functions, and models. It then discusses how deep reinforcement learning can be applied to problems in robotics, business operations, and other machine learning domains. The document reviews how reinforcement learning relates to other machine learning problems like supervised learning and contextual bandits. It provides an overview of policy gradient methods and the cross-entropy method for policy optimization before discussing Markov decision processes, parameterized policies, and specific policy gradient algorithms like the vanilla policy gradient algorithm and trust region policy optimization.
This tutorial covers machine learning approaches for learning to rank documents in information retrieval systems. It discusses how early IR methods did not incorporate machine learning. It then covers: (1) ordinal regression approaches that learn multiple thresholds to account for the ordered nature of relevance labels; (2) optimizing pairwise preferences between documents, which decomposes the problem and allows for efficient algorithms; (3) directly optimizing rank-based evaluation measures like MAP and NDCG using structural SVMs, boosting, or smooth approximations to allow for gradient descent optimization of discontinuous objectives. The goal is to outperform traditional IR methods by applying machine learning techniques to learn good ranking functions.
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNINGgerogepatton
This paper proposes a novel parameter for transfer reinforcement learning to avoid over-fitting when an
agent uses a transferred policy from a source task. Learning robot systems have recently been studied for
many applications, such as home robots, communication robots, and warehouse robots. However, if the
agent reuses the knowledge that has been sufficiently learned in the source task, deadlock may occur and
appropriate transfer learning may not be realized. In the previous work, a parameter called transfer rate
was proposed to adjust the ratio of transfer, and its contribution include avoiding dead lock in the target
task. However, adjusting the parameter depends on human intuition and experiences. Furthermore, the
method for deciding transfer rate has not discussed. Therefore, an automatic method for adjusting the
transfer rate is proposed in this paper using a sigmoid function. Further, computer simulations are used to
evaluate the effectiveness of the proposed method to improve the environmental adaptation performance in
a target task, which refers to the situation of reusing knowledge.
How to formulate reinforcement learning in illustrative waysYasutoTamura1
This lecture introduces reinforcement learning and how to approach learning it. It discusses formulating the environment as a Markov decision process and defines important concepts like policy, value functions, returns, and the Bellman equation. The key ideas are that reinforcement learning involves optimizing a policy to maximize expected returns, and value functions are introduced to indirectly evaluate and improve the policy through dynamic programming methods like policy iteration and value iteration. Understanding these fundamental concepts through simple examples is emphasized as the starting point for learning reinforcement learning.
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdfPo-Chuan Chen
The document proposes an approach to generate natural language summaries for online content using offline reinforcement learning. It involves crawling Twitter data, fine-tuning models like RoBERTa and GPT-2, and using a reinforcement learning algorithm (PPO) to further train the text generation model using a reward function. The methodology, planned experiment, related work and conclusion are discussed over multiple sections and figures.
Adaptive relevance feedback in information retrievalYI-JHEN LIN
Adaptive relevance feedback aims to optimize the balance between the original query and feedback documents. The paper proposes learning an adaptive feedback coefficient based on query and feedback document characteristics. These include query and feedback document discrimination and divergence between the query and feedback. Logistic regression is used to learn weights mapping query-feedback pairs to coefficients. Experiments show the approach improves retrieval performance compared to fixed coefficients, especially when training and test data are in the same domain.
TEACHING AND LEARNING BASED OPTIMISATIONUday Wankar
Teaching–Learning-Based Optimization (TLBO) seems to be a rising star from amongst a number of metaheuristics with relatively competitive performances. It is reported that it outperforms some of the well-known metaheuristics regarding constrained benchmark functions, constrained mechanical design, and continuous non-linear numerical optimization problems. Such a breakthrough has steered us towards investigating the secrets of TLBO’s dominance. This report’s findings on TLBO qualitatively and quantitatively through code-reviews and experiments, respectively.
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIJack Clark
This document discusses deep reinforcement learning through policy optimization. It begins with an introduction to reinforcement learning and how deep neural networks can be used to approximate policies, value functions, and models. It then discusses how deep reinforcement learning can be applied to problems in robotics, business operations, and other machine learning domains. The document reviews how reinforcement learning relates to other machine learning problems like supervised learning and contextual bandits. It provides an overview of policy gradient methods and the cross-entropy method for policy optimization before discussing Markov decision processes, parameterized policies, and specific policy gradient algorithms like the vanilla policy gradient algorithm and trust region policy optimization.
This tutorial covers machine learning approaches for learning to rank documents in information retrieval systems. It discusses how early IR methods did not incorporate machine learning. It then covers: (1) ordinal regression approaches that learn multiple thresholds to account for the ordered nature of relevance labels; (2) optimizing pairwise preferences between documents, which decomposes the problem and allows for efficient algorithms; (3) directly optimizing rank-based evaluation measures like MAP and NDCG using structural SVMs, boosting, or smooth approximations to allow for gradient descent optimization of discontinuous objectives. The goal is to outperform traditional IR methods by applying machine learning techniques to learn good ranking functions.
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNINGgerogepatton
This paper proposes a novel parameter for transfer reinforcement learning to avoid over-fitting when an
agent uses a transferred policy from a source task. Learning robot systems have recently been studied for
many applications, such as home robots, communication robots, and warehouse robots. However, if the
agent reuses the knowledge that has been sufficiently learned in the source task, deadlock may occur and
appropriate transfer learning may not be realized. In the previous work, a parameter called transfer rate
was proposed to adjust the ratio of transfer, and its contribution include avoiding dead lock in the target
task. However, adjusting the parameter depends on human intuition and experiences. Furthermore, the
method for deciding transfer rate has not discussed. Therefore, an automatic method for adjusting the
transfer rate is proposed in this paper using a sigmoid function. Further, computer simulations are used to
evaluate the effectiveness of the proposed method to improve the environmental adaptation performance in
a target task, which refers to the situation of reusing knowledge.
How to formulate reinforcement learning in illustrative waysYasutoTamura1
This lecture introduces reinforcement learning and how to approach learning it. It discusses formulating the environment as a Markov decision process and defines important concepts like policy, value functions, returns, and the Bellman equation. The key ideas are that reinforcement learning involves optimizing a policy to maximize expected returns, and value functions are introduced to indirectly evaluate and improve the policy through dynamic programming methods like policy iteration and value iteration. Understanding these fundamental concepts through simple examples is emphasized as the starting point for learning reinforcement learning.
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdfPo-Chuan Chen
The document proposes an approach to generate natural language summaries for online content using offline reinforcement learning. It involves crawling Twitter data, fine-tuning models like RoBERTa and GPT-2, and using a reinforcement learning algorithm (PPO) to further train the text generation model using a reward function. The methodology, planned experiment, related work and conclusion are discussed over multiple sections and figures.
Adaptive relevance feedback in information retrievalYI-JHEN LIN
Adaptive relevance feedback aims to optimize the balance between the original query and feedback documents. The paper proposes learning an adaptive feedback coefficient based on query and feedback document characteristics. These include query and feedback document discrimination and divergence between the query and feedback. Logistic regression is used to learn weights mapping query-feedback pairs to coefficients. Experiments show the approach improves retrieval performance compared to fixed coefficients, especially when training and test data are in the same domain.
Financial Trading as a Game: A Deep Reinforcement Learning Approach謙益 黃
An automatic program that generates constant profit from the financial market is lucrative for every market practitioner. Recent advance in deep reinforcement learning provides a framework toward end-to-end training of such trading agent. In this paper, we propose an Markov Decision Process (MDP) model suitable for the financial trading task and solve it with the state-of-the-art deep recurrent Q-network (DRQN) algorithm. We propose several modifications to the existing learning algorithm to make it more suitable under the financial trading setting, namely 1. We employ a substantially small replay memory (only a few hundreds in size) compared to ones used in modern deep reinforcement learning algorithms (often millions in size.) 2. We develop an action augmentation technique to mitigate the need for random exploration by providing extra feedback signals for all actions to the agent. This enables us to use greedy policy over the course of learning and shows strong empirical performance compared to more commonly used ε-greedy exploration. However, this technique is specific to financial trading under a few market assumptions. 3. We sample a longer sequence for recurrent neural network training. A side product of this mechanism is that we can now train the agent for every T steps. This greatly reduces training time since the overall computation is down by a factor of T. We combine all of the above into a complete online learning algorithm and validate our approach on the spot foreign exchange market.
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...Deren Lei
Deep reinforcement learning (RL) has been a commonly-used strategy for the abstractive summarization task to address both the exposure bias and non-differentiable task issues. However, the conventional reward ROUGE-L simply looks for exact n-grams matches between candidates and annotated references, which inevitably makes the generated sentences repetitive and incoherent. In this paper, we explore the practicability of utilizing the distributional semantics to measure the matching degrees. Our proposed distributional semantics reward has distinct superiority in capturing the lexical and compositional diversity of natural language.
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNINGMLReview
This document proposes using meta-learning and an LSTM model to learn an optimization algorithm for few-shot learning. The model, called a meta-learner, is trained on multiple datasets to learn how to efficiently train a learner network on new small datasets. The meta-learner LSTM models the parameter updates of the learner network during training, learning an initialization and update rule. The inputs to the meta-learner are the loss, parameters, and gradient, and it outputs updated parameters. This learned update rule can then be used to train the learner network on new small datasets, enabling few-shot learning using only a small amount of labeled data.
Strategies for Cooperation Emergence in Distributed Service DiscoveryMiguel Rebollo
The document presents strategies to promote cooperative behavior in distributed service discovery networks. It proposes using social plasticity techniques like link rewiring and variable incentives. Simulation results show that combining social plasticity with different incentive policies like rewarding shorter paths or more similar services leads to higher cooperative behavior rates, search success rates and lower path lengths, even when cooperators are initially the minority.
This document provides an introduction to reinforcement learning. It defines reinforcement learning as finding a policy that maximizes the sum of rewards by interacting with an environment. It discusses key concepts like Markov decision processes, value functions, temporal difference learning, Q-learning, and deep reinforcement learning. The document also provides examples of applications in games, robotics, economics and comparisons of model-based planning versus model-free reinforcement learning approaches.
This document discusses using deep reinforcement learning and deep learning techniques for agent-based models. It discusses using deep learning to approximate policy and value functions, using imitation learning to learn from expert demonstrations, and using Q-learning and model-based reinforcement learning to optimize agent behavior. Micro-emulations use deep learning to model individual agent behavior, while macro-emulations aim to emulate the overall system behavior. Open problems include using reinforcement learning to find optimal policies given an agent-based model simulator.
Recent Advances in Flower Pollination AlgorithmEditor IJCATR
Flower Pollination Algorithm (FPA) is a nature inspired algorithm based on pollination process of plants. Recently, FPA
has become a popular algorithm in the evolutionary computation field due to its superiority to many other algorithms. As a
consequence, in this paper, FPA, its improvements, its hybridization and applications in many fields, such as operations research,
engineering and computer science, are discussed and analyzed. Based on its applications in the field of optimization it was seemed that
this algorithm has a better convergence speed compared to other algorithms. The survey investigates the difference between FPA
versions as well as its applications. To add to this, several future improvements are suggested.
Cuckoo Search: Recent Advances and ApplicationsXin-She Yang
This document summarizes recent advances and applications of the cuckoo search algorithm, a nature-inspired metaheuristic optimization algorithm developed in 2009. Cuckoo search mimics the brood parasitism breeding behavior of some cuckoo species. It uses a combination of local and global search achieved through random walks and Levy flights to efficiently explore the search space. Studies show cuckoo search often finds optimal solutions faster than genetic algorithms and particle swarm optimization. The algorithm has been applied to diverse optimization problems and continues to be improved and extended to multi-objective optimization.
This document proposes using linear function approximation as a computationally efficient method for solving reinforcement learning problems compared to neural network approaches like TRPO and PPO. It summarizes TRPO and PPO, which use neural networks to approximate value functions. The paper then presents a natural actor-critic algorithm that uses linear function approximation instead of neural networks for value approximation. The author evaluates this approach on cart pole and acrobot benchmarks and finds it trains faster than neural network methods while achieving equivalent or better results, especially on sparse reward problems. This allows the paper to recommend using natural policy gradient methods with linear function approximation over TRPO and PPO for traditional and sparse reward low-dimensional reinforcement learning challenges.
Reinforcement Learning for Test Case PrioritizationLionel Briand
1) The document discusses using reinforcement learning for test case prioritization in continuous integration environments. It compares different ranking models (listwise, pairwise, pointwise) and reinforcement learning algorithms.
2) Pairwise and pointwise ranking models generally perform better than listwise, and pairwise training times are better than pointwise. The best configuration is pairwise ranking with the ACER algorithm.
3) When compared to traditional machine learning ranking models, the best reinforcement learning configuration provides significantly better ranking accuracy than the state-of-the-art MART model.
4) However, relying solely on test execution history may not provide sufficient features for an accurate prioritization policy regardless of the approach. Enriched datasets with more features
Fuzzy clustering has been widely studied and applied in a variety of key areas of science and
engineering. In this paper the Improved Teaching Learning Based Optimization (ITLBO)
algorithm is used for data clustering, in which the objects in the same cluster are similar. This
algorithm has been tested on several datasets and compared with some other popular algorithm
in clustering. Results have been shown that the proposed method improves the output of
clustering and can be efficiently used for fuzzy clustering.
This document summarizes a paper on Cold-Start Reinforcement Learning with Softmax Policy Gradient. It introduces the limitations of existing sequence learning methods like maximum likelihood estimation and reward augmented maximum likelihood. It then describes the softmax policy gradient method which uses a softmax value function to overcome issues with warm starts and sample variance. The method achieves better performance on text summarization and image captioning tasks.
Naver learning to rank question answer pairs using hrde-ltcNAVER Engineering
The automatic question answering (QA) task has long been considered a primary objective of artificial intelligence.
Among the QA sub-systems, we focused on answer-ranking part. In particular, we investigated a novel neural network architecture with additional data clustering module to improve the performance in ranking answer candidates which are longer than a single sentence. This work can be used not only for the QA ranking task, but also to evaluate the relevance of next utterance with given dialogue generated from the dialogue model.
In this talk, I'll present our research results (NAACL 2018), and also its potential use cases (i.e. fake news detection). Finally, I'll conclude by introducing some issues on previous research, and by introducing recent approach in academic.
Is Reinforcement Learning (Not) for Natural
Language Processing.pdfPo-Chuan Chen
The document presents RL4LMs, a library for training language models with reinforcement learning. It introduces RL4LMs, which enables generative models to be optimized with RL algorithms. It also presents the GRUE benchmark for evaluating models, which pairs NLP tasks with reward functions capturing human preferences. Additionally, it introduces the NLPO algorithm that dynamically learns task-specific constraints to reduce the large action space in language generation. The goal is to facilitate research in building RL methods to better align language models with human preferences.
This document provides an overview of an introductory lecture on reinforcement learning. The key points covered include:
- Reinforcement learning involves an agent learning through trial-and-error interactions with an environment by receiving rewards.
- The goal of reinforcement learning is for the agent to select actions that maximize total rewards. This involves making decisions to balance short-term versus long-term rewards.
- Major components of a reinforcement learning agent include its policy, which determines its behavior, its value function which predicts future rewards, and its model which represents its understanding of the environment's dynamics.
This document discusses algorithm-independent machine learning techniques. It introduces concepts like bias and variance, which can quantify how well a learning algorithm matches a problem without depending on a specific algorithm. Methods like cross-validation, bootstrapping, and resampling can be used with different algorithms. While no algorithm is inherently superior, such techniques provide guidance on algorithm use and help integrate multiple classifiers.
Financial Trading as a Game: A Deep Reinforcement Learning Approach謙益 黃
An automatic program that generates constant profit from the financial market is lucrative for every market practitioner. Recent advance in deep reinforcement learning provides a framework toward end-to-end training of such trading agent. In this paper, we propose an Markov Decision Process (MDP) model suitable for the financial trading task and solve it with the state-of-the-art deep recurrent Q-network (DRQN) algorithm. We propose several modifications to the existing learning algorithm to make it more suitable under the financial trading setting, namely 1. We employ a substantially small replay memory (only a few hundreds in size) compared to ones used in modern deep reinforcement learning algorithms (often millions in size.) 2. We develop an action augmentation technique to mitigate the need for random exploration by providing extra feedback signals for all actions to the agent. This enables us to use greedy policy over the course of learning and shows strong empirical performance compared to more commonly used ε-greedy exploration. However, this technique is specific to financial trading under a few market assumptions. 3. We sample a longer sequence for recurrent neural network training. A side product of this mechanism is that we can now train the agent for every T steps. This greatly reduces training time since the overall computation is down by a factor of T. We combine all of the above into a complete online learning algorithm and validate our approach on the spot foreign exchange market.
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...Deren Lei
Deep reinforcement learning (RL) has been a commonly-used strategy for the abstractive summarization task to address both the exposure bias and non-differentiable task issues. However, the conventional reward ROUGE-L simply looks for exact n-grams matches between candidates and annotated references, which inevitably makes the generated sentences repetitive and incoherent. In this paper, we explore the practicability of utilizing the distributional semantics to measure the matching degrees. Our proposed distributional semantics reward has distinct superiority in capturing the lexical and compositional diversity of natural language.
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNINGMLReview
This document proposes using meta-learning and an LSTM model to learn an optimization algorithm for few-shot learning. The model, called a meta-learner, is trained on multiple datasets to learn how to efficiently train a learner network on new small datasets. The meta-learner LSTM models the parameter updates of the learner network during training, learning an initialization and update rule. The inputs to the meta-learner are the loss, parameters, and gradient, and it outputs updated parameters. This learned update rule can then be used to train the learner network on new small datasets, enabling few-shot learning using only a small amount of labeled data.
Strategies for Cooperation Emergence in Distributed Service DiscoveryMiguel Rebollo
The document presents strategies to promote cooperative behavior in distributed service discovery networks. It proposes using social plasticity techniques like link rewiring and variable incentives. Simulation results show that combining social plasticity with different incentive policies like rewarding shorter paths or more similar services leads to higher cooperative behavior rates, search success rates and lower path lengths, even when cooperators are initially the minority.
This document provides an introduction to reinforcement learning. It defines reinforcement learning as finding a policy that maximizes the sum of rewards by interacting with an environment. It discusses key concepts like Markov decision processes, value functions, temporal difference learning, Q-learning, and deep reinforcement learning. The document also provides examples of applications in games, robotics, economics and comparisons of model-based planning versus model-free reinforcement learning approaches.
This document discusses using deep reinforcement learning and deep learning techniques for agent-based models. It discusses using deep learning to approximate policy and value functions, using imitation learning to learn from expert demonstrations, and using Q-learning and model-based reinforcement learning to optimize agent behavior. Micro-emulations use deep learning to model individual agent behavior, while macro-emulations aim to emulate the overall system behavior. Open problems include using reinforcement learning to find optimal policies given an agent-based model simulator.
Recent Advances in Flower Pollination AlgorithmEditor IJCATR
Flower Pollination Algorithm (FPA) is a nature inspired algorithm based on pollination process of plants. Recently, FPA
has become a popular algorithm in the evolutionary computation field due to its superiority to many other algorithms. As a
consequence, in this paper, FPA, its improvements, its hybridization and applications in many fields, such as operations research,
engineering and computer science, are discussed and analyzed. Based on its applications in the field of optimization it was seemed that
this algorithm has a better convergence speed compared to other algorithms. The survey investigates the difference between FPA
versions as well as its applications. To add to this, several future improvements are suggested.
Cuckoo Search: Recent Advances and ApplicationsXin-She Yang
This document summarizes recent advances and applications of the cuckoo search algorithm, a nature-inspired metaheuristic optimization algorithm developed in 2009. Cuckoo search mimics the brood parasitism breeding behavior of some cuckoo species. It uses a combination of local and global search achieved through random walks and Levy flights to efficiently explore the search space. Studies show cuckoo search often finds optimal solutions faster than genetic algorithms and particle swarm optimization. The algorithm has been applied to diverse optimization problems and continues to be improved and extended to multi-objective optimization.
This document proposes using linear function approximation as a computationally efficient method for solving reinforcement learning problems compared to neural network approaches like TRPO and PPO. It summarizes TRPO and PPO, which use neural networks to approximate value functions. The paper then presents a natural actor-critic algorithm that uses linear function approximation instead of neural networks for value approximation. The author evaluates this approach on cart pole and acrobot benchmarks and finds it trains faster than neural network methods while achieving equivalent or better results, especially on sparse reward problems. This allows the paper to recommend using natural policy gradient methods with linear function approximation over TRPO and PPO for traditional and sparse reward low-dimensional reinforcement learning challenges.
Reinforcement Learning for Test Case PrioritizationLionel Briand
1) The document discusses using reinforcement learning for test case prioritization in continuous integration environments. It compares different ranking models (listwise, pairwise, pointwise) and reinforcement learning algorithms.
2) Pairwise and pointwise ranking models generally perform better than listwise, and pairwise training times are better than pointwise. The best configuration is pairwise ranking with the ACER algorithm.
3) When compared to traditional machine learning ranking models, the best reinforcement learning configuration provides significantly better ranking accuracy than the state-of-the-art MART model.
4) However, relying solely on test execution history may not provide sufficient features for an accurate prioritization policy regardless of the approach. Enriched datasets with more features
Fuzzy clustering has been widely studied and applied in a variety of key areas of science and
engineering. In this paper the Improved Teaching Learning Based Optimization (ITLBO)
algorithm is used for data clustering, in which the objects in the same cluster are similar. This
algorithm has been tested on several datasets and compared with some other popular algorithm
in clustering. Results have been shown that the proposed method improves the output of
clustering and can be efficiently used for fuzzy clustering.
This document summarizes a paper on Cold-Start Reinforcement Learning with Softmax Policy Gradient. It introduces the limitations of existing sequence learning methods like maximum likelihood estimation and reward augmented maximum likelihood. It then describes the softmax policy gradient method which uses a softmax value function to overcome issues with warm starts and sample variance. The method achieves better performance on text summarization and image captioning tasks.
Naver learning to rank question answer pairs using hrde-ltcNAVER Engineering
The automatic question answering (QA) task has long been considered a primary objective of artificial intelligence.
Among the QA sub-systems, we focused on answer-ranking part. In particular, we investigated a novel neural network architecture with additional data clustering module to improve the performance in ranking answer candidates which are longer than a single sentence. This work can be used not only for the QA ranking task, but also to evaluate the relevance of next utterance with given dialogue generated from the dialogue model.
In this talk, I'll present our research results (NAACL 2018), and also its potential use cases (i.e. fake news detection). Finally, I'll conclude by introducing some issues on previous research, and by introducing recent approach in academic.
Is Reinforcement Learning (Not) for Natural
Language Processing.pdfPo-Chuan Chen
The document presents RL4LMs, a library for training language models with reinforcement learning. It introduces RL4LMs, which enables generative models to be optimized with RL algorithms. It also presents the GRUE benchmark for evaluating models, which pairs NLP tasks with reward functions capturing human preferences. Additionally, it introduces the NLPO algorithm that dynamically learns task-specific constraints to reduce the large action space in language generation. The goal is to facilitate research in building RL methods to better align language models with human preferences.
This document provides an overview of an introductory lecture on reinforcement learning. The key points covered include:
- Reinforcement learning involves an agent learning through trial-and-error interactions with an environment by receiving rewards.
- The goal of reinforcement learning is for the agent to select actions that maximize total rewards. This involves making decisions to balance short-term versus long-term rewards.
- Major components of a reinforcement learning agent include its policy, which determines its behavior, its value function which predicts future rewards, and its model which represents its understanding of the environment's dynamics.
This document discusses algorithm-independent machine learning techniques. It introduces concepts like bias and variance, which can quantify how well a learning algorithm matches a problem without depending on a specific algorithm. Methods like cross-validation, bootstrapping, and resampling can be used with different algorithms. While no algorithm is inherently superior, such techniques provide guidance on algorithm use and help integrate multiple classifiers.
Similar to On the Effectiveness of Offline RL for Dialogue Response Generation.pdf (20)
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...Po-Chuan Chen
This paper proposes MetaPrompter, which utilizes meta-learning to learn a prompt pool that can generate effective prompts for complex tasks. It also introduces a new soft verbalizer called Representative Verbalizer (RepVerb) that constructs label embeddings from feature embeddings. In experiments on few-shot classification tasks, MetaPrompter outperforms prior meta-prompt tuning methods while requiring significantly fewer parameters.
Quark: Controllable Text Generation with Reinforced [Un]learning.pdfPo-Chuan Chen
This document summarizes a research paper titled "Quark: Controllable Text Generation with Reinforced [Un]learning". The paper introduces Quantized Reward Konditioning (Quark), an algorithm for optimizing a reward function to (un)learn unwanted properties from large language models. Quark iteratively collects samples, sorts them into quantiles based on reward, and maximizes the likelihood of high-reward samples while regularizing the model to remain close to the original. Experiments show Quark can effectively reduce toxicity, unwanted sentiment, and repetition in generated text.
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...Po-Chuan Chen
The SEEK paper proposes a new method for empathetic dialogue generation that models the emotion flow between utterances in a conversation. It introduces two tasks - fine-grained emotion recognition of each utterance and predicting the emotion of the response. It also models the bi-directional interaction between emotional context and commonsense knowledge selection to generate appropriate responses. Experiments on the EmpatheticDialogues dataset show the SEEK method outperforms baselines in automatic and human evaluations.
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...Po-Chuan Chen
The document summarizes a paper titled "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers". It proposes GPTQ, a new one-shot quantization method that can quantize large generative pre-trained models like GPT-3 with 175 billion parameters to 3-4 bits within a few GPU hours with minimal accuracy loss. GPTQ improves upon existing quantization methods by employing arbitrary weight order, lazy batch updates of the Hessian matrix, and a Cholesky reformulation to scale efficiently to huge models and achieve over 2x higher compression than prior work. Experimental results show GPTQ outperforms baseline quantization and enables extremely accurate models to fit in a single
A Statistical Perspective on Retrieval-Based Models.pdfPo-Chuan Chen
This paper presents a statistical perspective on retrieval-based models for classification. It analyzes such models using two different frameworks: local empirical risk minimization and classification in an extended feature space. For local empirical risk minimization, the paper provides assumptions and derives an excess risk bound that decomposes the error of the local model into different terms related to the local vs global optimal risk, sample vs retrieved set risk, generalization error of the local model, and central absolute moment of the local model. It also shows how to tighten the bound by leveraging the local structure of the data distribution.
A Neural Corpus Indexer for Document Retrieval.pdfPo-Chuan Chen
The document describes Neural Corpus Indexer (NCI), a sequence-to-sequence neural network that indexes documents by generating relevant document identifiers directly from input queries. NCI represents documents with hierarchical semantic identifiers generated via k-means clustering. It uses a prefix-aware weight-adaptive decoder and consistency-based regularization during training. Experiments on Natural Questions and TriviaQA datasets show NCI outperforms existing retrieval methods by significantly improving recall.
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdfPo-Chuan Chen
This document summarizes the AdaMix paper, which proposes a new parameter-efficient fine-tuning method called AdaMix. AdaMix uses a mixture of adaptation modules, where it trains multiple views of the task by randomly routing inputs to different adaptation modules. By tuning only 0.1-0.2% of the model parameters, AdaMix outperforms both full model fine-tuning and other state-of-the-art PEFT methods on various NLU and NLG tasks according to experiments on datasets like GLUE, E2E, WebNLG and DART. AdaMix works by introducing a set of adaptation modules in each transformer layer and applying a stochastic routing policy during training, along with consistency regularization and adaptation
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...Po-Chuan Chen
This paper proposes LLaMA-Adapter, a lightweight method to efficiently fine-tune the LLaMA language model into an instruction-following model. It uses learnable adaption prompts prepended to word tokens in higher transformer layers. Additionally, it introduces zero-initialized attention with a gating mechanism that incorporates instructional signals while preserving pre-trained knowledge. Experiments show LLaMA-Adapter can generate high-quality responses comparable to fully fine-tuned models, and it can be extended to multi-modal reasoning tasks.
Active Retrieval Augmented Generation.pdfPo-Chuan Chen
FLARE proposes a method called Forward-Looking Active REtrieval augmented generation (FLARE) that iteratively retrieves information during text generation based on the predicted upcoming sentence. FLARE uses the predicted next sentence as a query to retrieve documents if it contains low-confidence tokens, then regenerates the sentence. Experiments show FLARE outperforms baselines on multiple knowledge-intensive tasks. However, FLARE did not significantly improve performance on a short-text dataset where continual retrieval of disparate information may not be needed.
This document describes a Kaggle competition called Image to Prompts that aims to predict the text prompt for a generated image using a generative text-to-image model. The method uses an ensemble of a Vision Transformer, CLIP Interrogator, and OFA models. Analysis shows the CLIP Interrogator and OFA models generate higher quality prompts than the ViT model. Future work to improve methods includes generating a larger dataset of image-prompt pairs and training customized models on this data.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfPo-Chuan Chen
The document describes the RAG (Retrieval-Augmented Generation) model for knowledge-intensive NLP tasks. RAG combines a pre-trained language generator (BART) with a dense passage retriever (DPR) to retrieve and incorporate relevant knowledge from Wikipedia. RAG achieves state-of-the-art results on open-domain question answering, abstractive question answering, and fact verification by leveraging both parametric knowledge from the generator and non-parametric knowledge retrieved from Wikipedia. The retrieved knowledge can also be updated without retraining the model.
Evaluating Parameter Efficient Learning for Generation.pdfPo-Chuan Chen
This document summarizes a research paper that evaluated parameter efficient learning methods (PERMs) for natural language generation tasks. The researchers compared PERMs like adapter tuning, prefix tuning, and prompt tuning to finetuning large pre-trained language models on several metrics. Their results showed that PERMs can outperform finetuning with fewer training samples or larger models, and that adapter tuning generalizes best across domains while prefix tuning produces the most faithful generations. The study provides insights into how PERMs can help adapt models with limited data.
Off-Policy Deep Reinforcement Learning without Exploration.pdfPo-Chuan Chen
BCQ is an algorithm for off-policy reinforcement learning that combines deep Q-learning with a state-conditioned generative model to produce only previously seen actions from a batch of data. BCQ uses the generative model to propose actions similar to the batch, then selects the highest valued action via a Q-network. It addresses overestimation bias through importance sampling and clipped double Q-learning. Experiments show BCQ achieves state-of-the-art performance on benchmark continuous control and discrete action tasks by constraining behavior to the batch data.
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdfPo-Chuan Chen
This document discusses a mixture of experts (MoE) approach for reinforcement learning-based dialogue management. It introduces a MoE language model consisting of: (1) a primitive language model capable of generating diverse utterances, (2) several specialized expert models trained for different intents, and (3) a dialogue manager that selects utterances from the experts. The experts are constructed by training on labeled conversation data. Reinforcement learning is used to train the dialogue manager to optimize long-term dialogue quality by selecting among the expert utterances. Experiments demonstrate the MoE approach can generate more coherent and engaging conversations than single language models.
HyperPrompt:Prompt-based Task-Conditioning of TransformerspdfPo-Chuan Chen
HyperPrompt is a novel architecture that introduces learnable hyper-prompts into the self-attention module of Transformers to enable efficient multi-task learning. HyperPrompt achieves competitive performance compared to strong baselines using only 0.14% additional parameters. It introduces hyper-prompts as global task memories for queries to attend to, and uses hypernetworks to generate layer-specific and task-specific prompts. Experiments on GLUE and SuperGLUE show HyperPrompt outperforms parameter-efficient baselines while maintaining low computational cost for both training and inference.
Training language models to follow instructions
with human feedback.pdfPo-Chuan Chen
This paper presents InstructGPT, a method for fine-tuning large language models to follow a broad range of written instructions from humans. The researchers first collected a dataset of human demonstrations for various tasks and used it to train an initial supervised model. They then collected human rankings of model outputs to train a reward model and further fine-tuned the supervised model with reinforcement learning to maximize rewards. Evaluation showed the fine-tuned model was preferred by humans over GPT-3 for following instructions while maintaining performance on other tasks.
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesChristina Lin
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
A review on techniques and modelling methodologies used for checking electrom...nooriasukmaningtyas
The proper function of the integrated circuit (IC) in an inhibiting electromagnetic environment has always been a serious concern throughout the decades of revolution in the world of electronics, from disjunct devices to today’s integrated circuit technology, where billions of transistors are combined on a single chip. The automotive industry and smart vehicles in particular, are confronting design issues such as being prone to electromagnetic interference (EMI). Electronic control devices calculate incorrect outputs because of EMI and sensors give misleading values which can prove fatal in case of automotives. In this paper, the authors have non exhaustively tried to review research work concerned with the investigation of EMI in ICs and prediction of this EMI using various modelling methodologies and measurement setups.
Batteries -Introduction – Types of Batteries – discharging and charging of battery - characteristics of battery –battery rating- various tests on battery- – Primary battery: silver button cell- Secondary battery :Ni-Cd battery-modern battery: lithium ion battery-maintenance of batteries-choices of batteries for electric vehicle applications.
Fuel Cells: Introduction- importance and classification of fuel cells - description, principle, components, applications of fuel cells: H2-O2 fuel cell, alkaline fuel cell, molten carbonate fuel cell and direct methanol fuel cells.
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELgerogepatton
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...IJECEIAES
Medical image analysis has witnessed significant advancements with deep learning techniques. In the domain of brain tumor segmentation, the ability to
precisely delineate tumor boundaries from magnetic resonance imaging (MRI)
scans holds profound implications for diagnosis. This study presents an ensemble convolutional neural network (CNN) with transfer learning, integrating
the state-of-the-art Deeplabv3+ architecture with the ResNet18 backbone. The
model is rigorously trained and evaluated, exhibiting remarkable performance
metrics, including an impressive global accuracy of 99.286%, a high-class accuracy of 82.191%, a mean intersection over union (IoU) of 79.900%, a weighted
IoU of 98.620%, and a Boundary F1 (BF) score of 83.303%. Notably, a detailed comparative analysis with existing methods showcases the superiority of
our proposed model. These findings underscore the model’s competence in precise brain tumor localization, underscoring its potential to revolutionize medical
image analysis and enhance healthcare outcomes. This research paves the way
for future exploration and optimization of advanced CNN models in medical
imaging, emphasizing addressing false positives and resource efficiency.
Advanced control scheme of doubly fed induction generator for wind turbine us...IJECEIAES
This paper describes a speed control device for generating electrical energy on an electricity network based on the doubly fed induction generator (DFIG) used for wind power conversion systems. At first, a double-fed induction generator model was constructed. A control law is formulated to govern the flow of energy between the stator of a DFIG and the energy network using three types of controllers: proportional integral (PI), sliding mode controller (SMC) and second order sliding mode controller (SOSMC). Their different results in terms of power reference tracking, reaction to unexpected speed fluctuations, sensitivity to perturbations, and resilience against machine parameter alterations are compared. MATLAB/Simulink was used to conduct the simulations for the preceding study. Multiple simulations have shown very satisfying results, and the investigations demonstrate the efficacy and power-enhancing capabilities of the suggested control system.
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Sinan KOZAK
Sinan from the Delivery Hero mobile infrastructure engineering team shares a deep dive into performance acceleration with Gradle build cache optimizations. Sinan shares their journey into solving complex build-cache problems that affect Gradle builds. By understanding the challenges and solutions found in our journey, we aim to demonstrate the possibilities for faster builds. The case study reveals how overlapping outputs and cache misconfigurations led to significant increases in build times, especially as the project scaled up with numerous modules using Paparazzi tests. The journey from diagnosing to defeating cache issues offers invaluable lessons on maintaining cache integrity without sacrificing functionality.
Introduction- e - waste – definition - sources of e-waste– hazardous substances in e-waste - effects of e-waste on environment and human health- need for e-waste management– e-waste handling rules - waste minimization techniques for managing e-waste – recycling of e-waste - disposal treatment methods of e- waste – mechanism of extraction of precious metal from leaching solution-global Scenario of E-waste – E-waste in India- case studies.
On the Effectiveness of Offline RL for Dialogue Response Generation.pdf
1. On the Effectiveness of Offline RL for Dialogue Response Generation
On the Effectiveness of Offline RL for Dialogue
Response Generation
ICML, 2023
Paloma Sodhi, Felix Wu, Ethan R. Elenberg et al.
Speaker: Po-Chuan Chen
Dec 12, 2023
1 / 39
2. On the Effectiveness of Offline RL for Dialogue Response Generation
Table of contents
1 Abstract
2 Introduction
3 Problem Formulation
4 Approach
5 Experiments
6 Discussion
2 / 39
3. On the Effectiveness of Offline RL for Dialogue Response Generation
Abstract
Table of contents
1 Abstract
2 Introduction
3 Problem Formulation
4 Approach
5 Experiments
6 Discussion
3 / 39
4. On the Effectiveness of Offline RL for Dialogue Response Generation
Abstract
Abstract
For language models, many methods using teacher forcing (TF) to
train. It attempts to match human language exactly, even though
identical meanings can be expressed in different ways.
But with offline RL, which shows a clear performance improvement
over teacher forcing while not inducing training instability or
sacrificing practical training budgets1.
1https://github.com/asappresearch/dialogue-offline-rl
4 / 39
5. On the Effectiveness of Offline RL for Dialogue Response Generation
Introduction
Table of contents
1 Abstract
2 Introduction
3 Problem Formulation
4 Approach
5 Experiments
6 Discussion
5 / 39
6. On the Effectiveness of Offline RL for Dialogue Response Generation
Introduction
Introduction
Historically, text generation models have typically been trained with
teacher forcing (TF) [6], which involves predicting the next token in a
sequence to exactly match the human utterance in a ground truth
dataset.
But it’s a challenging objective, and solving it with designing a loss
that incorporating human-in-the-loop feedback can be expensive.
6 / 39
7. On the Effectiveness of Offline RL for Dialogue Response Generation
Introduction
Contribution
In this paper, they present a comprehensive evaluation of offline RL
methods for dialogue text generation and investigate best practices.
They implement three complementary approaches, TF Top, Decision
Transformers (DT) [1], ILQL [2].
Also, they find that offline RL methods show a clear performance
improvement over teacher forcing and achieve a trade-off where they
generate text close enough in meaning to human.
7 / 39
8. On the Effectiveness of Offline RL for Dialogue Response Generation
Problem Formulation
Table of contents I
1 Abstract
2 Introduction
3 Problem Formulation
Dialogue Response Generation as an MDP
Rewards for Dialogue Response Generation
Why Offline Reinforcement Learning?
4 Approach
5 Experiments
8 / 39
9. On the Effectiveness of Offline RL for Dialogue Response Generation
Problem Formulation
Table of contents II
6 Discussion
9 / 39
10. On the Effectiveness of Offline RL for Dialogue Response Generation
Problem Formulation
Dialogue Response Generation as an MDP
Dialogue Response Generation as an MDP
There has a supervised dataset of context response pairs {xi, yi}N
i=1,
where context x is the conversation history, and response
y = {y1, . . . yT } is a target sequence of tokens.
Figure 1: Dialogue generation as a Markov Decision Process (MDP)
10 / 39
11. On the Effectiveness of Offline RL for Dialogue Response Generation
Problem Formulation
Dialogue Response Generation as an MDP
Dialogue Response Generation as an MDP (Cont.)
The goal is to learn a policy 𝜋 : st → at maximizing return.
States, st ∈ S is the context x and the partially generated
sequence of tokens up to and including time step t,
ŷ≤t := {ŷ1, . . . ŷt}.
Actions, at ∈ A are the set of next tokens ŷt+1 available from the
vocabulary V.
Transition function, T (st+1 | st, at) is deterministic since every
state-action pair (ŷ≤t, ŷt+1) leads to a unique state ŷ≤t+1 for the
next step.
Rewards, rt : S × A → [0, 1] that computes similarity
generated response ŷ and target response y.
11 / 39
12. On the Effectiveness of Offline RL for Dialogue Response Generation
Problem Formulation
Rewards for Dialogue Response Generation
Rewards for Dialogue Response Generation
The metric should capture both what the speaker is trying to
communicate and the relevance to the conversation.
Definition for reward
Collecting human-in-the-loop annotations
Automated metrics
BERTScore [7]
BLEURT [5]
They use a terminal reward, which is cumulative over an episode
E𝜋
ÍT
t=0 𝛾trt. And it assumes to be undiscounted cumulative rewards,
which 𝛾 = 1.
12 / 39
13. On the Effectiveness of Offline RL for Dialogue Response Generation
Problem Formulation
Why Offline Reinforcement Learning?
Why Offline Reinforcement Learning?
For text generation, if we use online reinforcement learning, the agent
must balance the need to try out new actions to learn about the
environment.
This can be particularly challenging in text generation, as action
space (i.e. vocabulary size) is often large.
Another problem is that the reward landscape is sparse, hence
policies during training can get stuck in local minima where reward is
persistently zero.
13 / 39
14. On the Effectiveness of Offline RL for Dialogue Response Generation
Problem Formulation
Why Offline Reinforcement Learning?
Why Offline Reinforcement Learning? (Cont.)
Offline RL provides a learning paradigm that combines
Supervised learning’s ability to leverage existing data
General utility optimization power of online reinforcement
learning methods
They collect an offline dataset of state transitions
D = {(si
t, ai
t, ri
t, si
t+1)}N
i=1 using a behavior policy 𝜋𝛽.
The goal is to learn a policy 𝜋 that maximizes performance on the
dataset while staying close to the behavior policy:
max
𝜋
JD (𝜋) − 𝛼D(𝜋, 𝜋𝛽)
14 / 39
15. On the Effectiveness of Offline RL for Dialogue Response Generation
Approach
Table of contents I
1 Abstract
2 Introduction
3 Problem Formulation
4 Approach
Fine Tune on Top Returns
Decision Transformers: Condition on Return
Off-Policy Q-Learning
On-Policy RL: PPO
Comparison between Approaches
15 / 39
16. On the Effectiveness of Offline RL for Dialogue Response Generation
Approach
Table of contents II
5 Experiments
6 Discussion
16 / 39
17. On the Effectiveness of Offline RL for Dialogue Response Generation
Approach
Fine Tune on Top Returns
Fine Tune on Top Returns
The simplest approach is to fine-tune a model on “top”
demonstrations, i.e. teacher forcing on top returns (TF-Top).
The gradient update is simply the log-likelihood gradient on the data
subset Dtop,
Est,at∼Dtop [∇𝜃 log 𝜋𝜃 (at | st)]
where Dtop = {(st, at) ∈ D | Q̂(st, at) ≥ 1 − 𝛿}
Here 𝛿 can be computed by taking the top percentile of all returns
Q̂(st, at), the return for any token along the sequence is the same as
the final reward received at the end of the sequence.
17 / 39
18. On the Effectiveness of Offline RL for Dialogue Response Generation
Approach
Decision Transformers: Condition on Return
Decision Transformers: Condition on Return
Decision Transformer (DT) wants to learn the return conditional
distribution of actions in each state, and then define a policy by
sampling from the distribution of actions that receive high returns.
Given a data point (st, at), they take its return Q̂(st, at) tokenize it, and
then fine tune a model by conditioning on this return token.
The gradient update is simply the log-likelihood,
Est,at∼D [∇𝜃 log 𝜋𝜃 (at | st, Q̂(st, at))]
At test time, they condition the model on the highest return Q̂top.
18 / 39
19. On the Effectiveness of Offline RL for Dialogue Response Generation
Approach
Decision Transformers: Condition on Return
Decision Transformers: Condition on Return (Cont.)
Figure 2: Decision Transformer architecture
19 / 39
20. On the Effectiveness of Offline RL for Dialogue Response Generation
Approach
Decision Transformers: Condition on Return
Decision Transformers: Condition on Return (Cont.)
One advantage of decision transformer over fine-tuning on top returns
is that the model is trained to explicitly learn a decision boundary
between different returns.
However, both approaches have the theoretical drawback of requiring
”trajectory coverage”.
Trajectory coverage
The training dataset must contain trajectories starting from the initial
state s0 that sees high return. It makes the number of data points
needed increases exponentially with the length of the trajectory.
20 / 39
21. On the Effectiveness of Offline RL for Dialogue Response Generation
Approach
Off-Policy Q-Learning
Off-Policy Q-Learning
Here, they use offline variant for Q-learning, Implicit Q-learning
(ILQL).
ILQL adds two extra heads to the pre-trained model, the action value
head Q𝜃 (st, at), which denotes the utility of a token at given a
sequence st, and the state value head V𝜓 (st), which denotes the value
of the sequence st.
The implicit policy set as
𝜋𝜃 (at | st) = 𝜋𝛽 (at | st) exp(𝜂(Q𝜃 (st, at) − V𝜓 (st)))
21 / 39
22. On the Effectiveness of Offline RL for Dialogue Response Generation
Approach
Off-Policy Q-Learning
Off-Policy Q-Learning (Cont.)
The gradient update set as
E
st,at,st+1∼D
[∇𝜃Q𝜃 (st, at) r (st, at) + V𝜓 (st+1) − Q𝜃 (st, at)
| {z }
Temporal Difference Error
]
-𝛼Est∼D∇𝜃 KL 𝜋𝛽 (· | st) ∥𝜋𝜃 (· | st)
This paper improve upon original ILQL by regularizing against logits
of the pre-trained TF policy 𝜋𝛽 instead of the demonstrated data D,
which more suited for settings where we may not have a lot of
demonstrated data.
22 / 39
23. On the Effectiveness of Offline RL for Dialogue Response Generation
Approach
On-Policy RL: PPO
On-Policy RL: PPO
In this paper, they also compare against an online RL algorithm:
Proximal Policy Optimization [4].
The gradient update is,
E
st,at∼𝜋𝜃
∇𝜃 𝜋𝜃 (at | st)
𝜋𝜃old (at | st)
A (st, at)
23 / 39
24. On the Effectiveness of Offline RL for Dialogue Response Generation
Approach
Comparison between Approaches
Comparison between Approaches
When is DT and Q-learning comparable?
For MDPs where such stitching is not possible, e.g. a tree, DT and
ILQL are comparable in performance. They hypothesize that dialogue
text generation belongs to this class of MDPs.
When is DT and TF Top comparable?
DT should expect to do better than TF Top only when the data TF Top
throws away provides valuable information.
If that information is already captured by base TF model, then both
DT and TF Top are likely to be similar.
24 / 39
25. On the Effectiveness of Offline RL for Dialogue Response Generation
Experiments
Table of contents
1 Abstract
2 Introduction
3 Problem Formulation
4 Approach
5 Experiments
Experimental Setup
Results and Analysis
6 Discussion
25 / 39
26. On the Effectiveness of Offline RL for Dialogue Response Generation
Experiments
Experimental Setup
Experimental Setup
They evaluate offline RL methods using three task-oriented dialogue
datasets.
MultiWoz 2.2, which is a widely used dataset created to evaluate
performance of dialogue systems in multi-domain settings.
Action Based Conversations Dataset, which contains
customer-agent conversations where the agent’s goal is to solve a
customer problem.
TaskMaster-3, which contains conversations between users and
a system on movie ticketing.
26 / 39
27. On the Effectiveness of Offline RL for Dialogue Response Generation
Experiments
Experimental Setup
Baseline and Metrics
They choose a terminal binary reward BERTCLICK, which is a
thresholded BERTSCORE with threshold value 0.6.
They evaluate on a range of automated similarity metrics shown to
have a high correlation with human judgements like BERTSCORE,
BLEURT, METEOR and BLEU.
Baselines: TF, TF All, TF Top, DT, ILQL, and PPO.
For base models they study GPT2Medium2 and DistilGPT3 which
have 355M and 82M parameters, respectively.
2https://huggingface.co/gpt2-medium
3https://huggingface.co/distilgpt2
27 / 39
28. On the Effectiveness of Offline RL for Dialogue Response Generation
Experiments
Experimental Setup
Training Process
1 They train the TF model on all the training data.
2 Then, they use this trained TF model to generate an offline RL
dataset.
3 Finally fine tune different RL models on varying percentages of
generated offline RL data.
28 / 39
29. On the Effectiveness of Offline RL for Dialogue Response Generation
Experiments
Results and Analysis
Results and Analysis
Table 1: Comparison across different methods on average metrics and dataset
size with distilGPT2. 20%, 80% refer to percentage of the data used for
fine-tuning offline RL methods.
29 / 39
30. On the Effectiveness of Offline RL for Dialogue Response Generation
Experiments
Results and Analysis
How does performance vary across multiple responses?
TF optimizes for recall, so with multiple responses, it should be able
to reach the performance of offline RL methods.
Figure 3: Average BERTCLICK over top-k responses
30 / 39
31. On the Effectiveness of Offline RL for Dialogue Response Generation
Experiments
Results and Analysis
How do improvements look qualitatively to human
evaluators?
Figure 4: Human evaluation (similarity and relevance) of TF, TF Top, DT on
100 examples with 2 representative examples presented.
31 / 39
32. On the Effectiveness of Offline RL for Dialogue Response Generation
Experiments
Results and Analysis
How do offline RL compare with PPO?
Table 2: Comparison of offline RL (DT) against online RL (PPO).
32 / 39
33. On the Effectiveness of Offline RL for Dialogue Response Generation
Experiments
Results and Analysis
How does ILQL critic perform as a ranker?
Table 3: Comparison when ranking responses generated by the base TF
model.
33 / 39
34. On the Effectiveness of Offline RL for Dialogue Response Generation
Experiments
Results and Analysis
Can online data collection help DT?
They compare with Quark [3], which can be viewed as an online
counterpart to DT. The performance depending on how good a
coverage sampling from the base TF model has.
Figure 5: Average BERTCLICK for DT vs Quark
34 / 39
35. On the Effectiveness of Offline RL for Dialogue Response Generation
Discussion
Table of contents
1 Abstract
2 Introduction
3 Problem Formulation
4 Approach
5 Experiments
6 Discussion
35 / 39
36. On the Effectiveness of Offline RL for Dialogue Response Generation
Discussion
Discussion
In this paper, they examine the effectiveness of offline RL methods for
generating dialogue text.
This paper found that
1 Offline RL models learn to produce good enough text that are
similar to human.
2 Decision Transformer is a practical choice.
3 Some future directions like learn reward functions from human
feedback and a dialogue has multiple turns.
36 / 39
37. On the Effectiveness of Offline RL for Dialogue Response Generation
Discussion
Limitations
This paper didn’t consider large language models, so it’s possible that
their findings do not generalize to large scale models with billions of
parameters.
37 / 39
38. On the Effectiveness of Offline RL for Dialogue Response Generation
Discussion
References I
[1] Lili Chen et al. “Decision Transformer: Reinforcement Learning
via Sequence Modeling”. In: Advances in Neural Information
Processing Systems. Ed. by M. Ranzato et al. Vol. 34. Curran
Associates, Inc., 2021, pp. 15084–15097. url: https:
//proceedings.neurips.cc/paper_files/paper/2021/
file/7f489f642a0ddb10272b5c31057f0663-Paper.pdf.
[2] Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline
Reinforcement Learning with Implicit Q-Learning. 2021. arXiv:
2110.06169 [cs.LG].
[3] Ximing Lu et al. Quark: Controllable Text Generation with
Reinforced Unlearning. 2022. arXiv: 2205.13636 [cs.CL].
38 / 39
39. On the Effectiveness of Offline RL for Dialogue Response Generation
Discussion
References II
[4] John Schulman et al. Proximal Policy Optimization Algorithms.
2017. arXiv: 1707.06347 [cs.LG].
[5] Thibault Sellam, Dipanjan Das, and Ankur P Parikh. “BLEURT:
Learning Robust Metrics for Text Generation”. In: Proceedings
of ACL. 2020.
[6] Ronald J. Williams and David Zipser. “A Learning Algorithm
for Continually Running Fully Recurrent Neural Networks”. In:
Neural Comput. 1.2 (1989), pp. 270–280. issn: 0899-7667. doi:
10.1162/neco.1989.1.2.270. url:
https://doi.org/10.1162/neco.1989.1.2.270.
[7] Tianyi Zhang et al. BERTScore: Evaluating Text Generation with
BERT. 2020. arXiv: 1904.09675 [cs.CL].
39 / 39