Hello~! :)
While studying the Sutton-Barto book, the traditional textbook for Reinforcement Learning, I created PPT about the Multi-armed Bandits, a Chapter 2.
If there are any mistakes, I would appreciate your feedback immediately.
Thank you.
The document discusses multi-armed bandits and their applications. It provides an overview of multi-armed bandits, describing the exploration-exploitation dilemma. It then discusses the optimal UCB algorithm and how it balances exploration and exploitation. Finally, it summarizes two applications of multi-armed bandits: using them for learning to rank in recommendation systems and addressing the cold-start problem in recommender systems.
Dr. Subrat Panda gave an introduction to reinforcement learning. He defined reinforcement learning as dealing with agents that must sense and act upon their environment to receive delayed scalar feedback in the form of rewards. He described key concepts like the Markov decision process framework, value functions, Q-functions, exploration vs exploitation, and extensions like deep reinforcement learning. He listed several real-world applications of reinforcement learning and resources for learning more.
Reinforcement Learning (RL) approaches to deal with finding an optimal reward based policy to act in an environment (Charla en Inglรฉs)
However, what has led to their widespread use is its combination with deep neural networks (DNN) i.e., deep reinforcement learning (Deep RL). Recent successes on not only learning to play games but also superseding humans in it and academia-industry research collaborations like for manipulation of objects, locomotion skills, smart grids, etc. have surely demonstrated their case on a wide variety of challenging tasks.
With application spanning across games, robotics, dialogue, healthcare, marketing, energy and many more domains, Deep RL might just be the power that drives the next generation of Artificial Intelligence (AI) agents!
An introduction to reinforcement learningJie-Han Chen
ย
This document provides an introduction and overview of reinforcement learning. It begins with a syllabus that outlines key topics such as Markov decision processes, dynamic programming, Monte Carlo methods, temporal difference learning, deep reinforcement learning, and active research areas. It then defines the key elements of reinforcement learning including policies, reward signals, value functions, and models of the environment. The document discusses the history and applications of reinforcement learning, highlighting seminal works in backgammon, helicopter control, Atari games, Go, and dialogue generation. It concludes by noting challenges in the field and prominent researchers contributing to its advancement.
In some applications, the output of the system is a sequence of actions. In such a case, a single action is not important
game playing where a single move by itself is not that important.in the case of the agent acts on its environment, it receives some evaluation of its action (reinforcement),
but is not told of which action is the correct one to achieve its goal
Reinforcement Learning 2. Multi-armed BanditsSeung Jae Lee
ย
A summary of Chapter 2: Multi-armed Bandits of the book 'Reinforcement Learning: An Introduction' by Sutton and Barto. You can find the full book in Professor Sutton's website: http://incompleteideas.net/book/the-book-2nd.html
Reinforcement learning is a machine learning technique that involves trial-and-error learning. The agent learns to map situations to actions by trial interactions with an environment in order to maximize a reward signal. Deep Q-networks use reinforcement learning and deep learning to allow agents to learn complex behaviors directly from high-dimensional sensory inputs like pixels. DQN uses experience replay and target networks to stabilize learning from experiences. DQN has achieved human-level performance on many Atari 2600 games.
This presentation contains an introduction to reinforcement learning, comparison with others learning ways, introduction to Q-Learning and some applications of reinforcement learning in video games.
The document discusses multi-armed bandits and their applications. It provides an overview of multi-armed bandits, describing the exploration-exploitation dilemma. It then discusses the optimal UCB algorithm and how it balances exploration and exploitation. Finally, it summarizes two applications of multi-armed bandits: using them for learning to rank in recommendation systems and addressing the cold-start problem in recommender systems.
Dr. Subrat Panda gave an introduction to reinforcement learning. He defined reinforcement learning as dealing with agents that must sense and act upon their environment to receive delayed scalar feedback in the form of rewards. He described key concepts like the Markov decision process framework, value functions, Q-functions, exploration vs exploitation, and extensions like deep reinforcement learning. He listed several real-world applications of reinforcement learning and resources for learning more.
Reinforcement Learning (RL) approaches to deal with finding an optimal reward based policy to act in an environment (Charla en Inglรฉs)
However, what has led to their widespread use is its combination with deep neural networks (DNN) i.e., deep reinforcement learning (Deep RL). Recent successes on not only learning to play games but also superseding humans in it and academia-industry research collaborations like for manipulation of objects, locomotion skills, smart grids, etc. have surely demonstrated their case on a wide variety of challenging tasks.
With application spanning across games, robotics, dialogue, healthcare, marketing, energy and many more domains, Deep RL might just be the power that drives the next generation of Artificial Intelligence (AI) agents!
An introduction to reinforcement learningJie-Han Chen
ย
This document provides an introduction and overview of reinforcement learning. It begins with a syllabus that outlines key topics such as Markov decision processes, dynamic programming, Monte Carlo methods, temporal difference learning, deep reinforcement learning, and active research areas. It then defines the key elements of reinforcement learning including policies, reward signals, value functions, and models of the environment. The document discusses the history and applications of reinforcement learning, highlighting seminal works in backgammon, helicopter control, Atari games, Go, and dialogue generation. It concludes by noting challenges in the field and prominent researchers contributing to its advancement.
In some applications, the output of the system is a sequence of actions. In such a case, a single action is not important
game playing where a single move by itself is not that important.in the case of the agent acts on its environment, it receives some evaluation of its action (reinforcement),
but is not told of which action is the correct one to achieve its goal
Reinforcement Learning 2. Multi-armed BanditsSeung Jae Lee
ย
A summary of Chapter 2: Multi-armed Bandits of the book 'Reinforcement Learning: An Introduction' by Sutton and Barto. You can find the full book in Professor Sutton's website: http://incompleteideas.net/book/the-book-2nd.html
Reinforcement learning is a machine learning technique that involves trial-and-error learning. The agent learns to map situations to actions by trial interactions with an environment in order to maximize a reward signal. Deep Q-networks use reinforcement learning and deep learning to allow agents to learn complex behaviors directly from high-dimensional sensory inputs like pixels. DQN uses experience replay and target networks to stabilize learning from experiences. DQN has achieved human-level performance on many Atari 2600 games.
This presentation contains an introduction to reinforcement learning, comparison with others learning ways, introduction to Q-Learning and some applications of reinforcement learning in video games.
1. Reinforcement learning involves an agent learning through trial-and-error interactions with an environment. The agent learns a policy for how to act by maximizing rewards.
2. The document outlines key elements of reinforcement learning including states, actions, rewards, value functions, and explores different methods for solving reinforcement learning problems including dynamic programming, Monte Carlo methods, and temporal difference learning.
3. Temporal difference learning combines the advantages of Monte Carlo methods and dynamic programming by allowing for incremental learning through bootstrapping predictions like dynamic programming while also learning directly from experience like Monte Carlo methods.
This document provides an introduction to deep reinforcement learning. It begins with an overview of reinforcement learning and its key characteristics such as using reward signals rather than supervision and sequential decision making. The document then covers the formulation of reinforcement learning problems using Markov decision processes and the typical components of an RL agent including policies, value functions, and models. It discusses popular RL algorithms like Q-learning, deep Q-networks, and policy gradient methods. The document concludes by outlining some potential applications of deep reinforcement learning and recommending further educational resources.
Reinforcement learning is a machine learning technique where an agent learns how to behave in an environment by receiving rewards or punishments for its actions. The goal of the agent is to learn an optimal policy that maximizes long-term rewards. Reinforcement learning can be applied to problems like game playing, robot control, scheduling, and economic modeling. The reinforcement learning process involves an agent interacting with an environment to learn through trial-and-error using state, action, reward, and policy. Common algorithms include Q-learning which uses a Q-table to learn the optimal action-selection policy.
This document provides an overview of reinforcement learning. It discusses how reinforcement learning differs from supervised and unsupervised learning by focusing on choosing actions to maximize long-term reward. It also formalizes reinforcement learning problems using Markov decision processes and describes algorithms like value iteration and policy iteration that can be used to find optimal policies.
This document discusses reinforcement learning. It defines reinforcement learning as a learning method where an agent learns how to behave via interactions with an environment. The agent receives rewards or penalties based on its actions but is not told which actions are correct. Several reinforcement learning concepts and algorithms are covered, including model-based vs model-free approaches, passive vs active learning, temporal difference learning, adaptive dynamic programming, and exploration-exploitation tradeoffs. Generalization methods like function approximation and genetic algorithms are also briefly mentioned.
Reinforcement Learning 5. Monte Carlo MethodsSeung Jae Lee
ย
A summary of Chapter 5: Monte Carlo Methods of the book 'Reinforcement Learning: An Introduction' by Sutton and Barto. You can find the full book in Professor Sutton's website: http://incompleteideas.net/book/the-book-2nd.html
Check my website for more slides of books and papers!
https://www.endtoend.ai
Reinforcement Learning : A Beginners TutorialOmar Enayet
ย
This document provides an overview of reinforcement learning concepts including:
1) It defines the key components of a Markov Decision Process (MDP) including states, actions, transitions, rewards, and discount rate.
2) It describes value functions which estimate the expected return for following a particular policy from each state or state-action pair.
3) It discusses several elementary solution methods for reinforcement learning problems including dynamic programming, Monte Carlo methods, temporal-difference learning, and actor-critic methods.
Here are the key steps to run a REINFORCE algorithm on the CartPole environment using SLM Lab:
1. Define the REINFORCE agent configuration in a spec file. This specifies things like the algorithm name, hyperparameters, network architecture, optimizer, etc.
2. Define the CartPole environment configuration.
3. Initialize SLM Lab and load the spec file:
```js
const slmLab = require('slm-lab');
slmLab.init();
const spec = require('./reinforce_cartpole.js');
```
4. Create an experiment with the spec:
```js
const experiment = new slmLab.Experiment(spec
This document discusses various algorithms for multi-armed bandit problems including k-armed bandits, action value methods like epsilon-greedy, tracking non-stationary problems, optimistic initial values, upper confidence bound action selection, gradient bandit algorithms, contextual bandits, and Thomson sampling. The k-armed bandit problem involves choosing actions to maximize reward over time without knowing the expected reward of each action. The document outlines methods for balancing exploration of unknown actions with exploitation of best known actions.
This document discusses adversarial search techniques used in artificial intelligence to model games as search problems. It introduces the minimax algorithm and alpha-beta pruning to determine optimal strategies by looking ahead in the game tree. These techniques allow computers to search deeper and play games like chess and Go at a world-champion level by evaluating board positions and pruning unfavorable branches in the search.
Reinforcement Learning 3. Finite Markov Decision ProcessesSeung Jae Lee
ย
A summary of Chapter 3: Finite Markov Decision Processes of the book 'Reinforcement Learning: An Introduction' by Sutton and Barto. You can find the full book in Professor Sutton's website: http://incompleteideas.net/book/the-book-2nd.html
This document provides an overview of Markov Decision Processes (MDPs) and related concepts in decision theory and reinforcement learning. It defines MDPs and their components, describes algorithms for solving MDPs like value iteration and policy iteration, and discusses extensions to partially observable MDPs. It also briefly mentions dynamic Bayesian networks, the dopaminergic system, and its role in reinforcement learning and decision making.
The document discusses Deep Q-Network (DQN), which combines Q-learning with deep neural networks to allow for function approximation and solving problems with large state/action spaces. DQN uses experience replay and a separate target network to stabilize training. It has led to many successful variants, including Double DQN which reduces overestimation, prioritized experience replay which replays important transitions more frequently, and dueling networks which separate value and advantage estimation.
The document discusses the K-nearest neighbors (KNN) algorithm, a simple machine learning algorithm used for classification problems. KNN works by finding the K training examples that are closest in distance to a new data point, and assigning the most common class among those K examples as the prediction for the new data point. The document covers how KNN calculates distances between data points, how to choose the K value, techniques for handling different data types, and the strengths and weaknesses of the KNN algorithm.
Reinforcement learning is an area of machine learning inspired by behaviorist psychology, concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.
Reinforcement Learning 8: Planning and Learning with Tabular MethodsSeung Jae Lee
ย
A summary of Chapter 8: Planning and Learning with Tabular Methods of the book 'Reinforcement Learning: An Introduction' by Sutton and Barto. You can find the full book in Professor Sutton's website: http://incompleteideas.net/book/the-book-2nd.html
Check my website for more slides of books and papers!
https://www.endtoend.ai
Planning and Learning with Tabular MethodsDongmin Lee
ย
1) The document discusses planning methods in reinforcement learning that use models of the environment to generate simulated experiences for training.
2) It introduces Dyna-Q, an algorithm that integrates planning, acting, model learning, and direct reinforcement learning by using a model to generate additional simulated experiences for training.
3) When the model is incorrect, planning may lead to suboptimal policies, but interaction with the real environment can sometimes discover and correct modeling errors; when changes make the environment better, planning may fail to find improved policies without encouraging exploration.
Efficient Selectivity and Backup Operators in Monte-Carlo Tree SearchMelvin Zhang
ย
This document discusses efficient selection and backup operators in Monte-Carlo Tree Search. It describes the selection principle where moves that look best are searched deeper, while bad moves are searched less. It also explains that Crazy Stone selects each move probabilistically based on the move's statistics. Finally, it mentions that Crazy Stone uses a "Mix" backup operator that takes a linear combination of the mean, max, and robust max values of child nodes to update internal nodes after simulation.
1. Reinforcement learning involves an agent learning through trial-and-error interactions with an environment. The agent learns a policy for how to act by maximizing rewards.
2. The document outlines key elements of reinforcement learning including states, actions, rewards, value functions, and explores different methods for solving reinforcement learning problems including dynamic programming, Monte Carlo methods, and temporal difference learning.
3. Temporal difference learning combines the advantages of Monte Carlo methods and dynamic programming by allowing for incremental learning through bootstrapping predictions like dynamic programming while also learning directly from experience like Monte Carlo methods.
This document provides an introduction to deep reinforcement learning. It begins with an overview of reinforcement learning and its key characteristics such as using reward signals rather than supervision and sequential decision making. The document then covers the formulation of reinforcement learning problems using Markov decision processes and the typical components of an RL agent including policies, value functions, and models. It discusses popular RL algorithms like Q-learning, deep Q-networks, and policy gradient methods. The document concludes by outlining some potential applications of deep reinforcement learning and recommending further educational resources.
Reinforcement learning is a machine learning technique where an agent learns how to behave in an environment by receiving rewards or punishments for its actions. The goal of the agent is to learn an optimal policy that maximizes long-term rewards. Reinforcement learning can be applied to problems like game playing, robot control, scheduling, and economic modeling. The reinforcement learning process involves an agent interacting with an environment to learn through trial-and-error using state, action, reward, and policy. Common algorithms include Q-learning which uses a Q-table to learn the optimal action-selection policy.
This document provides an overview of reinforcement learning. It discusses how reinforcement learning differs from supervised and unsupervised learning by focusing on choosing actions to maximize long-term reward. It also formalizes reinforcement learning problems using Markov decision processes and describes algorithms like value iteration and policy iteration that can be used to find optimal policies.
This document discusses reinforcement learning. It defines reinforcement learning as a learning method where an agent learns how to behave via interactions with an environment. The agent receives rewards or penalties based on its actions but is not told which actions are correct. Several reinforcement learning concepts and algorithms are covered, including model-based vs model-free approaches, passive vs active learning, temporal difference learning, adaptive dynamic programming, and exploration-exploitation tradeoffs. Generalization methods like function approximation and genetic algorithms are also briefly mentioned.
Reinforcement Learning 5. Monte Carlo MethodsSeung Jae Lee
ย
A summary of Chapter 5: Monte Carlo Methods of the book 'Reinforcement Learning: An Introduction' by Sutton and Barto. You can find the full book in Professor Sutton's website: http://incompleteideas.net/book/the-book-2nd.html
Check my website for more slides of books and papers!
https://www.endtoend.ai
Reinforcement Learning : A Beginners TutorialOmar Enayet
ย
This document provides an overview of reinforcement learning concepts including:
1) It defines the key components of a Markov Decision Process (MDP) including states, actions, transitions, rewards, and discount rate.
2) It describes value functions which estimate the expected return for following a particular policy from each state or state-action pair.
3) It discusses several elementary solution methods for reinforcement learning problems including dynamic programming, Monte Carlo methods, temporal-difference learning, and actor-critic methods.
Here are the key steps to run a REINFORCE algorithm on the CartPole environment using SLM Lab:
1. Define the REINFORCE agent configuration in a spec file. This specifies things like the algorithm name, hyperparameters, network architecture, optimizer, etc.
2. Define the CartPole environment configuration.
3. Initialize SLM Lab and load the spec file:
```js
const slmLab = require('slm-lab');
slmLab.init();
const spec = require('./reinforce_cartpole.js');
```
4. Create an experiment with the spec:
```js
const experiment = new slmLab.Experiment(spec
This document discusses various algorithms for multi-armed bandit problems including k-armed bandits, action value methods like epsilon-greedy, tracking non-stationary problems, optimistic initial values, upper confidence bound action selection, gradient bandit algorithms, contextual bandits, and Thomson sampling. The k-armed bandit problem involves choosing actions to maximize reward over time without knowing the expected reward of each action. The document outlines methods for balancing exploration of unknown actions with exploitation of best known actions.
This document discusses adversarial search techniques used in artificial intelligence to model games as search problems. It introduces the minimax algorithm and alpha-beta pruning to determine optimal strategies by looking ahead in the game tree. These techniques allow computers to search deeper and play games like chess and Go at a world-champion level by evaluating board positions and pruning unfavorable branches in the search.
Reinforcement Learning 3. Finite Markov Decision ProcessesSeung Jae Lee
ย
A summary of Chapter 3: Finite Markov Decision Processes of the book 'Reinforcement Learning: An Introduction' by Sutton and Barto. You can find the full book in Professor Sutton's website: http://incompleteideas.net/book/the-book-2nd.html
This document provides an overview of Markov Decision Processes (MDPs) and related concepts in decision theory and reinforcement learning. It defines MDPs and their components, describes algorithms for solving MDPs like value iteration and policy iteration, and discusses extensions to partially observable MDPs. It also briefly mentions dynamic Bayesian networks, the dopaminergic system, and its role in reinforcement learning and decision making.
The document discusses Deep Q-Network (DQN), which combines Q-learning with deep neural networks to allow for function approximation and solving problems with large state/action spaces. DQN uses experience replay and a separate target network to stabilize training. It has led to many successful variants, including Double DQN which reduces overestimation, prioritized experience replay which replays important transitions more frequently, and dueling networks which separate value and advantage estimation.
The document discusses the K-nearest neighbors (KNN) algorithm, a simple machine learning algorithm used for classification problems. KNN works by finding the K training examples that are closest in distance to a new data point, and assigning the most common class among those K examples as the prediction for the new data point. The document covers how KNN calculates distances between data points, how to choose the K value, techniques for handling different data types, and the strengths and weaknesses of the KNN algorithm.
Reinforcement learning is an area of machine learning inspired by behaviorist psychology, concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.
Reinforcement Learning 8: Planning and Learning with Tabular MethodsSeung Jae Lee
ย
A summary of Chapter 8: Planning and Learning with Tabular Methods of the book 'Reinforcement Learning: An Introduction' by Sutton and Barto. You can find the full book in Professor Sutton's website: http://incompleteideas.net/book/the-book-2nd.html
Check my website for more slides of books and papers!
https://www.endtoend.ai
Planning and Learning with Tabular MethodsDongmin Lee
ย
1) The document discusses planning methods in reinforcement learning that use models of the environment to generate simulated experiences for training.
2) It introduces Dyna-Q, an algorithm that integrates planning, acting, model learning, and direct reinforcement learning by using a model to generate additional simulated experiences for training.
3) When the model is incorrect, planning may lead to suboptimal policies, but interaction with the real environment can sometimes discover and correct modeling errors; when changes make the environment better, planning may fail to find improved policies without encouraging exploration.
Efficient Selectivity and Backup Operators in Monte-Carlo Tree SearchMelvin Zhang
ย
This document discusses efficient selection and backup operators in Monte-Carlo Tree Search. It describes the selection principle where moves that look best are searched deeper, while bad moves are searched less. It also explains that Crazy Stone selects each move probabilistically based on the move's statistics. Finally, it mentions that Crazy Stone uses a "Mix" backup operator that takes a linear combination of the mean, max, and robust max values of child nodes to update internal nodes after simulation.
Descripciรณn de la plรกtica:
CoreML es el puente entre iOS y Machine Learning, pero con ciertas limitantes. Explotaremos el potencial que tiene CoreML, responderemos las siguientes preguntas:
- ยฟCรณmo podrรญamos ir mรกs allรก de los lรญmites de CoreML?
- ยฟCรณmo puede nuestra App aprender de la experiencia?
This document provides an overview of multi-armed bandit algorithms from an algorithmic perspective. It introduces the multi-armed bandit problem and defines it as a reinforcement learning problem where an agent must balance exploring unknown arms versus exploiting the best known arm to maximize rewards over time. The document summarizes several common bandit algorithms, including epsilon-greedy, upper confidence bound, and Thompson sampling. It also discusses extensions for non-stationary and contextual multi-armed bandit problems.
The document describes a fast, scalable, online, distributed machine learning classifier built on Apache Spark. It leverages recent research to develop a classifier that can handle large, sparse datasets with up to hundreds of millions of features in a single pass. The system uses online learning techniques like stochastic gradient descent that allow incremental updates to the model as new data is received without requiring multiple passes over the training data. This makes it suitable for applications with streaming data where predictions are needed in real-time. Key challenges addressed include feature scaling, handling different feature frequencies, and efficiently encoding sparse features.
Legal Analytics Course - Class 8 - Introduction to Random Forests and Ensembl...Daniel Katz
ย
The document discusses random forests, an ensemble machine learning method. Random forests create many decision trees from random subsets of the training data and variables. This helps prevent overfitting and makes the model more robust. The document provides an example of building a random forest model to predict Titanic passenger survival using R code. It also shares online resources for learning more about random forests.
ๅพ Atari/AlphaGo/ChatGPT ่ซๆทฑๅบฆๅผทๅๅญธ็ฟๅ้็จไบบๅทฅๆบๆ งFrank Fang Kuo Yu
ย
The document discusses deep reinforcement learning and artificial general intelligence through examples of Atari games, AlphaGo series, and ChatGPT/GPT-4. It covers how deep reinforcement learning was applied to achieve human-level performance in Atari games using DQN and policy gradient methods. It also summarizes the development of AlphaGo, AlphaGo Zero, AlphaZero and MuZero using deep reinforcement learning techniques like self-play and Monte Carlo tree search. Finally, it discusses how ChatGPT and GPT-4 were trained using supervised learning, reinforcement learning from human feedback, and rule-based reward models to improve safety.
This presentation provides an introduction to few-shot learning. It begins by comparing human and machine learning, noting that humans can learn new tasks from only a few examples while machines typically require large datasets. It then discusses meta-learning as a framework for few-shot learning, where a model is trained to learn from few examples. Finally, it outlines different approaches to meta-learning, including based on similarity, learning algorithms like MAML, and modeling data through Bayesian programs.
Multi-Agent Reinforcement Learning is an extension of single-agent RL to problems with multiple interacting agents. It is challenging due to non-stationary environments and credit assignment across agents. Baseline methods like Independent Q-Learning treat other agents as part of the environment. Cooperation methods use centralized critics and decentralized actors. Zero-sum methods were applied to StarCraft. General-sum methods like LOLA learn opponent models to optimize strategies. Experience replay and communication protocols help agents learn cooperative behaviors.
This document discusses a fast, scalable, online, distributed machine learning classifier built on Apache Spark. It leverages recent research in online learning to develop a practical classifier that can handle up to hundreds of millions of sparse features. The key benefits of online learning are discussed, including the ability to incrementally update models with streaming data in real-time without having separate training and testing phases. Examples of using online learning for large-scale advertising and IoT applications are provided.
AlphaGo uses a novel combination of Monte Carlo tree search and neural networks to master the game of Go. It trains two neural networks - a policy network to predict expert moves and a value network to evaluate board positions. During gameplay, AlphaGo runs multiple Monte Carlo tree simulations that use the neural networks to guide search and evaluate positions. The move selected is the one most frequently visited after all simulations. This approach allowed AlphaGo to defeat world champion Lee Sedol 4-1, achieving a milestone in artificial intelligence.
Reinforcement Learning โ a Rewards Based Approach to Machine Learning - Marko...Marko Lohert
ย
Reinforcement learning is a branch of machine learning that relies on learning through the mechanism of rewards and punishments.
This is the presentation about reinforcement learning that I gave at DORS/CLUC conference in 2024 in Zagreb, Croatia (https://www.dorscluc.org).
This document discusses using Agile principles and techniques to address social and environmental problems. It argues that current engineering approaches assume simplified models that do not reflect real-world complexity and uncertainty. It advocates taking an experimental approach of framing problems from user perspectives, developing hypotheses, and running customer experiments to iteratively learn and build solutions. An example is discussed of using techniques like design thinking and customer development from Lean Startup to address issues of effectiveness, feasibility, sustainability and desirability for social entrepreneurship problems. The overall message is that engineering needs to be done within an understanding of complexity, with empathy for users, and through a process of disciplined learning from experiments.
Google ai based personal assistant and i woringIdcIdk1
ย
The document provides an overview of artificial intelligence and machine learning concepts. It discusses how machine learning can be used to solve problems by training models on data, but obtaining the right data and models can be challenging. It also describes common machine learning tasks like classification and regression, and different types of learning models including supervised, unsupervised, and reinforcement learning. The document outlines the typical steps involved in a machine learning project, from defining a problem to deploying a trained model.
Summer internship 2014 report by Rishabh Misra, Thapar UniversityRishabh Misra
ย
The document summarizes an internship focused on recommender systems and reinforcement learning. The intern read research papers on topics like contextual bandits and matrix factorization techniques. They implemented algorithms such as LinUCB, GLM-UCB, and probabilistic matrix factorization on datasets like MovieLens. The intern also worked on features for a collaborative tweet recommendation system and referred to machine learning courses.
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)Sergey Karayev
ย
The document discusses several research directions in deep learning, including unsupervised learning, reinforcement learning, unsupervised RL, meta-reinforcement learning, few-shot imitation learning, domain randomization, and using deep learning for science and engineering applications. It also discusses exciting directions in AI like mitigating bias, multi-modal learning, architecture search, value alignment, scaling laws, human-in-the-loop systems, and explainability. Examples of progress in areas like unsupervised sentiment analysis, language modeling, computer vision, reinforcement learning for games, robotics, animation and more are provided. The need for more data-efficient and human-level learning approaches is discussed.
Reinforcement learning is a machine learning technique that involves an agent learning how to achieve a goal in an environment by trial-and-error using feedback in the form of rewards and punishments. The agent learns an optimal behavior or policy for achieving the maximum reward. Key elements of reinforcement learning include the agent, environment, states, actions, policy, reward function, and value function. Reinforcement learning problems can be solved using methods like dynamic programming, Monte Carlo methods, and temporal difference learning.
Algorithms and machine learning models can unintentionally learn to classify and control people based on their data. A case study shows how optimizing for click-through rates can lead users to be clustered into "filter bubbles" and have their opinions steered over time without feedback. It is important to be aware of these risks and regulate algorithms' use of personal data to avoid unfairly profiling or manipulating individuals.
I updated the previous slides.
Previous slides: https://www.slideshare.net/DongMinLee32/causal-confusion-in-imitation-learning-238882277
I reviewed the "Causal Confusion in Imitation Learning" paper.
Paper link: https://papers.nips.cc/paper/9343-causal-confusion-in-imitation-learning.pdf
- Abstract
Behavioral cloning reduces policy learning to supervised learning by training a discriminative model to predict expert actions given observations. Such discriminative models are non-causal: the training procedure is unaware of the causal structure of the interaction between the expert and the environment. We point out that ignoring causality is particularly damaging because of the distributional shift in imitation learning. In particular, it leads to a counter-intuitive โcausal misidentificationโ phenomenon: access to more information can yield worse performance. We investigate how this problem arises, and propose a solution to combat it through targeted interventionsโeither environment interaction or expert queriesโto determine the correct causal model. We show that causal misidentification occurs in several benchmark control domains as well as realistic driving settings, and validate our solution against DAgger and other baselines and ablations.
- Outline
1. Introduction
2. Causality and Causal Inference
3. Causality in Imitation Learning
4. Experiments Setting
5. Resolving Causal Misidentification
- Causal Graph-Parameterized Policy Learning
- Targeted Intervention
6. Experiments
Thank you!
Character Controllers using Motion VAEsDongmin Lee
ย
Title: Character Controllers using Motion VAEs
Proceeding: ACM Transactions on Graphics (TOG) (Proc. SIGGRAPH 2020)
Paper: https://dl.acm.org/doi/abs/10.1145/3386569.3392422
Video: https://www.youtube.com/watch?v=Zm3G9oqmQ4Y
Given example motions, how can we generalize these to produce new purposeful motions?
We take a two-step approach to this problem
โข Kinematic generative model based on an autoregressive conditional variational autoencoder or motion VAE (MVAE)
โข Learning controller to generate desired motions using Deep Reinforcement Learning (Deep RL)
I reviewed the "Causal Confusion in Imitation Learning" paper.
- Abstract
Behavioral cloning reduces policy learning to supervised learning by training a discriminative model to predict expert actions given observations. Such discriminative models are non-causal: the training procedure is unaware of the causal structure of the interaction between the expert and the environment. We point out that ignoring causality is particularly damaging because of the distributional shift in imitation learning. In particular, it leads to a counter-intuitive โcausal misidentificationโ phenomenon: access to more information can yield worse performance. We investigate how this problem arises, and propose a solution to combat it through targeted interventionsโeither environment interaction or expert queriesโto determine the correct causal model. We show that causal misidentification occurs in several benchmark control domains as well as realistic driving settings, and validate our solution against DAgger and other baselines and ablations.
- Outline
1. Introduction
2. Causality and Causal Inference
3. Causality in Imitation Learning
4. Experiments Setting
5. Resolving Causal Misidentification
- Causal Graph-Parameterized Policy Learning
- Targeted Intervention
6. Experiments
Link: https://papers.nips.cc/paper/9343-causal-confusion-in-imitation-learning.pdf
Thank you!
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Dongmin Lee
ย
I reviewed the PEARL paper.
PEARL (Probabilistic Embeddings for Actor-critic RL) is an off-policy meta-RL algorithm to achieve both meta-training and adaptation efficiency. It performs probabilistic encoder filtering of latent task variables to enables posterior sampling for structured and efficient exploration.
Outline
- Abstract
- Introduction
- Probabilistic Latent Context
- Off-Policy Meta-Reinforcement Learning
- Experiments
Link: https://arxiv.org/abs/1903.08254
Thank you!
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...Dongmin Lee
ย
I reviewed the PRM-RL paper.
PRM-RL (Probabilistic Roadmap-Reinforcement Learning) is a hierarchical method that combines sampling-based path planning with RL. It uses feature-based and deep neural net policies (DDPG) in continuous state and action spaces. In experiment, authors evaluate PRM- RL, both in simulation and on-robot, on two navigation tasks: end-to-end differential drive indoor navigation in office environments, and aerial cargo delivery in urban environments.
Outline
- Abstract
- Introduction
- Reinforcement Learning
- Methods
- Results
Thank you.
Exploration Strategies in Reinforcement LearningDongmin Lee
ย
I presented about "Exploration Strategies in Reinforcement Learning" at AI Robotics KR.
- Exploration strategies in RL
1. Epsilon-greedy
2. Optimism in the face of uncertainty
3. Thompson (posterior) sampling
4. Information theoretic exploration (e.g., Entropy Regularization in RL)
Thank you.
Maximum Entropy Reinforcement Learning (Stochastic Control)Dongmin Lee
ย
I reviewed the following papers.
- T. Haarnoja, et al., โReinforcement Learning with Deep Energy-Based Policies", ICML 2017
- T. Haarnoja, et al., โSoft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor", ICML 2018
- T. Haarnoja, et al., โSoft Actor-Critic Algorithms and Applications", arXiv preprint 2018
Thank you.
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...Sรฉrgio Sacani
ย
Wereport the study of a huge optical intraday flare on 2021 November 12 at 2 a.m. UT in the blazar OJ287. In the binary black hole model, it is associated with an impact of the secondary black hole on the accretion disk of the primary. Our multifrequency observing campaign was set up to search for such a signature of the impact based on a prediction made 8 yr earlier. The first I-band results of the flare have already been reported by Kishore et al. (2024). Here we combine these data with our monitoring in the R-band. There is a big change in the RโI spectral index by 1.0 ยฑ0.1 between the normal background and the flare, suggesting a new component of radiation. The polarization variation during the rise of the flare suggests the same. The limits on the source size place it most reasonably in the jet of the secondary BH. We then ask why we have not seen this phenomenon before. We show that OJ287 was never before observed with sufficient sensitivity on the night when the flare should have happened according to the binary model. We also study the probability that this flare is just an oversized example of intraday variability using the Krakow data set of intense monitoring between 2015 and 2023. We find that the occurrence of a flare of this size and rapidity is unlikely. In machine-readable Tables 1 and 2, we give the full orbit-linked historical light curve of OJ287 as well as the dense monitoring sample of Krakow.
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...PsychoTech Services
ย
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...Creative-Biolabs
ย
Neutralizing antibodies, pivotal in immune defense, specifically bind and inhibit viral pathogens, thereby playing a crucial role in protecting against and mitigating infectious diseases. In this slide, we will introduce what antibodies and neutralizing antibodies are, the production and regulation of neutralizing antibodies, their mechanisms of action, classification and applications, as well as the challenges they face.
Microbial interaction
Microorganisms interacts with each other and can be physically associated with another organisms in a variety of ways.
One organism can be located on the surface of another organism as an ectobiont or located within another organism as endobiont.
Microbial interaction may be positive such as mutualism, proto-cooperation, commensalism or may be negative such as parasitism, predation or competition
Types of microbial interaction
Positive interaction: mutualism, proto-cooperation, commensalism
Negative interaction: Ammensalism (antagonism), parasitism, predation, competition
I. Mutualism:
It is defined as the relationship in which each organism in interaction gets benefits from association. It is an obligatory relationship in which mutualist and host are metabolically dependent on each other.
Mutualistic relationship is very specific where one member of association cannot be replaced by another species.
Mutualism require close physical contact between interacting organisms.
Relationship of mutualism allows organisms to exist in habitat that could not occupied by either species alone.
Mutualistic relationship between organisms allows them to act as a single organism.
Examples of mutualism:
i. Lichens:
Lichens are excellent example of mutualism.
They are the association of specific fungi and certain genus of algae. In lichen, fungal partner is called mycobiont and algal partner is called
II. Syntrophism:
It is an association in which the growth of one organism either depends on or improved by the substrate provided by another organism.
In syntrophism both organism in association gets benefits.
Compound A
Utilized by population 1
Compound B
Utilized by population 2
Compound C
utilized by both Population 1+2
Products
In this theoretical example of syntrophism, population 1 is able to utilize and metabolize compound A, forming compound B but cannot metabolize beyond compound B without co-operation of population 2. Population 2is unable to utilize compound A but it can metabolize compound B forming compound C. Then both population 1 and 2 are able to carry out metabolic reaction which leads to formation of end product that neither population could produce alone.
Examples of syntrophism:
i. Methanogenic ecosystem in sludge digester
Methane produced by methanogenic bacteria depends upon interspecies hydrogen transfer by other fermentative bacteria.
Anaerobic fermentative bacteria generate CO2 and H2 utilizing carbohydrates which is then utilized by methanogenic bacteria (Methanobacter) to produce methane.
ii. Lactobacillus arobinosus and Enterococcus faecalis:
In the minimal media, Lactobacillus arobinosus and Enterococcus faecalis are able to grow together but not alone.
The synergistic relationship between E. faecalis and L. arobinosus occurs in which E. faecalis require folic acid
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...Scintica Instrumentation
ย
Targeting Hsp90 and its pathogen Orthologs with Tethered Inhibitors as a Diagnostic and Therapeutic Strategy for cancer and infectious diseases with Dr. Timothy Haystead.
BIRDS DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptxgoluk9330
ย
Ahota Beel, nestled in Sootea Biswanath Assam , is celebrated for its extraordinary diversity of bird species. This wetland sanctuary supports a myriad of avian residents and migrants alike. Visitors can admire the elegant flights of migratory species such as the Northern Pintail and Eurasian Wigeon, alongside resident birds including the Asian Openbill and Pheasant-tailed Jacana. With its tranquil scenery and varied habitats, Ahota Beel offers a perfect haven for birdwatchers to appreciate and study the vibrant birdlife that thrives in this natural refuge.
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at ๐ณ = 2.9 wi...Sรฉrgio Sacani
ย
We present the JWST discovery of SNโ2023adsy, a transient object located in a host galaxy JADES-GS
+
53.13485
โ
27.82088
with a host spectroscopic redshift of
2.903
ยฑ
0.007
. The transient was identified in deep James Webb Space Telescope (JWST)/NIRCam imaging from the JWST Advanced Deep Extragalactic Survey (JADES) program. Photometric and spectroscopic followup with NIRCam and NIRSpec, respectively, confirm the redshift and yield UV-NIR light-curve, NIR color, and spectroscopic information all consistent with a Type Ia classification. Despite its classification as a likely SNโIa, SNโ2023adsy is both fairly red (
๏ฟฝ
โข
(
๏ฟฝ
โ
๏ฟฝ
)
โผ
0.9
) despite a host galaxy with low-extinction and has a high CaโII velocity (
19
,
000
ยฑ
2
,
000
km/s) compared to the general population of SNeโIa. While these characteristics are consistent with some Ca-rich SNeโIa, particularly SNโ2016hnk, SNโ2023adsy is intrinsically brighter than the low-
๏ฟฝ
Ca-rich population. Although such an object is too red for any low-
๏ฟฝ
cosmological sample, we apply a fiducial standardization approach to SNโ2023adsy and find that the SNโ2023adsy luminosity distance measurement is in excellent agreement (
โฒ
1
โข
๏ฟฝ
) with
ฮ
CDM. Therefore unlike low-
๏ฟฝ
Ca-rich SNeโIa, SNโ2023adsy is standardizable and gives no indication that SNโIa standardized luminosities change significantly with redshift. A larger sample of distant SNeโIa is required to determine if SNโIa population characteristics at high-
๏ฟฝ
truly diverge from their low-
๏ฟฝ
counterparts, and to confirm that standardized luminosities nevertheless remain constant with redshift.
2. Reference
- Reinforcement Learning : An Introduction,
Richard S. Sutton & Andrew G. Barto
(Link)
- ๋ฉํฐ ์๋ ๋ฐด๋ง(Multi-Armed Bandits), ์กํธ์ฐ
(https://brunch.co.kr/@chris-song/62)
3. Index
1. Why do you need to know Multi-armed Bandits?
2. A k-armed Bandit Problem
3. Simple-average Action-value Methods
4. A simple Bandit Algorithm
5. Gradient Bandit Algorithm
6. Summary
4. 1. Why do you need to know
Multi-armed Bandits(MAB)?
5. 1. Why do you need to know MAB?
- Reinforcement Learning(RL) uses training information that
โEvaluateโ(โInstructโ X) the actions
- Evaluative feedback indicates โHow good the action taken wasโ
- Because of simplicity, โNonassociativeโ one situation
- Most prior work involving evaluative feedback
- โNonassociativeโ, โEvaluative feedbackโ -> MAB
- In order to Introduce basic learning methods in later chapters
6. 1. Why do you need to know MAB?
In my opinion,
- I think we canโt seem to know RL without knowing MAB.
- MAB deal with โExploitation & Explorationโ of the core ideas in RL.
- In the full reinforcement learning problem, MAB is always used.
- In every profession, MAB is very useful.
8. 2. A k-armed Bandit Problem
Do you know what MAB is?
9. Do you know what MAB is?
Source : Multi-Armed Bandit
Image source : https://brunch.co.kr/@chris-song/62
- Slot Machine -> Bandit
- Slot Machineโs lever -> Armed
- N slot Machine
๏ Multi-armed Bandits
10. Do you know what MAB is?
Source : Multi-Armed Bandit
Image source : https://brunch.co.kr/@chris-song/62
Among the various slot machines,
which slot machine
should I put my money on
and lower the lever?
11. Do you know what MAB is?
Source : Multi-Armed Bandit
Image source : https://brunch.co.kr/@chris-song/62
How can you make the best return
on your investment?
12. Do you know what MAB is?
Source : Multi-Armed Bandit
Image source : https://brunch.co.kr/@chris-song/62
MAB is a algorithm created to
optimize investment in slot machines
13. 2. A k-armed Bandit Problem
A K-armed Bandit Problem
14. A K-armed Bandit Problem
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
t โ Discrete time step or play number
๐ด ๐ก - Action at time t
๐ ๐ก - Reward at time t
๐โ ๐ โ True value (expected reward) of action a
15. A K-armed Bandit Problem
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
In our k-armed bandit problem, each of the k actions has
an expected or mean reward given that that action is selected;
let us call this the value of that action.
18. Simple-average Method
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
๐๐ก(๐) converges to ๐โ(๐)
22. Greedy Action Selection Method
๐๐๐๐๐๐ฅ ๐ ๐(๐) - a value of ๐ at which ๐(๐) takes its maximal value
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
23. Greedy Action Selection Method
Greedy action selection always exploits
current knowledge to maximize immediate reward
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
24. Greedy Action Selection Method
Greedy Action Selection Methodโs disadvantage
Is it a good idea to select greedy action,
exploit that action selection
and maximize the current immediate reward?
26. ๐-greedy Action Selection Method
Exploitation is the right thing to do to maximize
the expected reward on the one step,
but Exploration may produce the greater total reward in the long run.
27. ๐-greedy Action Selection Method
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
๐ โ probability of taking a random action in an ๐-greedy policy
28. ๐-greedy Action Selection Method
Exploitation
Exploration
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
30. Upper-Confidence-Bound(UCB) Action Selection Method
ln ๐ก - natural logarithm of ๐ก
๐๐ก(๐) โ the number of times that action ๐ has been selected prior to time ๐ก
๐ โ the number ๐ > 0 controls the degree of exploration
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
31. Upper-Confidence-Bound(UCB) Action Selection Method
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
The idea of this UCB action selection is that
The square-root term is a measure of
the uncertainty(or potential) in the estimate of ๐โs value
The probability that the slot machine
may be the optimal slot machine
32. Upper-Confidence-Bound(UCB) Action Selection Method
UCB Action Selection Methodโs disadvantage
UCB is more difficult than ๐-greedy to extend beyond bandits
to the more general reinforcement learning settings
One difficulty is in dealing with nonstationary problems
Another difficulty is dealing with large state spaces
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
34. 4. A simple Bandit Algorithm
- Incremental Implementation
- Tracking a Nonstationary Problem
35. 4. A simple Bandit Algorithm
- Incremental Implementation
- Tracking a Nonstationary Problem
36. Incremental Implementation
๐ ๐ denote the estimate of ๐ ๐โ1โs action value
after ๐ ๐โ1 has been selected n โ 1 times
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
37. Incremental Implementation
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
38. Incremental Implementation
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
๐ ๐+1 = ๐ ๐ +
1
๐
๐ ๐ โ ๐ ๐
holds even for ๐ = 1,
obtaining ๐2 = ๐ 1 for arbitrary ๐1
39. Incremental Implementation
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
40. Incremental Implementation
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
41. Incremental Implementation
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Unstable(โConstant)
Available on stationary problem
42. Incremental Implementation
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
43. Incremental Implementation
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
The expression ๐๐๐๐๐๐ก โ ๐๐๐๐ธ๐ ๐ก๐๐๐๐ก๐ is an error in the estimate.
The target is presumed to indicate a desirable direction
in which to move, though it may be noisy.
44. 4. A simple Bandit Algorithm
- Incremental Implementation
- Tracking a Nonstationary Problem
45. Tracking a Nonstationary Problem
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
46. Tracking a Nonstationary Problem
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Why do you think it should be changed from
1
๐
to ๐ผ?
47. Tracking a Nonstationary Problem
Why do you think it should be changed from
1
๐
to ๐ผ?
We often encounter RL problems that are effectively nonstationary.
In such cases it makes sense to give more weight
to recent rewards than to long-past rewards.
One of the most popular ways of doing this is to use
a constant step-size parameter.
The step-size parameter ๐ผ โ (0,1] is constant.
48. Tracking a Nonstationary Problem
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
49. Tracking a Nonstationary Problem
?
= ๐ ๐ + ๐ผ๐ ๐ โ ๐ผ๐ ๐
= ๐ผ๐ ๐ + (1 โ ๐ผ)๐ ๐
๐ ๐ = ๐ผ๐ ๐โ1 + 1 โ ๐ผ ๐ ๐โ1 ?
โด ๐ ๐ = ๐ผ๐ ๐โ1 + (1 โ ๐ผ)๐ ๐โ1
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
50. Tracking a Nonstationary Problem
?
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
52. Tracking a Nonstationary Problem
Sequences of step-size parameters often converge
very slowly or need considerable tuning
in order to obtain a satisfactory convergence rate.
Thus, step-size parameters should be tuned effectively.
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
54. 5. Gradient Bandit Algorithm
In addition to a simple bandit algorithm,
there is another way to use the gradient method as a bandit algorithm
55. 5. Gradient Bandit Algorithm
We consider learning a numerical ๐๐๐๐๐๐๐๐๐๐
for each action ๐, which we denote ๐ป๐ก(๐).
The larger the preference, the more often that action is taken,
but the preference has no interpretation in terms of reward.
In other wards, just because the preference(๐ป๐ก(๐)) is large,
the reward is not unconditionally large.
However, if the reward is large, It can affect the preference(๐ป๐ก(๐))
56. 5. Gradient Bandit Algorithm
The action probabilities are determined according to
a ๐ ๐๐๐ก โ ๐๐๐ฅ ๐๐๐ ๐ก๐๐๐๐ข๐ก๐๐๐ (i.e., Gibbs or Boltzmann distribution)
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
57. 5. Gradient Bandit Algorithm
๐ ๐ก(๐) โ Probability of selecting action ๐ at time ๐ก
Initially all action preferences are the same
so that all actions have an equal probability of being selected.
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
58. 5. Gradient Bandit Algorithm
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
There is a natural learning algorithm for this setting
based on the idea of stochastic gradient ascent.
On each step, after selecting action ๐ด ๐ก and receiving the reward ๐ ๐ก,
the action preferences are updated.
Selected action ๐ด ๐ก
Non-selected actions
59. 5. Gradient Bandit Algorithm
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
1) What does ๐ ๐ก mean?
2) What does (๐ ๐ก โ ๐ ๐ก)(1 โ ๐ ๐ก ๐ด ๐ก ) mean?
60. 5. Gradient Bandit Algorithm
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
1) What does ๐ ๐ก mean?
2) What does (๐ ๐ก โ ๐ ๐ก)(1 โ ๐ ๐ก ๐ด ๐ก ) mean?
?
61. What does ๐ ๐ก mean?
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
๐ ๐ก โ R is the average of all the rewards.
The ๐ ๐ก term serves as a baseline.
If the reward is higher than the baseline,
then the probability of taking ๐ด ๐ก in the future is increased,
and if the reward is below baseline, then probability is decreased.
The non-selected actions move in the opposite direction.
62. 5. Gradient Bandit Algorithm
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
1) What does ๐ ๐ก mean?
2) What does (๐ ๐ก โ ๐ ๐ก)(1 โ ๐ ๐ก ๐ด ๐ก ) mean?
?
63. What does (๐ ๐ก โ ๐ ๐ก)(1 โ ๐ ๐ก ๐ด ๐ก ) mean?
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
- Stochastic approximation to gradient ascent
in Bandit Gradient Algorithm
- Expected reward
- Expected reward by Low of total expectationE[๐ ๐ก] = E[ E ๐ ๐ก ๐ด ๐ก ]
64. What does (๐ ๐ก โ ๐ ๐ก)(1 โ ๐ ๐ก ๐ด ๐ก ) mean?
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
65. What does (๐ ๐ก โ ๐ ๐ก)(1 โ ๐ ๐ก ๐ด ๐ก ) mean?
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
?
66. What does (๐ ๐ก โ ๐ ๐ก)(1 โ ๐ ๐ก ๐ด ๐ก ) mean?
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
?
The gradient sums to zero over all the actions, ฯ ๐ฅ
๐๐ ๐ก(๐ฅ)
๐๐ป๐ก(๐)
= 0
โ as ๐ป๐ก(๐) is changed, some actionsโ probabilities go up
and some go down, but the sum of the changes must be
zero because the sum of the probabilities is always one.
67. What does (๐ ๐ก โ ๐ ๐ก)(1 โ ๐ ๐ก ๐ด ๐ก ) mean?
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
68. What does (๐ ๐ก โ ๐ ๐ก)(1 โ ๐ ๐ก ๐ด ๐ก ) mean?
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
?
E[๐ ๐ก] = E[ E ๐ ๐ก ๐ด ๐ก ]
69. What does (๐ ๐ก โ ๐ ๐ก)(1 โ ๐ ๐ก ๐ด ๐ก ) mean?
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Please refer page 40 in link of reference slide
70. What does (๐ ๐ก โ ๐ ๐ก)(1 โ ๐ ๐ก ๐ด ๐ก ) mean?
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
โด
We can substitute a sample of the expectation above
for the performance gradient in ๐ป๐ก+1 ๐ โ ๐ป๐ก ๐ + ๐ผ
๐E[๐ ๐ก]
๐๐ป๐ก(๐)
74. 6. Summary
A simple bandit algorithm : Incremental Implementation
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
75. 6. Summary
A simple bandit algorithm : Incremental Implementation
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
76. 6. Summary
A simple bandit algorithm : Tracking a Nonstationary Problem
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
77. 6. Summary
A simple bandit algorithm : Tracking a Nonstationary Problem
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
78. 6. Summary
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Gradient Bandit Algorithm
79. 6. Summary
Source : Reinforcement Learning : An Introduction, Richard S. Sutton & Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
A parameter study of the various bandit algorithms