Reinforcement learning is a method for learning behaviors through trial-and-error interactions with an environment. The goal is to maximize a numerical reward signal by discovering the actions that yield the most reward. The learner is not told which actions to take directly, but must instead determine which actions are best by trying them out. This document outlines reinforcement learning concepts like exploration versus exploitation, where exploration involves trying non-optimal actions to gain more information, while exploitation uses current knowledge to choose optimal actions. It also discusses formalisms like Markov decision processes and the tradeoff between maximizing short-term versus long-term rewards in reinforcement learning problems.
Reinforcement Learning (RL) approaches to deal with finding an optimal reward based policy to act in an environment (Charla en Inglés)
However, what has led to their widespread use is its combination with deep neural networks (DNN) i.e., deep reinforcement learning (Deep RL). Recent successes on not only learning to play games but also superseding humans in it and academia-industry research collaborations like for manipulation of objects, locomotion skills, smart grids, etc. have surely demonstrated their case on a wide variety of challenging tasks.
With application spanning across games, robotics, dialogue, healthcare, marketing, energy and many more domains, Deep RL might just be the power that drives the next generation of Artificial Intelligence (AI) agents!
In this talk we discuss about the aplicação of Reinforcement Learning to Games. Recently, OpenAI created an algorithm capable of beating a human team in DOTA, considered a game with great amount of complexity and strategy. In this talk, we'll evaluate the role Reinforcement Learning plays in the world of games, taking a look at some of main achievements and how they look like in terms of implementation. We'll also take a look at some of the history of AI applied to games and how things evolved over time.
Deep Reinforcement Learning Talk at PI School. Covering following contents as:
1- Deep Reinforcement Learning
2- QLearning
3- Deep QLearning (DQN)
4- Google Deepmind Paper (DQN for ATARI)
Reinforcement Learning (RL) approaches to deal with finding an optimal reward based policy to act in an environment (Charla en Inglés)
However, what has led to their widespread use is its combination with deep neural networks (DNN) i.e., deep reinforcement learning (Deep RL). Recent successes on not only learning to play games but also superseding humans in it and academia-industry research collaborations like for manipulation of objects, locomotion skills, smart grids, etc. have surely demonstrated their case on a wide variety of challenging tasks.
With application spanning across games, robotics, dialogue, healthcare, marketing, energy and many more domains, Deep RL might just be the power that drives the next generation of Artificial Intelligence (AI) agents!
In this talk we discuss about the aplicação of Reinforcement Learning to Games. Recently, OpenAI created an algorithm capable of beating a human team in DOTA, considered a game with great amount of complexity and strategy. In this talk, we'll evaluate the role Reinforcement Learning plays in the world of games, taking a look at some of main achievements and how they look like in terms of implementation. We'll also take a look at some of the history of AI applied to games and how things evolved over time.
Deep Reinforcement Learning Talk at PI School. Covering following contents as:
1- Deep Reinforcement Learning
2- QLearning
3- Deep QLearning (DQN)
4- Google Deepmind Paper (DQN for ATARI)
Deep Reinforcement Learning and Its ApplicationsBill Liu
What is the most exciting AI news in recent years? AlphaGo!
What are key techniques for AlphaGo? Deep learning and reinforcement learning (RL)!
What are application areas for deep RL? A lot! In fact, besides games, deep RL has been making tremendous achievements in diverse areas like recommender systems and robotics.
In this talk, we will introduce deep reinforcement learning, present several applications, and discuss issues and potential solutions for successfully applying deep RL in real life scenarios.
https://www.aicamp.ai/event/eventdetails/W2021042818
Lecture slides in DASI spring 2018, National Cheng Kung University, Taiwan. The content is about deep reinforcement learning: policy gradient including variance reduction and importance sampling
[1312.5602] Playing Atari with Deep Reinforcement LearningSeung Jae Lee
Presentation slides for 'Playing Atari with Deep Reinforcement Learning' by Mnih et al.
You can find more presentation slides in my website:
https://www.endtoend.ai
We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.
Maximum Entropy Reinforcement Learning (Stochastic Control)Dongmin Lee
I reviewed the following papers.
- T. Haarnoja, et al., “Reinforcement Learning with Deep Energy-Based Policies", ICML 2017
- T. Haarnoja, et al., “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor", ICML 2018
- T. Haarnoja, et al., “Soft Actor-Critic Algorithms and Applications", arXiv preprint 2018
Thank you.
Hello~! :)
While studying the Sutton-Barto book, the traditional textbook for Reinforcement Learning, I created PPT about the Multi-armed Bandits, a Chapter 2.
If there are any mistakes, I would appreciate your feedback immediately.
Thank you.
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven CuriosityHung Le
Despite remarkable successes in various domains such as robotics and games, Reinforcement Learning (RL) still struggles with exploration inefficiency. For example, in hard Atari games, state-of-the-art agents often require billions of trial actions, equivalent to years of practice, while a moderately skilled human player can achieve the same score in just a few hours of play. This contrast emerges from the difference in exploration strategies between humans, leveraging memory, intuition and experience, and current RL agents, primarily relying on random trials and errors. This tutorial reviews recent advances in enhancing RL exploration efficiency through intrinsic motivation or curiosity, allowing agents to navigate environments without external rewards. Unlike previous surveys, we analyze intrinsic motivation through a memory-centric perspective, drawing parallels between human and agent curiosity, and providing a memory-driven taxonomy of intrinsic motivation approaches.
The talk consists of three main parts. Part A provides a brief introduction to RL basics, delves into the historical context of the explore-exploit dilemma, and raises the challenge of exploration inefficiency. In Part B, we present a taxonomy of self-motivated agents leveraging deliberate, RAM-like, and replay memory models to compute surprise, novelty, and goal, respectively. Part C explores advanced topics, presenting recent methods using language models and causality for exploration. Whenever possible, case studies and hands-on coding demonstrations. will be presented.
Deep Reinforcement Learning and Its ApplicationsBill Liu
What is the most exciting AI news in recent years? AlphaGo!
What are key techniques for AlphaGo? Deep learning and reinforcement learning (RL)!
What are application areas for deep RL? A lot! In fact, besides games, deep RL has been making tremendous achievements in diverse areas like recommender systems and robotics.
In this talk, we will introduce deep reinforcement learning, present several applications, and discuss issues and potential solutions for successfully applying deep RL in real life scenarios.
https://www.aicamp.ai/event/eventdetails/W2021042818
Lecture slides in DASI spring 2018, National Cheng Kung University, Taiwan. The content is about deep reinforcement learning: policy gradient including variance reduction and importance sampling
[1312.5602] Playing Atari with Deep Reinforcement LearningSeung Jae Lee
Presentation slides for 'Playing Atari with Deep Reinforcement Learning' by Mnih et al.
You can find more presentation slides in my website:
https://www.endtoend.ai
We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.
Maximum Entropy Reinforcement Learning (Stochastic Control)Dongmin Lee
I reviewed the following papers.
- T. Haarnoja, et al., “Reinforcement Learning with Deep Energy-Based Policies", ICML 2017
- T. Haarnoja, et al., “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor", ICML 2018
- T. Haarnoja, et al., “Soft Actor-Critic Algorithms and Applications", arXiv preprint 2018
Thank you.
Hello~! :)
While studying the Sutton-Barto book, the traditional textbook for Reinforcement Learning, I created PPT about the Multi-armed Bandits, a Chapter 2.
If there are any mistakes, I would appreciate your feedback immediately.
Thank you.
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven CuriosityHung Le
Despite remarkable successes in various domains such as robotics and games, Reinforcement Learning (RL) still struggles with exploration inefficiency. For example, in hard Atari games, state-of-the-art agents often require billions of trial actions, equivalent to years of practice, while a moderately skilled human player can achieve the same score in just a few hours of play. This contrast emerges from the difference in exploration strategies between humans, leveraging memory, intuition and experience, and current RL agents, primarily relying on random trials and errors. This tutorial reviews recent advances in enhancing RL exploration efficiency through intrinsic motivation or curiosity, allowing agents to navigate environments without external rewards. Unlike previous surveys, we analyze intrinsic motivation through a memory-centric perspective, drawing parallels between human and agent curiosity, and providing a memory-driven taxonomy of intrinsic motivation approaches.
The talk consists of three main parts. Part A provides a brief introduction to RL basics, delves into the historical context of the explore-exploit dilemma, and raises the challenge of exploration inefficiency. In Part B, we present a taxonomy of self-motivated agents leveraging deliberate, RAM-like, and replay memory models to compute surprise, novelty, and goal, respectively. Part C explores advanced topics, presenting recent methods using language models and causality for exploration. Whenever possible, case studies and hands-on coding demonstrations. will be presented.
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
Deep Reinforcement Learning with Shallow Trees:
In this talk, I present Concept Network Reinforcement Learning (CNRL), developed at Bonsai. It is an industrially applicable approach to solving complex tasks using reinforcement learning, which facilitates problem decomposition, allows component reuse, and simplifies reward functions. Inspired by Sutton’s options framework, we introduce the notion of “Concept Networks” which are tree-like structures in which leaves are “sub-concepts” (sub-tasks), representing policies on a subset of state space. The parent (non-leaf) nodes are “Selectors”, containing policies on which sub-concept to choose from the child nodes, at each time during an episode. There will be a high-level overview on the reinforcement learning fundamentals at the beginning of the talk.
Bio: Matineh Shaker is an Artificial Intelligence Scientist at Bonsai in Berkeley, CA, where she builds machine learning, reinforcement learning, and deep learning tools and algorithms for general purpose intelligent systems. She was previously a Machine Learning Researcher at Geometric Intelligence, Data Science Fellow at Insight Data Science, Predoctoral Fellow at Harvard Medical School. She received her PhD from Northeastern University with a dissertation in geometry-inspired manifold learning.
This presentation contains an introduction to reinforcement learning, comparison with others learning ways, introduction to Q-Learning and some applications of reinforcement learning in video games.
https://telecombcn-dl.github.io/2017-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
발표자: 곽동현(서울대 박사과정, 현 NAVER Clova)
강화학습(Reinforcement learning)의 개요 및 최근 Deep learning 기반의 RL 트렌드를 소개합니다.
발표영상:
http://tv.naver.com/v/2024376
https://youtu.be/dw0sHzE1oAc
An efficient use of temporal difference technique in Computer Game LearningPrabhu Kumar
A computer game using temporal difference algorithm of Machine learning which improves the ability of the computer to learn and also explore the best next move for the game by greedy movement techniques and exploration method techniques for the future states of the game.
TensorFlow and Deep Learning Tips and TricksBen Ball
Presented at https://www.meetup.com/TensorFlow-and-Deep-Learning-Singapore/events/241183195/ . Tips and Tricks for using Tensorflow with Deep Reinforcement Learning.
See our blog for more information at http://prediction-machines.com/blog/
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...Ashwin Rao
I am pleased to introduce a new and exciting course, as part of ICME at Stanford University. I will be teaching CME 241 (Reinforcement Learning for Stochastic Control Problems in Finance) in Winter 2019.
Similar to 25 introduction reinforcement_learning (20)
Senior data scientist and founder of the company Intelligentia Data I+D SA de CV. We are offering consultancy services, development of projects and products in Machine Learning, Big Data, Data Sciences and Artificial Intelligence.
My first set of slides (The NN and DL class I am preparing for the fall)... I included the problem of Vanishing Gradient and the need to have ReLu (Mentioning btw the saturation problem inherited from Hebbian Learning)
It has been almost 62 years since the invention of the term Artificial Intelligence by Samuel and Minsky et al. at the Dartmouth workshop College in 1956 (“Dartmouth Summer Research Project on Artificial Intelligence”) where this new area of Computer Science was invented. However, the history of Artificial Intelligence goes back to previous millennia, when the Greeks in their Myths spoke about golden robots at Hephaestus, and the Galatea of Pygmalion. They were the first automatons known at the dawn of history, and although these first attempts were only myths, automatons were invented and built through multiple civilizations in history. Nevertheless, these automatons resembled in quite limited way their final objectives, representing animals and humans. In spite of that, the greatest illusion of an automaton, the Turk by Wolfgang von Kempelen, inspired many people, trough its exhibitions, as Alexander Graham Bell and Charles Babbage to develop inventions that would change forever human history. Thus, the importance of the concept “Artificial Intelligence” as a driver of our technological dreams. And although Artificial Intelligence has never been defined in a precise practical way, the amount of research and methods that have been developed to tackle some of its basics tasks have been and are quite humongous. Thus, the importance of having an introduction to the concepts of Artificial Intelligence, thus the dream can continue.
A review of one of the most popular methods of clustering, a part of what is know as unsupervised learning, K-Means. Here, we go from the basic heuristic used to solve the NP-Hard problem to an approximation algorithm K-Centers. Additionally, we look at variations coming from the Fuzzy Set ideas. In the future, we will add more about On-Line algorithms in the line of Stochastic Gradient Ideas...
Here a Review of the Combination of Machine Learning models from Bayesian Averaging, Committees to Boosting... Specifically An statistical analysis of Boosting is done
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
HEAP SORT ILLUSTRATED WITH HEAPIFY, BUILD HEAP FOR DYNAMIC ARRAYS.
Heap sort is a comparison-based sorting technique based on Binary Heap data structure. It is similar to the selection sort where we first find the minimum element and place the minimum element at the beginning. Repeat the same process for the remaining elements.
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
We have compiled the most important slides from each speaker's presentation. This year’s compilation, available for free, captures the key insights and contributions shared during the DfMAy 2024 conference.
Planning Of Procurement o different goods and services
25 introduction reinforcement_learning
1. Introduction to Artificial Intelligence
Reinforcement Learning
Andres Mendez-Vazquez
April 26, 2019
1 / 111
2. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
2 / 111
3. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
3 / 111
4. Images/cinvestav
Reinforcement Learning is learning what to do
What does Reinforcement Learning want?
Maximize a numerical reward signal
The learner is not told which actions to take
A discovery must be performed to obtain the actions that yield the
most rewards
4 / 111
5. Images/cinvestav
Reinforcement Learning is learning what to do
What does Reinforcement Learning want?
Maximize a numerical reward signal
The learner is not told which actions to take
A discovery must be performed to obtain the actions that yield the
most rewards
4 / 111
6. Images/cinvestav
These ideas come from Dynamical Systems Theory
Specifically
As the optimal control of incompletely-known Markov Decision
Processes (MDP).
Thus, the device must do the following
Interact with the environment to learn by using punishments and
rewards.
5 / 111
7. Images/cinvestav
These ideas come from Dynamical Systems Theory
Specifically
As the optimal control of incompletely-known Markov Decision
Processes (MDP).
Thus, the device must do the following
Interact with the environment to learn by using punishments and
rewards.
5 / 111
8. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
6 / 111
9. Images/cinvestav
A mobile robot
decides whether it should enter a new room
in search of more trash to collect
or going back to charge its battery
It needs to take a decision based on the
current charge level of its battery
how quickly it can arrive to its base
7 / 111
10. Images/cinvestav
A mobile robot
decides whether it should enter a new room
in search of more trash to collect
or going back to charge its battery
It needs to take a decision based on the
current charge level of its battery
how quickly it can arrive to its base
7 / 111
11. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
8 / 111
12. Images/cinvestav
A K-Armed Bandit Problem
Definition
You are faced repeatedly with a choice among k different options, or
actions.
After each choice you receive a numerical reward chosen
From an stationary probability distribution depending on the actions.
Definition Stationary Probability
A stationary distribution of a Markov chain is a probability distribution
that remains unchanged in the Markov chain as time progresses.
9 / 111
13. Images/cinvestav
A K-Armed Bandit Problem
Definition
You are faced repeatedly with a choice among k different options, or
actions.
After each choice you receive a numerical reward chosen
From an stationary probability distribution depending on the actions.
Definition Stationary Probability
A stationary distribution of a Markov chain is a probability distribution
that remains unchanged in the Markov chain as time progresses.
9 / 111
14. Images/cinvestav
A K-Armed Bandit Problem
Definition
You are faced repeatedly with a choice among k different options, or
actions.
After each choice you receive a numerical reward chosen
From an stationary probability distribution depending on the actions.
Definition Stationary Probability
A stationary distribution of a Markov chain is a probability distribution
that remains unchanged in the Markov chain as time progresses.
9 / 111
16. Images/cinvestav
Therefore
Each of the k actions has an expected value
q∗ (a) = E [Rt|At = a]
Something Notable
If you knew the value of each action,
It would be trivial to solve the k-armed bandit problem.
11 / 111
17. Images/cinvestav
Therefore
Each of the k actions has an expected value
q∗ (a) = E [Rt|At = a]
Something Notable
If you knew the value of each action,
It would be trivial to solve the k-armed bandit problem.
11 / 111
18. Images/cinvestav
What do we want?
To find the estimated value of action a at time t, Qt (a)
|Qt (a) − q∗ (a)| < with as small as possible
12 / 111
19. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
13 / 111
20. Images/cinvestav
Greedy Actions
Intuitions
To choose the greatest E [Rt|At = a]
This is know as Exploitation
You are exploiting your current knowledge of the values of the actions.
Contrasting to Exploration
When you select the non-gready actions to obtain a better knowledge
of the non-gready actions.
14 / 111
21. Images/cinvestav
Greedy Actions
Intuitions
To choose the greatest E [Rt|At = a]
This is know as Exploitation
You are exploiting your current knowledge of the values of the actions.
Contrasting to Exploration
When you select the non-gready actions to obtain a better knowledge
of the non-gready actions.
14 / 111
22. Images/cinvestav
Greedy Actions
Intuitions
To choose the greatest E [Rt|At = a]
This is know as Exploitation
You are exploiting your current knowledge of the values of the actions.
Contrasting to Exploration
When you select the non-gready actions to obtain a better knowledge
of the non-gready actions.
14 / 111
23. Images/cinvestav
An Interesting Conundrum
It has been observed that
Exploitation maximizes the expected reward on the one step,
Exploration may produce the greater total reward in the long run.
Thus, the idea
The value of a state s is the total amount of reward an agent can
expect to accumulate over the future, starting from s.
Thus, Values
They indicate the long-term profit of states after taking into account
The states that follow and their rewards.
15 / 111
24. Images/cinvestav
An Interesting Conundrum
It has been observed that
Exploitation maximizes the expected reward on the one step,
Exploration may produce the greater total reward in the long run.
Thus, the idea
The value of a state s is the total amount of reward an agent can
expect to accumulate over the future, starting from s.
Thus, Values
They indicate the long-term profit of states after taking into account
The states that follow and their rewards.
15 / 111
25. Images/cinvestav
An Interesting Conundrum
It has been observed that
Exploitation maximizes the expected reward on the one step,
Exploration may produce the greater total reward in the long run.
Thus, the idea
The value of a state s is the total amount of reward an agent can
expect to accumulate over the future, starting from s.
Thus, Values
They indicate the long-term profit of states after taking into account
The states that follow and their rewards.
15 / 111
26. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
16 / 111
27. Images/cinvestav
Therefore, the parts of a Reinforcement Learning
A Policy
It defines the learning agent’s way of behaving at a given time.
The policy is the core of a reinforcement learning system
In general, policies may be stochastic
A Reward Signal
It defines the goal of a reinforcement learning problem.
The system’s sole objective is to maximize the total reward over the
long run
This encourage Exploitation
It is immediate... you might want something with more long term.
A Value Function
It specifies what is good in the long run.
This encourage Exploration
17 / 111
28. Images/cinvestav
Therefore, the parts of a Reinforcement Learning
A Policy
It defines the learning agent’s way of behaving at a given time.
The policy is the core of a reinforcement learning system
In general, policies may be stochastic
A Reward Signal
It defines the goal of a reinforcement learning problem.
The system’s sole objective is to maximize the total reward over the
long run
This encourage Exploitation
It is immediate... you might want something with more long term.
A Value Function
It specifies what is good in the long run.
This encourage Exploration
17 / 111
29. Images/cinvestav
Therefore, the parts of a Reinforcement Learning
A Policy
It defines the learning agent’s way of behaving at a given time.
The policy is the core of a reinforcement learning system
In general, policies may be stochastic
A Reward Signal
It defines the goal of a reinforcement learning problem.
The system’s sole objective is to maximize the total reward over the
long run
This encourage Exploitation
It is immediate... you might want something with more long term.
A Value Function
It specifies what is good in the long run.
This encourage Exploration
17 / 111
30. Images/cinvestav
Therefore, the parts of a Reinforcement Learning
A Policy
It defines the learning agent’s way of behaving at a given time.
The policy is the core of a reinforcement learning system
In general, policies may be stochastic
A Reward Signal
It defines the goal of a reinforcement learning problem.
The system’s sole objective is to maximize the total reward over the
long run
This encourage Exploitation
It is immediate... you might want something with more long term.
A Value Function
It specifies what is good in the long run.
This encourage Exploration
17 / 111
31. Images/cinvestav
Therefore, the parts of a Reinforcement Learning
A Policy
It defines the learning agent’s way of behaving at a given time.
The policy is the core of a reinforcement learning system
In general, policies may be stochastic
A Reward Signal
It defines the goal of a reinforcement learning problem.
The system’s sole objective is to maximize the total reward over the
long run
This encourage Exploitation
It is immediate... you might want something with more long term.
A Value Function
It specifies what is good in the long run.
This encourage Exploration
17 / 111
32. Images/cinvestav
Therefore, the parts of a Reinforcement Learning
A Policy
It defines the learning agent’s way of behaving at a given time.
The policy is the core of a reinforcement learning system
In general, policies may be stochastic
A Reward Signal
It defines the goal of a reinforcement learning problem.
The system’s sole objective is to maximize the total reward over the
long run
This encourage Exploitation
It is immediate... you might want something with more long term.
A Value Function
It specifies what is good in the long run.
This encourage Exploration
17 / 111
33. Images/cinvestav
Therefore, the parts of a Reinforcement Learning
A Policy
It defines the learning agent’s way of behaving at a given time.
The policy is the core of a reinforcement learning system
In general, policies may be stochastic
A Reward Signal
It defines the goal of a reinforcement learning problem.
The system’s sole objective is to maximize the total reward over the
long run
This encourage Exploitation
It is immediate... you might want something with more long term.
A Value Function
It specifies what is good in the long run.
This encourage Exploration
17 / 111
34. Images/cinvestav
Therefore, the parts of a Reinforcement Learning
A Policy
It defines the learning agent’s way of behaving at a given time.
The policy is the core of a reinforcement learning system
In general, policies may be stochastic
A Reward Signal
It defines the goal of a reinforcement learning problem.
The system’s sole objective is to maximize the total reward over the
long run
This encourage Exploitation
It is immediate... you might want something with more long term.
A Value Function
It specifies what is good in the long run.
This encourage Exploration
17 / 111
35. Images/cinvestav
Therefore, the parts of a Reinforcement Learning
A Policy
It defines the learning agent’s way of behaving at a given time.
The policy is the core of a reinforcement learning system
In general, policies may be stochastic
A Reward Signal
It defines the goal of a reinforcement learning problem.
The system’s sole objective is to maximize the total reward over the
long run
This encourage Exploitation
It is immediate... you might want something with more long term.
A Value Function
It specifies what is good in the long run.
This encourage Exploration
17 / 111
36. Images/cinvestav
Finally
A Model of the Environment
This simulates the behavior of the environment
Allowing to make inferences on how the environment can behave
Something Notable
Models are used for planning
Meaning we made decisions before they are executed
Thus, Reinforcement Learning Methods using models are called
model-based
18 / 111
37. Images/cinvestav
Finally
A Model of the Environment
This simulates the behavior of the environment
Allowing to make inferences on how the environment can behave
Something Notable
Models are used for planning
Meaning we made decisions before they are executed
Thus, Reinforcement Learning Methods using models are called
model-based
18 / 111
38. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
19 / 111
39. Images/cinvestav
Limitations
Given that
Reinforcement learning relies heavily on the concept of state
Therefore, it has the same problem as in Markov Decision Processes
Defining the state to obtain a better representation of the search
space.
This is where the creative part of the problems come to be
As we have seen, solutions in AI really a lot in the creativity of the
solution...
20 / 111
40. Images/cinvestav
Limitations
Given that
Reinforcement learning relies heavily on the concept of state
Therefore, it has the same problem as in Markov Decision Processes
Defining the state to obtain a better representation of the search
space.
This is where the creative part of the problems come to be
As we have seen, solutions in AI really a lot in the creativity of the
solution...
20 / 111
41. Images/cinvestav
Limitations
Given that
Reinforcement learning relies heavily on the concept of state
Therefore, it has the same problem as in Markov Decision Processes
Defining the state to obtain a better representation of the search
space.
This is where the creative part of the problems come to be
As we have seen, solutions in AI really a lot in the creativity of the
solution...
20 / 111
42. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
21 / 111
44. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
23 / 111
45. Images/cinvestav
Markov Process and Markov Decision Process
Definition of Markov Process
A sequence of states is Markov if and only if the probability of moving
to the next state st+1 depends only on the present state st
P [st+1|st] = P [st+1|s1, ..., st]
Something Notable
We always talk about time-homogeneous Markov chain in
Reinforcement Learning:
the probability of the transition is independent of t
P [st+1 = s |st = s] = P [st = s |st−1 = s]
24 / 111
46. Images/cinvestav
Markov Process and Markov Decision Process
Definition of Markov Process
A sequence of states is Markov if and only if the probability of moving
to the next state st+1 depends only on the present state st
P [st+1|st] = P [st+1|s1, ..., st]
Something Notable
We always talk about time-homogeneous Markov chain in
Reinforcement Learning:
the probability of the transition is independent of t
P [st+1 = s |st = s] = P [st = s |st−1 = s]
24 / 111
47. Images/cinvestav
Formally
Definition (Markov Process)
a Markov Process (or Markov Chain) is a tuple (S, P), where
1 S is a finite set of states
2 P is a state transition probability matrix. Pss = P [st+1 = s |st = s]
Thus, the dynamic of the Markov process is as follow
s0
Ps0s1
−→ s1
Ps1s2
−→ s2
Ps2s3
−→ s3 −→ · · ·
25 / 111
48. Images/cinvestav
Formally
Definition (Markov Process)
a Markov Process (or Markov Chain) is a tuple (S, P), where
1 S is a finite set of states
2 P is a state transition probability matrix. Pss = P [st+1 = s |st = s]
Thus, the dynamic of the Markov process is as follow
s0
Ps0s1
−→ s1
Ps1s2
−→ s2
Ps2s3
−→ s3 −→ · · ·
25 / 111
49. Images/cinvestav
Then, we get the following familiar definition
Definition
MDP is a tuple S, A, T , R, α
1 S is a finite set of states
2 A is a finite set of actions
3 P is a state transition probability matrix.
Pss = P st+1 = s |st = s, At = a
4 α ∈ [0, 1] is called the discount factor
5 R : S × A → R is a reward function
Thus, the dynamic of the Markov Decision Process is as follow using
a probabilty to pick an action
s0
a0
−→ s1
a1
−→ s2
a2
−→ s3 −→ · · ·
26 / 111
50. Images/cinvestav
Then, we get the following familiar definition
Definition
MDP is a tuple S, A, T , R, α
1 S is a finite set of states
2 A is a finite set of actions
3 P is a state transition probability matrix.
Pss = P st+1 = s |st = s, At = a
4 α ∈ [0, 1] is called the discount factor
5 R : S × A → R is a reward function
Thus, the dynamic of the Markov Decision Process is as follow using
a probabilty to pick an action
s0
a0
−→ s1
a1
−→ s2
a2
−→ s3 −→ · · ·
26 / 111
51. Images/cinvestav
Then, we get the following familiar definition
Definition
MDP is a tuple S, A, T , R, α
1 S is a finite set of states
2 A is a finite set of actions
3 P is a state transition probability matrix.
Pss = P st+1 = s |st = s, At = a
4 α ∈ [0, 1] is called the discount factor
5 R : S × A → R is a reward function
Thus, the dynamic of the Markov Decision Process is as follow using
a probabilty to pick an action
s0
a0
−→ s1
a1
−→ s2
a2
−→ s3 −→ · · ·
26 / 111
52. Images/cinvestav
Then, we get the following familiar definition
Definition
MDP is a tuple S, A, T , R, α
1 S is a finite set of states
2 A is a finite set of actions
3 P is a state transition probability matrix.
Pss = P st+1 = s |st = s, At = a
4 α ∈ [0, 1] is called the discount factor
5 R : S × A → R is a reward function
Thus, the dynamic of the Markov Decision Process is as follow using
a probabilty to pick an action
s0
a0
−→ s1
a1
−→ s2
a2
−→ s3 −→ · · ·
26 / 111
53. Images/cinvestav
Then, we get the following familiar definition
Definition
MDP is a tuple S, A, T , R, α
1 S is a finite set of states
2 A is a finite set of actions
3 P is a state transition probability matrix.
Pss = P st+1 = s |st = s, At = a
4 α ∈ [0, 1] is called the discount factor
5 R : S × A → R is a reward function
Thus, the dynamic of the Markov Decision Process is as follow using
a probabilty to pick an action
s0
a0
−→ s1
a1
−→ s2
a2
−→ s3 −→ · · ·
26 / 111
54. Images/cinvestav
Then, we get the following familiar definition
Definition
MDP is a tuple S, A, T , R, α
1 S is a finite set of states
2 A is a finite set of actions
3 P is a state transition probability matrix.
Pss = P st+1 = s |st = s, At = a
4 α ∈ [0, 1] is called the discount factor
5 R : S × A → R is a reward function
Thus, the dynamic of the Markov Decision Process is as follow using
a probabilty to pick an action
s0
a0
−→ s1
a1
−→ s2
a2
−→ s3 −→ · · ·
26 / 111
55. Images/cinvestav
Then, we get the following familiar definition
Definition
MDP is a tuple S, A, T , R, α
1 S is a finite set of states
2 A is a finite set of actions
3 P is a state transition probability matrix.
Pss = P st+1 = s |st = s, At = a
4 α ∈ [0, 1] is called the discount factor
5 R : S × A → R is a reward function
Thus, the dynamic of the Markov Decision Process is as follow using
a probabilty to pick an action
s0
a0
−→ s1
a1
−→ s2
a2
−→ s3 −→ · · ·
26 / 111
56. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
27 / 111
57. Images/cinvestav
The Goal of Reinforcement Learning
We want to maximize the expected value of the return
i.e. To obtain the optimal policy!!!
Thus, as in our previous study of MDP’s
Gt = Rt+1 + αRt+2 + · · · =
∞
k=0
αk
Rt+k+1
28 / 111
58. Images/cinvestav
The Goal of Reinforcement Learning
We want to maximize the expected value of the return
i.e. To obtain the optimal policy!!!
Thus, as in our previous study of MDP’s
Gt = Rt+1 + αRt+2 + · · · =
∞
k=0
αk
Rt+k+1
28 / 111
59. Images/cinvestav
Remarks
Interpretation
The discounted future rewards can be interpreted as the current
value of future rewards.
The reward α
α close to 0 leads to “short-sighted” evaluation... you are only
looking for immediate rewards!!!
α close to 1 leads to “long-sighted” evaluation... you are willing to
wait for a better reward!!!
29 / 111
60. Images/cinvestav
Remarks
Interpretation
The discounted future rewards can be interpreted as the current
value of future rewards.
The reward α
α close to 0 leads to “short-sighted” evaluation... you are only
looking for immediate rewards!!!
α close to 1 leads to “long-sighted” evaluation... you are willing to
wait for a better reward!!!
29 / 111
61. Images/cinvestav
Why to use this discount factor?
Basically
1 Discount factor avoids infinite returns in cyclic Markov process given
the property of the geometry series.
2 There is certain uncertainty about the future reward.
3 If the reward is financial, immediate rewards may earn more interest
than delayed rewards.
4 Animal/human behavior shows preference for immediate reward.
30 / 111
62. Images/cinvestav
Why to use this discount factor?
Basically
1 Discount factor avoids infinite returns in cyclic Markov process given
the property of the geometry series.
2 There is certain uncertainty about the future reward.
3 If the reward is financial, immediate rewards may earn more interest
than delayed rewards.
4 Animal/human behavior shows preference for immediate reward.
30 / 111
63. Images/cinvestav
Why to use this discount factor?
Basically
1 Discount factor avoids infinite returns in cyclic Markov process given
the property of the geometry series.
2 There is certain uncertainty about the future reward.
3 If the reward is financial, immediate rewards may earn more interest
than delayed rewards.
4 Animal/human behavior shows preference for immediate reward.
30 / 111
64. Images/cinvestav
Why to use this discount factor?
Basically
1 Discount factor avoids infinite returns in cyclic Markov process given
the property of the geometry series.
2 There is certain uncertainty about the future reward.
3 If the reward is financial, immediate rewards may earn more interest
than delayed rewards.
4 Animal/human behavior shows preference for immediate reward.
30 / 111
65. Images/cinvestav
A Policy
Using our MDP ideas...
A policy π is a distribution over actions given states
π (a|s) = P [At = a|St = s]
A policy guides the choice of action at a given state.
And it is time independent
Then, we introduce two value functions
State-value function.
Action-value function.
31 / 111
66. Images/cinvestav
A Policy
Using our MDP ideas...
A policy π is a distribution over actions given states
π (a|s) = P [At = a|St = s]
A policy guides the choice of action at a given state.
And it is time independent
Then, we introduce two value functions
State-value function.
Action-value function.
31 / 111
67. Images/cinvestav
The state-value function vπ of an MDP
It is defined as
vπ (s) = Eπ [Gt|St = s]
This gives the long-term value of the state s
vπ (s) = Eπ [Gt|St = s]
= Eπ
∞
k=0
αk
Rt+k+1|St = s
= Eπ [Rt+1 + αGt+1|St = s]
= Eπ [Rt+1|St = s]
Immediate Reward
+ Eπ [αvπ (St+1) |St = s]
Discount value of successor state
32 / 111
68. Images/cinvestav
The state-value function vπ of an MDP
It is defined as
vπ (s) = Eπ [Gt|St = s]
This gives the long-term value of the state s
vπ (s) = Eπ [Gt|St = s]
= Eπ
∞
k=0
αk
Rt+k+1|St = s
= Eπ [Rt+1 + αGt+1|St = s]
= Eπ [Rt+1|St = s]
Immediate Reward
+ Eπ [αvπ (St+1) |St = s]
Discount value of successor state
32 / 111
69. Images/cinvestav
The state-value function vπ of an MDP
It is defined as
vπ (s) = Eπ [Gt|St = s]
This gives the long-term value of the state s
vπ (s) = Eπ [Gt|St = s]
= Eπ
∞
k=0
αk
Rt+k+1|St = s
= Eπ [Rt+1 + αGt+1|St = s]
= Eπ [Rt+1|St = s]
Immediate Reward
+ Eπ [αvπ (St+1) |St = s]
Discount value of successor state
32 / 111
70. Images/cinvestav
The state-value function vπ of an MDP
It is defined as
vπ (s) = Eπ [Gt|St = s]
This gives the long-term value of the state s
vπ (s) = Eπ [Gt|St = s]
= Eπ
∞
k=0
αk
Rt+k+1|St = s
= Eπ [Rt+1 + αGt+1|St = s]
= Eπ [Rt+1|St = s]
Immediate Reward
+ Eπ [αvπ (St+1) |St = s]
Discount value of successor state
32 / 111
71. Images/cinvestav
The Action-Value Function qπ (s, a)
It is the expected return starting from state s
Starting from state s, taking action a, and then following policy π
qπ (s, a) = Eπ [Gt|St = a, At = a]
We can also obtain the following decomposition
qπ (s, a) = Eπ [Rt+1 + αqπ (St+1, At+1) |St = s, At = a]
33 / 111
72. Images/cinvestav
The Action-Value Function qπ (s, a)
It is the expected return starting from state s
Starting from state s, taking action a, and then following policy π
qπ (s, a) = Eπ [Gt|St = a, At = a]
We can also obtain the following decomposition
qπ (s, a) = Eπ [Rt+1 + αqπ (St+1, At+1) |St = s, At = a]
33 / 111
75. Images/cinvestav
Then
Expressing qπ (s, a) in terms of vπ (s) in the expression of vπ (s) -
The Bellman Equation
vπ (s) =
a∈A
π (a|s)
Ra
s + α
s ∈S
Pa
ss vπ s
The Bellman equation relates the state-value function
of one state with that of other states.
35 / 111
76. Images/cinvestav
Then
Expressing qπ (s, a) in terms of vπ (s) in the expression of vπ (s) -
The Bellman Equation
vπ (s) =
a∈A
π (a|s)
Ra
s + α
s ∈S
Pa
ss vπ s
The Bellman equation relates the state-value function
of one state with that of other states.
35 / 111
77. Images/cinvestav
Also
Bellman equation for qπ (s, a)
qπ (s, a) = Ra
s + α
s ∈S
Pa
ss
a∈A
π (a|s) qπ (s, a)
How do we solve this problem?
We have the following solutions
36 / 111
78. Images/cinvestav
Also
Bellman equation for qπ (s, a)
qπ (s, a) = Ra
s + α
s ∈S
Pa
ss
a∈A
π (a|s) qπ (s, a)
How do we solve this problem?
We have the following solutions
36 / 111
79. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
37 / 111
80. Images/cinvestav
Action-Value Methods
One natural way to estimate this is by averaging the rewards actually
received
Qt (a) =
t−1
i=1 RiIAi=a
t−1
i=1 IAi=a
Nevertheless
If the denominator is zero, we define Qt (a) at some default value
Now, as the denominator goes to infinity, by the law of large numbers,
Qt (a) −→ q∗
(a)
38 / 111
81. Images/cinvestav
Action-Value Methods
One natural way to estimate this is by averaging the rewards actually
received
Qt (a) =
t−1
i=1 RiIAi=a
t−1
i=1 IAi=a
Nevertheless
If the denominator is zero, we define Qt (a) at some default value
Now, as the denominator goes to infinity, by the law of large numbers,
Qt (a) −→ q∗
(a)
38 / 111
82. Images/cinvestav
Action-Value Methods
One natural way to estimate this is by averaging the rewards actually
received
Qt (a) =
t−1
i=1 RiIAi=a
t−1
i=1 IAi=a
Nevertheless
If the denominator is zero, we define Qt (a) at some default value
Now, as the denominator goes to infinity, by the law of large numbers,
Qt (a) −→ q∗
(a)
38 / 111
83. Images/cinvestav
Furthermore
We call this the sample-average method
Not the best, but allows to introduce a way to select actions...
A classic one us the greedy estimate
At = arg max
a
Qt (a)
Therefore
Greedy action selection always exploits current knowledge to
maximize immediate reward.
it spends no time in inferior actions to see if they might really be
better.
39 / 111
84. Images/cinvestav
Furthermore
We call this the sample-average method
Not the best, but allows to introduce a way to select actions...
A classic one us the greedy estimate
At = arg max
a
Qt (a)
Therefore
Greedy action selection always exploits current knowledge to
maximize immediate reward.
it spends no time in inferior actions to see if they might really be
better.
39 / 111
85. Images/cinvestav
Furthermore
We call this the sample-average method
Not the best, but allows to introduce a way to select actions...
A classic one us the greedy estimate
At = arg max
a
Qt (a)
Therefore
Greedy action selection always exploits current knowledge to
maximize immediate reward.
it spends no time in inferior actions to see if they might really be
better.
39 / 111
86. Images/cinvestav
Furthermore
We call this the sample-average method
Not the best, but allows to introduce a way to select actions...
A classic one us the greedy estimate
At = arg max
a
Qt (a)
Therefore
Greedy action selection always exploits current knowledge to
maximize immediate reward.
it spends no time in inferior actions to see if they might really be
better.
39 / 111
87. Images/cinvestav
And Here a Heuristic
A Simple Alternative is to act greedily most of the time
every once in a while, say with small probability , select randomly
from among all the actions with equal probability.
This is called -greedy method
In the limit to infinity, this method ensures:
Qt (a) −→ q∗
(a)
40 / 111
88. Images/cinvestav
And Here a Heuristic
A Simple Alternative is to act greedily most of the time
every once in a while, say with small probability , select randomly
from among all the actions with equal probability.
This is called -greedy method
In the limit to infinity, this method ensures:
Qt (a) −→ q∗
(a)
40 / 111
90. Images/cinvestav
Example
In the following example
The q∗ (a) were selected according to a Gaussian Distribution with
mean 0 and variance 1.
Then, when an action is selected, the reward Rt was selected from a
normal distribution with mean q∗ (At) and variance 1.
We have then
42 / 111
91. Images/cinvestav
Example
In the following example
The q∗ (a) were selected according to a Gaussian Distribution with
mean 0 and variance 1.
Then, when an action is selected, the reward Rt was selected from a
normal distribution with mean q∗ (At) and variance 1.
We have then
42 / 111
93. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
44 / 111
94. Images/cinvestav
Imagine the following
Let Ri denotes the reward after the ith
of this action
Qn =
R1 + R2 + · · · + Rn−1
n − 1
The problem of this technique
The Amount of Memory and Computational Power to keep updates
the estimates as more reward happens
45 / 111
95. Images/cinvestav
Imagine the following
Let Ri denotes the reward after the ith
of this action
Qn =
R1 + R2 + · · · + Rn−1
n − 1
The problem of this technique
The Amount of Memory and Computational Power to keep updates
the estimates as more reward happens
45 / 111
96. Images/cinvestav
Therefore
We need something more efficient
We can do something like that by realizing the following:
Qn+1 =
1
n
n
i=1
Rn
=
1
n
Rn +
n
i=1
Ei
=
1
n
[Rn + (n − 1) Qn]
= Qn +
1
n
[Rn − Qn]
46 / 111
97. Images/cinvestav
Therefore
We need something more efficient
We can do something like that by realizing the following:
Qn+1 =
1
n
n
i=1
Rn
=
1
n
Rn +
n
i=1
Ei
=
1
n
[Rn + (n − 1) Qn]
= Qn +
1
n
[Rn − Qn]
46 / 111
98. Images/cinvestav
Therefore
We need something more efficient
We can do something like that by realizing the following:
Qn+1 =
1
n
n
i=1
Rn
=
1
n
Rn +
n
i=1
Ei
=
1
n
[Rn + (n − 1) Qn]
= Qn +
1
n
[Rn − Qn]
46 / 111
99. Images/cinvestav
Therefore
We need something more efficient
We can do something like that by realizing the following:
Qn+1 =
1
n
n
i=1
Rn
=
1
n
Rn +
n
i=1
Ei
=
1
n
[Rn + (n − 1) Qn]
= Qn +
1
n
[Rn − Qn]
46 / 111
100. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
47 / 111
101. Images/cinvestav
General Formula in Reinforcement Learning
The Pattern Appears almost all the time in RL
New Estimate = Old Estimate + Step Size [Target − Old Estimate]
Looks a lot as a Gradient Descent if we decide Assume a Taylor Series
f (x) = f (a) + f (a) (x − a) =⇒ f (a) =
f (x) − f (a)
x − a
Then
xn+1 = xn + γf (a) = xn + γ
f (x) − f (a)
x − a
= xn + γ [f (x) − f (a)]
48 / 111
102. Images/cinvestav
General Formula in Reinforcement Learning
The Pattern Appears almost all the time in RL
New Estimate = Old Estimate + Step Size [Target − Old Estimate]
Looks a lot as a Gradient Descent if we decide Assume a Taylor Series
f (x) = f (a) + f (a) (x − a) =⇒ f (a) =
f (x) − f (a)
x − a
Then
xn+1 = xn + γf (a) = xn + γ
f (x) − f (a)
x − a
= xn + γ [f (x) − f (a)]
48 / 111
103. Images/cinvestav
General Formula in Reinforcement Learning
The Pattern Appears almost all the time in RL
New Estimate = Old Estimate + Step Size [Target − Old Estimate]
Looks a lot as a Gradient Descent if we decide Assume a Taylor Series
f (x) = f (a) + f (a) (x − a) =⇒ f (a) =
f (x) − f (a)
x − a
Then
xn+1 = xn + γf (a) = xn + γ
f (x) − f (a)
x − a
= xn + γ [f (x) − f (a)]
48 / 111
104. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
49 / 111
105. Images/cinvestav
A Simple Bandit Algorithm
Initialize for a = 1 to k
1 Q (a) ← 0
2 N (a) ← 0
Loop until certain threshold γ
1 A ←
arg maxa Q (a) with probability 1 −
A random action with probability
2 R ← bandit (A)
3 N (A) ← N (A) + 1
4 Q (A) ← Q (A) + 1
N(A) [R − Q (A)]
50 / 111
106. Images/cinvestav
A Simple Bandit Algorithm
Initialize for a = 1 to k
1 Q (a) ← 0
2 N (a) ← 0
Loop until certain threshold γ
1 A ←
arg maxa Q (a) with probability 1 −
A random action with probability
2 R ← bandit (A)
3 N (A) ← N (A) + 1
4 Q (A) ← Q (A) + 1
N(A) [R − Q (A)]
50 / 111
107. Images/cinvestav
Remarks
Remember
This is works well for a stationary problem
For more in non-stationary problems
“Reinforcement Learning: An Introduction“ by Sutton et al. at
chapter 2.
51 / 111
108. Images/cinvestav
Remarks
Remember
This is works well for a stationary problem
For more in non-stationary problems
“Reinforcement Learning: An Introduction“ by Sutton et al. at
chapter 2.
51 / 111
109. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
52 / 111
110. Images/cinvestav
Dynamic Programming
As in many things in Computer Science
We love the Divide Et Impera...
Then
In Reinforcement Learning, Bellman equation gives recursive
decompositions, and value function stores and reuses solutions.
53 / 111
111. Images/cinvestav
Dynamic Programming
As in many things in Computer Science
We love the Divide Et Impera...
Then
In Reinforcement Learning, Bellman equation gives recursive
decompositions, and value function stores and reuses solutions.
53 / 111
112. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
54 / 111
113. Images/cinvestav
Policy Evaluation
For this, we assume as always
The problem can be solved with the Bellman Equation
We start from an initial guess v1
Then, we apply Bellman Iteratively
v1 → v2 → · · · → vπ
For this, we use synchronous updates
We update the value functions of the present iteration at the same
time based on the that of the previous iteration.
55 / 111
114. Images/cinvestav
Policy Evaluation
For this, we assume as always
The problem can be solved with the Bellman Equation
We start from an initial guess v1
Then, we apply Bellman Iteratively
v1 → v2 → · · · → vπ
For this, we use synchronous updates
We update the value functions of the present iteration at the same
time based on the that of the previous iteration.
55 / 111
115. Images/cinvestav
Policy Evaluation
For this, we assume as always
The problem can be solved with the Bellman Equation
We start from an initial guess v1
Then, we apply Bellman Iteratively
v1 → v2 → · · · → vπ
For this, we use synchronous updates
We update the value functions of the present iteration at the same
time based on the that of the previous iteration.
55 / 111
116. Images/cinvestav
Then
At each iteration k + 1, for all states s ∈ S
We update vk+1 from vk (s ) according to Bellman Equations, where
s is a successor state of s:
vk+1 (s) =
a∈A
π (a|s)
Ra
s + α
s ∈S
Pa
ss vk s
Something Notable
The iterative process stops when the maximum difference between
value function at the current step and that at the previous step is
smaller than some small positive constant.
56 / 111
117. Images/cinvestav
Then
At each iteration k + 1, for all states s ∈ S
We update vk+1 from vk (s ) according to Bellman Equations, where
s is a successor state of s:
vk+1 (s) =
a∈A
π (a|s)
Ra
s + α
s ∈S
Pa
ss vk s
Something Notable
The iterative process stops when the maximum difference between
value function at the current step and that at the previous step is
smaller than some small positive constant.
56 / 111
118. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
57 / 111
119. Images/cinvestav
However
Our ultimate goal is to find the optimal policy
For this, we can use policy iteration process
Algorithm
1 Initialize π randomly
2 Repeat until the previous policy and the current policy are the same.
1 Evaluate vπ by policy evaluation.
2 Using synchronous updates, for each state s, let
π (s) = arg max
a∈A
q (s, a) = arg max
a∈A
Ra
s + α
s ∈S
Pa
ss vπ s
58 / 111
120. Images/cinvestav
However
Our ultimate goal is to find the optimal policy
For this, we can use policy iteration process
Algorithm
1 Initialize π randomly
2 Repeat until the previous policy and the current policy are the same.
1 Evaluate vπ by policy evaluation.
2 Using synchronous updates, for each state s, let
π (s) = arg max
a∈A
q (s, a) = arg max
a∈A
Ra
s + α
s ∈S
Pa
ss vπ s
58 / 111
121. Images/cinvestav
Now
Something Notable
This policy iteration algorithm will improve the policy for each step.
Assume that
We have a deterministic policy π and after one step we get π .
We know that the value improves in one step because
qπ s, π (s) = max
a∈A
qπ (s, a) ≥ qπ (s, π (s)) = vπ (s)
59 / 111
122. Images/cinvestav
Now
Something Notable
This policy iteration algorithm will improve the policy for each step.
Assume that
We have a deterministic policy π and after one step we get π .
We know that the value improves in one step because
qπ s, π (s) = max
a∈A
qπ (s, a) ≥ qπ (s, π (s)) = vπ (s)
59 / 111
123. Images/cinvestav
Now
Something Notable
This policy iteration algorithm will improve the policy for each step.
Assume that
We have a deterministic policy π and after one step we get π .
We know that the value improves in one step because
qπ s, π (s) = max
a∈A
qπ (s, a) ≥ qπ (s, π (s)) = vπ (s)
59 / 111
124. Images/cinvestav
Then
We can see that
vπ (s) ≤ qπ s, π (s) = Eπ [Rt+1 + αvπ (St+1) |St = s]
≤ Eπ Rt+1 + αqπ St+1, π (St+1) |St = s
≤ Eπ Rt+1 + αRt+2 + α2
qπ St+2, π (St+2) |St = s
≤ Eπ [Rt+1 + αRt+2 + · · · |St = s] = vπ (s)
60 / 111
125. Images/cinvestav
Then
We can see that
vπ (s) ≤ qπ s, π (s) = Eπ [Rt+1 + αvπ (St+1) |St = s]
≤ Eπ Rt+1 + αqπ St+1, π (St+1) |St = s
≤ Eπ Rt+1 + αRt+2 + α2
qπ St+2, π (St+2) |St = s
≤ Eπ [Rt+1 + αRt+2 + · · · |St = s] = vπ (s)
60 / 111
126. Images/cinvestav
Then
We can see that
vπ (s) ≤ qπ s, π (s) = Eπ [Rt+1 + αvπ (St+1) |St = s]
≤ Eπ Rt+1 + αqπ St+1, π (St+1) |St = s
≤ Eπ Rt+1 + αRt+2 + α2
qπ St+2, π (St+2) |St = s
≤ Eπ [Rt+1 + αRt+2 + · · · |St = s] = vπ (s)
60 / 111
127. Images/cinvestav
Then
We can see that
vπ (s) ≤ qπ s, π (s) = Eπ [Rt+1 + αvπ (St+1) |St = s]
≤ Eπ Rt+1 + αqπ St+1, π (St+1) |St = s
≤ Eπ Rt+1 + αRt+2 + α2
qπ St+2, π (St+2) |St = s
≤ Eπ [Rt+1 + αRt+2 + · · · |St = s] = vπ (s)
60 / 111
128. Images/cinvestav
Then
If improvements stop
qπ s, π (s) = max
a∈A
qπ (s, a) = qπ (s, π (s)) = vπ (s)
This means that the Bellman equation has been satisfied
vπ (s) = max
a∈A
qπ (s, a)
Therefore, vπ (s) = v∗ (s) for all s ∈ S and π is an optimal policy.
61 / 111
129. Images/cinvestav
Then
If improvements stop
qπ s, π (s) = max
a∈A
qπ (s, a) = qπ (s, π (s)) = vπ (s)
This means that the Bellman equation has been satisfied
vπ (s) = max
a∈A
qπ (s, a)
Therefore, vπ (s) = v∗ (s) for all s ∈ S and π is an optimal policy.
61 / 111
130. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
62 / 111
131. Images/cinvestav
Once the policy has been improved
We can iteratively a sequence of monotonically improving policies and
values
π0
E
−→ vπ0
I
→ π1
E
−→ vπ1
I
→ π2
E
−→ · · ·
I
−→ π∗
E
→ v∗
with E = evaluation and I = improvement.
63 / 111
132. Images/cinvestav
Policy Iteration (using iterative policy evaluation)
Step 1. Initialization
V (s) ∈ R and π (s) ∈ A (s) arbitrarily for all s ∈ S
Step 2. Policy Evaluation
1 Loop:
2 ∆ ← 0
3 Loop for each s ∈ S
4 v ← V (s)
5 V (s) ← s ,r p (s , r|s, π (s)) [r + αV (s )]
6 ∆ ← max (∆, |v − V (s)|)
7 until ∆ < θ (A small threshold for accuracy)
64 / 111
133. Images/cinvestav
Policy Iteration (using iterative policy evaluation)
Step 1. Initialization
V (s) ∈ R and π (s) ∈ A (s) arbitrarily for all s ∈ S
Step 2. Policy Evaluation
1 Loop:
2 ∆ ← 0
3 Loop for each s ∈ S
4 v ← V (s)
5 V (s) ← s ,r p (s , r|s, π (s)) [r + αV (s )]
6 ∆ ← max (∆, |v − V (s)|)
7 until ∆ < θ (A small threshold for accuracy)
64 / 111
134. Images/cinvestav
Further
Step 3. Policy Improvement
1 Policy-stable ← True
2 For each s ∈ S:
3 old_action ← π (s)
4 π (s) ← s ,r p (s .r|s, a) [r + αV (s )]
5 if old_action= π (s) then policy-stable ← False
6 If policy-stable then stop and return V ≈ v∗ and π ≈ π∗ else go to 2.
65 / 111
135. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
66 / 111
136. Images/cinvestav
Jack’s Car Rental
Setup
We have a Car’s Rental with two places (s1, s2).
And each time Jack is available he rents it out and is credited $10 by
the company.
If Jack’s is out of cars at that location, then the business is lost.
The cars are available for renting the next day
Jack’s has options
It can move them between the two locations overnight, at a cost of
$2 dollars and only 5 of them can be moved.
67 / 111
137. Images/cinvestav
Jack’s Car Rental
Setup
We have a Car’s Rental with two places (s1, s2).
And each time Jack is available he rents it out and is credited $10 by
the company.
If Jack’s is out of cars at that location, then the business is lost.
The cars are available for renting the next day
Jack’s has options
It can move them between the two locations overnight, at a cost of
$2 dollars and only 5 of them can be moved.
67 / 111
138. Images/cinvestav
Jack’s Car Rental
Setup
We have a Car’s Rental with two places (s1, s2).
And each time Jack is available he rents it out and is credited $10 by
the company.
If Jack’s is out of cars at that location, then the business is lost.
The cars are available for renting the next day
Jack’s has options
It can move them between the two locations overnight, at a cost of
$2 dollars and only 5 of them can be moved.
67 / 111
139. Images/cinvestav
Jack’s Car Rental
Setup
We have a Car’s Rental with two places (s1, s2).
And each time Jack is available he rents it out and is credited $10 by
the company.
If Jack’s is out of cars at that location, then the business is lost.
The cars are available for renting the next day
Jack’s has options
It can move them between the two locations overnight, at a cost of
$2 dollars and only 5 of them can be moved.
67 / 111
140. Images/cinvestav
Jack’s Car Rental
Setup
We have a Car’s Rental with two places (s1, s2).
And each time Jack is available he rents it out and is credited $10 by
the company.
If Jack’s is out of cars at that location, then the business is lost.
The cars are available for renting the next day
Jack’s has options
It can move them between the two locations overnight, at a cost of
$2 dollars and only 5 of them can be moved.
67 / 111
141. Images/cinvestav
We choose a probability for modeling requests and returns
Additionally, renting at each location has a Poisson distribution
P (n|λ) =
λn
n!
e−λ
We simplify the problem by having an upper-bound n = 20
Basically, any other car generated by the return simply disappear!!!
68 / 111
142. Images/cinvestav
We choose a probability for modeling requests and returns
Additionally, renting at each location has a Poisson distribution
P (n|λ) =
λn
n!
e−λ
We simplify the problem by having an upper-bound n = 20
Basically, any other car generated by the return simply disappear!!!
68 / 111
143. Images/cinvestav
Finally
We have the following λ s for request and returns
1 Request s1 we have λr1 = 3
2 Request s2 we have λr2 = 4
3 Return s1 we have λr1 = 3
4 Return s2 we have λr1 = 2
Actually, it defines how many cars are requested or returned in a time
interval (One day)
69 / 111
144. Images/cinvestav
Finally
We have the following λ s for request and returns
1 Request s1 we have λr1 = 3
2 Request s2 we have λr2 = 4
3 Return s1 we have λr1 = 3
4 Return s2 we have λr1 = 2
Actually, it defines how many cars are requested or returned in a time
interval (One day)
69 / 111
145. Images/cinvestav
Finally
We have the following λ s for request and returns
1 Request s1 we have λr1 = 3
2 Request s2 we have λr2 = 4
3 Return s1 we have λr1 = 3
4 Return s2 we have λr1 = 2
Actually, it defines how many cars are requested or returned in a time
interval (One day)
69 / 111
146. Images/cinvestav
Finally
We have the following λ s for request and returns
1 Request s1 we have λr1 = 3
2 Request s2 we have λr2 = 4
3 Return s1 we have λr1 = 3
4 Return s2 we have λr1 = 2
Actually, it defines how many cars are requested or returned in a time
interval (One day)
69 / 111
147. Images/cinvestav
Finally
We have the following λ s for request and returns
1 Request s1 we have λr1 = 3
2 Request s2 we have λr2 = 4
3 Return s1 we have λr1 = 3
4 Return s2 we have λr1 = 2
Actually, it defines how many cars are requested or returned in a time
interval (One day)
Time Interval
Events
69 / 111
149. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
71 / 111
150. Images/cinvestav
History
There is an original earlier work
By Arthur Samuel and it was used by Sutton to invent Temporal
Difference-Lambda
It was famously applied by Gerald Tesauro
To build TD-Gammon
A program that learned to play backgammon at the level of expert
human players
72 / 111
151. Images/cinvestav
History
There is an original earlier work
By Arthur Samuel and it was used by Sutton to invent Temporal
Difference-Lambda
It was famously applied by Gerald Tesauro
To build TD-Gammon
A program that learned to play backgammon at the level of expert
human players
72 / 111
152. Images/cinvestav
Temporal-Difference (TD) Learning
Something Notable
Possibly the central and novel idea of reinforcement learning,
It has similarity with Monte Carlo Methods
They use experience to solve the prediction problem (For More look
at Sutton’s Book)
Monte Carlo methods wait until Gt is known
V (St) ← V (St) + γ [Gt − V (St)] and Gt =
∞
k=0
αk
Rt+k+1
Let us call this method constant-α Monte Carlo.
73 / 111
153. Images/cinvestav
Temporal-Difference (TD) Learning
Something Notable
Possibly the central and novel idea of reinforcement learning,
It has similarity with Monte Carlo Methods
They use experience to solve the prediction problem (For More look
at Sutton’s Book)
Monte Carlo methods wait until Gt is known
V (St) ← V (St) + γ [Gt − V (St)] and Gt =
∞
k=0
αk
Rt+k+1
Let us call this method constant-α Monte Carlo.
73 / 111
154. Images/cinvestav
Temporal-Difference (TD) Learning
Something Notable
Possibly the central and novel idea of reinforcement learning,
It has similarity with Monte Carlo Methods
They use experience to solve the prediction problem (For More look
at Sutton’s Book)
Monte Carlo methods wait until Gt is known
V (St) ← V (St) + γ [Gt − V (St)] and Gt =
∞
k=0
αk
Rt+k+1
Let us call this method constant-α Monte Carlo.
73 / 111
155. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
74 / 111
156. Images/cinvestav
Fundamental Theorem of Simulation
Cumulative Distribution Function
With every random variable, we associate a function called
Cumulative Distribution Function (CDF) which is defined as follows:
FX(x) = P(f(X) ≤ x)
With properties:
FX(x) ≥ 0
FX(x) in a non-decreasing function of X.
75 / 111
157. Images/cinvestav
Fundamental Theorem of Simulation
Cumulative Distribution Function
With every random variable, we associate a function called
Cumulative Distribution Function (CDF) which is defined as follows:
FX(x) = P(f(X) ≤ x)
With properties:
FX(x) ≥ 0
FX(x) in a non-decreasing function of X.
75 / 111
158. Images/cinvestav
Fundamental Theorem of Simulation
Cumulative Distribution Function
With every random variable, we associate a function called
Cumulative Distribution Function (CDF) which is defined as follows:
FX(x) = P(f(X) ≤ x)
With properties:
FX(x) ≥ 0
FX(x) in a non-decreasing function of X.
75 / 111
161. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
78 / 111
166. Images/cinvestav
Therefore
One thing that is made clear by this theorem is that we can generate
X in three ways
1 First generate X ∼ f, and then U|X = x ∼ U {u|u ≤ f (x)} , but
this is useless because we already have X and do not need X.
2 First generate U, and then X|U = u ∼ U {x|u ≤ f (x)} .
3 Generate (X, U) jointly, a smart idea.
it allows us to generate data on a larger set were simulation is easier.
Not as simple as it looks.
81 / 111
167. Images/cinvestav
Therefore
One thing that is made clear by this theorem is that we can generate
X in three ways
1 First generate X ∼ f, and then U|X = x ∼ U {u|u ≤ f (x)} , but
this is useless because we already have X and do not need X.
2 First generate U, and then X|U = u ∼ U {x|u ≤ f (x)} .
3 Generate (X, U) jointly, a smart idea.
it allows us to generate data on a larger set were simulation is easier.
Not as simple as it looks.
81 / 111
168. Images/cinvestav
Therefore
One thing that is made clear by this theorem is that we can generate
X in three ways
1 First generate X ∼ f, and then U|X = x ∼ U {u|u ≤ f (x)} , but
this is useless because we already have X and do not need X.
2 First generate U, and then X|U = u ∼ U {x|u ≤ f (x)} .
3 Generate (X, U) jointly, a smart idea.
it allows us to generate data on a larger set were simulation is easier.
Not as simple as it looks.
81 / 111
169. Images/cinvestav
Therefore
One thing that is made clear by this theorem is that we can generate
X in three ways
1 First generate X ∼ f, and then U|X = x ∼ U {u|u ≤ f (x)} , but
this is useless because we already have X and do not need X.
2 First generate U, and then X|U = u ∼ U {x|u ≤ f (x)} .
3 Generate (X, U) jointly, a smart idea.
it allows us to generate data on a larger set were simulation is easier.
Not as simple as it looks.
81 / 111
170. Images/cinvestav
Markov Chains
The random process Xt ∈ S for t = 1, 2, ..., T has a Markov property
iif
p (XT |XT−1, XT−2, ..., X1) = p (XT |XT−1)
Finite-state Discrete Time Markov Chains |S| < ∞
It can be completly specified by the transition matrix.
P = [pij] with pij = P [Xt = j|Xt−1 = i]
For Irreducible chains, the stationary distribution π is long-term
proportion of time that the chain spends in each state.
Computed by π = πP .
82 / 111
171. Images/cinvestav
Markov Chains
The random process Xt ∈ S for t = 1, 2, ..., T has a Markov property
iif
p (XT |XT−1, XT−2, ..., X1) = p (XT |XT−1)
Finite-state Discrete Time Markov Chains |S| < ∞
It can be completly specified by the transition matrix.
P = [pij] with pij = P [Xt = j|Xt−1 = i]
For Irreducible chains, the stationary distribution π is long-term
proportion of time that the chain spends in each state.
Computed by π = πP .
82 / 111
172. Images/cinvestav
Markov Chains
The random process Xt ∈ S for t = 1, 2, ..., T has a Markov property
iif
p (XT |XT−1, XT−2, ..., X1) = p (XT |XT−1)
Finite-state Discrete Time Markov Chains |S| < ∞
It can be completly specified by the transition matrix.
P = [pij] with pij = P [Xt = j|Xt−1 = i]
For Irreducible chains, the stationary distribution π is long-term
proportion of time that the chain spends in each state.
Computed by π = πP .
82 / 111
173. Images/cinvestav
Markov Chains
The random process Xt ∈ S for t = 1, 2, ..., T has a Markov property
iif
p (XT |XT−1, XT−2, ..., X1) = p (XT |XT−1)
Finite-state Discrete Time Markov Chains |S| < ∞
It can be completly specified by the transition matrix.
P = [pij] with pij = P [Xt = j|Xt−1 = i]
For Irreducible chains, the stationary distribution π is long-term
proportion of time that the chain spends in each state.
Computed by π = πP .
82 / 111
174. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
83 / 111
175. Images/cinvestav
The Idea
Draw an i.i.d. set of samples x(i)
N
i=1
From a target density p (x) on a high-dimensional X.
The set of possible configurations of a system.
Thus, we can use those samples to approximate the target density
pN (x) =
1
N
N
i=1
δx(i) (x)
where δx(i) (x) denotes the Dirac Delta mass located at x(i).
84 / 111
176. Images/cinvestav
The Idea
Draw an i.i.d. set of samples x(i)
N
i=1
From a target density p (x) on a high-dimensional X.
The set of possible configurations of a system.
Thus, we can use those samples to approximate the target density
pN (x) =
1
N
N
i=1
δx(i) (x)
where δx(i) (x) denotes the Dirac Delta mass located at x(i).
84 / 111
177. Images/cinvestav
The Idea
Draw an i.i.d. set of samples x(i)
N
i=1
From a target density p (x) on a high-dimensional X.
The set of possible configurations of a system.
Thus, we can use those samples to approximate the target density
pN (x) =
1
N
N
i=1
δx(i) (x)
where δx(i) (x) denotes the Dirac Delta mass located at x(i).
84 / 111
180. Images/cinvestav
Consequently
Approximate the integrals
IN (f) =
1
N
N
i=1
f x(i) a.s.
−→
N→∞
I (f) =
X
f (x) p (x) dx
Where Variance of f (x)
σ2
f = Ep(x) f2
(x) − I2
(f) < ∞
Therefore the variance of the estimator IN (f)
EIN (f) = var (IN (f)) =
σ2
f
N
86 / 111
181. Images/cinvestav
Consequently
Approximate the integrals
IN (f) =
1
N
N
i=1
f x(i) a.s.
−→
N→∞
I (f) =
X
f (x) p (x) dx
Where Variance of f (x)
σ2
f = Ep(x) f2
(x) − I2
(f) < ∞
Therefore the variance of the estimator IN (f)
EIN (f) = var (IN (f)) =
σ2
f
N
86 / 111
182. Images/cinvestav
Consequently
Approximate the integrals
IN (f) =
1
N
N
i=1
f x(i) a.s.
−→
N→∞
I (f) =
X
f (x) p (x) dx
Where Variance of f (x)
σ2
f = Ep(x) f2
(x) − I2
(f) < ∞
Therefore the variance of the estimator IN (f)
EIN (f) = var (IN (f)) =
σ2
f
N
86 / 111
185. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
88 / 111
186. Images/cinvestav
What if we do not wait for Gt
TD methods need to wait only until the next time step
At time t + 1, it uses the observed reward Rt+1 and the estimate
V (St+1)
The simplest TD method
V (St) ← V (St) + γ [Rt+1 + αV (St+1) − V (St)]
Immediately on transition to St+1 and receiving Rt+1 .
89 / 111
187. Images/cinvestav
What if we do not wait for Gt
TD methods need to wait only until the next time step
At time t + 1, it uses the observed reward Rt+1 and the estimate
V (St+1)
The simplest TD method
V (St) ← V (St) + γ [Rt+1 + αV (St+1) − V (St)]
Immediately on transition to St+1 and receiving Rt+1 .
89 / 111
188. Images/cinvestav
This method is called TD(0) or one-step TD
Actually
There are TD with n-step TD methods
Differences with the Monte Carlo Methods
In Monte Carlo update, the target is Gt
In TD update, the target is Rt + αV (St+1)
90 / 111
189. Images/cinvestav
This method is called TD(0) or one-step TD
Actually
There are TD with n-step TD methods
Differences with the Monte Carlo Methods
In Monte Carlo update, the target is Gt
In TD update, the target is Rt + αV (St+1)
90 / 111
190. Images/cinvestav
TD as a Bootstrapping Method
Quite similar to Dynamic
Additionally, we know that
vπ (s) = Eπ [Gt|St]
= Eπ [Rt+1 + αGt+1|St]
= Eπ [Rt+1 + αvπ (St+1) |St]
Monte Carlo methods use an estimate of vπ (s) = Eπ [Gt|St]
Instead, Dynamic Programming methods use an estimate of
vπ (s) = Eπ [Rt+1 + αvπ (St+1) |St].
91 / 111
191. Images/cinvestav
TD as a Bootstrapping Method
Quite similar to Dynamic
Additionally, we know that
vπ (s) = Eπ [Gt|St]
= Eπ [Rt+1 + αGt+1|St]
= Eπ [Rt+1 + αvπ (St+1) |St]
Monte Carlo methods use an estimate of vπ (s) = Eπ [Gt|St]
Instead, Dynamic Programming methods use an estimate of
vπ (s) = Eπ [Rt+1 + αvπ (St+1) |St].
91 / 111
192. Images/cinvestav
Furthermore
The Monte Carlo target is an estimate
because the expected value in (6.3) is not known, but a sample is
used instead
The DP target is an estimate
because vπ(St + 1) is not known and the current estimate,V (St+1), is
used instead.
Here TD goes beyond and combines both ideas
It combines the sampling of Monte Carlo with the bootstrapping of
DP.
92 / 111
193. Images/cinvestav
Furthermore
The Monte Carlo target is an estimate
because the expected value in (6.3) is not known, but a sample is
used instead
The DP target is an estimate
because vπ(St + 1) is not known and the current estimate,V (St+1), is
used instead.
Here TD goes beyond and combines both ideas
It combines the sampling of Monte Carlo with the bootstrapping of
DP.
92 / 111
194. Images/cinvestav
Furthermore
The Monte Carlo target is an estimate
because the expected value in (6.3) is not known, but a sample is
used instead
The DP target is an estimate
because vπ(St + 1) is not known and the current estimate,V (St+1), is
used instead.
Here TD goes beyond and combines both ideas
It combines the sampling of Monte Carlo with the bootstrapping of
DP.
92 / 111
195. Images/cinvestav
As in many methods there is an error
And it is used to improve the policy
δt = Rt+1 + αV (St+1) − V (St)
Because the TD error depends on the next state and next reward
It is not actually available until one time step later.
δt is the error in V (St)
At time t + 1
93 / 111
196. Images/cinvestav
As in many methods there is an error
And it is used to improve the policy
δt = Rt+1 + αV (St+1) − V (St)
Because the TD error depends on the next state and next reward
It is not actually available until one time step later.
δt is the error in V (St)
At time t + 1
93 / 111
197. Images/cinvestav
As in many methods there is an error
And it is used to improve the policy
δt = Rt+1 + αV (St+1) − V (St)
Because the TD error depends on the next state and next reward
It is not actually available until one time step later.
δt is the error in V (St)
At time t + 1
93 / 111
198. Images/cinvestav
Finally
If the array V does not change during the episode,
The Monte Carlo Error can be written as a sum of TD errors
= Gt − V (St) = Rt+1 + αGt+1 − V (St) + αV (St+1) − αV (St+1)
= δt + α (Gt+1 − V (St+1))
= δt + αδt+! + α2
(Gt+2 − V (St+2))
=
T−1
k=t
αk−t
δk
94 / 111
199. Images/cinvestav
Optimality of TD(0)
Under batch updating
TD(0) converges deterministically to a single answer independent
as long as γ is chosen to be sufficiently small.
V (St) ← V (St) + γ [Gt − V (St)] and
95 / 111
200. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
96 / 111
201. Images/cinvestav
Q-learning: Off-Policy TD Control
One of the early breakthroughs in reinforcement learning
The development of an off-policy TD control algorithm, Q-learning
(Watkind, 1989)
It is defined as
Q (St, At) ← Q (St, At) + γ Rt+1 + α max
a
Q (St+1, a) − Q (St, At)
Here
The learned action-value function, Q, directly approximates q∗ the
optimal action-value function
independent of the policy being followed q.
97 / 111
202. Images/cinvestav
Q-learning: Off-Policy TD Control
One of the early breakthroughs in reinforcement learning
The development of an off-policy TD control algorithm, Q-learning
(Watkind, 1989)
It is defined as
Q (St, At) ← Q (St, At) + γ Rt+1 + α max
a
Q (St+1, a) − Q (St, At)
Here
The learned action-value function, Q, directly approximates q∗ the
optimal action-value function
independent of the policy being followed q.
97 / 111
203. Images/cinvestav
Q-learning: Off-Policy TD Control
One of the early breakthroughs in reinforcement learning
The development of an off-policy TD control algorithm, Q-learning
(Watkind, 1989)
It is defined as
Q (St, At) ← Q (St, At) + γ Rt+1 + α max
a
Q (St+1, a) − Q (St, At)
Here
The learned action-value function, Q, directly approximates q∗ the
optimal action-value function
independent of the policy being followed q.
97 / 111
204. Images/cinvestav
Furthermore
The policy still has an effect
to determine which state-action pairs are visited and updated.
Therefore
All that is required for correct convergence is that all pairs continue
to be updated
Under this assumption and a variant of the usual stochastic
approximation conditions
Q has been shown to converge with probability 1 to q∗
98 / 111
205. Images/cinvestav
Furthermore
The policy still has an effect
to determine which state-action pairs are visited and updated.
Therefore
All that is required for correct convergence is that all pairs continue
to be updated
Under this assumption and a variant of the usual stochastic
approximation conditions
Q has been shown to converge with probability 1 to q∗
98 / 111
206. Images/cinvestav
Furthermore
The policy still has an effect
to determine which state-action pairs are visited and updated.
Therefore
All that is required for correct convergence is that all pairs continue
to be updated
Under this assumption and a variant of the usual stochastic
approximation conditions
Q has been shown to converge with probability 1 to q∗
98 / 111
207. Images/cinvestav
Q-learning (Off-Policy TD control) for estimating π ≈ π∗
Algorithm parameters
α ∈ (0, 1] and small > 0
Initialize
Initialize Q(s, a), for all s ∈ S+ , a ∈ A, arbitrarily except that
Q (terminal, ·) = 0
99 / 111
208. Images/cinvestav
Q-learning (Off-Policy TD control) for estimating π ≈ π∗
Algorithm parameters
α ∈ (0, 1] and small > 0
Initialize
Initialize Q(s, a), for all s ∈ S+ , a ∈ A, arbitrarily except that
Q (terminal, ·) = 0
99 / 111
209. Images/cinvestav
Then
Loop for each episode
1 Initialize S
2 Loop for each step of episode:
3 Choose A from S using policy derived from Q (for example
-greedy)
4 Take action A, observe R, S
Q (S, A) ← Q (S, A) + γ R + α max
a
Q S , a − Q (S, A)
5 S ← S
6 Until S is terminal
100 / 111
210. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
101 / 111
211. Images/cinvestav
One of the earliest successes
Backgammon is a major game
with numerous tournaments and regular world championship matches.
It is a stochastic game with a full description of the world
102 / 111
212. Images/cinvestav
One of the earliest successes
Backgammon is a major game
with numerous tournaments and regular world championship matches.
It is a stochastic game with a full description of the world
102 / 111
213. Images/cinvestav
What do they used?
TD-Gammon used a nonlinear form of TD (λ)
An estimated value ˆv (s, w) is used to estimate the probability of
winning from state s.
Then
To achieve this, rewards were defined as zero for all time steps except
those on which the game is won.
103 / 111
214. Images/cinvestav
What do they used?
TD-Gammon used a nonlinear form of TD (λ)
An estimated value ˆv (s, w) is used to estimate the probability of
winning from state s.
Then
To achieve this, rewards were defined as zero for all time steps except
those on which the game is won.
103 / 111
216. Images/cinvestav
The interesting part
Tesauro’s representation
For each point on the backgammon board
Four units indicate the number of white pieces on the point
Basically
If no white pieces, all four units took zero values,
if there was one white piece, the first unit takes a one value
And so on...
Therefore, given four white pieces and another four black pieces at
each of the 24 points
We have 4*24+4*24=192 units
105 / 111
217. Images/cinvestav
The interesting part
Tesauro’s representation
For each point on the backgammon board
Four units indicate the number of white pieces on the point
Basically
If no white pieces, all four units took zero values,
if there was one white piece, the first unit takes a one value
And so on...
Therefore, given four white pieces and another four black pieces at
each of the 24 points
We have 4*24+4*24=192 units
105 / 111
218. Images/cinvestav
The interesting part
Tesauro’s representation
For each point on the backgammon board
Four units indicate the number of white pieces on the point
Basically
If no white pieces, all four units took zero values,
if there was one white piece, the first unit takes a one value
And so on...
Therefore, given four white pieces and another four black pieces at
each of the 24 points
We have 4*24+4*24=192 units
105 / 111
219. Images/cinvestav
Then
We have that
Two additional units encoded the number of white and black pieces
on the bar (Each took the value n/2 with n the number on the bar)
Two other represent the number of black and white pieces already
(these took the value n/15)
two units indicated in a binary fashion the white or black turn.
106 / 111
220. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
107 / 111
221. Images/cinvestav
TD(λ) by Back-Propagation
TD-Gammon uses a semi-gradient of the TD (λ)
TD (λ) has the following general update rule:
wt+1 = wt + γ [Rt+1 + αˆv (St+1, wt) − ˆv (St, wt)] zt
with wt is the vector of all modifiable parameters
zt is a vector of eligibility traces
Eligibility traces unify and generalize TD and Monte Carlo methods.
It works as Long-Short Term Memory
The rough idea is that when a component of wt produces an estimate
zt is bumped out and then begins to fade away.
108 / 111
222. Images/cinvestav
TD(λ) by Back-Propagation
TD-Gammon uses a semi-gradient of the TD (λ)
TD (λ) has the following general update rule:
wt+1 = wt + γ [Rt+1 + αˆv (St+1, wt) − ˆv (St, wt)] zt
with wt is the vector of all modifiable parameters
zt is a vector of eligibility traces
Eligibility traces unify and generalize TD and Monte Carlo methods.
It works as Long-Short Term Memory
The rough idea is that when a component of wt produces an estimate
zt is bumped out and then begins to fade away.
108 / 111
223. Images/cinvestav
TD(λ) by Back-Propagation
TD-Gammon uses a semi-gradient of the TD (λ)
TD (λ) has the following general update rule:
wt+1 = wt + γ [Rt+1 + αˆv (St+1, wt) − ˆv (St, wt)] zt
with wt is the vector of all modifiable parameters
zt is a vector of eligibility traces
Eligibility traces unify and generalize TD and Monte Carlo methods.
It works as Long-Short Term Memory
The rough idea is that when a component of wt produces an estimate
zt is bumped out and then begins to fade away.
108 / 111
224. Images/cinvestav
How this is done?
Update Rule
zt = αλzt−1 + ˆv (St, wt) with z0 = 0
Something Notable
The gradient in this equation can be computed efficiently by the
back-propapagation procedure
For this, we set the error in the ANN
ˆv (St+1, wt) − ˆv (St, wt)
109 / 111
225. Images/cinvestav
How this is done?
Update Rule
zt = αλzt−1 + ˆv (St, wt) with z0 = 0
Something Notable
The gradient in this equation can be computed efficiently by the
back-propapagation procedure
For this, we set the error in the ANN
ˆv (St+1, wt) − ˆv (St, wt)
109 / 111
226. Images/cinvestav
How this is done?
Update Rule
zt = αλzt−1 + ˆv (St, wt) with z0 = 0
Something Notable
The gradient in this equation can be computed efficiently by the
back-propapagation procedure
For this, we set the error in the ANN
ˆv (St+1, wt) − ˆv (St, wt)
109 / 111
227. Images/cinvestav
Outline
1 Reinforcement Learning
Introduction
Example
A K-Armed Bandit Problem
Exploration vs Exploitation
The Elements of Reinforcement Learning
Limitations
2 Formalizing Reinforcement Learning
Introduction
Markov Process and Markov Decision Process
Return, Policy and Value function
3 Solution Methods
Multi-armed Bandits
Problems of Implementations
General Formula in Reinforcement Learning
A Simple Bandit Algorithm
Dynamic Programming
Policy Evaluation
Policy Iteration
Policy and Value Improvement
Example
Temporal-Difference Learning
A Small Review of Monte Carlo Methods
The Main Idea
The Monte Carlo Principle
Going Back to Temporal Difference
Q-learning: Off-Policy TD Control
4 The Neural Network Approach
Introduction, TD-Gammon
TD(λ) by Back-Propagation
Conclusions
110 / 111
228. Images/cinvestav
Conclusions
With this new tools as Reinforcement Learning
The New Artificial Intelligence has a lot to do in the world
Thanks to be in this class
As they say we have a lot to do...
111 / 111
229. Images/cinvestav
Conclusions
With this new tools as Reinforcement Learning
The New Artificial Intelligence has a lot to do in the world
Thanks to be in this class
As they say we have a lot to do...
111 / 111