The document discusses reinforcement learning and its key concepts. It covers defining the reinforcement learning problem through reward maximization and Bellman's equation. It then discusses learning methods like Monte Carlo, temporal difference learning, and Q-learning. It also covers improvements like the importance of exploration versus exploitation and eligibility traces for accelerated learning.
This presentation contains an introduction to reinforcement learning, comparison with others learning ways, introduction to Q-Learning and some applications of reinforcement learning in video games.
Deep Reinforcement Learning Talk at PI School. Covering following contents as:
1- Deep Reinforcement Learning
2- QLearning
3- Deep QLearning (DQN)
4- Google Deepmind Paper (DQN for ATARI)
Reinforcement Learning (RL) approaches to deal with finding an optimal reward based policy to act in an environment (Charla en Inglés)
However, what has led to their widespread use is its combination with deep neural networks (DNN) i.e., deep reinforcement learning (Deep RL). Recent successes on not only learning to play games but also superseding humans in it and academia-industry research collaborations like for manipulation of objects, locomotion skills, smart grids, etc. have surely demonstrated their case on a wide variety of challenging tasks.
With application spanning across games, robotics, dialogue, healthcare, marketing, energy and many more domains, Deep RL might just be the power that drives the next generation of Artificial Intelligence (AI) agents!
This presentation contains an introduction to reinforcement learning, comparison with others learning ways, introduction to Q-Learning and some applications of reinforcement learning in video games.
Deep Reinforcement Learning Talk at PI School. Covering following contents as:
1- Deep Reinforcement Learning
2- QLearning
3- Deep QLearning (DQN)
4- Google Deepmind Paper (DQN for ATARI)
Reinforcement Learning (RL) approaches to deal with finding an optimal reward based policy to act in an environment (Charla en Inglés)
However, what has led to their widespread use is its combination with deep neural networks (DNN) i.e., deep reinforcement learning (Deep RL). Recent successes on not only learning to play games but also superseding humans in it and academia-industry research collaborations like for manipulation of objects, locomotion skills, smart grids, etc. have surely demonstrated their case on a wide variety of challenging tasks.
With application spanning across games, robotics, dialogue, healthcare, marketing, energy and many more domains, Deep RL might just be the power that drives the next generation of Artificial Intelligence (AI) agents!
This slide described about Deep sarsa, Deep Q-learning, and DQN, and used for Reinforcement Learning study group's lecture, where is belonged to Korea Artificial Intelligence Laboratory.
Reinforcement learning is an area of machine learning inspired by behaviorist psychology, concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.
In some applications, the output of the system is a sequence of actions. In such a case, a single action is not important
game playing where a single move by itself is not that important.in the case of the agent acts on its environment, it receives some evaluation of its action (reinforcement),
but is not told of which action is the correct one to achieve its goal
Reinforcement Learning 2. Multi-armed BanditsSeung Jae Lee
A summary of Chapter 2: Multi-armed Bandits of the book 'Reinforcement Learning: An Introduction' by Sutton and Barto. You can find the full book in Professor Sutton's website: http://incompleteideas.net/book/the-book-2nd.html
Deep Reinforcement Learning and Its ApplicationsBill Liu
What is the most exciting AI news in recent years? AlphaGo!
What are key techniques for AlphaGo? Deep learning and reinforcement learning (RL)!
What are application areas for deep RL? A lot! In fact, besides games, deep RL has been making tremendous achievements in diverse areas like recommender systems and robotics.
In this talk, we will introduce deep reinforcement learning, present several applications, and discuss issues and potential solutions for successfully applying deep RL in real life scenarios.
https://www.aicamp.ai/event/eventdetails/W2021042818
( Machine Learning & Deep Learning Specialization Training: https://goo.gl/5u2RiS )
This CloudxLab Reinforcement Learning tutorial helps you to understand Reinforcement Learning in detail. Below are the topics covered in this tutorial:
1) What is Reinforcement?
2) Reinforcement Learning an Introduction
3) Reinforcement Learning Example
4) Learning to Optimize Rewards
5) Policy Search - Brute Force Approach, Genetic Algorithms and Optimization Techniques
6) OpenAI Gym
7) The Credit Assignment Problem
8) Inverse Reinforcement Learning
9) Playing Atari with Deep Reinforcement Learning
10) Policy Gradients
11) Markov Decision Processes
This slide described about Deep sarsa, Deep Q-learning, and DQN, and used for Reinforcement Learning study group's lecture, where is belonged to Korea Artificial Intelligence Laboratory.
Reinforcement learning is an area of machine learning inspired by behaviorist psychology, concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.
In some applications, the output of the system is a sequence of actions. In such a case, a single action is not important
game playing where a single move by itself is not that important.in the case of the agent acts on its environment, it receives some evaluation of its action (reinforcement),
but is not told of which action is the correct one to achieve its goal
Reinforcement Learning 2. Multi-armed BanditsSeung Jae Lee
A summary of Chapter 2: Multi-armed Bandits of the book 'Reinforcement Learning: An Introduction' by Sutton and Barto. You can find the full book in Professor Sutton's website: http://incompleteideas.net/book/the-book-2nd.html
Deep Reinforcement Learning and Its ApplicationsBill Liu
What is the most exciting AI news in recent years? AlphaGo!
What are key techniques for AlphaGo? Deep learning and reinforcement learning (RL)!
What are application areas for deep RL? A lot! In fact, besides games, deep RL has been making tremendous achievements in diverse areas like recommender systems and robotics.
In this talk, we will introduce deep reinforcement learning, present several applications, and discuss issues and potential solutions for successfully applying deep RL in real life scenarios.
https://www.aicamp.ai/event/eventdetails/W2021042818
( Machine Learning & Deep Learning Specialization Training: https://goo.gl/5u2RiS )
This CloudxLab Reinforcement Learning tutorial helps you to understand Reinforcement Learning in detail. Below are the topics covered in this tutorial:
1) What is Reinforcement?
2) Reinforcement Learning an Introduction
3) Reinforcement Learning Example
4) Learning to Optimize Rewards
5) Policy Search - Brute Force Approach, Genetic Algorithms and Optimization Techniques
6) OpenAI Gym
7) The Credit Assignment Problem
8) Inverse Reinforcement Learning
9) Playing Atari with Deep Reinforcement Learning
10) Policy Gradients
11) Markov Decision Processes
Anyone who is interested in Reinforcement Learning can have a look. It covers Markov Reward Processes, Markov Decision Processes and Dynamic Programming!
An efficient use of temporal difference technique in Computer Game LearningPrabhu Kumar
A computer game using temporal difference algorithm of Machine learning which improves the ability of the computer to learn and also explore the best next move for the game by greedy movement techniques and exploration method techniques for the future states of the game.
Maximum Entropy Reinforcement Learning (Stochastic Control)Dongmin Lee
I reviewed the following papers.
- T. Haarnoja, et al., “Reinforcement Learning with Deep Energy-Based Policies", ICML 2017
- T. Haarnoja, et al., “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor", ICML 2018
- T. Haarnoja, et al., “Soft Actor-Critic Algorithms and Applications", arXiv preprint 2018
Thank you.
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017MLconf
Ben Lau is a quantitative researcher in a macro hedge fund in Hong Kong and he looks to apply mathematical models and signal processing techniques to study the financial market. Prior joining the financial industry, he specialized in using his mathematical modelling skills to discover the mysteries of the universe whilst working at Stanford Linear Accelerator Centre, a national accelerator laboratory where he studied the asymmetry between matter and antimatter by analysing tens of billions of collision events created by the particle accelerators. Ben was awarded his Ph.D. in Particle Physics from Princeton University and his undergraduate degree (with First Class Honours) at the Chinese University of Hong Kong.
Abstract Summary:
Deep Reinforcement Learning: Developing a robotic car with the ability to form long term driving strategies is the key for enabling fully autonomous driving in the future. Reinforcement learning has been considered a strong AI paradigm which can be used to teach machines through interaction with the environment and by learning from their mistakes. In this talk, we will discuss how to apply deep reinforcement learning technique to train a self-driving car under an open source racing car simulator called TORCS. I am going to share how this is implemented and will discuss various challenges in this project.
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Hogeon Seo
The link of the original article: https://ai.intel.com/demystifying-deep-reinforcement-learning/
This review summarizes:
How do I learn reinforcement learning?
Reinforcement Learning is Hot!
What is the RL?
General approach to model the RL problem
Maximize the total future reward
A function Q(s,a) = the maximum DFR
How to get Q-function?
Deep Q Network
Experience Replay
Exploration-Exploitation
발표자: 곽동현(서울대 박사과정, 현 NAVER Clova)
강화학습(Reinforcement learning)의 개요 및 최근 Deep learning 기반의 RL 트렌드를 소개합니다.
발표영상:
http://tv.naver.com/v/2024376
https://youtu.be/dw0sHzE1oAc
Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
1. Defining the Problem
Learning
Improvements
Reinforcement Learning
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
1 Defining the Problem
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
2 Learning
Monte-Carlo Method
Temporal-Difference
Learning to Act
Q-Learning
3 Improvements
Importance of Making Mistakes
Eligibility Trace
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
1 Defining the Problem
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
2 Learning
Monte-Carlo Method
Temporal-Difference
Learning to Act
Q-Learning
3 Improvements
Importance of Making Mistakes
Eligibility Trace
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
Reinforcement Learning
Learning of a behavior without
explicit information about correct actions
A reward gives information about success
¨Orjan Ekeberg Machine Learning
2. Defining the Problem
Learning
Improvements
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
Agent–Environment Abstraction:
Agent
Environment
ActionState
Reward
Policy — Choice of action, depending on current state
Learning Objective:
Develop a policy which maximizes the reward
... total reward over the agents lifetime!
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
Credit Assignment Problems
The reward does not necessarily arrive when you do something
good
Temporal credit assignment
The reward does not say what was good
Structural credit assignment
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
Consider a minimal behavior selection situation:
What is the best action to make?
−1
2
−1
2
−1
2
−1
1
0
0
3
−1
Reward — Immediate consequence of our decision
Planning — Taking future reward into account
Optimal Behavior — Maximize total reward
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
Can the agent make the right decision without explicitly modeling
the future?
Yes! If the Sum of Future Rewards for each situation is known.
−1
2
−1
1
0
0
3
−1
4
−2
2
−2
−1
2
8
−4
Call this the State Value
State Values are subjective
Depend on the agents own behavior
Must be re-adjusted as the agents behavior improves
¨Orjan Ekeberg Machine Learning
3. Defining the Problem
Learning
Improvements
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
Planning Horizon — How long is the future?
−1
0
1
2
Reward
Time
−1
0
1
2
Reward
Time
Infinite future allows for infinite postponing
Discounting — Make early reward more valuable
Discount factor (γ) — Time scale of planning
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
Finite Horizon
max
h
t=0
rt
Infinite Horizon
max
∞
t=0
γt
rt
Requires discount of future reward (0 < γ < 1)
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
The Reward function controls which task should be solved
Game (Chess, Backgammon)
Reward only at the end: +1 when winning, −1 when loosing
Avoiding mistakes (cycling, balancing, ...)
Reward −1 at the end (when failing)
Find a short/fast/cheap path to a goal
Reward −1 at each step
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
Simplifying Assumptions
Discrete time
Finite number of actions ai
ai ∈ a1, a2, a3, . . . , an
Finite number of states si
si ∈ s1, s2, s3, . . . , sm
Environment is a stationary Markov Decision Process
Reward and next state depends only on s, a and chance
¨Orjan Ekeberg Machine Learning
4. Defining the Problem
Learning
Improvements
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
Classical model problem: Grid World
Each state is represented by a position in a grid
The agent acts by moving to other positions
G
G
Trivial labyrinth
Reward: −1 at each step until
a goal state (G) is reached
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
The values of a state depends on the current policy.
0
0
−1 −2 −3
−1 −2 −3 −2
−2 −3 −2 −1
−3 −2 −1
V with an
optimal policy
−14
−14
−14
−140 −20 −22
−18 −22 −20
0
−18−22
−22 −20
−20
V with a
random policy
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
The Agents Internal Representation
Policy
The action chosen by the agent for each state
π(s) → a
Value Function
Expected total future reward from s when following policy π
V π
(s) →
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
Representation of the Environment
Where does an action take us?
δ(s, a) → s
How much reward do we receive?
r(s, a) →
The values of different states are interrelated
Bellman’s Equation:
V π
(s) = r(s, π(s)) + γ · V π
(δ(s, π(s)))
¨Orjan Ekeberg Machine Learning
5. Defining the Problem
Learning
Improvements
Monte-Carlo Method
Temporal-Difference
Learning to Act
Q-Learning
1 Defining the Problem
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
2 Learning
Monte-Carlo Method
Temporal-Difference
Learning to Act
Q-Learning
3 Improvements
Importance of Making Mistakes
Eligibility Trace
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Monte-Carlo Method
Temporal-Difference
Learning to Act
Q-Learning
Normal scenario: r(s, a) and δ(s, a) are not known
V π must be estimated by experience
Monte-Carlo Method
Start at a random s
Follow π, store the rewards and st
When the goal is reached, update V π(s)-estimation for all
visited stated with the future reward we actually received
Painfully slow!
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Monte-Carlo Method
Temporal-Difference
Learning to Act
Q-Learning
Temporal Difference Learning
−1
2
8
−4
−1
2
8
−4
6
Temporal Difference — the difference between:
Experienced value
(immediate reward + value of next state)
Expected value
Measure of unexpected success
Two sources:
Higher immediate reward than expected
Reached better situation than expected
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Monte-Carlo Method
Temporal-Difference
Learning to Act
Q-Learning
Expected Value
V π
(st)
One-step Experienced Value
rt+1 + γ · V π
(st+1)
TD-signal — measure of surprise / disappointment
TD = rt+1 + γ · V π
(st+1) − V π
(st)
TD Learning
V π
(st) ← V π
(st) + ηTD
¨Orjan Ekeberg Machine Learning
6. Defining the Problem
Learning
Improvements
Monte-Carlo Method
Temporal-Difference
Learning to Act
Q-Learning
Learning to Act
How do we get a policy from the TD signal?
Many possibilities...
Actor–Critic Model
(Barto, Sutton, Anderson IEEE Trans. Syst. Man & Cybern. 1983)
Q-Learning
(Watkins, 1989; Watkins & Dayan Machine Learning 1992)
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Monte-Carlo Method
Temporal-Difference
Learning to Act
Q-Learning
Actor–Critic Model
TDState
Reward
Action
Actor
Critic
Actor — Associates states with actions π(·)
Critic — Associates states with their value V (·)
TD-signal is used as a reinforcement signal to update both!
High TD (unexpected success)
Increase value of preceeding state
Increase tendency to make same action again
Low TD (unexpected failure)
Decrease value of preceeding state
Decrease tendency to make same action again
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Monte-Carlo Method
Temporal-Difference
Learning to Act
Q-Learning
Q-Learning
Estimate Q(s, a) instead of V (s)
Q(s, a): Expected total reward when doing a from s.
π(s) = argmax
a
Q(s, a)
V (s) = max
a
Q (s, a)
The Q-function can also be learned using Temporal-Difference
Q(s, a) ← Q(s, a) + η r + γ max
a
Q(s , a ) − Q(s, a)
s is the next state.
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Importance of Making Mistakes
Eligibility Trace
1 Defining the Problem
Reward Maximization
Simplifying Assumptions
Bellman’s Equation
2 Learning
Monte-Carlo Method
Temporal-Difference
Learning to Act
Q-Learning
3 Improvements
Importance of Making Mistakes
Eligibility Trace
¨Orjan Ekeberg Machine Learning
7. Defining the Problem
Learning
Improvements
Importance of Making Mistakes
Eligibility Trace
What do we do when...
The environment is not deterministic
The environment is not fully observable
There are way too many states
The states are not discrete
The agent is acting in continuous time
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Importance of Making Mistakes
Eligibility Trace
The Exploration–Exploitation dilemma
If an agent strictly follows a greedy policy based on the current
estimate of Q, learning is not guaranteed to converge to Q
Simple solution:
Use a policy which has a certain probability of ”making mistakes”
-greedy
Sometimes (with probability ) make a random action instead
of the one that seems best (greedy)
Softmax
Assign a probability to choose each action depending on how
good they seem
¨Orjan Ekeberg Machine Learning
Defining the Problem
Learning
Improvements
Importance of Making Mistakes
Eligibility Trace
Accelerated learning
Id´ea: TD updates can be used to improve not only the last state’s
value, but also states we have visited earlier.
∀s, a : Q(s, a) ← Q(s, a) + η [rt+1 + γQ(st+1, at+1) − Q(st, at)] · e
e is a remaining trace (eligibility trace) encoding how long ago we
were in s doing a.
Often denoted TD(λ) where λ is the time constant of the trace e
¨Orjan Ekeberg Machine Learning