Review
Demystifying Deep Reinforcement Learning
written by Tambet Matiisen (Dec 22, 2015)
Hogeon Seo
Artificial Intelligence Laboratory
Institute of Integrated Technology (IIT)
Gwangju Institute of Science and Technology (GIST)
2018.11.09. @ Reinforcement Learning Study Seminar
https://ai.intel.com/demystifying-deep-reinforcement-learning/
Demystifying Deep Reinforcement Learning
Contents
How do I learn reinforcement learning?
Reinforcement Learning is Hot!
What is the RL?
General approach to model the RL problem
Maximize the total future reward
A function Q(s,a) = the maximum DFR
How to get Q-function?
Deep Q Network
Experience Replay
Exploration-Exploitation
Demystifying Deep Reinforcement Learning
Reinforcement Learning is Hot!
• Playing Atari with Deep Reinforcement Learning (Dec 19, 2013)
• AlphaGo: Mastering the ancient game of Go with Machine Learning (Jan 27, 2016)
Examples of RL: Deep Q Learning network playing ATARI, AlphaGo,
Berkeley robot stacking Legos, physically-simulated quadruped leaping over terrain.
Demystifying Deep Reinforcement Learning
What is the RL?
• What we have to do is moving left, right, or fire => a kind of classification
• In a certain situation, we should choose the proper action to get the rewards => goal of RL
• Compared to the supervised and unsupervised learning,
RL has sparse and time-delayed labels: the rewards. => characteristic of RL
• The rewards rely on the preceding actions in a specific situation => credit assignment problem
• If an action results in a reward,
do we try another action for more rewards? => explore-exploit dilemma
Demystifying Deep Reinforcement Learning
General approach to model the RL problem
• When an AGENT did an ACTION on the STATE in the ENVIRONMENT,
the ACTION can cause REWARDS
• The ACTIONS follow the POLICY, how to do for each STATE
• The ENVIRONMENT is stochastic
• Next STATE is somewhat random
• The set of STATEs and ACTIONS => a Markov decision process
• An EPISODE consist of a finite sequence of STATES, ACTIONS, and REWARDS
• The Markov decision process is based on the Markov assumption
that the probability of the next STATE is dependent on only the present STATE and ACTION
Demystifying Deep Reinforcement Learning
Maximize the total future reward
• For an AGENT, a good strategy is to always choose an action which maximize the future reward
• Given an EPISODE, the total reward:
• Given an EPISODE, the total reward from the time t:
• However, since the ENVIRONMENT is stochastic and the rewards can diverge as time goes by.
To consider this uncertainty, the discounted future rewards (DFR) are adopted as follows:
• If r=0, the POLICY is short-sighted and relies only on the immediate REWARDS.
If r=1, the ENVIROMENT is deterministic and the same ACTIONS result in the SAME REWARDS.
Demystifying Deep Reinforcement Learning
A function Q(s,a) = the maximum DFR
• The maximum discounted future reward
when the ACTION (a) is performed in the STATE (s) and the ACTIONS continued optimally:
in other words, the best possible total reward at the end of the EPISODE
after performing the ACTION (a) in the STATE (s).
• This function is called Q-function
due to the fact that the function represents the quality of a certain ACTION in a given STATE.
• If having the Q-function, the ACTION causes the highest Q-value by following the POLICY:
Demystifying Deep Reinforcement Learning
How to get Q-function?
• Considering the discounted future rewards,
the Q-value of the STATE (s) and ACTION (a) in terms of the Q-value of the next STATE (s’):
<= Bellman equation
• Q-learning is that the Q-function using the Bellman equation can be iteratively approximated.
Basic Q-learning algorithm is as follows:
• α is learning rate to control the difference between present Q-value and new Q-value.
• If α is 1, the algorithm is same with Bellman equation
Demystifying Deep Reinforcement Learning
Deep Q Network
Demystifying Deep Reinforcement Learning
Experience Replay
• To approximate the Q-function using CNN, it takes a long time
• The trick is experience replay.
• During playing, all the EXPERIENCE <s, a, r, s’> are stored in a replay memory.
• When training the network,
random mini-batches from the replay memory are used
instead of the most recent transition.
=> breaks the similarity of subsequent training samples,
which avoids to falling into a local minimum
Demystifying Deep Reinforcement Learning
Exploration-Exploitation
• Q-learning is to solve the credit assignment problem
: it propagates rewards back until it reaches the crucial decision point that causes the REWRADS.
• If Q-table or Q-network is initialized randomly, the predictions are initially random as well.
• If the highest Q-value is picked,
the ACTION will be random and the AGENT performs “exploration”
• As the Q-function converges,
it returns more consistent Q-value and the amount of exploration decreases.
=> it settles with first effective strategy it finds
• To solve this issue, ε-greedy exploration – with probability ε to choose a random ACTION
• DeepMind decreases ε from 1 to 0.1 over time – in the beginning the system makes completely random
moves to explore the STATE space maximally, and then it settles down to a fixed rate

Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)

  • 1.
    Review Demystifying Deep ReinforcementLearning written by Tambet Matiisen (Dec 22, 2015) Hogeon Seo Artificial Intelligence Laboratory Institute of Integrated Technology (IIT) Gwangju Institute of Science and Technology (GIST) 2018.11.09. @ Reinforcement Learning Study Seminar https://ai.intel.com/demystifying-deep-reinforcement-learning/
  • 2.
    Demystifying Deep ReinforcementLearning Contents How do I learn reinforcement learning? Reinforcement Learning is Hot! What is the RL? General approach to model the RL problem Maximize the total future reward A function Q(s,a) = the maximum DFR How to get Q-function? Deep Q Network Experience Replay Exploration-Exploitation
  • 3.
    Demystifying Deep ReinforcementLearning Reinforcement Learning is Hot! • Playing Atari with Deep Reinforcement Learning (Dec 19, 2013) • AlphaGo: Mastering the ancient game of Go with Machine Learning (Jan 27, 2016) Examples of RL: Deep Q Learning network playing ATARI, AlphaGo, Berkeley robot stacking Legos, physically-simulated quadruped leaping over terrain.
  • 4.
    Demystifying Deep ReinforcementLearning What is the RL? • What we have to do is moving left, right, or fire => a kind of classification • In a certain situation, we should choose the proper action to get the rewards => goal of RL • Compared to the supervised and unsupervised learning, RL has sparse and time-delayed labels: the rewards. => characteristic of RL • The rewards rely on the preceding actions in a specific situation => credit assignment problem • If an action results in a reward, do we try another action for more rewards? => explore-exploit dilemma
  • 5.
    Demystifying Deep ReinforcementLearning General approach to model the RL problem • When an AGENT did an ACTION on the STATE in the ENVIRONMENT, the ACTION can cause REWARDS • The ACTIONS follow the POLICY, how to do for each STATE • The ENVIRONMENT is stochastic • Next STATE is somewhat random • The set of STATEs and ACTIONS => a Markov decision process • An EPISODE consist of a finite sequence of STATES, ACTIONS, and REWARDS • The Markov decision process is based on the Markov assumption that the probability of the next STATE is dependent on only the present STATE and ACTION
  • 6.
    Demystifying Deep ReinforcementLearning Maximize the total future reward • For an AGENT, a good strategy is to always choose an action which maximize the future reward • Given an EPISODE, the total reward: • Given an EPISODE, the total reward from the time t: • However, since the ENVIRONMENT is stochastic and the rewards can diverge as time goes by. To consider this uncertainty, the discounted future rewards (DFR) are adopted as follows: • If r=0, the POLICY is short-sighted and relies only on the immediate REWARDS. If r=1, the ENVIROMENT is deterministic and the same ACTIONS result in the SAME REWARDS.
  • 7.
    Demystifying Deep ReinforcementLearning A function Q(s,a) = the maximum DFR • The maximum discounted future reward when the ACTION (a) is performed in the STATE (s) and the ACTIONS continued optimally: in other words, the best possible total reward at the end of the EPISODE after performing the ACTION (a) in the STATE (s). • This function is called Q-function due to the fact that the function represents the quality of a certain ACTION in a given STATE. • If having the Q-function, the ACTION causes the highest Q-value by following the POLICY:
  • 8.
    Demystifying Deep ReinforcementLearning How to get Q-function? • Considering the discounted future rewards, the Q-value of the STATE (s) and ACTION (a) in terms of the Q-value of the next STATE (s’): <= Bellman equation • Q-learning is that the Q-function using the Bellman equation can be iteratively approximated. Basic Q-learning algorithm is as follows: • α is learning rate to control the difference between present Q-value and new Q-value. • If α is 1, the algorithm is same with Bellman equation
  • 9.
    Demystifying Deep ReinforcementLearning Deep Q Network
  • 10.
    Demystifying Deep ReinforcementLearning Experience Replay • To approximate the Q-function using CNN, it takes a long time • The trick is experience replay. • During playing, all the EXPERIENCE <s, a, r, s’> are stored in a replay memory. • When training the network, random mini-batches from the replay memory are used instead of the most recent transition. => breaks the similarity of subsequent training samples, which avoids to falling into a local minimum
  • 11.
    Demystifying Deep ReinforcementLearning Exploration-Exploitation • Q-learning is to solve the credit assignment problem : it propagates rewards back until it reaches the crucial decision point that causes the REWRADS. • If Q-table or Q-network is initialized randomly, the predictions are initially random as well. • If the highest Q-value is picked, the ACTION will be random and the AGENT performs “exploration” • As the Q-function converges, it returns more consistent Q-value and the amount of exploration decreases. => it settles with first effective strategy it finds • To solve this issue, ε-greedy exploration – with probability ε to choose a random ACTION • DeepMind decreases ε from 1 to 0.1 over time – in the beginning the system makes completely random moves to explore the STATE space maximally, and then it settles down to a fixed rate