Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018

257 views

Published on

https://telecombcn-dl.github.io/2018-dlai/

Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018

  1. 1. Xavier Giro-i-Nieto xavier.giro@upc.edu Associate Professor Universitat Politècnica de Catalunya Technical University of Catalonia Reinforcement Learning (Reloaded) Day 11 Lecture 1 #DLUPC http://bit.ly/dlai2018
  2. 2. 2 Acknowledgements Víctor Campos victor.campos@bsc.es PhD Candidate Barcelona Supercomputing Center
  3. 3. 3 Acknowledgements Lilian Weng, “A (Long) Peek into Reinforcement Learning” (2018)
  4. 4. 4 A broader picture of types of learning... Slide inspired by Alex Graves (Deepmind) at “Unsupervised Learning Tutorial” @ NeurIPS 2018. ...with a teacher ...without a teacher Active agent... Reinforcement learning (with extrinsic reward) Intrinsic motivation / Exploration. Passive agent... Supervised learning Unsupervised learning
  5. 5. 5 A broader picture of types of learning... Slide inspired by Alex Graves (Deepmind) at “Unsupervised Learning Tutorial” @ NeurIPS 2018. ...with a teacher ...without a teacher Active agent... Reinforcement learning (with extrinsic reward) Intrinsic motivation / Exploration. Passive agent... Supervised learning Unsupervised learning
  6. 6. 6 Outline 1. Previously in DLAI
  7. 7. 7 Architecture Lilian Weng, “A (Long) Peek into Reinforcement Learning” (2018)
  8. 8. 8 Training data Lilian Weng, “A (Long) Peek into Reinforcement Learning” (2018) In RL, training data is obtained by computing interaction sequences of: State , Action , Reward A complete episode of T interactions: S1 , A1 , R2 , S2 , A2 , …, ST One experience corresponds to a single tuple within an episode: (s,a,r,s’)
  9. 9. 9 Markov Decision Process (MDP) Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. S A R P γ Environment samples initial state s0 ~ p(s0 ) Agent selects action a Environment samples next state s’ ~ P ( .| s, a) Environment samples reward r ~ R(. | s,a) reward (r) state (s) action (a)
  10. 10. 10 Markov Decision Process (MDP) Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. S A R P γ
  11. 11. 11 Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. MDP: Model (of the Environment) S A R P γ
  12. 12. 12 MDP: Model (of the Environment) The transition function P(s’,r|s,a) records the probability of transitioning from s to s’ after taking action a, while obtaining a reward r. where ℙ is the symbol for probability. State- transition function Reward function Sum of transition function across all rewards
  13. 13. 13 MDP: Model (of the Environment) Environment samples next state s’ ~ P ( .| s,a) Environment samples reward r ~ R(. | s,a) reward (r) state (s) action (a) The Model (of the Environment) is defined by: ● P(s’|s,a): State-transition function ● R(s,a): Reward function. Model (of the Environment)
  14. 14. 14 Figure: OpenAI Spinning Up MDP: Model (of the Environment)
  15. 15. 15 MDP: Model (of the Environment) ● Model-based RL: R(·| s,a) and P(·| s,a) are known, so an optimal solution can be found with Dynamic Programming. Model-based RL is the only option that can be applied in reality due to the poor data efficiency of model-free RL. More details at Sergey Levine’s slides @ Berkeley. ● Model-free RL: there is no prior knowledge of the world. The agent needs to learn all dynamics from scratch, resulting in poor data efficiency which limits its application to real-world agents. Requires sim2real transfer for real deployment.
  16. 16. 16 MDP: Model (of the Environment) Example of sim2real: Dexterity (OpenAI 2018)
  17. 17. 17 MDP: Policy Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Agent selects action a The Policy ㄫ is a function S ➝ A that specifies which action to take given a state. ● On-policy learning: the agent is trained with a sequence of interactions of the target policy. ● Off-policy learning: the agent is trained with a sequence of interactions obtained from a behaviour policy different from the target policy. Policy ㄫ
  18. 18. 18 MDP: Policy The Policy ㄫ is a function S ➝ A that specifies which action to take in each state. ● Deterministic: ㄫ(s)=a ● Stochastic: ㄫ(a|s)=ℙㄫ [A=a|S=s]
  19. 19. 19 MDP: Policy The Policy ㄫ is a function S ➝ A that specifies which action to take in each state. ● Discrete actions: categorical distribution over actions. ○ In the deterministic case: take the argmax action. ○ In the stochastic case: sample from the categorical distribution. ● Continuous actions: Gaussian distribution over actions; i.e. the policy generally outputs the mean and std per dimension. ○ In the deterministic case: take the mean. ○ In the stochastic case: sample from a normal distribution. Slide concept: Víctor Campos
  20. 20. 20 MDP: Return Gt The future reward, also known as return Gt , is a total sum of discounted rewards from t and onwards in time... ...where γ is the discount factor between [0,1]. Return Gt
  21. 21. 21 MDP: Value Functions Vㄫ (s) & Qㄫ (s,a) Given a policy ㄫ, a value function is the expected return Gt given.... ...by following policy ㄫ starting from state s & action a” “Expected return… State-value function Vs s & a Action-value function Q More specific
  22. 22. 22 MDP: Advantage Function Aㄫ (s,a) The Advantage function (A-value) is the defined as: s & a Advantage function A Aㄫ (s,a) tells how good/bad an action is with respect to the expectation of returns over all possible actions for a given state.
  23. 23. 23 MDP: Optimal Value and Policy The optimal value functions produce the maximum return... Optimal value functions Optimal policy ㄫ* The optimal policy achieves optimal value functions V* (s) and Q* (s,a)
  24. 24. 24 Outline 1. Previously in DLAI 2. Common Approaches
  25. 25. 25 Value vs Policy vs Model Value function Policy ㄫ Model (of the Environment)
  26. 26. 26 Value vs Policy vs Model Value function Policy ㄫ Model (of the Environment) Goals of Reinforcement Learning
  27. 27. 27 Value vs Policy vs Model David Silver (Deepmind), “Introduction to Deep Learning” (2015) Summary of approaches in RL based on whether we want to learn the value, policy, or the model of the environment.
  28. 28. 28 Value vs Policy vs Model David Silver (Deepmind), “Introduction to Deep Learning” (2015) Summary of approaches in RL based on whether we want to learn the value, policy, or the model of the environment.
  29. 29. 29 Value vs Policy vs Model Value function Policy ㄫ Model (of the Environment) Goals of Reinforcement Learning
  30. 30. 30 Inferring a Policy from the Value Value function Policy ㄫ Given a value function, a policy can be easily defined. For example: Greedy policy ㄫ(s) = arg maxa∈A Q(s,a)
  31. 31. 31 Generalized Policy Iteration (GPI) Generalized Policy Iteration (GPI) algorithms adopt an iterative procedure to improve the policy ㄫ: Lilian Weng, “A (Long) Peek into Reinforcement Learning” (2018) The value function Vㄫi is approximated repeatedly to be closer to the true value of the current policy and in the meantime, the policy is improved repeatedly to approach optimality.
  32. 32. 32 Generalized Policy Iteration (GPI) Pieter Abbeel, “Markov Decision Processes and Exact Solution Methods” (2012) Example: Grid world ● The agent lives in a grid. ● Walls block the agent’s path. ● Big rewards come at the end.
  33. 33. 33 Generalized Policy Iteration (GPI) Pieter Abbeel, “Markov Decision Processes and Exact Solution Methods” (2012) Example: Grid world with noise = 0.2 ● The agent’s actions do not always go as planned: ○ Action North takes the agent: ■ North (80%) ■ West (10%) ■ East (10%) ○ Similarly for the West, East & South actions.
  34. 34. 34 Generalized Policy Iteration (GPI) Pieter Abbeel, “Markov Decision Processes and Exact Solution Methods” (2012)
  35. 35. 35 Generalized Policy Iteration (GPI) Pieter Abbeel, “Markov Decision Processes and Exact Solution Methods” (2012)
  36. 36. 36 Generalized Policy Iteration (GPI) Pieter Abbeel, “Markov Decision Processes and Exact Solution Methods” (2012)
  37. 37. 37 Generalized Policy Iteration (GPI) Pieter Abbeel, “Markov Decision Processes and Exact Solution Methods” (2012)
  38. 38. 38 GPI: Monte-Carlo Methods (MC) Monte-Carlo (MC) Methods learns from complete episodes of raw experience and computes the observed mean return as an approximation of the expected return Gt . Lilian Weng, “A (Long) Peek into Reinforcement Learning” (2018) MC S1 , A1 , R2 , S2 , R2 , …, STS1 , A1 , R2 , S2 , R2 , …, STS1 , A1 , R2 , S2 , R2 , …, STS1 , A1 , R2 , S2 , R2 , …, ST S1 , A1 , R2 , S2 , R2 , …, ST ...where 𝟙[St =s] is a binary indicator function.
  39. 39. 39 GPI: Monte-Carlo Methods (MC) The optimal policy ㄫ* is learned by iterating as follows: 1. Improve the policy greedily with respect to the current Q: ㄫ(s) = arg maxa∈A Q(s,a) 2. Generate a new complete episode with the updated policy ㄫ. 3. Update Q using the new episode Lilian Weng, “A (Long) Peek into Reinforcement Learning” (2018)
  40. 40. 40 Figure: OpenAI Spinning Up Deep Q-learning (DQN)
  41. 41. 41 Deep Q-learning (DQN) Problem: Estimating the optimal Q* (s,a) might be feasible for few state-action pairs with exhaustive search, but impossible for large spaces. Solution: Learn a function Q(s,a, θ) approximation with machine learning: Eg. Neural Network parameters Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al. "Human-level control through deep reinforcement learning." Nature 518, no. 7540 (2015): 529-533. Q* (s) = arg maxa∈A Q(s,a, θ)Q(s,a, θ)
  42. 42. 42 Deep Q-learning (DQN) Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al. "Human-level control through deep reinforcement learning." Nature 518, no. 7540 (2015): 529-533. Single Feed Forward Pass A single feedforward pass to compute the Q-values for all actions from the current state (efficient)Q(s,a, θ)
  43. 43. 43 Deep Q-learning (DQN) Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al. "Human-level control through deep reinforcement learning." Nature 518, no. 7540 (2015): 529-533.
  44. 44. 44 Value vs Policy vs Model David Silver (Deepmind), “Introduction to Deep Learning” (2015) Summary of approaches in RL based on whether we want to learn the value, policy, or the model of the environment.
  45. 45. 45 Value vs Policy vs Model Value function Policy ㄫ Model (of the Environment) Goals of Reinforcement Learning
  46. 46. 46 Value vs Policy vs Model Value function Policy ㄫ Directly learn the policy by estimating the parameters θ of a stochastic policy function: ㄫ(a|s;θ)
  47. 47. 47 Figure: OpenAI Spinning Up REINFORCE (Vanilla Policy Gradients - VPN)
  48. 48. 48Sutton, Richard S., David A. McAllester, Satinder P. Singh, and Yishay Mansour. "Policy gradient methods for reinforcement learning with function approximation." NIPS 2000. REINFORCE (Vanilla Policy Gradients - VPN) the mathematical formulation of ‘trial-and-error’: try and action, and make it more likely if it resulted in positive reward; otherwise, make it less likely. Opposite goals in: ● Supervised learning: minimize a loss function by gradient descent. ● Reinforcement learning: maximize J(θ), the expected return of the policy, by gradient ascent. REINFORCE (Vanilla Policy Gradients - VPN)
  49. 49. 49 Slide: Víctor Campos, Barcelona TensorFlow Meetup (2018) Policy parameters (weights of the NN) REINFORCE (Vanilla Policy Gradients - VPN)
  50. 50. 50 Slide: Víctor Campos, Barcelona TensorFlow Meetup (2018) Policy parameters (weights of the NN) Estimation of the expectation over all possible trajectories, using N sampled trajectories (Monte Carlo) REINFORCE (Vanilla Policy Gradients - VPN)
  51. 51. 51 Slide: Víctor Campos, Barcelona TensorFlow Meetup (2018) Policy parameters (weights of the NN) Estimation of the expectation over all possible trajectories, using N sampled trajectories (Monte Carlo) The policy gradient. If we follow it, the action ai,t will be more likely if the agent ever finds itself again in state si,t REINFORCE (Vanilla Policy Gradients - VPN)
  52. 52. 52 Slide: Víctor Campos, Barcelona TensorFlow Meetup (2018) Policy parameters (weights of the NN) Estimation of the expectation over all possible trajectories, using N sampled trajectories (Monte Carlo) Cumulative (discounted) reward REINFORCE (Vanilla Policy Gradients - VPN) The policy gradient. If we follow it, the action ai,t will be more likely if the agent ever finds itself again in state si,t
  53. 53. 53 Learn more Pieter Abbeel and John Schulman, CS 294-112 Deep Reinforcement Learning, Berkeley. Slides: “Reinforcement Learning - Policy Optimization” OpenAI / UC Berkeley (2017)
  54. 54. 54 Learn more David Silver, UCL COMP050, Reinforcement Learning
  55. 55. 55 Learn more OpenAI Spinning Up in Deep RL
  56. 56. 56 A taxonomy of RL Algorithms Figure: OpenAI Spinning Up
  57. 57. 57 Final Questions

×