Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona 2018

336 views

Published on

https://telecombcn-dl.github.io/2018-dlai/

Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona 2018

  1. 1. Xavier Giro-i-Nieto xavier.giro@upc.edu Associate Professor Universitat Politècnica de Catalunya Technical University of Catalonia Reinforcement Learning: MDP & DQN Day 5 Lecture 2 #DLUPC http://bit.ly/dlai2018
  2. 2. 2 Acknowledgements Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
  3. 3. 3 Acknowledgements Víctor Campos victor.campos@bsc.es PhD Candidate Barcelona Supercomputing Center Míriam Bellver miriam.bellver@bsc.edu PhD Candidate Barcelona Supercomputing Center
  4. 4. 4 Video lecture Xavier Giró, DLAI 2017
  5. 5. 5 Outline 1. Motivation 2. Architecture 3. Markov Decision Process (MDP) 4. Deep Q-learning 5. RL Frameworks 6. Learn more
  6. 6. 6 Outline 1. Motivation 2. Architecture 3. Markov Decision Process (MDP) 4. Deep Q-learning 5. RL Frameworks 6. Learn more
  7. 7. Types of machine learning Yann Lecun’s Black Forest cake 7
  8. 8. 8 Types of machine learning We can categorize three types of learning procedures: 1. Supervised Learning: 𝐲 = ƒ(𝐱) 2. Unsupervised Learning: ƒ(𝐱) 3. Reinforcement Learning (RL): 𝐲 = ƒ(𝐱) 𝐳 Predict label y corresponding to observation x Estimate the distribution of observation x Predict action y based on observation x, to maximize a future reward z
  9. 9. 9 Types of machine learning We can categorize three types of learning procedures: 1. Supervised Learning: 𝐲 = ƒ(𝐱) 2. Unsupervised Learning: ƒ(𝐱) 3. Reinforcement Learning (RL): 𝐲 = ƒ(𝐱) 𝐳
  10. 10. 10 Rich Sutton, “Temporal-Difference Learning” University of Alberta (2017)
  11. 11. 11 Motivation What is Reinforcement Learning (RL) ? “a way of programming agents by reward and punishment without needing to specify how the task is to be achieved” [Kaelbling, Littman, & Moore, 96] Kaelbling, Leslie Pack, Michael L. Littman, and Andrew W. Moore. "Reinforcement learning: A survey." Journal of artificial intelligence research 4 (1996): 237-285.
  12. 12. 12 Outline 1. Motivation 2. Architecture 3. Markov Decision Process (MDP) 4. Deep Q-learning 5. RL Frameworks 6. Learn more
  13. 13. 13 Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. "Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013).
  14. 14. 14 Architecure Figure: UCL Course on RL by David Silver
  15. 15. 15 Architecure Figure: UCL Course on RL by David Silver Environment
  16. 16. 16 Architecure Figure: UCL Course on RL by David Silver Environment state (st )
  17. 17. 17 Architecure Figure: UCL Course on RL by David Silver Environment state (st )
  18. 18. 18 Architecure Figure: UCL Course on RL by David Silver Environment Agent state (st )
  19. 19. 19 Architecure Figure: UCL Course on RL by David Silver Environment Agent action (At )state (st )
  20. 20. 20 Architecure Figure: UCL Course on RL by David Silver Environment Agent action (At ) reward (rt ) state (st )
  21. 21. 21 Architecure Figure: UCL Course on RL by David Silver Environment Agent action (At ) reward (rt ) state (st ) Reward is given to the agent delayed with respect to previous states and actions !
  22. 22. 22 Architecure Figure: UCL Course on RL by David Silver Environment Agent action (At ) reward (rt ) state (st+1 )
  23. 23. 23 Architecure Figure: UCL Course on RL by David Silver Environment Agent action (At ) reward (rt ) state (st+1 ) GOAL: Reach the highest score possible at the end of the game.
  24. 24. 24 Architecure Figure: UCL Course on RL by David Silver Environment Agent action (At ) reward (rt ) state (st+1 ) GOAL: Learn how to take actions to maximize accumulative reward
  25. 25. 25 Architecure Multiple problems that can be formulated with a RL architecture. Cart-Pole Problem Objective: Balance a pole on top of a movable car Slide credit: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
  26. 26. 26 Architecure Slide credit: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Environment Agent action (At ) reward (rt ) state (st ) Angle Angular speed Position Horizontal velocity Horizontal force applied in the car1 at each time step if the pole is upright
  27. 27. 27 Architecure Schulman, John, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. "High-dimensional continuous control using generalized advantage estimation." ICLR 2016 [project page] Multiple problems that can be formulated with a RL architecture. Robot Locomotion Objective: Make the robot move forward
  28. 28. 28 Architecure Schulman, John, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. "High-dimensional continuous control using generalized advantage estimation." ICLR 2016 [project page] Environment Agent action (At ) reward (rt ) state (st ) Angle and position of the joints Torques applied on joints1 at each time step upright + forward movement
  29. 29. 29 Schulman, John, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. "High-dimensional continuous control using generalized advantage estimation." ICLR 2016 [project page]
  30. 30. 30 Outline 1. Motivation 2. Architecture 3. Markov Decision Process (MDP) ○ Policy ○ Optimal Policy ○ Value Function ○ Q-value function ○ Optimal Q-value function ○ Bellman equation ○ Value iteration algorithm 4. Deep Q-learning 5. RL Frameworks 6. Learn more
  31. 31. 31 Markov Decisin Process (MDP) Markov Decision Processes provide a formalism for reinforcement learning problems. Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Markov property: Current state completely characterises the state of the world.
  32. 32. 32 Markov Decision Process (MDP) Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. S A R P ४
  33. 33. 33 Markov Decision Process (MDP) Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. S A R P ४ Environment samples initial state s0 ~ p(s0 ) Agent selects action at Environment samples next state st+1 ~ P ( .| st , at ) Environment samples reward rt ~ R(. | st ,at ) reward (rt ) state (st ) action (at )
  34. 34. 34 MDP: Policy Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. S A R P ४ Agent selects action at policy π A Policy π is a function S ➝ A that specifies which action to take in each state.
  35. 35. 35 MDP: Policy A Policy π is a function S ➝ A that specifies which action to take in each state. Agent selects action at policy π GOAL: Learn how to take actions to maximize reward Agent GOAL: Find policy π* that maximizes the cumulative discounted reward: MDP
  36. 36. 36 MDP: Policy Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Multiple problems that can be formulated with a RL architecture. Grid World (a simple MDP) Objective: reach one of the terminal states (greyed out) in least number of actions.
  37. 37. 37 MDP: Policy Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Environment Agent action (At ) reward (rt ) state (st ) Each cell is a state: A negative “reward” (penalty) for each transition rt = r = -1
  38. 38. 38 MDP: Policy: Random Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Example: Actions resulting from applying a random policy on this Grid World problem.
  39. 39. 39 MDP: Policy: Optimal Policy π* Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Exercise: Draw the actions resulting from applying an optimal policy π* in this Grid World problem.
  40. 40. 40 MDP: Policy: Optimal Policy π* Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Solution: Draw the actions resulting from applying an optimal policy π* in this Grid World problem.
  41. 41. 41 MDP: Policy: Optimal Policy π* Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. How do we handle the randomness (initial state s0 , transition probabilities, action...) ? GOAL: Find policy π* that maximizes the cumulative discounted reward: Environment samples initial state s0 ~ p(s0 ) Agent selects action at ~π (.|st ) Environment samples next state st+1 ~ P ( .| st , at ) Environment samples reward rt ~ R(. | st ,at ) reward (rt ) state (st ) action (at )
  42. 42. 42 MDP: Policy: Optimal Policy π* Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. How do we handle the randomness (initial state s0 , transition probabilities, action) ? GOAL: Find policy π* that maximizes the cumulative discounted reward: The optimal policy π* will maximize the expected sum of rewards: initial state selected action at t sampled state for t+1expected cumulative discounted reward
  43. 43. 43 MDP: Value Function Vπ (s) Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. How to estimate how good state s is for a given policy π ? With the value function at state s, Vπ (s), the expected cumulative reward from following policy π from state s. “...from following policy π from state s.” “Expected cumulative reward…””
  44. 44. 44 MDP: Q-Value Function Qπ (s,a) Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. How to estimate how good a state-action pair (s,a) is for a given policy π ? With the Q-value function at state s and action a, Qπ (s,a), the expected cumulative reward from taking action a in state s, and then following policy π. “...from taking action a in state s and then following policy π.” “Expected cumulative reward…””
  45. 45. 45 MDP: Optimal Q-Value Function Q* (s,a) Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. The optimal Q-value function at state s and action, Q* (s,a), is the maximum expected cumulative reward achievable from a given (state, action) pair: choose the policy that maximizes the expected cumulative reward (From the previous page) Q-value function
  46. 46. 46 MDP: Optimal Q-Value Function Q* (s,a) Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Q* (s,a) satisfies the following Bellman equation: Maximum expected cumulative reward for future pair (s’,a’) FUTURE REWARD (From the previous page) Optimal Q-value function reward for considered pair (s,a) Maximum expected cumulative reward for considered pair (s,a) Expectation across possible future states s’ (randomness) discount factor
  47. 47. 47 MDP: Bellman Equation Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Q* (s,a) satisfies the following Bellman equation: The optimal policy π* corresponds to taking the best action in any state according to Q*. GOAL: Find policy π* that maximizes the cumulative discounted reward: select action a’ that maximizes expected cumulative reward
  48. 48. 48 MDP: Solving the Optimal Policy Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Value iteration algorithm: Estimate the Bellman equation with an iterative update. The iterative estimation Qi (s,a) will converge to the optimal Q*(s,a) as i ➝ ∞. (From the previous page) Bellman Equation Updated Q-value function Current Q-value for future pair (s’,a’)
  49. 49. 49 MDP: Solving the Optimal Policy Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Exploring all positive states and action is not scalable by itself. Let alone iteratively. Eg. If video game, it would require generating all possible pixels and actions… as many times as necessary iterations to estimate Q*(s,a). The iterative estimation Qi (s,a) will converge to the optimal Q*(s,a) as i ➝ ∞. Updated Q-value function Current Q-value for future pair (s’,a’)
  50. 50. 50 MDP: Solving the Optimal Policy Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Solution: Use a deep neural network as an function approximator of Q*(s,a). Q(s,a,Ө) ≈ Q*(s,a) Neural Network parameters Exploring all positive states and action is not scalable Eg. If video game, it would require generating all possible pixels and actions.
  51. 51. 51 Outline 1. Motivation 2. Architecture 3. Markov Decision Process (MDP) 4. Deep Q-learning 5. RL Frameworks 6. Learn more
  52. 52. 52 Deep Q-learning Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. The function to approximate is a Q-function that satisfies the Bellman equation: Q(s,a,Ө) ≈ Q*(s,a)
  53. 53. 53 Deep Q-learning Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. The function to approximate is a Q-function that satisfies the Bellman equation: Q(s,a,Ө) ≈ Q*(s,a) Forward Pass Loss function: Sample a (s,a) pair Predicted Q-value with Өi Sample a future state s’ Predict Q-value with Өi-1
  54. 54. 54 Deep Q-learning Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Train the DNN to approximate a Q-value function that satisfies the Bellman equation
  55. 55. 55 Deep Q-learning Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Must compute reward during training
  56. 56. 56 Deep Q-learning Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Backward Pass Gradient update (with respect to Q-function parameters Ө): Forward Pass Loss function:
  57. 57. 57 Deep Q-learning: Deep Q-Network DQN Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Q(s,a,Ө) ≈ Q*(s,a)
  58. 58. 58 Deep Q-learning: Deep Q-Network DQN Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Q(s,a,Ө) ≈ Q*(s,a) efficiency Single Feed Forward Pass A single feedforward pass to compute the Q-values for all actions from the current state (efficient)
  59. 59. 59 Deep Q-learning: Deep Q-Network DQN Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al. "Human-level control through deep reinforcement learning." Nature 518, no. 7540 (2015): 529-533. Number of actions between 4-18, depending on the Atari game
  60. 60. 60 Deep Q-learning: Deep Q-Network DQN Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Q(st , ⬅), Q(st , ➡), Q(st , ⬆), Q(st ,⬇ )
  61. 61. 61 Deep Q-learning: Demo Andrej Karpathy, “ConvNetJS Deep Q Learning Demo”
  62. 62. 62 Deep Q-learning: Demo
  63. 63. 63 Outline 1. Motivation 2. Architecture 3. Markov Decision Process (MDP) 4. Deep Q-learning 5. RL Frameworks 6. Learn more
  64. 64. 64 RL Frameworks OpenAI Gym + keras-rl + keras-rl keras-rl implements some state-of-the art deep reinforcement learning algorithms in Python and seamlessly integrates with the deep learning library Keras. Just like Keras, it works with either Theano or TensorFlow, which means that you can train your algorithm efficiently either on CPU or GPU. Furthermore, keras-rl works with OpenAI Gym out of the box. Slide credit: Míriam Bellver
  65. 65. 65 RL Frameworks OpenAI Universe environment
  66. 66. 66 Outline 1. Motivation 2. Architecture 3. Markov Decision Process (MDP) 4. Deep Q-learning 5. RL Frameworks 6. Learn more
  67. 67. 67 Learn more Deep Learning TV, “Reinforcement learning - Ep. 30” Siraj Raval, Deep Q Learning for Video Games
  68. 68. 68 Learn more David Silver, UCL COMP050, Reinforcement Learning
  69. 69. 69 Learn more Pieter Abbeel and John Schulman, CS 294-112 Deep Reinforcement Learning, Berkeley. Slides: “Reinforcement Learning - Policy Optimization” OpenAI / UC Berkeley (2017)
  70. 70. 70 Learn more Nando de Freitas, “Machine Learning” (University of Oxford)
  71. 71. 71 Homework Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M. and Dieleman, S., 2016. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), pp.484-489
  72. 72. 72 Homework (exam material !!) Greg Kohs, “AlphaGo” (2017). [@ Netflix]
  73. 73. 73 Outline 1. Motivation 2. Architecture 3. Markov Decision Process (MDP) 4. Deep Q-learning 5. RL Frameworks 6. Learn more
  74. 74. 74 Final Questions

×