Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intelligence)

772 views

Published on

https://telecombcn-dl.github.io/2017-dlai/

Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.

Published in: Data & Analytics
  • Hello! High Quality And Affordable Essays For You. Starting at $4.99 per page - Check our website! https://vk.cc/82gJD2
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intelligence)

  1. 1. [course site] Xavier Giro-i-Nieto xavier.giro@upc.edu Associate Professor Universitat Politecnica de Catalunya Technical University of Catalonia Reinforcement Learning Day 7 Lecture 2 #DLUPC
  2. 2. 2 Acknowledegments Bellver M, Giró-i-Nieto X, Marqués F, Torres J. Hierarchical Object Detection with Deep Reinforcement Learning. In Deep Reinforcement Learning Workshop, NIPS 2016. 2016.
  3. 3. 3 Acknowledegments Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
  4. 4. 4 Outline 1. Motivation 2. Architecture 3. Markov Decision Process (MDP) 4. Deep Q-learning 5. RL Frameworks 6. Learn more
  5. 5. 5 Outline 1. Motivation 2. Architecture 3. Markov Decision Process (MDP) 4. Deep Q-learning 5. RL Frameworks 6. Learn more
  6. 6. 6 Motivation What is Reinforcement Learning ? “a way of programming agents by reward and punishment without needing to specify how the task is to be achieved” [Kaelbling, Littman, & Moore, 96] Kaelbling, Leslie Pack, Michael L. Littman, and Andrew W. Moore. "Reinforcement learning: A survey." Journal of artificial intelligence research 4 (1996): 237-285.
  7. 7. Yann Lecun’s Black Forest cake 7 Motivation
  8. 8. We can categorize three types of learning procedures: 1. Supervised Learning: = ƒ( ) 2. Unsupervised Learning: ƒ( ) 3. Reinforcement Learning (RL): = ƒ( ) 8 Predict label y corresponding to observation x Estimate the distribution of observation x Predict action y based on observation x, to maximize a future reward z Motivation
  9. 9. We can categorize three types of learning procedures: 1. Supervised Learning: = ƒ( ) 2. Unsupervised Learning: ƒ( ) 3. Reinforcement Learning (RL): = ƒ( ) 9 Motivation
  10. 10. 10 Outline 1. Motivation 2. Architecture 3. Markov Decision Process (MDP) 4. Deep Q-learning 5. RL Frameworks 6. Learn more
  11. 11. 11 Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. "Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013).
  12. 12. 12Bernhard Schölkkopf, “Learning to see and act” Nature 2015. Motivation
  13. 13. 13 Outline 1. Motivation 2. Architecture 3. Markov Decision Process (MDP) ○ Policy ○ Optimal Policy ○ Value Function ○ Q-value function ○ Optimal Q-value function ○ Bellman equation ○ Value iteration algorithm 4. Deep Q-learning 5. RL Frameworks 6. Learn more
  14. 14. 14Figure: UCL Course on RL by David Silver Architecture
  15. 15. 15Figure: UCL Course on RL by David Silver Environment Architecture
  16. 16. 16Figure: UCL Course on RL by David Silver Environment state (st ) Architecture
  17. 17. 17Figure: UCL Course on RL by David Silver Environment state (st ) Architecture
  18. 18. 18Figure: UCL Course on RL by David Silver Environment Agent state (st ) Architecture
  19. 19. 19Figure: UCL Course on RL by David Silver Environment Agent action (At )state (st ) Architecture
  20. 20. 20Figure: UCL Course on RL by David Silver Environment Agent action (At )state (st ) Architecture
  21. 21. 21Figure: UCL Course on RL by David Silver Environment Agent action (At ) reward (rt ) state (st ) Architecture
  22. 22. 22Figure: UCL Course on RL by David Silver Environment Agent action (At ) reward (rt ) state (st ) Architecture Reward is given to the agent delayed with respect to previous states and actions !
  23. 23. 23Figure: UCL Course on RL by David Silver Environment Agent action (At ) reward (rt ) state (st+1 ) Architecture
  24. 24. 24Figure: UCL Course on RL by David Silver Environment Agent action (At ) reward (rt ) state (st+1 ) Architecture GOAL: Complete the game with the highest score.
  25. 25. 25Figure: UCL Course on RL by David Silver Environment Agent action (At ) reward (rt ) state (st+1 ) Architecture GOAL: Learn how to take actions to maximize accumulative reward
  26. 26. 26 Other problems that can be formulated with a RL architecture. Cart-Pole Problem Objective: Balance a pole on top of a movable car Architecture Slide credit: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
  27. 27. 27 Architecture Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Environment Agent action (At ) reward (rt ) state (st ) Angle Angular speed Position Horizontal velocity Horizontal force applied in the car1 at each time step if the pole is upright
  28. 28. 28 Other problems that can be formulated with a RL architecture. Robot Locomotion Objective: Make the robot move forward Architecture Schulman, John, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. "High-dimensional continuous control using generalized advantage estimation." ICLR 2016 [project page]
  29. 29. 29 Architecture Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Environment Agent action (At ) reward (rt ) state (st ) Angle and position of the joints Torques applied on joints1 at each time step upright + forward movement
  30. 30. 30 Schulman, John, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. "High-dimensional continuous control using generalized advantage estimation." ICLR 2016 [project page]
  31. 31. 31 Outline 1. Motivation 2. Architecture 3. Markov Decision Process (MDP) ○ Policy ○ Optimal Policy ○ Value Function ○ Q-value function ○ Optimal Q-value function ○ Bellman equation ○ Value iteration algorithm 4. Deep Q-learning 5. RL Frameworks 6. Learn more
  32. 32. 32 Markov Decision Processes (MDP) Markov Decision Processes provide a formalism for reinforcement learning problems. Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Markov property: Current state completely characterises the state of the world.
  33. 33. 33 Markov Decision Processes (MDP) Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. S A R P ४
  34. 34. 34 Markov Decision Processes (MDP) Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. S A R P ४ Environment samples initial state s0 ~ p(s0 ) Agent selects action at Environment samples next state st+1 ~ P ( .| st , at ) Environment samples reward rt ~ R(. | st ,at ) reward (rt ) state (st ) action (at )
  35. 35. 35 MDP: Policy Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. S A R P ४ Agent selects action at policy π A Policy π is a function S ➝ A that specifies which action to take in each state.
  36. 36. 36 MDP: Policy Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Agent selects action at policy π A Policy π is a function S ➝ A that specifies which action to take in each state. GOAL: Learn how to take actions to maximize reward Agent GOAL: Find policy π* that maximizes the cumulative discounted reward: MDP
  37. 37. 37 Other problems that can be formulated with a RL architecture. Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. MDP: Policy Grid World (a simple MDP) Objective: reach one of the terminal states (greyed out) in least number of actions.
  38. 38. 38Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Environment Agent action (At ) reward (rt ) state (st ) Each cell is a state: A negative “reward” (penalty) for each transition rt = r = -1 MDP: Policy
  39. 39. 39Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. MDP: Policy Example: Actions resulting from applying a random policy on this Grid World problem.
  40. 40. 40Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Exercise: Draw the actions resulting from applying an optimal policy in this Grid World problem. MDP: Optimal Policy π*
  41. 41. 41Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Solution: Draw the actions resulting from applying an optimal policy in this Grid World problem. MDP: Optimal Policy π*
  42. 42. 42 MDP: Optimal Policy π* Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. How do we handle the randomness (initial state s0 , transition probabilities, action...) ? GOAL: Find policy π* that maximizes the cumulative discounted reward: Environment samples initial state s0 ~ p(s0 ) Agent selects action at ~π (.|st ) Environment samples next state st+1 ~ P ( .| st , at ) Environment samples reward rt ~ R(. | st ,at ) reward (rt ) state (st ) action (at )
  43. 43. 43Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. How do we handle the randomness (initial state s0 , transition probabilities, action) ? GOAL: Find policy π* that maximizes the cumulative discounted reward: The optimal policy π* will maximize the expected sum of rewards: initial state selected action at t sampled state for t+1expected cumulative discounted reward MDP: Optimal Policy π*
  44. 44. 44 MDP: Policy: Value function Vπ (s) Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. How to estimate how good state s is for a given policy π ? With the value function at state s, Vπ (s), the expected cumulative reward from following policy π from state s. “...from following policy π from state s.” “Expected cumulative reward…””
  45. 45. 45 MDP: Policy: Q-value function Qπ (s,a) Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. How to estimate how good a state-action pair (s,a) is for a given policy π ? With the Q-value function at state s and action a, Qπ (s,a), the expected cumulative reward from taking action a in state s, and then following policy π. “...from taking action a in state s and then following policy π.” “Expected cumulative reward…””
  46. 46. 46 MDP: Policy: Optimal Q-value function Q* (s,a) Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. The optimal Q-value function at state s and action, Q* (s,a), is the maximum expected cumulative reward achievable from a given (state, action) pair: choose the policy that maximizes the expected cumulative reward (From the previous page) Q-value function
  47. 47. 47 MDP: Policy: Bellman equation Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Q* (s,a) satisfies the following Bellman equation: Maximum expected cumulative reward for future pair (s’,a’) FUTURE REWARD (From the previous page) Optimal Q-value function reward for considered pair (s,a) Maximum expected cumulative reward for considered pair (s,a) Expectation across possible future states s’ (randomness) discount factor
  48. 48. 48 MDP: Policy: Bellman equation Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Q* (s,a) satisfies the following Bellman equation: The optimal policy π* corresponds to taking the best action in any state according to Q*. GOAL: Find policy π* that maximizes the cumulative discounted reward: select action a’ that maximizes expected cumulative reward
  49. 49. 49 MDP: Policy: Solving the Optimal Policy Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Value iteration algorithm: Estimate the Bellman equation with an iterative update. The iterative estimation Qi (s,a) will converge to the optimal Q*(s,a) as i ➝ ∞. (From the previous page) Bellman Equation Updated Q-value function Current Q-value for future pair (s’,a’)
  50. 50. 50 MDP: Policy: Solving the Optimal Policy Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Qi (s,a) will converge to the optimal Q*(s,a) as i ➝ ∞. Updated Q-value for current pair (s,a) Current Q-value for next pair (s’,a’) This iterative approach is not scalable because it requires computing Qi (s,a) for every state-action pair. Eg. If state is current game pixels, computationally unfeasible to compute Qi (s,a) for the entire state space !
  51. 51. 51 MDP: Policy: Solving the Optimal Policy Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. This iterative approach is not scalable because it requires computing Q(s,a) for every state-action pair. Eg. If state is current game pixels, computationally unfeasible to compute Q(s,a) for the entire state space ! Solution: Use a deep neural network as an function approximator of Q*(s,a). Q(s,a,Ө) ≈ Q*(s,a) Neural Network parameters
  52. 52. 52 Outline 1. Motivation 2. Architecture 3. Markov Decision Process (MDP) 4. Deep Q-learning ○ Forward and Backward passes ○ DQN ○ Experience Replay ○ Examples 5. RL Frameworks 6. Learn more ○ Coming next…
  53. 53. 53 Deep Q-learning Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. The function to approximate is a Q-function that satisfies the Bellman equation: Q(s,a,Ө) ≈ Q*(s,a)
  54. 54. 54 Deep Q-learning Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. The function to approximate is a Q-function that satisfies the Bellman equation: Q(s,a,Ө) ≈ Q*(s,a) Forward Pass Loss function: Sample a (s,a) pair Predicted Q-value with Өi Sample a future state s’ Predict Q-value with Өi-1
  55. 55. 55 Deep Q-learning Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Train the DNN to approximate a Q-value function that satisfies the Bellman equation
  56. 56. 56 Deep Q-learning Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Must compute reward during training
  57. 57. 57 Deep Q-learning Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Backward Pass Gradient update (with respect to Q-function parameters Ө): Forward Pass Loss function:
  58. 58. 58Source: Tambet Matiisen, Demystifying Deep Reinforcement Learning (Nervana) Deep Q-learning: Deep Q-Network DQN Q(s,a,Ө) ≈ Q*(s,a)
  59. 59. 59Source: Tambet Matiisen, Demystifying Deep Reinforcement Learning (Nervana) Deep Q-learning: Deep Q-Network DQN Q(s,a,Ө) ≈ Q*(s,a) efficiency Single Feed Forward Pass A single feedforward pass to compute the Q-values for all actions from the current state (efficient)
  60. 60. 60 Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al. "Human-level control through deep reinforcement learning." Nature 518, no. 7540 (2015): 529-533. Deep Q-learning: Deep Q-Network DQN Number of actions between 4-18, depending on the Atari game
  61. 61. 61 Deep Q-learning: Deep Q-Network DQN Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Q(st , ⬅), Q(st , ➡), Q(st , ⬆), Q(st ,⬇ )
  62. 62. 62 Deep Q-learning: Experience Replay Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Learning from batches of consecutive samples is problematic: ● Samples are too correlated ➡ inefficient learning ● Q-network parameters determine the next training samples ➡ can lead to bad feedback loops.
  63. 63. 63 Deep Q-learning: Experience Replay Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Experience replay: ● Continually update a replay memory table of transitions (st , at , rt , st+1 ) as game (experience) episodes are played. ● Train a Q-network on random minibatches of transitions from the replay memory, instead of consecutiev samples.
  64. 64. 64Andrej Karpathy, “ConvNetJS Deep Q Learning Demo” Deep Q-learning: Demo
  65. 65. 65 Miriam Bellver, Xavier Giro-i-Nieto, Ferran Marques, and Jordi Torres. "Hierarchical Object Detection with Deep Reinforcement Learning." Deep Reinforcement Learning Workshop NIPS 2016. Deep Q-learning: DQN: Computer Vision Method for performing hierarchical object detection in images guided by a deep reinforcement learning agent. OBJECT FOUND
  66. 66. 66 Deep Q-learning: DQN: Computer Vision State: The agent will decide which action to choose based on: ● visual description of the current observed region ● history vector that maps past actions performed
  67. 67. 67 Deep Q-learning: DQN: Computer Vision Reward: Reward for movement actions Reward for terminal action
  68. 68. 68 Deep Q-learning: DQN: Computer Vision Actions: Two kind of actions: ● movement actions: to which of the 5 possible regions defined by the hierarchy to move ● terminal action: the agent indicates that the object has been found
  69. 69. 69 Miriam Bellver, Xavier Giro-i-Nieto, Ferran Marques, and Jordi Torres. "Hierarchical Object Detection with Deep Reinforcement Learning." Deep Reinforcement Learning Workshop NIPS 2016. Deep Q-learning: DQN: Computer Vision
  70. 70. 70 Outline 1. Motivation 2. Architecture 3. Markov Decision Process (MDP) 4. Deep Q-learning 5. RL Frameworks 6. Learn more
  71. 71. 71 RL Frameworks OpenAI Gym + keras-rl + keras-rl keras-rl implements some state-of-the art deep reinforcement learning algorithms in Python and seamlessly integrates with the deep learning library Keras. Just like Keras, it works with either Theano or TensorFlow, which means that you can train your algorithm efficiently either on CPU or GPU. Furthermore, keras-rl works with OpenAI Gym out of the box. Slide credit: Míriam Bellver
  72. 72. 72 OpenAI Universe environment RL Frameworks
  73. 73. 73 Outline 1. Motivation 2. Architecture 3. Markov Decision Process (MDP) 4. Deep Q-learning 5. RL Frameworks 6. Learn more
  74. 74. 74 Deep Learning TV, “Reinforcement learning - Ep. 30” Siraj Raval, Deep Q Learning for Video Games Learn more
  75. 75. Emma Brunskill, Stanford CS234: Reinforcement Learning Learn more
  76. 76. David Silver, UCL COMP050, Reinforcement Learning Learn more
  77. 77. Nando de Freitas, “Machine Learning” (University of Oxford) Learn more
  78. 78. 78 Pieter Abbeel and John Schulman, CS 294-112 Deep Reinforcement Learning, Berkeley. Slides: “Reinforcement Learning - Policy Optimization” OpenAI / UC Berkeley (2017) Learn more
  79. 79. 79 Learn more Slide credit: Míriam Bellver actor state critic ‘q-value’action (5) state action (5) actor performs an action critic assesses how good the action was, and the gradients are used to train the actor and the critic Actor-Critic algorithm Grondman, Ivo, Lucian Busoniu, Gabriel AD Lopes, and Robert Babuska. "A survey of actor-critic reinforcement learning: Standard and natural policy gradients." IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42, no. 6 (2012): 1291-1307.
  80. 80. ● Evolution Strategies Learn more
  81. 81. 81 Outline 1. Motivation 2. Architecture 3. Markov Decision Process (MDP) ○ Policy ○ Optimal Policy ○ Value Function ○ Q-value function ○ Optimal Q-value function ○ Bellman equation ○ Value iteration algorithm 4. Deep Q-learning ○ Forward and Backward passes ○ DQN ○ Experience Replay ○ Examples 5. RL Frameworks 6. Learn more ○ Coming next…
  82. 82. Conclusions Reinforcement Learning ● There is no supervisor, only reward signal ● Feedback is delayed, not instantaneous ● Time really matters (sequential, non i.i.d data) Slide credit: UCL Course on RL by David Silver
  83. 83. 83 Coming next... https://www.theguardian.com/technology/2014/jan/27/google-acquires-uk-artificial-intelligence-startup-deepmind
  84. 84. 84 Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M. and Dieleman, S., 2016. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), pp.484-489 Coming next...
  85. 85. 85 Greg Kohs, “AlphaGo” (2017) Coming next...
  86. 86. 86 Vinyals, O., Ewalds, T., Bartunov, S., Georgiev, P., Vezhnevets, A.S., Yeo, M., Makhzani, A., Küttler, H., Agapiou, J., Schrittwieser, J. and Quan, J., 2017. Starcraft ii: A new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782. [Press release] Coming next...
  87. 87. 87 Vinyals, O., Ewalds, T., Bartunov, S., Georgiev, P., Vezhnevets, A.S., Yeo, M., Makhzani, A., Küttler, H., Agapiou, J., Schrittwieser, J. and Quan, J., 2017. Starcraft ii: A new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782. [Press release] Coming next...
  88. 88. 88 Coming next... Edifici Vèrtex (Auditori)

×