Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- The AI Rush by Jean-Baptiste Dumont 1953134 views
- AI and Machine Learning Demystified... by Carol Smith 3941643 views
- 10 facts about jobs in the future by Pew Research Cent... 945826 views
- Harry Surden - Artificial Intellige... by Harry Surden 799573 views
- Inside Google's Numbers in 2017 by Rand Fishkin 1386092 views
- Pinot: Realtime Distributed OLAP da... by Kishore Gopalakri... 645757 views

913 views

Published on

Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.

Published in:
Data & Analytics

No Downloads

Total views

913

On SlideShare

0

From Embeds

0

Number of Embeds

1

Shares

0

Downloads

63

Comments

0

Likes

1

No embeds

No notes for slide

- 1. Xavier Giro-i-Nieto xavier.giro@upc.edu Associate Professor Universitat Politècnica de Catalunya Technical University of Catalonia Reinforcement Learning: MDP & DQN Day 5 Lecture 2 #DLUPC http://bit.ly/dlai2018
- 2. 2 Acknowledgements Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
- 3. 3 Acknowledgements Víctor Campos victor.campos@bsc.es PhD Candidate Barcelona Supercomputing Center Míriam Bellver miriam.bellver@bsc.edu PhD Candidate Barcelona Supercomputing Center
- 4. 4 Video lecture Xavier Giró, DLAI 2017
- 5. 5 Outline 1. Motivation 2. Architecture 3. Markov Decision Process (MDP) 4. Deep Q-learning 5. RL Frameworks 6. Learn more
- 6. 6 Outline 1. Motivation 2. Architecture 3. Markov Decision Process (MDP) 4. Deep Q-learning 5. RL Frameworks 6. Learn more
- 7. Types of machine learning Yann Lecun’s Black Forest cake 7
- 8. 8 Types of machine learning We can categorize three types of learning procedures: 1. Supervised Learning: 𝐲 = ƒ(𝐱) 2. Unsupervised Learning: ƒ(𝐱) 3. Reinforcement Learning (RL): 𝐲 = ƒ(𝐱) 𝐳 Predict label y corresponding to observation x Estimate the distribution of observation x Predict action y based on observation x, to maximize a future reward z
- 9. 9 Types of machine learning We can categorize three types of learning procedures: 1. Supervised Learning: 𝐲 = ƒ(𝐱) 2. Unsupervised Learning: ƒ(𝐱) 3. Reinforcement Learning (RL): 𝐲 = ƒ(𝐱) 𝐳
- 10. 10 Rich Sutton, “Temporal-Difference Learning” University of Alberta (2017)
- 11. 11 Motivation What is Reinforcement Learning (RL) ? “a way of programming agents by reward and punishment without needing to specify how the task is to be achieved” [Kaelbling, Littman, & Moore, 96] Kaelbling, Leslie Pack, Michael L. Littman, and Andrew W. Moore. "Reinforcement learning: A survey." Journal of artificial intelligence research 4 (1996): 237-285.
- 12. 12 Outline 1. Motivation 2. Architecture 3. Markov Decision Process (MDP) 4. Deep Q-learning 5. RL Frameworks 6. Learn more
- 13. 13 Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. "Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013).
- 14. 14 Architecure Figure: UCL Course on RL by David Silver
- 15. 15 Architecure Figure: UCL Course on RL by David Silver Environment
- 16. 16 Architecure Figure: UCL Course on RL by David Silver Environment state (st )
- 17. 17 Architecure Figure: UCL Course on RL by David Silver Environment state (st )
- 18. 18 Architecure Figure: UCL Course on RL by David Silver Environment Agent state (st )
- 19. 19 Architecure Figure: UCL Course on RL by David Silver Environment Agent action (At )state (st )
- 20. 20 Architecure Figure: UCL Course on RL by David Silver Environment Agent action (At ) reward (rt ) state (st )
- 21. 21 Architecure Figure: UCL Course on RL by David Silver Environment Agent action (At ) reward (rt ) state (st ) Reward is given to the agent delayed with respect to previous states and actions !
- 22. 22 Architecure Figure: UCL Course on RL by David Silver Environment Agent action (At ) reward (rt ) state (st+1 )
- 23. 23 Architecure Figure: UCL Course on RL by David Silver Environment Agent action (At ) reward (rt ) state (st+1 ) GOAL: Reach the highest score possible at the end of the game.
- 24. 24 Architecure Figure: UCL Course on RL by David Silver Environment Agent action (At ) reward (rt ) state (st+1 ) GOAL: Learn how to take actions to maximize accumulative reward
- 25. 25 Architecure Multiple problems that can be formulated with a RL architecture. Cart-Pole Problem Objective: Balance a pole on top of a movable car Slide credit: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
- 26. 26 Architecure Slide credit: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Environment Agent action (At ) reward (rt ) state (st ) Angle Angular speed Position Horizontal velocity Horizontal force applied in the car1 at each time step if the pole is upright
- 27. 27 Architecure Schulman, John, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. "High-dimensional continuous control using generalized advantage estimation." ICLR 2016 [project page] Multiple problems that can be formulated with a RL architecture. Robot Locomotion Objective: Make the robot move forward
- 28. 28 Architecure Schulman, John, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. "High-dimensional continuous control using generalized advantage estimation." ICLR 2016 [project page] Environment Agent action (At ) reward (rt ) state (st ) Angle and position of the joints Torques applied on joints1 at each time step upright + forward movement
- 29. 29 Schulman, John, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. "High-dimensional continuous control using generalized advantage estimation." ICLR 2016 [project page]
- 30. 30 Outline 1. Motivation 2. Architecture 3. Markov Decision Process (MDP) ○ Policy ○ Optimal Policy ○ Value Function ○ Q-value function ○ Optimal Q-value function ○ Bellman equation ○ Value iteration algorithm 4. Deep Q-learning 5. RL Frameworks 6. Learn more
- 31. 31 Markov Decisin Process (MDP) Markov Decision Processes provide a formalism for reinforcement learning problems. Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Markov property: Current state completely characterises the state of the world.
- 32. 32 Markov Decision Process (MDP) Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. S A R P ४
- 33. 33 Markov Decision Process (MDP) Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. S A R P ४ Environment samples initial state s0 ~ p(s0 ) Agent selects action at Environment samples next state st+1 ~ P ( .| st , at ) Environment samples reward rt ~ R(. | st ,at ) reward (rt ) state (st ) action (at )
- 34. 34 MDP: Policy Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. S A R P ४ Agent selects action at policy π A Policy π is a function S ➝ A that specifies which action to take in each state.
- 35. 35 MDP: Policy A Policy π is a function S ➝ A that specifies which action to take in each state. Agent selects action at policy π GOAL: Learn how to take actions to maximize reward Agent GOAL: Find policy π* that maximizes the cumulative discounted reward: MDP
- 36. 36 MDP: Policy Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Multiple problems that can be formulated with a RL architecture. Grid World (a simple MDP) Objective: reach one of the terminal states (greyed out) in least number of actions.
- 37. 37 MDP: Policy Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Environment Agent action (At ) reward (rt ) state (st ) Each cell is a state: A negative “reward” (penalty) for each transition rt = r = -1
- 38. 38 MDP: Policy: Random Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Example: Actions resulting from applying a random policy on this Grid World problem.
- 39. 39 MDP: Policy: Optimal Policy π* Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Exercise: Draw the actions resulting from applying an optimal policy π* in this Grid World problem.
- 40. 40 MDP: Policy: Optimal Policy π* Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Solution: Draw the actions resulting from applying an optimal policy π* in this Grid World problem.
- 41. 41 MDP: Policy: Optimal Policy π* Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. How do we handle the randomness (initial state s0 , transition probabilities, action...) ? GOAL: Find policy π* that maximizes the cumulative discounted reward: Environment samples initial state s0 ~ p(s0 ) Agent selects action at ~π (.|st ) Environment samples next state st+1 ~ P ( .| st , at ) Environment samples reward rt ~ R(. | st ,at ) reward (rt ) state (st ) action (at )
- 42. 42 MDP: Policy: Optimal Policy π* Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. How do we handle the randomness (initial state s0 , transition probabilities, action) ? GOAL: Find policy π* that maximizes the cumulative discounted reward: The optimal policy π* will maximize the expected sum of rewards: initial state selected action at t sampled state for t+1expected cumulative discounted reward
- 43. 43 MDP: Value Function Vπ (s) Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. How to estimate how good state s is for a given policy π ? With the value function at state s, Vπ (s), the expected cumulative reward from following policy π from state s. “...from following policy π from state s.” “Expected cumulative reward…””
- 44. 44 MDP: Q-Value Function Qπ (s,a) Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. How to estimate how good a state-action pair (s,a) is for a given policy π ? With the Q-value function at state s and action a, Qπ (s,a), the expected cumulative reward from taking action a in state s, and then following policy π. “...from taking action a in state s and then following policy π.” “Expected cumulative reward…””
- 45. 45 MDP: Optimal Q-Value Function Q* (s,a) Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. The optimal Q-value function at state s and action, Q* (s,a), is the maximum expected cumulative reward achievable from a given (state, action) pair: choose the policy that maximizes the expected cumulative reward (From the previous page) Q-value function
- 46. 46 MDP: Optimal Q-Value Function Q* (s,a) Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Q* (s,a) satisfies the following Bellman equation: Maximum expected cumulative reward for future pair (s’,a’) FUTURE REWARD (From the previous page) Optimal Q-value function reward for considered pair (s,a) Maximum expected cumulative reward for considered pair (s,a) Expectation across possible future states s’ (randomness) discount factor
- 47. 47 MDP: Bellman Equation Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Q* (s,a) satisfies the following Bellman equation: The optimal policy π* corresponds to taking the best action in any state according to Q*. GOAL: Find policy π* that maximizes the cumulative discounted reward: select action a’ that maximizes expected cumulative reward
- 48. 48 MDP: Solving the Optimal Policy Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Value iteration algorithm: Estimate the Bellman equation with an iterative update. The iterative estimation Qi (s,a) will converge to the optimal Q*(s,a) as i ➝ ∞. (From the previous page) Bellman Equation Updated Q-value function Current Q-value for future pair (s’,a’)
- 49. 49 MDP: Solving the Optimal Policy Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Exploring all positive states and action is not scalable by itself. Let alone iteratively. Eg. If video game, it would require generating all possible pixels and actions… as many times as necessary iterations to estimate Q*(s,a). The iterative estimation Qi (s,a) will converge to the optimal Q*(s,a) as i ➝ ∞. Updated Q-value function Current Q-value for future pair (s’,a’)
- 50. 50 MDP: Solving the Optimal Policy Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Solution: Use a deep neural network as an function approximator of Q*(s,a). Q(s,a,Ө) ≈ Q*(s,a) Neural Network parameters Exploring all positive states and action is not scalable Eg. If video game, it would require generating all possible pixels and actions.
- 51. 51 Outline 1. Motivation 2. Architecture 3. Markov Decision Process (MDP) 4. Deep Q-learning 5. RL Frameworks 6. Learn more
- 52. 52 Deep Q-learning Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. The function to approximate is a Q-function that satisfies the Bellman equation: Q(s,a,Ө) ≈ Q*(s,a)
- 53. 53 Deep Q-learning Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. The function to approximate is a Q-function that satisfies the Bellman equation: Q(s,a,Ө) ≈ Q*(s,a) Forward Pass Loss function: Sample a (s,a) pair Predicted Q-value with Өi Sample a future state s’ Predict Q-value with Өi-1
- 54. 54 Deep Q-learning Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Train the DNN to approximate a Q-value function that satisfies the Bellman equation
- 55. 55 Deep Q-learning Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Must compute reward during training
- 56. 56 Deep Q-learning Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Backward Pass Gradient update (with respect to Q-function parameters Ө): Forward Pass Loss function:
- 57. 57 Deep Q-learning: Deep Q-Network DQN Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Q(s,a,Ө) ≈ Q*(s,a)
- 58. 58 Deep Q-learning: Deep Q-Network DQN Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Q(s,a,Ө) ≈ Q*(s,a) efficiency Single Feed Forward Pass A single feedforward pass to compute the Q-values for all actions from the current state (efficient)
- 59. 59 Deep Q-learning: Deep Q-Network DQN Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al. "Human-level control through deep reinforcement learning." Nature 518, no. 7540 (2015): 529-533. Number of actions between 4-18, depending on the Atari game
- 60. 60 Deep Q-learning: Deep Q-Network DQN Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Q(st , ⬅), Q(st , ➡), Q(st , ⬆), Q(st ,⬇ )
- 61. 61 Deep Q-learning: Demo Andrej Karpathy, “ConvNetJS Deep Q Learning Demo”
- 62. 62 Deep Q-learning: Demo
- 63. 63 Outline 1. Motivation 2. Architecture 3. Markov Decision Process (MDP) 4. Deep Q-learning 5. RL Frameworks 6. Learn more
- 64. 64 RL Frameworks OpenAI Gym + keras-rl + keras-rl keras-rl implements some state-of-the art deep reinforcement learning algorithms in Python and seamlessly integrates with the deep learning library Keras. Just like Keras, it works with either Theano or TensorFlow, which means that you can train your algorithm efficiently either on CPU or GPU. Furthermore, keras-rl works with OpenAI Gym out of the box. Slide credit: Míriam Bellver
- 65. 65 RL Frameworks OpenAI Universe environment
- 66. 66 Outline 1. Motivation 2. Architecture 3. Markov Decision Process (MDP) 4. Deep Q-learning 5. RL Frameworks 6. Learn more
- 67. 67 Learn more Deep Learning TV, “Reinforcement learning - Ep. 30” Siraj Raval, Deep Q Learning for Video Games
- 68. 68 Learn more David Silver, UCL COMP050, Reinforcement Learning
- 69. 69 Learn more Pieter Abbeel and John Schulman, CS 294-112 Deep Reinforcement Learning, Berkeley. Slides: “Reinforcement Learning - Policy Optimization” OpenAI / UC Berkeley (2017)
- 70. 70 Learn more Nando de Freitas, “Machine Learning” (University of Oxford)
- 71. 71 Homework Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M. and Dieleman, S., 2016. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), pp.484-489
- 72. 72 Homework (exam material !!) Greg Kohs, “AlphaGo” (2017). [@ Netflix]
- 73. 73 Outline 1. Motivation 2. Architecture 3. Markov Decision Process (MDP) 4. Deep Q-learning 5. RL Frameworks 6. Learn more
- 74. 74 Final Questions

No public clipboards found for this slide

Be the first to comment