Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- Ashirth Barthur, Security Scientist... by MLconf 1421 views
- Ross Goodwin, Technologist, Sunspri... by MLconf 459 views
- Josh Patterson, Advisor, Skymind β ... by MLconf 715 views
- Luna Dong, Principal Scientist, Ama... by MLconf 853 views
- Chris Fregly, Research Scientist, P... by MLconf 1214 views
- Funda Gunes, Senior Research Statis... by MLconf 1329 views

1,019 views

Published on

Abstract Summary:

Deep Reinforcement Learning: Developing a robotic car with the ability to form long term driving strategies is the key for enabling fully autonomous driving in the future. Reinforcement learning has been considered a strong AI paradigm which can be used to teach machines through interaction with the environment and by learning from their mistakes. In this talk, we will discuss how to apply deep reinforcement learning technique to train a self-driving car under an open source racing car simulator called TORCS. I am going to share how this is implemented and will discuss various challenges in this project.

Published in:
Technology

No Downloads

Total views

1,019

On SlideShare

0

From Embeds

0

Number of Embeds

10

Shares

0

Downloads

33

Comments

8

Likes

2

No notes for slide

- 1. Deep Reinforcement Learning using deep learning to play self-driving car games Ben Lau Ben Lau - Deep Learning and Reinforcement MLConf 2017, New York City
- 2. What is Reinforcement Learning? Ben Lau - Deep Learning and Reinforcement 3 classes of learning Supervised Learning ο Label data ο Direct Feedback Unsupervised Learning ο No labels data ο No feedback ο βFind Hidden Structure Reinforcement Learning ο Using reward as feedback ο Learn series of actions ο Trial and Error
- 3. RL: Agent and Environment Ben Lau - Deep Learning and Reinforcement π π‘ Agent Action π΄ π‘ Environment Reward Observation ππ‘ At each step t the Agent β’ Receive observation ππ‘ β’ Execute action π΄ π‘ β’ Receive reward π π‘ the Environment β’ Receive action π΄ π‘ β’ Sends observation ππ‘+1 β’ Sends reward π π‘+1
- 4. RL: State Ben Lau - Deep Learning and Reinforcement Experience is a sequence of observations, actions, rewards π1, π1, π1 β¦ , ππ‘β1, ππ‘β1, π π‘β1, ππ‘, ππ‘, π π‘ The state is a summary of experience π π‘ = π(π1, π1, π1 β¦ , ππ‘β1, ππ‘β1, π π‘β1, ππ‘, ππ‘, π π‘) Note: Not all the state are fully observable Fully Observable Not Fully Observable
- 5. Approach to Reinforcement Learning Ben Lau - Deep Learning and Reinforcement Value-Based RL ο Estimate the optimal value function πβ(π , π) ο This is the maximum value achievable under any policy Policy-Based RL ο Search directly for the optimal policy πβ ο This is the policy achieving maximum future reward Model-based RL ο Build a model of the environment ο Plan (e.g. by lookahead) using model
- 6. Deep Learning + RL ο¨ AI Ben Lau - Deep Learning and Reinforcement reward Game input Deep convolution network Stee r Gas Peda l Brake
- 7. Policies Ben Lau - Deep Learning and Reinforcement A deterministic policy is the agentβs behavior ο It is a map from state to action: ο π π‘ = π(π π‘) In Reinforcement Learning, the agentβs goal is to choose each action such that it maximize the sum of future rewards Choose at to maximize π π‘ = ππ‘+1 + πΎππ‘+2 + πΎ2 ππ‘+3 + β― πΎ is a discount factor [0,1], as the reward is less certain when further away State(s) Action(a) Obstacle Brake Corner Left/Right Straight line Acceleration
- 8. Approach to Reinforcement Learning Ben Lau - Deep Learning and Reinforcement Value-Based RL ο Estimate the optimal value function πβ(π , π) ο This is the maximum value achievable under any policy
- 9. Value Function Ben Lau - Deep Learning and Reinforcement ο A value function is a prediction of future reward ο How much reward will I get from action a in state s? ο A Q-value function gives expected total reward ο From state-action pair (s, a) ο Under policy π ο With discount factor πΎ π π π , π = πΈ ππ‘+1 + πΎππ‘+2 + πΎ2 ππ‘+3 + β― π , π] ο An optimal value function is the maximum achievable value πβ π , π = πππ₯ π π π π , π ο Once we have the πβ we can act optimally πβ π = ππππππ₯ πβ (π , π)
- 10. Understanding Q Function Ben Lau - Deep Learning and Reinforcement ο The best way to understand Q function is considering a βstrategy guideβ ο Suppose you are playing a difficult game (DOOM) ο If you have a strategy guide, itβs pretty easy ο Just follow the guide ο Suppose you are in state s, and need to make a decision, If you have this m Q-function(strategy guide), then it is easy, just pick the action with highest Q Doom Strategy Guide
- 11. How to find Q-function Ben Lau - Deep Learning and Reinforcement ο Discount Future Reward:π π‘ = ππ‘ + πΎππ‘+1 + πΎ2 ππ‘+2 + β― + πΎ πβπ‘ ππ which can be written as: ο π π‘ = ππ‘ + πΎπ π‘+1 Recall the definition of Q-function (max reward if choose action a in state s) ο π π π‘, π π‘ = max π π‘+1 Therefore, we can rewrite the Q-function as below ο π π , π = π + πΎ Γ πππ₯ πβ²Q(π β², πβ²) In plain English, it means maximum future reward for (s,a) is the immediate reward r + maximum future reward in next state sβ, action aβ It can be solved by dynamic programming or iterative solution
- 12. Deep Q-Network (DQN) Ben Lau - Deep Learning and Reinforcement ο Action-Value function (Q-function) often very big ο DQN idea: I use the neural network to compress this Q-table, using the weight (w) in the neural network ο π π , π β π π , π, π€ ο Training become finding sets of optimal weights w instead ο In the literature we often called βnon-linear function approximationβ State Action Value A 1 140.11 A 2 139.22 B 1 145.89 B 2 140.23 C 1 123.67 C 2 135.27 β
- 13. DQN Demo Using DeepQ network to play Doom
- 14. Approach to Reinforcement Learning Ben Lau - Deep Learning and Reinforcement Policy-Based RL ο Search directly for the optimal policy πβ ο This is the policy achieving maximum future reward
- 15. Deep Policy Network Ben Lau - Deep Learning and Reinforcement Review: A policy is the agentβs behavior ο It is a map from state to action: ο at = Ο(st) ο We can directly search the policy ο Letβs parameterize the policy by some model parameters π ο π = π(π , π) ο We called it Policy-Based reinforcement learning because we will adjust the model parameters π directly ο The goal is to maximize the total discount reward from beginning maximize total π = π1 + πΎπ2 + πΎ2 π3 + β―
- 16. Policy Gradient Ben Lau - Deep Learning and Reinforcement How to make good action more likely? ο Define objective function as total discounted reward πΏ π = πΈ π1 + πΎπ2 + πΎ2 π3 + β― |π π(π , π) or πΏ π = πΈ π |π π(π , π) Where the expectations of the total reward R is calculated under some probability distribution π(π|π) parameterized by π ο The goal become maximize the total reward by compute the gradient ππΏ(π) ππ
- 17. Policy Gradient (II) Ben Lau - Deep Learning and Reinforcement Recall: Q-function is the maximum discounted future reward in state s, actio π π π‘, π π‘ = πππ₯π π‘+1 ο In the continuous case we can written as π π π‘, π π‘ = π π‘+1 Therefore, we can compute the gradient as ππΏ(π) ππ = πΈ π(π|π) ππ ππ ο Using chain-rule, we can re-write as ππΏ(π) ππ = πΈ π(π|π) ππ π(π ,π) ππ ππ ππ No dynamics model required! 1. Only requires Q is differential w.r.t. a 2. As long as a can be parameterized as function of π
- 18. The power of Policy Gradient Ben Lau - Deep Learning and Reinforcement Because the policy gradient does not require the dynamical modelο¨ therefore, no prior domain knowledge is required AlphaGo doesnβt pre-programme any domain knowledge It keep playing many times (via self-play) and adjust the policy parameters π to maximize the reward(winning probability)
- 19. Intuition: Value vs Policy RL Ben Lau - Deep Learning and Reinforcement ο Valued Based RL is similar to driving instructor : A score is given for any action is taken by student ο Policy Based RL is similar to a driver : It is the actual policy how to drive a car
- 20. The car racing game TORCS Ben Lau - Deep Learning and Reinforcement ο TORCS is a state of the art open source simulator written in C++ ο Main Features ο Sophisticated dynamics ο Provided with several tracks, controllers ο Sensors ο Rangefinder ο Speed ο Position on track ο Rotation speed of wheels ο RPM ο Angle with tracks Quite realistic to self-driving carsβ¦ Track sensors
- 21. Deep Learning Recipe Ben Lau - Deep Learning and Reinforcement reward Game input state s Deep Neural network Stee r Gas Peda l Brak e ο Rangefinder ο Speed ο Position on track ο Rotation speed of wheels ο RPM ο Angle with tracks Compute the optimal policy π via policy gradient
- 22. Design of the reward function Ben Lau - Deep Learning and Reinforcement ο Obvious choice : Highest velocity of the car π = ππππ cos π ο However, experience found that learning not very stable ο Use modify reward function π = ππ₯ cos π βππ₯ sin π βππ₯|track pos| Encourage stay in the center of the track
- 23. Source code available here: Google: DDPG Keras Ben Lau - Deep Learning and Reinforcement
- 24. Training Set: Aalborg Track
- 25. Validation Set: Alpine Tracks Recall basic Machine Learning, make sure you need to test the model In the validation set, not the training set
- 26. Learning how to brake Ben Lau - Deep Learning and Reinforcement Since we try to maximize the velocity of the car The AI agent donβt want to hit the brake at all! (As it go against the reward function) Using Stochastic Brake Idea
- 27. Final Demo β Car does not stay center of track Ben Lau - Deep Learning and Reinforcement
- 28. Future Application Ben Lau - Deep Learning and Reinforcement Self driving cars:
- 29. Future Application
- 30. Thank you! Twitter: @yanpanlau
- 31. Appendix
- 32. How to find Q-function (II) Ben Lau - Deep Learning and Reinforcement ο π π , π = π + πΎ Γ πππ₯ πβ²Q(π β² , πβ² ) We could use iterative method to solve the Q-function, given a transition (s,a, ο We want π + πΎ Γ πππ₯ πβ²Q(π β² , πβ² ) to be same as π π , π ο Consider find Q-function is a regression task, we can define a loss function ο Loss function = 1 2 π + πΎ Γ πππ₯ πβ²Q(π β², πβ²) β π(π , π) 2 ο Q is optimal when the loss function is minimum target prediction

No public clipboards found for this slide

Login to see the comments