Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017

619 views

Published on

Ben Lau is a quantitative researcher in a macro hedge fund in Hong Kong and he looks to apply mathematical models and signal processing techniques to study the financial market. Prior joining the financial industry, he specialized in using his mathematical modelling skills to discover the mysteries of the universe whilst working at Stanford Linear Accelerator Centre, a national accelerator laboratory where he studied the asymmetry between matter and antimatter by analysing tens of billions of collision events created by the particle accelerators. Ben was awarded his Ph.D. in Particle Physics from Princeton University and his undergraduate degree (with First Class Honours) at the Chinese University of Hong Kong.

Abstract Summary:

Deep Reinforcement Learning: Developing a robotic car with the ability to form long term driving strategies is the key for enabling fully autonomous driving in the future. Reinforcement learning has been considered a strong AI paradigm which can be used to teach machines through interaction with the environment and by learning from their mistakes. In this talk, we will discuss how to apply deep reinforcement learning technique to train a self-driving car under an open source racing car simulator called TORCS. I am going to share how this is implemented and will discuss various challenges in this project.

Published in: Technology
  • Be the first to comment

Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017

  1. 1. Deep Reinforcement Learning using deep learning to play self-driving car games Ben Lau Ben Lau - Deep Learning and Reinforcement MLConf 2017, New York City
  2. 2. What is Reinforcement Learning? Ben Lau - Deep Learning and Reinforcement 3 classes of learning Supervised Learning οƒ˜ Label data οƒ˜ Direct Feedback Unsupervised Learning οƒ˜ No labels data οƒ˜ No feedback οƒ˜ β€œFind Hidden Structure Reinforcement Learning οƒ˜ Using reward as feedback οƒ˜ Learn series of actions οƒ˜ Trial and Error
  3. 3. RL: Agent and Environment Ben Lau - Deep Learning and Reinforcement 𝑅𝑑 Agent Action 𝐴 𝑑 Environment Reward Observation 𝑂𝑑 At each step t the Agent β€’ Receive observation 𝑂𝑑 β€’ Execute action 𝐴 𝑑 β€’ Receive reward 𝑅𝑑 the Environment β€’ Receive action 𝐴 𝑑 β€’ Sends observation 𝑂𝑑+1 β€’ Sends reward 𝑅𝑑+1
  4. 4. RL: State Ben Lau - Deep Learning and Reinforcement Experience is a sequence of observations, actions, rewards π‘œ1, π‘Ÿ1, π‘Ž1 … , π‘œπ‘‘βˆ’1, π‘Ÿπ‘‘βˆ’1, π‘Ž π‘‘βˆ’1, π‘œπ‘‘, π‘Ÿπ‘‘, π‘Ž 𝑑 The state is a summary of experience 𝑠𝑑 = 𝑓(π‘œ1, π‘Ÿ1, π‘Ž1 … , π‘œπ‘‘βˆ’1, π‘Ÿπ‘‘βˆ’1, π‘Ž π‘‘βˆ’1, π‘œπ‘‘, π‘Ÿπ‘‘, π‘Ž 𝑑) Note: Not all the state are fully observable Fully Observable Not Fully Observable
  5. 5. Approach to Reinforcement Learning Ben Lau - Deep Learning and Reinforcement Value-Based RL οƒ˜ Estimate the optimal value function π‘„βˆ—(𝑠, π‘Ž) οƒ˜ This is the maximum value achievable under any policy Policy-Based RL οƒ˜ Search directly for the optimal policy πœ‹βˆ— οƒ˜ This is the policy achieving maximum future reward Model-based RL οƒ˜ Build a model of the environment οƒ˜ Plan (e.g. by lookahead) using model
  6. 6. Deep Learning + RL  AI Ben Lau - Deep Learning and Reinforcement reward Game input Deep convolution network Stee r Gas Peda l Brake
  7. 7. Policies Ben Lau - Deep Learning and Reinforcement A deterministic policy is the agent’s behavior οƒ˜ It is a map from state to action: οƒ˜ π‘Ž 𝑑 = πœ‹(𝑠𝑑) In Reinforcement Learning, the agent’s goal is to choose each action such that it maximize the sum of future rewards Choose at to maximize 𝑅𝑑 = π‘Ÿπ‘‘+1 + π›Ύπ‘Ÿπ‘‘+2 + 𝛾2 π‘Ÿπ‘‘+3 + β‹― 𝛾 is a discount factor [0,1], as the reward is less certain when further away State(s) Action(a) Obstacle Brake Corner Left/Right Straight line Acceleration
  8. 8. Approach to Reinforcement Learning Ben Lau - Deep Learning and Reinforcement Value-Based RL οƒ˜ Estimate the optimal value function π‘„βˆ—(𝑠, π‘Ž) οƒ˜ This is the maximum value achievable under any policy
  9. 9. Value Function Ben Lau - Deep Learning and Reinforcement οƒ˜ A value function is a prediction of future reward οƒ˜ How much reward will I get from action a in state s? οƒ˜ A Q-value function gives expected total reward οƒ˜ From state-action pair (s, a) οƒ˜ Under policy πœ‹ οƒ˜ With discount factor 𝛾 𝑄 πœ‹ 𝑠, π‘Ž = 𝐸 π‘Ÿπ‘‘+1 + π›Ύπ‘Ÿπ‘‘+2 + 𝛾2 π‘Ÿπ‘‘+3 + β‹― 𝑠, π‘Ž] οƒ˜ An optimal value function is the maximum achievable value π‘„βˆ— 𝑠, π‘Ž = π‘€π‘Žπ‘₯ π‘Ž 𝑄 πœ‹ 𝑠, π‘Ž οƒ˜ Once we have the π‘„βˆ— we can act optimally πœ‹βˆ— 𝑠 = π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯ π‘„βˆ— (𝑠, π‘Ž)
  10. 10. Understanding Q Function Ben Lau - Deep Learning and Reinforcement οƒ˜ The best way to understand Q function is considering a β€œstrategy guide” οƒ˜ Suppose you are playing a difficult game (DOOM) οƒ˜ If you have a strategy guide, it’s pretty easy οƒ  Just follow the guide οƒ˜ Suppose you are in state s, and need to make a decision, If you have this m Q-function(strategy guide), then it is easy, just pick the action with highest Q Doom Strategy Guide
  11. 11. How to find Q-function Ben Lau - Deep Learning and Reinforcement οƒ˜ Discount Future Reward:𝑅𝑑 = π‘Ÿπ‘‘ + π›Ύπ‘Ÿπ‘‘+1 + 𝛾2 π‘Ÿπ‘‘+2 + β‹― + 𝛾 π‘›βˆ’π‘‘ π‘Ÿπ‘› which can be written as: οƒ˜ 𝑅𝑑 = π‘Ÿπ‘‘ + 𝛾𝑅𝑑+1 Recall the definition of Q-function (max reward if choose action a in state s) οƒ˜ 𝑄 𝑠𝑑, π‘Ž 𝑑 = max 𝑅𝑑+1 Therefore, we can rewrite the Q-function as below οƒ˜ 𝑄 𝑠, π‘Ž = π‘Ÿ + 𝛾 Γ— π‘šπ‘Žπ‘₯ π‘Žβ€²Q(𝑠′, π‘Žβ€²) In plain English, it means maximum future reward for (s,a) is the immediate reward r + maximum future reward in next state s’, action a’ It can be solved by dynamic programming or iterative solution
  12. 12. Deep Q-Network (DQN) Ben Lau - Deep Learning and Reinforcement οƒ˜ Action-Value function (Q-function) often very big οƒ˜ DQN idea: I use the neural network to compress this Q-table, using the weight (w) in the neural network οƒ˜ 𝑄 𝑠, π‘Ž β‰ˆ 𝑄 𝑠, π‘Ž, 𝑀 οƒ˜ Training become finding sets of optimal weights w instead οƒ˜ In the literature we often called β€œnon-linear function approximation” State Action Value A 1 140.11 A 2 139.22 B 1 145.89 B 2 140.23 C 1 123.67 C 2 135.27 β‰ˆ
  13. 13. DQN Demo Using DeepQ network to play Doom
  14. 14. Approach to Reinforcement Learning Ben Lau - Deep Learning and Reinforcement Policy-Based RL οƒ˜ Search directly for the optimal policy πœ‹βˆ— οƒ˜ This is the policy achieving maximum future reward
  15. 15. Deep Policy Network Ben Lau - Deep Learning and Reinforcement Review: A policy is the agent’s behavior οƒ˜ It is a map from state to action: οƒ˜ at = Ο€(st) οƒ˜ We can directly search the policy οƒ˜ Let’s parameterize the policy by some model parameters πœƒ οƒ˜ π‘Ž = πœ‹(𝑠, πœƒ) οƒ˜ We called it Policy-Based reinforcement learning because we will adjust the model parameters πœƒ directly οƒ˜ The goal is to maximize the total discount reward from beginning maximize total 𝑅 = π‘Ÿ1 + π›Ύπ‘Ÿ2 + 𝛾2 π‘Ÿ3 + β‹―
  16. 16. Policy Gradient Ben Lau - Deep Learning and Reinforcement How to make good action more likely? οƒ˜ Define objective function as total discounted reward 𝐿 πœƒ = 𝐸 π‘Ÿ1 + π›Ύπ‘Ÿ2 + 𝛾2 π‘Ÿ3 + β‹― |πœ‹ πœƒ(𝑠, π‘Ž) or 𝐿 πœƒ = 𝐸 𝑅|πœ‹ πœƒ(𝑠, π‘Ž) Where the expectations of the total reward R is calculated under some probability distribution 𝑝(π‘Ž|πœƒ) parameterized by πœƒ οƒ˜ The goal become maximize the total reward by compute the gradient πœ•πΏ(πœƒ) πœ•πœƒ
  17. 17. Policy Gradient (II) Ben Lau - Deep Learning and Reinforcement Recall: Q-function is the maximum discounted future reward in state s, actio 𝑄 𝑠𝑑, π‘Ž 𝑑 = π‘šπ‘Žπ‘₯𝑅𝑑+1 οƒ˜ In the continuous case we can written as 𝑄 𝑠𝑑, π‘Ž 𝑑 = 𝑅𝑑+1 Therefore, we can compute the gradient as πœ•πΏ(πœƒ) πœ•πœƒ = 𝐸 𝑝(π‘Ž|πœƒ) πœ•π‘„ πœ•πœƒ οƒ˜ Using chain-rule, we can re-write as πœ•πΏ(πœƒ) πœ•πœƒ = 𝐸 𝑝(π‘Ž|πœƒ) πœ•π‘„ πœƒ(𝑠,π‘Ž) πœ•π‘Ž πœ•π‘Ž πœ•πœƒ No dynamics model required! 1. Only requires Q is differential w.r.t. a 2. As long as a can be parameterized as function of πœƒ
  18. 18. The power of Policy Gradient Ben Lau - Deep Learning and Reinforcement Because the policy gradient does not require the dynamical model therefore, no prior domain knowledge is required AlphaGo doesn’t pre-programme any domain knowledge It keep playing many times (via self-play) and adjust the policy parameters πœƒ to maximize the reward(winning probability)
  19. 19. Intuition: Value vs Policy RL Ben Lau - Deep Learning and Reinforcement οƒ˜ Valued Based RL is similar to driving instructor : A score is given for any action is taken by student οƒ˜ Policy Based RL is similar to a driver : It is the actual policy how to drive a car
  20. 20. The car racing game TORCS Ben Lau - Deep Learning and Reinforcement οƒ˜ TORCS is a state of the art open source simulator written in C++ οƒ˜ Main Features οƒ˜ Sophisticated dynamics οƒ˜ Provided with several tracks, controllers οƒ˜ Sensors οƒ˜ Rangefinder οƒ˜ Speed οƒ˜ Position on track οƒ˜ Rotation speed of wheels οƒ˜ RPM οƒ˜ Angle with tracks Quite realistic to self-driving cars… Track sensors
  21. 21. Deep Learning Recipe Ben Lau - Deep Learning and Reinforcement reward Game input state s Deep Neural network Stee r Gas Peda l Brak e οƒ˜ Rangefinder οƒ˜ Speed οƒ˜ Position on track οƒ˜ Rotation speed of wheels οƒ˜ RPM οƒ˜ Angle with tracks Compute the optimal policy πœ‹ via policy gradient
  22. 22. Design of the reward function Ben Lau - Deep Learning and Reinforcement οƒ˜ Obvious choice : Highest velocity of the car 𝑅 = π‘‰π‘π‘Žπ‘Ÿ cos πœƒ οƒ˜ However, experience found that learning not very stable οƒ˜ Use modify reward function 𝑅 = 𝑉π‘₯ cos πœƒ βˆ’π‘‰π‘₯ sin πœƒ βˆ’π‘‰π‘₯|track pos| Encourage stay in the center of the track
  23. 23. Source code available here: Google: DDPG Keras Ben Lau - Deep Learning and Reinforcement
  24. 24. Training Set: Aalborg Track
  25. 25. Validation Set: Alpine Tracks Recall basic Machine Learning, make sure you need to test the model In the validation set, not the training set
  26. 26. Learning how to brake Ben Lau - Deep Learning and Reinforcement Since we try to maximize the velocity of the car The AI agent don’t want to hit the brake at all! (As it go against the reward function) Using Stochastic Brake Idea
  27. 27. Final Demo – Car does not stay center of track Ben Lau - Deep Learning and Reinforcement
  28. 28. Future Application Ben Lau - Deep Learning and Reinforcement Self driving cars:
  29. 29. Future Application
  30. 30. Thank you! Twitter: @yanpanlau
  31. 31. Appendix
  32. 32. How to find Q-function (II) Ben Lau - Deep Learning and Reinforcement οƒ˜ 𝑄 𝑠, π‘Ž = π‘Ÿ + 𝛾 Γ— π‘šπ‘Žπ‘₯ π‘Žβ€²Q(𝑠′ , π‘Žβ€² ) We could use iterative method to solve the Q-function, given a transition (s,a, οƒ˜ We want π‘Ÿ + 𝛾 Γ— π‘šπ‘Žπ‘₯ π‘Žβ€²Q(𝑠′ , π‘Žβ€² ) to be same as 𝑄 𝑠, π‘Ž οƒ˜ Consider find Q-function is a regression task, we can define a loss function οƒ˜ Loss function = 1 2 π‘Ÿ + 𝛾 Γ— π‘šπ‘Žπ‘₯ π‘Žβ€²Q(𝑠′, π‘Žβ€²) βˆ’ 𝑄(𝑠, π‘Ž) 2 οƒ˜ Q is optimal when the loss function is minimum target prediction

Γ—