MOUNTAIN CAR PROBLEM USING TEMPORAL DIFFERENCE(TD) & VALUE ITERATION(VI) REINFORCEMENT LEARNING ALRORITHMS

  • 227 views
Uploaded on

MOUNTAIN CAR PROBLEM USING TEMPORAL DIFFERENCE(TD) & VALUE ITERATION(VI) REINFORCEMENT LEARNING ALRORITHMS

MOUNTAIN CAR PROBLEM USING TEMPORAL DIFFERENCE(TD) & VALUE ITERATION(VI) REINFORCEMENT LEARNING ALRORITHMS

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
227
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
5
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • {"19":"Figure show the graph of RMS vs Episode at 11th episode at the top, while bottom one displays the car learning in the mountain\n","20":"Figure show the graph of RMS vs Episode at 1000th episode\n"}

Transcript

  • 1. MOUNTAIN CAR PROBLEM USING TEMPORAL DIFFERENCE(TD) & VALUE ITERATION(VI) REINFORCEMENT LEARNING ALRORITHMS By Muzammil Abdulrahman & Yusuf Garba Dambatta Mevlana University Konya, Turkey 2013
  • 2. INTRODUCTION  The aim of the mountain car problem is for the car to learn on two continuous variables; • position and • velocity  So that it can reach the top of the mountain in a minimum number of steps.  By starting the car from rest, its engine power alone will not be powerful enough to bring the car over the hill in front. 2
  • 3. INTRODUCTION CONT. To climb up the hill, the car would need to swing back and forth inside the valley 3
  • 4. INTRODUCTION CONT.  By accelerating forward and backward in order to gather momentum.  The agent receives a negative reward at every time step when the goal is not reached  The agent has no information about the goal until an initial success, it uses reinforcement learning methods.  In this project, we employed TD-Q learning and value iteration algorithms 4
  • 5. REINFORCEMENT LEARNING  Reinforcement learning is orthogonal learning algorithm in the field of machine learning  Where an estimation of the correctness of the answer is provided to the system  It deals with how an agent should take an action in an environment so as to maximize a cumulative reward  It is a Learning from interaction  And is a Goal-oriented learning 5
  • 6. CHARACTERISTICS  No direct training examples – (delayed) rewards instead  Goal-oriented learning  Learning about, from, and while interacting with an external environment  Need for exploration of environment & exploitation  The environment might be stochastic and/or unknown  The learning actions of the agent affect future rewards 6
  • 7. EXAMPLES Robot moving in an environment
  • 8. EXAMPLES  Chess Master 8
  • 9. UNSUPERVISED LEARNING Training Info = Evaluation(rewards/penalties) Input RL System Output(actions) Objectives: Get as much reward as possible 9
  • 10. SUPERVISED LEARNING Training Info = desired (target) outputs Input Supervised Learning System Output Training example = {input (state), target output} Error = (target output – actual output) 10
  • 11. TEMPORAL DIFFERENCE(TD)  Temporal difference (TD) learning is a prediction method.  It has been mostly used for solving the reinforcement learning problem.  TD learning is a combination of Monte Carlo ideas and dynamic programming (DP) ideas. 11
  • 12. TD Q-LEARNING  The update of TD Q-Learning looks as follows Q (a, s) Q (a, s) + α [(R(s) + γ maxaı Q (aı, sı) - Q (a, s)] 12
  • 13. TD Q-LEARNING ALGORITHM  Initialize Q values for all states ‘s’ and actions ‘a’  Obtain the current state  Select an action according to current state  Implement the selected action and obtain an immediate reward and the next state  Update the Q function according to the above equation  Update the system state  Stop the algorithm if the maximum number of iteration is reached 13
  • 14. ε -GREEDY SELECTION (Q,S,EPSILON)  The agent randomly select action from Q table based on e-greedy strategy.  Initially, epsilon=0.01 which is the probability of selecting random action.  It will be approximately equal to zero, when the car agent has fully learned how to climb the front hill (no randomness because it has learned about best action). 14
  • 15. STATE, ACTION & REWARD State: The states are position and speed. Position is between the range of -1.5 and 0.55 and speed is between the range of -0.07 and 0.07 Action: The agent has one of these 3 actions at all the time: Forward, backward, neutral (Forward accelaration=+1m/s2, backward deccelaration =accelaration=+1 1m/s2 , neutral=0 m/s2). Reward: The agent receive a reward of -1 for all actions except when the agent reaches the goal state where it receives a 0 reward 15
  • 16. VALUE ITERATION  The value iteration algorithm which is also called backward induction  Combines policy improvement and a truncated policy evaluation into a single update step  V (s) = R(s) + γ max ∑ T (s, a, s) V (s′) 16
  • 17. VALUE ITERATION ALGORITHM  Inputs: (S, A,T, R, γ), ε: threshold value  Initialize V0 for state ‘s’ and action ‘a’  for each compute the next approximation using Bellman backup equation.  V(s) R(s) + γ max ∑ T (s, a, s) V (s′) δ V (s′) -V(s)  Until δ < ε  Return V 17
  • 18. GRAPHICAL RESULTS  The graph shows the relation between RMS value (also called policy-loss) and the number of episodes.  RMS value is the error between the current Q value and the previous Q value.  With any probability, an agent chooses action randomly. If the choosen action happen to be bad, it will cause instant rise in error.  At convergence, the error is approximately zero.  In our case, convergence is reached when 3 or more successive RMS value equals 0.0001 or less 18
  • 19.  The car in the mountain will be displayed at 11th iteration to visualize how the car agent learns. 19
  • 20. GRAPH 20
  • 21. CONT. Figure show the graph of Total Reward vs Episode at 1000th episode 21
  • 22. RESULT CONT.  The car in the mountain will be displayed at 11th iteration to visualize how the car agent learns.  After 11th iteration, it will be stopped to reduce the time it takes to converge.  After 3 or more successive RMS values equals 0.001 or less, the car will be displayed again to show how it has fully learned how to reach goal state at any episode maintaining constant steps. 22
  • 23. VI RESULTS  The graph below shows the convergence error over iterations 23
  • 24. VI CONT. Figure 6 shows the graph of Optimal Positions and Velocities over time on top while bottom one displays the car learning in the mountain. 24
  • 25. VI CONT.  The first Episode records the highest error  This is because the error is the difference between the current value function and the previous value function i.e. Error= V (s′) -V(s)  But initially the previous value function is 0  Hence Error= V (s′) 25
  • 26. VI CONT.  At subsequent episodes, the error keeps decreasing as the next updated value functions increase.  At convergence, the error (with 0 value) is less ε than the threshold value ( =0.0001) which is the termination criteria for this project.  Finally the optimal policy will be returned. 26
  • 27. VI CONT.  The graphs below shows the optimal positions and velocities over time  The first graph is that of the optimal positions over time  It simply shows the optimal positions attained by the car as it attempt to reach the goal state at different time 27
  • 28. CONT.  Also the second graph shows the optimal velocities attained by the car as it attempt to reach the goal state at different time  The car initially accelerate from rest position to attain a position of -0.2 it then swings back to gather enough momentum by attaining a position of -0.95, it finally accelerate forward again and reach the goal state 28
  • 29. CONCLUSION In this project, the temporal difference and value iteration learning algorithms were implemented for mountain car problem. Both the algorithms were guaranteed to converge by determining the optimal policy for reaching the goal state. 29