Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

An introduction to reinforcement learning (rl)

Slides from Aditya Tarigoppula's talk at NYC Machine Learning on December 13th.

  • Login to see the comments

An introduction to reinforcement learning (rl)

  1. 1. An Introduction to ReinforcementLearning (RL) and RL Brain MachineInterface (RL-BMI) Aditya Tarigoppula www.joefrancislab.com SUNY Downstate Medical Center
  2. 2. Outline Optimality Value functions Methods for attaining optimality Environment DP MC TDSTART / END RL Examples Eligibility Traces BMI & RL-BMI
  3. 3.  Environment model - Markov decision process 1) States ‘S’ 2) Actions ‘A’ 3) State transition probabilities. a Pss Pr{st 1 s | st s, at a}, Pa 1 ss 4) Reward s Rt rt 1 rt 2 rt 3 ... rt T 2 Rt rt 1 rt 2 rt 3 ... 0 1 5) :s a Deterministic, non-stationary policy RL Problem: The decision maker, ‘agent’ needs to learn the optimum policy in an ‘environment’ to maximize the total amount of reward it receives over the long term.
  4. 4. • Agent performs the action under the policybeing followed.• Environment is everything else other than theagent. a Pss
  5. 5. Value Functions: State Value Function V ( s) E {Rt | st s} k E {rt 1 rt k 2 | st s} k 0 a a ( s, a ) Pss [ Rss V ( s )] a s State – Action Value Function Q ( s, a ) E {Rt | st s, at a} k E { rt k 1 | st s, at a} k 0
  6. 6. Optimal Value Function: Optimal Policy – A policy that is better than or equal to all the other policies is called Optimal policy. (in the sense of maximizing expected reward) Optimal state value function V * ( s) maxV ( s) Optimal state-action value function Q* ( s, a) max Q ( s, a) Bellman optimality equation V * ( s) max E{rt 1 V * ( s ) | st s, at a} a Q * ( s, a ) E{rt 1 max Q* ( s , a ) | st s, at a} a
  7. 7. At time = t Acquire Brain State DecoderE Action Selection (trying to execute anX optimum action)A tMPL Action executed At time = t +1E Observe reward Update the decoder t+1
  8. 8. S1 Pr 0.8E S3 Pr 0.1 Pr 0.1 S2XAM Pr 0P S4LE V ( s) [0.8 * ( R( s, a1 ) *V ( s1 )) 0.1* ( R( s, a2 )... ... *V ( s2 )) 0.1* ( R( s, a3 ) *V ( s3 ))] Prof. Andrew Ng, Lecture 16, Machine learning
  9. 9. Outline Were here ! Optimality Value functions Methods for Environment attaining optimality DP MC TDSTART / END RL Examples Eligibility Traces BMI & RL-BMI
  10. 10. Solution Methods for RL problem◦ Dynamic Programming (DP) – is a method for optimization of problems which exhibit the characteristics of overlapping sub problems and optimal substructure.◦ Monte Carlo method (MC) - requires only experience--sample sequences of states, actions, and rewards from interaction with an environment.◦ Temporal Difference learning (TD) – is a method that combines the better aspects of DP (estimation) and MC (experience) without incorporating the ‘troublesome’ aspects of both.
  11. 11. Dynamic ProgrammingPolicy Evaluation
  12. 12. Dynamic ProgrammingPolicy Improvement Q ( s, ( s )) V ( s ) E I E I I E * * 0 V o 1 V 1 ...... V E – Policy Evaluation I – Policy Improvement
  13. 13. Policy Iteration Value IterationDYNAMICPRO Replace entireG section withRA V (s) max a a Pss [ Rss V ( s )] aM sMING
  14. 14. Monte Carlo Vs. DP◦ The estimates for each state are independent. In other words, MC methods do not "bootstrap“.◦ DP includes only one-step transitions whereas the MC diagram goes all the way to the end of the episode.◦ The computational expense of estimating the value of a single state is less when one requires the value of only a subset of the states.
  15. 15. Monte Carlo Policy Evaluation Every visit MC First visit MC
  16. 16. -> Without a model, we need Q value estimates. -> All state-action pairs should be visited. -> Exploration techniques 1) Exploring starts 2) e-greedy Policy Next SlideMONTECARLO
  17. 17.  As promised, this is the “NEXT SLIDE” !MONTECARLO
  18. 18. Temporal Difference Methods◦ Like MC, TD methods can learn directly from raw experience without a model of the environments dynamics. Like DP, TD methods update estimates are based in part on other learned estimates, without waiting for a final outcome (they bootstrap). V ( st ) V ( st ) [rt 1 V ( st 1 ) V ( st )]
  19. 19. TD(lambda) Bias –Variance Tradeoff Bias decreases Intuition: start with large ‘lamda’ and then decrease over time Variance Increases  trace decay parameter
  20. 20. SARSA DifferenceQ Learning
  21. 21. Outline Optimality Value functions Methods for attaining optimality Environment DP MC TDSTART / END RL Examples Eligibility Traces Were here ! BMI & RL-BMI
  22. 22. Eligibility Traces OR
  23. 23. Outline Optimality Value functions Methods for attaining optimality Environment DP MC TDSTART / END RL Examples Eligibility Traces BMI & RL-BMI Were here !
  24. 24. Online/closed loop RL-BMI architectureNEURAL action output _ index[max(Qi ( st ))]SIGNAL Q( st , at ) Qi ( st , action) reward tanh(.) TD _ err rt * Q( st 1 , at 1 ) Q( st , at ) delta TD _ err * e _ trace ‘delta’ used for updating the weights through back-propagation
  25. 25. BMISETUP Scott, S. H. (1999). "Apparatus for measuring and perturbing shoulder and elbow joint positions and torques during reaching." J Neurosci Methods 89(2): 119-27.
  26. 26. Actor-Critic Model http://drugabuse.gov/researchreports/metha mph/meth04.gif
  27. 27. References Reinforcement Learning: An Introduction Richard S. Sutton & Andrew G. Barto Prof. Andrew Ng’s machine Learning Lectures http://heli.stanford.edu Dr. Joseph T. Francis www.joefrancislab.com Prof. Peter Dayan Dr. Justin Sanchez Group http://www.bme.miami.edu/nrg/

×