Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Reinforcement Learning : A Beginner... by Omar Enayet 9543 views
- Reinforcement learning by Chandra Meena 2328 views
- Introduction to Reinforcement Learning by Edward Balaban 1346 views
- Reinforcement learning 7313 by Slideshare 3022 views
- Functional Assessment of theThyroid... by roger961 200 views
- Ten Trends by Rodd Lucier 9741 views

7,545 views

Published on

Slides from Aditya Tarigoppula's talk at NYC Machine Learning on December 13th.

No Downloads

Total views

7,545

On SlideShare

0

From Embeds

0

Number of Embeds

13

Shares

0

Downloads

103

Comments

0

Likes

7

No embeds

No notes for slide

- 1. An Introduction to ReinforcementLearning (RL) and RL Brain MachineInterface (RL-BMI) Aditya Tarigoppula www.joefrancislab.com SUNY Downstate Medical Center
- 2. Outline Optimality Value functions Methods for attaining optimality Environment DP MC TDSTART / END RL Examples Eligibility Traces BMI & RL-BMI
- 3. Environment model - Markov decision process 1) States ‘S’ 2) Actions ‘A’ 3) State transition probabilities. a Pss Pr{st 1 s | st s, at a}, Pa 1 ss 4) Reward s Rt rt 1 rt 2 rt 3 ... rt T 2 Rt rt 1 rt 2 rt 3 ... 0 1 5) :s a Deterministic, non-stationary policy RL Problem: The decision maker, ‘agent’ needs to learn the optimum policy in an ‘environment’ to maximize the total amount of reward it receives over the long term.
- 4. • Agent performs the action under the policybeing followed.• Environment is everything else other than theagent. a Pss
- 5. Value Functions: State Value Function V ( s) E {Rt | st s} k E {rt 1 rt k 2 | st s} k 0 a a ( s, a ) Pss [ Rss V ( s )] a s State – Action Value Function Q ( s, a ) E {Rt | st s, at a} k E { rt k 1 | st s, at a} k 0
- 6. Optimal Value Function: Optimal Policy – A policy that is better than or equal to all the other policies is called Optimal policy. (in the sense of maximizing expected reward) Optimal state value function V * ( s) maxV ( s) Optimal state-action value function Q* ( s, a) max Q ( s, a) Bellman optimality equation V * ( s) max E{rt 1 V * ( s ) | st s, at a} a Q * ( s, a ) E{rt 1 max Q* ( s , a ) | st s, at a} a
- 7. At time = t Acquire Brain State DecoderE Action Selection (trying to execute anX optimum action)A tMPL Action executed At time = t +1E Observe reward Update the decoder t+1
- 8. S1 Pr 0.8E S3 Pr 0.1 Pr 0.1 S2XAM Pr 0P S4LE V ( s) [0.8 * ( R( s, a1 ) *V ( s1 )) 0.1* ( R( s, a2 )... ... *V ( s2 )) 0.1* ( R( s, a3 ) *V ( s3 ))] Prof. Andrew Ng, Lecture 16, Machine learning
- 9. Outline Were here ! Optimality Value functions Methods for Environment attaining optimality DP MC TDSTART / END RL Examples Eligibility Traces BMI & RL-BMI
- 10. Solution Methods for RL problem◦ Dynamic Programming (DP) – is a method for optimization of problems which exhibit the characteristics of overlapping sub problems and optimal substructure.◦ Monte Carlo method (MC) - requires only experience--sample sequences of states, actions, and rewards from interaction with an environment.◦ Temporal Difference learning (TD) – is a method that combines the better aspects of DP (estimation) and MC (experience) without incorporating the ‘troublesome’ aspects of both.
- 11. Dynamic ProgrammingPolicy Evaluation
- 12. Dynamic ProgrammingPolicy Improvement Q ( s, ( s )) V ( s ) E I E I I E * * 0 V o 1 V 1 ...... V E – Policy Evaluation I – Policy Improvement
- 13. Policy Iteration Value IterationDYNAMICPRO Replace entireG section withRA V (s) max a a Pss [ Rss V ( s )] aM sMING
- 14. Monte Carlo Vs. DP◦ The estimates for each state are independent. In other words, MC methods do not "bootstrap“.◦ DP includes only one-step transitions whereas the MC diagram goes all the way to the end of the episode.◦ The computational expense of estimating the value of a single state is less when one requires the value of only a subset of the states.
- 15. Monte Carlo Policy Evaluation Every visit MC First visit MC
- 16. -> Without a model, we need Q value estimates. -> All state-action pairs should be visited. -> Exploration techniques 1) Exploring starts 2) e-greedy Policy Next SlideMONTECARLO
- 17. As promised, this is the “NEXT SLIDE” !MONTECARLO
- 18. Temporal Difference Methods◦ Like MC, TD methods can learn directly from raw experience without a model of the environments dynamics. Like DP, TD methods update estimates are based in part on other learned estimates, without waiting for a final outcome (they bootstrap). V ( st ) V ( st ) [rt 1 V ( st 1 ) V ( st )]
- 19. TD(lambda) Bias –Variance Tradeoff Bias decreases Intuition: start with large ‘lamda’ and then decrease over time Variance Increases trace decay parameter
- 20. SARSA DifferenceQ Learning
- 21. Outline Optimality Value functions Methods for attaining optimality Environment DP MC TDSTART / END RL Examples Eligibility Traces Were here ! BMI & RL-BMI
- 22. Eligibility Traces OR
- 23. Outline Optimality Value functions Methods for attaining optimality Environment DP MC TDSTART / END RL Examples Eligibility Traces BMI & RL-BMI Were here !
- 24. Online/closed loop RL-BMI architectureNEURAL action output _ index[max(Qi ( st ))]SIGNAL Q( st , at ) Qi ( st , action) reward tanh(.) TD _ err rt * Q( st 1 , at 1 ) Q( st , at ) delta TD _ err * e _ trace ‘delta’ used for updating the weights through back-propagation
- 25. BMISETUP Scott, S. H. (1999). "Apparatus for measuring and perturbing shoulder and elbow joint positions and torques during reaching." J Neurosci Methods 89(2): 119-27.
- 26. Actor-Critic Model http://drugabuse.gov/researchreports/metha mph/meth04.gif
- 27. References Reinforcement Learning: An Introduction Richard S. Sutton & Andrew G. Barto Prof. Andrew Ng’s machine Learning Lectures http://heli.stanford.edu Dr. Joseph T. Francis www.joefrancislab.com Prof. Peter Dayan Dr. Justin Sanchez Group http://www.bme.miami.edu/nrg/

No public clipboards found for this slide

Be the first to comment