Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Reinforcement Learning : A Beginner... by Omar Enayet 10580 views
- Reinforcement learning by Chandra Meena 3089 views
- Introduction to Chainer: A Flexible... by Seiya Tokui 40743 views
- An introduction to reinforcement le... by pauldix 7884 views
- Reinforcement Learning by butest 964 views
- Introduction to Reinforcement Learning by Edward Balaban 1526 views

3,666 views

Published on

No Downloads

Total views

3,666

On SlideShare

0

From Embeds

0

Number of Embeds

610

Shares

0

Downloads

145

Comments

0

Likes

7

No embeds

No notes for slide

- 1. Reinforcement LearningV.SaranyaAP/CSESri Vidya College of Engineering andTechnology,Virudhunagar
- 2. What is learning? Learning takes place as a result of interactionbetween an agent and the world, the ideabehind learning is that Percepts received by an agent should be used notonly for acting, but also for improving the agent’sability to behave optimally in the future to achievethe goal.
- 3. Learning types Learning types Supervised learning:a situation in which sample (input, output) pairs of thefunction to be learned can be perceived or are givenYou can think it as if there is a kind teacher Reinforcement learning:in the case of the agent acts on its environment, itreceives some evaluation of its action (reinforcement),but is not told of which action is the correct one toachieve its goal
- 4. Reinforcement learning TaskLearn how to behave successfully to achieve agoal while interacting with an external environment Learn via experiences! Examples Game playing: player knows whether it win orlose, but not know how to move at each step Control: a traffic system can measure the delay ofcars, but not know how to decrease it.
- 5. RL is learning from interaction
- 6. RL model Each percept(e) is enough to determinethe State(the state is accessible) The agent can decompose the Rewardcomponent from a percept. The agent task: to find a optimal policy, mappingstates to actions, that maximize long-run measureof the reinforcement. Think of reinforcement as reward Can be modeled as “MDP” model!
- 7. Review of MDP model MDP model <S,T,A,R>AgentEnvironmentStateRewardActions0r0a0s1a1r1s2a2r2s3• S– set of states• A– set of actions• T(s,a,s’) = P(s’|s,a)– theprobability of transition from s tos’ given action a• R(s,a)– the expected reward fortaking action a in state s∑∑==),,(),,(),(),,(),|(),(sssasrsasTasRsasrassPasR
- 8. Model based v.s.Model freeapproaches But, we don’t know anything about the environmentmodel—the transition function T(s,a,s’) Here comes two approaches Model based approach RL:learn the model, and use it to derive the optimal policy.e.g Adaptive dynamic learning(ADP) approach Model free approach RL:derive the optimal policy without learning the model.e.g LMS and Temporal difference approach Which one is better?
- 9. Passive learning v.s. Activelearning Passive learning The agent imply watches the world going by andtries to learn the utilities of being in various states Active learning The agent not simply watches, but also acts
- 10. Example environment
- 11. Passive learning scenario The agent see the the sequences of statetransitions and associate rewards The environment generates state transitions and theagent perceive theme.g (1,1) (1,2) (1,3) (2,3) (3,3) (4,3)[+1](1,1)(1,2) (1,3) (1,2) (1,3) (1,2) (1,1) (2,1)(3,1) (4,1) (4,2)[-1] Key idea: updating the utility value using thegiven training sequences.
- 12. Passive leaning scenario
- 13. LMS(least mean squares)updating Reward to go of a statethe sum of the rewards from that state until aterminal state is reached Key: use observed reward to go of the state asthe direct evidence of the actual expected utility ofthat state Learning utility function directly from sequenceexample
- 14. LMS updatingfunction LMS-UPDATE (U, e, percepts, M, N ) return an updated Uif TERMINAL?[e] then{ reward-to-go 0for each ei in percepts (starting from end) dos = STATE[ei]reward-to-go reward-to-go + REWARS[ei]U[s] = RUNNING-AVERAGE (U[s], reward-to-go, N[s])end}function RUNNING-AVERAGE (U[s], reward-to-go, N[s] )U[s] = [ U[s] * (N[s] – 1) + reward-to-go ] / N[s]
- 15. LMS updating algorithm inpassive learning Drawback: The actual utility of a state is constrained to be probability- weighted averageof its successor’s utilities. Meet very slowly to correct utilities values (requires a lot of sequences)for our example, >1000!
- 16. Temporal difference method inpassive learning TD(0) key idea: adjust the estimated utility value of the current state based on itsimmediately reward and the estimated value of the next state. The updating rule is the learning rate parameter Only when is a function that decreases as the number of timesa state has been visited increased, then can U(s)converge to thecorrect value.))()()(()()( sUsUsRsUsU −++= ααα
- 17. The TD learning curve(4,3)(2,3)(2,2)(1,1)(3,1)(4,1)(4,2)
- 18. Adaptive dynamicprogramming(ADP) in passivelearning Different with LMS and TD method(model freeapproaches) ADP is a model based approach! The updating rule for passive learning However, in an unknown environment, T is notgiven, the agent must learn T itself byexperiences with the environment. How to learn T?))(),((),()(sUssrssTsUsγ+=∑
- 19. ADP learning curves(4,3)(3,3)(2,3)(1,1)(3,1)(4,1)(4,2)
- 20. Active learning An active agent must consider what actions to take? what their outcomes maybe Update utility equation Rule to chose action))(),,(),((maxargsUsasTasRasa∑+= γ))(),,(),((max)(sUsasTasRsUsa∑+= γ
- 21. Active ADP algorithmFor each s, initialize U(s) , T(s,a,s’) and R(s,a)Initialize s to current state that is perceivedLoop forever{Select an action a and execute it (using current model R and T) usingReceive immediate reward r and observe the new state s’Using the transition tuple <s,a,s’,r> to update model R and T (see further)For all the sate s, update U(s) using the updating rules = s’}))(),,(),((maxargsUsasTasRasa∑+= γ))(),,(),((max)(sUsasTasRsUsa∑+= γ
- 22. How to learn model? Use the transition tuple <s, a, s’, r> to learn T(s,a,s’) and R(s,a).That’s supervised learning! Since the agent can get every transition (s, a, s’,r) directly, so take(s,a)/s’ as an input/output example of the transition probabilityfunction T. Different techniques in the supervised learning(see further readingfor detail) Use r and T(s,a,s’) to learn R(s,a)∑=),,(),(srsasTasR
- 23. ADP approach pros and cons Pros: ADP algorithm converges far faster than LMS and Temporallearning. That is because it use the information from the model ofthe environment. Cons: Intractable for large state space In each step, update U for all states Improve this by prioritized-sweeping
- 24. An action has two kinds of outcome Gain rewards on the current experiencetuple (s,a,s’) Affect the percepts received, and hencethe ability of the agent to learnExploration problem in Activelearning
- 25. Exploration problem in Activelearning A trade off when choosing action between its immediately good (reflected in its current utility estimates usingthe what we have learned) its long term good (exploring more about the environment help it tobehave optimally in the long run) Two extreme approaches “wacky”approach: acts randomly, in the hope that it willeventually explore the entire environment. “greedy”approach: acts to maximize its utility using currentmodel estimate Just like human in the real world! People need to decide between Continuing in a comfortable existence Or striking out into the unknown in the hopes of discovering a new andbetter life
- 26. Exploration problem in Activelearning One kind of solution: the agent should be more wackywhen it has little idea of the environment, and moregreedy when it has a model that is close to beingcorrect In a given state, the agent should give some weight to actions that ithas not tried very often. While tend to avoid actions that are believed to be of low utility Implemented by exploration function f(u,n): assigning a higher utility estimate to relatively unexplored action statepairs Chang the updating rule of value function to U+ denote the optimistic estimate of the utility)),(),(),,((),((max)(saNsUsasTfasrsUsa++∑+= γ
- 27. Exploration problem in Activelearning One kind of definition of f(u,n)if n< Neu otherwise is an optimistic estimate of the best possible rewardobtainable in any state The agent will try each action-state pair(s,a) at least Ne times The agent will behave initially as if there were wonderful rewardsscattered all over around– optimistic .=),( nuf+R{+R
- 28. Generalization inReinforcement Learning Use generalization techniques to deal with large stateor action space. Function approximation techniques
- 29. Genetic algorithm and Evolutionaryprogramming Start with a set of individuals Apply selection and reproduction operators to “evolve” an individual that issuccessful (measured by a fitness function)
- 30. Genetic algorithm and Evolutionaryprogramming Imagine the individuals as agent functions Fitness function as performance measure orreward function No attempt made to learn the relationship therewards and actions taken by an agent Simply searches directly in the individual space tofind one that maximizes the fitness functions

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment