Successfully reported this slideshow.

Reinforcement learning



Upcoming SlideShare
Reinforcement learning
Reinforcement learning
Loading in …3
1 of 12
1 of 12

More Related Content

Reinforcement learning

  1. 1. Reinforcement Learning The Exploding Kittens Edition Tarek Amr
  2. 2. Why Reinforcement Learning? I learned after playing many times; That I‘m more likely to win if I played this move after that one. No one kept telling me make this or that move!
  3. 3. States, Actions and Rewards St St+1 At At+1 St+2 Goal State R
  4. 4. What’s a good reward If getting an Exploding Kitten card gives me a reward of -1; What reward do I get if I get a Defuse card? And for a Nope card?
  5. 5. From Rewards, States get Values And from values comes policies!
  6. 6. a State has a value (V) St St+1 At At+1 St+2 Goal State R Vt Vt+1
  7. 7. or State/Action pair have a value (Q) St St+1 At At+1 St+2 Goal State R Qt Qt+1
  8. 8. Temporal Difference; S-A-R-S-A St St+1 At At+1 St+2 Goal State R Qt := Qt + α (Rt+1 + γ Qt+1 - Qt)
  9. 9. Epsilon Greedy St St+1At At+1 St+2 Goal State RExploration vs Exploitation Qt := Qt + α (Rt+1 + γ Qt+1 - Qt)
  10. 10. Deep Q Learning State Feature1 State Feature2 Action Value 10 20 JUMP 0.5 20 15 DUCK 0.6 15 25 JUMP 0.8 Warning:Over simplification Ahead This is a Q-Table; What if there are too many States & Actions?
  11. 11. MDP, MC and TD Markov Decision Process: ● You need to know the states and the transitions between them. Monte Carlo (variance ↑): ● You wait till episode’s end, and re-assign values to states. ● No need to even know the states, we sample from the environment. Temporal Difference (bias ↑): ● Update on the go. No need to even have goal states.
  12. 12. Let’s play the RL vs SL game for (i=0; i<3; i++) { ● Pick a catawiki problem ● Should it be solved via ○ Reinforcement learning? ○ Supervised learning? }

Editor's Notes

  • We expect, in general, that the environment will be nondeterministic; that is, that taking the same action in the same state on two different occasions may result in different next states and/or different reinforcement values. However, we assume the environment is stationary; that is, that the probabilities of making state transitions or receiving specific reinforcement signals do not change over time.
  • Reinforcement learning differs from the more widely studied problem of supervised learning in several ways. The most important difference is that there is no presentation of input/output pairs. Instead, after choosing an action the agent is told the immediate reward and the subsequent state, but is not told which action would have been in its best long-term interests. It is necessary for the agent to gather useful experience about the possible system states, actions, transitions and rewards actively to act optimally.
    Another difference from supervised learning is that on-line performance is important: the evaluation of the system is often concurrent with learning.
    Use cases for RL: if there is path dependence (i.e. the order of your moves matter, like in chess), if you have a budget (e.g. max # emails to send, money), or if your decisions select your future training examples (e.g. (greedily) not bidding on new websites in programmatic advertising will never allow you acquire data about them). (via Peter Tegelaar)
  • ×