3. Reinforcement Learning
● Learning from interaction!
○ Driving a car,
○ Holding a conversation,
● Goal-directed approach
○ Closed-loop,
○ Reward oriented,
4. Reinforcement vs. Unsupervised Learning
● Hidden structures!
● Unlabeled data!
● No reliance on structures!
● Maximize a reward!
5. Exploration vs. Exploitation Dilemma
● Exploit to obtain rewards!
● Explore to perform better!
● Either Exploration or Exploitation?
● Closest to the human and animal learning!
6. Examples
● Mobile Robot
○ More trash to find,
○ Way back to battery station,
● Adaptive Controller for Petrol Refinery
○ Optimize yield/cost/quality,
○ Specified marginal costs,
7. Agent & Environment
● Policy,
○ Mapping from states to actions,
● Reward,
○ Pain, pleasure,
● Value Function,
○ Farsighted judgement,
● Model,
○ Mimics the environment,
8. Pick and Place Robot
Action:
Voltages at motors,
States:
Latest joint data,
Reward:
+1 for successful pick-up, computed in the environment!
9. Goals & Markov Decision Process
Goals:
Markov Decision Process:
Retaining all relevant information, Markov Property!
10. Markov Decision Process ctd.
MDP if,
● The state and action spaces are finite,
● Satisfies Markov property,
Example: Recycling Robot
● Actively search for a can,
● Remain still and wait for a can,
● Go back to station,
16. Monte Carlo Methods
● Used in algorithm to mimic policy iteration,
○ Policy Evaluation,
■ (s,a) averages over time ==> Q
○ Policy Iteration,
■ Next policy from Q, (Greedy Policy),
● Given s, new policy returns a that max Q(s, . )
● Works in episodic problems ONLY!