My Robot Can Learn -Using Reinforcement Learning to Teach my Robot

My Robot Can Learn
Using Reinforcement Learning
to Teach my Robot
Marcel Tilly
Senior Program Manager
Microsoft AI and Research

Agenda
• Context for Reinforcement Learning
• Motivation for Reinforcement Learning
• The Reinforcement Learning Problem
• Aspects of an RL Agent
• Samples for Reinforcement Learning

Reinforcement Learning Applications
RL application areas
Process Control
23%
Other
8%
Finance
4%
Autonomic Computing
6% Trafﬁc
6%
Robotics
13%
Resource Management
18%
Networking
21%
Survey by Csaba Szepesva
of 77 recent application
papers, based on an IEEE.o
search for the keywords
“RL” and “application”
signal processing
natural language processing
web services
brain-computer interfaces
aircraft control
engine control
bio/chemical reactors
sensor networks
routing
call admission control
network resource management
power systems
inventory control
supply chains
customer service
mobile robots, motion control, Robocup, visionstoplight control, trains, unmanned vehicles
load balancing
memory management
algorithm tuning
option pricing
asset management
Rick Sutton. Deconstructing Reinforcement Learning. ICML 09

Just some useless information…

Facets of Reinforcement Learning
Computer Science
Neuroscience
Psychology
Economics
Mathematics
Engineering
Machine Learning
Reward System
Classical Operand
Conditioning
Bounded Rationality
Operations Research
Optimal Control
Reinforcement Learning

Machine Learning
We can answer the 4 major questions:
• How much/How many?
• Which category?
• Which groups? [What is wrong?]
• Which action?

How much/ How many
• What will be the temperature
next Thursday?
• What will be my energy costs
next month?
• How many new user will I get?
à Regression

Which category?
• Is there a cat or a dog on the
image?
• Which machine failure is causing
the significant data signature?
• What is the topic/sentiment of
this news article?
à Classification

Which groups?
• Which customer have similar
taste?
• Which visitor likes the same
movies?
• Which topics can I extract from
the document?
• Which data does not fit nicely in
what I have seen so far?
à Clustering/ Anomaly Detection

Which action?
• Should I rise or lower the
temperature?
• Should I clean the living room
or should I stay plugged?
• Should I brake or accelerate?
• What is the next move for this
Go match?
à Reinforcement Learning

Machine Learning
Supervised
UnsupervisedReinforcement
Learning
Semi- Supervised
Active
RL Function approx.
Learning by example!
You do not know what is in your data!
Learning by trial and error

Characteristics of RL
Why is RL really different?
• There is no supervisor, only a reward signal
• Feedback is delayed, not instantaneous
• Time really matters
• Agent’s action affects the subsequent data it
receives

Examples for Reinforcement Learning
• Fly stunt manoeuvres of helicopter
• Recommend restaurants to users
• Optimize online music store
• Control a house
• Control a power station
• Make a humaniod robot walk
• Play games better than humans
• Make a bot have a conversation like a human

What is Reinforcement Learning?
“… the idea of a learning system that wants
something. This was the idea of a “hedonistoc”
learning system, or, as we would day now, the idea of
reinforcement learning.”
• Agents take actions (A) in an evnvironment and receive
rewards (R)
• Goal is to find the policy(𝜋) that maximizes rewards
• Inspired by research into psychology and animal learning
Definition
Sutton, Barto

Agent and Environment
At each step the agent:
• Executes action At
• Receives observation Ot
• Receives scalar reward Rt
The environment:
• Receives action At
• Emits observation Ot+1
• Emits scalar reward Rt+1
Approaches:
• MDP, POMDP
• Multi-arm bandit
Agent
Environment
ActionAt
ObservationOt
Reward Rt

History and State
• The history is the sequence of observations
• i.e. all observable variable up to time t
• i.e. the sensorimotor stream of a robot or embodied agent
• What happens next
• The agent selects actions
• The environment
• State is the information used to find next action
• Formally, state is a function of the history:
Ht= O1,R1,A1 … At-1,Ot,Rt
St= f(Ht)

Reinforcement Learning on the Lego
Mindstorms NXT Robot
Taken from: https://www.youtube.com/watch?v=WF9QWc_lxfM&t=17s

Components of an RL agent
An RL agent may include one or more of these components:
• Policy: agent's behavior function
• Maps from state to action
• Deterministic policy A=𝜋(S)
• Stochastic policy 𝜋 𝐴 𝑆 = ℙ[𝐴|𝑆]
• Value function: how good is each state and/or action
• How much reward will I get from action
• Optimal Value Function
𝑄∗
𝑆, 𝐴 = 𝔼/0[𝑅 + 𝛾 max 𝑄∗
𝑆0
, 𝐴0
| 𝑆, 𝐴]
• Model: agent's representation of the environment
𝜋
S
A
𝑄
S
V
A
𝑇, 𝑅
S
S’
A
R

Approaches To Reinforcement Learning
• Value-based RL
• Estimate the optimal value function Q*(S,A)
• This is the maximum value achievable under any policy
• Policy-based RL
• Search directly for the optimal policy 𝜋*
• This is the policy achieving maximum future reward
• Model-based RL
• Build a model of the environment
• Plan (e.g. by lookahead) using model
• Use deep neural networks to represent them -> DeepRL

Sample: Process Control
Environment
Action(on|off)
Observation (Temp = n)
Reward (good | bad)

How could it work?
Temp before
(Ot)
Cooler
(Action)
Temp after
(Ot+1)
Opportunities Observations Probability
(Reward?)
90 on 80 1 0 0
90 on 82 1 1 1
90 on 84 1 0 0
90 on 86 1 0 0
90 on 88 1 0 0
90 on 90 1 0 0
90 off 88 1 0 0
90 off 90 1 0 0
90 off 92 1 1 1
90 off 94 1 0 0
90 off 96 1 0 0
90 off 98 1 0 0

The result: A model
Temp before Cooler
[Action]
Temp after Opportunities Observations Probability
90 on 80 404 10 0.025
90 on 82 404 134 0.332
90 on 84 404 215 0.532
90 on 86 404 34 0.084
90 on 88 404 9 0.022
90 on 90 404 2 0.005
90 off 88 381 1 0.003
90 off 90 381 23 0.059
90 off 92 381 101 0.261
90 off 94 381 163 0.421
90 off 96 381 75 0.194
90 off 98 381 24 0.062

Now: Take it backward St -> A -> St+1
Temp before Cooler
[Action]
Temp after Opportunities Observations Probability
90 on 80 404 10 0.025
90 on 82 404 134 0.332
90 on 84 404 215 0.532
90 on 86 404 34 0.084
90 on 88 404 9 0.022
90 on 90 404 2 0.005
90 off 88 381 1 0.003
90 off 90 381 23 0.059
90 off 92 381 101 0.261
90 off 94 381 163 0.421
90 off 96 381 75 0.194
90 off 98 381 24 0.062

How to do it with a Mindstorms Robot?
https://www.youtube.com/watch?v=WF9QWc_lxfM&t=17s
Angel Martinez-Tenor: Reinforcement Learning on the Lego Mindstorms NXT Robot.

Sample: Atari Games
David Silver (DeepMind):
Applying RL to Atari
Games and try to play
better than a human.
Agent
Environment
ActionAt
ObservationOt
Reward Rt

An example for DeepRL with Atari
• End-to-end learning of values Q(S,A) from pixels s
• Input state S is stack of raw pixels from last 4
frames
• Output is Q(S,A) for 18 joystick/button positions
• Reward is change in score for that step

Project Malmo @ MSR
• Makes (deep) reinforcement learning available as a platform
• Code that helps artificial intelligence agents sense and act
within the Minecraft environment
• The two components can run on Windows, Linux, or Mac OS
• Write your agent in Python, Lua, C#, C++ or Java

Sneak Preview
Try it today: https://github.com/Microsoft/malmo#getting-started

… there is one more thing
Watch this:

Wrap-up
• RL could become the next star in ML
• More storage space
• More compute power
• Applications in IoT, autonomous driving, process control
• Good foundation research
• Convincing prototypes and applications
à Focus shift
David Silver
“Reinforcement Learning + deep Learning = AI”

Books
Sutton and Barto
"Reinforcement
Learning: An
Introduction”
(1998)
H.M. Schwartz
“Multi-Agent
Machine Learning: A
Reinforcement
Approach”(2014)
Csaba Szepesvari
“Algorithms for
Reinforcement Learning”
(2010)

References
• Some content is reused from
• Introduction to Reinforcement Learning - Shane M. Conway
• Lecture 1: Introduction to Reinforcement Learning – David Silver
• How reinforcement learning works in Becca 7 – Brandon Rohrer
• Johnson M., Hofmann K., Hutton T., Bignell D. (2016) The Malmo
Platform for Artificial Intelligence Experimentation.Proc. 25th
International Joint Conference on Artificial Intelligence, Ed.
Kambhampati S., p. 4246. AAAI Press, Palo Alto, California USA.
https://github.com/Microsoft/malmo

Thanks!
marcel.tilly@microsoft.com

My Robot Can Learn -Using Reinforcement Learning to Teach my Robot

More Related Content

Viewers also liked

Similar to My Robot Can Learn -Using Reinforcement Learning to Teach my Robot

More from Rising Media Ltd.

Recently uploaded

My Robot Can Learn -Using Reinforcement Learning to Teach my Robot