My	Robot	Can	Learn	
Using	Reinforcement	Learning	
to	Teach	my	Robot
Marcel	Tilly
Senior	Program	Manager
Microsoft	AI	and	Research
Once	upon	in	the	time…
Agenda
• Context	for	Reinforcement	Learning
• Motivation	for	Reinforcement	Learning
• The	Reinforcement	Learning	Problem
• Aspects	of	an	RL	Agent
• Samples	for	Reinforcement	Learning
Reinforcement	Learning	Applications
RL application areas
Process Control
23%
Other
8%
Finance
4%
Autonomic Computing
6% Traffic
6%
Robotics
13%
Resource Management
18%
Networking
21%
Survey by Csaba Szepesva
of 77 recent application
papers, based on an IEEE.o
search for the keywords
“RL” and “application”
signal processing
natural language processing
web services
brain-computer interfaces
aircraft control
engine control
bio/chemical reactors
sensor networks
routing
call admission control
network resource management
power systems
inventory control
supply chains
customer service
mobile robots, motion control, Robocup, visionstoplight control, trains, unmanned vehicles
load balancing
memory management
algorithm tuning
option pricing
asset management
Rick	Sutton.	Deconstructing Reinforcement Learning.	ICML	09
Just	some	useless	information…
Facets	of	Reinforcement	Learning
Computer	Science
Neuroscience
Psychology
Economics
Mathematics
Engineering
Machine	Learning
Reward	System
Classical	Operand	
Conditioning
Bounded	Rationality
Operations	Research
Optimal	Control
Reinforcement	Learning
Machine	Learning
We	can	answer	the	4	major	questions:
• How	much/How	many?
• Which	category?
• Which	groups?	[What	is	wrong?]
• Which	action?
How	much/	How	many
• What	will	be	the	temperature	
next	Thursday?
• What	will	be	my	energy	costs	
next	month?
• How	many	new	user	will	I	get?
à Regression
Which	category?
• Is	there	a	cat	or	a	dog	on	the	
image?
• Which	machine	failure	is	causing	
the	significant	data	signature?
• What	is	the	topic/sentiment	of	
this	news	article?
à Classification
Which	groups?
• Which	customer	have	similar	
taste?
• Which	visitor	likes	the	same	
movies?
• Which	topics	can	I	extract	from	
the	document?
• Which	data	does	not	fit	nicely	in	
what	I	have	seen	so	far?
à Clustering/	Anomaly	Detection
Which	action?
• Should	I	rise	or	lower	the	
temperature?
• Should	I	clean	the	living	room	
or	should	I	stay	plugged?
• Should	I	brake	or	accelerate?
• What	is	the	next	move	for	this	
Go	match?
à Reinforcement	Learning
Machine	Learning
Supervised
UnsupervisedReinforcement
Learning
Semi- Supervised
Active
RL	Function	approx.
Learning	by example!
You do	not	know what is in	your data!
Learning	by trial and error
Characteristics	of	RL
Why	is	RL	really	different?
• There	is	no	supervisor,	only	a	reward	signal
• Feedback	is	delayed,	not	instantaneous
• Time	really	matters
• Agent’s	action	affects	the	subsequent	data	it	
receives
Examples	for	Reinforcement	Learning
• Fly	stunt	manoeuvres	of	helicopter
• Recommend	restaurants	to	users
• Optimize	online	music	store
• Control	a	house
• Control	a	power	station
• Make	a	humaniod robot	walk
• Play	games	better	than	humans
• Make	a	bot	have	a	conversation	like	a	human
What	is	Reinforcement	Learning?
“…	the	idea	of	a	learning	system	that	wants	
something.	This	was	the	idea	of	a	“hedonistoc”	
learning	system,	or,	as	we	would	day	now,	the	idea	of	
reinforcement	learning.”
• Agents	take	actions (A)	in	an	evnvironment	and	receive	
rewards (R)
• Goal	is	to	find	the	policy(𝜋) that	maximizes	rewards
• Inspired	by	research	into	psychology	and	animal	learning
Definition
Sutton,	Barto
Agent	and	Environment
At each step the agent:
• Executes action At
• Receives observation Ot
• Receives scalar reward Rt
The environment:
• Receives action At
• Emits observation Ot+1
• Emits scalar reward Rt+1
Approaches:
• MDP, POMDP
• Multi-arm bandit
Agent
Environment
ActionAt
ObservationOt
Reward Rt
History	and	State
• The	history is	the	sequence	of	observations
• i.e.	all	observable	variable	up	to	time	t
• i.e.	the	sensorimotor	stream	of	a	robot	or	embodied	agent
• What	happens	next
• The	agent	selects	actions
• The	environment
• State	is	the	information	used	to	find	next	action
• Formally,	state is	a	function	of	the	history:
Ht=	O1,R1,A1	…	At-1,Ot,Rt
St=	f(Ht)
Short	RL	Experiment
?
Reinforcement	Learning	on	the	Lego	
Mindstorms NXT	Robot
Taken	from:	https://www.youtube.com/watch?v=WF9QWc_lxfM&t=17s
Components	of	an	RL	agent
An	RL	agent	may	include	one	or	more	of	these	components:
• Policy:	agent's	behavior	function	
• Maps	from	state	to	action
• Deterministic	policy A=𝜋(S)
• Stochastic	policy 𝜋 𝐴 𝑆 = 	ℙ[𝐴|𝑆]
• Value	function:	how	good	is	each	state	and/or	action
• How	much	reward	will	I	get	from	action
• Optimal	Value	Function
𝑄∗
𝑆, 𝐴 =	 𝔼/0[𝑅 + 	𝛾 max 𝑄∗
𝑆0
, 𝐴0
|	𝑆, 𝐴]
• Model:	agent's	representation	of	the	environment
𝜋
S
A
𝑄
S
V
A
𝑇, 𝑅
S
S’
A
R
Approaches	To	Reinforcement	Learning
• Value-based	RL
• Estimate	the	optimal	value	function	Q*(S,A)
• This	is	the	maximum	value	achievable	under	any	policy
• Policy-based	RL
• Search	directly	for	the	optimal	policy	𝜋*
• This	is	the	policy	achieving	maximum	future	reward
• Model-based	RL
• Build	a	model	of	the	environment
• Plan	(e.g.	by	lookahead)	using	model
• Use	deep	neural	networks	to	represent	them	->	DeepRL
Grid	World:	Rewards	and	Goals
Sample:	Process	Control
Environment
Action(on|off)
Observation (Temp = n)
Reward (good | bad)
How	could	it	work?
Temp	before
(Ot)
Cooler
(Action)
Temp	after
(Ot+1)
Opportunities Observations Probability
(Reward?)
90 on 80 1 0 0
90 on 82 1 1 1
90 on 84 1 0 0
90 on 86 1 0 0
90 on 88 1 0 0
90 on 90 1 0 0
90 off 88 1 0 0
90 off 90 1 0 0
90 off 92 1 1 1
90 off 94 1 0 0
90 off 96 1 0 0
90 off 98 1 0 0
The	result:	A	model
Temp	before Cooler
[Action]
Temp	after Opportunities Observations Probability
90 on 80 404 10 0.025
90 on 82 404 134 0.332
90 on 84 404 215 0.532
90 on 86 404 34 0.084
90 on 88 404 9 0.022
90 on 90 404 2 0.005
90 off 88 381 1 0.003
90 off 90 381 23 0.059
90 off 92 381 101 0.261
90 off 94 381 163 0.421
90 off 96 381 75 0.194
90 off 98 381 24 0.062
Now:	Take	it	backward	St ->	A	->	St+1
Temp	before Cooler
[Action]
Temp	after Opportunities Observations Probability
90 on 80 404 10 0.025
90 on 82 404 134 0.332
90 on 84 404 215 0.532
90 on 86 404 34 0.084
90 on 88 404 9 0.022
90 on 90 404 2 0.005
90 off 88 381 1 0.003
90 off 90 381 23 0.059
90 off 92 381 101 0.261
90 off 94 381 163 0.421
90 off 96 381 75 0.194
90 off 98 381 24 0.062
How	to	do	it	with	a	Mindstorms Robot?
https://www.youtube.com/watch?v=WF9QWc_lxfM&t=17s
Angel	Martinez-Tenor:	Reinforcement	Learning	on	the	Lego	Mindstorms NXT	Robot.
Sample:	Atari	Games
David	Silver	(DeepMind):
Applying	RL	to	Atari	
Games	and	try	to	play	
better	than	a	human.
Agent
Environment
ActionAt
ObservationOt
Reward Rt
An	example	for	DeepRL with	Atari
• End-to-end	learning	of	values	Q(S,A)	from	pixels	s
• Input	state	S is	stack	of	raw	pixels	from	last	4	
frames
• Output	is	Q(S,A) for	18	joystick/button	positions
• Reward	is	change	in	score	for	that	step
Project	Malmo	@	MSR
• Makes	(deep)	reinforcement	learning	available	as	a	platform	
• Code	that	helps	artificial	intelligence	agents	sense	and	act	
within	the	Minecraft	environment
• The	two	components	can	run	on	Windows,	Linux,	or	Mac	OS
• Write	your	agent	in	Python,	Lua,	C#,	C++	or	Java
Sneak	Preview
Try	it	today:	https://github.com/Microsoft/malmo#getting-started
…	there	is	one	more	thing
Watch	this:
Wrap-up
• RL	could	become	the	next	star	in	ML
• More	storage	space
• More	compute	power
• Applications	in	IoT,	autonomous	driving,	process	control
• Good	foundation	research
• Convincing	prototypes	and	applications
à Focus	shift
David	Silver
“Reinforcement	Learning	+	deep	Learning	=	AI”
Books
Sutton	and	Barto
"Reinforcement	
Learning:	An
Introduction”
(1998)
H.M.	Schwartz
“Multi-Agent	
Machine	Learning:	A	
Reinforcement	
Approach”(2014)
Csaba Szepesvari	
“Algorithms	for	
Reinforcement	Learning”	
(2010)
References
• Some	content	is	reused	from
• Introduction	to	Reinforcement	Learning	- Shane	M.	Conway
• Lecture	1:	Introduction	to	Reinforcement	Learning	– David	Silver
• How	reinforcement	learning	works	in	Becca 7	– Brandon	Rohrer
• Johnson	M.,	Hofmann	K.,	Hutton	T.,	Bignell D.	(2016)	The	Malmo	
Platform	for	Artificial	Intelligence	Experimentation.Proc.	25th	
International	Joint	Conference	on	Artificial	Intelligence,	Ed.	
Kambhampati S.,	p.	4246.	AAAI	Press,	Palo	Alto,	California	USA.	
https://github.com/Microsoft/malmo
Thanks!
marcel.tilly@microsoft.com

My Robot Can Learn -Using Reinforcement Learning to Teach my Robot