SlideShare a Scribd company logo
1 of 51
Download to read offline
Deep	Reinforcement	Learning	
Lecture	1:	Mo5va5on	+	Overview	+	MDPs	+	Exact	
Solu5on	Methods		
	
	
Pieter	Abbeel	
OpenAI		/	UC	Berkeley	/	Gradescope	
	
Many	slides	made	with	John	Schulman,	Yan	(Rocky)	Duan	and	Xi	(Peter)	Chen
Organizers	/	Instructors	/	TAs
LogisLcs	
n  Laptop	setup	for	labs	
n  Check	your	email!	
n  CommunicaLon	with	staff	
n  In-person	
n  Piazza	
n  Help	us	out!	J	
n  Laptops	
n  Stay	charged	
n  Tomorrow	
n  Make	sure	to	have	your	badges!
n  Understand	mathemaLcal	and	algorithmic	foundaLons	of	
Deep	RL	
n  Have	implemented	many	of	the	core	algorithms	
Bootcamp	Goals
8:30:	Breakfast	/	Check-in	
9-10:	Core	Lecture	1	Intro	to	MDPs	and	Exact	SoluLon	Methods	(Pieter	Abbeel)	
10-10:10	Brief	Sponsor	Intros	
10:10-11:10	Core	Lecture	2	Sample-based	ApproximaLons	and	Fi]ed	Learning	(Rocky	Duan)	
11:10-11:30:	Coffee	Break	
11:30-12:30	Core	Lecture	3	DQN	+	variants	(Vlad	Mnih)	
12:30-1:30	Lunch	[catered]	
1:30-2:15	Core	Lecture	4a	Policy	Gradients	and	Actor	CriLc	(Pieter	Abbeel)	
2:15-3:00	Core	Lecture	4b	Pong	from	Pixels	(Andrej	Karpathy)	
3:00-3:30	Core	Lecture	5	Natural	Policy	Gradients,	TRPO,	and	PPO	(John	Schulman)	
3:30-3:50	Coffee	Break	
3:50-4:30	Core	Lecture	6	Nuts	and	Bolts	of	Deep	RL	ExperimentaLon		(John	Schulman)	
4:30-7	Labs	1-2-3			
7-8	FronLers	Lecture	I:	Recent	Advances,	FronLers	and	Future	of	Deep	RL	(Vlad	Mnih)	
	
Schedule	--	Saturday
8:30:	Breakfast		
9-10:	Core	Lecture	7	SVG,	DDPG,	and	StochasLc	ComputaLon	Graphs	(John	Schulman)	
10-11:	Core	Lecture	8	DerivaLve-free	Methods	(Peter	Chen)	
11-11:30:	Coffee	Break	
11:30-12:30	Core	Lecture	9	Model-based	RL	(Chelsea	Finn)	
12:30-1:30	lunch	[catered]	
1:30-2:30	Core	Lecture	10	ULliLes	/	Inverse	RL	(Pieter	Abbeel	/	Chelsea	Finn)	
2:30-3:10	Two-minute	PresentaLons	by	each	TA	
3:10-3:30	Coffee	Break	
3:30-6	Labs	4-5	
6-7	FronLers	Lecture	II:	Recent	Advances,	FronLers	and	Future	of	Deep	RL	(Sergey	Levine)	
Schedule	--	Sunday
[Drawing	from	Su]on	and	Barto,	Reinforcement	Learning:	An	IntroducLon,	1998]	
Markov	Decision	Process	
AssumpLon:	agent	gets	to	observe	the	state
Kohl	and	Stone,	2004	
Some	Reinforcement	Learning	Success	Stories	
Tedrake	et	al,	2005	 Kober	and	Peters,	2009	Ng	et	al,	2004	
Silver	et	al,	2014	(DPG)	
Lillicrap	et	al,	2015	(DDPG)	 Schulman	et	al,	
2016	(TRPO	+	GAE)	
Levine*,	Finn*,	et	
al,	2016	
(GPS)	
Mnih	et	al	2013	(DQN)	
Mnih	et	al,	2015	(A3C)	
Silver*,	Huang*,	et	
al,	2016	
(AlphaGo)
RL	Algorithms	Landscape
RL	Algorithms	Landscape
An	MDP	is	defined	by:	
n  Set	of	states		S		
n  Set	of	acLons		A
n  TransiLon	funcLon		P(s’ | s, a)
n  Reward	funcLon		R(s, a, s’)
n  Start	state		s0
n  Discount	factor	γ
n  Horizon	H
Markov	Decision	Process	(MDP)
Example	MDP:	Gridworld	
Goal:
π:
An	MDP	is	defined	by:	
n  Set	of	states		S		
n  Set	of	acLons		A
n  TransiLon	funcLon		P(s’ | s, a)
n  Reward	funcLon		R(s, a, s’)
n  Start	state		s0
n  Discount	factor	γ
n  Horizon	H
n  OpLmal	Control	
													=		
given	an	MDP	(S, A, P, R, γ, H)	
find	the	opLmal	policy	π*	
Outline	
n  Exact	Methods:	
n  Value	Itera*on	
n  Policy	IteraLon	
For	now:	small	discrete	state-acLon	spaces	as	they	are	simpler	to	get	the	main	
concepts	across.		We	will	consider	large	/	conLnuous	spaces	later!
OpLmal	Value	FuncLon	V*	
=	sum	of	discounted	rewards	when	starLng	from	state	s	and	acLng	opLmally	
V ⇤
(s) = max
⇡
E
" HX
t=0
t
R(st, at, st+1) | ⇡, s0 = s
#
OpLmal	Value	FuncLon	V*	
=	sum	of	discounted	rewards	when	starLng	from	state	s	and	acLng	opLmally	
Let’s	assume:		
acLons	determinisLcally	successful,		gamma	=	1,	H	=	100	
V*(4,3)	=	1	
V*(3,3)	=	1	
V*(2,3)	=	1	
V*(1,1)	=	1	
V*(4,2)	=	-1	
V ⇤
(s) = max
⇡
E
" HX
t=0
t
R(st, at, st+1) | ⇡, s0 = s
#
OpLmal	Value	FuncLon	V*	
=	sum	of	discounted	rewards	when	starLng	from	state	s	and	acLng	opLmally	
Let’s	assume:		
acLons	determinisLcally	successful,		gamma	=	0.9,	H	=	100	
V*(4,3)	=	1	
V*(3,3)	=	0.9	
V*(2,3)	=	0.9*0.9	=	0.81	
V*(1,1)	=	0.9*0.9*0.9*0.9*0.9	=	0.59		
V*(4,2)	=	-1	
V ⇤
(s) = max
⇡
E
" HX
t=0
t
R(st, at, st+1) | ⇡, s0 = s
#
V*(4,3)	=	1	
V*(3,3)	=	0.8	*	0.9	+	0.1	*	0.9	*	V*(3,3)	+	0.1	*	0.9	*	V*(3,2)	
V*(2,3)	=		
V*(1,1)	=	
V*(4,2)	=		
OpLmal	Value	FuncLon	V*	
=	sum	of	discounted	rewards	when	starLng	from	state	s	and	acLng	opLmally	
Let’s	assume:		
acLons	successful	w/probability	0.8,		gamma	=	0.9,	H	=	100	
V ⇤
(s) = max
⇡
E
" HX
t=0
t
R(st, at, st+1) | ⇡, s0 = s
#
n  														=	opLmal	value	for	state	s	when	H=0	
n  		
n  													=	opLmal	value	for	state	s	when	H=1	
n  			
n  												=	opLmal	value	for	state	s	when	H=2			
n  		
n  											=	opLmal	value	for	state	s	when	H	=	k	
n  				
Value	IteraLon	
V ⇤
0 (s)
V ⇤
0 (s) = 0 8s
V ⇤
1 (s)
V ⇤
1 (s) = max
a
X
s0
P(s0
|s, a)(R(s, a, s0
) + V ⇤
0 (s0
))
V ⇤
2 (s) = max
a
X
s0
P(s0
|s, a)(R(s, a, s0
) + V ⇤
1 (s0
))
V ⇤
2 (s)
V ⇤
k (s) = max
a
X
s0
P(s0
|s, a)(R(s, a, s0
) + V ⇤
k 1(s0
))
V ⇤
k (s)
Algorithm:	
Start	with																								for	all	s.	
For	k	=	1,	…	,	H:	
								For	all	states	s	in	S:		
	
	
												This	is	called	a	value	update	or	Bellman	update/back-up	
Value	IteraLon	
V ⇤
k (s) max
a
X
s0
P(s0
|s, a) R(s, a, s0
) + V ⇤
k 1(s0
)
⇡⇤
k(s) arg max
a
X
s0
P(s0
|s, a) R(s, a, s0
) + V ⇤
k 1(s0
)
Noise = 0.2
Discount = 0.9
k	=	0	
Value	IteraLon
Value	IteraLon	
k	=	0	
Noise = 0.2
Discount = 0.9
Value	IteraLon	
k	=	1	
Noise = 0.2
Discount = 0.9
Value	IteraLon	
k	=	2	
Noise = 0.2
Discount = 0.9
Value	IteraLon	
k	=	3	
Noise = 0.2
Discount = 0.9
Value	IteraLon	
k	=	4	
Noise = 0.2
Discount = 0.9
Value	IteraLon	
k	=	5	
Noise = 0.2
Discount = 0.9
Value	IteraLon	
k	=	6	
Noise = 0.2
Discount = 0.9
Value	IteraLon	
k	=	7	
Noise = 0.2
Discount = 0.9
Value	IteraLon	
k	=	8	
Noise = 0.2
Discount = 0.9
Value	IteraLon	
k	=	9	
Noise = 0.2
Discount = 0.9
Value	IteraLon	
k	=	10	
Noise = 0.2
Discount = 0.9
Value	IteraLon	
k	=	11	
Noise = 0.2
Discount = 0.9
Value	IteraLon	
k	=	12	
Noise = 0.2
Discount = 0.9
Value	IteraLon	
k	=	100	
Noise = 0.2
Discount = 0.9
§  Now	we	know	how	to	act	for	infinite	horizon	with	discounted	rewards!	
§  Run	value	iteraLon	Lll	convergence.	
§  This	produces	V*,	which	in	turn	tells	us	how	to	act,	namely	following:	
§  Note:	the	infinite	horizon	opLmal	policy	is	staLonary,	i.e.,	the	opLmal	acLon	at	
a	state	s	is	the	same	acLon	at	all	Lmes.		(Efficient	to	store!)	
Theorem.			Value	itera*on	converges.		At	convergence,	we	have	found	the	
op*mal	value	func*on	V*	for	the	discounted	infinite	horizon	problem,	which	
sa*sfies	the	Bellman	equa*ons	
					
	
Value	IteraLon	Convergence	
⇡⇤
(s) = arg max
a
X
s0
P(s0
|s, a) [R(s, a, s0
) + V ⇤
(s0
)]
V ⇤
(s) = max
a
X
s0
P(s0
|s, a) [R(s, a, s0
) + V ⇤
(s0
)]
n  															=	expected	sum	of	rewards	accumulated	starLng	from	state	s,	acLng	opLmally	for									steps	
n  															=	expected	sum	of	rewards	accumulated	starLng	from	state	s,	acLng	opLmally	for	H	steps	
n  AddiLonal	reward	collected	over	Lme	steps	H+1,	H+2,	…		
	goes	to	zero	as	H	goes	to	infinity	
			Hence							
	
For	simplicity	of	notaLon	in	the	above	it	was	assumed	that	rewards	are	always	greater	than	or	equal	to	zero.		If	rewards	can	be	negaLve,	a	
similar	argument	holds,	using	max	|R|	and	bounding	from	both	sides.	
Convergence:	IntuiLon	
V ⇤
H(s)
1V ⇤
(s)
H+1
R(sH+1) + H+2
R(sH+2) + . . .  H+1
Rmax + H+2
Rmax + . . . =
H+1
1
Rmax
V ⇤
H
H!1
! V ⇤
n  Define	the	max-norm:	
n  Theorem:	For	any	two	approximaLons	U	and	V	
n  I.e.,	any	disLnct	approximaLons	must	get	closer	to	each	other,	so,	in	parLcular,	any	approximaLon	
must	get	closer	to	the	true	U	and	value	iteraLon	converges	to	a	unique,	stable,	opLmal	soluLon	
n  Theorem:	
n  I.e.	once	the	change	in	our	approximaLon	is	small,	it	must	also	be	close	to	correct	
Convergence	and	ContracLons
(a)	Prefer	the	close	exit	(+1),	risking	the	cliff	(-10)	
(b)	Prefer	the	close	exit	(+1),	but	avoiding	the	cliff	(-10)	
(c)	Prefer	the	distant	exit	(+10),	risking	the	cliff	(-10)	
(d)	Prefer	the	distant	exit	(+10),	avoiding	the	cliff	(-10)	
Exercise	1:	Effect	of	Discount	and	Noise	
(1)	γ	=	0.1,	noise	=	0.5	
(2)	γ	=	0.99,	noise	=	0	
(3)	γ =	0.99,	noise	=	0.5	
(4)	γ =	0.1,	noise	=	0
(a)	Prefer	close	exit	(+1),	risking	the	cliff	(-10) 	---	 	(4)	γ	=	0.1,	noise	=	0	
Exercise	1	SoluLon
(b)	Prefer	close	exit	(+1),	avoiding	the	cliff	(-10) 	--- 	(1)		γ	=	0.1,	noise	=	0.5	
Exercise	1	SoluLon
(c)	Prefer	distant	exit	(+1),	risking	the	cliff	(-10) 	--- 	(2)	γ	=	0.99,	noise	=	0	
Exercise	1	SoluLon
(d)	Prefer	distant	exit	(+1),	avoid	the	cliff	(-10) 	--- 	(3)	γ	=	0.99,	noise	=	0.5	
Exercise	1	SoluLon
Q-Values	
Q*(s, a) = expected utility starting in s, taking action a, and (thereafter)
acting optimally
Bellman Equation:
Q-Value Iteration:
Q⇤
k+1(s, a)
X
s0
P(s0
|s, a)(R(s, a, s0
) + max
a0
Q⇤
k(s0
, a0
))
Q-Value	IteraLon	
Noise = 0.2
Discount = 0.9
k = 100
Q⇤
k+1(s, a)
X
s0
P(s0
|s, a)(R(s, a, s0
) + max
a0
Q⇤
k(s0
, a0
))
n  OpLmal	Control	
													=		
given	an	MDP	(S, A, P, R, γ, H)	
find	the	opLmal	policy	π*	
Outline	
n  Exact	Methods:	
n  Value	Itera*on	
n  Policy	IteraLon	
For	now:	small	discrete	state-acLon	spaces	as	they	are	simpler	to	get	the	main	
concepts	across.		We	will	consider	large	/	conLnuous	spaces	later!
n  Recall	value	iteraLon:	
n  Policy	evaluaLon	for	a	given												:	
	At	convergence:	
Policy	EvaluaLon	
⇡(s)
V ⇤
k (s) max
a
X
s0
P(s0
|s, a) R(s, a, s0
) + V ⇤
k 1(s0
)
V ⇡
k (s)
X
s0
P(s0
|s, ⇡(s))(R(s, ⇡(s), s0
) + V ⇡
k 1(s))
8s V ⇡
(s)
X
s0
P(s0
|s, ⇡(s))(R(s, ⇡(s), s0
) + V ⇡
(s))
Exercise	2	
Consider a stochastic policy ⇡(a|s), where ⇡(a|s) is the probability of taking
action a when in state s. Which of the following is the correct update to perform
policy evaluation for this stochastic policy?
1. V ⇡
k+1(s) maxa
P
s0 P(s0
|s, a) (R(s, a, s0
) + V ⇡
k (s0
))
2. V ⇡
k+1(s)
P
s0
P
a ⇡(a|s)P(s0
|s, a) (R(s, a, s0
) + V ⇡
k (s0
))
3. V ⇡
k+1(s)
P
a ⇡(a|s) maxs0 P(s0
|s, a) (R(s, a, s0
) + V ⇡
k (s0
))
n  Repeat	unLl	policy	converges	
n  At	convergence:	opLmal	policy;	and	converges	faster	than	value	iteraLon	under	some	condiLons	
Policy	IteraLon	
One	itera5on	of	policy	itera5on:	
n  Policy	evaluaLon	for	current	policy									:	
n  Iterate	unLl	convergence	
n  Policy	improvement:	find	the	best	acLon	according	to	one-step	
look-ahead	
V ⇡k
i+1(s)
X
s0
P(s0
|s, ⇡k(s)) [R(s, ⇡(s), s0
) + V ⇡k
i (s0
)]
⇡k
⇡k+1(s) arg max
a
X
s0
P(s0
|s, a) [R(s, a, s0
) + V ⇡k
(s0
)]
n  Idea	1:	modify	Bellman	updates	
n  Idea	2:	it	is	just	a	linear	system,	solve	with	Numpy	(or	whatever)	
	variables:	Vπ(s)	
	constants:	P,	R	
Policy	EvaluaLon	Revisited	
V ⇡
i+1(s)
X
s0
P(s0
|s, ⇡(s)) [R(s, ⇡(s), s0
) + V ⇡
i (s0
)]
8s V ⇡
(s) =
X
s0
P(s0
|s, ⇡(s)) [R(s, ⇡(s), s0
) + V ⇡
(s0
)]
Proof	sketch:			
(1)  Guarantee	to	converge:	In	every	step	the	policy	improves.		This	means	that	a	given	policy	can	be	encountered	
at	most	once.		This	means	that	awer	we	have	iterated	as	many	Lmes	as	there	are	different	policies,	i.e.,	
(number	acLons)(number	states),	we	must	be	done	and	hence	have	converged.	
(2)  Op*mal	at	convergence:	by	definiLon	of	convergence,	at	convergence	πk+1(s)	=	πk(s)	for	all	states	s.		This	
means	 	 	 	 	 		
							Hence						saLsfies	the	Bellman	equaLon,	which	means							is	equal	to	the	opLmal	value	funcLon	V*.	
Theorem.		Policy	iteraLon	is	guaranteed	to	converge	and	at	convergence,	the	current	policy	
and	its	value	funcLon	are	the	opLmal	policy	and	the	opLmal	value	funcLon!	
Policy	IteraLon	Guarantees	
Policy	IteraLon	iterates	over:	
V ⇡k
i+1(s)
X
s0
P (s0
|s, ⇡k(s)) [R(s, ⇡(s), s0
) + V ⇡k
i (s0
)]
n  Policy	evaluaLon	for	current	policy								:	
n  Iterate	unLl	convergence	
n  Policy	improvement:	find	the	best	acLon	according	to	
one-step	look-ahead	
⇡k
⇡k+1(s) arg max
a
X
s0
P (s0
|s, a) [R(s, a, s0
) + V ⇡k
(s0
)]
n  OpLmal	Control	
													=		
given	an	MDP	(S, A, P, R, γ, H)	
find	the	opLmal	policy	π*	
Outline	
n  Exact	Methods:	
n  Value	Itera*on	
n  Policy	Itera*on	
Limita*ons:		
•  IteraLon	over	/	storage	for	all	states	and	acLons:	requires	small,	
discrete	state-acLon	space	
•  Update	equaLons	require	access	to	dynamics	model

More Related Content

Similar to Lec1 intro-mdps-exact-methods

04 Jan 30 Class SCM 416 (LP1).pptx
04 Jan 30 Class SCM 416 (LP1).pptx04 Jan 30 Class SCM 416 (LP1).pptx
04 Jan 30 Class SCM 416 (LP1).pptxAhmedghanem772804
 
Skiena algorithm 2007 lecture21 other reduction
Skiena algorithm 2007 lecture21 other reductionSkiena algorithm 2007 lecture21 other reduction
Skiena algorithm 2007 lecture21 other reductionzukun
 
Lecture1-Architecture
Lecture1-ArchitectureLecture1-Architecture
Lecture1-ArchitectureDanny Albay
 
Conservation of Energy
Conservation of EnergyConservation of Energy
Conservation of EnergyJan Parker
 
Conservation of energy for web
Conservation of energy for webConservation of energy for web
Conservation of energy for webJan Parker
 
Conservation of Energy
Conservation of Energy Conservation of Energy
Conservation of Energy Jan Parker
 
Chap 11(Heat).pptx
Chap 11(Heat).pptxChap 11(Heat).pptx
Chap 11(Heat).pptxSyed Iqbal
 
Recursion Algorithms Derivation
Recursion Algorithms DerivationRecursion Algorithms Derivation
Recursion Algorithms DerivationRodrigue Tchamna
 
Mid semexam | Theory of Computation | Akash Anand | MTH 401A | IIT Kanpur
Mid semexam | Theory of Computation | Akash Anand | MTH 401A | IIT KanpurMid semexam | Theory of Computation | Akash Anand | MTH 401A | IIT Kanpur
Mid semexam | Theory of Computation | Akash Anand | MTH 401A | IIT KanpurVivekananda Samiti
 

Similar to Lec1 intro-mdps-exact-methods (13)

04 Jan 30 Class SCM 416 (LP1).pptx
04 Jan 30 Class SCM 416 (LP1).pptx04 Jan 30 Class SCM 416 (LP1).pptx
04 Jan 30 Class SCM 416 (LP1).pptx
 
Skiena algorithm 2007 lecture21 other reduction
Skiena algorithm 2007 lecture21 other reductionSkiena algorithm 2007 lecture21 other reduction
Skiena algorithm 2007 lecture21 other reduction
 
Matlab II
Matlab IIMatlab II
Matlab II
 
Lecture1-Architecture
Lecture1-ArchitectureLecture1-Architecture
Lecture1-Architecture
 
Conservation of Energy
Conservation of EnergyConservation of Energy
Conservation of Energy
 
Conservation of energy for web
Conservation of energy for webConservation of energy for web
Conservation of energy for web
 
Conservation of Energy
Conservation of Energy Conservation of Energy
Conservation of Energy
 
Chap 11(Heat).pptx
Chap 11(Heat).pptxChap 11(Heat).pptx
Chap 11(Heat).pptx
 
Lecture1
Lecture1Lecture1
Lecture1
 
Recursion Algorithms Derivation
Recursion Algorithms DerivationRecursion Algorithms Derivation
Recursion Algorithms Derivation
 
4.3e.pptx
4.3e.pptx4.3e.pptx
4.3e.pptx
 
poster2
poster2poster2
poster2
 
Mid semexam | Theory of Computation | Akash Anand | MTH 401A | IIT Kanpur
Mid semexam | Theory of Computation | Akash Anand | MTH 401A | IIT KanpurMid semexam | Theory of Computation | Akash Anand | MTH 401A | IIT Kanpur
Mid semexam | Theory of Computation | Akash Anand | MTH 401A | IIT Kanpur
 

More from Ronald Teo

07 regularization
07 regularization07 regularization
07 regularizationRonald Teo
 
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
 Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodelRonald Teo
 
Lec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgLec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgRonald Teo
 
Lec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsLec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsRonald Teo
 
Lec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-researchLec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-researchRonald Teo
 
Lec4b pong from_pixels
Lec4b pong from_pixelsLec4b pong from_pixels
Lec4b pong from_pixelsRonald Teo
 
Lec4a policy-gradients-actor-critic
Lec4a policy-gradients-actor-criticLec4a policy-gradients-actor-critic
Lec4a policy-gradients-actor-criticRonald Teo
 
Lec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fittingLec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fittingRonald Teo
 
02 linear algebra
02 linear algebra02 linear algebra
02 linear algebraRonald Teo
 

More from Ronald Teo (16)

Mc td
Mc tdMc td
Mc td
 
07 regularization
07 regularization07 regularization
07 regularization
 
Dp
DpDp
Dp
 
06 mlp
06 mlp06 mlp
06 mlp
 
Mdp
MdpMdp
Mdp
 
04 numerical
04 numerical04 numerical
04 numerical
 
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
 Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
 
Intro rl
Intro rlIntro rl
Intro rl
 
Lec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgLec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scg
 
Lec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsLec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methods
 
Lec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-researchLec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-research
 
Lec4b pong from_pixels
Lec4b pong from_pixelsLec4b pong from_pixels
Lec4b pong from_pixels
 
Lec4a policy-gradients-actor-critic
Lec4a policy-gradients-actor-criticLec4a policy-gradients-actor-critic
Lec4a policy-gradients-actor-critic
 
Lec3 dqn
Lec3 dqnLec3 dqn
Lec3 dqn
 
Lec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fittingLec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fitting
 
02 linear algebra
02 linear algebra02 linear algebra
02 linear algebra
 

Recently uploaded

Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 

Recently uploaded (20)

Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 

Lec1 intro-mdps-exact-methods