Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Edward Balaban

January 17, 2014

Introduction to
Reinforcement
Learning
Edward Balaban

Preliminaries
MDP
POMDP
Resources

Objectives

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
POMDP

Introduce Reinforcement Learning and its applications

Resources

Overview Markov Decision Processes, Value Iteration,
Policy Iteration, and Q-learning
Overview Partially Observable Markov Decision
Processes and methods to solve them
Illustrate the above concepts with some examples

1 / 28

What is Reinforcement Learning?
Supervised learning: learn a model from training data
that maps inputs to outputs, use it to generate outputs
from future inputs
Unsupervised learning: recognize patterns in input data

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
POMDP
Resources

Reinforcement learning (RL): provide the learning agent
with a reward function and let it ﬁgure out the best
strategy for obtaining large rewards

Some of the material in these slides is borrowed from Andrew Ng and Wheeler Ruml lectures
on reinforcement learning
2 / 28

What is Reinforcement Learning?
Supervised learning: learn a model from training data
that maps inputs to outputs, use it to generate outputs
from future inputs
Unsupervised learning: recognize patterns in input data

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
POMDP
Resources

Reinforcement learning (RL): provide the learning agent
with a reward function and let it ﬁgure out the best
strategy for obtaining large rewards
RL has been used in such diverse applications as:
Business strategy planning
Aircraft control
Optimal routing (data packets, vehicles, etc)
Robot motion control
Some of the material in these slides is borrowed from Andrew Ng and Wheeler Ruml lectures
on reinforcement learning
2 / 28

How do we model for RL?
Modeling frameworks with increasing levels of uncertainty:

Introduction to
Reinforcement
Learning
Edward Balaban

State space models:
no uncertainty

Preliminaries

Markov Decision Processes (MDPs):
uncertainty in action eﬀects

POMDP

MDP

Resources

Partially Observable Markov Decision Processes (POMDPs):
uncertainty in action eﬀects and current state

3 / 28

How do we model for RL?
Modeling frameworks with increasing levels of uncertainty:

Introduction to
Reinforcement
Learning
Edward Balaban

State space models:
no uncertainty

Preliminaries

Markov Decision Processes (MDPs):
uncertainty in action eﬀects

POMDP

MDP

Resources

Partially Observable Markov Decision Processes (POMDPs):
uncertainty in action eﬀects and current state
Other modeling frameworks exist, e.g. Predictive State Representation:
Generalizations of POMDPs that were shown to have both a
greater representational capacity than POMDPs and yield
representations that are at least as compact (Singh et al, 2004
and Even-Dar et al, 2005)
Represent the state of a dynamic system by tracking occurrence
probabilities of a set of future events (tests), conditioned on past
events (histories)
Rely solely on observable quantities (unlike POMDPs)
3 / 28

Markov Decision Process (MDP)

Introduction to
Reinforcement
Learning

States:

S = {s1 , ..., s|S| }

Actions:

A = {a1 , ..., a|A| }

Preliminaries

Transition probabilities:

T (s, a, s ) = Pr (s |s, a)

MDP

Rewards:

R:S→R

Policy:

π : S → A, Π is the set of all policies

Edward Balaban

Learning
Solving
Continuous state MDP
Example

POMDP
Resources

4 / 28


Introduction to
Reinforcement
Learning

States:

S = {s1 , ..., s|S| }

Actions:

A = {a1 , ..., a|A| }

Preliminaries


T (s, a, s ) = Pr (s |s, a)

MDP

Rewards:

R:S→R

Policy:


Edward Balaban

Learning
Solving
Example

POMDP

Value function:

Resources

V π (s) = E [R(s0 ) + γR(s1 ) + γ 2 R(s2 ) + . . . |s0 = s, π]

4 / 28


Introduction to
Reinforcement
Learning

States:

S = {s1 , ..., s|S| }

Actions:

A = {a1 , ..., a|A| }

Preliminaries


T (s, a, s ) = Pr (s |s, a)

MDP

Rewards:

R:S→R

Policy:


Edward Balaban

Learning
Solving
Example

POMDP
Resources

Value function:
V π (s) = E [R(s0 ) + γR(s1 ) + γ 2 R(s2 ) + . . . |s0 = s, π]
Bellman Equation:
V π (s) = R(s) + γ

T (s, a, s )V π (s )
s ∈S

Optimal Value function:
V ∗ = max V π (s)
π

4 / 28

Markov Decision Process (MDP), continued
Bellman Equation for the optimal value function:
T (s, a, s )V ∗ (s )

V ∗ (s) = R(s) + max γ
a∈A
s ∈S

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
Learning
Solving

Policy:


π ∗ (s) = arg max γ
a∈A
∗

V (s) = V

T (s, a, s )V ∗ (s )
s ∈S

π∗

Example

POMDP
Resources

(s) ≥ V π (s)

5 / 28

Learning an MDP

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP

Usually S, A, and γ are known.

Learning
Solving
Example

# times took action a in state s and got to s
T (s, a, s ) =
#times took action a in state s

POMDP
Resources

6 / 28

Learning an MDP

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP

Usually S, A, and γ are known.

Learning
Solving
Example

# times took action a in state s and got to s
T (s, a, s ) =
#times took action a in state s

POMDP
Resources

Similarly, if R is unknown, can also pick our estimate of the
expected immediate reward R(s) in state s to be the average
reward observed in that state.

6 / 28

Introduction to
Reinforcement
Learning

Solving an MDP: Value Iteration

Edward Balaban
Preliminaries
MDP
Learning
Solving
Example

∀s ∈ S, V (s) ← 0
Repeat until convergence:

POMDP
Resources

∀s ∈ S, V (s) ← R(s) + max γ
a∈A

s ∈S

T (s, a, s )V (s )

7 / 28

Introduction to
Reinforcement
Learning

Convergence

Edward Balaban
From the deﬁnition of Bellman operator:
Preliminaries
MDP
||B(V1 ) − B(V2 )||∞ = max R(s) + γmax
s∈S

Psa (s )V1 (s ) − R(s) − γmax

a∈A

Psa (s )V2 (s )

a∈A
s ∈S

s

Learning
Solving

∈S

(1)

Example

POMDP
= γ · max max
s∈S

Psa (s )V1 (s ) − max

a∈A

Psa (s )V2 (s )

(2)

a∈A
s ∈S

s

Resources

∈S

To go further, we need to understand whether the two maximization operations over the set of actions
for V1 and V2 can be combined. To do that, let’s use the following deﬁnitions:

f1 (a) =

Psa (s )V1 (s )

(3)

Psa (s )V2 (s )

(4)

s ∈S

f2 (a) =
s

∈S

8 / 28

Introduction to
Reinforcement
Learning

Convergence, continued
∗
In order to, for the moment, get rid of the max operators, let’s also deﬁne a1 as the action that
∗
maximizes f1 and a2 as the action that maximizes f2 . Then

max

a∈A

s ∈S

Preliminaries

∗
∗
Psa (s )V2 (s ) can be written as |f1 (a1 ) − f2 (a2 )|.

Psa (s )V1 (s ) − max

a∈A

Edward Balaban

s

MDP

∈S

Learning
Solving

∗
∗
∗
∗
∗
∗
Since f1 (a1 ) ≥ f2 (a1 ) and f2 (a2 ) ≥ f1 (a2 ) (by the virtue of a1 and a2 maximizing f1 and f2 ,
respectively), we can “unpack” the absolute value operator as follows:
∗

∗

∗

∗

∗

∗

∗

∗

∗

∗

f1 (a1 ) − f2 (a2 ) ≤ f1 (a1 ) − f2 (a1 )
f2 (a2 ) − f1 (a1 ) ≤ f2 (a2 ) − f1 (a2 )

Example

(5)

POMDP

(6)

Resources

Then it is also true that
∗

∗

f1 (a1 ) − f2 (a2 ) ≤ |f1 (a1 ) − f2 (a1 )|

(7)

∗
f2 (a2 )

− f1 (a2 )|

(8)

f1 (a1 ) − f2 (a2 ) ≤ max |f1 (a) − f2 (a)|

(9)

−

∗
f1 (a1 )

≤

∗
|f2 (a2 )

∗

And, ﬁnally, it should also be true for ∀a that
∗

∗

a

∗
f2 (a2 )

−

∗
f1 (a1 )

≤ max |f2 (a) − f1 (a)|

(10)

a

Therefore we can conclude that
|max f1 (a) − max f2 (a)| ≤ max |f1 (a) − f2 (a)|
a

a

(11)

a

9 / 28

Introduction to
Reinforcement
Learning


Edward Balaban

Then Equation 2 can be rewritten as an inequality:

Preliminaries
||B(V1 ) − B(V2 )||∞ ≤ γ · max max

Psa (s )V1 (s ) −

Psa (s )V2 (s )

(12)

s∈S a∈A
s ∈S

s

MDP
Learning

∈S

Solving

Simplifying further, we get:

Example

POMDP
||B(V1 ) − B(V2 )||∞ ≤ γ · max max

Psa (s ) V1 (s ) − V2 (s )

(13)

Resources

s∈S a∈A
s ∈S

By using the triangle inequality and the fact that Psa (s ) ≥ 0, we can rewrite the above expression as

Psa (s ) V1 (s ) − V2 (s )

||B(V1 ) − B(V2 )||∞ ≤ γ · max max

(14)

s∈S a∈A
s ∈S

Psa (s ) V1 (s ) − V2 (s ) can be seen as the expectation of V1 (s ) − V2 (s ) . It is, therefore,
s ∈S

no greater than the maximum value that V1 (s ) − V2 (s ) can take. Thus the above inequality can
be written as:
||B(V1 ) − B(V2 )||∞ ≤ γ · max max max V1 (s ) − V2 (s )

(15)

s∈S a∈A s ∈S

10 / 28

Introduction to
Reinforcement
Learning


Edward Balaban
The remaining expression on the right can only be maximized with respect to s , so we can simplify to

Preliminaries
MDP

||B(V1 ) − B(V2 )||∞ ≤ γ · max V1 (s ) − V2 (s )

(16)

s ∈S

Learning
Solving

What we have on the right hand side now is the definition of infinity norm, therefore we finally obtain:

Example

POMDP
||B(V1 ) − B(V2 )||∞ ≤ γ||V1 − V2 ||∞

(17)
Resources

We’ll prove that the Bellman operator has at most one fixed point by contradiction. Let’s assume that
there are two distinct fixed points, V1 and V2 . Since B(V1 ) = V1 and B(V2 ) = V2 , the inequality
obtained in part (a) becomes
||V1 − V2 ||∞ ≤ γ||V1 − V2 ||∞
(18)
(1 − γ)||V1 − V2 ||∞ ≤ 0

(19)

Since γ ∈ [0, 1), then 1 − γ > 0. An infinity norm of any variable is non-negative, so the only way for
the above expression to be true is if ||V1 − V2 ||∞ = 0, and, consequently, if V1 = V2 . Therefore we
proved that the Bellman operator has at most one fixed point.

11 / 28

Using an MDP with value iteration

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP

Repeat:

Learning
Solving

Execute π in the MDP for some number of trials.
Using the accumulated experience in the MDP, update
estimates for T (s, a, s ) (and R, if applicable)

Example

POMDP
Resources

Apply value iteration to get a new estimated value
function V
Update π to be the greedy policy with respect to V

12 / 28

Solving an MDP: Policy Iteration

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
Learning
Solving

Initialize π randomly.
Repeat until convergence:
V ← Vπ
∀s ∈ S, π(s) = arg max
a∈A

Example

POMDP
Resources

s

∈S T (s, a, s )V (s )

V ← V π can be done eﬃciently by solving Bellman’s
equations as a system of linear equations.

13 / 28

Solving (and learning) an MDP: Q-learning
Model-free reinforcement learning.

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
Learning
Solving
Example

POMDP
Resources

14 / 28


Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries

V (s) = R(s) + γmax
a

T (s, a, s )V (s )
s

MDP
Learning
Solving
Example

POMDP
Resources

14 / 28


Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries

a

T (s, a, s )V (s )
s

MDP
Learning
Solving

Think of Q-learning as a regression!

Example

POMDP
Resources

14 / 28


Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries

a

T (s, a, s )V (s )
s

MDP
Learning
Solving


Example

POMDP

Explore states: in state s, took action a, got reward r , ended
up in state s ( s, a, s , r ).

Resources

Q(s, a) ← Q(s, a) + α(error )

14 / 28


Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries

a

T (s, a, s )V (s )
s

MDP
Learning
Solving


Example

POMDP


Resources

Q(s, a) ← Q(s, a) + α(sensed − predicted)

14 / 28


Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries

a

T (s, a, s )V (s )
s

MDP
Learning
Solving


Example

POMDP


Resources

Q(s, a) ← Q(s, a) + α(sensed − predicted)
Q(s, a) ← Q(s, a) + α([γ(r + max Q(s , a )] − [Q(s, a)])
a

Stochastic update with step size α.
14 / 28

Continuous State MDP
A more realistic form of MDP

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
Learning
Solving
Example

POMDP
Resources

15 / 28


Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries

Needs a simulator

MDP
Learning
Solving
Example

POMDP
Resources

15 / 28


Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries

Needs a simulator

MDP
Learning

Solving continuous-state MDPs:

Solving
Example

LQR

POMDP

Fitted Value Iteration

Resources

15 / 28

An example MDP - the inverted pendulum

Introduction to
Reinforcement
Learning

A thin pole is connected via a free hinge to a cart
Edward Balaban

The cart can move laterally on a smooth table surface
Preliminaries

Failure occurs if:
the angle of the pole deviates by more than a certain amount
from the vertical position
the cart’s position goes out of bounds

MDP
Learning
Solving
Example

The objective is to develop a controller to balance the pole

POMDP

The only actions the controller can take is accelerate the cart either left
or right

Resources

The algorithm cannot use any knowledge of the dynamics of the
underlying system

16 / 28

Introduction to
Reinforcement
Learning

Baby pendulum: results

Edward Balaban

7.5

Preliminaries

7
MDP
Learning

6.5

Solving
Example

6

POMDP
Resources

5.5
5
4.5
4
3.5
3
0

20

40

60

80

100

120

140

160

180
17 / 28

Partially Observable Markov Decision Process
(MDP)

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries

States:

S = {s1 , ..., s|S| }

MDP

Actions:

A = {a1 , ..., a|A| }

POMDP


T (s, a, s ) = Pr (s |s, a)

Observations:

Z = {z1 , ..., z|Z | }

Observation probabilities:

O(z, a, s ) = Pr (z |s, a)

Belief state (agent):

b = {b(s1 ), . . . , b(s|S| )} : S → [0, 1]|S| ,
|S|
i=1

Deﬁnition
Solving
Example
System Health
Management

Resources

b(si ) = 1

Belief space:

B - the set of all belief states (inﬁnite)

Initial belief:

b0

Rewards:

R:S→R

Policy:

π : B → A|B| , Π is the set of all policies

18 / 28

Solving a POMDP

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP

Solving a realistic POMDP exactly is often computationally
intractable.

POMDP
Deﬁnition
Solving
Example

Approximate method families:

System Health
Management

Resources

Point-based methods
Monte Carlo methods
Generalization methods

19 / 28

Example: Prognostic Decision Making

Introduction to
Reinforcement
Learning

System Degradation

Edward Balaban

All of aerospace systems experience degradation
Preliminaries

Degradation can be use- or time-dependent

MDP

The operating environment is often a signiﬁcant
factor

POMDP
Deﬁnition

JAXA Hayabusa

Solving
Example
System Health
Management

Resources

20 / 28


Introduction to
Reinforcement
Learning

System Degradation

Edward Balaban

Preliminaries


MDP

factor

POMDP
Deﬁnition

JAXA Hayabusa

Solving
Example

Faults

System Health
Management

Degradation can accelerate if a fault occurs

Resources

In a complex, multi-component system a fault
can have cascading eﬀects
In case of a fault, a quick mitigation decision is
often required
United Flight 232

20 / 28


Introduction to
Reinforcement
Learning

System Degradation

Edward Balaban

Preliminaries


MDP

factor

POMDP
Definition

JAXA Hayabusa

Solving
Example

Faults

System Health
Management

Degradation can accelerate if a fault occurs

Resources

In a complex, multi-component system a fault
can have cascading effects
In case of a fault, a quick mitigation decision is
often required
United Flight 232

System Health Management (SHM)
Recent designs, e.g. S-92, have more SHM
capabilities (in fault detection and diagnosis)
Still, maintenance is predominantly done based
on fixed schedules
In-flight emergencies are handled through skill
and ingenuity of the crew and ground control

Sikorsky S-92

20 / 28

How Can We Do Better?
In recent years progress has been made in using physics modeling and
computational methods for:
Fault detection,

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries

Fault magnitude estimation,

MDP

Degradation trajectory prediction (prognostics).

POMDP
Deﬁnition
Solving
Example
System Health
Management

Resources

21 / 28

Fault detection,

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries


MDP


POMDP
Deﬁnition

Research on how to utilize prognostic health information is in the very early
stages, however.

Solving
Example
System Health
Management

Resources

21 / 28

Fault detection,

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries


MDP


POMDP
Deﬁnition

Research on how to utilize prognostic health information is in the very early
stages, however.

Prognostic Decision Making (PDM)

Solving
Example
System Health
Management

Resources

The process of selecting system actions informed by predictions of
the future system health state
PDM can help with the following, for example:
Component life extension,
Fault mitigation,
Mission replanning
Crew decision support in emergencies,
Condition-based maintenance,
Asset allocation.
21 / 28

Introduction to
Reinforcement
Learning

System
Described as a continuous-state, continuous-action POMDP:

Edward Balaban

State space:

S ⊆ Rn

Preliminaries

Action space:

A ⊆ Rm

MDP

Observations:

Z ⊆ Rp

POMDP

Transition function:

T (s, a, s ) = pdf (s |s, a) : S × A × S → [0, ∞)

Deﬁnition

Observation function:

O(z , a, s ) = pdf (z |s , a) : S×A×Z → [0, ∞)

Example

Belief state:

b(s) = pdf (s)

Belief space:

B - the set of all belief states

Initial belief:

b0

Belief update:

b az (s ) ∝ O(z , a, s )

Solving

System Health
Management

Resources

T (s, a, s )b(s)ds
S

Policy:

π(a, b) = pdf (a|b) : A × B → [0, ∞), Π is the
set of all policies

Costs:

C = {c1 (s, a), ..., c|C | (s, a)} : S × A → R|C |

Rewards:

R(s, r ) = pdf (r |s) : S × R → [0, ∞)

Objectives:

F = {f1 (s), . . . , f|F | (s)} : S → R|F |

Constraints:

G = {g1 (s), . . . , g|G| (s)} : S → B|G|
22 / 28

Introduction to
Reinforcement
Learning

System Degradation
Let H = {h1 , . . . , hH } be the vector of system health parameters
incorporated into the state

Edward Balaban
Preliminaries

Fault: Gfault ∈ G defines significant deviations from the expected
nominal behavior. A fault occurs if ∃i, gi (s) = true, gi ∈ Gfault .

MDP

Failure: Gfailure ∈ G defines states where the system loses functional
capability with respect to a health parameter h ∈ H
System failure F : S → B, a boolean function indicating when the entire
system is effectively non-functional (F is defined via the Gfailure set)
End of Life (EoL): tEoL : F (s) = true

POMDP
Definition
Solving
Example
System Health
Management

Resources

Remaining Useful Life: RUL = tEoL − t

h
1

fault threshold

tfault
EoL

failure threshold
0

t
23 / 28

States

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
POMDP
Deﬁnition
Solving
Example
System Health
Management

Resources

Nominal (green), fault (yellow), and failure (red) states deﬁned
using Gfault and Gfailure constraints

24 / 28

Diagnostics

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
POMDP
Deﬁnition
Solving
Example
System Health
Management

Resources

the process of determining the current belief state - bt

24 / 28

Prognostics

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
POMDP
Deﬁnition
Solving
Example
System Health
Management

Resources

the process of determining, at time t, the belief state b(t+∆t) ,
given the current policy π

24 / 28

Introduction to
Reinforcement
Learning

Decision Making

Edward Balaban
Preliminaries
MDP
POMDP
Deﬁnition
Solving
Example
System Health
Management

Resources

the process of ﬁnding (or approximating) π ∗ , such that
π ∗ = arg max J π (bt )
π∈Π

24 / 28

Case Study: UAV Mission Replanning
Given:
An initial mission route (not necessarily optimized) which includes
waypoint parameter constraints (e.g. on airspeed or bank angle).

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries

Each waypoint is associated with a payoff value

MDP

A healthy vehicle is able to complete the entire route within the energy
and component health constraints

POMDP

Transition costs between a pair of waypoints are history-dependent
A fault occurs that makes it impossible to complete the mission before
the End of Life (EoL)

Definition
Solving
Example
System Health
Management

Resources

Find:
A policy π that maximizes mission payoff and extends the remaining useful
life

25 / 28

Introduction to
Reinforcement
Learning

Reasoning Architecture
Vehicle Simulation
(including prognostic
models)

Edward Balaban

health and energy cost estimates
Diagnoser

Preliminaries
MDP

input route and
parameter constraints

candidate route

observations

POMDP
Deﬁnition
Solving

Decision Maker

Vehicle

Example
System Health
Management

initial fault set
current fault set

Resources

Particle Filter is currently used as the decision-making algorithm
Decision Maker picks ordered waypoint subsets and parameter values for
candidate routes and proposed routes
The vehicle simulation is 6DOF, with prognostic models for battery and
motor temperatures, as well as the battery state of charge
The fault mode currently implemented is increased motor friction
The fault leads to increased current consumption and motor/battery
overheating
26 / 28

Mission Replanning Simulation

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
POMDP
Deﬁnition
Solving
Example
System Health
Management

Resources

27 / 28

Resources
Sutton and Barto book:
http:
//webdocs.cs.ualberta.ca/˜sutton/book/ebook/
Intro to POMDPs:
http://cs.brown.edu/research/ai/pomdp/
tutorial/index.html
Stanford Autonomous Helicopter project:
http://heli.stanford.edu
NASA Vehicle Health Management (Intelligent Systems
Division): http://ti.arc.nasa.gov/tech/dash/
pcoe/publications/
E. Balaban and J. J. Alonso, “A Modeling Framework
for Prognostic Decision Making and its Application to
UAV Mission Planning”, in proceedings of the Annual
Conference of the Prognostics and Health Management
Society, 2013, pp. 1-12.:
https://c3.nasa.gov/dashlink/resources/881/

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
POMDP
Resources

28 / 28

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
POMDP
Resources

Thank you!

28 / 28

Introduction to Reinforcement Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Introduction to Reinforcement Learning

Similar to Introduction to Reinforcement Learning (20)

Recently uploaded

Recently uploaded (20)

Introduction to Reinforcement Learning