Intrinsically Motivated Reinforcement Learning

Kai Zhang | 2015-11-25 Intrinsic Motivated Reinforcement Learning
1. Perception Action Cycle 2. Information-to-go 3. Examples 4. References
2. Information-to-go
3. Examples
Intrinsically Motivated Reinforcement Learning
Kai Zhang
EECS Department
2015-11-25

Reinforcement Learning Revisited
Agent and Environment interact at discrete time steps: t = 0,1,2,…
• Agent observes state at step t: 𝑠𝑠𝑡𝑡 ∈ 𝑆𝑆
• produces action at step t: 𝑎𝑎𝑡𝑡 ∈ 𝐴𝐴(𝑠𝑠𝑡𝑡)
• gets resulting reward: 𝑟𝑟𝑡𝑡+1 ∈ ℜ
• and resulting next state: 𝑠𝑠𝑡𝑡+1

The Agent Learns A Policy
Policy at step t, 𝜋𝜋𝑡𝑡:
A mapping from states to action probabilities
𝜋𝜋𝑡𝑡(𝑠𝑠, 𝑎𝑎)= probability that 𝑎𝑎𝑡𝑡 = 𝑎𝑎 when 𝑠𝑠𝑡𝑡 = 𝑠𝑠
Policy Evaluation: for a given policy 𝜋𝜋 , compute the state-value function V 𝜋𝜋
Recall:
State- value function for policy 𝜋𝜋 :
𝑉𝑉 𝜋𝜋
𝑠𝑠 = 𝐸𝐸𝑡𝑡 𝑅𝑅𝑡𝑡 𝑠𝑠𝑡𝑡 = 𝑠𝑠 = 𝐸𝐸𝑡𝑡 �
𝑘𝑘=0
∞
𝛾𝛾 𝑘𝑘
𝑟𝑟𝑡𝑡+𝑘𝑘+1|𝑠𝑠𝑡𝑡 = 𝑠𝑠
Bellman equation for 𝑉𝑉 𝜋𝜋
:
𝑉𝑉 𝜋𝜋 𝑠𝑠 = �
𝑎𝑎
𝜋𝜋(𝑠𝑠, 𝑎𝑎) �
𝑠𝑠𝑠
𝑃𝑃𝑠𝑠𝑠𝑠𝑠
𝑎𝑎
[𝑅𝑅𝑠𝑠𝑠𝑠′
𝑎𝑎
+ 𝛾𝛾𝑉𝑉 𝜋𝜋(𝑠𝑠𝑠)]

Graphical Model for the Perception Action Cycle
Both future extrinsic reward (Value) and intrinsic (Information to-go) are optimized
together using Bellman-like equations w.r.t. to both channels

Bellman meets Shannon
Richard Ernest Bellman
(August 26, 1920 – March 19, 1984)
Claude Elwood Shannon
(April 30, 1916 – February 24, 2001)

Decision/action Sequences and Information
Let G denote our target (relevant) variable
𝐼𝐼 𝑠𝑠; 𝐺𝐺 = 𝐸𝐸𝑔𝑔log
𝑝𝑝 𝑔𝑔 𝑠𝑠
𝑝𝑝 𝑔𝑔
= �
𝑔𝑔∈𝐺𝐺
𝑝𝑝(𝑔𝑔, 𝑠𝑠)log
𝑝𝑝 𝑔𝑔 𝑠𝑠
𝑝𝑝 𝑔𝑔
- The mutual information at state s on G.
For an MDP the following recursion holds:
𝐼𝐼𝜋𝜋
𝑠𝑠𝑡𝑡; 𝐺𝐺
= �
𝑎𝑎𝑡𝑡∈𝐴𝐴
𝜋𝜋(𝑎𝑎𝑡𝑡|𝑠𝑠𝑡𝑡) �
𝑠𝑠𝑡𝑡+1
𝑃𝑃𝑠𝑠𝑡𝑡 𝑠𝑠𝑡𝑡+1
𝑎𝑎𝑡𝑡
∆𝐼𝐼𝑠𝑠𝑡𝑡 𝑠𝑠𝑡𝑡+1
𝑎𝑎𝑡𝑡
+ 𝐼𝐼𝜋𝜋
𝑠𝑠𝑡𝑡+1; 𝐺𝐺
and ∆𝐼𝐼𝑠𝑠𝑡𝑡 𝑠𝑠𝑡𝑡+1
𝑎𝑎𝑡𝑡
= log
𝑝𝑝 𝑠𝑠𝑡𝑡+1 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡
𝑝𝑝 𝑠𝑠𝑡𝑡+1
+ log
𝜋𝜋 𝑎𝑎𝑡𝑡+1 𝑠𝑠𝑡𝑡+1
𝜋𝜋 𝑎𝑎𝑡𝑡+1

Agent and Environment interact at discrete time
steps: t = 0,1,2,…
• agent observes state at step t: 𝑠𝑠𝑡𝑡 ∈ 𝑆𝑆
• produces action at step t: 𝑎𝑎𝑡𝑡 ∈ 𝐴𝐴(𝑠𝑠𝑡𝑡)
• gets resulting reward: 𝑟𝑟𝑡𝑡+1 ∈ ℜ
• and resulting next state: 𝑠𝑠𝑡𝑡+1
Bellman equation for 𝑉𝑉 𝜋𝜋:
𝑉𝑉 𝜋𝜋
𝑠𝑠 = �
𝑎𝑎
𝜋𝜋(𝑠𝑠, 𝑎𝑎) �
𝑠𝑠𝑠
𝑎𝑎
[𝑅𝑅𝑠𝑠𝑠𝑠′
𝑎𝑎
+ 𝛾𝛾𝑉𝑉 𝜋𝜋
(𝑠𝑠𝑠)]
Solved for 𝑉𝑉 𝜋𝜋
𝑠𝑠 by DP given 𝑃𝑃𝑠𝑠𝑠𝑠𝑠
𝑎𝑎
, 𝑅𝑅𝑠𝑠𝑠𝑠𝑠
𝑎𝑎
and 𝜋𝜋
Agent has goal variable 𝑔𝑔 ∈ 𝐺𝐺
Interacts with environment at time steps t =
0,1,2,…
• estimates/infer an internal state: ̂𝑠𝑠𝑡𝑡 ∈ ̂𝑆𝑆
characterized by 𝑝𝑝 ̂𝑠𝑠 𝑠𝑠 , 𝑝𝑝(𝑔𝑔| ̂𝑠𝑠)
• produces action at step t: 𝑎𝑎𝑡𝑡 ∈ 𝐴𝐴(𝑠𝑠𝑡𝑡) with
𝜋𝜋(𝑎𝑎| ̂𝑠𝑠)
• get estimate information gain: ∆𝐼𝐼𝑠𝑠,𝑠𝑠𝑠
𝑎𝑎
∈ ℜ
• resulting world next state: 𝑠𝑠𝑡𝑡+1 with 𝑃𝑃𝑠𝑠,𝑠𝑠𝑠
𝑎𝑎
Bellman equation for 𝐼𝐼 𝜋𝜋:
𝐼𝐼 𝜋𝜋
𝑠𝑠; 𝑔𝑔 = �
𝑎𝑎
𝜋𝜋(𝑎𝑎| ̂𝑠𝑠) �
𝑠𝑠𝑠
𝑎𝑎
[∆𝐼𝐼𝑠𝑠𝑠𝑠′
𝑎𝑎
+ 𝐼𝐼 𝜋𝜋
( ̂𝑠𝑠′
; 𝑔𝑔) ]
solved for 𝐼𝐼 𝜋𝜋 using DP and prob. inference

Combining (future) Value and Information
In cases where information is free, we can maximize value irrespective of its
information cost.
In general, however, we want
(1) to reduce decision complexity;
(2) maximize the environment information gain;
These two goals can be obtained by combining the information and value equations
using a language multiplier, and this becomes an optimization problem.

Trading Value and Information
‘Free Energy’ formulation:
𝐹𝐹 𝜋𝜋
𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡, 𝛽𝛽 = 𝐼𝐼 𝜋𝜋
𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 − 𝛽𝛽𝑄𝑄 𝜋𝜋
𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡
= 𝐸𝐸𝑝𝑝 𝑠𝑠𝑡𝑡+1 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 𝜋𝜋 𝑎𝑎𝑡𝑡+1 𝑠𝑠𝑡𝑡+1
�log
+ log
− 𝛽𝛽𝑅𝑅𝑠𝑠𝑠𝑠′
𝑎𝑎
+ 𝐼𝐼 𝜋𝜋
𝑠𝑠𝑡𝑡+1, 𝑎𝑎𝑡𝑡+1 − 𝛽𝛽𝑄𝑄 𝜋𝜋
(𝑠𝑠𝑡𝑡+1, 𝑎𝑎𝑡𝑡+1
𝜋𝜋 𝑎𝑎 𝑠𝑠 =
𝜋𝜋(𝑎𝑎)
𝑍𝑍(𝑠𝑠, 𝛽𝛽)
exp 𝐹𝐹 𝜋𝜋
𝑠𝑠, 𝑎𝑎, 𝛽𝛽
𝑍𝑍 𝑠𝑠, 𝛽𝛽 = ∑𝑎𝑎 𝜋𝜋 𝑎𝑎 exp 𝐹𝐹 𝜋𝜋
𝑠𝑠, 𝑎𝑎, 𝛽𝛽
𝜋𝜋 𝑎𝑎 = ∑𝑠𝑠 𝜋𝜋 𝑎𝑎|𝑠𝑠 𝑝𝑝(𝑠𝑠)
INFO-RL Algorithm
𝐹𝐹 𝜋𝜋
𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡, 𝛽𝛽 = 𝐸𝐸𝑝𝑝 𝑠𝑠𝑡𝑡+1 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡
log
− 𝛽𝛽𝑅𝑅𝑠𝑠𝑠𝑠′
𝑎𝑎
+ 𝐸𝐸𝜋𝜋 𝑎𝑎𝑡𝑡+1 𝑠𝑠𝑡𝑡+1
log
+ 𝐹𝐹 𝜋𝜋
𝑠𝑠𝑡𝑡+1, 𝑎𝑎𝑡𝑡+1, 𝛽𝛽
These 3 equations should be iterated till convergence for every state (like Blahut Arimoto)

Grid World Example
𝛽𝛽=0.01 𝛽𝛽=0.05 𝛽𝛽=0.5 𝛽𝛽=5

Future Work
• Stochastic world
• More complicated tasks

References
Tishby, N. & Polani, D. Information theory of decisions and actions. In: Perception-
reason-action cycle: Models, algorithms and systems, ed. V. Cutsuridis, A. Hussain & J. G.
Taylor, pp. 601–36. Springer. 2010
N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. In Proc. of the
37th Annual Allerton Conference on Communication, Control and Computing, pages 368–
377, 1999.
Jonathan Rubin, Ohad Shamir, and Naftali Tishby. Trading value and information. In
Decision Making with Imperfect Decision Makers, pages 57–74. Springer, 2012

Questions

Intrinsically Motivated Reinforcement Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (11)

Similar to Intrinsically Motivated Reinforcement Learning

Similar to Intrinsically Motivated Reinforcement Learning (20)

Recently uploaded

Recently uploaded (20)

Intrinsically Motivated Reinforcement Learning