2. Kai Zhang | 2015-11-25 Intrinsic Motivated Reinforcement Learning
1. Perception Action Cycle 2. Information-to-go 3. Examples 4. References
Reinforcement Learning Revisited
Agent and Environment interact at discrete time steps: t = 0,1,2,…
• Agent observes state at step t: 𝑠𝑠𝑡𝑡 ∈ 𝑆𝑆
• produces action at step t: 𝑎𝑎𝑡𝑡 ∈ 𝐴𝐴(𝑠𝑠𝑡𝑡)
• gets resulting reward: 𝑟𝑟𝑡𝑡+1 ∈ ℜ
• and resulting next state: 𝑠𝑠𝑡𝑡+1
3. Kai Zhang | 2015-11-25 Intrinsic Motivated Reinforcement Learning
1. Perception Action Cycle 2. Information-to-go 3. Examples 4. References
The Agent Learns A Policy
Policy at step t, 𝜋𝜋𝑡𝑡:
A mapping from states to action probabilities
𝜋𝜋𝑡𝑡(𝑠𝑠, 𝑎𝑎)= probability that 𝑎𝑎𝑡𝑡 = 𝑎𝑎 when 𝑠𝑠𝑡𝑡 = 𝑠𝑠
Policy Evaluation: for a given policy 𝜋𝜋 , compute the state-value function V 𝜋𝜋
Recall:
State- value function for policy 𝜋𝜋 :
𝑉𝑉 𝜋𝜋
𝑠𝑠 = 𝐸𝐸𝑡𝑡 𝑅𝑅𝑡𝑡 𝑠𝑠𝑡𝑡 = 𝑠𝑠 = 𝐸𝐸𝑡𝑡 �
𝑘𝑘=0
∞
𝛾𝛾 𝑘𝑘
𝑟𝑟𝑡𝑡+𝑘𝑘+1|𝑠𝑠𝑡𝑡 = 𝑠𝑠
Bellman equation for 𝑉𝑉 𝜋𝜋
:
𝑉𝑉 𝜋𝜋 𝑠𝑠 = �
𝑎𝑎
𝜋𝜋(𝑠𝑠, 𝑎𝑎) �
𝑠𝑠𝑠
𝑃𝑃𝑠𝑠𝑠𝑠𝑠
𝑎𝑎
[𝑅𝑅𝑠𝑠𝑠𝑠′
𝑎𝑎
+ 𝛾𝛾𝑉𝑉 𝜋𝜋(𝑠𝑠𝑠)]
4. Kai Zhang | 2015-11-25 Intrinsic Motivated Reinforcement Learning
1. Perception Action Cycle 2. Information-to-go 3. Examples 4. References
Graphical Model for the Perception Action Cycle
Both future extrinsic reward (Value) and intrinsic (Information to-go) are optimized
together using Bellman-like equations w.r.t. to both channels
5. Kai Zhang | 2015-11-25 Intrinsic Motivated Reinforcement Learning
1. Perception Action Cycle 2. Information-to-go 3. Examples 4. References
Bellman meets Shannon
Richard Ernest Bellman
(August 26, 1920 – March 19, 1984)
Claude Elwood Shannon
(April 30, 1916 – February 24, 2001)
6. Kai Zhang | 2015-11-25 Intrinsic Motivated Reinforcement Learning
1. Perception Action Cycle 2. Information-to-go 3. Examples 4. References
Decision/action Sequences and Information
Let G denote our target (relevant) variable
𝐼𝐼 𝑠𝑠; 𝐺𝐺 = 𝐸𝐸𝑔𝑔log
𝑝𝑝 𝑔𝑔 𝑠𝑠
𝑝𝑝 𝑔𝑔
= �
𝑔𝑔∈𝐺𝐺
𝑝𝑝(𝑔𝑔, 𝑠𝑠)log
𝑝𝑝 𝑔𝑔 𝑠𝑠
𝑝𝑝 𝑔𝑔
- The mutual information at state s on G.
For an MDP the following recursion holds:
𝐼𝐼𝜋𝜋
𝑠𝑠𝑡𝑡; 𝐺𝐺
= �
𝑎𝑎𝑡𝑡∈𝐴𝐴
𝜋𝜋(𝑎𝑎𝑡𝑡|𝑠𝑠𝑡𝑡) �
𝑠𝑠𝑡𝑡+1
𝑃𝑃𝑠𝑠𝑡𝑡 𝑠𝑠𝑡𝑡+1
𝑎𝑎𝑡𝑡
∆𝐼𝐼𝑠𝑠𝑡𝑡 𝑠𝑠𝑡𝑡+1
𝑎𝑎𝑡𝑡
+ 𝐼𝐼𝜋𝜋
𝑠𝑠𝑡𝑡+1; 𝐺𝐺
and ∆𝐼𝐼𝑠𝑠𝑡𝑡 𝑠𝑠𝑡𝑡+1
𝑎𝑎𝑡𝑡
= log
𝑝𝑝 𝑠𝑠𝑡𝑡+1 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡
𝑝𝑝 𝑠𝑠𝑡𝑡+1
+ log
𝜋𝜋 𝑎𝑎𝑡𝑡+1 𝑠𝑠𝑡𝑡+1
𝜋𝜋 𝑎𝑎𝑡𝑡+1
7. Kai Zhang | 2015-11-25 Intrinsic Motivated Reinforcement Learning
1. Perception Action Cycle 2. Information-to-go 3. Examples 4. References
Agent and Environment interact at discrete time
steps: t = 0,1,2,…
• agent observes state at step t: 𝑠𝑠𝑡𝑡 ∈ 𝑆𝑆
• produces action at step t: 𝑎𝑎𝑡𝑡 ∈ 𝐴𝐴(𝑠𝑠𝑡𝑡)
• gets resulting reward: 𝑟𝑟𝑡𝑡+1 ∈ ℜ
• and resulting next state: 𝑠𝑠𝑡𝑡+1
Bellman equation for 𝑉𝑉 𝜋𝜋:
𝑉𝑉 𝜋𝜋
𝑠𝑠 = �
𝑎𝑎
𝜋𝜋(𝑠𝑠, 𝑎𝑎) �
𝑠𝑠𝑠
𝑃𝑃𝑠𝑠𝑠𝑠𝑠
𝑎𝑎
[𝑅𝑅𝑠𝑠𝑠𝑠′
𝑎𝑎
+ 𝛾𝛾𝑉𝑉 𝜋𝜋
(𝑠𝑠𝑠)]
Solved for 𝑉𝑉 𝜋𝜋
𝑠𝑠 by DP given 𝑃𝑃𝑠𝑠𝑠𝑠𝑠
𝑎𝑎
, 𝑅𝑅𝑠𝑠𝑠𝑠𝑠
𝑎𝑎
and 𝜋𝜋
Agent has goal variable 𝑔𝑔 ∈ 𝐺𝐺
Interacts with environment at time steps t =
0,1,2,…
• estimates/infer an internal state: ̂𝑠𝑠𝑡𝑡 ∈ ̂𝑆𝑆
characterized by 𝑝𝑝 ̂𝑠𝑠 𝑠𝑠 , 𝑝𝑝(𝑔𝑔| ̂𝑠𝑠)
• produces action at step t: 𝑎𝑎𝑡𝑡 ∈ 𝐴𝐴(𝑠𝑠𝑡𝑡) with
𝜋𝜋(𝑎𝑎| ̂𝑠𝑠)
• get estimate information gain: ∆𝐼𝐼𝑠𝑠,𝑠𝑠𝑠
𝑎𝑎
∈ ℜ
• resulting world next state: 𝑠𝑠𝑡𝑡+1 with 𝑃𝑃𝑠𝑠,𝑠𝑠𝑠
𝑎𝑎
Bellman equation for 𝐼𝐼 𝜋𝜋:
𝐼𝐼 𝜋𝜋
𝑠𝑠; 𝑔𝑔 = �
𝑎𝑎
𝜋𝜋(𝑎𝑎| ̂𝑠𝑠) �
𝑠𝑠𝑠
𝑃𝑃𝑠𝑠𝑠𝑠𝑠
𝑎𝑎
[∆𝐼𝐼𝑠𝑠𝑠𝑠′
𝑎𝑎
+ 𝐼𝐼 𝜋𝜋
( ̂𝑠𝑠′
; 𝑔𝑔) ]
solved for 𝐼𝐼 𝜋𝜋 using DP and prob. inference
8. Kai Zhang | 2015-11-25 Intrinsic Motivated Reinforcement Learning
1. Perception Action Cycle 2. Information-to-go 3. Examples 4. References
Combining (future) Value and Information
In cases where information is free, we can maximize value irrespective of its
information cost.
In general, however, we want
(1) to reduce decision complexity;
(2) maximize the environment information gain;
These two goals can be obtained by combining the information and value equations
using a language multiplier, and this becomes an optimization problem.
10. Kai Zhang | 2015-11-25 Intrinsic Motivated Reinforcement Learning
1. Perception Action Cycle 2. Information-to-go 3. Examples 4. References
Grid World Example
𝛽𝛽=0.01 𝛽𝛽=0.05 𝛽𝛽=0.5 𝛽𝛽=5
11. Kai Zhang | 2015-11-25 Intrinsic Motivated Reinforcement Learning
1. Perception Action Cycle 2. Information-to-go 3. Examples 4. References
Future Work
• Stochastic world
• More complicated tasks
12. Kai Zhang | 2015-11-25 Intrinsic Motivated Reinforcement Learning
1. Perception Action Cycle 2. Information-to-go 3. Examples 4. References
References
Tishby, N. & Polani, D. Information theory of decisions and actions. In: Perception-
reason-action cycle: Models, algorithms and systems, ed. V. Cutsuridis, A. Hussain & J. G.
Taylor, pp. 601–36. Springer. 2010
N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. In Proc. of the
37th Annual Allerton Conference on Communication, Control and Computing, pages 368–
377, 1999.
Jonathan Rubin, Ohad Shamir, and Naftali Tishby. Trading value and information. In
Decision Making with Imperfect Decision Makers, pages 57–74. Springer, 2012