SlideShare a Scribd company logo
1 of 13
Download to read offline
Kai Zhang | 2015-11-25 Intrinsic Motivated Reinforcement Learning
1. Perception Action Cycle 2. Information-to-go 3. Examples 4. References
2. Information-to-go
3. Examples
Intrinsically Motivated Reinforcement Learning
Kai Zhang
EECS Department
2015-11-25
Kai Zhang | 2015-11-25 Intrinsic Motivated Reinforcement Learning
1. Perception Action Cycle 2. Information-to-go 3. Examples 4. References
Reinforcement Learning Revisited
Agent and Environment interact at discrete time steps: t = 0,1,2,…
• Agent observes state at step t: 𝑠𝑠𝑡𝑡 ∈ 𝑆𝑆
• produces action at step t: 𝑎𝑎𝑡𝑡 ∈ 𝐴𝐴(𝑠𝑠𝑡𝑡)
• gets resulting reward: 𝑟𝑟𝑡𝑡+1 ∈ ℜ
• and resulting next state: 𝑠𝑠𝑡𝑡+1
Kai Zhang | 2015-11-25 Intrinsic Motivated Reinforcement Learning
1. Perception Action Cycle 2. Information-to-go 3. Examples 4. References
The Agent Learns A Policy
Policy at step t, 𝜋𝜋𝑡𝑡:
A mapping from states to action probabilities
𝜋𝜋𝑡𝑡(𝑠𝑠, 𝑎𝑎)= probability that 𝑎𝑎𝑡𝑡 = 𝑎𝑎 when 𝑠𝑠𝑡𝑡 = 𝑠𝑠
Policy Evaluation: for a given policy 𝜋𝜋 , compute the state-value function V 𝜋𝜋
Recall:
State- value function for policy 𝜋𝜋 :
𝑉𝑉 𝜋𝜋
𝑠𝑠 = 𝐸𝐸𝑡𝑡 𝑅𝑅𝑡𝑡 𝑠𝑠𝑡𝑡 = 𝑠𝑠 = 𝐸𝐸𝑡𝑡 �
𝑘𝑘=0
∞
𝛾𝛾 𝑘𝑘
𝑟𝑟𝑡𝑡+𝑘𝑘+1|𝑠𝑠𝑡𝑡 = 𝑠𝑠
Bellman equation for 𝑉𝑉 𝜋𝜋
:
𝑉𝑉 𝜋𝜋 𝑠𝑠 = �
𝑎𝑎
𝜋𝜋(𝑠𝑠, 𝑎𝑎) �
𝑠𝑠𝑠
𝑃𝑃𝑠𝑠𝑠𝑠𝑠
𝑎𝑎
[𝑅𝑅𝑠𝑠𝑠𝑠′
𝑎𝑎
+ 𝛾𝛾𝑉𝑉 𝜋𝜋(𝑠𝑠𝑠)]
Kai Zhang | 2015-11-25 Intrinsic Motivated Reinforcement Learning
1. Perception Action Cycle 2. Information-to-go 3. Examples 4. References
Graphical Model for the Perception Action Cycle
Both future extrinsic reward (Value) and intrinsic (Information to-go) are optimized
together using Bellman-like equations w.r.t. to both channels
Kai Zhang | 2015-11-25 Intrinsic Motivated Reinforcement Learning
1. Perception Action Cycle 2. Information-to-go 3. Examples 4. References
Bellman meets Shannon
Richard Ernest Bellman
(August 26, 1920 – March 19, 1984)
Claude Elwood Shannon
(April 30, 1916 – February 24, 2001)
Kai Zhang | 2015-11-25 Intrinsic Motivated Reinforcement Learning
1. Perception Action Cycle 2. Information-to-go 3. Examples 4. References
Decision/action Sequences and Information
Let G denote our target (relevant) variable
𝐼𝐼 𝑠𝑠; 𝐺𝐺 = 𝐸𝐸𝑔𝑔log
𝑝𝑝 𝑔𝑔 𝑠𝑠
𝑝𝑝 𝑔𝑔
= �
𝑔𝑔∈𝐺𝐺
𝑝𝑝(𝑔𝑔, 𝑠𝑠)log
𝑝𝑝 𝑔𝑔 𝑠𝑠
𝑝𝑝 𝑔𝑔
- The mutual information at state s on G.
For an MDP the following recursion holds:
𝐼𝐼𝜋𝜋
𝑠𝑠𝑡𝑡; 𝐺𝐺
= �
𝑎𝑎𝑡𝑡∈𝐴𝐴
𝜋𝜋(𝑎𝑎𝑡𝑡|𝑠𝑠𝑡𝑡) �
𝑠𝑠𝑡𝑡+1
𝑃𝑃𝑠𝑠𝑡𝑡 𝑠𝑠𝑡𝑡+1
𝑎𝑎𝑡𝑡
∆𝐼𝐼𝑠𝑠𝑡𝑡 𝑠𝑠𝑡𝑡+1
𝑎𝑎𝑡𝑡
+ 𝐼𝐼𝜋𝜋
𝑠𝑠𝑡𝑡+1; 𝐺𝐺
and ∆𝐼𝐼𝑠𝑠𝑡𝑡 𝑠𝑠𝑡𝑡+1
𝑎𝑎𝑡𝑡
= log
𝑝𝑝 𝑠𝑠𝑡𝑡+1 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡
𝑝𝑝 𝑠𝑠𝑡𝑡+1
+ log
𝜋𝜋 𝑎𝑎𝑡𝑡+1 𝑠𝑠𝑡𝑡+1
𝜋𝜋 𝑎𝑎𝑡𝑡+1
Kai Zhang | 2015-11-25 Intrinsic Motivated Reinforcement Learning
1. Perception Action Cycle 2. Information-to-go 3. Examples 4. References
Agent and Environment interact at discrete time
steps: t = 0,1,2,…
• agent observes state at step t: 𝑠𝑠𝑡𝑡 ∈ 𝑆𝑆
• produces action at step t: 𝑎𝑎𝑡𝑡 ∈ 𝐴𝐴(𝑠𝑠𝑡𝑡)
• gets resulting reward: 𝑟𝑟𝑡𝑡+1 ∈ ℜ
• and resulting next state: 𝑠𝑠𝑡𝑡+1
Bellman equation for 𝑉𝑉 𝜋𝜋:
𝑉𝑉 𝜋𝜋
𝑠𝑠 = �
𝑎𝑎
𝜋𝜋(𝑠𝑠, 𝑎𝑎) �
𝑠𝑠𝑠
𝑃𝑃𝑠𝑠𝑠𝑠𝑠
𝑎𝑎
[𝑅𝑅𝑠𝑠𝑠𝑠′
𝑎𝑎
+ 𝛾𝛾𝑉𝑉 𝜋𝜋
(𝑠𝑠𝑠)]
Solved for 𝑉𝑉 𝜋𝜋
𝑠𝑠 by DP given 𝑃𝑃𝑠𝑠𝑠𝑠𝑠
𝑎𝑎
, 𝑅𝑅𝑠𝑠𝑠𝑠𝑠
𝑎𝑎
and 𝜋𝜋
Agent has goal variable 𝑔𝑔 ∈ 𝐺𝐺
Interacts with environment at time steps t =
0,1,2,…
• estimates/infer an internal state: ̂𝑠𝑠𝑡𝑡 ∈ ̂𝑆𝑆
characterized by 𝑝𝑝 ̂𝑠𝑠 𝑠𝑠 , 𝑝𝑝(𝑔𝑔| ̂𝑠𝑠)
• produces action at step t: 𝑎𝑎𝑡𝑡 ∈ 𝐴𝐴(𝑠𝑠𝑡𝑡) with
𝜋𝜋(𝑎𝑎| ̂𝑠𝑠)
• get estimate information gain: ∆𝐼𝐼𝑠𝑠,𝑠𝑠𝑠
𝑎𝑎
∈ ℜ
• resulting world next state: 𝑠𝑠𝑡𝑡+1 with 𝑃𝑃𝑠𝑠,𝑠𝑠𝑠
𝑎𝑎
Bellman equation for 𝐼𝐼 𝜋𝜋:
𝐼𝐼 𝜋𝜋
𝑠𝑠; 𝑔𝑔 = �
𝑎𝑎
𝜋𝜋(𝑎𝑎| ̂𝑠𝑠) �
𝑠𝑠𝑠
𝑃𝑃𝑠𝑠𝑠𝑠𝑠
𝑎𝑎
[∆𝐼𝐼𝑠𝑠𝑠𝑠′
𝑎𝑎
+ 𝐼𝐼 𝜋𝜋
( ̂𝑠𝑠′
; 𝑔𝑔) ]
solved for 𝐼𝐼 𝜋𝜋 using DP and prob. inference
Kai Zhang | 2015-11-25 Intrinsic Motivated Reinforcement Learning
1. Perception Action Cycle 2. Information-to-go 3. Examples 4. References
Combining (future) Value and Information
In cases where information is free, we can maximize value irrespective of its
information cost.
In general, however, we want
(1) to reduce decision complexity;
(2) maximize the environment information gain;
These two goals can be obtained by combining the information and value equations
using a language multiplier, and this becomes an optimization problem.
Kai Zhang | 2015-11-25 Intrinsic Motivated Reinforcement Learning
1. Perception Action Cycle 2. Information-to-go 3. Examples 4. References
Trading Value and Information
‘Free Energy’ formulation:
𝐹𝐹 𝜋𝜋
𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡, 𝛽𝛽 = 𝐼𝐼 𝜋𝜋
𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 − 𝛽𝛽𝑄𝑄 𝜋𝜋
𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡
= 𝐸𝐸𝑝𝑝 𝑠𝑠𝑡𝑡+1 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 𝜋𝜋 𝑎𝑎𝑡𝑡+1 𝑠𝑠𝑡𝑡+1
�log
𝑝𝑝 𝑠𝑠𝑡𝑡+1 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡
𝑝𝑝 𝑠𝑠𝑡𝑡+1
+ log
𝜋𝜋 𝑎𝑎𝑡𝑡+1 𝑠𝑠𝑡𝑡+1
𝜋𝜋 𝑎𝑎𝑡𝑡+1
− 𝛽𝛽𝑅𝑅𝑠𝑠𝑠𝑠′
𝑎𝑎
+ 𝐼𝐼 𝜋𝜋
𝑠𝑠𝑡𝑡+1, 𝑎𝑎𝑡𝑡+1 − 𝛽𝛽𝑄𝑄 𝜋𝜋
(𝑠𝑠𝑡𝑡+1, 𝑎𝑎𝑡𝑡+1
𝜋𝜋 𝑎𝑎 𝑠𝑠 =
𝜋𝜋(𝑎𝑎)
𝑍𝑍(𝑠𝑠, 𝛽𝛽)
exp 𝐹𝐹 𝜋𝜋
𝑠𝑠, 𝑎𝑎, 𝛽𝛽
𝑍𝑍 𝑠𝑠, 𝛽𝛽 = ∑𝑎𝑎 𝜋𝜋 𝑎𝑎 exp 𝐹𝐹 𝜋𝜋
𝑠𝑠, 𝑎𝑎, 𝛽𝛽
𝜋𝜋 𝑎𝑎 = ∑𝑠𝑠 𝜋𝜋 𝑎𝑎|𝑠𝑠 𝑝𝑝(𝑠𝑠)
INFO-RL Algorithm
𝐹𝐹 𝜋𝜋
𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡, 𝛽𝛽 = 𝐸𝐸𝑝𝑝 𝑠𝑠𝑡𝑡+1 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡
log
𝑝𝑝 𝑠𝑠𝑡𝑡+1 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡
𝑝𝑝 𝑠𝑠𝑡𝑡+1
− 𝛽𝛽𝑅𝑅𝑠𝑠𝑠𝑠′
𝑎𝑎
+ 𝐸𝐸𝜋𝜋 𝑎𝑎𝑡𝑡+1 𝑠𝑠𝑡𝑡+1
log
𝜋𝜋 𝑎𝑎𝑡𝑡+1 𝑠𝑠𝑡𝑡+1
𝜋𝜋 𝑎𝑎𝑡𝑡+1
+ 𝐹𝐹 𝜋𝜋
𝑠𝑠𝑡𝑡+1, 𝑎𝑎𝑡𝑡+1, 𝛽𝛽
These 3 equations should be iterated till convergence for every state (like Blahut Arimoto)
Kai Zhang | 2015-11-25 Intrinsic Motivated Reinforcement Learning
1. Perception Action Cycle 2. Information-to-go 3. Examples 4. References
Grid World Example
𝛽𝛽=0.01 𝛽𝛽=0.05 𝛽𝛽=0.5 𝛽𝛽=5
Kai Zhang | 2015-11-25 Intrinsic Motivated Reinforcement Learning
1. Perception Action Cycle 2. Information-to-go 3. Examples 4. References
Future Work
• Stochastic world
• More complicated tasks
Kai Zhang | 2015-11-25 Intrinsic Motivated Reinforcement Learning
1. Perception Action Cycle 2. Information-to-go 3. Examples 4. References
References
Tishby, N. & Polani, D. Information theory of decisions and actions. In: Perception-
reason-action cycle: Models, algorithms and systems, ed. V. Cutsuridis, A. Hussain & J. G.
Taylor, pp. 601–36. Springer. 2010
N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. In Proc. of the
37th Annual Allerton Conference on Communication, Control and Computing, pages 368–
377, 1999.
Jonathan Rubin, Ohad Shamir, and Naftali Tishby. Trading value and information. In
Decision Making with Imperfect Decision Makers, pages 57–74. Springer, 2012
Kai Zhang | 2015-11-25 Intrinsic Motivated Reinforcement Learning
1. Perception Action Cycle 2. Information-to-go 3. Examples 4. References
Questions

More Related Content

What's hot

Iwsm2014 putnam revisited (han suelmann) for publication
Iwsm2014   putnam revisited (han suelmann) for publicationIwsm2014   putnam revisited (han suelmann) for publication
Iwsm2014 putnam revisited (han suelmann) for publicationNesma
 
Cs221 lecture7-fall11
Cs221 lecture7-fall11Cs221 lecture7-fall11
Cs221 lecture7-fall11darwinrlo
 
Dynamic programming 2
Dynamic programming 2Dynamic programming 2
Dynamic programming 2Roy Thomas
 
PPT - Discovering Reinforcement Learning Algorithms
PPT - Discovering Reinforcement Learning AlgorithmsPPT - Discovering Reinforcement Learning Algorithms
PPT - Discovering Reinforcement Learning AlgorithmsJisang Yoon
 
ADVANCED OPTIMIZATION TECHNIQUES META-HEURISTIC ALGORITHMS FOR ENGINEERING AP...
ADVANCED OPTIMIZATION TECHNIQUES META-HEURISTIC ALGORITHMS FOR ENGINEERING AP...ADVANCED OPTIMIZATION TECHNIQUES META-HEURISTIC ALGORITHMS FOR ENGINEERING AP...
ADVANCED OPTIMIZATION TECHNIQUES META-HEURISTIC ALGORITHMS FOR ENGINEERING AP...Ajay Kumar
 
What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...
What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...
What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...Simplilearn
 
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021Chris Ohk
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Chris Ohk
 
Introduction to optimization technique
Introduction to optimization techniqueIntroduction to optimization technique
Introduction to optimization techniqueKAMINISINGH963
 

What's hot (11)

Iwsm2014 putnam revisited (han suelmann) for publication
Iwsm2014   putnam revisited (han suelmann) for publicationIwsm2014   putnam revisited (han suelmann) for publication
Iwsm2014 putnam revisited (han suelmann) for publication
 
Cs221 lecture7-fall11
Cs221 lecture7-fall11Cs221 lecture7-fall11
Cs221 lecture7-fall11
 
QMC: Operator Splitting Workshop, Deeper Look at Deep Learning: A Geometric R...
QMC: Operator Splitting Workshop, Deeper Look at Deep Learning: A Geometric R...QMC: Operator Splitting Workshop, Deeper Look at Deep Learning: A Geometric R...
QMC: Operator Splitting Workshop, Deeper Look at Deep Learning: A Geometric R...
 
Dynamic programming 2
Dynamic programming 2Dynamic programming 2
Dynamic programming 2
 
PPT - Discovering Reinforcement Learning Algorithms
PPT - Discovering Reinforcement Learning AlgorithmsPPT - Discovering Reinforcement Learning Algorithms
PPT - Discovering Reinforcement Learning Algorithms
 
ADVANCED OPTIMIZATION TECHNIQUES META-HEURISTIC ALGORITHMS FOR ENGINEERING AP...
ADVANCED OPTIMIZATION TECHNIQUES META-HEURISTIC ALGORITHMS FOR ENGINEERING AP...ADVANCED OPTIMIZATION TECHNIQUES META-HEURISTIC ALGORITHMS FOR ENGINEERING AP...
ADVANCED OPTIMIZATION TECHNIQUES META-HEURISTIC ALGORITHMS FOR ENGINEERING AP...
 
What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...
What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...
What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...
 
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
 
Introduction to optimization technique
Introduction to optimization techniqueIntroduction to optimization technique
Introduction to optimization technique
 
Continuous control
Continuous controlContinuous control
Continuous control
 

Similar to Intrinsically Motivated Reinforcement Learning

Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
 
NIPS KANSAI Reading Group #5: State Aware Imitation Learning
NIPS KANSAI Reading Group #5: State Aware Imitation LearningNIPS KANSAI Reading Group #5: State Aware Imitation Learning
NIPS KANSAI Reading Group #5: State Aware Imitation LearningEiji Uchibe
 
On the Effectiveness of Offline RL for Dialogue Response Generation.pdf
On the Effectiveness of Offline RL for Dialogue Response Generation.pdfOn the Effectiveness of Offline RL for Dialogue Response Generation.pdf
On the Effectiveness of Offline RL for Dialogue Response Generation.pdfPo-Chuan Chen
 
Tracking the tracker: Time Series Analysis in Python from First Principles
Tracking the tracker: Time Series Analysis in Python from First PrinciplesTracking the tracker: Time Series Analysis in Python from First Principles
Tracking the tracker: Time Series Analysis in Python from First Principleskenluck2001
 
An Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAn Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAnirban Santara
 
Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Dan Elton
 
Jsai final final final
Jsai final final finalJsai final final final
Jsai final final finaldinesh malla
 
Machine Learning - Reinforcement Learning
Machine Learning - Reinforcement LearningMachine Learning - Reinforcement Learning
Machine Learning - Reinforcement LearningJY Chun
 
Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Julia Maddalena
 
IFTA2020 Kei Nakagawa
IFTA2020 Kei NakagawaIFTA2020 Kei Nakagawa
IFTA2020 Kei NakagawaKei Nakagawa
 
Temporal Graph Pattern Mining
Temporal Graph Pattern MiningTemporal Graph Pattern Mining
Temporal Graph Pattern MiningEugene Yang
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIJack Clark
 
Abductive commonsense reasoning
Abductive commonsense reasoningAbductive commonsense reasoning
Abductive commonsense reasoningSan Kim
 
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNINGAUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNINGgerogepatton
 
DCWP_CVPR2023.pptx
DCWP_CVPR2023.pptxDCWP_CVPR2023.pptx
DCWP_CVPR2023.pptx건영 박
 

Similar to Intrinsically Motivated Reinforcement Learning (20)

ddpg seminar
ddpg seminarddpg seminar
ddpg seminar
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
NIPS KANSAI Reading Group #5: State Aware Imitation Learning
NIPS KANSAI Reading Group #5: State Aware Imitation LearningNIPS KANSAI Reading Group #5: State Aware Imitation Learning
NIPS KANSAI Reading Group #5: State Aware Imitation Learning
 
Deep robotics
Deep roboticsDeep robotics
Deep robotics
 
On the Effectiveness of Offline RL for Dialogue Response Generation.pdf
On the Effectiveness of Offline RL for Dialogue Response Generation.pdfOn the Effectiveness of Offline RL for Dialogue Response Generation.pdf
On the Effectiveness of Offline RL for Dialogue Response Generation.pdf
 
Tracking the tracker: Time Series Analysis in Python from First Principles
Tracking the tracker: Time Series Analysis in Python from First PrinciplesTracking the tracker: Time Series Analysis in Python from First Principles
Tracking the tracker: Time Series Analysis in Python from First Principles
 
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
 
An Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAn Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGI
 
Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design
 
Jsai final final final
Jsai final final finalJsai final final final
Jsai final final final
 
Machine Learning - Reinforcement Learning
Machine Learning - Reinforcement LearningMachine Learning - Reinforcement Learning
Machine Learning - Reinforcement Learning
 
Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning
 
IFTA2020 Kei Nakagawa
IFTA2020 Kei NakagawaIFTA2020 Kei Nakagawa
IFTA2020 Kei Nakagawa
 
Temporal Graph Pattern Mining
Temporal Graph Pattern MiningTemporal Graph Pattern Mining
Temporal Graph Pattern Mining
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
Abductive commonsense reasoning
Abductive commonsense reasoningAbductive commonsense reasoning
Abductive commonsense reasoning
 
Particle filter
Particle filterParticle filter
Particle filter
 
Reinforcement-Learning.ppt
Reinforcement-Learning.pptReinforcement-Learning.ppt
Reinforcement-Learning.ppt
 
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNINGAUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING
 
DCWP_CVPR2023.pptx
DCWP_CVPR2023.pptxDCWP_CVPR2023.pptx
DCWP_CVPR2023.pptx
 

Recently uploaded

HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2RajaP95
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 

Recently uploaded (20)

HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 

Intrinsically Motivated Reinforcement Learning

  • 1. Kai Zhang | 2015-11-25 Intrinsic Motivated Reinforcement Learning 1. Perception Action Cycle 2. Information-to-go 3. Examples 4. References 2. Information-to-go 3. Examples Intrinsically Motivated Reinforcement Learning Kai Zhang EECS Department 2015-11-25
  • 2. Kai Zhang | 2015-11-25 Intrinsic Motivated Reinforcement Learning 1. Perception Action Cycle 2. Information-to-go 3. Examples 4. References Reinforcement Learning Revisited Agent and Environment interact at discrete time steps: t = 0,1,2,… • Agent observes state at step t: 𝑠𝑠𝑡𝑡 ∈ 𝑆𝑆 • produces action at step t: 𝑎𝑎𝑡𝑡 ∈ 𝐴𝐴(𝑠𝑠𝑡𝑡) • gets resulting reward: 𝑟𝑟𝑡𝑡+1 ∈ ℜ • and resulting next state: 𝑠𝑠𝑡𝑡+1
  • 3. Kai Zhang | 2015-11-25 Intrinsic Motivated Reinforcement Learning 1. Perception Action Cycle 2. Information-to-go 3. Examples 4. References The Agent Learns A Policy Policy at step t, 𝜋𝜋𝑡𝑡: A mapping from states to action probabilities 𝜋𝜋𝑡𝑡(𝑠𝑠, 𝑎𝑎)= probability that 𝑎𝑎𝑡𝑡 = 𝑎𝑎 when 𝑠𝑠𝑡𝑡 = 𝑠𝑠 Policy Evaluation: for a given policy 𝜋𝜋 , compute the state-value function V 𝜋𝜋 Recall: State- value function for policy 𝜋𝜋 : 𝑉𝑉 𝜋𝜋 𝑠𝑠 = 𝐸𝐸𝑡𝑡 𝑅𝑅𝑡𝑡 𝑠𝑠𝑡𝑡 = 𝑠𝑠 = 𝐸𝐸𝑡𝑡 � 𝑘𝑘=0 ∞ 𝛾𝛾 𝑘𝑘 𝑟𝑟𝑡𝑡+𝑘𝑘+1|𝑠𝑠𝑡𝑡 = 𝑠𝑠 Bellman equation for 𝑉𝑉 𝜋𝜋 : 𝑉𝑉 𝜋𝜋 𝑠𝑠 = � 𝑎𝑎 𝜋𝜋(𝑠𝑠, 𝑎𝑎) � 𝑠𝑠𝑠 𝑃𝑃𝑠𝑠𝑠𝑠𝑠 𝑎𝑎 [𝑅𝑅𝑠𝑠𝑠𝑠′ 𝑎𝑎 + 𝛾𝛾𝑉𝑉 𝜋𝜋(𝑠𝑠𝑠)]
  • 4. Kai Zhang | 2015-11-25 Intrinsic Motivated Reinforcement Learning 1. Perception Action Cycle 2. Information-to-go 3. Examples 4. References Graphical Model for the Perception Action Cycle Both future extrinsic reward (Value) and intrinsic (Information to-go) are optimized together using Bellman-like equations w.r.t. to both channels
  • 5. Kai Zhang | 2015-11-25 Intrinsic Motivated Reinforcement Learning 1. Perception Action Cycle 2. Information-to-go 3. Examples 4. References Bellman meets Shannon Richard Ernest Bellman (August 26, 1920 – March 19, 1984) Claude Elwood Shannon (April 30, 1916 – February 24, 2001)
  • 6. Kai Zhang | 2015-11-25 Intrinsic Motivated Reinforcement Learning 1. Perception Action Cycle 2. Information-to-go 3. Examples 4. References Decision/action Sequences and Information Let G denote our target (relevant) variable 𝐼𝐼 𝑠𝑠; 𝐺𝐺 = 𝐸𝐸𝑔𝑔log 𝑝𝑝 𝑔𝑔 𝑠𝑠 𝑝𝑝 𝑔𝑔 = � 𝑔𝑔∈𝐺𝐺 𝑝𝑝(𝑔𝑔, 𝑠𝑠)log 𝑝𝑝 𝑔𝑔 𝑠𝑠 𝑝𝑝 𝑔𝑔 - The mutual information at state s on G. For an MDP the following recursion holds: 𝐼𝐼𝜋𝜋 𝑠𝑠𝑡𝑡; 𝐺𝐺 = � 𝑎𝑎𝑡𝑡∈𝐴𝐴 𝜋𝜋(𝑎𝑎𝑡𝑡|𝑠𝑠𝑡𝑡) � 𝑠𝑠𝑡𝑡+1 𝑃𝑃𝑠𝑠𝑡𝑡 𝑠𝑠𝑡𝑡+1 𝑎𝑎𝑡𝑡 ∆𝐼𝐼𝑠𝑠𝑡𝑡 𝑠𝑠𝑡𝑡+1 𝑎𝑎𝑡𝑡 + 𝐼𝐼𝜋𝜋 𝑠𝑠𝑡𝑡+1; 𝐺𝐺 and ∆𝐼𝐼𝑠𝑠𝑡𝑡 𝑠𝑠𝑡𝑡+1 𝑎𝑎𝑡𝑡 = log 𝑝𝑝 𝑠𝑠𝑡𝑡+1 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 𝑝𝑝 𝑠𝑠𝑡𝑡+1 + log 𝜋𝜋 𝑎𝑎𝑡𝑡+1 𝑠𝑠𝑡𝑡+1 𝜋𝜋 𝑎𝑎𝑡𝑡+1
  • 7. Kai Zhang | 2015-11-25 Intrinsic Motivated Reinforcement Learning 1. Perception Action Cycle 2. Information-to-go 3. Examples 4. References Agent and Environment interact at discrete time steps: t = 0,1,2,… • agent observes state at step t: 𝑠𝑠𝑡𝑡 ∈ 𝑆𝑆 • produces action at step t: 𝑎𝑎𝑡𝑡 ∈ 𝐴𝐴(𝑠𝑠𝑡𝑡) • gets resulting reward: 𝑟𝑟𝑡𝑡+1 ∈ ℜ • and resulting next state: 𝑠𝑠𝑡𝑡+1 Bellman equation for 𝑉𝑉 𝜋𝜋: 𝑉𝑉 𝜋𝜋 𝑠𝑠 = � 𝑎𝑎 𝜋𝜋(𝑠𝑠, 𝑎𝑎) � 𝑠𝑠𝑠 𝑃𝑃𝑠𝑠𝑠𝑠𝑠 𝑎𝑎 [𝑅𝑅𝑠𝑠𝑠𝑠′ 𝑎𝑎 + 𝛾𝛾𝑉𝑉 𝜋𝜋 (𝑠𝑠𝑠)] Solved for 𝑉𝑉 𝜋𝜋 𝑠𝑠 by DP given 𝑃𝑃𝑠𝑠𝑠𝑠𝑠 𝑎𝑎 , 𝑅𝑅𝑠𝑠𝑠𝑠𝑠 𝑎𝑎 and 𝜋𝜋 Agent has goal variable 𝑔𝑔 ∈ 𝐺𝐺 Interacts with environment at time steps t = 0,1,2,… • estimates/infer an internal state: ̂𝑠𝑠𝑡𝑡 ∈ ̂𝑆𝑆 characterized by 𝑝𝑝 ̂𝑠𝑠 𝑠𝑠 , 𝑝𝑝(𝑔𝑔| ̂𝑠𝑠) • produces action at step t: 𝑎𝑎𝑡𝑡 ∈ 𝐴𝐴(𝑠𝑠𝑡𝑡) with 𝜋𝜋(𝑎𝑎| ̂𝑠𝑠) • get estimate information gain: ∆𝐼𝐼𝑠𝑠,𝑠𝑠𝑠 𝑎𝑎 ∈ ℜ • resulting world next state: 𝑠𝑠𝑡𝑡+1 with 𝑃𝑃𝑠𝑠,𝑠𝑠𝑠 𝑎𝑎 Bellman equation for 𝐼𝐼 𝜋𝜋: 𝐼𝐼 𝜋𝜋 𝑠𝑠; 𝑔𝑔 = � 𝑎𝑎 𝜋𝜋(𝑎𝑎| ̂𝑠𝑠) � 𝑠𝑠𝑠 𝑃𝑃𝑠𝑠𝑠𝑠𝑠 𝑎𝑎 [∆𝐼𝐼𝑠𝑠𝑠𝑠′ 𝑎𝑎 + 𝐼𝐼 𝜋𝜋 ( ̂𝑠𝑠′ ; 𝑔𝑔) ] solved for 𝐼𝐼 𝜋𝜋 using DP and prob. inference
  • 8. Kai Zhang | 2015-11-25 Intrinsic Motivated Reinforcement Learning 1. Perception Action Cycle 2. Information-to-go 3. Examples 4. References Combining (future) Value and Information In cases where information is free, we can maximize value irrespective of its information cost. In general, however, we want (1) to reduce decision complexity; (2) maximize the environment information gain; These two goals can be obtained by combining the information and value equations using a language multiplier, and this becomes an optimization problem.
  • 9. Kai Zhang | 2015-11-25 Intrinsic Motivated Reinforcement Learning 1. Perception Action Cycle 2. Information-to-go 3. Examples 4. References Trading Value and Information ‘Free Energy’ formulation: 𝐹𝐹 𝜋𝜋 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡, 𝛽𝛽 = 𝐼𝐼 𝜋𝜋 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 − 𝛽𝛽𝑄𝑄 𝜋𝜋 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 = 𝐸𝐸𝑝𝑝 𝑠𝑠𝑡𝑡+1 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 𝜋𝜋 𝑎𝑎𝑡𝑡+1 𝑠𝑠𝑡𝑡+1 �log 𝑝𝑝 𝑠𝑠𝑡𝑡+1 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 𝑝𝑝 𝑠𝑠𝑡𝑡+1 + log 𝜋𝜋 𝑎𝑎𝑡𝑡+1 𝑠𝑠𝑡𝑡+1 𝜋𝜋 𝑎𝑎𝑡𝑡+1 − 𝛽𝛽𝑅𝑅𝑠𝑠𝑠𝑠′ 𝑎𝑎 + 𝐼𝐼 𝜋𝜋 𝑠𝑠𝑡𝑡+1, 𝑎𝑎𝑡𝑡+1 − 𝛽𝛽𝑄𝑄 𝜋𝜋 (𝑠𝑠𝑡𝑡+1, 𝑎𝑎𝑡𝑡+1 𝜋𝜋 𝑎𝑎 𝑠𝑠 = 𝜋𝜋(𝑎𝑎) 𝑍𝑍(𝑠𝑠, 𝛽𝛽) exp 𝐹𝐹 𝜋𝜋 𝑠𝑠, 𝑎𝑎, 𝛽𝛽 𝑍𝑍 𝑠𝑠, 𝛽𝛽 = ∑𝑎𝑎 𝜋𝜋 𝑎𝑎 exp 𝐹𝐹 𝜋𝜋 𝑠𝑠, 𝑎𝑎, 𝛽𝛽 𝜋𝜋 𝑎𝑎 = ∑𝑠𝑠 𝜋𝜋 𝑎𝑎|𝑠𝑠 𝑝𝑝(𝑠𝑠) INFO-RL Algorithm 𝐹𝐹 𝜋𝜋 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡, 𝛽𝛽 = 𝐸𝐸𝑝𝑝 𝑠𝑠𝑡𝑡+1 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 log 𝑝𝑝 𝑠𝑠𝑡𝑡+1 𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡 𝑝𝑝 𝑠𝑠𝑡𝑡+1 − 𝛽𝛽𝑅𝑅𝑠𝑠𝑠𝑠′ 𝑎𝑎 + 𝐸𝐸𝜋𝜋 𝑎𝑎𝑡𝑡+1 𝑠𝑠𝑡𝑡+1 log 𝜋𝜋 𝑎𝑎𝑡𝑡+1 𝑠𝑠𝑡𝑡+1 𝜋𝜋 𝑎𝑎𝑡𝑡+1 + 𝐹𝐹 𝜋𝜋 𝑠𝑠𝑡𝑡+1, 𝑎𝑎𝑡𝑡+1, 𝛽𝛽 These 3 equations should be iterated till convergence for every state (like Blahut Arimoto)
  • 10. Kai Zhang | 2015-11-25 Intrinsic Motivated Reinforcement Learning 1. Perception Action Cycle 2. Information-to-go 3. Examples 4. References Grid World Example 𝛽𝛽=0.01 𝛽𝛽=0.05 𝛽𝛽=0.5 𝛽𝛽=5
  • 11. Kai Zhang | 2015-11-25 Intrinsic Motivated Reinforcement Learning 1. Perception Action Cycle 2. Information-to-go 3. Examples 4. References Future Work • Stochastic world • More complicated tasks
  • 12. Kai Zhang | 2015-11-25 Intrinsic Motivated Reinforcement Learning 1. Perception Action Cycle 2. Information-to-go 3. Examples 4. References References Tishby, N. & Polani, D. Information theory of decisions and actions. In: Perception- reason-action cycle: Models, algorithms and systems, ed. V. Cutsuridis, A. Hussain & J. G. Taylor, pp. 601–36. Springer. 2010 N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. In Proc. of the 37th Annual Allerton Conference on Communication, Control and Computing, pages 368– 377, 1999. Jonathan Rubin, Ohad Shamir, and Naftali Tishby. Trading value and information. In Decision Making with Imperfect Decision Makers, pages 57–74. Springer, 2012
  • 13. Kai Zhang | 2015-11-25 Intrinsic Motivated Reinforcement Learning 1. Perception Action Cycle 2. Information-to-go 3. Examples 4. References Questions