内発的動機づけの計算モデル, 岡夏樹

認知的インタラクションデザイン学公開講義
2016.7.13
内発的動機づけの計算モデル
岡夏樹
京都工芸繊維大学
情報工学・人間科学系

強化学習
環境
エージェント
行動状態
報酬

強化学習＋内発的動機づけ
環境
エージェント
行動状態
外部報酬
(A)
predictor
内部報酬
(B) RL

いろいろな内発的動機づけ
• 新奇性、好奇心
• 親近性、予測可能性、学習容易性、課題の
分量・時間制限
• 達成感
• 対人交流、他者からの受容感
• 目標（能力を高める、知る、他者に勝つ、それ
らができないことを避ける）
• 自己効力感（うまくできるという自信）、有能感
（能力を発揮できているという感覚）
• 自己決定感、自律性
面白さ、興味、関心情報量

強化学習
環境
エージェント
行動 𝑎状態 𝑠
報酬 𝑟
Objective: get as much reward as possible

Q学習（即時報酬だけの場合）
 ),(),(),(1 asQrasQasQ kkk  
r
Qk(s,a)
Qk+1(s,a)
α
1
報酬 𝑟 は変動してもよい
状態 𝑠 で行動 𝑎 をとる価値 𝑄 𝑘(𝑠, 𝑎) は 𝑟 の期
待値に収束

𝑟 が変動する場合
変形すると・・・
→ exponential, recency-weighted average
Q学習（即時報酬だけの場合）
 ),(),(),( 11 asQrasQasQ kkkk   
i
ik
k
i
k
k rQQ 

  )1()1(
1
0 

8
i
ik
k
i
k
k
kkk
rQQ
rrQrQQ
rQQ
rQQ



 



)1()1(
))1)((1()1(
)1(
)1(
1
0
210212
101
11




・・・

行動選択
• ソフトマックス法
𝑒 𝑄(𝑠,𝑎)/𝜏
𝑒 𝑄(𝑠,𝑏)/𝜏𝑛
𝑏=1
𝑒 𝑄(𝑠,𝑎)/𝜏
に比例した確率で行動選択
温度定数 𝜏 が大きくなればランダムな選択に、𝜏 が小さ
くなれば一番報酬が多いと思われる選択をする方法に
近づく

遅れのある報酬も考慮したQ学習：行動価値は将来もらえる
報酬（割引率 𝛾 で減衰）の和の期待値に収束
discounted return
10
S
U
T
W
V
A
B
C
D
E
10
2
        ttt
a
ttttt asQasQrasQasQ ,,max,, 11   
エピソード
行
動
価
値

Formal Theory of Creativity & Fun & Intrinsic
Motivation (1990-2010) by Jürgen Schmidhuber
http://people.idsia.ch/~juergen/creativity.html
• (A) an adaptive predictor of the growing data
history as the agent is interacting with its
environment
• (B) a reinforcement learner selecting the
actions that shape the history
• (B) is motivated to learn to invent
interesting things that (A) does not yet know
but can easily learn.

（つづき）
• To maximize future expected reward, (B)
learns more and more complex behaviors that
yield initially surprising (but eventually
boring) novel patterns that make (A) quickly
improve.

（つづき）
• O(t): the state of some observer O at time t
• H(t): its history of previous actions &
sensations & rewards until time t
• Beauty B(D,O(t)) of any data D: the negative
number of bits required to encode D
• Interestingness I(D,O(t)) of data D for
observer O at discrete time
step t>0: I(D,O(t))= B(D,O(t))-B(D,O(t-1))

（つづき）
• Intrinsic reward ri(t)=I(H(t),O(t))
• External reward re(t)
• Total reward r(t)=g(ri(t),re(t)), e.g., g(a,b)=a+b

（つづき）
Implementations
• Intrinsic reward: prediction error
• Intrinsic reward: improvements in prediction
error
• Intrinsic reward: relative entropies between
the agent's priors and posteriors

（参考資料）
2016年度人工知能学会全国大会, 1O4-OS-22a-3
https://kaigi.org/jsai/webprogram/2016/pdf/273.pdf
インタラクションを通した数の概念の獲得
京都工芸繊維大学
高井利将岡夏樹早川博章

19
情報量、平均情報量（エントロピー）
• 発生確率が 𝑝 である事象が実際に発生したこ
とを知ったときに得られる情報量は、
𝐼 = −log2 𝑝
𝑝 =
1
2
→ 𝐼 = 1, 𝑝 =
1
4
→ 𝐼 = 2
• 各事象の発生確率が 𝑝𝑗 であるとき、1回の事
象発生で得られる平均情報量 𝐻 は、
𝐻 = − 𝑝𝑗 × log2 𝑝𝑗𝑗

20
平均情報量（エントロピー）
コイン投げの例: 表が出る確率 𝑝
• 𝐻(𝑝) = −{ 𝑝 log2 𝑝 + (1 − 𝑝) log2(1 − 𝑝) }
𝑝
𝐻
10
0
1

人とのインタラクション場面での
相手を含む
環境
エージェント
行動状態（相手の行動を含む）
外部報酬
(A)
predictor
内部報酬
(B) RL

人とのインタラクション場面での
相手を含む
環境
エージェント
行動状態（相手の行動を含む）
外部報酬
(A)
predictor
内部報酬
(B) RL
like
dislike
1 2 3 4
*
**
***
****

課題
• 簡単な強化学習プログラムをサンプルとして提
供するので、それに内発的動機づけの機能を付
け加えよ。入出力などの仕様も適宜変えてよい。
• 提出物：
– 実行可能なソースコード
– レポート（以下の内容を含む）
• 仕様の解説（人の側はどう入力し、エージェントはどう応答
するか、どのような内発的動機づけを付加したか、等）
• 内発的動機づけの導入により、エージェントのふるまいが、
どのように変わったか
• その変化に応じて、人の側のふるまいがどのような影響を
受けると思うか

参考資料
• 簡単な強化学習のサンプルプログラム
http://www.ii.is.kit.ac.jp/oka/RL5.html
サンプルプログラムを改定すると、上記はリンク切
れとなる可能性があります。その場合は、
http://www.ii.is.kit.ac.jp/oka/
からリンクを見つけて下さい。
• HTML入門、JavaScript入門、その他の資料へ
のリンク等は、Moodle（学内eラーニングプ
ラットフォーム）に掲載

参考資料
• Second Interdisciplinary Symposium on
Information-Seeking, Curiosity and Attention
https://openlab-flowers.inria.fr/t/second-
interdisciplinary-symposium-on-information-
seeking-curiosity-and-attention-neurocuriosity-
2016/187
• Information-seeking, curiosity, and attention:
computational and neural mechanisms
http://www.pyoudeyer.com/TICSCuriosity2013.pdf

内発的動機づけの計算モデル, 岡夏樹

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

More from KIT Cognitive Interaction Design

More from KIT Cognitive Interaction Design (16)

内発的動機づけの計算モデル, 岡夏樹