Learning coordination strategies using reinforcement learning myriam z abramson , dissertation, 2003

Learning Coordination strategies
using reinforcement learning
-- Myriam Z. Abramson , dissertation, 2003
張景照
dorgon chang
20120614

Index
• Coordinate problem(要解決的問題)
• Evaluation of the GO
• Reinforcement learning
• Temporal Difference learning(使用Sarsa)
• Learning Vector Quantization(LVQ)
• Sarsa LVQ (SLVQ) <=作者提出的方法
2
Learning Coordination strategies using
reinforcement learning

Coordinate problem
• Coordination strategy problem簡單來說就是
action selection problem。
• 當我們只知道local situation的時候，如何選擇
一個正確的行動，在不依靠end game state的情
況下，讓其能夠跟其他的行動結合在一起。
• 局部地區的戰術
如何影響整體的戰略。
3

Evaluation of the GO
4
This method convey the spatial connectivity between the stones
ε為自定數值，當該點的influence
超過ε時會繼續往外擴散
對盤面上所有數字做加總，
可以得到一個盤面的評估值
黑子往外擴散1
白子往外擴散-1
在接下來的方法中當成reward

Reinforcement learning：
Introduction
• Machine learning的目標是用來產生一個
agents，而RL是其中的一個方法，其特徵是
Trial-and-error search and delayed reward。
5
下一顆子
例如：贏或輸、
平手
盤面
Agent往後預測幾步盤面

Reinforcement learning：
Value Function
• π = agent所使用來選擇action的policy。
• s = 目前的state。
• ：在π這個policy下，state s所得到的reward。
• ：在π這個policy下，state s採取action a所得到的reward。
• 最常見的policy為ε-greedy。 (另有greedy、ε-soft 、softmax……)
• ε介於0~1之間，其值越高，代表「exploration」越被鼓勵。
(exploration v.s. exploitation)
• ε-greedy：大部份都會選擇擁有最高reward的行動，有少部份的機率ε
會亂數決定。
6

Temporal Difference learning
• TD learning是用來評估RL中value function的
方法。
7
DP：當前的評估值是基於先前學習過的評估值為基礎。
MC：利用random game的模式，統計其結果來解決未來可能遇到的問題。
TD方法結合：

Temporal Difference learning：
Forward View of TD() (1)
• Monte Carlo：observe reward for all steps in an episode
• TD(0)：observed one step only observed two step
• TD() is a method for averaging all n-step,
8
1 1
(1)
( )t t tR r V s   2
1 2 2
(2)
( )t t t tt r r V sR      
V(St)
V(St+1)
Value update
set λ = 0, TD(0)
set λ = 1, Monte Carlo
r =在t時間點的reward, γ = 觀察未來reward的discount rate
為在t時間點往後觀察T步
的total reward, 回傳一個scalar

9
1 1
(1)
( )t t tR r V s   2
1 2 2
(2)
( )t t t tt r r V sR      
Value update
set λ = 0, TD(0)
set λ = 0 代入

10
1 1
(1)
( )t t tR r V s   2
1 2 2
(2)
( )t t t tt r r V sR      
Value update
set λ = 0, TD(0)
set λ = 1 代入

11
T為一場game的
total step、t為這場
game的第幾個step
1 1
(1)
( )t t tR r V s  
S0
w1 =
w2 =
w3 =
S1
S2
S3
Normalize 確保
weight總和為1
1nw 
set λ = 0.5 and t = 0, T = 3

12
= 之後的總和
λ越高weight下降越快，越重視前面的結果。
λ越低weight下降越慢，越重視後面的結果。
總結λ存在的功能與意義：
1.作為TD跟MC方法的橋梁
2.對於一個沒有立即影響
的action，我們如何去
做punish or reward
Eligibility Traces 
若λ =0.1 => 1-λ = 0.9
若λ = 0.9 =>1-λ = 0.1
Set λ = 0.5 and t = 0, T = 3的結果

Backward View of TD( )(1)
• Eligibility Traces：
• Reinforcing Events：
• Value updates：
13
1
1
( )
( ) 1
( ) t
t t
t
te s s s
s s
s
e s
e





 
 
1 1( ) ( )t tt t t tr V s V s    
( ) ( )t tt s sV e 
rt+1
V(St)
0
( ) ( ) k
t
t k
t ss
k
e s I 

 
1
0t
t
ss
t
s s
I
s s

 

非遞迴的定義
利用Reinforcing Events
一步一步的往回更新

Backward View of TD( )(2)
• Eligibility Traces：
• Reinforcing Events：
• Value updates：
14
1
1
( )
( ) 1
( ) t
t t
t
te s s s
s s
s
e s
e





 
 
1 1( ) ( )t tt t t tr V s V s    
( ) ( )t tt s sV e 
set λ = 0 get TD(0)
1 1
(1)
( )t t tR r V s  
TD(0)
0
( ) ( ) k
t
t k
t ss
k
e s I 

 
1
0t
t
ss
t
s s
I
s s

 

非遞迴的定義

Why Backward View？
• Forward view
– theoretical view：概念上比較容易理解
– Not directly implementable：資訊仍要模擬取得。
• Backward view
– mechanistic view：較好實作
– simple conceptually and computationally
– In the offline case, achieving the same result as the
forward view (可證明)
15

Equivalence of the Forward and Backward Views
•
16
1 1
0 0
( ) ( ) t
b f
t t
T T
ss
t t
tV s V Is
 
 
  
Ref： 7.4 Equivalence of the Forward and Backward Views,
http://www.cs.ualberta.ca/~sutton/book/7/node1.html(證明，在offline case下相等)
Backward
view
Forward
view
Value update相等
1
0t
t
ss
t
s s
I
s s

 

Sum of Forward:If λ = 1(MC) and T = 3
=>

Sarsa演算法
•
17
Behavior
Policy
Estimation
policy
For each game
每下一子
更新時Rt要往後觀察幾步，
看所使用的方法：如Sarsa (λ)
1 1
(1)
( )t t tR r V s  

Learning Vector Quantization
• 主要目的：資料的壓縮
• 基本概念：希望以較少的群集來表示整個
輸入樣本空間。=>找一個類別的代表點
18
LVQVQ
適用於無類別資訊的資料適用於有類別資訊的資料
M=3，代表數量
O=prototype vector
+ = input data
m1 m2
m3

SLVQ：架構(1)
19
<=代表點，初始時random撒在棋盤上
an idea of
what a SOM
looks like
建立n個agent = pattern database
可用SOM演算法
動態決定需要幾個M
意即pattern數量
可以動態增減
Agent會將嚐試過的state/action pair的值記錄下來，
經由LVQ演算法：Q(s, a) = >Q(m, a) ，state-space的數量被大幅的壓縮。

SLVQ：架構(2)
20
m1
m2
m3
初始各代表點的weight亂數產生
m1
m2
m3
遊戲終盤時，代表點更新(用LVQ)
更新更新
設M=3
m1 m2
m3
越多場訓練，
代表點的代表
性會越足夠=>
會逐漸收斂
更新代表點時，會利
用相似性的計算找出
特定的pattern。(利用
幾何距離)
Ref: S. Santini and R. Jain. Similarity measures. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 21(9), 1999.
使用Backward View做更新

Candidate Moves (1)
• 就經驗來講，如果一個move有多重意義的話
會比較好。以下為圍棋中移動的特徵：
• Attack:reduce opponent’s liberties
• Defend:increase own’s liberties
• Claim:increase own’s influence
• Invade:decrease opponent’s influence
• Connect:Join two groups
• Conquer:enclose liberties
21

Candidate Moves(2)
22
Attack：A,B,C,D,E,F => reduce opponent’s liberties(氣)
黑方為
攻擊方
Defend：N,O,P,G,Q => increase own’s liberties
No use：M,L,K,J,I,H =>從候選移動名單中移除
Pattern database
中一個agent可能
的攻擊及防守點
Match
m
J
12

Reference(1)
• 英文部份：
• Myriam Z. Abramson, Learning Coordination strategies using
reinforcement learning, dissertation, George Mason University, Fairfax, VA,
2003
• Shin Ishii, Control of exploitation-exploration meta-parameter in
reinforcement learning, Nara Institute of Science and technology, Neural
Netwokrs 15(4-6), pp.665-687, 2002
• Simon Haykin, Neural networks and learning machines third Edition,
Chapter 12, PEARSON EDUCATION
• Richard S. Sutton, A Convergent O(n) Algorithm for off-policy Temporal-
difference learning with linear function approximation, Reinforcement
Learning and Artificial Intelligence Laboratory, Department of Computing
Science University of Alberta
23

Reference(2)
• 中文部份：
• 陳漢鴻，電腦象棋的自我學習，碩士論文，
國立雲林科技大學資訊工程系，民95年6月
24

Reference(3)
• 網頁部份：
• Reinforcement Learning,
• http://www.cse.unsw.edu.au/~cs9417ml/RL1/index.html, 2009.12.03
• Cyber Rodent Project, http://www.cns.atr.jp/cnb/crp/, 2009.12.03
• Off-Policy Learning, http://rl.cs.mcgill.ca/Projects/off-policy.html, 2009.12.03
• [MATH] Monte Carlo Method 蒙地卡羅法則, http://www.wretch.cc/blog/glCheng/3431370, 2009.12.03
• Intelligent agent, http://en.wikipedia.org/wiki/Intelligent_agent, 2009.12.03
• Simple Competitive Learning ,
http://www.willamette.edu/~gorr/classes/cs449/Unsupervised/competitive.html, 2009.12.12
• Eligibility Traces, http://www.cs.ualberta.ca/~sutton/book/7/node1.html, 2009.12.12
• Tabu search, http://sjchen.im.nuu.edu.tw/Project_Courses/ML/Tabu.pdf, 2009.12.12
• Self Organizing Maps, http://davis.wpi.edu/~matt/courses/soms/ , 2009.12.16
• Reinforcement Learning , http://www.informatik.uni-
freiburg.de/~ki/teaching/ws0607/advanced/recordings/reinforcement.pdf, 2009.12.25
25

Learning coordination strategies using reinforcement learning myriam z abramson , dissertation, 2003

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Learning coordination strategies using reinforcement learning myriam z abramson , dissertation, 2003

Similar to Learning coordination strategies using reinforcement learning myriam z abramson , dissertation, 2003 (9)

Learning coordination strategies using reinforcement learning myriam z abramson , dissertation, 2003

Editor's Notes