SlideShare a Scribd company logo
Learning Coordination strategies
using reinforcement learning
-- Myriam Z. Abramson , dissertation, 2003
張景照
dorgon chang
20120614
Index
• Coordinate problem(要解決的問題)
• Evaluation of the GO
• Reinforcement learning
• Temporal Difference learning(使用Sarsa)
• Learning Vector Quantization(LVQ)
• Sarsa LVQ (SLVQ) <=作者提出的方法
2
Learning Coordination strategies using
reinforcement learning
Coordinate problem
• Coordination strategy problem簡單來說就是
action selection problem。
• 當我們只知道local situation的時候,如何選擇
一個正確的行動,在不依靠end game state的情
況下,讓其能夠跟其他的行動結合在一起。
• 局部地區的戰術
如何影響整體的戰略。
3
Learning Coordination strategies using
reinforcement learning
Evaluation of the GO
4
This method convey the spatial connectivity between the stones
Learning Coordination strategies using
reinforcement learning
ε為自定數值,當該點的influence
超過ε時會繼續往外擴散
對盤面上所有數字做加總,
可以得到一個盤面的評估值
黑子往外擴散1
白子往外擴散-1
在接下來的方法中當成reward
Reinforcement learning:
Introduction
• Machine learning的目標是用來產生一個
agents,而RL是其中的一個方法 ,其特徵是
Trial-and-error search and delayed reward。
5
Learning Coordination strategies using
reinforcement learning
下一顆子
例如:贏或輸、
平手
盤面
Agent往後預測幾步盤面
Reinforcement learning:
Value Function
• π = agent所使用來選擇action的policy。
• s = 目前的state。
• :在π這個policy下,state s所得到的reward。
• :在π這個policy下,state s採取action a所得到的reward。
• 最常見的policy為ε-greedy。 (另有greedy、ε-soft 、softmax……)
• ε介於0~1之間,其值越高,代表「exploration」越被鼓勵。
(exploration v.s. exploitation)
• ε-greedy:大部份都會選擇擁有最高reward的行動,有少部份的機率ε
會亂數決定。
6
Learning Coordination strategies using
reinforcement learning
Temporal Difference learning
• TD learning是用來評估RL中value function的
方法。
7
Learning Coordination strategies using
reinforcement learning
DP:當前的評估值是基於先前學習過的評估值為基礎。
MC:利用random game的模式,統計其結果來解決未來可能遇到的問題。
TD方法結合:
Temporal Difference learning:
Forward View of TD() (1)
• Monte Carlo:observe reward for all steps in an episode
• TD(0):observed one step only observed two step
• TD() is a method for averaging all n-step,
8
1 1
(1)
( )t t tR r V s   2
1 2 2
(2)
( )t t t tt r r V sR      
V(St)
V(St+1)
Value update
Learning Coordination strategies using
reinforcement learning
set λ = 0, TD(0)
set λ = 1, Monte Carlo
r =在t時間點的reward, γ = 觀察未來reward的discount rate
為在t時間點往後觀察T步
的total reward, 回傳一個scalar
Temporal Difference learning:
Forward View of TD() (2)
• Monte Carlo:observe reward for all steps in an episode
• TD(0):observed one step only observed two step
• TD() is a method for averaging all n-step,
9
1 1
(1)
( )t t tR r V s   2
1 2 2
(2)
( )t t t tt r r V sR      
Value update
Learning Coordination strategies using
reinforcement learning
set λ = 0, TD(0)
set λ = 1, Monte Carlo
set λ = 0 代入
r =在t時間點的reward, γ = 觀察未來reward的discount rate
為在t時間點往後觀察T步
的total reward, 回傳一個scalar
Temporal Difference learning:
Forward View of TD() (3)
• Monte Carlo:observe reward for all steps in an episode
• TD(0):observed one step only observed two step
• TD() is a method for averaging all n-step,
10
1 1
(1)
( )t t tR r V s   2
1 2 2
(2)
( )t t t tt r r V sR      
Value update
Learning Coordination strategies using
reinforcement learning
set λ = 0, TD(0)
set λ = 1, Monte Carlo
set λ = 1 代入
r =在t時間點的reward, γ = 觀察未來reward的discount rate
為在t時間點往後觀察T步
的total reward, 回傳一個scalar
Temporal Difference learning:
Forward View of TD() (4)
Learning Coordination strategies using
reinforcement learning
11
T為一場game的
total step、t為這場
game的第幾個step
1 1
(1)
( )t t tR r V s  
S0
w1 =
w2 =
w3 =
S1
S2
S3
Normalize 確保
weight總和為1
1nw 
set λ = 0.5 and t = 0, T = 3
Temporal Difference learning:
Forward View of TD() (5)
Learning Coordination strategies using
reinforcement learning
12
= 之後的總和
λ越高weight下降越快,越重視前面的結果。
λ越低weight下降越慢,越重視後面的結果。
總結λ存在的功能與意義:
1.作為TD跟MC方法的橋梁
2.對於一個沒有立即影響
的action,我們如何去
做punish or reward
Eligibility Traces 
若λ =0.1 => 1-λ = 0.9
若λ = 0.9 =>1-λ = 0.1
Set λ = 0.5 and t = 0, T = 3的結果
Temporal Difference learning:
Backward View of TD( )(1)
• Eligibility Traces:
• Reinforcing Events:
• Value updates:
13
1
1
( )
( ) 1
( ) t
t t
t
te s s s
s s
s
e s
e





 
 
1 1( ) ( )t tt t t tr V s V s    
( ) ( )t tt s sV e 
rt+1
V(St)
Learning Coordination strategies using
reinforcement learning
0
( ) ( ) k
t
t k
t ss
k
e s I 

 
1
0t
t
ss
t
s s
I
s s

 

非遞迴的定義
利用Reinforcing Events
一步一步的往回更新
Temporal Difference learning:
Backward View of TD( )(2)
• Eligibility Traces:
• Reinforcing Events:
• Value updates:
14
1
1
( )
( ) 1
( ) t
t t
t
te s s s
s s
s
e s
e





 
 
1 1( ) ( )t tt t t tr V s V s    
( ) ( )t tt s sV e 
Learning Coordination strategies using
reinforcement learning
set λ = 0 get TD(0)
1 1
(1)
( )t t tR r V s  
TD(0)
0
( ) ( ) k
t
t k
t ss
k
e s I 

 
1
0t
t
ss
t
s s
I
s s

 

非遞迴的定義
Temporal Difference learning:
Why Backward View?
• Forward view
– theoretical view:概念上比較容易理解
– Not directly implementable:資訊仍要模擬取得。
• Backward view
– mechanistic view:較好實作
– simple conceptually and computationally
– In the offline case, achieving the same result as the
forward view (可證明)
15
Learning Coordination strategies using
reinforcement learning
Temporal Difference learning:
Equivalence of the Forward and Backward Views
•
Learning Coordination strategies using
reinforcement learning
16
1 1
0 0
( ) ( ) t
b f
t t
T T
ss
t t
tV s V Is
 
 
  
Ref: 7.4 Equivalence of the Forward and Backward Views,
http://www.cs.ualberta.ca/~sutton/book/7/node1.html(證明,在offline case下相等)
Backward
view
Forward
view
Value update相等
1
0t
t
ss
t
s s
I
s s

 

Sum of Forward:If λ = 1(MC) and T = 3
=>
Temporal Difference learning:
Sarsa演算法
•
17
Behavior
Policy
Estimation
policy
Learning Coordination strategies using
reinforcement learning
For each game
每下一子
更新時Rt要往後觀察幾步,
看所使用的方法:如Sarsa (λ)
1 1
(1)
( )t t tR r V s  
Learning Vector Quantization
• 主要目的:資料的壓縮
• 基本概念:希望以較少的群集來表示整個
輸入樣本空間。=>找一個類別的代表點
18
LVQVQ
適用於無類別資訊的資料 適用於有類別資訊的資料
Learning Coordination strategies using
reinforcement learning
M=3,代表數量
O=prototype vector
+ = input data
m1 m2
m3
SLVQ:架構(1)
19
<=代表點,初始時random撒在棋盤上
an idea of
what a SOM
looks like
建立n個agent = pattern database
Learning Coordination strategies using
reinforcement learning
可用SOM演算法
動態決定需要幾個M
意即pattern數量
可以動態增減
Agent會將嚐試過的state/action pair的值記錄下來,
經由LVQ演算法:Q(s, a) = >Q(m, a) ,state-space的數量被大幅的壓縮。
SLVQ:架構(2)
Learning Coordination strategies using
reinforcement learning
20
m1
m2
m3
初始各代表點的weight亂數產生
m1
m2
m3
遊戲終盤時,代表點更新(用LVQ)
更新 更新
設M=3
m1 m2
m3
越多場訓練,
代表點的代表
性會越足夠=>
會逐漸收斂
更新代表點時,會利
用相似性的計算找出
特定的pattern。(利用
幾何距離)
Ref: S. Santini and R. Jain. Similarity measures. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 21(9), 1999.
使用Backward View做更新
Candidate Moves (1)
• 就經驗來講,如果一個move有多重意義的話
會比較好。以下為圍棋中移動的特徵:
• Attack:reduce opponent’s liberties
• Defend:increase own’s liberties
• Claim:increase own’s influence
• Invade:decrease opponent’s influence
• Connect:Join two groups
• Conquer:enclose liberties
21
Learning Coordination strategies using
reinforcement learning
Candidate Moves(2)
22
Attack:A,B,C,D,E,F => reduce opponent’s liberties(氣)
黑方為
攻擊方
Defend:N,O,P,G,Q => increase own’s liberties
No use:M,L,K,J,I,H =>從候選移動名單中移除
Pattern database
中一個agent可能
的攻擊及防守點
Learning Coordination strategies using
reinforcement learning
Match
m
J
12
Reference(1)
• 英文部份:
• Myriam Z. Abramson, Learning Coordination strategies using
reinforcement learning, dissertation, George Mason University, Fairfax, VA,
2003
• Shin Ishii, Control of exploitation-exploration meta-parameter in
reinforcement learning, Nara Institute of Science and technology, Neural
Netwokrs 15(4-6), pp.665-687, 2002
• Simon Haykin, Neural networks and learning machines third Edition,
Chapter 12, PEARSON EDUCATION
• Richard S. Sutton, A Convergent O(n) Algorithm for off-policy Temporal-
difference learning with linear function approximation, Reinforcement
Learning and Artificial Intelligence Laboratory, Department of Computing
Science University of Alberta
23
Learning Coordination strategies using
reinforcement learning
Reference(2)
• 中文部份:
• 陳漢鴻,電腦象棋的自我學習,碩士論文,
國立雲林科技大學資訊工程系,民95年6月
24
Learning Coordination strategies using
reinforcement learning
Reference(3)
• 網頁部份:
• Reinforcement Learning,
• http://www.cse.unsw.edu.au/~cs9417ml/RL1/index.html, 2009.12.03
• Cyber Rodent Project, http://www.cns.atr.jp/cnb/crp/, 2009.12.03
• Off-Policy Learning, http://rl.cs.mcgill.ca/Projects/off-policy.html, 2009.12.03
• [MATH] Monte Carlo Method 蒙地卡羅法則, http://www.wretch.cc/blog/glCheng/3431370, 2009.12.03
• Intelligent agent, http://en.wikipedia.org/wiki/Intelligent_agent, 2009.12.03
• Simple Competitive Learning ,
http://www.willamette.edu/~gorr/classes/cs449/Unsupervised/competitive.html, 2009.12.12
• Eligibility Traces, http://www.cs.ualberta.ca/~sutton/book/7/node1.html, 2009.12.12
• Tabu search, http://sjchen.im.nuu.edu.tw/Project_Courses/ML/Tabu.pdf, 2009.12.12
• Self Organizing Maps, http://davis.wpi.edu/~matt/courses/soms/ , 2009.12.16
• Reinforcement Learning , http://www.informatik.uni-
freiburg.de/~ki/teaching/ws0607/advanced/recordings/reinforcement.pdf, 2009.12.25
25
Learning Coordination strategies using
reinforcement learning

More Related Content

Viewers also liked

Robot Motor Skill Coordination with EM-based Reinforcement Learning
Robot Motor Skill Coordination with EM-based Reinforcement LearningRobot Motor Skill Coordination with EM-based Reinforcement Learning
Robot Motor Skill Coordination with EM-based Reinforcement Learning
Petar Kormushev
 
Machine Learning CSCI 5622
Machine Learning CSCI 5622Machine Learning CSCI 5622
Machine Learning CSCI 5622
butest
 
Machine Learning: Decision Trees Chapter 18.1-18.3
Machine Learning: Decision Trees Chapter 18.1-18.3Machine Learning: Decision Trees Chapter 18.1-18.3
Machine Learning: Decision Trees Chapter 18.1-18.3
butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
butest
 
EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
butest
 
Lecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation MaximizationLecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation Maximization
butest
 
Clustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modelClustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture model
jins0618
 
Psych 101 - Introduction to Psychology - Lecture 1
Psych 101 - Introduction to Psychology - Lecture 1Psych 101 - Introduction to Psychology - Lecture 1
Psych 101 - Introduction to Psychology - Lecture 1
WhatisPsychology
 
Rocks
RocksRocks
SecureWorks
SecureWorksSecureWorks
SecureWorks
jduhaime
 
Final
FinalFinal
NO WORLD HUNGER MOVEMENT (Turkish)
NO WORLD HUNGER MOVEMENT (Turkish)NO WORLD HUNGER MOVEMENT (Turkish)
NO WORLD HUNGER MOVEMENT (Turkish)
Rıfat Oğuz
 
Powerpoint Oef2 Polle Druyts
Powerpoint Oef2 Polle DruytsPowerpoint Oef2 Polle Druyts
Powerpoint Oef2 Polle Druytskatho kortrijk
 
Jobs for America's Graduates
Jobs for America's GraduatesJobs for America's Graduates
Jobs for America's Graduates
guest6145231
 
Virtual Desktops for the Enterprise
Virtual Desktops for the EnterpriseVirtual Desktops for the Enterprise
Virtual Desktops for the Enterprise
Flexera
 
Peter Sloterdijk
Peter SloterdijkPeter Sloterdijk
Peter Sloterdijk
guestc5a05cb
 
90% of all cancers can be eliminated & science agrees.
90% of all cancers can be eliminated & science agrees.90% of all cancers can be eliminated & science agrees.
90% of all cancers can be eliminated & science agrees.
wagweb007
 
Communications
CommunicationsCommunications
Communications
JACKYXTREM
 
Physics Presentation Final 1
Physics Presentation Final 1Physics Presentation Final 1
Physics Presentation Final 1
djgoodman23
 
Certificacion a grado 2012-II GSDL LICENCIADOS
Certificacion a grado 2012-II GSDL LICENCIADOSCertificacion a grado 2012-II GSDL LICENCIADOS
Certificacion a grado 2012-II GSDL LICENCIADOS
David Leon Sicilia
 

Viewers also liked (20)

Robot Motor Skill Coordination with EM-based Reinforcement Learning
Robot Motor Skill Coordination with EM-based Reinforcement LearningRobot Motor Skill Coordination with EM-based Reinforcement Learning
Robot Motor Skill Coordination with EM-based Reinforcement Learning
 
Machine Learning CSCI 5622
Machine Learning CSCI 5622Machine Learning CSCI 5622
Machine Learning CSCI 5622
 
Machine Learning: Decision Trees Chapter 18.1-18.3
Machine Learning: Decision Trees Chapter 18.1-18.3Machine Learning: Decision Trees Chapter 18.1-18.3
Machine Learning: Decision Trees Chapter 18.1-18.3
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
Lecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation MaximizationLecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation Maximization
 
Clustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modelClustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture model
 
Psych 101 - Introduction to Psychology - Lecture 1
Psych 101 - Introduction to Psychology - Lecture 1Psych 101 - Introduction to Psychology - Lecture 1
Psych 101 - Introduction to Psychology - Lecture 1
 
Rocks
RocksRocks
Rocks
 
SecureWorks
SecureWorksSecureWorks
SecureWorks
 
Final
FinalFinal
Final
 
NO WORLD HUNGER MOVEMENT (Turkish)
NO WORLD HUNGER MOVEMENT (Turkish)NO WORLD HUNGER MOVEMENT (Turkish)
NO WORLD HUNGER MOVEMENT (Turkish)
 
Powerpoint Oef2 Polle Druyts
Powerpoint Oef2 Polle DruytsPowerpoint Oef2 Polle Druyts
Powerpoint Oef2 Polle Druyts
 
Jobs for America's Graduates
Jobs for America's GraduatesJobs for America's Graduates
Jobs for America's Graduates
 
Virtual Desktops for the Enterprise
Virtual Desktops for the EnterpriseVirtual Desktops for the Enterprise
Virtual Desktops for the Enterprise
 
Peter Sloterdijk
Peter SloterdijkPeter Sloterdijk
Peter Sloterdijk
 
90% of all cancers can be eliminated & science agrees.
90% of all cancers can be eliminated & science agrees.90% of all cancers can be eliminated & science agrees.
90% of all cancers can be eliminated & science agrees.
 
Communications
CommunicationsCommunications
Communications
 
Physics Presentation Final 1
Physics Presentation Final 1Physics Presentation Final 1
Physics Presentation Final 1
 
Certificacion a grado 2012-II GSDL LICENCIADOS
Certificacion a grado 2012-II GSDL LICENCIADOSCertificacion a grado 2012-II GSDL LICENCIADOS
Certificacion a grado 2012-II GSDL LICENCIADOS
 

Similar to Learning coordination strategies using reinforcement learning myriam z abramson , dissertation, 2003

Hands-on tutorial of deep learning (Keras)
Hands-on tutorial of deep learning (Keras)Hands-on tutorial of deep learning (Keras)
Hands-on tutorial of deep learning (Keras)
Chun-Min Chang
 
C1 discrete time signals and systems in the time-domain
C1 discrete time signals and systems in the time-domainC1 discrete time signals and systems in the time-domain
C1 discrete time signals and systems in the time-domain
Pei-Che Chang
 
第3章 离散系统的时域分析
第3章   离散系统的时域分析第3章   离散系统的时域分析
第3章 离散系统的时域分析reader520
 
Hands on data analysis 101
Hands on data analysis 101Hands on data analysis 101
Hands on data analysis 101
FEG
 
高等生產管理 第一組
高等生產管理 第一組高等生產管理 第一組
高等生產管理 第一組阿狗 郭
 
项目反应理论项目进度报告20090929
项目反应理论项目进度报告20090929项目反应理论项目进度报告20090929
项目反应理论项目进度报告20090929Albert
 
Learning to Rank: An Introduction to LambdaMART
Learning to Rank: An Introduction to LambdaMARTLearning to Rank: An Introduction to LambdaMART
Learning to Rank: An Introduction to LambdaMART
Julian Qian
 
HMC
HMCHMC
Manifold
ManifoldManifold
Manifold
sun peiyuan
 

Similar to Learning coordination strategies using reinforcement learning myriam z abramson , dissertation, 2003 (9)

Hands-on tutorial of deep learning (Keras)
Hands-on tutorial of deep learning (Keras)Hands-on tutorial of deep learning (Keras)
Hands-on tutorial of deep learning (Keras)
 
C1 discrete time signals and systems in the time-domain
C1 discrete time signals and systems in the time-domainC1 discrete time signals and systems in the time-domain
C1 discrete time signals and systems in the time-domain
 
第3章 离散系统的时域分析
第3章   离散系统的时域分析第3章   离散系统的时域分析
第3章 离散系统的时域分析
 
Hands on data analysis 101
Hands on data analysis 101Hands on data analysis 101
Hands on data analysis 101
 
高等生產管理 第一組
高等生產管理 第一組高等生產管理 第一組
高等生產管理 第一組
 
项目反应理论项目进度报告20090929
项目反应理论项目进度报告20090929项目反应理论项目进度报告20090929
项目反应理论项目进度报告20090929
 
Learning to Rank: An Introduction to LambdaMART
Learning to Rank: An Introduction to LambdaMARTLearning to Rank: An Introduction to LambdaMART
Learning to Rank: An Introduction to LambdaMART
 
HMC
HMCHMC
HMC
 
Manifold
ManifoldManifold
Manifold
 

Learning coordination strategies using reinforcement learning myriam z abramson , dissertation, 2003

  • 1. Learning Coordination strategies using reinforcement learning -- Myriam Z. Abramson , dissertation, 2003 張景照 dorgon chang 20120614
  • 2. Index • Coordinate problem(要解決的問題) • Evaluation of the GO • Reinforcement learning • Temporal Difference learning(使用Sarsa) • Learning Vector Quantization(LVQ) • Sarsa LVQ (SLVQ) <=作者提出的方法 2 Learning Coordination strategies using reinforcement learning
  • 3. Coordinate problem • Coordination strategy problem簡單來說就是 action selection problem。 • 當我們只知道local situation的時候,如何選擇 一個正確的行動,在不依靠end game state的情 況下,讓其能夠跟其他的行動結合在一起。 • 局部地區的戰術 如何影響整體的戰略。 3 Learning Coordination strategies using reinforcement learning
  • 4. Evaluation of the GO 4 This method convey the spatial connectivity between the stones Learning Coordination strategies using reinforcement learning ε為自定數值,當該點的influence 超過ε時會繼續往外擴散 對盤面上所有數字做加總, 可以得到一個盤面的評估值 黑子往外擴散1 白子往外擴散-1 在接下來的方法中當成reward
  • 5. Reinforcement learning: Introduction • Machine learning的目標是用來產生一個 agents,而RL是其中的一個方法 ,其特徵是 Trial-and-error search and delayed reward。 5 Learning Coordination strategies using reinforcement learning 下一顆子 例如:贏或輸、 平手 盤面 Agent往後預測幾步盤面
  • 6. Reinforcement learning: Value Function • π = agent所使用來選擇action的policy。 • s = 目前的state。 • :在π這個policy下,state s所得到的reward。 • :在π這個policy下,state s採取action a所得到的reward。 • 最常見的policy為ε-greedy。 (另有greedy、ε-soft 、softmax……) • ε介於0~1之間,其值越高,代表「exploration」越被鼓勵。 (exploration v.s. exploitation) • ε-greedy:大部份都會選擇擁有最高reward的行動,有少部份的機率ε 會亂數決定。 6 Learning Coordination strategies using reinforcement learning
  • 7. Temporal Difference learning • TD learning是用來評估RL中value function的 方法。 7 Learning Coordination strategies using reinforcement learning DP:當前的評估值是基於先前學習過的評估值為基礎。 MC:利用random game的模式,統計其結果來解決未來可能遇到的問題。 TD方法結合:
  • 8. Temporal Difference learning: Forward View of TD() (1) • Monte Carlo:observe reward for all steps in an episode • TD(0):observed one step only observed two step • TD() is a method for averaging all n-step, 8 1 1 (1) ( )t t tR r V s   2 1 2 2 (2) ( )t t t tt r r V sR       V(St) V(St+1) Value update Learning Coordination strategies using reinforcement learning set λ = 0, TD(0) set λ = 1, Monte Carlo r =在t時間點的reward, γ = 觀察未來reward的discount rate 為在t時間點往後觀察T步 的total reward, 回傳一個scalar
  • 9. Temporal Difference learning: Forward View of TD() (2) • Monte Carlo:observe reward for all steps in an episode • TD(0):observed one step only observed two step • TD() is a method for averaging all n-step, 9 1 1 (1) ( )t t tR r V s   2 1 2 2 (2) ( )t t t tt r r V sR       Value update Learning Coordination strategies using reinforcement learning set λ = 0, TD(0) set λ = 1, Monte Carlo set λ = 0 代入 r =在t時間點的reward, γ = 觀察未來reward的discount rate 為在t時間點往後觀察T步 的total reward, 回傳一個scalar
  • 10. Temporal Difference learning: Forward View of TD() (3) • Monte Carlo:observe reward for all steps in an episode • TD(0):observed one step only observed two step • TD() is a method for averaging all n-step, 10 1 1 (1) ( )t t tR r V s   2 1 2 2 (2) ( )t t t tt r r V sR       Value update Learning Coordination strategies using reinforcement learning set λ = 0, TD(0) set λ = 1, Monte Carlo set λ = 1 代入 r =在t時間點的reward, γ = 觀察未來reward的discount rate 為在t時間點往後觀察T步 的total reward, 回傳一個scalar
  • 11. Temporal Difference learning: Forward View of TD() (4) Learning Coordination strategies using reinforcement learning 11 T為一場game的 total step、t為這場 game的第幾個step 1 1 (1) ( )t t tR r V s   S0 w1 = w2 = w3 = S1 S2 S3 Normalize 確保 weight總和為1 1nw  set λ = 0.5 and t = 0, T = 3
  • 12. Temporal Difference learning: Forward View of TD() (5) Learning Coordination strategies using reinforcement learning 12 = 之後的總和 λ越高weight下降越快,越重視前面的結果。 λ越低weight下降越慢,越重視後面的結果。 總結λ存在的功能與意義: 1.作為TD跟MC方法的橋梁 2.對於一個沒有立即影響 的action,我們如何去 做punish or reward Eligibility Traces  若λ =0.1 => 1-λ = 0.9 若λ = 0.9 =>1-λ = 0.1 Set λ = 0.5 and t = 0, T = 3的結果
  • 13. Temporal Difference learning: Backward View of TD( )(1) • Eligibility Traces: • Reinforcing Events: • Value updates: 13 1 1 ( ) ( ) 1 ( ) t t t t te s s s s s s e s e          1 1( ) ( )t tt t t tr V s V s     ( ) ( )t tt s sV e  rt+1 V(St) Learning Coordination strategies using reinforcement learning 0 ( ) ( ) k t t k t ss k e s I     1 0t t ss t s s I s s     非遞迴的定義 利用Reinforcing Events 一步一步的往回更新
  • 14. Temporal Difference learning: Backward View of TD( )(2) • Eligibility Traces: • Reinforcing Events: • Value updates: 14 1 1 ( ) ( ) 1 ( ) t t t t te s s s s s s e s e          1 1( ) ( )t tt t t tr V s V s     ( ) ( )t tt s sV e  Learning Coordination strategies using reinforcement learning set λ = 0 get TD(0) 1 1 (1) ( )t t tR r V s   TD(0) 0 ( ) ( ) k t t k t ss k e s I     1 0t t ss t s s I s s     非遞迴的定義
  • 15. Temporal Difference learning: Why Backward View? • Forward view – theoretical view:概念上比較容易理解 – Not directly implementable:資訊仍要模擬取得。 • Backward view – mechanistic view:較好實作 – simple conceptually and computationally – In the offline case, achieving the same result as the forward view (可證明) 15 Learning Coordination strategies using reinforcement learning
  • 16. Temporal Difference learning: Equivalence of the Forward and Backward Views • Learning Coordination strategies using reinforcement learning 16 1 1 0 0 ( ) ( ) t b f t t T T ss t t tV s V Is        Ref: 7.4 Equivalence of the Forward and Backward Views, http://www.cs.ualberta.ca/~sutton/book/7/node1.html(證明,在offline case下相等) Backward view Forward view Value update相等 1 0t t ss t s s I s s     Sum of Forward:If λ = 1(MC) and T = 3 =>
  • 17. Temporal Difference learning: Sarsa演算法 • 17 Behavior Policy Estimation policy Learning Coordination strategies using reinforcement learning For each game 每下一子 更新時Rt要往後觀察幾步, 看所使用的方法:如Sarsa (λ) 1 1 (1) ( )t t tR r V s  
  • 18. Learning Vector Quantization • 主要目的:資料的壓縮 • 基本概念:希望以較少的群集來表示整個 輸入樣本空間。=>找一個類別的代表點 18 LVQVQ 適用於無類別資訊的資料 適用於有類別資訊的資料 Learning Coordination strategies using reinforcement learning M=3,代表數量 O=prototype vector + = input data m1 m2 m3
  • 19. SLVQ:架構(1) 19 <=代表點,初始時random撒在棋盤上 an idea of what a SOM looks like 建立n個agent = pattern database Learning Coordination strategies using reinforcement learning 可用SOM演算法 動態決定需要幾個M 意即pattern數量 可以動態增減 Agent會將嚐試過的state/action pair的值記錄下來, 經由LVQ演算法:Q(s, a) = >Q(m, a) ,state-space的數量被大幅的壓縮。
  • 20. SLVQ:架構(2) Learning Coordination strategies using reinforcement learning 20 m1 m2 m3 初始各代表點的weight亂數產生 m1 m2 m3 遊戲終盤時,代表點更新(用LVQ) 更新 更新 設M=3 m1 m2 m3 越多場訓練, 代表點的代表 性會越足夠=> 會逐漸收斂 更新代表點時,會利 用相似性的計算找出 特定的pattern。(利用 幾何距離) Ref: S. Santini and R. Jain. Similarity measures. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(9), 1999. 使用Backward View做更新
  • 21. Candidate Moves (1) • 就經驗來講,如果一個move有多重意義的話 會比較好。以下為圍棋中移動的特徵: • Attack:reduce opponent’s liberties • Defend:increase own’s liberties • Claim:increase own’s influence • Invade:decrease opponent’s influence • Connect:Join two groups • Conquer:enclose liberties 21 Learning Coordination strategies using reinforcement learning
  • 22. Candidate Moves(2) 22 Attack:A,B,C,D,E,F => reduce opponent’s liberties(氣) 黑方為 攻擊方 Defend:N,O,P,G,Q => increase own’s liberties No use:M,L,K,J,I,H =>從候選移動名單中移除 Pattern database 中一個agent可能 的攻擊及防守點 Learning Coordination strategies using reinforcement learning Match m J 12
  • 23. Reference(1) • 英文部份: • Myriam Z. Abramson, Learning Coordination strategies using reinforcement learning, dissertation, George Mason University, Fairfax, VA, 2003 • Shin Ishii, Control of exploitation-exploration meta-parameter in reinforcement learning, Nara Institute of Science and technology, Neural Netwokrs 15(4-6), pp.665-687, 2002 • Simon Haykin, Neural networks and learning machines third Edition, Chapter 12, PEARSON EDUCATION • Richard S. Sutton, A Convergent O(n) Algorithm for off-policy Temporal- difference learning with linear function approximation, Reinforcement Learning and Artificial Intelligence Laboratory, Department of Computing Science University of Alberta 23 Learning Coordination strategies using reinforcement learning
  • 25. Reference(3) • 網頁部份: • Reinforcement Learning, • http://www.cse.unsw.edu.au/~cs9417ml/RL1/index.html, 2009.12.03 • Cyber Rodent Project, http://www.cns.atr.jp/cnb/crp/, 2009.12.03 • Off-Policy Learning, http://rl.cs.mcgill.ca/Projects/off-policy.html, 2009.12.03 • [MATH] Monte Carlo Method 蒙地卡羅法則, http://www.wretch.cc/blog/glCheng/3431370, 2009.12.03 • Intelligent agent, http://en.wikipedia.org/wiki/Intelligent_agent, 2009.12.03 • Simple Competitive Learning , http://www.willamette.edu/~gorr/classes/cs449/Unsupervised/competitive.html, 2009.12.12 • Eligibility Traces, http://www.cs.ualberta.ca/~sutton/book/7/node1.html, 2009.12.12 • Tabu search, http://sjchen.im.nuu.edu.tw/Project_Courses/ML/Tabu.pdf, 2009.12.12 • Self Organizing Maps, http://davis.wpi.edu/~matt/courses/soms/ , 2009.12.16 • Reinforcement Learning , http://www.informatik.uni- freiburg.de/~ki/teaching/ws0607/advanced/recordings/reinforcement.pdf, 2009.12.25 25 Learning Coordination strategies using reinforcement learning

Editor's Notes

  1. The goal of this thesis is not to learn to play the game of Go as a tournament class program but to learn from go how to approach the coordination strategy problem.
  2. The evaluation of the board is done using chinese rule which allow a player to fill its territory without penalty and a program to play without recognizing life or death patterns. using chinese rule Convey the spatial connectivity
  3. 例如:老鼠走到一個地方,得到一個初始reward,然後會聞一下味道, 藉以尋找起司的位置,之後在腦中就可以想像接下來該怎麼走,當然, 往後想像越多步的話,那些步就會越不可靠,所以越後面打的折扣越多 上圖中: α為learing rate,用來控制學習的速度 ε用來調節行為選擇的randomness程度,其值越大,就代表「exploration」越被鼓厲。 γ為discount rate,表示未來的報償會比現在的報償的價值還少,所以要有個折扣率。(如果設大於1,就代表較注重未來的報酬) 這個方法一個重要的問題點是怎麼在exploration跟exploitation之中取得平衡。如果沒在之間取得平衡,RL agent的自我學習就會失敗。   那麼,這個問題的重要性在哪呢?Exploration指的是agent嚐試著去做不同的action,以找出可能的潛在利益;Exploitation則是去執行目前已知能夠獲得報酬的action。 以下為RL agent對環境建立world model的步驟: agent觀察一個input state 2.藉由一個decision make function(policy)決定要採取什麼action。 3.執行一個action 4.agent從環境得到一個scalar reward(reinforcement signal),有時候會建一個critic系統把這個primary reinforcement signal轉換成品質更好的heuristic signal。 5. 將state/action這個pair能夠得到多少的reward記錄下來。 當觀察足夠的狀態之後,decision policy將能夠最佳化,因此該agent在特定的環境內將能夠表現的很好。
  4. move policy towards greedy policy(i.e. ε-soft) Converges to best ε-soft policy
  5. Monte carlo的部份: 能夠在全新的環境中學習,並利用先前的經驗來解決問題。 DP的部份:其當前的評估值是前先前學習過的評估值為基礎。 如果我們算不出他有多好,因此我們從這一點多下幾次,因此統計看贏的機率有多高,如果很高的話那麼這麼棋型就還不錯。 Monte Carlo:observe reward for all steps in an episode Policy用來選擇action稱作action-selection policy(or behavior policy),用來選擇要用哪個new action來做評估的的policy,稱作estimation policy(or target policy)。   On-policy learning在做new action選擇時會遵從某些policy,之後根據執行的結果,其value function會被更新。這些policy不會一直選擇利益最高的,通常會存在著一些機制做exploration的動作。有三個很常見的policy:ε-soft, ε-greedy and softmax。   Off-policy method可以因為不同的行為及評估方式來學習不同的policies,其用來決定action 的behavior policy通常會留有一些空間去做exploration,例如使用ε-greedy policy,而estimation policy則是完全的greedy,只選擇擁有最大reward的來更新。這樣做的好處是可以將exploration的控制與learning的程序分開。 =>value評估不是立即的
  6. T = t + 5 t = 0 Set  to 0, we get to TD(0) Set  to 1, we get MC but in a better way A normalization factor of ensures that the weights sum to 1. There are two ways to view eligibility traces. 1.they are a bridge from TD to Monte Carlo methods. 2. more mechanistic Higher settings lead to longer lasting traces; that is, a larger proportion of credit from a reward can be given to more distal states and actions when λ is higher, with λ = 1 producing parallel learning to Monte Carlo RL algorithms. TD-Lambda is a learning algorithm invented by Richard S. Sutton based on earlier work on temporal difference learning by Arthur Samuel.[1] This algorithm was famously applied by Gerald Tesauro to create TD-Gammon, a program that can learn to play the game of backgammon better than expert human players.
  7. T = t + 5 t = 0 Set  to 0, we get to TD(0) Set  to 1, we get MC but in a better way A normalization factor of ensures that the weights sum to 1. There are two ways to view eligibility traces. 1.they are a bridge from TD to Monte Carlo methods. 2. more mechanistic Higher settings lead to longer lasting traces; that is, a larger proportion of credit from a reward can be given to more distal states and actions when λ is higher, with λ = 1 producing parallel learning to Monte Carlo RL algorithms. TD-Lambda is a learning algorithm invented by Richard S. Sutton based on earlier work on temporal difference learning by Arthur Samuel.[1] This algorithm was famously applied by Gerald Tesauro to create TD-Gammon, a program that can learn to play the game of backgammon better than expert human players.
  8. T = t + 5 t = 0 Set  to 0, we get to TD(0) Set  to 1, we get MC but in a better way A normalization factor of ensures that the weights sum to 1. There are two ways to view eligibility traces. 1.they are a bridge from TD to Monte Carlo methods. 2. more mechanistic Higher settings lead to longer lasting traces; that is, a larger proportion of credit from a reward can be given to more distal states and actions when λ is higher, with λ = 1 producing parallel learning to Monte Carlo RL algorithms. TD-Lambda is a learning algorithm invented by Richard S. Sutton based on earlier work on temporal difference learning by Arthur Samuel.[1] This algorithm was famously applied by Gerald Tesauro to create TD-Gammon, a program that can learn to play the game of backgammon better than expert human players.
  9. A normalization factor of 1-λensures that the weights sum to 1. The resulting backup is toward a return, called the λ-return
  10. Eligibility Traces 存在意義有二個觀點: 1.作為TD跟Monte carlo方法的橋梁 2.讓TD方法擁有更多的機制,例如解決temporal-credit assignment 問題。=>對於一個沒有立即影響的action,我們如何去做punish or reward。(觀察未來reward的權重)
  11. 現在遊戲真正的時間點在哪,S就是該時間點的狀態。
  12. 現在遊戲真正的時間點在哪,S就是該時間點的狀態。 我們是一步一步的用TD error往回更新,現在的TD error會給上一個狀態 The backward view of TD() is oriented toward looking backward in time. At each moment we look at the current TD error and assign it backward to each prior state according to the state's eligibility trace at that time. We might imagine ourselves riding along the stream of states, computing TD errors, and shouting them back to the previously visited states, as suggested by Figure 7.8. Where the TD error and traces come together we get the update given by (7.7). To better understand the backward view, consider what happens at various values of . If , then by (7.5) all traces are zero at t except for the trace corresponding to . Thus the TD() update (7.7) reduces to the simple TD rule (6 .2 ), which we henceforth call TD(0). In terms of Figure 7.8, TD(0) is the case in which only the one state preceding the current one is changed by the TD error. For larger values of , but still , more of the preceding states are changed, but each more temporally distant state is changed less because its eligibility trace is smaller, as suggested in the figure. We say that the earlier states are given less credit for the TD error.
  13. Forward View = Backward View 可證明,見Eligibility Traces, http://www.cs.ualberta.ca/~sutton/book/7/node1.html the sum of all the updates is the same for the two algorithms:
  14. 通常都是私底下做self-play訓練到有一定程度之後再拿出去跟人打 如果用backward方法的話,實際上線時,更新時機可以是當一場game完成, 準備進入下一次game的時候
  15. For each episode=>所有的training For each step =>一場training 可以offline training 也可以online training
  16. 競爭式學習只有唯一優勝的神經元進行權重調整,其餘神經元則不調整 LVQ 的方法和 VQ 很接近,但不同的是: VQ 是用在資料的壓縮與取編碼向量,適用於無類別資訊的資料。 LVQ 的目標再找出同一個類別資料的代表點,再以代表點來進行分類,因此適用於分類問題。 在vector quantization,我們假設已經存在用M個prototype vectors所定義好的code book。這裡的M是使用者定義的、prototype vector是亂數選擇的。每個input屬於距離最近的cluster。 LVQ is a supervised version of VQ that can be used when we have labeled input data
  17. Prototype越多,收斂速度越慢 There is a trade-off between the number of codebook vectors m and the learning speed.more codebook vectors, the internal representation of the task is more detailed=>但要更多時間去學他們相對之間的關係。 Self-Organizing Map Neural Network 可用unsupervise SOM演算法動態的決定需要幾個 Learning Coordination strategies using reinforcement learning, dissertation 29頁
  18. 就經驗來講,一個步如果擁有multipurpose的話會更好。
  19. 實驗結果: SLVQ+STS擊敗了Wally and minimax Wally => J. K. Millen. Programming the game of go. in Byte Magazine, April 1981. codebook weight vectors (300) were initialized with random weights