SlideShare a Scribd company logo
1 of 20
55분만에 끝내는
강화학습
55분만에 Simple maze 환경 Policy iteration 까지
Minseop Kim
Groot Seminar
강화학습이 뭐지?
• Supervisor(label)가 없다! Only reward signal.
• Reward hypothesis에 기반한다.
Sequential Decision Making
• Time step이란 개념이 존재.
• Goal: select actions to maximise total future reward
• Actions may have long term consequences
• Reward may be delayed
• It may be better to sacrifice immediate reward to gain more
long-term reward
Reward, Agent, Environment, Action,
Observation
• Agent
• 의사결정을 하는 주체
• 우리가 학습시켜야 하는 것
• Ex) 게임 캐릭터, 헬리콥터, 탐사로봇 …
• Environment
• Agent와 상호작용 하는 모든 것
• 우리가 컨트롤 하지 못함
• Ex) 게임 프로그램, 지구, 화성 …
• Reward 𝑅𝑡
• Scalar feedback signal
• Indicates how well agent is doing at step t
Agent Environment
Reward, Agent, Environment, Action,
Observation
• Action 𝐴 𝑡
• Agent가 매 time step 마다 취하는 의사결정
• Discrete space ex) 상하좌우 움직임, 바둑돌을 두는 위치 …
• Continuous space ex) 로봇 관절의 꺾는 정도, 헬기의 속도 …
• Observation 𝑂𝑡
• Agent가 action을 취한 뒤 environment로 부터 받는 모든 관측값
History, State
• History 𝐻𝑡
• Observation, action, reward의 sequence
• State 𝑆𝑡
• Function of the history 𝑆𝑡 = 𝑓(𝐻𝑡)
• State를 바탕으로 agent는 현재 상황을 판단하고 Action을 취함
Policy, Return
• Policy 𝜋
• Agent가 현재 state s에서 action a를 취할 확률
• Return 𝐺𝑡
Value function, Action value function
• Value function 𝑉𝜋(𝑠)
• 현재 state s에서 policy 𝜋를 따랐을 때 return의 기대값
• Used to evaluate the goodness/badness of states
• Action value function 𝑞 𝜋 𝑠, 𝑎
• 현재 state s에서 action a를 취하고 policy 𝜋를 따랐을 때 return의 기
대값
Simple grid world example
- State : 1~14, terminal state(색칠된 칸)
- 벽에 부딪히면 r=-1을 받으면서 제자리에 가만히 있음
- 1~14 state의 value function 값을 구해보자!!
사실 그냥 구하기 엄청 어려움!
앞으로 배울것임 ㅎㅎ
Episode
• Episode
• 쉽게 말해 게임 한판
• Ex) 1(동,-1)2(동,-1)3(서,-1)2(남,-1)6(동,-1)7(남,-1)11(남,-1)Terminal
Ex) 5(동,-1)6(남,-1)10(남,-1)14(동,-1)Terminal
• Episodic environment : Terminal state나 정해진 time step이 존재하는 환경
• Non episodic environment : 게임이 무한히 진행되는 환경
Evaluate value function using Monte-
Carlo method
Incremental mean
Incremental Monte-Carlo Updates
Evaluate value function using
TD(Temporal Difference) method
Bias/Variance Trade-Off
Simple grid world example
• Value function table
• N_table = np.zeros(14)
• V_table = np.zeros(14)
• Action value function table
• N_table = np.zeros(14)
• Q_table = np.zeros((14, 4))
Policy iteration
1. Policy가 𝜋를 따를 때 𝑞 𝜋값 구하기(MC or TD)
2. 𝜀 − 𝑔𝑟𝑒𝑒𝑑𝑦 하게 Policy improvement
3. 1, 2번 반복 -> reward의 sum(return)을
maximize 해주는 optimal policy 𝜋∗를 구할
수 있음
(단, GLIE 조건 만족 하에)
𝜀 − 𝑔𝑟𝑒𝑒𝑑𝑦
Simple grid world example
이제 풀 수 있다!
Simple maze example
Reward : 골에 도착하면 1, 나머지는 0
State : agent의 maze상 좌표
Optimal policy
https://github.com/minseop4898/RL-Course-by-David-Silver/blob/master/GLIE%20Monte-
Carlo%20Control/GLIE%20MC%20in%20simple%20maze.ipynb
감사합니다.

More Related Content

More from SEMINARGROOT

Coding Test Review 3
Coding Test Review 3Coding Test Review 3
Coding Test Review 3SEMINARGROOT
 
Time Series Analysis - ARMA
Time Series Analysis - ARMATime Series Analysis - ARMA
Time Series Analysis - ARMASEMINARGROOT
 
Differential Geometry for Machine Learning
Differential Geometry for Machine LearningDifferential Geometry for Machine Learning
Differential Geometry for Machine LearningSEMINARGROOT
 
Generative models : VAE and GAN
Generative models : VAE and GANGenerative models : VAE and GAN
Generative models : VAE and GANSEMINARGROOT
 
Understanding Blackbox Prediction via Influence Functions
Understanding Blackbox Prediction via Influence FunctionsUnderstanding Blackbox Prediction via Influence Functions
Understanding Blackbox Prediction via Influence FunctionsSEMINARGROOT
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You NeedSEMINARGROOT
 
WWW 2020 XAI Tutorial Review
WWW 2020 XAI Tutorial ReviewWWW 2020 XAI Tutorial Review
WWW 2020 XAI Tutorial ReviewSEMINARGROOT
 
Coding test review 2
Coding test review 2Coding test review 2
Coding test review 2SEMINARGROOT
 
Locality sensitive hashing
Locality sensitive hashingLocality sensitive hashing
Locality sensitive hashingSEMINARGROOT
 
Coding Test Review1
Coding Test Review1Coding Test Review1
Coding Test Review1SEMINARGROOT
 
Strong convexity on gradient descent and newton's method
Strong convexity on gradient descent and newton's methodStrong convexity on gradient descent and newton's method
Strong convexity on gradient descent and newton's methodSEMINARGROOT
 
SVM (Support Vector Machine & Kernel)
SVM (Support Vector Machine & Kernel)SVM (Support Vector Machine & Kernel)
SVM (Support Vector Machine & Kernel)SEMINARGROOT
 
Gaussian Process Regression
Gaussian Process Regression  Gaussian Process Regression
Gaussian Process Regression SEMINARGROOT
 
Unsupervised Methods for Image Super-Resolution
Unsupervised Methods for Image Super-ResolutionUnsupervised Methods for Image Super-Resolution
Unsupervised Methods for Image Super-ResolutionSEMINARGROOT
 
PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation
PointNet: Deep Learning on Point Sets for 3D Classification and SegmentationPointNet: Deep Learning on Point Sets for 3D Classification and Segmentation
PointNet: Deep Learning on Point Sets for 3D Classification and SegmentationSEMINARGROOT
 
Automotive engineering (ADAS)
Automotive engineering (ADAS)Automotive engineering (ADAS)
Automotive engineering (ADAS)SEMINARGROOT
 

More from SEMINARGROOT (20)

Coding Test Review 3
Coding Test Review 3Coding Test Review 3
Coding Test Review 3
 
Time Series Analysis - ARMA
Time Series Analysis - ARMATime Series Analysis - ARMA
Time Series Analysis - ARMA
 
Differential Geometry for Machine Learning
Differential Geometry for Machine LearningDifferential Geometry for Machine Learning
Differential Geometry for Machine Learning
 
Generative models : VAE and GAN
Generative models : VAE and GANGenerative models : VAE and GAN
Generative models : VAE and GAN
 
Effective Python
Effective PythonEffective Python
Effective Python
 
Understanding Blackbox Prediction via Influence Functions
Understanding Blackbox Prediction via Influence FunctionsUnderstanding Blackbox Prediction via Influence Functions
Understanding Blackbox Prediction via Influence Functions
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
 
Attention
AttentionAttention
Attention
 
WWW 2020 XAI Tutorial Review
WWW 2020 XAI Tutorial ReviewWWW 2020 XAI Tutorial Review
WWW 2020 XAI Tutorial Review
 
Coding test review 2
Coding test review 2Coding test review 2
Coding test review 2
 
Locality sensitive hashing
Locality sensitive hashingLocality sensitive hashing
Locality sensitive hashing
 
Coding Test Review1
Coding Test Review1Coding Test Review1
Coding Test Review1
 
Strong convexity on gradient descent and newton's method
Strong convexity on gradient descent and newton's methodStrong convexity on gradient descent and newton's method
Strong convexity on gradient descent and newton's method
 
SVM (Support Vector Machine & Kernel)
SVM (Support Vector Machine & Kernel)SVM (Support Vector Machine & Kernel)
SVM (Support Vector Machine & Kernel)
 
Gaussian Process Regression
Gaussian Process Regression  Gaussian Process Regression
Gaussian Process Regression
 
Unsupervised Methods for Image Super-Resolution
Unsupervised Methods for Image Super-ResolutionUnsupervised Methods for Image Super-Resolution
Unsupervised Methods for Image Super-Resolution
 
PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation
PointNet: Deep Learning on Point Sets for 3D Classification and SegmentationPointNet: Deep Learning on Point Sets for 3D Classification and Segmentation
PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation
 
COMA
COMACOMA
COMA
 
Super resolution
Super resolutionSuper resolution
Super resolution
 
Automotive engineering (ADAS)
Automotive engineering (ADAS)Automotive engineering (ADAS)
Automotive engineering (ADAS)
 

55minute rl

  • 1. 55분만에 끝내는 강화학습 55분만에 Simple maze 환경 Policy iteration 까지 Minseop Kim Groot Seminar
  • 2. 강화학습이 뭐지? • Supervisor(label)가 없다! Only reward signal. • Reward hypothesis에 기반한다.
  • 3. Sequential Decision Making • Time step이란 개념이 존재. • Goal: select actions to maximise total future reward • Actions may have long term consequences • Reward may be delayed • It may be better to sacrifice immediate reward to gain more long-term reward
  • 4. Reward, Agent, Environment, Action, Observation • Agent • 의사결정을 하는 주체 • 우리가 학습시켜야 하는 것 • Ex) 게임 캐릭터, 헬리콥터, 탐사로봇 … • Environment • Agent와 상호작용 하는 모든 것 • 우리가 컨트롤 하지 못함 • Ex) 게임 프로그램, 지구, 화성 … • Reward 𝑅𝑡 • Scalar feedback signal • Indicates how well agent is doing at step t Agent Environment
  • 5. Reward, Agent, Environment, Action, Observation • Action 𝐴 𝑡 • Agent가 매 time step 마다 취하는 의사결정 • Discrete space ex) 상하좌우 움직임, 바둑돌을 두는 위치 … • Continuous space ex) 로봇 관절의 꺾는 정도, 헬기의 속도 … • Observation 𝑂𝑡 • Agent가 action을 취한 뒤 environment로 부터 받는 모든 관측값
  • 6. History, State • History 𝐻𝑡 • Observation, action, reward의 sequence • State 𝑆𝑡 • Function of the history 𝑆𝑡 = 𝑓(𝐻𝑡) • State를 바탕으로 agent는 현재 상황을 판단하고 Action을 취함
  • 7. Policy, Return • Policy 𝜋 • Agent가 현재 state s에서 action a를 취할 확률 • Return 𝐺𝑡
  • 8. Value function, Action value function • Value function 𝑉𝜋(𝑠) • 현재 state s에서 policy 𝜋를 따랐을 때 return의 기대값 • Used to evaluate the goodness/badness of states • Action value function 𝑞 𝜋 𝑠, 𝑎 • 현재 state s에서 action a를 취하고 policy 𝜋를 따랐을 때 return의 기 대값
  • 9. Simple grid world example - State : 1~14, terminal state(색칠된 칸) - 벽에 부딪히면 r=-1을 받으면서 제자리에 가만히 있음 - 1~14 state의 value function 값을 구해보자!! 사실 그냥 구하기 엄청 어려움! 앞으로 배울것임 ㅎㅎ
  • 10. Episode • Episode • 쉽게 말해 게임 한판 • Ex) 1(동,-1)2(동,-1)3(서,-1)2(남,-1)6(동,-1)7(남,-1)11(남,-1)Terminal Ex) 5(동,-1)6(남,-1)10(남,-1)14(동,-1)Terminal • Episodic environment : Terminal state나 정해진 time step이 존재하는 환경 • Non episodic environment : 게임이 무한히 진행되는 환경
  • 11. Evaluate value function using Monte- Carlo method
  • 14. Evaluate value function using TD(Temporal Difference) method
  • 16. Simple grid world example • Value function table • N_table = np.zeros(14) • V_table = np.zeros(14) • Action value function table • N_table = np.zeros(14) • Q_table = np.zeros((14, 4))
  • 17. Policy iteration 1. Policy가 𝜋를 따를 때 𝑞 𝜋값 구하기(MC or TD) 2. 𝜀 − 𝑔𝑟𝑒𝑒𝑑𝑦 하게 Policy improvement 3. 1, 2번 반복 -> reward의 sum(return)을 maximize 해주는 optimal policy 𝜋∗를 구할 수 있음 (단, GLIE 조건 만족 하에)
  • 19. Simple grid world example 이제 풀 수 있다! Simple maze example Reward : 골에 도착하면 1, 나머지는 0 State : agent의 maze상 좌표 Optimal policy https://github.com/minseop4898/RL-Course-by-David-Silver/blob/master/GLIE%20Monte- Carlo%20Control/GLIE%20MC%20in%20simple%20maze.ipynb