CS294-112 Lecture 06

Deep Reinforcement Learning
CS294-112, 2017 Fall
Lecture 6
손규빈

고려대학교 산업경영공학과

/ 22
목차
1. What if we just use a critic, without an actor?
2. Extracting a policy from a value function
3. The Q-learning algorithm
4. Extensions: continuous actions, improvements

/ 223
Can we omit policy gradient completely?
Advantage: 현재 action이 average보다 얼마나 더 좋은가

Policy는 잠시 잊고,

Advantage를 maximize하는 action을 찾자.
π′(at |st) = 1 if at = argmaxat
Aπ
(st, at)
π′(at |st) = 0 otherwise
A(s, a) = Q(s,a) - V(s)

/ 224
Policy iteration
Algorithm

1. Policy evaluation 
: 현재의 정책을 이용해서 최적의 Value function 구하기 
(N회 반복. 반복할수록 True Value에 가까워짐)

2. Policy improvement 
주어진 Value function으로 
Policy를 발전시키는 것(1회 시행)
Aπ
(s, a)
π ← π′
π′(at |st) = 1 if at = argmaxat
Aπ
(st, at)
π′(at |st) = 0 otherwise

/ 228
Dynamic programming
s, a가 둘 다 discrete 한 예제

- 16 states, 4 actions per state

- 모든 state에 대해서 Value 값을 table로 저장 
-> approximation error X

- transition matrix: 16 x 16 x 4 tensor

Bootstrapped update :

policy가 1, 0으로 deterministic하기 때문에 위 수식을 다음으로 단순화할 수 있다.
Vπ
(s) = Ea∼π(a|s)[r(s, a) + γEs′∼p(s′|s,a)[Vπ
(s′)]]
Vπ
(s) ← r(s, π(s)) + γEs′∼p(s′|s,π(s))[Vπ
(s′)]

/ 229
Policy iteration with dynamic programming
Policy iteration

1. evaluate

2. set

Policy evaluation
Vπ
(s)
π ← π′
Vπ
(s) ← r(s, π(s)) + γEs′∼p(s′|s,π(s))[Vπ
(s′)]

/ 2210
Even simpler dynamic programming (1/2)
Bootstrap 방식을 쓰기 때문에

Advantage를 최대화하는 것은

결국 Q function 값을 최대화하는 것과 같음

즉 Q function을 최대화하도록 계산되는 방식이 Value iteration

1. set

2. set
Aπ
(s, a) = r(s, a) + γE[Vπ
(s′)] − Vπ
(s)
Q function
Q(s, a) ← r(s, a) + γE[V(s′)]
V(s) ← maxaQ(s, a)

/ 2211
Even simpler dynamic programming (2/2)
모든 state, action의 combination을 활용하여

Q function의 값을 우측 도표처럼 저장하면

Q의 max 값을 찾을 수 있다.

Policy를 찾는 과정은 생략하고 다음을 반복하는 것이

Value iteration 이다.

1. set

2. set
Q(s, a) ← r(s, a) + γE[V(s′)]
V(s) ← maxaQ(s, a)

/ 2214
Fitted value iteration (1/2)
Image 같은 데이터에는 이전 Grid world 방식을 적용할 수 없다.

curse of dimensionality

200 by 200 by 3 의 shape이라면

|S| x |A| 개의 Q function 값을 구해야한다.
|S| = (2553
)200×200

/ 2215
Fitted value iteration (2/2)
Value function을 Neural nets로 approximation

Value iteration algorithm

1. set

2. set
L(ϕ) =
1
2
||Vϕ(s) − maxaQπ(s, a)||2
yi ← maxai
(r(si, ai) + γE[Vϕ(s′i)])
ϕ ← argminϕ
1
2 ∑
i
||Vϕ(si) − yi ||2
y hat y

/ 2216
What if we don't know the transition dynamics?
Value iteration은 transition dynamics를 알아야 가능한 방식

Policy iteration에서 Value가 아니라 Q function으로 evaluate하면 된다.

1. evaluate

2. set

Policy evaluation
Qπ
(s, a)
π ← π′
Qπ
(s, a) ← r(s, a) + γEs′∼p(s′|s,π(s))[Qπ
(s′, π(s′))]
maxai
(r(si, ai) + γE[Vϕ(s′i)])

/ 2217
Can we do the "max" trick again?
Policy iteration에서 Value iteration으로 갈 때 Value가 가장 큰 action 선택

=> Q-value도 transition dynamic을 모를 때 사용할 수 있다.

fitted Q iteration algorithm

1. set

2. set
yi ← r(si, ai) + γE[Vϕ(s′i)]
ϕ ← argminϕ
1
2 ∑
i
||Qϕ(si, ai) − yi ||2
E[V(s′i)] ≈ maxa′Qϕ(s′i, a′i)
이미 Q function이 있기 때문에

얼마든지 필요한 값을 얻을 수 있고,

역시 max 값이 필요하다.
+ off-policy 가능

+ 네트워크 1개 -> lower variance

- no convergence guarantees.

/ 2218
Fitted Q-iteration
full fitted Q-iteration algorithm

1. collect dataset using some policy

2. set

3. set

{(si, ai, s′i, ri)}
yi ← r(si, ai) + γmaxa′i
Qϕ(s′i, a′i)
ϕ ← argminϕ
1
2 ∑
i
||Qϕ(si, ai) − yi ||2
hyperparameters

- dataset size N

- collection policy

- iterations K

- gradient steps S
K x

/ 2219
Why is this algorithm off-policy?
Off-policy

- data가 꼭 action을 maximization하는 policy에서 온 것이 아니어도 괜찮다. 
=> action을 exploration만을 하는 구조로 만들어도 상관없음

- action 결정 policy와, 학습에서 쓰이는 policy를 분리(objective function)

state, action으로 이루어진 big bucket of data를 가지는 것이라서

개별 데이터를 세세하게 신경쓰지 않는다.

학습할 땐 maximum Q function을 갖고

하기 때문에 action과 별개로 학습되니 괜찮다.

/ 2220
Online Q-learning algorithm
algorithm

1. take some action and observe

2.

3.

기존 full fitted Q-iteration과 다르게 샘플이 1개

off-policy algorithm이라서 1번 부분에서 자유도가 높다.
ai (si, ai, s′i, ri)
yi = r(si, ai) + γmaxa′Qϕ(s′i, a′i)
ϕ ← ϕ − α
dQϕ
dϕ
(si, ai)(Qϕ(si, ai) − yi)

/ 2221
Exploration with Q-learning
Q-learning exploration 예제

- Epsilon greedy 
10% 확률로 랜덤하게 행동할 작은 확률을 줌(exploration rate) 
90% 확률로 Q function의 최대값을 가지는 action을 선택

- Boltzmann exploration : Q value를 확률값으로 바꾸어 action 선택

CS294-112 Lecture 06

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Similar to CS294-112 Lecture 06

Similar to CS294-112 Lecture 06 (8)

CS294-112 Lecture 06