Hindsight experience replay paper review

Hindsight Experience Replay
OpenAI
Paper review
Presenter : Uijin Jung

Presenter
• Name : Uijin Jung

• Github : github.com/jinPrelude

• 16년 부산 원주민

• 경기도 3년 거주 억양 섞임

Contents
• Abstract

1. Introduction

2. Background

3. Hindsight Experience Replay

4. Experiments

5. Related work

6. Conclusion

1. Introduction
• Reward engineering limits the applicability of RL in the
real world because it requires both RL expertise and
domain-speciﬁc knowledge.

• But dealing with sparse rewards is also one of the biggest
challenges in RL

• One ability humans have, unlike the current generation of
model-free RL algorithms, is to learn almost as much from
achieving an undesired outcome as from the desired one.
1. Introduction

:(
:)

2. Background
• Reinforcement Learning

• Deep Q Learning (DQN)

• Deep Deterministic Policy Gradient (DDPG)

• Universal Value Function Approximators (UVFA)

• Bit ﬂipping environment

{0,1} {0,1} {0,1} {0,1} … {0,1} {0,1} {0,1} {0,1}

{0,1} {0,1} {0,1} {0,1} … {0,1} {0,1} {0,1} {0,1}
n

{0,1} {0,1} {0,1} {0,1} … {0,1} {0,1} {0,1} {0,1}
n
A = {0, 1, …, n-1}

{0,1} {0,1} {0,1} {0,1} … {0,1} {0,1} {0,1} {0,1}
A = {0, 1, …, n-1}

{0,1} {0,1} {0,1} {0,1} … {0,1} {0,1} {0,1} {0,1}
A = {0, 1, …, n-1}
g (goal)

{0,1} {0,1} {0,1} {0,1} … {0,1} {0,1} {0,1} {0,1}
A = {0, 1, …, n-1}
0 1 0 0 … 1 1 0 1
n
g (goal)

{0,1} {0,1} {0,1} {0,1} … {0,1} {0,1} {0,1} {0,1}
A = {0, 1, …, n-1}
0 1 0 0 … 1 1 0 1g (goal) =

=
{0,1} {0,1} {0,1} {0,1} … {0,1} {0,1} {0,1} {0,1}
A = {0, 1, …, n-1}
0 1 0 0 … 1 1 0 1g (goal)

{0,1} {0,1} {0,1} {0,1} … {0,1} {0,1} {0,1} {0,1}

{0,1} {0,1} {0,1} {0,1} … {0,1} {0,1} {0,1} {0,1}
40

1 1 0 s0 … 1 0 0 sT
≠
0 1 1 g

1 1 0 s0 … 1 0 0 sT
≠
0 1 1 g
Episode reward : {-1, -1, …, -1, -1}

1 1 0 s0 … 1 0 0 sT
≠
0 1 1 g
Episode reward : {-1, -1, …, -1, -1}
R
r0(s0, a0, r0, s1, g)
r1(s1, a1, r1, s2, g)
rt(st, at, rt, st+1, g)
rT−1(sT−1, aT−1, rT−1, sT, g)

1 1 0 s0 … 1 0 0 sT
≠
1 0 0 sT
Episode reward : {-1, -1, …, -1, -1}
R
r0(s0, a0, r0, s1, g)
r1(s1, a1, r1, s2, g)
rT−1(sT−1, aT−1, rT−1, sT, g)

1 1 0 s0 … 1 0 0 sT
1 0 0 sT
=
Episode reward : {-1, -1, …, -1, -1}
R
r0(s0, a0, r0, s1, g)
r1(s1, a1, r1, s2, g)
rT−1(sT−1, aT−1, rT−1, sT, g)

1 1 0 s0 … 1 0 0 sT
1 0 0 sT
=
Episode reward : {-1, -1, …, -1, 0}
R
r0(s0, a0, r0, s1, g)
r1(s1, a1, r1, s2, g)
rT−1(sT−1, aT−1, rT−1, sT, g)

1 1 0 s0 … 1 0 0 sT
1 0 0 sT
=
Episode reward : {-1, -1, …, -1, 0}
R
r0(s0, a0, r0, s1, g)
r1(s1, a1, r1, s2, g)
rT−1(sT−1, aT−1, rT−1, sT, g)
R′
r0(s0, a0, r0, s1, sT)
r1(s1, a1, r1, s2, sT)
rt(st, at, rt, st+1, sT)
rT−1(sT−1, aT−1, rT−1, sT, sT)

R
r0(s0, a0, r0, s1, g)
r1(s1, a1, r1, s2, g)
rT−1(sT−1, aT−1, rT−1, sT, g)
R′
r0(s0, a0, r0, s1, sT)
r1(s1, a1, r1, s2, sT)

Memory
R
r0(s0, a0, r0, s1, g)
r1(s1, a1, r1, s2, g)
rT−1(sT−1, aT−1, rT−1, sT, g)
R′
r0(s0, a0, r0, s1, sT)
r1(s1, a1, r1, s2, sT)

Training
Memory

4. Experiments
• Three diﬀerent tasks : pushing, sliding, pick&place

• How we deﬁne MDPs

• Does HER improve performance?

• Does HER improve performance even if there is only one goal we care
about?

• How does HER interact with reward shaping?

• How many goals should we replay each trajectory with and how to
choose them?
• Deployment on a physical robot

• How many goals should we replay each trajectory
with and how to choose them?
• future — replay with k random states which come from the same episode as the
transition being replayed and were observed after it,

• episode — replay with k random states coming from the same episode as the transition
being replayed,

• random — replay with k random states encountered so far in the whole training
procedure.

6. Conclusion
• We showed that HER allows training policies which push,
slide and pick-and-place objects with a robotic arm to the
speciﬁed positions while the vanilla RL algorithm fails to
solve these tasks.

Hindsight experience replay paper review

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Hindsight experience replay paper review

Similar to Hindsight experience replay paper review (20)

More from Euijin Jeong

More from Euijin Jeong (6)

Recently uploaded

Recently uploaded (20)

Hindsight experience replay paper review