[DL輪読会]Deep Reinforcement Learning that Matters

•

8 likes•3,870 views

Deep Learning JP

2017/12/8 Deep Learning JP: http://deeplearning.jp/seminar-2/

Technology

1
DEEP LEARNING JP
[DL Papers]
http://deeplearning.jp/
Deep Reinforcement Learning that Matters
Reiji Hatsugai

11
!"~$(&|(")
("*+~,(-.|(", !")
0"*+ = 0((", !", ("*+)

12
!"~$(&|(")
("*+~,(-.|(", !")
0"*+ = 0((", !", ("*+)
$
π∗
= argmax
π
Eπ [ γ τ
rτ ]
τ =0
∞
∑

13
TRPO
DQN DDQN
A3C
UNREAL PCL
ACER
PPO
Q-Prop
IPG
ACKTR
DDPG
D4PG
SAC
Soft Q

14
TRPO
DQN DDQN
A3C
UNREAL PCL
ACER
PPO
Q-Prop
IPG
ACKTR
DDPG
D4PG
SAC
Soft Q
『『深深層層』』強強化化学学習習ににななっっててかからら
たたくくささんんのの手手法法がが開開発発さされれたた

Deep Reinforcement Learning that Matters
• ICML2017 reproducibility work shop Reproducibility of
Benchmarked Deep Reinforcement Learning Tasks for Continuous Control
• AAAI2018 accepted
•
–
–
•
•
16

Deep Reinforcement Learning that Matters
•
– ACKTR (Wu et al. 2017)
– PPO (Schulman et al. 2017)
– DDPG (Lillicrap et al. 2015)
– TRPO (Schulman et al. 2015)
• ACKTR, PPO
• DDPG, TRPO baseline
•
17

Deep Reinforcement Learning that Matters
• Network Architecture
• Reward Scale
• Random Seeds and Trials
• Environments
• Codebases
• Reporting Evaluation Metrics
18

Network Architecture
•
– (64, 64) (rllab)
– (100, 50, 25) (Q-Prop)
– (400, 300) (DDPG)
•
• Activation Function
21

Network Architecture
• PPO
• Tanh
• PPO
• “This also suggests a possible need for hyper parameter agnostic algorithms”
•
24

Reward Scale
• Q DQN cliping
• 0.
= 20
• σ=0.1
•
LeCun et al .2012; Glorot and Bengio 2010; Vincent, de Brebisson, and Bouthillier 2015
•
25

Reward Scale
• Reward Scale
•
• Reward Scale
• Layer norm
• Learning values across many orders of magnitude (Hado van Hasselt et al. 2016)
– adaptive
• HumanoidStandup-v1 100
– Reward Scale
27

Random Seeds and Trials
• 10 seed
• 10 5 5
•
29

Random Seeds and Trials
• 2
–
–
•
seed
• power analysis
•
33

Environment
• Hopper, HalfCheetah, Swimmer, Walker2D
•
34

HalfCheetah
• HalfCheetah DDPG
• Hopper DDPG
• Reproducibility of Benchmarked Deep
Reinforcement Learning Tasks for Continuous Control
• DDPG Q
• HalfCheetah DDPG DDPG base
HalfCheetah unfair
37

Swimmer
• TRPO
• policy local optimal
•
•
39

Code base
• TRPO DDPG rllab, baseline
•
40

Code base
•
• dramatic impacts on performance
•
42

Reporting Evaluation Metrics
•
•
•
–
–
–
43

Deep Reinforcement Learning that Matters
•
•
–
–
–
–
•
– hyperparameters agnostic algorithm
• “There is often no clear winner among all benchmark environments.”
44

• HalfCheetah Hopper DDPG
stable, unstable
• task difficulty algorithm
• Simple Nearest Neighbor Policy Method for Continuous Control Tasks
– Nearest Neighbor Policy
– task difficulty task
– NN task
45

• NN-1, NN-2
•
• NN-1
1.
2. action
• NN-2
1.
2. action 1step 1
• Sparse reward
46

Simple Nearest Neighbor
• Sparse Mountain Car
• HalfCheetah
• HalfCheetah
• task difficulty
• ICLR3,4,4
• NNPolicy
48

•
HalfCheetah
•
–
– sensor
• 3 MLP
• Towards Generalization and Simplicity in Continuous Control
– Policy parameterize RBF
– Natural Gradient
– Neural Net humanoid
– mujoco Todorov Natural Gradient Kakade 49

Towards Generalization and Simplicity in Continuous Control
50

•
• sensor DeepLearning
•
•
– sparse reward
–
• IL, IRL??
–
normalize
51

What's hot

SSII2021 [TS2] 深層強化学習〜強化学習の基礎から応用まで〜SSII

【DL輪読会】Transformers are Sample Efficient World ModelsDeep Learning JP

ゼロから始める深層強化学習（NLP2018講演資料）/ Introduction of Deep Reinforcement LearningPreferred Networks

TensorFlowで逆強化学習Mitsuhisa Ohta

最近のDQNmooopan

【DL輪読会】論文解説：Offline Reinforcement Learning as One Big Sequence Modeling ProblemDeep Learning JP

海鳥の経路予測のための逆強化学習Tsubasa Hirakawa

[DL輪読会]Control as Inferenceと発展Deep Learning JP

[DL輪読会]Learning Latent Dynamics for Planning from PixelsDeep Learning JP

「世界モデル」と関連研究についてMasahiro Suzuki

多様な強化学習の概念と課題認識佑甲野

[DL輪読会]逆強化学習とGANsDeep Learning JP

論文紹介：Dueling network architectures for deep reinforcement learningKazuki Adachi

ドメイン適応の原理と応用Yoshitaka Ushiku

[DL輪読会]GQNと関連研究，世界モデルとの関係についてDeep Learning JP

方策勾配型強化学習の基礎と応用Ryo Iwaki

最近強化学習の良記事がたくさん出てきたので勉強しながらまとめたKatsuya Ito

深層学習の数理Taiji Suzuki

Maximum Entropy IRL（最大エントロピー逆強化学習）とその発展系についてYusuke Nakata

強化学習の分散アーキテクチャ変遷Eiji Sekiya

What's hot (20)

SSII2021 [TS2] 深層強化学習〜強化学習の基礎から応用まで〜

【DL輪読会】Transformers are Sample Efficient World Models

ゼロから始める深層強化学習（NLP2018講演資料）/ Introduction of Deep Reinforcement Learning

TensorFlowで逆強化学習

Similar to [DL輪読会]Deep Reinforcement Learning that Matters

pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"YeChan(Paul) Kim

Hadoop londonYahoo Developer Network

India software developers conference 2013 BangaloreSatnam Singh

Demystifying deep reinforement learning재연 윤

Deep Convolutional GANs - meaning of latent spaceHansol Kang

A Workshop on RAjay Ohri

Developing in R - the contextual Multi-Armed Bandit editionRobin van Emden

Imitation Learning for Autonomous Driving in TORCSPreferred Networks

Valerii Vasylkov Erlang. measurements and benefits.Аліна Шепшелей

SE2016 Exotic Valerii Vasylkov "Erlang. Measurements and benefits"Inhacking

Face recognition v1San Kim

Getting started with Spark & Cassandra by Jon Haddad of DatastaxData Con LA

Cassandra drivers and librariesDuyhai Doan

Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...MongoDB

機械学習モデルの判断根拠の説明Satoshi Hara

R for hadoopersGwen (Chen) Shapira

Training in Analytics, R and Social Media AnalyticsAjay Ohri

IIBMP2019 講演資料「オープンソースで始める深層学習」Preferred Networks

Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Databricks

MySQL Performance Monitoringspil-engineering

Similar to [DL輪読会]Deep Reinforcement Learning that Matters (20)

pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"

Hadoop london

India software developers conference 2013 Bangalore

Demystifying deep reinforement learning

Deep Convolutional GANs - meaning of latent space

A Workshop on R

Developing in R - the contextual Multi-Armed Bandit edition

Imitation Learning for Autonomous Driving in TORCS

Valerii Vasylkov Erlang. measurements and benefits.

SE2016 Exotic Valerii Vasylkov "Erlang. Measurements and benefits"

Face recognition v1

Getting started with Spark & Cassandra by Jon Haddad of Datastax

Cassandra drivers and libraries

Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...

機械学習モデルの判断根拠の説明

R for hadoopers

Training in Analytics, R and Social Media Analytics

IIBMP2019 講演資料「オープンソースで始める深層学習」

Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...

MySQL Performance Monitoring

Recently uploaded

Key Features Of Token Development (1).pptxLBM Solutions

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent

How to convert PDF to text with Nanonetsnaman860154

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

How to Remove Document Management Hurdles with X-Docs?XfilesPro

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard

A Domino Admins Adventures (Engage 2024)Gabriella Davis

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Recently uploaded (20)

Key Features Of Token Development (1).pptx

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...

How to convert PDF to text with Nanonets

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...

Human Factors of XR: Using Human Factors to Design XR Systems

How to Remove Document Management Hurdles with X-Docs?

Unblocking The Main Thread Solving ANRs and Frozen Frames

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Swan(sea) Song – personal research during my six years at Swansea ... and bey...

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

Breaking the Kubernetes Kill Chain: Host Path Mount

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

Injustice - Developers Among Us (SciFiDevCon 2024)

Maximizing Board Effectiveness 2024 Webinar.pptx

A Domino Admins Adventures (Engage 2024)

08448380779 Call Girls In Friends Colony Women Seeking Men

[DL輪読会]Deep Reinforcement Learning that Matters

1. 1 DEEP LEARNING JP [DL Papers] http://deeplearning.jp/ Deep Reinforcement Learning that Matters Reiji Hatsugai

2. • – – • • difficulty • • 2

3. 3

4. 4

5. : HalfCheetah 5

6. : Hopper 6

7. 7

8. 8

9. 9

10. 10

11. 11 !"~$(&|(") ("*+~,(-.|(", !") 0"*+ = 0((", !", ("*+)

12. 12 !"~$(&|(") ("*+~,(-.|(", !") 0"*+ = 0((", !", ("*+) $ π∗ = argmax π Eπ [ γ τ rτ ] τ =0 ∞ ∑

13. 13 TRPO DQN DDQN A3C UNREAL PCL ACER PPO Q-Prop IPG ACKTR DDPG D4PG SAC Soft Q

14. 14 TRPO DQN DDQN A3C UNREAL PCL ACER PPO Q-Prop IPG ACKTR DDPG D4PG SAC Soft Q 『『深深層層』』強強化化学学習習ににななっっててかかららたたくくささんんのの手手法法がが開開発発さされれたた

15. • 1. 2. 3. 4. 15

16. Deep Reinforcement Learning that Matters • ICML2017 reproducibility work shop Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control • AAAI2018 accepted • – – • • 16

17. Deep Reinforcement Learning that Matters • – ACKTR (Wu et al. 2017) – PPO (Schulman et al. 2017) – DDPG (Lillicrap et al. 2015) – TRPO (Schulman et al. 2015) • ACKTR, PPO • DDPG, TRPO baseline • 17

18. Deep Reinforcement Learning that Matters • Network Architecture • Reward Scale • Random Seeds and Trials • Environments • Codebases • Reporting Evaluation Metrics 18

19. Deep Reinforcement Learning that Matters • Network Architecture • Reward Scale • Random Seeds and Trials • Environments • Codebases • Reporting Evaluation Metrics 19 外因的なもの

20. Deep Reinforcement Learning that Matters • Network Architecture • Reward Scale • Random Seeds and Trials • Environments • Codebases • Reporting Evaluation Metrics 20 内因的なもの

21. Network Architecture • – (64, 64) (rllab) – (100, 50, 25) (Q-Prop) – (400, 300) (DDPG) • • Activation Function 21

22. Policy Architecture 22

23. Activation Function 23

24. Network Architecture • PPO • Tanh • PPO • “This also suggests a possible need for hyper parameter agnostic algorithms” • 24

25. Reward Scale • Q DQN cliping • 0. = 20 • σ=0.1 • LeCun et al .2012; Glorot and Bengio 2010; Vincent, de Brebisson, and Bouthillier 2015 • 25

26. Reward Scale 26

27. Reward Scale • Reward Scale • • Reward Scale • Layer norm • Learning values across many orders of magnitude (Hado van Hasselt et al. 2016) – adaptive • HumanoidStandup-v1 100 – Reward Scale 27

28. Deep Reinforcement Learning that Matters • Network Architecture • Reward Scale • Random Seeds and Trials • Environments • Codebases • Reporting Evaluation Metrics 28 内因的なもの

29. Random Seeds and Trials • 10 seed • 10 5 5 • 29

30. Random Seeds and Trials 30

31. Random Seeds and Trials 31

32. Random Seeds and Trials 32 <0.05

33. Random Seeds and Trials • 2 – – • seed • power analysis • 33

34. Environment • Hopper, HalfCheetah, Swimmer, Walker2D • 34

35. HalfCheetah 35

36. Hopper 36

37. HalfCheetah • HalfCheetah DDPG • Hopper DDPG • Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control • DDPG Q • HalfCheetah DDPG DDPG base HalfCheetah unfair 37

38. Swimmer 38

39. Swimmer • TRPO • policy local optimal • • 39

40. Code base • TRPO DDPG rllab, baseline • 40

41. Code base 41

42. Code base • • dramatic impacts on performance • 42

43. Reporting Evaluation Metrics • • • – – – 43

44. Deep Reinforcement Learning that Matters • • – – – – • – hyperparameters agnostic algorithm • “There is often no clear winner among all benchmark environments.” 44

45. • HalfCheetah Hopper DDPG stable, unstable • task difficulty algorithm • Simple Nearest Neighbor Policy Method for Continuous Control Tasks – Nearest Neighbor Policy – task difficulty task – NN task 45

46. • NN-1, NN-2 • • NN-1 1. 2. action • NN-2 1. 2. action 1step 1 • Sparse reward 46

47. NN 47

48. Simple Nearest Neighbor • Sparse Mountain Car • HalfCheetah • HalfCheetah • task difficulty • ICLR3,4,4 • NNPolicy 48

49. • HalfCheetah • – – sensor • 3 MLP • Towards Generalization and Simplicity in Continuous Control – Policy parameterize RBF – Natural Gradient – Neural Net humanoid – mujoco Todorov Natural Gradient Kakade 49

50. Towards Generalization and Simplicity in Continuous Control 50

51. • • sensor DeepLearning • • – sparse reward – • IL, IRL?? – normalize 51

[DL輪読会]Deep Reinforcement Learning that Matters

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to [DL輪読会]Deep Reinforcement Learning that Matters

Similar to [DL輪読会]Deep Reinforcement Learning that Matters (20)

More from Deep Learning JP

More from Deep Learning JP (20)

Recently uploaded

Recently uploaded (20)

[DL輪読会]Deep Reinforcement Learning that Matters