SlideShare a Scribd company logo
1 of 19
Exploration By Random
Network Distillation
권휘
Intrinsic vs Extrinsic Reward
reward extrinsic reward
(environment’s reward
intrinsic reward
(exploration bonus)
Intrinsic Reward
From P.-Y. Oudeyer., et al., Intrinsic Motivation Systems for Autonomous Mental Development, IEEE TRANSACTIONS ON EVOLUTIONARY
COMPUTATION, 2007.
Check also P.-Y. Oudeyer and F. Kaplan. What is intrinsic motivation? a typology of computational approaches. Frontiers in neurorobotics, 2009.
Intrinsic Reward
Tabular case:
visitation count
Intrinsic Reward
Non-tabular case: pseudo-counts, prediction
error(forward dynamics, inverse dynamics, …)
Previous study (‘18 Aug)
From https://blog.openai.com/reinforcement-learning-with-prediction-based-rewards/
(Forward dynamics)
Pixels
Random Features (RF)
Variational Autoencoder (VAE)
Inverse Dynamics Features (IDF)
compact, sufficient, stationary
Previous study (‘18 Aug)
- Diverse environments(48 Atari, Mario, 2 Roboschool, 2 player Pong,
2 Unity Mazes)
- Large-Scale(~2048 parallel envs)
- Curiosity(Intrinsic reward) only
- Infinite horizon
- Stabilization techniques
- Limitation: stochastic dynamics(Noisy-TV)
Check Yuri Burda et al., Large-Scale Study of Curiosity-Driven Learning, arXiv:1808.04355v1, 2018
Environment
Environment
From https://blog.openai.com/reinforcement-learning-with-prediction-based-rewards/
Random Network Distillation (‘18 Oct)
From https://blog.openai.com/reinforcement-learning-with-prediction-based-rewards/
Random initialized fixed network
Random initialized training network
Contribution
1. Works well with high-dimensional observations
2. Easily scale up to large numbers of parallel environments
3. Solve stochastic dynamic problem
4. Intrinsic + Extrinsic reward
5. Get high score in Montezuma’s Revenge (pass 1st level)
Sources of Prediction Errors
1. Amount of training data
2. Stochasticity
3. Model misspecification
4. Learning dynamics
Relation to Uncertainty Quantification
Relation to Uncertainty Quantification
From Osbald et al., Randomized Prior Functions for Deep Reinforcement Learning, arXiv:1806.03335v2, 2018.
Intrinsic + Extrinsic returns
1. Intrinsic: Non-episodic, separate value function( )
2. Extrinsic: Episodic, separate value function( )
3. Reward and Observation Normalization
Experiments
Condition:
- 30K rollouts of length 128 per environments with 128 parallel environments.
- Discount factor(0.99 vs 0.999)
- RNN vs CNN
- # of parallel envs
Metrics:
- Mean episodic return
- the number of rooms the agent finds over the training run.
Experiments
Experiments
Pseudo-code
PPO
Get observation
initial statistics
왜?

More Related Content

Recently uploaded

Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
FIDO Alliance
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc
 
Revolutionizing SAP® Processes with Automation and Artificial Intelligence
Revolutionizing SAP® Processes with Automation and Artificial IntelligenceRevolutionizing SAP® Processes with Automation and Artificial Intelligence
Revolutionizing SAP® Processes with Automation and Artificial Intelligence
Precisely
 

Recently uploaded (20)

Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTop 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development Companies
 
Revolutionizing SAP® Processes with Automation and Artificial Intelligence
Revolutionizing SAP® Processes with Automation and Artificial IntelligenceRevolutionizing SAP® Processes with Automation and Artificial Intelligence
Revolutionizing SAP® Processes with Automation and Artificial Intelligence
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
 
Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxVector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptx
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 

Featured

Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 

Featured (20)

PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 

190307 exploration by random network distillation

Editor's Notes

  1. Intrinsic, extrinsic 의 개념에 대해서 먼저 살펴보겠습니다. 이 저자는 시간에 대한 총 reward를 환경이 주는 extrinsic reward와 추가적으로 intrinsic reward, exploration bonus로 정의합니다.
  2. Extrinsic reward는 환경이 주게끔 되어있는 reward이고, 새롭게 등장하는 Intrinsic reward가 뭔지 살펴보겠습니다. 아래 논문에서는 학습하는 로봇의 관점에서 novelty, surprise, complexity의 정도를 평가할 수 있는 메커니즘을 만드는 것이라고 합니다.
  3. 우리가 학습하는 환경의 경우에는 novelty, surprise를 얼마나 많이 방문했는지에 대한 것으로 표현할 수도 있습니다. Tabular case 같은 경우에는 그래서 방문한 숫자를 세서 이를 reward에 반영해줍니다. 많이 방문할 수록 놀라움이 떨어지게 되죠.
  4. non-tabular의 경우에는 각 grid를 셀 수 없기 때문에 다른 방법을 써야 하는데, 그 중 하나가 지난 주에 소개해주신 CTS, 그리고 이번주에 소개할 RND가 됩니다. RND는 prediction error를 prediction error로 설정하는데 이 경우 dynamics를 예측하는 경우도 있습니다.
  5. RND를 살펴보기 전에 같은 저자가 바로 이전에 연구한 내용을 살펴보겠습니다. Large-scale study of curiosity-driven learning 이라는 논문인데, RND의 바탕이 된다고 봅니다. 위에서 잠깐 설명한 것처럼 dynamics를 예측하는 등의 방법으로 prediction error를 exploration bonus로 활용할 수 있습니다. 이 논문에서는 forward dynamics를 예측해서 그 에러를 bonus로 활용합니다. predictor는 compact, sufficient, stationary 해야 한다고 주장합니다. 다양한 function을 predictor로 활용할 수 있습니다. 저자는 compact, sufficient, stationary 해야 한다고 얘기합니다. 그리고 이 논문에서는 large-scale로 curiosity만을 가지고 학습을 시켜보고 어느 정도 잘 되는 것을 보입니다.
  6. 이전 연구에서는 아래와 같은 조건으로 학습시키고 잘 되는 것을 보입니다. 다양한 feature extractor를 사용하는데, 뭐가 더 좋은지 판단은 어렵다고 합니다. 학습된 feature가 일반화를 좀 더 잘한다고 하고 random feature는 생각보다 잘한다. 그리고 문제가 있는데, stochastic dynamics를 잘 대응 못하는게 문제입니다. 뭐 당연하죠. noisy-TV!
  7. Atari의 유명한 게임인 montezuma’s revenge 입니다. 이 화면에서 다음 observation을 예측하는 건 그렇게 어려운 일이 아닌 것처럼 보입니다.
  8. 자, 이런 환경에서는 어떨까요? 저자가 이전 논문에서 소개하는 noisy-TV라는 환경입니다. stochastic한 TV가 벽에 걸려있고 agent가 이를 발견하면 움직일 수가 없게 됩니다. 이유는 새로운 observation을 보려고 하는데 계속 찾던 미로랑 비교했을 때 항상 새로운 내용이 TV 계속 나오기 때문에 떠날 수가 없게 되죠. TV 짱!
  9. 자 그럼, stochastic dynamics의 문제를 해결해봅시다. 현재 observation을 feature로 embedding 하면 아무리 stochastic 하더라도 보는 화면은 계속 학습할 것이기 때문에 exploration bonus가 줄어들 것입니다. 짜잔
  10. RND의 contribution은 아래와 같습니다. 기존 RND 계열 외 연구들은 high-dim obs에 대해서 학습을 하기 어려운 점이 있었는데 그 문제를 해결했다고 합니다. 위에서 봤듯이 stochastic dynamic 문제를 해결했고 기존 연구에서 확장해서 intrinsic + extrinsic 문제에 대해서도 좋은 성능을 내는데 성공했습니다.
  11. Prediction error를 발생시킬 수 있는 요소들입니다. Exploration bonus를 만들어 내는 것들이죠. 학습 데이터의 양, stochasticity, model의 capacity, dynamics를 잘 배우는지에 관한 내용입니다. Stochasticity의 경우에는 forward dynamics prediction의 경우에 noise가 더 많이 생길 수 있는데 RND는 현재 observation에 대해서 feature 예측이기 때문에 훨씬 덜 합니다.!
  12. Bayesian Linear Regression으로 uncertainty를 구하는 방법과 dropout, ensemble, random prior for ensemble와 연관성을 Osband et al. 에서 설명하고 있는데 그 부분과 연관시켜서 RND에서의 distillation error가 uncertainty를 계산하는 것과 같다고 합니다. 잘 모르겠습니다.
  13. 이제 intrinsic, extrinsic을 함께 학습시켜보면 intrinsic은 계속해서 누적하면서 observation을 찾기 때문에 game over를 없애고 계속해서 누적 증가하는 식으로 설정한다. (사람과 같다.) extrinsic의 경우는 죽었을 때까지의 본 최적의 경로를 학습하게 하기 위해서 episodic으로 설정한다. Reward normalization은 error가 환경, point마다 제각각이기 때문에 hyperparameter를 맞춰주기 위해서 해준다. Observation normalization은 random target network를 목표로 학습하는데 frozen parameter를 사용하기 때문에 scale에서 학습에 어려움이 있을 수 있기 떄문에 normalization을 해줍니다. 이 때, 초기 몇 step에 대해서 해주고 이 후에 학습하면서 계속 업데이트 해줍니다. policy network에는 따로 normalization 해주진 않는다.
  14. 학습은 아래 조건으로 진행합니다.
  15. 이제 intrinsic, extrinsic을 함께 학습시켜보면 intrinsic은 계속해서 누적하면서 observation을 찾기 때문에 game over를 없애고 계속해서 누적 증가하는 식으로 설정한다. (사람과 같다.) extrinsic의 경우는 죽었을 때까지의 본 최적의 경로를 학습하게 하기 위해서 episodic으로 설정한다. Reward normalization은 error가 환경, point마다 제각각이기 때문에 hyperparameter를 맞춰주기 위해서 해준다. Observation normalization은 random target network를 목표로 학습하는데 frozen parameter를 사용하기 때문에 scale에서 학습에 어려움이 있을 수 있다. policy network에는 따로 normalization 해주진 않는다.
  16. 이제 intrinsic, extrinsic을 함께 학습시켜보면 intrinsic은 계속해서 누적하면서 observation을 찾기 때문에 game over를 없애고 계속해서 누적 증가하는 식으로 설정한다. (사람과 같다.) extrinsic의 경우는 죽었을 때까지의 본 최적의 경로를 학습하게 하기 위해서 episodic으로 설정한다. Reward normalization은 error가 환경, point마다 제각각이기 때문에 hyperparameter를 맞춰주기 위해서 해준다. Observation normalization은 random target network를 목표로 학습하는데 frozen parameter를 사용하기 때문에 scale에서 학습에 어려움이 있을 수 있다. policy network에는 따로 normalization 해주진 않는다.
  17. RND에서 의문점이 2개가 있는데 하나는 마지막 N_opt 만큼 optimization을 할 때 왜 distillation loss까지 여러 번 하는지와 intrinsic reward에 대해서 학습을 한다고 했을 때 game over 없이 discount factor가 계속 적용된다고 했을 때, 방법대로면 학습엔 문제가 없을 것 같지만 직관적으로는 step이 높아질 수록 뒤의 frame들이 의미가 없어질 것 같다는 생각이 드는데 관련된 연구가 있는지 궁금합니다.