190307 exploration by random network distillation

•Download as PPTX, PDF•

0 likes•647 views

This document discusses using random network distillation to provide intrinsic rewards for reinforcement learning agents. It summarizes a previous study that used prediction error as an intrinsic reward across diverse environments. The document then proposes using random network distillation, where a randomly initialized fixed network predicts the outputs of a randomly initialized training network, to provide intrinsic rewards. This allows scaling to high-dimensional observations and solving problems with stochastic dynamics. Experiments are proposed to test combining intrinsic rewards from random network distillation with extrinsic rewards on diverse environments.

Technology

Exploration By Random
Network Distillation
권휘

Intrinsic vs Extrinsic Reward
reward extrinsic reward
(environment’s reward
intrinsic reward
(exploration bonus)

Intrinsic Reward
From P.-Y. Oudeyer., et al., Intrinsic Motivation Systems for Autonomous Mental Development, IEEE TRANSACTIONS ON EVOLUTIONARY
COMPUTATION, 2007.
Check also P.-Y. Oudeyer and F. Kaplan. What is intrinsic motivation? a typology of computational approaches. Frontiers in neurorobotics, 2009.

Intrinsic Reward
Tabular case:
visitation count

Intrinsic Reward
Non-tabular case: pseudo-counts, prediction
error(forward dynamics, inverse dynamics, …)

Previous study (‘18 Aug)
From https://blog.openai.com/reinforcement-learning-with-prediction-based-rewards/
(Forward dynamics)
Pixels
Random Features (RF)
Variational Autoencoder (VAE)
Inverse Dynamics Features (IDF)
compact, sufficient, stationary

Previous study (‘18 Aug)
- Diverse environments(48 Atari, Mario, 2 Roboschool, 2 player Pong,
2 Unity Mazes)
- Large-Scale(~2048 parallel envs)
- Curiosity(Intrinsic reward) only
- Infinite horizon
- Stabilization techniques
- Limitation: stochastic dynamics(Noisy-TV)
Check Yuri Burda et al., Large-Scale Study of Curiosity-Driven Learning, arXiv:1808.04355v1, 2018

Environment
From https://blog.openai.com/reinforcement-learning-with-prediction-based-rewards/

Random Network Distillation (‘18 Oct)
From https://blog.openai.com/reinforcement-learning-with-prediction-based-rewards/
Random initialized fixed network
Random initialized training network

Contribution
1. Works well with high-dimensional observations
2. Easily scale up to large numbers of parallel environments
3. Solve stochastic dynamic problem
4. Intrinsic + Extrinsic reward
5. Get high score in Montezuma’s Revenge (pass 1st level)

Sources of Prediction Errors
1. Amount of training data
2. Stochasticity
3. Model misspecification
4. Learning dynamics

Relation to Uncertainty Quantification
From Osbald et al., Randomized Prior Functions for Deep Reinforcement Learning, arXiv:1806.03335v2, 2018.

Intrinsic + Extrinsic returns
1. Intrinsic: Non-episodic, separate value function( )
2. Extrinsic: Episodic, separate value function( )
3. Reward and Observation Normalization

Experiments
Condition:
- 30K rollouts of length 128 per environments with 128 parallel environments.
- Discount factor(0.99 vs 0.999)
- RNN vs CNN
- # of parallel envs
Metrics:
- Mean episodic return
- the number of rooms the agent finds over the training run.

Pseudo-code
PPO
Get observation
initial statistics
왜?

Recently uploaded

Design Guidelines for Passkeys 2024.pptx

FIDO Alliance

Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots

Leah Henrickson

State of the Smart Building Startup Landscape 2024!

Memoori

Hyatt driving innovation and exceptional customer experiences with FIDO passw...

FIDO Alliance

TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...

marcuskenyatta275

Microsoft CSP Briefing Pre-Engagement - Questionnaire

Exakis Nelite

In the dynamic field of DevOps, the quest for efficiency and productivity is endless. This talk introduces a revolutionary toolkit: Large Language Models (LLMs), including ChatGPT, Gemini, and Claude, extending far beyond traditional coding assistance. We'll explore how LLMs can automate not just code generation, but also transform day-to-day operations such as crafting compelling cover letters for TPS reports, streamlining client communications, and architecting innovative DevOps solutions. Attendees will learn effective prompting strategies and examine real-life use cases, demonstrating LLMs' potential to redefine productivity in the DevOps landscape. Join us to discover how to harness the power of LLMs for a comprehensive productivity boost across your DevOps activities.

ChatGPT and Beyond - Elevating DevOps Productivity

VictorSzoltysek

In the ever-evolving landscape of data management, Zero-ETL is an approach that is reshaping how businesses handle and integrate their data. This webinar explores Zero-ETL, a paradigm shift from the traditional Extract, Transform, Load (ETL) process, offering a more streamlined, efficient, and real-time data integration method. We will begin with an introduction to the concept of Zero-ETL, including how it allows direct access to data in its native environment and real-time data transformation, providing up-to-date information with significantly reduced data redundancy. Next, we'll take you through several demonstrations showing how Zero-ETL can deliver real-time data and enable the free movement of data between systems. We will also discuss the various tools that support all aspects of Zero-ETL, providing attendees with an understanding of how they can adopt this innovative approach in their organizations. Lastly, the session will conclude with an interactive Q&A segment, allowing participants to gain deeper insights into how Zero-ETL can be tailored to their specific business needs and how they can get started today. Join us to discover how Zero-ETL can elevate your organization's data strategy.

The Zero-ETL Approach: Enhancing Data Agility and Insight

Safe Software

WebAssembly is Key to Better LLM Performance

Samy Fodil

Intro to Passkeys and the State of Passwordless.pptx

FIDO Alliance

Extensible Python: Robustness through Addition - PyCon 2024

Patrick Viafore

In today's digital world, trust is key to customer relationships, but keeping it is a huge challenge. Customers are well-informed and empowered, quick to change brands if their trust is broken, even if it costs them more. This puts a lot of pressure on organizations to handle trust and safety issues with great care and transparency. The challenge, however, is real. Fragmented solutions have left privacy, legal, and security teams in a perpetual cycle of catch-up, struggling to update privacy notices, manage customer data rights, and answer lengthy security questionnaires—all while trying to prove ROI to the business. It's a thankless job, filled with repetition, tedious tasks, and constant interdepartmental coordination. Combine this with fast regulatory changes and the quick evolution of AI, and it becomes overwhelming. Join this webinar to learn more about TrustArc's new innovative solution Trust Center, the only unified, no-code online hub for trust and safety information built for privacy, security, compliance, and legal teams. Trust Center streamlines your path to compliance, shortens the pre-sales cycle, and reduces both legal and regulatory risks, saving time, effort, and cost. This webinar will review: - Why companies are building unified Trust Centers for a robust privacy program. - How unified Trust Centers streamline sales cycles, ensure regulatory compliance, and reduce operational bottlenecks. - How compliance, legal, security, GRC, and privacy teams benefit from a unified Trust Center in terms of needs, pains, and outcomes. - How TrustArc Trust Center saves time and work while reducing legal, reputational, and compliance risk by effectively managing policies, notices, terms, and disclosures, and providing real-time updates on subprocessors.

TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...

TrustArc

Working together SRE & Platform Engineering

Marcus Vechiato

Discover the top CodeIgniter development companies that can elevate your project to new heights. Our blog explores the best firms known for their expertise in CodeIgniter framework development. From robust web applications to scalable solutions, these companies deliver excellence. Whether you're a startup or an enterprise, find the perfect match for your development needs on Top CSS Gallery's blog.

Top 10 CodeIgniter Development Companies

TopCSSGallery

Revolutionizing SAP® Processes with Automation and Artificial Intelligence

Precisely

Join me in this session where I'll share our journey of building a fully serverless application that flawlessly managed check-ins for an event with a staggering 80 thousand registrations. We'll dive into three key strategies that made this possible. Firstly, by harnessing DynamoDB global tables, we ensured global service availability and data replication across regions, boosting performance and disaster recovery. Next, we'll explore how we seamlessly integrated real-time updates into the app using Appsync subscriptions, making the experience dynamic and engaging for users. Finally, I'll discuss how provisioned concurrency not only improved performance but also kept costs in check, highlighting the cost-effectiveness of serverless architectures. Through these strategies and the inherent scalability of serverless technology, our application effortlessly handled massive user loads without manual intervention. This session is a real world example to the power and efficiency of modern cloud-based solutions in enabling seamless scalability and robust performance with Serverless

How we scaled to 80K users by doing nothing!.pdf

Srushith Repakula

WebRTC and SIP not just audio and video @ OpenSIPS 2024

Lorenzo Miniero

(Explainable) Data-Centric AI: what are you explaininhg, and to whom?

Paolo Missier

Vector Search @ sw2con for slideshare.pptx

jbellis

ERP Contender Series: Acumatica vs. Sage Intacct

BrainSell Technologies

Recently uploaded (20)

Design Guidelines for Passkeys 2024.pptx

Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots

State of the Smart Building Startup Landscape 2024!

Hyatt driving innovation and exceptional customer experiences with FIDO passw...

TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...

Microsoft CSP Briefing Pre-Engagement - Questionnaire

ChatGPT and Beyond - Elevating DevOps Productivity

The Zero-ETL Approach: Enhancing Data Agility and Insight

WebAssembly is Key to Better LLM Performance

Intro to Passkeys and the State of Passwordless.pptx

Extensible Python: Robustness through Addition - PyCon 2024

TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...

Working together SRE & Platform Engineering

Top 10 CodeIgniter Development Companies

Revolutionizing SAP® Processes with Automation and Artificial Intelligence

How we scaled to 80K users by doing nothing!.pdf

WebRTC and SIP not just audio and video @ OpenSIPS 2024

(Explainable) Data-Centric AI: what are you explaininhg, and to whom?

Vector Search @ sw2con for slideshare.pptx

ERP Contender Series: Acumatica vs. Sage Intacct

Featured

PEPSICO Presentation to CAGNY Conference Feb 2024

Neil Kimberley

Content Methodology: A Best Practices Report (Webinar)

contently

How to Prepare For a Successful Job Search for 2024

Albert Qian

A report by thenetworkone and Kurio. The contributing experts and agencies are (in an alphabetical order): Sylwia Rytel, Social Media Supervisor, 180heartbeats + JUNG v MATT (PL), Sharlene Jenner, Vice President - Director of Engagement Strategy, Abelson Taylor (USA), Alex Casanovas, Digital Director, Atrevia (ES), Dora Beilin, Senior Social Strategist, Barrett Hoffher (USA), Min Seo, Campaign Director, Brand New Agency (KR), Deshé M. Gully, Associate Strategist, Day One Agency (USA), Francesca Trevisan, Strategist, Different (IT), Trevor Crossman, CX and Digital Transformation Director; Olivia Hussey, Strategic Planner; Simi Srinarula, Social Media Manager, The Hallway (AUS), James Hebbert, Managing Director, Hylink (CN / UK), Mundy Álvarez, Planning Director; Pedro Rojas, Social Media Manager; Pancho González, CCO, Inbrax (CH), Oana Oprea, Head of Digital Planning, Jam Session Agency (RO), Amy Bottrill, Social Account Director, Launch (UK), Gaby Arriaga, Founder, Leonardo1452 (MX), Shantesh S Row, Creative Director, Liwa (UAE), Rajesh Mehta, Chief Strategy Officer; Dhruv Gaur, Digital Planning Lead; Leonie Mergulhao, Account Supervisor - Social Media & PR, Medulla (IN), Aurelija Plioplytė, Head of Digital & Social, Not Perfect (LI), Daiana Khaidargaliyeva, Account Manager, Osaka Labs (UK / USA), Stefanie Söhnchen, Vice President Digital, PIABO Communications (DE), Elisabeth Winiartati, Managing Consultant, Head of Global Integrated Communications; Lydia Aprina, Account Manager, Integrated Marketing and Communications; Nita Prabowo, Account Manager, Integrated Marketing and Communications; Okhi, Web Developer, PNTR Group (ID), Kei Obusan, Insights Director; Daffi Ranandi, Insights Manager, Radarr (SG), Gautam Reghunath, Co-founder & CEO, Talented (IN), Donagh Humphreys, Head of Social and Digital Innovation, THINKHOUSE (IRE), Sarah Yim, Strategy Director, Zulu Alpha Kilo (CA).

Social Media Marketing Trends 2024 // The Global Indie Insights

Kurio // The Social Media Age(ncy)

The search marketing landscape is evolving rapidly with new technologies, and professionals, like you, rely on innovative paid search strategies to meet changing demands. It’s important that you’re ready to implement new strategies in 2024. Check this out and learn the top trends in paid search advertising that are expected to gain traction, so you can drive higher ROI more efficiently in 2024. You’ll learn: - The latest trends in AI and automation, and what this means for an evolving paid search ecosystem. - New developments in privacy and data regulation. - Emerging ad formats that are expected to make an impact next year. Watch Sreekant Lanka from iQuanti and Irina Klein from OneMain Financial as they dive into the future of paid search and explore the trends, strategies, and technologies that will shape the search marketing landscape. If you’re looking to assess your paid search strategy and design an industry-aligned plan for 2024, then this webinar is for you.

Trends In Paid Search: Navigating The Digital Landscape In 2024

Search Engine Journal

From their humble beginnings in 1984, TED has grown into the world’s most powerful amplifier for speakers and thought-leaders to share their ideas. They have over 2,400 filmed talks (not including the 30,000+ TEDx videos) freely available online, and have hosted over 17,500 events around the world. With over one billion views in a year, it’s no wonder that so many speakers are looking to TED for ideas on how to share their message more effectively. The article “5 Public-Speaking Tips TED Gives Its Speakers”, by Carmine Gallo for Forbes, gives speakers five practical ways to connect with their audience, and effectively share their ideas on stage. Whether you are gearing up to get on a TED stage yourself, or just want to master the skills that so many of their speakers possess, these tips and quotes from Chris Anderson, the TED Talks Curator, will encourage you to make the most impactful impression on your audience. See the full article and more summaries like this on SpeakerHub here: https://speakerhub.com/blog/5-presentation-tips-ted-gives-its-speakers See the original article on Forbes here: http://www.forbes.com/forbes/welcome/?toURL=http://www.forbes.com/sites/carminegallo/2016/05/06/5-public-speaking-tips-ted-gives-its-speakers/&refURL=&referrer=#5c07a8221d9b

5 Public speaking tips from TED - Visualized summary

SpeakerHub

Everyone is in agreement that ChatGPT (and other generative AI tools) will shape the future of work. Yet there is little consensus on exactly how, when, and to what extent this technology will change our world. Businesses that extract maximum value from ChatGPT will use it as a collaborative tool for everything from brainstorming to technical maintenance. For individuals, now is the time to pinpoint the skills the future professional will need to thrive in the AI age. Check out this presentation to understand what ChatGPT is, how it will shape the future of work, and how you can prepare to take advantage.

ChatGPT and the Future of Work - Clark Boyd

Clark Boyd

Getting into the tech field. what next

Tessa Mero

Google's Just Not That Into You: Understanding Core Updates & Search Intent

Lily Ray

How to have difficult conversations

Rajiv Jayarajah, MAppComm, ACC

Introduction to Data Science

Christy Abraham Joy

Time Management & Productivity - Best Practices

Vit Horky

The six step guide to practical project management If you think managing projects is too difficult, think again. We’ve stripped back project management processes to the basics – to make it quicker and easier, without sacrificing the vital ingredients for success. “If you’re looking for some real-world guidance, then The Six Step Guide to Practical Project Management will help.” Dr Andrew Makar, Tactical Project Management

The six step guide to practical project management

MindGenius

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...

RachelPearson36

During this webinar, Anand Bagmar demonstrates how AI tools such as ChatGPT can be applied to various stages of the software development life cycle (SDLC) using an eCommerce application case study. Find the on-demand recording and more info at https://applitools.info/b59 Key takeaways: • Learn how to use ChatGPT to add AI power to your testing and test automation • Understand the limitations of the technology and where human expertise is crucial • Gain insight into different AI-based tools • Adopt AI-based tools to stay relevant and optimize work for developers and testers * ChatGPT and OpenAI belong to OpenAI, L.L.C.

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...

Applitools

12 Ways to Increase Your Influence at Work

GetSmarter

ChatGPT webinar slides

Alireza Esmikhani

More than Just Lines on a Map: Best Practices for U.S Bike Routes

Project for Public Spaces & National Center for Biking and Walking

Has your project been caught in a storm of deadlines, clashing requirements, and the need to change course halfway through? If yes, then check out how the administration team navigated through all of this, relocating 160 people from 3 countries and opening 2 offices during the most turbulent time in the last 20 years. Belka Games’ Chief Administrative Officer, Katerina Rudko, will share universal approaches and life hacks that can help your project survive unstable periods when there seem to be too many tasks and a lack of time and people.

Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...

DevGAMM Conference

Barbie - Brand Strategy Presentation

Erica Santiago

Featured (20)

PEPSICO Presentation to CAGNY Conference Feb 2024

Content Methodology: A Best Practices Report (Webinar)

How to Prepare For a Successful Job Search for 2024

Social Media Marketing Trends 2024 // The Global Indie Insights

Trends In Paid Search: Navigating The Digital Landscape In 2024

5 Public speaking tips from TED - Visualized summary

ChatGPT and the Future of Work - Clark Boyd

Getting into the tech field. what next

Google's Just Not That Into You: Understanding Core Updates & Search Intent

How to have difficult conversations

Introduction to Data Science

Time Management & Productivity - Best Practices

The six step guide to practical project management

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...

12 Ways to Increase Your Influence at Work

ChatGPT webinar slides

More than Just Lines on a Map: Best Practices for U.S Bike Routes

Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...

Barbie - Brand Strategy Presentation

190307 exploration by random network distillation

1. Exploration By Random Network Distillation 권휘

2. Intrinsic vs Extrinsic Reward reward extrinsic reward (environment’s reward intrinsic reward (exploration bonus)

3. Intrinsic Reward From P.-Y. Oudeyer., et al., Intrinsic Motivation Systems for Autonomous Mental Development, IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, 2007. Check also P.-Y. Oudeyer and F. Kaplan. What is intrinsic motivation? a typology of computational approaches. Frontiers in neurorobotics, 2009.

4. Intrinsic Reward Tabular case: visitation count

5. Intrinsic Reward Non-tabular case: pseudo-counts, prediction error(forward dynamics, inverse dynamics, …)

6. Previous study (‘18 Aug) From https://blog.openai.com/reinforcement-learning-with-prediction-based-rewards/ (Forward dynamics) Pixels Random Features (RF) Variational Autoencoder (VAE) Inverse Dynamics Features (IDF) compact, sufficient, stationary

7. Previous study (‘18 Aug) - Diverse environments(48 Atari, Mario, 2 Roboschool, 2 player Pong, 2 Unity Mazes) - Large-Scale(~2048 parallel envs) - Curiosity(Intrinsic reward) only - Infinite horizon - Stabilization techniques - Limitation: stochastic dynamics(Noisy-TV) Check Yuri Burda et al., Large-Scale Study of Curiosity-Driven Learning, arXiv:1808.04355v1, 2018

8. Environment

9. Environment From https://blog.openai.com/reinforcement-learning-with-prediction-based-rewards/

10. Random Network Distillation (‘18 Oct) From https://blog.openai.com/reinforcement-learning-with-prediction-based-rewards/ Random initialized fixed network Random initialized training network

11. Contribution 1. Works well with high-dimensional observations 2. Easily scale up to large numbers of parallel environments 3. Solve stochastic dynamic problem 4. Intrinsic + Extrinsic reward 5. Get high score in Montezuma’s Revenge (pass 1st level)

12. Sources of Prediction Errors 1. Amount of training data 2. Stochasticity 3. Model misspecification 4. Learning dynamics

13. Relation to Uncertainty Quantification

14. Relation to Uncertainty Quantification From Osbald et al., Randomized Prior Functions for Deep Reinforcement Learning, arXiv:1806.03335v2, 2018.

15. Intrinsic + Extrinsic returns 1. Intrinsic: Non-episodic, separate value function( ) 2. Extrinsic: Episodic, separate value function( ) 3. Reward and Observation Normalization

16. Experiments Condition: - 30K rollouts of length 128 per environments with 128 parallel environments. - Discount factor(0.99 vs 0.999) - RNN vs CNN - # of parallel envs Metrics: - Mean episodic return - the number of rooms the agent finds over the training run.

17. Experiments

18. Experiments

19. Pseudo-code PPO Get observation initial statistics 왜?

Editor's Notes

Intrinsic, extrinsic 의 개념에 대해서 먼저 살펴보겠습니다. 이 저자는 시간에 대한 총 reward를 환경이 주는 extrinsic reward와 추가적으로 intrinsic reward, exploration bonus로 정의합니다.
Extrinsic reward는 환경이 주게끔 되어있는 reward이고, 새롭게 등장하는 Intrinsic reward가 뭔지 살펴보겠습니다. 아래 논문에서는 학습하는 로봇의 관점에서 novelty, surprise, complexity의 정도를 평가할 수 있는 메커니즘을 만드는 것이라고 합니다.
우리가 학습하는 환경의 경우에는 novelty, surprise를 얼마나 많이 방문했는지에 대한 것으로 표현할 수도 있습니다. Tabular case 같은 경우에는 그래서 방문한 숫자를 세서 이를 reward에 반영해줍니다. 많이 방문할 수록 놀라움이 떨어지게 되죠.
non-tabular의 경우에는 각 grid를 셀 수 없기 때문에 다른 방법을 써야 하는데, 그 중 하나가 지난 주에 소개해주신 CTS, 그리고 이번주에 소개할 RND가 됩니다. RND는 prediction error를 prediction error로 설정하는데 이 경우 dynamics를 예측하는 경우도 있습니다.
RND를 살펴보기 전에 같은 저자가 바로 이전에 연구한 내용을 살펴보겠습니다. Large-scale study of curiosity-driven learning 이라는 논문인데, RND의 바탕이 된다고 봅니다. 위에서 잠깐 설명한 것처럼 dynamics를 예측하는 등의 방법으로 prediction error를 exploration bonus로 활용할 수 있습니다. 이 논문에서는 forward dynamics를 예측해서 그 에러를 bonus로 활용합니다. predictor는 compact, sufficient, stationary 해야 한다고 주장합니다. 다양한 function을 predictor로 활용할 수 있습니다. 저자는 compact, sufficient, stationary 해야 한다고 얘기합니다. 그리고 이 논문에서는 large-scale로 curiosity만을 가지고 학습을 시켜보고 어느 정도 잘 되는 것을 보입니다.
이전 연구에서는 아래와 같은 조건으로 학습시키고 잘 되는 것을 보입니다. 다양한 feature extractor를 사용하는데, 뭐가 더 좋은지 판단은 어렵다고 합니다. 학습된 feature가 일반화를 좀 더 잘한다고 하고 random feature는 생각보다 잘한다. 그리고 문제가 있는데, stochastic dynamics를 잘 대응 못하는게 문제입니다. 뭐 당연하죠. noisy-TV!
Atari의 유명한 게임인 montezuma’s revenge 입니다. 이 화면에서 다음 observation을 예측하는 건 그렇게 어려운 일이 아닌 것처럼 보입니다.
자, 이런 환경에서는 어떨까요? 저자가 이전 논문에서 소개하는 noisy-TV라는 환경입니다. stochastic한 TV가 벽에 걸려있고 agent가 이를 발견하면 움직일 수가 없게 됩니다. 이유는 새로운 observation을 보려고 하는데 계속 찾던 미로랑 비교했을 때 항상 새로운 내용이 TV 계속 나오기 때문에 떠날 수가 없게 되죠. TV 짱!
자 그럼, stochastic dynamics의 문제를 해결해봅시다. 현재 observation을 feature로 embedding 하면 아무리 stochastic 하더라도 보는 화면은 계속 학습할 것이기 때문에 exploration bonus가 줄어들 것입니다. 짜잔
RND의 contribution은 아래와 같습니다. 기존 RND 계열 외 연구들은 high-dim obs에 대해서 학습을 하기 어려운 점이 있었는데 그 문제를 해결했다고 합니다. 위에서 봤듯이 stochastic dynamic 문제를 해결했고 기존 연구에서 확장해서 intrinsic + extrinsic 문제에 대해서도 좋은 성능을 내는데 성공했습니다.
Prediction error를 발생시킬 수 있는 요소들입니다. Exploration bonus를 만들어 내는 것들이죠. 학습 데이터의 양, stochasticity, model의 capacity, dynamics를 잘 배우는지에 관한 내용입니다. Stochasticity의 경우에는 forward dynamics prediction의 경우에 noise가 더 많이 생길 수 있는데 RND는 현재 observation에 대해서 feature 예측이기 때문에 훨씬 덜 합니다.!
Bayesian Linear Regression으로 uncertainty를 구하는 방법과 dropout, ensemble, random prior for ensemble와 연관성을 Osband et al. 에서 설명하고 있는데 그 부분과 연관시켜서 RND에서의 distillation error가 uncertainty를 계산하는 것과 같다고 합니다. 잘 모르겠습니다.
이제 intrinsic, extrinsic을 함께 학습시켜보면 intrinsic은 계속해서 누적하면서 observation을 찾기 때문에 game over를 없애고 계속해서 누적 증가하는 식으로 설정한다. (사람과 같다.) extrinsic의 경우는 죽었을 때까지의 본 최적의 경로를 학습하게 하기 위해서 episodic으로 설정한다. Reward normalization은 error가 환경, point마다 제각각이기 때문에 hyperparameter를 맞춰주기 위해서 해준다. Observation normalization은 random target network를 목표로 학습하는데 frozen parameter를 사용하기 때문에 scale에서 학습에 어려움이 있을 수 있기 떄문에 normalization을 해줍니다. 이 때, 초기 몇 step에 대해서 해주고 이 후에 학습하면서 계속 업데이트 해줍니다. policy network에는 따로 normalization 해주진 않는다.
학습은 아래 조건으로 진행합니다.
이제 intrinsic, extrinsic을 함께 학습시켜보면 intrinsic은 계속해서 누적하면서 observation을 찾기 때문에 game over를 없애고 계속해서 누적 증가하는 식으로 설정한다. (사람과 같다.) extrinsic의 경우는 죽었을 때까지의 본 최적의 경로를 학습하게 하기 위해서 episodic으로 설정한다. Reward normalization은 error가 환경, point마다 제각각이기 때문에 hyperparameter를 맞춰주기 위해서 해준다. Observation normalization은 random target network를 목표로 학습하는데 frozen parameter를 사용하기 때문에 scale에서 학습에 어려움이 있을 수 있다. policy network에는 따로 normalization 해주진 않는다.
이제 intrinsic, extrinsic을 함께 학습시켜보면 intrinsic은 계속해서 누적하면서 observation을 찾기 때문에 game over를 없애고 계속해서 누적 증가하는 식으로 설정한다. (사람과 같다.) extrinsic의 경우는 죽었을 때까지의 본 최적의 경로를 학습하게 하기 위해서 episodic으로 설정한다. Reward normalization은 error가 환경, point마다 제각각이기 때문에 hyperparameter를 맞춰주기 위해서 해준다. Observation normalization은 random target network를 목표로 학습하는데 frozen parameter를 사용하기 때문에 scale에서 학습에 어려움이 있을 수 있다. policy network에는 따로 normalization 해주진 않는다.
RND에서 의문점이 2개가 있는데 하나는 마지막 N_opt 만큼 optimization을 할 때 왜 distillation loss까지 여러 번 하는지와 intrinsic reward에 대해서 학습을 한다고 했을 때 game over 없이 discount factor가 계속 적용된다고 했을 때, 방법대로면 학습엔 문제가 없을 것 같지만 직관적으로는 step이 높아질 수록 뒤의 frame들이 의미가 없어질 것 같다는 생각이 드는데 관련된 연구가 있는지 궁금합니다.

190307 exploration by random network distillation

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

190307 exploration by random network distillation

Editor's Notes