RL 논문 리뷰 스터디에서 Agent57 논문 내용을 정리해 발표했습니다. Agent57은 NGU(Never Give Up)를 기반으로 몇 가지 기능을 개선해 57개의 Atari 게임 모두 인간보다 뛰어난 점수를 기록한 최초의 RL 알고리즘입니다. 많은 분들에게 도움이 되었으면 합니다.
Bayesian Optimization for Balancing Metrics in Recommender SystemsViral Gupta
Most large-scale online recommender systems like newsfeed ranking, people recommendations, job recommendations, etc. often have multiple utilities or metrics that need to be simultaneously optimized. The machine learning models that are trained to optimize a single utility are combined together through parameters to generate the final ranking function. These combination parameters drive business metrics. Finding the right choice of the parameters is often done through online A/B experimentation, which can be incredibly complex and time-consuming, especially considering the non-linear effects of these parameters on the metrics of interest.
In this tutorial, we will talk about how we can apply Bayesian Optimization techniques to obtain the parameters for such complex online systems in order to balance the competing metrics. First, we will provide an in-depth introduction to Bayesian Optimization, covering some of the basics as well as the recent advances in the field. Second, we will talk about how to formulate a real-world recommender system problem as a black-box optimization problem that can be solved via Bayesian Optimization. We will focus on a few key problems such as newsfeed ranking, people recommendations, job recommendations, etc. Third, we will talk about the architecture of the solution and how we are able to deploy it for large-scale systems. Finally, we will discuss the extensions and some of the future directions in this domain.
Supporting cognitive adaptability through game design working 100412Shane Gallagher
Adaptability is a metacompetency critically important to the United States Department of Defense and is considered a key component of 21st Century skills by the U.S. Department of Labor and the U.S. Department of Education). Video games are seen as learning environments supporting the acquisition of 21st century skills. Can games, then, be used as components of an effective learning environment that support the development of adaptability?
Initially this paper describes the metacompetency of adaptability. Next is how adaptability can be functionally and discretely measured by focusing on its most granular or micromomentary level which we describe as cognitive adaptability. Finally, the authors examine both the nature of cognitive adaptability, interventions that support its development, and how those interventions might be translated into game design features. Toward this end, the paper will also discuss how these features are exhibited in a popular commercially available video game and how it could be employed to test the hypothesis that a play frame of 12 consecutive hours, using a video game meeting the design criteria and example previously discussed, will increase cognitive adaptability in the players
Bayesian Optimization for Balancing Metrics in Recommender SystemsViral Gupta
Most large-scale online recommender systems like newsfeed ranking, people recommendations, job recommendations, etc. often have multiple utilities or metrics that need to be simultaneously optimized. The machine learning models that are trained to optimize a single utility are combined together through parameters to generate the final ranking function. These combination parameters drive business metrics. Finding the right choice of the parameters is often done through online A/B experimentation, which can be incredibly complex and time-consuming, especially considering the non-linear effects of these parameters on the metrics of interest.
In this tutorial, we will talk about how we can apply Bayesian Optimization techniques to obtain the parameters for such complex online systems in order to balance the competing metrics. First, we will provide an in-depth introduction to Bayesian Optimization, covering some of the basics as well as the recent advances in the field. Second, we will talk about how to formulate a real-world recommender system problem as a black-box optimization problem that can be solved via Bayesian Optimization. We will focus on a few key problems such as newsfeed ranking, people recommendations, job recommendations, etc. Third, we will talk about the architecture of the solution and how we are able to deploy it for large-scale systems. Finally, we will discuss the extensions and some of the future directions in this domain.
Supporting cognitive adaptability through game design working 100412Shane Gallagher
Adaptability is a metacompetency critically important to the United States Department of Defense and is considered a key component of 21st Century skills by the U.S. Department of Labor and the U.S. Department of Education). Video games are seen as learning environments supporting the acquisition of 21st century skills. Can games, then, be used as components of an effective learning environment that support the development of adaptability?
Initially this paper describes the metacompetency of adaptability. Next is how adaptability can be functionally and discretely measured by focusing on its most granular or micromomentary level which we describe as cognitive adaptability. Finally, the authors examine both the nature of cognitive adaptability, interventions that support its development, and how those interventions might be translated into game design features. Toward this end, the paper will also discuss how these features are exhibited in a popular commercially available video game and how it could be employed to test the hypothesis that a play frame of 12 consecutive hours, using a video game meeting the design criteria and example previously discussed, will increase cognitive adaptability in the players
Recommender Systems from A to Z – The Right DatasetCrossing Minds
In the last years a lot of improvements were done in the field of Machine Learning and the Tools that support the community of developers. But still, implementing a recommender system is very hard.
That is why at Crossing Minds, we decided to create a series of 4 meetups to discuss how to implement a recommender system end-to-end:
Part 1 – The Right Dataset
Part 2 – Model Training
Part 3 – Model Evaluation
Part 4 – Real-Time Deployment
This first meetup will be about building the right dataset and doing all the preprocessing needed to create different models. We will talk about explicit vs implicit feedback, dataset analysis, likes/dislikes vs ratings, users and items features, normalization and similarities.
Here is a presentation given to the programmers at Insomniac Games on July 6, 2007. It covers two related topics. First we consider a case study of an optimization to NPC AI used on Resistance: Fall of Man. Then we discuss a broader proposal for a game-play architecture that supports and encourages higher performance code compared to the common game-object driven main loop.
The presentation is also available on Insomniac\'s public R&D website:
http://www.insomniacgames.com/tech/techpage.php
In the latest years, we have witnessed a growing number of media transmitted and stored on computers and mobile devices. For this reason, there is an actual need to employ smart compression algorithms to reduce the size of our media files. However, such techniques are often responsible for severe reduction of user perceived quality. In this talk we present several approaches we have developed to restore degraded images and videos to match their original quality, making use of Generative Adversarial Networks. The aim of the talk is to highlight the main features of our research work, including the advantages of our solution, the current challenges and the possible directions for future improvements.
We present our approach for the NIPS 2017 "Learning To Run" challenge. The goal of the challenge is to develop a controller able to run in a complex environment, by training a model with Deep Reinforcement Learning methods.
We follow the approach of the team Reason8 (3rd place). We begin from the algorithm that performed better on the task, DDPG. We implement and benchmark several improvements over vanilla DDPG, including parallel sampling, parameter noise, layer normalization and domain specific changes. We were able to reproduce results of the Reason8 team, obtaining a model able to run for more than 30m.
MongoDB.local Seattle 2019: Tips & Tricks for Avoiding Common Query PitfallsMongoDB
Query performance can either be a constant headache or the unsung hero of an application. MongoDB provides extremely powerful querying capabilities when used properly. As a member of the support team I will share common mistakes observed as well as tips and tricks to avoiding them.
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Big Data Spain
In recent years Machine Learning (ML) and especially Deep Learning (DL) have achieved great success in many areas such as visual recognition, NLP or even aiding in medical research.
https://www.bigdataspain.org/2017/talk/attacking-machine-learning-used-in-antivirus-with-reinforcement
Big Data Spain 2017
16th - 17th Kinépolis Madrid
Atari Game State Representation using Convolutional Neural Networksjohnstamford
I recently gave a talk to some MSc Machine Learning students at De Montfort University about the project I did for my MSc. The work included looking at feature extraction from game screens using the Arcade Learning Environment and Convolutional Neural Networks (CNN).
The work was planned to investigate if the costly nature Q-Learning could be offset by the use of a trained system using 'expert' data. The system uses the same technology as used by Deepmind in their 2013 paper.
Recommender Systems from A to Z – The Right DatasetCrossing Minds
In the last years a lot of improvements were done in the field of Machine Learning and the Tools that support the community of developers. But still, implementing a recommender system is very hard.
That is why at Crossing Minds, we decided to create a series of 4 meetups to discuss how to implement a recommender system end-to-end:
Part 1 – The Right Dataset
Part 2 – Model Training
Part 3 – Model Evaluation
Part 4 – Real-Time Deployment
This first meetup will be about building the right dataset and doing all the preprocessing needed to create different models. We will talk about explicit vs implicit feedback, dataset analysis, likes/dislikes vs ratings, users and items features, normalization and similarities.
Here is a presentation given to the programmers at Insomniac Games on July 6, 2007. It covers two related topics. First we consider a case study of an optimization to NPC AI used on Resistance: Fall of Man. Then we discuss a broader proposal for a game-play architecture that supports and encourages higher performance code compared to the common game-object driven main loop.
The presentation is also available on Insomniac\'s public R&D website:
http://www.insomniacgames.com/tech/techpage.php
In the latest years, we have witnessed a growing number of media transmitted and stored on computers and mobile devices. For this reason, there is an actual need to employ smart compression algorithms to reduce the size of our media files. However, such techniques are often responsible for severe reduction of user perceived quality. In this talk we present several approaches we have developed to restore degraded images and videos to match their original quality, making use of Generative Adversarial Networks. The aim of the talk is to highlight the main features of our research work, including the advantages of our solution, the current challenges and the possible directions for future improvements.
We present our approach for the NIPS 2017 "Learning To Run" challenge. The goal of the challenge is to develop a controller able to run in a complex environment, by training a model with Deep Reinforcement Learning methods.
We follow the approach of the team Reason8 (3rd place). We begin from the algorithm that performed better on the task, DDPG. We implement and benchmark several improvements over vanilla DDPG, including parallel sampling, parameter noise, layer normalization and domain specific changes. We were able to reproduce results of the Reason8 team, obtaining a model able to run for more than 30m.
MongoDB.local Seattle 2019: Tips & Tricks for Avoiding Common Query PitfallsMongoDB
Query performance can either be a constant headache or the unsung hero of an application. MongoDB provides extremely powerful querying capabilities when used properly. As a member of the support team I will share common mistakes observed as well as tips and tricks to avoiding them.
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Big Data Spain
In recent years Machine Learning (ML) and especially Deep Learning (DL) have achieved great success in many areas such as visual recognition, NLP or even aiding in medical research.
https://www.bigdataspain.org/2017/talk/attacking-machine-learning-used-in-antivirus-with-reinforcement
Big Data Spain 2017
16th - 17th Kinépolis Madrid
Atari Game State Representation using Convolutional Neural Networksjohnstamford
I recently gave a talk to some MSc Machine Learning students at De Montfort University about the project I did for my MSc. The work included looking at feature extraction from game screens using the Arcade Learning Environment and Convolutional Neural Networks (CNN).
The work was planned to investigate if the costly nature Q-Learning could be offset by the use of a trained system using 'expert' data. The system uses the same technology as used by Deepmind in their 2013 paper.
Momenti Seminar - 5 Years of RosettaStoneChris Ohk
Momenti Seminar에서 진행했던 "하스스톤 시뮬레이터 RosettaStone 개발 5년 간의 기록"의 발표 자료를 공유드립니다. 5년 동안 오픈 소스 프로젝트를 진행하면서 경험했던 일들을 정리하며 어떤 교훈을 얻었는지 생각해보는 시간이었습니다. 많은 분들에게 도움이 되었으면 합니다.
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Chris Ohk
RL 논문 리뷰 스터디에서 Evolving Reinforcement Learning Algorithms 논문 내용을 정리해 발표했습니다. 이 논문은 Value-based Model-free RL 에이전트의 손실 함수를 표현하는 언어를 설계하고 기존 DQN보다 최적화된 손실 함수를 제안합니다. 많은 분들에게 도움이 되었으면 합니다.
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021Chris Ohk
RL 논문 리뷰 스터디에서 Adversarially Guided Actor-Critic 논문 내용을 정리해 발표했습니다. AGAC는 Actor-Critic에 GAN에서 영감을 받은 방법들을 결합해 리워드가 희소하고 탐험이 어려운 환경에서 뛰어난 성능을 보여줍니다. 많은 분들에게 도움이 되었으면 합니다.
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기Chris Ohk
강화학습에 관심을 갖게 되어 어떤 게임에 적용해볼까 고민하다가 평소 즐기던 '하스스톤'이라는 게임에 관심을 갖게 되어 2017년 말부터 하스스톤 강화 학습을 위한 API를 만들기 시작했습니다. 이 발표를 통해 평소 하스스톤과 같은 카드 게임 개발이나 게임에 강화학습을 적용하기 위한 환경을 구축하는데 관심을 갖고 있던 프로그래머들에게 조금이나마 도움이 되었으면 합니다.
모던 C++의 시초인 C++11은 C++ 코드 전반에 많은 변화를 가져왔습니다. 그리고 최근 C++20의 표준위원회 회의가 마무리되었습니다. 내년에 C++20이 도입되면 C++11이 처음 도입되었을 때와 비슷한 규모, 또는 그 이상의 변화가 있을 것이라고 예상하고 있습니다. C++20에는 Concepts, Contract, Ranges, Coroutine, Module 등 굵직한 기능 외에도 많은 기능들이 추가될 예정입니다. 이번 세션에서는 C++20에 추가될 주요 기능들을 살펴보고자 합니다.
GDG Campus Korea에서 개최한 'Daily 만년 Junior들의 이야기 : 델리만주' 밋업에서 발표했던 내용으로 대학원 석사 입학 후부터 오늘날까지 어떤 활동들을 했는지 정리했습니다. 대학원생 분들과 게임 프로그래머 취업을 준비하시는 분들께 많은 도움이 되었으면 합니다.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
The Metaverse and AI: how can decision-makers harness the Metaverse for their...Jen Stirrup
The Metaverse is popularized in science fiction, and now it is becoming closer to being a part of our daily lives through the use of social media and shopping companies. How can businesses survive in a world where Artificial Intelligence is becoming the present as well as the future of technology, and how does the Metaverse fit into business strategy when futurist ideas are developing into reality at accelerated rates? How do we do this when our data isn't up to scratch? How can we move towards success with our data so we are set up for the Metaverse when it arrives?
How can you help your company evolve, adapt, and succeed using Artificial Intelligence and the Metaverse to stay ahead of the competition? What are the potential issues, complications, and benefits that these technologies could bring to us and our organizations? In this session, Jen Stirrup will explain how to start thinking about these technologies as an organisation.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
3. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Introduction
• Arcade Learning Environment (ALE)
• An interface to a diverse set of Atari 2600 game environments designed to be
engaging and challenging for human players
• The Atari 2600 games are well suited for evaluating general competency in
AI agents for three main reasons
• 1) Varied enough to claim generality
• 2) Each interesting enough to be representative of settings that might be faced in practice
• 3) Each created by an independent party to be free of experimenter’s bias
4. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Introduction
• Previous achievement of Deep RL in ALE
• DQN (Minh et al., 2015) : 100% HNS on 23 games
• Rainbow (M. Hessel et al., 2017) : 100% HNS on 53 games
• R2D2 (Kapturowski et al., 2018) : 100% HNS on 52 games
• MuZero (Schrittwieser et al., 2019) : 100% HNS on 51 games
→ Despite all efforts, no single RL algorithm has been able to achieve over
100% HNS on all 57 Atari games with one set of hyperparameters.
5. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Introduction
• How to measure Artificial General Intelligence?
6. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Introduction
• Atari-57 5th percentile performance
7. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Introduction
• Challenging issues of Deep RL in ALE
• 1) Long-term credit assignment
• This problem is particularly hard when rewards are delayed and credit needs to be assigned over
long sequences of actions.
• The game of Skiing is a canonical example due to its peculiar reward structure.
• The goal of the game is to run downhill through all gates as fast as possible.
• A penalty of five seconds is given for each missed gate.
• The reward, given only at the end, is proportional to the time elapsed.
• long-term credit assignment is needed to understand why an action taken early in the game (e.g.
missing a gate) has a negative impact in the obtained reward.
8. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Introduction
• Challenging issues of Deep RL in ALE
• 2) Exploration in large high dimensional state spaces
• Games like Private Eye, Montezuma’s Revenge, Pitfall! or Venture are widely considered hard
exploration games as hundreds of actions may be required before a first positive reward is seen.
• In order to succeed, the agents need to keep exploring the environment despite the apparent
impossibility of finding positive rewards.
9. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Introduction
• Challenging issues of Deep RL in ALE
• Exploration algorithms in deep RL generally fall into three categories:
• Randomize value function
• Unsupervised policy learning
• Intrinsic motivation: NGU
• Other work combines handcrafted features, domain-specific knowledge or
privileged pre-training to side-step the exploration problem.
10. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Introduction
• In summary, our contributions are as follows:
• 1) A new parameterization of the state-action value function that decomposes the
contributions of the intrinsic and extrinsic rewards. As a result, we significantly
increase the training stability over a large range of intrinsic reward scales.
• 2) A meta-controller: an adaptive mechanism to select which of the policies
(parameterized by exploration rate and discount factors) to prioritize throughout
the training process. This allows the agent to control the exploration/exploitation
trade-off by dedicating more resources to one or the other.
11. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Introduction
• In summary, our contributions are as follows:
• 3) Finally, we demonstrate for the first time performance that is above the human
baseline across all Atari 57 games. As part of these experiments, we also find that
simply re-tuning the backprop through time window to be twice the previously
published window for R2D2 led to superior long-term credit assignment (e.g., in
Solaris) while still maintaining or improving overall performance on the remaining
games.
12. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Background: NGU
• Our work builds on top of the Never Give Up (NGU) agent,
which combines two ideas:
• The curiosity-driven exploration
• Distributed deep RL agents, in particular R2D2
• NGU computes an intrinsic reward in order to encourage
exploration. This reward is defined by combining
• Per-episode novelty
• Life-long novelty
13. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Background: NGU
• Per-episode novelty, 𝑟𝑡
episodic
, rapidly vanishes over the course
of an episode, and it is computed by comparing observations
to the contents of an episodic memory.
• The life-long novelty, 𝛼 𝑡, slowly vanishes throughout training,
and it is computed by using a parametric model.
15. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Background: NGU
• With this, the intrinsic reward 𝒓𝒊
𝒕
is defined as follows:
𝑟𝑡
𝑖
= 𝑟𝑡
episodic
∙ min max 𝛼 𝑡, 1 , 𝐿
where 𝐿 = 5 is a chosen maximum reward scaling.
• This leverages the long-term novelty provided by 𝛼 𝑡,
while 𝑟𝑡
episodic
continues to encourage the agent to explore
within an episode.
16. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Background: NGU
• At time 𝑡, NGU adds 𝑁 different scales of the same intrinsic
reward 𝛽𝑗 𝑟𝑡
𝑖
(𝛽𝑗 ∈ ℝ+
, 𝑗 ∈ 0, … , 𝑁 − 1) to the extrinsic reward
provided by the environment, 𝑟𝑡
𝑒
, to form 𝑁 potential total
rewards 𝑟𝑗,𝑡 = 𝑟𝑡
𝑒
+ 𝛽𝑗 𝑟𝑡
𝑖
.
• Consequently, NGU aims to learn the 𝑁 different associated
optimal state-action value functions 𝑄 𝑟 𝑗
∗
associated with each
reward function 𝑟𝑗,𝑡.
17. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Background: NGU
• The exploration rates 𝛽𝑗 are parameters that control the degree
of exploration.
• Higher values will encourage exploratory policies.
• Smaller values will encourage exploitative policies.
• Additionally, for purposes of learning long-term credit
assignment, each 𝑄 𝑟 𝑗
∗
has its own associated discount factor 𝛾𝑗.
18. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Background: NGU
• Since the intrinsic reward is typically much more dense than
the extrinsic reward, 𝛽𝑗, 𝛾𝑗 𝑗=0
𝑁−1
are chosen so as to allow for
long term horizons (high values of 𝛾𝑗) for exploitative policies
(small values of 𝛽𝑗) and small term horizons (low values of 𝛾𝑗)
for exploratory policies (high values of 𝛽𝑗).
20. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Background: NGU
• To learn the state-action value function 𝑄 𝑟 𝑗
∗
, NGU trains a
recurrent neural network 𝑄 𝑥, 𝑎, 𝑗; 𝜃 , where 𝑗 is a one-hot
vector indexing one of 𝑁 implied MDPs (in particular 𝛽𝑗, 𝛾𝑗 ),
𝑥 is the current observation, 𝑎 is an action, and 𝜃 are the
parameters of the network (including the recurrent state).
• In practice, NGU can be unstable and fail to learn an
appropriate approximation of 𝑄 𝑟 𝑗
∗
for all the state-action value
functions in the family, even in simple environments.
21. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Background: NGU
• This is especially the case when the scale and sparseness of 𝑟𝑡
𝑒
and 𝑟𝑡
𝑖
are both different, or when one reward is more noisy
than the other.
• We conjecture that learning a common state-action value
function for a mix of rewards is difficult when the rewards are
very different in nature.
22. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Background: NGU
• Our agent is a deep distributed RL agent, in the lineage of
R2D2 and NGU.
23. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Background: NGU
• It decouples the data collection and the learning processes by
having many actors feed data to a central prioritized replay
buffer.
• A learner can then sample training data from this buffer.
• More precisely, the replay buffer contains sequences of
transitions that are removed regularly in a FIFO-manner.
24. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Background: NGU
• These sequences come from actor processes that interact with
independent copies of the environment, and they are
prioritized based on temporal differences errors.
• The priorities are initialized by the actors and updated by the
learner with the updated state-action value function
𝑄 𝑥, 𝑎, 𝑗; 𝜃 .
25. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Background: NGU
• According to those priorities, the learner samples sequences of
transitions from the replay buffer to construct an RL loss.
• Then, it updates the parameters of the neural network
𝑄 𝑥, 𝑎, 𝑗; 𝜃 by minimizing the RL loss to approximate the
optimal state-action value function.
• Finally, each actor shares the same network architecture as the
learner but with different weights.
26. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Background: NGU
• We refer as 𝜃𝑙 to the parameters of the 𝑙-th actor. The learner
weights 𝜃 are sent to the actor frequently, which allows it to
update its own weights 𝜃𝑙.
• Each actor uses different values 𝜖𝑙, which are employed to
follow an 𝜖𝑙-greedy policy based on the current estimate of the
state-action value function 𝑄 𝑥, 𝑎, 𝑗; 𝜃𝑙 .
• In particular, at the beginning of each episode and in each
actor, NGU uniformly selects a pair 𝛽𝑗, 𝛾𝑗 .
27. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Improvements to NGU
• State-Action Value Function Parameterization
𝑄 𝑥, 𝑎, 𝑗; 𝜃 = 𝑄 𝑥, 𝑎, 𝑗; 𝜃 𝑒
+ 𝛽𝑗 𝑄 𝑥, 𝑎, 𝑗; 𝜃 𝑖
• 𝑄 𝑥, 𝑎, 𝑗; 𝜃 𝑒 : Extrinsic component
• 𝑄 𝑥, 𝑎, 𝑗; 𝜃 𝑖 : Intrinsic component
• Weight: 𝜃 𝑒
and 𝜃 𝑖
(separately)
with identical architecture and 𝜃 = 𝜃 𝑖 ∪ 𝜃 𝑒
• Reward: 𝑟 𝑒 and 𝑟 𝑖 (separately)
But, same target policy
𝜋 = argmax 𝑎∈𝒜 𝑄(𝑥, 𝑎, 𝑗; 𝜃 𝑒
)
28. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Improvements to NGU
• State-Action Value Function Parameterization
• We show that this optimization of separate state-action values is equivalent
to the optimization of the original single state-action value function with
reward 𝒓 𝒆
+ 𝜷𝒋 𝒓𝒊
in Appendix B.
• Even though the theoretical objective being optimized is the same, the
parameterization is different: we use two different neural networks to
approximate each one of these state-action values.
31. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Improvements to NGU
• Adaptive Exploration over a Family of Policies
• The core idea of NGU is to jointly train a family of policies with different
degrees of exploratory behavior using a single network architecture.
• In this way, training these exploratory policies plays the role of a set of
auxiliary tasks that can help train the shared architecture even in the absence
of extrinsic rewards.
• A major limitation of this approach is that all policies are trained equally,
regardless of their contribution to the learning progress.
32. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Improvements to NGU
• We propose to incorporate a meta-controller that can
adaptively select which policies to use both at training and
evaluation time. This carries two important consequences.
33. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Improvements to NGU
• Meta-controller with adaptively select
• 1) By selecting which policies to prioritize during training, we can allocate
more of the capacity of the network to better represent the state-action
value function of the policies that are most relevant for the task at hand.
• Note that this is likely to change throughout the training process, naturally
building a curriculum to facilitate training.
34. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Improvements to NGU
• Meta-controller with adaptively select
• Policies are represented by pairs of exploration rate and discount factor,
(𝛽𝑗, 𝛾𝑗), which determine the discounted cumulative rewards to maximize.
• It is natural to expect policies with higher 𝛽𝑗 and lower 𝛾𝑗 to make more
progress early in training, while the opposite would be expected as training
progresses.
35. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Improvements to NGU
• Meta-controller with adaptively select
• 2) This mechanism also provides a natural way of choosing the best policy in
the family to use at evaluation time.
• Considering a wide range of values of 𝛾𝑗 with 𝛽𝑗≈0, provides a way of
automatically adjusting the discount factor on a per-task basis.
• This significantly increases the generality of the approach.
36. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Improvements to NGU
• We propose to implement the meta-controller using
a nonstationary multi-arm bandit algorithm running
independently on each actor.
• The reason for this choice, as opposed to a global meta-
controller, is that each actor follows a different 𝜖𝑙-greedy
policy which may alter the choice of the optimal arm.
37. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Improvements to NGU
• The multi-armed bandit problem is a problem in which a fixed
limited set of resources must be allocated between competing
(alternative) choices in a way that maximizes their expected
gain, when each choice's properties are only partially known at
the time of allocation, and may become better understood as
time passes or by allocating resources to the choice.
39. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Improvements to NGU
• Nonstationary multi-arm bandit algorithm
• Each arm 𝑗 from the 𝑁-arm bandit is linked to a policy in the family and
corresponds to a pair (𝛽𝑗, 𝛾𝑗). At the beginning of each episode, say, the 𝑘th
episode, the meta-controller chooses an arm 𝐽 𝑘 setting which policy will be
executed. (𝐽 𝑘: random variable)
• Then the 𝑙-th actor acts 𝑙-greedily with respect to the corresponding state-
action value function, 𝑄(𝑥, 𝑎, 𝐽 𝑘; 𝜃𝑙), for the whole episode.
• The undiscounted extrinsic episode returns, noted 𝑅 𝑘
𝑒
𝐽 𝑘 , are used as a
reward signal to train the multi-arm bandit algorithm of the meta-controller.
40. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Improvements to NGU
• Nonstationary multi-arm bandit algorithm
• The reward signal 𝑅 𝑘
𝑒
𝐽 𝑘 is non-stationary, as the agent changes throughout
training. Thus, a classical bandit algorithm such as Upper Confidence Bound
(UCB) will not be able to adapt to the changes of the reward through time.
• Therefore, we employ a simplified sliding-window UCB (SW-UCB) with
𝝐 𝐔𝐂𝐁-greedy exploration.
• With probability 1 − 𝜖UCB, this algorithm runs a slight modification of classic
UCB on a sliding window of size 𝜏 and selects a random arm with probability
𝜖UCB.
43. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Conclusions
• We present the first deep reinforcement learning agent with
performance above the human benchmark on all 57 Atari
games.
• The agent is able to balance the learning of different skills that
are required to be performant on such diverse set of games:
• Exploration
• Exploitation
• Long-term credit assignment
44. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Conclusions
• To do that, we propose simple improvements to an existing
agent, Never Give Up, which has good performance on hard-
exploration games, but in itself does not have strong overall
performance across all 57 games.
• 1) Using a different parameterization of the state-action value function
• 2) Using a meta-controller to dynamically adapt the novelty preference and
discount
• 3) The use of longer backprop-through time window to learn from using the
Retrace algorithm