Seed rl paper review

SEED RL :
Scalable and Efficient Deep-RL with Accelerated Central
from google
2020/07 이경만
Review by
RL
이미지 출처 : 게임 토막 - SEED9 Entertainment

SEED RL은...
IMPALA나 R2D2 같은 분산 강화학습 알고리즘을 Cloud,TPU에서 구현 할때 어떤 점을 고려해서
설계해야 하나에 대한 모범답안 같은 구현체.
IMPALA가 open source로 Deepmind에서 공개 되었지만 여전히 사용하기 힘든 이유는 분산환경의
엔지니어링이 어렵기 때문이죠.
SEED RL은 TPU기반 하에 RL 알고리즘에 대한 구현체로서 더 나아가 최신 기술을 활용해 비용과
성능에 효율적인 구조로 만들기 위한 엔지니어링 적인 고민이 들어 있습니다.
하지만 왠지 TF2에 TPU pod을 사용해야 할거 같기도 하고 모델 사이즈가 작고 LSTM을 사용하지 않는
경우 통신양이 크게 줄어들지는 않고 레이턴시가 치명적인 실시간 게임의 경우 문제 될 수 있다는 단점도
가지고 있기 때문에 상황에 맞춰서 써야 합니다.. 그래도 특히 모델 사이즈가 크다면 매우 좋은 효과가
있고 클라우드와 TPU, gRPC등의 최신기술을 이용해 RL을 효율적으로 구현 할 수 있는 방법을 오픈 소스로
공개했다는데 의의가 있습니다.

ABSTRACT
We present a modern scalable reinforcement learning agent called SEED (Scalable, Efﬁcient Deep-RL).
우리의SEED라 불리는현대적이고확장가능한강화학습에이전트를소개한다. ( 확장가능, 효율적인Deep-RL)
By effectively utilizing modern accelerators, we show that it is not only possible to train on millions of
frames per second but also to lower the cost of experiments compared to current methods.
최근의accelerator들을효과적으로사용해서초당백만프레임을학습하는게가능할뿐아니라현재소개된모든방법들의비용을낮추는방법을보여
주겠다.
We achieve this with a simple architecture that features centralized inference and an optimized
communication layer.
우리는중앙집중식inference와 효율적인통신레이어를가진심플한구조를통해이를이뤄냈다.
SEED adopts two state of the art distributed algorithms, IMPALA/V-trace (policy gradients) and R2D2
(Q-learning), and is evaluated on Atari-57, DeepMind Lab and Google Research Football.
We improve the state of the art on Football and are able to reach state of the art on Atari-57 three times
faster in wall-time. For the scenarios we consider, a 40% to 80% cost reduction for running experiments is
achieved. The implementation along with experiments is open-sourced so results can be reproduced and
novel ideas tried out.
football 과 아타리57 환경에서SOTA를 3배 정도빠르게달성했고.이 시나리오를적용하면실험에드는비용을40~80%정도줄일수 있었다. 모든실험의
구현은open-source로 구현하였으므로모든아이디어들은재현될 수 있다.
https://github.com/google-research/seed_rl

INTRODUCTION
The field of reinforcement learning (RL) has recently seen impressive results across a variety of tasks. This has
in part been fueled by the introduction of deep learning in RL and the introduction of accelerators such as
GPUs. In the very recent history, focus on massive scale has been key to solve a number of complicated games
such as AlphaGo (Silver et al., 2016), Dota (OpenAI, 2018) and StarCraft 2 (Vinyals et al., 2017).
The sheer amount of environment data needed to solve tasks trivial to humans, makes distributed machine
learning unavoidable for fast experiment turnaround time. 인간에게사소한문제라도RL에서는순수한환경의데이터가 필요하기
때문에가속화된실험환경의분산기계학습은피할수 없습니다.
RL is inherently comprised of heterogeneous tasks: running environments, model inference, model training,
replay buffer, etc. and current state-of-the-art distributed algorithms do not efficiently use compute
resources for the tasks.
RL은 본질적으로여러가지기술들의복합체입니다. : 환경을실행, 모델을인퍼런스/트레이닝, 리플레이버퍼, 등등, 그리고여러가지SOTA 분산알고리즘은
이러한컴퓨팅리소스와태스크를효과적으로사용하지못합니다.
The amount of data and inefficient use of resources makes experiments unreasonably expensive.
The two main challenges addressed in this paper are scaling of reinforcement learning and optimizing the use
of modern accelerators, CPUs and other resources.
이 논문의두개의메인챌린지는강화학습의규묘를키우고하고최신가속기(그냥 TPU라고하는게..), CPU와 다른리소스들을최적화시키는것이다.

INTRODUCTION#2
We introduce SEED (Scalable, Efficient, Deep-RL), a modern RL agent that scales well, is flexible and
efficiently utilizes available resources. It is a distributed agent where model inference is done centrally
combined with fast streaming RPCs to reduce the overhead of inference calls.
We show that with simple methods, one can achieve state-of-the-art results faster on a number of tasks. For
optimal performance, we use TPUs (cloud.google.com/tpu/) and TensorFlow 2 (Abadi et al., 2015) to simplify
the implementation.
The cost of running SEED is analyzed against IMPALA (Espeholt et al., 2018) which is a commonly used
state-of-the-art distributed RL algorithm (Veeriah et al. (2019); Li et al. (2019); Deverett et al. (2019);
Omidshafiei et al. (2019); Vezhnevets et al. (2019); Hansen et al. (2019); Schaarschmidt et al.; Tirumala et al.
(2019), ...).
We show cost reductions of up to 80% while being significantly faster. When scaling SEED to many
accelerators, it can train on millions of frames per second. Finally, the implementation is open-sourced
together with examples of running it at scale on Google Cloud (see Appendix A.4 for details) making it easy to
reproduce results and try novel idea

Appendix A.4
A.4 SEED LOCALLY AND ON CLOUD SEED is open-sourced together with an example
of running it both on a local machine and with scale using AI Platform, part of Google
Cloud.
We provide a public Docker image with low-level components implemented in C++
already pre-compiled to minimize the time needed to start SEED experiments. The main
pre-requisite to running on Cloud is setting up a Cloud Project.
The provided startup script uploads the image and runs training for you. For more details
please see github.com/ google-research/seed_rl.

RELATED WORK - scaling Value based method
● scaling DQN was Nair et al. (2015) that used asynchronous SGD (Dean et al., 2012) together with a
distributed setup consisting of actors, replay buffers, parameter servers and learners.
● Since then, it has been shown that asynchronous SGD leads to poor sample complexity while not being
signiﬁcantly faster (Chen et al., 2016; Espeholt et al., 2018) , async SGD는 크게빠르지않으면서샘플효율성이나빠짐
● Along with advances for Q-learning such as prioritized replay (Schaul et al., 2015), dueling networks
(Wang et al., 2016), and double-Q learning (van Hasselt, 2010; Van Hasselt et al., 2016) the
state-of-the-art distributed Q-learning was improved with Ape-X (Horgan et al., 2018).
● Recently, R2D2 (Kapturowski et al., 2018) achieved impressive results across all the Arcade Learning
Environment (ALE) (Bellemare et al., 2013) games by incorporating value-function rescaling (Pohlen et
al., 2018) and LSTMs (Hochreiter & Schmidhuber, 1997) on top of the advancements of Ape-X.

RELATED WORK - scaling policy gradients methods
● A3C (Mnih et al., 2016) introduced asynchronous single-machine training using asynchronous SGD and
relied exclusively on CPUs. GPUs were later introduced in
● GA3C (Mahmood, 2017) with improved speed but poor convergence results due to an inherently
on-policy method being used in an off-policy setting. - off-policy 세팅으로on-policy 방법을사용해속도는향상시켰지만
수렴은거의되지않았다.-> policy lag?
● This was corrected by V-trace (Espeholt et al., 2018) in the IMPALA agent both for single machine
training and also scaled using a simple actor-learner architecture to more than a thousand machines.
● PPO (Schulman et al., 2017) serves a similar purpose to V-trace and was used in OpenAI Rapid (Petrov et
al., 2018) with the actor-learner architecture extended with Redis (redis.io), an in-memory data store,
and was scaled to 128,000 CPUs.

RELATED WORK - environments&agent
● For inexpensive environments like ALE, a single machine with multiple accelerators can achieve results
quickly (Stooke & Abbeel, 2018). This approach was taken a step further by converting ALE to run on a
GPU (Dalton et al., 2019).
● A third class of algorithms is evolutionary algorithms. With simplicity and massive scale, they have
achieved impressive results on a number of tasks (Salimans et al., 2017; Such et al., 2017).
● Besides algorithms, there exist a number of useful libraries and frameworks for reinforcement learning.
ELF (Tian et al., 2017) is a framework for efﬁciently interacting with environments, avoiding Python
global-interpreter-lock contention.
● Dopamine (Castro et al., 2018) is a ﬂexible research focused RL framework with a strong emphasis on
reproducibility. It has state of the art agent implementations such as Rainbow (Hessel et al., 2017) but is
single-threaded.
● TF-Agents (Guadarrama et al., 2018) and rlpyt (Stooke & Abbeel, 2019) both have a broader focus with
implementations for several classes of algorithms but as of writing, they do not have distributed
capability for large scale RL.

RELATED WORK - environments #2
● RLLib (Liang et al., 2017) provides a number of composable distributed components and a
communication abstraction with a number of algorithm implementations such as IMPALA and Ape-X.
Concurrent with this work,
● TorchBeast (Küttler et al., 2019) was released which is an implementation of single-machine IMPALA
with remote environments. ( *TorchBeast는 pure python인 mono beast , c++ 통신모듈까지합쳐진 poly beast로 나눠짐)
● SEED is closest related to IMPALA, but has a number of key differences that combine the benefits of
single-machine training with a scalable architecture. Inference is moved to the learner but environments
run remotely. seed는 IMPALA와 가깝다. 하지만확장가능한구조의싱글머신트레이닝의몇가지다른장점을합친차이점을가지고있다.
인퍼런스가러너에이동했고환경들은리모트로실행된다.
This is combined with a fast communication layer to mitigate latency issues from the increased number
of remote calls. The result is significantly faster training at reduced costs by as much as 80% for the
scenarios we consider. Along with a policy gradients (V-trace) implementation we also provide an
implementation of state of the art Q-learning (R2D2).
In the work we use TPUs but in principle, any modern accelerator could be used in their place. TPUs are
particularly well-suited given they high throughput for machine learning applications and the scalability.
Up to 2048 cores are connected with a fast interconnect providing 100+ petaflops of compute.

ARCHITECTURE
Before introducing the architecture of SEED, we ﬁrst analyze the generic actor-learner architecture used by
IMPALA, which is also used in various forms in Ape-X, OpenAI Rapid and others.
SEED를 소개하기전에기존의일반적인actor-learner 아키텍쳐(IMPALA , Ape-X ,OpenAI Rapid 등등) 를 보기로합시다.
An overview of the architecture is shown in Figure 1a.
A large number of actors repeatedly read model parameters from the learner (or parameter servers). Each
actor then proceeds the local model to sample actions and generate a full trajectory of observations, actions,
policy logits/Q-values.
Finally, this trajectory along with recurrent state is transferred to a shared queue or replay buffer.
Asynchronously, the learner reads batches of trajectories from the queue/replay buffer and optimizes the
model.

Figure 1a.
https://ai.googleblog.com/2020/03/massively-scaling-reinforcement.html

ARCHITECTURE
There are a number of reasons for why this architecture falls short:
이 아키텍처가 충분하지 않은 데는 여러 가지 이유가 있습니다.
1. Using CPUs for neural network inference: The actor machines are usually CPU-based (occasionally
GPU-based for expensive environments). CPUs are known to be computationally inefﬁcient for neural networks
(Raina et al., 2009). When the computational needs of a model increase, the time spent on inference starts to
outweigh the environment step computation. The solution is to increase the number of actors which increases
the cost and affects convergence (Espeholt et al., 2018).
actor에서inference용으로CPU를 사용합니다. CPU는 뉴럴넷을돌리기엔비효율적입니다. (모델사이즈가커지면커질수록) 솔루션중하나는
ACTOR 갯수를충분히늘리는것입니다.
2. Inefﬁcient resource utilization: Actors alternate between two tasks: environment steps and inference steps.
The compute requirements for the two tasks are often not similar which leads to poor utilization or slow actors.
E.g. some environments are inherently single-threading while neural networks are easily parallelizable.
리소스사용에비효율적입니다. 액터가환경을실행하고인퍼런스를하는두가지일을해야하기때문입니다. 환경과뉴럴넷은cpu사용타입이다르기
때문이죠(때때로어떤환경은싱글쓰레드입니다.)
3. Bandwidth requirements: Model parameters, recurrent state and observations are transferred between
actors and learners. Relatively to model parameters, the size of the observation trajectory often only accounts
for a few percents.1 Furthermore, memory-based models send large states, increase bandwidth requirements.
네트웍밴드위스도많이필요합니다: 모델파라메터, rnn 히든상태, 옵져베이션은 ACTOR와 Learner 사이에전송됩니다. 상대적으로모델
파라메터가크기때문에옵져베이션트레젝토리의크기는전체전송량의몇% 밖에되지않습니다. 또한LSTM같은메모리베이스모델은매우큰
스테이트를보내기때문에많은네트웍밴트위스를요구합니다.

ARCHITECTURE
● While single-machine approaches such as GA3C (Mahmood, 2017) and single-machine IMPALA avoid
using CPU for inference (1) and do not have network bandwidth requirements (3), they are restricted by
resource usage (2) and the scale required for many types of environments.
GA3C나 single-machine IMPALA 같은single-machine 접근은inference에 CPU사용을피하면서네트웍밴드위스를요구하지않고대신자원의
사용에제약을받습니다. 하지만많은 타입의환경에서스케일을요구합니다.
● The architecture used in SEED (Figure 1b) solves the problems mentioned above. Inference and
trajectory accumulation is moved to the learner which makes it conceptually a single-machine setup
with remote environments (besides handling failures). Moving the logic effectively makes the actors a
small loop around the environments.
SEED에서사용된아키텍쳐는위의문제를해결합니다. 인퍼런스와트레젝토리수집은러너로이동함으로서개념적으로는싱글머신셋업에리모트
환경을만듭니다. 로직을옮김으로서액터를환경과관련된스몰루프로만들어버립니다. (액터는더 심플해집니다.)
● For every single environment step, the observations are sent to the learner, which runs the inference
and sends actions back to the actors. -_-;;;; 하지만액터는매스탭마다옵져베이션을러너에보내고액션을받아오게됩니다.

https://ai.googleblog.com/2020/03/massively-scaling-reinforcement.html

ARCHITECTURE - new problem, Latency
● For every single environment step, the observations are sent to the learner, which runs the inference
and sends actions back to the actors. -_-;;;; 하지만액터는매스탭마다옵져베이션을러너에보내고액션을받아오게됩니다.
● This introduces a new problem: Latency. To minimize latency, we created a simple framework that uses
gRPC (grpc.io) - a high performance RPC library. Specifically, we employ streaming RPCs where the
connection from actor to learner is kept open and metadata sent only once.
이것은새로운문제를일으킵니다. 레이턴시! , 레이턴시를최소화하기위해우리는gRPC(고성능RPC 라이브러리)를 사용하는작은프레임웍을
만들었습니다. . 특별히우리는actor로부터learner에게는오픈을유지하고메타데이터를한번만보내기위해Streaming RPC를 사용합니다.
● Furthermore, the framework includes a batching module that efficiently batches multiple actor
inference calls together.
특히(러너쪽) 프레임웍에배치모듈을넣어액터의여러요청들을함께처리할수 있어효율성을높였습니다.
● In cases where actors can fit on the same machine as learners, gRPC uses unix domain sockets and thus
reduces latency, CPU and syscall overhead. Overall, the end-to-end latency, including network and
inference, is faster for a number of the models we consider (see Appendix A.7).
액터는 러너가같은머신에있는경우 gRPC는 unix domain socket을 사용하기때문에CPU와 syscall 오버헤드를낮출수 있습니다. 전반적으로end
to end 레이턴시, network과 inference 를 포함해우리가생각하는몇몇모델은더 빨랐습니다. ( see Appendix A.7).

gRPC - Streaming RPC (순서 보장 + 사이즈 가변)
https://grpc.io/docs/what-is-grpc/core-concepts/
Unary RPC - 클라이언트에서 고정 사이즈 요청 / 고정
사이즈 응답
Server Streaming RPC - 클라이언트에서 고정 사이즈
요청 / 서버에서는 Stream 을 응답으로
Client Streaming RPC - 클라이언트에서 스트리밍 ,
서버에서 고정 사이즈 응답
Bidirectional Streaming RPC - 서버/ 클라 스트리밍
요청/응답

IMPALA의 Off policy 는 트레젝토리가학습을위해큐
보내지기까지전체트레젝토리에서같은policy를 사용
합니다. 폴리시는딱 2번만바뀝니다.
SEED의 Off policy 는 트레젝토리의학습이계속해서반영됩니다.
따라서트레젝토리는매우많은다른폴리시에의해구성됩니다.
The IMPALA and SEED architectures differ in that for SEED, at any point in time, only one copy of the model
exists whereas for distributed IMPALA each actor has its own copy.
IMPALA와 SEED의 구조의다른부분은SEED는 동일한모델로학습하고인퍼런스하지만, IMPALA의 액터는복제된모델로분산실행된다는점입니다.
This changes the way the trajectories are off-policy. In IMPALA (Figure 2a), an actor uses the same policy πθt
for an entire trajectory. 이런차이가IMPALA를 off-policy의 방법입니다. actor는 전체trajectory를 모을동안동일한폴리시를사용합니다.
For SEED (Figure 2b), the policy during an unroll of a trajectory may change multiple times with later steps
using more recent policies closer to the one used at optimization time.
SEED의 경우하나의trajectory를 모으는동안(학습의결과로나온) 여러개의policy로 바뀌게되어동일하지않습니다.

Detailed Learner Architecture in SEED

A detailed view of the learner in the SEED architecture is shown on Figure 3.
Three types of threads are running:
1. Inference
2. Data prefetching
3. Training.
Inference threads receive a batch of observations, rewards and episode termination ﬂags.
Inference Thread는 Observations 과 리워드와에피소드끝 플래그의배치를받는다.
They load the recurrent states and send the data to the inference TPU core. The sampled actions and new
recurrent states are received, and the actions are sent back to the actors while the latest recurrent states
are stored.
Inference Thread는 Observations 과 리워드와에피소드끝 플래그의배치를받는다. 그리고recurrent states를 로드해데이터들을TPU core에 보내고
action과 새로운recurrent states를 받아서actions는 각각의Actor에 보내고recurrent states 는 저장한다.
When a trajectory is fully unrolled it is added to a FIFO queue or replay buffer and later sampled by data
prefetching threads. Finally, the trajectories are pushed to a device buffer for each of the TPU cores taking
part in training.
trajectory가 모두나오면(에피소드가끝나면) Queue나 리플레이버퍼에추가되고일정주기로prefectching thread 에서샘플링된다. 최종적으로
trajectory는 device buffer 에 푸시되고각각의TPU core들은학습을위해가져가게된다.

The training thread (the main Python thread) takes the prefetched trajectories, computes gradients using
the training TPU cores and applies the gradients on the models of all TPU cores (inference and training)
synchronously.
트레이닝쓰레드(메인파이썬쓰레드)는 trajectory들을얻어서트레이닝TPU를 사용해그라디언트를계산하고얻어진그라디언트를전체TPU코어
(인퍼런스+ 트레이닝)에 적용한다.
The ratio of inference and training cores can be adjusted for maximum throughput and utilization.
트레이닝과인퍼런스용TPU 코어갯수는최대성능과사용량에의해조절된다.
The architecture scales to a TPU pod (2048 cores) by round-robin assigning actors to TPU host machines,
and having separate inference threads for each TPU host.
전체구조는actor를 라운드로빈방식으로TPU host machine 에 할당함으로서TPU pod 규모로스케일링할 수 있습니다. (2048core) 각각의TPU host 에는
독립된inference thread가 동작합니다.
When actors wait for a response from the learner, they are idle so in order to fully utilize the machines, we
run multiple environments on a single actor.
actor가 learner로부터응답을대기할때는멈춰야합니다. 따라서전체머신의성능을풀로활용하기위해서하나의액터에여러개의환경을실행합니다.

출처 : https://www.youtube.com/watch?v=7WhWkhFAIO4
2013년에 DNN으로 음성인식을 바꾸면서 사람들이 음성 검색을 하루에 3분씩만 하더라도 구글 전체
데이터 센터의 2배 이상의 컴퓨팅 파워가 필요하다는 것을 알게 되었다.
TPU의 목표는 GPU의 10배 성능향상이다.
이 목표하에 Google에서 TPU를 자체 디자인,검사, 생산해 TPU v1은 15개월만에 데이터 센터에 공급함.

TPU pod , TPU host
The architecture scales to a TPU pod (2048 cores) by round-robin assigning actors to TPU host machines,
and having separate inference threads for each TPU host.
https://cloud.google.com/tpu/docs/system-architecture?hl=ko#v3-performance
https://www.youtube.com/watch?v=7WhWkhFAIO4 TPU v1 논문 리뷰 ( 한글)

https://cloud.google.com/tpu/docs/system-architecture?hl=ko#v3-performance

summarize
To summarize, we solve the issues listed previously by:
1. Moving inference to the learner and thus eliminating any neural network related computations from
the actors. Increasing the model size in this architecture will not increase the need for more actors (in
fact the opposite is true).
인퍼런스를learner쪽으로옮겨서NN관련연산을actor에서지워버림. 모델사이즈를키우는것이액터의부하를키우지않습니다. (사실반대도
진실. -_-;; 모델사이즈가작아도Actor의 부하가더 줄어들게되지않음. ...)
2. Batching inference on the learner and having multiple environments on the actor. This fully utilize both
the accelerators on the learner and CPUs on the actors. The number of TPU cores for inference and
training is ﬁnely tuned to match the inference and training workloads. All factors help reducing the
cost of experiments.
Actor에 멀티환경을적용해learner에 batch inference를 한다. (이렇게해서tragectory를 보내는것과batch observation 을 보내는것을셈셈으로
해서효율을높인다.) learner의 가속기(TPU)와 Actor의 CPU의 효율을극대화시킨다. inference를 위한TPU 코어의갯수는학습도중파인튜닝을
한다. 이 모든게학습비용을줄이는데도움이된다.
3. Everything involving the model stays on the learner and only observations and actions are sent
between the actors and the learner. This reduces bandwidth requirements by as much as 99%.
모델에관련된모든건leaener에 머물러있고observation과 action만 actor과 learner 사이에보내진다. 이게네트웍부하를 최대99% 줄여준다.
4. Using streaming gRPC that has minimal latency and minimal overhead and integrating batching into
the server module.
streaming gRPC를 사용해레이턴시를최소화했고부하도최소화했고배치를서버모듈에통합했다.

We provide the following two algorithms
V-TRACE
● One of the algorithms we adapt into the framework is V-trace (Espeholt et al., 2018).
● We do not include any of the additions that have been proposed on top of IMPALA such as van den Oord
et al. (2018); Gregor et al. (2019). Impala에 구현된 V-Trace에 어떤것도 추가하지 않았다.
● The additions can also be applied to SEED and since they are more computational expensive, they would
beneﬁt from the SEED architecture. V-Trace의 추가사항은 SEED에도 적용할 수 있고 그게 더 많은 계산비용이 들더라도 SEED
아키텍쳐의 이점을 누릴 수 있다.
Q-LEARNING
● We show the versatility of SEED’s architecture by fully implementing R2D2 (Kapturowski et al., 2018), a state of the art distributed
value-based agent.
● R2D2 itself builds on a long list of improvements over DQN (Mnih et al., 2015): double Q-learning (van Hasselt, 2010; Van Hasselt et al.,
2016), multi-step bootstrap targets (Sutton, 1988; Sutton & Barto, 1998; Mnih et al., 2016), dueling network architecture (Wang et al.,
2016), prioritized distributed replay buffer (Schaul et al., 2015; Horgan et al., 2018), value-function rescaling (Pohlen et al., 2018), LSTM’s
(Hochreiter & Schmidhuber, 1997) and burn-in (Kapturowski et al., 2018).
● Instead of a distributed replay buffer, we show that it is possible to keep the replay buffer on the learner with a straightforward ﬂexible
implementation.
● This reduces complexity by removing one type of job in the setup. It has the drawback of being limited by the memory of the learner but it
was not a problem in our experiments by a large margin: a replay buffer of 105 trajectories of length 120 of 84 × 84 uncompressed
grayscale observations (following R2D2’s hyperparameters) takes 85GBs of RAM, while Google Cloud machines can offer hundreds of
GBs. 분산리플레이버퍼를사용하지않고learner의 메모리85GB를 리플레이버퍼로사용해구조를단순화시킴(TPU v3는 128GB의
메모리가있음)
● However, nothing prevents the use of a distributed replay buffer together with SEED’s central inference, in cases where a much larger
replay buffer is needed , 더많은메모리가필요하면분산replay memory를 사용해도된다.

EXPERIMENTS-STABILITY
하이퍼 파라메터를 바꿔가면서 실험했을때 IMPALA와 SEED의 결과물이 동일했다.
(왼쪽 : 최종 리턴 , 아래 : 하이퍼 파라메터의 조합 번호)
동일 파라메터에 동일한 결과물.

EXPERIMENTS-STABILITY
Frame = 4* step
환경이많을수록시간당샘플링되는프레임은많아짐, 동일프레임효율성은IMPALA와 비슷하지만, 최고리턴에도달하는시간은환경을늘릴
수록빨라지며, 어떤경우는환경을두배로늘리는것만으로최고리턴에더 빠르게도달함. (7.5시간-> 2.5시간) 하지만환경을너무늘리면샘플
효율성이떨어지는경우도있음(첫번째게임의경우12480 환경에서프레임당효율이떨어짐)

EXPERIMENTS- SPEED
Compared to the baseline IMPALA using 2
Nvidia P100’s
we ﬁnd that using 2 TPU v3 cores in SEED
improves the speed by 1.6x (see Table 1).
P100을 2개쓴 경우 보다 TPU v3의 2 core를 쓴
경우가 1.6배 속도로 더 빠름.
Additionally, using 8 cores adds another 4.1x
speed-up.
A speed-up of 4.5x is achievable if the batch size
is increased linearly with the number of training
cores
However, we found that increasing the batch
size, like with DeepMind Lab, hurts sample
complexity.
배치사이즈를너무늘리는경우 딥마인드랩에서샘플복잡도를
떨어뜨리는경우가있음을발견함

● 안좋은 소식은 TPU를 대여할때 core 8개
단위로 밖에 되지 않습니다.
● 좋은 소식은 선점형을 이용하면 1/3이하
가격으로 사용 가능하고 TPU v2의 경우 v3의
56% 가격으로 사용 가능하다는 점입니다. 대신
선점형을 사용하기 위한 처리를 해줘야 합니다.

CONCLUSION
● 최신 액셀러레이터를보다 잘 활용하여 이전 분산 아키텍처보다 환경 프레임 당 더 빠르고 비용이
저렴한 새로운 강화 학습 에이전트 아키텍처를 도입하고 분석했습니다
● 동일한 샘플 효율성을 유지하면서 Google Research Football의 최고 점수를 개선했으며 Atari-57의
3.1 배 더 빠른 최고 점수를 달성하면서 강력한 IMPALA 와 비교하여 DeepMind Lab에서 11 배의
wall-time 속도 향상을 해냈습니다.
● 에이전트는 open-source이며 패키징 되어 Google Cloud에서 쉽게 실행할 수 있습니다. (Docker
이미지도 제공하는것을 의미)
● 우리는 이를 통해 최신 연구 결과가 공유 되고 이를 바탕으로 강화학습 연구가 가속화되기를
원합니다.
● 이 새로운 에이전트 구조의 가능성을 보여주는 현실 시나리오에서 초당 수백만 프레임이 가능함을
증명했습니다. ( 이전 연구의 80배)
● 하지만 몇몇 환경에서 배치를 키우는것은 샘플 효율성을 해치는것도 실험 결과에 나왔습니다.
SL이나 RL에서 더큰 배치 사이즈로 샘플 효율성을 유지하는것에 대한 것은 어느정도
연구되었습니다. 이는 강화학습을 확대하기 위해 점점 개방적으로 연구해야 하는 분야라고 생각
합니다.

SEED는 IMPALA의 구현체로 보면 Torchbeast나 Deepmind가 공개한 코드와
크게 다르지 않지만 Cloud의 최신 가속기(TPU pod)기반의 RL 어플리케이션으로 보면 독특한 지위를 가지게 됩니다.
- 학습 속도/비용 문제는 연구에서도 산업에서도 빠르면 빠를 수록/저렴하면 저렴할수록 더 많은것을 할 수 있기 때문에
반드시 해결해야할 문제입니다.
- 그런 점에서 SEED RL은 집단이 아닌 개인이 혼자 수백~수천대를 컨트롤 할 수 있는 클라우드 기반 분산 RL은 필수고
거기에 최신가속기인 TPU의 잇점을 활용해 인퍼런스를 학습 환경에 추가 할 수 있다는 것을 보여주었습니다.
- RL뿐 아니라 다른 Deep-learning Application에서도 클라우드 기반 가속기의 접근은 필수인 상황에서 엔지니어링
예제 로서의 가치는 매우 크다고 생각 합니다.
- SEED RL은 모델이 크면 클수록 더 좋은 가성비를 내게 됩니다. 앞으로 모델은 점점더 무거워 질 것이기 때문에 충분히
가치가 있다고 생각합니다
- 모델이 작은 경우 GPU로도 비슷한 결과가 나오게 구현도 가능할 거 같습니다. ( mpala)
저자들의 결론 처럼 이 코드가 SEED가 되어 더 많은 연구와 논문이 나오면 좋을것 같습니다.
나의 결론

Seed rl paper review

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Seed rl paper review

Similar to Seed rl paper review (20)

Recently uploaded

Recently uploaded (20)

Seed rl paper review