Mu zero review by nomoreid

Mastering Atari, Go, chess and shogi by
planning with a learned model #1
from Deepmind
review by 이경만(nomoreid). 2021

MuZero 의 이름은 물론 AlphaZero를 기반으로 합니다. Zero 를 유지 하여 인간 데이터를 모방하지 않고 훈련되었음을표시하고 Alpha 를 Mu 로 대체하여 이제
계획에 학습 된 모델을 사용함을 나타냅니다. 좀 더 자세히 살펴보면 Mu 가 의미가 풍부 하다는 것을 알 수 있습니다
● 일본어 로 mu 로 읽을 수있는夢은 '꿈'을 의미 합니다. MuZero 가 학습 된 모델을 사용하여 미래의 시나리오를상상하는 것과 같습니다 .
● mu 라고 발음하는 그리스 문자 μ 는 학습 된 모델을 나타낼 수도 있습니다.
● 일본어로발음된뮤(mu)를 발음한 無 처음부터 학습한다는개념을 두 배로 줄이며, 모방할 인간 데이터뿐만아니라 규칙도 제공하지 않습니다.
출처 :http://www.furidamu.org/blog/2020/12/22/muzero-intuition/

MCTS or search
netflix : the queen’s gambit

바둑 AI의 발전
어떤 상태에서 이 수를 두었을때 이긴 확률을 가지고 있다면? 모든 수를 다 검토해 본다면? = 철저한
검색
+ 완벽한 수

MinMax search
https://en.wikipedia.org/wiki/Minimax
Alpha–beta pruning
https://en.wikipedia.org/wiki/Alpha%E2%80%93beta_pruning
바둑 AI의 발전
Alpha–beta pruning 은 검색 트리 에서 minimax 알고리즘 에 의해 평가되는 노드 수를 줄이는 검색 알고리즘 입니다 . 2
인 게임 ( Tic-tac-toe , Chess , Go 등)의 머신 플레이에 일반적으로 사용되는 적대적인 검색 알고리즘입니다 . 이동이
이전에 조사 된 이동보다 더 나쁘다는 것을 증명하는 가능성이 하나 이상 발견되면 이동 평가를 중지합니다. 이러한
움직임은 더 이상 평가할 필요가 없습니다. 표준 미니 맥스 트리에 적용하면 미니 맥스와 동일한 이동을 반환하지만 최종
결정에 영향을 줄 수없는 분기는 제거합니다.
최소 최대 (때로는 MINMAX , MM
[1]
또는 안장 포인트
[2]
) 의사 결정에 사용되는
규칙입니다 인공 지능 , 의사 결정 이론 , 게임 이론 , 통계 및 철학 에 대한 미니 가능한
포함한 효율적인 손실 A에 대한 최악의 경우를 ( 최대 imum 손실) 시나리오 . 게인을 처리 할
때 최소 게인을 최대화하기 위해 "maximin"이라고합니다. 원래 n 플레이어 제로섬 게임
이론을 위해 공식화되었습니다., 플레이어가 번갈아 이동하는 경우와 동시에 이동하는
경우를 모두 포함하여 불확실성이있는 상황에서 더 복잡한 게임과 일반적인 의사 결정으로
확장되었습니다.

MCTS
출처 : https://www.youtube.com/watch?v=L0A86LmH7Yw

MCTS
AlphaGo zero의 학습방법
탐험
활용
학습의 과정에서 초반에는 새로운 시도를
많이 할 것이고 (탐험) 지식이 쌓이면 쌓인
지식을 더 많이 활용하게 될 것이다. -
nomoreid

부모 노드의 방문에 비해 Child의 방문이 적을 수록 pbc 값이 커지고 해당 노드의 방문 확률이
높아진다. exploration과 exploitation을 조절하는 텀

https://deepmind.com/blog/article/muzero-mastering-go-chess-shogi-and-atari-without-rules

http://www.aitimes.com/news/articleView.html?idxno=135133
http://www.hani.co.kr/arti/science/technology/975985.html
기사

https://www.bbc.com/news/technology-55403473

- 규칙도 안배운?
- 학습하지 않은 환경에서도?
- 알파고제로는 규칙을 사전 입력해 줘야하는데
- 환경의 가장 중요한 측면에 집중
- 사람 처럼 스스로 학습해 원리를 터득?
- 유투브 영상 압축알고리즘 최적화에 쓸 수 있을 정도로 범용적? .

Constructing agents with planning capabilities has long been one of the main challenges in the
pursuit of artifcial intelligence. 에이전트에 계획능력을 구축하는건 오랫동안AI가 추구해온것중하나였습니다.
Tree-based planning methods have enjoyed huge success in challenging domains, such as chess1
and Go2 , where a perfect simulator is available. 트리기반의 계획 방법은 완벽한 시뮬레이터가있는 체스나 바둑같은 챌린징한
도메인에서 큰 성공을 거두었습니다. (완벽한 시뮬레이터란?)
However, in real-world problems, the dynamics governing the environment are often complex and
unknown. 하지만 현실문제에서환경을 지배하는 역학은 종종 복잡하고 알수없는게 대부분이다.
Here we present the MuZero algorithm, which, by combining a tree-based search with a learned
model, achieves superhuman performance in a range of challenging and visually complex domains,
without any knowledge of their underlying dynamics. 뮤제로 알고리즘은 트리 기반 검색 과 학습 모델을 결합해 내부의 역학에
대한 지식없이 여러 챌린징한 도메인과 비쥬얼적으로복잡한 문제에 슈퍼휴먼 성능을 달성했다.
The MuZero algorithm learns an iterable model that produces predictions relevant to planning: the
action-selection policy, the value function and the reward. 뮤제로 알고리즘은 계획과 관련된 예측(액션선택 Policy , 밸류펑션 ,
리워드)을 생산하는 반복가능한 모델을 학습합니다. (환경모델을 의미)
When evaluated on 57 diferent Atari games —the canonical video game environment for testing artifcial intelligence
techniques, in which model-based planning approaches have historically struggled4 — the MuZero algorithm achieved
state-of-the-art performance. 아타리에서 다른 모델 베이스 알고리즘에 비해 좋고 SOTA를 달성했다. (아타리는 완벽한 시뮬레이터가아님)
When evaluated on Go, chess and shogi—canonical environments for high-performance planning—the MuZero algorithm
matched, without any knowledge of the game dynamics, the superhuman performance of the AlphaZero algorithm that
was supplied with the rules of the game. 바둑, 체스 및 장기 (고성능 계획을위한 표준 환경)에서 평가할 때 MuZero 알고리즘은 게임 역학에 대한
지식없이 게임 규칙과 함께 제공된AlphaZero 알고리즘의 초인적 성능과 일치했습니다

Planning algorithms based on lookahead search have achieved remarkable successes in artificial
intelligence. Human world champions have been defeated in classic games such as checkers , chess , Go
and poker, and planning algorithms have had real-world impact in applications from logistics to chemical
synthesis. 예측-기반(lookahead) 검색에 기반을 둔 planning 알고리즘은AI에서 주목할만한 성공을 거두었다. (체스,바둑,포커, 화학 합성등등)
However, these planning algorithms all rely on knowledge of the environment’s dynamics, such as the
rules of the game or an accurate simulator, preventing their direct application to real-world domains such
as robotics, industrial control or intelligent assistants, where the dynamics are normally unknown. 그러나
이러한 계획 알고리즘은 모두 게임 규칙이나 정확한 시뮬레이터와 같은 환경 역학에 대한 지식에 의존합니다 , 그래서 일반적으로 역학이 알려지지 않은
로봇 공학, 산업 제어 또는 지능형 어시스턴트와 같은 real-world domain에 직접 적용되는 것을 방해 합니다.
Model-based reinforcement learning (RL) aims to address this issue by first learning a model of the
environment’s dynamics and then planning with respect to the learned model. Typically, these models
have either focused on reconstructing the true environmental state or the sequence of full observations.
모델 기반 강화학습은 환경역학의 모델을 먼저 학습한 다음 학습된 모델로 계획(planning 알고리즘을 실행해) 문제를 해결하는것을 목표로 합니다.
일반적으로 이 모델은 실제 환경 상태나 전체 옵져베이션의 시퀀스를 재구성하는데 중점을 두었습니다.
However, previous work remains far from the state of the art in visually rich domains, such as Atari 2600
games . Instead, the most successful methods are based on model-free RL—that is, they estimate the
optimal policy and/ or value function directly from interactions with the environment. However, model-free
algorithms are in turn far from the state of the art in domains that require precise and sophisticated
lookahead, such as chess and Go , 하지만 이전 작업들은 아타리와 같은 비쥬얼 적으로 풍부한 도메인에서SOTA와 거리가 멀었다. 대신
가장 성공적인 방법은 모델프리 RL을 기반으로 한다. 즉 환경과의 상호작용에서 직접 otimal policy/ value function을 추정합니다. 하지만 모델프리 RL은
체스나 바둑과 같이 정교한 예측을 요구하는 분야에서는 SOTA와 거리가 멀다.

Atari, 시각적으로
풍부한 도메인
Go, Chess , 정교한 예측이
필요한 도메인
Model Free RL SOTA Not So Good
Model Base RL None (환경 모델이 없기
때문에 Can’t)
SOTA
perfect simulator ? (#nomoreid)
논문에서는 완벽한 시뮬레이터라는 표현을 썻지만 내용상 Planning 을 하기 위해선 특정 시점에
look-forward search 가 가능한 시뮬레이터가 있어야 함.
즉 어떤 상태에서 여러가지 액션을 실험하고 다시 돌릴 수 있는 시뮬레이터가 필요.
이런 기능은 대부분의 게임이 불가능하고 현실에서는 더 힘듬.
시뮬레이터의 역할 : 현재 상태S와 액션 A를 입력하면 다음 상태 S’와 reward , done 정보등을 보내줌. 하지만 대부분의 시뮬레이터는 S를 내부에 저장한다. 즉 뒤로
돌리거나 특정 상태로 복귀하는 기능은 제공하지 않는 경우가 많다.

MuZero!!!
Here we introduce MuZero, a new approach to model-based RL that achieves
both state-of-the-art performance in Atari 2600 games—a visually complex set of
domains—and superhuman performance in precision planning tasks such as
chess, shogi and Go, without prior-knowledge of the game dynamics.
이 논문에서는 시각적으로 복잡한 도메인인Atari2600 게임의 SOTA 성능과 바둑,체스,쇼기등등등 에서 게임 역학에 대한 사전지식없이SOTA를
달성하는 모델 기반 RL에 대한 새로운 접근방식인 MuZero를 소개 합니다.
MuZero builds on AlphaZero’s powerful search and policy iteration algorithms, but
incorporates a learned model into the training procedure. MuZero also extends
AlphaZero to a broader set of environments, including single agent domains and
non-zero rewards at intermediate time steps.
뮤제로는 알파제로의 강력한 검색과 정책 반복 알고리즘 기반위에 만들어 졌지만 학습된 모델을 훈련 절차에 통합합니다. 뮤제로는 또한
알파제로를 단일 에이전트 도메인 및 타임스탭마다 0이 아닌 리워드를 주는 광범위한 환경세트로 확장합니다.

The main idea of the algorithm (summarized in Fig. 1) is to predict those aspects
of the future that are directly relevant for planning. 알고리즘의 메인 아이디어는 계획과 관련된 미래의 표상
(상태)을 직접 예측하기 위해서 이다. (aspects를 쓴 이유는 예측하는것이Observation이나 State가 아니라 hidden state 이기 때문)
1. The model receives the observation (for example, an image of the Go board
or the Atari screen) as an input and transforms it into a hidden state. 모델은
옵져베이션을받아서 (예를들면 바둑판이나 아타리 화면) hidden state로 변환시킨다.
2. The hidden state is then updated iteratively by a recurrent process that
receives the previous hidden state and a hypothetical next action. hidden state는 이전
hidden state와 가상의 다음액션을 받아 재귀적 프로세스에 의해 반복적으로 업데이트 된다.
3. At every one of these steps, the model produces a policy (predicting the move
to play), value function (predicting the cumulative reward, for example, the
eventual winner) and immediate reward prediction (for example, the points
scored by playing a move). 매 스탭동안, 모델은 Policy(움직임에 대한 예측), Value function(누적보상에 대한 예측,
예를들면 최종 우승자) ,즉각적인 보상 (예:움직임에 대한 점수 포인트) 예측을 생산한다.
4. The model is trained end to end, with the sole objective of accurately
estimating these three important quantities, to match the improved policy and
value function generated by search, as well as the observed reward. 모델은 이
3가지 중요한 수량을 정확하게 추정하는 단 한가지를 목표로 end to end로 학습된다. (개선된 policy , search 에 의해 생성된 value
function , 관측된 리워드가 동일해지도록)

There is no direct requirement or constraint on the hidden state to capture all
information necessary to reconstruct the original observation, drastically reducing
the amount of information the model has to maintain and predict.
모든 정보를 저장하고 있는 hidden state에 원래의 옵져베이션을재구성하는 것에 대한 직접적인 요구나 제약이 없으므로 모델이 유지하고 예측해야하는
정보의 양을 크게 줄입니다.
Neither is there any requirement for the hidden state to match the unknown, true
state of the environment; nor any other constraints on the semantics of state. 또한
hidden state로 알려지지않은환경의 실제 상태와 동일한지 비교해야할 필요도 없어집니다. 상태의 의미에 대한 다른 제약도 없습니다.
Instead, the hidden states are free to represent any state that correctly estimates
the policy, value function and reward. Intuitively, the agent can invent, internally,
any dynamics that lead to accurate planning.
대신 숨겨진 상태는 정책, 가치 기능 및 보상을 올바르게 추정하는 모든 상태를 자유롭게 나타낼 수 있습니다. 직관적으로 에이전트는
정확한 계획으로 이어지는 역학을 내부적으로 발명 할 수 있습니다.

Fig1. a, How MuZero uses its model to plan.
The model consists of three connected components for representation, dynamics and prediction.
모델은 3개의 연결된 컴포넌트로 구성되어 있습니다. representation, dynamics, prediction.
Given a previous hidden state sk−1 and a candidate action ak , the dynamics function g produces an
immediate reward rk and a new hidden state sk. 이전 히든스테이트sk−1과 후보액션 ak가 주어지면 역학 함수 g가 즉각적인 리워드
rk 와 새로운 히든 스테이트 sk를 리턴 합니다.
The policy pk and value function vk are computed from the hidden state sk by a prediction function f.
policy pk와 value function vk 는 hidden state sk 를 입력으로 받아 예측함수 f에 의해 계산된다.
The initial hidden state s0 is obtained by passing the past observations (for example, the Go board or
Atari screen) into a representation function h. 초기 hidden state S0는 과거의 옵져베이션들을 입력으로 받는 representation function h로
부터 나온다.
Search tree 구조

출처 : 유투브 팡요랩 , 알파고 제로 논문분석 : https://youtu.be/CgOGKChwWrw

알파고 제로에서 MCTS를 사용하지 않은 Raw Network(Policy
network)을 사용했을때의 ELO점수가 2000점 이상 차이남을 알
수 있다.
(200점 차이 일때 75% 승리)
이미지 출처 : Mastering the game of Go without human knowledge (nature)
MCTS의 효과

How MuZero acts in the environment. muzero가 환경에서 작동하는 방식
An MCTS is performed at each timestep t, as described in a. An action at+1 is sampled
from the search policy πt, which is proportional to the visit count for each action from the
root node. MCTS가 각각의 스탭 t 마다 fig.a에서 묘사된것처럼 수행됩니다. search policy πt(루트노드에서 각 액션별 방문횟수와 비례하는)에서
하나의 액션 at+1이 뽑힙니다.
The environment receives the action and generates a new observation ot+1 and reward
ut+1. At the end of the episode, the trajectory data are stored into a replay buffer. 환경은 액션을
받고 새로운 옵져베이션 ot+1 과 리워드 ut+1을 생성합니다.에피소드의 마지막에서는 트래젝토리데이터는 리플레이 버퍼에 저장됩니다.

c, How MuZero trains its model. 뮤제로가 모델을 학습시키는 법
A trajectory is sampled from the replay buffer. For the initial step, the representation function h receives as input the past
observations o1, ..., ot from the selected trajectory. 트레젝토리를 하나를 리플레이 버퍼에서 가져 옵니다. 최초스탭에선 , representation function h는
선택된 트레젝토리의 O1~ Ot까지의 과거 옵져베이션을 입력으로 받습니다.
The model is subsequently unrolled recurrently for K steps. 모델은 이후 K step 동안 반복적으로 수행됩니다.
At each step k, the dynamics function g receives as input the hidden state sk−1 from the previous step and the real action at+k.
각각의 스탭에서 dynamics function g 는 sk−1을 이전 스탭으로 받고 트레젝토리가 수행한 실제 액션 at+k를 받습니다.
The parameters of the representation, dynamics and prediction functions are jointly trained, end to end, by backpropagation through
time, to predict three quantities: the policy pk ≈ πt+k, value function vk ≈ zt+k and reward r k ≈ ut+k, where zt+k is a sample return:
either the final reward (board games) or n-step return (Atari). Schematic Go boards at the top of the figure represent the sequence
of observations. representation,dynamics, prediction function의 파라메터는 합쳐져서 end to end로 backpropagation으로 학습이
됩니다. policy pk ≈ πt+k, value function vk ≈ zt+k , reward r k ≈ ut+k, zt+k is a sample return 보드게임은 마지막 결과값 , 아타리는
nstep return 값 ,(실제 관측된 트레잭토리와 동일 한 액션을 했을때 각각의 함수들의 결과값이 실제 환경의 값과 동일한 값을
가지도록 loss를 구성한다는 의미)

Previous work
RL can be subdivided into two principal categories: model based and model free.
RL은 두개의 카테고리로 나눌수 있다: model base와 model free
Model-based RL constructs, as an intermediate step, a model of the environment. Classically,
this model is represented by a Markov decision process (MDP) consisting of two components: a
state transition model, predicting the next state given the selected action, and a reward model,
predicting the expected reward during that transition. Model-base RL은 중간 단계로 환경 모델을 구성한다.
고전적으로, 이 모델은 상태 전이 모델, 선택된 액션이 주어진 다음 상태를 예측하는 것과 그 전환 동안 예상되는 보상을 예측하는 보상 모델의
두 가지 요소로 구성된 마르코프 결정 프로세스(MDP)로 표현된다.
Once a model has been constructed, it is straightforward to apply MDP planning algorithms, such
as value iteration or Monte Carlo tree search (MCTS), to compute the optimal value function or
optimal policy for the MDP. 모델이 만들어진 후에 바로 value iteration 이나 MCTS같은 최적의 value function또는 MDP를 위한
최적의 policy 를 계산하기 위해 MDP planning 알고리즘에 직접적으로 사용된다.
In large or partially observed environments, the algorithm must first construct the state
representation that the model should predict. This tripartite separation between representation
learning, model learning and planning is potentially problematic, as the agent is not able to
optimize its representation or model for the purpose of effective planning, so, for example,
modelling errors may compound during planning. 크거나 제한된 관측 만 가능한(partially observed) 환경에서는 먼저 모델이
예측할 상태 representation 알고리즘을 먼저 구성해야 한다. representation,학습 모델 학습, planning 의 3가지 분리는 잠재적으로 문제가 될 수 있다.
에이전트가 효과적인 planning 을 위해 representation이나 model을 최적화할 수 없기 때문에 예를들어 planning 하는 동안 modelling error가 복합적으로
발생할 수 있다.

A common approach to model-based RL focuses on directly modelling the observation stream at the
pixel level. It has been hypothesized that deep, stochastic models may mitigate the problems of
compounding error. model-base RL의 일반적인 접근은 pixel level의 observation stream을 직접 모델링하는데 집중 되어 있다. 깊은(deep) 확률 모델이
복합적인 오류(compounding error)를 완화할 수 있다는 가설이 있다.
However, planning at pixel-level granularity is not computationally tractable in large-scale problems.
Other methods build a latent state-space model that is sufficient to reconstruct the observation
stream at the pixel level or to predict its future latent states, which facilitates more efficient planning
but still focuses the majority of the model capacity on potentially irrelevant detail. 하지만 pixel-level로 세밀한
planning은 대규모 문제에서 계산적으로 다루기 어렵다. 다른 방법들은 픽셀 수준에서 Observation Stream 을 재구성 하거나 미래의 latent state를 예측하기 위한 latent
state-space model을 빌드한다.이는 보다 효율적인 planning을 가능하게 하지만 여전히 모델 용량의 대부분을 (task와 관련없는) 디테일에 집중시킨다.(nomoreid : 한마디로
task와 관련없는 observation reconstruction에 집중하므로 낭비가 심하다.)
None of these previous methods have constructed a model that facilitates effective planning in
visually complex domains such as Atari; results lag behind well tuned, model-free methods, even in
terms of data efficiency. 이러한 이전 방법 중 어떤 것도 아타리와 같은 시각적으로복잡한 영역에서 효과적인 planning을 용이하게 하는 모델을 구축하지 못했다.
결과는 잘 튜닝된 model-free 방법에 비해 뒤쳐지고 데이터 효율성도 비교적 떨어진다.
A quite different approach to model-based RL has recently been developed, focused end to end on
predicting the value function. The main idea of these methods is to construct an abstract MDP model
such that planning in the abstract MDP is equivalent to planning in the real environment. model-base RL에 대한
다른 접근 방식은 최근에 개발되었으며end-to-end로 가치함수를예측하는데초점이 맞춰져 있다. 이러한 방법의 주요 아이디어는추상 MDP에서의 계획이 실제 환경에서 계획하는 것과
동등하도록추상 MDP 모델을 구성하는 것이다.
This is achieved by ensuring value equivalence, that is, that, starting from the same real state, the
cumulative reward of a trajectory through the abstract MDP matches the cumulative reward of a
trajectory in the real environment. 이것은 동일한 real state에서 시작된, 추상 MDP를 통한 trajectory의 누적보상과 실재 환경에서의
trajectory의 누적보상이 동일하도록 값의 동일성을 추구하는것으로 달성됩니다.

The predictron introduced value equivalent models for predicting value functions (without actions).
predictron(데이비드 실버의 2017년 논문)은 value function(액션 없는)을 예측하기 위한 등가-value 모형을 소개하고 있다.
Although the underlying model still takes the form of an MDP, there is no requirement for its transition
model to match real states in the environment. Instead the MDP model is viewed as a hidden layer of
a deep neural network. 기본 모델은 여전히 MDP의 형태를 취하지만, 환경의 실제 상태와 일치하도록 전환 모델이 요구되지 않는다. 대신 MDP
모델은 심층 신경망의 숨겨진 계층으로 간주된다.
The unrolled MDP is trained such that the expected cumulative sum of rewards matches the
expected value with respect to the real environment, for example, by temporal-difference learning.
Value equivalent models have also been applied to optimizing value (with actions). 펼쳐진 MDP에서 예상 누적
보상의 합은 실제 환경의 예측값과 매칭 시킵니다. (예를들면 TD러닝에 의한). Value 동일 모델도 최적화된(액션과 함께한) 값에 적용 합니다.
Value-aware model learning constructs an MDP model, such that a step of value iteration using the
model produces the same outcome as the real environment. TreeQN learns an abstract MDP model,
such that a tree search over that model (represented by a tree-structured neural network)
approximates the optimal value function. Value iteration networks learn a local MDP model, such that
many steps of value iteration over that model (represented by a convolutional neural network)
approximates the optimal value function. Value prediction networks are perhaps the closest precursor
to MuZero: they learn an MDP model grounded in real actions; the unrolled MDP is trained such that
the cumulative sum of rewards, conditioned on the actual sequence of actions generated by a simple
lookahead search, matches the real environment. Unlike MuZero there is no policy prediction, and
the search utilizes only value prediction.

Model base vs Model free (from open ai , spinningup)
출처 : https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html
muZero

muZero
출처 : https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html

Dyna: Integrated Planning, Acting, and Learning
- RL을 하면서 나오는 경험으로 환경모델을
만듬 (테이블)
- 환경모델로는 planning을 해서
value/policy를 구하는데 활용
출처 : http://incompleteideas.net/book/RLbook2020trimmed.pdf 162p

World Model
- VAE와 MD RNN으로 환경 Model 을 구축
- RL학습에 사용
- VAE로 부터 나온 latent vector z를 입력으로 사용
- z로 부터 MD RNN이 만든 가상의 미래 예측을 학습에
사용
- z는 이미지를 recontruction하기 위한 벡터이기 때문에
RL의 목적에는 맞지 않는 노이즈가 다수 존재 할
가능성이 큼
- muzero 논문중 , Typically, these models have either
focused on reconstructing the true environmental state
or the sequence of full observations. 이부분에 해당하는
방법
이미지 출처 : https://worldmodels.github.io/

출처 : https://vimeo.com/238243832

MuZero algorithm
We now describe the MuZero algorithm in more detail. Predictions are made at
each time step t, for each of k = 0, …, K steps, by a model μθ, with parameters θ,
conditioned on past observations o1, ..., ot and for k > 0 on future actions at+1, ...,
at+k. The model predicts three future quantities: 예측은 매 스탭마다 파라메터 θ와 이전 옵져베이션(O1...Ot)들과
미래 Action (At+1...At+k)를 가지고 모델 μθ 에 의해 만들어집니다. 모델이 예측하는 3가지 요소는 다음과 같습니다.
where u. is the true, observed reward, π is the policy used to select real actions
and γ is the discount function of the environment. u(우-)는 진짜, 관측된 보상이며 π(파이,피)는 실제 사용된
액션입니다.γ(감마) 는 환경의 디스카운트 함수입니다.

Internally, at each time step t (subscripts t are suppressed for simplicity), the model is
represented by the combination of a representation function, a dynamics function and a
prediction function. 내부적으로 각각의 스탭 t마다 (t는 단순화를 위해 생략합니다.), 모델은 representation function, dynamics function , prediction
function 의 조합으로 나타납니다.
The dynamics function gθ, is a recurrent process, rk,sk = gθ(sk−1, ak ), that computes, at each
hypothetical step k, an immediate reward rk and an internal state sk . It mirrors the structure of
an MDP model that computes the expected reward and state transition for a given state and
action. dynamic function g는 각각의 가상 스탭 k에서 즉각적인 리워드 r과 내부 상태 Sk를 계산하는 재귀적 프로세스 이다. rk,sk = gθ(sk−1, ak )
However, unlike traditional approaches to model-based RL, this internal state sk has no
semantics of environment state attached to it—it is simply the hidden state of the overall model
and its sole purpose is to accurately predict relevant, future quantities: policies, values and
rewards. 하지만 다른 전통적인 model-base RL의 접근법과는 달리 내부 상태 Sk 는 환경에 연결된 어떠한 의미도 없습니다.-이건 단순히 모델의 숨겨진
상태이며 유일한 목적은 policies 와 value 와 rewards의 미래값을 정확히 예측하는데 있습니다.
In this paper, the dynamics function is represented deterministically; the extension to stochastic
transitions is left for future work. 이 논문에서 dynamics function은 결정론적으로 표현됩니다; stochastic transitions 은 향후 작업으로
남깁니다. (stochastic transitions은 액션이 확률적으로 반영됨 , deterministic policy : π(x)=a , stochastic policy : π(a|x)=p(A=a|X=x),∀a∈A(x),∀x∈X )
A prediction function fθ computes the policy and value functions from the internal state sk, pk, vk
= fθ(sk), akin to the joint policy and value network of AlphaZero. A representation function hθ
initializes the ‘root’ state s0 by encoding past observations, s0 = hθ(o1, ..., ot); again, this has no
special semantics beyond its support for future predictions. prediction function fθ는 내부 상태 sk로 부터 policy와 value
function을 계산합니다 pk, vk = fθ(sk) , AlphaZero의 공동 policy 및 value 네트워크와 유사합니다. representation function hθ는 과거의 옵져베이션들을
입력으로 받아 ‘root’ 상태 S0로 초기화 합니다.s0 = hθ(o1, ..., ot) , 한번더 강조하지만 미래 예측을 지원하는것을 제외한 어떤 특별한 의미도 가지지
않습니다.

Given such a model, it is possible to search over hypothetical future trajectories a1 ,
..., ak given past observations o1, ..., ot. For example, a naive search could simply
select the k-step action sequence that maximizes the value function. 이런 모델이 주어지면 과거의
옵져베이션(o1, ..., ot)가 주어지면 가상의 미래 trajectoris(a1 , ..., ak)를 검색하는것이가능해 집니다. 예를들면 naive search로 value function을 최대화하는
k-step action sequence를 선택 할 수 있습니다.
More generally, we may apply any MDP planning algorithm to the internal rewards
and state space induced by the dynamics function. Specifically, we use an MCTS
algorithm similar to AlphaZero’s search, generalized to allow for single-agent domains
and intermediate rewards (Methods). 보다 일반적으로 우리는 dynamics function으로 유도된 내부 reward 와 state
space에 어떤 MDP planning 알고리즘도 적용할 수 있습니다. 특별히 우리는 알파제로의 검색과 유사한 MCTS 알고리즘을 일반화시켜 single-agent
domain 과 중간 reward를 허용하도록 일반화 하였습니다.
The MCTS algorithm may be viewed as a search policy πt = P[at+1|o1, ..., ot] and
search value function νt ≈ E [ut+1 + γut+2 +...|o1, ..., ot] that both selects an action and
predicts cumulative reward given past observations o1, ..., ot. At each internal node, it
makes use of the policy, value function and reward estimate produced by the current
model parameters θ, and combines these values together using lookahead search to
produce an improved policy πt and improved value function νt at the root of the search
tree. The next action at+1 ≈ πt is then chosen by the search policy. MCTS알고리즘은 search policy πt
= P[at+1|o1, ..., ot] 와 search value function νt ≈ E [ut+1 + γut+2 +...|o1, ..., ot] 로 볼 수 있습니다. 주어진 과거의 Observation o1, ..., ot으로 액션을
선택하고 누적 reward를 예측 합니다. 각각의 내부 노드에서 현재의 모델 파라메터 θ에 의해 생성된 policy , value function, 추정된 reward를 사용 하고,
search tree 의 root에서는 개선된 정책 πt 와 개선된 value function νt를 생성하기 위해 lookahead search를 이용, 이런 값들을 함께 결합 합니다. 다음 액션
at+1 은 search policy에 의해 선택 됩니다.

All parameters of the model are trained jointly to accurately match the policy, value
function and reward prediction, for every hypothetical step k, to three corresponding
targets observed after k actual time steps have elapsed. 모델의 모든 parameters는 모든 가상 단계 k에 대해
policy, value function, reward 예측을 k개의 실제 time step이 경과 한 후에 관찰된 3개의 해당 대상에 정확하게 일치 하도록 공동으로(jointly) 훈련됩니다.
Similarly to AlphaZero, the first objective is to minimize the error between the actions
predicted by the policy ptk and by the search policy πt+k. Also like AlphaZero, value
targets are generated by playing out the game or MDP using the search policy.
알파고와 비슷하게 첫번째 목표는 policy ptk로 부터 나오는 예측과 search policy πt+k로 부터 나오는 액션사이의 에러를 최소화 하는것이다. 알파고와 비슷하게
game또는 MDP를 search policy를 사용해 플레이 하여 value의 목표(target)를 생성합니다.
However, unlike AlphaZero, we allow for long episodes with discounting and
intermediate rewards by computing an n-step return zt that bootstraps n steps into the
future from the search value,
하지만 알파고와는 달리 뮤제로는 search value로 부터 미래 방향으로 n step bootstrap해서 나온 n-step 리턴 zt를 계산해 디스카운트와중간 보상을 가지고
있는 긴 에피소드를 허용 해야 합니다. (아타리등 좀더 범용적 문제를 해결하면서 서치를 사용하기 위해서)
Final outcomes {lose, draw, win} in board games are treated as rewards ut ∈ {−1, 0,
+1} occurring at the final step of the episode. 보드게임에서의최종 보상인 lose,draw,win은 에피소드의 마지막 스텝에
reward -1과 1사이의 reward 값으로 처리되야 한다.

Specifically, the second objective is to minimize the error between the value function
vtk and the value target, zt+k. The third objective is to minimize the error between the
predicted immediate reward rtk and the observed immediate reward ut+k. Finally, an L2
regularization term is also added, scaled by a constant c, leading to the overall loss.
특히 두번째 목적은 value function Vtk와 value target zt+k 사이의 에러를 줄이는것입니다. 세번째 목적은 예측된 보상 rtk와 관찰된 보상 ut+k 간의
에러를 줄이는것입니다. 마지막으로 L2 정규화 텀을 추가하고 스칼라 값 고정값c를 더해서 전체 loss를 구성합니다.
where lp , lv and lr are loss functions for policy, value and reward, respectively.
Supplementary Fig. 2 summarizes the equations governing how the MuZero algorithm
plans, acts and learns. We note that for chess, Go and shogi, the same squared error
loss as AlphaZero is used for rewards and values. A cross-entropy loss was found to
be more stable than a squared error when encountering rewards and values of
variable scale in Atari. Cross-entropy was used for the policy loss in both cases. 여기서 lp
, lp , lr 은 각각 policy, value , reward에 대한 loss 함수 입니다. Supplementary Fig2는 Muzero알고리즘이 어떻게 plan하고 acts하고 배우는지 방정식을 요약해
두었습니다. 체스, 바둑, 쇼기에 대해서는 reward와 value에 알파제로와 같은 squared error loss를 사용 했습니다. 아타리의 다양한 스케일의 reward와 value
에서는 cross-entropy loss가 squared error 보다 더 안정적이라는것을 발견했습니다. policy loss 는 두경우 모두 cross-entropy 를 사용 합니다.

φ(x)refers to the representation of a
real numberxthrough a linear
combination of its adjacentintegers, as
described in the Network Architecture
section.
φ (x)는 네트워크 아키텍처 섹션에 설명 된대로
인접한 정수의 선형 조합을 통한 실수 x의 표현을
나타냅니다.

Next?
● the reset of paper
○ Results
○ Method
■ Comparison to AlphaZero
■ Search
■ Selection
■ Expansion
■ Backup
■ Hyperparameters
■ Data generation
■ Observation and action encoding : Representation function
■ Dynamics function
■ Network architecture
■ Training
■ Muzero Reanalyze
■ Evaluation

Result
We applied the MuZero algorithm to the classic board games Go, chess and shogi, as
benchmarks for challenging planning problems, and to all 57 games in the Atari learning
environment , as benchmarks for visually complex RL domains.
In each case, we trained MuZero for K = 5 hypothetical steps. Training proceeded for one
million mini-batches of size 2,048 in board games and of size 1,024 in Atari. During both
training and evaluation, MuZero used 800 simulations for each search in board games
and 50 simulations for each search in Atari. K값을 5로 정함(한번의 서치에 5개의 depth를), 아타리는 1024 사이즈 ,
다른 보드게임은 2048 사이즈의 백만 미니배치를 트레이닝때 수행함. 보드 게임의 경우 한번의 서치에 800번의 시뮬레이션을 하고 아타리는
50번의 시뮬레이션을 사용했다.
The representation function uses the same convolutional and residual architecture as
AlphaZero, but with 16 residual blocks instead of 20. The dynamics function uses the
same architecture as the representation function and the prediction function uses the
same architecture as AlphaZero. All networks use 256 hidden planes (see Methods for
further details) representation function은 알파제로와 동일한 convolutional and residual 아키텍쳐를 사용 했다. 하지만 residual block
을 20개 대신 16개만 사용 했다. dynamics function은 representation function과 동일하고 prediction function은 알파제로와 동일하다. 모든
네트웍은 256개의 hidden planes (디테일한 부분은 Methods를 참고)

AlphaGo zero’s network architecture
https://dylandjian.github.io/alphago-zero/
https://jonathan-hui.medium.com/alphago-z
ero-a-game-changer-14ef6e45eba5

Figure 2 shows the performance throughout training in each game. In Go, MuZero slightly
exceeded the performance of AlphaZero, despite using less computation per node in the search
tree (16 residual blocks per evaluation in MuZero compared with 20 blocks in AlphaZero). This
suggests that MuZero may be caching its computation in the search tree and using each
additional application of the dynamics model to gain a deeper understanding of the position. fig2는
개별 게임의 훈련시 성능을 보여준다. 바둑에 대해 MuZero는 검색트리에서노드당 계산량을적게 사용함에도불구하고(16 residual blocks(muzero) vs 20 redial
block (alphazero)) AlphaZero의 성능을 약간 초과한다. 이것은 Muzero가 검색 트리에서계산을 캐싱할 수 있음을 의미하고위치에 대한 더 깊은 이해를 얻기 위해
dynamics model의 추가적인적용을 이용함을위미한다.
In Atari, MuZero achieved state-of-the-art performance for both mean and median normalized
score across the 57 games of the arcade learning environment, outperforming the previous
state-of-the-art method R2D2 (a model-free approach) in 42 out of 57 games, and outperforming
the previous best model-based approach SimPLe in all games (Table 1 and Supplementary
Table 1). 아타리에선 MuZero는 arcade learning environment 의 57개 게임에서 평균과 정규화된 중간값으로 SOTA성능을 달성했다. 기존 Model
free SOTA인 R2D2에 비해 42개의 게임이 더 성능이 좋았고 기존의 model base SOTA인 SimPLe 에 비해 모두 성능이 좋았다.
We also evaluated a second version of MuZero that was optimized for greater sample efficiency.
Specifically, it reanalyses old trajectories by re-running the MCTS using the latest network
parameters to provide fresh targets (see ‘MuZero Reanalyze’ in Methods). When applied to 57
Atari games, using 200 million frames of experience per game, MuZero Reanalyze achieved
731% median normalized score, compared with 192%, 231% and 431% for previous
state-of-the-art model-free approaches IMPALA, Rainbow and LASER, respectively. 우리는 또
샘플효율성을 위해 최적화된 MuZero의 second version을 평가 해봤다. 특히 이 버전은 오래된 트레젝토리들을 최신의 파라메터로 MCTS를 재실행해서
재분석을 합니다. (자세한 내용은 Method의 MuZero Reanalyze를 보세요). 게임마다 200 millon 프레임(2억 프레임)의 경험을 사용해 57개의 아타리
게임에 적용했을때 MuZero Reanalyze 는 731%의 스코어의 정규화된 중간값을 달성했다. (이전 SOTA IMPALA Rainbow LASER는 각각 191%, 231%,
431%)

We compare separately against agents trained in large (top) and small (bottom) data settings; all agents other than MuZero
used model-free RL techniques. Mean and median scores are given, compared with human testers. The best results are
highlighted in bold. MuZero shows state-of-the-art performance in both settings. a Hyperparameters were tuned per game.
우리는 위쪽은 대량 , 아래는 소량의 데이터 세팅으로 학습시켜 분리해서 비교해 보았다. MuZero가 아닌 나머지 에이전트는
Model-free RL 기술을 사용했다. 점수의 평균과 중간값은 인간 테스터의 값과 비교한 것이다. 최고점수는 볼드로 표시 된다.
MuZero는 두가지 세팅에서 모두 SOTA를 보여준다. 하이퍼 파라메터는 게임마다 튜닝되었다.

To understand the role of the model in MuZero, we also ran several experiments, focusing on the
board game of Go and the Atari game of Ms. Pac-Man. First, we tested the scalability of planning
(Fig. 3a), in the canonical planning problem of Go. We compared the performance of search in
AlphaZero, using a perfect model, to the performance of search in MuZero, using a learned
model. Specifically, the fully trained AlphaZero or MuZero was evaluated by comparing MCTS
with different thinking times. MuZero matched the performance of a perfect model, even when
doing much larger searches (thinking time of up to 10 s) than those from which the model was
trained (thinking time of around 0.1 s; see also Supplementary Fig. 3a). MuZero안의 model의 역할을 이해하기
위해 우리는 바둑과 아타리 게임중 하나인 Ms Pac-Man을 중점으로 몇가지 실험을 돌렸다. 먼저 우리는 바둑의 표준 계획 문제에서 계획의 확장성에
(scalability of planning) 대해서 실험했다.(Fig 3a) 완벽한 시뮬레이터를 사용하는 AlphaZero 에서의 서치 성능과 학습된 모델을 사용하는 MuZero의 서치
성능을 비교했다. 특히 충분히 훈련된 AlphaZero와 MuZero 에서 MCTS 시간의 차이를 비교 평가했다. Muzero는 모델이 학습할때보다 더 큰 서치를 하는
상황에서도 완벽한 모델의 성능과 일치했다. (Fig. 3a 0.1초의 사고시간) 아마도 Perfect simulator의 오타인듯.
We also investigated the scalability of planning across all Atari games (Fig. 3b). We compared
MCTS with different numbers of simulations, using the fully trained MuZero. The improvements
due to planning are much less marked than in Go, perhaps because of greater model inaccuracy;
performance improved slightly with search time, but plateaued at around 100 simulations. Even
with a single simulation—that is, when selecting moves solely according to the policy
network—MuZero performed well, suggesting that, by the end of training, the raw policy has
learned to internalize the benefits of search (see also Supplementary Fig. 3b). 모든 아타리 게임의 계획의
확장성에 대해서도 조사했다.(Fig.3b) 우리는 충분히 훈련된 MuZero로 다른 숫자의 시뮬레이션으로 MCTS를 비교했다. 계획으로 인한 향상은 바둑에
비해 훨씬 덜 두드러졌다. (작았다) 아마도 모델의 부정확성이 더 크기 때문인거 같다. 검색 시간에 따라 성능이 약간 향상 되긴 했다. 하지만 100회의
시뮬레이션에서 (성능 향상이) 정체 되었다.(한 Step당 100번의 서치) 단일 시뮬레이션 에서 조차 (그냥 policy network의 정책을 따르는 것을 의미함)
MuZero는 잘 동작했다. 이는 교육이 끝났을때 raw policy network이 검색의 이점을 내부화 하는것을 배웠습니다.

Fig. 3 | Evaluations of MuZero on Go, all 57 Atari games and Ms. Pac-Man.
a, Scaling with search time per move in Go, comparing the learned model with the ground truth simulator. Both
networks were trained at 800 simulations per search, equivalent to 0.1 s per search. Remarkably, the learned
model is able to scale well to up to two orders of magnitude longer searches than seen during training.
b, Scaling of final human normalized mean score in Atari with the number of simulations per search. The network
was trained at 50 simulations per search. Dark line indicates mean score and the shaded regions indicate the
25th to 75th and 5th to 95th percentiles. The learned model’s performance increases up to 100 simulations per
search. Beyond, even when scaling to much longer searches than during training, the learned model’s
performance remains stable and decreases only slightly. This contrasts with the much better scaling in Go (a),
presumably due to greater model inaccuracy in Atari than Go.
50회의 서치로 학습한
모델. eval시 search
수를 늘려도
성능향상이 없는 구간
(100>)

Next, we tested our model-based learning algorithm against a comparable model-free learning
algorithm (Fig. 3c). We replaced the training objective of MuZero (equation (1)) with a model-free
Q-learning objective (as used by R2D2), and the dual policy and value heads with a single head
representing the action-value function Q(⋅|st). Subsequently, we trained and evaluated the new
model without using any search. When evaluated on Ms. Pac-Man, our model-free algorithm
achieved identical results to R2D2, but learned much slower than MuZero and converged to a
much lower final score. We conjecture that the search-based policy improvement step of MuZero
provides a stronger learning signal than the high-bias, high-variance targets used by Q-learning.
다음으로 , 우리는 model base 알고리즘을 model free 알고리즘과 비교했다. 우리는 MuZero의 트레이닝 목표를 model-free 알고리즘인
Q-learning(R2D2에도 쓰였던)의 목표로 변경했고 policy와 value의 dual-head를 action-value function Q 의 single-head로 바꾸었다. 이후 우리는 어떤
서치도 사용하지 않고 새로운 모델을 학습 시켰다. Ms.Pac-man으로 R2D2의 이상적인 결과에 도달했다. 하지만 MuZero에 배해 매우 늦게 학습되었고 더
낮은 최종 스코어에 수렴되었다. 우리는 q-learning의 high-variance, high-bais의 target에 비해 muzero의 search-base policy improvement step (검색기반
정책 향상 단계?)이 강한 learning signal을 제공한다고 추측하고 있다.
To better understand the nature of MuZero’s learning algorithm, we measured how MuZero’s
training scales with respect to the amount of search it uses during training. Figure 3d shows the
performance in Ms. Pac-Man, using an MCTS of different simulation counts per move throughout
training. Surprisingly, and in contrast to previous work, even with only six simulations per
move—fewer than the number of actions—MuZero learned an effective policy and improved
rapidly. With more simulations, the performance jumped much higher. For analysis of the policy
improvement during each individual iteration, see also Supplementary Fig. 3c, d. MuZero 학습알고리즘의
본질에 대해 더 잘 이해하기 위해 우리는 훈련중 검색양과 관련해 얼마나 훈련이 확장되는지 측정했습니다. fig 3d가 학습시 각 움직임마다 서로 다른
시뮬레이션 수의 MCTS를 사용한 Ms Pac-man에서의 성능을 보여준다. 놀랍게도 그리고 이전작업과 대조적으로 액션수보다 적은 6번의 시뮬레이션
으로도 MuZero는 효과적인 Policy를 배웠고 계속적으로 향상 시켰다. 더많은 시뮬레이션으로 성능은 더 높이 점프하였다.

c, Comparison of MCTS-based training with Q-learning in the MuZero framework on Ms. Pac-Man,
keeping network size and amount of training constant. The state-of-the-art Q-learning algorithm R2D2 is
shown as a baseline. Our Q-learning implementation reaches the same final score as R2D2, but improves
slower and results in much lower final performance compared with MCTS-based training.
d, Different networks trained at different numbers of simulations (sims) per move, but all evaluated at 50
simulations per move. 다른 시뮬레이션수로 학습한 다른 네트웍 , 하지만 이벨류에이션시에는 동일하게
행동당 50 시뮬레이션을 함. -> 학습시 시뮬레이션 수가 성능에 더 큰 영향을 줌.

Conclustions
Many of the breakthroughs in artificial intelligence have been based on either high-performance
planning or model-free RL methods.
Here we have introduced a method that combines the benefits of both approaches. Our
algorithm, MuZero, has both matched the superhuman performance of high-performance
planning algorithms in their favoured domains (logically complex board games such as chess
and Go) and outperformed state-of-the-art model-free RL algorithms in their favoured domains
(visually complex Atari games). Crucially, our method does not require any knowledge of the
environment dynamics, potentially paving the way towards the application of powerful learning
and planning methods to a host of real-world domains for which there exists no perfect simulator.
인공지능의 많은 혁신들은 고성능 planning 또는 model-free RL이 기반이었습니다. 우리는 두 접근의 장점을 합친 방법을 소개했습니다.
MuZero는 (논리적으로 복잡한 보드게임, 바둑이나 체스같은) 도메인에서 선호되는 고성능 계획법과 (비쥬얼적으로 복잡한 아타리 같은) 모델
프리 RL알고리즘이 SOTA인 도메인에서 이를 능가 했습니다. 결정적으로 우리의 방법은 환경의 Dynamics에 대한 지식이 필요 없으며 ,
완벽한 시뮬레이터가 없는 현실 도메인들에 RL을 적용할 수 있는 방법을 제시해 줍니다.

Method, Comparison to AlphaZero
MuZero is designed for a more general setting than AlphaGo Zero and AlphaZero.
MuZero는 AlphaGo Zero와 AlphaZero의 보다 일반적인 세팅을 위해 디자인 되었다.
In AlphaGo Zero and AlphaZero, the planning process makes use of a simulator
that samples the next state and reward (for example, according to the
environment’s dynamics, or the rules of the game). AlphaGo Zero 와 AlphaZero에서 next state와 reward를
샘플링하는 planning process 는 시뮬레이터를이용해 이루어진다. (예를들면 환경의 다이나믹스나게임 룰에 의해)
The simulator updates the state of the game while traversing the search tree (Fig.
1a). The simulator is used to provide three important pieces of knowledge: (1)
state transitions in the search tree, (2) actions available at each node of the
search tree and (3) episode termination within the search tree. In MuZero, all of
these have been replaced with the use of a single implicit model learned by a
neural network (Fig. 1b). 시뮬레이터는검색트리를 순회하는 동안 게임의 상태를 업데이트 합니다. (Fig 1a) 시뮬레이터는세가지
중요한 지식을 제공하기 위해 사용됩니다. (1) 검색트리의 상태전환(state transitions) , (2) 검색트리의 각각의 노드에서 사용가능한 액션,(3)
검색트리내에서의에피소드 중단점. MuZero에서는 모든 이런것들이 뉴럴넷을 이용해 학습한 단일 내장 모델의 사용으로 대체된다.

(1) State transitions. AlphaZero had access to a perfect simulator of the environment’s
dynamics. In contrast, MuZero employs a learned dynamics model within its search. Under this
model, each node in the tree is represented by a corresponding hidden state; by providing a
hidden state sk−1 and an action ak to the model, the search algorithm can transition to a new
node sk = g(sk−1, ak).
(1)상태 전환 : AlphaZero는 환경의 dynamics에 대해서는 완벽한 시뮬레이터를통해야 합니다. 반대로 뮤제로는 서치중에 학습된
Dynamics 모델을 이용해야 합니다. 이모델에서 각각의 노드는 해당하는 hidden state로 표현 됩니다. hidden state Sk-1과 액션 ak를
모델에 입력함으로서검색 알고리즘은 새로운 노드 Sk로 전환이 가능 합니다. Sk = g(Sk-1 , ak)
(2) Actions available. We consider a standard problem formulation where the set of
available actions is provided at each time step alongside the observation.
(2) 유효한 액션 , 우리는 매 스탭 마다 Observation과 함께 유효한 액션에 대한 정보가 제공되는 표준 문제 공식을 가정합니다.
During search, however, it could be helpful to specify the available actions at each interior
node—which would require knowledge of how the available actions change over time.
검색동안에 각각의 내부 노드에 따라 어떤 액션이 유효한가에 대한것은 도움이 될 수 있습니다. 하지만 이렇게 하려먼 각각의 시간(상태)
마다 어떤 액션이 유효한지에 대한 지식이 필요합니다.
AlphaZero used the set of legal actions obtained from the simulator to mask the policy
network at interior nodes. MuZero does not perform any masking within the search tree,
but only masks legal actions at the root of the search tree where the set of available
actions is directly observed. The policy network rapidly learns to exclude actions that are
unavailable, simply because they are never selected. 알파제로는 policy network의 (유효한 행동을
제한하기 위한) 마스킹을 위해 시뮬레이터에서 얻은 정확한 정보를 사용했다. MuZero는 검색트리 내에서는 유효한
행동이 관측되는 root node를 제외하면 나머지는 making을 하지 못했다. policy network은 유효하지 않은 액션을
해보면서 빠르게 선택하지 말아야 할 액션들을 학습했다.

(3) Terminal states. AlphaZero stopped the search at tree nodes representing terminal states and used the
terminal value provided by the simulator instead of the value produced by the network. MuZero does not
give special treatment to terminal states and always uses the value predicted by the network.
(3)종료 상태, 알파제로는 검색트리의 종료 상태에서 검색을 중지하고 시뮬레이터에서 제공하는터미널 값을 사용합니다.
MuZero는 터미널상태에 특별한 처리를 하지 않으며 (검색중지를 하지 않으며) 항상 네트웍이 예측한 값을 사용한다.
Inside the tree, the search can proceed past a state that would terminate the simulator. In this case, the
network is expected to always predict the same value, which may be achieved by modelling terminal
states as absorbing states during training. 트리안에서 검색은 시뮬레이터를종료하는 상태 이후를 검색 할 수 있다. 이런 경우
네트웍은 항상 동일 한 값을 예측한다. 트레이닝도중에는터미널 상태에 도달한 경우 absorbing state로 취급하여 처리할 수 있다. (nomoreid : absorbing
state는 MDP에서 다른 상태로 변하지 않는 상태를 의미하며 터미널 이후의 예측된 상태를 훈련에 사용하지 말라는 의미)
In addition, MuZero is designed to operate in the general RL setting: single-agent domains with
discounted intermediate rewards of arbitrary magnitude. In contrast, AlphaGo Zero and AlphaZero were
designed to operate in two-player games with undiscounted terminal rewards of ±1.
추가적으로 MuZero는 일반적인 RL세팅:다양한 크기의 즉각적이고 디스카운트된리워드와 함께하는 싱글 에이전트 도메인(아타리같은)을 제어하기
위해 디자인 되었다. 반대로 알파고제로와알파제로는 2인용의 디스카운트되지않은 +-1의 리워드값만 다룰 수 있다.
Many other generalizations of MuZero may be possible, for example, to stochastic, continuous,
non-stationary or temporally extended environments, or to imperfect information or general sum games.
These generalizations are left for future work. MuZero는 많은 다른 일반화 거리가 남아 있다. 예를들면 확률적, 연속적 , 비정상적 ,
일시적으로 확장된 환경또는 불완전한 정보와 general sum game 같은 경우로 확장될 수 있다. 이런 일반화는 미래작업으로남긴다.

https://www.cs.cmu.edu/~avrim/ML14/lect0409.pdf 출처
- Zero sum game : 누군가 이기면 다른 사람은 지는 게임 ( 대부분의 2인용 보드게임)
- General sum game : 다같이 지거나 다같이 이기는 경우도 있는 게임.

Search
We now describe the search algorithm used by MuZero. Our approach is based
on MCTS with upper confidence bounds, an approach to planning that converges
asymptotically to the optimal policy in single agent domains and to the minimax
value function in zero sum games. 이제 MuZero의 검색알고리즘에 대해 묘사하겠다. 우리의 접근은 신뢰상한을 가진
MCTS에 기반을 둔다. 단일에이전트 도메인의 경우 점근적으로 최적정책에 ,zero sum 게임(바둑과 같은 1:1보드게임)의 경우 minmax value
function 으로 수렴하도록 planning 하기 위한 접근법이다.
Every node of the search tree is associated with an internal state s. For each
action a from s there is an edge (s, a) that stores a set of statistics {N(s, a), P(s,
a), Q(s, a), R(s, a), S(s, a)}, respectively representing visit counts N, policy P,
mean value Q, reward R and state transition S. Similar to AlphaZero, the search is
divided into three stages, repeated for a number of simulations.
검색트리의 모든 노드는 내부 상태 s와 연결되어 있다. 상태 s의 각각의 액션 a을 edge (s,a) 라고 하며 {방문횟수 N(s,a) , 폴리시 P(s,a) ,value 평균
Q(s,a) , 리워드 R(s,a) , 상태 변환 S(sa )} 등 통계 셋을 저장하고 있다. 알파고와 비슷하게 검색은 3개의 스테이지로 나뉘며 시뮬레이션의횟수만큼
반복된다. (nomoreid : 3개의 stage 란 다음에 소개할 selection , expansion , backup을 의미하는것 같다. , 또는 selection , evaluation , backup ? )

Selection
Each simulation starts from the internal root state s0 , and finishes when the simulation
reaches a leaf node sl . For each hypothetical time step k = 1 ... l of the simulation, an
action ak is selected according to the stored statistics for internal state sk−1, by maximizing
over a probabilistic upper confidence tree (PUCT) bound. 각 시뮬레이션은내부의 root 상태인 s0에서
시작해 leaf 노드 Sl에 도달하면 끝난다. (검색 도중의)시뮬레이션가상의 스탭 K 마다 내부상태 sk-1의 저장된 통계 데이터에 따라
PUCT bound(PUCB) 확률을 최대화 하기 위해 액션 ak를 선택한다.
where a and b are possible actions. The constants c1 and c2 are used to control the
influence of the policy P(s, a) relative to the value Q(s, a) as nodes are visited more
often. In our experiments, c1 = 1.25 and c2 = 19,652. For k < l, the next state and reward
are looked up in the state transition and reward table
a,b가 가능한 액션인 경우. 상수 c1과 c2는 노드의 방문 빈도값이 policy P(s,a)와 value Q(s,a)에 의해 영향을 받는 정도를 컨트롤 하는데 사용 합니다. 우리의
실험에서 c1은 1.25 고 c2는 19,652다. k < l 인 동안 다음 상태와 리워드는 상태 변환및 reward의 테이블 Sk , Rk로 부터 얻게 된다.

a
0
b
10
c
40
= visit parent
a = 0 , b = 10 , c = 40 일때 각각의 pbc 값
a = 0 , visit parent = 50 , pbc = 8.86
b = 10 , visit parent = 50 , pbc = 0.80
c = 40 , visit parent = 50 , pbc = 0.21
PUCB 값은
- 부모 노드의 방문 횟수에 비해 자식 노드의 방문 횟수가 작을 수록 증가한다.
(explore)
- 방문횟수가 전체적으로 비슷할때 pbc는 동일해 져서 더 높은 Q(s,a) + p(a)*C 가 더
높은 액션을 유도함 (exploit) , 이때 전체 방문횟수가 작은 경우는 p(a)를 더 많이
사용하고 큰 경우는 pbc값이 작아져 p(a)의 비중이 더 작아진다.
이전 페이지 식에서
v(a) = Q(s,a)
이미지 출처 : https://www.youtube.com/watch?v=L0A86LmH7Yw

Expansion
At the final time step l of the simulation, the reward and state are computed by the
dynamics function, rl, sl = gθ(sl−1, al ), and stored in the corresponding tables,
시뮬레이션의마지막 스탭 l 에서 리워드와 스테이트는 다이나믹 함수 g 에 의해 계산되고 해당하는 테이블에 저장된다.
The policy and value function are computed by the prediction function, pl , vl = fθ
(sl ). A new node, corresponding to state sl is added to the search tree. Each edge
(sl , a) from the newly expanded node is initialized to {N(s l , a) = 0, Q(s l , a) = 0,
P(sl , a) = pl }. policy와 value function은 prediction function에 의해 계산된다. 상태 Sl에 해당하는 새로운 노드가 검색 트리에 추가됩니다.
새로 확장된 노드의 각각의 엣지 (sl, a)는 {N(Sl,a) = 0 , Q(Sl,a) = 0 , P(sl,a) = pl } 로 초기화 된다.
Note that the search algorithm makes at most one call to the dynamics function
and prediction function respectively per simulation; the computational cost is of the
same order as in AlphaZero. 검색 알고리즘은 시뮬레이션당dynamics function과 prediction function을 최대 1번만 호출합니다.
계산 비용은 알파제로와 동일한 수준입니다.

Backup
At the end of the simulation, the statistics along the trajectory are updated. The backup is
generalized to the case where the environment can emit intermediate rewards, have a
discount γ different from 1 and the value estimates are unbounded. (We note that in board
games, the discount is assumed to be 1 and there are no intermediate rewards.) 시뮬레이션의
마지막에서 트레젝토리에 따라 통계가 업데이트 됩니다. 백업은 환경이 중간에 보상을 줄수 있고 1이 아닌 디스카운트 감마를 가지고 Value 추정이 제한되지 않는 경우까지 일반화
됩니다. (보드 게임은 디스카운트값을 1로 하고 즉각적인 리워트가 없음.)
For k = l ... 0, we form an l − k-step estimate of the cumulative discounted reward,
bootstrapping from the value function vl
k=l...0일때 , 우리는 밸류함수 Vl로 부터 부트스트래핑을해 누적 디스카운트된보상의 l-k step 근사치를 구한다.(node의 value 값을 갱신)
For k = l ... 1, we update the statistics for each edge (sk−1, ak) in the simulation path as
follows k=l...1 일때, 다음과 같이 시뮬레이션 path 에 있는 각각의 edge(sk-1,ak)에 대해 통계를 업데이트 한다.

In two-player zero sum games, the value functions are assumed to be bounded within the [0, 1] interval.
This choice allows us to combine value estimates with probabilities using a variant of the PUCT rule
(equation (2)). two-player zero sum 게임에서 value 함수는 0~1 사이의 값으로 가정한다. 이런 가정은 PUCT 룰과 결합해 가치 추정을 확률과
결합할 수 있다.
However, as in many environments the value is unbounded, it is necessary to adjust the PUCT rule. A
simple solution would be to use the maximum score that can be observed in the environment to either
rescale the value or set the PUCT constants appropriately. However, both solutions are game specific and
require adding prior knowledge to the MuZero algorithm. 하지만 많은 환경에서 value는 제한되지 않았기때문에PUCT룰을
조정할 필요가 있다. 간단한 솔루션은 환경에서 관측된 최대값을 사용하거나 PUCT의 상수를 적절하게 설정하는것이다. 하지만 양쪽 솔루션 모두 게임에
특화되었고 Muzero 알고리즘에 대한 사전 지식을 요구한다
To avoid this, MuZero computes normalized Q-value estimates Q ∈ [0, 1] by using the
minimum–maximum values observed in the search tree up to that point. When a node is reached during
the selection stage, the algorithm computes the normalized Q values of its edges to be used in place of
the Q values in the PUCT rule using the equation. 이를 피하기 위해 Muzero는 검색트리의 해당 지점에서 관측된 최소-최대 값을
사용해 nomalized Q-value 추정치를 계산한다. selection 단계에서 어떤 노드에 도달하면 알고리즘은 PUCT의 Q값대신 사용될 해당 엣지의 정규화된
Q값을 사용 합니다.

Hyperparameters
For simplicity we preferentially use the same architectural choices and hyperparameters as in previous
work. Specifically, we started with the network architecture and search choices of AlphaZero . For board
games, we use the same PUCT constants, Dirichlet exploration noise and the same 800 simulations per
search as in AlphaZero. Owing to the much smaller branching factor and simpler policies in Atari, we used
only 50 simulations per search to speed up experiments. As shown in Fig. 3b, the algorithm is not very
sensitive to this choice. We also use the same discount (0.997) and value transformation (see ‘Network
architecture’) as R2D2. For parameter values not mentioned in the text, please refer to the pseudocode
(see ‘Code availability’). 단순화를 위해 이전 작업들의 하이퍼 파라메터를 그대로 사용했다. 특별히 네트웍 아키텍쳐는 알파제로의것으로
시작했다. 보드게임을 위해서 우리는 알파제로와 동일한 PUCT 상수와 Dirichlet exploration noise를 사용했고 알파제로 처럼 서치당 800번의
시뮬레이션을했다. 아타리게임을위해서는 더 작은 branch factor와 심플한 네트웍을 사용했고 실험속도를 위해 서치당 50번의 시뮬레이션만했다. Fig
3b에서 보여지듯 알고리즘은 이런 세팅에도 그다지 민감하지 않았다. 우리는 R2D2와 동일한 디스카운트 (0.997)과 value transformation을 사용했다.
몇몇 파라메터에 대해서는 논문에 언급하지 않았고 수도코드를 참고해야 한다.

Data generation
To generate training data, the latest checkpoint of the network (updated every 1,000 training steps) is used to play
games with MCTS. In the board games Go, chess and shogi, the search is run for 800 simulations per move to
pick an action; in Atari, due to the much smaller action space 50 simulations per move are sufficient.
For board games, games are sent to the training job as soon as they finish. Owing to the much larger length of
Atari games (up to 30 min or 108,000 frames), intermediate sequences are sent every 200 moves. In board
games, the training job keeps an in-memory replay buffer of the most recent one million games received; in Atari,
where the visual observations are larger, the most recent 125,000 sequences of length 200 are kept.
During the generation of experience in the board game domains, the same exploration scheme as the one
described in AlphaZero is used. Using a variation of this scheme, in the Atari domain, actions are sampled from
the visit count distribution throughout the duration of each game, instead of just the first k moves. Moreover, the
visit count distribution is parametrized using a temperature parameter T
T is decayed as a function of the number of training steps of the network. Specifically, for the first 500,000 training steps a
temperature of 1.0 is used, for the next 250,000 steps a temperature of 0.5 and for the remaining 250,000 a temperature
of 0.25. This ensures that the action selection becomes greedier as training progresses.

Observation and action encoding - Representation function
The history over board states used as input to the representation function for Go,
chess and shogi is represented similarly to AlphaZero . In Go and shogi, we
encode the last eight board states as in AlphaZero; in chess, we increased the
history to the last 100 board states to allow correct prediction of draws. For Atari,
the input of the representation function includes the last 32 RGB frames at
resolution 96 × 96 along with the last 32 actions that led to each of those frames.
We encode the historical actions because unlike board games, an action in Atari
does not necessarily have a visible effect on the observation. RGB frames are
encoded as one plane per colour, rescaled to the range [0, 1], for red, green and
blue, respectively. We perform no other normalization, whitening or other
preprocessing of the RGB input. Historical actions are encoded as simple bias
planes, scaled as a/18 (there are 18 total actions in Atari).

Observation and action encoding-Dynamics function.
The input to the dynamics function is the hidden state produced by the
representation function or previous application of the dynamics function,
concatenated with a representation of the action for the transition. Actions are
encoded spatially in planes of the same resolution as the hidden state. In Atari,
this resolution is 6 × 6 (see description of downsampling in ‘Network architecture’),
in board games, this is the same as the board size (19 × 19 for Go, 8 × 8 for
chess, 9 × 9 for shogi). In Go, a normal action (playing a stone on the board) is
encoded as an all-zero plane, with a single one in the position of the played stone.
A pass is encoded as an all-zero plane.

In chess, eight planes are used to encode the action. The first one-hot plane
encodes which position the piece was moved from. The next two planes encode
which position the piece was moved to: a one-hot plane to encode the target
position, if on the board, and a second binary plane to indicate whether the target
was valid (on the board) or not. This is necessary because for simplicity, our policy
action space enumerates a superset of all possible actions, not all of which are
legal, and we use the same action space for policy prediction and to encode the
dynamics function input. The remaining five binary planes are used to indicate the
type of promotion, if any (queen, knight, bishop, rook, none).

The encoding for shogi is similar, with a total of 11 planes. We use the first eight
planes to indicate where the piece moved from—either a board position (first
one-hot plane) or the drop of one of the seven types of prisoner (remaining seven
binary planes). The next two planes are used to encode the target as in chess.
The remaining binary plane indicates whether the move was a promotion or not.
In Atari, an action is encoded as a one-hot vector that is tiled appropriately into
planes.

Network architecture
The prediction function pk , vk = fθ(s k ) uses the same architecture as
AlphaZero: one or two convolutional layers that preserve the resolution but reduce
the number of planes, followed by a fully connected layer to the size of the output.
For value and reward prediction in Atari, we follow ref. 47 in scaling targets using
an invertible transform h( ) x x = sign( )( | | x ε + 1 − 1) + x, where ε = 0.001 in all
our experiments. We then apply a transformation ϕ to the scalar reward and value
targets to obtain equivalent categorical representations. We use a discrete support
set of size 601 with one support for every integer between −300 and 300.
Under this transformation, each scalar is represented as the linear combination of
its two adjacent supports, such that the original value can be recovered by x =
xlow × plow + xhigh × phigh. As an example, a target of 3.7 would be represented
as a weight of 0.3 on the support for 3 and a weight of 0.7 on the support for 4.

The value and reward outputs of the network are also modelled using a softmax
output of size 601. During inference, the actual value and rewards are obtained by
first computing their expected value under their respective softmax distribution and
subsequently by inverting the scaling transformation. Scaling and transformation
of the value and reward happens transparently on the network side and is not
visible to the rest of the algorithm.
Both the representation and dynamics function use the same architecture as
AlphaZero, but with 16 instead of 20 residual blocks. We use 3 × 3 kernels and
256 hidden planes for each convolution.

For Atari, where observations have large spatial resolution, the representation
function starts with a sequence of convolutions with stride 2 to reduce the spatial
resolution. Specifically, starting with an input observation of resolution 96 × 96 and
128 planes (32 history frames of 3 colour channels each, concatenated with the
corresponding 32 actions broadcast to planes), we downsample as follows: 1
convolution with stride 2 and 128 output planes, output resolution 48 × 48; 2
residual blocks with 128 planes; 1 convolution with stride 2 and 256 output planes,
output resolution 24 × 24; 3 residual blocks with 256 planes; average pooling with
stride 2, output resolution 12 × 12; 3 residual blocks with 256 planes; average
pooling with stride 2, output resolution 6 × 6. The kernel size is 3 × 3 for all
operations.
For the dynamics function (which always operates at the downsampled resolution
of 6 × 6), the action is first encoded as an image, then stacked with the hidden
state of the previous step along the plane dimension.

https://github.com/werner-duvaud/muzero-general/blob/master/docs/muzero-network-werner-duvaud.png

https://github.com/werner-duvaud/muzero-general/blob/master/docs/muzero-network-werner-duvaud.png
Eval Loop

Training.
During training, the MuZero network is unrolled for K hypothetical steps and
aligned to sequences sampled from the trajectories generated by the MCTS
actors. Sequences are selected by sampling a state from any game in the replay
buffer, then unrolling for K steps from that state. In Atari, samples are drawn
according to prioritized replay, with priority , where , ν is the
search value and z the observed n-step return. To correct for sampling bias
introduced by the prioritized sampling, we scale the loss using the importance
sampling ratio . In all our experiments, we set α = β = 1. For board
games, states are sampled uniformly.

Training.
Each observation ot along the sequence also has a corresponding search policy
πt, search value function νt and environment reward ut. At each unrolled step k,
the network has a loss to the policy, value and reward target for that step, summed
to produce the total loss for the MuZero network (see equation (1)). Note that, in
board games without intermediate rewards, we omit the reward prediction loss.
For board games, we bootstrap directly to the end of the game, equivalent to
predicting the final outcome; for Atari we bootstrap for n = 10 steps into the future.

Training.
To maintain roughly similar magnitude of gradient across different unroll steps, we
scale the gradient in two separate locations. (1) We scale the loss of each head by
1/K, where K is the number of unroll steps. This ensures that the total gradient has
similar magnitude irrespective of how many steps we unroll for. (2) We also scale
the gradient at the start of the dynamics function by 1/2. This ensures that the total
gradient applied to the dynamics function stays constant. In the experiments
reported in this paper, we always unroll for K = 5 steps. For a detailed illustration,
see Fig. 1. To improve the learning process and bound the activations, we also
scale the hidden state to the same range as the action input ([0,1]): s = s s scaled
s s − min( ) max( ) − min( ) .

Training.
All experiments were run using third-generation Google Cloud tensor processing
units (TPUs). For each board game, we used 16 TPUs for training and 1,000
TPUs for self-play. For each game in Atari, in the 20 billion frame setting we used
8 TPUs for training and 32 TPUs for self-play. In the smaller 200 million frame
setting, we used only four TPUs for training and two TPUs for self-play, equivalent
to two weeks of training on 1 GPU. The much smaller proportion of TPUs used for
acting in Atari is due to the smaller number of simulations per move (50 instead of
800) and the smaller size of the dynamics function compared with the
representation function.

Training.
Note that the network is trained separately for each environment (that is, one
model for each different Atari game or board game). However, in principle, the
same model could be shared between different environments during training, or
could be tested in new environments (that is, zero-shot generalization); this
approach is left to future work. 환경별로 별도의 네트웍으로 학습했지만 제로샷러닝등을위해 네트웍을 통합하는 것도 퓨쳐
웍으로 고려하겠다.

MuZero Reanalyze
To improve the sample efficiency of MuZero, we introduced a second variant of the algorithm,
MuZero Reanalyze. MuZero Reanalyze revisits its past time steps and re-executes its search
using the latest model parameters, potentially resulting in a better-quality policy than the original
search. This fresh policy is used as the policy target for 80% of updates during MuZero training.
Furthermore, a target network, , based on recent parameters θ− , is used to provide
a fresher, stable n-step bootstrapped target for the value function,
In addition, several other hyperparameters were adjusted— primarily to increase sample reuse
and avoid overfitting of the value function. Specifically, 2.0 samples were drawn per state,
instead of 0.1; the value target was weighted down to 0.25 compared with weights of 1.0 for
policy and reward targets; and the n-step return was reduced to n = 5 steps instead of n = 10
steps.
샘플효율성을 높이기 위해 알고리즘의 두번째 변형 MuZero Reanalyze를 소개한다.MuZero Reanalyze는 최신 모델 파라메터를 사용해 과거의 time step을 재방문해
새롭게 검색을 수행한다. 잠재적으로 원본보다 더 좋은 질의 policy를 만들 가능성이 있다. 이 신선한 policy는 MuZero training 도중 80%의 업데이트에 policy
target으로 사용 된다. (20%는 실제 시뮬레이터의 결과로) 또한 타겟 네트웍 v-(최신 파라메터를 기반으로 한 세타-) 는 Value 함수에 더 신선하고 안정적인 n-step
bootstrapped target을 제공하는데 사용 된다.
추가적으로 몇가지 다른 하이퍼 파라메터들이 (주로 밸류 함수의 재사용과 오버피팅을 막기 위해) 조정되었다. 특별히 스테이트당 0.1 샘플이 아닌 2.0 샘플이
그려졌다.(??) policy와 reward의 target에 대해서는 가중치를 1로 줬고 value타겟에 대해서는 0.25를 줬다. (아마도 loss에서 가중치) n-step return 값에서 n은 10대신
5를 사용 했다.

출처 :http://www.furidamu.org/blog/2020/12/22/muzero-intuition/
● a reanalyse buffer that receives all trajectories generated by the actors and keeps the most recent ones.
● multiple reanalyse actors that sample stored trajectories from the reanalyse buffer, re-run MCTS using the latest network checkpoints from the learner
and send the resulting trajectories with updated search statistics to the learner.
For the learner, "fresh" and reanalysed trajectories are indistinguishable; this makes it very simple to vary the proportion of fresh vs reanalysed trajectories.
MuZero

?
1/100 더 작은 환경에서 MuZero Reanalyze를 동일 시간 동안 실행했고 평균은 43% ,
중간값은35.8% 정도의 점수를 얻었다. 동일 세팅으로 실험결과는 공개되지 않음…..??

Evaluation
We evaluated the relative strength of MuZero (Fig. 2) in board games by measuring the Elo rating of each
player. We estimate the probability that player a will defeat player b by a logistic function p( d a b efeats ) =
(1 + 10 ) c e[ (b e )− ( ) a ] −1 elo , and estimate the ratings e(⋅) by Bayesian logistic regression, computed
by the BayesElo program50 using the standard constant celo = 1/400.
Elo ratings were computed from the results of an 800-simulationsper-move tournament between iterations
of MuZero during training, and also a baseline player: either Stockfish, Elmo or AlphaZero, respectively.
Baseline players used an equivalent search time of 100 ms per move. The Elo rating of the baseline
players was anchored to publicly available values5 . In Atari, we computed mean reward over 1,000
episodes per game, limited to the standard 30 min or 108,000 frames per episode, using 50 simulations
per move unless indicated otherwise. To mitigate the effects of the deterministic nature of the Atari
simulator, we employed two different evaluation strategies: 30 noop random starts and human starts. For
the former, at the beginning of each episode, a random number of between 0 and 30 noop actions are
applied to the simulator before handing control to the agent. For the latter, start positions are sampled from
human expert play to initialize the Atari simulator before handing the control to the agent.

결론
- MuZero는 학습의 선순환 구조를 생각해 볼 수 있다.
- 시뮬레이션으로 관측된 trajectory로 Env Model을 학습한다
- Env Model로 Search를 해 더 좋은 policy/value function을 학습한다
- 더 좋은 policy/value function으로 얻어진 더 좋은 trajectory로 Env Model을 학습한다.
- Search의 결과를 policy/value 학습의 target으로 한다는 Alpha Go의 기본 아이디어가 얼마나 유용한
환경인가가 MuZero가 잘 동작하느냐의 관건
- 동일한 액션을 했을때 동일한 상태로 전환되는 환경이면 MuZero가 잘 동작할 확률이 높다.
- 다른 말로 환경과 Action의 상호 작용이 결정적일수록 MuZero는 더 잘 동작한다.
- 반대로 Search가 필요없는 단순한 환경이면 MuZero는 낭비다.
- 한번의 동작을 하기 위해 수십번의 인퍼런스를 더 많이 해야 한다.
- 학습에도 수십~수백번의 인퍼런스가 추가로 더 필요하다.
- 시뮬레이션 비용이 큰 경우는 MuZero가 더 유리하다
- 수십번의 인퍼런스 타임 보다 시뮬레이션 비용이 비싼 경우는 MuZero가 더 유리함
- 환경에 맞게 Search Depth(k) , 시뮬레이션 횟수를 조절한다면 시뮬레이션 비용대비 더 싼
비용으로 학습이 가능. (MuZero Reanalyze)

Mu zero review by nomoreid

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Mu zero review by nomoreid

Similar to Mu zero review by nomoreid (20)

Mu zero review by nomoreid