Yuandong Tian at AI Frontiers : Planning in Reinforcement Learning

AI Frontiers
AI FrontiersAI Frontiers
Planning in Reinforcement
Learning
Yuandong Tian
Research Scientist
Facebook AI Research
AI works in a lot of situations
Medical Translation
Personalization Surveillance
Object Recognition
Smart Design
Speech Recognition
Board game
What AI still needs to improve
Very few supervised data
Complicated environments
Lots of Corner cases.
Home Robotics
Autonomous Driving
ChatBot
StarCraftQuestion Answering
Common Sense
Exponential space
to explore
What AI still needs to improve
Human level
A scary trend of slowing down
“Man, we need more data”
Initial Enthusiasm
“It really works! All in AI!”
Trying all possible hacks
“How can that be…” Despair
“No way, it doesn’t work”
Performance
Efforts
What AI still needs to improve
Human level
A scary trend of slowing down
“Man, we need more data”
Initial Enthusiasm
“It really works! All in AI!”
Trying all possible hacks
“How can that be…” Despair
“No way, it doesn’t work”
Performance
Efforts
We need novel algorithms
Reinforcement Learning
Action
State Reward
Agent
Environment
[R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction]
Atari Games Go
DoTA 2 Doom
Quake 3 StarCraft
Why Planning is important?
Just one example
AlphaGo Zero
Update
Models
Generate
Training data
Self-Replays
Zero-human knowledge
[Silver et al, Mastering the game of Go without human knowledge, Nature 2017]
AlphaGo Zero Strength
• 3 days version
• 4.9M Games, 1600 rollouts/move
• 20 block ResNet
• Defeat AlphaGo Lee.
• 40 days version
• 29M Games, 1600 rollouts/move
• 40 blocks ResNet.
• Defeat AlphaGo Master by 89:11
ELF: Extensive, Lightweight and Flexible
Framework for Game Research
Larry ZitnickQucheng Gong Wenling Shang Yuxin WuYuandong Tian
https://github.com/facebookresearch/ELF
[Y. Tian et al, ELF: An Extensive, Lightweight and Flexible Research Platform for Real-time Strategy Games, NIPS 2017]
ELF: A simple for-loop Action
State Reward
Agent
Environment
C++
Python
How ELF works
Game
Threads
(C++)
0
1
2
3
4
5
6
7
Batch BatchBatch Batch Batch
Python
Distributed ELF
Server
Evaluate/Selfplay
Training
Send request
(game params)
Receive
experiences
Client
Client
Client Client Client
Client
Client
AlphaGoZero (more synchronization)
AlphaZero (less synchronization)
Putting AlphaGoZero and AlphaZero
into the same framework
ELF OpenGo
• System can be trained with 2000 GPUs in 2 weeks.
• Decent performance against professional players and strong bots.
• Abundant ablation analysis
• Decoupled design, code highly reusable for other games.
We open source the code and the pre-trained model for the Go and ML community
https://github.com/pytorch/ELF
Simple tutorial in experimental branch (tutorial, tutorial_distri)
ELF OpenGo Performance
20-0Name (rank) ELO (world rank) Result
Kim Ji-seok 3590 (#3) 5-0
Shin Jin-seo 3570 (#5) 5-0
Park Yeonghun 3481 (#23) 5-0
Choi Cheolhan 3466 (#30) 5-0
Single GPU, 80k rollouts, 50 seconds
Offer unlimited thinking time for the players
Vs top professional players
Vs strong bot (LeelaZero)
[158603eb, 192x15, Apr. 25, 2018]: 980 wins, 18 losses (98.2%)
Vs professional players
Single GPU, 2k rollouts, 27-0 against Taiwanese pros.
Planning is how new knowledge is created
Game Start Game End
Random Moves
Meaningful Moves
Already dan level even if the
opening doesn’t make much sense.
Planning #rollouts
Win rate against
bot without planning
50%
100%
Training is almost always constrained
by model capacity (why 40b > 20b)
Where the
reward signal is
Planning is how new knowledge is created
T-step
looking forward
Tree searchOne-step
looking forward
Monte Carlo Sampling
Learning a neural network that directly predicts the optimal value/policy
Temporal Difference (TD) in Reinforcement Learning:
Planning is how game AI is created
Extensive search/planning Evaluate Consequence
Black wins
White wins
Black wins
White wins
Black wins
Current game situation
Lufei Ruan vs. Yifan Hou (2010)
AlphaBeta Pruning + Iterative Deepening
Monte Carlo Tree Search
…
How to plan without a known model?
If you don’t have a ground truth dynamics model …
• Only (human/expert) trajectories, no world model.
• Limited access of world models.
• Cannot restart the model, cannot query any (s, a) pair
• Noisy signals from the world model
• …
If you have a ground truth dynamics model …
• Infinite access of the exact world model
• May query any (s, a) pair
• …
Current
state
Next
state
Action to
take
Build one
Navigation
Target
Yi Wu Georgia Gkioxari Yuxin Wu How to plan the trajectory?
Build a semantic model
outdoor
living room
sofa
Find “oven”
car
chair
dining room
kitchen
oven
Incomplete model of the environment
[Y. Wu et al, Learning and Planning with a Semantic Model, submitted to ICLR 2019]
Build a semantic model
Bayesian Inference
𝑃(𝑧|𝑌)
car
chair
dining room
kitchen
oven
0.7
0.95
0.8
0.5
0.6
0.7
Next step
“kitchen”
outdoor
living room
sofa
Learning experience 𝑌
LEAPS
LEArning and Planning with a Semantic model
living room
kitchen
chair
sofa
Dining room
𝑃(𝑧kitchen,living room)
𝑃(𝑧sofa,living room)
𝑃(𝑧chair,living room)
𝑃(𝑧dining,living room)
living room
kitchen
chair
sofa
Dining room
𝑃(𝑧kitchen,living room)
𝑃 𝑧sofa,living room 𝑌 = 1
𝑃(𝑧chair,living room)
𝑃(𝑧dining,living room)
Planning the trajectory
and more exploration
Planning the trajectory
and more exploration
House3D
SUNCG dataset, 45K scenes, all objects are fully labeled.
https://github.com/facebookresearch/House3D
Depth
Segmentation maskRGB image
Top-down map
Learning the Prior between Different Rooms
Test Performance on ConceptNav
Case Study
• A case study
• Go to “outdoor”
Prior: 𝑃(𝑧)
Birth
Outdoor
Living
room
Garage
0.12
0.38
0.76
0.73 0.28
Sub-Goal: Outdoor
Case Study
• A case study
• Go to “outdoor”
Sub-Goal: Outdoor
Failed!
Birth
Outdoor
Living
room
Garage
0.12
0.38
0.01
0.73 0.28
Posterior: 𝑃(𝑧|𝐸)
• A case study
• Go to “outdoor”
Posterior: 𝑃(𝑧|𝐸)
Sub-Goal: Garage
Birth
Outdoor
Living
room
Garage
0.12
0.01
0.73 0.28
Failed!
Case Study
0.08
Case Study
• A case study
• Go to “outdoor”
Sub-Goal: Living Room
Posterior: 𝑃(𝑧|𝐸)
Birth
Outdoor
Living
room
Garage
0.12
0.08
0.01
0.28
Success
0.99
Case Study
• A case study
• Go to “outdoor”
Sub-Goal: Outdoor
Posterior: 𝑃(𝑧|𝐸)
Success
Birth
Outdoor
Living
room
Garage
0.99
0.08
0.01
0.280.99
Improving Existing Plans
Iterative LQR
What if the dynamics is nonlinear?
New Policy
Sample according to the new policy
Non-differentiable Plans
• Direct predicting combinatorial solutions.
[O. Vinyals. et al, Pointer Networks, NIPS 2015]
Convex hull
Seq2seq model
[H. Mao et al, Resource Management with Deep
Reinforcement Learning, ACM Workshop on Hot
Topics in Networks, 2016]
Schedule the job
to i-th slot
Policy gradient
Neural network rewriter
st
Sample 𝒈 𝒕 ∼ 𝑺𝑷 𝒈 𝒕 , 𝒈 𝒕 ⊂ 𝒔 𝒕 Sample 𝒂 𝒕 ∼ 𝑹𝑺 𝒈 𝒕
𝒔 𝒕+𝟏 = 𝒇(𝒔 𝒕, 𝒈 𝒕, 𝒂 𝒕)
Input Encoder Score Predictor Rule Selector
[X. Chen and Y. Tian, Learning to Progressively Plan, submitted to ICLR 2019]
Is it simpler to improve the solution progressively?
Job Scheduling
Scheduling
1
2
3
T=2, S=1
T=3, S=2
T=1, S=3
0
1
2
3
1 32 4 5 6
0
1
2
3
1
2
3
1 32 4 5 6
1
2
3
Graph representationJobs
10
Resource 1
Resource 2
time
time
Slow down
Slow
down
Job Scheduling
g2 0.1
g4 0.7
g5 0.2
Input Encoder Score Predictor Rule Selector
0.1 -0.3 0.2
DAG-LSTM
Embedding
0
1
5
3
42
FC
FC
FC
Softmax
St
at
St+1
0
1
5
3
42
-0.2 0.5
𝐠 𝐭 0
1
5
4
32
Job Scheduling
#Resources 2 5 10 20
Shortest Job First 4.80 5.83 5.58 5.00
Shortest First Search 4.25 5.05 5.54 4.98
DeepRM 2.81 6.52 9.20 10.18
Neural Rewriter (Ours) 2.80 3.36 4.50 4.63
Optim (LB) 2.57 3.02 4.08 4.26
Earliest Job First (UB) 11.11 13.62 22.13 24.23
Expression Simplification
a1 0
…
a5 1
(Constant Reduction)
…
a19 0
Input Encoder Score Predictor Rule Selector
Embedding
Tree-LSTM FC
0
-1 0.7
FC
Softmax
St St+1
𝐠 𝐭
at
<=
min -
v0 v2 v1 v1
<=
min -
v0 v2 v1 v1
<=
min 0
v0 v2
Expression Simplification
Expr Len
Reduction
Tree Size
Reduction
Halide Rule-based
Rewriter
36.13 9.68
Heuristic Search 43.27 12.09
Neural Rewriter (Ours) 46.98 13.53
Example
≤
5
max
v 3
+
3
5 ≤ max 𝑣, 3 + 3
≤
5
v 3
max
+ +
33
≤
5
v 3
+
≤
5
3 3
+
OR
≤
5
v 3
+
True
OR
True
Future Directions
Multi-Agent
Admiral (General) Captain Lieutenant
Hierarchical RL
RL Systems Model-based RL
Policy
Optimization
Model
Estimation
RL applications
RL for Optimization
How to do well in Reinforcement Learning?
Experience on
(distributed) systems
Strong math skills
Strong coding skills
Parameter tuning skills
Thanks!
1 of 44

Recommended

Long Lin at AI Frontiers : AI in Gaming by
Long Lin at AI Frontiers : AI in GamingLong Lin at AI Frontiers : AI in Gaming
Long Lin at AI Frontiers : AI in GamingAI Frontiers
927 views22 slides
Anima Anandkumar at AI Frontiers : Modern ML : Deep, distributed, Multi-dimen... by
Anima Anandkumar at AI Frontiers : Modern ML : Deep, distributed, Multi-dimen...Anima Anandkumar at AI Frontiers : Modern ML : Deep, distributed, Multi-dimen...
Anima Anandkumar at AI Frontiers : Modern ML : Deep, distributed, Multi-dimen...AI Frontiers
1K views24 slides
Alpha go 16110226_김영우 by
Alpha go 16110226_김영우Alpha go 16110226_김영우
Alpha go 16110226_김영우영우 김
892 views17 slides
Agile Deep Learning by
Agile Deep LearningAgile Deep Learning
Agile Deep LearningDavid Murgatroyd
1.6K views56 slides
Deep learning with tensorflow by
Deep learning with tensorflowDeep learning with tensorflow
Deep learning with tensorflowCharmi Chokshi
1.5K views70 slides
Big data app meetup 2016-06-15 by
Big data app meetup 2016-06-15Big data app meetup 2016-06-15
Big data app meetup 2016-06-15Illia Polosukhin
1.7K views31 slides

More Related Content

What's hot

TensorFlow 101 by
TensorFlow 101TensorFlow 101
TensorFlow 101Raghu Rajah
366 views15 slides
Practical Deep Learning by
Practical Deep LearningPractical Deep Learning
Practical Deep LearningAndré Karpištšenko
2.5K views39 slides
iT Cafe - OT & The first approach to Deep Learning by
iT Cafe - OT & The first approach to Deep LearningiT Cafe - OT & The first approach to Deep Learning
iT Cafe - OT & The first approach to Deep LearningDongmin Kim
157 views29 slides
Ted Willke, Intel Labs MLconf 2013 by
Ted Willke, Intel Labs MLconf 2013Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013MLconf
7.6K views35 slides
Convolutional neural network in practice by
Convolutional neural network in practiceConvolutional neural network in practice
Convolutional neural network in practice남주 김
10K views162 slides
Analyze this by
Analyze thisAnalyze this
Analyze thisAjay Ohri
962 views18 slides

What's hot(11)

iT Cafe - OT & The first approach to Deep Learning by Dongmin Kim
iT Cafe - OT & The first approach to Deep LearningiT Cafe - OT & The first approach to Deep Learning
iT Cafe - OT & The first approach to Deep Learning
Dongmin Kim157 views
Ted Willke, Intel Labs MLconf 2013 by MLconf
Ted Willke, Intel Labs MLconf 2013Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013
MLconf7.6K views
Convolutional neural network in practice by 남주 김
Convolutional neural network in practiceConvolutional neural network in practice
Convolutional neural network in practice
남주 김10K views
Analyze this by Ajay Ohri
Analyze thisAnalyze this
Analyze this
Ajay Ohri962 views
Hadoop and Machine Learning by joshwills
Hadoop and Machine LearningHadoop and Machine Learning
Hadoop and Machine Learning
joshwills53.1K views
Scott Clark, Software Engineer, Yelp at MLconf SF by MLconf
Scott Clark, Software Engineer, Yelp at MLconf SFScott Clark, Software Engineer, Yelp at MLconf SF
Scott Clark, Software Engineer, Yelp at MLconf SF
MLconf4.3K views
Deep Learning Review by 明信 蘇
Deep Learning ReviewDeep Learning Review
Deep Learning Review
明信 蘇895 views
Practical deep learning for computer vision by Eran Shlomo
Practical deep learning for computer visionPractical deep learning for computer vision
Practical deep learning for computer vision
Eran Shlomo849 views

Similar to Yuandong Tian at AI Frontiers : Planning in Reinforcement Learning

Atari Game State Representation using Convolutional Neural Networks by
Atari Game State Representation using Convolutional Neural NetworksAtari Game State Representation using Convolutional Neural Networks
Atari Game State Representation using Convolutional Neural Networksjohnstamford
3K views39 slides
One Person, One Model, One World: Learning Continual User Representation wi... by
One Person, One Model, One World:  Learning Continual User Representation  wi...One Person, One Model, One World:  Learning Continual User Representation  wi...
One Person, One Model, One World: Learning Continual User Representation wi...westlakereplab
3 views27 slides
Deep learning to the rescue - solving long standing problems of recommender ... by
Deep learning to the rescue - solving long standing problems of recommender ...Deep learning to the rescue - solving long standing problems of recommender ...
Deep learning to the rescue - solving long standing problems of recommender ...Balázs Hidasi
13.6K views17 slides
AI_Module_1_Lecture_1.pptx by
AI_Module_1_Lecture_1.pptxAI_Module_1_Lecture_1.pptx
AI_Module_1_Lecture_1.pptxadityab33
2 views31 slides
Open ai openpower by
Open ai openpowerOpen ai openpower
Open ai openpowerGanesan Narayanasamy
5.9K views29 slides
Applying AI in Games (GDC2019) by
Applying AI in Games (GDC2019)Applying AI in Games (GDC2019)
Applying AI in Games (GDC2019)Jun Okumura
2.4K views44 slides

Similar to Yuandong Tian at AI Frontiers : Planning in Reinforcement Learning(20)

Atari Game State Representation using Convolutional Neural Networks by johnstamford
Atari Game State Representation using Convolutional Neural NetworksAtari Game State Representation using Convolutional Neural Networks
Atari Game State Representation using Convolutional Neural Networks
johnstamford3K views
One Person, One Model, One World: Learning Continual User Representation wi... by westlakereplab
One Person, One Model, One World:  Learning Continual User Representation  wi...One Person, One Model, One World:  Learning Continual User Representation  wi...
One Person, One Model, One World: Learning Continual User Representation wi...
westlakereplab3 views
Deep learning to the rescue - solving long standing problems of recommender ... by Balázs Hidasi
Deep learning to the rescue - solving long standing problems of recommender ...Deep learning to the rescue - solving long standing problems of recommender ...
Deep learning to the rescue - solving long standing problems of recommender ...
Balázs Hidasi13.6K views
AI_Module_1_Lecture_1.pptx by adityab33
AI_Module_1_Lecture_1.pptxAI_Module_1_Lecture_1.pptx
AI_Module_1_Lecture_1.pptx
adityab332 views
Applying AI in Games (GDC2019) by Jun Okumura
Applying AI in Games (GDC2019)Applying AI in Games (GDC2019)
Applying AI in Games (GDC2019)
Jun Okumura2.4K views
Machine Learning & AI - 2022 intro for pre-college students.pdf by Ed Fernandez
Machine Learning & AI - 2022 intro for pre-college students.pdfMachine Learning & AI - 2022 intro for pre-college students.pdf
Machine Learning & AI - 2022 intro for pre-college students.pdf
Ed Fernandez220 views
Dealing with Estimation, Uncertainty, Risk, and Commitment by TechWell
Dealing with Estimation, Uncertainty, Risk, and CommitmentDealing with Estimation, Uncertainty, Risk, and Commitment
Dealing with Estimation, Uncertainty, Risk, and Commitment
TechWell718 views
Ropossum: A Game That Generates Itself by Mohammad Shaker
Ropossum: A Game That Generates ItselfRopossum: A Game That Generates Itself
Ropossum: A Game That Generates Itself
Mohammad Shaker2.9K views
Applied Data Science: Building a Beer Recommender | Data Science MD - Oct 2014 by Austin Ogilvie
Applied Data Science: Building a Beer Recommender | Data Science MD - Oct 2014Applied Data Science: Building a Beer Recommender | Data Science MD - Oct 2014
Applied Data Science: Building a Beer Recommender | Data Science MD - Oct 2014
Austin Ogilvie1.7K views
Worker Productivity 20230628 v1.pptx by ISSIP
Worker Productivity 20230628 v1.pptxWorker Productivity 20230628 v1.pptx
Worker Productivity 20230628 v1.pptx
ISSIP28 views
Hacking Predictive Modeling - RoadSec 2018 by HJ van Veen
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
HJ van Veen1.2K views
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam... by Codemotion
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Codemotion263 views
2020 04 04 NetCoreConf - Machine Learning.Net by Bruno Capuano
2020 04 04 NetCoreConf - Machine Learning.Net2020 04 04 NetCoreConf - Machine Learning.Net
2020 04 04 NetCoreConf - Machine Learning.Net
Bruno Capuano65 views
HPAI Class 2 - human aspects and computing systems in ai - 012920 by melendez321
HPAI  Class 2 - human aspects and computing systems in ai - 012920HPAI  Class 2 - human aspects and computing systems in ai - 012920
HPAI Class 2 - human aspects and computing systems in ai - 012920
melendez3212K views
MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Mus... by multimediaeval
MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Mus...MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Mus...
MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Mus...
multimediaeval158 views
Breaking Through The Challenges of Scalable Deep Learning for Video Analytics by Jason Anderson
Breaking Through The Challenges of Scalable Deep Learning for Video AnalyticsBreaking Through The Challenges of Scalable Deep Learning for Video Analytics
Breaking Through The Challenges of Scalable Deep Learning for Video Analytics
Jason Anderson389 views
Deep Learning: a birds eye view by Roelof Pieters
Deep Learning: a birds eye viewDeep Learning: a birds eye view
Deep Learning: a birds eye view
Roelof Pieters8.3K views

More from AI Frontiers

Divya Jain at AI Frontiers : Video Summarization by
Divya Jain at AI Frontiers : Video SummarizationDivya Jain at AI Frontiers : Video Summarization
Divya Jain at AI Frontiers : Video SummarizationAI Frontiers
1.9K views16 slides
Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI by
Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI
Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI AI Frontiers
843 views45 slides
Training at AI Frontiers 2018 - LaiOffer Self-Driving-Car-Lecture 1: Heuristi... by
Training at AI Frontiers 2018 - LaiOffer Self-Driving-Car-Lecture 1: Heuristi...Training at AI Frontiers 2018 - LaiOffer Self-Driving-Car-Lecture 1: Heuristi...
Training at AI Frontiers 2018 - LaiOffer Self-Driving-Car-Lecture 1: Heuristi...AI Frontiers
571 views18 slides
Training at AI Frontiers 2018 - Ni Lao: Weakly Supervised Natural Language Un... by
Training at AI Frontiers 2018 - Ni Lao: Weakly Supervised Natural Language Un...Training at AI Frontiers 2018 - Ni Lao: Weakly Supervised Natural Language Un...
Training at AI Frontiers 2018 - Ni Lao: Weakly Supervised Natural Language Un...AI Frontiers
614 views105 slides
Training at AI Frontiers 2018 - LaiOffer Self-Driving-Car-lecture 2: Incremen... by
Training at AI Frontiers 2018 - LaiOffer Self-Driving-Car-lecture 2: Incremen...Training at AI Frontiers 2018 - LaiOffer Self-Driving-Car-lecture 2: Incremen...
Training at AI Frontiers 2018 - LaiOffer Self-Driving-Car-lecture 2: Incremen...AI Frontiers
322 views62 slides
Training at AI Frontiers 2018 - Udacity: Enhancing NLP with Deep Neural Networks by
Training at AI Frontiers 2018 - Udacity: Enhancing NLP with Deep Neural NetworksTraining at AI Frontiers 2018 - Udacity: Enhancing NLP with Deep Neural Networks
Training at AI Frontiers 2018 - Udacity: Enhancing NLP with Deep Neural NetworksAI Frontiers
508 views176 slides

More from AI Frontiers(20)

Divya Jain at AI Frontiers : Video Summarization by AI Frontiers
Divya Jain at AI Frontiers : Video SummarizationDivya Jain at AI Frontiers : Video Summarization
Divya Jain at AI Frontiers : Video Summarization
AI Frontiers1.9K views
Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI by AI Frontiers
Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI
Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI
AI Frontiers843 views
Training at AI Frontiers 2018 - LaiOffer Self-Driving-Car-Lecture 1: Heuristi... by AI Frontiers
Training at AI Frontiers 2018 - LaiOffer Self-Driving-Car-Lecture 1: Heuristi...Training at AI Frontiers 2018 - LaiOffer Self-Driving-Car-Lecture 1: Heuristi...
Training at AI Frontiers 2018 - LaiOffer Self-Driving-Car-Lecture 1: Heuristi...
AI Frontiers571 views
Training at AI Frontiers 2018 - Ni Lao: Weakly Supervised Natural Language Un... by AI Frontiers
Training at AI Frontiers 2018 - Ni Lao: Weakly Supervised Natural Language Un...Training at AI Frontiers 2018 - Ni Lao: Weakly Supervised Natural Language Un...
Training at AI Frontiers 2018 - Ni Lao: Weakly Supervised Natural Language Un...
AI Frontiers614 views
Training at AI Frontiers 2018 - LaiOffer Self-Driving-Car-lecture 2: Incremen... by AI Frontiers
Training at AI Frontiers 2018 - LaiOffer Self-Driving-Car-lecture 2: Incremen...Training at AI Frontiers 2018 - LaiOffer Self-Driving-Car-lecture 2: Incremen...
Training at AI Frontiers 2018 - LaiOffer Self-Driving-Car-lecture 2: Incremen...
AI Frontiers322 views
Training at AI Frontiers 2018 - Udacity: Enhancing NLP with Deep Neural Networks by AI Frontiers
Training at AI Frontiers 2018 - Udacity: Enhancing NLP with Deep Neural NetworksTraining at AI Frontiers 2018 - Udacity: Enhancing NLP with Deep Neural Networks
Training at AI Frontiers 2018 - Udacity: Enhancing NLP with Deep Neural Networks
AI Frontiers508 views
Training at AI Frontiers 2018 - LaiOffer Self-Driving-Car-Lecture 3: Any-Angl... by AI Frontiers
Training at AI Frontiers 2018 - LaiOffer Self-Driving-Car-Lecture 3: Any-Angl...Training at AI Frontiers 2018 - LaiOffer Self-Driving-Car-Lecture 3: Any-Angl...
Training at AI Frontiers 2018 - LaiOffer Self-Driving-Car-Lecture 3: Any-Angl...
AI Frontiers334 views
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ... by AI Frontiers
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
AI Frontiers965 views
Percy Liang at AI Frontiers : Pushing the Limits of Machine Learning by AI Frontiers
Percy Liang at AI Frontiers : Pushing the Limits of Machine LearningPercy Liang at AI Frontiers : Pushing the Limits of Machine Learning
Percy Liang at AI Frontiers : Pushing the Limits of Machine Learning
AI Frontiers813 views
Ilya Sutskever at AI Frontiers : Progress towards the OpenAI mission by AI Frontiers
Ilya Sutskever at AI Frontiers : Progress towards the OpenAI missionIlya Sutskever at AI Frontiers : Progress towards the OpenAI mission
Ilya Sutskever at AI Frontiers : Progress towards the OpenAI mission
AI Frontiers2K views
Mark Moore at AI Frontiers : Uber Elevate by AI Frontiers
Mark Moore at AI Frontiers : Uber ElevateMark Moore at AI Frontiers : Uber Elevate
Mark Moore at AI Frontiers : Uber Elevate
AI Frontiers1.5K views
Mario Munich at AI Frontiers : Consumer robotics: embedding affordable AI in ... by AI Frontiers
Mario Munich at AI Frontiers : Consumer robotics: embedding affordable AI in ...Mario Munich at AI Frontiers : Consumer robotics: embedding affordable AI in ...
Mario Munich at AI Frontiers : Consumer robotics: embedding affordable AI in ...
AI Frontiers768 views
Arnaud Thiercelin at AI Frontiers : AI in the Sky by AI Frontiers
Arnaud Thiercelin at AI Frontiers : AI in the SkyArnaud Thiercelin at AI Frontiers : AI in the Sky
Arnaud Thiercelin at AI Frontiers : AI in the Sky
AI Frontiers701 views
Wei Xu at AI Frontiers : Language Learning in an Interactive and Embodied Set... by AI Frontiers
Wei Xu at AI Frontiers : Language Learning in an Interactive and Embodied Set...Wei Xu at AI Frontiers : Language Learning in an Interactive and Embodied Set...
Wei Xu at AI Frontiers : Language Learning in an Interactive and Embodied Set...
AI Frontiers928 views
Sumit Gupta at AI Frontiers : AI for Enterprise by AI Frontiers
Sumit Gupta at AI Frontiers : AI for EnterpriseSumit Gupta at AI Frontiers : AI for Enterprise
Sumit Gupta at AI Frontiers : AI for Enterprise
AI Frontiers1.1K views
Alex Ermolaev at AI Frontiers : Major Applications of AI in Healthcare by AI Frontiers
Alex Ermolaev at AI Frontiers : Major Applications of AI in HealthcareAlex Ermolaev at AI Frontiers : Major Applications of AI in Healthcare
Alex Ermolaev at AI Frontiers : Major Applications of AI in Healthcare
AI Frontiers827 views
Melissa Goldman at AI Frontiers : AI & Finance by AI Frontiers
Melissa Goldman at AI Frontiers : AI & FinanceMelissa Goldman at AI Frontiers : AI & Finance
Melissa Goldman at AI Frontiers : AI & Finance
AI Frontiers996 views
Li Deng at AI Frontiers : From Modeling Speech/Language to Modeling Financial... by AI Frontiers
Li Deng at AI Frontiers : From Modeling Speech/Language to Modeling Financial...Li Deng at AI Frontiers : From Modeling Speech/Language to Modeling Financial...
Li Deng at AI Frontiers : From Modeling Speech/Language to Modeling Financial...
AI Frontiers967 views
Ashok Srivastava at AI Frontiers : Using AI to Solve Complex Economic Problems by AI Frontiers
Ashok Srivastava at AI Frontiers : Using AI to Solve Complex Economic ProblemsAshok Srivastava at AI Frontiers : Using AI to Solve Complex Economic Problems
Ashok Srivastava at AI Frontiers : Using AI to Solve Complex Economic Problems
AI Frontiers719 views
Rohit Tripathi at AI Frontiers : Using intelligent connectivity and AI to tra... by AI Frontiers
Rohit Tripathi at AI Frontiers : Using intelligent connectivity and AI to tra...Rohit Tripathi at AI Frontiers : Using intelligent connectivity and AI to tra...
Rohit Tripathi at AI Frontiers : Using intelligent connectivity and AI to tra...
AI Frontiers907 views

Recently uploaded

TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f... by
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...TrustArc
130 views29 slides
Microsoft Power Platform.pptx by
Microsoft Power Platform.pptxMicrosoft Power Platform.pptx
Microsoft Power Platform.pptxUni Systems S.M.S.A.
74 views38 slides
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue by
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueCloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueShapeBlue
68 views13 slides
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates by
Keynote Talk: Open Source is Not Dead - Charles Schulz - VatesKeynote Talk: Open Source is Not Dead - Charles Schulz - Vates
Keynote Talk: Open Source is Not Dead - Charles Schulz - VatesShapeBlue
178 views15 slides
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ... by
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...ShapeBlue
48 views17 slides
MVP and prioritization.pdf by
MVP and prioritization.pdfMVP and prioritization.pdf
MVP and prioritization.pdfrahuldharwal141
39 views8 slides

Recently uploaded(20)

TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f... by TrustArc
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc130 views
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue by ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueCloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
ShapeBlue68 views
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates by ShapeBlue
Keynote Talk: Open Source is Not Dead - Charles Schulz - VatesKeynote Talk: Open Source is Not Dead - Charles Schulz - Vates
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates
ShapeBlue178 views
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ... by ShapeBlue
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
ShapeBlue48 views
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ... by ShapeBlue
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...
ShapeBlue121 views
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda... by ShapeBlue
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
ShapeBlue93 views
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... by James Anderson
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
James Anderson142 views
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha... by ShapeBlue
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
ShapeBlue113 views
Confidence in CloudStack - Aron Wagner, Nathan Gleason - Americ by ShapeBlue
Confidence in CloudStack - Aron Wagner, Nathan Gleason - AmericConfidence in CloudStack - Aron Wagner, Nathan Gleason - Americ
Confidence in CloudStack - Aron Wagner, Nathan Gleason - Americ
ShapeBlue58 views
Data Integrity for Banking and Financial Services by Precisely
Data Integrity for Banking and Financial ServicesData Integrity for Banking and Financial Services
Data Integrity for Banking and Financial Services
Precisely76 views
Business Analyst Series 2023 - Week 4 Session 7 by DianaGray10
Business Analyst Series 2023 -  Week 4 Session 7Business Analyst Series 2023 -  Week 4 Session 7
Business Analyst Series 2023 - Week 4 Session 7
DianaGray10110 views
NTGapps NTG LowCode Platform by Mustafa Kuğu
NTGapps NTG LowCode Platform NTGapps NTG LowCode Platform
NTGapps NTG LowCode Platform
Mustafa Kuğu287 views
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online by ShapeBlue
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineKVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online
ShapeBlue154 views
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ... by ShapeBlue
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
ShapeBlue114 views
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker50 views
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava... by ShapeBlue
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
ShapeBlue74 views

Yuandong Tian at AI Frontiers : Planning in Reinforcement Learning

  • 1. Planning in Reinforcement Learning Yuandong Tian Research Scientist Facebook AI Research
  • 2. AI works in a lot of situations Medical Translation Personalization Surveillance Object Recognition Smart Design Speech Recognition Board game
  • 3. What AI still needs to improve Very few supervised data Complicated environments Lots of Corner cases. Home Robotics Autonomous Driving ChatBot StarCraftQuestion Answering Common Sense Exponential space to explore
  • 4. What AI still needs to improve Human level A scary trend of slowing down “Man, we need more data” Initial Enthusiasm “It really works! All in AI!” Trying all possible hacks “How can that be…” Despair “No way, it doesn’t work” Performance Efforts
  • 5. What AI still needs to improve Human level A scary trend of slowing down “Man, we need more data” Initial Enthusiasm “It really works! All in AI!” Trying all possible hacks “How can that be…” Despair “No way, it doesn’t work” Performance Efforts We need novel algorithms
  • 6. Reinforcement Learning Action State Reward Agent Environment [R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction] Atari Games Go DoTA 2 Doom Quake 3 StarCraft
  • 7. Why Planning is important? Just one example
  • 8. AlphaGo Zero Update Models Generate Training data Self-Replays Zero-human knowledge [Silver et al, Mastering the game of Go without human knowledge, Nature 2017]
  • 9. AlphaGo Zero Strength • 3 days version • 4.9M Games, 1600 rollouts/move • 20 block ResNet • Defeat AlphaGo Lee. • 40 days version • 29M Games, 1600 rollouts/move • 40 blocks ResNet. • Defeat AlphaGo Master by 89:11
  • 10. ELF: Extensive, Lightweight and Flexible Framework for Game Research Larry ZitnickQucheng Gong Wenling Shang Yuxin WuYuandong Tian https://github.com/facebookresearch/ELF [Y. Tian et al, ELF: An Extensive, Lightweight and Flexible Research Platform for Real-time Strategy Games, NIPS 2017]
  • 11. ELF: A simple for-loop Action State Reward Agent Environment C++ Python
  • 13. Distributed ELF Server Evaluate/Selfplay Training Send request (game params) Receive experiences Client Client Client Client Client Client Client AlphaGoZero (more synchronization) AlphaZero (less synchronization) Putting AlphaGoZero and AlphaZero into the same framework
  • 14. ELF OpenGo • System can be trained with 2000 GPUs in 2 weeks. • Decent performance against professional players and strong bots. • Abundant ablation analysis • Decoupled design, code highly reusable for other games. We open source the code and the pre-trained model for the Go and ML community https://github.com/pytorch/ELF Simple tutorial in experimental branch (tutorial, tutorial_distri)
  • 15. ELF OpenGo Performance 20-0Name (rank) ELO (world rank) Result Kim Ji-seok 3590 (#3) 5-0 Shin Jin-seo 3570 (#5) 5-0 Park Yeonghun 3481 (#23) 5-0 Choi Cheolhan 3466 (#30) 5-0 Single GPU, 80k rollouts, 50 seconds Offer unlimited thinking time for the players Vs top professional players Vs strong bot (LeelaZero) [158603eb, 192x15, Apr. 25, 2018]: 980 wins, 18 losses (98.2%) Vs professional players Single GPU, 2k rollouts, 27-0 against Taiwanese pros.
  • 16. Planning is how new knowledge is created Game Start Game End Random Moves Meaningful Moves Already dan level even if the opening doesn’t make much sense. Planning #rollouts Win rate against bot without planning 50% 100% Training is almost always constrained by model capacity (why 40b > 20b) Where the reward signal is
  • 17. Planning is how new knowledge is created T-step looking forward Tree searchOne-step looking forward Monte Carlo Sampling Learning a neural network that directly predicts the optimal value/policy Temporal Difference (TD) in Reinforcement Learning:
  • 18. Planning is how game AI is created Extensive search/planning Evaluate Consequence Black wins White wins Black wins White wins Black wins Current game situation Lufei Ruan vs. Yifan Hou (2010) AlphaBeta Pruning + Iterative Deepening Monte Carlo Tree Search …
  • 19. How to plan without a known model? If you don’t have a ground truth dynamics model … • Only (human/expert) trajectories, no world model. • Limited access of world models. • Cannot restart the model, cannot query any (s, a) pair • Noisy signals from the world model • … If you have a ground truth dynamics model … • Infinite access of the exact world model • May query any (s, a) pair • … Current state Next state Action to take Build one
  • 20. Navigation Target Yi Wu Georgia Gkioxari Yuxin Wu How to plan the trajectory?
  • 21. Build a semantic model outdoor living room sofa Find “oven” car chair dining room kitchen oven Incomplete model of the environment [Y. Wu et al, Learning and Planning with a Semantic Model, submitted to ICLR 2019]
  • 22. Build a semantic model Bayesian Inference 𝑃(𝑧|𝑌) car chair dining room kitchen oven 0.7 0.95 0.8 0.5 0.6 0.7 Next step “kitchen” outdoor living room sofa Learning experience 𝑌
  • 23. LEAPS LEArning and Planning with a Semantic model living room kitchen chair sofa Dining room 𝑃(𝑧kitchen,living room) 𝑃(𝑧sofa,living room) 𝑃(𝑧chair,living room) 𝑃(𝑧dining,living room) living room kitchen chair sofa Dining room 𝑃(𝑧kitchen,living room) 𝑃 𝑧sofa,living room 𝑌 = 1 𝑃(𝑧chair,living room) 𝑃(𝑧dining,living room) Planning the trajectory and more exploration Planning the trajectory and more exploration
  • 24. House3D SUNCG dataset, 45K scenes, all objects are fully labeled. https://github.com/facebookresearch/House3D Depth Segmentation maskRGB image Top-down map
  • 25. Learning the Prior between Different Rooms
  • 26. Test Performance on ConceptNav
  • 27. Case Study • A case study • Go to “outdoor” Prior: 𝑃(𝑧) Birth Outdoor Living room Garage 0.12 0.38 0.76 0.73 0.28 Sub-Goal: Outdoor
  • 28. Case Study • A case study • Go to “outdoor” Sub-Goal: Outdoor Failed! Birth Outdoor Living room Garage 0.12 0.38 0.01 0.73 0.28 Posterior: 𝑃(𝑧|𝐸)
  • 29. • A case study • Go to “outdoor” Posterior: 𝑃(𝑧|𝐸) Sub-Goal: Garage Birth Outdoor Living room Garage 0.12 0.01 0.73 0.28 Failed! Case Study 0.08
  • 30. Case Study • A case study • Go to “outdoor” Sub-Goal: Living Room Posterior: 𝑃(𝑧|𝐸) Birth Outdoor Living room Garage 0.12 0.08 0.01 0.28 Success 0.99
  • 31. Case Study • A case study • Go to “outdoor” Sub-Goal: Outdoor Posterior: 𝑃(𝑧|𝐸) Success Birth Outdoor Living room Garage 0.99 0.08 0.01 0.280.99
  • 33. Iterative LQR What if the dynamics is nonlinear? New Policy Sample according to the new policy
  • 34. Non-differentiable Plans • Direct predicting combinatorial solutions. [O. Vinyals. et al, Pointer Networks, NIPS 2015] Convex hull Seq2seq model [H. Mao et al, Resource Management with Deep Reinforcement Learning, ACM Workshop on Hot Topics in Networks, 2016] Schedule the job to i-th slot Policy gradient
  • 35. Neural network rewriter st Sample 𝒈 𝒕 ∼ 𝑺𝑷 𝒈 𝒕 , 𝒈 𝒕 ⊂ 𝒔 𝒕 Sample 𝒂 𝒕 ∼ 𝑹𝑺 𝒈 𝒕 𝒔 𝒕+𝟏 = 𝒇(𝒔 𝒕, 𝒈 𝒕, 𝒂 𝒕) Input Encoder Score Predictor Rule Selector [X. Chen and Y. Tian, Learning to Progressively Plan, submitted to ICLR 2019] Is it simpler to improve the solution progressively?
  • 36. Job Scheduling Scheduling 1 2 3 T=2, S=1 T=3, S=2 T=1, S=3 0 1 2 3 1 32 4 5 6 0 1 2 3 1 2 3 1 32 4 5 6 1 2 3 Graph representationJobs 10 Resource 1 Resource 2 time time Slow down Slow down
  • 37. Job Scheduling g2 0.1 g4 0.7 g5 0.2 Input Encoder Score Predictor Rule Selector 0.1 -0.3 0.2 DAG-LSTM Embedding 0 1 5 3 42 FC FC FC Softmax St at St+1 0 1 5 3 42 -0.2 0.5 𝐠 𝐭 0 1 5 4 32
  • 38. Job Scheduling #Resources 2 5 10 20 Shortest Job First 4.80 5.83 5.58 5.00 Shortest First Search 4.25 5.05 5.54 4.98 DeepRM 2.81 6.52 9.20 10.18 Neural Rewriter (Ours) 2.80 3.36 4.50 4.63 Optim (LB) 2.57 3.02 4.08 4.26 Earliest Job First (UB) 11.11 13.62 22.13 24.23
  • 39. Expression Simplification a1 0 … a5 1 (Constant Reduction) … a19 0 Input Encoder Score Predictor Rule Selector Embedding Tree-LSTM FC 0 -1 0.7 FC Softmax St St+1 𝐠 𝐭 at <= min - v0 v2 v1 v1 <= min - v0 v2 v1 v1 <= min 0 v0 v2
  • 40. Expression Simplification Expr Len Reduction Tree Size Reduction Halide Rule-based Rewriter 36.13 9.68 Heuristic Search 43.27 12.09 Neural Rewriter (Ours) 46.98 13.53
  • 41. Example ≤ 5 max v 3 + 3 5 ≤ max 𝑣, 3 + 3 ≤ 5 v 3 max + + 33 ≤ 5 v 3 + ≤ 5 3 3 + OR ≤ 5 v 3 + True OR True
  • 42. Future Directions Multi-Agent Admiral (General) Captain Lieutenant Hierarchical RL RL Systems Model-based RL Policy Optimization Model Estimation RL applications RL for Optimization
  • 43. How to do well in Reinforcement Learning? Experience on (distributed) systems Strong math skills Strong coding skills Parameter tuning skills

Editor's Notes

  1. RL is one of the methods that might come to rescue. The basic idea of RL is very simple. We have an agent who perceives the state from the environment, takes an action and receives a reward. The environment receives the action, changes its internal state and repeats. With virtual environments, we could potentially get infinite amount of data to train our models. Since then, deep reinforcement learning has make substantial progress in different kind of games, including Atari games, Go, DoTA 2, etc. Then the question is, what is the next step? In this talk, I am going to talk about a few recent works that mainly explore the power of planning in reinforcement learning setting.
  2. 1.
  3. I will start with one example why planning is important.
  4. We all know that last year, DeepMind has released a paper in Nature about AlphaGoZero, which learns super-human Go bot without any human knowledge. The idea is very simple, starting from random generated models, we use planning algorithms like MCTS to find the best moves in each step, and save the selfplay into a replay buffer. Then we update the models based on the self-replay buffer. Then repeat.
  5. This approach is surprisingly simple, yet it gives very strong performance. 3 days with thousands of TPUs and the model can already beat AlphaGo Lee, after 40 days it defeats AlphaGo Master.
  6. Inspired by this interesting and exciting results, we thus try reproducing AlphaGoZero with our recent published ELF platform. The goal is to understand why such a simple approach yields a strong performance.
  7. ELF is an Extensive, Lightweight and Flexible framework that makes a practical RL system easy and manageable. For this, ELF puts all the implementation and design details into the C++ side, leaving the Python side a simple for-loop, as advertised in the text book. Moreover, each time the python side returns a batch for a neural network model to operate on, improving its efficiency.
  8. The key idea of ELF is to achieve dynamic batching from multiple game instances. In this platform, many games are running simultaneously. From time to time each have requests to call deep learning API (like PyTorch) for computing the next states and actions, or store the game history into replay buffers. ELF provides a dynamic batching interface so that requests can be automatically batched for high thoroughput.
  9. We improve our framework to support distributed setting, and putting AGZ and AZ together.
  10. We then releases ELF OpenGo, reproduces AZ and AGZ.
  11. One question is that, ok, what can we learn from it?
  12. One interesting observation from this experiment is that the knowledge is propagated backwards. You start to see meaningful moves from the end of the game, where there is reward signal. Over iterations, the meaningful moves are backpropagated all the way to the opening of the game. The right figure shows that even with well-trained models, still with the additional planning the strength is much higher. In fact, the strength of the bot is always constrained by the capacity of the model. We are working on an arXiv paper to discuss about the training details.
  13. Not only training AGZ/AZ requires planning. For general reinforcement learning, during training, planning is also very important. This not only includes 1 step look forward, but also including T-step looking forward as well as more complicated planning mechanism like tree search. A large portion of DRL is to try pushing the results given by planning into neural networks.
  14. Planning is also very important for general game AI. Chess or other games are other important examples.
  15. Traditional