Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

AlphaGo zero

47 views

Published on

17年底在滴滴内部的AlphaGo Zero的分享

Published in: Technology
  • Be the first to comment

  • Be the first to like this

AlphaGo zero

  1. 1. AlphaGo Zero guodong
  2. 2. Go • 本质是什么问题?有限空间搜索和评分 • 如果⾃自⼰己要实现下围棋的程序? • 向前看⼏几步(迷宫) -> 复杂度问题 • 套路包(规则)+ 学习⾼高⼿手的Move • 评估棋局
  3. 3. 3篇相关⽂文章 • Beat human in some games without human knowledge in 2015 • Human-level control through deep reinforcement learning • Beat human in Go game partially rely on human knowledge in 2016 • Mastering the Game of Go with Deep Neural Networks and Tree Search • Dominate human in Go game without human knowledge in 2017 • Mastering the Game of Go without Human Knowledge
  4. 4. AlphaGo
  5. 5. MCTS • Most of go programs area based on MCTS • 缩减搜索空间via policy network and value network • Exploration tradeoff: prefers actions with high prior probability and low visit count, and high action value
  6. 6. MCTS steps • steps • selection: 基于当前的state,定位到tree的⼀一条path; Based on Q value, prior probability(value network), and Visit count • Expansion:当决定有必要(visit count⼤大于阈值)继续从深度上拓展该 path,使⽤用default policy做sampling • Evaluation:combine value network prediction and one random rollout results; update Value of state using sampling • value network approximates the outcome of games by strong policy; while the rollouts evaluate the outcome of games played by weaker policy • Backup: improve the tree (update visit count, and Q value)
  7. 7. AlphaGo • Power MCTS by Policy Network and Value Network • Policy Network(SL + Policy Gradient on DNN) • Value Network(Value Function Approximation + DNN + MC target)
  8. 8. AlphaGo:Policy Network • purpose: narrow down search to high-probability moves • firstly trained by SL to predict human expert moves, then refined by policy gradient RL
  9. 9. Value Network • Purpose: evaluate positions in the tree • Value-based Reinforcement learning (function approximation using DNN) • MSE of the predicted values and the observed rewards • “label” from MC (episode from self-play; whole episode shares single reward) • easy overfitting due to successive positions are similar
  10. 10. AlphaGo Zero
  11. 11. AlphaGo Zero • 1, Trained solely by self-play RL, without any supervision of human data • 2, End-2-end: raw image as the input features • 3, single neural network; and use residual network • 4, MCTS relies upon network only, without performing any MC rollouts • NB! knowledge all learned via network • incorporates lookahead search inside training loop
  12. 12. AlphaGo Zero Algorithm • AlphaGo之前的做法:先确定好Network,再build MCTS • 随机初始化Network的参数和MCTS • 迭代如下步骤 • 基于Network的预估迭代MCTS • 基于MCTS提升后的Policy决定的Move,作为Network 下⼀一轮迭代的输⼊入(self-play) • Minimize Loss:
  13. 13. Comparison • AlphaGo Zero段位更⾼高;训练时间⼤大幅下降;学到 了与human expert不同的策略
  14. 14. Performance
  15. 15. 经验总结 • 不依赖⼈人类经验 vs ⽆无监督 • Data质量决定了策略上界:使⽤用⼈人类经验训练模型beat⼈人类? • 确定性问题 vs 不确定性问题 • 是否有明确的rewards • 是否有明确的game rule • 营销技术品牌 • Sampling is valuable: MC in value network inference, MC in tree search • RL is powerful in solving dynamic problem, combining with MC • Human knowledge is probably local-optimal • Engineering(tricks) is critical

×