Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

[1807] Learning Montezuma's Revenge from a Single Demonstration

9 views

Published on

Presentation slides for 'Learning Montezuma's Revenge from a Single Demonstration' by T. Salimans and R. Chen.

You can find more presentation slides in my website:
https://www.endtoend.ai

Published in: Technology
  • Be the first to comment

  • Be the first to like this

[1807] Learning Montezuma's Revenge from a Single Demonstration

  1. 1. Learning Montezuma’s Revenge from a Single Demonstration (18.07) Ryan Lee
  2. 2. Exploration and Learning ● Exploration: Find action sequence with positive reward ● Learning: Remember and generalize action sequence ● Need both for a successful agent
  3. 3. Montezuma’s Revenge ● One of the hardest games in Atari 2600 ● Sparse rewards → Exploration is difficult https://www.retrogames.cz/play_124-Atari2600.php?language=EN
  4. 4. Simplifying Exploration with Demonstrations ● Solution: Shorten the episode ○ Start the agent near the end of demonstration ○ Train agent until it ties or beats the demonstrator’s score ○ Gradually move starting point back in time Go down Ladder 1 Go down Rope Go down Ladder 2 Jump over Skull Go up Ladder 3
  5. 5. Go down Ladder 1 Go down Rope Go down Ladder 2 Jump over Skull Go up Ladder 3 Go down Ladder 1 Go down Rope Go down Ladder 2 Jump over Skull Go up Ladder 3 Go down Ladder 1 Go down Rope Go down Ladder 2 Jump over Skull Go up Ladder 3 Go down Ladder 1 Go down Rope Go down Ladder 2 Jump over Skull Go up Ladder 3 Go down Ladder 1 Go down Rope Go down Ladder 2 Jump over Skull Go up Ladder 3
  6. 6. Result ● 74500 points on Montezuma’s Revenge (State of the Art) ● Surpasses demo score of 71500 ● Exploits emulator flaw
  7. 7. Comparison with DeepMind’s approach ● DeepMind’s approach ○ Less control over environment needed ○ Agents imitate the demo ● This approach ○ Need full game states in demo ○ Directly optimize game score → Less overfitting for sub-optimal demo ○ Better in multiplayer games where performance should be optimized against various opponents
  8. 8. Remaining Challenges ● Agent cannot reach exact state in demo ○ Agent needs to generalize between similar states ○ Problematic in Gravitar or Pitfall ● Careful hyperparameter tuning needed ● High variance in each run ● NN does not generalize as well as human https://blog.openai.com/openai-baselines-ppo/
  9. 9. Thank you! Original content by OpenAI ● Learning Montezuma’s Revenge from a Single Demonstration You can find more content in ● github.com/seungjaeryanlee ● www.endtoend.ai

×