Successfully reported this slideshow.
Your SlideShare is downloading. ×

Discovery and Learning of Navigation Goals from Pixels in Minecraft

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 40 Ad

Discovery and Learning of Navigation Goals from Pixels in Minecraft

Download to read offline

Master MATT thesis defense by Juan José Nieto
Advised by Víctor Campos and Xavier Giro-i-Nieto.
27th May 2021.

Pre-training Reinforcement Learning (RL) agents in a task-agnostic manner has shown promising results. However, previous works still struggle to learn and discover meaningful skills in high-dimensional state-spaces. We approach the problem by leveraging unsupervised skill discovery and self-supervised learning of state representations. In our work, we learn a compact latent representation by making use of variational or contrastive techniques. We demonstrate that both allow learning a set of basic navigation skills by maximizing an information theoretic objective. We assess our method in Minecraft 3D maps with different complexities. Our results show that representations and conditioned policies learned from pixels are enough for toy examples, but do not scale to realistic and complex maps. We also explore alternative rewards and input observations to overcome these limitations.

https://imatge.upc.edu/web/publications/discovery-and-learning-navigation-goals-pixels-minecraft

Master MATT thesis defense by Juan José Nieto
Advised by Víctor Campos and Xavier Giro-i-Nieto.
27th May 2021.

Pre-training Reinforcement Learning (RL) agents in a task-agnostic manner has shown promising results. However, previous works still struggle to learn and discover meaningful skills in high-dimensional state-spaces. We approach the problem by leveraging unsupervised skill discovery and self-supervised learning of state representations. In our work, we learn a compact latent representation by making use of variational or contrastive techniques. We demonstrate that both allow learning a set of basic navigation skills by maximizing an information theoretic objective. We assess our method in Minecraft 3D maps with different complexities. Our results show that representations and conditioned policies learned from pixels are enough for toy examples, but do not scale to realistic and complex maps. We also explore alternative rewards and input observations to overcome these limitations.

https://imatge.upc.edu/web/publications/discovery-and-learning-navigation-goals-pixels-minecraft

Advertisement
Advertisement

More Related Content

Similar to Discovery and Learning of Navigation Goals from Pixels in Minecraft (20)

More from Universitat Politècnica de Catalunya (20)

Advertisement

Recently uploaded (20)

Discovery and Learning of Navigation Goals from Pixels in Minecraft

  1. 1. Discovery and learning of navigation goals from pixels in Minecraft Juan José Nieto Salas Master Thesis May 27th, 2021 Acknowledgements: Xavier Giró (Advisor) Víctor Campos (Advisor) Òscar Mañas Roger Creus 1
  2. 2. ENVIRONMENT REINFORCEMENT LEARNING REINFORCEMENT LEARNING 2 action state reward agent
  3. 3. 3 https://www.youtube.com/watch?v=GHo8B4JMC38 MOTIVATION MOTIVATION
  4. 4. 4 MOTIVATION: SELF-SUPERVISED LEARNING MOTIVATION: SELF-SUPERVISED LEARNING COMPUTER VISION COMPUTER VISION NATURAL LANGUAGE PROCESSING NATURAL LANGUAGE PROCESSING Mathilde Caron, et al. "Emerging Properties in Self- Supervised Vision Transformers." (2021). Tom B. Brown, et al. "Language Models are Few-Shot Learners." (2020).
  5. 5. INTRINSIC MOTIVATION: INTRINSIC MOTIVATION: 5 EMPOWERMENT UNSUPERVISED RL UNSUPERVISED RL Benjamin Eysenbach, et al. "Diversity is All You Need: Learning Skills without a Reward Function." (2018) Archit Sharma, et al. "Dynamics-Aware Unsupervised Discovery of Skills." (2020).
  6. 6. Explore, Discover and Learn (EDL) Víctor Campos et. al. ICML 2020 Explore, Discover and Learn (EDL) Víctor Campos et. al. ICML 2020 Good coverage of the state space Independent of how the state distribution is induced 6
  7. 7. 7 Explore, Discover and Learn Define the state distribution and how we sample from it Learn the mapping from s to z and define the intrinsic rewards Learn behaviours by training the conditioned policies on z
  8. 8. 8 Reward as reconstruction error using MSE does not scale to pixels Explore, Discover and Learn (x, y) (3, H, W)
  9. 9. 10 IMPLEMENTATION IMPLEMENTATION SKILL DISCOVERY → DISCOVER NAVIGATION GOALS
  10. 10. 11 IMPLEMENTATION IMPLEMENTATION SKILL DISCOVERY → DISCOVER NAVIGATION GOALS SKILL LEARNING → LEARN BEHAVIOURS THAT GUIDE THE AGENT TOWARDS THESE GOALS
  11. 11. ● Induce state distribution from expert trajectories ● Information-theoretic objectives do not encode human priors properly Navigate: Treechop: Obtain bed: Obtain diamond: ObtainIron Pickaxe: Obtain meat: MineRL - Guss et. al. (2019) MineRL - Guss et. al. (2019) Explore, Discover and Learn Explore 12
  12. 12. 13 Maximize mutual information between inputs and some latent variables FORWARD REVERSE Explore, Discover and Learn Discover
  13. 13. 14 Maximize mutual information between inputs and some latent variables FORWARD REVERSE Explore, Discover and Learn Discover
  14. 14. 15 Maximize mutual information between inputs and some latent variables FORWARD REVERSE Explore, Discover and Learn Discover
  15. 15. 16 Maximize mutual information between inputs and some latent variables FORWARD REVERSE VARIATIONAL VARIATIONAL CONTRASTIVE CONTRASTIVE Auto-encoding Variational Bayes Kingma et. al. (2014) Representation Learning with Contrastive Predictive Coding Oord et. al. (2018) Explore, Discover and Learn Discover
  16. 16. 18 VARIATIONAL VARIATIONAL CONTRASTIVE CONTRASTIVE Explore, Discover and Learn Learn
  17. 17. Pipeline: Variational Pipeline: Variational 20
  18. 18. Pipeline: Contrastive Pipeline: Contrastive 21
  19. 19. 22 Experiments Experiments
  20. 20. 23 Skill discovery Skill discovery CONTRASTIVE VARIATIONAL MAP Index maps from random trajectories
  21. 21. 24 Skill discovery Skill discovery CONTRASTIVE VARIATIONAL Index maps from expert trajectories MAP
  22. 22. 25 CONTRASTIVE VARIATIONAL Skill discovery Skill discovery PCA over embeddings learned from expert trajectories
  23. 23. 1. Toy map with random trajectories 2. Toy map with expert plays 3. Realistic map with random trajectories where input is composed by pixels and coordinates Skill learning Skill learning 26
  24. 24. 27 Experiment 1 Experiment 1 ● Handcrafted map ● Random trajectories ● Contrastive approach MAP REWARD MAP
  25. 25. 28 Experiment 1 Experiment 1 ● Handcrafted map ● Random trajectories ● Contrastive approach REWARD MAP TRAJECTORIES IN EVALUATION AVERAGE REWARD OVER TIME MAP
  26. 26. 29 Experiment 1 Experiment 1 ● Handcrafted map ● Random trajectories ● Contrastive approach
  27. 27. 30 Experiment 1 Experiment 1 ● Handcrafted map ● Random trajectories ● Contrastive approach
  28. 28. 31 Experiment 2 Experiment 2 ● Handcrafted map ● Expert trajectories ● Variational approach CENTROIDES RECONSTRUCTION
  29. 29. 32 Experiment 2 Experiment 2 ● Handcrafted map ● Expert trajectories ● Variational approach ● z3 reconstruction -> REWARD MAP MAP
  30. 30. 33 Experiment 2 Experiment 2 ● Handcrafted map ● Expert trajectories ● Variational approach ● z3 reconstruction -> REWARD MAP TRAJECTORIES IN EVALUATION AVERAGE REWARD OVER TIME MAP
  31. 31. 34 Experiment 2 Experiment 2 ● Handcrafted map ● Expert trajectories ● Variational approach
  32. 32. 35 Experiment 2 Experiment 2 ● Handcrafted map ● Expert trajectories ● Variational approach
  33. 33. 36 Experiment 2 Experiment 2 ● Handcrafted map ● Expert trajectories ● Variational approach
  34. 34. 37 Experiment 3 Experiment 3
  35. 35. 38 Experiment 3 Experiment 3 ● Real map ● Random trajectories ● Variational approach ● Inputs: pixels and coordinates REWARD MAP MAP
  36. 36. 39 Experiment 3 Experiment 3 ● Real map ● Random trajectories ● Variational approach ● Inputs: pixels and coordinates REWARD MAP TRAJECTORIES IN EVALUATION AVERAGE REWARD OVER TIME MAP
  37. 37. 41 Experiment 3 Experiment 3 ● Real map ● Random trajectories ● Variational approach ● Inputs: pixels and coordinates
  38. 38. 42 Embodied AI Workshop Embodied AI Workshop
  39. 39. ● We empirically demonstrate that expert trajectories are sufficient for discovering generic skills ● We maximize empowerment either with variational and contrastive approaches ● We successfully learned meaningful skills by using the reverse form of the mutual information 43 Conclusions Conclusions
  40. 40. 44 THANK YOU!

Editor's Notes

  • -> goal mine diamond bloc.
    -> neurips challenge

    -> long sequence of actions, impossible to perform by chance
    -> rather learn set of skills to ease training and solve more complex tasks
    -> mention skills examples

    -> learn this skills without supervision, inspired from the self-supervised success
  • -> mention these two examples

    -> in this paradigm we extract some features that can be transferred to other downstream tasks.
    -> since it does not require annotating labels there will be no scalability problems

    -> transfer ideas to RL?
    -> what kind of tasks?
    -> not enough to extract features, we wanna transfer behaviours or skills
  • -> it’s a little bit difficult to asses the learned skills since we do not have labels! But these simple examples and plots helps on this task
    -> we’ll also show some differences between discovering skills from random and expert trajectories
    -> for that, we use two different maps

    -> showing top view of the map!!
    -> these index maps are a way of assessing the learned skills

    -> each dot belongs to an observation from a random trajectory
    -> it has been encoded and we pick the index of the closest embedding from the codebook
    -> each index is mapped to a different color forming these plots

    -> explain results, variational more discrete regions and contrastive more overlapped

  • These experiments show the progress done during our work

    Make sure that everyone understands what are the observations of the agent!!
  • Although as we’ve seen they struggle when deployed in realistic environments, since they are quite mix and overlapped
    Mention that variational is common but contrastive approach is kind of new for maximizing empowerment
    That could be used along with a hierarchical policy on top to perform more complex tasks

×