Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Atari Game State Representation using Convolutional Neural Networks


Published on

I recently gave a talk to some MSc Machine Learning students at De Montfort University about the project I did for my MSc. The work included looking at feature extraction from game screens using the Arcade Learning Environment and Convolutional Neural Networks (CNN).

The work was planned to investigate if the costly nature Q-Learning could be offset by the use of a trained system using 'expert' data. The system uses the same technology as used by Deepmind in their 2013 paper.

Published in: Technology
  • Be the first to comment

Atari Game State Representation using Convolutional Neural Networks

  1. 1. Training a Multi Layer Perceptron with Expert Data and Game State Representation using Convolutional Neural Networks JOHN STAMFORD MSC INTELLIGENT SYSTEMS AND ROBOTICS
  2. 2. Contents Background and Initial Brief Previous Work Motivation Technical Frameworks State Representation Testing Results Conclusion Future work
  3. 3. Background / Brief Based on a project by Google/Deepmind Build an App to capture gameplay data ◦Users play Atari games on a mobile device ◦We capture the data (somehow) Use the data in machine learning ◦Reduce the costliness nature of Reinforcement Learning
  4. 4. Deepmind Bought by Google for £400 million “Playing Atari with Deep Reinforcement Learning” (2013) General Agent ◦ No prior knowledge of the environment ◦ Inputs (States) and Outputs (Actions) ◦ Learn Policies ◦ Mapping States and Actions Deep Reinforcement Learning Deep Q Networks (DQN) 2015 Paper Release (with source code LUA)
  5. 5. Motivation Starts the Q-Learning Sample Code ◦ Deep Reinforcement Learning (Q-Learning) ◦ Links to Deepmind (Mnih et al. 2013) Costly nature of Reinforcement Learning ◦ Trial and Error Approach ◦ Issues with long term goals ◦ Makes lots of mistakes ◦ Celiberto et al. (2010) states... “this technique is not efficient enough to be used in applications with real world demands due to the time that the agent needs to learn”
  6. 6. Background Q-Learning (RL) ◦ Learn the optimal policy, which action to take at each state ◦ Represented as... Q(s, a) Functioning: Watkins and Dayan (1992) state that... ◦ system observes its current state xn ◦ selects/performs an action an ◦ observes the subsequent state yn and gets the reward rn ◦ updates the Qn (s, a) values using ◦ a learning rate identified as α ◦ discounted factor as γ Qn(s,a) = (1 - αn)Qn-1(s, a) + αn[rn + γ(max(Qn-1(yn,a)))]
  7. 7. Pseudo Code Source: Mnih et al. (2013)
  8. 8. Representation of Q(s,a) Actions States Q Values
  9. 9. Other Methods Imitation Learning (IL) ◦ Applied to robotics e.g. Nemec et al. (2010), Schmidts et al. (2011) and Kunze et al. (2013) Could this be applied to the games agent? ◦ Potentially by mapping the states and the actions from observed game play ◦ Manually updating the policies Hamahata et al. (2008) states that “imitation learning consisting of a simple observation cannot give us the sophisticated skill”
  10. 10. Other Methods Combining RL and IL ◦ Kulkarni (2012, p. 4) refers to this as ‘semi-supervised learning’ ◦ Barto and Rosenstein (2004) suggesting the use of a model which acts as a supervisor and an actor. Supervisor Information (Barto and Rosenstein, 2004) State Representation
  11. 11. The Plan (at this point) Reduce the costly impact of RL ◦ Use some form of critic or early reward system ◦ If no Q Value exists for that state, then check with an expert Capture Expert Data ◦ States ◦ Actions ◦ Rewards Build a model Use the model to inform the Q Learning System
  12. 12. Data Capture Plan Capture Input Data Using Stella VCS based Android Solution User Actions Up, Down, Left, Right, Up, Down, Left, Right, Up, Down, Left, Right, Up, Down, Left, Right, Up, Down, Left, Right, Up, Down, Left, Right, Up, Down, Left, Right, Up, Down, Left, Right Account for SEED Variant setSeed(12345679) Replay in the Lab Extract Score & States Using ALE
  13. 13. The Big Problem We couldn’t account for the randomisation ◦ALE is based on Stella ◦ Version problems ◦Tested various approaches ◦Replayed games over Skype We could save the state..! ◦But had some problems Other problems
  14. 14. Technical Implementation Arcade Learning Environment (ALE) (Bellemare et al 2013) ◦ General Agent Testing Environment using Atari Games ◦ Supporting 50+ Games ◦ Based Stella VCS Atari Emulator ◦ Supports Agents in C++, Java and more... Python 2.7 (Anaconda Distribution) Theano (ML Framework written in Python) ◦ Mnih et al. (2013) ◦ Q-Learning Sample Code ◦ Korjus (2014) Linux then Windows 8, Cuda Support
  15. 15. Computational Requirements Test System ◦ Simple CNN / MLP ◦ 16,000 grayscale ◦ 28x28 images Results ◦ Significant Difference with Cuda Support ◦ CNN Process is very computationally costly MLP Speed Test Results CNN Speed Test Results
  16. 16. States and Actions States - Screen Data ◦Raw Screen Data ◦SDL (SDL_Surface) ◦ BMP File Actions – Controller Inputs Resulted in…. ◦Lots of Images matched to entries in a CSV File
  17. 17. Rewards ALE Reward Data void BreakoutSettings::step(const System& system) { // update the reward int x = readRam(&system, 77); int y = readRam(&system, 76); reward_t score = 1 * (x & 0x000F) + 10 * ((x & 0x00F0) >> 4) + 100 * (y & 0x000F); m_reward = score - m_score; m_score = score; // update terminal status int byte_val = readRam(&system, 57); if (!m_started && byte_val == 5) m_started = true; m_terminal = m_started && byte_val == 0; }
  18. 18. State Representation Screen Pixel – 160 x 210 RGB If we used them as inputs... ◦ RGB: 100,800 ◦ Greyscale: 33,000 Mnih et al. (2013) use cropped 84 x 84 images ◦ Good – High Resolutions, Lots of Features Present ◦ Bad – When handling lots of training data MNIST Example Set use 28 x 28 ◦ Good – Computationally Acceptable ◦ Bad – Limited Detail The problem ◦ Unable to process large amounts of hi-res images ◦ Low-res images gave poor results
  19. 19. Original System - Image Processing Image Resize Methods Temporal Data (Frame Merging)
  20. 20. Original System - Training Results 28x28 Images 64x64 Images 84x84 (4,100 images) = Memory Error 7 minutes for 16,000 28x28 18 minutes for 4,000 64x64
  21. 21. Development Original Revised
  22. 22. CNN Framework Mnih et al. (2013) make use of Convolutional Neural Networks Feature extraction ◦ Can be used to reduce Dimensionality of the Domain Space ◦ Examples include ◦ Hand Writing Classification Yuan et al. (2012), Bottou et al. (1994) ◦ Face Detection Garcia and Delakis (2004) and Chen et al. (2006) A CNN as inputs for a fully connected MLP (Bergstra et al. 2010).
  23. 23. Convolutional Neural Networks Feature Extraction Developed as a result of the work of LeCun et al. (1998) Take inspiration from cats and monkeys visual processes Hubel and Wiesel (1962, 1968) Can accommodate changes in Scale, Rotation, Stroke Width, etc Can handle Noise See:
  24. 24. Convolution of an Image 0 0 0 0 1 0 0 0 0 Example Kernel Source:
  25. 25. Other Examples 0 -1 0 -1 5 -1 0 -1 0 0 1 0 1 -4 1 0 1 0 1 0 -1 0 0 0 -1 0 1 -1 -1 -1 -1 8 -1 -1 -1 -1 Source:
  26. 26. CNN Feature Extraction Single Convolutional Layer ◦ From Full Resolution Images (160 x 210 RGB) 1,939 Inputs 130 Inputs
  27. 27. CNN Feature Extraction Binary Conversion ◦ Accurate State Representation Lower Computational Costs ◦ Single Convolution Layer (15 seconds for 2,391 images / 11.7 seconds for 1,790) ◦ Reduced number of inputs for the MLP ◦ More Manageable
  28. 28. Problems & Limitations Binary Conversion was too severe (Breakout) Feature removed by binary conversion as shown above Seaquest could not differentiate between the enemy and the goals
  29. 29. New System Training Results Test Configuration Results Lowest Error Rate: 32.50%
  30. 30. Evidence of Learning MLP New System
  31. 31. More Testing
  32. 32. Conclusion Large amounts of data CNN as a Preprocessor... ◦ Reduced Computational Costs ◦ Allowed for good state representation ◦ Reduced dimensionality for the MLP Old System ◦ No evidence of learning New System ◦ Evidence of the system learning ◦ Needs to be implemented as an agent to test real-world effectiveness
  33. 33. What would I do differently? Better Evaluation Methodology ◦ What was the frequency/distribution of controls? ◦ Was the system better at different games or controls? Went too far with the image conversion...
  34. 34. Future Work 1. Data Collection Methods 2. Foundation for Q-Learning
  35. 35. Future Work 3. State Representation Step 1 Identify areas of interest Step 2 Process and Classify Area Step 3 Update State Representation
  36. 36. Future Work 4. Explore the effects of multiple Convolutional Layers 5. Build a working agent...! ? ?
  37. 37. Useful Links ALE (Visual Studio Version) Replicating the Paper “Playing Atari with Deep Reinforcement Learning” - Kristjan Korjus et al Github for the above project ALE : ALE Old Site:
  38. 38. Bibliography Barto, M. T. and Rosenstein, A. G. (2004), `Supervised actor-critic reinforcement learning', Handbook of Learning and Approximate Dynamic Programming 2, 359. Bellemare, M. G., Naddaf, Y., Veness, J. and Bowling, M. (2013), `The arcade learning environment: An evaluation platform for general agents', Journal of Articial Intelligence Research 47, 253-279. Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D. and Bengio, Y. (2010), Theano: a CPU and GPU math expression compiler, in `Proceedings of the Python for Scientic Computing Conference (SciPy)'. Oral Presentation. Celiberto, L., Matsuura, J., Lopez de Mantaras, R. and Bianchi, R. (2010), Using transfer learning to speed-up reinforcement learning: A cased- based approach, in `Robotics Symposium and Intelligent Robotic Meeting (LARS), 2010 Latin American', pp. 55-60 Korjus, K., Kuzovkin, I., Tampuu, A. and Pungas, T. (2014), Replicating the paper "Playing Atari with Deep Reinforcement Learning", Technical report, University of Tartu. Kulkarni, P. (2012), Reinforcement and systemic machine learning for decision making, John Wiley & Sons, Hoboken. Kunze, L., Haidu, A. and Beetz, M. (2013), Acquiring task models for imitation learning through games with a purpose, in `Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ International Conference on', pp. 102-107. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D. and Riedmiller, M. (2013), Playing Atari with deep reinforcement learning, in `NIPS Deep Learning Workshop'. Nemec, B., Zorko, M. and Zlajpah, L. (2010), Learning of a ball-in-a-cup playing robot, in `Robotics in Alpe-Adria-Danube Region (RAAD), 2010 IEEE 19th International Workshop on', pp. 297-301. Schmidts, A. M., Lee, D. and Peer, A. (2011), Imitation learning of human grasping skills from motion and force data, in `Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ International Conference on', pp. 1002-1007. Watkins, C. J. C. H. and Dayan, P. (1992), `Technical note q-learning', Machine Learning 8, 279-292.
  39. 39. Thank you