Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Training Drone Image Models with Grand Theft Auto

1,101 views

Published on

CCRi Data Scientist Monica Rajendiran's presentation from Charlottesville's 2018 #tomtomfest Machine Learning conference.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Training Drone Image Models with Grand Theft Auto

  1. 1. Training Image Models with
  2. 2. Video Learning for Analysis from Deep Embeddings Timothy Emerick, PhD Sue He Alexander Polis Monica Rajendiran
  3. 3. truck
  4. 4. truck bicyclist
  5. 5. A green truck is crossing an intersection.
  6. 6. A group of people are crossing the street.
  7. 7. ★ Machine vision models often require large amounts of labeled data to train well ★ Existing labelled datasets can be too generic and have a broad concept space for our purposes
  8. 8. ★ Machine vision models often require large amounts of labeled data to train well ★ Existing labelled datasets can be too generic and have a broad concept space for our purposes
  9. 9. ImageNet 14 million+ images of 21K+ class entities YouTube-8M 450K+ hours of 4700+ class entities Olga Russakovsky*, Jia Deng*, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei. (* = equal contribution) ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015. Abu-El-Haija, Sami, et al. "YouTube-8M: A large-scale video classification benchmark." arXiv preprint arXiv:1609.08675 (2016).
  10. 10. ImageNet 14 million+ images of 21K+ class entities YouTube-8M 450K+ hours of 4700+ class entities Olga Russakovsky*, Jia Deng*, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei. (* = equal contribution) ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015. Abu-El-Haija, Sami, et al. "YouTube-8M: A large-scale video classification benchmark." arXiv preprint arXiv:1609.08675 (2016).
  11. 11. ★ Graphics have become extremely realistic over the years ★ Games are codeable, enabling complex simulations ★ Simulating in-game helps you ignore low level tasks like movement animations and routing
  12. 12. ★ Graphics have become extremely realistic over the years ★ Games are codeable, enabling complex simulations ★ Simulating in-game helps you ignore low level tasks like movement animations and routing
  13. 13. ★ Graphics have become extremely realistic over the years ★ Games are codeable, enabling complex simulations ★ Simulating in-game helps you ignore low level tasks like movement animations and routing
  14. 14. ★ Rockstar Advanced Game Engine’s (RAGE) super realistic graphics ★ Huge modding community provides lots of customization ★ Programmatically configurable options
  15. 15. ★ Rockstar Advanced Game Engine’s (RAGE) super realistic graphics ★ Huge modding community provides lots of customization ★ Programmatically configurable options
  16. 16. ★ Rockstar Advanced Game Engine’s (RAGE) super realistic graphics ★ Huge modding community provides lots of customization ★ Programmatically configurable options
  17. 17. ★ Programmatically configurable options ○ Script-Hook-V is a library which allows you to write scripts in-game ○ Thousands of function calls
  18. 18. ★ Programmatically configurable options ○ We can generate entities of choice in-game and have them perform complex actions ○ Vehicles: driving, turning, waiting at stoplights ○ People: entering/exiting vehicles, waiting to cross the street, parking ○ Environment: weather, time of day, camera elevation, zoom
  19. 19. ★ Grand Theft Auto Dataset: ○ Video footage ○ Objects of interest per frame (vehicles and pedestrians) ○ Object location information (bounding box information) ○ Text Descriptions (e.g. a white truck is turning left)
  20. 20. CNNS ★ Extracts features from the input image, distilled down to class predictions ★ Preserves spatial relationship between pixels Bird Airplane Superman Car
  21. 21. 0 2 2 3 0 2 5 3 3 7 8 7 4 5 3 5 4 4 0 2 8 7 3 8 5
  22. 22. 7 8 5 12 12 15 16 16 7 0 2 2 3 0 2 5 3 3 7 8 7 4 5 3 5 4 4 0 2 8 7 3 8 5 * 1 0 1 0 1 0 0 0 0 Input Image Weights * Filter Feature Map
  23. 23. 7 0 2 2 3 0 2 5 3 3 7 8 7 4 5 3 5 4 4 0 2 8 7 3 8 5 * 1 0 1 0 1 0 0 0 0 Input Image Weights * Filter Feature Map x1 x0 x1 x0 x1 x0 x0 x0 x0
  24. 24. 7 8 0 2 2 3 0 2 5 3 3 7 8 7 4 5 3 5 4 4 0 2 8 7 3 8 5 * 1 0 1 0 1 0 0 0 0 Input Image Weights * Filter Feature Map x1 x0 x1 x0 x1 x0 x0 x0 x0
  25. 25. 7 8 5 0 2 2 3 0 2 5 3 3 7 8 7 4 5 3 5 4 4 0 2 8 7 3 8 5 * 1 0 1 0 1 0 0 0 0 Input Image Weights * Filter Feature Map x1 x0 x1 x0 x1 x0 x0 x0 x0
  26. 26. 7 8 5 12 0 2 2 3 0 2 5 3 3 7 8 7 4 5 3 5 4 4 0 2 8 7 3 8 5 * 1 0 1 0 1 0 0 0 0 Input Image Weights * Filter Feature Map x1 x0 x1 x0 x1 x0 x0 x0 x0
  27. 27. 7 8 5 12 12 0 2 2 3 0 2 5 3 3 7 8 7 4 5 3 5 4 4 0 2 8 7 3 8 5 * 1 0 1 0 1 0 0 0 0 Input Image Weights * Filter Feature Map x1 x0 x1 x0 x1 x0 x0 x0 x0
  28. 28. 7 8 5 12 12 15 0 2 2 3 0 2 5 3 3 7 8 7 4 5 3 5 4 4 0 2 8 7 3 8 5 * 1 0 1 0 1 0 0 0 0 Input Image Weights * Filter Feature Map x1 x0 x1 x0 x1 x0 x0 x0 x0
  29. 29. 7 8 5 12 12 15 16 0 2 2 3 0 2 5 3 3 7 8 7 4 5 3 5 4 4 0 2 8 7 3 8 5 * 1 0 1 0 1 0 0 0 0 Input Image Weights * Filter Feature Map x1 x0 x1 x0 x1 x0 x0 x0 x0
  30. 30. 7 8 5 12 12 15 16 16 0 2 2 3 0 2 5 3 3 7 8 7 4 5 3 5 4 4 0 2 8 7 3 8 5 * 1 0 1 0 1 0 0 0 0 Input Image Weights * Filter Feature Map x1 x0 x1 x0 x1 x0 x0 x0 x0
  31. 31. 7 8 5 12 12 15 16 16 7 0 2 2 3 0 2 5 3 3 7 8 7 4 5 3 5 4 4 0 2 8 7 3 8 5 * 1 0 1 0 1 0 0 0 0 Input Image Weights * Filter Feature Map x1 x0 x1 x0 x1 x0 x0 x0 x0
  32. 32. 3 feature maps produced from 3 filters Bird Airplane Superman Car
  33. 33. -1 -1 -1 -1 8 -1 -1 -1 -1
  34. 34. CNNS ★ Extracts features from the input image, distilled down to class predictions ★ Preserves spatial relationship between pixels Bird Airplane Superman Car
  35. 35. ★ YOLO9000 (YOLO v2) is a real time object detection convolutional neural network architecture ★ Redmon, Joseph and Farhadi, Ali. "YOLO9000: better, faster, stronger." arXiv (2017).
  36. 36. ★ YOLO9000 (YOLO v2) is a real time object detection convolutional neural network architecture ★ Redmon, Joseph and Farhadi, Ali. "YOLO9000: better, faster, stronger." arXiv (2017).
  37. 37. Game Engine Action Generation Camera Control Environment Control Annotations Text Extraction Pedestrians/Vehicles Camera Environment
  38. 38. Game Engine Action Generation Camera Control Environment Control Annotations Text Extraction Pedestrians/Vehicles Camera Environment
  39. 39. RNNs ★ Works well with sequential input (e.g. words in a sentence or a vector of numbers representing an image) ★ For a given input, incorporates a “feedback” loop of the information it received and the decision it made from the previous input in the sequence Neural Network Output Input
  40. 40. “e” “h” Vocabulary of 4 letters: h e l o Letters could be encoded as: h [1 0 0 0] e [0 1 0 0] l [0 0 1 0] o [0 0 0 1] h e e l l l l o
  41. 41. “l” “e” h e e l l l l o
  42. 42. “l” “l” h e e l l l l o
  43. 43. “o” “l” h e e l l l l o
  44. 44. LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM A car white driving LSTM ★ A variation of RNNs (Long Short Term Memory) ★ LSTMs use additional units of “memory” for longer connections across sequence inputs
  45. 45. Attention ★ Train model to focus on salient objects in the image ★ Instead of feeding features from the entire image to an RNN, just feed the salient region’s features
  46. 46. LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM A car white driving
  47. 47. “A man in a white shirt is walking”
  48. 48. “A white service vehicle is parked”
  49. 49. Search: “red truck” Search by Text in Video ★ Extracting captions from video and store them in an index ★ Fast video search by text query over large amounts of video
  50. 50. Search by Example in Video ★ A user-defined bounding box on a video frame ★ Query for similar objects of interest in the entirety of a video dataset, at the frame level
  51. 51. Search by Example in Video ★ A user-defined bounding box on a video frame ★ Query for similar objects of interest in the entirety of a video dataset, at the frame level
  52. 52. ★ GTA V allows us to create fully annotated, custom tailored, photorealistic datasets ★ We can use this dataset to train models that are good at object detection/localization, captioning, and search by example or text for overhead video ★ The use of models trained on GTA data also has applicability in areas such as real-time security camera alerting and self driving cars
  53. 53. ★ GTA V allows us to create fully annotated, custom tailored, photorealistic datasets ★ We can use this dataset to train models that are good at object detection/localization, captioning, and search by example or text for overhead video ★ The use of models trained on GTA data also has applicability in areas such as real-time security camera alerting and self driving cars
  54. 54. ★ GTA V allows us to create fully annotated, custom tailored, photorealistic datasets ★ We can use this dataset to train models that are good at object detection/localization, captioning, and search by example or text for overhead video ★ The use of models trained on GTA data also has applicability in areas such as real-time security camera alerting and self driving cars
  55. 55. www.ccri.com mrajendiran@ccri.com

×