Training Drone Image Models with Grand Theft Auto

1. Training Image Models with

2. Video Learning for Analysis from Deep Embeddings Timothy Emerick, PhD Sue He Alexander Polis Monica Rajendiran

4. truck

5. truck bicyclist

6. A green truck is crossing an intersection.

7. A group of people are crossing the street.

8. ★ Machine vision models often require large amounts of labeled data to train well ★ Existing labelled datasets can be too generic and have a broad concept space for our purposes

9. ★ Machine vision models often require large amounts of labeled data to train well ★ Existing labelled datasets can be too generic and have a broad concept space for our purposes

10. ImageNet 14 million+ images of 21K+ class entities YouTube-8M 450K+ hours of 4700+ class entities Olga Russakovsky*, Jia Deng*, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei. (* = equal contribution) ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015. Abu-El-Haija, Sami, et al. "YouTube-8M: A large-scale video classification benchmark." arXiv preprint arXiv:1609.08675 (2016).

11. ImageNet 14 million+ images of 21K+ class entities YouTube-8M 450K+ hours of 4700+ class entities Olga Russakovsky*, Jia Deng*, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei. (* = equal contribution) ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015. Abu-El-Haija, Sami, et al. "YouTube-8M: A large-scale video classification benchmark." arXiv preprint arXiv:1609.08675 (2016).

12. ★ Graphics have become extremely realistic over the years ★ Games are codeable, enabling complex simulations ★ Simulating in-game helps you ignore low level tasks like movement animations and routing

15. ★ Rockstar Advanced Game Engine’s (RAGE) super realistic graphics ★ Huge modding community provides lots of customization ★ Programmatically configurable options

18. ★ Programmatically configurable options ○ Script-Hook-V is a library which allows you to write scripts in-game ○ Thousands of function calls

19. ★ Programmatically configurable options ○ We can generate entities of choice in-game and have them perform complex actions ○ Vehicles: driving, turning, waiting at stoplights ○ People: entering/exiting vehicles, waiting to cross the street, parking ○ Environment: weather, time of day, camera elevation, zoom

20. ★ Grand Theft Auto Dataset: ○ Video footage ○ Objects of interest per frame (vehicles and pedestrians) ○ Object location information (bounding box information) ○ Text Descriptions (e.g. a white truck is turning left)

21. CNNS ★ Extracts features from the input image, distilled down to class predictions ★ Preserves spatial relationship between pixels Bird Airplane Superman Car

22. 0 2 2 3 0 2 5 3 3 7 8 7 4 5 3 5 4 4 0 2 8 7 3 8 5

23. 7 8 5 12 12 15 16 16 7 0 2 2 3 0 2 5 3 3 7 8 7 4 5 3 5 4 4 0 2 8 7 3 8 5 * 1 0 1 0 1 0 0 0 0 Input Image Weights * Filter Feature Map

24. 7 0 2 2 3 0 2 5 3 3 7 8 7 4 5 3 5 4 4 0 2 8 7 3 8 5 * 1 0 1 0 1 0 0 0 0 Input Image Weights * Filter Feature Map x1 x0 x1 x0 x1 x0 x0 x0 x0

25. 7 8 0 2 2 3 0 2 5 3 3 7 8 7 4 5 3 5 4 4 0 2 8 7 3 8 5 * 1 0 1 0 1 0 0 0 0 Input Image Weights * Filter Feature Map x1 x0 x1 x0 x1 x0 x0 x0 x0

26. 7 8 5 0 2 2 3 0 2 5 3 3 7 8 7 4 5 3 5 4 4 0 2 8 7 3 8 5 * 1 0 1 0 1 0 0 0 0 Input Image Weights * Filter Feature Map x1 x0 x1 x0 x1 x0 x0 x0 x0

27. 7 8 5 12 0 2 2 3 0 2 5 3 3 7 8 7 4 5 3 5 4 4 0 2 8 7 3 8 5 * 1 0 1 0 1 0 0 0 0 Input Image Weights * Filter Feature Map x1 x0 x1 x0 x1 x0 x0 x0 x0

28. 7 8 5 12 12 0 2 2 3 0 2 5 3 3 7 8 7 4 5 3 5 4 4 0 2 8 7 3 8 5 * 1 0 1 0 1 0 0 0 0 Input Image Weights * Filter Feature Map x1 x0 x1 x0 x1 x0 x0 x0 x0

29. 7 8 5 12 12 15 0 2 2 3 0 2 5 3 3 7 8 7 4 5 3 5 4 4 0 2 8 7 3 8 5 * 1 0 1 0 1 0 0 0 0 Input Image Weights * Filter Feature Map x1 x0 x1 x0 x1 x0 x0 x0 x0

30. 7 8 5 12 12 15 16 0 2 2 3 0 2 5 3 3 7 8 7 4 5 3 5 4 4 0 2 8 7 3 8 5 * 1 0 1 0 1 0 0 0 0 Input Image Weights * Filter Feature Map x1 x0 x1 x0 x1 x0 x0 x0 x0

31. 7 8 5 12 12 15 16 16 0 2 2 3 0 2 5 3 3 7 8 7 4 5 3 5 4 4 0 2 8 7 3 8 5 * 1 0 1 0 1 0 0 0 0 Input Image Weights * Filter Feature Map x1 x0 x1 x0 x1 x0 x0 x0 x0

32. 7 8 5 12 12 15 16 16 7 0 2 2 3 0 2 5 3 3 7 8 7 4 5 3 5 4 4 0 2 8 7 3 8 5 * 1 0 1 0 1 0 0 0 0 Input Image Weights * Filter Feature Map x1 x0 x1 x0 x1 x0 x0 x0 x0

33. 3 feature maps produced from 3 filters Bird Airplane Superman Car

34. -1 -1 -1 -1 8 -1 -1 -1 -1

35. CNNS ★ Extracts features from the input image, distilled down to class predictions ★ Preserves spatial relationship between pixels Bird Airplane Superman Car

36. ★ YOLO9000 (YOLO v2) is a real time object detection convolutional neural network architecture ★ Redmon, Joseph and Farhadi, Ali. "YOLO9000: better, faster, stronger." arXiv (2017).

37. ★ YOLO9000 (YOLO v2) is a real time object detection convolutional neural network architecture ★ Redmon, Joseph and Farhadi, Ali. "YOLO9000: better, faster, stronger." arXiv (2017).

39. Game Engine Action Generation Camera Control Environment Control Annotations Text Extraction Pedestrians/Vehicles Camera Environment

40. Game Engine Action Generation Camera Control Environment Control Annotations Text Extraction Pedestrians/Vehicles Camera Environment

42. RNNs ★ Works well with sequential input (e.g. words in a sentence or a vector of numbers representing an image) ★ For a given input, incorporates a “feedback” loop of the information it received and the decision it made from the previous input in the sequence Neural Network Output Input

43. “e” “h” Vocabulary of 4 letters: h e l o Letters could be encoded as: h [1 0 0 0] e [0 1 0 0] l [0 0 1 0] o [0 0 0 1] h e e l l l l o

44. “l” “e” h e e l l l l o

45. “l” “l” h e e l l l l o

46. “o” “l” h e e l l l l o

47. LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM A car white driving LSTM ★ A variation of RNNs (Long Short Term Memory) ★ LSTMs use additional units of “memory” for longer connections across sequence inputs

49. Attention ★ Train model to focus on salient objects in the image ★ Instead of feeding features from the entire image to an RNN, just feed the salient region’s features

50. LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM A car white driving

52. “A man in a white shirt is walking”

53. “A white service vehicle is parked”

54. Search: “red truck” Search by Text in Video ★ Extracting captions from video and store them in an index ★ Fast video search by text query over large amounts of video

55. Search by Example in Video ★ A user-defined bounding box on a video frame ★ Query for similar objects of interest in the entirety of a video dataset, at the frame level

56. Search by Example in Video ★ A user-defined bounding box on a video frame ★ Query for similar objects of interest in the entirety of a video dataset, at the frame level

57. ★ GTA V allows us to create fully annotated, custom tailored, photorealistic datasets ★ We can use this dataset to train models that are good at object detection/localization, captioning, and search by example or text for overhead video ★ The use of models trained on GTA data also has applicability in areas such as real-time security camera alerting and self driving cars

60. www.ccri.com mrajendiran@ccri.com

Training Drone Image Models with Grand Theft Auto

Recommended

Recommended

More Related Content

What's hot

What's hot (8)

Similar to Training Drone Image Models with Grand Theft Auto

Similar to Training Drone Image Models with Grand Theft Auto (20)

Recently uploaded

Recently uploaded (20)

Training Drone Image Models with Grand Theft Auto

Editor's Notes