Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

572 views

Published on

This talk will present some recent advances in video understanding at Google. It will cover the technology behind progress in applications such as large-scale video annotation for YouTube, video summarization and Motion Stills, as well as our research in weakly-supervised learning, domain adaptation from YouTube to Google Photos and action recognition. I will also give my perspective on promising directions for future research in video.

Published in: Data & Analytics
  • Be the first to comment

Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond

  1. 1. Large-Scale Video Understanding: YouTube and Beyond Rahul Sukthankar Machine Perception, Google Research https://research.google.com/teams/perception/ AI Frontiers Conference - Nov. 3, 2017
  2. 2. Machine Perception Really Works! (better than I expected)
  3. 3. Sample of Perception tech in products Signals for Image Search ranking, related images, search-by-image, etc.
  4. 4. Sample of Perception tech in products Cloud Video API Cloud Vision API
  5. 5. Sample of Perception tech in products (Seth LaForge, Nexus 5X) HDR+ in Android Camera Mobile Vision API
  6. 6. Sample of Perception tech in products Organizing Photos image & video collections and making them searchable by content Microvideo tech in Photos & Motion Stills De-reflection & tracking in Photo Scanner
  7. 7. Sample of Perception tech in products Personalized sticker packs in Allo On-device handwriting input & recognition OCR for lots of languages
  8. 8. Sample of Perception tech in products Visual & auditory annotation & signals on YouTube Thumbnail/preview selection & optimization for YouTube Non-speech sound captions on YouTube
  9. 9. Sample of Perception tech in products Region tracking for custom blurring tool on YouTube Mobile creative effects on YouTube
  10. 10. watch, listen, understandcapture a moment improve & manipulate Useful Applications for Video Technology Help users create, enhance, organize, and discover videos.
  11. 11. Privacy Region Tracking & Blurring for YouTube
  12. 12. Fun Effects from Tracking (on Mobile) for YouTube
  13. 13. Large-Scale Video Annotation for YouTube
  14. 14. Large-Scale Video Annotation for YouTube extract features quantize & aggregate train model (e.g., AdaBoost) training data Video understanding pipeline as of ~5 years ago frame features video features “Roller-blading” hand-designed descriptors codebook histogram pixels & sound samples
  15. 15. Large-Scale Video Annotation for YouTube extract features training data Modern video understanding pipeline “Roller-blading” pixels & sound samples Magic box containing many convolutional, deep, end-to- end buzzwords :-)
  16. 16. Deep-learned visual features Inception model trained on noisy data (images) Bottleneck embedding layer (1000-d) Videos with noisy labels Frame-level Video-level - Max pooling - Avg pooling - VLAD pooling
  17. 17. +80% mean avg. precision 40x more compact features Deep learned visual features, VLAD coding: 1024-d, 0.272 MAP Handcrafted audio- visual features: ~40K- d, 0.153 MAP MeanAveragePrecision Dimensionality 0.40 0.30 0.20 0.10 0 Deep-learned vs. handcrafted features
  18. 18. Personal video search in Google Photos Lots of videos Almost no metadata
  19. 19. “Dancing” on the web
  20. 20. “Dancing” in home videos
  21. 21. Domain adaptation: Finding home videos on YouTube By capture device vs By video frame rate By video orientation vs
  22. 22. The technology behind personal video search Video Trained on web images Image / photo annotation model 1
  23. 23. The technology behind personal video search Video Trained on web images Image / photo annotation model YouTube frame annotation model Trained on video thumbnails Domain-adapted frame-level vision model 1 2
  24. 24. YouTube video annotation model Trained on YouTube videos The technology behind personal video search Video Trained on web images Image / photo annotation model YouTube frame annotation model Trained on video thumbnails Domain-adapted frame-level vision model Domain-adapted video-level vision model 1 2 3
  25. 25. YouTube video annotation model Trained on YouTube videos The technology behind personal video search Video Audio Trained on web images Image / photo annotation model Trained on YouTube videos YouTube audio annotation model YouTube frame annotation model Trained on video thumbnails Domain-adapted frame-level vision model Domain-adapted video-level vision model Domain-adapted audio model 1 2 3 4
  26. 26. YouTube video annotation model Trained on YouTube videos toddler dancing The technology behind personal video search Video Audio Trained on web images Image / photo annotation model Trained on YouTube videos YouTube audio annotation model YouTube frame annotation model Trained on video thumbnails Domain-adapted frame-level vision model Domain-adapted video-level vision model Domain-adapted audio model 1 2 3 4 Fusion & calibration 5 Trained on home videos Domain-adapted personal video model
  27. 27. Evolution of personal video annotation models 1 2 3 4
  28. 28. Evolution of personal video annotation models 1 2 3 4 Photo annotation model applied on video frames
  29. 29. Evolution of personal video annotation models Domain adaptation + fusion across frames 1 2 3 4 Photo annotation model applied on video frames
  30. 30. Evolution of personal video annotation models Fusion across multiple vision models Domain adaptation + fusion across frames 1 2 3 4 Photo annotation model applied on video frames
  31. 31. Evolution of personal video annotation models Fusion across multiple audio-visual models Fusion across multiple vision models Photo annotation model applied on video frames Domain adaptation + fusion across frames 1 2 3 4
  32. 32. Evolution of personal video annotation models 1 2 3 4 > 2x recall gain
  33. 33. Learning aesthetics: YouTube Thumbnails
  34. 34. Learning aesthetics: YouTube Thumbnails YouTube thumbnail quality model
  35. 35. Learning aesthetics: YouTube Thumbnails
  36. 36. Learning aesthetics: YouTube Thumbnails Improving YouTube video thumbnails with deep neural nets, Google Research Blog, Oct. 2015
  37. 37. Video retargeting (spatial) Original video. Reframed for a banner aspect ratio.
  38. 38. Video retargeting (temporal) Video preview: (duration: 6 secs)
  39. 39. Motion Stabilization
  40. 40. Motion Stills app Stream One-Up
  41. 41. Motion Still examples: cinemagraphs
  42. 42. Motion Stills examples: gifs / memes
  43. 43. Motion Stills examples: timelapse
  44. 44. Promising Directions for Future Research: Learning from Video
  45. 45. Sermanet, Self-Supervised Imitation, Google Brain Self-Supervised Imitation Pierre Sermanet* Corey Lynch* Yevgen Chebotar* Jasmine Hsu Eric Jang Stefan Schaal Sergey Levine Google Brain + University of Southern California * equal contribution
  46. 46. Sermanet, Self-Supervised Imitation, Google Brain Multi-view capture This image cannot currently be displayed.
  47. 47. Sermanet, Self-Supervised Imitation, Google Brain Time-Contrastive Networks (TCN) (source: [Rippel et al 2015]) arxiv.org/abs/1704.06888v2 sermanet.github.io/imitate
  48. 48. Sermanet, Self-Supervised Imitation, Google Brain Approach (pouring, real) * RL used: Combining Model-Based and Model-Free Updates for Trajectory-Centric Reinforcement Learning, Chebotar, Y., Hausman, K., Zhang, M., Sukhatme, G., Schaal, S., Levine, S. [ICML 17]
  49. 49. Sermanet, Self-Supervised Imitation, Google Brain Resulting policies
  50. 50. Sermanet, Self-Supervised Imitation, Google Brain Pose imitation (real robot)
  51. 51. Useful Datasets for Video Understanding ● Large-scale video annotation ○ Sports-1M > 1M videos from ~500 classes [with Stanford] ○ YouTube-8M ~8M videos from ~4800 classes ● Action recognition in video ○ THUMOS Temporal localization in untrimmed videos [with UCF, INRIA] ○ Kinetics 400+ short clips for 400 actions [with DeepMind] ○ AVA Spatially localized atomic actions [with Berkeley, INRIA] ● Object recognition ○ YouTube-BB Spatially localized objects in video (80 classes) ○ Open Images Spatially localized objects in images (600 classes)
  52. 52. Sports-1M: 1.1M videos from 487 sports classes (video classification)
  53. 53. YouTube-8M Video Research Dataset research.google.com/youtube8m/
  54. 54. THUMOS Challenge Series: Temporal Localization in Untrimmed Videos
  55. 55. YouTube Bounding Boxes: Spatial localization of one object through time
  56. 56. AVA: Spatial localization of an actor performing atomic actions Atomic action: “Paint”
  57. 57. Open Images v3 - detailed spatial annotations in images Example validation images
  58. 58. Open Images v3 - detailed spatial annotations in images Example validation images
  59. 59. ● Significant progress in large-scale video annotation for YouTube ● Video understanding has many applications beyond YouTube ● We encourage others to work on video through public datasets ● Many exciting research problems ahead, particularly in learning from video (I think there’s a lot more progress to be made in video understanding) Conclusion

×