Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Matt Feiszli at AI Frontiers : Video Understanding


Published on

I will discuss the state of the art of video understanding, particularly its research and applications at Facebook. I will focus on two active areas: multimodality and time. Video is naturally multi-modal, offering great possibility for content understanding while also opening new doors like unsupervised and weakly-supervised learning at scale. Temporal representation remains a largely open problem; while we can describe a few seconds of video, there is no natural representation for a few minutes of video. I will discuss recent progress, the importance of these problems for applications, and what we hope to achieve.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Matt Feiszli at AI Frontiers : Video Understanding

  1. 1. VIDEO UNDERSTANDING Matt Feiszli Research Scientist / Manager Facebook AI
  2. 2. AI @FACEBOOK Research Tools Platforms Product
  3. 3. VIDEO @FACEBOOK SEE NOTES Facebook AI Mobile Vision Video ML Integrity FRL (AR / VR)
  4. 4. Make it Relevant Understand • What is this about? • What’s the language? • Who’s in it? • Where does it take Personalize • Who wants to see this? • Which part(s)? Deliver • Highest possible quality • Many possible devices • Variety of bandwidths
  5. 5. HUMAN-LEVEL UNDERSTANDING (by watching)
  7. 7. TIME motion & change
  8. 8. WHERE ARE WE NOW? o Multimodal, temporal signal o Idea: Novel tasks replace labels • Language + vision • Audio as labels for video o Aspirations vs. reality?
  9. 9. RETRIEVAL & RANKING o Watch / no-watch: first few minutes • Should “understand” several minutes o Goal: Long-form content representation o Reality: Metadata is strongest signal. • Topic tagging • People, places, activities, brands
  10. 10. GREAT MOMENTS o Video: Boredom punctuated by greatness • Highlight reels, summaries • Objectionable content o Can find some moments. • Highly multimodal. o Complex actions, intents are a mystery.
  11. 11. Visuotemporal Structure
  12. 12. Action: Doing Pushups
  13. 13. Correspondence
  14. 14. o Action labels have temporal structure • Pushups: two key poses, two transitions • Compare: “Baking a cake” o Current visual models tend to ignore this • Instead: correlated objects, scenes, etc Temporal Structure
  15. 15. o Speech recognition • Words -> phonemes -> features • Modern models mostly learn this o Not without ambiguity, but… • … far better than actions Temporal Structure Macquarie University, Dept. of Linguistics, “Vowel Spectra”
  16. 16. o Goal: self-supervision (“free supervision”) o Examples: • Compression (e.g. autoencoders) • Neighboring image patches • Temporal ordering • Audio-visual matching Towards self-supervision?
  17. 17. Self-Supervised Learning with Audio and Video Temporal Synchronization Bruno Korbar, Du Tran, Lorenzo Torresani NIPS 2018
  18. 18. Arandjelovic & Zisserman ICCV’17 L^3 Net Related Work
  19. 19. Audio-Video Temporal Synchronization (AVTS)
  20. 20. Sound Localization
  21. 21. o Goal: rich features via extremely large label spaces o “Extremely large label space”? • Verbs + objects? • Combinations of attributes? • Natural language? What is a Label (at Scale)?
  22. 22. Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, Laurens van der Maaten. o SOA on ImageNet1K, 85.4% Top1 accuracy • Architecture – ResNext101-32x48 • Data – 3.5B Images • Labels – 17K classes • Training – 300 GPUs distributed training • Supervision – Weakly supervised Extreme Scale: Exploring the Limits of Supervised Pretraining
  23. 23. o Transfer learning from 100M videos? • Already setting new SOA on Kinetics, Epic Kitchens, etc. o Temporal models? o Labels? • Size of label space • Objects, actions, etc. Extreme Scale: Learnings from Video (to be published)
  24. 24. Is a toy car a car?
  25. 25. Thank you!