Matt Feiszli at AI Frontiers : Video Understanding

AI Frontiers
Nov. 13, 2018
Matt Feiszli at AI Frontiers : Video Understanding
Matt Feiszli at AI Frontiers : Video Understanding
Matt Feiszli at AI Frontiers : Video Understanding
Matt Feiszli at AI Frontiers : Video Understanding
Matt Feiszli at AI Frontiers : Video Understanding
Matt Feiszli at AI Frontiers : Video Understanding
Matt Feiszli at AI Frontiers : Video Understanding
Matt Feiszli at AI Frontiers : Video Understanding
Matt Feiszli at AI Frontiers : Video Understanding
Matt Feiszli at AI Frontiers : Video Understanding
Matt Feiszli at AI Frontiers : Video Understanding
Matt Feiszli at AI Frontiers : Video Understanding
Matt Feiszli at AI Frontiers : Video Understanding
Matt Feiszli at AI Frontiers : Video Understanding
Matt Feiszli at AI Frontiers : Video Understanding
Matt Feiszli at AI Frontiers : Video Understanding
Matt Feiszli at AI Frontiers : Video Understanding
Matt Feiszli at AI Frontiers : Video Understanding
Matt Feiszli at AI Frontiers : Video Understanding
Matt Feiszli at AI Frontiers : Video Understanding
Matt Feiszli at AI Frontiers : Video Understanding
Matt Feiszli at AI Frontiers : Video Understanding
Matt Feiszli at AI Frontiers : Video Understanding
Matt Feiszli at AI Frontiers : Video Understanding
Matt Feiszli at AI Frontiers : Video Understanding
1 of 25

More Related Content

Similar to Matt Feiszli at AI Frontiers : Video Understanding

Refining training courses about research integrity Mark HooperRefining training courses about research integrity Mark Hooper
Refining training courses about research integrity Mark HooperARDC
Barry Vercoe at the 2015 Innovation ForumBarry Vercoe at the 2015 Innovation Forum
Barry Vercoe at the 2015 Innovation ForumLocus Research
ISS2022 KeynoteISS2022 Keynote
ISS2022 KeynoteMark Billinghurst
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
AmadouAmadou
AmadouBen Hayoun
From Interaction to EmpathyFrom Interaction to Empathy
From Interaction to EmpathyMark Billinghurst

Similar to Matt Feiszli at AI Frontiers : Video Understanding(20)

More from AI Frontiers

Divya Jain at AI Frontiers : Video SummarizationDivya Jain at AI Frontiers : Video Summarization
Divya Jain at AI Frontiers : Video SummarizationAI Frontiers
Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI
Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI AI Frontiers
Training at AI Frontiers 2018 - LaiOffer Self-Driving-Car-Lecture 1: Heuristi...Training at AI Frontiers 2018 - LaiOffer Self-Driving-Car-Lecture 1: Heuristi...
Training at AI Frontiers 2018 - LaiOffer Self-Driving-Car-Lecture 1: Heuristi...AI Frontiers
Training at AI Frontiers 2018 - Ni Lao: Weakly Supervised Natural Language Un...Training at AI Frontiers 2018 - Ni Lao: Weakly Supervised Natural Language Un...
Training at AI Frontiers 2018 - Ni Lao: Weakly Supervised Natural Language Un...AI Frontiers
Training at AI Frontiers 2018 - LaiOffer Self-Driving-Car-lecture 2: Incremen...Training at AI Frontiers 2018 - LaiOffer Self-Driving-Car-lecture 2: Incremen...
Training at AI Frontiers 2018 - LaiOffer Self-Driving-Car-lecture 2: Incremen...AI Frontiers
Training at AI Frontiers 2018 - Udacity: Enhancing NLP with Deep Neural NetworksTraining at AI Frontiers 2018 - Udacity: Enhancing NLP with Deep Neural Networks
Training at AI Frontiers 2018 - Udacity: Enhancing NLP with Deep Neural NetworksAI Frontiers

More from AI Frontiers(20)

Recently uploaded

Data Formats: Reading and writing JSON – XML - YAMLData Formats: Reading and writing JSON – XML - YAML
Data Formats: Reading and writing JSON – XML - YAMLCSUC - Consorci de Serveis Universitaris de Catalunya
info_session_gdsc_tmsl .pptxinfo_session_gdsc_tmsl .pptx
info_session_gdsc_tmsl .pptxNikitaSingh741518
Unleashing the Power of Modern Carpooling Apps, Inspired by BlaBlaCarUnleashing the Power of Modern Carpooling Apps, Inspired by BlaBlaCar
Unleashing the Power of Modern Carpooling Apps, Inspired by BlaBlaCarArchie Cadell
Getting your enterprise ready for Microsoft 365 CopilotGetting your enterprise ready for Microsoft 365 Copilot
Getting your enterprise ready for Microsoft 365 CopilotVignesh Ganesan I Microsoft MVP
Google Cloud Study Jams Info SessionGoogle Cloud Study Jams Info Session
Google Cloud Study Jams Info SessionGDSCPCCE
Framing Few Shot Knowledge Graph Completion with Large Language ModelsFraming Few Shot Knowledge Graph Completion with Large Language Models
Framing Few Shot Knowledge Graph Completion with Large Language ModelsMODUL Technology GmbH

Recently uploaded(20)

Matt Feiszli at AI Frontiers : Video Understanding

Editor's Notes

  1. Intro Talk about AI@FB, the who and the why of video.
  2. @FB AI means full stack Research all the way to production. Depends on tools (PyTorch has seen great adoption and we’re standardizing internally). The AI org provides platforms, workflows, models, infra. All of this serves product, and there’s increasing amounts of AI distributed Through the various product organizations and verticals.
  3. Just a few examples of teams (you might not know about): can’t mention all. Facebook AI (FAIR): research, tools, platforms Video MPK, video NYC, along with folks from FAIR. But, lots of the same done by verticals. E.g. mobile: pose, effects, etc. Not on slide, but, Portal: post tracking for AI cameraman. Well-reviewed feature for videoconf, using SOTA pose tracking algorithms. Video ML: focus on video product features. Various integrity use cases (org’d CVPR workshop) AR/VR: awesome work. Note: didn’t even mention Feed, Ads, etc.
  4. A video is uploaded – this happens 10’s of millions of times per day. [bullets] To do a good job here requires essentially human understanding. So let’s move to the science.
  5. What’s the promise of video? Learn everything by watching. Physics, language, planning, causality, intent… it’s all there. Not convincing? There are a ton of instructional and educational videos. Language tutorials, etc. Make it easier. Why video better?
  6. Modalities. Both correlated + complementary. E.g. topic tagging. Amateur sports video: visual. Pro sports broadcast: visual + speech both powerful, complementary. Playing instrument: both correlated. OTOH consider news: not visual, but speech or OCR tells you: cooking, news, medical, etc. Also accessibility: read a menu, signs, etc. SELF-SUPERVISION: labeling video is really, really expensive and slow. New tasks reduce need for labels
  7. Time – video evolves. Consequences, planning, interactions. Reasonable goal: can we learn sports by watching? Understand player intent, predict actions? Prediction tasks: future as supervision for past.
  8. Search, recommendations, chaining Consider topic tagging: requires multimodal. Recall menus, signs, news crawl, sports / stock ticker Problems: summarization, salience, and common sense world understanding.
  9. Video is boredom with occasional moments of greatness (or really, significant change). Compare: Photos are self-selected for greatness. If you see it at all, probably good. Some moments are reasonably learnable (visually, listening, etc) Hard: people interactions, tense moments, sentiment not well-developed (listening for loud soundtrack) Hard: most successful ML relies on large, labeled datasets. Need query or person to ground it. E.g. training highlight models. Trickiest part is assembling dataset. Model learns clickbait. Ethical warning for us all. FB thinks about bias in ML models. We have ethical and operational guidelines when building dataset, conducting research, and building our products. Affects design decisions daily.
  10. Let’s look at a typical visual task: action recognition. Each point on the curve is one frame. Imagenet features at 1 FPS on UCF-101 action recognition, embedded in 2D using t-SNE. Color represents velocity (rate of change). So what’s happening? Periodic motion, two slow / stationary repeated points, faster transitions between. Microcosm: but like longer video in many ways. Repeated sections, similar sequences. Car chases, soccer set pieces like corner kick, etc. Self-similarity.
  11. Man doing pushups from UCF-101 public dataset.
  12. Compression is very different – requires a metric. Last three construct binary classification problems: do patches match? Are frames in the right order? Do A/V match?
  13. Use 2D ConvNet for visual stream: Has no temporal context & no motion modeling Only train on easy negative examples -- negative sound is selected from a different video. What this (AVC) really learns? Semantic correlation between audio/video
  14. Introduce new task, Audio Video Temporal Synchronization (AVTS) for self-supervised pre-training. Demonstrate an effective curriculum learning for AVTS. Use fact time is continuous. Can relate motion to acoustics. State-of-the-art results on self-supervised training both in audio and video domain.
  15. Now: not redefining labels, but asking what the definition is. Based on convo with Dhruv, cite... Verbs take objects ”Pick up baby”, “Pick up ball”: similar semantics, visually distinct. Large label spaces: Attributes: large via combinatorial explosion? “matt with red tie”, “matt with green shirt” Hashtags, phrases Natural language: phrases, sentences, etc. Train vision on, or jointly with, word embeddings.
  16. Starts to beg the question of “what is a label” Combinations of attributes?
  17. 100M? Current tasks saturate at around 5-10M videos. New SOTA on Kinetics, Epic Kitchens, etc. Pretraining 3D via inflation from images? Not as good as pretraining on 3D clips. Pretraining 2D on image frames vs video frames? Small gap. Not really necessary. Label space should be adapted to task In general seems # of labels is not a huge win. Seems better in general to have more videos than more labels. Interesting difference: what’s a negative section? Simplified model is positive sections, negative sections. But not really true for actions. Large models benefit more
  18. Since we’re talking about labels: We were annotating a dataset internally. There’s always some back-and-forth; the annotators ask questions. In this case: “Is a toy car a car?” It seems so benign -- how could this be wrong either way. This was a vision dataset, so we applied visual rules. If a class boundary can be decided visually, Then it’s a fair boundary to draw. Toy cars tend to co-occur with playrooms, kids’ hands, etc. The motion is all different. The sounds are different. They may look different or be made of plastic. So, we made two labels. Toy car vs car. But you see what we did there? We judged an object based on its context. Apply this thinking elsewhere? Could be terrible. This example is interesting because it seems so benign – and the decision we made doesn’t feel incorrect. But we’re building these big correlation machines, And they’ll learn an object together with its context.