Successfully reported this slideshow.
Your SlideShare is downloading. ×

Matt Feiszli at AI Frontiers : Video Understanding

Matt Feiszli at AI Frontiers : Video Understanding

Download to read offline

I will discuss the state of the art of video understanding, particularly its research and applications at Facebook. I will focus on two active areas: multimodality and time. Video is naturally multi-modal, offering great possibility for content understanding while also opening new doors like unsupervised and weakly-supervised learning at scale. Temporal representation remains a largely open problem; while we can describe a few seconds of video, there is no natural representation for a few minutes of video. I will discuss recent progress, the importance of these problems for applications, and what we hope to achieve.

I will discuss the state of the art of video understanding, particularly its research and applications at Facebook. I will focus on two active areas: multimodality and time. Video is naturally multi-modal, offering great possibility for content understanding while also opening new doors like unsupervised and weakly-supervised learning at scale. Temporal representation remains a largely open problem; while we can describe a few seconds of video, there is no natural representation for a few minutes of video. I will discuss recent progress, the importance of these problems for applications, and what we hope to achieve.

More Related Content

Similar to Matt Feiszli at AI Frontiers : Video Understanding

More from AI Frontiers

Related Books

Free with a 30 day trial from Scribd

See all

Matt Feiszli at AI Frontiers : Video Understanding

  1. 1. VIDEO UNDERSTANDING Matt Feiszli Research Scientist / Manager Facebook AI
  2. 2. AI @FACEBOOK Research Tools Platforms Product
  3. 3. VIDEO @FACEBOOK SEE NOTES Facebook AI Mobile Vision Video ML Integrity FRL (AR / VR)
  4. 4. Make it Relevant Understand • What is this about? • What’s the language? • Who’s in it? • Where does it take Personalize • Who wants to see this? • Which part(s)? Deliver • Highest possible quality • Many possible devices • Variety of bandwidths
  5. 5. HUMAN-LEVEL UNDERSTANDING (by watching)
  6. 6. OCRAUDIOVISION SPEECH
  7. 7. TIME motion & change
  8. 8. WHERE ARE WE NOW? o Multimodal, temporal signal o Idea: Novel tasks replace labels • Language + vision • Audio as labels for video o Aspirations vs. reality?
  9. 9. RETRIEVAL & RANKING o Watch / no-watch: first few minutes • Should “understand” several minutes o Goal: Long-form content representation o Reality: Metadata is strongest signal. • Topic tagging • People, places, activities, brands
  10. 10. GREAT MOMENTS o Video: Boredom punctuated by greatness • Highlight reels, summaries • Objectionable content o Can find some moments. • Highly multimodal. o Complex actions, intents are a mystery.
  11. 11. Visuotemporal Structure
  12. 12. Action: Doing Pushups
  13. 13. Correspondence
  14. 14. o Action labels have temporal structure • Pushups: two key poses, two transitions • Compare: “Baking a cake” o Current visual models tend to ignore this • Instead: correlated objects, scenes, etc Temporal Structure
  15. 15. o Speech recognition • Words -> phonemes -> features • Modern models mostly learn this o Not without ambiguity, but… • … far better than actions Temporal Structure Macquarie University, Dept. of Linguistics, “Vowel Spectra”
  16. 16. o Goal: self-supervision (“free supervision”) o Examples: • Compression (e.g. autoencoders) • Neighboring image patches • Temporal ordering • Audio-visual matching Towards self-supervision?
  17. 17. Self-Supervised Learning with Audio and Video Temporal Synchronization Bruno Korbar, Du Tran, Lorenzo Torresani NIPS 2018
  18. 18. Arandjelovic & Zisserman ICCV’17 L^3 Net Related Work
  19. 19. Audio-Video Temporal Synchronization (AVTS)
  20. 20. Sound Localization
  21. 21. o Goal: rich features via extremely large label spaces o “Extremely large label space”? • Verbs + objects? • Combinations of attributes? • Natural language? What is a Label (at Scale)?
  22. 22. Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, Laurens van der Maaten. o SOA on ImageNet1K, 85.4% Top1 accuracy • Architecture – ResNext101-32x48 • Data – 3.5B Images • Labels – 17K classes • Training – 300 GPUs distributed training • Supervision – Weakly supervised Extreme Scale: Exploring the Limits of Supervised Pretraining
  23. 23. o Transfer learning from 100M videos? • Already setting new SOA on Kinetics, Epic Kitchens, etc. o Temporal models? o Labels? • Size of label space • Objects, actions, etc. Extreme Scale: Learnings from Video (to be published)
  24. 24. Is a toy car a car?
  25. 25. Thank you!

Editor's Notes

  • Intro
    Talk about AI@FB, the who and the why of video.
  • @FB AI means full stack
    Research all the way to production. Depends on tools (PyTorch has seen great adoption and we’re standardizing internally).
    The AI org provides platforms, workflows, models, infra. All of this serves product, and there’s increasing amounts of AI distributed
    Through the various product organizations and verticals.
  • Just a few examples of teams (you might not know about): can’t mention all.
    Facebook AI (FAIR): research, tools, platforms Video MPK, video NYC, along with folks from FAIR. But, lots of the same done by verticals.
    E.g. mobile: pose, effects, etc. Not on slide, but, Portal: post tracking for AI cameraman. Well-reviewed feature for videoconf, using SOTA pose tracking algorithms.

    Video ML: focus on video product features. Various integrity use cases (org’d CVPR workshop)
    AR/VR: awesome work.
    Note: didn’t even mention Feed, Ads, etc.
  • A video is uploaded – this happens 10’s of millions of times per day.

    [bullets]

    To do a good job here requires essentially human understanding. So let’s move to the science.
  • What’s the promise of video? Learn everything by watching. Physics, language, planning, causality, intent… it’s all there.

    Not convincing? There are a ton of instructional and educational videos. Language tutorials, etc. Make it easier.

    Why video better?
  • Modalities. Both correlated + complementary.

    E.g. topic tagging. Amateur sports video: visual. Pro sports broadcast: visual + speech both powerful, complementary.
    Playing instrument: both correlated. OTOH consider news: not visual, but speech or OCR tells you: cooking, news, medical, etc.
    Also accessibility: read a menu, signs, etc.

    SELF-SUPERVISION: labeling video is really, really expensive and slow. New tasks reduce need for labels
  • Time – video evolves.
    Consequences, planning, interactions.
    Reasonable goal: can we learn sports by watching? Understand player intent, predict actions? Prediction tasks: future as supervision for past.

  • Search, recommendations, chaining
    Consider topic tagging: requires multimodal. Recall menus, signs, news crawl, sports / stock ticker
    Problems: summarization, salience, and common sense world understanding.
  • Video is boredom with occasional moments of greatness (or really, significant change).
    Compare: Photos are self-selected for greatness. If you see it at all, probably good.
    Some moments are reasonably learnable (visually, listening, etc)

    Hard: people interactions, tense moments, sentiment not well-developed (listening for loud soundtrack)
    Hard: most successful ML relies on large, labeled datasets. Need query or person to ground it.

    E.g. training highlight models. Trickiest part is assembling dataset. Model learns clickbait. Ethical warning for us all. FB thinks about bias in ML models. We have ethical and operational guidelines when building dataset, conducting research, and building our products. Affects design decisions daily.
  • Let’s look at a typical visual task: action recognition.
    Each point on the curve is one frame. Imagenet features at 1 FPS on UCF-101 action recognition, embedded in 2D using t-SNE. Color represents velocity (rate of change).
    So what’s happening? Periodic motion, two slow / stationary repeated points, faster transitions between.

    Microcosm: but like longer video in many ways. Repeated sections, similar sequences. Car chases, soccer set pieces like corner kick, etc. Self-similarity.
  • Man doing pushups from UCF-101 public dataset.
  • Compression is very different – requires a metric.
    Last three construct binary classification problems: do patches match? Are frames in the right order? Do A/V match?
  • Use 2D ConvNet for visual stream: Has no temporal context & no motion modeling

    Only train on easy negative examples -- negative sound is selected from a different video.
    What this (AVC) really learns? Semantic correlation between audio/video

  • Introduce new task, Audio Video Temporal Synchronization (AVTS) for self-supervised pre-training.

    Demonstrate an effective curriculum learning for AVTS. Use fact time is continuous. Can relate motion to acoustics.

    State-of-the-art results on self-supervised training both in audio and video domain.
  • Now: not redefining labels, but asking what the definition is.
    Based on convo with Dhruv, cite...
    Verbs take objects
    ”Pick up baby”, “Pick up ball”: similar semantics, visually distinct.
    Large label spaces:
    Attributes: large via combinatorial explosion? “matt with red tie”, “matt with green shirt”

    Hashtags, phrases
    Natural language: phrases, sentences, etc. Train vision on, or jointly with, word embeddings.
  • Starts to beg the question of “what is a label”
    Combinations of attributes?
  • 100M? Current tasks saturate at around 5-10M videos.
    New SOTA on Kinetics, Epic Kitchens, etc.

    Pretraining 3D via inflation from images? Not as good as pretraining on 3D clips. Pretraining 2D on image frames vs video frames? Small gap. Not really necessary.

    Label space should be adapted to task
    In general seems # of labels is not a huge win. Seems better in general to have more videos than more labels.

    Interesting difference: what’s a negative section? Simplified model is positive sections, negative sections. But not really true for actions.

    Large models benefit more
  • Since we’re talking about labels: We were annotating a dataset internally. There’s always some back-and-forth; the annotators ask questions.
    In this case: “Is a toy car a car?”
    It seems so benign -- how could this be wrong either way. This was a vision dataset, so we applied visual rules. If a class boundary can be decided visually,
    Then it’s a fair boundary to draw. Toy cars tend to co-occur with playrooms, kids’ hands, etc. The motion is all different. The sounds are different. They may look different or be made of plastic. So, we made two labels.
    Toy car vs car. But you see what we did there? We judged an object based on its context. Apply this thinking elsewhere? Could be terrible.
    This example is interesting because it seems so benign – and the decision we made doesn’t feel incorrect. But we’re building these big correlation machines,
    And they’ll learn an object together with its context.

×