I will discuss the state of the art of video understanding, particularly its research and applications at Facebook. I will focus on two active areas: multimodality and time. Video is naturally multi-modal, offering great possibility for content understanding while also opening new doors like unsupervised and weakly-supervised learning at scale. Temporal representation remains a largely open problem; while we can describe a few seconds of video, there is no natural representation for a few minutes of video. I will discuss recent progress, the importance of these problems for applications, and what we hope to achieve.
Intro Talk about AI@FB, the who and the why of video.
@FB AI means full stack Research all the way to production. Depends on tools (PyTorch has seen great adoption and we’re standardizing internally). The AI org provides platforms, workflows, models, infra. All of this serves product, and there’s increasing amounts of AI distributed Through the various product organizations and verticals.
Just a few examples of teams (you might not know about): can’t mention all. Facebook AI (FAIR): research, tools, platforms Video MPK, video NYC, along with folks from FAIR. But, lots of the same done by verticals. E.g. mobile: pose, effects, etc. Not on slide, but, Portal: post tracking for AI cameraman. Well-reviewed feature for videoconf, using SOTA pose tracking algorithms.
Video ML: focus on video product features. Various integrity use cases (org’d CVPR workshop) AR/VR: awesome work. Note: didn’t even mention Feed, Ads, etc.
A video is uploaded – this happens 10’s of millions of times per day.
To do a good job here requires essentially human understanding. So let’s move to the science.
What’s the promise of video? Learn everything by watching. Physics, language, planning, causality, intent… it’s all there.
Not convincing? There are a ton of instructional and educational videos. Language tutorials, etc. Make it easier.
Why video better?
Modalities. Both correlated + complementary.
E.g. topic tagging. Amateur sports video: visual. Pro sports broadcast: visual + speech both powerful, complementary. Playing instrument: both correlated. OTOH consider news: not visual, but speech or OCR tells you: cooking, news, medical, etc. Also accessibility: read a menu, signs, etc.
SELF-SUPERVISION: labeling video is really, really expensive and slow. New tasks reduce need for labels
Time – video evolves. Consequences, planning, interactions. Reasonable goal: can we learn sports by watching? Understand player intent, predict actions? Prediction tasks: future as supervision for past.
Search, recommendations, chaining Consider topic tagging: requires multimodal. Recall menus, signs, news crawl, sports / stock ticker Problems: summarization, salience, and common sense world understanding.
Video is boredom with occasional moments of greatness (or really, significant change). Compare: Photos are self-selected for greatness. If you see it at all, probably good. Some moments are reasonably learnable (visually, listening, etc)
Hard: people interactions, tense moments, sentiment not well-developed (listening for loud soundtrack) Hard: most successful ML relies on large, labeled datasets. Need query or person to ground it.
E.g. training highlight models. Trickiest part is assembling dataset. Model learns clickbait. Ethical warning for us all. FB thinks about bias in ML models. We have ethical and operational guidelines when building dataset, conducting research, and building our products. Affects design decisions daily.
Let’s look at a typical visual task: action recognition. Each point on the curve is one frame. Imagenet features at 1 FPS on UCF-101 action recognition, embedded in 2D using t-SNE. Color represents velocity (rate of change). So what’s happening? Periodic motion, two slow / stationary repeated points, faster transitions between.
Microcosm: but like longer video in many ways. Repeated sections, similar sequences. Car chases, soccer set pieces like corner kick, etc. Self-similarity.
Man doing pushups from UCF-101 public dataset.
Compression is very different – requires a metric. Last three construct binary classification problems: do patches match? Are frames in the right order? Do A/V match?
Use 2D ConvNet for visual stream: Has no temporal context & no motion modeling
Only train on easy negative examples -- negative sound is selected from a different video. What this (AVC) really learns? Semantic correlation between audio/video
Introduce new task, Audio Video Temporal Synchronization (AVTS) for self-supervised pre-training.
Demonstrate an effective curriculum learning for AVTS. Use fact time is continuous. Can relate motion to acoustics.
State-of-the-art results on self-supervised training both in audio and video domain.
Now: not redefining labels, but asking what the definition is. Based on convo with Dhruv, cite... Verbs take objects ”Pick up baby”, “Pick up ball”: similar semantics, visually distinct. Large label spaces: Attributes: large via combinatorial explosion? “matt with red tie”, “matt with green shirt”
Hashtags, phrases Natural language: phrases, sentences, etc. Train vision on, or jointly with, word embeddings.
Starts to beg the question of “what is a label” Combinations of attributes?
100M? Current tasks saturate at around 5-10M videos. New SOTA on Kinetics, Epic Kitchens, etc.
Pretraining 3D via inflation from images? Not as good as pretraining on 3D clips. Pretraining 2D on image frames vs video frames? Small gap. Not really necessary.
Label space should be adapted to task In general seems # of labels is not a huge win. Seems better in general to have more videos than more labels.
Interesting difference: what’s a negative section? Simplified model is positive sections, negative sections. But not really true for actions.
Large models benefit more
Since we’re talking about labels: We were annotating a dataset internally. There’s always some back-and-forth; the annotators ask questions. In this case: “Is a toy car a car?” It seems so benign -- how could this be wrong either way. This was a vision dataset, so we applied visual rules. If a class boundary can be decided visually, Then it’s a fair boundary to draw. Toy cars tend to co-occur with playrooms, kids’ hands, etc. The motion is all different. The sounds are different. They may look different or be made of plastic. So, we made two labels. Toy car vs car. But you see what we did there? We judged an object based on its context. Apply this thinking elsewhere? Could be terrible. This example is interesting because it seems so benign – and the decision we made doesn’t feel incorrect. But we’re building these big correlation machines, And they’ll learn an object together with its context.
Matt Feiszli at AI Frontiers : Video Understanding
Research Scientist / Manager
FRL (AR / VR)
Make it Relevant
• What is this about?
• What’s the language?
• Who’s in it?
• Where does it take
• Who wants to see
• Which part(s)?
• Highest possible
• Many possible devices
• Variety of bandwidths
WHERE ARE WE NOW?
o Multimodal, temporal signal
o Idea: Novel tasks replace labels
• Language + vision
• Audio as labels for video
o Aspirations vs. reality?
o Watch / no-watch: first few minutes
• Should “understand” several minutes
o Goal: Long-form content representation
o Reality: Metadata is strongest signal.
• Topic tagging
• People, places, activities, brands
o Video: Boredom punctuated by greatness
• Highlight reels, summaries
• Objectionable content
o Can find some moments.
• Highly multimodal.
o Complex actions, intents are a mystery.
o Action labels have temporal structure
• Pushups: two key poses, two transitions
• Compare: “Baking a cake”
o Current visual models tend to ignore this
• Instead: correlated objects, scenes, etc
o Speech recognition
• Words -> phonemes -> features
• Modern models mostly learn this
o Not without ambiguity, but…
• … far better than actions
Macquarie University, Dept. of Linguistics, “Vowel Spectra”
o Goal: self-supervision (“free supervision”)
• Compression (e.g. autoencoders)
• Neighboring image patches
• Temporal ordering
• Audio-visual matching
Self-Supervised Learning with Audio and Video
Bruno Korbar, Du Tran, Lorenzo Torresani
Arandjelovic & Zisserman
o Goal: rich features via extremely large label spaces
o “Extremely large label space”?
• Verbs + objects?
• Combinations of attributes?
• Natural language?
What is a Label (at Scale)?
Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar
Paluri, Yixuan Li, Ashwin Bharambe, Laurens van der Maaten.
o SOA on ImageNet1K, 85.4% Top1 accuracy
• Architecture – ResNext101-32x48
• Data – 3.5B Images
• Labels – 17K classes
• Training – 300 GPUs distributed training
• Supervision – Weakly supervised
Extreme Scale: Exploring the Limits of Supervised
o Transfer learning from 100M videos?
• Already setting new SOA on Kinetics, Epic Kitchens, etc.
o Temporal models?
• Size of label space
• Objects, actions, etc.
Extreme Scale: Learnings from Video (to be published)