Self-supervised learning (SSL) is behind some of the latest AI breakthroughs. By enabling learning from vast amounts of unlabeled data, rather than relying on carefully annotated datasets, it has unlocked the potential of AI across Natural language processing, audio and computer vision. The talk covers how it is being used for vision tasks.
Presented in AI & ML meetup @Azuga on July 01, 2023 by Vidhya Vinay.
3. streamingo.ai
streamingo.ai
Self Supervised Learning
l “dark matter of intelligence”
l Learns from unlableled data
l Able to match or surpass models trained with
supervised approach
l SSL works for text, image, video, audio and time series
data
6. streamingo.ai
streamingo.ai
Why Self Supervised Learning
l Representations learned can be used for variety of
tasks.
For eg. in NLP , downstream tasks could be
summarization, translation or generating text
l Supervised learning, the task has to be defined
beforehand.
l Unsupervised learning doesnt learn the representation.
10. streamingo.ai
streamingo.ai
Playback speed
l Take clips of t frames from each video, select frames in
a way that the playback speed is altered.
l Collect p frames, where p is the playback rate, either
speeding up the video or slowing it down
11. streamingo.ai
streamingo.ai
Temporal Order
l Each video V is split into clips of t frames
l Each set of clips contains a single clip in the correct
order, and the remaining clips are modified by shuffling
the order.
l For eg. (t2,t1,t3) is incorrect and (t1,t2,t3) is correct.
l Also called odd-one-out-learning
15. streamingo.ai
streamingo.ai
Frame Prediction
l Reconstructing motion or Generating mtion from RGB
frames
l Uses optical flow as the motion signal
l Discrimintator and Variational AutoEncoder used to
measure the quality of the generated predictions
l Another approach is to create motion maps, and then
predict next frame as various resolutions.
l Use a reconstruction loss to measure quality of the
reconstruction
20. streamingo.ai
streamingo.ai
Multimodal Masked Modeling
l First introduced in NLP as Masked Language Modeling
(MLM)
l Bidirectional Encoder Representation from
Transformers (BERT) was extended to video domain by
transforming raw visual data into discrete sequence of
tokens using hierarchical k-means
23. streamingo.ai
streamingo.ai
View Augmentation
l Change in apperance using augmentations such as
l Random resized crop, channel drop, random color
jitter, random grey and/or random rotation
l Positive pairs are augmented versions of original
clips
l Negative pairs are clips from other videos
l Popular approaches SimCLR, BYOL aand MoCo
27. streamingo.ai
streamingo.ai
Temporal Augmentation
l Augmentation used to generate
paris from modifying the temporal
order or the start and end of a clip
interval
l Maximize similarity function
between two temporally adjacent
frames in same video
l Minimize similarity between frames
from other videos