1. Survey on Self-Supervised Multimodal
Representation Learning and
Foundation Models
SUBMITTED BY
CHIRUDEEP GORLE
016682627
2. Introduction
Deep learning is currently a need for all contemporary intelligent systems,
particularly those that handle natural language and computer vision.
However, it is costly and impractical to annotate datasets for supervised
learning. To circumvent this issue, researchers have employed self-supervised
learning, in which models learn from data without direct human supervision.
This study looks at the development of self-supervised learning across various
verbal, visual, and auditory media.
3. Research questions
A review of the research on representation learning for many modalities,
including language, vision, audio, and robotics, is included in the study's
first section. The study of these modalities combined is then looked at.
This study addresses the goal of self-supervised representations, the
application of diverse learning algorithms for representations across
several modalities, and the benefits of combining these modalities to
create more effective AI agents.
4. Language
The method of developing statistical models that record the joint probability
distribution of word combinations in sentences and utilize that data to define
languages. The solution to the dimensionality problem is provided by the
work, which trains word representations using neighboring words. The
statistical approach represents words and language as conditional
probabilities of the subsequent word given all the preceding words.
5. Image
Self-supervised techniques have also been used to learn about visual
representation in the field of computer vision. Similar to NLP, picture
representation learning involves self-supervision through pretext tasks that
generate fictional labels for bogus prediction problems. These pretext
problems allow models to learn representations without explicit supervision.
6. Audio
Comparable progress has been made in the learning of voice representations
using techniques influenced by BERT and Transformers. Speech features may
be encoded using spectrograms as speech is a waveform of signals. These
speech features are processed by an audio encoder, which then learns an
embedding, represented by the letter "z." The speech features produced by
the decoder using 'z' as input allow for the reconstruction of the original
characteristics. This generative self-supervised learning technique is made
possible by the use of a reconstruction loss, which encourages the
development of rich speech representations.
7. Robotics
Robotics has been influenced by recent advances in image representation
learning, such as MoCo and SwAV. These techniques are employed in robotics
and reinforcement learning to offer extra incentives when the main RL
environment only offers occasional rewards. By generating self-supervised
additional incentives, these techniques make it possible to build RL agents
that are incredibly successful at robotics tasks. When used in RL scenarios like
games and simulated work, SwAV has shown promising results. This flexibility
helps researchers in robotics to overcome the issue of insufficient incentives
and improve the performance of RL agents.
8. Multimodal
Studies have focused on combining two or more modalities to address issues
including photo captioning, visual question answering, and music
composition. Combining textual and visual information is necessary for these
positions. Many methods have been developed that enable the learning of
cross-modal embeddings that capture the information from both domains in
order to encode and mix various modalities. In the Vilbert model, for instance,
the mask language model predicts the missing information from both visual
and textual representations, adapting the BERT architecture to handle
multimodal inputs.
9. Research opportunities
In terms of research, multimodal learning has gone a long way, but there is
still plenty to discover. One such area is the development of learnable
augmentation systems that can go beyond the limitations of hand-made
augmentations. Eliminating trial and error and biases can promote
multimodal learning. Making more effective pretext assignments for teaching
rich feature representations is another possible solution. Transformer-based
models, which are often used in multimodal learning, must be improved in
terms of setup and usefulness.
10. Conclusion
In-depth analyses of existing and successful self-supervised representation
learning methods in robotics, audio, vision, NLP, and multimodal applications
are presented in this study. It investigates the evolution of these algorithms,
from their use in subsequent applications through their learning of
embedding representations. The positive results of multimodal representation
learning are highlighted. The research also addresses the approaches' current
limitations and offers potential future developments.