Learning Transferable Visual Models from Natural Language Supervision: CLIP

Learning
Transferable Visual
Models from Natural
Language
Supervision: CLIP
Authors: Alec Radford, Jong Wook Kim, Chris
Hallacy, et al.
Presented By:
Zohaib Hassan

Objective
Problem:
• Current limitations of traditional
SOTA computer vision models,
which rely on labeled datasets
with fixed categories.
• limits generality and flexibility
Proposed Solution:
• A method that enables flexible,
scalable learning from natural
language.
• Learning directly from raw text
(natural descriptions) paired with
images is a promising alternative

Background and
Motivating Work
• Inspired by NLP breakthroughs
(e.g., GPT-3)
• NLP and Vision Integration
• Pre-training methods which learn
directly from raw, unstructured
text
• Previous Works: VirTex and
ConVIRT combined vision and NLP
but did not reach the scale and
flexibility of CLIP.

CLIP: Contrastive Language-
Image Pre-Training
• CLIP - A Promising Solution to Traditional
Model Limitations
• Developed to overcome these limitations of
traditional computer vision systems
• Uses natural language descriptions as a
form of supervision
• Learns from a large dataset of image-text
pairs
• Does not rely on labeled datasets with
predefined classes
• Offers Generalization, Scalability, and
Zero-Shot Transfer Capabilities

Key Concept -
Natural
Language
Supervision
CLIP leverages text descriptions paired
with images to learn visual
representations. Instead of labeled
datasets with fixed classes, it uses freely
available text as supervision.
This enables CLIP to learn a vast range of
visual concepts without requiring
traditional, manually labeled data

Approach
Overview
The CLIP Framework: CLIP's unique
framework involves pre-training on (image,
text) pairs using a contrastive objective.
Steps Involved:
1) Image Encoder: Uses ResNet or ViT
2) Text Encoder: Transformer-based text
embeddings.
3) Contrastive Objective: Learns by
maximizing similarity between correct
image-text pairs and minimizing it for
incorrect ones.
Given a batch of N pairs, CLIP aims to identify
the correct pairs among the × possible
𝑁 𝑁
pairs.

Data and
Dataset Creation
Dataset Size: Built a new dataset of 400
million (image, text) pairs from the
internet...
Balance and Scale: Limited to 20,000
pairs per query
Unique Dataset: Diverse and internet-
scale data, facilitating broad
generalization.
Scalability: Unlike MS-COCO and
ImageNet, CLIP’s dataset isn’t manually
labeled, increasing scalability.

Training Strategy
Contrastive Learning: Predict correct
image-text pairs...
CLIP learns a multi-modal
embedding space by jointly training
an image encoder and text encoder
to maximize the cosine similarity
Objective: Maximize similarity for
correct pairs, minimize for incorrect.
Contrastive Objective: Key to
improving learning efficiency.
Image/Text Embedding: These
enable the “zero-shot” applications

Zero-Shot Learning
and Transferability
Zero-Shot Concept: CLIP performs
tasks without fine-tuning...
• designed to generalize across
tasks with zero-shot learning
Task Examples:
OCR, Action Recognition, Geo-
Localization
Impact: Achieves zero-shot learning
on ImageNet, comparable to
traditional supervised models.

Experimental
Setup and
Results
Benchmarking: Over 30
datasets including ImageNet,
for tasks such as Classification,
OCR, and geo-localization...
PERFORMANCE:
• Zero-shot ImageNet
Performance: Comparable to
fully supervised ResNet-50.
• Cross-Task Transfer: Strong
performance without fine-
tuning on specific datasets.
Versatile Benchmarks: First model to
achieve this scale and versatility in a
zero-shot setting.

Efficiency and
Scalability
•Scaling: Smooth scaling of
performance as tested with
different model sizes (ResNet-50
to Vision Transformers).
•Efficiency: Contrastive learning
framework significantly improves
training efficiency.
•Takeaway: CLIP scales smoothly
across model sizes, maintaining
high performance.

Robustness
and Analysis
Robustness: Evaluates adaptability
to new data distributions...
Robustness to Distribution
Shifts: High adaptability to diverse
datasets and real-world scenarios.
Strong Zero-shot Performance:
Often outperforms few-shot
alternatives.

Prompt Engineering
and Ensembling
Prompt Engineering:
Adds context to CLIP’s interpretations...
(e.g., adding “A photo of a…”)
Ensembling: Improves zero-shot
performance by using different prompts.
Results: Prompt engineering and
ensembling boost zero-shot
classification performance by almost 5
points on average

Limitation
s and
Challenge
s
CLIP’s limitations include potential
data bias and challenges with
complex tasks
• Data Bias: Internet-sourced data
can introduce biases...
• Complex Tasks:
Underperformance in specialized
domains (e.g., medical images)

Conclusion
s
CLIP represents a shift towards
scalable, flexible vision models
Future Directions:
•Improving Robustness: More robust
models across domains.
•Specialization: Enhanced handling of
complex tasks.

Learning Transferable Visual Models from Natural Language Supervision: CLIP

More Related Content

What's hot

Similar to Learning Transferable Visual Models from Natural Language Supervision: CLIP

Recently uploaded

Learning Transferable Visual Models from Natural Language Supervision: CLIP

Editor's Notes