Webinar: Multimodal Search with CLIP (September 2022)
Searching Across Image and Text
Intro to OpenAI’s CLIP
Sujit Pal
Technology Research
Director
James Briggs
Staff Developer
Advocate
Raphael Pisoni
Senior Computer Vision
Engineer
Webinar: Multimodal Search with CLIP (September 2022)
Agenda
● Intro to image search and CLIP
○ What is image search
○ What is CLIP
○ How to use Pinecone with CLIP
● How to use CLIP
○ CLIP application #1: Multilingual image search
○ CLIP application #2: Satellite images with JAX
● Q&A
Webinar: Multimodal Search with CLIP (September 2022)
🏦 Where is the Bank of England?
🌱 Where is the grassy bank?
🛩🛩 How does a plane bank?
How to introduce Python to children? 🐍
🐝 “the bees decided to have a mutiny against their queen”
🐝 “flying stinging insects rebelled in opposition to the matriarch”
Representing Meaning
Webinar: Multimodal Search with CLIP (September 2022)
Representing Meaning
+ Images example
+ Image <> text example
- Make clear we’re comparing semantics
A little CLIP recap:
Model developed by OpenAI.
Trained on images and captions.
Places corresponding image and
text embeddings close to each
other.
Tries to place different embedding
as far away as possible.
Enough data and training lets
semantic clusters emerge. Image credit: OpenAI
Webinar: Multimodal Search with CLIP (September 2022)
Searching with CLIP:
Use CLIP to embed images/texts
that you want to make searchable.
Add embeddings to a catalogue or
vector search database.
To retrieve similar images/texts
embed your query with CLIP and
find the most similar embeddings.
As a similarity metric use
Cosine Similarity or more memory
efficient approximations.
Webinar: Multimodal Search with CLIP (September 2022)
CLIP embeddings:
CLIP embeddings encode some
semantic meaning.
Apart from search they can be
used as a starting point for many
other use cases:
● image captioning
● image generation
● zero shot detection
● your idea here…
Source: https://github.com/borisdayma/dalle-mini
Webinar: Multimodal Search with CLIP (September 2022)
BUT…
The original CLIP was only trained on pairs of images and English texts
crawled from the internet.
What if?
● we want to use it in a different language?
● it doesn’t know enough about our target domain?
● we want to use it with a different modality besides Images and Text?
Webinar: Multimodal Search with CLIP (September 2022)
CLIP Use Cases
Webinar: Multimodal Search with CLIP (September 2022)
Bringing CLIP
to the Italian Language
with Raphael Pisoni
Webinar: Multimodal Search with CLIP (September 2022)
Who am I?
Webinar: Multimodal Search with CLIP (September 2022)
CLIP does not understand your language?
We were there…
Our Goal:
Train a version of CLIP that understands Italian while maintaining as
much as possible of the original performance.
Giuseppe
Attanasio
NLP
Raphael
Pisoni
CV
Silvia
Terragni
NLP
Gabriele
Sarti
NLP
Sri
Lakshimi
AI
Federico
Bianchi
NLP
Webinar: Multimodal Search with CLIP (September 2022)
Here’s what we learned:
● Keep as much as possible!
● Make heavy use of pretraining!
● Data quality matters!
● Tips and Tricks!
Webinar: Multimodal Search with CLIP (September 2022)
Keep as much as possible:
The original CLIP model took two
weeks to train on 256 GPUs. Let’s
not waste all of that!
No matter what language we use,
we can always keep the amazing
Image model!
✓
✗
Webinar: Multimodal Search with CLIP (September 2022)
Make heavy use of pretraining:
The original CLIP Text branch was
replaced with a great BERT model
that was pre-trained on Italian text
by the Bavarian state library.
Webinar: Multimodal Search with CLIP (September 2022)
Make heavy use of pretraining (pt.2):
The reprojection layers between the
models are needed to bring the
outputs to the same dimensionality
and are initialised randomly.
That means that at the start of the
training we will have large and
messy gradients.
Webinar: Multimodal Search with CLIP (September 2022)
Make heavy use of pretraining (pt. 3):
To avoid the initial gradients
messing with the pretrained models,
we simply pretrain the model with
frozen backbones.
This makes sure that the
reprojection layers are aligned and
the gradients are smooth before we
start fine-tuning the backbones.
Webinar: Multimodal Search with CLIP (September 2022)
Data quality matters:
CLIP was trained on 400 Million image/caption pairs.
Compared to that we worked in an extreme low resource setting.
We used:
● WIT 600k
● MSCOCO-it 100k
● Conceptual Captions 700k
● La Foto del Giorno 30k
Total: ~1.4 Million Italian image/caption pairs.
No surprise: The data quality made a HUGE difference!
Webinar: Multimodal Search with CLIP (September 2022)
Tips and Tricks pt. 1
Don’t use AdamW!
Everybody uses AdamW but it would decay the pretrained weights that we
want to keep as much as possible.
Use Adam or this instead:
optimizer = optax.chain(
optax.adaptive_grad_clip(0.01, eps=0.001),
optax.scale_by_belief(),
optax.scale_by_schedule(decay_lr_schedule_f
n),
optax.scale(-1.0),
)
Webinar: Multimodal Search with CLIP (September 2022)
Tips and Tricks pt. 2
Always augment your data!
We used pretty crazy image
transforms.
We tried text augmentations but did
not use them due to lack of time.
transforms = torch.nn.Sequential(
Resize(
[image_size],
interpolation=InterpolationMode.BICUBIC),
RandomCrop(
[image_size],
pad_if_needed=True,
padding_mode="edge"),
ColorJitter(hue=0.1),
RandomHorizontalFlip(),
RandomAffine(
degrees=15,
translate=(0.1, 0.1),
scale=(0.8, 1.2),
shear=(-15, 15, -15, 15),
interpolation=InterpolationMode.BILINEAR,
fill=127),
RandomPerspective(
distortion_scale=0.3,
p=0.3,
interpolation=InterpolationMode.BILINEAR,
fill=127),
RandomAutocontrast(p=0.3),
RandomEqualize(p=0.3),
ConvertImageDtype(torch.float)
)
Webinar: Multimodal Search with CLIP (September 2022)
Tips and Tricks pt. 3
Use the biggest Batch Size you can fit!
Common practice anyways but even more important for CLIP!
The bigger the batch size, the more negative examples will be in the batch.
Alternatively use hard negative mining. (which we did not have time to do)
Webinar: Multimodal Search with CLIP (September 2022)
Tips and Tricks pt. 4
A word on the CLIP loss:
CLIP used a contrastive loss
which is somewhat controversial.
We used it with a logit-scale fixed
to 20.
Today there are better losses and
loss-variants. (example)
def cross_entropy(logits, axis):
logprobs = jax.nn.log_softmax(logits, axis=axis)
nll = jnp.diag(logprobs)
ce = -jnp.mean(nll)
return ce
def clip_loss(similarity):
loss = (cross_entropy(similarity, axis=0) +
cross_entropy(similarity, axis=1)) / 2
return loss
def compute_loss(image_embeds, text_embeds, logit_scale):
# normalized features
image_embeds = image_embeds /
jnp.linalg.norm(image_embeds, axis=-1, keepdims=True)
text_embeds = text_embeds /
jnp.linalg.norm(text_embeds, axis=-1, keepdims=True)
# cosine similarity as logits
logit_scale = jnp.exp(logit_scale)
logits_per_text = jnp.matmul(text_embeds, image_embeds.T) * logit_scale
logits_per_image = logits_per_text.T
return clip_loss(logits_per_image)
Webinar: Multimodal Search with CLIP (September 2022)
Results:
We used the multilingual mClip (Reimers et al., 2020) as a baseline.
Webinar: Multimodal Search with CLIP (September 2022)
Image Retrieval (MSCOCO-it)
MRR CLIP-Italian mCLIP
MRR@1 0.3797 0.2874
MRR@5 0.5039 0.3957
MRR@10 0.5204 0.4129
0-Shot Classification (Imagenet)
Accuracy CLIP-Italian mCLIP
Acc@1 22.11 20.15
Acc@5 43.69 36.57
Acc@10 52.55 42.91
Acc@100 81.08 67.11
Bonus Result:
● 0-Shot Object Localization
Webinar: Multimodal Search with CLIP (September 2022)
Demo Time!
Webinar: Multimodal Search with CLIP (September 2022)
Hugging Face Space
GitHub Repo
CLIP for Satellite Image Search
with Sujit Pal
Webinar: Multimodal Search with CLIP (September 2022)
Webinar: Multimodal Search with CLIP (September 2022)
Who am I?
● Work for Elsevier Labs
● Technology Research Director
● Areas of interest: Search, NLP
and ML
● My main focus is improving
search through Machine
Learning.
Webinar: Multimodal Search with CLIP (September 2022)
The Project
● Objective – fine tune OpenAI CLIP for Satellite Images and use it to
power an Image Search Application
● Team TWIML
Artashes
Arutiunian (Arto)
@arampacha
Dev Vidhani
@devv
Goutham
Venkatesh
@goutham794
Mayank Bhaskar
@cataluna84
Ritobrota
Ghosh (Rito)
@ghosh-r
Sujit Pal
@sujitpal
Webinar: Multimodal Search with CLIP (September 2022)
Why fine-tune?
● CLIP is very good at identifying “natural” images, but doesn’t do too well with
images from specialized domains.
Recall @1 Recall @3 Recall @5 Recall @10
Baseline (OpenAI CLIP on HF) 0.572 0.745 0.837 0.939
Our final model 0.883 0.968 0.982 0.998
Webinar: Multimodal Search with CLIP (September 2022)
Dataset
● RSICD Dataset for remote sensing image captioning task (repo)
● Approximately 10k (224 x 224) pixel images with 5 single-sentence captions
● Augmented with additional datasets:
○ UCM Dataset (repo)
○ Sidney Dataset (repo)
Webinar: Multimodal Search with CLIP (September 2022)
Data Augmentation
image of a Baltimore ballpark
image d'un terrain de balle à Baltimore
image of a ball field in Baltimore
Caption Augmentation Image Augmentation
Webinar: Multimodal Search with CLIP (September 2022)
● Batch size: 1024 (128 x 8 TPU cores)
● Best model
○ ADAM Optimizer
○ LR 5e-6 with linear decay
● For each batch:
○ N positive examples
○ N2-N negative examples
● Uses contrastive loss (OpenAI default)
● Download model from HuggingFace:
(our best model)
● Access training code on github:
(arampacha/CLIP-RSICD)
Fine tuning
Webinar: Multimodal Search with CLIP (September 2022)
● Application
○ Text to Image
○ Image to Image
○ Identify landmarks in image
● Index time
○ Compute image embeddings for all images in corpus using fine-tuned CLIP model
○ Store in vector database (Pinecone or similar)
● Query time
○ Compute embedding of text or image query
○ Find approximate nearest neighbors in index and return ordered by nearness
● Demo link
Applying fine-tuned model
Demo
Webinar: Multimodal Search with CLIP (September 2022)
Pinecone.io
▪ Fully managed vector database
▪ Free and paid subscriptions for millions
to billions of vectors
Pinecone.io/learn
▪ Vector Search in the Wild
▪ Embedding Models for Image Search
Our Community
Get Started
CLIP and Image Search
■ HuggingFace CLIP Documentation
■ OpenAI CLIP Documentation
■ Sujit’s Satellite Image use case
■ Raphael’s CLIP-Italian use case
Webinar: Getting Started with Semantic (July 2022)
Follow Raphael and Sujit on
Twitter

Searching Across Images and Test

  • 1.
    Webinar: Multimodal Searchwith CLIP (September 2022) Searching Across Image and Text Intro to OpenAI’s CLIP Sujit Pal Technology Research Director James Briggs Staff Developer Advocate Raphael Pisoni Senior Computer Vision Engineer
  • 2.
    Webinar: Multimodal Searchwith CLIP (September 2022) Agenda ● Intro to image search and CLIP ○ What is image search ○ What is CLIP ○ How to use Pinecone with CLIP ● How to use CLIP ○ CLIP application #1: Multilingual image search ○ CLIP application #2: Satellite images with JAX ● Q&A
  • 3.
    Webinar: Multimodal Searchwith CLIP (September 2022) 🏦 Where is the Bank of England? 🌱 Where is the grassy bank? 🛩🛩 How does a plane bank? How to introduce Python to children? 🐍 🐝 “the bees decided to have a mutiny against their queen” 🐝 “flying stinging insects rebelled in opposition to the matriarch” Representing Meaning
  • 4.
    Webinar: Multimodal Searchwith CLIP (September 2022) Representing Meaning + Images example + Image <> text example - Make clear we’re comparing semantics
  • 5.
    A little CLIPrecap: Model developed by OpenAI. Trained on images and captions. Places corresponding image and text embeddings close to each other. Tries to place different embedding as far away as possible. Enough data and training lets semantic clusters emerge. Image credit: OpenAI Webinar: Multimodal Search with CLIP (September 2022)
  • 6.
    Searching with CLIP: UseCLIP to embed images/texts that you want to make searchable. Add embeddings to a catalogue or vector search database. To retrieve similar images/texts embed your query with CLIP and find the most similar embeddings. As a similarity metric use Cosine Similarity or more memory efficient approximations. Webinar: Multimodal Search with CLIP (September 2022)
  • 7.
    CLIP embeddings: CLIP embeddingsencode some semantic meaning. Apart from search they can be used as a starting point for many other use cases: ● image captioning ● image generation ● zero shot detection ● your idea here… Source: https://github.com/borisdayma/dalle-mini Webinar: Multimodal Search with CLIP (September 2022)
  • 8.
    BUT… The original CLIPwas only trained on pairs of images and English texts crawled from the internet. What if? ● we want to use it in a different language? ● it doesn’t know enough about our target domain? ● we want to use it with a different modality besides Images and Text? Webinar: Multimodal Search with CLIP (September 2022)
  • 9.
    CLIP Use Cases Webinar:Multimodal Search with CLIP (September 2022)
  • 10.
    Bringing CLIP to theItalian Language with Raphael Pisoni Webinar: Multimodal Search with CLIP (September 2022)
  • 11.
    Who am I? Webinar:Multimodal Search with CLIP (September 2022)
  • 12.
    CLIP does notunderstand your language? We were there… Our Goal: Train a version of CLIP that understands Italian while maintaining as much as possible of the original performance. Giuseppe Attanasio NLP Raphael Pisoni CV Silvia Terragni NLP Gabriele Sarti NLP Sri Lakshimi AI Federico Bianchi NLP Webinar: Multimodal Search with CLIP (September 2022)
  • 13.
    Here’s what welearned: ● Keep as much as possible! ● Make heavy use of pretraining! ● Data quality matters! ● Tips and Tricks! Webinar: Multimodal Search with CLIP (September 2022)
  • 14.
    Keep as muchas possible: The original CLIP model took two weeks to train on 256 GPUs. Let’s not waste all of that! No matter what language we use, we can always keep the amazing Image model! ✓ ✗ Webinar: Multimodal Search with CLIP (September 2022)
  • 15.
    Make heavy useof pretraining: The original CLIP Text branch was replaced with a great BERT model that was pre-trained on Italian text by the Bavarian state library. Webinar: Multimodal Search with CLIP (September 2022)
  • 16.
    Make heavy useof pretraining (pt.2): The reprojection layers between the models are needed to bring the outputs to the same dimensionality and are initialised randomly. That means that at the start of the training we will have large and messy gradients. Webinar: Multimodal Search with CLIP (September 2022)
  • 17.
    Make heavy useof pretraining (pt. 3): To avoid the initial gradients messing with the pretrained models, we simply pretrain the model with frozen backbones. This makes sure that the reprojection layers are aligned and the gradients are smooth before we start fine-tuning the backbones. Webinar: Multimodal Search with CLIP (September 2022)
  • 18.
    Data quality matters: CLIPwas trained on 400 Million image/caption pairs. Compared to that we worked in an extreme low resource setting. We used: ● WIT 600k ● MSCOCO-it 100k ● Conceptual Captions 700k ● La Foto del Giorno 30k Total: ~1.4 Million Italian image/caption pairs. No surprise: The data quality made a HUGE difference! Webinar: Multimodal Search with CLIP (September 2022)
  • 19.
    Tips and Trickspt. 1 Don’t use AdamW! Everybody uses AdamW but it would decay the pretrained weights that we want to keep as much as possible. Use Adam or this instead: optimizer = optax.chain( optax.adaptive_grad_clip(0.01, eps=0.001), optax.scale_by_belief(), optax.scale_by_schedule(decay_lr_schedule_f n), optax.scale(-1.0), ) Webinar: Multimodal Search with CLIP (September 2022)
  • 20.
    Tips and Trickspt. 2 Always augment your data! We used pretty crazy image transforms. We tried text augmentations but did not use them due to lack of time. transforms = torch.nn.Sequential( Resize( [image_size], interpolation=InterpolationMode.BICUBIC), RandomCrop( [image_size], pad_if_needed=True, padding_mode="edge"), ColorJitter(hue=0.1), RandomHorizontalFlip(), RandomAffine( degrees=15, translate=(0.1, 0.1), scale=(0.8, 1.2), shear=(-15, 15, -15, 15), interpolation=InterpolationMode.BILINEAR, fill=127), RandomPerspective( distortion_scale=0.3, p=0.3, interpolation=InterpolationMode.BILINEAR, fill=127), RandomAutocontrast(p=0.3), RandomEqualize(p=0.3), ConvertImageDtype(torch.float) ) Webinar: Multimodal Search with CLIP (September 2022)
  • 21.
    Tips and Trickspt. 3 Use the biggest Batch Size you can fit! Common practice anyways but even more important for CLIP! The bigger the batch size, the more negative examples will be in the batch. Alternatively use hard negative mining. (which we did not have time to do) Webinar: Multimodal Search with CLIP (September 2022)
  • 22.
    Tips and Trickspt. 4 A word on the CLIP loss: CLIP used a contrastive loss which is somewhat controversial. We used it with a logit-scale fixed to 20. Today there are better losses and loss-variants. (example) def cross_entropy(logits, axis): logprobs = jax.nn.log_softmax(logits, axis=axis) nll = jnp.diag(logprobs) ce = -jnp.mean(nll) return ce def clip_loss(similarity): loss = (cross_entropy(similarity, axis=0) + cross_entropy(similarity, axis=1)) / 2 return loss def compute_loss(image_embeds, text_embeds, logit_scale): # normalized features image_embeds = image_embeds / jnp.linalg.norm(image_embeds, axis=-1, keepdims=True) text_embeds = text_embeds / jnp.linalg.norm(text_embeds, axis=-1, keepdims=True) # cosine similarity as logits logit_scale = jnp.exp(logit_scale) logits_per_text = jnp.matmul(text_embeds, image_embeds.T) * logit_scale logits_per_image = logits_per_text.T return clip_loss(logits_per_image) Webinar: Multimodal Search with CLIP (September 2022)
  • 23.
    Results: We used themultilingual mClip (Reimers et al., 2020) as a baseline. Webinar: Multimodal Search with CLIP (September 2022) Image Retrieval (MSCOCO-it) MRR CLIP-Italian mCLIP MRR@1 0.3797 0.2874 MRR@5 0.5039 0.3957 MRR@10 0.5204 0.4129 0-Shot Classification (Imagenet) Accuracy CLIP-Italian mCLIP Acc@1 22.11 20.15 Acc@5 43.69 36.57 Acc@10 52.55 42.91 Acc@100 81.08 67.11
  • 24.
    Bonus Result: ● 0-ShotObject Localization Webinar: Multimodal Search with CLIP (September 2022)
  • 25.
    Demo Time! Webinar: MultimodalSearch with CLIP (September 2022) Hugging Face Space GitHub Repo
  • 26.
    CLIP for SatelliteImage Search with Sujit Pal Webinar: Multimodal Search with CLIP (September 2022)
  • 27.
    Webinar: Multimodal Searchwith CLIP (September 2022) Who am I? ● Work for Elsevier Labs ● Technology Research Director ● Areas of interest: Search, NLP and ML ● My main focus is improving search through Machine Learning.
  • 28.
    Webinar: Multimodal Searchwith CLIP (September 2022) The Project ● Objective – fine tune OpenAI CLIP for Satellite Images and use it to power an Image Search Application ● Team TWIML Artashes Arutiunian (Arto) @arampacha Dev Vidhani @devv Goutham Venkatesh @goutham794 Mayank Bhaskar @cataluna84 Ritobrota Ghosh (Rito) @ghosh-r Sujit Pal @sujitpal
  • 29.
    Webinar: Multimodal Searchwith CLIP (September 2022) Why fine-tune? ● CLIP is very good at identifying “natural” images, but doesn’t do too well with images from specialized domains. Recall @1 Recall @3 Recall @5 Recall @10 Baseline (OpenAI CLIP on HF) 0.572 0.745 0.837 0.939 Our final model 0.883 0.968 0.982 0.998
  • 30.
    Webinar: Multimodal Searchwith CLIP (September 2022) Dataset ● RSICD Dataset for remote sensing image captioning task (repo) ● Approximately 10k (224 x 224) pixel images with 5 single-sentence captions ● Augmented with additional datasets: ○ UCM Dataset (repo) ○ Sidney Dataset (repo)
  • 31.
    Webinar: Multimodal Searchwith CLIP (September 2022) Data Augmentation image of a Baltimore ballpark image d'un terrain de balle à Baltimore image of a ball field in Baltimore Caption Augmentation Image Augmentation
  • 32.
    Webinar: Multimodal Searchwith CLIP (September 2022) ● Batch size: 1024 (128 x 8 TPU cores) ● Best model ○ ADAM Optimizer ○ LR 5e-6 with linear decay ● For each batch: ○ N positive examples ○ N2-N negative examples ● Uses contrastive loss (OpenAI default) ● Download model from HuggingFace: (our best model) ● Access training code on github: (arampacha/CLIP-RSICD) Fine tuning
  • 33.
    Webinar: Multimodal Searchwith CLIP (September 2022) ● Application ○ Text to Image ○ Image to Image ○ Identify landmarks in image ● Index time ○ Compute image embeddings for all images in corpus using fine-tuned CLIP model ○ Store in vector database (Pinecone or similar) ● Query time ○ Compute embedding of text or image query ○ Find approximate nearest neighbors in index and return ordered by nearness ● Demo link Applying fine-tuned model
  • 34.
    Demo Webinar: Multimodal Searchwith CLIP (September 2022)
  • 35.
    Pinecone.io ▪ Fully managedvector database ▪ Free and paid subscriptions for millions to billions of vectors Pinecone.io/learn ▪ Vector Search in the Wild ▪ Embedding Models for Image Search Our Community Get Started CLIP and Image Search ■ HuggingFace CLIP Documentation ■ OpenAI CLIP Documentation ■ Sujit’s Satellite Image use case ■ Raphael’s CLIP-Italian use case Webinar: Getting Started with Semantic (July 2022) Follow Raphael and Sujit on Twitter