Searching Across Images and Test

Webinar: Multimodal Search with CLIP (September 2022)
Searching Across Image and Text
Intro to OpenAI’s CLIP
Sujit Pal
Technology Research
Director
James Briggs
Staff Developer
Advocate
Raphael Pisoni
Senior Computer Vision
Engineer

Agenda
● Intro to image search and CLIP
○ What is image search
○ What is CLIP
○ How to use Pinecone with CLIP
● How to use CLIP
○ CLIP application #1: Multilingual image search
○ CLIP application #2: Satellite images with JAX
● Q&A

🏦 Where is the Bank of England?
🌱 Where is the grassy bank?
🛩🛩 How does a plane bank?
How to introduce Python to children? 🐍
🐝 “the bees decided to have a mutiny against their queen”
🐝 “flying stinging insects rebelled in opposition to the matriarch”
Representing Meaning

Representing Meaning
+ Images example
+ Image <> text example
- Make clear we’re comparing semantics

A little CLIP recap:
Model developed by OpenAI.
Trained on images and captions.
Places corresponding image and
text embeddings close to each
other.
Tries to place different embedding
as far away as possible.
Enough data and training lets
semantic clusters emerge. Image credit: OpenAI

Searching with CLIP:
Use CLIP to embed images/texts
that you want to make searchable.
Add embeddings to a catalogue or
vector search database.
To retrieve similar images/texts
embed your query with CLIP and
find the most similar embeddings.
As a similarity metric use
Cosine Similarity or more memory
efficient approximations.

CLIP embeddings:
CLIP embeddings encode some
semantic meaning.
Apart from search they can be
used as a starting point for many
other use cases:
● image captioning
● image generation
● zero shot detection
● your idea here…
Source: https://github.com/borisdayma/dalle-mini

BUT…
The original CLIP was only trained on pairs of images and English texts
crawled from the internet.
What if?
● we want to use it in a different language?
● it doesn’t know enough about our target domain?
● we want to use it with a different modality besides Images and Text?

CLIP Use Cases

Bringing CLIP
to the Italian Language
with Raphael Pisoni

Who am I?

CLIP does not understand your language?
We were there…
Our Goal:
Train a version of CLIP that understands Italian while maintaining as
much as possible of the original performance.
Giuseppe
Attanasio
NLP
Raphael
Pisoni
CV
Silvia
Terragni
NLP
Gabriele
Sarti
NLP
Sri
Lakshimi
AI
Federico
Bianchi
NLP

Here’s what we learned:
● Keep as much as possible!
● Make heavy use of pretraining!
● Data quality matters!
● Tips and Tricks!

Keep as much as possible:
The original CLIP model took two
weeks to train on 256 GPUs. Let’s
not waste all of that!
No matter what language we use,
we can always keep the amazing
Image model!
✓
✗

Make heavy use of pretraining:
The original CLIP Text branch was
replaced with a great BERT model
that was pre-trained on Italian text
by the Bavarian state library.

Make heavy use of pretraining (pt.2):
The reprojection layers between the
models are needed to bring the
outputs to the same dimensionality
and are initialised randomly.
That means that at the start of the
training we will have large and
messy gradients.

Make heavy use of pretraining (pt. 3):
To avoid the initial gradients
messing with the pretrained models,
we simply pretrain the model with
frozen backbones.
This makes sure that the
reprojection layers are aligned and
the gradients are smooth before we
start fine-tuning the backbones.

Data quality matters:
CLIP was trained on 400 Million image/caption pairs.
Compared to that we worked in an extreme low resource setting.
We used:
● WIT 600k
● MSCOCO-it 100k
● Conceptual Captions 700k
● La Foto del Giorno 30k
Total: ~1.4 Million Italian image/caption pairs.
No surprise: The data quality made a HUGE difference!

Tips and Tricks pt. 1
Don’t use AdamW!
Everybody uses AdamW but it would decay the pretrained weights that we
want to keep as much as possible.
Use Adam or this instead:
optimizer = optax.chain(
optax.adaptive_grad_clip(0.01, eps=0.001),
optax.scale_by_belief(),
optax.scale_by_schedule(decay_lr_schedule_f
n),
optax.scale(-1.0),
)

Always augment your data!
We used pretty crazy image
transforms.
We tried text augmentations but did
not use them due to lack of time.
transforms = torch.nn.Sequential(
Resize(
[image_size],
interpolation=InterpolationMode.BICUBIC),
RandomCrop(
[image_size],
pad_if_needed=True,
padding_mode="edge"),
ColorJitter(hue=0.1),
RandomHorizontalFlip(),
RandomAffine(
degrees=15,
translate=(0.1, 0.1),
scale=(0.8, 1.2),
shear=(-15, 15, -15, 15),
interpolation=InterpolationMode.BILINEAR,
fill=127),
RandomPerspective(
distortion_scale=0.3,
p=0.3,
interpolation=InterpolationMode.BILINEAR,
fill=127),
RandomAutocontrast(p=0.3),
RandomEqualize(p=0.3),
ConvertImageDtype(torch.float)
)

Use the biggest Batch Size you can fit!
Common practice anyways but even more important for CLIP!
The bigger the batch size, the more negative examples will be in the batch.
Alternatively use hard negative mining. (which we did not have time to do)

A word on the CLIP loss:
CLIP used a contrastive loss
which is somewhat controversial.
We used it with a logit-scale fixed
to 20.
Today there are better losses and
loss-variants. (example)
def cross_entropy(logits, axis):
logprobs = jax.nn.log_softmax(logits, axis=axis)
nll = jnp.diag(logprobs)
ce = -jnp.mean(nll)
return ce
def clip_loss(similarity):
loss = (cross_entropy(similarity, axis=0) +
cross_entropy(similarity, axis=1)) / 2
return loss
def compute_loss(image_embeds, text_embeds, logit_scale):
# normalized features
image_embeds = image_embeds /
jnp.linalg.norm(image_embeds, axis=-1, keepdims=True)
text_embeds = text_embeds /
jnp.linalg.norm(text_embeds, axis=-1, keepdims=True)
# cosine similarity as logits
logit_scale = jnp.exp(logit_scale)
logits_per_text = jnp.matmul(text_embeds, image_embeds.T) * logit_scale
logits_per_image = logits_per_text.T
return clip_loss(logits_per_image)

Results:
We used the multilingual mClip (Reimers et al., 2020) as a baseline.
Image Retrieval (MSCOCO-it)
MRR CLIP-Italian mCLIP
MRR@1 0.3797 0.2874
MRR@5 0.5039 0.3957
MRR@10 0.5204 0.4129
0-Shot Classification (Imagenet)
Accuracy CLIP-Italian mCLIP
Acc@1 22.11 20.15
Acc@5 43.69 36.57
Acc@10 52.55 42.91
Acc@100 81.08 67.11

Bonus Result:
● 0-Shot Object Localization

Demo Time!
Hugging Face Space
GitHub Repo

CLIP for Satellite Image Search
with Sujit Pal

Who am I?
● Work for Elsevier Labs
● Technology Research Director
● Areas of interest: Search, NLP
and ML
● My main focus is improving
search through Machine
Learning.

The Project
● Objective – fine tune OpenAI CLIP for Satellite Images and use it to
power an Image Search Application
● Team TWIML
Artashes
Arutiunian (Arto)
@arampacha
Dev Vidhani
@devv
Goutham
Venkatesh
@goutham794
Mayank Bhaskar
@cataluna84
Ritobrota
Ghosh (Rito)
@ghosh-r
Sujit Pal
@sujitpal

Why fine-tune?
● CLIP is very good at identifying “natural” images, but doesn’t do too well with
images from specialized domains.
Recall @1 Recall @3 Recall @5 Recall @10
Baseline (OpenAI CLIP on HF) 0.572 0.745 0.837 0.939
Our final model 0.883 0.968 0.982 0.998

Dataset
● RSICD Dataset for remote sensing image captioning task (repo)
● Approximately 10k (224 x 224) pixel images with 5 single-sentence captions
● Augmented with additional datasets:
○ UCM Dataset (repo)
○ Sidney Dataset (repo)

Data Augmentation
image of a Baltimore ballpark
image d'un terrain de balle à Baltimore
image of a ball field in Baltimore
Caption Augmentation Image Augmentation

● Batch size: 1024 (128 x 8 TPU cores)
● Best model
○ ADAM Optimizer
○ LR 5e-6 with linear decay
● For each batch:
○ N positive examples
○ N2-N negative examples
● Uses contrastive loss (OpenAI default)
● Download model from HuggingFace:
(our best model)
● Access training code on github:
(arampacha/CLIP-RSICD)
Fine tuning

● Application
○ Text to Image
○ Image to Image
○ Identify landmarks in image
● Index time
○ Compute image embeddings for all images in corpus using fine-tuned CLIP model
○ Store in vector database (Pinecone or similar)
● Query time
○ Compute embedding of text or image query
○ Find approximate nearest neighbors in index and return ordered by nearness
● Demo link
Applying fine-tuned model

Demo

Pinecone.io
▪ Fully managed vector database
▪ Free and paid subscriptions for millions
to billions of vectors
Pinecone.io/learn
▪ Vector Search in the Wild
▪ Embedding Models for Image Search
Our Community
Get Started
CLIP and Image Search
■ HuggingFace CLIP Documentation
■ OpenAI CLIP Documentation
■ Sujit’s Satellite Image use case
■ Raphael’s CLIP-Italian use case
Webinar: Getting Started with Semantic (July 2022)
Follow Raphael and Sujit on
Twitter

Searching Across Images and Test

More Related Content

What's hot

Similar to Searching Across Images and Test

More from Sujit Pal

Recently uploaded

Searching Across Images and Test