SlideShare a Scribd company logo
April 2022
Sujit Pal (ORCID ID: https://orcid.org/0000-0002-6225-110X)
Learning a Joint Embedding
Representation for Image
Search using Self-supervised
means
Haystack Conference 2022
Who am I?
• Work for Elsevier Labs
• Technology Research Director
• Areas of interest: Search, NLP, ML
• Interested in improving search through
Machine Learning
• Background
• Model fine-tuning
• Image Search
• Demos
Agenda
Background
• Image Search
• Existing Approaches
− Text search on image captions
− Searching on image tags
• Embedding based search
• Unsupervised or self-
supervised
Motivation
The Scream is the popular
name given to a composition
created by
Norwegian Expressionist artist
Edvard Munch in 1893. The
agonised face in the painting
has become one of the most
iconic images of art, seen as
symbolising the anxiety of
the human condition.
• Distributional Hypothesis –
words that are used or occur in
similar contexts tend to purport
similar meanings [Wikipedia]
• Word2Vec – neural network learns
weights that project words from a
sparse high dimensional
vocabulary space to a dense low
dimensional (300-D) space.
• Words that have similar meaning
tend to cluster together in this
space.
Embeddings: Origin Story
Image Credits: Explainable Artificial Intelligence (Part II) – Model Interpretation Strategies (Sarkar, 2018)
• Same clustering effect is observed
with images as well
• Unsupervised: MNIST images
encoded using Sparse AutoEncoders
and projected to 2D show clustering
behavior.
• Supervised: Images encoded using
pre-trained ImageNet networks and
encodings used as input to
downstream classifiers. Needs
(relatively) small amounts of training
data for task.
Embeddings for Image Search
Image Credits: Deep Learning of Part-based Representation of Data using Sparse Autoencoders with Non-negativity
constraints (Hosseini et al, 2015), Learning a multi-view weighted majority vote classifier (Goyal, 2018)
• Image Search == Image Similarity, i.e.,
a ranking problem
• Relevant results must appear above
less-relevant results
• Classification Loss (cross-entropy)
better suited to learning hyperplanes
that separate one class of images
from another.
• Contrastive Loss better suited for
ranking images by similarity
• Pairwise loss – embedding pushes
positive pairs close together and
negative pairs far apart.
The need for Contrastive Learning
Image Credit: Understanding Ranking Loss, Contrastive Loss, Margin Loss, Hinge Loss and all those confusing names (blog post by Raul Gomez)
• Uses Contrastive Loss instead of Cross-Entropy
• Given a pair of items x0 and x1 and a label y (1 if x0 and x1 are similar, 0
otherwise), contrastive loss is defined as:
• m is the minimum marginal loss for negative pairs. If the margin is already
high for dissimilar pairs, no additional effort is wasted trying to push them
further apart.
Contrastive Learning
• No explicit labels
• Model learns from pairs of similar
or dissimilar items
• Only similar pairs need to be
provided, dissimilar pairs can
generally be inferred
• Pairs can be multi-modal – for
example, text + audio, text +
video, and text + images.
• In our case (images + captions)
Advantages of Contrastive Learning
Image Credit: Improving self-supervised representation learning by synthesizing challenging negatives (Kalantidis et al, 2020)
• My inspiration for this work.
• Trained on medical images and
captions (self-supervised)
• Siamese network
• Image Encoder: ResNet50
• Text Encoder: BERT
• Minimizes bi-directional loss
between positive image-text pairs
• ConVIRT image encoders uniformly
did better on downstream tasks
compared to previous approaches.
Previous Work: ConVIRT
Image Credit: Contrastive Learning of Medical Visual Representations from Paired Images and Texts (Zhang et al, 2020)
• Part of HuggingFace JAX/Flax
community week event
• Part of team of ”non-work
colleagues” from TWIML
• Meant to train on medical images
but another team was quicker, so
trained on satellite images instead
• Placed third
• That’s how I learned about CLIP
• Blog post has more details
Fine tuning CLIP for a different domain
Model Fine-tuning
• Contrastive Language Image Pretraining
• Pretrained model, trained in self-supervised
manner on 400M image-caption pairs from the
Internet
• Trained on 256 GPUs for 2 weeks
• Matches original ResNet50 performance on
ImageNet
• Good zero-shot performance on ”natural
images”
• Architecture:
− Text Encoder: Text Transformer
− Image Encoder: Vision Transformer
• Available on HuggingFace
Our Base Model: OpenAI CLIP
Image Credit: Learning Transferable Visual Models from Natural Language Superviseion (Radford et al, 2021)
• Fine-tuning == extending the
contrastive pretraining with
additional image + caption pairs
from the target domain
• For each batch of size N:
− N positive pairs
− N2 – N negative pairs
• Trained using Contrastive Objective
against positive pairs and sampling
from in-batch negative pairs.
Fine-tuning CLIP
Image Credit: Learning Transferable Visual Models from Natural Language Superviseion (Radford et al, 2021)
• ImageCLEF 2017 Caption
Prediction Task
− 164,614 training images and
captions
− 10,000 validation images and
captions
− 10,000 test images (without
caption)
• Repurposed training and
validation sets for fine-tuning on
medical images + captions.
Dataset
• Ignore provided test split for training
• Split training dataset 90:10 into training and validation
• Make the validation split the new test split
• Final data split
− 148,153 training images + captions
− 16,461 validation images + captions
− 10,000 test images + captions
Dataset Preprocessing
• Criteria: Mean Reciprocal Rank (MRR) @k
• Baseline: evaluated pre-trained CLIP model OOB (before fine-tuning)
• Correct caption for image returned 43% of the time at first position, 57% of
the time within the first 10 positions.
Evaluation
• Best training hyperparameters
− Batch size 64
− Optimizer ADAM
− Learning Rate 5e-5
− Number of epochs 10
− Number of training samples 50,000
• Regularization achieved by sampling random
subset of 50k samples from 148k training
dataset
• No image or text augmentation
• Training loss still falling, but validation loss
flattening out, maybe still some room for
improvement
Model Training
• Evaluated for multiple runs against the (new) test set.
• Results show significant improvement over baseline.
Training Results
Image Search
• Encode images into L2-normalized vectors
using fine-tuned model
• Store encoded image vectors in vector store
• At query time, query text or image is encoded
using same fine-tuned model
• Vector store searched for encodings that are
“close” to the query vector
• Images and captions corresponding to the top
K closest vectors returned in query results
Image Search Architecture
• Exhaustive Search – match the query vector to every vector in the
database (O(N))
• Approximate Nearest Neighbors (ANN) – tradeoff accuracy for speed.
− Space partitioning – see blog post by Eyal Trabelsi)
o Using Forests (ANNOY)
o Using LSH
o Pre-clustering and matching centroids
o Quantization / Product Quantization
− Hierarchical Navigable Small Worlds Graph (HNSW) – hierarchy of probability
skip lists (see blog post by James Briggs)
Vector Search
• We chose Vespa as our vector store
• Hybrid Search platform supports both
text (BM25) and vector (HNSW) search,
allowing easy comparison of legacy and
new behavior
• Our demo application allows:
− Text to image search – using caption text,
caption vector, and hybrid
− Image to image search – using image
vector, and image vector + caption
Demo Application
• Schema (s/m/a/schemas/image.sd)
− image_id (string, properties:
summary, attribute)
− image_path (string, properties:
summary, attribute)
− caption_text (string, properties:
summary, index)
− clip_vector (tensor<float>[512],
properties: attribute, index,
HNSW)
Vespa Configuration
• Ranking Profiles
− caption-search – bm25(caption_text)
− image-search – closeness(field, clip_vector)
− combined-search – bm25(caption_text) +
closeness(clip_vector, query_vector)
• Query Vector template needs to be
defined in s/m/a/search/query-
profiles/types/root.xml
• Managing vespa – I used homegrown
scripts from (sujitpal/vespa-poc) but
probably better to use the official scripts
from vespa-cli.
Vespa Configuration (cont’d)
Demos
• Demo link for clip-imageclef
− internal, need to be in Elsevier VPN
− http://10.169.23.142/clip-demo
• Demo link for clip-rsicd
− publicly available on HuggingFace spaces
− https://huggingface.co/spaces/sujitpal/clip-rsicd-demo
Demo Links
• Code for fine-tuning CLIP with ImageCLEF data and image search using
Vespa – elsevierlabs-os/clip-image-search
Resources
Thank you
sujit.pal@elsevier.com
https://twitter.com/palsujit
https://www.linkedin.com/in/sujitpal/

More Related Content

Similar to Learning a Joint Embedding Representation for Image Search using Self-supervised means

Searching Images: Recent research at Southampton
Searching Images: Recent research at SouthamptonSearching Images: Recent research at Southampton
Searching Images: Recent research at Southampton
Jonathon Hare
 
Minor Project Report on Denoising Diffusion Probabilistic Model
Minor Project Report on Denoising Diffusion Probabilistic ModelMinor Project Report on Denoising Diffusion Probabilistic Model
Minor Project Report on Denoising Diffusion Probabilistic Model
soxigoh238
 
Tracking emerges by colorizing videos
Tracking emerges by colorizing videosTracking emerges by colorizing videos
Tracking emerges by colorizing videos
Oh Yoojin
 
Topic Modelling: Tutorial on Usage and Applications
Topic Modelling: Tutorial on Usage and ApplicationsTopic Modelling: Tutorial on Usage and Applications
Topic Modelling: Tutorial on Usage and Applications
Ayush Jain
 
Computer vision-nit-silchar-hackathon
Computer vision-nit-silchar-hackathonComputer vision-nit-silchar-hackathon
Computer vision-nit-silchar-hackathon
Aditya Bhattacharya
 
[CVPR 2018] Visual Search (Image Retrieval) and Metric Learning
[CVPR 2018] Visual Search (Image Retrieval) and Metric Learning[CVPR 2018] Visual Search (Image Retrieval) and Metric Learning
[CVPR 2018] Visual Search (Image Retrieval) and Metric Learning
NAVER Engineering
 
JRs presentation-few-shot-learning-overview @ AI4Media WP5 workshop
JRs presentation-few-shot-learning-overview @ AI4Media WP5 workshopJRs presentation-few-shot-learning-overview @ AI4Media WP5 workshop
JRs presentation-few-shot-learning-overview @ AI4Media WP5 workshop
Hannes Fassold
 
Computer Vision image classification
Computer Vision image classificationComputer Vision image classification
Computer Vision image classification
Wael Badawy
 
Review : Multi-Domain Image Completion for Random Missing Input Data [cdm]
Review : Multi-Domain Image Completion for Random Missing Input Data [cdm]Review : Multi-Domain Image Completion for Random Missing Input Data [cdm]
Review : Multi-Domain Image Completion for Random Missing Input Data [cdm]
Dongmin Choi
 
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Sangmin Woo
 
Lecture_16_Self-supervised_Learning.pptx
Lecture_16_Self-supervised_Learning.pptxLecture_16_Self-supervised_Learning.pptx
Lecture_16_Self-supervised_Learning.pptx
Karimdabbabi
 
Lec10 alignment
Lec10 alignmentLec10 alignment
Lec10 alignment
BaliThorat1
 
Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelines
jeykottalam
 
How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?
Tuan Yang
 
Visual-Textual Joint Relevance Learning for Tag-Based Social Image Search
Visual-Textual Joint Relevance Learning for Tag-Based Social Image SearchVisual-Textual Joint Relevance Learning for Tag-Based Social Image Search
Visual-Textual Joint Relevance Learning for Tag-Based Social Image Search
SOYEON KIM
 
Image captioning
Image captioningImage captioning
Image captioning
Muhammad Zbeedat
 
Content Complexity, Similarity, and Consistency in Social Media: A Deep Learn...
Content Complexity, Similarity, and Consistency in Social Media: A Deep Learn...Content Complexity, Similarity, and Consistency in Social Media: A Deep Learn...
Content Complexity, Similarity, and Consistency in Social Media: A Deep Learn...
Gene Moo Lee
 
[PR12] Generative Models as Distributions of Functions
[PR12] Generative Models as Distributions of Functions[PR12] Generative Models as Distributions of Functions
[PR12] Generative Models as Distributions of Functions
JaeJun Yoo
 
Convolutional Neural Networks for Computer vision Applications
Convolutional Neural Networks for Computer vision ApplicationsConvolutional Neural Networks for Computer vision Applications
Convolutional Neural Networks for Computer vision Applications
Alex Conway
 
xai basic solutions , with some examples and formulas
xai basic solutions , with some examples and formulasxai basic solutions , with some examples and formulas
xai basic solutions , with some examples and formulas
Royi Itzhak
 

Similar to Learning a Joint Embedding Representation for Image Search using Self-supervised means (20)

Searching Images: Recent research at Southampton
Searching Images: Recent research at SouthamptonSearching Images: Recent research at Southampton
Searching Images: Recent research at Southampton
 
Minor Project Report on Denoising Diffusion Probabilistic Model
Minor Project Report on Denoising Diffusion Probabilistic ModelMinor Project Report on Denoising Diffusion Probabilistic Model
Minor Project Report on Denoising Diffusion Probabilistic Model
 
Tracking emerges by colorizing videos
Tracking emerges by colorizing videosTracking emerges by colorizing videos
Tracking emerges by colorizing videos
 
Topic Modelling: Tutorial on Usage and Applications
Topic Modelling: Tutorial on Usage and ApplicationsTopic Modelling: Tutorial on Usage and Applications
Topic Modelling: Tutorial on Usage and Applications
 
Computer vision-nit-silchar-hackathon
Computer vision-nit-silchar-hackathonComputer vision-nit-silchar-hackathon
Computer vision-nit-silchar-hackathon
 
[CVPR 2018] Visual Search (Image Retrieval) and Metric Learning
[CVPR 2018] Visual Search (Image Retrieval) and Metric Learning[CVPR 2018] Visual Search (Image Retrieval) and Metric Learning
[CVPR 2018] Visual Search (Image Retrieval) and Metric Learning
 
JRs presentation-few-shot-learning-overview @ AI4Media WP5 workshop
JRs presentation-few-shot-learning-overview @ AI4Media WP5 workshopJRs presentation-few-shot-learning-overview @ AI4Media WP5 workshop
JRs presentation-few-shot-learning-overview @ AI4Media WP5 workshop
 
Computer Vision image classification
Computer Vision image classificationComputer Vision image classification
Computer Vision image classification
 
Review : Multi-Domain Image Completion for Random Missing Input Data [cdm]
Review : Multi-Domain Image Completion for Random Missing Input Data [cdm]Review : Multi-Domain Image Completion for Random Missing Input Data [cdm]
Review : Multi-Domain Image Completion for Random Missing Input Data [cdm]
 
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
 
Lecture_16_Self-supervised_Learning.pptx
Lecture_16_Self-supervised_Learning.pptxLecture_16_Self-supervised_Learning.pptx
Lecture_16_Self-supervised_Learning.pptx
 
Lec10 alignment
Lec10 alignmentLec10 alignment
Lec10 alignment
 
Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelines
 
How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?
 
Visual-Textual Joint Relevance Learning for Tag-Based Social Image Search
Visual-Textual Joint Relevance Learning for Tag-Based Social Image SearchVisual-Textual Joint Relevance Learning for Tag-Based Social Image Search
Visual-Textual Joint Relevance Learning for Tag-Based Social Image Search
 
Image captioning
Image captioningImage captioning
Image captioning
 
Content Complexity, Similarity, and Consistency in Social Media: A Deep Learn...
Content Complexity, Similarity, and Consistency in Social Media: A Deep Learn...Content Complexity, Similarity, and Consistency in Social Media: A Deep Learn...
Content Complexity, Similarity, and Consistency in Social Media: A Deep Learn...
 
[PR12] Generative Models as Distributions of Functions
[PR12] Generative Models as Distributions of Functions[PR12] Generative Models as Distributions of Functions
[PR12] Generative Models as Distributions of Functions
 
Convolutional Neural Networks for Computer vision Applications
Convolutional Neural Networks for Computer vision ApplicationsConvolutional Neural Networks for Computer vision Applications
Convolutional Neural Networks for Computer vision Applications
 
xai basic solutions , with some examples and formulas
xai basic solutions , with some examples and formulasxai basic solutions , with some examples and formulas
xai basic solutions , with some examples and formulas
 

More from Sujit Pal

Supporting Concept Search using a Clinical Healthcare Knowledge Graph
Supporting Concept Search using a Clinical Healthcare Knowledge GraphSupporting Concept Search using a Clinical Healthcare Knowledge Graph
Supporting Concept Search using a Clinical Healthcare Knowledge Graph
Sujit Pal
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
Sujit Pal
 
Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...
Sujit Pal
 
Cheap Trick for Question Answering
Cheap Trick for Question AnsweringCheap Trick for Question Answering
Cheap Trick for Question Answering
Sujit Pal
 
Searching Across Images and Test
Searching Across Images and TestSearching Across Images and Test
Searching Across Images and Test
Sujit Pal
 
The power of community: training a Transformer Language Model on a shoestring
The power of community: training a Transformer Language Model on a shoestringThe power of community: training a Transformer Language Model on a shoestring
The power of community: training a Transformer Language Model on a shoestring
Sujit Pal
 
Backprop Visualization
Backprop VisualizationBackprop Visualization
Backprop Visualization
Sujit Pal
 
Accelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn CloudAccelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn Cloud
Sujit Pal
 
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Sujit Pal
 
Leslie Smith's Papers discussion for DL Journal Club
Leslie Smith's Papers discussion for DL Journal ClubLeslie Smith's Papers discussion for DL Journal Club
Leslie Smith's Papers discussion for DL Journal Club
Sujit Pal
 
Using Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based RetrievalUsing Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based Retrieval
Sujit Pal
 
Transformer Mods for Document Length Inputs
Transformer Mods for Document Length InputsTransformer Mods for Document Length Inputs
Transformer Mods for Document Length Inputs
Sujit Pal
 
Question Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other StoriesQuestion Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other Stories
Sujit Pal
 
Building Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSBuilding Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDS
Sujit Pal
 
Graph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingGraph Techniques for Natural Language Processing
Graph Techniques for Natural Language Processing
Sujit Pal
 
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildLearning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Sujit Pal
 
Search summit-2018-ltr-presentation
Search summit-2018-ltr-presentationSearch summit-2018-ltr-presentation
Search summit-2018-ltr-presentation
Sujit Pal
 
Search summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slidesSearch summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slides
Sujit Pal
 
SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming text
Sujit Pal
 
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
Sujit Pal
 

More from Sujit Pal (20)

Supporting Concept Search using a Clinical Healthcare Knowledge Graph
Supporting Concept Search using a Clinical Healthcare Knowledge GraphSupporting Concept Search using a Clinical Healthcare Knowledge Graph
Supporting Concept Search using a Clinical Healthcare Knowledge Graph
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...
 
Cheap Trick for Question Answering
Cheap Trick for Question AnsweringCheap Trick for Question Answering
Cheap Trick for Question Answering
 
Searching Across Images and Test
Searching Across Images and TestSearching Across Images and Test
Searching Across Images and Test
 
The power of community: training a Transformer Language Model on a shoestring
The power of community: training a Transformer Language Model on a shoestringThe power of community: training a Transformer Language Model on a shoestring
The power of community: training a Transformer Language Model on a shoestring
 
Backprop Visualization
Backprop VisualizationBackprop Visualization
Backprop Visualization
 
Accelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn CloudAccelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn Cloud
 
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
 
Leslie Smith's Papers discussion for DL Journal Club
Leslie Smith's Papers discussion for DL Journal ClubLeslie Smith's Papers discussion for DL Journal Club
Leslie Smith's Papers discussion for DL Journal Club
 
Using Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based RetrievalUsing Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based Retrieval
 
Transformer Mods for Document Length Inputs
Transformer Mods for Document Length InputsTransformer Mods for Document Length Inputs
Transformer Mods for Document Length Inputs
 
Question Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other StoriesQuestion Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other Stories
 
Building Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSBuilding Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDS
 
Graph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingGraph Techniques for Natural Language Processing
Graph Techniques for Natural Language Processing
 
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildLearning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search Guild
 
Search summit-2018-ltr-presentation
Search summit-2018-ltr-presentationSearch summit-2018-ltr-presentation
Search summit-2018-ltr-presentation
 
Search summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slidesSearch summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slides
 
SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming text
 
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
 

Recently uploaded

Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
e20449
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Mind IT Systems
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
Tendenci - The Open Source AMS (Association Management Software)
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Anthony Dahanne
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 

Recently uploaded (20)

Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 

Learning a Joint Embedding Representation for Image Search using Self-supervised means

  • 1. April 2022 Sujit Pal (ORCID ID: https://orcid.org/0000-0002-6225-110X) Learning a Joint Embedding Representation for Image Search using Self-supervised means Haystack Conference 2022
  • 2. Who am I? • Work for Elsevier Labs • Technology Research Director • Areas of interest: Search, NLP, ML • Interested in improving search through Machine Learning
  • 3. • Background • Model fine-tuning • Image Search • Demos Agenda
  • 5. • Image Search • Existing Approaches − Text search on image captions − Searching on image tags • Embedding based search • Unsupervised or self- supervised Motivation The Scream is the popular name given to a composition created by Norwegian Expressionist artist Edvard Munch in 1893. The agonised face in the painting has become one of the most iconic images of art, seen as symbolising the anxiety of the human condition.
  • 6. • Distributional Hypothesis – words that are used or occur in similar contexts tend to purport similar meanings [Wikipedia] • Word2Vec – neural network learns weights that project words from a sparse high dimensional vocabulary space to a dense low dimensional (300-D) space. • Words that have similar meaning tend to cluster together in this space. Embeddings: Origin Story Image Credits: Explainable Artificial Intelligence (Part II) – Model Interpretation Strategies (Sarkar, 2018)
  • 7. • Same clustering effect is observed with images as well • Unsupervised: MNIST images encoded using Sparse AutoEncoders and projected to 2D show clustering behavior. • Supervised: Images encoded using pre-trained ImageNet networks and encodings used as input to downstream classifiers. Needs (relatively) small amounts of training data for task. Embeddings for Image Search Image Credits: Deep Learning of Part-based Representation of Data using Sparse Autoencoders with Non-negativity constraints (Hosseini et al, 2015), Learning a multi-view weighted majority vote classifier (Goyal, 2018)
  • 8. • Image Search == Image Similarity, i.e., a ranking problem • Relevant results must appear above less-relevant results • Classification Loss (cross-entropy) better suited to learning hyperplanes that separate one class of images from another. • Contrastive Loss better suited for ranking images by similarity • Pairwise loss – embedding pushes positive pairs close together and negative pairs far apart. The need for Contrastive Learning Image Credit: Understanding Ranking Loss, Contrastive Loss, Margin Loss, Hinge Loss and all those confusing names (blog post by Raul Gomez)
  • 9. • Uses Contrastive Loss instead of Cross-Entropy • Given a pair of items x0 and x1 and a label y (1 if x0 and x1 are similar, 0 otherwise), contrastive loss is defined as: • m is the minimum marginal loss for negative pairs. If the margin is already high for dissimilar pairs, no additional effort is wasted trying to push them further apart. Contrastive Learning
  • 10. • No explicit labels • Model learns from pairs of similar or dissimilar items • Only similar pairs need to be provided, dissimilar pairs can generally be inferred • Pairs can be multi-modal – for example, text + audio, text + video, and text + images. • In our case (images + captions) Advantages of Contrastive Learning Image Credit: Improving self-supervised representation learning by synthesizing challenging negatives (Kalantidis et al, 2020)
  • 11. • My inspiration for this work. • Trained on medical images and captions (self-supervised) • Siamese network • Image Encoder: ResNet50 • Text Encoder: BERT • Minimizes bi-directional loss between positive image-text pairs • ConVIRT image encoders uniformly did better on downstream tasks compared to previous approaches. Previous Work: ConVIRT Image Credit: Contrastive Learning of Medical Visual Representations from Paired Images and Texts (Zhang et al, 2020)
  • 12. • Part of HuggingFace JAX/Flax community week event • Part of team of ”non-work colleagues” from TWIML • Meant to train on medical images but another team was quicker, so trained on satellite images instead • Placed third • That’s how I learned about CLIP • Blog post has more details Fine tuning CLIP for a different domain
  • 14. • Contrastive Language Image Pretraining • Pretrained model, trained in self-supervised manner on 400M image-caption pairs from the Internet • Trained on 256 GPUs for 2 weeks • Matches original ResNet50 performance on ImageNet • Good zero-shot performance on ”natural images” • Architecture: − Text Encoder: Text Transformer − Image Encoder: Vision Transformer • Available on HuggingFace Our Base Model: OpenAI CLIP Image Credit: Learning Transferable Visual Models from Natural Language Superviseion (Radford et al, 2021)
  • 15. • Fine-tuning == extending the contrastive pretraining with additional image + caption pairs from the target domain • For each batch of size N: − N positive pairs − N2 – N negative pairs • Trained using Contrastive Objective against positive pairs and sampling from in-batch negative pairs. Fine-tuning CLIP Image Credit: Learning Transferable Visual Models from Natural Language Superviseion (Radford et al, 2021)
  • 16. • ImageCLEF 2017 Caption Prediction Task − 164,614 training images and captions − 10,000 validation images and captions − 10,000 test images (without caption) • Repurposed training and validation sets for fine-tuning on medical images + captions. Dataset
  • 17. • Ignore provided test split for training • Split training dataset 90:10 into training and validation • Make the validation split the new test split • Final data split − 148,153 training images + captions − 16,461 validation images + captions − 10,000 test images + captions Dataset Preprocessing
  • 18. • Criteria: Mean Reciprocal Rank (MRR) @k • Baseline: evaluated pre-trained CLIP model OOB (before fine-tuning) • Correct caption for image returned 43% of the time at first position, 57% of the time within the first 10 positions. Evaluation
  • 19. • Best training hyperparameters − Batch size 64 − Optimizer ADAM − Learning Rate 5e-5 − Number of epochs 10 − Number of training samples 50,000 • Regularization achieved by sampling random subset of 50k samples from 148k training dataset • No image or text augmentation • Training loss still falling, but validation loss flattening out, maybe still some room for improvement Model Training
  • 20. • Evaluated for multiple runs against the (new) test set. • Results show significant improvement over baseline. Training Results
  • 22. • Encode images into L2-normalized vectors using fine-tuned model • Store encoded image vectors in vector store • At query time, query text or image is encoded using same fine-tuned model • Vector store searched for encodings that are “close” to the query vector • Images and captions corresponding to the top K closest vectors returned in query results Image Search Architecture
  • 23. • Exhaustive Search – match the query vector to every vector in the database (O(N)) • Approximate Nearest Neighbors (ANN) – tradeoff accuracy for speed. − Space partitioning – see blog post by Eyal Trabelsi) o Using Forests (ANNOY) o Using LSH o Pre-clustering and matching centroids o Quantization / Product Quantization − Hierarchical Navigable Small Worlds Graph (HNSW) – hierarchy of probability skip lists (see blog post by James Briggs) Vector Search
  • 24. • We chose Vespa as our vector store • Hybrid Search platform supports both text (BM25) and vector (HNSW) search, allowing easy comparison of legacy and new behavior • Our demo application allows: − Text to image search – using caption text, caption vector, and hybrid − Image to image search – using image vector, and image vector + caption Demo Application
  • 25. • Schema (s/m/a/schemas/image.sd) − image_id (string, properties: summary, attribute) − image_path (string, properties: summary, attribute) − caption_text (string, properties: summary, index) − clip_vector (tensor<float>[512], properties: attribute, index, HNSW) Vespa Configuration
  • 26. • Ranking Profiles − caption-search – bm25(caption_text) − image-search – closeness(field, clip_vector) − combined-search – bm25(caption_text) + closeness(clip_vector, query_vector) • Query Vector template needs to be defined in s/m/a/search/query- profiles/types/root.xml • Managing vespa – I used homegrown scripts from (sujitpal/vespa-poc) but probably better to use the official scripts from vespa-cli. Vespa Configuration (cont’d)
  • 27. Demos
  • 28. • Demo link for clip-imageclef − internal, need to be in Elsevier VPN − http://10.169.23.142/clip-demo • Demo link for clip-rsicd − publicly available on HuggingFace spaces − https://huggingface.co/spaces/sujitpal/clip-rsicd-demo Demo Links
  • 29. • Code for fine-tuning CLIP with ImageCLEF data and image search using Vespa – elsevierlabs-os/clip-image-search Resources