SlideShare a Scribd company logo
1 of 18
Polysemous Visual-Semantic Embedding
for Cross-Modal Retrieval
Ruijie Quan 2019/07/13
MOTIVATION
121.07.2019
Most current methods learn injective
embedding functions that map an
instance to a single point in the shared
space.
Drawback:
Cannot effectively handle polysemous
instances while individual instances
and their cross modal associations are
often ambiguousin real-world
scenarios.
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
CONTRIBUTIONS
221.07.2019
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
 Contributions:
1. Introduce Polysemous Instance Embedding Networks (PIE-Nets) that compute
multiple and diverse representations of an instance by combining global context
with locally-guided features via multi-head self-attention and residual learning.
2. Tackle a more challenging case of video-text retrieval.
3. A new dataset of 50K video-sentence pairs collected from social media, dubbed
MRW (my reaction when)
INTRODUCTION
321.07.2019
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
1. injective embedding can suffer when there
is ambiguity in individual instances. e.g.,
polysemy words and images containing
multiple objects.
2. partial cross-domain association
e.g. a text sentence may describe only certain
regions of an image,
a video may contain extra frames not
described by its associated sentence
Injective Embedding could be problematic
INTRODUCTION
421.07.2019
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
1. formulating instance embedding as a one-
to-many mapping task
2. optimizing the mapping functions to be
robust to ambiguous instances and partial
cross-modal associations
Address the above issues by:
APPROACH
521.07.2019
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
APPROACH
621.07.2019
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
To address the issues with ambiguous instances
Propose a novel one-to-many instance embedding model, Polysemous Instance Embedding
Network (PIE-Net)
 extracts K embeddings of each instance by combining global and local information
of its input.
 obtain K locally-guided representations by attending to different parts of an input
instance (e.g., regions, frames, words) using a multi-head self-attention module
 combine each of such local representation with global representation via residual
learning to avoid learning redundant information
 regularize the K locally-guided representations to be diverse
APPROACH
721.07.2019
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
APPROACH
821.07.2019
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
To address the partial association issue
Polysemous Visual-Semantic Embedding (PVSE)tie-up two PIE-Nets and train our model
in the multiple-instance learning (MIL) framework
APPROACH
121.07.2019
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
Image Encoder Video Encoder Sentence Encoder
the feature map before
the final average pooling
layer as local features
apply average pooling
and feed the output to
one fully-connected layer
to obtain global features
ResNet-152
take the 2048-dim output
from the final average
pooling layer, and use
them as local features
feed Ψ(x) into a bidirec-
tional GRU (bi-GRU) with
H hidden units, and take
the final hidden states as
global features
ResNet-152
producing L 300-dim
vectors, and use them
as local features local
features
feed them into a bi-GRU
with H hidden units, and
take the final hidden
states as global features
GloVe pretrained on the
CommonCrawl dataset
1. Modality-Specific Feature Encoder
APPROACH
1021.07.2019
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
2. Local Feature Transformer
Multihead self-attention
3. Feature Fusion With Residual Learning
Residual Learning
APPROACH
1121.07.2019
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
4. Optimization and Inference
MIL Loss:
Diversity Loss: Domain Discrepancy Loss:
MRW DATASET
1221.07.2019
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
EXPERIMENTS
1321.07.2019
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
EXPERIMENTS
1421.07.2019
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
The number of embeddings K:
a significant improvement from K = 0 to K = 1;
this shows the effectiveness of Local Feature
Transformer.
this shows the importance of balancing global
and local information in the final embedding
simply concatenating the two features (no
residual learning) hurts the performance
Global vs. locally-guided features:
EXPERIMENTS
1521.07.2019
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
The results show that both loss terms are
important in our model. Overall, the model is
not much sensitive to the two relative weight
terms.
Sensitivity analysis on different loss weights:
EXPERIMENTS
1621.07.2019
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
Thank you for your attention.

More Related Content

Similar to Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval

Whitepaper multipoint video_conferencing_june2012_wr
Whitepaper multipoint video_conferencing_june2012_wrWhitepaper multipoint video_conferencing_june2012_wr
Whitepaper multipoint video_conferencing_june2012_wrJohn Shim
 
Multi-View Video Coding Algorithms/Techniques: A Comprehensive Study
Multi-View Video Coding Algorithms/Techniques: A Comprehensive StudyMulti-View Video Coding Algorithms/Techniques: A Comprehensive Study
Multi-View Video Coding Algorithms/Techniques: A Comprehensive StudyIJERA Editor
 
Practical Aggregate Programming with Protelis @ SASO2017
Practical Aggregate Programming with Protelis @ SASO2017Practical Aggregate Programming with Protelis @ SASO2017
Practical Aggregate Programming with Protelis @ SASO2017Danilo Pianini
 
Microservices and containers networking: Contiv, an industry leading open sou...
Microservices and containers networking: Contiv, an industry leading open sou...Microservices and containers networking: Contiv, an industry leading open sou...
Microservices and containers networking: Contiv, an industry leading open sou...Codemotion
 
The OSGi Framework Multiplication
The OSGi Framework MultiplicationThe OSGi Framework Multiplication
The OSGi Framework MultiplicationClément Escoffier
 
OOMEN MEZARIS ReTV
OOMEN MEZARIS ReTVOOMEN MEZARIS ReTV
OOMEN MEZARIS ReTVFIAT/IFTA
 
Implementing artificial intelligence strategies for content annotation and pu...
Implementing artificial intelligence strategies for content annotation and pu...Implementing artificial intelligence strategies for content annotation and pu...
Implementing artificial intelligence strategies for content annotation and pu...ReTV project
 
Implementing Artificial Intelligence Strategies for Content Annotation and Pu...
Implementing Artificial Intelligence Strategies for Content Annotation and Pu...Implementing Artificial Intelligence Strategies for Content Annotation and Pu...
Implementing Artificial Intelligence Strategies for Content Annotation and Pu...ReTV project
 
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videosAdria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videosCodiax
 
Agent based Video Fusion in Wireless Multimedia Sensor Networks
Agent based Video Fusion in Wireless Multimedia Sensor NetworksAgent based Video Fusion in Wireless Multimedia Sensor Networks
Agent based Video Fusion in Wireless Multimedia Sensor NetworksRSIS International
 
5 ijaems sept-2015-9-video feature extraction based on modified lle using ada...
5 ijaems sept-2015-9-video feature extraction based on modified lle using ada...5 ijaems sept-2015-9-video feature extraction based on modified lle using ada...
5 ijaems sept-2015-9-video feature extraction based on modified lle using ada...INFOGAIN PUBLICATION
 
SelfCon_AAAI.pdf
SelfCon_AAAI.pdfSelfCon_AAAI.pdf
SelfCon_AAAI.pdfsungnyun
 
Tutorial on Point Cloud Compression and standardisation
Tutorial on Point Cloud Compression and standardisationTutorial on Point Cloud Compression and standardisation
Tutorial on Point Cloud Compression and standardisationRufael Mekuria
 
multi modal transformers representation generation .pptx
multi modal transformers representation generation .pptxmulti modal transformers representation generation .pptx
multi modal transformers representation generation .pptxsiddharth1729
 
Using 3D visualisations for exercising and infrastructure stress testing
Using 3D visualisations for exercising and infrastructure stress testingUsing 3D visualisations for exercising and infrastructure stress testing
Using 3D visualisations for exercising and infrastructure stress testingDaden Limited
 
Ending the Multipoint Videoconferencing Compromise
Ending the Multipoint Videoconferencing CompromiseEnding the Multipoint Videoconferencing Compromise
Ending the Multipoint Videoconferencing CompromiseVideoguy
 
Comparison of control plane deployment architectures in the scope of hypercon...
Comparison of control plane deployment architectures in the scope of hypercon...Comparison of control plane deployment architectures in the scope of hypercon...
Comparison of control plane deployment architectures in the scope of hypercon...Miroslav Halas
 
Deploying Applications in Today’s Network Infrastructure
Deploying Applications in Today’s Network InfrastructureDeploying Applications in Today’s Network Infrastructure
Deploying Applications in Today’s Network InfrastructureCisco Canada
 

Similar to Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval (20)

Whitepaper multipoint video_conferencing_june2012_wr
Whitepaper multipoint video_conferencing_june2012_wrWhitepaper multipoint video_conferencing_june2012_wr
Whitepaper multipoint video_conferencing_june2012_wr
 
Multi-View Video Coding Algorithms/Techniques: A Comprehensive Study
Multi-View Video Coding Algorithms/Techniques: A Comprehensive StudyMulti-View Video Coding Algorithms/Techniques: A Comprehensive Study
Multi-View Video Coding Algorithms/Techniques: A Comprehensive Study
 
Practical Aggregate Programming with Protelis @ SASO2017
Practical Aggregate Programming with Protelis @ SASO2017Practical Aggregate Programming with Protelis @ SASO2017
Practical Aggregate Programming with Protelis @ SASO2017
 
Ijciet 10 02_043
Ijciet 10 02_043Ijciet 10 02_043
Ijciet 10 02_043
 
Microservices and containers networking: Contiv, an industry leading open sou...
Microservices and containers networking: Contiv, an industry leading open sou...Microservices and containers networking: Contiv, an industry leading open sou...
Microservices and containers networking: Contiv, an industry leading open sou...
 
The OSGi Framework Multiplication
The OSGi Framework MultiplicationThe OSGi Framework Multiplication
The OSGi Framework Multiplication
 
OOMEN MEZARIS ReTV
OOMEN MEZARIS ReTVOOMEN MEZARIS ReTV
OOMEN MEZARIS ReTV
 
Implementing artificial intelligence strategies for content annotation and pu...
Implementing artificial intelligence strategies for content annotation and pu...Implementing artificial intelligence strategies for content annotation and pu...
Implementing artificial intelligence strategies for content annotation and pu...
 
Implementing Artificial Intelligence Strategies for Content Annotation and Pu...
Implementing Artificial Intelligence Strategies for Content Annotation and Pu...Implementing Artificial Intelligence Strategies for Content Annotation and Pu...
Implementing Artificial Intelligence Strategies for Content Annotation and Pu...
 
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videosAdria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
 
Agent based Video Fusion in Wireless Multimedia Sensor Networks
Agent based Video Fusion in Wireless Multimedia Sensor NetworksAgent based Video Fusion in Wireless Multimedia Sensor Networks
Agent based Video Fusion in Wireless Multimedia Sensor Networks
 
How Technology can help to facilitate Effective eLearning Space
How Technology can help to facilitate Effective eLearning SpaceHow Technology can help to facilitate Effective eLearning Space
How Technology can help to facilitate Effective eLearning Space
 
5 ijaems sept-2015-9-video feature extraction based on modified lle using ada...
5 ijaems sept-2015-9-video feature extraction based on modified lle using ada...5 ijaems sept-2015-9-video feature extraction based on modified lle using ada...
5 ijaems sept-2015-9-video feature extraction based on modified lle using ada...
 
SelfCon_AAAI.pdf
SelfCon_AAAI.pdfSelfCon_AAAI.pdf
SelfCon_AAAI.pdf
 
Tutorial on Point Cloud Compression and standardisation
Tutorial on Point Cloud Compression and standardisationTutorial on Point Cloud Compression and standardisation
Tutorial on Point Cloud Compression and standardisation
 
multi modal transformers representation generation .pptx
multi modal transformers representation generation .pptxmulti modal transformers representation generation .pptx
multi modal transformers representation generation .pptx
 
Using 3D visualisations for exercising and infrastructure stress testing
Using 3D visualisations for exercising and infrastructure stress testingUsing 3D visualisations for exercising and infrastructure stress testing
Using 3D visualisations for exercising and infrastructure stress testing
 
Ending the Multipoint Videoconferencing Compromise
Ending the Multipoint Videoconferencing CompromiseEnding the Multipoint Videoconferencing Compromise
Ending the Multipoint Videoconferencing Compromise
 
Comparison of control plane deployment architectures in the scope of hypercon...
Comparison of control plane deployment architectures in the scope of hypercon...Comparison of control plane deployment architectures in the scope of hypercon...
Comparison of control plane deployment architectures in the scope of hypercon...
 
Deploying Applications in Today’s Network Infrastructure
Deploying Applications in Today’s Network InfrastructureDeploying Applications in Today’s Network Infrastructure
Deploying Applications in Today’s Network Infrastructure
 

More from 哲东 郑

Deep learning for person re-identification
Deep learning for person re-identificationDeep learning for person re-identification
Deep learning for person re-identification哲东 郑
 
Cross-domain complementary learning with synthetic data for multi-person part...
Cross-domain complementary learning with synthetic data for multi-person part...Cross-domain complementary learning with synthetic data for multi-person part...
Cross-domain complementary learning with synthetic data for multi-person part...哲东 郑
 
Visual saliency
Visual saliencyVisual saliency
Visual saliency哲东 郑
 
Image Synthesis From Reconfigurable Layout and Style
Image Synthesis From Reconfigurable Layout and StyleImage Synthesis From Reconfigurable Layout and Style
Image Synthesis From Reconfigurable Layout and Style哲东 郑
 
Weijian image retrieval
Weijian image retrievalWeijian image retrieval
Weijian image retrieval哲东 郑
 
Scops self supervised co-part segmentation
Scops self supervised co-part segmentationScops self supervised co-part segmentation
Scops self supervised co-part segmentation哲东 郑
 
Video object detection
Video object detectionVideo object detection
Video object detection哲东 郑
 
C2 ae open set recognition
C2 ae open set recognitionC2 ae open set recognition
C2 ae open set recognition哲东 郑
 
Sota semantic segmentation
Sota semantic segmentationSota semantic segmentation
Sota semantic segmentation哲东 郑
 
Deep randomized embedding
Deep randomized embeddingDeep randomized embedding
Deep randomized embedding哲东 郑
 
Semantic Image Synthesis with Spatially-Adaptive Normalization
Semantic Image Synthesis with Spatially-Adaptive NormalizationSemantic Image Synthesis with Spatially-Adaptive Normalization
Semantic Image Synthesis with Spatially-Adaptive Normalization哲东 郑
 
Instance level facial attributes transfer with geometry-aware flow
Instance level facial attributes transfer with geometry-aware flowInstance level facial attributes transfer with geometry-aware flow
Instance level facial attributes transfer with geometry-aware flow哲东 郑
 
Learning to adapt structured output space for semantic
Learning to adapt structured output space for semanticLearning to adapt structured output space for semantic
Learning to adapt structured output space for semantic哲东 郑
 
Unsupervised Learning of Object Landmarks through Conditional Image Generation
Unsupervised Learning of Object Landmarks through Conditional Image GenerationUnsupervised Learning of Object Landmarks through Conditional Image Generation
Unsupervised Learning of Object Landmarks through Conditional Image Generation哲东 郑
 
Graph based global reasoning networks
Graph based global reasoning networks Graph based global reasoning networks
Graph based global reasoning networks 哲东 郑
 
Variational Discriminator Bottleneck
Variational Discriminator BottleneckVariational Discriminator Bottleneck
Variational Discriminator Bottleneck哲东 郑
 

More from 哲东 郑 (20)

Deep learning for person re-identification
Deep learning for person re-identificationDeep learning for person re-identification
Deep learning for person re-identification
 
Cross-domain complementary learning with synthetic data for multi-person part...
Cross-domain complementary learning with synthetic data for multi-person part...Cross-domain complementary learning with synthetic data for multi-person part...
Cross-domain complementary learning with synthetic data for multi-person part...
 
Step zhedong
Step zhedongStep zhedong
Step zhedong
 
Visual saliency
Visual saliencyVisual saliency
Visual saliency
 
Image Synthesis From Reconfigurable Layout and Style
Image Synthesis From Reconfigurable Layout and StyleImage Synthesis From Reconfigurable Layout and Style
Image Synthesis From Reconfigurable Layout and Style
 
Weijian image retrieval
Weijian image retrievalWeijian image retrieval
Weijian image retrieval
 
Scops self supervised co-part segmentation
Scops self supervised co-part segmentationScops self supervised co-part segmentation
Scops self supervised co-part segmentation
 
Video object detection
Video object detectionVideo object detection
Video object detection
 
Center nets
Center netsCenter nets
Center nets
 
C2 ae open set recognition
C2 ae open set recognitionC2 ae open set recognition
C2 ae open set recognition
 
Sota semantic segmentation
Sota semantic segmentationSota semantic segmentation
Sota semantic segmentation
 
Deep randomized embedding
Deep randomized embeddingDeep randomized embedding
Deep randomized embedding
 
Semantic Image Synthesis with Spatially-Adaptive Normalization
Semantic Image Synthesis with Spatially-Adaptive NormalizationSemantic Image Synthesis with Spatially-Adaptive Normalization
Semantic Image Synthesis with Spatially-Adaptive Normalization
 
Instance level facial attributes transfer with geometry-aware flow
Instance level facial attributes transfer with geometry-aware flowInstance level facial attributes transfer with geometry-aware flow
Instance level facial attributes transfer with geometry-aware flow
 
Learning to adapt structured output space for semantic
Learning to adapt structured output space for semanticLearning to adapt structured output space for semantic
Learning to adapt structured output space for semantic
 
Unsupervised Learning of Object Landmarks through Conditional Image Generation
Unsupervised Learning of Object Landmarks through Conditional Image GenerationUnsupervised Learning of Object Landmarks through Conditional Image Generation
Unsupervised Learning of Object Landmarks through Conditional Image Generation
 
Graph based global reasoning networks
Graph based global reasoning networks Graph based global reasoning networks
Graph based global reasoning networks
 
Style gan
Style ganStyle gan
Style gan
 
Vi2vi
Vi2viVi2vi
Vi2vi
 
Variational Discriminator Bottleneck
Variational Discriminator BottleneckVariational Discriminator Bottleneck
Variational Discriminator Bottleneck
 

Recently uploaded

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 

Recently uploaded (20)

Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 

Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval

  • 1. Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval Ruijie Quan 2019/07/13
  • 2. MOTIVATION 121.07.2019 Most current methods learn injective embedding functions that map an instance to a single point in the shared space. Drawback: Cannot effectively handle polysemous instances while individual instances and their cross modal associations are often ambiguousin real-world scenarios. Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
  • 3. CONTRIBUTIONS 221.07.2019 Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval  Contributions: 1. Introduce Polysemous Instance Embedding Networks (PIE-Nets) that compute multiple and diverse representations of an instance by combining global context with locally-guided features via multi-head self-attention and residual learning. 2. Tackle a more challenging case of video-text retrieval. 3. A new dataset of 50K video-sentence pairs collected from social media, dubbed MRW (my reaction when)
  • 4. INTRODUCTION 321.07.2019 Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval 1. injective embedding can suffer when there is ambiguity in individual instances. e.g., polysemy words and images containing multiple objects. 2. partial cross-domain association e.g. a text sentence may describe only certain regions of an image, a video may contain extra frames not described by its associated sentence Injective Embedding could be problematic
  • 5. INTRODUCTION 421.07.2019 Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval 1. formulating instance embedding as a one- to-many mapping task 2. optimizing the mapping functions to be robust to ambiguous instances and partial cross-modal associations Address the above issues by:
  • 7. APPROACH 621.07.2019 Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval To address the issues with ambiguous instances Propose a novel one-to-many instance embedding model, Polysemous Instance Embedding Network (PIE-Net)  extracts K embeddings of each instance by combining global and local information of its input.  obtain K locally-guided representations by attending to different parts of an input instance (e.g., regions, frames, words) using a multi-head self-attention module  combine each of such local representation with global representation via residual learning to avoid learning redundant information  regularize the K locally-guided representations to be diverse
  • 9. APPROACH 821.07.2019 Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval To address the partial association issue Polysemous Visual-Semantic Embedding (PVSE)tie-up two PIE-Nets and train our model in the multiple-instance learning (MIL) framework
  • 10. APPROACH 121.07.2019 Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval Image Encoder Video Encoder Sentence Encoder the feature map before the final average pooling layer as local features apply average pooling and feed the output to one fully-connected layer to obtain global features ResNet-152 take the 2048-dim output from the final average pooling layer, and use them as local features feed Ψ(x) into a bidirec- tional GRU (bi-GRU) with H hidden units, and take the final hidden states as global features ResNet-152 producing L 300-dim vectors, and use them as local features local features feed them into a bi-GRU with H hidden units, and take the final hidden states as global features GloVe pretrained on the CommonCrawl dataset 1. Modality-Specific Feature Encoder
  • 11. APPROACH 1021.07.2019 Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval 2. Local Feature Transformer Multihead self-attention 3. Feature Fusion With Residual Learning Residual Learning
  • 12. APPROACH 1121.07.2019 Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval 4. Optimization and Inference MIL Loss: Diversity Loss: Domain Discrepancy Loss:
  • 13. MRW DATASET 1221.07.2019 Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
  • 15. EXPERIMENTS 1421.07.2019 Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval The number of embeddings K: a significant improvement from K = 0 to K = 1; this shows the effectiveness of Local Feature Transformer. this shows the importance of balancing global and local information in the final embedding simply concatenating the two features (no residual learning) hurts the performance Global vs. locally-guided features:
  • 16. EXPERIMENTS 1521.07.2019 Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval The results show that both loss terms are important in our model. Overall, the model is not much sensitive to the two relative weight terms. Sensitivity analysis on different loss weights:
  • 18. Thank you for your attention.