A Survey
on Cross-Modal Embedding
( )
n
n @ymym3412
n nlpaper.challenge
n
n @ymas0315
n
n Cross-Modal Embedding
n Cross-Modal Retrieval
n Audio-Visual Embedding
n
Cross-Modal Embedding
Cross-Modal Embedding
n
Cross-Modal Embedding
n
Cross-Modal Embedding
nCross-Modal Retrieval
◦ 3D
◦ Adversarial Training Consistency Loss
nAudio-Visual Embedding
◦ Web
◦
Cross-Modal Retrieval
Cross-Modal Retrieval
n
n Text <-> Image
Wikipedia
Image/Tag
n
NUS-WIDE
Flickr / 18.6
Low-level
Pascal VOC
Flickr LabelMe 9963
Image/Text
n
Wikipedia
Wikipedia
Flickr-30k
Flickr 30000
16
Recipe1M
1M
Cross-Modal Retrieval
n
• Real-Valued Representation
• Binary Representation
• Unsupervised Method
• Pairwise based Method
• Supervised Method
Unsupervised Method
CCA
2
AutoEncoder
Ranking Method
Pairwise based Method
/ /
Shared Space
Supervised Method
:
nLocalizing Moments in Video with Natural Language(ICCV2017)
◦
◦ Global Context
:
nAttentive Moment Retrieval in Videos (SIGIR2018)
°
° Attention
First
: 3D
nY2Seq2Seq: Cross-Modal Representation Learning for 3D Shape
and Text by Joint Reconstruction and Prediction of View and
Word Sequence (AAAI2019)
◦ 3D Cross-Modal Retrieval
◦ 3D
: Adversarial Training
nSelf-Supervised Adversarial Hashing Networks for Cross-Modal
Retrieval(CVPR2018)
◦
◦ Adversarial
Training =
: Adversarial Training
nCoupled CycleGAN: Unsupervised Hashing Network for Cross-
Modal Retrieval (AAAI2019)
◦ 2 GAN
◦ Outer Cycle GAN
◦ Inner Cycle GAN
: Consistency Loss
nLook, Imagine and Match: Improving Textual-Visual Cross-
Modal Retrieval with Generative Models(CVPR2018)
◦ Decoder
Adversarial Training
◦ Adversarial Training
Adversarial Training
: Consistency Loss
nLearning Cross-Modal Embeddings with Adversarial Networks
for Cooking Recipes and Food Images(CVPR2019)
◦ Metric Learning, Adversarial Training
Consistency Loss
nViLBERT: Pretraining Task-Agnostic Visiolinguistic
Representation for Vision-and-Language Tasks
◦ Vision Language BERT
◦ Vision->Language Language->Vision Attention Co-Attention
Transformer
◦ / Mask
◦ Vision/Language Encoder
BERT
Audio-Visual Embedding
Audio-Visual Embedding
Audio-Visual
n
◦ Audio Visual
⇒ Alignment
◦
⇒
n
◦
⇒ ” ”
( … )
Cross-modal retrieval
nAudio-Visual Embedding Network (AVE-Net)
◦
◦ DNN
n
◦ /
◦ Cross-modal Intra-modal
Audio-Visual
Audio-Visual
Cross-modal retrieval
nAudio-Visual Embedding Network (AVE-Net)
◦
◦ DNN
n
◦ /
◦ Cross-modal Intra-modal
nDCG@30
(Higher is better)
Audio-Visual
Audio-visual source separation
nLooking to Listen at the Cocktail Party
◦
◦ https://www.youtube.com/watch?v=rVQVAPiJWKU
Audio-Visual
Sound source localization
nLearning to Localize Sound Source in Visual Scenes
◦ attention
◦ Attention supervised
Audio-Visual
Image/sound generation
nSpeech2Face: Learning the Face Behind a Voice
◦ decoder
nYoutube 8M
◦
nAudioSet
◦ 632 2,084,320
nAVSpeech
◦ 29,000 ID
nYahoo Flickr Creative Commons 100M (YFCC100M)
◦ 80 (100M )
◦ Flickr Creative Commons
nVoxCeleb1, 2
◦ Youtube 2000
nSoundnet: Learning sound representations from unlabeled
video (NIPS2016)
◦
◦
◦ SVM
(Audio+Vision)
nLook, Listen and Learn (ICCV2017)
◦ visual audio
◦
(Audio-Visual Correspondence(AVC))
◦
AVC Audio-visual
nLook, Listen and Learn (ICCV2017)
◦ visual audio
◦
(Audio-Visual Correspondence(AVC))
◦
AVC Audio-visual
nObjects that Sound (ECCV2018)
◦ L3 ( )
◦ Cross-modal retrieval Sound source localization
AVC
AVOL-Net
nObjects that Sound (ECCV2018)
◦ L3 ( )
◦ Cross-modal retrieval Sound source localization
AVC
AVOL-Net
nAudio-Visual Scene Analysis with Self-Supervised Multisensory
Features (ECCV2018)
◦ ( )
◦ Action recognition
◦ Audio-visual source separation
Alignment
AVSS
nThe Sound of Pixels (ECCV2018)
◦ PixelPlayer
(http://sound-of-pixels.csail.mit.edu/)
◦
Mix-and-Separate
Mix-and-separate
AVSS
nThe Sound of Pixels (ECCV2018)
◦ PixelPlayer
(http://sound-of-pixels.csail.mit.edu/)
◦
Mix-and-Separate
Mix-and-separate
AVSS
nThe Sound of Pixels (ECCV2018)
◦ PixelPlayer
(http://sound-of-pixels.csail.mit.edu/)
◦
Mix-and-Separate
Mix-and-separate
K
AVSS
nThe Sound of Pixels (ECCV2018)
◦ PixelPlayer
(http://sound-of-pixels.csail.mit.edu/)
◦
Mix-and-Separate
Mix-and-separate
AVSS
nThe Sound of Pixels (ECCV2018)
◦ PixelPlayer
(http://sound-of-pixels.csail.mit.edu/)
◦
Mix-and-Separate
Mix-and-separate
AVSS
nThe Sound of Pixels (ECCV2018)
◦ PixelPlayer
(http://sound-of-pixels.csail.mit.edu/)
◦
Mix-and-Separate
Mix-and-separate
nSpeech2Face: Learning the Face Behind a Voice (CVPR2019)
◦
◦ Encoder
⇔
nTalking Face Generation by Adversarially Disentangled Audio-
Visual Representation (AAAI2019)
◦ /
◦ (disentangle )
nTalking Face Generation by Adversarially Disentangled Audio-
Visual Representation (AAAI2019)
◦ /
◦ (disentangle )
n Cross-Modal Embeddings Image
Text Cross-Modal Retrieval, Audio Vision Audio-Visual
Embeddings
n Cross-Modal Retrieval Image Text
Video 3D Adversarial Training
n Audio-Visual
n Image/Text/Audio/Video Cross-Modal ->
(
)
Kaiye Wang, Qiyue Yin, Wei Wang, Shu Wu, Liang Wang: A Comprehensive Survey on Cross-modal Retrieval
T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng: NUS-WIDE: A real-world web image database from National
University of Singapore
Sung Ju Hwang ; Kristen Grauman: Reading between the Lines: Object Localization Using Implicit Cues from Image
Tags
Peter Young Alice Lai Micah Hodosh Julia Hockenmaier: From image descriptions to visual denotations: New
similarity metrics for semantic inference over event descriptions
Hao Wang, Doyen Sahoo, Chenghao Liu, Ee-peng Lim, Steven C. H. Hoi: Learning Cross-Modal Embeddings with
Adversarial Networks for Cooking Recipes and Food Images
J. Zhou, G. Ding, and Y. Guo: Latent Semantic Sparse Hashing for Cross-Modal Similarity Search
Amaia Salvador1∗ Nicholas Hynes2∗ Yusuf Aytar2, Javier Marin2 Ferda Ofli3, Ingmar Weber3 Antonio Torralba2:
Learning Cross-modal Embeddings for Cooking Recipes and Food Images
Micael Carvalho, Rémi Cadène, David Picard, Laure Soulier, Nicolas Thome, Matthieu Cord: Cross-Modal Retrieval in
the Cooking Context: Learning Semantic Text-Image Embeddings
Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, Sanja Fidler: VSE++: Improving Visual-Semantic Embeddings with
Hard Negatives
Alexander Hermans, Lucas Beyer, Bastian Leibe: In Defense of the Triplet Loss for Person Re-Identification
Chao Li, Cheng Deng, Ning Li, Wei Liu, Xinbo Gao, Dacheng Tao: Self-Supervised Adversarial Hashing Networks for
Cross-Modal Retrieval
Nikhil Rasiwasia1, Jose Costa Pereira1, Emanuele Coviello1, Gabriel Doyle2,
Gert R.G. Lanckriet1, Roger Levy2, Nuno Vasconcelos1: A New Approach to Cross-Modal Multimedia Retrieval
Ting Yao †, Tao Mei †, and Chong-Wah Ngo ‡† Microsoft Research, Beijing, China‡ City University of Hong Kong,
Kowloon, Hong Kong: Learning Query and Image Similarities with Ranking Canonical Correlation Analysis
Lisa Anne Hendricks1∗, Oliver Wang2, Eli Shechtman2, Josef Sivic2,3∗, Trevor Darrell1, Bryan Russell2: Localizing
Moments in Video with Natural Language
Zhu Zhang, Zhijie Lin, Zhou Zhao and Zhenxin Xiao: Attentive Moment Retrieval in Videos
Zhizhong Han1,2, Mingyang Shang1, Xiyang Wang1, Yu-Shen Liu1∗, Matthias Zwicker2: Y2Seq2Seq: Cross-Modal
Representation Learning for 3D Shape and Text by Joint Reconstruction and Prediction of View and Word Sequences
Chao Li, Cheng Deng, Ning Li, Wei Liu, Xinbo Gao, Dacheng Tao: Self-Supervised Adversarial Hashing Networks for
Cross-Modal Retrieval
Chao Li,1 Cheng Deng,1∗ Lei Wang,1 De Xie,1 Xianglong Liu2†: Coupled CycleGAN: Unsupervised Hashing Network
for Cross-Modal Retrieval
Jiuxiang Gu, Jianfei Cai, Shafiq Joty, Li Niu, Gang Wang: Look, Imagine and Match: Improving Textual-Visual Cross-
Modal Retrieval with Generative Models
Jiasen Lu1, Dhruv Batra1,2, Devi Parikh1,2, Stefan Lee: ViLBERT: Pretraining Task-Agnostic Visiolinguistic
Representations for Vision-and-Language Task
Yusuf Aytar, Carl Vondrick, Antonio Torralba: SoundNet: Learning Sound Representations from Unlabeled Video
Relja Arandjelović, Andrew Zisserman: Objects that Sound
Relja Arandjelović, Andrew Zisserman: Look, Listen and Learn
Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, Antonio Torralba: The Sound of
Pixels
Andrew Owens, Alexei A. Efros: Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
Tae-Hyun Oh, Tali Dekel, Changil Kim, Inbar Mosseri, William T. Freeman, Michael Rubinstein, Wojciech Matusik:
Speech2Face: Learning the Face Behind a Voice
Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T. Freeman, Michael
Rubinstein: Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation
Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, In So Kweon: Learning to Localize Sound Source in Visual
Scenes
Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, Xiaogang Wang:
Talking Face Generation by Adversarially Disentangled Audio-Visual Representation

A Survey on Cross-Modal Embedding

  • 1.
  • 2.
  • 3.
    n Cross-Modal Embedding nCross-Modal Retrieval n Audio-Visual Embedding n
  • 4.
  • 5.
  • 6.
  • 7.
    Cross-Modal Embedding nCross-Modal Retrieval ◦3D ◦ Adversarial Training Consistency Loss nAudio-Visual Embedding ◦ Web ◦
  • 8.
  • 9.
    Cross-Modal Retrieval n n Text<-> Image Wikipedia
  • 10.
  • 11.
  • 12.
    Cross-Modal Retrieval n • Real-ValuedRepresentation • Binary Representation • Unsupervised Method • Pairwise based Method • Supervised Method
  • 13.
  • 14.
  • 15.
  • 16.
    : nLocalizing Moments inVideo with Natural Language(ICCV2017) ◦ ◦ Global Context
  • 17.
    : nAttentive Moment Retrievalin Videos (SIGIR2018) ° ° Attention First
  • 18.
    : 3D nY2Seq2Seq: Cross-ModalRepresentation Learning for 3D Shape and Text by Joint Reconstruction and Prediction of View and Word Sequence (AAAI2019) ◦ 3D Cross-Modal Retrieval ◦ 3D
  • 19.
    : Adversarial Training nSelf-SupervisedAdversarial Hashing Networks for Cross-Modal Retrieval(CVPR2018) ◦ ◦ Adversarial Training =
  • 20.
    : Adversarial Training nCoupledCycleGAN: Unsupervised Hashing Network for Cross- Modal Retrieval (AAAI2019) ◦ 2 GAN ◦ Outer Cycle GAN ◦ Inner Cycle GAN
  • 21.
    : Consistency Loss nLook,Imagine and Match: Improving Textual-Visual Cross- Modal Retrieval with Generative Models(CVPR2018) ◦ Decoder Adversarial Training ◦ Adversarial Training Adversarial Training
  • 22.
    : Consistency Loss nLearningCross-Modal Embeddings with Adversarial Networks for Cooking Recipes and Food Images(CVPR2019) ◦ Metric Learning, Adversarial Training Consistency Loss
  • 23.
    nViLBERT: Pretraining Task-AgnosticVisiolinguistic Representation for Vision-and-Language Tasks ◦ Vision Language BERT ◦ Vision->Language Language->Vision Attention Co-Attention Transformer ◦ / Mask ◦ Vision/Language Encoder BERT
  • 24.
  • 25.
    Audio-Visual Embedding Audio-Visual n ◦ AudioVisual ⇒ Alignment ◦ ⇒ n ◦ ⇒ ” ” ( … )
  • 26.
    Cross-modal retrieval nAudio-Visual EmbeddingNetwork (AVE-Net) ◦ ◦ DNN n ◦ / ◦ Cross-modal Intra-modal Audio-Visual
  • 27.
    Audio-Visual Cross-modal retrieval nAudio-Visual EmbeddingNetwork (AVE-Net) ◦ ◦ DNN n ◦ / ◦ Cross-modal Intra-modal nDCG@30 (Higher is better)
  • 28.
    Audio-Visual Audio-visual source separation nLookingto Listen at the Cocktail Party ◦ ◦ https://www.youtube.com/watch?v=rVQVAPiJWKU
  • 29.
    Audio-Visual Sound source localization nLearningto Localize Sound Source in Visual Scenes ◦ attention ◦ Attention supervised
  • 30.
  • 31.
    nYoutube 8M ◦ nAudioSet ◦ 6322,084,320 nAVSpeech ◦ 29,000 ID nYahoo Flickr Creative Commons 100M (YFCC100M) ◦ 80 (100M ) ◦ Flickr Creative Commons nVoxCeleb1, 2 ◦ Youtube 2000
  • 32.
    nSoundnet: Learning soundrepresentations from unlabeled video (NIPS2016) ◦ ◦ ◦ SVM (Audio+Vision)
  • 33.
    nLook, Listen andLearn (ICCV2017) ◦ visual audio ◦ (Audio-Visual Correspondence(AVC)) ◦ AVC Audio-visual
  • 34.
    nLook, Listen andLearn (ICCV2017) ◦ visual audio ◦ (Audio-Visual Correspondence(AVC)) ◦ AVC Audio-visual
  • 35.
    nObjects that Sound(ECCV2018) ◦ L3 ( ) ◦ Cross-modal retrieval Sound source localization AVC AVOL-Net
  • 36.
    nObjects that Sound(ECCV2018) ◦ L3 ( ) ◦ Cross-modal retrieval Sound source localization AVC AVOL-Net
  • 37.
    nAudio-Visual Scene Analysiswith Self-Supervised Multisensory Features (ECCV2018) ◦ ( ) ◦ Action recognition ◦ Audio-visual source separation Alignment
  • 38.
    AVSS nThe Sound ofPixels (ECCV2018) ◦ PixelPlayer (http://sound-of-pixels.csail.mit.edu/) ◦ Mix-and-Separate Mix-and-separate
  • 39.
    AVSS nThe Sound ofPixels (ECCV2018) ◦ PixelPlayer (http://sound-of-pixels.csail.mit.edu/) ◦ Mix-and-Separate Mix-and-separate
  • 40.
    AVSS nThe Sound ofPixels (ECCV2018) ◦ PixelPlayer (http://sound-of-pixels.csail.mit.edu/) ◦ Mix-and-Separate Mix-and-separate K
  • 41.
    AVSS nThe Sound ofPixels (ECCV2018) ◦ PixelPlayer (http://sound-of-pixels.csail.mit.edu/) ◦ Mix-and-Separate Mix-and-separate
  • 42.
    AVSS nThe Sound ofPixels (ECCV2018) ◦ PixelPlayer (http://sound-of-pixels.csail.mit.edu/) ◦ Mix-and-Separate Mix-and-separate
  • 43.
    AVSS nThe Sound ofPixels (ECCV2018) ◦ PixelPlayer (http://sound-of-pixels.csail.mit.edu/) ◦ Mix-and-Separate Mix-and-separate
  • 44.
    nSpeech2Face: Learning theFace Behind a Voice (CVPR2019) ◦ ◦ Encoder ⇔
  • 45.
    nTalking Face Generationby Adversarially Disentangled Audio- Visual Representation (AAAI2019) ◦ / ◦ (disentangle )
  • 46.
    nTalking Face Generationby Adversarially Disentangled Audio- Visual Representation (AAAI2019) ◦ / ◦ (disentangle )
  • 47.
    n Cross-Modal EmbeddingsImage Text Cross-Modal Retrieval, Audio Vision Audio-Visual Embeddings n Cross-Modal Retrieval Image Text Video 3D Adversarial Training n Audio-Visual n Image/Text/Audio/Video Cross-Modal -> ( )
  • 48.
    Kaiye Wang, QiyueYin, Wei Wang, Shu Wu, Liang Wang: A Comprehensive Survey on Cross-modal Retrieval T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng: NUS-WIDE: A real-world web image database from National University of Singapore Sung Ju Hwang ; Kristen Grauman: Reading between the Lines: Object Localization Using Implicit Cues from Image Tags Peter Young Alice Lai Micah Hodosh Julia Hockenmaier: From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions Hao Wang, Doyen Sahoo, Chenghao Liu, Ee-peng Lim, Steven C. H. Hoi: Learning Cross-Modal Embeddings with Adversarial Networks for Cooking Recipes and Food Images J. Zhou, G. Ding, and Y. Guo: Latent Semantic Sparse Hashing for Cross-Modal Similarity Search Amaia Salvador1∗ Nicholas Hynes2∗ Yusuf Aytar2, Javier Marin2 Ferda Ofli3, Ingmar Weber3 Antonio Torralba2: Learning Cross-modal Embeddings for Cooking Recipes and Food Images Micael Carvalho, Rémi Cadène, David Picard, Laure Soulier, Nicolas Thome, Matthieu Cord: Cross-Modal Retrieval in the Cooking Context: Learning Semantic Text-Image Embeddings Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, Sanja Fidler: VSE++: Improving Visual-Semantic Embeddings with Hard Negatives Alexander Hermans, Lucas Beyer, Bastian Leibe: In Defense of the Triplet Loss for Person Re-Identification Chao Li, Cheng Deng, Ning Li, Wei Liu, Xinbo Gao, Dacheng Tao: Self-Supervised Adversarial Hashing Networks for Cross-Modal Retrieval
  • 49.
    Nikhil Rasiwasia1, JoseCosta Pereira1, Emanuele Coviello1, Gabriel Doyle2, Gert R.G. Lanckriet1, Roger Levy2, Nuno Vasconcelos1: A New Approach to Cross-Modal Multimedia Retrieval Ting Yao †, Tao Mei †, and Chong-Wah Ngo ‡† Microsoft Research, Beijing, China‡ City University of Hong Kong, Kowloon, Hong Kong: Learning Query and Image Similarities with Ranking Canonical Correlation Analysis Lisa Anne Hendricks1∗, Oliver Wang2, Eli Shechtman2, Josef Sivic2,3∗, Trevor Darrell1, Bryan Russell2: Localizing Moments in Video with Natural Language Zhu Zhang, Zhijie Lin, Zhou Zhao and Zhenxin Xiao: Attentive Moment Retrieval in Videos Zhizhong Han1,2, Mingyang Shang1, Xiyang Wang1, Yu-Shen Liu1∗, Matthias Zwicker2: Y2Seq2Seq: Cross-Modal Representation Learning for 3D Shape and Text by Joint Reconstruction and Prediction of View and Word Sequences Chao Li, Cheng Deng, Ning Li, Wei Liu, Xinbo Gao, Dacheng Tao: Self-Supervised Adversarial Hashing Networks for Cross-Modal Retrieval Chao Li,1 Cheng Deng,1∗ Lei Wang,1 De Xie,1 Xianglong Liu2†: Coupled CycleGAN: Unsupervised Hashing Network for Cross-Modal Retrieval Jiuxiang Gu, Jianfei Cai, Shafiq Joty, Li Niu, Gang Wang: Look, Imagine and Match: Improving Textual-Visual Cross- Modal Retrieval with Generative Models Jiasen Lu1, Dhruv Batra1,2, Devi Parikh1,2, Stefan Lee: ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Task
  • 50.
    Yusuf Aytar, CarlVondrick, Antonio Torralba: SoundNet: Learning Sound Representations from Unlabeled Video Relja Arandjelović, Andrew Zisserman: Objects that Sound Relja Arandjelović, Andrew Zisserman: Look, Listen and Learn Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, Antonio Torralba: The Sound of Pixels Andrew Owens, Alexei A. Efros: Audio-Visual Scene Analysis with Self-Supervised Multisensory Features Tae-Hyun Oh, Tali Dekel, Changil Kim, Inbar Mosseri, William T. Freeman, Michael Rubinstein, Wojciech Matusik: Speech2Face: Learning the Face Behind a Voice Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T. Freeman, Michael Rubinstein: Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, In So Kweon: Learning to Localize Sound Source in Visual Scenes Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, Xiaogang Wang: Talking Face Generation by Adversarially Disentangled Audio-Visual Representation