Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A Survey on Cross-Modal Embedding

1,014 views

Published on

第2回 NLP/CV最先端勉強会での「A Survey on Cross-modal Embedding」の発表資料です

Published in: Science
  • DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download Full EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download Full doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book THIS can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer THIS is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBooks .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story THIS Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money THIS the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths THIS Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

A Survey on Cross-Modal Embedding

  1. 1. A Survey on Cross-Modal Embedding ( )
  2. 2. n n @ymym3412 n nlpaper.challenge n n @ymas0315 n
  3. 3. n Cross-Modal Embedding n Cross-Modal Retrieval n Audio-Visual Embedding n
  4. 4. Cross-Modal Embedding
  5. 5. Cross-Modal Embedding n
  6. 6. Cross-Modal Embedding n
  7. 7. Cross-Modal Embedding nCross-Modal Retrieval ◦ 3D ◦ Adversarial Training Consistency Loss nAudio-Visual Embedding ◦ Web ◦
  8. 8. Cross-Modal Retrieval
  9. 9. Cross-Modal Retrieval n n Text <-> Image Wikipedia
  10. 10. Image/Tag n NUS-WIDE Flickr / 18.6 Low-level Pascal VOC Flickr LabelMe 9963
  11. 11. Image/Text n Wikipedia Wikipedia Flickr-30k Flickr 30000 16 Recipe1M 1M
  12. 12. Cross-Modal Retrieval n • Real-Valued Representation • Binary Representation • Unsupervised Method • Pairwise based Method • Supervised Method
  13. 13. Unsupervised Method CCA 2 AutoEncoder
  14. 14. Ranking Method Pairwise based Method
  15. 15. / / Shared Space Supervised Method
  16. 16. : nLocalizing Moments in Video with Natural Language(ICCV2017) ◦ ◦ Global Context
  17. 17. : nAttentive Moment Retrieval in Videos (SIGIR2018) ° ° Attention First
  18. 18. : 3D nY2Seq2Seq: Cross-Modal Representation Learning for 3D Shape and Text by Joint Reconstruction and Prediction of View and Word Sequence (AAAI2019) ◦ 3D Cross-Modal Retrieval ◦ 3D
  19. 19. : Adversarial Training nSelf-Supervised Adversarial Hashing Networks for Cross-Modal Retrieval(CVPR2018) ◦ ◦ Adversarial Training =
  20. 20. : Adversarial Training nCoupled CycleGAN: Unsupervised Hashing Network for Cross- Modal Retrieval (AAAI2019) ◦ 2 GAN ◦ Outer Cycle GAN ◦ Inner Cycle GAN
  21. 21. : Consistency Loss nLook, Imagine and Match: Improving Textual-Visual Cross- Modal Retrieval with Generative Models(CVPR2018) ◦ Decoder Adversarial Training ◦ Adversarial Training Adversarial Training
  22. 22. : Consistency Loss nLearning Cross-Modal Embeddings with Adversarial Networks for Cooking Recipes and Food Images(CVPR2019) ◦ Metric Learning, Adversarial Training Consistency Loss
  23. 23. nViLBERT: Pretraining Task-Agnostic Visiolinguistic Representation for Vision-and-Language Tasks ◦ Vision Language BERT ◦ Vision->Language Language->Vision Attention Co-Attention Transformer ◦ / Mask ◦ Vision/Language Encoder BERT
  24. 24. Audio-Visual Embedding
  25. 25. Audio-Visual Embedding Audio-Visual n ◦ Audio Visual ⇒ Alignment ◦ ⇒ n ◦ ⇒ ” ” ( … )
  26. 26. Cross-modal retrieval nAudio-Visual Embedding Network (AVE-Net) ◦ ◦ DNN n ◦ / ◦ Cross-modal Intra-modal Audio-Visual
  27. 27. Audio-Visual Cross-modal retrieval nAudio-Visual Embedding Network (AVE-Net) ◦ ◦ DNN n ◦ / ◦ Cross-modal Intra-modal nDCG@30 (Higher is better)
  28. 28. Audio-Visual Audio-visual source separation nLooking to Listen at the Cocktail Party ◦ ◦ https://www.youtube.com/watch?v=rVQVAPiJWKU
  29. 29. Audio-Visual Sound source localization nLearning to Localize Sound Source in Visual Scenes ◦ attention ◦ Attention supervised
  30. 30. Audio-Visual Image/sound generation nSpeech2Face: Learning the Face Behind a Voice ◦ decoder
  31. 31. nYoutube 8M ◦ nAudioSet ◦ 632 2,084,320 nAVSpeech ◦ 29,000 ID nYahoo Flickr Creative Commons 100M (YFCC100M) ◦ 80 (100M ) ◦ Flickr Creative Commons nVoxCeleb1, 2 ◦ Youtube 2000
  32. 32. nSoundnet: Learning sound representations from unlabeled video (NIPS2016) ◦ ◦ ◦ SVM (Audio+Vision)
  33. 33. nLook, Listen and Learn (ICCV2017) ◦ visual audio ◦ (Audio-Visual Correspondence(AVC)) ◦ AVC Audio-visual
  34. 34. nLook, Listen and Learn (ICCV2017) ◦ visual audio ◦ (Audio-Visual Correspondence(AVC)) ◦ AVC Audio-visual
  35. 35. nObjects that Sound (ECCV2018) ◦ L3 ( ) ◦ Cross-modal retrieval Sound source localization AVC AVOL-Net
  36. 36. nObjects that Sound (ECCV2018) ◦ L3 ( ) ◦ Cross-modal retrieval Sound source localization AVC AVOL-Net
  37. 37. nAudio-Visual Scene Analysis with Self-Supervised Multisensory Features (ECCV2018) ◦ ( ) ◦ Action recognition ◦ Audio-visual source separation Alignment
  38. 38. AVSS nThe Sound of Pixels (ECCV2018) ◦ PixelPlayer (http://sound-of-pixels.csail.mit.edu/) ◦ Mix-and-Separate Mix-and-separate
  39. 39. AVSS nThe Sound of Pixels (ECCV2018) ◦ PixelPlayer (http://sound-of-pixels.csail.mit.edu/) ◦ Mix-and-Separate Mix-and-separate
  40. 40. AVSS nThe Sound of Pixels (ECCV2018) ◦ PixelPlayer (http://sound-of-pixels.csail.mit.edu/) ◦ Mix-and-Separate Mix-and-separate K
  41. 41. AVSS nThe Sound of Pixels (ECCV2018) ◦ PixelPlayer (http://sound-of-pixels.csail.mit.edu/) ◦ Mix-and-Separate Mix-and-separate
  42. 42. AVSS nThe Sound of Pixels (ECCV2018) ◦ PixelPlayer (http://sound-of-pixels.csail.mit.edu/) ◦ Mix-and-Separate Mix-and-separate
  43. 43. AVSS nThe Sound of Pixels (ECCV2018) ◦ PixelPlayer (http://sound-of-pixels.csail.mit.edu/) ◦ Mix-and-Separate Mix-and-separate
  44. 44. nSpeech2Face: Learning the Face Behind a Voice (CVPR2019) ◦ ◦ Encoder ⇔
  45. 45. nTalking Face Generation by Adversarially Disentangled Audio- Visual Representation (AAAI2019) ◦ / ◦ (disentangle )
  46. 46. nTalking Face Generation by Adversarially Disentangled Audio- Visual Representation (AAAI2019) ◦ / ◦ (disentangle )
  47. 47. n Cross-Modal Embeddings Image Text Cross-Modal Retrieval, Audio Vision Audio-Visual Embeddings n Cross-Modal Retrieval Image Text Video 3D Adversarial Training n Audio-Visual n Image/Text/Audio/Video Cross-Modal -> ( )
  48. 48. Kaiye Wang, Qiyue Yin, Wei Wang, Shu Wu, Liang Wang: A Comprehensive Survey on Cross-modal Retrieval T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng: NUS-WIDE: A real-world web image database from National University of Singapore Sung Ju Hwang ; Kristen Grauman: Reading between the Lines: Object Localization Using Implicit Cues from Image Tags Peter Young Alice Lai Micah Hodosh Julia Hockenmaier: From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions Hao Wang, Doyen Sahoo, Chenghao Liu, Ee-peng Lim, Steven C. H. Hoi: Learning Cross-Modal Embeddings with Adversarial Networks for Cooking Recipes and Food Images J. Zhou, G. Ding, and Y. Guo: Latent Semantic Sparse Hashing for Cross-Modal Similarity Search Amaia Salvador1∗ Nicholas Hynes2∗ Yusuf Aytar2, Javier Marin2 Ferda Ofli3, Ingmar Weber3 Antonio Torralba2: Learning Cross-modal Embeddings for Cooking Recipes and Food Images Micael Carvalho, Rémi Cadène, David Picard, Laure Soulier, Nicolas Thome, Matthieu Cord: Cross-Modal Retrieval in the Cooking Context: Learning Semantic Text-Image Embeddings Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, Sanja Fidler: VSE++: Improving Visual-Semantic Embeddings with Hard Negatives Alexander Hermans, Lucas Beyer, Bastian Leibe: In Defense of the Triplet Loss for Person Re-Identification Chao Li, Cheng Deng, Ning Li, Wei Liu, Xinbo Gao, Dacheng Tao: Self-Supervised Adversarial Hashing Networks for Cross-Modal Retrieval
  49. 49. Nikhil Rasiwasia1, Jose Costa Pereira1, Emanuele Coviello1, Gabriel Doyle2, Gert R.G. Lanckriet1, Roger Levy2, Nuno Vasconcelos1: A New Approach to Cross-Modal Multimedia Retrieval Ting Yao †, Tao Mei †, and Chong-Wah Ngo ‡† Microsoft Research, Beijing, China‡ City University of Hong Kong, Kowloon, Hong Kong: Learning Query and Image Similarities with Ranking Canonical Correlation Analysis Lisa Anne Hendricks1∗, Oliver Wang2, Eli Shechtman2, Josef Sivic2,3∗, Trevor Darrell1, Bryan Russell2: Localizing Moments in Video with Natural Language Zhu Zhang, Zhijie Lin, Zhou Zhao and Zhenxin Xiao: Attentive Moment Retrieval in Videos Zhizhong Han1,2, Mingyang Shang1, Xiyang Wang1, Yu-Shen Liu1∗, Matthias Zwicker2: Y2Seq2Seq: Cross-Modal Representation Learning for 3D Shape and Text by Joint Reconstruction and Prediction of View and Word Sequences Chao Li, Cheng Deng, Ning Li, Wei Liu, Xinbo Gao, Dacheng Tao: Self-Supervised Adversarial Hashing Networks for Cross-Modal Retrieval Chao Li,1 Cheng Deng,1∗ Lei Wang,1 De Xie,1 Xianglong Liu2†: Coupled CycleGAN: Unsupervised Hashing Network for Cross-Modal Retrieval Jiuxiang Gu, Jianfei Cai, Shafiq Joty, Li Niu, Gang Wang: Look, Imagine and Match: Improving Textual-Visual Cross- Modal Retrieval with Generative Models Jiasen Lu1, Dhruv Batra1,2, Devi Parikh1,2, Stefan Lee: ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Task
  50. 50. Yusuf Aytar, Carl Vondrick, Antonio Torralba: SoundNet: Learning Sound Representations from Unlabeled Video Relja Arandjelović, Andrew Zisserman: Objects that Sound Relja Arandjelović, Andrew Zisserman: Look, Listen and Learn Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, Antonio Torralba: The Sound of Pixels Andrew Owens, Alexei A. Efros: Audio-Visual Scene Analysis with Self-Supervised Multisensory Features Tae-Hyun Oh, Tali Dekel, Changil Kim, Inbar Mosseri, William T. Freeman, Michael Rubinstein, Wojciech Matusik: Speech2Face: Learning the Face Behind a Voice Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T. Freeman, Michael Rubinstein: Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, In So Kweon: Learning to Localize Sound Source in Visual Scenes Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, Xiaogang Wang: Talking Face Generation by Adversarially Disentangled Audio-Visual Representation

×