本スライドは、弊社の梅本により弊社内の技術勉強会で使用されたものです。
近年注目を集めるアーキテクチャーである「Transformer」の解説スライドとなっております。
"Arithmer Seminar" is weekly held, where professionals from within and outside our company give lectures on their respective expertise.
The slides are made by the lecturer from outside our company, and shared here with his/her permission.
Arithmer株式会社は東京大学大学院数理科学研究科発の数学の会社です。私達は現代数学を応用して、様々な分野のソリューションに、新しい高度AIシステムを導入しています。AIをいかに上手に使って仕事を効率化するか、そして人々の役に立つ結果を生み出すのか、それを考えるのが私たちの仕事です。
Arithmer began at the University of Tokyo Graduate School of Mathematical Sciences. Today, our research of modern mathematics and AI systems has the capability of providing solutions when dealing with tough complex issues. At Arithmer we believe it is our job to realize the functions of AI through improving work efficiency and producing more useful results for society.
本スライドは、弊社の梅本により弊社内の技術勉強会で使用されたものです。
近年注目を集めるアーキテクチャーである「Transformer」の解説スライドとなっております。
"Arithmer Seminar" is weekly held, where professionals from within and outside our company give lectures on their respective expertise.
The slides are made by the lecturer from outside our company, and shared here with his/her permission.
Arithmer株式会社は東京大学大学院数理科学研究科発の数学の会社です。私達は現代数学を応用して、様々な分野のソリューションに、新しい高度AIシステムを導入しています。AIをいかに上手に使って仕事を効率化するか、そして人々の役に立つ結果を生み出すのか、それを考えるのが私たちの仕事です。
Arithmer began at the University of Tokyo Graduate School of Mathematical Sciences. Today, our research of modern mathematics and AI systems has the capability of providing solutions when dealing with tough complex issues. At Arithmer we believe it is our job to realize the functions of AI through improving work efficiency and producing more useful results for society.
Unified Vision-Language Pre-Training for Image Captioning and VQAharmonylab
出典:Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, Jianfeng Gao : Unified Vision-Language Pre-Training for Image Captioning and VQA, The Thirty-Fourth AAAI Conference on Artificial Intelligence, pp13041-13049 (2020)
公開URL:https://aaai.org/ojs/index.php/AAAI/article/view/7005/6859
概要:本論文ではUnified VLPというVision-Languageタスクを解くための統一的なモデルを提案しています。統一的というのは、EncoderとDecoderを一つのTransfomer内で完結させている点、画像キャプションとVQAという性質の異なるタスクを同じモデルで解くことができるという点からきている用語です。さらに本論文では、従来はBERTなどの言語モデルで行われていた事前学習をVision-Languageモデルにも適用し、画像とそのキャプションのペアを事前学習のデータセットとして用いることで、モデルの性能が向上することを示しています。
6. Text to Image
6
画像
画像説明文
*[9] Text-to-Image Synthesis
modelの一例[10]:
Caption-> Scene Graph -> Scene Layout-> Image
画像説明文から画像を生成するタスク
トレンドの例:
・ Scene Layoutを介して画像生成 [9,10]
・ “V”or”L”の拡張:Text to Video [11]
Story Visualizaton (textから画像序列) [12]
9. その他
9
Textベース画像編集 [20]
・ Vision and Languageタスクこれからも続出する傾向
・ 更にMulti-modal (Vision + Language + X (Audio)などの研
究も [19] )
The flower has red petals with
yellow stigmas in the middle
Language and Vision Navigation [21]
画像、
編集指示
編集後の
画像
3D環境、
移動指示
移動
42. CV編
42
• 対象論文
• ※数字は次ページの表と対応
1. Antol et al., “VQA: Visual Question Answering.”, ICCV 2015
2. Goyal et al., “Making the V in VQA Matter: Elevating the Role of
Image Understanding in Visual Question Answering.”, CVPR 2017
3. Das et al., “Embodied Question Answering”, CVPR 2018
4. Kafle et al., “DVQA: Understanding Data Visualizations via
Question Answering.”, CVPR 2018
5. Li et al., “Visual Question Generation as Dual Task of Visual
Question Answering.” CVPR 2018
44. NLP編
44
• 対象論文
• ※数字は次ページの表と対応
1. Li et al., “Tell-and-Answer: Towards Explainable Visual Question
Answering using Attributes and Captions.”, EMNLP 2018
2. Patro et al., “Multimodal Differential Network for Visual Question
Generation.”, EMNLP 2018
3. Chao et al., “Being Negative but Constructively: Lessons Learnt
from Creating Better Visual Question Answering Datasets.”,
NAACL 2018
4. Mahendru et al., “The Promise of Premise: Harnessing Question
Premises in Visual Question Answering.”, EMNLP 2017
5. Fukui et al., “Multimodal Compact Bilinear Pooling for Visual
Question Answering and Visual Grounding.” EMNLP 2016
51. 参考文献
51
[1] Antol, Stanislaw, et al. "Vqa: Visual question answering." Proceedings of the IEEE international conference on
computer vision. 2015.
[2] Das, Abhishek, et al. "Embodied question answering." Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition Workshops. 2018.
[3] Kiros, Ryan, Ruslan Salakhutdinov, and Rich Zemel. "Multimodal neural language models." International
Conference on Machine Learning. 2014.
[4] Chen, Xinlei, et al. "Microsoft COCO captions: Data collection and evaluation server." arXiv preprint
arXiv:1504.00325 (2015).
[5] Xu, Kelvin, et al. "Show, attend and tell: Neural image caption generation with visual attention." International
conference on machine learning. 2015.
[6] Yoshida, Kota, et al. "Neural Joking Machine: Humorous image captioning." arXiv preprint arXiv:1805.11850
(2018).
[7] Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense
captioning." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.
[8] Huang, Ting-Hao Kenneth, et al. "Visual storytelling." Proceedings of the 2016 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies. 2016.
[9] Hong, Seunghoon, et al. "Inferring semantic layout for hierarchical text-to-image synthesis." Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition. 2018.
[10] Johnson, Justin, Agrim Gupta, and Li Fei-Fei. "Image generation from scene graphs." Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 2018.
52. 参考文献
52
[11] Li, Yitong, et al. "Video generation from text." Thirty-Second AAAI Conference on Artificial Intelligence. 2018.
[12] Li, Yitong, et al. "StoryGAN: A Sequential Conditional GAN for Story Visualization." arXiv preprint
arXiv:1812.02784 (2018).
[13] Wang, Peng, et al. "Fvqa: Fact-based visual question answering." IEEE transactions on pattern analysis and
machine intelligence (2017).
[14] Misra, Ishan, et al. "Learning by Asking Questions." arXiv preprint arXiv:1712.01238 (2017).
[15] Das, Abhishek, et al. "Visual dialog." Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. 2017.
[16] Massiceti, Daniela, et al. "Flipdial: A generative model for two-way visual dialogue." Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 2018.
[17] Wu, Qi, et al. "Are you talking to me? reasoned visual dialog generation through adversarial
learning." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.
[18] Kottur, Satwik, et al. "Visual coreference resolution in visual dialog using neural module networks." Proceedings
of the European Conference on Computer Vision (ECCV). 2018.
[19] Hori, Chiori, et al. "End-to-end audio visual scene-aware dialog using multimodal attention-based video
features." arXiv preprint arXiv:1806.08409 (2018).
[20] Chen, Jianbo, et al. "Language-based image editing with recurrent attentive models." Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 2018.
[21] Anderson, Peter, et al. "Vision-and-language navigation: Interpreting visually-grounded navigation instructions
in real environments." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.
53. 参考文献
53
[22] Goyal, Yash, et al. "Making the V in VQA matter: Elevating the role of image understanding in Visual Question
Answering." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017.
[23] Johnson, Justin, et al. "Clevr: A diagnostic dataset for compositional language and elementary visual
reasoning." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017.
[24] Yang, Zichao, et al. "Stacked attention networks for image question answering." Proceedings of the IEEE
conference on computer vision and pattern recognition. 2016.
[25] Anderson, Peter, et al. "Bottom-up and top-down attention for image captioning and visual question answering."
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.
[26] Perez, Ethan, et al. "Film: Visual reasoning with a general conditioning layer." Thirty-Second AAAI Conference
on Artificial Intelligence. 2018.
[27] Das, Abhishek, et al. "Learning cooperative visual dialog agents with deep reinforcement learning." Proceedings
of the IEEE International Conference on Computer Vision. 2017.
[28] Alamri, Huda, et al. "Audio-Visual Scene-Aware Dialog." arXiv preprint arXiv:1901.09107 (2019).
[29] Li, Qing, et al. ““Tell-and-Answer: Towards Explainable Visual Question Answering using Attributes and Captions.”, EMNLP
2018
[30] Mahendru, Aroma, et al., “The Promise of Premise: Harnessing Question Premises in Visual Question Answering.”, EMNLP
2017