SlideShare a Scribd company logo
Frontiers of Vision and Language:
Bridging Images and Texts by Deep Learning
The University of Tokyo
Yoshitaka Ushiku
losnuevetoros
Documents = Vision + Language
Vision & Language:
an emerging topic
• Integration of CV, NLP
and ML techs
• Several backgrounds
– Impact of Deep Learning
• Image recognition (CV)
• Machine translation (NLP)
– Growth of user generated
contents
– Exploratory researches on
Vision and Language
2012: Impact of Deep Learning
Academic AI startup A famous company
Many slides refer to the first use of CNN (AlexNet) on ImageNet
2012: Impact of Deep Learning
Academic AI startup A famous company
Large gap of error rates
on ImageNet
1st team: 15.3%
2nd team: 26.2%
Large gap of error rates
on ImageNet
1st team: 15.3%
2nd team: 26.2%
Large gap of error rates
on ImageNet
1st team: 15.3%
2nd team: 26.2%
Many slides refer to the first use of CNN (AlexNet) on ImageNet
2012: Impact of Deep Learning
According to the official site…
1st team w/ DL
Error rate: 15%
2nd team w/o DL
Error rate: 26%
[http://image-net.org/challenges/LSVRC/2012/results.html]
It’s me!!
2014: Another impact of Deep Learning
• Deep learning appears in machine translation
[Sutskever+, NIPS 2014]
– LSTM [Hochreiter+Schmidhuber, 1997] solves the gradient vanishing
problem in RNN
→Dealing with relations between distant words in a sentence
– Four-layer LSTM is trained in an end-to-end manner
→comparable to state-of-the-art (English to French)
• Emergence of common techs such as CNN/RNN
Reduction of barriers to get into CV+NLP
Input
Output
Growth of user generated contents
Especially in content posting/sharing service
• Facebook: 300 million photos per day
• YouTube: 400-hours videos per minute
Pōhutukawa blooms this
time of the year in New
Zealand. As the flowers
fall, the ground
underneath the trees look
spectacular.
Pairs of a sentence
+ a video / photo
→Collectable in
large quantities
Exploratory researches on Vision and Language
Captioning an image associated with its article
[Feng+Lapata, ACL 2010]
• Input: article + image Output: caption for image
• Dataset: Sets of article + image + caption
× 3361
King Toupu IV died at the
age of 88 last week.
Exploratory researches on Vision and Language
Captioning an image associated with its article
[Feng+Lapata, ACL 2010]
• Input: article + image Output: caption for image
• Dataset: Sets of article + image + caption
× 3361
King Toupu IV died at the
age of 88 last week.As a result of these backgrounds:
Various research topics such as …
Image Captioning
Group of people sitting
at a table with a dinner.
Tourists are standing on
the middle of a flat desert.
[Ushiku+, ICCV 2015]
Video Captioning
A man is holding a box of doughnuts.
Then he and a woman are standing next each other.
Then she is holding a plate of food.
[Shin+, ICIP 2016]
Multilingual + Image Caption Translation
Ein Masten mit zwei Ampeln
fur Autofahrer. (German)
A pole with two lights
for drivers. (English)
[Hitschler+, ACL 2016]
Visual Question Answering[Fukui+, EMNLP 2016]
Image Generation from Captions
This bird is blue with white
and has a very short beak.
This flower is white and
yellow in color, with petals
that are wavy and smooth.
[Zhang+, 2016]
Goal of this keynote
Looking over researches on vision&language
• Historical flow of each area
• Changes by Deep Learning
× Deep Learning enabled these researches
✓ Deep Learning boosted these researches
1. Image Captioning
2. Video Captioning
3. Multilingual + Image Caption Translation
4. Visual Question Answering
5. Image Generation from Captions
Frontiers of Vision and Language 1
Image Captioning
Every picture tells a story
Dataset:
Images + <object, action, scene> + Captions
1. Predict <object, action, scene> for an input
image using MRF
2. Search for the existing caption associated with
similar <object, action, scene>
<Horse, Ride, Field>
[Farhadi+, ECCV 2010]
Every picture tells a story
<pet, sleep, ground>
See something unexpected.
<transportation, move, track>
A man stands next to a train
on a cloudy day.
[Farhadi+, ECCV 2010]
Retrieve? Generate?
• Retrieve
• Generate
– Template-based
e.g. generating a Subject+Verb sentence
– Template-free
A small gray dog
on a leash.
A black dog
standing in
grassy area.
A small white dog
wearing a flannel
warmer.
Input Dataset
Retrieve? Generate?
• Retrieve
– A small gray dog on a leash.
• Generate
– Template-based
e.g. generating a Subject+Verb sentence
– Template-free
A small gray dog
on a leash.
A black dog
standing in
grassy area.
A small white dog
wearing a flannel
warmer.
Input Dataset
Retrieve? Generate?
• Retrieve
– A small gray dog on a leash.
• Generate
– Template-based
dog+stand ⇒ A dog stands.
– Template-free
A small gray dog
on a leash.
A black dog
standing in
grassy area.
A small white dog
wearing a flannel
warmer.
Input Dataset
Retrieve? Generate?
• Retrieve
– A small gray dog on a leash.
• Generate
– Template-based
dog+stand ⇒ A dog stands.
– Template-free
A small white dog standing on a leash.
A small gray dog
on a leash.
A black dog
standing in
grassy area.
A small white dog
wearing a flannel
warmer.
Input Dataset
Captioning with multi-keyphrases
[Ushiku+, ACM MM 2012]
End of sentence
[Ushiku+, ACM MM 2012]
Benefits of Deep Learning
• Refinement of image recognition [Krizhevsky+, NIPS 2012]
• Deep learning appears in machine translation
[Sutskever+, NIPS 2014]
– LSTM [Hochreiter+Schmidhuber, 1997] solves the gradient vanishing
problem in RNN
→Dealing with relations between distant words in a sentence
– Four-layer LSTM is trained in an end-to-end manner
→comparable to state-of-the-art (English to French)
Emergence of common techs such as CNN/RNN
Reduction of barriers to get into CV+NLP
Input
Output
Google NIC
Concatenation of Google’s methods
• GoogLeNet [Szegedy+, CVPR 2015]
• MT with LSTM
[Sutskever+, NIPS 2014]
Caption (word seq.) 𝑆0 … 𝑆 𝑁 for image 𝐼
𝑆0: beginning of the sentence
𝑆1 = LSTM CNN 𝐼
𝑆𝑡 = LSTM St−1 , 𝑡 = 2 … 𝑁 − 1
𝑆 𝑁: end of the sentence
[Vinyals+, CVPR 2015]
Examples of generated captions
[https://github.com/tensorflow/models/tree/master/im2txt]
[Vinyals+, CVPR 2015]
Comparison to [Ushiku+, ACM MM 2012]
Input image
[Ushiku+, ACM MM 2012]:
Conventional object recognition
Fisher Vector + Linear classifier
Neural image captioning:
Conventional object recognition
Convolutional Neural Network
Neural image captioning
Conventional machine translation
Recurrent Neural Network + beam search
[Ushiku+, ACM MM 2012]:
Conventional machine translation
Log Linear Model + beam search
Estimation of important words Connect the words with grammar model
• Trained using only images and captions
• Approaches are similar to each other
Current development: Accuracy
• Attention-based captioning [Xu+, ICML 2015]
– Focus on some areas for predicting each word!
– Both attention and caption models are trained
using pairs of an image & caption
Current development: Problem setting
Dense captioning
[Lin+, BMVC 2015] [Johnson+, CVPR 2016]
Current development: Problem setting
Generating captions for a photo sequence
[Park+Kim, NIPS 2015][Huang+, NAACL 2016]
The family
got
together for
a cookout.
They had a
lot of
delicious
food.
The dog
was happy
to be there.
They had a
great time
on the
beach.
They even
had a swim
in the water.
Current development: Problem setting
Captioning using sentiment terms
[Mathews+, AAAI 2016][Shin+, BMVC 2016]
Neutral caption
Positive caption
Frontiers of Vision and Language 2
Video Captioning
Before Deep Learning
• Grounding of languages and objects in videos
[Yu+Siskind, ACL 2013]
– Learning from only videos and their captions
– Experiment with a small object with few objects
– Controlled and small dataset
• Deep Learning should suite for this problem
– Image Captioning: single image → word sequence
– Video Captioning: image sequence → word
sequence
End-to-end learning by Deep Learning
• LRCN
[Donahue+, CVPR 2015]
– CNN+RNN for
• Action recognition
• Image / Video
Captioning
• Video to Text
[Venugopalan+, ICCV 2015]
– CNNs to recognize
• Objects from RGB frames
• Actions from flow images
– RNN for captioning
Video Captioning
A man is holding a box of doughnuts.
Then he and a woman are standing next each other.
Then she is holding a plate of food.
[Shin+, ICIP 2016]
Video Captioning
A boat is floating on the water near a mountain.
And a man riding a wave on top of a surfboard.
Then he on the surfboard in the water.
[Shin+, ICIP 2016]
Video Retrieval from Caption
• Input: Captions
• Output: A video related to the caption
10 sec video clip from 40 min database!
• Video captioning is also addressed
A woman in blue is
playing ping pong in a
room.
A guy is skiing with no
shirt on and yellow
snow pants.
A man is water skiing
while attached to a
long rope.
[Yamaguchi+, ICCV 2017]
Frontiers of Vision and Language 3
Multilingual +
Image Caption Translation
Towards multiple languages
Datasets with multilingual captions
• IAPR TC12 [Grubinger+, 2006] English + Germany
• Multi30K [Elliot+, 2016] English + Germany
• STAIR Captions [Yoshikawa+, 2017]
English + Japanese
Development of cross-lingual tasks
• Non-English-caption generation
• Image Caption Transration
Input: Pair of a caption in Language A + an image
or A caption in Language A
Output: Caption in Language B
Non-English-caption generation
Non-English-caption generation
Most researches: generate English Caption
• Japanese [Miyazaki+Shimizu, ACL 2016]
• Chinese [Li+, ICMR 2016]
• Turkish [Unal+, SIU 2016]
Çimlerde ko¸ san bir köpek
金色头发的小女孩
柵の中にキリンが一頭
立っています
Just collecting non-English captions?
Transfer learning among languages
[Miyazaki+Shimizu, ACL 2016]
• Vision-Language grounding Wim is transferred
• Efficient learning using small amount of captions
an elephant is
an elephant
一匹の 象が 土の
一匹の 象が
Image Caption Translation
Machine translation via visual data
Images can boost MT [Calixto+,2012]
• Example below (English to Portuguese):
Does the word “seal” in English
– mean “seal” similar to “stamp”?
– mean “seal” which is a sea animal?
• [Calixto+,2012] insist that the mistranslation can be
avoided using a related image (w/o experiments)
Mistranslation!
Input: Caption in Language A + image
• Caption translation via an associated image
[Elliott+, 2015] [Hitschler+, ACL 2016]
– Generate translation candidates
– Re-rank the candidates using similar images’
captions in Language B
Eine Person in
einem Anzug
und Krawatte
und einem Rock.
(In German)
Translation w/o the related image
A person in a suit and tie
and a rock.
Translation with the related image
A person in a suit and tie
and a skirt.
Input: Caption in Language A
• Cross-lingual document retrieval via images
[Funaki+Nakayama, EMNLP 2015]
• Zero-shot machine translation
[Nakayama+Nishida, 2017]
Frontiers of Vision and Language 4
Visual Question Answering
Visual Question Answering (VQA)
Proposed in Human-Computer Interfaces
• VizWiz [Bigham+, UIST 2010]
Manually solved on AMT
• Automation for the first time (w/o Deep Learning)
[Malinowski+Fritz, NIPS 2014]
• Similar term: Visual Turing Test [Malinowski+Fritz, 2014]
VQA: Visual Question Answering
• Established VQA as an AI problem
– Provided a benchmark dataset
– Experimental results with reasonable baselines
• Portal web site is also organized
– http://www.visualqa.org/
– Annual competition for VQA accuracy
[Antol+, ICCV 2015]
What color are her eyes?
What is the mustache made of?
VQA Dataset
Collected questions and answers on AMT
• Over 100K real images and 30K abstract images
• About 700K questions+10 answers for each
VQA=Multiclass Classification
Feature 𝑍𝐼+𝑄 is applied to an usual classifier
Question 𝑄
What objects are
found on the bed?
Answer 𝐴
bed sheets, pillow
Image 𝐼
Image feature
𝑥𝐼
Question feature
𝑥 𝑄
Integrated feature
𝑧𝐼+𝑄
Development of VQA
How to calculate the integrated feature 𝑧𝐼+𝑄?
• VQA [Antol+, ICCV 2015]: Just concatenate them
• Summation
例 Summation of an image feature with attention
and a question feature [Xu+Saenko, ECCV 2016]
• Multiplication
e.g.Bilinear multiplication using DFT
[Fukui+, EMNLP 2016]
• Hybrid of summation and multiplication
e.g.Concatenation of sum and multiplication
[Saito+, ICME 2017]
𝑧𝐼+𝑄 =
𝑥𝐼
𝑥 𝑄
𝑥𝐼 𝑥 𝑄
𝑥𝐼 𝑥 𝑄𝑧𝐼+𝑄 =
𝑧𝐼+𝑄 =
𝑧𝐼+𝑄 =
𝑥𝐼 𝑥 𝑄
𝑥𝐼 𝑥 𝑄
VQA Challenge
Examples from competition results
Q: What is the woman holding?
GT A: laptop
Machine A: laptop
Q: Is it going to rain soon?
GT A: yes
Machine A: yes
VQA Challenge
Examples from competition results
Q: Why is there snow on one
side of the stream and clear
grass on the other?
GT A: shade
Machine A: yes
Q: Is the hydrant painted a new
color?
GT A: yes
Machine A: no
Frontiers of Vision and Language 5
Image Generation from Captions
Image generation from input caption
Photo-realistic image generation itself is difficult
• [Mansimov+, ICLR 2016]: Incrementally draw using LSTM
• N.B. Photo synthesis is well studied [Hays+Efros, 2007]
Generative Adversarial Networks (GAN)
[Goodfellow+, NIPS 2014]
• Unconditional generative model
• Adversarial learning of Generator and Discriminator
• GAN using convolution … DCGAN [Radford+, ICLR 2016]
Before Conditional Generative Models
Generator
Random vector → Image
Discriminator
Discriminates real or fake
is a fake
image from Generator!
Generative Adversarial Networks (GAN)
[Goodfellow+, NIPS 2014]
• Unconditional generative model
• Adversarial learning of Generator and Discriminator
• GAN using convolution … DCGAN [Radford+, ICLR 2016]
Before Conditional Generative Models
Generator
Random vector → Image
Discriminator
Discriminates real or fake
is a fake
image from Generator!
Generative Adversarial Networks (GAN)
[Goodfellow+, NIPS 2014]
• Unconditional generative model
• Adversarial learning of Generator and Discriminator
• GAN using convolution … DCGAN [Radford+, ICLR 2016]
Before Conditional Generative Models
Generator
Random vector → Image
Discriminator
Discriminates real or fake
is a fake
image from Generator!
Generative Adversarial Networks (GAN)
[Goodfellow+, NIPS 2014]
• Unconditional generative model
• Adversarial learning of Generator and Discriminator
• GAN using convolution … DCGAN [Radford+, ICLR 2016]
Before Conditional Generative Models
Generator
Random vector → Image
Discriminator
Discriminates real or fake
is a fake
image from Generator!
Generative Adversarial Networks (GAN)
[Goodfellow+, NIPS 2014]
• Unconditional generative model
• Adversarial learning of Generator and Discriminator
• GAN using convolution … DCGAN [Radford+, ICLR 2016]
Before Conditional Generative Models
Generator
Random vector → Image
Discriminator
Discriminates real or fake
is a … hmm
Add a Caption to Generator and Discriminator
Conditional Generative Models
Tries to generate an image
・photo-realistic
・related to the caption
Tries to detect an image
・fake
・unrelated
[Reed+, ICML 2016]
Examples of generated images
• Birds (CUB) / Flowers (Oxford-102)
– About 10K images & 5 captions for each image
– 200 kinds of birds / 102 kinds of flowers
A tiny bird, with a tiny beak,
tarsus and feet, a blue crown,
blue coverts, and black
cheek patch
Bright droopy yellow petals
with burgundy streaks, and a
yellow stigma
[Reed+, ICML 2016]
Towards more realistic image generation
StackGAN [Zhang+, 2016]
Two-step GANs
• First GAN generates small and fuzzy image
• Second GAN enlarges and refines it
Examples of generated images
This bird is blue with white
and has a very short beak.
This flower is white and
yellow in color, with petals
that are wavy and smooth.
[Zhang+, 2016]
Examples of generated images
This bird is blue with white
and has a very short beak.
This flower is white and
yellow in color, with petals
that are wavy and smooth.
[Zhang+, 2016]
N.B. Results using dataset specialized in birds / flowers
→ More breakthrough is necessary to generate general images
Take-home Messages
• Looked over researches on vision and language
1. Image Captioning
2. Video Captioning
3. Multilingual + Image Caption Translation
4. Visual Question Answering
5. Image Generation from Captions
• Contributions of Deep Learning
– Most research themes exist before Deep Learning
– Commodity techs for processing images, videos and natural
languages
– Evolution of recognition and generation
Towards a new stage among vision and language!

More Related Content

What's hot

[DL輪読会]GLEAN: Generative Latent Bank for Large-Factor Image Super-Resolution
[DL輪読会]GLEAN: Generative Latent Bank for Large-Factor Image Super-Resolution[DL輪読会]GLEAN: Generative Latent Bank for Large-Factor Image Super-Resolution
[DL輪読会]GLEAN: Generative Latent Bank for Large-Factor Image Super-Resolution
Deep Learning JP
 
[CVPR2020読み会@CV勉強会] 3D Packing for Self-Supervised Monocular Depth Estimation
[CVPR2020読み会@CV勉強会] 3D Packing for Self-Supervised Monocular Depth Estimation[CVPR2020読み会@CV勉強会] 3D Packing for Self-Supervised Monocular Depth Estimation
[CVPR2020読み会@CV勉強会] 3D Packing for Self-Supervised Monocular Depth Estimation
Kazuyuki Miyazawa
 
Deep learning を用いた画像から説明文の自動生成に関する研究の紹介
Deep learning を用いた画像から説明文の自動生成に関する研究の紹介Deep learning を用いた画像から説明文の自動生成に関する研究の紹介
Deep learning を用いた画像から説明文の自動生成に関する研究の紹介
株式会社メタップスホールディングス
 
Image captioning
Image captioningImage captioning
Image captioning
Rajesh Shreedhar Bhat
 
論文紹介:Learn2Augment: Learning to Composite Videos for Data Augmentation in Act...
論文紹介:Learn2Augment: Learning to Composite Videos for Data Augmentation in Act...論文紹介:Learn2Augment: Learning to Composite Videos for Data Augmentation in Act...
論文紹介:Learn2Augment: Learning to Composite Videos for Data Augmentation in Act...
Toru Tamaki
 
Cv勉強会cvpr2018読み会: Im2Flow: Motion Hallucination from Static Images for Action...
Cv勉強会cvpr2018読み会: Im2Flow: Motion Hallucination from Static Images for Action...Cv勉強会cvpr2018読み会: Im2Flow: Motion Hallucination from Static Images for Action...
Cv勉強会cvpr2018読み会: Im2Flow: Motion Hallucination from Static Images for Action...
Toshiki Sakai
 
Deep Learning for Computer Vision: Data Augmentation (UPC 2016)
Deep Learning for Computer Vision: Data Augmentation (UPC 2016)Deep Learning for Computer Vision: Data Augmentation (UPC 2016)
Deep Learning for Computer Vision: Data Augmentation (UPC 2016)
Universitat Politècnica de Catalunya
 
GANの概要とDCGANのアーキテクチャ/アルゴリズム
GANの概要とDCGANのアーキテクチャ/アルゴリズムGANの概要とDCGANのアーキテクチャ/アルゴリズム
GANの概要とDCGANのアーキテクチャ/アルゴリズム
Hirosaji
 
Explanation methods for Artificial Intelligence Models
Explanation methods for Artificial Intelligence ModelsExplanation methods for Artificial Intelligence Models
Explanation methods for Artificial Intelligence Models
Deep Learning Italia
 
Responsible AI in Industry (Tutorials at AAAI 2021, FAccT 2021, and WWW 2021)
Responsible AI in Industry (Tutorials at AAAI 2021, FAccT 2021, and WWW 2021)Responsible AI in Industry (Tutorials at AAAI 2021, FAccT 2021, and WWW 2021)
Responsible AI in Industry (Tutorials at AAAI 2021, FAccT 2021, and WWW 2021)
Krishnaram Kenthapadi
 
Domain Transfer and Adaptation Survey
Domain Transfer and Adaptation SurveyDomain Transfer and Adaptation Survey
Domain Transfer and Adaptation Survey
Sangwoo Mo
 
ESE presentation.pptx
ESE presentation.pptxESE presentation.pptx
ESE presentation.pptx
SuprithaRavishankar
 
Deep Learningで似た画像を見つける技術 | OHS勉強会#5
Deep Learningで似た画像を見つける技術 | OHS勉強会#5Deep Learningで似た画像を見つける技術 | OHS勉強会#5
Deep Learningで似た画像を見つける技術 | OHS勉強会#5
Toshinori Hanya
 
[DL輪読会]CartoonGAN: Generative Adversarial Networks for Photo Cartoonization
[DL輪読会]CartoonGAN: Generative Adversarial Networks for Photo Cartoonization[DL輪読会]CartoonGAN: Generative Adversarial Networks for Photo Cartoonization
[DL輪読会]CartoonGAN: Generative Adversarial Networks for Photo Cartoonization
Deep Learning JP
 
CV分野での最近の脱○○系3選
CV分野での最近の脱○○系3選CV分野での最近の脱○○系3選
CV分野での最近の脱○○系3選
Kazuyuki Miyazawa
 
Deep Residual Hashing Neural Network for Image Retrieval
Deep Residual Hashing Neural Network for Image RetrievalDeep Residual Hashing Neural Network for Image Retrieval
Deep Residual Hashing Neural Network for Image Retrieval
Edwin Efraín Jiménez Lepe
 
Latent diffusions vs DALL-E v2
Latent diffusions vs DALL-E v2Latent diffusions vs DALL-E v2
Latent diffusions vs DALL-E v2
Vitaly Bondar
 
DLゼミ: ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
DLゼミ: ViTPose: Simple Vision Transformer Baselines for Human Pose EstimationDLゼミ: ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
DLゼミ: ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
harmonylab
 
Res netと派生研究の紹介
Res netと派生研究の紹介Res netと派生研究の紹介
Res netと派生研究の紹介
masataka nishimori
 
SSII2022 [SS2] 少ないデータやラベルを効率的に活用する機械学習技術 〜 足りない情報をどのように補うか?〜
SSII2022 [SS2] 少ないデータやラベルを効率的に活用する機械学習技術 〜 足りない情報をどのように補うか?〜SSII2022 [SS2] 少ないデータやラベルを効率的に活用する機械学習技術 〜 足りない情報をどのように補うか?〜
SSII2022 [SS2] 少ないデータやラベルを効率的に活用する機械学習技術 〜 足りない情報をどのように補うか?〜
SSII
 

What's hot (20)

[DL輪読会]GLEAN: Generative Latent Bank for Large-Factor Image Super-Resolution
[DL輪読会]GLEAN: Generative Latent Bank for Large-Factor Image Super-Resolution[DL輪読会]GLEAN: Generative Latent Bank for Large-Factor Image Super-Resolution
[DL輪読会]GLEAN: Generative Latent Bank for Large-Factor Image Super-Resolution
 
[CVPR2020読み会@CV勉強会] 3D Packing for Self-Supervised Monocular Depth Estimation
[CVPR2020読み会@CV勉強会] 3D Packing for Self-Supervised Monocular Depth Estimation[CVPR2020読み会@CV勉強会] 3D Packing for Self-Supervised Monocular Depth Estimation
[CVPR2020読み会@CV勉強会] 3D Packing for Self-Supervised Monocular Depth Estimation
 
Deep learning を用いた画像から説明文の自動生成に関する研究の紹介
Deep learning を用いた画像から説明文の自動生成に関する研究の紹介Deep learning を用いた画像から説明文の自動生成に関する研究の紹介
Deep learning を用いた画像から説明文の自動生成に関する研究の紹介
 
Image captioning
Image captioningImage captioning
Image captioning
 
論文紹介:Learn2Augment: Learning to Composite Videos for Data Augmentation in Act...
論文紹介:Learn2Augment: Learning to Composite Videos for Data Augmentation in Act...論文紹介:Learn2Augment: Learning to Composite Videos for Data Augmentation in Act...
論文紹介:Learn2Augment: Learning to Composite Videos for Data Augmentation in Act...
 
Cv勉強会cvpr2018読み会: Im2Flow: Motion Hallucination from Static Images for Action...
Cv勉強会cvpr2018読み会: Im2Flow: Motion Hallucination from Static Images for Action...Cv勉強会cvpr2018読み会: Im2Flow: Motion Hallucination from Static Images for Action...
Cv勉強会cvpr2018読み会: Im2Flow: Motion Hallucination from Static Images for Action...
 
Deep Learning for Computer Vision: Data Augmentation (UPC 2016)
Deep Learning for Computer Vision: Data Augmentation (UPC 2016)Deep Learning for Computer Vision: Data Augmentation (UPC 2016)
Deep Learning for Computer Vision: Data Augmentation (UPC 2016)
 
GANの概要とDCGANのアーキテクチャ/アルゴリズム
GANの概要とDCGANのアーキテクチャ/アルゴリズムGANの概要とDCGANのアーキテクチャ/アルゴリズム
GANの概要とDCGANのアーキテクチャ/アルゴリズム
 
Explanation methods for Artificial Intelligence Models
Explanation methods for Artificial Intelligence ModelsExplanation methods for Artificial Intelligence Models
Explanation methods for Artificial Intelligence Models
 
Responsible AI in Industry (Tutorials at AAAI 2021, FAccT 2021, and WWW 2021)
Responsible AI in Industry (Tutorials at AAAI 2021, FAccT 2021, and WWW 2021)Responsible AI in Industry (Tutorials at AAAI 2021, FAccT 2021, and WWW 2021)
Responsible AI in Industry (Tutorials at AAAI 2021, FAccT 2021, and WWW 2021)
 
Domain Transfer and Adaptation Survey
Domain Transfer and Adaptation SurveyDomain Transfer and Adaptation Survey
Domain Transfer and Adaptation Survey
 
ESE presentation.pptx
ESE presentation.pptxESE presentation.pptx
ESE presentation.pptx
 
Deep Learningで似た画像を見つける技術 | OHS勉強会#5
Deep Learningで似た画像を見つける技術 | OHS勉強会#5Deep Learningで似た画像を見つける技術 | OHS勉強会#5
Deep Learningで似た画像を見つける技術 | OHS勉強会#5
 
[DL輪読会]CartoonGAN: Generative Adversarial Networks for Photo Cartoonization
[DL輪読会]CartoonGAN: Generative Adversarial Networks for Photo Cartoonization[DL輪読会]CartoonGAN: Generative Adversarial Networks for Photo Cartoonization
[DL輪読会]CartoonGAN: Generative Adversarial Networks for Photo Cartoonization
 
CV分野での最近の脱○○系3選
CV分野での最近の脱○○系3選CV分野での最近の脱○○系3選
CV分野での最近の脱○○系3選
 
Deep Residual Hashing Neural Network for Image Retrieval
Deep Residual Hashing Neural Network for Image RetrievalDeep Residual Hashing Neural Network for Image Retrieval
Deep Residual Hashing Neural Network for Image Retrieval
 
Latent diffusions vs DALL-E v2
Latent diffusions vs DALL-E v2Latent diffusions vs DALL-E v2
Latent diffusions vs DALL-E v2
 
DLゼミ: ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
DLゼミ: ViTPose: Simple Vision Transformer Baselines for Human Pose EstimationDLゼミ: ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
DLゼミ: ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
 
Res netと派生研究の紹介
Res netと派生研究の紹介Res netと派生研究の紹介
Res netと派生研究の紹介
 
SSII2022 [SS2] 少ないデータやラベルを効率的に活用する機械学習技術 〜 足りない情報をどのように補うか?〜
SSII2022 [SS2] 少ないデータやラベルを効率的に活用する機械学習技術 〜 足りない情報をどのように補うか?〜SSII2022 [SS2] 少ないデータやラベルを効率的に活用する機械学習技術 〜 足りない情報をどのように補うか?〜
SSII2022 [SS2] 少ないデータやラベルを効率的に活用する機械学習技術 〜 足りない情報をどのように補うか?〜
 

Viewers also liked

C. G. Jung (1875 1961)
C. G. Jung (1875 1961)C. G. Jung (1875 1961)
C. G. Jung (1875 1961)
Grant Heller
 
Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!
Roelof Pieters
 
Jung, carl gustav el hombre y sus simbolos
Jung, carl gustav   el hombre y sus simbolosJung, carl gustav   el hombre y sus simbolos
Jung, carl gustav el hombre y sus simbolos
Alma Heil 916 NOS
 
Carl Gustav Jung
Carl Gustav JungCarl Gustav Jung
Carl Gustav Jung
Guillermo Baeza
 
Carl Gustav Jung
Carl Gustav JungCarl Gustav Jung
Carl Gustav Jung
Nahima Abigail M. Rayo
 
MIGUEL SERRANO C.G.JUNG Y HERMANN HESSE
MIGUEL SERRANO C.G.JUNG Y HERMANN HESSEMIGUEL SERRANO C.G.JUNG Y HERMANN HESSE
MIGUEL SERRANO C.G.JUNG Y HERMANN HESSE
Alma Heil 916 NOS
 
Analytical Psychology
Analytical PsychologyAnalytical Psychology
Analytical Psychology
EHTISHAM MANZOOR
 
An introduction to Deep Learning with Apache MXNet (November 2017)
An introduction to Deep Learning with Apache MXNet (November 2017)An introduction to Deep Learning with Apache MXNet (November 2017)
An introduction to Deep Learning with Apache MXNet (November 2017)
Julien SIMON
 
Presentation_2_11_15_PsychologyMeeting_SimpleVersion
Presentation_2_11_15_PsychologyMeeting_SimpleVersionPresentation_2_11_15_PsychologyMeeting_SimpleVersion
Presentation_2_11_15_PsychologyMeeting_SimpleVersion
Lu Vechi, PhD
 
Convolutional Neural Networks (DLAI D5L1 2017 UPC Deep Learning for Artificia...
Convolutional Neural Networks (DLAI D5L1 2017 UPC Deep Learning for Artificia...Convolutional Neural Networks (DLAI D5L1 2017 UPC Deep Learning for Artificia...
Convolutional Neural Networks (DLAI D5L1 2017 UPC Deep Learning for Artificia...
Universitat Politècnica de Catalunya
 
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsDeep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Roelof Pieters
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual Introduction
Lukas Masuch
 
Python for Image Understanding: Deep Learning with Convolutional Neural Nets
Python for Image Understanding: Deep Learning with Convolutional Neural NetsPython for Image Understanding: Deep Learning with Convolutional Neural Nets
Python for Image Understanding: Deep Learning with Convolutional Neural Nets
Roelof Pieters
 
Deep Learning Explained
Deep Learning ExplainedDeep Learning Explained
Deep Learning Explained
Melanie Swan
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural Networks
Christian Perone
 
Holy eucharist
Holy eucharistHoly eucharist
Holy eucharist
CAstañares Inutsij
 
Portable Retinal Imaging and Medical Diagnostics
Portable Retinal Imaging and Medical DiagnosticsPortable Retinal Imaging and Medical Diagnostics
Portable Retinal Imaging and Medical Diagnostics
PetteriTeikariPhD
 
Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow
Jen Aman
 
[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習
[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習
[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習
台灣資料科學年會
 
Deep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applicationsDeep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applications
Buhwan Jeong
 

Viewers also liked (20)

C. G. Jung (1875 1961)
C. G. Jung (1875 1961)C. G. Jung (1875 1961)
C. G. Jung (1875 1961)
 
Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!
 
Jung, carl gustav el hombre y sus simbolos
Jung, carl gustav   el hombre y sus simbolosJung, carl gustav   el hombre y sus simbolos
Jung, carl gustav el hombre y sus simbolos
 
Carl Gustav Jung
Carl Gustav JungCarl Gustav Jung
Carl Gustav Jung
 
Carl Gustav Jung
Carl Gustav JungCarl Gustav Jung
Carl Gustav Jung
 
MIGUEL SERRANO C.G.JUNG Y HERMANN HESSE
MIGUEL SERRANO C.G.JUNG Y HERMANN HESSEMIGUEL SERRANO C.G.JUNG Y HERMANN HESSE
MIGUEL SERRANO C.G.JUNG Y HERMANN HESSE
 
Analytical Psychology
Analytical PsychologyAnalytical Psychology
Analytical Psychology
 
An introduction to Deep Learning with Apache MXNet (November 2017)
An introduction to Deep Learning with Apache MXNet (November 2017)An introduction to Deep Learning with Apache MXNet (November 2017)
An introduction to Deep Learning with Apache MXNet (November 2017)
 
Presentation_2_11_15_PsychologyMeeting_SimpleVersion
Presentation_2_11_15_PsychologyMeeting_SimpleVersionPresentation_2_11_15_PsychologyMeeting_SimpleVersion
Presentation_2_11_15_PsychologyMeeting_SimpleVersion
 
Convolutional Neural Networks (DLAI D5L1 2017 UPC Deep Learning for Artificia...
Convolutional Neural Networks (DLAI D5L1 2017 UPC Deep Learning for Artificia...Convolutional Neural Networks (DLAI D5L1 2017 UPC Deep Learning for Artificia...
Convolutional Neural Networks (DLAI D5L1 2017 UPC Deep Learning for Artificia...
 
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsDeep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word Embeddings
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual Introduction
 
Python for Image Understanding: Deep Learning with Convolutional Neural Nets
Python for Image Understanding: Deep Learning with Convolutional Neural NetsPython for Image Understanding: Deep Learning with Convolutional Neural Nets
Python for Image Understanding: Deep Learning with Convolutional Neural Nets
 
Deep Learning Explained
Deep Learning ExplainedDeep Learning Explained
Deep Learning Explained
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural Networks
 
Holy eucharist
Holy eucharistHoly eucharist
Holy eucharist
 
Portable Retinal Imaging and Medical Diagnostics
Portable Retinal Imaging and Medical DiagnosticsPortable Retinal Imaging and Medical Diagnostics
Portable Retinal Imaging and Medical Diagnostics
 
Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow
 
[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習
[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習
[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習
 
Deep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applicationsDeep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applications
 

Similar to Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning

Recognize, Describe, and Generate: Introduction of Recent Work at MIL
Recognize, Describe, and Generate: Introduction of Recent Work at MILRecognize, Describe, and Generate: Introduction of Recent Work at MIL
Recognize, Describe, and Generate: Introduction of Recent Work at MIL
Yoshitaka Ushiku
 
ontribute to 'the Meaning of a Text.pptx
ontribute to 'the Meaning of a Text.pptxontribute to 'the Meaning of a Text.pptx
ontribute to 'the Meaning of a Text.pptx
LAWRENCEJEREMYBRIONE
 
Generative adversarial network and its applications to speech signal and natu...
Generative adversarial network and its applications to speech signal and natu...Generative adversarial network and its applications to speech signal and natu...
Generative adversarial network and its applications to speech signal and natu...
宏毅 李
 
Generative Adversarial Network and its Applications to Speech Processing an...
Generative Adversarial Network and its Applications to Speech Processing an...Generative Adversarial Network and its Applications to Speech Processing an...
Generative Adversarial Network and its Applications to Speech Processing an...
宏毅 李
 
Teaching Chinese with Mac Tools
Teaching Chinese with Mac ToolsTeaching Chinese with Mac Tools
Teaching Chinese with Mac Tools
makkyfung
 
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and ApplicationsICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
Forward Gradient
 
[246]QANet: Towards Efficient and Human-Level Reading Comprehension on SQuAD
[246]QANet: Towards Efficient and Human-Level Reading Comprehension on SQuAD[246]QANet: Towards Efficient and Human-Level Reading Comprehension on SQuAD
[246]QANet: Towards Efficient and Human-Level Reading Comprehension on SQuAD
NAVER D2
 
From Semantics to Self-supervised Learning for Speech and Beyond
From Semantics to Self-supervised Learning for Speech and BeyondFrom Semantics to Self-supervised Learning for Speech and Beyond
From Semantics to Self-supervised Learning for Speech and Beyond
linshanleearchive
 
Deep Representation: Building a Semantic Image Search Engine
Deep Representation: Building a Semantic Image Search EngineDeep Representation: Building a Semantic Image Search Engine
Deep Representation: Building a Semantic Image Search Engine
C4Media
 
Mobile videos 15.6.22
Mobile videos 15.6.22Mobile videos 15.6.22
Mobile videos 15.6.22
Matleena Laakso
 
Integrating Differentiated Instruction in Inclusive Classroom.pptx
Integrating Differentiated Instruction in Inclusive Classroom.pptxIntegrating Differentiated Instruction in Inclusive Classroom.pptx
Integrating Differentiated Instruction in Inclusive Classroom.pptx
JeddeLeon6
 
Blending in the Open
Blending in the OpenBlending in the Open
Blending in the Open
bbridges51
 
The NLP Muppets revolution!
The NLP Muppets revolution!The NLP Muppets revolution!
The NLP Muppets revolution!
Fabio Petroni, PhD
 
Digital Storytelling ITSC
Digital Storytelling ITSCDigital Storytelling ITSC
Digital Storytelling ITSC
Jennifer Gingerich
 
vim
vimvim
Understanding Deep Learning
Understanding Deep LearningUnderstanding Deep Learning
Understanding Deep Learning
C4Media
 
Powerpoint A R T
Powerpoint A R TPowerpoint A R T
Powerpoint A R T
Ross Dalziel
 
MediaEval 2012 Opening
MediaEval 2012 OpeningMediaEval 2012 Opening
MediaEval 2012 Opening
MediaEval2012
 
170704admnet beamer-public
170704admnet beamer-public170704admnet beamer-public
170704admnet beamer-public
Hiroshi Ueda
 
SpokenMedia: Automatic Lecture Transcription and Rich Media Notebooks
SpokenMedia: Automatic Lecture Transcription and Rich Media NotebooksSpokenMedia: Automatic Lecture Transcription and Rich Media Notebooks
SpokenMedia: Automatic Lecture Transcription and Rich Media Notebooks
Brandon Muramatsu
 

Similar to Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning (20)

Recognize, Describe, and Generate: Introduction of Recent Work at MIL
Recognize, Describe, and Generate: Introduction of Recent Work at MILRecognize, Describe, and Generate: Introduction of Recent Work at MIL
Recognize, Describe, and Generate: Introduction of Recent Work at MIL
 
ontribute to 'the Meaning of a Text.pptx
ontribute to 'the Meaning of a Text.pptxontribute to 'the Meaning of a Text.pptx
ontribute to 'the Meaning of a Text.pptx
 
Generative adversarial network and its applications to speech signal and natu...
Generative adversarial network and its applications to speech signal and natu...Generative adversarial network and its applications to speech signal and natu...
Generative adversarial network and its applications to speech signal and natu...
 
Generative Adversarial Network and its Applications to Speech Processing an...
Generative Adversarial Network and its Applications to Speech Processing an...Generative Adversarial Network and its Applications to Speech Processing an...
Generative Adversarial Network and its Applications to Speech Processing an...
 
Teaching Chinese with Mac Tools
Teaching Chinese with Mac ToolsTeaching Chinese with Mac Tools
Teaching Chinese with Mac Tools
 
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and ApplicationsICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
 
[246]QANet: Towards Efficient and Human-Level Reading Comprehension on SQuAD
[246]QANet: Towards Efficient and Human-Level Reading Comprehension on SQuAD[246]QANet: Towards Efficient and Human-Level Reading Comprehension on SQuAD
[246]QANet: Towards Efficient and Human-Level Reading Comprehension on SQuAD
 
From Semantics to Self-supervised Learning for Speech and Beyond
From Semantics to Self-supervised Learning for Speech and BeyondFrom Semantics to Self-supervised Learning for Speech and Beyond
From Semantics to Self-supervised Learning for Speech and Beyond
 
Deep Representation: Building a Semantic Image Search Engine
Deep Representation: Building a Semantic Image Search EngineDeep Representation: Building a Semantic Image Search Engine
Deep Representation: Building a Semantic Image Search Engine
 
Mobile videos 15.6.22
Mobile videos 15.6.22Mobile videos 15.6.22
Mobile videos 15.6.22
 
Integrating Differentiated Instruction in Inclusive Classroom.pptx
Integrating Differentiated Instruction in Inclusive Classroom.pptxIntegrating Differentiated Instruction in Inclusive Classroom.pptx
Integrating Differentiated Instruction in Inclusive Classroom.pptx
 
Blending in the Open
Blending in the OpenBlending in the Open
Blending in the Open
 
The NLP Muppets revolution!
The NLP Muppets revolution!The NLP Muppets revolution!
The NLP Muppets revolution!
 
Digital Storytelling ITSC
Digital Storytelling ITSCDigital Storytelling ITSC
Digital Storytelling ITSC
 
vim
vimvim
vim
 
Understanding Deep Learning
Understanding Deep LearningUnderstanding Deep Learning
Understanding Deep Learning
 
Powerpoint A R T
Powerpoint A R TPowerpoint A R T
Powerpoint A R T
 
MediaEval 2012 Opening
MediaEval 2012 OpeningMediaEval 2012 Opening
MediaEval 2012 Opening
 
170704admnet beamer-public
170704admnet beamer-public170704admnet beamer-public
170704admnet beamer-public
 
SpokenMedia: Automatic Lecture Transcription and Rich Media Notebooks
SpokenMedia: Automatic Lecture Transcription and Rich Media NotebooksSpokenMedia: Automatic Lecture Transcription and Rich Media Notebooks
SpokenMedia: Automatic Lecture Transcription and Rich Media Notebooks
 

More from Yoshitaka Ushiku

機械学習を民主化する取り組み
機械学習を民主化する取り組み機械学習を民主化する取り組み
機械学習を民主化する取り組み
Yoshitaka Ushiku
 
ドメイン適応の原理と応用
ドメイン適応の原理と応用ドメイン適応の原理と応用
ドメイン適応の原理と応用
Yoshitaka Ushiku
 
Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vi...
Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vi...Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vi...
Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vi...
Yoshitaka Ushiku
 
これからの Vision & Language ~ Acadexit した4つの理由
これからの Vision & Language ~ Acadexit した4つの理由これからの Vision & Language ~ Acadexit した4つの理由
これからの Vision & Language ~ Acadexit した4つの理由
Yoshitaka Ushiku
 
視覚と対話の融合研究
視覚と対話の融合研究視覚と対話の融合研究
視覚と対話の融合研究
Yoshitaka Ushiku
 
Women Also Snowboard: Overcoming Bias in Captioning Models(関東CV勉強会 ECCV 2018 ...
Women Also Snowboard: Overcoming Bias in Captioning Models(関東CV勉強会 ECCV 2018 ...Women Also Snowboard: Overcoming Bias in Captioning Models(関東CV勉強会 ECCV 2018 ...
Women Also Snowboard: Overcoming Bias in Captioning Models(関東CV勉強会 ECCV 2018 ...
Yoshitaka Ushiku
 
Vision-and-Language Navigation: Interpreting visually-grounded navigation ins...
Vision-and-Language Navigation: Interpreting visually-grounded navigation ins...Vision-and-Language Navigation: Interpreting visually-grounded navigation ins...
Vision-and-Language Navigation: Interpreting visually-grounded navigation ins...
Yoshitaka Ushiku
 
Sequence Level Training with Recurrent Neural Networks (関東CV勉強会 強化学習論文読み会)
Sequence Level Training with Recurrent Neural Networks (関東CV勉強会 強化学習論文読み会)Sequence Level Training with Recurrent Neural Networks (関東CV勉強会 強化学習論文読み会)
Sequence Level Training with Recurrent Neural Networks (関東CV勉強会 強化学習論文読み会)
Yoshitaka Ushiku
 
Learning Cooperative Visual Dialog with Deep Reinforcement Learning(関東CV勉強会 I...
Learning Cooperative Visual Dialog with Deep Reinforcement Learning(関東CV勉強会 I...Learning Cooperative Visual Dialog with Deep Reinforcement Learning(関東CV勉強会 I...
Learning Cooperative Visual Dialog with Deep Reinforcement Learning(関東CV勉強会 I...
Yoshitaka Ushiku
 
今後のPRMU研究会を考える
今後のPRMU研究会を考える今後のPRMU研究会を考える
今後のPRMU研究会を考える
Yoshitaka Ushiku
 
Self-Critical Sequence Training for Image Captioning (関東CV勉強会 CVPR 2017 読み会)
Self-Critical Sequence Training for Image Captioning (関東CV勉強会 CVPR 2017 読み会)Self-Critical Sequence Training for Image Captioning (関東CV勉強会 CVPR 2017 読み会)
Self-Critical Sequence Training for Image Captioning (関東CV勉強会 CVPR 2017 読み会)
Yoshitaka Ushiku
 
Asymmetric Tri-training for Unsupervised Domain Adaptation
Asymmetric Tri-training for Unsupervised Domain AdaptationAsymmetric Tri-training for Unsupervised Domain Adaptation
Asymmetric Tri-training for Unsupervised Domain Adaptation
Yoshitaka Ushiku
 
Deep Learning による視覚×言語融合の最前線
Deep Learning による視覚×言語融合の最前線Deep Learning による視覚×言語融合の最前線
Deep Learning による視覚×言語融合の最前線
Yoshitaka Ushiku
 
Leveraging Visual Question Answering for Image-Caption Ranking (関東CV勉強会 ECCV ...
Leveraging Visual Question Answeringfor Image-Caption Ranking (関東CV勉強会 ECCV ...Leveraging Visual Question Answeringfor Image-Caption Ranking (関東CV勉強会 ECCV ...
Leveraging Visual Question Answering for Image-Caption Ranking (関東CV勉強会 ECCV ...
Yoshitaka Ushiku
 
We Are Humor Beings: Understanding and Predicting Visual Humor (関東CV勉強会 CVPR ...
We Are Humor Beings: Understanding and Predicting Visual Humor (関東CV勉強会 CVPR ...We Are Humor Beings: Understanding and Predicting Visual Humor (関東CV勉強会 CVPR ...
We Are Humor Beings: Understanding and Predicting Visual Humor (関東CV勉強会 CVPR ...
Yoshitaka Ushiku
 
ごあいさつ 或いはMATLAB教徒がPythonistaに改宗した話 (関東CV勉強会)
ごあいさつ 或いはMATLAB教徒がPythonistaに改宗した話 (関東CV勉強会)ごあいさつ 或いはMATLAB教徒がPythonistaに改宗した話 (関東CV勉強会)
ごあいさつ 或いはMATLAB教徒がPythonistaに改宗した話 (関東CV勉強会)
Yoshitaka Ushiku
 
Generating Notifications for Missing Actions: Don’t forget to turn the lights...
Generating Notifications for Missing Actions:Don’t forget to turn the lights...Generating Notifications for Missing Actions:Don’t forget to turn the lights...
Generating Notifications for Missing Actions: Don’t forget to turn the lights...
Yoshitaka Ushiku
 
画像キャプションの自動生成
画像キャプションの自動生成画像キャプションの自動生成
画像キャプションの自動生成
Yoshitaka Ushiku
 
Unsupervised Object Discovery and Localization in the Wild: Part-Based Match...
Unsupervised Object Discovery and Localization in the Wild:Part-Based Match...Unsupervised Object Discovery and Localization in the Wild:Part-Based Match...
Unsupervised Object Discovery and Localization in the Wild: Part-Based Match...
Yoshitaka Ushiku
 
CVPR 2015 論文紹介(NTT研究所内勉強会用資料)
CVPR 2015 論文紹介(NTT研究所内勉強会用資料)CVPR 2015 論文紹介(NTT研究所内勉強会用資料)
CVPR 2015 論文紹介(NTT研究所内勉強会用資料)
Yoshitaka Ushiku
 

More from Yoshitaka Ushiku (20)

機械学習を民主化する取り組み
機械学習を民主化する取り組み機械学習を民主化する取り組み
機械学習を民主化する取り組み
 
ドメイン適応の原理と応用
ドメイン適応の原理と応用ドメイン適応の原理と応用
ドメイン適応の原理と応用
 
Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vi...
Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vi...Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vi...
Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vi...
 
これからの Vision & Language ~ Acadexit した4つの理由
これからの Vision & Language ~ Acadexit した4つの理由これからの Vision & Language ~ Acadexit した4つの理由
これからの Vision & Language ~ Acadexit した4つの理由
 
視覚と対話の融合研究
視覚と対話の融合研究視覚と対話の融合研究
視覚と対話の融合研究
 
Women Also Snowboard: Overcoming Bias in Captioning Models(関東CV勉強会 ECCV 2018 ...
Women Also Snowboard: Overcoming Bias in Captioning Models(関東CV勉強会 ECCV 2018 ...Women Also Snowboard: Overcoming Bias in Captioning Models(関東CV勉強会 ECCV 2018 ...
Women Also Snowboard: Overcoming Bias in Captioning Models(関東CV勉強会 ECCV 2018 ...
 
Vision-and-Language Navigation: Interpreting visually-grounded navigation ins...
Vision-and-Language Navigation: Interpreting visually-grounded navigation ins...Vision-and-Language Navigation: Interpreting visually-grounded navigation ins...
Vision-and-Language Navigation: Interpreting visually-grounded navigation ins...
 
Sequence Level Training with Recurrent Neural Networks (関東CV勉強会 強化学習論文読み会)
Sequence Level Training with Recurrent Neural Networks (関東CV勉強会 強化学習論文読み会)Sequence Level Training with Recurrent Neural Networks (関東CV勉強会 強化学習論文読み会)
Sequence Level Training with Recurrent Neural Networks (関東CV勉強会 強化学習論文読み会)
 
Learning Cooperative Visual Dialog with Deep Reinforcement Learning(関東CV勉強会 I...
Learning Cooperative Visual Dialog with Deep Reinforcement Learning(関東CV勉強会 I...Learning Cooperative Visual Dialog with Deep Reinforcement Learning(関東CV勉強会 I...
Learning Cooperative Visual Dialog with Deep Reinforcement Learning(関東CV勉強会 I...
 
今後のPRMU研究会を考える
今後のPRMU研究会を考える今後のPRMU研究会を考える
今後のPRMU研究会を考える
 
Self-Critical Sequence Training for Image Captioning (関東CV勉強会 CVPR 2017 読み会)
Self-Critical Sequence Training for Image Captioning (関東CV勉強会 CVPR 2017 読み会)Self-Critical Sequence Training for Image Captioning (関東CV勉強会 CVPR 2017 読み会)
Self-Critical Sequence Training for Image Captioning (関東CV勉強会 CVPR 2017 読み会)
 
Asymmetric Tri-training for Unsupervised Domain Adaptation
Asymmetric Tri-training for Unsupervised Domain AdaptationAsymmetric Tri-training for Unsupervised Domain Adaptation
Asymmetric Tri-training for Unsupervised Domain Adaptation
 
Deep Learning による視覚×言語融合の最前線
Deep Learning による視覚×言語融合の最前線Deep Learning による視覚×言語融合の最前線
Deep Learning による視覚×言語融合の最前線
 
Leveraging Visual Question Answering for Image-Caption Ranking (関東CV勉強会 ECCV ...
Leveraging Visual Question Answeringfor Image-Caption Ranking (関東CV勉強会 ECCV ...Leveraging Visual Question Answeringfor Image-Caption Ranking (関東CV勉強会 ECCV ...
Leveraging Visual Question Answering for Image-Caption Ranking (関東CV勉強会 ECCV ...
 
We Are Humor Beings: Understanding and Predicting Visual Humor (関東CV勉強会 CVPR ...
We Are Humor Beings: Understanding and Predicting Visual Humor (関東CV勉強会 CVPR ...We Are Humor Beings: Understanding and Predicting Visual Humor (関東CV勉強会 CVPR ...
We Are Humor Beings: Understanding and Predicting Visual Humor (関東CV勉強会 CVPR ...
 
ごあいさつ 或いはMATLAB教徒がPythonistaに改宗した話 (関東CV勉強会)
ごあいさつ 或いはMATLAB教徒がPythonistaに改宗した話 (関東CV勉強会)ごあいさつ 或いはMATLAB教徒がPythonistaに改宗した話 (関東CV勉強会)
ごあいさつ 或いはMATLAB教徒がPythonistaに改宗した話 (関東CV勉強会)
 
Generating Notifications for Missing Actions: Don’t forget to turn the lights...
Generating Notifications for Missing Actions:Don’t forget to turn the lights...Generating Notifications for Missing Actions:Don’t forget to turn the lights...
Generating Notifications for Missing Actions: Don’t forget to turn the lights...
 
画像キャプションの自動生成
画像キャプションの自動生成画像キャプションの自動生成
画像キャプションの自動生成
 
Unsupervised Object Discovery and Localization in the Wild: Part-Based Match...
Unsupervised Object Discovery and Localization in the Wild:Part-Based Match...Unsupervised Object Discovery and Localization in the Wild:Part-Based Match...
Unsupervised Object Discovery and Localization in the Wild: Part-Based Match...
 
CVPR 2015 論文紹介(NTT研究所内勉強会用資料)
CVPR 2015 論文紹介(NTT研究所内勉強会用資料)CVPR 2015 論文紹介(NTT研究所内勉強会用資料)
CVPR 2015 論文紹介(NTT研究所内勉強会用資料)
 

Recently uploaded

leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
alexjohnson7307
 
(CISOPlatform Summit & SACON 2024) Cyber Insurance & Risk Quantification.pdf
(CISOPlatform Summit & SACON 2024) Cyber Insurance & Risk Quantification.pdf(CISOPlatform Summit & SACON 2024) Cyber Insurance & Risk Quantification.pdf
(CISOPlatform Summit & SACON 2024) Cyber Insurance & Risk Quantification.pdf
Priyanka Aash
 
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptxMAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
janagijoythi
 
The Impact of the Internet of Things (IoT) on Smart Homes and Cities
The Impact of the Internet of Things (IoT) on Smart Homes and CitiesThe Impact of the Internet of Things (IoT) on Smart Homes and Cities
The Impact of the Internet of Things (IoT) on Smart Homes and Cities
Arpan Buwa
 
Feature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptxFeature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptx
ssuser1915fe1
 
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
sunilverma7884
 
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptxDublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
Kunal Gupta
 
Connector Corner: Leveraging Snowflake Integration for Smarter Decision Making
Connector Corner: Leveraging Snowflake Integration for Smarter Decision MakingConnector Corner: Leveraging Snowflake Integration for Smarter Decision Making
Connector Corner: Leveraging Snowflake Integration for Smarter Decision Making
DianaGray10
 
Uncharted Together- Navigating AI's New Frontiers in Libraries
Uncharted Together- Navigating AI's New Frontiers in LibrariesUncharted Together- Navigating AI's New Frontiers in Libraries
Uncharted Together- Navigating AI's New Frontiers in Libraries
Brian Pichman
 
(CISOPlatform Summit & SACON 2024) Workshop _ Most Dangerous Attack Technique...
(CISOPlatform Summit & SACON 2024) Workshop _ Most Dangerous Attack Technique...(CISOPlatform Summit & SACON 2024) Workshop _ Most Dangerous Attack Technique...
(CISOPlatform Summit & SACON 2024) Workshop _ Most Dangerous Attack Technique...
Priyanka Aash
 
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdfAcumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
BrainSell Technologies
 
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptxUse Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
SynapseIndia
 
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
Priyanka Aash
 
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
maigasapphire
 
Semantic-Aware Code Model: Elevating the Future of Software Development
Semantic-Aware Code Model: Elevating the Future of Software DevelopmentSemantic-Aware Code Model: Elevating the Future of Software Development
Semantic-Aware Code Model: Elevating the Future of Software Development
Baishakhi Ray
 
Google I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged SlidesGoogle I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged Slides
Google Developer Group - Harare
 
(CISOPlatform Summit & SACON 2024) Digital Personal Data Protection Act.pdf
(CISOPlatform Summit & SACON 2024) Digital Personal Data Protection Act.pdf(CISOPlatform Summit & SACON 2024) Digital Personal Data Protection Act.pdf
(CISOPlatform Summit & SACON 2024) Digital Personal Data Protection Act.pdf
Priyanka Aash
 
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
bhumivarma35300
 
How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...
DianaGray10
 
Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17
Bhajan Mehta
 

Recently uploaded (20)

leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
 
(CISOPlatform Summit & SACON 2024) Cyber Insurance & Risk Quantification.pdf
(CISOPlatform Summit & SACON 2024) Cyber Insurance & Risk Quantification.pdf(CISOPlatform Summit & SACON 2024) Cyber Insurance & Risk Quantification.pdf
(CISOPlatform Summit & SACON 2024) Cyber Insurance & Risk Quantification.pdf
 
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptxMAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
 
The Impact of the Internet of Things (IoT) on Smart Homes and Cities
The Impact of the Internet of Things (IoT) on Smart Homes and CitiesThe Impact of the Internet of Things (IoT) on Smart Homes and Cities
The Impact of the Internet of Things (IoT) on Smart Homes and Cities
 
Feature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptxFeature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptx
 
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
 
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptxDublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
 
Connector Corner: Leveraging Snowflake Integration for Smarter Decision Making
Connector Corner: Leveraging Snowflake Integration for Smarter Decision MakingConnector Corner: Leveraging Snowflake Integration for Smarter Decision Making
Connector Corner: Leveraging Snowflake Integration for Smarter Decision Making
 
Uncharted Together- Navigating AI's New Frontiers in Libraries
Uncharted Together- Navigating AI's New Frontiers in LibrariesUncharted Together- Navigating AI's New Frontiers in Libraries
Uncharted Together- Navigating AI's New Frontiers in Libraries
 
(CISOPlatform Summit & SACON 2024) Workshop _ Most Dangerous Attack Technique...
(CISOPlatform Summit & SACON 2024) Workshop _ Most Dangerous Attack Technique...(CISOPlatform Summit & SACON 2024) Workshop _ Most Dangerous Attack Technique...
(CISOPlatform Summit & SACON 2024) Workshop _ Most Dangerous Attack Technique...
 
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdfAcumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
 
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptxUse Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
 
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
 
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
 
Semantic-Aware Code Model: Elevating the Future of Software Development
Semantic-Aware Code Model: Elevating the Future of Software DevelopmentSemantic-Aware Code Model: Elevating the Future of Software Development
Semantic-Aware Code Model: Elevating the Future of Software Development
 
Google I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged SlidesGoogle I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged Slides
 
(CISOPlatform Summit & SACON 2024) Digital Personal Data Protection Act.pdf
(CISOPlatform Summit & SACON 2024) Digital Personal Data Protection Act.pdf(CISOPlatform Summit & SACON 2024) Digital Personal Data Protection Act.pdf
(CISOPlatform Summit & SACON 2024) Digital Personal Data Protection Act.pdf
 
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
 
How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...
 
Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17
 

Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning

  • 1. Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning The University of Tokyo Yoshitaka Ushiku losnuevetoros
  • 2. Documents = Vision + Language Vision & Language: an emerging topic • Integration of CV, NLP and ML techs • Several backgrounds – Impact of Deep Learning • Image recognition (CV) • Machine translation (NLP) – Growth of user generated contents – Exploratory researches on Vision and Language
  • 3. 2012: Impact of Deep Learning Academic AI startup A famous company Many slides refer to the first use of CNN (AlexNet) on ImageNet
  • 4. 2012: Impact of Deep Learning Academic AI startup A famous company Large gap of error rates on ImageNet 1st team: 15.3% 2nd team: 26.2% Large gap of error rates on ImageNet 1st team: 15.3% 2nd team: 26.2% Large gap of error rates on ImageNet 1st team: 15.3% 2nd team: 26.2% Many slides refer to the first use of CNN (AlexNet) on ImageNet
  • 5. 2012: Impact of Deep Learning According to the official site… 1st team w/ DL Error rate: 15% 2nd team w/o DL Error rate: 26% [http://image-net.org/challenges/LSVRC/2012/results.html] It’s me!!
  • 6. 2014: Another impact of Deep Learning • Deep learning appears in machine translation [Sutskever+, NIPS 2014] – LSTM [Hochreiter+Schmidhuber, 1997] solves the gradient vanishing problem in RNN →Dealing with relations between distant words in a sentence – Four-layer LSTM is trained in an end-to-end manner →comparable to state-of-the-art (English to French) • Emergence of common techs such as CNN/RNN Reduction of barriers to get into CV+NLP Input Output
  • 7. Growth of user generated contents Especially in content posting/sharing service • Facebook: 300 million photos per day • YouTube: 400-hours videos per minute Pōhutukawa blooms this time of the year in New Zealand. As the flowers fall, the ground underneath the trees look spectacular. Pairs of a sentence + a video / photo →Collectable in large quantities
  • 8. Exploratory researches on Vision and Language Captioning an image associated with its article [Feng+Lapata, ACL 2010] • Input: article + image Output: caption for image • Dataset: Sets of article + image + caption × 3361 King Toupu IV died at the age of 88 last week.
  • 9. Exploratory researches on Vision and Language Captioning an image associated with its article [Feng+Lapata, ACL 2010] • Input: article + image Output: caption for image • Dataset: Sets of article + image + caption × 3361 King Toupu IV died at the age of 88 last week.As a result of these backgrounds: Various research topics such as …
  • 10. Image Captioning Group of people sitting at a table with a dinner. Tourists are standing on the middle of a flat desert. [Ushiku+, ICCV 2015]
  • 11. Video Captioning A man is holding a box of doughnuts. Then he and a woman are standing next each other. Then she is holding a plate of food. [Shin+, ICIP 2016]
  • 12. Multilingual + Image Caption Translation Ein Masten mit zwei Ampeln fur Autofahrer. (German) A pole with two lights for drivers. (English) [Hitschler+, ACL 2016]
  • 14. Image Generation from Captions This bird is blue with white and has a very short beak. This flower is white and yellow in color, with petals that are wavy and smooth. [Zhang+, 2016]
  • 15. Goal of this keynote Looking over researches on vision&language • Historical flow of each area • Changes by Deep Learning × Deep Learning enabled these researches ✓ Deep Learning boosted these researches 1. Image Captioning 2. Video Captioning 3. Multilingual + Image Caption Translation 4. Visual Question Answering 5. Image Generation from Captions
  • 16. Frontiers of Vision and Language 1 Image Captioning
  • 17. Every picture tells a story Dataset: Images + <object, action, scene> + Captions 1. Predict <object, action, scene> for an input image using MRF 2. Search for the existing caption associated with similar <object, action, scene> <Horse, Ride, Field> [Farhadi+, ECCV 2010]
  • 18. Every picture tells a story <pet, sleep, ground> See something unexpected. <transportation, move, track> A man stands next to a train on a cloudy day. [Farhadi+, ECCV 2010]
  • 19. Retrieve? Generate? • Retrieve • Generate – Template-based e.g. generating a Subject+Verb sentence – Template-free A small gray dog on a leash. A black dog standing in grassy area. A small white dog wearing a flannel warmer. Input Dataset
  • 20. Retrieve? Generate? • Retrieve – A small gray dog on a leash. • Generate – Template-based e.g. generating a Subject+Verb sentence – Template-free A small gray dog on a leash. A black dog standing in grassy area. A small white dog wearing a flannel warmer. Input Dataset
  • 21. Retrieve? Generate? • Retrieve – A small gray dog on a leash. • Generate – Template-based dog+stand ⇒ A dog stands. – Template-free A small gray dog on a leash. A black dog standing in grassy area. A small white dog wearing a flannel warmer. Input Dataset
  • 22. Retrieve? Generate? • Retrieve – A small gray dog on a leash. • Generate – Template-based dog+stand ⇒ A dog stands. – Template-free A small white dog standing on a leash. A small gray dog on a leash. A black dog standing in grassy area. A small white dog wearing a flannel warmer. Input Dataset
  • 25. Benefits of Deep Learning • Refinement of image recognition [Krizhevsky+, NIPS 2012] • Deep learning appears in machine translation [Sutskever+, NIPS 2014] – LSTM [Hochreiter+Schmidhuber, 1997] solves the gradient vanishing problem in RNN →Dealing with relations between distant words in a sentence – Four-layer LSTM is trained in an end-to-end manner →comparable to state-of-the-art (English to French) Emergence of common techs such as CNN/RNN Reduction of barriers to get into CV+NLP Input Output
  • 26. Google NIC Concatenation of Google’s methods • GoogLeNet [Szegedy+, CVPR 2015] • MT with LSTM [Sutskever+, NIPS 2014] Caption (word seq.) 𝑆0 … 𝑆 𝑁 for image 𝐼 𝑆0: beginning of the sentence 𝑆1 = LSTM CNN 𝐼 𝑆𝑡 = LSTM St−1 , 𝑡 = 2 … 𝑁 − 1 𝑆 𝑁: end of the sentence [Vinyals+, CVPR 2015]
  • 27. Examples of generated captions [https://github.com/tensorflow/models/tree/master/im2txt] [Vinyals+, CVPR 2015]
  • 28. Comparison to [Ushiku+, ACM MM 2012] Input image [Ushiku+, ACM MM 2012]: Conventional object recognition Fisher Vector + Linear classifier Neural image captioning: Conventional object recognition Convolutional Neural Network Neural image captioning Conventional machine translation Recurrent Neural Network + beam search [Ushiku+, ACM MM 2012]: Conventional machine translation Log Linear Model + beam search Estimation of important words Connect the words with grammar model • Trained using only images and captions • Approaches are similar to each other
  • 29. Current development: Accuracy • Attention-based captioning [Xu+, ICML 2015] – Focus on some areas for predicting each word! – Both attention and caption models are trained using pairs of an image & caption
  • 30. Current development: Problem setting Dense captioning [Lin+, BMVC 2015] [Johnson+, CVPR 2016]
  • 31. Current development: Problem setting Generating captions for a photo sequence [Park+Kim, NIPS 2015][Huang+, NAACL 2016] The family got together for a cookout. They had a lot of delicious food. The dog was happy to be there. They had a great time on the beach. They even had a swim in the water.
  • 32. Current development: Problem setting Captioning using sentiment terms [Mathews+, AAAI 2016][Shin+, BMVC 2016] Neutral caption Positive caption
  • 33. Frontiers of Vision and Language 2 Video Captioning
  • 34. Before Deep Learning • Grounding of languages and objects in videos [Yu+Siskind, ACL 2013] – Learning from only videos and their captions – Experiment with a small object with few objects – Controlled and small dataset • Deep Learning should suite for this problem – Image Captioning: single image → word sequence – Video Captioning: image sequence → word sequence
  • 35. End-to-end learning by Deep Learning • LRCN [Donahue+, CVPR 2015] – CNN+RNN for • Action recognition • Image / Video Captioning • Video to Text [Venugopalan+, ICCV 2015] – CNNs to recognize • Objects from RGB frames • Actions from flow images – RNN for captioning
  • 36. Video Captioning A man is holding a box of doughnuts. Then he and a woman are standing next each other. Then she is holding a plate of food. [Shin+, ICIP 2016]
  • 37. Video Captioning A boat is floating on the water near a mountain. And a man riding a wave on top of a surfboard. Then he on the surfboard in the water. [Shin+, ICIP 2016]
  • 38. Video Retrieval from Caption • Input: Captions • Output: A video related to the caption 10 sec video clip from 40 min database! • Video captioning is also addressed A woman in blue is playing ping pong in a room. A guy is skiing with no shirt on and yellow snow pants. A man is water skiing while attached to a long rope. [Yamaguchi+, ICCV 2017]
  • 39. Frontiers of Vision and Language 3 Multilingual + Image Caption Translation
  • 40. Towards multiple languages Datasets with multilingual captions • IAPR TC12 [Grubinger+, 2006] English + Germany • Multi30K [Elliot+, 2016] English + Germany • STAIR Captions [Yoshikawa+, 2017] English + Japanese Development of cross-lingual tasks • Non-English-caption generation • Image Caption Transration Input: Pair of a caption in Language A + an image or A caption in Language A Output: Caption in Language B
  • 42. Non-English-caption generation Most researches: generate English Caption • Japanese [Miyazaki+Shimizu, ACL 2016] • Chinese [Li+, ICMR 2016] • Turkish [Unal+, SIU 2016] Çimlerde ko¸ san bir köpek 金色头发的小女孩 柵の中にキリンが一頭 立っています
  • 43. Just collecting non-English captions? Transfer learning among languages [Miyazaki+Shimizu, ACL 2016] • Vision-Language grounding Wim is transferred • Efficient learning using small amount of captions an elephant is an elephant 一匹の 象が 土の 一匹の 象が
  • 45. Machine translation via visual data Images can boost MT [Calixto+,2012] • Example below (English to Portuguese): Does the word “seal” in English – mean “seal” similar to “stamp”? – mean “seal” which is a sea animal? • [Calixto+,2012] insist that the mistranslation can be avoided using a related image (w/o experiments) Mistranslation!
  • 46. Input: Caption in Language A + image • Caption translation via an associated image [Elliott+, 2015] [Hitschler+, ACL 2016] – Generate translation candidates – Re-rank the candidates using similar images’ captions in Language B Eine Person in einem Anzug und Krawatte und einem Rock. (In German) Translation w/o the related image A person in a suit and tie and a rock. Translation with the related image A person in a suit and tie and a skirt.
  • 47. Input: Caption in Language A • Cross-lingual document retrieval via images [Funaki+Nakayama, EMNLP 2015] • Zero-shot machine translation [Nakayama+Nishida, 2017]
  • 48. Frontiers of Vision and Language 4 Visual Question Answering
  • 49. Visual Question Answering (VQA) Proposed in Human-Computer Interfaces • VizWiz [Bigham+, UIST 2010] Manually solved on AMT • Automation for the first time (w/o Deep Learning) [Malinowski+Fritz, NIPS 2014] • Similar term: Visual Turing Test [Malinowski+Fritz, 2014]
  • 50. VQA: Visual Question Answering • Established VQA as an AI problem – Provided a benchmark dataset – Experimental results with reasonable baselines • Portal web site is also organized – http://www.visualqa.org/ – Annual competition for VQA accuracy [Antol+, ICCV 2015] What color are her eyes? What is the mustache made of?
  • 51. VQA Dataset Collected questions and answers on AMT • Over 100K real images and 30K abstract images • About 700K questions+10 answers for each
  • 52. VQA=Multiclass Classification Feature 𝑍𝐼+𝑄 is applied to an usual classifier Question 𝑄 What objects are found on the bed? Answer 𝐴 bed sheets, pillow Image 𝐼 Image feature 𝑥𝐼 Question feature 𝑥 𝑄 Integrated feature 𝑧𝐼+𝑄
  • 53. Development of VQA How to calculate the integrated feature 𝑧𝐼+𝑄? • VQA [Antol+, ICCV 2015]: Just concatenate them • Summation 例 Summation of an image feature with attention and a question feature [Xu+Saenko, ECCV 2016] • Multiplication e.g.Bilinear multiplication using DFT [Fukui+, EMNLP 2016] • Hybrid of summation and multiplication e.g.Concatenation of sum and multiplication [Saito+, ICME 2017] 𝑧𝐼+𝑄 = 𝑥𝐼 𝑥 𝑄 𝑥𝐼 𝑥 𝑄 𝑥𝐼 𝑥 𝑄𝑧𝐼+𝑄 = 𝑧𝐼+𝑄 = 𝑧𝐼+𝑄 = 𝑥𝐼 𝑥 𝑄 𝑥𝐼 𝑥 𝑄
  • 54. VQA Challenge Examples from competition results Q: What is the woman holding? GT A: laptop Machine A: laptop Q: Is it going to rain soon? GT A: yes Machine A: yes
  • 55. VQA Challenge Examples from competition results Q: Why is there snow on one side of the stream and clear grass on the other? GT A: shade Machine A: yes Q: Is the hydrant painted a new color? GT A: yes Machine A: no
  • 56. Frontiers of Vision and Language 5 Image Generation from Captions
  • 57. Image generation from input caption Photo-realistic image generation itself is difficult • [Mansimov+, ICLR 2016]: Incrementally draw using LSTM • N.B. Photo synthesis is well studied [Hays+Efros, 2007]
  • 58. Generative Adversarial Networks (GAN) [Goodfellow+, NIPS 2014] • Unconditional generative model • Adversarial learning of Generator and Discriminator • GAN using convolution … DCGAN [Radford+, ICLR 2016] Before Conditional Generative Models Generator Random vector → Image Discriminator Discriminates real or fake is a fake image from Generator!
  • 59. Generative Adversarial Networks (GAN) [Goodfellow+, NIPS 2014] • Unconditional generative model • Adversarial learning of Generator and Discriminator • GAN using convolution … DCGAN [Radford+, ICLR 2016] Before Conditional Generative Models Generator Random vector → Image Discriminator Discriminates real or fake is a fake image from Generator!
  • 60. Generative Adversarial Networks (GAN) [Goodfellow+, NIPS 2014] • Unconditional generative model • Adversarial learning of Generator and Discriminator • GAN using convolution … DCGAN [Radford+, ICLR 2016] Before Conditional Generative Models Generator Random vector → Image Discriminator Discriminates real or fake is a fake image from Generator!
  • 61. Generative Adversarial Networks (GAN) [Goodfellow+, NIPS 2014] • Unconditional generative model • Adversarial learning of Generator and Discriminator • GAN using convolution … DCGAN [Radford+, ICLR 2016] Before Conditional Generative Models Generator Random vector → Image Discriminator Discriminates real or fake is a fake image from Generator!
  • 62. Generative Adversarial Networks (GAN) [Goodfellow+, NIPS 2014] • Unconditional generative model • Adversarial learning of Generator and Discriminator • GAN using convolution … DCGAN [Radford+, ICLR 2016] Before Conditional Generative Models Generator Random vector → Image Discriminator Discriminates real or fake is a … hmm
  • 63. Add a Caption to Generator and Discriminator Conditional Generative Models Tries to generate an image ・photo-realistic ・related to the caption Tries to detect an image ・fake ・unrelated [Reed+, ICML 2016]
  • 64. Examples of generated images • Birds (CUB) / Flowers (Oxford-102) – About 10K images & 5 captions for each image – 200 kinds of birds / 102 kinds of flowers A tiny bird, with a tiny beak, tarsus and feet, a blue crown, blue coverts, and black cheek patch Bright droopy yellow petals with burgundy streaks, and a yellow stigma [Reed+, ICML 2016]
  • 65. Towards more realistic image generation StackGAN [Zhang+, 2016] Two-step GANs • First GAN generates small and fuzzy image • Second GAN enlarges and refines it
  • 66. Examples of generated images This bird is blue with white and has a very short beak. This flower is white and yellow in color, with petals that are wavy and smooth. [Zhang+, 2016]
  • 67. Examples of generated images This bird is blue with white and has a very short beak. This flower is white and yellow in color, with petals that are wavy and smooth. [Zhang+, 2016] N.B. Results using dataset specialized in birds / flowers → More breakthrough is necessary to generate general images
  • 68. Take-home Messages • Looked over researches on vision and language 1. Image Captioning 2. Video Captioning 3. Multilingual + Image Caption Translation 4. Visual Question Answering 5. Image Generation from Captions • Contributions of Deep Learning – Most research themes exist before Deep Learning – Commodity techs for processing images, videos and natural languages – Evolution of recognition and generation Towards a new stage among vision and language!

Editor's Notes

  1. In ILSVRC 2012, the only team that used CNN for the first time in the history of ILSVRC won the first place with overwhelming accuracy. This incident has caused widespread deep learning so far, and this result has been reported on so many slides. As you can see, slides from academics, AI startups participating in this GTC, and a famous company holding this GTC report the same thing.
  2. The says that there was a large gap of error rates on ImageNet. Whereas the 2nd team achieved 26.2%, 1st team achieved 15.3%. Again, there was a large gap of error rates, there was a large gap of error rates. The 1st team is very famous, but some of you may be curious about the 2nd team; who are they?
  3. You can easily know the answer because the official site still has the information about ILSVRC 2012. Yes, the 1st team with deep learning achieved 15% error, the 2nd team without deep learning achieved 26% error … and if you scroll down this web page, the members of the second team are shown in a table. There seems to be several guys in the second team, and now please remember this name. It is hard to pronounce. Yoshitaka Ushiku.
  4. Therefore, we propose a new approach by solving a novel problem “multi-keyphrase problem”. We assume that the contents of images can be … For example, if the image of the locomotive is the input, two keyphrases “” and “” are important. Only with these keyphrases, we can generate a sentence by connecting them using a grammar knowledge. And even a rare image like the last one, can be explained by estimating “man bites”, which describe the relation between “man” and “bite”. (叩け そして 読め) = “comes down to”