Recognize, Describe, and Generate: Introduction of Recent Work at MIL

Recognize, Describe, and Generate:
Introduction of Recent Work at MIL
The University of Tokyo
Yoshitaka Ushiku

Journalist Robot
• Born in 2006
• Objective: publishing news automatically
– Recognize
• Objects, people, actions
– Describe
• What is happening
– Generate
• Contents as humans do

Outline
• Journalist Robot: ancestor of current work in MIL
• Outline: research originates with this robot
– Recognize
• Basic: Framework for DL, Domain Adaptation
• Classification: Single-modality, Multi-modalities
– Describe
• Image Captioning
• Video Captioning
– Generate
• Image Reconstruction
• Video Generation

MILJS: JavaScript × Deep Learning
[Hidaka+, ICLR Workshop 2017]

MILJS: JavaScript × Deep Learning
• Support for both learning and inference
• Support for nodes with GPGPUs
– Currently WebCL is utilized.
– Now working on WebGPU.
• Support for nodes w/o GPGPUs
• No requirements to install any software
– Even ResNet with 152 layers can be trained
[Hidaka+, ICLR Workshop 2017]

WebDNN: Fastest Inference Framework on Web Browser
Optimizes trained models for inference only
• Caffe, Keras, Chainer
• IE, Edge, Safari, Chrome, Firefox on PCs and mobiles
• GPU computation
• Web camera
[Kikura+ 2017]

Asymmetric Tri-training for Domain Adaptation
• Unsupervised domain adaptation
Trained on mnist → Works on SVHN?
– Ground-truth labels are associated with source (mnist)
– However, there are no labels for target (SVHN)
[Saito+, ICML 2017]

• Asymmetric Tri-training: pseudo labels for target domain
[Saito+, ICML 2017]

1st: Training on MNIST → Add pseudo labels for easy samples
2nd~: Training on MNIST+α→ Add more pseudo labels
eight
nine
[Saito+, ICML 2017]

End-to-end learning for environmental sound classification
Existing methods for speech / sound recognition:
① Feature extraction: Fourier Transformation (log-mel features)
② Classification: CNN with the extracted feature map
[Tokozume+, ICASSP 2017]
①
②
Log-mel features are suitable for human speech; but for environmental sounds…?

Proposed approach (EnvNet):
CNN for both ① feature map extraction and ② classification
①
②
Extracted “feature map”

Comparison of accuracy [%] on ESC-50[Piczak, ACM MM 2015]
64.5 64.0
71.0
log-mel feature + CNN
[Piczak, MLSP 2015]
End-to-end CNN
(Ours)
End-to-end CNN &
log-mel feature + CNN
(Ours)
EnvNet can extract discriminative features for environmental sounds

Multispectral Segmentation & Detection
• Robust Segmentation & Detection
– Daytime and nighttime
– Fog
• Use of multispectral images
– RGB
– Far-infrared (FIR)
– Mid-infrared (MIR)
– Near-infrared (NIR)
[Ha+, IROS 2017][Karasawa+, submitted to ACM MM 2017]
RGBMIR
NIRFIR

• How to combine multispectral images?
– Just concatenate them to a multi-channel image?
– Develop some sophisticated methods?
• To capture multispectral images
– Develop a novel single camera
– Use different cameras
Low cost, but many problems:
• Camera parameters
• Capture timings
• Ours: Multispectral Fusion Network
– Baseline: SegNet [Badrinarayanan+, PAMI 2017]
– Independent encoders for each camera
Multispectral Segmentation & Detection[Ha+, IROS 2017][Karasawa+, submitted to ACM MM 2017]

Multispectral Segmentation & Detection[Ha+, IROS 2017][Karasawa+, submitted to ACM MM 2017]

Visual Question Answering (VQA)
Question answering system for
• Associated image
• Question by natural language
[Saito+, ICME 2017]
Q: Is it going to rain soon?
Ground Truth A: yes
Q: Why is there snow on one
side of the stream and clear
grass on the other?
Ground Truth A: shade

After integrating for 𝑍𝐼+𝑄: usual classification
Question 𝑄
What objects are
found on the bed?
Answer 𝐴
bed sheets, pillow
Image 𝐼
Image feature
𝑥𝐼
Question feature
𝑥 𝑄
Integrated
vector
𝑧𝐼+𝑄
[Saito+, ICME 2017]
VQA = Multi-class classification

Current advancement: improving how to integrate 𝑥𝐼 and 𝑥 𝑄
• Concatenation
e.g.) [Antol+, ICCV 2015]
• Summation
e.g.) Image feature (with attention) + Question feature
[Xu+Saenko, ECCV 2016]
• Multiplication
e.g.) Bilinear multiplication
[Fukui+, EMNLP 2016]
• This work: DualNet doing sum, multiply and concatenation
𝑧𝐼+𝑄 =
𝑥𝐼
𝑥 𝑄
𝑥𝐼 𝑥 𝑄
𝑥𝐼 𝑥 𝑄𝑧𝐼+𝑄 =
𝑧𝐼+𝑄 =
𝑧𝐼+𝑄 =
𝑥𝐼 𝑥 𝑄
𝑥𝐼 𝑥 𝑄
[Saito+, ICME 2017]

VQA Challenge 2016 (in CVPR 2016)
Won the 1st place on abstract images w/o attention mechanism
[Saito+, ICME 2017]
Q: What fruit is yellow and brown?
A: banana
Q: How many screens are there?
A: 2
Q: What is the boy playing with?
A: teddy bear
Q: Are there any animals swimming in the
pond? A: no

[Ushiku+, ACMMM 2011]
Automatic Image Captioning

Training Dataset
A woman posing
on a red scooter.
White and gray
kitten lying on
its side.
A white van
parked in an
empty lot.
A white cat rests
head on a stone.
Silver car parked
on side of road.
A small gray dog
on a leash.
A black dog
standing in a
grassy area.
A small white dog
wearing a flannel
warmer.
Input Image
A small white dog wearing a flannel warmer.
A small gray dog on a leash.
A black dog standing in a grassy area.
Nearest Captions
A small white dog wearing a flannel warmer.
A small gray dog on a leash.
A black dog standing in a grassy area.
A small white dog standing on a leash.

Automatic Image Captioning [ACM MM 2012, ICCV 2015]
Group of people sitting
at a table with a dinner.
Tourists are standing on
the middle of a flat desert.

Image Captioning + Sentiment Terms[Shin+, BMVC 2016]
A confused man in a
blue shirt is sitting on a
bench.
A man in a blue shirt
and blue jeans is
standing in the
overlooked water.
A zebra standing in a
field with a tree in the
dirty background.

Image Captioning + Sentiment Terms
Two steps for adding a sentiment term
1. Usual image captioning using CNN+RNN
[Shin+, BMVC 2016]
The most probable noun is memorized

Image Captioning + Sentiment Terms
Two steps for adding a sentiment term
1. Usual image captioning using CNN+RNN
2. Forced to predict sentiment term before the noun
[Shin+, BMVC 2016]

Beyond Caption to Narrative
A man is holding a box of doughnuts.
Then he and a woman are standing next each other.
Then she is holding a plate of food.
[Shin+, ICIP 2016]

Beyond Caption to Narrative [Shin+, ICIP 2016]
A man is
holding a box
of doughnuts.
he and a
woman are
standing next
each other.
she is holding
a plate of food. Narrative

Beyond Caption to Narrative
A boat is floating on the water near a mountain.
And a man riding a wave on top of a surfboard.
Then he on the surfboard in the water.
[Shin+, ICIP 2016]

Image Reconstruction [Kato+, CVPR 2014]
1d2d
3d
md
jd
kd
Nd
);( θdp
1d
2d 3d md
kd
Nd
jd Cat
𝐱𝑗 = 𝑓(𝛉𝑗)
Camera
Traditional pipeline for image classification
Extracting
local descriptors
Collecting
descriptors
Calculating
Global feature
Classifying
images

1d2d
3d
md
jd
kd
Nd
);( θdp
1d
2d 3d md
kd
Nd
jd Cat
𝐱𝑗 = 𝑓(𝛉𝑗)
Camera
Inversed problem: Image reconstruction from a label
Pot

cat (bombay) camera grand piano headphone joshua treepyramid wheel chairgramophone
Pot
Optimized arrangement using:
Global location cost + Adjacency cost
Other examples

Video Generation
• Image generation is still challenging
Only successful for controlled settings:
– Human faces
– Birds
– Flowers
• Video generation is …
– Additionally requiring temporal consistency
– Extremely challenging
[Yamamoto+, ACMMM 2016]
[Vondrick+, NIPS 2016]
BEGAN
[Berthelot+, 2017 Mar.]
StackGAN
[Zhang+, 2016 Dec.]

Video Generation
• This work: generating easy videos
– C3D (3D convolutional neural network)
for conditional generation with an input label
– tempCAE (temporal convolutional auto-encoder)
for regularizing video to improve its naturalness
[Yamamoto+, ACMMM 2016]

Video Generation [Yamamoto+, ACMMM 2016]
Ours
(C3D+tempCAE)
Only C3D
Ours
(C3D+tempCAE)
Only C3D
Car runs
to left
Rocket
flies up

Conclusion
• MIL: Machine Intelligence Laboratory
Beyond Human Intelligence Based on Cyber-Physical Systems
• This talk introduces some of the current research
– Recognize
• Basic: Framework for DL, Domain Adaptation
• Classification: Single-modality, Multi-modalities
– Describe
• Image Captioning, Video Captioning
– Generate
• Image Reconstruction, Video Generation

Recognize, Describe, and Generate: Introduction of Recent Work at MIL

More Related Content

Similar to Recognize, Describe, and Generate: Introduction of Recent Work at MIL

More from Yoshitaka Ushiku

Recently uploaded

Recognize, Describe, and Generate: Introduction of Recent Work at MIL