Recognize, Describe, and Generate:
Introduction of Recent Work at MIL
The University of Tokyo
Yoshitaka Ushiku
Journalist Robot
• Born in 2006
• Objective: publishing news automatically
– Recognize
• Objects, people, actions
– Describe
• What is happening
– Generate
• Contents as humans do
Outline
• Journalist Robot: ancestor of current work in MIL
• Outline: research originates with this robot
– Recognize
• Basic: Framework for DL, Domain Adaptation
• Classification: Single-modality, Multi-modalities
– Describe
• Image Captioning
• Video Captioning
– Generate
• Image Reconstruction
• Video Generation
Recognize
MILJS: JavaScript × Deep Learning
[Hidaka+, ICLR Workshop 2017]
MILJS: JavaScript × Deep Learning
• Support for both learning and inference
• Support for nodes with GPGPUs
– Currently WebCL is utilized.
– Now working on WebGPU.
• Support for nodes w/o GPGPUs
• No requirements to install any software
– Even ResNet with 152 layers can be trained
[Hidaka+, ICLR Workshop 2017]
WebDNN: Fastest Inference Framework on Web Browser
Optimizes trained models for inference only
• Caffe, Keras, Chainer
• IE, Edge, Safari, Chrome, Firefox on PCs and mobiles
• GPU computation
• Web camera
[Kikura+ 2017]
Asymmetric Tri-training for Domain Adaptation
• Unsupervised domain adaptation
Trained on mnist → Works on SVHN?
– Ground-truth labels are associated with source (mnist)
– However, there are no labels for target (SVHN)
[Saito+, ICML 2017]
Asymmetric Tri-training for Domain Adaptation
• Asymmetric Tri-training: pseudo labels for target domain
[Saito+, ICML 2017]
Asymmetric Tri-training for Domain Adaptation
1st: Training on MNIST → Add pseudo labels for easy samples
2nd~: Training on MNIST+α→ Add more pseudo labels
eight
nine
[Saito+, ICML 2017]
End-to-end learning for environmental sound classification
Existing methods for speech / sound recognition:
① Feature extraction: Fourier Transformation (log-mel features)
② Classification: CNN with the extracted feature map
[Tokozume+, ICASSP 2017]
①
②
Log-mel features are suitable for human speech; but for environmental sounds…?
End-to-end learning for environmental sound classification
Proposed approach (EnvNet):
CNN for both ① feature map extraction and ② classification
[Tokozume+, ICASSP 2017]
①
②
Extracted “feature map”
End-to-end learning for environmental sound classification
Comparison of accuracy [%] on ESC-50[Piczak, ACM MM 2015]
[Tokozume+, ICASSP 2017]
64.5 64.0
71.0
log-mel feature + CNN
[Piczak, MLSP 2015]
End-to-end CNN
(Ours)
End-to-end CNN &
log-mel feature + CNN
(Ours)
EnvNet can extract discriminative features for environmental sounds
Multispectral Segmentation & Detection
• Robust Segmentation & Detection
– Daytime and nighttime
– Fog
• Use of multispectral images
– RGB
– Far-infrared (FIR)
– Mid-infrared (MIR)
– Near-infrared (NIR)
[Ha+, IROS 2017][Karasawa+, submitted to ACM MM 2017]
RGBMIR
NIRFIR
• How to combine multispectral images?
– Just concatenate them to a multi-channel image?
– Develop some sophisticated methods?
• To capture multispectral images
– Develop a novel single camera
– Use different cameras
Low cost, but many problems:
• Camera parameters
• Capture timings
• Ours: Multispectral Fusion Network
– Baseline: SegNet [Badrinarayanan+, PAMI 2017]
– Independent encoders for each camera
Multispectral Segmentation & Detection[Ha+, IROS 2017][Karasawa+, submitted to ACM MM 2017]
Multispectral Segmentation & Detection[Ha+, IROS 2017][Karasawa+, submitted to ACM MM 2017]
Visual Question Answering (VQA)
Question answering system for
• Associated image
• Question by natural language
[Saito+, ICME 2017]
Q: Is it going to rain soon?
Ground Truth A: yes
Q: Why is there snow on one
side of the stream and clear
grass on the other?
Ground Truth A: shade
Visual Question Answering (VQA)
After integrating for 𝑍𝐼+𝑄: usual classification
Question 𝑄
What objects are
found on the bed?
Answer 𝐴
bed sheets, pillow
Image 𝐼
Image feature
𝑥𝐼
Question feature
𝑥 𝑄
Integrated
vector
𝑧𝐼+𝑄
[Saito+, ICME 2017]
VQA = Multi-class classification
Visual Question Answering (VQA)
Current advancement: improving how to integrate 𝑥𝐼 and 𝑥 𝑄
• Concatenation
e.g.) [Antol+, ICCV 2015]
• Summation
e.g.) Image feature (with attention) + Question feature
[Xu+Saenko, ECCV 2016]
• Multiplication
e.g.) Bilinear multiplication
[Fukui+, EMNLP 2016]
• This work: DualNet doing sum, multiply and concatenation
𝑧𝐼+𝑄 =
𝑥𝐼
𝑥 𝑄
𝑥𝐼 𝑥 𝑄
𝑥𝐼 𝑥 𝑄𝑧𝐼+𝑄 =
𝑧𝐼+𝑄 =
𝑧𝐼+𝑄 =
𝑥𝐼 𝑥 𝑄
𝑥𝐼 𝑥 𝑄
[Saito+, ICME 2017]
Visual Question Answering (VQA)
VQA Challenge 2016 (in CVPR 2016)
Won the 1st place on abstract images w/o attention mechanism
[Saito+, ICME 2017]
Q: What fruit is yellow and brown?
A: banana
Q: How many screens are there?
A: 2
Q: What is the boy playing with?
A: teddy bear
Q: Are there any animals swimming in the
pond? A: no
Describe
[Ushiku+, ACMMM 2011]
Automatic Image Captioning
Training Dataset
A woman posing
on a red scooter.
White and gray
kitten lying on
its side.
A white van
parked in an
empty lot.
A white cat rests
head on a stone.
Silver car parked
on side of road.
A small gray dog
on a leash.
A black dog
standing in a
grassy area.
A small white dog
wearing a flannel
warmer.
Input Image
A small white dog wearing a flannel warmer.
A small gray dog on a leash.
A black dog standing in a grassy area.
Nearest Captions
A small white dog wearing a flannel warmer.
A small gray dog on a leash.
A black dog standing in a grassy area.
A small white dog standing on a leash.
Automatic Image Captioning [ACM MM 2012, ICCV 2015]
Group of people sitting
at a table with a dinner.
Tourists are standing on
the middle of a flat desert.
Image Captioning + Sentiment Terms[Shin+, BMVC 2016]
A confused man in a
blue shirt is sitting on a
bench.
A man in a blue shirt
and blue jeans is
standing in the
overlooked water.
A zebra standing in a
field with a tree in the
dirty background.
Image Captioning + Sentiment Terms
Two steps for adding a sentiment term
1. Usual image captioning using CNN+RNN
[Shin+, BMVC 2016]
The most probable noun is memorized
Image Captioning + Sentiment Terms
Two steps for adding a sentiment term
1. Usual image captioning using CNN+RNN
2. Forced to predict sentiment term before the noun
[Shin+, BMVC 2016]
Beyond Caption to Narrative
A man is holding a box of doughnuts.
Then he and a woman are standing next each other.
Then she is holding a plate of food.
[Shin+, ICIP 2016]
Beyond Caption to Narrative [Shin+, ICIP 2016]
A man is
holding a box
of doughnuts.
he and a
woman are
standing next
each other.
she is holding
a plate of food. Narrative
Beyond Caption to Narrative
A boat is floating on the water near a mountain.
And a man riding a wave on top of a surfboard.
Then he on the surfboard in the water.
[Shin+, ICIP 2016]
Generate
Image Reconstruction [Kato+, CVPR 2014]
1d2d
3d
md
jd
kd
Nd
);( θdp
1d
2d 3d md
kd
Nd
jd Cat
𝐱𝑗 = 𝑓(𝛉𝑗)
Camera
Traditional pipeline for image classification
Extracting
local descriptors
Collecting
descriptors
Calculating
Global feature
Classifying
images
Image Reconstruction [Kato+, CVPR 2014]
1d2d
3d
md
jd
kd
Nd
);( θdp
1d
2d 3d md
kd
Nd
jd Cat
𝐱𝑗 = 𝑓(𝛉𝑗)
Camera
Inversed problem: Image reconstruction from a label
Pot
Image Reconstruction [Kato+, CVPR 2014]
cat (bombay) camera grand piano headphone joshua treepyramid wheel chairgramophone
Pot
Optimized arrangement using:
Global location cost + Adjacency cost
Other examples
Video Generation
• Image generation is still challenging
Only successful for controlled settings:
– Human faces
– Birds
– Flowers
• Video generation is …
– Additionally requiring temporal consistency
– Extremely challenging
[Yamamoto+, ACMMM 2016]
[Vondrick+, NIPS 2016]
BEGAN
[Berthelot+, 2017 Mar.]
StackGAN
[Zhang+, 2016 Dec.]
Video Generation
• This work: generating easy videos
– C3D (3D convolutional neural network)
for conditional generation with an input label
– tempCAE (temporal convolutional auto-encoder)
for regularizing video to improve its naturalness
[Yamamoto+, ACMMM 2016]
Video Generation [Yamamoto+, ACMMM 2016]
Ours
(C3D+tempCAE)
Only C3D
Ours
(C3D+tempCAE)
Only C3D
Car runs
to left
Rocket
flies up
Conclusion
• MIL: Machine Intelligence Laboratory
Beyond Human Intelligence Based on Cyber-Physical Systems
• This talk introduces some of the current research
– Recognize
• Basic: Framework for DL, Domain Adaptation
• Classification: Single-modality, Multi-modalities
– Describe
• Image Captioning, Video Captioning
– Generate
• Image Reconstruction, Video Generation

Recognize, Describe, and Generate: Introduction of Recent Work at MIL

  • 1.
    Recognize, Describe, andGenerate: Introduction of Recent Work at MIL The University of Tokyo Yoshitaka Ushiku
  • 2.
    Journalist Robot • Bornin 2006 • Objective: publishing news automatically – Recognize • Objects, people, actions – Describe • What is happening – Generate • Contents as humans do
  • 3.
    Outline • Journalist Robot:ancestor of current work in MIL • Outline: research originates with this robot – Recognize • Basic: Framework for DL, Domain Adaptation • Classification: Single-modality, Multi-modalities – Describe • Image Captioning • Video Captioning – Generate • Image Reconstruction • Video Generation
  • 4.
  • 5.
    MILJS: JavaScript ×Deep Learning [Hidaka+, ICLR Workshop 2017]
  • 6.
    MILJS: JavaScript ×Deep Learning • Support for both learning and inference • Support for nodes with GPGPUs – Currently WebCL is utilized. – Now working on WebGPU. • Support for nodes w/o GPGPUs • No requirements to install any software – Even ResNet with 152 layers can be trained [Hidaka+, ICLR Workshop 2017]
  • 7.
    WebDNN: Fastest InferenceFramework on Web Browser Optimizes trained models for inference only • Caffe, Keras, Chainer • IE, Edge, Safari, Chrome, Firefox on PCs and mobiles • GPU computation • Web camera [Kikura+ 2017]
  • 8.
    Asymmetric Tri-training forDomain Adaptation • Unsupervised domain adaptation Trained on mnist → Works on SVHN? – Ground-truth labels are associated with source (mnist) – However, there are no labels for target (SVHN) [Saito+, ICML 2017]
  • 9.
    Asymmetric Tri-training forDomain Adaptation • Asymmetric Tri-training: pseudo labels for target domain [Saito+, ICML 2017]
  • 10.
    Asymmetric Tri-training forDomain Adaptation 1st: Training on MNIST → Add pseudo labels for easy samples 2nd~: Training on MNIST+α→ Add more pseudo labels eight nine [Saito+, ICML 2017]
  • 11.
    End-to-end learning forenvironmental sound classification Existing methods for speech / sound recognition: ① Feature extraction: Fourier Transformation (log-mel features) ② Classification: CNN with the extracted feature map [Tokozume+, ICASSP 2017] ① ② Log-mel features are suitable for human speech; but for environmental sounds…?
  • 12.
    End-to-end learning forenvironmental sound classification Proposed approach (EnvNet): CNN for both ① feature map extraction and ② classification [Tokozume+, ICASSP 2017] ① ② Extracted “feature map”
  • 13.
    End-to-end learning forenvironmental sound classification Comparison of accuracy [%] on ESC-50[Piczak, ACM MM 2015] [Tokozume+, ICASSP 2017] 64.5 64.0 71.0 log-mel feature + CNN [Piczak, MLSP 2015] End-to-end CNN (Ours) End-to-end CNN & log-mel feature + CNN (Ours) EnvNet can extract discriminative features for environmental sounds
  • 14.
    Multispectral Segmentation &Detection • Robust Segmentation & Detection – Daytime and nighttime – Fog • Use of multispectral images – RGB – Far-infrared (FIR) – Mid-infrared (MIR) – Near-infrared (NIR) [Ha+, IROS 2017][Karasawa+, submitted to ACM MM 2017] RGBMIR NIRFIR
  • 15.
    • How tocombine multispectral images? – Just concatenate them to a multi-channel image? – Develop some sophisticated methods? • To capture multispectral images – Develop a novel single camera – Use different cameras Low cost, but many problems: • Camera parameters • Capture timings • Ours: Multispectral Fusion Network – Baseline: SegNet [Badrinarayanan+, PAMI 2017] – Independent encoders for each camera Multispectral Segmentation & Detection[Ha+, IROS 2017][Karasawa+, submitted to ACM MM 2017]
  • 16.
    Multispectral Segmentation &Detection[Ha+, IROS 2017][Karasawa+, submitted to ACM MM 2017]
  • 17.
    Visual Question Answering(VQA) Question answering system for • Associated image • Question by natural language [Saito+, ICME 2017] Q: Is it going to rain soon? Ground Truth A: yes Q: Why is there snow on one side of the stream and clear grass on the other? Ground Truth A: shade
  • 18.
    Visual Question Answering(VQA) After integrating for 𝑍𝐼+𝑄: usual classification Question 𝑄 What objects are found on the bed? Answer 𝐴 bed sheets, pillow Image 𝐼 Image feature 𝑥𝐼 Question feature 𝑥 𝑄 Integrated vector 𝑧𝐼+𝑄 [Saito+, ICME 2017] VQA = Multi-class classification
  • 19.
    Visual Question Answering(VQA) Current advancement: improving how to integrate 𝑥𝐼 and 𝑥 𝑄 • Concatenation e.g.) [Antol+, ICCV 2015] • Summation e.g.) Image feature (with attention) + Question feature [Xu+Saenko, ECCV 2016] • Multiplication e.g.) Bilinear multiplication [Fukui+, EMNLP 2016] • This work: DualNet doing sum, multiply and concatenation 𝑧𝐼+𝑄 = 𝑥𝐼 𝑥 𝑄 𝑥𝐼 𝑥 𝑄 𝑥𝐼 𝑥 𝑄𝑧𝐼+𝑄 = 𝑧𝐼+𝑄 = 𝑧𝐼+𝑄 = 𝑥𝐼 𝑥 𝑄 𝑥𝐼 𝑥 𝑄 [Saito+, ICME 2017]
  • 20.
    Visual Question Answering(VQA) VQA Challenge 2016 (in CVPR 2016) Won the 1st place on abstract images w/o attention mechanism [Saito+, ICME 2017] Q: What fruit is yellow and brown? A: banana Q: How many screens are there? A: 2 Q: What is the boy playing with? A: teddy bear Q: Are there any animals swimming in the pond? A: no
  • 21.
  • 22.
  • 23.
    Training Dataset A womanposing on a red scooter. White and gray kitten lying on its side. A white van parked in an empty lot. A white cat rests head on a stone. Silver car parked on side of road. A small gray dog on a leash. A black dog standing in a grassy area. A small white dog wearing a flannel warmer. Input Image A small white dog wearing a flannel warmer. A small gray dog on a leash. A black dog standing in a grassy area. Nearest Captions A small white dog wearing a flannel warmer. A small gray dog on a leash. A black dog standing in a grassy area. A small white dog standing on a leash.
  • 24.
    Automatic Image Captioning[ACM MM 2012, ICCV 2015] Group of people sitting at a table with a dinner. Tourists are standing on the middle of a flat desert.
  • 25.
    Image Captioning +Sentiment Terms[Shin+, BMVC 2016] A confused man in a blue shirt is sitting on a bench. A man in a blue shirt and blue jeans is standing in the overlooked water. A zebra standing in a field with a tree in the dirty background.
  • 26.
    Image Captioning +Sentiment Terms Two steps for adding a sentiment term 1. Usual image captioning using CNN+RNN [Shin+, BMVC 2016] The most probable noun is memorized
  • 27.
    Image Captioning +Sentiment Terms Two steps for adding a sentiment term 1. Usual image captioning using CNN+RNN 2. Forced to predict sentiment term before the noun [Shin+, BMVC 2016]
  • 28.
    Beyond Caption toNarrative A man is holding a box of doughnuts. Then he and a woman are standing next each other. Then she is holding a plate of food. [Shin+, ICIP 2016]
  • 29.
    Beyond Caption toNarrative [Shin+, ICIP 2016] A man is holding a box of doughnuts. he and a woman are standing next each other. she is holding a plate of food. Narrative
  • 30.
    Beyond Caption toNarrative A boat is floating on the water near a mountain. And a man riding a wave on top of a surfboard. Then he on the surfboard in the water. [Shin+, ICIP 2016]
  • 31.
  • 32.
    Image Reconstruction [Kato+,CVPR 2014] 1d2d 3d md jd kd Nd );( θdp 1d 2d 3d md kd Nd jd Cat 𝐱𝑗 = 𝑓(𝛉𝑗) Camera Traditional pipeline for image classification Extracting local descriptors Collecting descriptors Calculating Global feature Classifying images
  • 33.
    Image Reconstruction [Kato+,CVPR 2014] 1d2d 3d md jd kd Nd );( θdp 1d 2d 3d md kd Nd jd Cat 𝐱𝑗 = 𝑓(𝛉𝑗) Camera Inversed problem: Image reconstruction from a label Pot
  • 34.
    Image Reconstruction [Kato+,CVPR 2014] cat (bombay) camera grand piano headphone joshua treepyramid wheel chairgramophone Pot Optimized arrangement using: Global location cost + Adjacency cost Other examples
  • 36.
    Video Generation • Imagegeneration is still challenging Only successful for controlled settings: – Human faces – Birds – Flowers • Video generation is … – Additionally requiring temporal consistency – Extremely challenging [Yamamoto+, ACMMM 2016] [Vondrick+, NIPS 2016] BEGAN [Berthelot+, 2017 Mar.] StackGAN [Zhang+, 2016 Dec.]
  • 37.
    Video Generation • Thiswork: generating easy videos – C3D (3D convolutional neural network) for conditional generation with an input label – tempCAE (temporal convolutional auto-encoder) for regularizing video to improve its naturalness [Yamamoto+, ACMMM 2016]
  • 38.
    Video Generation [Yamamoto+,ACMMM 2016] Ours (C3D+tempCAE) Only C3D Ours (C3D+tempCAE) Only C3D Car runs to left Rocket flies up
  • 39.
    Conclusion • MIL: MachineIntelligence Laboratory Beyond Human Intelligence Based on Cyber-Physical Systems • This talk introduces some of the current research – Recognize • Basic: Framework for DL, Domain Adaptation • Classification: Single-modality, Multi-modalities – Describe • Image Captioning, Video Captioning – Generate • Image Reconstruction, Video Generation