DefenseTalk_Trimmed

Learning From Multiple Views of
Data
PhD Defense talk of
Abhishek Sharma
Collaborators
David W. Jacobs, Larry S. Davis, Hal Daume III, Oncel Tuzel, Ming Yu-Liu,
Abhishek Kumar, Jonghyun Choi, Murad Al Haj, Sanja Fidler and Angjoo
Kanazawa

Overview
1. Introduction
PART - I
1. Content Extraction
1. Semantic Segmentation as visual feature
2. Contextual information
3. Neural Network model
PART - II
1. Cross-modal content matching
1. Challenges
2. PLS based common representation
3. Generalized Multi-view Analysis
2. Future Directions

Match image and sentence
Image courtesy – UIUC sentence-Image dataset: http://vision.cs.uiuc.edu/pascal-sentences/
Text
viewTwo parked jet airplanes facing opposite directions
Image
view
Canonical/
Common
view

Find the image based on a sentence
Two parked jet airplanes facing opposite directions

A simple computer-based matching of sentence and image
1. Task understanding
2. Content from text and image
1. jet airplanes
2. Two
3. Parked
4. facing opposite direction
3. Content Matching

Cross-view content matching challenges
Text – “Two parked jet airplanes facing opposite directions on a grassy land”
Bag-of-Word
SIFT BoW
1
jetdirection facing
111 …Index 2 3 4 10000
Dimension
Mismatch
Semantic
Mismatch
Insufficient
Content
Deep ?

Cross-view content matching challenge
Lack of correspondence
Same Region
Missing Region
=
Column-wise Vectorization
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Deep ?

Other useful problems
Task – Face recognition
… Face DB
Content Extraction
Pixel, Attribute, SIFT, LBP,
HOG, Gabor
Content Matching
CCA, PLS, Metric Learning,
SVMs

Other useful problems
Task – Forensic sketch photo matching
Suspect
Image
Database
Forensic
Sketch Query
Image courtesy – Lios Gibson, “Forensic Art Essentials: A Manual for Law Enforcement Artists”
Content Extraction
SIFT, HOG, Gabor
Content Matching
Local LDA, PLS, CCA

This Dissertation
We are interested in extracting and matching task-dependent content
across multiple modalities
Task
Content
Matching
Content
Extraction
Pose-invariant face recognition
Pose-lighting invariant face recognition
Text-image matching
Forensic Image-photo matching
Semantic Segmentation
Partial Least Square
Pose-error robust matching
Generalized Multi-view Analysis

Part - I
Semantic Segmentation

Semantic Segmentation: Task
Input Image Segmentation Mask
Image courtesy – http://www.cs.unc.edu/~jtighe/Papers/ECCV10/siftflow/baseFinal.html
Label each
pixel

Semantic Segmentation: Overview
1. Scene understanding, robotics, medical image analysis etc.
2. Related work
3. Problem formulation
4. Role of context
5. Intuitive picture
6. Mathematical picture
7. Complete Pipeline
8. Back-propagation and issues
9. Pure-node RCPN
10. Experiments

Related Work
1. Multi-scale CNN (Farabet, Pineheiro)
2. Deep CNN (DeepSeg)
3. Non-parametric template matching (Tighe_1, Tighe_2, Eigen, Yang)
4. CRF models (Gould, Munoz, Lempitzky, Kumar, Mottaghi, Yuille)

Semantic Segmentation: Problem formulation
Label each super-pixel
Super-
segmentation
Road
Car
Ground
Image courtesy – http://www.cs.unc.edu/~jtighe/Papers/ECCV10/siftflow/baseFinal.html
Input image Super-segment overlaid image

Semantic Segmentation: Context
• Labeling super-pixel in isolation is difficult
• Without context machines outperform humans: 77.4% vs 72.2%
(Mottaghi et al.)
Building
Train
Aeroplane
Image courtesy – Roozbeh Mottaghi, Sanja Fidler, Jian Yao, Raquel Urtasun and Devi Parikh, “Analyzing Semantic Segmentation Using Hybrid Human-Machine CRFs”, IEEE CVPR 2013

Semantic Segmentation: Context importance
Image courtesy – Roozbeh Mottaghi, Sanja Fidler, Jian Yao, Raquel Urtasun and Devi Parikh, “Analyzing Semantic Segmentation Using Hybrid Human-Machine CRFs”, IEEE CVPR 2013

Semantic Segmentation: Context
• Labeling super-pixel in isolation is difficult
• Without context machines outperform humans: 77.4% vs 72.2%
(Mottaghi et al.)
• Use context
• MRFs and CRFs
• Typically MRFs and CRFs use human designed potential functions and features
• Complex human visual system – LEARN IT FROM DATA
Roozbeh Mottaghi, Sanja Fidler, Jian Yao, Raquel Urtasun and Devi Parikh, “Analyzing Semantic Segmentation Using Hybrid Human-Machine CRFs”, IEEE CVPR 2013

Recursive Context Propagation Network or RCPN
1. Label each super-pixel using entire image
2. Fast feed-forward computations for real-time labeling
3. End-to-end learning
4. Modular to the segmentation pipeline

Semantic Segmentation - Pipeline
• 𝐹𝐶𝑁𝑁 = Multi-scale CNN at scales – 1, 2 and 4
• 8×8×16 → 2×2 maxpool → 7×7×64 → 2×2
maxpool → 7×7×256
• 256×3 = 768 dimensional pixel feature
• Field of View (FOV) for every pixel = 47×47,
94×94 and 188×188 at different scales
• Super-pixels by LiuSeg
• ~ 100 super-pixels per image
• 𝑣𝑖 = average pixel features in each super-pixel
• Data augmentation by 5 random average sets
1. Super-pixel feature

Semantic Segmentation - Pipeline
1. Super-pixel feature
2. Context via Recursive Context Propagation Network

RCPN forward computation
1v
2v
1x
2x
6x
9x
Sub-tree
Semantic mapper 𝐹𝑠𝑒𝑚: ℜ 𝑑 𝑣 → ℜ 𝑑 𝑠
semantic vector 𝒙𝑖 = 𝐹𝑠𝑒𝑚( 𝒗𝑖 )
Combiner 𝐹𝑐𝑜𝑚: ℜ2𝑑 𝑠 → ℜ 𝑑 𝑠
Parent feature 𝑥𝑖𝑗 = 𝐹𝑐𝑜𝑚
𝑥𝑖
𝑥𝑗
6
~x
1
~x 1y
Decombiner 𝐹𝑑𝑒𝑐: ℜ2𝑑 𝑠 → ℜ 𝑑 𝑠
Enhanced feature 𝒙𝑖 = 𝐹𝑑𝑒𝑐
𝒙𝑖
𝒙𝑖𝑗
Labeler 𝐹𝑙𝑎𝑏: ℜ 𝑑 𝑠 → ℜ 𝑐
Label 𝒚𝑖 = 𝐹𝑙𝑎𝑏([ 𝒙𝑖])
Cartoon example
5 Super-pixel

RCPN characteristics
• N super-pixels = 2N – 1 nodes
• Leaf-nodes = super-pixels
• Internal nodes = merged-regions
• Pure merged-regions
• Pure nodes = Pure merged-regions
• Every super-pixel affects every super-pixel

Semantic Segmentation: Learning
• 𝐹𝐶𝑁𝑁 trained using CAFFE on
Nvidia GTX 780
• Stochastic gradient descent
• Learning rate = 0.1
• Momentum = 0.9
• Batch-size = 12 images
• Data augmentation -
Horizontal flip
• 2000 iterations in 7 hours
1. Multi-scale CNN

Semantic Segmentation: Learning
𝐹𝑅𝐶𝑃𝑁 = {𝐹𝑠𝑒𝑚; 𝐹𝑐𝑜𝑚; 𝐹𝑑𝑒𝑐; 𝐹𝑙𝑎𝑏} was trained using L-BFGS. Typically, 800-1000
iterations were required for complete training.
1. Mutli-scale CNN
2. RCPN

RCPN Back-propagation
1v
2v
1x
2x
6x 6
~x
1y1
~x
9x
cat
e1
dec
e6
com
e9
Sub-tree
com
e6
sem
e1
sem
e2
1l
Diminishing
Gradient with
Depth
Error flows
everywhere
Cartoon example
5 Super-pixel

RCPN Back-propagation and Bypass Error
1v
2v
1x
2x
6x 6
~x
1y1
~x
9x
cat
e1
dec
e6
com
e9
Sub-tree
com
e6
sem
e1
sem
e2
1l

RCPN Back-propagation and Bypass Error
1v 1x 1y1
~x
cat
e1
2v 2x
6x 6
~x
9x
dec
e6
com
e9
Sub-tree
com
e6
sem
e1
sem
e2
1l
Combiner is
bypassed
Context
Lost
Poor Local
Minimum
Sem Grad
Com Grad
Dec Grad
Lab Grad
Empirical
𝒈 𝑐𝑜𝑚 ≪ 𝒈 𝑠𝑒𝑚 ≈ 𝒈 𝑑𝑒𝑐 ≪ 𝒈𝑙𝑎𝑏
Ideal
𝒈 𝑠𝑒𝑚 < 𝒈 𝑐𝑜𝑚 < 𝒈 𝑑𝑒𝑐 < 𝒈𝑙𝑎𝑏

Pure-node RCPN or PN-RCPN
•RCPN + pure-nodes classification loss
•Benefits
•Roughly 65% more training data
•Meaningful combination by combiner
•Deeper and stronger gradients

PN-RCPN Back-propagation
1v
2v
1x
2x
6x 6
~x
1y1
~x
9x
cat
e1
dec
e6
com
e9
Sub-tree
com
e6
sem
e1
sem
e2
1l
Deep Strong
Gradient
6y 6lcat
e6

Grad Strength: RCPN vs. PN-RCPN
Sem Grad
Com Grad
Dec Grad
Lab Grad
Sem Grad
Com Grad
Dec Grad
Lab Grad
𝒈 𝑐𝑜𝑚 ≪ 𝒈 𝑠𝑒𝑚 ≈ 𝒈 𝑑𝑒𝑐 ≪ 𝒈𝑙𝑎𝑏 𝒈 𝑠𝑒𝑚 < 𝒈 𝑐𝑜𝑚 ≈ 𝒈 𝑑𝑒𝑐 < 𝒈𝑙𝑎𝑏

Experiments: Datasets
We conduct semantic segmentation experiments on three datasets
Stanford Background
Color images with 8 semantic classes
Train/Test – 572/143 images
SIFT Flow
Color images with 33 semantic classes
Train/Test – 2488/200
Daimler Urban Dataset
Gray-scale images with 6 semantic classes
Train/Test – 500/200

Experiments: Details
• Per pixel 0.5 subtraction
• 100 Super-pixels/image for Stanford and SIFT Flow
• 800 for Daimler due to large size
• 10 random parse trees with 5 random feature set for training to avoid
over-fitting
• 20 random parse trees with max-voting for testing

Experiments: Performance metric
1. Per-pixel accuracy (PPA)
2. Mean-class accuracy (MCA)
3. Intersection over Union (IoU) – Penalize under- & over-segmentation
4. Dynamic IoU (Dyn IoU) – IoU for dynamic objects
5. Time Per Image (TPI) – Both CPU and GPU

Stanford Results
Method PPA MCA IoU TPI (CPU/GPU)
Gould 76.4 NA NA 30 – 600 / NA
Munoz 76.9 NA NA 12 / NA
Tighe_1 77.5 NA NA 4 / NA
Kumar 79.4 NA NA < 600 / NA
Socher 78.1 NA NA NA / NA
Lempitzky 81.9 72.4 NA > 60 /NA
Singh 74.1 62.2 NA 20 / NA
Farabet 81.4 76.0 NA 60.5 / NA
Eigen 75.3 66.5 NA 16.6 / NA
Pinheiro 80.2 69.6 NA 10 / NA
Plain-NN 80.1 69.7 56.4 1.1 / 0.4
RCPN 81.8 73.9 61.3 1.1 / 0.4
PN-RCPN 82.1 79.0 64.0 1.1 / 0.4
TM-RCPN 82.3 79.1 64.5 1.6-6.1 / 0.9-5.9

SIFT Flow results
Method PPA MCA IoU TPI (CPU/GPU)
Tighe 77.0 30.1 NA 8.4 / NA
Liu 76.7 NA NA 31 / NA
Siingh 79.2 33.8 NA 20 / NA
Eigen 77.1 32.5 NA 16.6 / NA
Farabet 78.5 29.6 NA NA / NA
Bal. Farabet 72.3 50.8 NA NA / NA
Tighe, 24 78.6 39.2 NA 8.4 / NA
Pinheiro 77.7 29.8 NA NA / NA
Yang 79.8 48.7 NA < 12 / NA
Plain-NN 76.3 32.1 24.7 1.1 / 0.36
RCPN 79.6 33.6 26.9 1.1 / 0.4
Bal. RCPN 75.5 48.0 28.6 1.1 / 0.4
PN-RCPN 80.9 39.1 30.8 1.1 / 0.4
Bal. PN-RCPN 75.5 52.8 30.2 1.1 / 0.4
TM-RCPN 80.8 38.4 30.7 1.6-6.1 / 0.9-5.4
Bal. TM-RCPN 76.4 52.6 31.4 1.6-6.1 / 0.9-5.4
DeepSeg 85.2 51.7 39.1 NA / 0.2

Daimler Urban results
Method PPA MCA IoU IoU Dyn TPI (CPU/GPU)
Joint 94.5 91.0 86.0 74.5 111 / NA
Stixmantic 92.8 87.5 80.6 72.3 0.05 / NA
Bal. Plain-NN 91.4 83.2 75.8 56.2 5.9 / 2.8
Bal. RCPN 93.3 87.6 80.9 66.0 6.0 / 2.8
Bal. PN-RCPN 94.5 90.2 84.5 73.8 6.0 / 2.8
Bal. TM-RCPN 94.5 90.1 84.5 73.8 12 / 8.8

Part - II
Cross-Modal Content Matching

Common space representation
View 1
View 2
View 4
View 3
View v
Common Content
Noise
View-specific content
Feature vector
Common space

Cross-view Content Matching: A simple picture
RELAXEDIDEAL
Shape – Classes
Solid/Hollow shapes - Views
Dashed shape – Unseen classes
PAIRED DATA
VIEW 1
VIEW 2

PLS based multi-modal face recognition
PLS Bridge
Common
Subspace
Pose
Resolution
Sketch
WX WY
Shape = Identity
X Y

PLS based pose-invariant face recognition
0.75
0.8
0.85
0.9
0.95
1
1.05
PGFR TFA LLR ELF
Partial Comparison –Differenttesting
scenario
Others Proposed
• CMU PIE face date set for experiments.
• 34 training and 34 testing, intensity features

PLS based sketch-face recognition
Metho
d
Gal. Size Type Accuracy
Wang 100 Holistic 81
Liu 300 Patch 87.67
Klare 300 Pixel 99.47
PLS 100 Holistic 93.6
CCA 100 Holistic 94.6
Bilinear 100 Holistic 94.2

Cross-view Content Matching: A more complete picture
Multiple samples per class
Shape – Classes
Solid/Hollow shapes - Views
Color – Same-class samples
Dashed shape – Unseen classes
PAIRED DATA
VIEW 1
VIEW 2
RELAXEDIDEAL

What CCA/PLS/BLM can do ?
VIEW 1
VIEW 2
CCA/PLS/BLM
?
Match paired samples
Desired

CCA/PLS/BLM A better Picture
Generalized Multi-view Analysis or GMA


















































nn
S
nnnn
n
nD
w
w
w
S
S
w
w
w
DCC
CDC
CC










2
1
22
1
21
2212
112
00
00
0011

GMA cont..
Nice closed-form eigen-value problem

GMA cont..
• Multi-view extension of any generalized eigen-value
feature extraction
• GMA + LDA = GMLDA
D = Between class scatter matrix; S = Between class scatter
matrix
• GMA + MFA = GMMFA
D = Penalty Graph; S = Intrinsic Graph
• GMA + LPP = GMLPP
D = Identity; S = Graph Laplacian of Similarity matrix

Pros and Cons
Cross-view classification and retrieval
Kernelizable
Closed form optimal solution
Supervised
 Generalize to unseen classes
 Domain agnostic

Pros and Cons
 Still not ideal
 Non-probabilistic
 Shallow
 Similar views across test and train

VIEW 2
GMAVIEW 1 CCA/PLS/BLM
SVM-2K/HMFDA IDEAL
DIFFERENT LATENT SPACESORIGINAL SPACE
PAIRED DATA
Final Picture

Experiments
Pose and Lighting Invariant face
recognition
• 129 train subjects in 5 illums
• 129 test subjects (same identity
diff session) in 18 illums
• 120 subjects in 5 illum
• 129 test subjects (diff identity
diff session) in 18 illum

Text-Image Retrieval
• Wiki pages (2173 + 693)
• 10 Different classes
• Latent Dirichlet Allocation Model based text features
• SIFT histogram based image features
• Precision-Recall based Mean Average Precision score
• SM – Sematic matching (domain dependent approach)
• SCM – Semantic matching in CCA latent space (two stage
domain dependent approach)

Future Directions
• Deep learning based feature extraction
• Large-scale Data collection
• Deep Multi-view algorithms Vs. Common Deep Network
• Unsupervised training

Reference
Tighe_1: J. Tighe and S. Lazebnik. Superparsing. Int. J. Comput. Vision, 101(2):329–349, 2013
Tighe_2: J. Tighe and S. Lazebnik. Finding things: Image parsing with regions and per-exemplar detectors. IEEE CVPR, 2013
Gould: S. Gould, R. Fulton, and D. Koller. Decomposing a scene into geometric and semantically consistent regions. IEEE ICCV, 2009
Munoz: D. Munoz, J. A. Bagnell, and M. Hebert. Stacked hierarchical labeling. ECCV, 2010
Kumar: M. P. Kumar and D. Koller. Efficiently selecting regions for scene understanding. IEEE CVPR, 2010
Lempitsky: V. Lempitsky, A. Vedaldi, and A. Zisserman. A pylon model for semantic segmentation. NIPS, 2011
Farabet: C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning hierarchical features for scene labeling. IEEE TPAMI, August 2013
Eigen: R. Fergus and D. Eigen. Nonparametric image parsing using adaptive neighbor sets. IEEE CVPR, 2012
Joint: L. Ladick, P. Sturgess, C. Russell, S. Sengupta, Y. Bastanlar, W. Clocksin, and P. Torr. Joint optimization for object class
segmentation and dense stereo reconstruction. International Journal of Computer Vision, 100(2):122–133, 2012
Liu: C. Liu, J. Yuen, and A. Torralba. Nonparametric scene parsing via label transfer. IEEE TPAMI, 33(12), Dec 2011
LiuSeg: M.-Y. Liu, O. Tuzel, S. Ramalingam, and R. Chellappa. Entropy rate superpixel segmentation. IEEE CVPR, 2011
Pinheiro: P. H. O. Pinheiro and R. Collobert. Recurrent convolutional neural networks for scene parsing. ICML, 2014
Stixmantics: T. Scharwachter, M. Enzweiler, U. Franke, and S. Roth. Stix- ¨ mantics: A medium-level model for real-time semantic scene
understanding. ECCV, 2014
Yang: J. Yang, B. Price, S. Cohen, and M.-H. Yang. Context driven scene parsing with attention to rare classes. CVPR, pages 3294–3301,
2014

DefenseTalk_Trimmed

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to DefenseTalk_Trimmed

Similar to DefenseTalk_Trimmed (20)

DefenseTalk_Trimmed

Editor's Notes