We presents a deep architecture for dense semantic correspondence, called pyramidal affine regression networks (PARN), that estimates locally-varying affine transformation fields across images.
To deal with intra-class appearance and shape variations that commonly exist among different instances within the same object category,
we leverage a pyramidal model where affine transformation fields are progressively estimated in a coarse-to-fine manner so that the smoothness constraint is naturally imposed within deep networks.
PARN estimates residual affine transformations at each level and composes them to estimate final affine transformations.
Furthermore, to overcome the limitations of insufficient training data for semantic correspondence, we propose a novel weakly-supervised training scheme that generates progressive supervisions by leveraging a correspondence consistency across image pairs.
Our method is fully learnable in an end-to-end manner and does not require quantizing infinite continuous affine transformation fields.
1. Sangryul Jeon
School of Electrical and Electronic Engineering
Yonsei University
Feb. 19, 2019
PYRAMIDAL AFFINE REGRESSION NETWORKS
FOR DENSE SEMANTIC CORRESPONDENCE
2. 2
Contents
I. Introduction
II. Problem Formulation and Overview
III. Pyramidal Affine Regression Networks
IV. Training
V. Experimental Results
VI. Conclusion
5. 5
Introduction
Dense Correspondence
• Establishing dense correspondences between visually similar images, i.e., taken
under similar viewpoints or times
• Are they enough to deal with challenging scenarios?
To achieve 3D depth Information To achieve motion Information
Stereo Matching Optical Flow
6. 6
Introduction
Dense Semantic Correspondence
• Establishing dense correspondences between semantically similar images, i.e.,
different instances within the same object or scene category
• For example, the wheels of two different cars, the body of people and animals, etc.
Semantic Correspondence
7. 7
Introduction
Dense Semantic Correspondence: Applications
[Hassner&Basri’13]
Shape by-Example
[Liu et al.’11]
Depth TransferLabel Transfer / Scene Parsing
Face Recognition
[Liu et al.’11]
View Synthesis
[Hassner et al.’13]
[Karsch et al.’14]
[slide courtesy: T. Hassner]
11. 11
Problem Formulation and Overview
Estimating local transformation across semantically similar images
• Affine Transformation Fields
• Non-rigid image deformations can be locally well approximated by affine
transformations
• Establishing dense affine transformation fields between images
12. Estimating local transformation across semantically similar images
• Affine Transformation Fields (2 × 3 Matrix)
that maps pixel to , and in homogeneous coordinates
12
Problem Formulation and Overview
,
,
i
i
i
x
y
T
T
T
i ii Ti [ ,1]T
ii
i ii Ti
iT
13. 13
Problem Formulation and Overview
1. Smoothness constraints within pyramidal graph model
• J. Hur et al., “Generalized Deformable Spatial Pyramid: Geometry-Preserving
Dense Correspondence Estimation”, CVPR’2015
• Major weaknesses
1. Still tremendous solution spaces
2. Handcrafted descriptors and optimization technique
14. 14
Problem Formulation and Overview
2. Transformation parameter regression through CNN architecture
• Traditional matching pipeline
Histogram of Oriented Gradients (HOG) [Dalal et al., CVPR’05]
Scale Invariant Feature Transform (SIFT) [Liu et al., ECCV’08]
DAISY [Tola et al., CVPR’08]
Handcrafted
Feature
Representation
Feature Matching/
Optimization
Parameter
Estimator
15. 15
Problem Formulation and Overview
2. Transformation parameter regression through CNN architecture
• CNN architecture for geometric matching
CNNgeometric [Rocco et al., CVPR’17]
CNNgeometric with supervision from inliers [Rocco et al., CVPR’18]
Attentive Semantic Alignment Networks [Seo et al., ECCV’18]
CNN
Feature
Representation
Feature Matching
/Correlation
Layer
Transform.
Parameter
Regressor
16. 16
Problem Formulation and Overview
2. Transformation parameter regression through CNN architecture
• Major weaknesses
1. Assumption of global transformation
2. Synthesize training data in a self-supervising manner
18. 18
Pyramidal Affine Regression Networks
Visualization of our PARN results
• Dense affine transformation fields are progressively estimated in a coarse-to-fine
manner, so that the smoothness is naturally imposed within deep networks
Image pair
Level 1 Level 2 Level 3 Level 4
Warped Results
20. Network Architecture
1. Hierarchical Feature Extraction
• Leverage the feature hierarchies in CNN
• : Convolutional activation
• : siamese network parameters
→ Handle the trade-off between semantic robustness and matching precision
20
Pyramidal Affine Regression Networks
cW
21. 21
Pyramidal Affine Regression Networks
Network Architecture
2. Constrained cost volume construction
• The cost volume between two extracted features is computed with a rectified
cosine similarity
Level 1 Level 2 Level 3 Level 4Image pair
22. 22
Pyramidal Affine Regression Networks
Network Architecture
3. Locally-varying affine transformation field
• Progressively divide each grid into four rectangular grids, yielding
T
1 1
2 2k k
23. 23
Pyramidal Affine Regression Networks
Network Architecture
3. Locally-varying affine transformation field
• Discontinuities between nearby affine fields result blocky artifacts around grid
boundaries
Level 1 Level 2 Level 3Image pair
24. 24
Pyramidal Affine Regression Networks
Network Architecture
3. Locally-varying affine transformation field
• To alleviate this, a bilinear upsampler is applied at the end of successive CNNs
Affine field upsampling
27. Generating Progressive Supervisions
• Challenges: the lack of ground-truth semantic correspondences
• How to learn the network without pixel-level ground-truth annotations?
• Our solution: Correspondence consistency
→ weakly-supervised learning using tentative training samples
27
Training
28. 28
Training
Generating Progressive Supervisions
• Correspondence consistency in computer vision
Shape Matching Co-segmentation SfM
Collection of
Correspondences
[Huang et al., SGP’13] [Wang et al., ICCV’13] [Zach et al., CVPR’10]
[Zhou et al., CVPR’15] [Zhou et al., ICCV’15]
[Slide courtesy: Tinghui Zhou]
32. 32
Experimental Results
Experimental Settings
• Three grid-level modules ( )
• sampled after intermediate pooling layers :`conv5-3’, `conv4-3’, `conv3-3’
• is set to the ratio of the whole search space : {1/10,1/10,1/15,1/15}
Comparison to the lastest methods on semantic correspondence
• “Convolutional Neural Network Architecture for Geometric Matching” (CNNgeo),
CVPR’18
• “SCNet: Learning Semantic Correspondence” (SCNet), ICCV’17
• “DCTM: Discrete-Continuous Transformation Matching” (DCTM), ICCV’17
3K
( )M k
( )r k
39. 39
Conclusion
• We propose a CNN architecture which estimates locally-varying affine
transformation fields across semantically similar images
• Our network was trained in a weakly-supervised manner, using
correspondence consistency training image pairs.
• We believe PARN can potentially benefit instance-level object
detection and segmentation, thanks to its robustness to severe
geometric variations