Deep Learning for Structure-from-Motion (SfM)

Deconstructing
SfM-Netarchitecture
andbeyond
Deep LearningforStructure-from-Motion (SfM)

Purpose of this presentation
● Deconstruction of the novel SfM-Net deep learning
architecture for Structure-for-Motion (SfM)
- Anticipating the increased use of deep learning for
SfM and “geometric” computer vision problems in
general.
- Hopefully this presentation is able to give incite
feelings of “that could be done in a better way”
leading to better and better deep SfM architectures.
● “A bit of everything” for everyone approach targeted
mainly for computer vision engineers that are not at
advanced level in geometric deep learning.
● Trying to address the typical problem of finding the
relevant “seed literature” for a new topic helping fresh
grad students, postdocs, software engineers and
startup founders.
- Answer to “Do you know if someone has done some
work on the various steps involved in SfM” to identify
what wheels do not need to be re-invented

SfM • Structure from Motion Basics recap • Camera Projections
Structure-from-Motion (SfM). Instead of a
single stereo pair, the SfM technique requires
multiple, overlapping photographs as input to
feature extraction and 3-D reconstruction
algorithms. - Westoby et al
SfM method basically computes the
relative camera positions between
all related photos. After every relative
camera position is found, the scheme
uses these matrices to reconstruct all
feature points using triangulation.
Thus there are two main problems:
1) Image registration (e.g. SIFT,
SURF, ORB, etc)
2) Pose Estimation (e.g.
Perspective-n-Point with
RANSAC)
Image registration
e.g Find corresponding features from image pair (http://cs.brown.edu/courses/cs143/proj3a/)
Depending on how robust is the
algorithm used to find features
(e.g. old school vs. deep learning),
the higher quality the reconstructed
point cloud is.
Camera Projection Matrix convert from 3D read world coordinates to 2D image coordinates
ults/proj3/html/agartia3/index.html
Perspective
Camera Toy

SfM • Structure from Motion Basics recap • Fundamental Matrix

SfM • Structure from Motion Basics recap • pose estimates
Evaluating Pose Estimation Methods for
Stereo Visual Odometry on Robots
Date of Original Version: 8-2010
Hatem Alismail, Carnegie Mellon University;
Brett Browning, Carnegie Mellon University;
M. Bernardine Dias, Carnegie Mellon University
http://repository.cmu.edu/robotics/745/
Structure-From-Motion (SFM) methods, using
stereo data, are among the best performing
algorithms for motion estimation from video
imagery, or visual odometry. Critical to the
success of SFM methods is the quality of the initial
pose estimation algorithm from feature
correspondences. In this work, we evaluate the
performance of pose estimation algorithms
commonly used in SFM visual odometry. We
consider two classes of techniques to develop the
initial pose estimate: Absolute Orientation (AO)
methods, and Perspective-n-Point (PnP)
methods.
To date, there has not been a comparative study of
their performance on robot visual odometry tasks.
We undertake such a study to measure the
accuracy, repeatability, and robustness of these
techniques for vehicles moving in indoor
environments and in outdoor suburban roadways.
Our results show that PnP methods outperform
AO methods, with P3P being the best performing
algorithm. This is particularly true when stereo
triangulation uncertainty is high due to a wide
Field of View lens and small stereo-rig baseline.
Random forests versus Neural Networks — What's best for camera localization?
Daniela Massiceti ; Alexander Krull ; Eric Brachmann ; Carsten Rother ; Philip H.S. Torr
Robotics and Automation (ICRA), 2017 IEEE International Conference on; https://doi.org/10.1109/ICRA.2017.7989598
“To summarize, our best method, a ForestNet with a robust average, which has an equivalent fast and lightweight
RF, improves over the state-of-the-art for camera localization on the 7-Scenes dataset. While this work focuses
on scene coordinate regression for camera localization, our innovations may also be applied to other continuous
regression tasks.”
Camera Relocalization by Computing Pairwise Relative Poses Using Convolutional
Neural Network
Zakaria Laskar, Iaroslav Melekhov, Surya Kalia, Juho Kannala
https://arxiv.org/abs/1707.09733
“The neural network is trained for relative pose estimation in an end-to-end manner using training image pairs.
In contrast to previous work, our approach does not require scene-specific training of the network, which improves
scalability, and it can also be applied to scenes which are not available during the training of the network.”
DSAC - Differentiable RANSAC for Camera Localization
Eric Brachmann, Alexander Krull, Sebastian Nowozin, Jamie Shotton, Frank Michel, Stefan Gumhold, Carsten Rother
“We call this approach DSAC, the differentiable counterpart of RANSAC. We apply DSAC to the problem of
camera localization, where deep learning has so far failed to improve on traditional approaches. We
demonstrate that by directly minimizing the expected loss of the output camera poses, robustly estimated by
RANSAC, we achieve an increase in accuracy. In the future, any deep learning pipeline can use DSAC as a robust
optimization component.”
Deep 6-DOF Tracking
Mathieu Garon, Jean-François Lalonde
“We present a temporal 6-DOF tracking method which leverages deep learning to achieve state-of-the-art
performance on challenging datasets of real world capture. Our method is both more accurate and more robust to
occlusions than the existing best performing approaches while maintaining real-time performance. To assess its
efficacy, we evaluate our approach on several challenging RGBD sequences of real objects in a variety of
conditions. Notably, we systematically evaluate robustness to occlusions through a series of sequences where the
object to be tracked is increasingly occluded. Finally, our approach is purely data-driven and does not require
any hand-designed features: robust tracking is automatically learned from data.”

SfM-NeT • Abstract
Computer Science > Computer Vision and Pattern Recognition
SfM-Net: Learning of Structure and Motion from Video
Sudheendra Vijayanarasimhan, Susanna Ricco, Cordelia Schmid, Rahul Sukthankar, Katerina Fragkiadaki
Google Research; Inria, Grenoble, France; Carnegie Mellon University
(Submitted on 25 Apr 2017) arXiv:1704.07804 [cs.CV] | https://arxiv.org/abs/1704.07804
We propose SfM-Net, a geometry-aware neural network for motion
estimation in videos that decomposes frame-to-frame pixel motion in
terms of scene and object depth, camera motion and 3D object rotations
and translations. Given a sequence of frames, SfM-Net predicts depth,
segmentation, camera and rigid object motions, converts those into a
dense frame-to-frame motion field (optical flow), differentiably warps
frames in time to match pixels and back-propagates.
The model can be trained with various degrees of supervision:
1) Self-supervised by the re-projection photometric error (completely
unsupervised),
2) Supervised by ego-motion (camera motion), or
3) Supervised by depth (e.g., as provided by RGBD sensors). SfM-Net
extracts meaningful depth estimates and successfully estimates
frame-to-frame camera rotations and translations.
It often successfully segments the moving objects in the scene, even
though such supervision is never provided. SfM-Net: Given a pair of frames as input, our model decomposes frame-to-frame pixel
motion into 3D scene depth, 3D camera rotation and translation, a set of motion masks
and corresponding 3D rigid rotations and translations. It backprojects the resulting 3D
scene flow into 2D optical flow and warps accordingly to match pixels from one frame to
the next. Forwardbackward consistency checks constrain the estimated depth

SfM-NeT • Inspiration
SfM-Net is inspired by works that impose geometric
constraints on optical flow, exploiting rigidity of the visual
scene, such as early low-parametric optical flow methods
[e.g. Zelnik-Manor and Irani (2000)]
or the so-called direct methods for
visual SLAM (Simultaneous Localization and Mapping) that
perform dense pixel matching from frame to frame while
estimating a camera trajectory and depth of the pixels in the
scene [e.g. Schöps et al. (2014) and Engel et al. (2014)]
.
In contrast to those, instead of optimizing directly over
optical flow vectors, 3D point coordinates or camera
rotation and translation, our model optimizes over neural
network weights that, given a pair of frames, produce such
3D structure and motion. In this way, our method learns to
estimate structure and motion, and can in principle
improve as it processes more videos, in contrast to non-
learning based alternatives. It can thus be made robust to
lack of texture, degenerate camera motion trajectories or
dynamic objects (our model explicitly accounts for those),
by providing appropriate supervision.
Our work is also inspired and builds upon recent works on
learning geometrically interpretable optical flow fields for
point cloud prediction in time [Byravan and Fox (2016)]
and
backpropagating through camera projection for 3D human
pose estimation [Wu et al. (2016)]
or single-view depth estimation
[Zhou et al. (2017), https://github.com/tinghuiz/SfMLearner].
The training data to our system consists solely of
unlabeled image sequences capturing scene appearance
from different viewpoints, where the poses of the images
are not provided. Our training procedure produces two
models that operate independently, one for single-view
depth prediction, and one for multiview camera pose
estimation.
Overview of the Large Scale Direct Monocular SLAM (LSD-SLAM) algorithm
http://www.doc.ic.ac.uk/~ab9515/lsdslam.html.

SfM-NeT • related Work
Differentiable warping [Jaderberg et al. (2015)]
has been used to learn
end-to-end unsupervised optical flow [Yu et al. (2016)]
, disparity
flow in a stereo rig [Godard et al. (2016)]
and video prediction [
Patraucean et al. (2015)]
. The closest previous works to ours are
SE3-Nets[Byravan and Fox (2016)]
, 3D image interpreter [Wu et al. (2016)]
,
and Garg et al.’s (2016) depth CNN.
SE3-Nets[Byravan and Fox (2016)]
use an actuation force from a
robot and an input point cloud to forecast a set of 3D rigid
object motions (rotation and translations) and
corresponding pixel motion assignment masks under a
static camera assumption.
Our work uses similar representation of pixel motion masks
and 3D motions to capture the dynamic objects in the
scene. However, our work differs in that
1) We predict depth and camera motion while SE3-
Nets operate on given point clouds and assume no
camera motion,
2) SE3-Nets are supervised with pre-recorded 3D
optical flow, while this work admits diverse and
much weaker supervision, as well as complete
lack of supervision,
3) SE3-Nets consider one frame and an action as
input to predict the future motion, while our model
uses pairs of frames as input to estimate the intra-
frame motion, and
4) SE3-Nets are applied to toy or lab-like setups
whereas we show results on real videos.
https://doi.org/10.1109/ICRA.2017.7989023
SE3-NET architecture. Input is a 3D point cloud and an n-dimensional action vector (bold-italics), both of which are
encoded and concatenated to a joint feature vector (CAT). The decoder uses this encoding to predict "k" object masks M
and "k" SE(3) transforms which are used to transform the input cloud via the "Transform layer" to generate the output.
Mask weights are sharpened and normalized before use for prediction. Conv = Convolution, FC = Fully Connected,
Deconv = Deconvolution, CAT = Concatenation

SfM-NeT • SfM-Net architecture
SfM-Net architecture. For each pair of consecutive frames It
, It+1
, a conv/deconv sub-network predicts depth dt
while another predicts a set of K segmentation
masks mt
. The coarsest feature maps of the motion-mask encoder are further decoded through fully connected layers towards 3D rotations and translations for
the camera and the K segmentations. The predicted depth is converted into a per frame point-cloud using estimated or known camera intrinsics. Then, it is
transformed according to the predicted 3D scene flow, as composed by the 3D camera motion and independent 3D mask motions. Transformed 3D depth is
projected back to the 2D next frame, and thus provides corresponding 2D optical flow fields. Differentiable backward warping maps frame It+1
to It
, and gradients
are computed based on pixel errors. Forward-backward constraints are imposed by repeating this process for the inverted frame pair It+1
,, It
and constraining the
depths dt
and dt+1
to be consistent through the estimated scene motion.
coarse map
2 x fully connected layers

SfM-NeT • SfM-Net architecture • structure Network
We compute per frame depth using a
standard conv/deconv subnetwork
operating on a single frame (the
structure network on previous slide).
We use a RELU activation at our final
layer, since depth values are non-
negative.
Given depth d t , we obtain the 3D
point cloud corresponding to the
pixels in the scene using a pinhole
camera model. Let (xi
t
, yi
t
) be the
column and row positions of the ith
pixel in frame It
and let (cx
, cy
, f) be the
camera intrinsics, then
where dit
denotes the depth value of
the ith
pixel. We use the camera
intrinsics when available and revert
to default values of (0.5, 0.5, 1.0)
otherwise. Therefore, the predicted
depth will only be correct up to a
scalar multiplier.
https://youtu.be/vZELygPzV0M?t=51m47s
by Cyrill Stachniss
Xu et al. (2015)
https://machinelearningonline.blog/ by narasimman
Activation function variants for CIFAR-10/100 dataset
Godin et al. (2017): Test errors of the ResNet-110 architecture
using Dual ReLUs and Dual Exponential Linear Units [DELUs,
Clevert et al., (2016)], compared to the initial version with ReLUs
and the extension with Concatenated ReLU [CreLU,
Shang et al. (2017)] on CIFAR-10 and CIFAR-100, using an equal
parameter budget.
Comparison of activation functions. The
rectified linear unit (ReLU), the leaky ReLU (LReLU,
= 0.1), the shifted ReLUs (SReLUs), and theα
exponential linear unit (ELU, = 1.0).α
Clevert et al., (2016)
conv/deconv subnetwork
same as U-Net that first
downsamples [ENCODER]
the dense map (coarse map)
through max pooling which is
followed by upsampling
[DECODER].
Wojna et al. (2017)

SfM-NeT • SfM-Net architecture • Scene&Object Motion Network
We compute the motion of the camera and of independently moving objects in the
scene using a conv/deconv subnetwork that operates on a pair of images (the
motion network of architecture).
We depth-concatenate the pair of frames and use a series of convolutional layers
to produce an embedding layer. We use two fully-connected layers to predict the
motion of the camera between the frames and a predefined number K of rigid
body motions that explain moving objects in the scene.
The fully-connected layers are used to predict translation parameters tc
, the pivot
points of the camera rotation pc
., and sin α, sin β, sin γ. These last three
parameters are constrained to be in the interval [−1, 1] by using RELU activation and
the minimum function
We use similar representations as for camera motion and predict parameters using
fully-connected layers on top of the same embedding E. While camera motion is a
global transformation applied to all the pixels in the scene, the object motion
transforms are weighted by the predicted membership probability of each pixel to
each rigid motion. These masks are produced by feeding the embedding layer
through a deconvolutional tower. We use sigmoid activations at the last layer
instead of softmax in order to allow each pixel to belong to any number of rigid
body motions.
When a pixel has zero activation across all K maps it is assigned to the static
background whose motion is a function of the global camera motion alone. We
allow a pixel to belong to multiple rigid body transforms in order to capture
composition of motions, e.g., through kinematic chains, such as articulated
bodies. Learning the required number of motions for a sequence is an interesting
open problem. We found that we could fix K = 3 for all experiments presented here.
Note that our method can learn to ignore unnecessary object motions in a
sequence by assigning no pixels to the corresponding mask.
http://www.math.tau.ac.il/~dcor/Graphics/cg-sli
des/trans3d.pdf

SfM-NeT • SfM-Net architecture • optical Flow
We obtain optical flow by first transforming the point
cloud obtained in Equation (1) using the camera and object
motion rigid body transformations followed by projecting
the 3D point on to the image plane using the camera
intrinsics.
In the following, we drop the pixel superscript i from the 3D
coordinates, since it is clear we are referring to the motion
transformation of the ith
pixel of the tth
frame. We first apply
the object transformations:

Upgrade • SfM-NeT • Upgrade to architecture #1
The Devil is in the Decoder
Zbigniew Wojna, Vittorio Ferrari,
Sergio Guadarrama, Nathan Silberman,
Liang-Chieh Chen, Alireza Fathi, Jasper Uijlings
While encoders have been studied rigorously, relatively few
studies address the decoder side. Therefore this paper
presents an extensive comparison of a variety of
decoders for a variety of pixel-wise prediction tasks. Our
contributions are:
1) Decoders matter: we observe significant variance in
results between different types of decoders on various
problems.
2) We introduce a novel decoder: bilinear additive
upsampling.
3) We introduce new residual-like connections for decoders.
4) We identify two decoder types which give a consistently
high performance.
SfM-Net more like a proof-of-concept network that will be probably upgraded by the
authors themselves to skip-connection / residual-like connections for better
performance?
Wojna et al. (2017) found little advantage in depth prediction when using skip layers:
“For depth prediction, all layers except bilinear upsampling have good performance, whereas
adding skip layers to these results in equal performance except for depth-to-space, where it
slightly lowers performance”
When using residual connections, performance consistently improves:
“For the majority of combinations, we see that adding residual connections is beneficial.
Interestingly, we now can identify two upsampling methods which have consistently good
results on all problems presented in this paper, both which have residual connections: (1)
transposed convolutions + residual connections. (2) bilinear additive upsampling + residual
connections (both with and without skip connections).”
Our main results comparing a
variety of decoders on five
machine vision problems. The
upper part shows decoders
without residual-like connections;
the bottom shows decoders with
residual-like connections. The
colors represent relative
performance: red means top
performance, yellow means
reasonable performance, blue
means poor performance.

Upgrade • SfM-NeT • Upgrade to architecture #2
Learning a Multi-View
Stereo Machine
Abhishek Kar, Christian Häne, Jitendra Malik. UC Berkeley
https://people.eecs.berkeley.edu/~akar/deepmvs.pdf
In this work, we present Learnt Stereo Machines (LSM) -
a system which is able to reconstruct object geometry as
voxel occupancy grids or per-view depth maps from a small
number of views, including just a single image. We design
our system inspired by classical approaches while learning
each component from data embedded in an end to end
system. LSMs have built in projective geometry, enabling
reasoning in metric 3D space and effectively exploiting the
geometric structure of the Multi-view stereopsis (MVS)
problem.
Compared to classical approaches, which are designed to
exploit a specific cue such as silhouettes or photo-
consistency, our system learns to exploit the cues that are
relevant to the particular instance while also using priors
about shape to predict geometry for unseen regions.
Compared to recent learning based reconstruction
approaches, our system is able to better use camera pose
information leading to significantly large improvements
while adding more views. Finally, we show successful
generalization to unseen object categories demonstrating
that our network goes beyond semantic cues and strongly
uses geometric information for unified single and multi-view
3D reconstruction
Overview of a Learnt Stereo Machine (LSM). It takes as input one or more views and camera poses. The images are processed through a feature encoder
which are then unprojected into the 3D world frame using a differentiable unprojection operation. LSMs can produce two kinds of outputs – voxel
occupancy grids (Voxel LSM) decoded from Go
or per-view depth maps (Depth LSM) decodedafter a projection operation.
Qualitative results for
per-view depth map
prediction on
ShapeNet. We show
the depth maps
predicted by Depth-
LSM (visualized with
shading from a shifted
viewpoint) and the
point cloud obtained
by unprojecting them
into world coordinates.

Upgrade • SfM-NeT • Relu alternatives
use ELU non-linearity
without batchnorm or
ReLU with it.
A summary of recommendations:

Upgrade • SfM-NeT • Normalization techniques
Batch normalization, what was this?
TL;DR To reduce covariate shift (explained by Alex Smola)
[D] Weight normalization vs. layer normalization, has
anyone done benchmarks?
(self.MachineLearning)
submitted 3 months ago by carlthome
Batch normalization is the norm (pun
intended) but for RNNs or small batch sizes
layer normalization and
weight normalization look like attractive
alternatives.
In the NIPS submission for weight normalization,
they have the layer normalization paper listed as
a reference (although never cited in the text), but
it has since been removed. This got me thinking
about pros/cons of the respective methods. Has
anyone done benchmarks comparing weight
normalization to layer normalization (particularly
for ResNets or RNNs)?
PS: Recurrent batch normalization is
memory intensive and should be avoided IMO,
but that too would be interesting to benchmark.
Batch Renormalization: Towards Reducing Minibatch
Dependence in Batch-Normalized Models
Sergey Ioffe, Google Inc., sioffe@google.com
(Submitted on 10 Feb 2017 (v1), last revised 30 Mar 2017 (this version, v2))
https://arxiv.org/abs/1702.03275 | https://github.com/titu1994/BatchRenormalization
Batch Normalization—What the hey?
By Karl N. Jun 8, 2016
Batch ReNorm is useful
especially with smaller batch
sizes. Validation accuracy for
models trained with either
batchnorm or Batch Renorm, where
normalization is performed for sets
of 4 examples (but with the
gradients aggregated over all 50×32
examples processed by the 50
workers). Batch Renorm allows the
model to train faster and achieve a
higher accuracy, although
normalizing sets of 32 examples
performs better.
“Batch normalization is applied to all
convolutional layer outputs.”

Upgrade • SfM-NeT • pooling vs. strides vs. dilation
Christian Perone, R&D Machine Learning Engineer and Software Engineer at HP
https://www.slideshare.net/perone/deep-learning-convolutional-neural-networks
Typical architectures use pooling after each convolution
layer.
In practice on left, 16 convolutions need to be computed
with pooling and then max need to be computed for the
2x2 window. Whereas with 2x2 stride without pooling, one
would compute only 4 convolutions giving the same
downsampling.
Springenberg et al. (2015) suggest that the performance
decrease from this simplification may be negligible in
practice: “We find that max-pooling can simply be replaced
by a convolutional layer with increased stride without loss in
accuracy on several image recognition benchmarks”
Dilated convolution ( algorithm à trous,
Statonary wavelet transform in
practice) do not reduce
dimensionality but rather aggregate
multi-scale contextual information
See for example Yu et al. (2015)
IEEE Transactions on Pattern Analysis and Machine Intelligence ( Volume: PP, Issue: 99
)
Generalizing Pooling Functions in CNNs:
Mixed, Gated, and Tree
Chen-Yu Lee ; Patrick Gallagher ; Zhuowen Tu
https://doi.org/10.1109/TPAMI.2017.2703082
Each convolutional layer consist of a series of 3 × 3
convolutional layers alternating between stride 1
and stride 2.

Upgrade • SfM-NeT • Enforcing sharp boundaries #1
Downsampling-Upsampling combo can smoothen or remove some thin
structures and sharp boundaries, and several papers have been written to
address this.
“Unlike skip connections and previous encoder-decoder methods, we first
learn a coarse feature map after the encoder stage in a feedforward pass, and
then refine this feature map in a top-down strategy during the decoder stage
utilizing features at successively lower layers. Therefore, the deconvolutional
process is conducted stepwise, which is guided by Deeply-Supervision Net
providing the integrated direct supervision.”

Upgrade • SfM-NeT • Enforcing sharp boundaries #2
Segmentation-Aware Convolutional
Networks Using Local Attention Masks
Adam W. Harley, Konstantinos G. Derpanis, Iasonas Kokkinos
(Submitted on 15 Aug 2017)
https://arxiv.org/abs/1708.04607 | http://cs.cmu.edu/~aharley/segaware
Segmentation-aware convolution
filters are invariant to backgrounds.
We achieve this in three steps: (i)
compute segmentation cues for each
pixel (i.e., “embeddings”), (ii) create a
foreground mask for each patch, and
(iii) combine the masks with
convolution, so that the filters only
process the local foreground in each
image patch.
Segmentation-aware bilateral filtering. Given an input image (left), a CNN
typically produces a smooth prediction map (middle top). Using learned per-pixel
embeddings (middle bottom), we adaptively smooth the FC8 feature map with our
segmentation-aware bilateral filter (right).
General schematic for our segmentation-aware CNN. The first part is an
embedding network, which is guided to compute embedding-like representations
at multiple scales, and constructs a final embedding as a weighted sum of the
intermediate embeddings. The loss on these layers operates on pairwise distances
computed from the embeddings. These same distances are then used to construct
local attention masks, that intercept the convolutions in a task-specific network.
The final objective backpropagates through both networks, fine-tuning the
embeddings for the task.
Visualizations of optical flow produced by
FlowNet and its segmentation-aware variant on
the FlyingChairs test set: segmentation-
awareness yields much sharper results than
the baseline.

SfM-NeT • Supervision
SfM-Net inverts the image
formation and extracts depth,
camera and object motions that gave
rise to the observed temporal
differences, similar to previous SfM
works [1, 6].
Such inverse problems are ill-posed
as many solutions of depth, camera
and object motion can give rise to the
same observed frame-to-frame pixel
values.
A learning-based solution, as
opposed to direct optimization, has
the advantage of learning to handle
such ambiguities through partial
supervision of their weights or
appropriate pre-training, or simply
because the same coefficients
(network weights) need to explain a
large abundance of video data
consistently.
We detail the various supervision
modes below and explore a subset of
them in the experimental section.
Kyong Hwan Jin ; Michael T. McCann ; Emmanuel Froustey ; Michael Unser | https://doi.org/10.1109/TIP.2017.2713099
IEEE Transactions on Image Processing ( Volume: 26, Issue: 9, Sept. 2017 )
ME5286 – Lecture 2 (Theory):
Image Formation and Cameras
by Saad J Bedros, University of Minnesota,
http://www.me.umn.edu/courses/me5286/
Adversarial Inversion: Inverse Graphics with Adversarial Priors
Hsiao-Yu Fish Tung, Adam Harley, William Seto, Katerina Fragkiadaki (Submitted on 31 May 2017)
Multi-view Supervision for Single-view Reconstruction via
Differentiable Ray Consistency
Shubham Tulsiani, Tinghui Zhou, Alexei A. Efros, Jitendra Malik
Toward Geometric Deep SLAM
MagicPoint and MagicWarp (from Magic Leap)
Daniel DeTone, Tomasz Malisiewicz, Andrew Rabinovich (Submitted on 24 Jul 2017)
Geometric deep learning: going beyond Euclidean data
Michael M. Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, Pierre Vandergheynst last revised 3 May 2017

SfM-NeT • Supervision • Self-supervision
Given unconstrained video, without
accompanying ground-truth
structure or motion information, our
model is trained to minimize the
photometric error between the
first frame and the second frame
warped towards the first according
to the predicted motion field, based
on well-known brightness
constancy assumptions (assuming
Lambertian surfaces). We use
differentiable image warping
proposed in the spatial transformer
work (Jaderberg et al., 2015) and
compute color constancy loss in a
fully differentiable manner.
“In particular, we use a loss function that combines a data
term that measures photometric constancy over time with a
spatial term that models the expected variation of flow across
the image. The photometric loss measures the difference
between the first input image and the (inverse) warped
subsequent image based on the predicted optical flow by the
network. The smoothness loss measures the difference
between spatially neighbouring flow predictions. Together,
these two losses form a proxy for losses based on the
groundtruth flow.”
https://www.slideshare.net/yuhuang/optic-flow-estimatio
n-with-deep-learning
Light diffuseness metric Part 1: Theory
L Xia, MSc, SC Pont, PhD, I Heynderickx, PhD
Lighting Research & Technology Vol 49, Issue 4, 2017
http://doi.org/10.1177/1477153516631391
Thomas Y. Lee; David H. Brainard
Journal of Vision January 2014, Vol.14, 24. doi: 10.1167/14.1.24
“Human perception not necessarily the same as
a camera system”

SfM-NeT • Supervision • Spatial smoothness priors
When our network is self-supervised, we add
robust spatial smoothness penalties on the
optical flow field, the depth, and the inferred
motion maps, by penalizing the L1 norm of the
gradients across adjacent pixels, as usually
done in previous works [Kong and Black (2015)].
For depth prediction, we penalize the norm of
second order gradients in order to encourage
not constant but rather smoothly changing
depth values.
http://www.chioka.in/differences-between-l
1-and-l2-as-loss-function-and-regularizati
on/
By Michael Zibulevsky andMichael Elad
IEEE SIGNAL PROCESSING MAGAZINE [76] MAY 2010
DOI: 10.1109/MSP.2010.936023
Depth map inpainting under a second-order smoothness prior
Daniel Herrera C.†, Juho Kannala† , Lubor Ladický‡ , and Janne Heikkilä†
†Center for Machine Vision Research University of Oulu, Finland
‡Visual Geometry Group University of Oxford, UK
Levin (2004)'s approach uses a first-order prior, i.e.
it favors constant depth. Whereas our second-
order prior favors constant depth derivative.
This is clearly seen in the results of Figure 1. Levin's
approach correctly separates the surfaces but fiills
the missing pixels (across the surface boundary)
with a constant depth, while our method provides a
smooth result that matches the ground truth
shape.
Difference between 1st
order and 2nd
order gradients for depth image
(zoomed portion below) as computed with Matlab’s imgradient
https://doi.org/10.1007/978-3-642-
38886-6_52
Cited by 9 articles

SfM-NeT • Supervision • Forward-backward consistency constraints
We incorporate forward-backward
consistency constraints between
inferred scene depth in different
frames. Composing scene flow
forward and backward across
consecutive frames allows us to
impose such forward-backward
consistency cycles across more than
one frame gaps, however, we have
not yet seen empirical gain from
doing so.
In other words one could
“robustify” the network by having
more temporal samples which
should improve inlier / outlier
separation
Science of Electrical Engineering (ICSEE), IEEE International Conference on the
A Depth Restoration Occlusionless Temporal Dataset
Daniel Rotman ; Guy Gilboa Electrical Engineering Department, Technion - Israel Institute of Technology.
https://doi.org/10.1109/3DV.2016.26
“Utilizing multiple frames, we create a number of possibilities for an initial degraded depth map,
which allows us to arrive at a more educated decision when refining depth images. Evaluating this
method with our dataset shows significant benefits, particularly for overcoming real sensor-noise
artifacts.”
The dataset is freely downloadable at: http://visl.technion.ac.il/databases/drot2016/
3D Vision (3DV), 2016 Fourth International Conference on, 16-18 Nov. 2016
Frame rate reduction of depth cameras by RGB-based depth prediction
Daniel Rotman ; Omer Cohen ; Guy Gilboa Electrical Engineering Department, Technion - Israel Institute of Technology.
https://doi.org/10.1109/ICSEE.2016.7806153
“Depth cameras are becoming widely used for facilitating fast and robust natural user interaction. But
measuring depth can be high in power consumption mainly due to the active infrared illumination
involved in the acquisition process, for both structured-light and time-of-flight technologies. It
becomes a critical issue when the sensors are mounted on hand-held (mobile) devices, where
power usage is of the essence. A method is proposed to reduce the depth acquisition frame rate,
possibly by factors of 2 or 3, thus saving considerable power.
The compensation is done by calculating reliable depth estimations using a coupled color (RGB)
camera working at full frame rate. These predictions, which are shown to perform outstandingly,
create for the end user or application the perception of a depth sensor working at full frame rate.
Quality measures based on skeleton extraction and depth inaccuracy are used to calculate the
deviation from the ground truth.”

SfM-NeT • Supervision • Supervising depth
If depth is available on parts of the
input image, such as with video
sequences captured by a Kinect
sensor, we can use depth supervision
in the form of robust depth regression.
No in theory we can give targets
automatically for SfM pipeline
designed to operate:
1) without depth sensor, such as
traditional smartphone
- Target with Kinect or high-
quality laser scanner
2) Google Tango smartphone
with “low-quality depth sensing”
- Target with high-quality
laser scanner
No need for massive Mechanic
Turker workforce for boring time-
consuming labeling
3D with Kinect
Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on
Jan Smisek ; Michal Jancosek ; Tomas Pajdla
Date of Conference: 6-13 Nov. 2011
https://doi.org/10.1109/ICCVW.2011.6130380
“We demonstrate the functionality of
Kinect calibration by integrating it
into an SfM pipeline where 3D
measurements from a moving Kinect
are transformed into a common
coordinate system by computing
relative poses from matches in color
camera.”
SfM
performs
better
when one
has both
RGB and
depth
data
available

SfM-NeT • Supervision • Supervising camera motion
Supervising camera
motion. If ground-truth
camera pose trajectories
are available, we can
supervise our model by
computing corresponding
ground-truth camera
rotation and translation
from frame to frame, and
constrain our camera
motion predictions
accordingly.
IEEE Transactions on Image Processing
( Volume: 23, Issue: 12, Dec. 2014 )
Online Camera-Gyroscope Autocalibration
for Cell Phones
Chao Jia ; Brian L. Evans https://doi.org/10.1109/TIP.2014.2360120
Our contributions are: simultaneous online
camera self-calibration and camera-
gyroscope calibration based on an implicit
extended Kalman filter and generalization of
the multiple-view coplanarity constraint on
camera rotation in a rolling shutter camera
model for cell phones.
Now on Google
Tango platform you
could use:
1) RGB Video
2) Depth Video
And
3) Gyroscope data
Sensor-based camera motion detection for unconstrained slam
Original Assignee: Qualcomm Incorporated
Publication date: Jul 12, 2016
US 9390344 B2 Techniques are presented for monocular visual
simultaneous localization and mapping (SLAM) based on detecting a
translational motion in the movement of the camera using at least one
motion sensor, while the camera is performing panoramic SLAM, and
initializing a three dimensional map for tracking of finite features. Motion
sensors may include one or more sensors, including inertial (gyroscope,
accelerometer), magnetic (compass), vision (camera) or any other sensors
built into mobile devices.
Virtual Reality, 2001. Proceedings. IEEE
Fusion of vision and gyro tracking for robust
augmented reality registration
S. You ; U. Neumann https://doi.org/10.1109/VR.2001.913772
The framework includes a two-channel
complementary motion filter that combines
the low-frequency stability of vision sensors
with the high-frequency tracking of gyroscope
sensors, hence achieving stable static and
dynamic six-degree-of-freedom pose tracking.
Our implementation uses an extended Kalman
filter (EKF).
Poling and Lerman (2016): We present a deeply integrated method of
exploiting low-cost gyroscopes to improve general purpose feature
tracking. Most previous methods use gyroscopes to initialize and bound
the search for features. In contrast, we use them to regularize the tracking
energy function so that they can directly assist in the tracking of
ambiguous and poor-quality features.

SfM-NeT • Supervision • Supervising optical flow and object motion
Supervising optical flow and
object motion. Ground-truth
optical flow, object masks, or
object motions require
expensive human annotation
on real videos. However,
these signals are available in
recent synthetic datasets
[20]. In such cases, our
model could be trained to
minimize, for example, an L1
regression loss between
predicted and ground-truth
low vectors
In this paper, we propose to use DenseNet for optical flow prediction. Our contributions are
two-fold. First, we extend current DenseNet to a fully convolutional network. Our model is
totally unsupervised, and achieves performance close to supervised approaches. Second,
we empirically show that replacing convolutions with dense blocks in the expanding part
yields better performance
(a) Semantic segmentation breaks the image into regions such as road, bike, person, sky, etc. (c) Existing optical flow
algorithms do not have access to either the segmentations or the semantics of the classes. (d) Our semantic optical
flow algorithm computes motion differently in different regions, depending on the semantic class label, resulting in
more precise flow, particularly at object boundaries. (b) The flow also helps refine the segmentation of the
foreground objects.

Upgrade • Supervision • Loss Function #1
http://doi.ieeecomputersociety.org/10.1109/TPAMI.2007.1171
Nearly all existing methods for stereo reconstruction assume that scene reflectance is
Lambertian{*}
and make use of brightness constancy (BC) as a matching invariant. We
introduce a new invariant for stereo reconstruction called light transport constancy
(LTC), which allows completely arbitrary scene reflectance (bidirectional reflectance
distribution functions (BRDFs)). This invariant can be used to formulate a rank constraint
on multiview stereo matching when the scene is observed by several lighting
configurations in which only the lighting intensity varies.
{*}
Lambertian reflectance in practice means
that the surface would look as bright
independent from where you look at it. This is
not true for specular and mirror reflection.
“CS 354 Lighting” by Mark Kilgard
Graphics Software Engineer at NVIDIA
Direct Visual Odometry using Bit-Planes
Hatem Alismail, Brett Browning, and Simon Lucey, The Robotics Institute, Carnegie Mellon University
https://arxiv.org/abs/1604.00990 (2016)
At the core of direct Visual SLAM is the reliance on a consistent photometric
appearance across images, otherwise known as the brightness constancy assumption.
Unfortunately, brightness constancy seldom holds in real world applications
In this work, we overcome brightness constancy by incorporating feature descriptors
into a direct visual odometry framework. This combination results in an efficient
algorithm that combines the strength of both feature-based algorithms and direct
methods. Namely, we achieve robustness to arbitrary photometric variations while
operating in low-textured and poorly lit environments.
An illustration of our Bit-Planes
descriptor where each channel is
composed of bits. Since the residual
vector is binary, least squares
minimization becomes equivalent to
minimizing the Hamming distance.
Principles of Remote Sensing; Soudarissanane (2016)]

Upgrade • Supervision • Loss Function #2
Geometric Loss Functions for
Camera Pose Regression with
Deep Learning
Alex Kendall, Roberto Cipolla
(Submitted on 2 Apr 2017 (v1), last revised 23 May 2017 (this version, v2))
We show that our geometric approach can improve PoseNet’s efficacy across many different datasets – narrowing
the deficit to traditional SIFT feature-based algorithms. For outdoor scenes ranging from 50, 000m2
to 2km2
we
can achieve relocalisation accuracies of a few meters and a few degrees. In small rooms we are able to achieve
accuracies of 0.2 − 0.4m.
Comparison of different loss functions. We use an L1 distance for the
residuals in each loss. Linear sum combines position and orientation losses
with a constant scaling parameter β (Kendall and Cipolla 2015) and is defined
in (2). Learn weighting is the loss function in (3) which learns to combine
position and orientation using homoscedastic uncertainty. Reprojection error
implicitly combines rotation and translation by using the reprojection error of
the scene geometry as the loss (7). We find that homoscedastic uncertainty is
able to learn an effective weighting between position and orientation
quantities. The reprojection loss was not able to converge from random
initialisation. However, when used to fine-tune a network pretrained with (3) it
yields the best results.

Upgrade • Supervision • Semi-supervised targets
Recurrent Ladder Networks
Alexander Ilin, Isabeau Prémont-Schwarz, Tele Hotloo Hao,
Antti Rasmus, Rinu Boney, Harri Valpola
(Submitted on 28 Jul 2017)
“We propose a recurrent
extension of the Ladder network,
which is motivated by the
inference required in hierarchical
latent variable models. We
demonstrate that the recurrent
Ladder is able to handle a wide
variety of complex learning tasks
that benefit from iterative
inference and temporal
modeling. The architecture
shows close-to-optimal results
on temporal modeling of video
data, competitive results on
music modeling, and improved
perceptual grouping based on
higher order abstractions, such
as stochastic textures and
motion cues. We present results
for fully supervised, semi-
supervised, and unsupervised
tasks. The results suggest that
the proposed architecture and
principles are powerful tools for
learning a hierarchy of
abstractions, handling temporal
information, modeling relations
and interactions between
objects.”
(a): Simple static hierarchical latent variable model. (b): Directions of message propagation. (c):
Computational graph implementing message propagation in (b). (d): The structure of the
Ladder network can be seen as a computational graph implementing message propagation in
(c). The red circles mark the operations corresponding to the nodes of the graph in (b). (d): The
structure of the recurrent Ladder (RLadder) network.
https://github.com/CuriousAI/mean-teacher
A sketch of a binary classification task with two labeled examples (large blue dots) and one unlabeled
example, demonstrating how the choice of unlabeled target (black circle) affects the fitted function
(gray curve).
(a) A model with no regularization is free to fit any function. (b) A model trained with noisy labeled data
(small dots) learns to give consistent predictions around labeled data points. (c) Consistency to noise
around unlabeled examples provides additional smoothing. For the clarity of illustration, the teacher
model (blue curve) is first fitted to the labeled examples, and then left unchanged during the training of
the student model. Also for clarity, we will omit the small dots in figures d and e. (d) Noise on the teacher
model reduces the bias of the targets without additional training. The expected direction of stochastic
gradient descent is towards the mean (large blue circle) of individual noisy targets (small blue circles).
(e) An ensemble of models gives an even better expected target. Both Temporal Ensembling
and the Mean Teacher method use this approach

Upgrade • Supervision • “proxy” supervised targets
https://arxiv.org/abs/1702.02295 (Submitted on 8 Feb 2017 (v1), last revised 1 Jul 2017 (this version, v2))
We study the unsupervised learning of CNNs for optical flow estimation using proxy
ground truth data. Supervised CNNs, due to their immense learning capacity, have
shown superior performance on a range of computer vision problems including optical flow
prediction. They however require the ground truth flow which is usually not accessible
except on limited synthetic data. Without the guidance of ground truth optical flow,
unsupervised CNNs often perform worse as they are naturally ill-conditioned.
We therefore propose a novel framework in which proxy ground truth data generated
from classical approaches is used to guide the CNN learning. The models are further
refined in an unsupervised fashion using an image reconstruction loss. Our guided
learning approach is competitive with or superior to state-of-the-art approaches on three
standard benchmark datasets yet is completely unsupervised and can run in real time.
“More broadly, we introduce a paradigm which can be integrated into
future state-of-the-art motion estimation networks [Ranjan and Black (2016)]
to improve performance. In future work, we plan to experiment with
large-scale video corpora to learn non-rigid real world motion patterns
rather than just learning limited motions found in synthetic datasets.”

Upgrade • Supervision • Self-supervision
We improve CNN-based optical flow estimation in real videos by adding the
extra self-supervised task of future frame prediction, and training the
network with a mixture of synthetic and real-world videos. This
combination is made possible by putting a “multiplexer” at the entry of the
network which mixes data from the two sources on a timely basis.

Upgrade • Supervision • Dense object segmentation
Semantic Video Segmentation by
Gated Recurrent Flow Propagation
David Nilsson, Cristian Sminchisescu
(Submitted on 28 Dec 2016)
Deep Semantic Segmentation for Automated
Driving: Taxonomy, Roadmap and Challenges
Mennatullah Siam, Sara Elkerdawy, Martin Jagersand, Senthil Yogamani
Submitted on 8 Jul 2017 (v1), last revised 3 Aug 2017
Overview of our Spatial
Transformer Gated
Recurrent Unit (STGRU),
combining a Spatial
Transformer Network for
optical flow warping with
a Gated Recurrent Unit to
adaptively propagate and
fuse semantic
segmentation information
over time.
Modular End to End learning: We
use the term modular end to end
learning when there are auxiliary
losses to ensure safety and interpret
ability. For instance, segmentation
loss can be added as an auxiliary loss
for an end to end driving CNN [
Xu et al. (2016)]. Using this auxiliary
loss, the CNN loosely learns to
semantically segment, but it is also
learns to have a better representation
for the intermediate features. It was
shown in that work that using auxiliary
loss outperforms the vanilla end to
end learning. The work also uses
recurrent gated unit after the CNN to
model temporal information.

Upgrade • Supervision • generative motion and content
https://github.com/sergeytulyakov/mocogan
https://sites.google.com/a/umich.edu/rubenevillegas/iclr2017
We propose a deep neural network for the prediction of future frames in natural video
sequences. To effectively handle complex evolution of pixels in videos, we propose to
decompose the motion and content, two key components generating dynamics in videos. Our
model is built upon the Encoder-Decoder Convolutional Neural Network and Convolutional
LSTM for pixel-level prediction, which independently capture the spatial layout of an image
and the corresponding temporal dynamics.
By independently modeling motion and content, predicting the next frame reduces to
converting the extracted content features into the next frame content by the identified motion
features, which simplifies the task of prediction. Our model is end-to-end trainable over
multiple time steps, and naturally learns to decompose motion and content without separate
training.

Upgrade • Supervision • data Augmentation
Depth degradation techniques [taken from Yang et al. (2012)].
(a) under-sampling, (b) undersampling with signal-dependant
noise, (c) random missing, and (d) structural missing.
[Rotman and Gilboa (2016)]
From left to right, Kinect 1, 2 and R200 RealSense. The Kinect 1
sensor features invalid (black) depth values, and crooked edges.
The Kinect 2 has false intermediate depth values (on the right side
of the object). The RS shows depth artifacts with erroneous values
(outlier white pixel on left bottom). [Rotman and Gilboa (2016)]
Alismail et al. (2016)
An Image Degradation Model for Depth-
augmented Image Editing
(2015) James W. Hennessey, Niloy J. Mitra,
http://dx.doi.org/10.1111/cgf.12707

Upgrade • Supervision
• (multimodal) decomposition
Intrinsic Depth.
(a) Input video.
(b),(c) Albedo and
shading estimated by the intrinsic
video method.
(d) Surface
contours modified to combine RGB,
albedo and shading information.
(e) Proxy depth by propagating
sparse SfM depth using video
segments from [9].
(f) Depth estimated by our
method, which combines the previous
two methods.
(g) Depth from the original Depth
Transfer method.
(h) Depth from the fully-metric
method.
(i) Depth from the example-based
single image method.
(j) Ground truth depth. Note that
integrating information from different
intrinsic images improves the
estimation of the depth structure.
In (e) and (j), black pixels
indicate that no valid depth values are
provided.
Kong and Black (2015)
Decomposing Single Images for Layered Photo Retouching
Carlo Innamorati, Tobias Ritschel. Tim Weyrich. Niloy J. Mitra
University College London
http://dx.doi.org/10.1111/cgf.13220
http://geometry.cs.ucl.ac.uk/projects/2017/layered-retouching/
Outline of proposed technique. (a) The Kinect depth of an object is combined with (b) three photos at different rotations of a polarizing
filter. (c) Integration of surface normals obtained from Fresnel equations. Note the azimuthal ambiguity (observed as a flip in the shape)
and distortion of the zenith angle (observed as flatness in the shape). (d) Integration of surface normals after correcting for azimuthal
ambiguity removes the flip, and the final result is shown in (e) after correcting for zenith distortion and using physics-based integration.
Kadambi et al. (2015) - http://web.media.mit.edu/~achoo/polar3D/ - http://news.mit.edu/2015/algorithms-boost-3-d-imaging-resolution-1000-times-1201
Polarization-sensing to improve depth

Upgrade • Supervision • Multimodal Sensing • Rolling shutter motion
IEEE Transactions on Visualization and Computer Graphics ( Volume: 22, Issue: 11, Nov. 2016 )
Towards Kilo-Hertz 6-DoF Visual Tracking Using an Egocentric
Cluster of Rolling Shutter Cameras
Akash Bapat ; Enrique Dunn ; Jan-Michael Frahm
https://doi.org/10.1109/TVCG.2016.2593757
“The key idea is that a rolling shutter camera works by capturing the rows of
an image in rapid succession, essentially acting as a high-frequency 1D
image sensor. By integrating multiple rolling shutter cameras on the AR
device, our tracker is able to perform 6-DOF markerless tracking in a static
indoor environment with minimal latency.”
“Rolling Shutter (RS) cameras have become popularized because of low-cost imaging
capability. However, the RS cameras suffer from undesirable artifacts when the camera or
the subject is moving, or illumination condition changes. For that reason, Monocular Visual
Odometry (MVO) with RS cameras produces inaccurate ego-motion estimates. Previous
works solve this RS distortion problem with motion prediction from images and/or inertial
sensors. However, the MVO still has trouble in handling the RS distortion when the camera
motion changes abruptly (e.g. vibration of mobile cameras causes extremely fast motion
instantaneously).”
Coded Rolling Shutter Photography:
Flexible Space-Time Sampling
http://www.cs.columbia.edu/CAVE/projects/crsp/
The interlaced readout
can be used to compute
optical flow between the
two sub-images after
vertical interpolation. The
optical flow can be used
for motion interpolation,
skew compensation, and
motion deblur. Please refer
to the paper for details.

Upgrade • Supervision • Transfer learning #1
Application of transfer learning in
RGB-D object recognition
Advances in Computing, Communications and Informatics (ICACCI), 2016 International Conference on
Abhishek Kumar ; S. Nithin Shrivatsav ; G. R. K. S. Subrahmanyam ; Deepak Mishra
https://doi.org/10.1109/ICACCI.2016.7732108
“Firstly we trained a CNN network with 10 classes of different objects and then we
transfer the parameters to RGB and depth CNN network. This enables the network to
train faster and also achieve higher accuracy for a given number of epochs.”
Depth CNNs for RGB-D Scene Recognition: Learning from Scratch Better
than Transferring from RGB-CNNs
Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)
Xinhang Song, Luis Herranz, Shuqiang Jiang
https://github.com/songxinhang/D-CNN
https://www.aaai.org/ocs/index.php/AAAI/AAAI17/paper/download/14695/14310
HHA encoding for depth data (Gupta et al. 2014), is a three
channel representation (horizontal disparity, height above
ground, and angle with the direction of gravity) of depth data.
Transferring deep representations
within the same modality (e.g. Places-
CNN fine tuned on SUN397) works
well, since low-level patterns have
similar distributions, and bottom layers
can be reused while adjusting the more
dataset-specific top layers. However,
fine tuning is not that effective in inter-
modal transfer, such as Places-CNN
to depth in the HHA space, where low-
level features require modality-
specific filters. In this paper, we focus
on the bottom layers, because they are
more critical to represent depth data
properly. By reducing the number of
parameters of the network, and using
weakly supervised learning over
patches, the complexity of the model
matches better the amount of data
available. This depth representation is
not only more discriminative than those
fine tuned from Places-CNN but also
when combined with RGB features the
gain is higher, showing that both are
complementary. Notice also, that we do
not depend (for depth) on large
datasets such as Places.
https://doi.org/10.1016/j.patcog.2017.07.026
“The RGB-specific detection
network is initialized with
ImageNet [Deng et al. (2009)]
RGB
classification model. 3 To
better leverage the depth
information, the modality-
correlated and depth-specific
network are initialized from a
supervision transfer model [
Gupta et al. (2016)]
”

Learning Transferrable Knowledge for Semantic
Segmentation With Deep Convolutional Neural Network
Seunghoon Hong, Junhyuk Oh, Honglak Lee, Bohyung Han;
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 3204-3212
https://doi.org/10.1109/CVPR.2016.349
Overall architecture of the proposed algorithm. Given a
feature extracted from the encoder, the attention model
estimates adaptive spatial saliency of each category associated
with input image. The outputs of attention model are
subsequently fed into the decoder, which generates foreground
segmentation mask of each focused region. During training, we
fix the encoder by pre-trained weights, and leverage the
segmentation annotations from source domain to train both the
decoder and the attention model, and image-level class labels in
both domains to train the attention model under classification
objective. After training, semantic segmentation on the target
domain is performed naturally by exploiting the decoder trained
with source images and the attention model adapted to target
domain
The contributions of this paper are summarized below.
● We propose a new paradigm for weakly-supervised semantic segmentation, which exploits segmentation annotations
from different categories to guide segmentations with weak annotations. To our knowledge, this is the first attempt to tackle the
weakly-supervised semantic segmentation problem by transfer learning.
● We propose a novel encoder-decoder architecture with attention model, which is appropriate to transfer the segmentation
knowledge across categories.
● The proposed algorithm achieves substantial performance improvement over existing weakly-supervised approaches by
exploiting segmentation annotations in exclusive categories.

Borrowing Treasures from the Wealthy:
Deep Transfer Learning through Selective
Joint Fine-tuning
Weifeng Ge, Yizhou Yu (Submitted on 28 Feb 2017 (v1),
last revised 6 Jun 2017 (this version, v2))
https://github.com/ZYYSzj/Selective-Joint-Fine-tuning
In this paper, we introduce a source-target selective joint
fine-tuning scheme for improving the performance of
deep learning tasks with insufficient training data. In this
scheme, a target learning task with insufficient training
data is carried out simultaneously with another source
learning task with abundant training data. However, the
source learning task does not use all existing training data.
Our core idea is to identify and use a subset of training
images from the original source learning task whose low-
level characteristics are similar to those from the target
learning task, and jointly fine-tune shared convolutional
layers for both tasks.
Pipeline of the proposed selective joint fine-tuning. From left to right: (a) Datasets in the source domain and the target domain. (b) Select nearest
neighbors of each target domain training sample in the source domain via a low-level feature space. (c) Deep convolutional neural network initialized
with weights pre-trained on ImageNet or Places. (d) Jointly optimize the source and target cost functions in their own label spaces.
Similar Image Search
There is a unique step in our pipeline. For each image
from the target domain, we search a certain number
of images with similar low-level characteristics
from the source domain. Only images returned from
these searches are used as training images for the
source learning task in selective joint fine-tuning. We
elaborate this image search step below.
In summary, this paper has the following contributions:
● We introduce a new deep transfer learning scheme, called selective joint fine-tuning, for improving the
performance of deep learning tasks with insufficient training data. It is an important step forward in the
context of the widely adopted strategy of fine-tuning a pre-trained deep neural network.
● We develop a novel pipeline for implementing this deep transfer learning scheme. Specifically, we
compute descriptors from linear or nonlinear filter bank responses on training images from both
tasks, and use such descriptors to search for a desired subset of training samples for the source
learning task.
● Experiments demonstrate that our deep transfer learning scheme achieves state-of-the-art
performance on multiple visual classification tasks with insufficient training data for deep learning.

SfM-NeT • implementation Details
coarse map
2 x fully connected layers
Our depth-predicting structure and object-mask-
predicting motion conv/deconv networks share similar
architectures but use independent weights. Each consist of
a series of 3×3 convolutional layers alternating between
stride 1 and stride 2 followed by deconvolutional operations
consisting of a depth-to-space upsampling, concatentation
with corresponding feature maps from the convolutional
portion, and a 3×3 convolutional layer. Batch normalization
is applied to all convolutional layer outputs.
The structure network takes a single frame as input, while
the motion network takes a pair of frames. We predict depth
values using a 1×1 convolutional layer on top of the image-
sized feature map. We use RELU activations because
depths are positive and a bias of 1 to prevent small depth
values. The maximum predicted depth value is further
clipped at 100 to prevent large gradients.
We predict object masks from the image-sized feature
map of the motion network using a 1×1 convolutional layer
with sigmoid activations. To encourage sharp masks we
multiply the logits of the masks by a parameter that is a
function of the number of step for which the network has
been trained. The pivot variables are predicted as heat
maps using a softmax function over all the locations in the
image followed by a weighted average of the pixel locations.
keras.layers.convolutional.Conv2D
(filters, kernel_size, strides)
filters = 32
strides = (1,1)
kernel_size = (3,3)
filters = 64
strides = (2,2)
kernel_size = (3,3) keras.layers.convolutional.UpSamp
ling2D(size=(2, 2))

SfM-NeT • Experimental Results #1
Qualitative comparison of the estimated
depth using our unsupervised model
on sequences versus using stereo pairs
in the KITTI 2012 benchmark. When
using stereo pairs the camera pose
between the pair is constant and hence
the model is equivalent to the approach
of Garg et al.’s (2016).
For sequences, our model needs to
additionally predict camera rotation and
translation between the two frames. The
first six rows show successful
predictions even without camera pose
information and the last two illustrate
failure cases.
The failure cases show that when there
is no translation between the two frames
depth estimation fails whereas when
using stereo pairs there is always a
constant offset between the frames.

Ground truth
segmentation and
flow compared to
predicted motion
masks and flow
from SfM-Net in
KITTI 2015.
The model was
trained in a fully
unsupervised
manner. The top
six rows show
successful
prediction and the
last two show
typical failure
cases.

Motion segments computed from SfM-Net in MoSeg [Brox and Malik (2010]
.
The model was trained in a fully unsupervised manner.
“We report camera rotation and translation error in Table 2 for
each of the Freiburg sequences compared to the error in the
benchmark’s baseline trajectories. Our model was trained from
scratch for each sequence and used the focal length value
provided with the dataset. We observe that our results better
estimate the frame-to-frame translation and are comparable for
rotation.”

SfM-NeT • Conclusion
Current geometric SLAM methods obtain excellent egomotion and rigid 3D reconstruction
results, but often come at a price of extensive engineering, low tolerance to moving
objects — which are treated as noise during reconstruction — and sensitivity to camera
calibration.
Furthermore, matching and reconstruction are difficult in low textured regions.
Incorporating learning into depth reconstruction, camera motion prediction and object
segmentation, while still preserving the constraints of image formation,is a promising way to
robustify SLAM and visual odometry even further. However, the exact training scenario
required to solve this more difficult inference problem remains an open question.
Exploiting long history and far in time forward-backward constraints with visibility
reasoning is an important future direction. Further, exploiting a small amount of
annotated videos for object segmentation, depth, and camera motion, and combining
those with an abundance of self-supervised videos, could help initialize the network
weights in the right regime and facilitate learning. Many other curriculum learning
regimes, including those that incorporate synthetic datasets, can also be considered
t geom

Pipeline • Future paths • “GRID” Architecture • Review
Our technical results are corroborated by an extensive
set of evaluations, presented in this paper as well as
independent empirical observations reported by other
groups. We also perform experiments showing the
practical implications of our framework for choosing the
best fully-connected design for a given problem.
Due to fast pace of deep learning as a field, very hard to find good methodological review. Thus, be cautious when
reading these as e.g. Feb 2017 is already quite old when reading the review in August 2017.
Systematic evaluation of CNN advances on the ImageNet
Dmytro Mishkin, Nikolay Sergievskiy, Jiri Matas 16 May 2017
https://doi.org/10.1016/j.cviu.2017.05.007
https://www.researchgate.net/publication/316970253_Systematic_Evaluation_of_Convo
lution_Neural_Network_Advances_on_the_ImageNet
The commonly used input to CNN is raw RGB pixels and the commonly adopted
recommendation is not to use any pre-processing. There has not been much
research on the optimal colorspace or pre-processing techniques for CNN.
Rachmadi and Purnama (2015) explored different colorspaces for vehicle color
identification, Dong et al. (2014) compared YCrCb and RGB channels for image
superresolution, Graham (2015) extractedlocal average color from retina images in
winning solution to the Kaggle Diabetic Retinopathy Detection competition.
Petteri: The authors could have tested CIELab as well which might have
been interesting colorspace especially in photo enhancement applications [e.g.
Yan et al. (2016)]

Pipeline • Future paths • ConvNet Architecture • DenseNet #1
To ensure maximum information flow between layers in the network, we connect all layers (with matching feature-map
sizes) directly with each other. To preserve the feed-forward nature, each layer obtains additional inputs from all
preceding layers and passes on its own feature-maps to all subsequent layers.
Implicit Deep Supervision. One explanation for the improved accuracy of dense convolutional networks may be that
individual layers receive additional supervision from the loss function through the shorter connections. One can
interpret DenseNets to perform a kind of “deep supervision”. The benefits of deep supervision have previously been
shown in deeply-supervised nets (DSN; Lee et al. 2014), which have classifiers attached to every hidden layer, enforcing
the intermediate layers to learn discriminative features.

Image classification Semantic Segmentation Optical Flow
https://arxiv.org/abs/1611.09326 https://arxiv.org/abs/1707.06316

Classical expanding uses series of convolutions, deconvolutions, and skip connections to recover
the spatial resolution in order to get the perpixel prediction results. Due to the good properties of
DenseNet, we propose to replace the convolutions with dense blocks during expanding as
well. However, if we follow the same dense connectivity pattern, the number of feature maps after
each dense block will keep increasing. Considering that the resolution of the feature maps also
increases during expanding, the computational cost will be intractable for current GPUs. Thus, for a
dense block in the expanding part, we do not concatenate the input to its final output. For example,
if the input has k0
channels, the output of an L layer dense block will have Lk feature maps. k is the
growth rate of a DenseNet, defining the number of feature maps each layer produces. Note that
dense blocks in the contracting part will output k0
+ Lk feature maps. For symmetry, we also
introduce four dense blocks in the expanding part, each of which has four layers. The bottom layer
feature maps at the same resolution are concatenated through skip connections. Between the
dense blocks, there are transition up layers composed of two 3×3 deconvolutions with a stride of
2. One is for upsampling the estimated optical flow, and the other is for upsampling the feature
maps.
“Our model is totally unsupervised., thus we can experiment
with large-scale video corpora in future work, to learn non-
rigid real world motion patterns. Through comparison of
popular CNN architectures, we found that it is important to
design novel operators or networks for optical flow
estimation instead of relying on existing architectures
for image classification.”
In this work, we choose FlowFields (Bailer et al. 2015) as our classical optical flow estimator. To
our knowledge, it is one of the most accurate flow estimators among the published work. We
hope that by using FlowFields to generate proxy ground truth, we can learn to estimate
motion between image pairs as effectively as using the true ground truth.
If a classical approach fails to detect certain motion patterns, a network trained on the proxy
ground truth is also likely to miss these patterns. This leads us to ask if there is other
unsupervised guidance that can improve the network training?
The unsupervised approach (Yu et al. 2016) treats optical flow estimation as an image
reconstruction problem based on the intuition that if the estimated flow and the next frame can
be used to reconstruct the current frame then the network has learned useful representations
of the underlying motions.
Note that we could add additional unsupervised guides like a gradient constancy assumption
or an edge-aware weighted smoothness loss (Godard et al 2016) to further fine tune our models.

degradation. However, during inference the
lower layers do not know about high layer
features, although they contain contextual high
semantics that benefit low layers to adaptively
extract informative features for later layers.
In this paper, we study the influence of
backward skip connections which are in the
opposite direction to forward shortcuts, i.e.
paths from high layers to low layers. To achieve
this -- which indeed runs counter to the nature
of feed-forward networks -- we propose a new
fully convolutional model that consists of a pair
of networks. A `Slave' network is dedicated to
provide the backward connections from its top
layers to the `Master' network's bottom layers.
‘Forward skip’ is an illustration of one forward skip connection in our
own baseline FCN-8s-F1. ‘Backward Skip’ is our proposed design
where we align two networks, a Master and Slave. The Slave network
provides the backward skip connections from its higher layers to the
Master network lower layers. ‘Block in MSNet’ shows an illustration of
our final model’s skip connections, where additional forward skip
connections have been added to the Master.
A detailed overview of our MSNet-FB1 model. The model consists of two networks; Master
and Slave. Slave is a basic FCN-8s. The Master is a FCN-8s-F1 (as our baseline) and
contains the backward skip connections taken from the Slave network. The forward
propagation starts in the Slave network, and then the generated feature maps are fed back
to lower layers in the Master network serving as backward skip connections within the
Master. Notice that the backward skip connection (blue arrows) are exactly the inverse of
the forward skip connections (purple arrows).
An illustration of several convolutional blocks of our MSNet-FB2
with dense skip connections. Notice that the Master network is
FCN-8s-F2. We set P = 3 and N = 3. Here the dense backward skip
connections are in inverse directions to the dense forward skip
connections

The use of backward connection actually is closer to the cliché of deep learning being analogous to human
brain
Deep neural network explains early visual and inferior temporal (IT) representations of object images.
Each representation in model and brain was characterized by the dissimilarity matrix of the response patterns
elicited by a set of real-world photos of objects.
(a) Representations become monotonically more similar to those of human inferior temporal (IT) cortex as we
ascend the layers of the Krizhevsky et al. (2012) neural network. When the final representational stages are
linearly remixed to emphasize the same semantic dimensions as IT using linear category discriminants
(second bar from the right), and when each layer and each discriminant are assigned a weight to model the
prevalence of different computational features in IT (cross-validated to avoid overfitting to the image set;
rightmost bar), the noise ceiling ( gray shaded region) is reached, indicating that the model fully explains the
data. (b) Lower layers of the deep neural network resemble the representations in the foveal confluence of
early visual areas (V1–V3).
http://dx.doi.org/10.1146/annurev-vision-082114-035447
Center for Brains, Minds and Machines, McGovern Institute, MIT
“Feed-forward zombie” The Sciences of Consciousness: Progress and Problems:
Center for Brains, Minds and Machines (CBMM), Christof Koch - Allen Institute for Brain Science,
https://youtu.be/4gT-1S3FO4s?t=1h9m34s “Not pleasing the people worshipping alter of computalism”

Pipeline • Future paths • Uncertainty • with DenseNet
There are two major types of uncertainty one can
model. Aleatoric uncertainty captures noise inherent in
the observations. On the other hand, epistemic
uncertainty accounts for uncertainty in the model --
uncertainty which can be explained away given enough
data. Traditionally it has been difficult to model
epistemic uncertainty in computer vision, but with new
Bayesian deep learning tools this is now possible. We
study the benefits of modeling epistemic vs. aleatoric
uncertainty in Bayesian deep learning models for vision
tasks.
Our model based on DenseNet can process a 640 ×
480 resolution image in 150ms on a NVIDIA Titan X
GPU. The aleatoric uncertainty models add negligible
compute. However, epistemic models require
expensive Monte Carlo dropout sampling. For
models such as ResNet, this is possible to achieve
economically because only the last few layers contain
dropout. Other models, like DenseNet, require the entire
architecture to be sampled. This is difficult to parallelize
due to GPU memory constraints, and often results in a
50× slowdown for 50 Monte Carlo samples.

Pipeline • Future paths • Uncertainty • With model compression
Bayesian Compression
for Deep Learning
Christos Louizos, Karen Ullrich, Max Welling
(Submitted on 24 May 2017 (v1), last revised 10 Aug 2017 (this version, v3))
From a Bayesian perspective network pruning and
reducing bit precision for the weights is aligned
with achieving high accuracy, because Bayesian
methods search for the optimal model structure
(which leads to pruning with sparsity inducing
priors), and reward uncertain posteriors over
parameters through the bits back argument
[Hinton and Van Camp, 1993]
(which leads to removing
insignificant bits). This relation is made explicit in
the MDL principle [Grünwald, 2007]
which is known to be
related to Bayesian inference.
By employing sparsity inducing priors for hidden
units (and not individual weights) we can prune
neurons including all their ingoing and outgoing
weights. This avoids more complicated and
inefficient coding schemes needed for pruning or
vector quantizing individual weights. As a
additional Bayesian bonus we can use the
posterior uncertainties to assess which bits
are significant and remove the ones which
fluctuate too much under posterior sampling.
From this we derive the optimal fixed point
precision per layer, which is still practical on chip.
For the actual compression task
we compare our method to
current work in three different
scenarios: (i) compression
achieved only by pruning, here, for
non-group methods we use the
CSC format to store parameters;
(ii) compression based on the
former but with reduced bit
precision per layer (only for the
weights); and (iii) the maximum
compression rate as proposed by
Han et al. [2016]. We believe these
to be relevant scenarios because
(i) can be applied with already
existing frameworks such as
Tensorflow, (ii) is a practical
scheme given upcoming GPUs and
frameworks will be designed to
work with low and mixed precision
arithmetics [Lin and Talathi, 2016,
Gysel, 2016]

Pipeline • Future paths • Uncertainty • Geometric problems
http://mi.eng.cam.ac.uk/projects/relocalisation/

Future • Geometric Architectures

Pipeline • Future paths • Geometric Deep Learning #1
Bronstein et al. (July 2017): “Geometric deep learning (
http://geometricdeeplearning.com/) is an umbrella term for e merging
techniques attempting to generalize (structured) deep neural models to non-
Euclidean domains, such as graphs and manifolds. The purpose of this article
is to overview different examples of geometric deep-learning problems and
present available solutions, key difficulties, applications, and future research
directions in this nascent field”
SCNN (2013)
GCNN/ChebNet (2016)
GCN (2016)
GNN (2009)
Geodesic CNN (2015)
Anisotropic CNN (2016)
MoNet (2016)
Localized SCNN (2015)

Bronstein et al. (July 2017): “The non-Euclidean nature of data
implies that there are no such familiar properties as global
parameterization, common system of coordinates, vector space
structure, or shift-invariance. Consequently, basic operations like
convolution that are taken for granted in the Euclidean case are even
not well defined on non-Euclidean domains.”
“First attempts to generalize neural networks to graphs we are aware of
are due to Mori et al. (2005) who proposed a scheme combining
recurrent neural networks and random walk models. This approach
went almost unnoticed, re-emerging in a modern form in
Suhkbaatar et al. (2016) and Li et al. (2015) due to the renewed recent
interest in deep learning.”
“In a parallel effort in the computer vision and graphics community,
Masci et al. (2015) showed the first CNN model on meshed surfaces,
resorting to a spatial definition of the convolution operation based on
local intrinsic patches. Among other applications, such models were
shown to achieve state-of-the-art performance in finding
correspondence between deformable 3D shapes. Followup works
proposed different construction of intrinsic patches on point clouds
Boscaini et al. (2016)a,b and general graphs Monti et al. (2016).”
In calculus, the notion of derivative describes
how the value of a function changes with an
infinitesimal change of its argument. One of the
big differences distinguishing classical calculus
from differential geometry is a lack of vector
space structure on the manifold, prohibiting us
from naïvely using expressions like f(x+dx). The
conceptual leap that is required to generalize
such notions to manifolds is the need to work
locally in the tangent space.
Physically, a tangent vector field can be
thought of as a flow of material on a manifold.
The divergence measures the net flow of a field
at a point, allowing to distinguish between field
‘sources’ and ‘sinks’. Finally, the Laplacian (or
Laplace-Beltrami operator in differential
geometric jargon)
“A centerpiece of classical Euclidean signal processing is the property of the Fourier
transform diagonalizing the convolution operator, colloquially referred to as the
Convolution Theorem. This property allows to express the convolution f⋆g of two
functions in the spectral domain as the element-wise product of their Fourier transforms.
Unfortunately, in the non-Euclidean case we cannot even define the operation x-x’ on the
manifold or graph, so the notion of convolution does not directly extend to this case.

Bronstein et al. (July 2017): “We expect the following years to bring exciting new approaches
and results, and conclude our review with a few observations of current key difficulties and
potential directions of future research.”
Generalization: Generalizing
deep learning models to
geometric data requires not only
finding non-Euclidean
counterparts of basic building
blocks (such as convolutional
and pooling layers), but also
generalization across different
domains. Generalization
capability is a key requirement in
many applications, including
computer graphics, where a
model is learned on a training set
of non-Euclidean domains (3D
shapes) and then applied to
previously unseen ones.
Time-varying domains: An
interesting extension of geometric
deep learning problems discussed
in this review is coping with signals
defined over a dynamically
changing structure. In this case, we
cannot assume a fixed domain and
must track how these changes
affect signals. This could prove
useful to tackle applications such
as abnormal activity detection in
social or financial networks. In the
domain of computer graphics and
vision, potential applications deal
with dynamic shapes (e.g. 3D video
captured by a range sensor).
Computation: The final consideration is
a computational one. All existing deep
learning software frameworks are
primarily optimized for Euclidean data.
One of the main reasons for the
computational efficiency of deep
learning architectures (and one of the
factors that contributed to their
renaissance) is the assumption of
regularly structured data on 1D or 2D
grid, allowing to take advantage of
modern GPU hardware. Geometric data,
on the other hand, in most cases do not
have a grid structure, requiring different
ways to achieve efficient computations.
It seems that computational paradigms
developed for large-scale graph
processing are more adequate
frameworks for such applications.

Pipeline • Future paths • sparsity primer from neuroscience
Questioning the role of sparse
coding in the brain
Anton Spanne, Henrik Jörntell
July 2015 - http://doi.org/10.1016/j.tins.2015.05.005
Optimal Degrees of Synaptic
Connectivity
Litwin-Kumar A, Harris KD, Axel R, Sompolinsky H, Abbott LF
Feb 2017 - http://doi.org/10.1016/j.tins.2015.05.005
3D Visual Response Properties of MSTd Emerge
from an Efficient, Sparse Population Code
Michael Beyeler, Nikil Dutt and Jeffrey L. Krichmar
Journal of Neuroscience 10 August 2016
http://doi.org/10.1523/JNEUROSCI.0396-16.2016
● Sparse coding is questioned on both theoretical
and experimental grounds.
● Generalization is important to current brain
models but is weak under sparse coding.
● The beneficial properties ascribed to sparse
coding can be achieved by alternative means.
Sparse code is a compromise between local code and dense code. (A)
Comparison of coding schemes that differ in their ratio of active neurons: in
other words, in their sparseness. The activities within the population during
two hypothetical contexts (context A and context B) are shown as examples of
how different contexts are represented within the population. Note that by our
definition only a single context would be active at any time because a context
represents the global brain state (i.e., all the neurons). In local code, a context is
represented by the activity of a single neuron, or a small subset of neurons, and
different contexts are represented by different neurons. Notably, the activities
of the neurons are not independent because if a neuron is responding to
context A, it will not respond to any other context. In dense code, all neurons
are active and their combined activity is used to encode each context. Any
state in between the two extreme cases of local and dense code can in
principle be labeled sparse code. The reduction of average activation leads to a
reduction in the overlap or interference between the activation during
different contexts. (B) In the special case of binary activation functions,
maximal representational capacity is obtained if 50% of the neurons are active
during each context. For this reason an average activation of 50% is usually
considered dense code in the binary case.
● Sparse synaptic wiring can optimize a neural
representation for associative learning
● Maximizing dimension predicts the degree of
connectivity for cerebellum-like circuits
● Supervised plasticity of input connections is
needed to exploit dense wiring
● Performance of a Hebbian readout neuron is
formally related to dimension
(A) Schematic of network with a classifier that computes a weighted sum
of mixed-layer activity to determine the valence of an input pattern.
Example flow fields generated with the motion field model
[Longuet-Higgins and Prazdny (1980); Raudies (2013)]. Generated from a
pinhole camera with image plane.

Pipeline • Future paths• Non-euclidean computability & Geometric Operators #1
“Our model is totally unsupervised., thus we can experiment with
large-scale video corpora in future work, to learn non-rigid real world
motion patterns. Through comparison of popular CNN architectures,
we found that it is important to design novel operators or
networks for optical flow estimation instead of relying on
existing architectures for image classification”
pixels remain. The labels would be
transformed in the same way but are not
shown here.
Away from GRID
Szegedy et al. 2014 (Google, cited by 4,027 articles):
“Today’s computing infrastructures are very inefficient
when it comes to numerical calculation on non-uniform
sparse data structures. Even if the number of arithmetic
operations is reduced by 100×, the overhead of lookups
and cache misses is so dominant that switching to sparse
matrices would not pay off. The gap is widened even further
by the use of steadily improving, highly tuned, numerical
libraries that allow for extremely fast dense matrix
multiplication. Also, non-uniform sparse models require
more sophisticated engineering and computing
infrastructure.”
Google not quite there yet for hardware-accelerated
sparse matrix deep learning.
“This Matrix unit of a custom ASIC—called a Tensor Processing Unit
(TPU) if is designed for dense matrices. Sparse architectural support
was omitted for time-to-deploy reasons. Sparsity will have high
priority in future designs.”

Speeding up Convolutional Neural Networks
By Exploiting the Sparsity of Rectifier Units
Shaohuai Shi, Xiaowen Chu
(Submitted on 25 Apr 2017 (v1), last revised 15 May 2017 (this version, v2))
Rectifier neuron units (ReLUs) have been widely used in deep
convolutional networks. An ReLU converts negative values to zeros,
and does not change positive values, which leads to a high sparsity of
neurons. In this work, we first examine the sparsity of the outputs of
ReLUs in some popular deep convolutional architectures. And then
we use the sparsity property of ReLUs to accelerate the calculation
of convolution by skipping calculations of zero-valued neurons. The
proposed sparse convolution algorithm achieves some speedup
improvements on CPUs compared to the traditional matrix-matrix
multiplication algorithm for convolution when the sparsity is not less
than 0.9.
We measure the speed of compared algorithms on the Intel CPU: E5-2630v4 at
the core frequency of 2.20GHz with 128 GB memory.
We propose the inverse sparse convolution (ISC) algorithm by three steps: First,
we skip all the zero elements of the input data, and store the non-zero values in a
vector with their column and row information. Second, the kernel matrix is stored
as column-major matrix such that for each non-zero element (Ic,i,j
) of inputs, a
continuous memory that stores kernels can be fetched and multiplied by Ic,i,j
at one
time with AVX or SSE techniques. Third, transpose temporary results from the
second step to generate outputs.
The Power of Sparsity in
Convolutional Neural Networks
Soravit Changpinyo, Mark Sandler, Andrey Zhmoginov
(Submitted on 21 Feb 2017)
We deactivate connections between filters in convolutional
layers in a way that allows us to harvest savings both in run-time
and memory for many network architectures. More specifically,
we generalize 2D convolution to use a channel-wise sparse
connection structure and show that this leads to significantly
better results than the baseline approach for large networks
including VGG and Inception V3.
“For example, when applied to Inception V3 (Fig. 4) achieves
AlexNet-level accuracy with fewer than 400K parameters and
VGG-level one (Fig. 5) with roughly 3.5M parameters. In addition, we
show that our method leads to an interesting novel incremental training technique,
where we take advantage of sparse (and smaller) models to build a dense network. One
interesting open direction is to enable incremental training not to simply densify the
network over time, but also increase the number of chaannels. This would allow us to
grow the network without having to fix its original shape in place.”
Efficient Sparse-Winograd Con
-volutional Neural Networks
Xingyu Liu, Song Han, Huizi Mao, William J. Dally
17 Feb 2017 (modified: 19 Feb 2017)
ICLR 2017 workshop submission
https://openreview.net/forum?id=r1rqJyHKg
Convolutional Neural Networks (CNNs) are
compute intensive which limits their application
on mobile devices. Their energy is dominated by
the number of multiplies needed to perform the
convolutions. Winograd’s minimal filtering
algorithm (Lavin and Gray (2015)) and network
pruning (Han et al. (2015)) reduce the operation
count. Unfortunately, these two methods cannot
be combined—because applying the Winograd
transform fills in the sparsity in both the weights
and the activations.
We propose two modifications to Winograd-based
CNNs to enable these methods to exploit sparsity.
First, we prune the weights in the ”Winograd
domain” (after the transform) to exploit static
weight sparsity.
Second, we move the ReLU operation into the
”Winograd domain” to improve the sparsity of the
transformed activations. On CIFAR-10, our method
reduces the number of multiplications in the VGG-
nagadomi model by 10.2x with no loss of
accuracy.

SPARCNet: A Hardware Accelerator
for Efficient Deployment of Sparse
Convolutional Networks
Adam Page, Ali Jafari, Colin Shea, Tinoosh Mohsenin
ACM Journal on Emerging Technologies in Computing Systems (JETC) - Special Issue on Hardware
and Algorithms for Learning On-a-chip and Special Issue on Alternative Computing Systems.
Volume 13 Issue 3, May 2017 - Article No. 31.
https://doi.org/10.1145/3005448
The SPARCNet accelerator with different numbers of
processing engines is implemented on a low-power Artix-7
FPGA platform. The FPGA-based accelerator is developed
using a combination of pure HDL written in Verilog and IP
cores developed using Xilinx’s Vivado HLS.
Additionally, the same networks are optimally implemented on a number of
embedded commercial-off-the-shelf platforms including NVIDIAs
CPU+GPU SoCs TK1 and TX1 and Intel Edison. Compared to NVIDIAs TK1
and TX1, the FPGA-based accelerator obtains 11.8 × and 7.5 × improvement
in energy efficiency In addition to improving efficiency, the accelerator has
built-in support for sparsification techniques and ability to perform in-place
rectified linear unit (ReLU) activation function, max-pooling, and batch
normalization.
Accelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC
E Nurvitadhi, D Sheffield, J Sim… 2017
Field-Programmable Technology (FPT), 2016 International Conference on
Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?
E Nurvitadhi, G Venkatesh, J Sim, D Marr, R Huang FPGA’17
Hardware accelerator for analytics of sparse data
E Nurvitadhi, A Mishra, Y Wang, G Venkatesh… - Proceedings of the 2016
Sparse Matrix Multiplication on CAM Based Accelerator
L Yavits, R Ginosar - arXiv preprint arXiv:1705.09937, 2017
Cambricon-X: An accelerator for sparse neural networks
S Zhang, Z Du, L Zhang, H Lan, S Liu… - … (MICRO), 2016 49th …, 2016
Accelerator for Sparse Machine Learning
L Yavits, R Ginosar - IEEE Computer Architecture Letters, 2017
A Scalable FPGA-Based Accelerator for High-Throughput MCMC Algorithms
M Hosseini, R Islam, A Kulkarni… - … (FCCM), 2017 IEEE
SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks
A Parashar, M Rhu, A Mukkara, A Puglielli… - Proceedings of the 44th …, 2017
NullHop: A Flexible Convolutional Neural Network Accelerator Based on
Sparse Representations of Feature Maps
A Aimar, H Mostafa, E Calabrese… - arXiv preprint arXiv: …, 2017

Geometric DNNs • implementation options in practice #1: GVNN
ankurhanda/gvnn
Insights gvnn: Geometric Vision with Neural
Networks
gvnn is primarily intended for self-supervised
learning using low-level vision. It is inspired by the
Spatial Transformer Networks (STN) paper that
appeared in NIPS in 2015 and its open source code
made available by Maxime Oquab. The code is self
contained i.e. the original implementation of STN by
Maxime is also within the repository.
STs were mainly limited to applying only 2D
transformations to the input. We added a new set of
transformations often needed for manipulating
data in 3D geometric computer vision. These
include the 3D counterparts of what were used in
original STN together with a lot more new
transformations and different M-estimators.
SO3 Layer
Rotations are represented as so(3) 3-vector. This vector is turned into rotation matrix via the exponential
map. For a more detailed view of the so(3) representation and exponential map read this tutorial from Ethan
Eade: Lie-Algebra Tutorial. This is what the exponential map is Exponential Map. Also, Tom Drummond's
notes on Lie-Algebra are a great source to learn about exponential maps Tom Drummond's notes. The
reason for choosing so3 representation is mainly due to its appealing properties when linearising rotations
(via Taylor series expansion) for iterative image alignment via classic linearise-solve-update rule. The figure
below shows how linearisation for SO3 is fitting a local plane on the sphere
Optical Flow
Lens Distortion
Projection Layer

Deep Learning for Structure-from-Motion (SfM)

More Related Content

What's hot

Similar to Deep Learning for Structure-from-Motion (SfM)

More from PetteriTeikariPhD

Recently uploaded

Deep Learning for Structure-from-Motion (SfM)