Passive stereo vision with deep learning

Passive Stereo Vision: From Traditional
to Deep Learning-based Methods
YU HUANG
SUNNYVALE, CALIFORNIA
YU.HUANG07@GMAIL.COM

Outline
• Modeling from multiple views
• Stereo matching
• constraints in stereo vision
• difficulties in stereo vision
• pipeline of stereo matching
• state of art methods
• Quality metric of stereo matching
• census transform and hamming distance
• guided filter in cost aggregation (volume)
• semi-global matching
• ELAS: efficient large scale stereo
• stereo matching as energy minimization
• dynamic programming/graph cut/belief propagation
• phase matching for stereo vision
• disparity refinement
• Multiple cameras/views
• Learning sparse represent. of depth maps
• Stereopsis via deep learning
• Deep learning of depth (and motion)
• Stereo matching by CNN
• Constant Highway Networks and Reflective
Confidence Learning;
• Efficient Deep Learning for Stereo Matching;
• E2e Learning of Geometry and Context;
• Appendix A: Depth from an image by learning
• Appendix B: Learning and optimization

Modeling from Multiple Views in Computer Vision
time
# cameras
photograph
binocular stereo
trinocular stereo
multi-baseline stereo
camcorder
human vision
camera dome
two frames ...
...

Binocular Stereo
• Given a calibrated binocular stereo pair, fuse it to produce a depth image
image 1 image 2
Dense depth map
Public Library, Stereoscopic Looking Room, Chicago, by Phillips, 1923

Basic Stereo Matching Algorithm
• For each pixel in the first image
Find corresponding epipolar line in the right image
Examine all pixels on the epipolar line and pick the best match
Triangulate the matches to get depth information
• Simplest case: epipolar lines are corresponding
scanlines;
• If necessary, rectify the two stereo images to transform
epipolar lines into scanlines

Depth from Disparity
f
x x’
Baseline
B
z
O O’
X
f
z
fB
xxdisparity


Disparity is inversely proportional to depth!
• Stereo camera calibration (focal length and baseline are known)
• Image planes of cameras parallel to each other and to the baseline
• Camera centers are at same height
• Focal lengths are the same
• Then, epipolar lines fall along the horizontal scan lines of the images

Calibration
• Find the intrinsic and extrinsic parameters of a camera
◦ Extrinsic parameters: the camera’s location and orientation in the world.
◦ Intrinsic parameters: the relationships between pixel coordinates and camera coordinates.
• Work of Roger Tsai and work of Zhengyou Zhang are influential: 3-D node setting and 2-d plane
• Basic idea:
◦ Given a set of world points Pi and their image coordinates (ui,vi)
◦ find the projection matrix M
◦ And then find intrinsic and extrinsic parameters.
• Calibration Techniques
◦ Calibration using 3D calibration object
◦ Calibration using 2D planer pattern
◦ Calibration using 1D object (line-based calibration)
◦ Self Calibration: no calibration objects
◦ Vanishing points from for orthogonal direction

Calibration
• Calibration using 3D calibration object:
◦ Calibration is performed by observing a calibration object whose geometry in 3D space is known with very good precision.
◦ Calibration object usually consists of two or three planes orthogonal to each other, e.g. calibration cube
◦ Calibration can also be done with a plane undergoing a precisely known translation (Tsai approach)
◦ (+) most accurate calibration, simple theory
◦ (-) more expensive, more elaborate setup
• 2D plane-based calibration (Zhang approach)
◦ Require observation of a planar pattern at few different orientations
◦ No need to know the plane motion
◦ Set up is easy, most popular approach
◦ Seems to be a good compromise.
• 1D line-based calibration:
◦ Relatively new technique.
◦ Calibration object is a set of collinear points, e.g., two points with known distance, three collinear points with known
distances, four or more…
◦ Camera can be calibrated by observing a moving line around a fixed point, e.g. a string of balls hanging from the ceiling!
◦ Can be used to calibrate multiple cameras at once. Good for network of cameras.

Fundamental Matrix
Let p be a point in left image, p’ in right image
Epipolar relation
◦ p maps to epipolar line l’
◦ p’ maps to epipolar line l
Epipolar mapping described by a 3x3 matrix F
It follows that
l’l
p p’
This matrix F is called
• the “Essential Matrix” E
– when image intrinsic parameters are known
• the “Fundamental Matrix”
– more generally (uncalibrated case)
Can solve for F from point correspondences
• Each (p, p’) pair gives one linear equation in entries of F
• 8/5 points give enough to solve for F/E (8/5-point algo)

Planar Rectification
Bring two views
to standard stereo setup
(moves epipole to )
(not possible when in/close to image)
~ image size
(calibrated)
Distortion minimization
(uncalibrated)

Polar re-parameterization around epipoles
Requires only (oriented) epipolar geometry
Preserve length of epipolar lines
Choose  so that no pixels are compressed
original image rectified image
Polar Rectification
Works for all relative motions
Guarantees minimal image size
Determine the common region from the extremal
epipolar lines and the location of epiole: e’F=0
Select half epipolar lines moving around the epipole
Construct rectified image line by line

Matching cost
disparity
Left Right
scanline
Correspondence Search
• Slide a window along the right scanline and compare contents of that window with the
reference window in the left image
• Matching cost: SSD or normalized correlation

Constraints in Stereo Vision
• Color constancy
• Lambertian surface assumption;
•Epipolar geometry
• Scanline as epipolar line for rectifed pair;
• Uniqueness
• For any point in one image, there should be at
most one matching point in the other image;
• Ordering
• Corresponding points should be in the same
order in both views;
• Smoothness
• Disparities to change slowly (the most part).
Epipolar plane
Epipolar line for pEpipolar line for p’
Uniqueness Ordering

Difficulties in Stereo Vision
• Photometric distortions and noise;
• Foreshortening;
• Perspective distortions;
• Uniform/ambiguous regions;
• Repetitive/ambiguous patterns;
• Transparent objects;
• Occlusions and discontinuities.

Pipeline of Stereo Matching Methods
• Pre-processing: compensate for photometric distortion;
• LoG, Census transform, phase only(DCT or WT), histogram equalization/matching, isotropic diffusion, …
• Cost computation:
• Absolute difference, squared difference, weighted difference, SAD, SSD, SWD, ZMNCC, …
• Cost aggregation:
• Bilateral filter, guided filter, non local, segment tree,...
• Disparity computation/optimization
• Integral image, box filtering, …
• Local (fast), global (slow), semi-global, …
• Disparity refinement
• Sub pixel interpolation, median filter, cross check (left-right consistency check) and occlusion filling.

State-of-Art Stereo Matching Methods
• Local method
• Look at one image patch at at time
• Solve many small problems independently
• Faster, less accurate, usually works for high texture
• Needs enough texture in a patch to disambiguate
• Global method
• Look at the whole image
• Solve one large problem
• Slower, more accurate, works up to medium texture
• Propagates estimates from textured to untextured regions
• Sparse point-based method
• Still works for low textured regions, hard to handle ambiguous regions
• Semi-global method
• SGM (semi-global-matching), 2-d search to 1-d search along 8/16 directions.

Quality Metrics in Stereo Matching (Passive)
• General objective approaches:
• Compute error statistics w.r.t. some ground truth data;
• RMS (root-mean-squared) error (in disparity units) btw. computed disparity dC (x, y) and ground truth dT (x, y);
• Percentage of bad matching pixels;
• Select the following areas support the analysis of matching results
• textureless regions;
• occluded regions;
• depth discontinuity regions.
• Evaluate synthetic image by warping the reference with disparity map;
• Forward warp the reference image by the computed disparity map;
• Inverse warp a new view by the computed disparity map.
• Subjective evaluation

Census Transform and Hamming Distance
• Census transform converts relative intensity difference to 0 or 1 and deforms 1 dimensional vector as much as
window size of census transform;
• Census transform makes data of (image size * vector size).
• Modified CTW: compared with the mean rather than the central pixel;
• Hamming distance of CT vectors with correlation windows used to find matched patches;
• Advantage: robustness to radiometric distortion, vignetting, lighting, boundaries and noise.
210159998639
198170326747
45677810298
304033115109
393126130121
11111
11000
00X11
00011
00011
111111100000110001100011
Census transform window (CTW)
Height
Width
Height
Width
(Square size of CTW)-1

CTonintensity
&gradientresp.
Originalgradient

Guided Filter in Cost Aggregation for Stereo Matching
• Idea: stereo match as labeling, a spatially smooth labeling with label transitions aligned with color edges;
• Edge preserving filter: WLS, Anisotropic diffusion, bilateral filter, total variation filter, guided filter, ...
• Guided filter works better than bilateral filter;
•
• Cost volume filtering with guided filter works like segmentation implicitly;
Wi,j : The filter weights depend on the guidance image IC’ : the filtered cost volume

PatchMatch Stereo
• Idea: First a random initialization of disparities and plane para.s for each pix. and
update the estimates by propagating info. from the neighboring pix.s;
• Spatial propagation: Check for each pix. the disparities and plane para.s for left and
upper neighbors and replace the current estimates if matching costs are smaller;
• View propagation: Warp the point in the other view and check the corresponding
etimates in the other image. Replace if the matching costs are lower;
• Temporal propagation: Propagate the information analogously by considering the
etimates for the same pixel at the preceding and consecutive video frame;
• Plane refinement: Disparity and plane para.s for each pix. refined by generat. random
samples within an interval and updat. estimates if matching costs reduced;
• Post-processing: Remove outliers with left/right consistency checking and weighted
median filter; Gaps are filled by propagating information from the neighborhood.

Semi-Global Matching for Stereo Computation
• Semi-global matching approximates a global optimization by combining several local optimization steps;
• Minimizing E(D) in a two-dimensional manner would be very costly, while SGM simplifies it by traversing
one-dimensional paths and ensures the constraints with respect to these explicit directions;
• At least 8 paths (16 suggested), like horizontal, vertical and diagonal orientations;
• For instance, cost aggregation along a horizontal path as
• Pixel-based cost computation by mutual information as
• Left-right consistency check for occlusion detection and disparity propagation for hole filling.
• To accelerate the process, down-sampled image pairs are used for disparity estimation.
a small penalty P1 a large penalty P2 for large disparity changes

ELAS: Efficient Large Scale Stereo Matching

ELAS: Efficient Large Scale Stereo Matching
• Prior model:
• Likelihood model:
• Posterior can be factorized by the Bayes rule as
• Likelihood calculated along the epipolar line as
• Disparity estimation as MAP:
• To minimize an energy function
A mean function linking the support points and the observations

Stereo as Energy Minimization
• Find disparities d that minimize an energy function
• Simple pixel / window matching
= SSD distance between windows I(x, y) and J(x, y + d(x,y))
I(x, y) J(x, y)
y = 141
C(x, y, d); the disparity space image (DSI)
x
d
• Choose the minimum of each column in the DSI independently:

Dynamic Programming (DP) in Stereo Matching
• Can minimize E(d) independently per scanline using dynamic programming (DP);
leftS
rightS
Left
occlusion
t
q
Right
occlusion
s p
occlC
occlC corrC
Three cases:
• Sequential – cost of match
• Left occluded – cost of no match
• Right occluded – cost of no match
Left image
Right image I
I
• DP yields the optimal path through grid,
the best set of matches for the ordering
constraint in scan-line stereo.

d1
d2
d3
• Graph Cut
• Delete enough edges so that
• each pixel is connected to exactly one label node
• Cost of a cut: sum of deleted edge weights
• Finding min cost cut equivalent to finding global minimum of energy
function
Energy Minimization via Graph Cuts Labels
(disparities)
edge weight
edge weight
• What defines a good stereo correspondence?
• 1. Match quality
• Want each pixel to find a good match in the other image
• 2. Smoothness
• If two pixels are adjacent, they should (usually) move
about the same amount
{
{
match cost smoothness cost
“Potts model”
L1 distance
Graph Cut: convert multi-way cut into a seq. of binary cut

Model Stereo Vision by MRF and Solution by Belief Propagation
• Allows rich probabilistic models for images.
• But built in a local, modular way. Learn local relationships, get global effects out.
disparity
images
Disparity-disparity
compatibility
function
neighboring
disparity nodes
local
observationsImages-disparity
compatibility
function
 FY
i
ii
ji
ji
yxxx
Z
yxP ),(),(1
),(
,
BELIEFS: Approximate posterior marginal distributions
neighborhood of node i
MESSAGES: Approximate sufficient statistics
I. Belief Update (Message Product)
II. Message Propagation (Convolution)

Hierarchical Belief Propagation (HBP) and Constant Space HBP
• HBP works in a coarse-to-fine manner;
• (a) initialize the messages at the coarsest level to all zeros;
• (b) apply BP at the coarsest level to iteratively refine the messages;
• (c) use refined messages from the coarser level to initialize the messages for the next level.
• Constant space HBP relies on that, only a small number of disparity levels and the corresponding
message values are needed at each pixel to losslessly reconstruct the BP messages;
• Apply the coarse-to-fine (CTF) scheme to both spatial and depth domain, i.e. gradually reduce the number of
disparity levels as the messages propagate in CTF;
• Re-computes the data term at each level (not each iter.);
• Slower 9/8, but memory does not grow with max disp;
• Energy computed only once at the finest level;
• Gradually reduce the disparity levels in CTF.
• The closer the messages are to the fixed points, the fewer the required disparity levels; Then, CSBP refines the
messages hierarchically to approach the fixed points.

Phase Matching in Frequency or WT Domain
• Phase reflects the structure information of the signal and inhibit the HF noise effect;
• Phase singularity is a problem;
• Local phase information as the primitive;
• Wavelet transform builds a hierarchical framework for multi-level coarse-to-fine processing;
• Stereo matching (disparity) with phase separation and instantaneous frequency of signals:
• Dynamic programming (DP) used for global optimization (occlusion handling) in stereo matching ;
• Phase is not uniformly stable;
• Smoothness constraints;
• Discontinuities detection;
• Multiple resolution solution:
• 1. top level: control points with feature matching, apply DP;
• 2. middle level: interpolation, apply DP;
• 3. bottom level: sub-pixel precision.
Local phaseDisparity Left/Right images

Original Phase matching Phase matching with DP

Disparity/Depth Refinement
• Sub-pixel refinement: real valued disparities may be obtained by approximating the cost
function locally using a parabola;
• Left-Right Consistency Check: outlier detection by difference;
• By computing a disparity for every pixel of the left image (left to right);
• by computing a disparity for every pixel of the right image (right to left);
• Segmentation can be used for outlier identification.
• Occlusion filling:
• Occlusion detection;
• Background expansion;
• Inpainting.
• Discontinuities smoothing:
• Bilateral filtering.

Multiple Cameras
Multi-baseline stereo
use the third view to verify depth estimates

Spatial Temporal Video Disparity Estimation
• The important problem of extending to video is flickering;
• Typical methods:
• Spatial temporal consistency: smoothing in the space-time volume;
• Post-processing of disparity maps by applying a median filter along the flow fields;
• Spatial-temporal cost aggregation and solved by local/global optimization methods;
• Joint disparity and flow estimation;
• SGM-based, as an instance;
• Modeled with MRF and solved by global optimization.
• Scene flow: 2D motion field along with 1D disparity change field.
• Dense method is very computationally expensive;
• Sparse method relies on heavily initial sparse correspondence success.

Sparse Coding
Sparse coding (Olshausen & Field, 1996).
Originally developed to explain early visual processing in the brain (edge detection).
Objective: Given a set of input data vectors learn a dictionary of
bases such that:
Each data vector is represented as a sparse linear combination of bases.
Sparse: mostly zeros

Predictive Sparse Coding
Recall the objective function for sparse coding:
Modify by adding a penalty for prediction error:
◦ Approximate the sparse code with an encoder
PSD for hierarchical feature training
◦ Phase 1: train the first layer;
◦ Phase 2: use encoder + absolute value as 1st feature extractor
◦ Phase 3: train the second layer;
◦ Phase 4: use encoder + absolute value as 1st feature extractor
◦ Phase 5: train a supervised classifier on top layer;
◦ Phase 6: optionally train the whole network with supervised BP.

Methods of Solving Sparse Coding
Greedy methods: projecting the residual on some atom;
◦ Matching pursuit, orthogonal matching pursuit;
L1-norm: Least Absolute Shrinkage and Selection Operator (LASSO);
◦ The residual is updated iteratively in the direction of the atom;
Gradient-based finding new search directions
◦ Projected Gradient Descent
◦ Coordinate Descent
Homotopy: a set of solutions indexed by a parameter (regularization)
◦ LARS (Least Angle Regression)
First order/proximal methods: Generalized gradient descent
◦ solving efficiently the proximal operator
◦ soft-thresholding for L1-norm
◦ Accelerated by the Nesterov optimal first-order method
Iterative reweighting schemes
◦ L2-norm: Chartand and Yin (2008)
◦ L1-norm: Cand`es et al. (2008)

Strategyof Dictionary Selection
• What D to use?
• A fixed overcomplete set of basis: no adaptivity.
• Steerable wavelet;
• Bandlet, curvelet, contourlet;
• DCT Basis;
• Gabor function;
• ….
• Data adaptive dictionary – learn from data;
• K-SVD: a generalized K-means clustering process for Vector Quantization (VQ).
• An iterative algorithm to effectively optimize the sparse approximation of signals in a learned
dictionary.
• Other methods of dictionary learning:
• non-negative matrix decompositions.
• sparse PCA (sparse dictionaries).
• fused-lasso regularizations (piecewise constant dictionaries)
• Extending the models: Sparsity + Self-similarity=Group Sparsity

Learning Sparse Representation in Depth Maps
• Sparse representations learned from
Middlebury database disparity maps;
• Then they are exploited in a two-layer
graphical model for inferring depth from
stereo, by including a sparsity prior on
the learned features;
 The first layer solved using an existing MRF-
based stereo matching algorithm;
 The second layer is solved using the non-
stationary sparse coding algorithm.

Learning Sparse Representation in Depth Maps
(c) Graph cut (d) GC + Sparse coding

Deep Learning
Representation learning attempts to automatically learn good features or representations;
Deep learning algorithms attempt to learn multiple levels of representation of increasing
complexity/abstraction (intermediate and high level features);
Become effective via unsupervised pre-training + supervised fine tuning;
◦ Deep networks trained with back propagation (without unsupervised pre-training) perform worse than
shallow networks.
Deal with the curse of dimensionality (smoothing & sparsity) and over-fitting (unsupervised, regularizer);
Semi-supervised: structure of manifold assumption;
◦ labeled data is scarce and unlabeled data is abundant.

Why Deep Learning?
Supervised training of deep models (e.g. many-layered Nets) is too hard (optimization
problem);
◦ Learn prior from unlabeled data;
Shallow models are not for learning high-level abstractions;
◦ Ensembles or forests do not learn features first;
◦ Graphical models could be deep net, but mostly not.
Unsupervised learning could be “local-learning”;
◦ Resemble boosting with each layer being like a weak learner
Learning is weak in directed graphical models with many hidden variables;
◦ Sparsity and regularizer.
Traditional unsupervised learning methods aren’t easy to learn multiple levels of
representation.
◦ Layer-wised unsupervised learning is the solution.
Multi-task learning (transfer learning and self taught learning);
Other issues: scalability & parallelism with the burden from big data.

Multi Layer Neural Network
A neural network = running several logistic regressions at the same time;
◦ Neuron=logistic regression or…
Calculate error derivatives (gradients) to refine: back propagate the error derivative through model
(the chain rule)
◦ Online learning: stochastic/incremental gradient descent
◦ Batch learning: conjugate gradient descent

Problems in MLPs
Multi Layer Perceptrons (MLPs), one feed-forward neural network, were popularly used for decades.
Gradient is progressively getting more scattered
◦ Below the top few layers, the correction signal is minimal
Gets stuck in local minima
◦ Especially start out far from ‘good’ regions (i.e., random initialization)
In usual settings, use only labeled data
◦ Almost all data is unlabeled!
◦ Instead the human brain can learn from unlabeled data.

Convolutional Neural Networks
CNN is a special kind of multi-layer NNs applied to 2-d arrays (usually images), based on spatially localized
neural input;
◦ local receptive fields(shifted window), shared weights (weight averaging) across the hidden units, and often, spatial
or temporal sub-sampling;
◦ Related to generative MRF/discriminative CRF:
◦ CNN=Field of Experts MRF=ML inference in CRF;
◦ Generate ‘patterns of patterns’ for pattern recognition.
Each layer combines (merge, smooth) patches from previous layers
◦ Pooling /Sampling (e.g., max or average) filter: compress and smooth the data.
◦ Convolution filters: (translation invariance) unsupervised;
◦ Local contrast normalization: increase sparsity, improve optimization/invariance.
C layers convolutions,
S layers pool/sample

Convolutional Neural Networks
Convolutional Networks are trainable multistage architectures composed of multiple stages;
Input and output of each stage are sets of arrays called feature maps;
At output, each feature map represents a particular feature extracted at all locations on input;
Each stage is composed of: a filter bank layer, a non-linearity layer, and a feature pooling layer;
A ConvNet is composed of 1, 2 or 3 such 3-layer stages, followed by a classification module;
◦ A fully connected layer: softmax transfer function for posterior distribution.
Filter: A trainable filter (kernel) in filter bank connects input feature map to output feature map;
Nonlinearity: a pointwise sigmoid tanh() or a rectified sigmoid abs(gi•tanh()) function;
◦ In rectified function, gi is a trainable gain parameter, might be followed a contrast normalization N;
Feature pooling: treats each feature map separately -> a reduced-resolution output feature map;
Supervised training is performed using a form of SGD to minimize the prediction error;
◦ Gradients are computed with the back-propagation method.
Unsupervised pre-training: predictive sparse decomposition (PSD), then supervised fine-tuning.
* is discrete convolution operator

LeNet (LeNet-5)
A layered model composed of convolution and subsampling operations followed by a holistic representation
and ultimately a classifier for handwritten digits;
Local receptive fields (5x5) with local connections;
Output via a RBF function, one for each class, with 84 inputs each;
Learning by Graph Transformer Networks (GTN);

AlexNet
A layered model composed of convol., subsample., followed by a holistic
representation and all-in-all a landmark classifier;
Consists of 5 convolutional layers, some of which followed by max-pooling
layers, 3 fully-connected layers with a final 1000-way softmax;
Fully-connected layers: linear classifiers/matrix multiplications;
ReLU are rectified-linear nonlinearities on layer output, can be trained
several times faster;
Local (contrast) normalization scheme aids generalization;
Overlapping pooling slightly less prone to overfitting;
Data augmentation: artificially enlarge the dataset using label-preserving
transformations;
Dropout: setting to zero output of each hidden neuron with prob. 0.5;
Trained by SGD with batch # 128, momentum 0.9, weight decay 0.0005.

The network’s input is 150,528-dimensional, and the number of neurons in the network’s
remaining layers is given by 253,440–186,624–64,896–64,896–43,264-4096–4096–1000.

MattNet
Matthew Zeiler from the startup company “Clarifai”, winner of ImageNet Classification in 2013;
Preprocessing: subtracting a per-pixel mean;
Data augmentation: downsampled to 256 pixels and a random 224 pixel crop is taken out of the image and
randomly flipped horizontally to provide more views of each example;
SGD with min-batch # 128, learning rate annealing, momentum 0.9 and dropout to prevent overfitting;
65M parameters trained for 12 days on a single Nvidia GPU;
Visualization by layered DeconvNets: project the feature activations back to the input pixel space;
◦ Reveal input stimuli exciting individual feature maps at any layer;
◦ Observe evolution of features during training;
◦ Sensitivity analysis of the classifier output by occluding portions to reveal which parts of scenes are important;
DeconvNet attached to each of ConvNet layer, unpooling uses locations of maxima to preserve structure;
Multiple such models were averaged together to further boost performance;
Supervised pre-training with AlexNet, then modify it to get better performance (error rate 14.8%).

Architecture of an eight layer ConvNet model. Input: 224 by 224 crop of an image (with 3 color planes). # 1-5
layers Convolution: 96 filters, 7x7, stride of 2 in both x and y. Feature maps: (i) via a rectified linear function, (ii)
3x3 max pooled (stride 2), (iii) contrast normalized 55x55 feature maps. # 6-7 layers: fully connected, input in
vector form (6x6x256 = 9216 dimensions). The final layer: a C-way softmax function, C - number of classes.

Top: A deconvnet layer (left) attached to
a convnet layer (right). The deconvnet
will reconstruct approximate version of
convnet features from the layer beneath.
Bottom: Unpooling operation in the
deconvnet, using switches which record
the location of the local max in each
pooling region (colored zones) during
pooling in the convnet.

Oxford VGG Net: Very Deep CNN
Networks of increasing depth using an architecture with very small (3×3) convolution filters;
◦ Spatial pooling is carried out by 5 max-pooling layers;
◦ A stack of convolutional layers followed by three Fully-Connected (FC) layers;
◦ All hidden layers are equipped with the rectification ReLU non-linearity;
◦ No Local Response Normalisation!
Trained by optimising the multinomial logistic regression objective using SGD;
Regularised by weight decay and dropout regularisation for the first two fully-connected layers;
The learning rate was initially set to 10−2, and then decreased by a factor of 10;
For random initialisation, sample the weights from a normal distribution;
Derived from the publicly available C++ Caffe toolbox, allow training and evaluation on multiple GPUs
installed in a single system, and on full-size (uncropped) images at multiple scales;
Combine the outputs of several models by averaging their soft-max class posteriors.

The depth of the configurations increases from the left (A) to the
right (E), as more layers are added (the added layers are shown in
bold). The convolutional layer parameters are denoted as
“conv<receptive field size> - <number of channels>”. The ReLU
activation function is not shown for brevity.

GoogleNet
Questions:
◦ Vanishing gradient?
◦ Exploding gradient?
◦ Tricky weight initialization?
Deep convolutional neural network architecture codenamed Inception;
◦ Finding out how an optimal local sparse structure in a convolutional vision network can be approximated and
covered by readily available dense components;
◦ Judiciously applying dimension reduction and projections wherever the computational requirements would increase
too much otherwise;
Increasing the depth and width of the network but keeping the computational budget constant;
◦ Drawbacks: Bigger size typically means a larger number of parameters, which makes the enlarged network more
prone to overfitting and the dramatically increased use of computational resources;
◦ Solution: From fully connected to sparsely connected architectures, analyze the correlation statistics of the
activations of the last layer and clustering neurons with highly correlated outputs.
◦ Based on the well known Hebbian principle: neurons that fire together, wire together;
Trained using the DistBelief: distributed machine learning system.

Inception module (with dimension reductions)

Convolution
Pooling
Softmax
Other
Problems with training deep architectures?
Network in a network in a network
9 Inception modules

PReLU Networks at MSR
A Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit;
◦ PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk;
◦ Allow negative activations on the ReLU function with a control parameter a learned Adaptively;
◦ Resolve diminishing gradient problem for very deep neural networks (> 13 layers) ;
Derive a robust initialization method better than “Xavier” (normalization) initialization;
Also use Spatial Pyramid Pooling (SPP) layer just before the fully connected layers;
Can train extremely deep rectified models and investigate deeper or wider network architectures;
ReLU vs. PReLU Note: μ is momemtum, ϵ is learning rate.

PReLU Networks at MSR
Performance: 4.94% top-5 test error on the ImageNet 2012 Classification dataset;
◦ ILSVRC 2014 winner (GoogLeNet, 6.66%);
Adopt the momentum method in BP training;
Mostly initialized by random weights from Gaussian distr.;
Investigate the variance of the FP responses in each layer;
Consider a sufficient condition in BP:
◦ The gradient is not exponentially large/small.

Architectures of large models
PReLU Networks

Batch Normalization at Google
Normalizing layer inputs for each mini-batch to handle saturating
nonlinearities and covariate shift;
◦ Internal Covariate Shift (ICS): the change in the distribution of network activations
due to the change in network parameters during training;
◦ Whitening to reduce ICS: linear transform to have zero means and unit variances, and
decorrelated;
◦ Fix the means and variance of layer inputs (instead of whitening jointly the features in
both I/O);
◦ Batch normalizing transform applied for activation over a mini-batch;
◦ BN transform is differentiable transform introducing normalized activations into the
network;
Batch normalized networks
◦ Unbiased variance estimate;
◦ Moving average;
Batch normalized ConvNets
◦ Effective mini-batch size;
◦ Per feature, not per activation.

Reduce the dependence of gradients on the scale of the parameters or of the initial values;
◦ Prevent small changes from amplifying into larger and suboptimal changes in activation in gradients;
◦ Stabilize the parameter growth and make gradient propagation better behaved in BN training;
In some cases, eliminate the need of dropout as a regularizer;
◦ In ImageNet Classification, remove local response normalization and reduce photometric distortions;
◦ Reach 4.9% in top-five validation error and 4.8% test error (human raters only 5.1%).
Accelerating BN network:
◦ Enable larger learning rate and less care about initialization, which accelerates the training;
◦ Reduce L2 weight regularization;
◦ Accelerate the learning rate decay.

Inception architecture

Neural Turing Machines
A Neural Turing Machine (NTM) architecture contains two basic components: a neural
network controller and a memory bank;
◦ During each update cycle, the controller network receives inputs from an external
environment and emits outputs in response;
◦ It also reads to and writes from a memory matrix via a set of parallel read and write heads.
These weightings arise by combining two addressing mechanisms with complementary
facilities;
◦ “content-based addressing”: focuses attention on locations based on the similarity between their
current values and values emitted by the controller;
◦ “location-based addressing”: the content of a variable is arbitrary, but the variable still needs a
recognizable name or addresses, by location, not by content;
Controller network: feed forward or recurrent.

Neural Turing Machines
Neural Turing Machine Architecture.
Flow Diagram of the Addressing Mechanism.

Highway Networks: Information Highway
Ease gradient-based training of very deep networks;
Allow unimpeded info. flow across several layers on information highways;
Use gating units to learn regulating the flow of info. through a network;
A highway network consists of multiple blocks such that the ith block computes a block
state Hi(x) and transform gate output Ti(x);
Highway networks with hundreds of layers can be trained directly using SGD and with a
variety of activation functions.
the transform gate the carry gate
C = 1 - T

Deep Residual Learning for Image Recognition
Reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning
unreferenced functions;
◦ The desired underlying mapping as H(x), then let the stacked nonlinear layers fit another mapping of F(x) = H(x) - x;
◦ The formulation of F(x)+x can be realized by feed forward NN with “shortcut connections” (such as “Highway
Network” and “Inception”);
These residual networks are easier to optimize, and can gain accuracy from considerably increased depth;
An ensemble of 152 layers residual nets achieves 3.57% error on the ImageNet test set;
◦ 224x224 crop, per-pixel mean subtracted, color augmentation, batch normalization;
◦ SGD with a mini-batch size of 256, learning rate from 0.1 then by 10;
◦ Weight decay of 0.0001 and a momentum of 0.9, no drop-out;

Rethink Inception Architecture for Computer Vision
Scale up networks in ways that aim at utilizing the added computation efficiently by factorized convolutions
and aggressive regularization;
Design principles in Inception:
◦ Avoid representational bottlenecks, especially early in the network;
◦ Higher dimensional representations are easier to process locally within a network;
◦ Spatial aggregation over lower dim embeddings w/o loss in representational power;
◦ Balance the width and depth of the network.
Factorizing convolutions with large filter size: asymmetric convolutions;
Auxiliary classifiers: act as regularizer, esp. batch normalized or dropout;
Grid size reduction: two parallel stride 2 blocks (pooling and activation) ;
Model regularization via label smoothing: marginalized effect of dropout;
Trained with Tensorflow: SGD with 50 replicas, batch size 32 for 100 epochs, learning rate of 0.045,
exponential rate of 0.94, a wei decay of 0.9.

Inception modules after the factorization of the nxn
convolutions. In the proposed architecture, it choses
n = 7 for the 17x17 grid.
Inception modules with expanded
the filter bank outputs.
Inception modules where
each 5x5 convolution is
replaced by two 3x3
convolution.

Auxiliary classifier on top
of the last 17x17 layer Inception module that reduces the grid-size while
expands the filter banks. It is both cheap and avoids
the representational bottleneck.
The outline of the proposed
network architecture

Belief Nets
Belief net is directed acyclic graph composed of stochastic var.
Can observe some of the variables and solve two problems:
◦ inference: Infer the states of the unobserved variables.
◦ learning: Adjust the interactions between variables to more likely generate the observed data.
stochastic
hidden
cause
visible
effect
Use nets composed of layers
of stochastic variables with
weighted connections.

Boltzmann Machines
Energy-based model associate a energy to each configuration of stochastic variables of interests (for
example, MRF, Nearest Neighbor);
◦ Learning means adjustment of the low energy function’s shape properties;
Boltzmann machine is a stochastic recurrent model with hidden variables;
◦ Monte Carlo Markov Chain, i.e. MCMC sampling (appendix);
Restricted Boltzmann machine is a special case:
◦ Only one layer of hidden units;
◦ factorization of each layer’s neurons/units (no connections in the same layer);
Contrastive divergence: approximation of gradient (appendix).
probability
Energy Function
Learning rule

Deep Belief Networks
A hybrid model: can be trained as generative or
discriminative model;
Deep architecture: multiple layers (learn features
layer by layer);
◦ Multi layer learning is difficult in sigmoid belief
networks.
◦ Top two layers are undirected connections, RBM;
◦ Lower layers get top down directed connections
from layers above;
Unsupervised or self-taught pre-learning provides
a good initialization;
◦ Greedy layer-wise unsupervised training for
RBM
Supervised fine-tuning
◦ Generative: wake-sleep algorithm (Up-down)
◦ Discriminative: back propagation (bottom-up)

Deep Boltzmann Machine
Learning internal representations that become increasingly complex;
High-level representations built from a large supply of unlabeled inputs;
Pre-training consists of learning a stack of modified RBMs, which are composed to create a deep Boltzmann
machine (undirected graph);
Generative fine-tuning: different from DBN
◦ Positive and negative phase (appendix)
Discriminative fine-tuning: the same to DBN
◦ Back propagation.

Denoising Auto-Encoder
Multilayer NNs with target output=input;
Reconstruction=decoder(encoder(input));
◦ Perturbs the input x to a corrupted version;
◦ Randomly sets some of the coordinates of input to zeros.
◦ Recover x from encoded perturbed data.
Learns a vector field towards higher probability regions;
Pre-trained with DBN or regularizer with perturbed training data;
Minimizes variational lower bound on a generative model;
◦ corresponds to regularized score matching on an RBM;
PCA=linear manifold=linear Auto Encoder;
Auto-encoder learns the salient variation like a nonlinear PCA.

Stacked Denoising Auto-Encoder
Stack many (may be sparse) auto-encoders in succession and train them using greedy layer-wise
unsupervised learning
◦ Drop the decode layer each time
◦ Performs better than stacking RBMs;
Supervised training on the last layer using final features;
(option) Supervised training on the entire network to fine- tune all weights of the neural net;
Empirically not quite as accurate as DBNs.

Stereopsis via Deep Learning
• Learn a binocular cross correlation model: use two quadrature pairs to detect disparity;
 Various filters correspond to phases, positions and frequencies;
• Disparity as latent variable: a pattern of matching filter responses;
 A joint probabilistic model over patch pairs and disparity defined as a Boltzmann machine.
 Training amounts to finding the parameters for max the log probability for pairs
 RBM used for this case;
 During inference, each latent variable receives activity
from exactly two products of matched filter-responses.
pooling

Stereopsis via Deep Learning
Example training data: Row 1, row 2 and row 3
show rendered image planes for the left/right
camera, where in row 3 the right camera has
been rotated by 45 around the z axis. Images
are rendered by depth maps shown in row 4 and
a randomly selected texture map from the
Berkeley Segmentation Database.
Example pairs from NORB-cluttered dataset. Learned binocular filter pairs.

Unsupervised Learning of Depth (and Motion)
• Learning about the interrelations between images from multiple cameras, multiple
frames in a video, or the combination of both;
• Depth and motion in a feature learning architecture based on the energy model;
• An AutoEncoder single-layer model uses multiplicative interactions to detect synchrony,
and a pooling layer independently trained on the hidden responses to achieve content
invariance;
• Depth as a latent variable in learning:
• Reconstruction error:
• Contraction as regularization:
• Complete objective function:
Note: there is no need for rectification, since
the model can learn any transformation
between the frames not just horizontal shift

• Extension to stereo sequences: both depth and motion;
 Encoding depth:
 Encoding motion:
 Multiview disparity:
Representation of depth
Representation of motion
Representation of disparity
Products of frame responses

Filters learned on stereo patch pairs from KITTI dataset.
Example of a filter pair learned on sequences by the
SAE-D model from the Hollywood3D dataset.

Stereo Matching by CNN
• Train a convolutional neural network on pairs of small image patches;
• The network output is used to initialize the matching cost btw a pair of patches;
• Eight layers, L1 through L8 with input as 9x9 gray patch and matching cost as output;
• 1st layer as convolutional only and other layers are fully connected.
• Rectified linear units follow each layer, except L8, but NO pooling!
• Trained with SGD (batch size as 128), by194 image pairs, 45 million extracted examples.
• Matching costs are combined between neighboring pixels with similar image
intensities using cross-based cost aggregation;
• Smoothness constraints are enforced by semi-global matching (SGM) and a left-right
consistency check is used to detect and eliminate errors in occluded regions;
• sub-pixel enhancement and median filter + bilateral filter -> final disparity map;
• Achieve the error rate of 2.61% on the KITTI stereo database ( < 2.83% before).

Stereo Matching by CNN
Support region

A Deep Embedding Model for Stereo Matching Costs
This deep embedding model leverages appearance data to learn visual similarity relationships btw corresp.
image patches, and maps intensity values into an embedding feature space to measure pixel dissimilarities;
Features are extracted in a pair of patches at different scales, followed
by an inner product to obtain the matching scores, then the scores
from different scales are then merged for an ensemble.
The deployed network architecture of our testing model for deep embedding.
Features are extracted in two images only once respectively. The sliding-
window style inner product can be grouped for matrix operation.

Improved Stereo Matching with Constant Highway
Networks and Reflective Confidence Learning
A 3-step pipeline for the stereo matching problem and a highway network architecture for computing the
matching cost at each possible disparity, based on multilevel weighted residual shortcuts, trained with a
hybrid loss that supports multilevel comparison of image patches.
A post-processing step employs a second deep convolutional neural network for pooling global
information from multiple disparities.
It outputs both the image disparity map, which replaces the conventional “winner takes all” strategy, and a
confidence in the prediction.
The confidence score is achieved by training the network with a reflective loss.
The learned confidence is employed to better detect outliers in the refinement.

The λ-ResMatch architecture of the matching cost network

The Global disparity network model for representing disparity patches

Efficient Deep Learning for Stereo Matching
 A matching network which is able to
produce very accurate results in less than a
second of GPU computation.
A product layer which simply computes the
inner product btw the two representations of
a Siamese architecture.
 Treat the disparity estimation problem as
multi-class classification, where the classes are
all possible disparities.
A Siamese network extracts marginal distributions
over all possible disparities for each pixel.
four-layer Siamese network
architecture
The code and data can is online at:
http://www.cs.toronto.edu/deepLowLevelVision.

End-to-End Learning of Geometry and
Context for Deep Stereo Regression
A deep learning architecture for regressing disparity from a rectified pair
of stereo images.
Leverage knowledge of the problem’s geometry to form a cost volume
using deep feature representations.
Learn to incorporate contextual information using 3-D convolutions over
this volume.
Disparity values regressed from the cost volume using a differentiable soft
argmin operation, which allows to train end-to-end to sub-pixel accuracy
without any additional post-processing or regularization.

End-to-End Learning of Geometry and
Context for Deep Stereo Regression
End-to-end deep stereo regression architecture, GC-Net (Geometry and Context Network)

End-to-End Training of Hybrid
CNN-CRF Models for Stereo
convolutional neural networks (CNNs) + optimization-based approaches for stereo estimation;
 The optimization, posed as a conditional random field (CRF), takes local matching costs and consistency-
enforcing (smoothness) costs as inputs, both estimated by CNN blocks.
 To perform inference in CRF, based on linear programming relaxation with a fixed number of iterations.
Training end-to-end: in the discriminative formulation (structured SVM), the training is practically feasible.
 The optimization part efficiently replaces post-processing steps by a trainable, well-understood model.
A CNN, called Unary-CNN, computes features of the two
images for each pixel. The features are compared using a
Correlation layer. The resulting matching cost volume becomes
the unary cost of the CRF. The pairwise costs of the CRF are
parametrized by edge weights, which can either follow a usual
contrast sensitive model or estimated by the Pairwise-CNN.

End-to-End Training of Hybrid
CNN-CRF Models for Stereo
The cross-correlation of features φ0 and φ1 respectively
The CRF model optimizes the cost of disparity labelings
matching costs
pairwise terms

Appendix A:
Depth from Single Image by Learning

Learning-based Depth from Image
Initial over-segmentation (super pixels);
Markov Random Field (MRF) to infer patch’s orientation and location from image features (texture, color and
gradient);
◦ Connected, co-planar or colinear as prior;
◦ Occlusion boundaries /folds indication;
◦ Multi-conditional learning; solved by linear program;
MRF overlaid on “super pixels”
Occlusion/fold
Coplanarity and Colinearity

Single Image Depth Estimation From Predicted Semantic Labels
Semantic segmentation to guide the 3D reconstruction;
Works like holistic scene understanding:
◦ 1. Multi-class image labeling MRF for scene segmentation;
◦ 2. Depth estimation for each semantic class by learning (logistic regression);
3. Scene depth estimation by MRF (pixel or super-pixel) with potential (learned boosted decision tree classifiers ) and prior of
geometry (horizon prediction, vertical objects), pixel’s smoothness, super-pixel ‘s soft connectivity, co-planarity and
orientation.
semantically derived geometric constraints
Smoothed per-pixel log-depth prior for each semantic class with horizon rotated to center of image

Image semantic overlay ground truth depth measurements

Learning Depth from Examples
Two similar images are likely to have similar 3D structure (depth).
Nearest-neighbor (kNN) search: finding k image+depth pairs that are most similar to the query (histograms of
oriented gradients as feature);
Depth fusion: median filtering of the k depth fields;
Joint-bilateral depth filtering: smoothing of the median-fused depth.

K-NN
query
Depth Fusion and Smoothing
Depth output
Note: depth (disparity) warping via SIFT-flow in aligning with the query is omitted.

Depth Transfer for Monocular Video
K-NN Search for candidates of query frames;
Depth changes are gradual frame-to-frame;
Moving objects are usually on the ground;
Warped with SIFT flow and regularized with smoothness and prior
Computational cost is worth?

To form a basis (dictionary) over the RGB and depth spaces, and represent depth maps by
a sparse linear combination of weights.
A prediction function is estimated between weight vectors in RGB to depth space to
recover depth maps from query images.
A final super-pixel post processor aligns depth maps with occlusion boundaries, creating
physically plausible results.
Scalable Exemplar Based Depth Transfer

images with similar global depth profile clustered together in 2D utilizing RGB pairwise
features (left) and sparse positive descriptors on depth (right) effective in grouping
images with similar depths profiles together.
estimate a transformation T, that maps points from one space to another.
Scalable Exemplar Based Depth Transfer

Learning to be a Depth Camera (Active Near-IR)
• Use hybrid classification-regression forests to learn how to map from near infrared
intensity images to absolute, metric depth in real-time;
• Simplify the problem by dividing it into sub-problems in the first layer, and then applies models
trained for these sub-problems in the second layer to solve the main problem efficiently;
• Restrict the depths of the object to a certain range for significant simplification;
• The first layer learns to infer a coarsely quantized depth range for each pixel, and optionally
pools these predictions across all pixels to obtain a more reliable distribution over these depth
ranges;
• The second layer then applies one or more expert repressor trained specifically on the inferred
depth ranges.
• Note: the forests do not need to explicitly model scene illumination, surface geometry and
reflectance, or complex inter-reflections, required by traditional SFS methods.

• Comparable to high-quality consumer depth cameras with a reduced cost, power
consumption, and form-factor.

• Applied for specific hand and face objects.

Appendix B:
Machine Learning and Optimization

Graphical Models
• Graphical Models: Powerful framework for representing dependency
structure between random variables.
• The joint probability distribution over a set of random variables.
• The graph contains a set of nodes (vertices) that represent random variables, and a set
of links (edges) that represent dependencies between those random variables.
• The joint distribution over all random variables decomposes into a product of
factors, where each factor depends on a subset of the variables.
• Two type of graphical models:
• Directed (Bayesian networks)
• Undirected (Markov random fields, Boltzmann machines)
• Hybrid graphical models that combine directed and undirected models, such as Deep
Belief Networks, Hierarchical-Deep Models.

Generative Model: MRF
Random Field: F={F1,F2,…FM} a family of random variables on set S in which each Fi takes
value fi in a label set L.
Markov Random Field: F is said to be a MRF on S w.r.t. a neighborhood N if and only if it
satisfies Markov property.
◦ Generative model for joint probability p(x)
◦ allows no direct probabilistic interpretation
◦ define potential functions Ψ on maximal cliques A
◦ map joint assignment to non-negative real number
◦ requires normalization
MRF is undirected graphical models

A flow network G(V, E) defined as a fully connected directed graph
where each edge (u,v) in E has a positive capacity c(u,v) >= 0;
The max-flow problem is to find the flow of maximum value on a
flow network G;
A s-t cut or simply cut of a flow network G is a partition of V into S
and T = V-S, such that s in S and t in T;
A minimum cut of a flow network is a cut whose capacity is the
least over all the s-t cuts of the network;
Methods of max flow or mini-cut:
◦ Ford Fulkerson method;
◦ "Push-Relabel" method.

Mostly labeling is solved as an energy minimization problem;
Two common energy models:
◦ Potts Interaction Energy Model;
◦ Linear Interaction Energy Model.
Graph G contain two kinds of vertices: p-vertices and i-vertices;
◦ all the edges in the neighborhood N, called n-links;
◦ edges between the p-vertices and the i-vertices called t-links.
In the multiple labeling case, the multi-way cut should leave each p-vertex connected to one i-vertex;
The minimum cost multi-way cut will minimize the energy function where the severed n-links would
correspond to the boundaries of the labeled vertices;
The approximation algorithms to find this multi-way cut:
◦ "alpha-expansion" algorithm;
◦ "alpha-beta swap" algorithm.

 A simplified Bayes Net: it propagates info. throughout a graphical model via a series
of messages between neighboring nodes iteratively; likely to converge to a consensus that
determines the marginal prob. of all the variables;
 messages estimate the cost (or energy) of a configuration of a clique given all other cliques;
then the messages are combined to compute a belief (marginal or maximum probability);
Two types of BP methods:
◦ max-product;
◦ sum-product.
BP provides exact solution when there are no loops in graph!
Equivalent to dynamic programming/Viterbi in these cases;
Loopy Belief Propagation: still provides approximate (but often good) solution;

Generalized BP for pairwise MRFs
◦ Hidden variables xi and xj are connected through a compatibility function;
◦ Hidden variables xi are connected to observable variables yi by the local “evidence” function;
The joint probability of {x} is given by
To improve inference by taking into account higher-order interactions among the
variables;
◦ An intuitive way is to define messages that propagate between groups of nodes rather than just single nodes;
◦ This is the intuition in Generalized Belief Propagation (GBP).

Stochastic Gradient Descent (SGD)
• The general class of estimators that arise as minimizers of sums are called M-
estimators;
• Where are stationary points of the likelihood function (or zeroes of its derivative, the score
function)?
• Online gradient descent samples a subset of summand functions at every step;
• The true gradient is approximated by a gradient at a single example;
• Shuffling of training set at each pass.
• There is a compromise between two forms, often called "mini-batches", where the
true gradient is approximated by a sum over a small number of training examples.
• STD converges almost surely to a global minimum when the objective function
is convex or pseudo-convex, and otherwise converges almost surely to a local
minimum.

Back Propagation
E (f(x0,w),y0) = -log (f(x0,w)- y0).

Variable Learning Rate
Too large learning rate
◦ cause oscillation in searching for the minimal point
Too slow learning rate
◦ too slow convergence to the minimal point
Adaptive learning rate
◦ At the beginning, the learning rate can be large when the current point is far from the
optimal point;
◦ Gradually, the learning rate will decay as time goes by.
Should not be too large or too small:
◦ annealing rate 𝛼(𝑡)=𝛼(0)/(1+𝑡/𝑇)
◦ 𝛼(𝑡) will eventually go to zero, but at the beginning it is almost a constant.

Dropout and Maxout for Overfitting
Dropout: set the output of each hidden neuron to zero w.p. 0.5.
◦ Motivation: Combining many different models that share parameters succeeds in reducing test
errors by approximately averaging together the predictions, which resembles the bagging.
◦ The units which are “dropped out” in this way do not contribute to the forward pass and do not
participate in back propagation.
◦ So every time an input is presented, the NN samples a different architecture, but all these
architectures share weights.
◦ This technique reduces complex co-adaptations of units, since a neuron cannot rely on the presence
of particular other units.
◦ It is, therefore, forced to learn more robust features that are useful in conjunction with many
different random subsets of the other units.
◦ Without dropout, the network exhibits substantial overfitting.
◦ Dropout roughly doubles the number of iterations required to converge.
Maxout takes the maximum across multiple feature maps;

Weight Decay for Overfitting
Weight decay or L2 regularization adds a penalty term to the error function, a term called the
regularization term: the negative log prior in Bayesian justification,
◦ Weight decay works as rescaling weights in the learning rule, but bias learning still the same;
◦ Prefer to learn small weights, and large weights allowed if improving the original cost function;
◦ A way of compromising btw finding small weights and minimizing the original cost function;
In a linear model, weight decay is equivalent to ridge (Tikhonov) regression;
L1 regularization: the weights not really useful shrink by a constant amount toward zero;
◦ Act like a form of feature selection;
◦ Make the input filters cleaner and easier to interpret;
L2 regularization penalizes large values strongly while L1 regularization ;
Markov Chain Monte Carlo (MCMC): simulating a Markov chain whose equilibrium distr. is the
posterior distribution for weights & hyper-parameters;
Hybrid Monte Carlo: gradient and sampling.

Early Stopping for Overfitting
Steps in early stopping:
◦ Divide the available data into training and validation sets.
◦ Use a large number of hidden units.
◦ Use very small random initial values.
◦ Use a slow learning rate.
◦ Compute the validation error rate periodically during training.
◦ Stop training when the validation error rate "starts to go up".
Early stopping has several advantages:
◦ It is fast.
◦ It can be applied successfully to networks in which the number of weights far exceeds the sample size.
◦ It requires only one major decision by the user: what proportion of validation cases to use.
Practical issues in early stopping:
◦ How many cases do you assign to the training and validation sets?
◦ Do you split the data into training and validation sets randomly or by some systematic algorithm?
◦ How do you tell when the validation error rate "starts to go up"?

MCMC Sampling for Optimization
Markov Chain: a stochastic process in which future states are independent of past states but the
present state.
◦ Markov chain will typically converge to a stable distribution.
Monte Carlo Markov Chain: sampling using ‘local’ information
◦ Devise a Markov chain whose stationary distribution is the target.
◦ Ergodic MC must be aperiodic, irreducible, and positive recurrent.
◦ Monte Carlo Integration to get quantities of interest.
Metropolis-Hastings method: sampling from a target distribution
◦ Create a Markov chain whose transition matrix does not depend on the normalization term.
◦ Make sure the chain has a stationary distribution and it is equal to the target distribution (accept ratio).
◦ After sufficient number of iterations, the chain will converge the stationary distribution.
Gibbs sampling is a special case of M-H Sampling.
◦ The Hammersley-Clifford Theorem: get the joint distribution from the complete conditional distribution.
Hybrid Monte Carlo: gradient sub step for each Markov chain.

Mean Field for Optimization
Variational approximation modifies the optimization problem to be tractable, at the price of
approximate solution;
Mean Field replaces M with a (simple) subset M(F), on which A* (μ) is a closed form (Note: F is
disconnected graph);
◦ Density becomes factorized product distribution in this sub-family.
◦ Objective: K-L divergence.
Mean field is a structured variation approximation approach:
◦ Coordinate ascent (deterministic);
Compared with stochastic approximation (sampling):
◦ Faster, but maybe not exact.

Contrastive Divergence for RBMs
Contrastive divergence (CD) is proposed for training PoE first, also being a quicker way to learn
RBMs;
◦ Contrastive divergence as the new objective;
◦ Taking gradients and ignoring a term which is usually very small.
Steps:
◦ Start with a training vector on the visible units.
◦ Then alternate between updating all the hidden units in parallel and updating all the visible units in parallel.
Can be applied using any MCMC algorithm to simulate the model (not limited to just Gibbs
sampling);
CD learning is biased: not work as gradient descent
Improved: Persistent CD explores more modes in the distribution
◦ Rather than from data samples, begin sampling from the mode samples, obtained from the last gradient
update.
◦ Still suffer from divergence of likelihood due to missing the modes.
Score matching: the score function does not depend on its normal. factor. So, match it b.t.w. the
model with the empirical density.

“Wake-Sleep” Algorithm for DBN
Pre-trained DBN is a generative model;
Do a stochastic bottom-up pass (wake phase)
◦ Get samples from factorial distribution (visible first, then generate hidden);
◦ Adjust the top-down weights to be good at reconstructing the feature activities in the layer below.
Do a few iterations of sampling in the top level RBM
◦ Adjust the weights in the top-level RBM.
Do a stochastic top-down pass (sleep phase)
◦ Get visible and hidden samples generated by generative model using data coming from nowhere!
◦ Adjust the bottom-up weights to be good at reconstructing the feature activities in the layer above.
◦ Any guarantee for improvement? No!
The “Wake-Sleep” algorithm is trying to describe the representation economical (Shannon’s coding
theory).

Greedy Layer-Wise Training
Deep networks tend to have more local minima problems than shallow networks during
supervised training
Train first layer using unlabeled data
◦ Supervised or semi-supervised: use more unlabeled data.
Freeze the first layer parameters and train the second layer
Repeat this for as many layers as desire
◦ Build more robust features
Use the outputs of the final layer to train the last supervised layer (leave early weights frozen)
Fine tune the full network with a supervised approach;
Avoid problems to train a deep net in a supervised fashion.
◦ Each layer gets full learning
◦ Help with ineffective early layer learning
◦ Help with deep network local minima

Why Greedy Layer-Wise Training Works?
Take advantage of the unlabeled data;
Regularization Hypothesis
◦ Pre-training is “constraining” parameters in a region relevant to unsupervised
dataset;
◦ Better generalization (representations that better describe unlabeled data are
more discriminative for labeled data) ;
Optimization Hypothesis
◦ Unsupervised training initializes lower level parameters near localities of better
minima than random initialization can.
Only need fine tuning in the supervised learning stage.

Two-Stage Pre-training in DBMs
Pre-training in one stage
◦ Positive phase: clamp observed, sample hidden, using variational approximation (mean-field)
◦ Negative phase: sample both observed and hidden, using persistent sampling (stochastic approximation:
MCMC)
Pre-training in two stages
◦ Approximating a posterior distribution over the states of hidden units (a simpler directed deep model as DBNs
or stacked DAE);
◦ Train an RBM by updating parameters to maximize the lower-bound of log-likelihood and correspond.
posterior of hidden units.
◦ Options (CAST, contrastive divergence, stochastic approximation…).

Passive stereo vision with deep learning

Passive stereo vision with deep learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Passive stereo vision with deep learning

Similar to Passive stereo vision with deep learning (20)

More from Yu Huang

More from Yu Huang (20)

Recently uploaded

Recently uploaded (20)

Passive stereo vision with deep learning