SlideShare a Scribd company logo
1 of 139
Download to read offline
Passive Stereo Vision: From Traditional
to Deep Learning-based Methods
YU HUANG
SUNNYVALE, CALIFORNIA
YU.HUANG07@GMAIL.COM
Outline
• Modeling from multiple views
• Stereo matching
• constraints in stereo vision
• difficulties in stereo vision
• pipeline of stereo matching
• state of art methods
• Quality metric of stereo matching
• census transform and hamming distance
• guided filter in cost aggregation (volume)
• semi-global matching
• ELAS: efficient large scale stereo
• stereo matching as energy minimization
• dynamic programming/graph cut/belief propagation
• phase matching for stereo vision
• disparity refinement
• Multiple cameras/views
• Learning sparse represent. of depth maps
• Stereopsis via deep learning
• Deep learning of depth (and motion)
• Stereo matching by CNN
• Constant Highway Networks and Reflective
Confidence Learning;
• Efficient Deep Learning for Stereo Matching;
• E2e Learning of Geometry and Context;
• Appendix A: Depth from an image by learning
• Appendix B: Learning and optimization
Modeling from Multiple Views in Computer Vision
time
# cameras
photograph
binocular stereo
trinocular stereo
multi-baseline stereo
camcorder
human vision
camera dome
two frames ...
...
Binocular Stereo
• Given a calibrated binocular stereo pair, fuse it to produce a depth image
image 1 image 2
Dense depth map
Public Library, Stereoscopic Looking Room, Chicago, by Phillips, 1923
Basic Stereo Matching Algorithm
• For each pixel in the first image
Find corresponding epipolar line in the right image
Examine all pixels on the epipolar line and pick the best match
Triangulate the matches to get depth information
• Simplest case: epipolar lines are corresponding
scanlines;
• If necessary, rectify the two stereo images to transform
epipolar lines into scanlines
Depth from Disparity
f
x x’
Baseline
B
z
O O’
X
f
z
fB
xxdisparity


Disparity is inversely proportional to depth!
• Stereo camera calibration (focal length and baseline are known)
• Image planes of cameras parallel to each other and to the baseline
• Camera centers are at same height
• Focal lengths are the same
• Then, epipolar lines fall along the horizontal scan lines of the images
Calibration
• Find the intrinsic and extrinsic parameters of a camera
◦ Extrinsic parameters: the camera’s location and orientation in the world.
◦ Intrinsic parameters: the relationships between pixel coordinates and camera coordinates.
• Work of Roger Tsai and work of Zhengyou Zhang are influential: 3-D node setting and 2-d plane
• Basic idea:
◦ Given a set of world points Pi and their image coordinates (ui,vi)
◦ find the projection matrix M
◦ And then find intrinsic and extrinsic parameters.
• Calibration Techniques
◦ Calibration using 3D calibration object
◦ Calibration using 2D planer pattern
◦ Calibration using 1D object (line-based calibration)
◦ Self Calibration: no calibration objects
◦ Vanishing points from for orthogonal direction
Calibration
• Calibration using 3D calibration object:
◦ Calibration is performed by observing a calibration object whose geometry in 3D space is known with very good precision.
◦ Calibration object usually consists of two or three planes orthogonal to each other, e.g. calibration cube
◦ Calibration can also be done with a plane undergoing a precisely known translation (Tsai approach)
◦ (+) most accurate calibration, simple theory
◦ (-) more expensive, more elaborate setup
• 2D plane-based calibration (Zhang approach)
◦ Require observation of a planar pattern at few different orientations
◦ No need to know the plane motion
◦ Set up is easy, most popular approach
◦ Seems to be a good compromise.
• 1D line-based calibration:
◦ Relatively new technique.
◦ Calibration object is a set of collinear points, e.g., two points with known distance, three collinear points with known
distances, four or more…
◦ Camera can be calibrated by observing a moving line around a fixed point, e.g. a string of balls hanging from the ceiling!
◦ Can be used to calibrate multiple cameras at once. Good for network of cameras.
Fundamental Matrix
Let p be a point in left image, p’ in right image
Epipolar relation
◦ p maps to epipolar line l’
◦ p’ maps to epipolar line l
Epipolar mapping described by a 3x3 matrix F
It follows that
l’l
p p’
This matrix F is called
• the “Essential Matrix” E
– when image intrinsic parameters are known
• the “Fundamental Matrix”
– more generally (uncalibrated case)
Can solve for F from point correspondences
• Each (p, p’) pair gives one linear equation in entries of F
• 8/5 points give enough to solve for F/E (8/5-point algo)
Planar Rectification
Bring two views
to standard stereo setup
(moves epipole to )
(not possible when in/close to image)
~ image size
(calibrated)
Distortion minimization
(uncalibrated)
Polar re-parameterization around epipoles
Requires only (oriented) epipolar geometry
Preserve length of epipolar lines
Choose  so that no pixels are compressed
original image rectified image
Polar Rectification
Works for all relative motions
Guarantees minimal image size
Determine the common region from the extremal
epipolar lines and the location of epiole: e’F=0
Select half epipolar lines moving around the epipole
Construct rectified image line by line
Matching cost
disparity
Left Right
scanline
Correspondence Search
• Slide a window along the right scanline and compare contents of that window with the
reference window in the left image
• Matching cost: SSD or normalized correlation
Constraints in Stereo Vision
• Color constancy
• Lambertian surface assumption;
•Epipolar geometry
• Scanline as epipolar line for rectifed pair;
• Uniqueness
• For any point in one image, there should be at
most one matching point in the other image;
• Ordering
• Corresponding points should be in the same
order in both views;
• Smoothness
• Disparities to change slowly (the most part).
Epipolar plane
Epipolar line for pEpipolar line for p’
Uniqueness Ordering
Difficulties in Stereo Vision
• Photometric distortions and noise;
• Foreshortening;
• Perspective distortions;
• Uniform/ambiguous regions;
• Repetitive/ambiguous patterns;
• Transparent objects;
• Occlusions and discontinuities.
Pipeline of Stereo Matching Methods
• Pre-processing: compensate for photometric distortion;
• LoG, Census transform, phase only(DCT or WT), histogram equalization/matching, isotropic diffusion, …
• Cost computation:
• Absolute difference, squared difference, weighted difference, SAD, SSD, SWD, ZMNCC, …
• Cost aggregation:
• Bilateral filter, guided filter, non local, segment tree,...
• Disparity computation/optimization
• Integral image, box filtering, …
• Local (fast), global (slow), semi-global, …
• Disparity refinement
• Sub pixel interpolation, median filter, cross check (left-right consistency check) and occlusion filling.
State-of-Art Stereo Matching Methods
• Local method
• Look at one image patch at at time
• Solve many small problems independently
• Faster, less accurate, usually works for high texture
• Needs enough texture in a patch to disambiguate
• Global method
• Look at the whole image
• Solve one large problem
• Slower, more accurate, works up to medium texture
• Propagates estimates from textured to untextured regions
• Sparse point-based method
• Still works for low textured regions, hard to handle ambiguous regions
• Semi-global method
• SGM (semi-global-matching), 2-d search to 1-d search along 8/16 directions.
Quality Metrics in Stereo Matching (Passive)
• General objective approaches:
• Compute error statistics w.r.t. some ground truth data;
• RMS (root-mean-squared) error (in disparity units) btw. computed disparity dC (x, y) and ground truth dT (x, y);
• Percentage of bad matching pixels;
• Select the following areas support the analysis of matching results
• textureless regions;
• occluded regions;
• depth discontinuity regions.
• Evaluate synthetic image by warping the reference with disparity map;
• Forward warp the reference image by the computed disparity map;
• Inverse warp a new view by the computed disparity map.
• Subjective evaluation
Census Transform and Hamming Distance
• Census transform converts relative intensity difference to 0 or 1 and deforms 1 dimensional vector as much as
window size of census transform;
• Census transform makes data of (image size * vector size).
• Modified CTW: compared with the mean rather than the central pixel;
• Hamming distance of CT vectors with correlation windows used to find matched patches;
• Advantage: robustness to radiometric distortion, vignetting, lighting, boundaries and noise.
210159998639
198170326747
45677810298
304033115109
393126130121
11111
11000
00X11
00011
00011
111111100000110001100011
Census transform window (CTW)
Height
Width
Height
Width
(Square size of CTW)-1
CTonintensity
&gradientresp.
Originalgradient
Guided Filter in Cost Aggregation for Stereo Matching
• Idea: stereo match as labeling, a spatially smooth labeling with label transitions aligned with color edges;
• Edge preserving filter: WLS, Anisotropic diffusion, bilateral filter, total variation filter, guided filter, ...
• Guided filter works better than bilateral filter;
•
• Cost volume filtering with guided filter works like segmentation implicitly;
Wi,j : The filter weights depend on the guidance image IC’ : the filtered cost volume
PatchMatch Stereo
• Idea: First a random initialization of disparities and plane para.s for each pix. and
update the estimates by propagating info. from the neighboring pix.s;
• Spatial propagation: Check for each pix. the disparities and plane para.s for left and
upper neighbors and replace the current estimates if matching costs are smaller;
• View propagation: Warp the point in the other view and check the corresponding
etimates in the other image. Replace if the matching costs are lower;
• Temporal propagation: Propagate the information analogously by considering the
etimates for the same pixel at the preceding and consecutive video frame;
• Plane refinement: Disparity and plane para.s for each pix. refined by generat. random
samples within an interval and updat. estimates if matching costs reduced;
• Post-processing: Remove outliers with left/right consistency checking and weighted
median filter; Gaps are filled by propagating information from the neighborhood.
PatchMatch Stereo
Semi-Global Matching for Stereo Computation
• Semi-global matching approximates a global optimization by combining several local optimization steps;
• Minimizing E(D) in a two-dimensional manner would be very costly, while SGM simplifies it by traversing
one-dimensional paths and ensures the constraints with respect to these explicit directions;
• At least 8 paths (16 suggested), like horizontal, vertical and diagonal orientations;
• For instance, cost aggregation along a horizontal path as
• Pixel-based cost computation by mutual information as
• Left-right consistency check for occlusion detection and disparity propagation for hole filling.
• To accelerate the process, down-sampled image pairs are used for disparity estimation.
a small penalty P1 a large penalty P2 for large disparity changes
ELAS: Efficient Large Scale Stereo Matching
ELAS: Efficient Large Scale Stereo Matching
• Prior model:
• Likelihood model:
• Posterior can be factorized by the Bayes rule as
• Likelihood calculated along the epipolar line as
• Disparity estimation as MAP:
• To minimize an energy function
A mean function linking the support points and the observations
Stereo as Energy Minimization
• Find disparities d that minimize an energy function
• Simple pixel / window matching
= SSD distance between windows I(x, y) and J(x, y + d(x,y))
I(x, y) J(x, y)
y = 141
C(x, y, d); the disparity space image (DSI)
x
d
• Choose the minimum of each column in the DSI independently:
Dynamic Programming (DP) in Stereo Matching
• Can minimize E(d) independently per scanline using dynamic programming (DP);
leftS
rightS
Left
occlusion
t
q
Right
occlusion
s p
occlC
occlC corrC
Three cases:
• Sequential – cost of match
• Left occluded – cost of no match
• Right occluded – cost of no match
Left image
Right image I
I
• DP yields the optimal path through grid,
the best set of matches for the ordering
constraint in scan-line stereo.
d1
d2
d3
• Graph Cut
• Delete enough edges so that
• each pixel is connected to exactly one label node
• Cost of a cut: sum of deleted edge weights
• Finding min cost cut equivalent to finding global minimum of energy
function
Energy Minimization via Graph Cuts Labels
(disparities)
edge weight
edge weight
• What defines a good stereo correspondence?
• 1. Match quality
• Want each pixel to find a good match in the other image
• 2. Smoothness
• If two pixels are adjacent, they should (usually) move
about the same amount
{
{
match cost smoothness cost
“Potts model”
L1 distance
Graph Cut: convert multi-way cut into a seq. of binary cut
Model Stereo Vision by MRF and Solution by Belief Propagation
• Allows rich probabilistic models for images.
• But built in a local, modular way. Learn local relationships, get global effects out.
disparity
images
Disparity-disparity
compatibility
function
neighboring
disparity nodes
local
observationsImages-disparity
compatibility
function
 FY
i
ii
ji
ji
yxxx
Z
yxP ),(),(1
),(
,
BELIEFS: Approximate posterior marginal distributions
neighborhood of node i
MESSAGES: Approximate sufficient statistics
I. Belief Update (Message Product)
II. Message Propagation (Convolution)
Hierarchical Belief Propagation (HBP) and Constant Space HBP
• HBP works in a coarse-to-fine manner;
• (a) initialize the messages at the coarsest level to all zeros;
• (b) apply BP at the coarsest level to iteratively refine the messages;
• (c) use refined messages from the coarser level to initialize the messages for the next level.
• Constant space HBP relies on that, only a small number of disparity levels and the corresponding
message values are needed at each pixel to losslessly reconstruct the BP messages;
• Apply the coarse-to-fine (CTF) scheme to both spatial and depth domain, i.e. gradually reduce the number of
disparity levels as the messages propagate in CTF;
• Re-computes the data term at each level (not each iter.);
• Slower 9/8, but memory does not grow with max disp;
• Energy computed only once at the finest level;
• Gradually reduce the disparity levels in CTF.
• The closer the messages are to the fixed points, the fewer the required disparity levels; Then, CSBP refines the
messages hierarchically to approach the fixed points.
Phase Matching in Frequency or WT Domain
• Phase reflects the structure information of the signal and inhibit the HF noise effect;
• Phase singularity is a problem;
• Local phase information as the primitive;
• Wavelet transform builds a hierarchical framework for multi-level coarse-to-fine processing;
• Stereo matching (disparity) with phase separation and instantaneous frequency of signals:
• Dynamic programming (DP) used for global optimization (occlusion handling) in stereo matching ;
• Phase is not uniformly stable;
• Smoothness constraints;
• Discontinuities detection;
• Multiple resolution solution:
• 1. top level: control points with feature matching, apply DP;
• 2. middle level: interpolation, apply DP;
• 3. bottom level: sub-pixel precision.
Local phaseDisparity Left/Right images
Original Phase matching Phase matching with DP
Disparity/Depth Refinement
• Sub-pixel refinement: real valued disparities may be obtained by approximating the cost
function locally using a parabola;
• Left-Right Consistency Check: outlier detection by difference;
• By computing a disparity for every pixel of the left image (left to right);
• by computing a disparity for every pixel of the right image (right to left);
• Segmentation can be used for outlier identification.
• Occlusion filling:
• Occlusion detection;
• Background expansion;
• Inpainting.
• Discontinuities smoothing:
• Bilateral filtering.
Multiple Cameras
Multi-baseline stereo
use the third view to verify depth estimates
Spatial Temporal Video Disparity Estimation
• The important problem of extending to video is flickering;
• Typical methods:
• Spatial temporal consistency: smoothing in the space-time volume;
• Post-processing of disparity maps by applying a median filter along the flow fields;
• Spatial-temporal cost aggregation and solved by local/global optimization methods;
• Joint disparity and flow estimation;
• SGM-based, as an instance;
• Modeled with MRF and solved by global optimization.
• Scene flow: 2D motion field along with 1D disparity change field.
• Dense method is very computationally expensive;
• Sparse method relies on heavily initial sparse correspondence success.
Sparse Coding
Sparse coding (Olshausen & Field, 1996).
Originally developed to explain early visual processing in the brain (edge detection).
Objective: Given a set of input data vectors learn a dictionary of
bases such that:
Each data vector is represented as a sparse linear combination of bases.
Sparse: mostly zeros
Predictive Sparse Coding
Recall the objective function for sparse coding:
Modify by adding a penalty for prediction error:
◦ Approximate the sparse code with an encoder
PSD for hierarchical feature training
◦ Phase 1: train the first layer;
◦ Phase 2: use encoder + absolute value as 1st feature extractor
◦ Phase 3: train the second layer;
◦ Phase 4: use encoder + absolute value as 1st feature extractor
◦ Phase 5: train a supervised classifier on top layer;
◦ Phase 6: optionally train the whole network with supervised BP.
Methods of Solving Sparse Coding
Greedy methods: projecting the residual on some atom;
◦ Matching pursuit, orthogonal matching pursuit;
L1-norm: Least Absolute Shrinkage and Selection Operator (LASSO);
◦ The residual is updated iteratively in the direction of the atom;
Gradient-based finding new search directions
◦ Projected Gradient Descent
◦ Coordinate Descent
Homotopy: a set of solutions indexed by a parameter (regularization)
◦ LARS (Least Angle Regression)
First order/proximal methods: Generalized gradient descent
◦ solving efficiently the proximal operator
◦ soft-thresholding for L1-norm
◦ Accelerated by the Nesterov optimal first-order method
Iterative reweighting schemes
◦ L2-norm: Chartand and Yin (2008)
◦ L1-norm: Cand`es et al. (2008)
Strategyof Dictionary Selection
• What D to use?
• A fixed overcomplete set of basis: no adaptivity.
• Steerable wavelet;
• Bandlet, curvelet, contourlet;
• DCT Basis;
• Gabor function;
• ….
• Data adaptive dictionary – learn from data;
• K-SVD: a generalized K-means clustering process for Vector Quantization (VQ).
• An iterative algorithm to effectively optimize the sparse approximation of signals in a learned
dictionary.
• Other methods of dictionary learning:
• non-negative matrix decompositions.
• sparse PCA (sparse dictionaries).
• fused-lasso regularizations (piecewise constant dictionaries)
• Extending the models: Sparsity + Self-similarity=Group Sparsity
Learning Sparse Representation in Depth Maps
• Sparse representations learned from
Middlebury database disparity maps;
• Then they are exploited in a two-layer
graphical model for inferring depth from
stereo, by including a sparsity prior on
the learned features;
 The first layer solved using an existing MRF-
based stereo matching algorithm;
 The second layer is solved using the non-
stationary sparse coding algorithm.
Learning Sparse Representation in Depth Maps
(c) Graph cut (d) GC + Sparse coding
Deep Learning
Representation learning attempts to automatically learn good features or representations;
Deep learning algorithms attempt to learn multiple levels of representation of increasing
complexity/abstraction (intermediate and high level features);
Become effective via unsupervised pre-training + supervised fine tuning;
◦ Deep networks trained with back propagation (without unsupervised pre-training) perform worse than
shallow networks.
Deal with the curse of dimensionality (smoothing & sparsity) and over-fitting (unsupervised, regularizer);
Semi-supervised: structure of manifold assumption;
◦ labeled data is scarce and unlabeled data is abundant.
Why Deep Learning?
Supervised training of deep models (e.g. many-layered Nets) is too hard (optimization
problem);
◦ Learn prior from unlabeled data;
Shallow models are not for learning high-level abstractions;
◦ Ensembles or forests do not learn features first;
◦ Graphical models could be deep net, but mostly not.
Unsupervised learning could be “local-learning”;
◦ Resemble boosting with each layer being like a weak learner
Learning is weak in directed graphical models with many hidden variables;
◦ Sparsity and regularizer.
Traditional unsupervised learning methods aren’t easy to learn multiple levels of
representation.
◦ Layer-wised unsupervised learning is the solution.
Multi-task learning (transfer learning and self taught learning);
Other issues: scalability & parallelism with the burden from big data.
Multi Layer Neural Network
A neural network = running several logistic regressions at the same time;
◦ Neuron=logistic regression or…
Calculate error derivatives (gradients) to refine: back propagate the error derivative through model
(the chain rule)
◦ Online learning: stochastic/incremental gradient descent
◦ Batch learning: conjugate gradient descent
Problems in MLPs
Multi Layer Perceptrons (MLPs), one feed-forward neural network, were popularly used for decades.
Gradient is progressively getting more scattered
◦ Below the top few layers, the correction signal is minimal
Gets stuck in local minima
◦ Especially start out far from ‘good’ regions (i.e., random initialization)
In usual settings, use only labeled data
◦ Almost all data is unlabeled!
◦ Instead the human brain can learn from unlabeled data.
Convolutional Neural Networks
CNN is a special kind of multi-layer NNs applied to 2-d arrays (usually images), based on spatially localized
neural input;
◦ local receptive fields(shifted window), shared weights (weight averaging) across the hidden units, and often, spatial
or temporal sub-sampling;
◦ Related to generative MRF/discriminative CRF:
◦ CNN=Field of Experts MRF=ML inference in CRF;
◦ Generate ‘patterns of patterns’ for pattern recognition.
Each layer combines (merge, smooth) patches from previous layers
◦ Pooling /Sampling (e.g., max or average) filter: compress and smooth the data.
◦ Convolution filters: (translation invariance) unsupervised;
◦ Local contrast normalization: increase sparsity, improve optimization/invariance.
C layers convolutions,
S layers pool/sample
Convolutional Neural Networks
Convolutional Networks are trainable multistage architectures composed of multiple stages;
Input and output of each stage are sets of arrays called feature maps;
At output, each feature map represents a particular feature extracted at all locations on input;
Each stage is composed of: a filter bank layer, a non-linearity layer, and a feature pooling layer;
A ConvNet is composed of 1, 2 or 3 such 3-layer stages, followed by a classification module;
◦ A fully connected layer: softmax transfer function for posterior distribution.
Filter: A trainable filter (kernel) in filter bank connects input feature map to output feature map;
Nonlinearity: a pointwise sigmoid tanh() or a rectified sigmoid abs(gi•tanh()) function;
◦ In rectified function, gi is a trainable gain parameter, might be followed a contrast normalization N;
Feature pooling: treats each feature map separately -> a reduced-resolution output feature map;
Supervised training is performed using a form of SGD to minimize the prediction error;
◦ Gradients are computed with the back-propagation method.
Unsupervised pre-training: predictive sparse decomposition (PSD), then supervised fine-tuning.
* is discrete convolution operator
LeNet (LeNet-5)
A layered model composed of convolution and subsampling operations followed by a holistic representation
and ultimately a classifier for handwritten digits;
Local receptive fields (5x5) with local connections;
Output via a RBF function, one for each class, with 84 inputs each;
Learning by Graph Transformer Networks (GTN);
AlexNet
A layered model composed of convol., subsample., followed by a holistic
representation and all-in-all a landmark classifier;
Consists of 5 convolutional layers, some of which followed by max-pooling
layers, 3 fully-connected layers with a final 1000-way softmax;
Fully-connected layers: linear classifiers/matrix multiplications;
ReLU are rectified-linear nonlinearities on layer output, can be trained
several times faster;
Local (contrast) normalization scheme aids generalization;
Overlapping pooling slightly less prone to overfitting;
Data augmentation: artificially enlarge the dataset using label-preserving
transformations;
Dropout: setting to zero output of each hidden neuron with prob. 0.5;
Trained by SGD with batch # 128, momentum 0.9, weight decay 0.0005.
The network’s input is 150,528-dimensional, and the number of neurons in the network’s
remaining layers is given by 253,440–186,624–64,896–64,896–43,264-4096–4096–1000.
MattNet
Matthew Zeiler from the startup company “Clarifai”, winner of ImageNet Classification in 2013;
Preprocessing: subtracting a per-pixel mean;
Data augmentation: downsampled to 256 pixels and a random 224 pixel crop is taken out of the image and
randomly flipped horizontally to provide more views of each example;
SGD with min-batch # 128, learning rate annealing, momentum 0.9 and dropout to prevent overfitting;
65M parameters trained for 12 days on a single Nvidia GPU;
Visualization by layered DeconvNets: project the feature activations back to the input pixel space;
◦ Reveal input stimuli exciting individual feature maps at any layer;
◦ Observe evolution of features during training;
◦ Sensitivity analysis of the classifier output by occluding portions to reveal which parts of scenes are important;
DeconvNet attached to each of ConvNet layer, unpooling uses locations of maxima to preserve structure;
Multiple such models were averaged together to further boost performance;
Supervised pre-training with AlexNet, then modify it to get better performance (error rate 14.8%).
Architecture of an eight layer ConvNet model. Input: 224 by 224 crop of an image (with 3 color planes). # 1-5
layers Convolution: 96 filters, 7x7, stride of 2 in both x and y. Feature maps: (i) via a rectified linear function, (ii)
3x3 max pooled (stride 2), (iii) contrast normalized 55x55 feature maps. # 6-7 layers: fully connected, input in
vector form (6x6x256 = 9216 dimensions). The final layer: a C-way softmax function, C - number of classes.
Top: A deconvnet layer (left) attached to
a convnet layer (right). The deconvnet
will reconstruct approximate version of
convnet features from the layer beneath.
Bottom: Unpooling operation in the
deconvnet, using switches which record
the location of the local max in each
pooling region (colored zones) during
pooling in the convnet.
Oxford VGG Net: Very Deep CNN
Networks of increasing depth using an architecture with very small (3×3) convolution filters;
◦ Spatial pooling is carried out by 5 max-pooling layers;
◦ A stack of convolutional layers followed by three Fully-Connected (FC) layers;
◦ All hidden layers are equipped with the rectification ReLU non-linearity;
◦ No Local Response Normalisation!
Trained by optimising the multinomial logistic regression objective using SGD;
Regularised by weight decay and dropout regularisation for the first two fully-connected layers;
The learning rate was initially set to 10−2, and then decreased by a factor of 10;
For random initialisation, sample the weights from a normal distribution;
Derived from the publicly available C++ Caffe toolbox, allow training and evaluation on multiple GPUs
installed in a single system, and on full-size (uncropped) images at multiple scales;
Combine the outputs of several models by averaging their soft-max class posteriors.
The depth of the configurations increases from the left (A) to the
right (E), as more layers are added (the added layers are shown in
bold). The convolutional layer parameters are denoted as
“conv<receptive field size> - <number of channels>”. The ReLU
activation function is not shown for brevity.
GoogleNet
Questions:
◦ Vanishing gradient?
◦ Exploding gradient?
◦ Tricky weight initialization?
Deep convolutional neural network architecture codenamed Inception;
◦ Finding out how an optimal local sparse structure in a convolutional vision network can be approximated and
covered by readily available dense components;
◦ Judiciously applying dimension reduction and projections wherever the computational requirements would increase
too much otherwise;
Increasing the depth and width of the network but keeping the computational budget constant;
◦ Drawbacks: Bigger size typically means a larger number of parameters, which makes the enlarged network more
prone to overfitting and the dramatically increased use of computational resources;
◦ Solution: From fully connected to sparsely connected architectures, analyze the correlation statistics of the
activations of the last layer and clustering neurons with highly correlated outputs.
◦ Based on the well known Hebbian principle: neurons that fire together, wire together;
Trained using the DistBelief: distributed machine learning system.
Inception module (with dimension reductions)
Convolution
Pooling
Softmax
Other
Problems with training deep architectures?
Network in a network in a network
9 Inception modules
PReLU Networks at MSR
A Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit;
◦ PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk;
◦ Allow negative activations on the ReLU function with a control parameter a learned Adaptively;
◦ Resolve diminishing gradient problem for very deep neural networks (> 13 layers) ;
Derive a robust initialization method better than “Xavier” (normalization) initialization;
Also use Spatial Pyramid Pooling (SPP) layer just before the fully connected layers;
Can train extremely deep rectified models and investigate deeper or wider network architectures;
ReLU vs. PReLU Note: μ is momemtum, ϵ is learning rate.
PReLU Networks at MSR
Performance: 4.94% top-5 test error on the ImageNet 2012 Classification dataset;
◦ ILSVRC 2014 winner (GoogLeNet, 6.66%);
Adopt the momentum method in BP training;
Mostly initialized by random weights from Gaussian distr.;
Investigate the variance of the FP responses in each layer;
Consider a sufficient condition in BP:
◦ The gradient is not exponentially large/small.
Architectures of large models
PReLU Networks
Batch Normalization at Google
Normalizing layer inputs for each mini-batch to handle saturating
nonlinearities and covariate shift;
◦ Internal Covariate Shift (ICS): the change in the distribution of network activations
due to the change in network parameters during training;
◦ Whitening to reduce ICS: linear transform to have zero means and unit variances, and
decorrelated;
◦ Fix the means and variance of layer inputs (instead of whitening jointly the features in
both I/O);
◦ Batch normalizing transform applied for activation over a mini-batch;
◦ BN transform is differentiable transform introducing normalized activations into the
network;
Batch normalized networks
◦ Unbiased variance estimate;
◦ Moving average;
Batch normalized ConvNets
◦ Effective mini-batch size;
◦ Per feature, not per activation.
Batch Normalization at Google
Reduce the dependence of gradients on the scale of the parameters or of the initial values;
◦ Prevent small changes from amplifying into larger and suboptimal changes in activation in gradients;
◦ Stabilize the parameter growth and make gradient propagation better behaved in BN training;
In some cases, eliminate the need of dropout as a regularizer;
◦ In ImageNet Classification, remove local response normalization and reduce photometric distortions;
◦ Reach 4.9% in top-five validation error and 4.8% test error (human raters only 5.1%).
Accelerating BN network:
◦ Enable larger learning rate and less care about initialization, which accelerates the training;
◦ Reduce L2 weight regularization;
◦ Accelerate the learning rate decay.
Batch Normalization at Google
Inception architecture
Neural Turing Machines
A Neural Turing Machine (NTM) architecture contains two basic components: a neural
network controller and a memory bank;
◦ During each update cycle, the controller network receives inputs from an external
environment and emits outputs in response;
◦ It also reads to and writes from a memory matrix via a set of parallel read and write heads.
These weightings arise by combining two addressing mechanisms with complementary
facilities;
◦ “content-based addressing”: focuses attention on locations based on the similarity between their
current values and values emitted by the controller;
◦ “location-based addressing”: the content of a variable is arbitrary, but the variable still needs a
recognizable name or addresses, by location, not by content;
Controller network: feed forward or recurrent.
Neural Turing Machines
Neural Turing Machine Architecture.
Flow Diagram of the Addressing Mechanism.
Highway Networks: Information Highway
Ease gradient-based training of very deep networks;
Allow unimpeded info. flow across several layers on information highways;
Use gating units to learn regulating the flow of info. through a network;
A highway network consists of multiple blocks such that the ith block computes a block
state Hi(x) and transform gate output Ti(x);
Highway networks with hundreds of layers can be trained directly using SGD and with a
variety of activation functions.
the transform gate the carry gate
C = 1 - T
Deep Residual Learning for Image Recognition
Reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning
unreferenced functions;
◦ The desired underlying mapping as H(x), then let the stacked nonlinear layers fit another mapping of F(x) = H(x) - x;
◦ The formulation of F(x)+x can be realized by feed forward NN with “shortcut connections” (such as “Highway
Network” and “Inception”);
These residual networks are easier to optimize, and can gain accuracy from considerably increased depth;
An ensemble of 152 layers residual nets achieves 3.57% error on the ImageNet test set;
◦ 224x224 crop, per-pixel mean subtracted, color augmentation, batch normalization;
◦ SGD with a mini-batch size of 256, learning rate from 0.1 then by 10;
◦ Weight decay of 0.0001 and a momentum of 0.9, no drop-out;
Rethink Inception Architecture for Computer Vision
Scale up networks in ways that aim at utilizing the added computation efficiently by factorized convolutions
and aggressive regularization;
Design principles in Inception:
◦ Avoid representational bottlenecks, especially early in the network;
◦ Higher dimensional representations are easier to process locally within a network;
◦ Spatial aggregation over lower dim embeddings w/o loss in representational power;
◦ Balance the width and depth of the network.
Factorizing convolutions with large filter size: asymmetric convolutions;
Auxiliary classifiers: act as regularizer, esp. batch normalized or dropout;
Grid size reduction: two parallel stride 2 blocks (pooling and activation) ;
Model regularization via label smoothing: marginalized effect of dropout;
Trained with Tensorflow: SGD with 50 replicas, batch size 32 for 100 epochs, learning rate of 0.045,
exponential rate of 0.94, a wei decay of 0.9.
Rethink Inception Architecture for Computer Vision
Inception modules after the factorization of the nxn
convolutions. In the proposed architecture, it choses
n = 7 for the 17x17 grid.
Inception modules with expanded
the filter bank outputs.
Inception modules where
each 5x5 convolution is
replaced by two 3x3
convolution.
Rethink Inception Architecture for Computer Vision
Auxiliary classifier on top
of the last 17x17 layer Inception module that reduces the grid-size while
expands the filter banks. It is both cheap and avoids
the representational bottleneck.
The outline of the proposed
network architecture
Belief Nets
Belief net is directed acyclic graph composed of stochastic var.
Can observe some of the variables and solve two problems:
◦ inference: Infer the states of the unobserved variables.
◦ learning: Adjust the interactions between variables to more likely generate the observed data.
stochastic
hidden
cause
visible
effect
Use nets composed of layers
of stochastic variables with
weighted connections.
Boltzmann Machines
Energy-based model associate a energy to each configuration of stochastic variables of interests (for
example, MRF, Nearest Neighbor);
◦ Learning means adjustment of the low energy function’s shape properties;
Boltzmann machine is a stochastic recurrent model with hidden variables;
◦ Monte Carlo Markov Chain, i.e. MCMC sampling (appendix);
Restricted Boltzmann machine is a special case:
◦ Only one layer of hidden units;
◦ factorization of each layer’s neurons/units (no connections in the same layer);
Contrastive divergence: approximation of gradient (appendix).
probability
Energy Function
Learning rule
Deep Belief Networks
A hybrid model: can be trained as generative or
discriminative model;
Deep architecture: multiple layers (learn features
layer by layer);
◦ Multi layer learning is difficult in sigmoid belief
networks.
◦ Top two layers are undirected connections, RBM;
◦ Lower layers get top down directed connections
from layers above;
Unsupervised or self-taught pre-learning provides
a good initialization;
◦ Greedy layer-wise unsupervised training for
RBM
Supervised fine-tuning
◦ Generative: wake-sleep algorithm (Up-down)
◦ Discriminative: back propagation (bottom-up)
Deep Boltzmann Machine
Learning internal representations that become increasingly complex;
High-level representations built from a large supply of unlabeled inputs;
Pre-training consists of learning a stack of modified RBMs, which are composed to create a deep Boltzmann
machine (undirected graph);
Generative fine-tuning: different from DBN
◦ Positive and negative phase (appendix)
Discriminative fine-tuning: the same to DBN
◦ Back propagation.
Denoising Auto-Encoder
Multilayer NNs with target output=input;
Reconstruction=decoder(encoder(input));
◦ Perturbs the input x to a corrupted version;
◦ Randomly sets some of the coordinates of input to zeros.
◦ Recover x from encoded perturbed data.
Learns a vector field towards higher probability regions;
Pre-trained with DBN or regularizer with perturbed training data;
Minimizes variational lower bound on a generative model;
◦ corresponds to regularized score matching on an RBM;
PCA=linear manifold=linear Auto Encoder;
Auto-encoder learns the salient variation like a nonlinear PCA.
Stacked Denoising Auto-Encoder
Stack many (may be sparse) auto-encoders in succession and train them using greedy layer-wise
unsupervised learning
◦ Drop the decode layer each time
◦ Performs better than stacking RBMs;
Supervised training on the last layer using final features;
(option) Supervised training on the entire network to fine- tune all weights of the neural net;
Empirically not quite as accurate as DBNs.
Stereopsis via Deep Learning
• Learn a binocular cross correlation model: use two quadrature pairs to detect disparity;
 Various filters correspond to phases, positions and frequencies;
• Disparity as latent variable: a pattern of matching filter responses;
 A joint probabilistic model over patch pairs and disparity defined as a Boltzmann machine.
 Training amounts to finding the parameters for max the log probability for pairs
 RBM used for this case;
 During inference, each latent variable receives activity
from exactly two products of matched filter-responses.
pooling
Stereopsis via Deep Learning
Example training data: Row 1, row 2 and row 3
show rendered image planes for the left/right
camera, where in row 3 the right camera has
been rotated by 45 around the z axis. Images
are rendered by depth maps shown in row 4 and
a randomly selected texture map from the
Berkeley Segmentation Database.
Example pairs from NORB-cluttered dataset. Learned binocular filter pairs.
Unsupervised Learning of Depth (and Motion)
• Learning about the interrelations between images from multiple cameras, multiple
frames in a video, or the combination of both;
• Depth and motion in a feature learning architecture based on the energy model;
• An AutoEncoder single-layer model uses multiplicative interactions to detect synchrony,
and a pooling layer independently trained on the hidden responses to achieve content
invariance;
• Depth as a latent variable in learning:
• Reconstruction error:
• Contraction as regularization:
• Complete objective function:
Note: there is no need for rectification, since
the model can learn any transformation
between the frames not just horizontal shift
Unsupervised Learning of Depth (and Motion)
• Extension to stereo sequences: both depth and motion;
 Encoding depth:
 Encoding motion:
 Multiview disparity:
Representation of depth
Representation of motion
Representation of disparity
Products of frame responses
Products of frame responses
Products of frame responses
Unsupervised Learning of Depth (and Motion)
Filters learned on stereo patch pairs from KITTI dataset.
Example of a filter pair learned on sequences by the
SAE-D model from the Hollywood3D dataset.
Stereo Matching by CNN
• Train a convolutional neural network on pairs of small image patches;
• The network output is used to initialize the matching cost btw a pair of patches;
• Eight layers, L1 through L8 with input as 9x9 gray patch and matching cost as output;
• 1st layer as convolutional only and other layers are fully connected.
• Rectified linear units follow each layer, except L8, but NO pooling!
• Trained with SGD (batch size as 128), by194 image pairs, 45 million extracted examples.
• Matching costs are combined between neighboring pixels with similar image
intensities using cross-based cost aggregation;
• Smoothness constraints are enforced by semi-global matching (SGM) and a left-right
consistency check is used to detect and eliminate errors in occluded regions;
• sub-pixel enhancement and median filter + bilateral filter -> final disparity map;
• Achieve the error rate of 2.61% on the KITTI stereo database ( < 2.83% before).
Stereo Matching by CNN
Support region
A Deep Embedding Model for Stereo Matching Costs
This deep embedding model leverages appearance data to learn visual similarity relationships btw corresp.
image patches, and maps intensity values into an embedding feature space to measure pixel dissimilarities;
Features are extracted in a pair of patches at different scales, followed
by an inner product to obtain the matching scores, then the scores
from different scales are then merged for an ensemble.
The deployed network architecture of our testing model for deep embedding.
Features are extracted in two images only once respectively. The sliding-
window style inner product can be grouped for matrix operation.
Improved Stereo Matching with Constant Highway
Networks and Reflective Confidence Learning
A 3-step pipeline for the stereo matching problem and a highway network architecture for computing the
matching cost at each possible disparity, based on multilevel weighted residual shortcuts, trained with a
hybrid loss that supports multilevel comparison of image patches.
A post-processing step employs a second deep convolutional neural network for pooling global
information from multiple disparities.
It outputs both the image disparity map, which replaces the conventional “winner takes all” strategy, and a
confidence in the prediction.
The confidence score is achieved by training the network with a reflective loss.
The learned confidence is employed to better detect outliers in the refinement.
Improved Stereo Matching with Constant Highway
Networks and Reflective Confidence Learning
The λ-ResMatch architecture of the matching cost network
Improved Stereo Matching with Constant Highway
Networks and Reflective Confidence Learning
The Global disparity network model for representing disparity patches
Efficient Deep Learning for Stereo Matching
 A matching network which is able to
produce very accurate results in less than a
second of GPU computation.
A product layer which simply computes the
inner product btw the two representations of
a Siamese architecture.
 Treat the disparity estimation problem as
multi-class classification, where the classes are
all possible disparities.
A Siamese network extracts marginal distributions
over all possible disparities for each pixel.
four-layer Siamese network
architecture
The code and data can is online at:
http://www.cs.toronto.edu/deepLowLevelVision.
End-to-End Learning of Geometry and
Context for Deep Stereo Regression
A deep learning architecture for regressing disparity from a rectified pair
of stereo images.
Leverage knowledge of the problem’s geometry to form a cost volume
using deep feature representations.
Learn to incorporate contextual information using 3-D convolutions over
this volume.
Disparity values regressed from the cost volume using a differentiable soft
argmin operation, which allows to train end-to-end to sub-pixel accuracy
without any additional post-processing or regularization.
End-to-End Learning of Geometry and
Context for Deep Stereo Regression
End-to-end deep stereo regression architecture, GC-Net (Geometry and Context Network)
End-to-End Training of Hybrid
CNN-CRF Models for Stereo
convolutional neural networks (CNNs) + optimization-based approaches for stereo estimation;
 The optimization, posed as a conditional random field (CRF), takes local matching costs and consistency-
enforcing (smoothness) costs as inputs, both estimated by CNN blocks.
 To perform inference in CRF, based on linear programming relaxation with a fixed number of iterations.
Training end-to-end: in the discriminative formulation (structured SVM), the training is practically feasible.
 The optimization part efficiently replaces post-processing steps by a trainable, well-understood model.
A CNN, called Unary-CNN, computes features of the two
images for each pixel. The features are compared using a
Correlation layer. The resulting matching cost volume becomes
the unary cost of the CRF. The pairwise costs of the CRF are
parametrized by edge weights, which can either follow a usual
contrast sensitive model or estimated by the Pairwise-CNN.
End-to-End Training of Hybrid
CNN-CRF Models for Stereo
The cross-correlation of features φ0 and φ1 respectively
The CRF model optimizes the cost of disparity labelings
matching costs
pairwise terms
Appendix A:
Depth from Single Image by Learning
Learning-based Depth from Image
Initial over-segmentation (super pixels);
Markov Random Field (MRF) to infer patch’s orientation and location from image features (texture, color and
gradient);
◦ Connected, co-planar or colinear as prior;
◦ Occlusion boundaries /folds indication;
◦ Multi-conditional learning; solved by linear program;
MRF overlaid on “super pixels”
Occlusion/fold
Coplanarity and Colinearity
Single Image Depth Estimation From Predicted Semantic Labels
Semantic segmentation to guide the 3D reconstruction;
Works like holistic scene understanding:
◦ 1. Multi-class image labeling MRF for scene segmentation;
◦ 2. Depth estimation for each semantic class by learning (logistic regression);
3. Scene depth estimation by MRF (pixel or super-pixel) with potential (learned boosted decision tree classifiers ) and prior of
geometry (horizon prediction, vertical objects), pixel’s smoothness, super-pixel ‘s soft connectivity, co-planarity and
orientation.
semantically derived geometric constraints
Smoothed per-pixel log-depth prior for each semantic class with horizon rotated to center of image
Image semantic overlay ground truth depth measurements
Learning Depth from Examples
Two similar images are likely to have similar 3D structure (depth).
Nearest-neighbor (kNN) search: finding k image+depth pairs that are most similar to the query (histograms of
oriented gradients as feature);
Depth fusion: median filtering of the k depth fields;
Joint-bilateral depth filtering: smoothing of the median-fused depth.
K-NN
query
Depth Fusion and Smoothing
Depth output
Note: depth (disparity) warping via SIFT-flow in aligning with the query is omitted.
Depth Transfer for Monocular Video
K-NN Search for candidates of query frames;
Depth changes are gradual frame-to-frame;
Moving objects are usually on the ground;
Warped with SIFT flow and regularized with smoothness and prior
Computational cost is worth?
Depth Inference with MRF
To form a basis (dictionary) over the RGB and depth spaces, and represent depth maps by
a sparse linear combination of weights.
A prediction function is estimated between weight vectors in RGB to depth space to
recover depth maps from query images.
A final super-pixel post processor aligns depth maps with occlusion boundaries, creating
physically plausible results.
Scalable Exemplar Based Depth Transfer
images with similar global depth profile clustered together in 2D utilizing RGB pairwise
features (left) and sparse positive descriptors on depth (right) effective in grouping
images with similar depths profiles together.
estimate a transformation T, that maps points from one space to another.
Scalable Exemplar Based Depth Transfer
Learning to be a Depth Camera (Active Near-IR)
• Use hybrid classification-regression forests to learn how to map from near infrared
intensity images to absolute, metric depth in real-time;
• Simplify the problem by dividing it into sub-problems in the first layer, and then applies models
trained for these sub-problems in the second layer to solve the main problem efficiently;
• Restrict the depths of the object to a certain range for significant simplification;
• The first layer learns to infer a coarsely quantized depth range for each pixel, and optionally
pools these predictions across all pixels to obtain a more reliable distribution over these depth
ranges;
• The second layer then applies one or more expert repressor trained specifically on the inferred
depth ranges.
• Note: the forests do not need to explicitly model scene illumination, surface geometry and
reflectance, or complex inter-reflections, required by traditional SFS methods.
Learning to be a Depth Camera (Active Near-IR)
• Comparable to high-quality consumer depth cameras with a reduced cost, power
consumption, and form-factor.
Learning to be a Depth Camera (Active Near-IR)
• Applied for specific hand and face objects.
Appendix B:
Machine Learning and Optimization
Graphical Models
• Graphical Models: Powerful framework for representing dependency
structure between random variables.
• The joint probability distribution over a set of random variables.
• The graph contains a set of nodes (vertices) that represent random variables, and a set
of links (edges) that represent dependencies between those random variables.
• The joint distribution over all random variables decomposes into a product of
factors, where each factor depends on a subset of the variables.
• Two type of graphical models:
• Directed (Bayesian networks)
• Undirected (Markov random fields, Boltzmann machines)
• Hybrid graphical models that combine directed and undirected models, such as Deep
Belief Networks, Hierarchical-Deep Models.
Generative Model: MRF
Random Field: F={F1,F2,…FM} a family of random variables on set S in which each Fi takes
value fi in a label set L.
Markov Random Field: F is said to be a MRF on S w.r.t. a neighborhood N if and only if it
satisfies Markov property.
◦ Generative model for joint probability p(x)
◦ allows no direct probabilistic interpretation
◦ define potential functions Ψ on maximal cliques A
◦ map joint assignment to non-negative real number
◦ requires normalization
MRF is undirected graphical models
A flow network G(V, E) defined as a fully connected directed graph
where each edge (u,v) in E has a positive capacity c(u,v) >= 0;
The max-flow problem is to find the flow of maximum value on a
flow network G;
A s-t cut or simply cut of a flow network G is a partition of V into S
and T = V-S, such that s in S and t in T;
A minimum cut of a flow network is a cut whose capacity is the
least over all the s-t cuts of the network;
Methods of max flow or mini-cut:
◦ Ford Fulkerson method;
◦ "Push-Relabel" method.
Mostly labeling is solved as an energy minimization problem;
Two common energy models:
◦ Potts Interaction Energy Model;
◦ Linear Interaction Energy Model.
Graph G contain two kinds of vertices: p-vertices and i-vertices;
◦ all the edges in the neighborhood N, called n-links;
◦ edges between the p-vertices and the i-vertices called t-links.
In the multiple labeling case, the multi-way cut should leave each p-vertex connected to one i-vertex;
The minimum cost multi-way cut will minimize the energy function where the severed n-links would
correspond to the boundaries of the labeled vertices;
The approximation algorithms to find this multi-way cut:
◦ "alpha-expansion" algorithm;
◦ "alpha-beta swap" algorithm.
 A simplified Bayes Net: it propagates info. throughout a graphical model via a series
of messages between neighboring nodes iteratively; likely to converge to a consensus that
determines the marginal prob. of all the variables;
 messages estimate the cost (or energy) of a configuration of a clique given all other cliques;
then the messages are combined to compute a belief (marginal or maximum probability);
Two types of BP methods:
◦ max-product;
◦ sum-product.
BP provides exact solution when there are no loops in graph!
Equivalent to dynamic programming/Viterbi in these cases;
Loopy Belief Propagation: still provides approximate (but often good) solution;
Generalized BP for pairwise MRFs
◦ Hidden variables xi and xj are connected through a compatibility function;
◦ Hidden variables xi are connected to observable variables yi by the local “evidence” function;
The joint probability of {x} is given by
To improve inference by taking into account higher-order interactions among the
variables;
◦ An intuitive way is to define messages that propagate between groups of nodes rather than just single nodes;
◦ This is the intuition in Generalized Belief Propagation (GBP).
Stochastic Gradient Descent (SGD)
• The general class of estimators that arise as minimizers of sums are called M-
estimators;
• Where are stationary points of the likelihood function (or zeroes of its derivative, the score
function)?
• Online gradient descent samples a subset of summand functions at every step;
• The true gradient is approximated by a gradient at a single example;
• Shuffling of training set at each pass.
• There is a compromise between two forms, often called "mini-batches", where the
true gradient is approximated by a sum over a small number of training examples.
• STD converges almost surely to a global minimum when the objective function
is convex or pseudo-convex, and otherwise converges almost surely to a local
minimum.
Back Propagation
E (f(x0,w),y0) = -log (f(x0,w)- y0).
Variable Learning Rate
Too large learning rate
◦ cause oscillation in searching for the minimal point
Too slow learning rate
◦ too slow convergence to the minimal point
Adaptive learning rate
◦ At the beginning, the learning rate can be large when the current point is far from the
optimal point;
◦ Gradually, the learning rate will decay as time goes by.
Should not be too large or too small:
◦ annealing rate 𝛼(𝑡)=𝛼(0)/(1+𝑡/𝑇)
◦ 𝛼(𝑡) will eventually go to zero, but at the beginning it is almost a constant.
Variable Momentum
AdaGrad/AdaDelta
Dropout and Maxout for Overfitting
Dropout: set the output of each hidden neuron to zero w.p. 0.5.
◦ Motivation: Combining many different models that share parameters succeeds in reducing test
errors by approximately averaging together the predictions, which resembles the bagging.
◦ The units which are “dropped out” in this way do not contribute to the forward pass and do not
participate in back propagation.
◦ So every time an input is presented, the NN samples a different architecture, but all these
architectures share weights.
◦ This technique reduces complex co-adaptations of units, since a neuron cannot rely on the presence
of particular other units.
◦ It is, therefore, forced to learn more robust features that are useful in conjunction with many
different random subsets of the other units.
◦ Without dropout, the network exhibits substantial overfitting.
◦ Dropout roughly doubles the number of iterations required to converge.
Maxout takes the maximum across multiple feature maps;
Weight Decay for Overfitting
Weight decay or L2 regularization adds a penalty term to the error function, a term called the
regularization term: the negative log prior in Bayesian justification,
◦ Weight decay works as rescaling weights in the learning rule, but bias learning still the same;
◦ Prefer to learn small weights, and large weights allowed if improving the original cost function;
◦ A way of compromising btw finding small weights and minimizing the original cost function;
In a linear model, weight decay is equivalent to ridge (Tikhonov) regression;
L1 regularization: the weights not really useful shrink by a constant amount toward zero;
◦ Act like a form of feature selection;
◦ Make the input filters cleaner and easier to interpret;
L2 regularization penalizes large values strongly while L1 regularization ;
Markov Chain Monte Carlo (MCMC): simulating a Markov chain whose equilibrium distr. is the
posterior distribution for weights & hyper-parameters;
Hybrid Monte Carlo: gradient and sampling.
Early Stopping for Overfitting
Steps in early stopping:
◦ Divide the available data into training and validation sets.
◦ Use a large number of hidden units.
◦ Use very small random initial values.
◦ Use a slow learning rate.
◦ Compute the validation error rate periodically during training.
◦ Stop training when the validation error rate "starts to go up".
Early stopping has several advantages:
◦ It is fast.
◦ It can be applied successfully to networks in which the number of weights far exceeds the sample size.
◦ It requires only one major decision by the user: what proportion of validation cases to use.
Practical issues in early stopping:
◦ How many cases do you assign to the training and validation sets?
◦ Do you split the data into training and validation sets randomly or by some systematic algorithm?
◦ How do you tell when the validation error rate "starts to go up"?
MCMC Sampling for Optimization
Markov Chain: a stochastic process in which future states are independent of past states but the
present state.
◦ Markov chain will typically converge to a stable distribution.
Monte Carlo Markov Chain: sampling using ‘local’ information
◦ Devise a Markov chain whose stationary distribution is the target.
◦ Ergodic MC must be aperiodic, irreducible, and positive recurrent.
◦ Monte Carlo Integration to get quantities of interest.
Metropolis-Hastings method: sampling from a target distribution
◦ Create a Markov chain whose transition matrix does not depend on the normalization term.
◦ Make sure the chain has a stationary distribution and it is equal to the target distribution (accept ratio).
◦ After sufficient number of iterations, the chain will converge the stationary distribution.
Gibbs sampling is a special case of M-H Sampling.
◦ The Hammersley-Clifford Theorem: get the joint distribution from the complete conditional distribution.
Hybrid Monte Carlo: gradient sub step for each Markov chain.
Mean Field for Optimization
Variational approximation modifies the optimization problem to be tractable, at the price of
approximate solution;
Mean Field replaces M with a (simple) subset M(F), on which A* (μ) is a closed form (Note: F is
disconnected graph);
◦ Density becomes factorized product distribution in this sub-family.
◦ Objective: K-L divergence.
Mean field is a structured variation approximation approach:
◦ Coordinate ascent (deterministic);
Compared with stochastic approximation (sampling):
◦ Faster, but maybe not exact.
Contrastive Divergence for RBMs
Contrastive divergence (CD) is proposed for training PoE first, also being a quicker way to learn
RBMs;
◦ Contrastive divergence as the new objective;
◦ Taking gradients and ignoring a term which is usually very small.
Steps:
◦ Start with a training vector on the visible units.
◦ Then alternate between updating all the hidden units in parallel and updating all the visible units in parallel.
Can be applied using any MCMC algorithm to simulate the model (not limited to just Gibbs
sampling);
CD learning is biased: not work as gradient descent
Improved: Persistent CD explores more modes in the distribution
◦ Rather than from data samples, begin sampling from the mode samples, obtained from the last gradient
update.
◦ Still suffer from divergence of likelihood due to missing the modes.
Score matching: the score function does not depend on its normal. factor. So, match it b.t.w. the
model with the empirical density.
“Wake-Sleep” Algorithm for DBN
Pre-trained DBN is a generative model;
Do a stochastic bottom-up pass (wake phase)
◦ Get samples from factorial distribution (visible first, then generate hidden);
◦ Adjust the top-down weights to be good at reconstructing the feature activities in the layer below.
Do a few iterations of sampling in the top level RBM
◦ Adjust the weights in the top-level RBM.
Do a stochastic top-down pass (sleep phase)
◦ Get visible and hidden samples generated by generative model using data coming from nowhere!
◦ Adjust the bottom-up weights to be good at reconstructing the feature activities in the layer above.
◦ Any guarantee for improvement? No!
The “Wake-Sleep” algorithm is trying to describe the representation economical (Shannon’s coding
theory).
Greedy Layer-Wise Training
Deep networks tend to have more local minima problems than shallow networks during
supervised training
Train first layer using unlabeled data
◦ Supervised or semi-supervised: use more unlabeled data.
Freeze the first layer parameters and train the second layer
Repeat this for as many layers as desire
◦ Build more robust features
Use the outputs of the final layer to train the last supervised layer (leave early weights frozen)
Fine tune the full network with a supervised approach;
Avoid problems to train a deep net in a supervised fashion.
◦ Each layer gets full learning
◦ Help with ineffective early layer learning
◦ Help with deep network local minima
Why Greedy Layer-Wise Training Works?
Take advantage of the unlabeled data;
Regularization Hypothesis
◦ Pre-training is “constraining” parameters in a region relevant to unsupervised
dataset;
◦ Better generalization (representations that better describe unlabeled data are
more discriminative for labeled data) ;
Optimization Hypothesis
◦ Unsupervised training initializes lower level parameters near localities of better
minima than random initialization can.
Only need fine tuning in the supervised learning stage.
Two-Stage Pre-training in DBMs
Pre-training in one stage
◦ Positive phase: clamp observed, sample hidden, using variational approximation (mean-field)
◦ Negative phase: sample both observed and hidden, using persistent sampling (stochastic approximation:
MCMC)
Pre-training in two stages
◦ Approximating a posterior distribution over the states of hidden units (a simpler directed deep model as DBNs
or stacked DAE);
◦ Train an RBM by updating parameters to maximize the lower-bound of log-likelihood and correspond.
posterior of hidden units.
◦ Options (CAST, contrastive divergence, stochastic approximation…).
Passive stereo vision with deep learning

More Related Content

What's hot

A Study on the Generation of Clothing Captions Highlighting the Differences b...
A Study on the Generation of Clothing Captions Highlighting the Differences b...A Study on the Generation of Clothing Captions Highlighting the Differences b...
A Study on the Generation of Clothing Captions Highlighting the Differences b...harmonylab
 
Object tracking presentation
Object tracking  presentationObject tracking  presentation
Object tracking presentationMrsShwetaBanait1
 
Computer vision introduction
Computer vision  introduction Computer vision  introduction
Computer vision introduction Wael Badawy
 
SSII2020TS: 物理ベースビジョンの過去・現在・未来 〜 カメラ・物体・光のインタラクションを モデル化するには 〜
SSII2020TS: 物理ベースビジョンの過去・現在・未来 〜 カメラ・物体・光のインタラクションを モデル化するには 〜SSII2020TS: 物理ベースビジョンの過去・現在・未来 〜 カメラ・物体・光のインタラクションを モデル化するには 〜
SSII2020TS: 物理ベースビジョンの過去・現在・未来 〜 カメラ・物体・光のインタラクションを モデル化するには 〜SSII
 
Slideshare unsupervised learning of depth and ego motion from video
Slideshare unsupervised learning of depth and ego motion from videoSlideshare unsupervised learning of depth and ego motion from video
Slideshare unsupervised learning of depth and ego motion from videoishii yasunori
 
Object Detection & Tracking
Object Detection & TrackingObject Detection & Tracking
Object Detection & TrackingAkshay Gujarathi
 
Cvpr 2021 manydepth
Cvpr 2021 manydepthCvpr 2021 manydepth
Cvpr 2021 manydepthKenta Tanaka
 
신뢰 전파 기법을 이용한 스테레오 정합(Stereo matching using belief propagation algorithm)
신뢰 전파 기법을 이용한 스테레오 정합(Stereo matching using belief propagation algorithm)신뢰 전파 기법을 이용한 스테레오 정합(Stereo matching using belief propagation algorithm)
신뢰 전파 기법을 이용한 스테레오 정합(Stereo matching using belief propagation algorithm)Hansol Kang
 
Content Based Image Retrieval
Content Based Image Retrieval Content Based Image Retrieval
Content Based Image Retrieval Swati Chauhan
 
3D Gaussian Splatting
3D Gaussian Splatting3D Gaussian Splatting
3D Gaussian Splattingtaeseon ryu
 
Object detection technique using bounding box algorithm for
Object detection technique using bounding box algorithm forObject detection technique using bounding box algorithm for
Object detection technique using bounding box algorithm forVESIT,Chembur,Mumbai
 

What's hot (20)

A Study on the Generation of Clothing Captions Highlighting the Differences b...
A Study on the Generation of Clothing Captions Highlighting the Differences b...A Study on the Generation of Clothing Captions Highlighting the Differences b...
A Study on the Generation of Clothing Captions Highlighting the Differences b...
 
Computer Vision
Computer VisionComputer Vision
Computer Vision
 
Computer vision
Computer visionComputer vision
Computer vision
 
Stereo vision
Stereo visionStereo vision
Stereo vision
 
Computer vision ppt
Computer vision pptComputer vision ppt
Computer vision ppt
 
Object tracking presentation
Object tracking  presentationObject tracking  presentation
Object tracking presentation
 
Ai lecture 03 computer vision
Ai lecture 03 computer visionAi lecture 03 computer vision
Ai lecture 03 computer vision
 
Computer vision introduction
Computer vision  introduction Computer vision  introduction
Computer vision introduction
 
SSII2020TS: 物理ベースビジョンの過去・現在・未来 〜 カメラ・物体・光のインタラクションを モデル化するには 〜
SSII2020TS: 物理ベースビジョンの過去・現在・未来 〜 カメラ・物体・光のインタラクションを モデル化するには 〜SSII2020TS: 物理ベースビジョンの過去・現在・未来 〜 カメラ・物体・光のインタラクションを モデル化するには 〜
SSII2020TS: 物理ベースビジョンの過去・現在・未来 〜 カメラ・物体・光のインタラクションを モデル化するには 〜
 
Slideshare unsupervised learning of depth and ego motion from video
Slideshare unsupervised learning of depth and ego motion from videoSlideshare unsupervised learning of depth and ego motion from video
Slideshare unsupervised learning of depth and ego motion from video
 
Object Detection & Tracking
Object Detection & TrackingObject Detection & Tracking
Object Detection & Tracking
 
Computer vision ppt
Computer vision pptComputer vision ppt
Computer vision ppt
 
Cvpr 2021 manydepth
Cvpr 2021 manydepthCvpr 2021 manydepth
Cvpr 2021 manydepth
 
3D reconstruction
3D reconstruction3D reconstruction
3D reconstruction
 
신뢰 전파 기법을 이용한 스테레오 정합(Stereo matching using belief propagation algorithm)
신뢰 전파 기법을 이용한 스테레오 정합(Stereo matching using belief propagation algorithm)신뢰 전파 기법을 이용한 스테레오 정합(Stereo matching using belief propagation algorithm)
신뢰 전파 기법을 이용한 스테레오 정합(Stereo matching using belief propagation algorithm)
 
Content Based Image Retrieval
Content Based Image Retrieval Content Based Image Retrieval
Content Based Image Retrieval
 
3D Gaussian Splatting
3D Gaussian Splatting3D Gaussian Splatting
3D Gaussian Splatting
 
Object detection
Object detectionObject detection
Object detection
 
Object detection technique using bounding box algorithm for
Object detection technique using bounding box algorithm forObject detection technique using bounding box algorithm for
Object detection technique using bounding box algorithm for
 
Object tracking final
Object tracking finalObject tracking final
Object tracking final
 

Viewers also liked

"High-resolution 3D Reconstruction on a Mobile Processor," a Presentation fro...
"High-resolution 3D Reconstruction on a Mobile Processor," a Presentation fro..."High-resolution 3D Reconstruction on a Mobile Processor," a Presentation fro...
"High-resolution 3D Reconstruction on a Mobile Processor," a Presentation fro...Edge AI and Vision Alliance
 
AI Strategies for Solving Poker Texas Hold'em
AI Strategies for Solving Poker Texas Hold'emAI Strategies for Solving Poker Texas Hold'em
AI Strategies for Solving Poker Texas Hold'emGiovanni Murru
 
Expert System - Artificial intelligence
Expert System - Artificial intelligenceExpert System - Artificial intelligence
Expert System - Artificial intelligenceDr. Abdul Ahad Abro
 
3DTV - Past, Present and Future
3DTV - Past, Present and Future3DTV - Past, Present and Future
3DTV - Past, Present and FutureTouradj Ebrahimi
 
Fingerprint sensor applications and technologies – Consumer market focus - 20...
Fingerprint sensor applications and technologies – Consumer market focus - 20...Fingerprint sensor applications and technologies – Consumer market focus - 20...
Fingerprint sensor applications and technologies – Consumer market focus - 20...Yole Developpement
 
Sensors for Cellphones and Tablets - 2016 Report by Yole Developpement
Sensors for Cellphones and Tablets - 2016 Report by Yole DeveloppementSensors for Cellphones and Tablets - 2016 Report by Yole Developpement
Sensors for Cellphones and Tablets - 2016 Report by Yole DeveloppementYole Developpement
 
Imaging Technologies for Automotive 2016 Report by Yole Developpement
Imaging Technologies for Automotive 2016 Report by Yole Developpement	Imaging Technologies for Automotive 2016 Report by Yole Developpement
Imaging Technologies for Automotive 2016 Report by Yole Developpement Yole Developpement
 
Deep Learning - The Past, Present and Future of Artificial Intelligence
Deep Learning - The Past, Present and Future of Artificial IntelligenceDeep Learning - The Past, Present and Future of Artificial Intelligence
Deep Learning - The Past, Present and Future of Artificial IntelligenceLukas Masuch
 

Viewers also liked (10)

Reconstruction 3 D
Reconstruction 3 DReconstruction 3 D
Reconstruction 3 D
 
"High-resolution 3D Reconstruction on a Mobile Processor," a Presentation fro...
"High-resolution 3D Reconstruction on a Mobile Processor," a Presentation fro..."High-resolution 3D Reconstruction on a Mobile Processor," a Presentation fro...
"High-resolution 3D Reconstruction on a Mobile Processor," a Presentation fro...
 
Advanced driver assistance systems
Advanced driver assistance systemsAdvanced driver assistance systems
Advanced driver assistance systems
 
AI Strategies for Solving Poker Texas Hold'em
AI Strategies for Solving Poker Texas Hold'emAI Strategies for Solving Poker Texas Hold'em
AI Strategies for Solving Poker Texas Hold'em
 
Expert System - Artificial intelligence
Expert System - Artificial intelligenceExpert System - Artificial intelligence
Expert System - Artificial intelligence
 
3DTV - Past, Present and Future
3DTV - Past, Present and Future3DTV - Past, Present and Future
3DTV - Past, Present and Future
 
Fingerprint sensor applications and technologies – Consumer market focus - 20...
Fingerprint sensor applications and technologies – Consumer market focus - 20...Fingerprint sensor applications and technologies – Consumer market focus - 20...
Fingerprint sensor applications and technologies – Consumer market focus - 20...
 
Sensors for Cellphones and Tablets - 2016 Report by Yole Developpement
Sensors for Cellphones and Tablets - 2016 Report by Yole DeveloppementSensors for Cellphones and Tablets - 2016 Report by Yole Developpement
Sensors for Cellphones and Tablets - 2016 Report by Yole Developpement
 
Imaging Technologies for Automotive 2016 Report by Yole Developpement
Imaging Technologies for Automotive 2016 Report by Yole Developpement	Imaging Technologies for Automotive 2016 Report by Yole Developpement
Imaging Technologies for Automotive 2016 Report by Yole Developpement
 
Deep Learning - The Past, Present and Future of Artificial Intelligence
Deep Learning - The Past, Present and Future of Artificial IntelligenceDeep Learning - The Past, Present and Future of Artificial Intelligence
Deep Learning - The Past, Present and Future of Artificial Intelligence
 

Similar to Passive stereo vision with deep learning

Lec13 stereo converted
Lec13 stereo convertedLec13 stereo converted
Lec13 stereo convertedBaliThorat1
 
Various object detection and tracking methods
Various object detection and tracking methodsVarious object detection and tracking methods
Various object detection and tracking methodssujeeshkumarj
 
Introduction to Binocular Stereo in Computer Vision
Introduction to Binocular Stereo in Computer VisionIntroduction to Binocular Stereo in Computer Vision
Introduction to Binocular Stereo in Computer Visionothersk46
 
cohenmedioni.ppt
cohenmedioni.pptcohenmedioni.ppt
cohenmedioni.pptChinnuDS
 
06 image features
06 image features06 image features
06 image featuresankit_ppt
 
Fisheye Omnidirectional View in Autonomous Driving
Fisheye Omnidirectional View in Autonomous DrivingFisheye Omnidirectional View in Autonomous Driving
Fisheye Omnidirectional View in Autonomous DrivingYu Huang
 
Close range Photogrammeetry
Close range PhotogrammeetryClose range Photogrammeetry
Close range Photogrammeetrychinmay khadke
 
Object detection at night
Object detection at nightObject detection at night
Object detection at nightSanjay Crúzé
 
Determination of System Geometrical Parameters and Consistency between Scans ...
Determination of System Geometrical Parameters and Consistency between Scans ...Determination of System Geometrical Parameters and Consistency between Scans ...
Determination of System Geometrical Parameters and Consistency between Scans ...David Scaduto
 
Digital Image Correlation Presentation
Digital Image Correlation PresentationDigital Image Correlation Presentation
Digital Image Correlation Presentationtrilionqualitysystems
 
2-Camera-Properties.pdf
2-Camera-Properties.pdf2-Camera-Properties.pdf
2-Camera-Properties.pdfssusera9462d1
 
Depth Fusion from RGB and Depth Sensors II
Depth Fusion from RGB and Depth Sensors IIDepth Fusion from RGB and Depth Sensors II
Depth Fusion from RGB and Depth Sensors IIYu Huang
 
Lec14 multiview stereo
Lec14 multiview stereoLec14 multiview stereo
Lec14 multiview stereoBaliThorat1
 
Processing of satellite_image_using_digi
Processing of satellite_image_using_digiProcessing of satellite_image_using_digi
Processing of satellite_image_using_digiShanmuga Sundaram
 
Processing_of_Satellite_Image_using_Digi.pptx
Processing_of_Satellite_Image_using_Digi.pptxProcessing_of_Satellite_Image_using_Digi.pptx
Processing_of_Satellite_Image_using_Digi.pptxeshitaakter2
 
Image enhancement
Image enhancementImage enhancement
Image enhancementAyaelshiwi
 

Similar to Passive stereo vision with deep learning (20)

Lec13 stereo converted
Lec13 stereo convertedLec13 stereo converted
Lec13 stereo converted
 
Various object detection and tracking methods
Various object detection and tracking methodsVarious object detection and tracking methods
Various object detection and tracking methods
 
Introduction to Binocular Stereo in Computer Vision
Introduction to Binocular Stereo in Computer VisionIntroduction to Binocular Stereo in Computer Vision
Introduction to Binocular Stereo in Computer Vision
 
project_PPT_final
project_PPT_finalproject_PPT_final
project_PPT_final
 
cohenmedioni.ppt
cohenmedioni.pptcohenmedioni.ppt
cohenmedioni.ppt
 
06 image features
06 image features06 image features
06 image features
 
Fisheye Omnidirectional View in Autonomous Driving
Fisheye Omnidirectional View in Autonomous DrivingFisheye Omnidirectional View in Autonomous Driving
Fisheye Omnidirectional View in Autonomous Driving
 
PPT s06-machine vision-s2
PPT s06-machine vision-s2PPT s06-machine vision-s2
PPT s06-machine vision-s2
 
Close range Photogrammeetry
Close range PhotogrammeetryClose range Photogrammeetry
Close range Photogrammeetry
 
Object detection at night
Object detection at nightObject detection at night
Object detection at night
 
Determination of System Geometrical Parameters and Consistency between Scans ...
Determination of System Geometrical Parameters and Consistency between Scans ...Determination of System Geometrical Parameters and Consistency between Scans ...
Determination of System Geometrical Parameters and Consistency between Scans ...
 
Digital Image Correlation Presentation
Digital Image Correlation PresentationDigital Image Correlation Presentation
Digital Image Correlation Presentation
 
2-Camera-Properties.pdf
2-Camera-Properties.pdf2-Camera-Properties.pdf
2-Camera-Properties.pdf
 
color doppler
color dopplercolor doppler
color doppler
 
Depth Fusion from RGB and Depth Sensors II
Depth Fusion from RGB and Depth Sensors IIDepth Fusion from RGB and Depth Sensors II
Depth Fusion from RGB and Depth Sensors II
 
Lec14 multiview stereo
Lec14 multiview stereoLec14 multiview stereo
Lec14 multiview stereo
 
Processing of satellite_image_using_digi
Processing of satellite_image_using_digiProcessing of satellite_image_using_digi
Processing of satellite_image_using_digi
 
Processing_of_Satellite_Image_using_Digi.pptx
Processing_of_Satellite_Image_using_Digi.pptxProcessing_of_Satellite_Image_using_Digi.pptx
Processing_of_Satellite_Image_using_Digi.pptx
 
cv1.ppt
cv1.pptcv1.ppt
cv1.ppt
 
Image enhancement
Image enhancementImage enhancement
Image enhancement
 

More from Yu Huang

Application of Foundation Model for Autonomous Driving
Application of Foundation Model for Autonomous DrivingApplication of Foundation Model for Autonomous Driving
Application of Foundation Model for Autonomous DrivingYu Huang
 
The New Perception Framework in Autonomous Driving: An Introduction of BEV N...
The New Perception Framework  in Autonomous Driving: An Introduction of BEV N...The New Perception Framework  in Autonomous Driving: An Introduction of BEV N...
The New Perception Framework in Autonomous Driving: An Introduction of BEV N...Yu Huang
 
Data Closed Loop in Simulation Test of Autonomous Driving
Data Closed Loop in Simulation Test of Autonomous DrivingData Closed Loop in Simulation Test of Autonomous Driving
Data Closed Loop in Simulation Test of Autonomous DrivingYu Huang
 
Techniques and Challenges in Autonomous Driving
Techniques and Challenges in Autonomous DrivingTechniques and Challenges in Autonomous Driving
Techniques and Challenges in Autonomous DrivingYu Huang
 
BEV Joint Detection and Segmentation
BEV Joint Detection and SegmentationBEV Joint Detection and Segmentation
BEV Joint Detection and SegmentationYu Huang
 
BEV Object Detection and Prediction
BEV Object Detection and PredictionBEV Object Detection and Prediction
BEV Object Detection and PredictionYu Huang
 
Fisheye based Perception for Autonomous Driving VI
Fisheye based Perception for Autonomous Driving VIFisheye based Perception for Autonomous Driving VI
Fisheye based Perception for Autonomous Driving VIYu Huang
 
Fisheye/Omnidirectional View in Autonomous Driving V
Fisheye/Omnidirectional View in Autonomous Driving VFisheye/Omnidirectional View in Autonomous Driving V
Fisheye/Omnidirectional View in Autonomous Driving VYu Huang
 
Fisheye/Omnidirectional View in Autonomous Driving IV
Fisheye/Omnidirectional View in Autonomous Driving IVFisheye/Omnidirectional View in Autonomous Driving IV
Fisheye/Omnidirectional View in Autonomous Driving IVYu Huang
 
Prediction,Planninng & Control at Baidu
Prediction,Planninng & Control at BaiduPrediction,Planninng & Control at Baidu
Prediction,Planninng & Control at BaiduYu Huang
 
Cruise AI under the Hood
Cruise AI under the HoodCruise AI under the Hood
Cruise AI under the HoodYu Huang
 
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)Yu Huang
 
Scenario-Based Development & Testing for Autonomous Driving
Scenario-Based Development & Testing for Autonomous DrivingScenario-Based Development & Testing for Autonomous Driving
Scenario-Based Development & Testing for Autonomous DrivingYu Huang
 
How to Build a Data Closed-loop Platform for Autonomous Driving?
How to Build a Data Closed-loop Platform for Autonomous Driving?How to Build a Data Closed-loop Platform for Autonomous Driving?
How to Build a Data Closed-loop Platform for Autonomous Driving?Yu Huang
 
Annotation tools for ADAS & Autonomous Driving
Annotation tools for ADAS & Autonomous DrivingAnnotation tools for ADAS & Autonomous Driving
Annotation tools for ADAS & Autonomous DrivingYu Huang
 
Simulation for autonomous driving at uber atg
Simulation for autonomous driving at uber atgSimulation for autonomous driving at uber atg
Simulation for autonomous driving at uber atgYu Huang
 
Multi sensor calibration by deep learning
Multi sensor calibration by deep learningMulti sensor calibration by deep learning
Multi sensor calibration by deep learningYu Huang
 
Prediction and planning for self driving at waymo
Prediction and planning for self driving at waymoPrediction and planning for self driving at waymo
Prediction and planning for self driving at waymoYu Huang
 
Jointly mapping, localization, perception, prediction and planning
Jointly mapping, localization, perception, prediction and planningJointly mapping, localization, perception, prediction and planning
Jointly mapping, localization, perception, prediction and planningYu Huang
 
Data pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous drivingData pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous drivingYu Huang
 

More from Yu Huang (20)

Application of Foundation Model for Autonomous Driving
Application of Foundation Model for Autonomous DrivingApplication of Foundation Model for Autonomous Driving
Application of Foundation Model for Autonomous Driving
 
The New Perception Framework in Autonomous Driving: An Introduction of BEV N...
The New Perception Framework  in Autonomous Driving: An Introduction of BEV N...The New Perception Framework  in Autonomous Driving: An Introduction of BEV N...
The New Perception Framework in Autonomous Driving: An Introduction of BEV N...
 
Data Closed Loop in Simulation Test of Autonomous Driving
Data Closed Loop in Simulation Test of Autonomous DrivingData Closed Loop in Simulation Test of Autonomous Driving
Data Closed Loop in Simulation Test of Autonomous Driving
 
Techniques and Challenges in Autonomous Driving
Techniques and Challenges in Autonomous DrivingTechniques and Challenges in Autonomous Driving
Techniques and Challenges in Autonomous Driving
 
BEV Joint Detection and Segmentation
BEV Joint Detection and SegmentationBEV Joint Detection and Segmentation
BEV Joint Detection and Segmentation
 
BEV Object Detection and Prediction
BEV Object Detection and PredictionBEV Object Detection and Prediction
BEV Object Detection and Prediction
 
Fisheye based Perception for Autonomous Driving VI
Fisheye based Perception for Autonomous Driving VIFisheye based Perception for Autonomous Driving VI
Fisheye based Perception for Autonomous Driving VI
 
Fisheye/Omnidirectional View in Autonomous Driving V
Fisheye/Omnidirectional View in Autonomous Driving VFisheye/Omnidirectional View in Autonomous Driving V
Fisheye/Omnidirectional View in Autonomous Driving V
 
Fisheye/Omnidirectional View in Autonomous Driving IV
Fisheye/Omnidirectional View in Autonomous Driving IVFisheye/Omnidirectional View in Autonomous Driving IV
Fisheye/Omnidirectional View in Autonomous Driving IV
 
Prediction,Planninng & Control at Baidu
Prediction,Planninng & Control at BaiduPrediction,Planninng & Control at Baidu
Prediction,Planninng & Control at Baidu
 
Cruise AI under the Hood
Cruise AI under the HoodCruise AI under the Hood
Cruise AI under the Hood
 
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
 
Scenario-Based Development & Testing for Autonomous Driving
Scenario-Based Development & Testing for Autonomous DrivingScenario-Based Development & Testing for Autonomous Driving
Scenario-Based Development & Testing for Autonomous Driving
 
How to Build a Data Closed-loop Platform for Autonomous Driving?
How to Build a Data Closed-loop Platform for Autonomous Driving?How to Build a Data Closed-loop Platform for Autonomous Driving?
How to Build a Data Closed-loop Platform for Autonomous Driving?
 
Annotation tools for ADAS & Autonomous Driving
Annotation tools for ADAS & Autonomous DrivingAnnotation tools for ADAS & Autonomous Driving
Annotation tools for ADAS & Autonomous Driving
 
Simulation for autonomous driving at uber atg
Simulation for autonomous driving at uber atgSimulation for autonomous driving at uber atg
Simulation for autonomous driving at uber atg
 
Multi sensor calibration by deep learning
Multi sensor calibration by deep learningMulti sensor calibration by deep learning
Multi sensor calibration by deep learning
 
Prediction and planning for self driving at waymo
Prediction and planning for self driving at waymoPrediction and planning for self driving at waymo
Prediction and planning for self driving at waymo
 
Jointly mapping, localization, perception, prediction and planning
Jointly mapping, localization, perception, prediction and planningJointly mapping, localization, perception, prediction and planning
Jointly mapping, localization, perception, prediction and planning
 
Data pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous drivingData pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous driving
 

Recently uploaded

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 

Recently uploaded (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

Passive stereo vision with deep learning

  • 1. Passive Stereo Vision: From Traditional to Deep Learning-based Methods YU HUANG SUNNYVALE, CALIFORNIA YU.HUANG07@GMAIL.COM
  • 2. Outline • Modeling from multiple views • Stereo matching • constraints in stereo vision • difficulties in stereo vision • pipeline of stereo matching • state of art methods • Quality metric of stereo matching • census transform and hamming distance • guided filter in cost aggregation (volume) • semi-global matching • ELAS: efficient large scale stereo • stereo matching as energy minimization • dynamic programming/graph cut/belief propagation • phase matching for stereo vision • disparity refinement • Multiple cameras/views • Learning sparse represent. of depth maps • Stereopsis via deep learning • Deep learning of depth (and motion) • Stereo matching by CNN • Constant Highway Networks and Reflective Confidence Learning; • Efficient Deep Learning for Stereo Matching; • E2e Learning of Geometry and Context; • Appendix A: Depth from an image by learning • Appendix B: Learning and optimization
  • 3. Modeling from Multiple Views in Computer Vision time # cameras photograph binocular stereo trinocular stereo multi-baseline stereo camcorder human vision camera dome two frames ... ...
  • 4. Binocular Stereo • Given a calibrated binocular stereo pair, fuse it to produce a depth image image 1 image 2 Dense depth map Public Library, Stereoscopic Looking Room, Chicago, by Phillips, 1923
  • 5. Basic Stereo Matching Algorithm • For each pixel in the first image Find corresponding epipolar line in the right image Examine all pixels on the epipolar line and pick the best match Triangulate the matches to get depth information • Simplest case: epipolar lines are corresponding scanlines; • If necessary, rectify the two stereo images to transform epipolar lines into scanlines
  • 6. Depth from Disparity f x x’ Baseline B z O O’ X f z fB xxdisparity   Disparity is inversely proportional to depth! • Stereo camera calibration (focal length and baseline are known) • Image planes of cameras parallel to each other and to the baseline • Camera centers are at same height • Focal lengths are the same • Then, epipolar lines fall along the horizontal scan lines of the images
  • 7. Calibration • Find the intrinsic and extrinsic parameters of a camera ◦ Extrinsic parameters: the camera’s location and orientation in the world. ◦ Intrinsic parameters: the relationships between pixel coordinates and camera coordinates. • Work of Roger Tsai and work of Zhengyou Zhang are influential: 3-D node setting and 2-d plane • Basic idea: ◦ Given a set of world points Pi and their image coordinates (ui,vi) ◦ find the projection matrix M ◦ And then find intrinsic and extrinsic parameters. • Calibration Techniques ◦ Calibration using 3D calibration object ◦ Calibration using 2D planer pattern ◦ Calibration using 1D object (line-based calibration) ◦ Self Calibration: no calibration objects ◦ Vanishing points from for orthogonal direction
  • 8. Calibration • Calibration using 3D calibration object: ◦ Calibration is performed by observing a calibration object whose geometry in 3D space is known with very good precision. ◦ Calibration object usually consists of two or three planes orthogonal to each other, e.g. calibration cube ◦ Calibration can also be done with a plane undergoing a precisely known translation (Tsai approach) ◦ (+) most accurate calibration, simple theory ◦ (-) more expensive, more elaborate setup • 2D plane-based calibration (Zhang approach) ◦ Require observation of a planar pattern at few different orientations ◦ No need to know the plane motion ◦ Set up is easy, most popular approach ◦ Seems to be a good compromise. • 1D line-based calibration: ◦ Relatively new technique. ◦ Calibration object is a set of collinear points, e.g., two points with known distance, three collinear points with known distances, four or more… ◦ Camera can be calibrated by observing a moving line around a fixed point, e.g. a string of balls hanging from the ceiling! ◦ Can be used to calibrate multiple cameras at once. Good for network of cameras.
  • 9. Fundamental Matrix Let p be a point in left image, p’ in right image Epipolar relation ◦ p maps to epipolar line l’ ◦ p’ maps to epipolar line l Epipolar mapping described by a 3x3 matrix F It follows that l’l p p’ This matrix F is called • the “Essential Matrix” E – when image intrinsic parameters are known • the “Fundamental Matrix” – more generally (uncalibrated case) Can solve for F from point correspondences • Each (p, p’) pair gives one linear equation in entries of F • 8/5 points give enough to solve for F/E (8/5-point algo)
  • 10. Planar Rectification Bring two views to standard stereo setup (moves epipole to ) (not possible when in/close to image) ~ image size (calibrated) Distortion minimization (uncalibrated)
  • 11.
  • 12. Polar re-parameterization around epipoles Requires only (oriented) epipolar geometry Preserve length of epipolar lines Choose  so that no pixels are compressed original image rectified image Polar Rectification Works for all relative motions Guarantees minimal image size Determine the common region from the extremal epipolar lines and the location of epiole: e’F=0 Select half epipolar lines moving around the epipole Construct rectified image line by line
  • 13.
  • 14. Matching cost disparity Left Right scanline Correspondence Search • Slide a window along the right scanline and compare contents of that window with the reference window in the left image • Matching cost: SSD or normalized correlation
  • 15. Constraints in Stereo Vision • Color constancy • Lambertian surface assumption; •Epipolar geometry • Scanline as epipolar line for rectifed pair; • Uniqueness • For any point in one image, there should be at most one matching point in the other image; • Ordering • Corresponding points should be in the same order in both views; • Smoothness • Disparities to change slowly (the most part). Epipolar plane Epipolar line for pEpipolar line for p’ Uniqueness Ordering
  • 16. Difficulties in Stereo Vision • Photometric distortions and noise; • Foreshortening; • Perspective distortions; • Uniform/ambiguous regions; • Repetitive/ambiguous patterns; • Transparent objects; • Occlusions and discontinuities.
  • 17. Pipeline of Stereo Matching Methods • Pre-processing: compensate for photometric distortion; • LoG, Census transform, phase only(DCT or WT), histogram equalization/matching, isotropic diffusion, … • Cost computation: • Absolute difference, squared difference, weighted difference, SAD, SSD, SWD, ZMNCC, … • Cost aggregation: • Bilateral filter, guided filter, non local, segment tree,... • Disparity computation/optimization • Integral image, box filtering, … • Local (fast), global (slow), semi-global, … • Disparity refinement • Sub pixel interpolation, median filter, cross check (left-right consistency check) and occlusion filling.
  • 18. State-of-Art Stereo Matching Methods • Local method • Look at one image patch at at time • Solve many small problems independently • Faster, less accurate, usually works for high texture • Needs enough texture in a patch to disambiguate • Global method • Look at the whole image • Solve one large problem • Slower, more accurate, works up to medium texture • Propagates estimates from textured to untextured regions • Sparse point-based method • Still works for low textured regions, hard to handle ambiguous regions • Semi-global method • SGM (semi-global-matching), 2-d search to 1-d search along 8/16 directions.
  • 19. Quality Metrics in Stereo Matching (Passive) • General objective approaches: • Compute error statistics w.r.t. some ground truth data; • RMS (root-mean-squared) error (in disparity units) btw. computed disparity dC (x, y) and ground truth dT (x, y); • Percentage of bad matching pixels; • Select the following areas support the analysis of matching results • textureless regions; • occluded regions; • depth discontinuity regions. • Evaluate synthetic image by warping the reference with disparity map; • Forward warp the reference image by the computed disparity map; • Inverse warp a new view by the computed disparity map. • Subjective evaluation
  • 20. Census Transform and Hamming Distance • Census transform converts relative intensity difference to 0 or 1 and deforms 1 dimensional vector as much as window size of census transform; • Census transform makes data of (image size * vector size). • Modified CTW: compared with the mean rather than the central pixel; • Hamming distance of CT vectors with correlation windows used to find matched patches; • Advantage: robustness to radiometric distortion, vignetting, lighting, boundaries and noise. 210159998639 198170326747 45677810298 304033115109 393126130121 11111 11000 00X11 00011 00011 111111100000110001100011 Census transform window (CTW) Height Width Height Width (Square size of CTW)-1
  • 22. Guided Filter in Cost Aggregation for Stereo Matching • Idea: stereo match as labeling, a spatially smooth labeling with label transitions aligned with color edges; • Edge preserving filter: WLS, Anisotropic diffusion, bilateral filter, total variation filter, guided filter, ... • Guided filter works better than bilateral filter; • • Cost volume filtering with guided filter works like segmentation implicitly; Wi,j : The filter weights depend on the guidance image IC’ : the filtered cost volume
  • 23.
  • 24. PatchMatch Stereo • Idea: First a random initialization of disparities and plane para.s for each pix. and update the estimates by propagating info. from the neighboring pix.s; • Spatial propagation: Check for each pix. the disparities and plane para.s for left and upper neighbors and replace the current estimates if matching costs are smaller; • View propagation: Warp the point in the other view and check the corresponding etimates in the other image. Replace if the matching costs are lower; • Temporal propagation: Propagate the information analogously by considering the etimates for the same pixel at the preceding and consecutive video frame; • Plane refinement: Disparity and plane para.s for each pix. refined by generat. random samples within an interval and updat. estimates if matching costs reduced; • Post-processing: Remove outliers with left/right consistency checking and weighted median filter; Gaps are filled by propagating information from the neighborhood.
  • 26. Semi-Global Matching for Stereo Computation • Semi-global matching approximates a global optimization by combining several local optimization steps; • Minimizing E(D) in a two-dimensional manner would be very costly, while SGM simplifies it by traversing one-dimensional paths and ensures the constraints with respect to these explicit directions; • At least 8 paths (16 suggested), like horizontal, vertical and diagonal orientations; • For instance, cost aggregation along a horizontal path as • Pixel-based cost computation by mutual information as • Left-right consistency check for occlusion detection and disparity propagation for hole filling. • To accelerate the process, down-sampled image pairs are used for disparity estimation. a small penalty P1 a large penalty P2 for large disparity changes
  • 27.
  • 28. ELAS: Efficient Large Scale Stereo Matching
  • 29. ELAS: Efficient Large Scale Stereo Matching • Prior model: • Likelihood model: • Posterior can be factorized by the Bayes rule as • Likelihood calculated along the epipolar line as • Disparity estimation as MAP: • To minimize an energy function A mean function linking the support points and the observations
  • 30.
  • 31. Stereo as Energy Minimization • Find disparities d that minimize an energy function • Simple pixel / window matching = SSD distance between windows I(x, y) and J(x, y + d(x,y)) I(x, y) J(x, y) y = 141 C(x, y, d); the disparity space image (DSI) x d • Choose the minimum of each column in the DSI independently:
  • 32. Dynamic Programming (DP) in Stereo Matching • Can minimize E(d) independently per scanline using dynamic programming (DP); leftS rightS Left occlusion t q Right occlusion s p occlC occlC corrC Three cases: • Sequential – cost of match • Left occluded – cost of no match • Right occluded – cost of no match Left image Right image I I • DP yields the optimal path through grid, the best set of matches for the ordering constraint in scan-line stereo.
  • 33. d1 d2 d3 • Graph Cut • Delete enough edges so that • each pixel is connected to exactly one label node • Cost of a cut: sum of deleted edge weights • Finding min cost cut equivalent to finding global minimum of energy function Energy Minimization via Graph Cuts Labels (disparities) edge weight edge weight • What defines a good stereo correspondence? • 1. Match quality • Want each pixel to find a good match in the other image • 2. Smoothness • If two pixels are adjacent, they should (usually) move about the same amount { { match cost smoothness cost “Potts model” L1 distance Graph Cut: convert multi-way cut into a seq. of binary cut
  • 34.
  • 35. Model Stereo Vision by MRF and Solution by Belief Propagation • Allows rich probabilistic models for images. • But built in a local, modular way. Learn local relationships, get global effects out. disparity images Disparity-disparity compatibility function neighboring disparity nodes local observationsImages-disparity compatibility function  FY i ii ji ji yxxx Z yxP ),(),(1 ),( , BELIEFS: Approximate posterior marginal distributions neighborhood of node i MESSAGES: Approximate sufficient statistics I. Belief Update (Message Product) II. Message Propagation (Convolution)
  • 36. Hierarchical Belief Propagation (HBP) and Constant Space HBP • HBP works in a coarse-to-fine manner; • (a) initialize the messages at the coarsest level to all zeros; • (b) apply BP at the coarsest level to iteratively refine the messages; • (c) use refined messages from the coarser level to initialize the messages for the next level. • Constant space HBP relies on that, only a small number of disparity levels and the corresponding message values are needed at each pixel to losslessly reconstruct the BP messages; • Apply the coarse-to-fine (CTF) scheme to both spatial and depth domain, i.e. gradually reduce the number of disparity levels as the messages propagate in CTF; • Re-computes the data term at each level (not each iter.); • Slower 9/8, but memory does not grow with max disp; • Energy computed only once at the finest level; • Gradually reduce the disparity levels in CTF. • The closer the messages are to the fixed points, the fewer the required disparity levels; Then, CSBP refines the messages hierarchically to approach the fixed points.
  • 37.
  • 38. Phase Matching in Frequency or WT Domain • Phase reflects the structure information of the signal and inhibit the HF noise effect; • Phase singularity is a problem; • Local phase information as the primitive; • Wavelet transform builds a hierarchical framework for multi-level coarse-to-fine processing; • Stereo matching (disparity) with phase separation and instantaneous frequency of signals: • Dynamic programming (DP) used for global optimization (occlusion handling) in stereo matching ; • Phase is not uniformly stable; • Smoothness constraints; • Discontinuities detection; • Multiple resolution solution: • 1. top level: control points with feature matching, apply DP; • 2. middle level: interpolation, apply DP; • 3. bottom level: sub-pixel precision. Local phaseDisparity Left/Right images
  • 39. Original Phase matching Phase matching with DP
  • 40. Disparity/Depth Refinement • Sub-pixel refinement: real valued disparities may be obtained by approximating the cost function locally using a parabola; • Left-Right Consistency Check: outlier detection by difference; • By computing a disparity for every pixel of the left image (left to right); • by computing a disparity for every pixel of the right image (right to left); • Segmentation can be used for outlier identification. • Occlusion filling: • Occlusion detection; • Background expansion; • Inpainting. • Discontinuities smoothing: • Bilateral filtering.
  • 41. Multiple Cameras Multi-baseline stereo use the third view to verify depth estimates
  • 42. Spatial Temporal Video Disparity Estimation • The important problem of extending to video is flickering; • Typical methods: • Spatial temporal consistency: smoothing in the space-time volume; • Post-processing of disparity maps by applying a median filter along the flow fields; • Spatial-temporal cost aggregation and solved by local/global optimization methods; • Joint disparity and flow estimation; • SGM-based, as an instance; • Modeled with MRF and solved by global optimization. • Scene flow: 2D motion field along with 1D disparity change field. • Dense method is very computationally expensive; • Sparse method relies on heavily initial sparse correspondence success.
  • 43. Sparse Coding Sparse coding (Olshausen & Field, 1996). Originally developed to explain early visual processing in the brain (edge detection). Objective: Given a set of input data vectors learn a dictionary of bases such that: Each data vector is represented as a sparse linear combination of bases. Sparse: mostly zeros
  • 44. Predictive Sparse Coding Recall the objective function for sparse coding: Modify by adding a penalty for prediction error: ◦ Approximate the sparse code with an encoder PSD for hierarchical feature training ◦ Phase 1: train the first layer; ◦ Phase 2: use encoder + absolute value as 1st feature extractor ◦ Phase 3: train the second layer; ◦ Phase 4: use encoder + absolute value as 1st feature extractor ◦ Phase 5: train a supervised classifier on top layer; ◦ Phase 6: optionally train the whole network with supervised BP.
  • 45. Methods of Solving Sparse Coding Greedy methods: projecting the residual on some atom; ◦ Matching pursuit, orthogonal matching pursuit; L1-norm: Least Absolute Shrinkage and Selection Operator (LASSO); ◦ The residual is updated iteratively in the direction of the atom; Gradient-based finding new search directions ◦ Projected Gradient Descent ◦ Coordinate Descent Homotopy: a set of solutions indexed by a parameter (regularization) ◦ LARS (Least Angle Regression) First order/proximal methods: Generalized gradient descent ◦ solving efficiently the proximal operator ◦ soft-thresholding for L1-norm ◦ Accelerated by the Nesterov optimal first-order method Iterative reweighting schemes ◦ L2-norm: Chartand and Yin (2008) ◦ L1-norm: Cand`es et al. (2008)
  • 46. Strategyof Dictionary Selection • What D to use? • A fixed overcomplete set of basis: no adaptivity. • Steerable wavelet; • Bandlet, curvelet, contourlet; • DCT Basis; • Gabor function; • …. • Data adaptive dictionary – learn from data; • K-SVD: a generalized K-means clustering process for Vector Quantization (VQ). • An iterative algorithm to effectively optimize the sparse approximation of signals in a learned dictionary. • Other methods of dictionary learning: • non-negative matrix decompositions. • sparse PCA (sparse dictionaries). • fused-lasso regularizations (piecewise constant dictionaries) • Extending the models: Sparsity + Self-similarity=Group Sparsity
  • 47. Learning Sparse Representation in Depth Maps • Sparse representations learned from Middlebury database disparity maps; • Then they are exploited in a two-layer graphical model for inferring depth from stereo, by including a sparsity prior on the learned features;  The first layer solved using an existing MRF- based stereo matching algorithm;  The second layer is solved using the non- stationary sparse coding algorithm.
  • 48. Learning Sparse Representation in Depth Maps (c) Graph cut (d) GC + Sparse coding
  • 49. Deep Learning Representation learning attempts to automatically learn good features or representations; Deep learning algorithms attempt to learn multiple levels of representation of increasing complexity/abstraction (intermediate and high level features); Become effective via unsupervised pre-training + supervised fine tuning; ◦ Deep networks trained with back propagation (without unsupervised pre-training) perform worse than shallow networks. Deal with the curse of dimensionality (smoothing & sparsity) and over-fitting (unsupervised, regularizer); Semi-supervised: structure of manifold assumption; ◦ labeled data is scarce and unlabeled data is abundant.
  • 50. Why Deep Learning? Supervised training of deep models (e.g. many-layered Nets) is too hard (optimization problem); ◦ Learn prior from unlabeled data; Shallow models are not for learning high-level abstractions; ◦ Ensembles or forests do not learn features first; ◦ Graphical models could be deep net, but mostly not. Unsupervised learning could be “local-learning”; ◦ Resemble boosting with each layer being like a weak learner Learning is weak in directed graphical models with many hidden variables; ◦ Sparsity and regularizer. Traditional unsupervised learning methods aren’t easy to learn multiple levels of representation. ◦ Layer-wised unsupervised learning is the solution. Multi-task learning (transfer learning and self taught learning); Other issues: scalability & parallelism with the burden from big data.
  • 51. Multi Layer Neural Network A neural network = running several logistic regressions at the same time; ◦ Neuron=logistic regression or… Calculate error derivatives (gradients) to refine: back propagate the error derivative through model (the chain rule) ◦ Online learning: stochastic/incremental gradient descent ◦ Batch learning: conjugate gradient descent
  • 52. Problems in MLPs Multi Layer Perceptrons (MLPs), one feed-forward neural network, were popularly used for decades. Gradient is progressively getting more scattered ◦ Below the top few layers, the correction signal is minimal Gets stuck in local minima ◦ Especially start out far from ‘good’ regions (i.e., random initialization) In usual settings, use only labeled data ◦ Almost all data is unlabeled! ◦ Instead the human brain can learn from unlabeled data.
  • 53. Convolutional Neural Networks CNN is a special kind of multi-layer NNs applied to 2-d arrays (usually images), based on spatially localized neural input; ◦ local receptive fields(shifted window), shared weights (weight averaging) across the hidden units, and often, spatial or temporal sub-sampling; ◦ Related to generative MRF/discriminative CRF: ◦ CNN=Field of Experts MRF=ML inference in CRF; ◦ Generate ‘patterns of patterns’ for pattern recognition. Each layer combines (merge, smooth) patches from previous layers ◦ Pooling /Sampling (e.g., max or average) filter: compress and smooth the data. ◦ Convolution filters: (translation invariance) unsupervised; ◦ Local contrast normalization: increase sparsity, improve optimization/invariance. C layers convolutions, S layers pool/sample
  • 54. Convolutional Neural Networks Convolutional Networks are trainable multistage architectures composed of multiple stages; Input and output of each stage are sets of arrays called feature maps; At output, each feature map represents a particular feature extracted at all locations on input; Each stage is composed of: a filter bank layer, a non-linearity layer, and a feature pooling layer; A ConvNet is composed of 1, 2 or 3 such 3-layer stages, followed by a classification module; ◦ A fully connected layer: softmax transfer function for posterior distribution. Filter: A trainable filter (kernel) in filter bank connects input feature map to output feature map; Nonlinearity: a pointwise sigmoid tanh() or a rectified sigmoid abs(gi•tanh()) function; ◦ In rectified function, gi is a trainable gain parameter, might be followed a contrast normalization N; Feature pooling: treats each feature map separately -> a reduced-resolution output feature map; Supervised training is performed using a form of SGD to minimize the prediction error; ◦ Gradients are computed with the back-propagation method. Unsupervised pre-training: predictive sparse decomposition (PSD), then supervised fine-tuning. * is discrete convolution operator
  • 55.
  • 56. LeNet (LeNet-5) A layered model composed of convolution and subsampling operations followed by a holistic representation and ultimately a classifier for handwritten digits; Local receptive fields (5x5) with local connections; Output via a RBF function, one for each class, with 84 inputs each; Learning by Graph Transformer Networks (GTN);
  • 57. AlexNet A layered model composed of convol., subsample., followed by a holistic representation and all-in-all a landmark classifier; Consists of 5 convolutional layers, some of which followed by max-pooling layers, 3 fully-connected layers with a final 1000-way softmax; Fully-connected layers: linear classifiers/matrix multiplications; ReLU are rectified-linear nonlinearities on layer output, can be trained several times faster; Local (contrast) normalization scheme aids generalization; Overlapping pooling slightly less prone to overfitting; Data augmentation: artificially enlarge the dataset using label-preserving transformations; Dropout: setting to zero output of each hidden neuron with prob. 0.5; Trained by SGD with batch # 128, momentum 0.9, weight decay 0.0005.
  • 58. The network’s input is 150,528-dimensional, and the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264-4096–4096–1000.
  • 59. MattNet Matthew Zeiler from the startup company “Clarifai”, winner of ImageNet Classification in 2013; Preprocessing: subtracting a per-pixel mean; Data augmentation: downsampled to 256 pixels and a random 224 pixel crop is taken out of the image and randomly flipped horizontally to provide more views of each example; SGD with min-batch # 128, learning rate annealing, momentum 0.9 and dropout to prevent overfitting; 65M parameters trained for 12 days on a single Nvidia GPU; Visualization by layered DeconvNets: project the feature activations back to the input pixel space; ◦ Reveal input stimuli exciting individual feature maps at any layer; ◦ Observe evolution of features during training; ◦ Sensitivity analysis of the classifier output by occluding portions to reveal which parts of scenes are important; DeconvNet attached to each of ConvNet layer, unpooling uses locations of maxima to preserve structure; Multiple such models were averaged together to further boost performance; Supervised pre-training with AlexNet, then modify it to get better performance (error rate 14.8%).
  • 60. Architecture of an eight layer ConvNet model. Input: 224 by 224 crop of an image (with 3 color planes). # 1-5 layers Convolution: 96 filters, 7x7, stride of 2 in both x and y. Feature maps: (i) via a rectified linear function, (ii) 3x3 max pooled (stride 2), (iii) contrast normalized 55x55 feature maps. # 6-7 layers: fully connected, input in vector form (6x6x256 = 9216 dimensions). The final layer: a C-way softmax function, C - number of classes.
  • 61. Top: A deconvnet layer (left) attached to a convnet layer (right). The deconvnet will reconstruct approximate version of convnet features from the layer beneath. Bottom: Unpooling operation in the deconvnet, using switches which record the location of the local max in each pooling region (colored zones) during pooling in the convnet.
  • 62. Oxford VGG Net: Very Deep CNN Networks of increasing depth using an architecture with very small (3×3) convolution filters; ◦ Spatial pooling is carried out by 5 max-pooling layers; ◦ A stack of convolutional layers followed by three Fully-Connected (FC) layers; ◦ All hidden layers are equipped with the rectification ReLU non-linearity; ◦ No Local Response Normalisation! Trained by optimising the multinomial logistic regression objective using SGD; Regularised by weight decay and dropout regularisation for the first two fully-connected layers; The learning rate was initially set to 10−2, and then decreased by a factor of 10; For random initialisation, sample the weights from a normal distribution; Derived from the publicly available C++ Caffe toolbox, allow training and evaluation on multiple GPUs installed in a single system, and on full-size (uncropped) images at multiple scales; Combine the outputs of several models by averaging their soft-max class posteriors.
  • 63. The depth of the configurations increases from the left (A) to the right (E), as more layers are added (the added layers are shown in bold). The convolutional layer parameters are denoted as “conv<receptive field size> - <number of channels>”. The ReLU activation function is not shown for brevity.
  • 64. GoogleNet Questions: ◦ Vanishing gradient? ◦ Exploding gradient? ◦ Tricky weight initialization? Deep convolutional neural network architecture codenamed Inception; ◦ Finding out how an optimal local sparse structure in a convolutional vision network can be approximated and covered by readily available dense components; ◦ Judiciously applying dimension reduction and projections wherever the computational requirements would increase too much otherwise; Increasing the depth and width of the network but keeping the computational budget constant; ◦ Drawbacks: Bigger size typically means a larger number of parameters, which makes the enlarged network more prone to overfitting and the dramatically increased use of computational resources; ◦ Solution: From fully connected to sparsely connected architectures, analyze the correlation statistics of the activations of the last layer and clustering neurons with highly correlated outputs. ◦ Based on the well known Hebbian principle: neurons that fire together, wire together; Trained using the DistBelief: distributed machine learning system.
  • 65. Inception module (with dimension reductions)
  • 66. Convolution Pooling Softmax Other Problems with training deep architectures? Network in a network in a network 9 Inception modules
  • 67. PReLU Networks at MSR A Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit; ◦ PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk; ◦ Allow negative activations on the ReLU function with a control parameter a learned Adaptively; ◦ Resolve diminishing gradient problem for very deep neural networks (> 13 layers) ; Derive a robust initialization method better than “Xavier” (normalization) initialization; Also use Spatial Pyramid Pooling (SPP) layer just before the fully connected layers; Can train extremely deep rectified models and investigate deeper or wider network architectures; ReLU vs. PReLU Note: μ is momemtum, ϵ is learning rate.
  • 68. PReLU Networks at MSR Performance: 4.94% top-5 test error on the ImageNet 2012 Classification dataset; ◦ ILSVRC 2014 winner (GoogLeNet, 6.66%); Adopt the momentum method in BP training; Mostly initialized by random weights from Gaussian distr.; Investigate the variance of the FP responses in each layer; Consider a sufficient condition in BP: ◦ The gradient is not exponentially large/small.
  • 69. Architectures of large models PReLU Networks
  • 70. Batch Normalization at Google Normalizing layer inputs for each mini-batch to handle saturating nonlinearities and covariate shift; ◦ Internal Covariate Shift (ICS): the change in the distribution of network activations due to the change in network parameters during training; ◦ Whitening to reduce ICS: linear transform to have zero means and unit variances, and decorrelated; ◦ Fix the means and variance of layer inputs (instead of whitening jointly the features in both I/O); ◦ Batch normalizing transform applied for activation over a mini-batch; ◦ BN transform is differentiable transform introducing normalized activations into the network; Batch normalized networks ◦ Unbiased variance estimate; ◦ Moving average; Batch normalized ConvNets ◦ Effective mini-batch size; ◦ Per feature, not per activation.
  • 71. Batch Normalization at Google Reduce the dependence of gradients on the scale of the parameters or of the initial values; ◦ Prevent small changes from amplifying into larger and suboptimal changes in activation in gradients; ◦ Stabilize the parameter growth and make gradient propagation better behaved in BN training; In some cases, eliminate the need of dropout as a regularizer; ◦ In ImageNet Classification, remove local response normalization and reduce photometric distortions; ◦ Reach 4.9% in top-five validation error and 4.8% test error (human raters only 5.1%). Accelerating BN network: ◦ Enable larger learning rate and less care about initialization, which accelerates the training; ◦ Reduce L2 weight regularization; ◦ Accelerate the learning rate decay.
  • 72. Batch Normalization at Google Inception architecture
  • 73. Neural Turing Machines A Neural Turing Machine (NTM) architecture contains two basic components: a neural network controller and a memory bank; ◦ During each update cycle, the controller network receives inputs from an external environment and emits outputs in response; ◦ It also reads to and writes from a memory matrix via a set of parallel read and write heads. These weightings arise by combining two addressing mechanisms with complementary facilities; ◦ “content-based addressing”: focuses attention on locations based on the similarity between their current values and values emitted by the controller; ◦ “location-based addressing”: the content of a variable is arbitrary, but the variable still needs a recognizable name or addresses, by location, not by content; Controller network: feed forward or recurrent.
  • 74. Neural Turing Machines Neural Turing Machine Architecture. Flow Diagram of the Addressing Mechanism.
  • 75. Highway Networks: Information Highway Ease gradient-based training of very deep networks; Allow unimpeded info. flow across several layers on information highways; Use gating units to learn regulating the flow of info. through a network; A highway network consists of multiple blocks such that the ith block computes a block state Hi(x) and transform gate output Ti(x); Highway networks with hundreds of layers can be trained directly using SGD and with a variety of activation functions. the transform gate the carry gate C = 1 - T
  • 76. Deep Residual Learning for Image Recognition Reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions; ◦ The desired underlying mapping as H(x), then let the stacked nonlinear layers fit another mapping of F(x) = H(x) - x; ◦ The formulation of F(x)+x can be realized by feed forward NN with “shortcut connections” (such as “Highway Network” and “Inception”); These residual networks are easier to optimize, and can gain accuracy from considerably increased depth; An ensemble of 152 layers residual nets achieves 3.57% error on the ImageNet test set; ◦ 224x224 crop, per-pixel mean subtracted, color augmentation, batch normalization; ◦ SGD with a mini-batch size of 256, learning rate from 0.1 then by 10; ◦ Weight decay of 0.0001 and a momentum of 0.9, no drop-out;
  • 77. Rethink Inception Architecture for Computer Vision Scale up networks in ways that aim at utilizing the added computation efficiently by factorized convolutions and aggressive regularization; Design principles in Inception: ◦ Avoid representational bottlenecks, especially early in the network; ◦ Higher dimensional representations are easier to process locally within a network; ◦ Spatial aggregation over lower dim embeddings w/o loss in representational power; ◦ Balance the width and depth of the network. Factorizing convolutions with large filter size: asymmetric convolutions; Auxiliary classifiers: act as regularizer, esp. batch normalized or dropout; Grid size reduction: two parallel stride 2 blocks (pooling and activation) ; Model regularization via label smoothing: marginalized effect of dropout; Trained with Tensorflow: SGD with 50 replicas, batch size 32 for 100 epochs, learning rate of 0.045, exponential rate of 0.94, a wei decay of 0.9.
  • 78. Rethink Inception Architecture for Computer Vision Inception modules after the factorization of the nxn convolutions. In the proposed architecture, it choses n = 7 for the 17x17 grid. Inception modules with expanded the filter bank outputs. Inception modules where each 5x5 convolution is replaced by two 3x3 convolution.
  • 79. Rethink Inception Architecture for Computer Vision Auxiliary classifier on top of the last 17x17 layer Inception module that reduces the grid-size while expands the filter banks. It is both cheap and avoids the representational bottleneck. The outline of the proposed network architecture
  • 80. Belief Nets Belief net is directed acyclic graph composed of stochastic var. Can observe some of the variables and solve two problems: ◦ inference: Infer the states of the unobserved variables. ◦ learning: Adjust the interactions between variables to more likely generate the observed data. stochastic hidden cause visible effect Use nets composed of layers of stochastic variables with weighted connections.
  • 81. Boltzmann Machines Energy-based model associate a energy to each configuration of stochastic variables of interests (for example, MRF, Nearest Neighbor); ◦ Learning means adjustment of the low energy function’s shape properties; Boltzmann machine is a stochastic recurrent model with hidden variables; ◦ Monte Carlo Markov Chain, i.e. MCMC sampling (appendix); Restricted Boltzmann machine is a special case: ◦ Only one layer of hidden units; ◦ factorization of each layer’s neurons/units (no connections in the same layer); Contrastive divergence: approximation of gradient (appendix). probability Energy Function Learning rule
  • 82. Deep Belief Networks A hybrid model: can be trained as generative or discriminative model; Deep architecture: multiple layers (learn features layer by layer); ◦ Multi layer learning is difficult in sigmoid belief networks. ◦ Top two layers are undirected connections, RBM; ◦ Lower layers get top down directed connections from layers above; Unsupervised or self-taught pre-learning provides a good initialization; ◦ Greedy layer-wise unsupervised training for RBM Supervised fine-tuning ◦ Generative: wake-sleep algorithm (Up-down) ◦ Discriminative: back propagation (bottom-up)
  • 83. Deep Boltzmann Machine Learning internal representations that become increasingly complex; High-level representations built from a large supply of unlabeled inputs; Pre-training consists of learning a stack of modified RBMs, which are composed to create a deep Boltzmann machine (undirected graph); Generative fine-tuning: different from DBN ◦ Positive and negative phase (appendix) Discriminative fine-tuning: the same to DBN ◦ Back propagation.
  • 84. Denoising Auto-Encoder Multilayer NNs with target output=input; Reconstruction=decoder(encoder(input)); ◦ Perturbs the input x to a corrupted version; ◦ Randomly sets some of the coordinates of input to zeros. ◦ Recover x from encoded perturbed data. Learns a vector field towards higher probability regions; Pre-trained with DBN or regularizer with perturbed training data; Minimizes variational lower bound on a generative model; ◦ corresponds to regularized score matching on an RBM; PCA=linear manifold=linear Auto Encoder; Auto-encoder learns the salient variation like a nonlinear PCA.
  • 85. Stacked Denoising Auto-Encoder Stack many (may be sparse) auto-encoders in succession and train them using greedy layer-wise unsupervised learning ◦ Drop the decode layer each time ◦ Performs better than stacking RBMs; Supervised training on the last layer using final features; (option) Supervised training on the entire network to fine- tune all weights of the neural net; Empirically not quite as accurate as DBNs.
  • 86. Stereopsis via Deep Learning • Learn a binocular cross correlation model: use two quadrature pairs to detect disparity;  Various filters correspond to phases, positions and frequencies; • Disparity as latent variable: a pattern of matching filter responses;  A joint probabilistic model over patch pairs and disparity defined as a Boltzmann machine.  Training amounts to finding the parameters for max the log probability for pairs  RBM used for this case;  During inference, each latent variable receives activity from exactly two products of matched filter-responses. pooling
  • 87. Stereopsis via Deep Learning Example training data: Row 1, row 2 and row 3 show rendered image planes for the left/right camera, where in row 3 the right camera has been rotated by 45 around the z axis. Images are rendered by depth maps shown in row 4 and a randomly selected texture map from the Berkeley Segmentation Database. Example pairs from NORB-cluttered dataset. Learned binocular filter pairs.
  • 88. Unsupervised Learning of Depth (and Motion) • Learning about the interrelations between images from multiple cameras, multiple frames in a video, or the combination of both; • Depth and motion in a feature learning architecture based on the energy model; • An AutoEncoder single-layer model uses multiplicative interactions to detect synchrony, and a pooling layer independently trained on the hidden responses to achieve content invariance; • Depth as a latent variable in learning: • Reconstruction error: • Contraction as regularization: • Complete objective function: Note: there is no need for rectification, since the model can learn any transformation between the frames not just horizontal shift
  • 89. Unsupervised Learning of Depth (and Motion) • Extension to stereo sequences: both depth and motion;  Encoding depth:  Encoding motion:  Multiview disparity: Representation of depth Representation of motion Representation of disparity Products of frame responses Products of frame responses Products of frame responses
  • 90. Unsupervised Learning of Depth (and Motion) Filters learned on stereo patch pairs from KITTI dataset. Example of a filter pair learned on sequences by the SAE-D model from the Hollywood3D dataset.
  • 91. Stereo Matching by CNN • Train a convolutional neural network on pairs of small image patches; • The network output is used to initialize the matching cost btw a pair of patches; • Eight layers, L1 through L8 with input as 9x9 gray patch and matching cost as output; • 1st layer as convolutional only and other layers are fully connected. • Rectified linear units follow each layer, except L8, but NO pooling! • Trained with SGD (batch size as 128), by194 image pairs, 45 million extracted examples. • Matching costs are combined between neighboring pixels with similar image intensities using cross-based cost aggregation; • Smoothness constraints are enforced by semi-global matching (SGM) and a left-right consistency check is used to detect and eliminate errors in occluded regions; • sub-pixel enhancement and median filter + bilateral filter -> final disparity map; • Achieve the error rate of 2.61% on the KITTI stereo database ( < 2.83% before).
  • 92. Stereo Matching by CNN Support region
  • 93. A Deep Embedding Model for Stereo Matching Costs This deep embedding model leverages appearance data to learn visual similarity relationships btw corresp. image patches, and maps intensity values into an embedding feature space to measure pixel dissimilarities; Features are extracted in a pair of patches at different scales, followed by an inner product to obtain the matching scores, then the scores from different scales are then merged for an ensemble. The deployed network architecture of our testing model for deep embedding. Features are extracted in two images only once respectively. The sliding- window style inner product can be grouped for matrix operation.
  • 94. Improved Stereo Matching with Constant Highway Networks and Reflective Confidence Learning A 3-step pipeline for the stereo matching problem and a highway network architecture for computing the matching cost at each possible disparity, based on multilevel weighted residual shortcuts, trained with a hybrid loss that supports multilevel comparison of image patches. A post-processing step employs a second deep convolutional neural network for pooling global information from multiple disparities. It outputs both the image disparity map, which replaces the conventional “winner takes all” strategy, and a confidence in the prediction. The confidence score is achieved by training the network with a reflective loss. The learned confidence is employed to better detect outliers in the refinement.
  • 95. Improved Stereo Matching with Constant Highway Networks and Reflective Confidence Learning The λ-ResMatch architecture of the matching cost network
  • 96. Improved Stereo Matching with Constant Highway Networks and Reflective Confidence Learning The Global disparity network model for representing disparity patches
  • 97. Efficient Deep Learning for Stereo Matching  A matching network which is able to produce very accurate results in less than a second of GPU computation. A product layer which simply computes the inner product btw the two representations of a Siamese architecture.  Treat the disparity estimation problem as multi-class classification, where the classes are all possible disparities. A Siamese network extracts marginal distributions over all possible disparities for each pixel. four-layer Siamese network architecture The code and data can is online at: http://www.cs.toronto.edu/deepLowLevelVision.
  • 98. End-to-End Learning of Geometry and Context for Deep Stereo Regression A deep learning architecture for regressing disparity from a rectified pair of stereo images. Leverage knowledge of the problem’s geometry to form a cost volume using deep feature representations. Learn to incorporate contextual information using 3-D convolutions over this volume. Disparity values regressed from the cost volume using a differentiable soft argmin operation, which allows to train end-to-end to sub-pixel accuracy without any additional post-processing or regularization.
  • 99. End-to-End Learning of Geometry and Context for Deep Stereo Regression End-to-end deep stereo regression architecture, GC-Net (Geometry and Context Network)
  • 100. End-to-End Training of Hybrid CNN-CRF Models for Stereo convolutional neural networks (CNNs) + optimization-based approaches for stereo estimation;  The optimization, posed as a conditional random field (CRF), takes local matching costs and consistency- enforcing (smoothness) costs as inputs, both estimated by CNN blocks.  To perform inference in CRF, based on linear programming relaxation with a fixed number of iterations. Training end-to-end: in the discriminative formulation (structured SVM), the training is practically feasible.  The optimization part efficiently replaces post-processing steps by a trainable, well-understood model. A CNN, called Unary-CNN, computes features of the two images for each pixel. The features are compared using a Correlation layer. The resulting matching cost volume becomes the unary cost of the CRF. The pairwise costs of the CRF are parametrized by edge weights, which can either follow a usual contrast sensitive model or estimated by the Pairwise-CNN.
  • 101. End-to-End Training of Hybrid CNN-CRF Models for Stereo The cross-correlation of features φ0 and φ1 respectively The CRF model optimizes the cost of disparity labelings matching costs pairwise terms
  • 102. Appendix A: Depth from Single Image by Learning
  • 103. Learning-based Depth from Image Initial over-segmentation (super pixels); Markov Random Field (MRF) to infer patch’s orientation and location from image features (texture, color and gradient); ◦ Connected, co-planar or colinear as prior; ◦ Occlusion boundaries /folds indication; ◦ Multi-conditional learning; solved by linear program; MRF overlaid on “super pixels” Occlusion/fold Coplanarity and Colinearity
  • 104.
  • 105. Single Image Depth Estimation From Predicted Semantic Labels Semantic segmentation to guide the 3D reconstruction; Works like holistic scene understanding: ◦ 1. Multi-class image labeling MRF for scene segmentation; ◦ 2. Depth estimation for each semantic class by learning (logistic regression); 3. Scene depth estimation by MRF (pixel or super-pixel) with potential (learned boosted decision tree classifiers ) and prior of geometry (horizon prediction, vertical objects), pixel’s smoothness, super-pixel ‘s soft connectivity, co-planarity and orientation. semantically derived geometric constraints Smoothed per-pixel log-depth prior for each semantic class with horizon rotated to center of image
  • 106. Image semantic overlay ground truth depth measurements
  • 107. Learning Depth from Examples Two similar images are likely to have similar 3D structure (depth). Nearest-neighbor (kNN) search: finding k image+depth pairs that are most similar to the query (histograms of oriented gradients as feature); Depth fusion: median filtering of the k depth fields; Joint-bilateral depth filtering: smoothing of the median-fused depth.
  • 108. K-NN query Depth Fusion and Smoothing Depth output Note: depth (disparity) warping via SIFT-flow in aligning with the query is omitted.
  • 109. Depth Transfer for Monocular Video K-NN Search for candidates of query frames; Depth changes are gradual frame-to-frame; Moving objects are usually on the ground; Warped with SIFT flow and regularized with smoothness and prior Computational cost is worth?
  • 111.
  • 112. To form a basis (dictionary) over the RGB and depth spaces, and represent depth maps by a sparse linear combination of weights. A prediction function is estimated between weight vectors in RGB to depth space to recover depth maps from query images. A final super-pixel post processor aligns depth maps with occlusion boundaries, creating physically plausible results. Scalable Exemplar Based Depth Transfer
  • 113. images with similar global depth profile clustered together in 2D utilizing RGB pairwise features (left) and sparse positive descriptors on depth (right) effective in grouping images with similar depths profiles together. estimate a transformation T, that maps points from one space to another. Scalable Exemplar Based Depth Transfer
  • 114. Learning to be a Depth Camera (Active Near-IR) • Use hybrid classification-regression forests to learn how to map from near infrared intensity images to absolute, metric depth in real-time; • Simplify the problem by dividing it into sub-problems in the first layer, and then applies models trained for these sub-problems in the second layer to solve the main problem efficiently; • Restrict the depths of the object to a certain range for significant simplification; • The first layer learns to infer a coarsely quantized depth range for each pixel, and optionally pools these predictions across all pixels to obtain a more reliable distribution over these depth ranges; • The second layer then applies one or more expert repressor trained specifically on the inferred depth ranges. • Note: the forests do not need to explicitly model scene illumination, surface geometry and reflectance, or complex inter-reflections, required by traditional SFS methods.
  • 115. Learning to be a Depth Camera (Active Near-IR) • Comparable to high-quality consumer depth cameras with a reduced cost, power consumption, and form-factor.
  • 116. Learning to be a Depth Camera (Active Near-IR) • Applied for specific hand and face objects.
  • 117. Appendix B: Machine Learning and Optimization
  • 118. Graphical Models • Graphical Models: Powerful framework for representing dependency structure between random variables. • The joint probability distribution over a set of random variables. • The graph contains a set of nodes (vertices) that represent random variables, and a set of links (edges) that represent dependencies between those random variables. • The joint distribution over all random variables decomposes into a product of factors, where each factor depends on a subset of the variables. • Two type of graphical models: • Directed (Bayesian networks) • Undirected (Markov random fields, Boltzmann machines) • Hybrid graphical models that combine directed and undirected models, such as Deep Belief Networks, Hierarchical-Deep Models.
  • 119. Generative Model: MRF Random Field: F={F1,F2,…FM} a family of random variables on set S in which each Fi takes value fi in a label set L. Markov Random Field: F is said to be a MRF on S w.r.t. a neighborhood N if and only if it satisfies Markov property. ◦ Generative model for joint probability p(x) ◦ allows no direct probabilistic interpretation ◦ define potential functions Ψ on maximal cliques A ◦ map joint assignment to non-negative real number ◦ requires normalization MRF is undirected graphical models
  • 120. A flow network G(V, E) defined as a fully connected directed graph where each edge (u,v) in E has a positive capacity c(u,v) >= 0; The max-flow problem is to find the flow of maximum value on a flow network G; A s-t cut or simply cut of a flow network G is a partition of V into S and T = V-S, such that s in S and t in T; A minimum cut of a flow network is a cut whose capacity is the least over all the s-t cuts of the network; Methods of max flow or mini-cut: ◦ Ford Fulkerson method; ◦ "Push-Relabel" method.
  • 121. Mostly labeling is solved as an energy minimization problem; Two common energy models: ◦ Potts Interaction Energy Model; ◦ Linear Interaction Energy Model. Graph G contain two kinds of vertices: p-vertices and i-vertices; ◦ all the edges in the neighborhood N, called n-links; ◦ edges between the p-vertices and the i-vertices called t-links. In the multiple labeling case, the multi-way cut should leave each p-vertex connected to one i-vertex; The minimum cost multi-way cut will minimize the energy function where the severed n-links would correspond to the boundaries of the labeled vertices; The approximation algorithms to find this multi-way cut: ◦ "alpha-expansion" algorithm; ◦ "alpha-beta swap" algorithm.
  • 122.  A simplified Bayes Net: it propagates info. throughout a graphical model via a series of messages between neighboring nodes iteratively; likely to converge to a consensus that determines the marginal prob. of all the variables;  messages estimate the cost (or energy) of a configuration of a clique given all other cliques; then the messages are combined to compute a belief (marginal or maximum probability); Two types of BP methods: ◦ max-product; ◦ sum-product. BP provides exact solution when there are no loops in graph! Equivalent to dynamic programming/Viterbi in these cases; Loopy Belief Propagation: still provides approximate (but often good) solution;
  • 123. Generalized BP for pairwise MRFs ◦ Hidden variables xi and xj are connected through a compatibility function; ◦ Hidden variables xi are connected to observable variables yi by the local “evidence” function; The joint probability of {x} is given by To improve inference by taking into account higher-order interactions among the variables; ◦ An intuitive way is to define messages that propagate between groups of nodes rather than just single nodes; ◦ This is the intuition in Generalized Belief Propagation (GBP).
  • 124. Stochastic Gradient Descent (SGD) • The general class of estimators that arise as minimizers of sums are called M- estimators; • Where are stationary points of the likelihood function (or zeroes of its derivative, the score function)? • Online gradient descent samples a subset of summand functions at every step; • The true gradient is approximated by a gradient at a single example; • Shuffling of training set at each pass. • There is a compromise between two forms, often called "mini-batches", where the true gradient is approximated by a sum over a small number of training examples. • STD converges almost surely to a global minimum when the objective function is convex or pseudo-convex, and otherwise converges almost surely to a local minimum.
  • 125. Back Propagation E (f(x0,w),y0) = -log (f(x0,w)- y0).
  • 126. Variable Learning Rate Too large learning rate ◦ cause oscillation in searching for the minimal point Too slow learning rate ◦ too slow convergence to the minimal point Adaptive learning rate ◦ At the beginning, the learning rate can be large when the current point is far from the optimal point; ◦ Gradually, the learning rate will decay as time goes by. Should not be too large or too small: ◦ annealing rate 𝛼(𝑡)=𝛼(0)/(1+𝑡/𝑇) ◦ 𝛼(𝑡) will eventually go to zero, but at the beginning it is almost a constant.
  • 129. Dropout and Maxout for Overfitting Dropout: set the output of each hidden neuron to zero w.p. 0.5. ◦ Motivation: Combining many different models that share parameters succeeds in reducing test errors by approximately averaging together the predictions, which resembles the bagging. ◦ The units which are “dropped out” in this way do not contribute to the forward pass and do not participate in back propagation. ◦ So every time an input is presented, the NN samples a different architecture, but all these architectures share weights. ◦ This technique reduces complex co-adaptations of units, since a neuron cannot rely on the presence of particular other units. ◦ It is, therefore, forced to learn more robust features that are useful in conjunction with many different random subsets of the other units. ◦ Without dropout, the network exhibits substantial overfitting. ◦ Dropout roughly doubles the number of iterations required to converge. Maxout takes the maximum across multiple feature maps;
  • 130. Weight Decay for Overfitting Weight decay or L2 regularization adds a penalty term to the error function, a term called the regularization term: the negative log prior in Bayesian justification, ◦ Weight decay works as rescaling weights in the learning rule, but bias learning still the same; ◦ Prefer to learn small weights, and large weights allowed if improving the original cost function; ◦ A way of compromising btw finding small weights and minimizing the original cost function; In a linear model, weight decay is equivalent to ridge (Tikhonov) regression; L1 regularization: the weights not really useful shrink by a constant amount toward zero; ◦ Act like a form of feature selection; ◦ Make the input filters cleaner and easier to interpret; L2 regularization penalizes large values strongly while L1 regularization ; Markov Chain Monte Carlo (MCMC): simulating a Markov chain whose equilibrium distr. is the posterior distribution for weights & hyper-parameters; Hybrid Monte Carlo: gradient and sampling.
  • 131. Early Stopping for Overfitting Steps in early stopping: ◦ Divide the available data into training and validation sets. ◦ Use a large number of hidden units. ◦ Use very small random initial values. ◦ Use a slow learning rate. ◦ Compute the validation error rate periodically during training. ◦ Stop training when the validation error rate "starts to go up". Early stopping has several advantages: ◦ It is fast. ◦ It can be applied successfully to networks in which the number of weights far exceeds the sample size. ◦ It requires only one major decision by the user: what proportion of validation cases to use. Practical issues in early stopping: ◦ How many cases do you assign to the training and validation sets? ◦ Do you split the data into training and validation sets randomly or by some systematic algorithm? ◦ How do you tell when the validation error rate "starts to go up"?
  • 132. MCMC Sampling for Optimization Markov Chain: a stochastic process in which future states are independent of past states but the present state. ◦ Markov chain will typically converge to a stable distribution. Monte Carlo Markov Chain: sampling using ‘local’ information ◦ Devise a Markov chain whose stationary distribution is the target. ◦ Ergodic MC must be aperiodic, irreducible, and positive recurrent. ◦ Monte Carlo Integration to get quantities of interest. Metropolis-Hastings method: sampling from a target distribution ◦ Create a Markov chain whose transition matrix does not depend on the normalization term. ◦ Make sure the chain has a stationary distribution and it is equal to the target distribution (accept ratio). ◦ After sufficient number of iterations, the chain will converge the stationary distribution. Gibbs sampling is a special case of M-H Sampling. ◦ The Hammersley-Clifford Theorem: get the joint distribution from the complete conditional distribution. Hybrid Monte Carlo: gradient sub step for each Markov chain.
  • 133. Mean Field for Optimization Variational approximation modifies the optimization problem to be tractable, at the price of approximate solution; Mean Field replaces M with a (simple) subset M(F), on which A* (μ) is a closed form (Note: F is disconnected graph); ◦ Density becomes factorized product distribution in this sub-family. ◦ Objective: K-L divergence. Mean field is a structured variation approximation approach: ◦ Coordinate ascent (deterministic); Compared with stochastic approximation (sampling): ◦ Faster, but maybe not exact.
  • 134. Contrastive Divergence for RBMs Contrastive divergence (CD) is proposed for training PoE first, also being a quicker way to learn RBMs; ◦ Contrastive divergence as the new objective; ◦ Taking gradients and ignoring a term which is usually very small. Steps: ◦ Start with a training vector on the visible units. ◦ Then alternate between updating all the hidden units in parallel and updating all the visible units in parallel. Can be applied using any MCMC algorithm to simulate the model (not limited to just Gibbs sampling); CD learning is biased: not work as gradient descent Improved: Persistent CD explores more modes in the distribution ◦ Rather than from data samples, begin sampling from the mode samples, obtained from the last gradient update. ◦ Still suffer from divergence of likelihood due to missing the modes. Score matching: the score function does not depend on its normal. factor. So, match it b.t.w. the model with the empirical density.
  • 135. “Wake-Sleep” Algorithm for DBN Pre-trained DBN is a generative model; Do a stochastic bottom-up pass (wake phase) ◦ Get samples from factorial distribution (visible first, then generate hidden); ◦ Adjust the top-down weights to be good at reconstructing the feature activities in the layer below. Do a few iterations of sampling in the top level RBM ◦ Adjust the weights in the top-level RBM. Do a stochastic top-down pass (sleep phase) ◦ Get visible and hidden samples generated by generative model using data coming from nowhere! ◦ Adjust the bottom-up weights to be good at reconstructing the feature activities in the layer above. ◦ Any guarantee for improvement? No! The “Wake-Sleep” algorithm is trying to describe the representation economical (Shannon’s coding theory).
  • 136. Greedy Layer-Wise Training Deep networks tend to have more local minima problems than shallow networks during supervised training Train first layer using unlabeled data ◦ Supervised or semi-supervised: use more unlabeled data. Freeze the first layer parameters and train the second layer Repeat this for as many layers as desire ◦ Build more robust features Use the outputs of the final layer to train the last supervised layer (leave early weights frozen) Fine tune the full network with a supervised approach; Avoid problems to train a deep net in a supervised fashion. ◦ Each layer gets full learning ◦ Help with ineffective early layer learning ◦ Help with deep network local minima
  • 137. Why Greedy Layer-Wise Training Works? Take advantage of the unlabeled data; Regularization Hypothesis ◦ Pre-training is “constraining” parameters in a region relevant to unsupervised dataset; ◦ Better generalization (representations that better describe unlabeled data are more discriminative for labeled data) ; Optimization Hypothesis ◦ Unsupervised training initializes lower level parameters near localities of better minima than random initialization can. Only need fine tuning in the supervised learning stage.
  • 138. Two-Stage Pre-training in DBMs Pre-training in one stage ◦ Positive phase: clamp observed, sample hidden, using variational approximation (mean-field) ◦ Negative phase: sample both observed and hidden, using persistent sampling (stochastic approximation: MCMC) Pre-training in two stages ◦ Approximating a posterior distribution over the states of hidden units (a simpler directed deep model as DBNs or stacked DAE); ◦ Train an RBM by updating parameters to maximize the lower-bound of log-likelihood and correspond. posterior of hidden units. ◦ Options (CAST, contrastive divergence, stochastic approximation…).