Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Visual Detection, Recognition and Tracking with Deep Learning


Published on

computer vision, machine learning, object detection, object classification, face detection and recognition, object tracking, deep learning, pose estimation, image parsing, image captioning, description, deep residual learning, rethink inception, visual image Questioning & Answer.

Published in: Engineering

Visual Detection, Recognition and Tracking with Deep Learning

  1. 1. Yu Huang Sunnyvale, California Visual Detection, Recognition and Tracking with Deep Learning
  2. 2. Outline • Deep learning • Sparse coding • Deep models • CNN/NIN/RNN; • DBN/DBM; • Stacked DAE; • Model Compression • Dark knowledge/Distilling the knowledge. • Visual recognition • Sparse coding • Hierarchical feature learning • Alexnet/Mattnet (ZFNet); • VGG Net/GoogleNet • PReLU/Batch normalization; • Rethink the Inception; • Deep Residual Learning; • Dual Path Net/Squeeze-Excitation-Net • Generic object detection • Deep multi-box/OverFeat; • R-CNN/Fast R-CNN/SPP Net/Faster R-CNN; • DeepID-Net/YOLO/YOLO9000; • DeepBox/R-FCN/LocNet/MS-CNN; • FPN and RetinaNet; • Pedestrian and Pose estimation • Face detection, landmark detection, recognition • Text detection and recognition • Scene parsing/Semantic segmentation • Multiscale Feature Learning; • Simultaneous Detection and Segmentation; • FCN/DeepLab/Parsenet/Segnet/Mask R-CNN; • RefineNet/PSPNet/GCN; • Visual object tracking • Learning Representation • Fully Convolutional Networks • Deep Regression Networks • CNN based Object Proposals • Recurrently Target-Attending Tracking • Multiple object tracking (MOT) • Tracking The Untrackable • Deep Network Flow • Recurrent Neural Networks • Quadruplet Convolutional Neural Networks
  3. 3. Deep Learning  Representation learning attempts to automatically learn good features or representations;  Deep learning algorithms attempt to learn multiple levels of representation of increasing complexity/abstraction (features);  Become effective via unsupervised pre-training + supervised fine tuning;  Deep networks trained with back propagation (without unsupervised pre-training) perform worse than shallow networks.  Deal with the curse of dimensionality (smoothing & sparsity) and over-fitting (unsupervised, regularizer);  Semi-supervised: structure of manifold assumption;  labeled data is scarce and unlabeled data is abundant.
  4. 4. Learning Feature Hierarchy with DL • Deep architectures can be more efficient in feature representation; • Natural derivation/abstration from low level structures to high level structures; • Share the lower-level representations for multiple tasks (such as detection, recognition, segmentation).
  5. 5. Sparse Coding  Sparse coding (Olshausen & Field, 1996).  Originally developed to explain early visual processing in the brain (edge detection).  Objective: Given a set of input data vectors learn a dictionary of bases such that:  Each data vector is represented as a sparse linear combination of bases. Sparse: mostly zeros
  6. 6. Methods of Solving Sparse Coding  Greedy methods: projecting the residual on some atom;  Matching pursuit, orthogonal matching pursuit;  L1-norm: Least Absolute Shrinkage and Selection Operator (LASSO);  The residual is updated iteratively in the direction of the atom;  Gradient-based finding new search directions  Projected Gradient Descent  Coordinate Descent  Homotopy: a set of solutions indexed by a parameter (regularization)  LARS (Least Angle Regression)  First order/proximal methods: Generalized gradient descent  solving efficiently the proximal operator  soft-thresholding for L1-norm  Accelerated by the Nesterov optimal first-order method  Iterative reweighting schemes  L2-norm: Chartand and Yin (2008)  L1-norm: Cand`es et al. (2008)
  7. 7. Sparse Coding for Unsupervised Pre-training • SC learns the optimal dictionary that can be used to reconstruct a set of training samples under sparsity constraints on the feature vector; • Predictive Sparse Decomposition(PSD): • Train a simple regressor (or encoder) to approximate the sparse solution for all data in the training set; • Modify by adding a penalty for prediction error: Approximate the sparse coding with an encoder; • PSD for hierarchical feature training • Phase 1: train the first layer; • Phase 2: use encoder + absolute value as 1st feature extractor • Phase 3: train the second layer; • Phase 4: use encoder + absolute value as 1st feature extractor • Phase 5: train a supervised classifier on top layer; • Phase 6: optionally train the whole network with supervised BP.
  8. 8. Convolutional Neural Networks  CNN is a special kind of multi-layer NNs applied to 2-d arrays (usually images), based on spatially localized neural input;  Local receptive fields(shifted window), shared weights (weight averaging) across the hidden units, and often, spatial or temporal sub-sampling;  Related to generative MRF/discriminative CRF:  CNN=Field of Experts MRF=ML inference in CRF;  Generate ‘patterns of patterns’ for pattern recognition.  Each layer combines (merge, smooth) patches from previous layers  Pooling /Sampling (e.g., max or average) filter: compress and smooth the data.  Convolution filters: (translation invariance) unsupervised;  Local contrast normalization: increase sparsity, improve optimization/invariance. C layers convolutions, S layers pool/sample
  9. 9. Convolutional Neural Networks  Convolutional Networks are trainable multistage architectures composed of multiple stages;  Input and output of each stage are sets of arrays called feature maps;  At output, each feature map represents a particular feature extracted at all locations on input;  Each stage is composed of: a filter bank layer, a non-linearity layer, and a feature pooling layer;  A ConvNet is composed of 1, 2 or 3 such 3-layer stages, followed by a classification module;  A fully connected layer: softmax transfer function for posterior distribution.  Filter: A trainable filter (kernel) in filter bank connects input feature map to output feature map;  Nonlinearity: a pointwise sigmoid tanh() or a rectified sigmoid abs(gi•tanh()) function;  In rectified function, gi is a trainable gain parameter, might be followed a contrast normalization N;  Feature pooling: treats each feature map separately -> a reduced-resolution output feature map;  Supervised training is performed using a form of SGD to minimize the prediction error;  Gradients are computed with the back-propagation method.  Unsupervised pre-training: predictive sparse decomposition (PSD), then supervised fine-tuning. * is discrete convolution operator
  10. 10. RNN: Recurrent Neural Network  A nonlinear dynamical system that maps sequences to sequences;  Parameterized with three weight matrices and three bias vectors;  RNNs are fundamentally difficult to train due to their nonlinear iterative nature;  The derivative of the loss function can be exponentially large with respect to the hidden activations;  RNN suffers also from the vanishing gradient problem.  Back Propagation Through Time (BPTT):  “Unfold” the recurrent network in time, by stacking identical copies of the RNN, and redirecting connections within the network to obtain connections between subsequent copies;  It’s hard to be used where online adaption is required as the entire time series must be used.  Real-Time Recurrent Learning (RTRL) is a forward-pass only algorithm that computes the derivatives of the RNN w.r.t. its parameters at each timestep;  Unlike BPTT, RTRL maintains the exact derivative of the loss so far at each timestep of the forward pass, without a backward pass and the need to store the past hidden states;  However, the computational cost of RTRL is prohibitive and more memory than BPTT as well.  Success in Application: Speech Recognition and Handwriting recognition.
  11. 11. LSTM: Long Short-Term Memory  An RNN structure that elegantly addresses the vanishing gradients problem using “memory units”;  These linear units have a self-connection of strength 1 and a pair of auxiliary “gating units” that control the flow of information to and from the unit;  Let N be the number of memory units of the LSTM. At each timestep t, the LSTM maintains a set of vectors as below, whose evolution is governed by the following equations:  Since the forward pass of the LSTM is relatively intricate, the equations for the correct derivatives of the LSTM are highly complex, making them tedious to implement;  Note: Theano has LSTM module.
  12. 12. Left: RNN with one fully connected hidden layer; Right: LSTM with memory blocks in hidden layer. From Simple RNN to BPTT LSTM: Long Short-Term Memory
  13. 13. Gated Recurrent Unit  GRU is a variation of RNN, adaptively capturing dependencies of different time scales with each recurrent unit;  GRU uses gating units as well to modulate the flow of information inside the unit, but without a memory cells.  GRU doesn’t control degree to which its state is exposed, but exposes the whole state each time;  Different from LSTM:  GRU expose its full content without control;  GRU controls the information flow from the previous activation when computing the new, candidate activation, but does not independently control the amount of the candidate activation being added (the control is tied via the update gate). • Shared virtues with LSTM: the additive component of their update from t to t + 1; • Easy for each unit to remember the existence of a specific feature in the input stream for a long series of steps; • Effectively creates shortcut paths that bypass multiple temporal steps, which allow the error to be back-propagated easily without too quickly vanishing.
  14. 14. Belief Nets  Belief net is directed acyclic graph composed of stochastic var.  Can observe some of the variables and solve two problems:  inference: Infer the states of the unobserved variables.  learning: Adjust the interactions between variables to more likely generate the observed data. stochastic hidden cause visible effect Use nets composed of layers of stochastic variables with weighted connections.
  15. 15. Boltzmann Machines  Energy-based model associate a energy to each configuration of stochastic variables of interests (for example, MRF, Nearest Neighbor);  Learning means adjustment of the low energy function’s shape properties;  Boltzmann machine is a stochastic recurrent model with hidden variables;  Monte Carlo Markov Chain (MCMC) sampling for gradient estimate;  Restricted Boltzmann machine is a special case:  Only one layer of hidden units;  factorization of each layer’s neurons/units (no connections in the same layer);  Contrastive divergence: approximation of gradient in RBMs. probability Energy Function Learning rule
  16. 16. Deep Belief Networks  A hybrid model: can be trained as generative or discriminative model;  Deep architecture: multiple layers (learn features layer by layer);  Multi layer learning is difficult in sigmoid belief networks.  Top two layers are undirected connections, RBM;  Lower layers get top down directed connections from layers above;  Unsupervised or self-taught pre-learning provides a good initialization;  Greedy layer-wise training for RBM  Supervised fine-tuning  Generative: Up-down wake-sleep algorithm  Discriminative: bottom-up back propagation
  17. 17. Deep Boltzmann Machine  Learning internal representations that become increasingly complex;  High-level representations built from a large supply of unlabeled inputs;  Pre-training consists of learning a stack of modified RBMs, then which are composed to create a deep Boltzmann machine (undirected graph);  Generative fine-tuning: different from DBN, two phases  Positive: observed, sample hidden, using variational approximation (mean-field);  Negative: sample both observed and hidden, using persistent sampling (stochastic approximation: MCMC).  Discriminative fine-tuning: the same to DBN  Back propagation.
  18. 18. Denoising Auto-Encoder  Multilayer NNs with target output=input;  Reconstruction=decoder(encoder(input));  Perturbs the input x to a corrupted version;  Randomly sets some of the coordinates of input to zeros.  Recover x from encoded perturbed data.  Learns a vector field towards higher probability regions;  Pre-trained with DBN or regularizer with perturbed training data;  Minimizes variational lower bound on a generative model;  Corresponds to regularized score matching on an RBM;  PCA=linear manifold=linear Auto Encoder;  Auto-encoder learns the salient variation like a nonlinear PCA.
  19. 19. Stacked Denoising Auto-Encoder  Stack many (sparse) auto-encoders in succession and train them using greedy layer-wise learning  Drop the decode layer each time  Supervised training on the last layer using final features  Then supervised training on the entire network to fine- tune all weights  Performs better than stacking RBMs for unsupervised pre-training.  Empirically not quite as accurate as DBNs.
  20. 20. • Compress the function learned by a complex model into a much smaller faster model that has comparable performance; • Given enough data, a NN can approximate any function to arbitrary precision; • Idea: Instead of training the NN on the original (small) training set, use an ensemble to label a large unlabeled dataset and then train the NN on this much larger ensemble labeled data set, to yield a NN that predicts similar to the ensemble and performs much better than a NN which is trained on the original training set; • Three methods to generate pseudo data: • RANDOM, generate data for each attribute independently from its marginal distribution; • NBE, estimate the joint density of attributes using the Naive Bayes and then generate samples from this joint distribution; • MUNGE, a new procedure that samples from a non-parametric estimate of the joint density. Model Compression
  21. 21. • Shallow feed-forward nets can learn the complex functions previously learned by deep nets and achieve accuracies previously only achievable with deep models; • In some cases the shallow neural nets can learn these deep functions using the same number of parameters as the original deep models; • Complexity of a learned model, and the size of representation best used to learn that model, are different things; • Model compression works best when the unlabeled set is much larger than the train set, when the unlabeled samples fall not on train points where the teacher model more likely have overfit. • Train the student model to mimic a more accurate ensemble of deep NN models (the teacher). Do Deep Nets Really Need to be Deep?
  22. 22. • The ensemble implements a function from input to output; • Forget the models in the ensemble and the way they are parameterized and focus on the function. • After learning the ensemble, we have our hands on the function. • Can we transfer the knowledge in the function into a single smaller model (Distillation)? • Soft targets: A way to transfer the function • If we have the ensemble, we can divide the averaged logits from the ensemble by a “temperature” to get a much softer distribution; • Softened outputs reveal the dark knowledge in the ensemble. • However it works better to fit both the hard targets and the soft targets from the ensemble. • Down-weight the cross entropy with the hard targets. • Dropout is an efficient way to average many large NNs; • Organize the labels into a tree, and predict all of the labels on the path to the root, instead of just predicting a label. Dark Knowledge
  23. 23. • Distilling the knowledge in an ensemble of models into a single model; • Transfer from a cumbersome model to a small model, more suitable for deployment. • An ensemble of one or more full models and many specialist models which learn to distinguish fine grained classes that the full models confuse; • These specialist model can be trained in parallel. • Use the class prob. by the cumbersome model as “soft targets” for training the small model. • When the cumbersome model is a large ensemble of simpler models, we can use an arithmetic or geometric mean of their individual predictive distributions as the soft targets. • When the soft targets have high entropy, they provide much more information per training case than hard targets and much less variance in the gradient between training cases, so the small model can often be trained on much less data than the original cumbersome model and using a much higher learning rate. • Distillation: raise the temperature of the final softmax until the cumbersome model produces a suitably soft set of targets. Distilling knowledge in a Neural Network
  24. 24. • Allow training of a student that is deeper and thinner than the teacher, using the outputs and the intermediate representations learned by the teacher as hints to improve the training process and final performance of the student that compress the wide and shallower (but still deep) networks; • A hint defined as output of a teacher’s hidden layer for guiding the student’s learning process; • Choose a hidden layer of the FitNet as the guided layer in the student, to learn from the teacher’s hint layer; • Train the FitNet in a stage-wise fashion: hints as a form of regularization, guided layer with a convolutional regressor, loss function based on prediction error of a teacher’s hint layer and regressor over guided layer. FitNets by Thin Deep Nets: Extension of Distillation
  25. 25. Squeezenet: alexnet-level Accuracy with 50x Fewer Parameters and Less Than 0.5mb Model Size  A small CNN architecture SqueezeNet achieves AlexNet-level accuracy on ImageNet with 50x fewer parameters.  With model compression techniques, compress SqueezeNet to less than 0.5MB (510× smaller than AlexNet).  1. Replace 3x3 filters with 1x1 filters: given a budget of a certain number of convolution filters, make the majority of these filters 1x1.  2. Decrease the number of input channels to 3x3 filters: squeeze layers.  3. Downsample late in the network so that convolution layers have large activation maps: the height and width of these activation maps are controlled by: (1) the input size and (2) the choice of dowmsample layers by setting the (stride > 1). The intuition is that large activation maps can lead to higher classification accuracy, with all else held equal.  A Fire module is comprised of: a squeeze conv. layer (1x1 filters only), feeding into an expand layer that has a mix of 1x1 and 3x3 conv. filters.
  26. 26. Squeezenet: alexnet-level Accuracy with 50x Fewer Parameters and Less Than 0.5mb Model Size Macroarchitectural view of SqueezeNet architecture. Left: SqueezeNet; Middle: SqueezeNet with simple bypass; Right: SqueezeNet with complex bypass.
  27. 27. Squeezenet: alexnet-level Accuracy with 50x Fewer Parameters and Less Than 0.5mb Model Size
  28. 28. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications  MobileNets are based on a streamlined architecture that uses depthwise separable convolutions to build light weight deep neural networks.  Global hyperparameters to efficiently trade off btw latency and accuracy.  Width multiplier: Thinner Models;  Resolution Multiplier: Reduced Representation.  These hyper-parameters allow the model builder to choose the right sized model for their application based on the constraints of the problem.  MobileNet uses 3 × 3 depthwise separable convolutions which uses between 8 to 9 times less computation than standard convolutions at only a small reduction in accuracy.  Applications and use cases including object detection, finegrain classification, face attributes and large scale geo-localization.
  29. 29. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications The convolutional filters in (a) are replaced by two layers: depthwise convolution in (b) and pointwise convolution in (c) to build a depthwise separable filter.
  30. 30. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications Left: Standard convolutional layer with batchnorm and ReLU. Right: Depthwise Separable convolutions with Depthwise and Pointwise layers followed by batchnorm and ReLU.
  31. 31. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices  The ShuffleNet architecture utilizes 2 operations, pointwise group convol. and channel shuffle, to reduce computation cost while maintaining accuracy.
  32. 32. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices ShuffleNet Units. a) bottleneck unit with depthwise convolution (DWConv); b) ShuffleNet unit with pointwise group convolution (GConv) and channel shuffle; c) ShuffleNet unit with stride = 2.
  33. 33. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices
  34. 34. Sparse Coding for Visual Recognition • Descriptor Layer: detect and locate features, extract corresponding descriptors (e.g. SIFT); • Code Layer: code the descriptors • Vector Quantization (VQ): each code has only one non-zero element; • Soft-VQ: small group of elements can be non- zero; • SPM layer: pool codes across subregions and average/normalize into a histogram. [Lazebnik et al., CVPR 2005; Yang et al., CVPR 2009]
  35. 35. Sparse Coding for Visual Recognition • Classifiers using these features need nonlinear kernels • Lazebnik et al., CVPR 2005; Grauman, Darrell, JMLR 2007; • High computational complexity • Idea: modify the coding step to produce feature representations that linear classifiers can use effectively • Sparse coding [Olshausen & Field, Nature 1996; Lee et al., NIPS 2007; Yang et al., CVPR 2009; Boureau et al., CVPR 2010] • Local Coordinate coding [Yu et al., NIPS 2009; Wang et al., CVPR 2010] • RBMs [Sohn, Jung, Lee, Hero III, ICCV 2011] • Other feature learning algorithms Improving the coding step
  36. 36. Deep learning for visual recognition, detection, localization • Hand-crafted features: • Needs expert knowledge • Requires time-consuming hand-tuning • (Arguably) one limiting factor of computer vision systems • Key idea of feature learning: • Learn statistical structure or correlation from unlabeled data • The learned representations used as features in supervised and semi-supervised settings • Hierarchical feature learning • Deep architectures can be representationally efficient. • Natural progression from the low level to the high level structures. • Can share the lower-level representations for multiple tasks in computer vision.
  37. 37. Hierarchical Feature Learning
  38. 38. AlexNet  A layered model composed of convol., subsample., followed by a holistic representation and all-in-all a landmark classifier;  Consists of 5 convolutional layers, some of which followed by max- pooling layers, 3 fully-connected layers with a final 1000-way softmax;  Fully-connected “FULL” layers: linear classifiers/matrix multiplications;  ReLU are rectified-linear nonlinearities on layer output, can be trained several times faster;  Local normalization scheme aids generalization;  Overlapping pooling slightly less prone to overfitting;  Data augmentation: artificially enlarge the dataset using label- preserving transformations;  Dropout: setting to zero output of each hidden neuron with prob. 0.5;  Trained by SGD with batch # 128, momentum 0.9, weight decay 0.0005.
  39. 39. The network’s input is 150,528-dimensional, and the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264-4096–4096–1000.
  40. 40. MattNet  Matthew Zeiler from the startup company “Clarifai”, winner of ImageNet Classification in 2013;  Preprocessing: subtracting a per-pixel mean;  Data augmentation: downsampled to 256 pixels and a random 224 pixel crop is taken out of the image and randomly flipped horizontally to provide more views of each example;  SGD with min-batch # 128, learning rate annealing, momentum 0.9 and dropout to prevent overfitting;  65M parameters trained for 12 days on a single Nvidia GPU;  Visualization by layered DeconvNets: project the feature activations back to the input pixel space;  Reveal input stimuli exciting individual feature maps at any layer;  Observe evolution of features during training;  Sensitivity analysis of the classifier output by occluding portions to reveal which parts of scenes are important;  DeconvNet attached to each of ConvNet layer, unpooling uses locations of maxima to preserve structure;  Multiple such models were averaged together to further boost performance;  Supervised pre-training with AlexNet, then modify it to get better performance (error rate 14.8%).
  41. 41. Architecture of an eight layer ConvNet model. Input: 224 by 224 crop of an image (with 3 color planes). # 1-5 layers Convolution: 96 filters, 7x7, stride of 2 in both x and y. Feature maps: (i) via a rectified linear function, (ii) 3x3 max pooled (stride 2), (iii) contrast normalized 55x55 feature maps. # 6-7 layers: fully connected, input in vector form (6x6x256 = 9216 dimensions). The final layer: a C-way softmax function, C - number of classes.
  42. 42. Top: A deconvnet layer (left) attached to a convnet layer (right). The deconvnet will reconstruct approximate version of convnet features from the layer beneath. Bottom: Unpooling operation in the deconvnet, using switches which record the location of the local max in each pooling region (colored zones) during pooling in the convnet.
  43. 43. Oxford VGG Net: Very Deep CNN  Networks of increasing depth using an architecture with very small (3×3) convolution filters;  Spatial pooling is carried out by 5 max-pooling layers;  A stack of convolutional layers followed by three Fully-Connected (FC) layers;  All hidden layers are equipped with the rectification ReLU non-linearity;  No Local Response Normalisation!  Trained by optimising the multinomial logistic regression objective using SGD;  Regularised by weight decay and dropout regularisation for the first two fully-connected layers;  The learning rate was initially set to 10−2, and then decreased by a factor of 10;  For random initialisation, sample the weights from a normal distribution;  Derived from the publicly available C++ Caffe toolbox, allow training and evaluation on multiple GPUs installed in a single system, and on full-size (uncropped) images at multiple scales;  Combine the outputs of several models by averaging their soft-max class posteriors.
  44. 44. The depth of the configurations increases from the left (A) to the right (E), as more layers are added (the added layers are shown in bold). The convolutional layer parameters are denoted as “conv<receptive field size> - <number of channels>”. The ReLU activation function is not shown for brevity.
  45. 45. GoogleNet  Questions:  Vanishing gradient?  Exploding gradient?  Tricky weight initialization?  Deep convolutional neural network architecture codenamed Inception;  Finding out how an optimal local sparse structure in a convolutional vision network can be approximated and covered by readily available dense components;  Judiciously applying dimension reduction and projections wherever the computational requirements would increase too much otherwise;  Increasing the depth and width of the network but keeping the computational budget constant;  Drawbacks: Bigger size typically means a larger number of parameters, which makes the enlarged network more prone to overfitting and the dramatically increased use of computational resources;  Solution: From fully connected to sparsely connected architectures, analyze the correlation statistics of the activations of the last layer and clustering neurons with highly correlated outputs.  Based on the well known Hebbian principle: neurons that fire together, wire together;  Trained using the DistBelief: distributed machine learning system.
  46. 46. Inception module (with dimension reductions)
  47. 47. Convolution Pooling Softmax Other Problems with training deep architectures? Network in a network in a network 9 Inception modules
  48. 48. PReLU Networks at MSR  A Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit;  PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk;  Allow negative activations on the ReLU function with a control parameter a learned Adaptively;  Resolve diminishing gradient problem for very deep neural networks (> 13 layers) ;  Derive a robust initialization method better than “Xavier” (normalization) initialization;  Also use Spatial Pyramid Pooling (SPP) layer just before the fully connected layers;  Can train extremely deep rectified models and investigate deeper or wider network architectures; ReLU vs. PReLU Note: μ is momemtum, ϵ is learning rate.
  49. 49. PReLU Networks at MSR  Performance: 4.94% top-5 test error on the ImageNet 2012 Classification dataset;  ILSVRC 2014 winner (GoogLeNet, 6.66%);  Adopt the momentum method in BP training;  Mostly initialized by random weights from Gaussian distr.;  Investigate the variance of the FP responses in each layer;  Consider a sufficient condition in BP:  The gradient is not exponentially large/small.
  50. 50. Architectures of large models PReLU Networks
  51. 51. Batch Normalization at Google  Normalizing layer inputs for each mini-batch to handle saturating nonlinearities and covariate shift;  Internal Covariate Shift (ICS): the change in the distribution of network activations due to the change in network parameters during training;  Whitening to reduce ICS: linear transform to have zero means and unit variances, and decorrelated;  Fix the means and variance of layer inputs (instead of whitening jointly the features in both I/O);  Batch normalizing transform applied for activation over a mini-batch;  BN transform is differentiable transform introducing normalized activations into the network;  Batch normalized networks  Unbiased variance estimate;  Moving average;  Batch normalized ConvNets  Effective mini-batch size;  Per feature, not per activation.
  52. 52. Batch Normalization at Google  Reduce the dependence of gradients on the scale of the parameters or of the initial values;  Prevent small changes from amplifying into larger and suboptimal changes in activation in gradients;  Stabilize the parameter growth and make gradient propagation better behaved in BN training;  In some cases, eliminate the need of dropout as a regularizer;  In ImageNet Classification, remove local response normalization and reduce photometric distortions;  Reach 4.9% in top-five validation error and 4.8% test error (human raters only 5.1%).  Accelerating BN network:  Enable larger learning rate and less care about initialization, which accelerates the training;  Reduce L2 weight regularization;  Accelerate the learning rate decay.
  53. 53. Batch Normalization at Google Inception architecture
  54. 54. Neural Turing Machines  A Neural Turing Machine (NTM) architecture contains two basic components: a neural network controller and a memory bank;  During each update cycle, the controller network receives inputs from an external environment and emits outputs in response;  It also reads to and writes from a memory matrix via a set of parallel read and write heads.  These weightings arise by combining two addressing mechanisms with complementary facilities;  “content-based addressing”: focuses attention on locations based on the similarity between their current values and values emitted by the controller;  “location-based addressing”: the content of a variable is arbitrary, but the variable still needs a recognizable name or addresses, by location, not by content;  Controller network: feed forward or recurrent.
  55. 55. Neural Turing Machines Neural Turing Machine Architecture. Flow Diagram of the Addressing Mechanism.
  56. 56. Highway Networks: Information Highway  Ease gradient-based training of very deep networks;  Allow unimpeded info. flow across several layers on information highways;  Use gating units to learn regulating the flow of info. through a network;  A highway network consists of multiple blocks such that the ith block computes a block state Hi(x) and transform gate output Ti(x);  Highway networks with hundreds of layers can be trained directly using SGD and with a variety of activation functions. the transform gate the carry gate C = 1 - T
  57. 57. Deep Residual Learning for Image Recognition  Reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions;  The desired underlying mapping as H(x), then let the stacked nonlinear layers fit another mapping of F(x) = H(x) - x;  The formulation of F(x)+x can be realized by feed forward NN with “shortcut connections” (such as “Highway Network” and “Inception”);  These residual networks are easier to optimize, and can gain accuracy from considerably increased depth;  An ensemble of 152 layers residual nets achieves 3.57% error on the ImageNet test set;  224x224 crop, per-pixel mean subtracted, color augmentation, batch normalization;  SGD with a mini-batch size of 256, learning rate from 0.1 then by 10;  Weight decay of 0.0001 and a momentum of 0.9, no drop-out;
  58. 58. Deep Residual Learning for Image Recognition Residual learning: a building block Examplenetwork architecturesforImageNet A deeper residual function F for ImageNet
  59. 59. Rethink Inception Architecture for Computer Vision  Scale up networks in ways that aim at utilizing the added computation efficiently by factorized convolutions and aggressive regularization;  Design principles in Inception:  Avoid representational bottlenecks, especially early in the network;  Higher dimensional representations are easier to process locally within a network;  Spatial aggregation over lower dim embeddings w/o loss in representational power;  Balance the width and depth of the network.  Factorizing convolutions with large filter size: asymmetric convolutions;  Auxiliary classifiers: act as regularizer, esp. batch normalized or dropout;  Grid size reduction: two parallel stride 2 blocks (pooling and activation) ;  Model regularization via label smoothing: marginalized effect of dropout;  Trained with Tensorflow: SGD with 50 replicas, batch size 32 for 100 epochs, learning rate of 0.045, exponential rate of 0.94, a wei decay of 0.9.
  60. 60. Rethink Inception Architecture for Computer Vision Inception modules after the factorization of the nxn convolutions. In the proposed architecture, it choses n = 7 for the 17x17 grid. Inception modules with expanded the filter bank outputs. Inception modules where each 5x5 convolution is replaced by two 3x3 convolution.
  61. 61. Rethink Inception Architecture for Computer Vision Auxiliary classifier on top of the last 17x17 layer Inception module that reduces the grid-size while expands the filter banks. It is both cheap and avoids the representational bottleneck. The outline of the proposed network architecture
  62. 62. Densely Connected Convolutional Networks  Dense Convolutional Network (DenseNet), connects each layer to every other layer in a feed-forward fashion.  Whereas traditional convol. nets with L layers have L connections—one btw each layer and its subsequent layer—DenseNet has L(L+1)/2 direct connections.  For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers.  DenseNets’ advantages: alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters.  DenseNets obtain significant improvements over the SoA on testing cases, whilst requiring less memory and computation to achieve high performance.
  63. 63. Densely Connected Convolutional Networks
  64. 64. Densely Connected Convolutional Networks
  65. 65. Memory-Efficient Implementation of DenseNets  The DenseNet architecture is highly computationally efficient as a result of feature reuse.  A naïve DenseNet implementation can require a significant amount of GPU memory: If not properly managed, pre-activation batch normalization and contiguous convolution operations can produce feature maps that grow quadratically with network depth.  Reduce the memory consumption of DenseNets during training, from quadratic to linear, then without GPU memory bottleneck, train extremely deep DenseNets.  Networks with 14M parameters can be trained on a single GPU, up from 4M.  A 264-layer DenseNet (73M parameters) can be trained on a single workstation with 8 NVIDIA Tesla M40 GPUs.
  66. 66. Memory-Efficient Implementation of DenseNets DenseNet layer forward pass. The efficient implementation stores the output of the concatenation, batch normalization, and ReLU layers in temporary storage buffers, whereas the original implementation allocates new memory.
  67. 67. Dual Path Networks  A simple, highly efficient and modularized Dual Path Network (DPN) for image classification which presents a topology of connection paths internally.  Within the HORNN (higher order recurrent neural network) framework, ResNet enables feature re-usage while DenseNet enables new features exploration which are both important for learning good representations.  Dual Path Network shares common features while maintaining the flexibility to explore new features through dual path architectures.  On the ImagNet-1k dataset, a shallow DPN surpasses the best ResNeXt- 101(64 × 4d) with 26% smaller model size, 25% less computational cost and 8% lower memory consumption, and a deeper DPN (DPN-131) further pushes the SoA single model performance with more than 3 times faster training speed.
  68. 68. Dual Path Networks The topological relations of different types of neural networks. (a) and (b) show relations between residual networks and RNN; (c) and (d) show relations between densely connected networks and higher order recurrent neural network (HORNN). The symbol “z−1 ” denotes a time-delay unit; “⊕” denotes the element-wise summation; “I(·)” denotes an identity mapping function.
  69. 69. Dual Path Networks  Dual path architecture shares feature extractions across all blocks to enjoy the benefits of reusing common features with low redundancy, while still remaining a densely connected path to give the network more flexibility in learning new features. the densely connected path that enables exploring new features the residual path that enables common features re-usage the dual path that integrates them and feeds them to the last transformation function the final transformation function generates current state used for making next mapping/prediction
  70. 70. Dual Path Networks (a) ResNet. (b) DenseNet, where each layer can access the outputs of all previous micro-blocks. Here, a 1 × 1 convolutional layer is added for consistency with the micro-block design in (a). (c) By sharing the first 1 × 1 connection of the same output across layers in (b), the DenseNet degenerates to a ResNet. (d) DPN. (e) An equivalent form of (d).
  71. 71. Dual Path Networks Table: Architecture and complexity comparison of DPNs and other baseline methods: DenseNet and ResNeXt. The symbol (+k) denotes the width increment on the densely connected path.
  72. 72. Squeeze-and-Excitation (SE) Networks  Explicitly model channel-interdependencies within modules;  Feature recalibration: Selectively enhance useful features and suppress less useful ones;
  73. 73. Squeeze-and-Excitation (SE) Networks
  74. 74. Deep Learning for Generic Object Detection • Predicting the bounding box of multiple objects by DL- based regression (GoogleNet); • Deep Multibox method; • Overfeat: sliding window-based detection and localization by deep NN; • R-CNN: region-based proposal + CNN feature (cuda- convent)-based detection by SVM; • SPP (spatial pyramid pooling) Net: extract window wise features from region of feature maps; • DeepID: selective search + R-CNN (Clarifai-fast); • Deformation constraint pooling layer.
  75. 75. OverFeat: Integrated Framework with CNN ● Multi-scale and sliding window for classification, localization and detection with CNN; ● Classification: similar to AlexNet; ■ Use the same fixed input size approach with AlexNet, no contrast norm., non-overlapping pooling; ■ SGD with decreasing learning rate, momentum, weight decay and dropout; ■ A feature extractor named “OverFeat” with two models: a fast and accurate one; ■ multi-view (4 corners + 1 center views + flip = 10 views); ■ fast and low memory footprint important to train bigger models; ● Localization: regression predicting coordinates of boundary boxes; ■ inputs: 256x5x5 (right after last pooling); ■ top-left, bottom-right, center, height/width (center not depend on scale); ■ fancier (similar to Yann’s face pose estimation); ● Detection: training with BG to avoid False Pos., trade-off btw pos./neg. accuracy.
  76. 76. OverFeat: Integrated Framework with CNN (a): 20 pixel unpooled layer 5 feature map. (b): max pooling over non-overlapping 3 pixel groups, using offsets of Δ= {0, 1, 2} pixels (red, green, blue respectively). (c): The resulting 6 pixel pooled maps for different Δ. (d): 5 pixel classifier (layers 6,7) is applied in sliding window fashion to pooled maps, yielding 2 pixel by C maps for each Δ.(e): reshaped into 6 pixel by C output maps. Single (top)/Multiple(bottom) output in detection Application example of regression network
  77. 77. R-CNN: Regions with CNN Features  A framework for object detection with ConvNets;  One can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects;  Regions with CNN detection approach:  generates ~2000 category-independent regions for the input image,  extracts a fixed-length feature vector from each region using a CNN (built on cuda-convnet);  classifies each region with category-specific linear SVM.  R-CNN outperforms OverFeat, with a mAP = 31.4% vs 24.3%.  Training: train feature extraction CNN on a large auxiliary dataset (ILSVRC), followed by domain specific fine-tuning on a small dataset (PASCAL);  Pre-training: Train ImageNet  Replace last layer with FC layer to N+1 outputs (N classes + 1 “background”; VOC N=20, ILSVRC N=200 )  When labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.
  78. 78. • Region detection  2000 regions; • Region cropped and scaled to [227 x 227]  feature extraction with ImageNet: • 5 convolutional layers + 2FC  4096 features; • SVM for 200 classes; • Greedy non-maximum suppression for each class: rejects a region if it has an intersection-over-union (IoU) overlap with a higher scoring selected region larger than a learned threshold; R-CNN: Regions with CNN Features
  79. 79. DeepID-Net: deformable CNNs for Generic Object Detection  Bounding box proposal by selective search;  Bounding box rejection;  Pre-train a deep model: RCNN (Classification+Detection) with Clarifai-fast;  Pre-train on image-level annotation with 1000 classes;  Fine-tune on object-level annotation with 200 classes;  Gap: classification vs. detection, 1000 vs. 200;  A deformation constrained pooling layer: even for repeated patterns;  Modeling part detectors: different parts have different sizes;  Context modeling and model averaging;  Bounding box regression.
  80. 80. DeepID-Net: deformable CNNs for Generic Object Detection
  81. 81. SPP-Net: Spatial Pyramid Pooling CNN  Introduce a spatial pyramid pooling layer to replace the pooling layer: on the convolutional layer;  Adaptively sized pooling on shared conv feature maps;  Outperform Bag of Words in keeping spatial information;  Generate a fixed-length represent. regardless of image size/scale;  Use multi-level spatial bins, robust to object deformations. Training • Size augmentation: • Imagenet: 224x224  180x180 • Horizontal flipping • Color altering • Dropout with 2 last FC layers • Learning rate: • Initialize as 0.01; divide by 10 when error plateau ImageNet Detection • Find 2000 windows candidate /~ R-CNN / • Extract the feature maps from the entire image only once (possibly at multiple scales) /~ Overfeat/. • Apply the spatial pyramid pooling on each candidate window of the feature, which maps window to a fixed- length representation • Then 2 Fully Connected layers • SVM • ~170x faster than R-CNN
  82. 82. A network structure with a spatial pyramid pooling layer Pooling features from arbitrary windows on feature maps SPP-Net: Spatial Pyramid Pooling CNN
  83. 83. Fast R-CNN for Object Detection • Simultaneously learns to classify object proposals and refine their spatial locations in a multi-task loss; • Pre-training: max pooling -> RoI; a final FCL as softmax -> two FCLs as softmax + bounding box regressor; input as images + RoIs; • Fine-tuning: • Image centric sampling; • Hierarchical min-batch sampling; • Joint optimize softmax + BB regressor; • Detection: • Truncated SVD in FCL weight matrix. An input image and multiple RoIs are input into a FCN. Each RoI is pooled into a fixed-size feature map and then mapped to a feature vector by FCLs. The network has two output vectors per RoI: softmax probabilities and per-class bounding-box regression offsets. The architecture is trained end-to-end with a multi-task loss.
  84. 84. Faster R-CNN with RPN for Object Detection • Region Proposal Network (RPN): a FCN • Share the conv. layers with detection network; • Learning: image centric sampling with BP and SGD as Fast R-CNN, initialized with ImageNet pre- trained model, fine-tuned end-to-end for region proposal task; • Fast R-CNN: • Learn using the proposals from RPN, also initialized by a ImageNet pre-trained model; • Joint RPN + Fast-CNN: • Use the detector network to initialize RPN with fixed shared conv layers and fine tune RPN; • Finally, fine tune the FCL of Fast R-CNN. Region Proposal Network Encode conv. map positions and output classified objectness score + regressed bounds for k=9 region proposals; Reference boxes (k=9 anchors) with 3 scales/aspect ratios: translation invariant;
  85. 85. Training Region-based Object Detectors with Online Hard Example Mining  Online hard example mining (OHEM) to train region-based ConvNet detectors.  Auto-selection of hard examples makes training more effective and efficient.  It eliminates several heuristics and hyperparameters in common use.  It yields consistent and significant boosts in detection performance on benchmarks like PASCAL VOC 2007 and 2012. Architecture of the Fast R-CNN approach
  86. 86. Training Region-based Object Detectors with Online Hard Example Mining Given an image and RoIs, the network computes a feature map. (a): the RoI network runs fwd pass and the Hard RoI uses RoI losses to select B examples. (b): hard examples used by RoI network for fwd& bwd passes.
  87. 87. A Unified Multi-scale Deep CNN for Fast Object Detection  A unified deep neural network, denoted the multi-scale CNN (MS- CNN) for fast multi-scale object detection.  Consists of a proposal sub-network and a detection sub-network.  In the proposal sub-network, detection is performed at multiple output layers, so that receptive fields match objects of different scales.  These complementary scale-specific detectors are combined to produce a strong multi-scale object detector.  The network is learned end-to-end, by optimizing a multi-task loss.  Feature upsampling by deconvolution as an alternative to input upsampling, to reduce the memory and computation costs.
  88. 88. A Unified Multi-scale Deep CNN for Fast Object Detection The cubes - output tensors. h × w - filter size, c - # classes, b # bounding box coordinates.
  89. 89. A Unified Multi-scale Deep CNN for Fast Object Detection Different strategies for multi-scale detection (the template size)
  90. 90. A Unified Multi-scale Deep CNN for Fast Object Detection Object detection sub-network of the MS-CNN. “trunk CNN layers” are shared with proposal sub-network. The green (blue) cubes represent object (context) region pooling. “class probability” and “bounding box” are the outputs of the detection sub-network.
  91. 91. SSD: Single Shot MultiBox Detector  Discretize the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location;  At prediction time, generate scores for the presence of each object category in each default box and produce adjustments to the box to better match the object shape;  Combine predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes;  Eliminates proposal generation and subsequent pixel or feature resampling stage and encapsulates all computation in a single network.
  92. 92. SSD: Single Shot MultiBox Detector (a) only needs an input image and ground truth boxes for each object during training. In a convolutional fashion, evaluate a small set of default boxes of different aspect ratios at each location in several feature maps with different scales. For each default box, predict both the shape offsets and the confidences for all object categories. At training time, first match these default boxes to the ground truth boxes. The model loss is a weighted sum between localization loss (e.g. Smooth L1) and confidence loss (e.g. Softmax).
  93. 93. You Olny Look Once (YOLO) for Object Detection The YOLO Detection System The system models detection as a regression problem to a 7724 tensor. This tensor encodes bounding boxes and class probabilities for all objects in the image. The network uses strided conv. layers to downsample the feature space instead of maxpooling layers. Pre-train the conv. layers on the ImageNet classification task and then double the resolution for detection. Note: More localization errors, relatively low recall.
  94. 94. YOLO9000: Better, Faster, Stronger  Detect over 9000 object categories:;  YOLOv2, 67 FPS, 76.8 mAP on VOC 2007; 40 FPS, 78.6 mAP;  Jointly train on object detection COCO and classification ImageNet;  Batch Normalization: 2% improvement in mAP;  High Resolution Classifier: full 448 × 448 resolution, almost 4% up in mAP;  Convolutional With Anchor Boxes: use anchor boxes to predict bound. boxes;  Dimension Clusters: k-means on the training set bounding boxes to automatically find good priors to adjust the boxes appropriately;  Direct location prediction: predict location relative to location of the grid cell;  Fine-Grained Features: 13 × 13 map, pass through layer from 26 × 26 res.  Multi-Scale Training: Every 10 batches randomly a new image dimension size.
  95. 95. YOLO9000: Better, Faster, Stronger  Based on Googlenet architecture, faster than VGG-16;  Darknet-19: 19 convolutional layers and 5 maxpooling layers;  Training for classification: Darknet, data augmentation;  Training for detection: remove the last conv. layer, add on three 3 × 3 conv. layers with 1024 filters each followed by a final 1 × 1 conv. layer;  Hierarchical classification: WordNet, -> WordTree, a model of visual concepts;  Dataset combination with WordTree: combine labels from ImageNet & COCO;  Joint classification and detection: use the COCO detection dataset and the top 9000 classes from the full ImageNet release;  YOLO9000: WordTree with 9418 classes.
  96. 96. YOLO9000: Better, Faster, Stronger
  97. 97. DeepBox: Learning Objectness with CNN  DeepBox uses CNNs to rerank proposals from a bottom-up method;  A four-layer CNN architecture that is as good as much larger networks on the task of evaluating objectness while being much faster;
  98. 98. DenseBox: Landmark Localization and Object Detection  Fully convolutional neural network (FCN);  Directly predicts bounding boxes and object class confidences through all locations and scales of an image;  Improve accuracy with landmark localization during multi-task learning. Pipeline:1) Image pyramid is fed to the network. 2) After several layers of convolution and pooling, upsampling feature map back and apply convolution layers to get final output. 3) Convert output feature map to bounding boxes, and apply non-maximum suppression to all bounding boxes over the threshold.
  99. 99. DenseBox: Landmark Localization and Object Detection DenseBox Densebox with landmark localization
  100. 100. R-FCN: Object Detection via Region-based Fully Convolutional Networks  Position-sensitive score maps to handle conflict of translation-invariance in image classification and translation-variance in object detection; Overall architecture of R-FCN. A RPN proposes candidate RoIs, applied on the score maps. All learnable weight layers are convolutional and are computed on the entire image; the per-RoI computational cost is negligible
  101. 101. R-FCN: Object Detection via Region-based Fully Convolutional Networks Key idea of R-FCN for object detection. k × k = 3 × 3 position-sensitive score maps generated by a FCN. For each of the k × k bins in an RoI, pooling is only performed on one of the k2 maps.
  102. 102. R-FCN: Object Detection via Region-based Fully Convolutional Networks
  103. 103. LocNet: Improving Localization Accuracy for Object Detection  Object localization aiming at boosting the localization accuracy.  The model, given a search region, aims at returning the bounding box of an object of interest inside this region.  Assign conditional prob. to each row and column of this region, where these prob. provide useful info. regarding loc. of boundaries of the object inside the search region and allow accurate inference of the object bounding box under a simple prob. framework.  A CNN architecture adapted for this task, called LocNet.
  104. 104. LocNet: Improving Localization Accuracy for Object Detection Illustration of the basic work-flow of the localization module. Left column: the model given a candidate box B (yellow box) it ”looks” on a search region R (red box), which is obtained by enlarging box B by a constant factor, in order to localize the bounding box of an object of interest. Right column: To localize a bounding box the model assigns one or more probabilities on each row and independently on each column of region R. Those prob. can be either the prob. of an element (row or column) to be one of the four object borders (see top-right image), or the probability for being on the inside of an objects bounding box (see bottom-right image). In either case the predicted bounding box is drawn with blue color.
  105. 105. LocNet: Improving Localization Accuracy for Object Detection The posterior prob. that the loc. model yields given a region R. Left Image: the in- out conditional prob. assigned on each row (py) and column (px) of R. Drawn with blues curves on the right and the bottom side. Right Image: the conditional prob. pl, pr, pt, and pb of each column or row to be the left (l), right (r), top (t) and bottom (b) border of an object’s bounding box. They are drawn with blue and red curves on the bottom and on the right side of the search region.
  106. 106. LocNet: Improving Localization Accuracy for Object Detection Visualization of the LocNet network architecture The processing starts by forwarding the entire image I, through a seq. of convolutional layers that outputs the AI activation maps. Then, the region R is projected on AI and the activations that lay inside it are cropped and pooled with a spatially adaptive max-pooling layer. The resulting fixed size activation maps are forwarded through two convolutional layers, each followed by ReLU non-linearities, that yield the localization-aware activation maps AR of region R. The network is split into two branches, the X and Y, with each being dedicated for the predictions that correspond to the dimension (x or y respectively) that is assigned to it. The resulted activations Ax R and Ay R efficiently encode the object location only across the dimension that their branch handles. Finally, each of those aggregated features is fed into the final fully connected layer followed from sigmoid units in order to output conditional prob. of its assigned dimension.
  107. 107. HyperNet: Towards Accurate Region Proposal Generation and Joint Object Detection  Region Proposal Network struggles in small-size object detection and precise localization (e.g., large Intersection-over-Union thresholds), mainly due to the coarseness of its feature maps.  HyperNet handles region proposal generation and object detection jointly.  An elaborately designed Hyper Feature which aggregates hierarchical feature maps first and then compresses them into a uniform space.  The features well incorporate deep but highly semantic, intermediate but really complementary, and shallow but naturally high-resolution features, thus enabling to construct HyperNet by sharing them both in generating proposals and detecting objects via an end-to-end joint training strategy.
  108. 108. HyperNet: Towards Accurate Region Proposal Generation and Joint Object Detection Topleft: top 10 object proposals generated by the network. Topright: detection results with precision value. Down: object proposal generation and detection pipeline.
  109. 109. HyperNet: Towards Accurate Region Proposal Generation and Joint Object Detection HyperNet object detection architecture. The system (1) takes an input image, (2) computes Hyper Feature representation, (3) generates 100 proposals and (4) classifies and makes adjustment for each region.
  110. 110. HyperNet: Towards Accurate Region Proposal Generation and Joint Object Detection
  111. 111. RON: Reverse Connection with Objectness Prior Networks for Object Detection  Under fully convolutional architecture, RON mainly focuses on two fundamental problems: (a) multi-scale object localization and (b) negative sample mining.  For (a), the reverse connection enables the network to detect objects on multi-levels of CNNs.  For (b), the objectness prior to significantly reduce the searching space of objects.  Optimize the reverse connection, objectness prior and object detector jointly by a multi-task loss function, thus RON can directly predict final detection results from all locations of various feature maps.
  112. 112. RON: Reverse Connection with Objectness Prior Networks for Object Detection RON object detection overview. Given an input image, the network firstly computes features of the backbone network. Then at each detection scale: (a) adds reverse connection; (b) generates objectness prior; (c) detects object on its corresponding CNN scales and locations. Finally, all detection results are fused and selected with non-maximum suppression.
  113. 113. RON: Reverse Connection with Objectness Prior Networks for Object Detection A reverse connection block One inception block. Object detection and bbox regression modules.
  114. 114. RON: Reverse Connection with Objectness Prior Networks for Object Detection Mapping the objectness prior with object detection. 1st binarize the objectness prior maps according to op, 2nd project the binary masks to the detection domain of the last conv. feature maps. The locations within the masks are collected for detecting objects. Objectness prior maps generated from images.
  115. 115. Feature Pyramid Networks for Object Detection  Exploit the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost.  A top down architecture with lateral connections is developed for building high- level semantic feature maps at all scales, Feature Pyramid Network (FPN). (a) Using an image pyramid to build a feature pyramid. Features are computed on each of the image scales independently, which is slow. (b) Recent detection systems have opted to use only single scale features for faster detection. (c) An alternative is to reuse the pyramidal feature hierarchy computed by a ConvNet as if it were a featurized image pyramid. (d) Feature Pyramid Network (FPN) is fast like (b) and (c), but more accurate. In this figure, feature maps are indicate by blue outlines and thicker outlines denote semantically stronger features.
  116. 116. Feature Pyramid Networks for Object Detection Top: a top-down architecture with skip connections, where predictions are made on the finest level. Bottom: It leverages it as a feature pyramid, with predictions made independently at all levels. A building block illustrating the lateral connection and the top- down pathway, merged by addition.
  117. 117. Focal Loss for Dense Object Detection  The highest accuracy object detectors are based on a two-stage approach: a classifier is applied to a sparse set of candidate object locations;  One-stage detectors, applied over a regular, dense sampling of possible object locations, faster and simpler, but less accuracy than two stage methods.  The extreme FG-BG class imbalance encountered during training of dense detectors is the central cause.  Idea: Reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples.  A “Focal Loss” focuses training on a sparse set of hard examples and prevents the number of easy negatives from overwhelming the detector during training.  A simple dense detector RetinaNet: match the speed of one-stage detectors while surpassing the accuracy of all two-stage detectors.
  118. 118. Focal Loss for Dense Object Detection Focal Loss adds a factor (1 − pt) γ to the standard CE criterion. Setting γ > 0 reduces the relative loss for well-classified examples (pt > .5), putting more focus on hard, misclassified examples. Speed (ms) versus accuracy (AP) on COCO test- dev. Enabled by the focal loss, simple one-stage RetinaNet detector outperforms one-stage and two- stage detectors. Variants of RetinaNet with ResNet- 50-FPN (blue circles) and ResNet-101-FPN (orange diamonds) at five scales (400-800 pixels).
  119. 119. Focal Loss for Dense Object Detection RetinaNet uses a Feature Pyramid Network (FPN) backbone on top of a feedforward ResNet architecture (a) to generate a rich, multi-scale convolutional feature pyramid (b). To this backbone RetinaNet attaches two subnetworks, one for classifying anchor boxes (c) and one for regressing from anchor boxes to ground-truth object boxes (d).
  120. 120. A Discriminative Deep Model for Pedestrian Detection with Occlusion Handling
  121. 121. Joint Deep Learning for Pedestrian Detection Learned filtered at the second convolutional layer Part models
  122. 122. Joint Deep Learning for Pedestrian Detection • Visibility Reasoning with Deep Belief Net Deformation Layer
  123. 123. Multi-Stage Contextual Deep Learning for Pedestrian Detection multi-stage contextual deep model The proposed deep Learning Architecture. Apply different filters Fi on the same feature map f and obtain different score maps si.
  124. 124. Pedestrian Detection aided by Deep Learning Semantic Tasks Jointly optimizes pedestrian detection with semantic tasks, including pedestrian attributes ( ‘carrying backpack’) and scene attributes ( ‘road’, ‘tree’, and ‘horizontal’); A multi-task objective function is designed to coordinate tasks and reduce discrepancies among datasets and a deep model, task-assistant CNN (TA-CNN), is to learn high-level features from multiple tasks and multiple data sources.
  125. 125. Pedestrian Detection aided by Deep Learning Semantic Tasks
  126. 126. CNN-based Pose Classification One convolutional neural net is trained on semantic part patches for each poselet and then the top- level activations of all nets are concatenated to obtain a pose-normalized deep representation. The final attributes are predicted by linear SVM classifier using the pose-normalized representations.
  127. 127. Given an image, a human body detector is used to find the bounding box around the human. Next, a convolutional neural network (CNN) extracts shared features from the cropped image, and the shared features are the inputs to the joint point regression tasks and the body-part detection tasks. The CNN, regression, and detection tasks are learned simultaneously, resulting in a shared feature representation. Heterogeneous Multi-task Learning for Pose Estimation
  128. 128. A Multi-source Deep Model for Pose Estimation How to generate multiple candidate locations: A candidate is used as the input to a deep model to determine whether the candidate is correct and estimate body locations. A multi-source deep model for constructing the non-linear representation from three information sources: mixture type, appearance score and deformation.
  129. 129. Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation  A hybrid architecture that consists of a deep CNN and a MRF.  The architecture exploits structural domain constraints such as geometric relationships between body joint locations.  Joint training of these two model paradigms improves. Multi-Resolution Sliding-Window With Overlapping Receptive Fields
  130. 130. Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation Efficient Sliding Window Model with Single Receptive Field
  131. 131. Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation Efficient Sliding Window Model with Single Receptive Field Efficient Sliding Window Model with Overlapping Receptive Fields
  132. 132. Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation Efficient Sliding Window Model with Single Receptive Field Approximated Efficient Sliding Window Model with Overlapping Receptive Fields
  133. 133. Stacked Hourglass Networks for Human Pose Estimation  A convolutional network architecture for human pose estimation.  Features are processed across all scales and consolidated to best capture the various spatial relationships associated with the body.  Run repeated bottom-up, top-down processing in conjunction with intermediate supervision to improve the performance of the network.  A “stacked hourglass” network based on successive steps of pooling-and- upsampling. A network consists of multiple stacked hourglass modules which allow for repeated bottom-up, top-down inference
  134. 134.  The person’s orientation, arrangement of their limbs, and the relationships of adjacent joints are best recognized at different scales in the image.  The hourglass is a simple, minimal design with the capacity to capture all these features and bring them together to output pixel-wise predictions.  Use a pipeline with skip layers to preserve spatial info. at each resolution.  It reaches its lowest resolution at 4x4 pixels allowing smaller spatial filters to be applied that compare features across the entire space of the image.  Stacking multiple hourglasses end-to-end, feeding the output of one as input into the next, which provides a mechanism for repeated bottom-up, top-down inference allowing for reevaluation of initial estimates and features across the whole image;  The key: prediction of intermediate heatmaps upon which to apply a loss. Stacked Hourglass Networks for Human Pose Estimation
  135. 135. An illustration of a single “hourglass” module. Each box corresponds to a residual module. # of features is consistent across the whole hourglass. Residual Module in the network The intermediate supervision process Stacked Hourglass Networks for Human Pose Estimation
  136. 136. The change in predictions from an intermediate stage (second hourglass) (left) to final predictions (eighth hourglass) (right). Stacked Hourglass Networks for Human Pose Estimation
  137. 137. DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation  Articulated pose estimation of multiple people in real world images.  Jointly solves the tasks of detection and pose estimation: it infers the number of persons in a scene, identifies occluded body parts, and disambiguates body parts btw people in close proximity of each other.  A partitioning and labeling formulation of a set of body-part hypotheses generated with CNN-based part detectors.  The formulation, an instance of an integer linear program, implicitly performs non-maximum suppression on the set of part candidates and groups them to form configurations of body parts respecting geometric and appearance constraints.
  138. 138. DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation (a) initial detections (= part candidates) and pairwise terms (graph) between all detections that (b) are jointly clustered belonging to one person (one colored subgraph = one person) and each part is labeled corresponding to its part class (different colors and symbols correspond to different body parts); (c) shows the predicted pose sticks.
  139. 139. Real-time Multi-Person 2D Pose Estimation using Part Affinity Fields  Efficiently detect 2D pose of multi- people.  A nonparametric representation, Part Affinity Fields (PAFs), to learn to associate body parts with individuals in the image.  The architecture encodes global context, allowing a greedy bottom-up parsing step that maintains high accuracy while achieving real-time performance.  The architecture jointly learn part locations and their association via two branches of the same sequential prediction process. Top: Multi-person pose estimation. Bottom left: Part Affinity Fields (PAFs) to the limb connecting right elbow and right wrist. Bottom right: zoom-in view of predicted PAFs.
  140. 140. Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields Overall pipeline. The entire image as input for a two-branch CNN to jointly predict confidence maps for body part detection in (b), and part affinity fields for parts association in (c). A set of bipartite matchings to associate body parts candidates (d). Assemble them into full body poses for all people in the image (e).
  141. 141. Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields Architecture of the two-branch multi-stage CNN. Each stage in the first branch predicts confidence maps, and each stage in the second branch predicts PAFs. After each stage, the predictions from the two branches, along with the image features, are concatenated for next stage. a set of detection confidence maps a set of part affinity fields
  142. 142. Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields Confidence maps of the right wrist (top) and PAFs (bottom) of right forearm across stages. Though confusion between left and right body parts and limbs in early stages, estimates are increasingly refined through global inference in later stages, as shown in the highlighted areas. Part association strategies. (a) body part detection candidates (red and blue dots) for two body part types and all connection candidates (grey lines). (b) connections using midpoint (yellow dots) represent.: correct connections (black lines) and incorrect connections (green lines) that also satisfy incidence constraint. (c) Results using PAFs (yellow arrows). By encoding position and orientation over the support of the limb, PAFs eliminate false associations.
  143. 143. Real-time Multi-Person 2D Pose Estimation using Part Affinity Fields Graph matching. (a) image with part detections (b) K-partite graph (c) Tree (d) A set of bipartite graphs
  144. 144. VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera  Real-time capture the full global 3D skeletal pose of a human in a stable, temporally consistent manner using a single RGB camera.  It combines a CNN-based pose regressor with kinematic skeleton fitting.  Fully convolutional pose formulation regresses 2D and 3D joint positions jointly in real time without tightly cropped input frames.  A real-time kinematic skeleton fitting method uses the CNN output to yield temporally stable 3D global pose reconstructions on the basis of a coherent kinematic skeleton.  Monocular RGB used in real-time applications such as 3D character control;  The accuracy is quantitatively on par with the best offline 3D monocular RGB pose estimation methods.
  145. 145. It consists of two primary components. The first is a CNN to regress 2D and 3D joint positions under the ill-posed monocular capture conditions. It is trained on annotated 3D human pose datasets, additionally leveraging annotated 2D human pose datasets for improved in-the-wild performance. The second component combines the regressed joint positions with a kinematic skeleton fitting method to produce a temporally stable, camera-relative, full 3D skeletal pose. VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera
  146. 146. Schema of the fully-convolutional formulation for predicting root relative joint locations. VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera
  147. 147. Representative training frames from Human3.6m and MPI-INF-3DHP 3D pose datasets. VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera
  148. 148. The structure is preceded by ResNet50/100 till level 4. VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera
  149. 149. • Combine 2D and 3D joint positions in a joint optimization framework, along with temporal filtering/smoothing, to obtain an accurate, temporally stable result. • First, the 2D predictions Kt are temporally filtered and used to obtain the 3D coordinates of each joint from the location-map predictions, giving us PL t . • To ensure skeletal stability, the bone lengths inherent to PL t are replaced by the bone lengths of the underlying skeleton in a simple retargeting step that preserves the bone directions of PL t . • The resulting 2D and 3D predictions are combined by minimizing the objective energy for skeletal joint angles θ and root joint’s location in camera space d, as VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera
  150. 150. Multi-view Face Detection using CNNs: Deep Dense Face Detector  No face pose/landmark annotation, segmentation, bounding box regression;  Detect faces from different view angles and with occlusion to some extent;  Can improve based on better sampling strategies and data augmentation;  Sliding window-based detector with tuned AlexNet and data augmentation;  Size 227x227, 50k iterations, batch size 128 (32 pos + 96 neg);  Convert the fully connected layer to convolutional layer by reshaping the layer paras;  Run CNN to get the heat map of face classifiers for refined localization;  Improve the face localization using a bounding box regression module.  Comparison with R-CNN: the best R-CNN classifier is inferior;  Loss of recall (miss of selective search);  loss of localization (bounding box regression).
  151. 151. Multi-view Face Detection using CNNs: Deep Dense Face Detector A set of faces with different out-of- plane rotations and occlusions. Heat map of face classifier
  152. 152. Facial Landmark Detecting by Deep Multi-task Learning • Face attribute would help in detecting facial landmarks; • Task-wise early stop in Multi-task learning; • Training data: Multi-task Facial Landmark dataset; • Testing data: AFLW (annotated facial landmarks in the wild); • Points: extend 5 points to 68 points.
  153. 153. Face Alignment by Deep Regression • Apply a global layer and multi-stage local layers; • Sequential learning, joint learning. (a) The network takes face image as input and outputs shape estimation. The global layer estimates initial shape and the rest local layers refine the estimation iteratively. (b) Inner structure of the global layer; (c) Inner structure of the tth local layer.
  154. 154. DeepFace for Fave Verification at Facebook  • Fiducial points are extracted by a Support Vector Regressor (SVR) trained to predict point configurations from an image descriptor; • 2D alignment: 2D similarity transformation; • 3D alignment: Based on a generic 3D shape model, register a 3D affine camera by back- projecting the frontal face plane of the 2D- aligned crop to the image plane of the 3D shape.
  155. 155. Learn the Deep Face Representation: Face++ • Megvii Face Classification (MFC) database: 5 million labeled faces with 20000 people; • 10-layer CNN for a four cropped-face-region-based feature extractors used in softmax- based training and PCA + L2 norm verification; • 99.50% accuracy on the LFW benchmark.
  156. 156. Learn the Deep Face Representation: Face++ • Pyramid CNN adopts a greedy-filter-and-down-sample operation, which enables the training procedure to be very fast and computation efficient. • Its structure can naturally incorporate feature sharing across multi-scale face representations, increasing the discriminative ability of resulting representation.
  157. 157. DeepID: Deep Learning Face Representation  Deep hidden identity features (DeepID) for face verification and identification;  Features are taken from the last hidden layer neuron activations of deep CNN;  The proposed features are extracted from various face regions to form complementary and over-complete representations;  Integrated with Joined Bayesian for face verification: 97.45% accuracy on LFW dataset. Structure of the neural network used for face verification ConvNet structure used for DeepID
  158. 158. DeepID2: Deep Learning Face Representation  DeepID2: joint face verification and identification to reduce former-cared intra- personal variations and enlarge the latter-cared inter-personal variations;  Deep CNN: in 55x47, 4 conv. + 3 max pooling, ReLU, 160-D feature vector;  Learning by SGD;  99.15% for verification rate on LFW dataset.
  159. 159. DeepID2+: Deep Learning Face Representation DeepID2+ net and supervisory signals • DeepID2+: increasing the dimension of hidden representations and adding supervision to early convolutional layer; • Sparse neural activations, selective neurons in higher layers and robustness to occlusions; • Get larger with 128 feature maps in each conv. layer; • Supervisory signals are only added to one fully-connected layer from 3rd & 4th conv. layers; the lower conv. layers can only get supervision from higher layers; • 99.47% and 93.2% for verification rates on LFW and Youtube dataset.
  160. 160. DeepID3: Face Recognition with Very Deep Neural Network • Apply stacked convolution and inception layers proposed in VGG Net and GoogLeNet to make them suitable to face recognition; • An ensemble of proposed two architectures achieves LFW face verification accuracy 99.53% and LFW rank-1 face identification accuracy 96.0%, respectively.
  161. 161. FaceNet: A Unified Embedding for Face Recognition and Clustering  Learn a Euclidean 128-D embedding per image using a deep CNN;  L2 distances -> face similarity;  Face verification: thresholding the distance btw. two wmbeddings;  Face recognition: k-NN classification;  Face clustering: k-means or agglomerative clustering;  Triplet based loss function based on LMNN (Large Margin Nearest Neighbor);  Apply a new online negative exemplar mining strategy;  Apply hard positive mining strategy in face clustering. Model Structure The Triplet Loss
  162. 162. FaceNet: A Unified Embedding for Face Recognition and Clustering The Triplet Loss minimizes the distance btw an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negative of a different identity. L = anchor positive negative Instead of picking the hardest positive, use all anchor-positive pairs in a mini-batch while still selecting the hard negatives. Selecting the hardest negatives
  163. 163. FaceNet: A Unified Embedding for Face Recognition and Clustering  CNN: MattNet and GoogleNet (Inception);  Training: use SGD with standard BP and AdaGrad;  Performance: LWF (98.87%, 99.63% with alignment) and Youtube (95.12%). MattNet GoogleNet Note: at GTC’15, Andrew Ng announced Baidu’s verification rate at LWF dataset is 99.85%!
  164. 164. Text Detection and Recognition based on CNNs Page 165 The end-to-end text spotting pipeline. a) A combination of region proposal methods extracts many word bounding box proposals. b) Proposals are filtered with a random forest classier reducing number of false- positive detections. c) A CNN is used to perform bounding box regression for refining the proposals. d) A CNN performs text recognition on each of the refined proposals. e) Detections are merged based on proximity and recognition results and assigned a score. f) Thresholding the detections results in the final text spotting result.
  165. 165. Text Detection and Recognition based on CNNs Page 166 A schematics of the CNNs used showing the dimensions of the feature maps at each stage for (a) dictionary encoding, (b) character sequence encoding, and (c) bag-of-N-gram encoding. The same five-layer, base CNN architecture is used for all three models.
  166. 166. Multi-digit Number Recognition from Street Views using Deep CNNs  Unified one of localization, segmentation, and recognition steps via the use of a deep CNN that operates directly on the image pixels;  Train a probabilistic model of sequences given images;  Extract a set of features H from the image X using a CNN with a fully connected final layer;  Six separate softmax classifiers are then connected to this feature vector H, i.e., each softmax classifier forms a response by making an affine transformation of H and normalizing this response with the softmax function. Page 167
  167. 167. Scene Parsing with Multiscale Feature Learning • Compute a tree of segments from a graph of pixel dissimilarities. • A set of dense feature vectors encodes regions of multiple sizes centered on each pixel. • The feature extractor is a multi-scale trained convolutional network. • The feature vectors associated with the segments covered by each node in the tree are aggregated and fed to a classifier which produces an estimate of the distribution of object categories contained in the segment. • A subset of tree nodes that cover the image are then selected so as to maximize the average “purity” of the class distributions, hence maximizing the overall likelihood that each segment will contain a single object. • The convolutional network feature extractor is trained end-to-end from raw pixels, alleviating the need for engineered features. After training, the system is parameter free. • From a pool of segmentation components, retrieve an optimal set of components that best explain the scene, taken from a segmentation tree or any family of over- segmentations.
  168. 168. Scene Parsing with Multiscale Feature Learning (a) Method 1 (b) Method 2
  169. 169. Scene Parsing with Multiscale Feature Learning
  170. 170. Simultaneous Detection and Segmentation (SDS) • Apply R-CNN to classify category-independent region proposals, and use category- specific top-down figure ground predictions to refine the bottom up proposals; • Proposal generation: category-independent bottom-up by MCG; • Feature extraction for each region by R-CNN; • Region classification by SVM and assign a score for each category to each candidate. • Region refinement (non maximum suppression) on the scored candidates.
  171. 171. Learning Rich Features from RGB-D Images for Object Detection and Segmentation • Geocentric embedding for depth images that encodes height above ground and angle with gravity for each pixel in addition to the horizontal disparity. • Better than using depth images for learning feature representations with CNN. • A decision forest approach that classifies pixels as FG or BG using a family of unary and binary tests that query shape and geocentric pose features.
  172. 172. Fully Convolutional Networks for Semantic Segmentation • Build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning; • Adapt the classification nets into fully convolutional network and transfer their learned representations by fine tuning (after a supervised pre-training); • Combine semantic info. from a deep coarse layer with appearance info. from a shallow fine layer to produce accurate and detailed segmentation; Transforming fully connected layers into convolution layers. Fully convolutional networks can efficiently learn to make dense predictions for per pixel semantic segmentation.
  173. 173. Fully Convolutional Networks for Semantic Segmentation • While a general deep net computes a general nonlinear function, a net with only layers of this form computes a nonlinear filter, called a deep filter or fully convolutional network; • Upsampling by de-concolution is more efficient and effective to yield dense predictions; • Input shifting and output interlacing without interpolation (proposed by OverFeat); • Changing only the filters and layers strides of a convent produce the same as the “shift-and-stitch” trick. • Define a FCN combining layers of the feature hierarchy and refining the spatial precision of the output; • Add links combining the final prediction layer with lower layers with finer strides, turning into a DAG, not layer-wise, with edges skipping head from lower layers to higher ones; • Learn the skip net which improves the perf. • Note: Decreasing the stride of pooling layers can get finer prediction, but problematic in big kernel size for conv. Layers for learning cost.
  174. 174. Fully Convolutional Networks for Semantic Segmentation A DAG nets learn to combine coarse, high layer information with fine, low layer information. Solid line (FCN-32s): A single-stream net upsamples stride 32 predictions back to pixels. Dashed line (FCN-16s): Combining predictions from both the final layer and the pool4 layer, at stride 16, lets the net predict finer details, while retaining high-level semantic information. Dotted line (FCN-8s): Additional predictions from pool3, at stride 8, provide further precision. Deep Jet (Nonlinear local feature hierarchy: make local predictions while respect global structure.
  175. 175. DeepLab: CNN + CRF for Semantic Segmentation oLearning DCNNs for semantic image segmentation from either (1) weakly annotated training data such as bounding boxes or image-level labels or (2) a combination of few strongly labeled and many weakly labeled images, sourced from one or multiple datasets; oUse DCNN to predict the label distribution per pixel, followed by a fully- connected (dense) CRF to smooth the predictions while preserving image edges; oExpectation-Maximization (EM) methods for semantic image segmentation model training under these weakly supervised and semi-supervised settings.
  176. 176. DeepLab: CNN + CRF for Semantic Segmentation DeepLab model training using image-level labels
  177. 177. DeepLab: CNN + CRF for Semantic Segmentation DeepLab model training from bounding boxes DeepLab model training on a union of full (strong labels) and image-level (weak labels) annotations
  178. 178. ParseNet: Looking Wider to See Better o Global context to CNN for semantic segmentation; oUsing the average feature for a layer to augment the features at each location; oLearning normalization parameters for improvement. oEarly fusion: unpool (replicate) global feature to the same size as of local feature map spatially and then concatenate them, and use the combined feature to learn the classifier; oLate fusion: each feature is used to learn its own classifier, followed by merging the two predictions into a single classification score;
  179. 179. SegNet: Deep Conv. Encoder-Decoder Architecture A decoder upsamples its input using the transferred pool indices from its encoder to produce a sparse feature map(s); It then performs convolution with a trainable filter bank to densify the feature map; The final decoder output feature maps are fed to a soft-max classifier for pixel-wise classification.
  180. 180. SegNet: Deep Conv. Encoder-Decoder Architecture oSegNet uses the max pooling indices to upsample (without learning) the feature map(s) and convolves with a trainable decoder filter bank. oFCN upsamples by learning to deconvolve the input feature map and adds the corresponding encoder feature map to produce the decoder output; this feature map is the output of the max- pooling layer (includes sub-sampling) in the corresponding encoder. oNote that there are no trainable decoder filters in FCN.
  181. 181. Mask R-CNN for Object Instance Segmentation  Detect objects in an image while generate a segment. mask for each instance.  Mask R-CNN extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition.  Simple to train and adds a small overhead to Faster R-CNN, running at 5 fps.  Easy to generalize to other tasks, allowing to estimate poses in the framework.
  182. 182. Mask R-CNN for Object Instance Segmentation Head Architecture: Left/Right for the heads for the ResNet C4 and FPN backbones, to which a mask branch is added. Numbers denote spatial resolution and channels. Arrows denote either conv, deconv, or fc layers as can be inferred from context (conv preserves spatial dimension while deconv increases it). All convs are 3×3, except the output conv which is 1×1, deconvs are 2×2 with stride 2, and use ReLU in hidden layers. Left: ‘res5’ denotes ResNet’s 5th stage, which for simplicity we altered so that the first conv operates on a 7×7 RoI with stride 1. Right: ‘×4’ denotes a stack of four consecutive convs.
  183. 183. Mask R-CNN for Object Instance Segmentation  Extended to human pose estimation: Model a keypoint’s location as a one-hot mask, and adopt Mask R-CNN to predict K masks, one for each of K keypoint types (e.g., left shoulder, right elbow).  Minimal domain knowledge for human pose is exploited by the system.  For each of the K keypoints of an instance, the training target is a one-hot m × m binary mask where only a single pixel is labeled as FG.  During training, for each visible ground-truth keypoint, minimize the cross-entropy loss over an m2 -way softmax output.  In instance segmentation, the K keypoints are treated independently.  The keypoint head consists of a stack of eight 3×3 512-d conv layers, followed by a deconv layer and 2× bilinear upscaling, producing an output resolution of 56×56.  Train the models using image scales randomly sampled from [640, 800] pixels; inference is on a single scale of 800 pixels.  90k iterations, learning rate of 0.02 and reducing it by 10 at 60k and 80k iterations.  Bounding-box non-maximum suppression with a threshold of 0.5.
  184. 184. RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation  RefineNet, a generic multi-path refinement network that explicitly exploits all the information available along the down-sampling process to enable high- resolution prediction using long-range residual connections.  In this way, the deeper layers that capture high-level semantic features can be directly refined using fine-grained features from earlier convolutions.  The individual components of RefineNet employ residual connections following the identity mapping mindset, which allows for effective end-to-end training.  Further, chained residual pooling, which captures rich background context in an efficient manner.
  185. 185. RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation Comparison of fully convolutional approaches for dense classification. Standard multi-layer CNNs, such as ResNet (a) suffer from downscaling of the feature maps, thereby losing fine structures along the way. Dilated convolutions (b) remedy this shortcoming by introducing atrous filters, but are computationally expensive to train and quickly reach memory limits even on modern GPUs. RefinNet (c) exploits various levels of detail at different stages of convolutions and fuses them to obtain a high-resolution prediction without the need to maintain large intermediate feature maps.
  186. 186. RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation The individual components of multi-path refinement network architecture RefineNet. Components in RefineNet employ residual connections with identity mappings. In this way, gradients can be directly propagated within RefineNet via local residual connections, and also directly propagate to the input paths via long-range residual connections, and thus achieve effective end-to-end training of the whole system.
  187. 187. RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation Illustration of 3 variants of RefineNet network architecture: (a) single RefineNet, (b) 2-cascaded RefineNet and (c) 4-cascaded RefineNet with 2-scale ResNet. Note that RefineNet block can seamlessly handle different numbers of inputs of arbitrary resolutions and dimensions without any modification.
  188. 188. Large Kernel Matters ——GCN  One is stacking small filters (1x1 or 3x3) cause the stacked small filters is more efficient than a large kernel, given the same computational complexity.  In semantic segmentation, the large kernel (effective receptive field) plays an important role to classification and localization simultaneously.  Global Convolutional Network (GCN) for the semantic segmentation.
  189. 189. Pyramid Scene Parsing Network  Global context info. by different-region-based context aggregation through pyramid pooling module with the pyramid scene parsing network (PSPNet). Given an input image (a), use CNN to get the feature map of the last conv. layer (b), then a pyramid parsing module to harvest different sub-region representations, followed by upsampling and concatenation layers to form the final feature representation, which carries both local and global context info. in (c). Finally, the representation is fed into a conv. layer to get the final per-pixel prediction (d).
  190. 190. Pyramid Scene Parsing Network
  191. 191. Large Kernel Matters ——GCN
  192. 192. Large Kernel Matters ——GCN (A) Global Convolutional Network. (B) 1×1 convolution baseline. (C) k×k convolution. (D) stack of 3×3 convolutions. A: the bottleneck module in original ResNet. B: GCN in ResNet-GCN.
  193. 193. Multi-View Deep Learning for Consistent Semantic Mapping with RGB-D Cameras  Object-class segmentation from multiple RGB-D views using deep learning.  Train a deep NN to predict object-class semantics that is consistent from several view points in a semi-supervised way.  At test time, the semantics predictions from deep NN can be fused more consistently in semantic key frame maps than predictions of a network trained on individual views.  Based from single-view deep learning to RGB and depth fusion for semantic object-class segmentation and enhance it with multi-scale loss minimization.  Obtain the camera trajectory using RGB-D SLAM and warp the predictions of RGB-D images into ground-truth annotated frames in order to enforce multi-view consistency during training.  At test time, predictions from multiple views are fused into key frames, enforcing multi-view consistency during training and testing.
  194. 194. Multi-View Deep Learning for Consistent Semantic Mapping with RGB-D Cameras Train CNN to predict multi-view consistent semantic segmentations for RGB-D images. The key innovations are the multi view consistency layers (MVCL), which warp semantic prediction or feature maps at multiple scales into a common reference view based on the SLAM trajectory. It improves performance for single-view segmentation and is specifically beneficial for multi-view fused segmentation at test time.
  195. 195. Multi-View Deep Learning for Consistent Semantic Mapping with RGB-D Cameras The network extracts features from depth images in a separate encoder whose features are fused with RGB features in a fused encoder network. The encoded features at the LR are successively refined through deconvolutions in a decoder. To guide the refinement, train the network in a deeply-supervised manner in which segmentation loss is computed at all scales of the decoder.