Visual Object Detection, Recognition and Tracking
Upcoming SlideShare
Loading in...5

Like this? Share it with your network


Visual Object Detection, Recognition and Tracking



computer vision, machine learning, object detection, object classification, face detection and recognition, object tracking, part-based model, patch-based features, bags of visual words, ...

computer vision, machine learning, object detection, object classification, face detection and recognition, object tracking, part-based model, patch-based features, bags of visual words, semi-supervised, online learning, deep learning, context model, fusion of multiple classifiers, manifold, segmentation as prior, branch-and-bound, imbalanced data, unknown category detection.



Total Views
Views on SlideShare
Embed Views



18 Embeds 2,038 1940 41 10 10 8 5 5 4 3 2 2 2 1 1 1 1 1 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Visual Object Detection, Recognition and Tracking Presentation Transcript

  • 1. Yu Huang Sunnyvale, California Visual Object Detection, Recognition and Tracking
  • 2. Outline  Object Detection  Object Classification  State of Art Object Detection and Classification  Global/local features, part-based, bag of words, face detection/recognition…  Efficiency in Detection/Classification  Divide-and-conquer, branch-and-bound, coarse-to-fine, DP, selective search.  Evaluation Metric  Typical Applications of Object Detection/Classification  Object Tracking  State-of-art methods of Object Tracking  Representation Scheme in Tracking  Search Mechanism in Tracking  Model Update in Tracking  Context in Tracker, Fusion of Trackers  Classical Scenarios of Object Tracking  Typical Applications of Object Tracking  Extension to Scene/Event Levels Page 2
  • 3. Object Detection Page 3  Given an image or a frame in the video, the goal of object detection is to determine whether there are any defined objects in the image and return their locations and extents (from a long time and whole viewed observer).  The object detection should know how to differentiate the specific object from everything else in the view.  Object detection usually is a binary classification problem; however, additional context information from background helps building a strong detector, such as co-occurrence of objects and geometric location priors.  Multi views or 3-d information (depth) if available.
  • 4. Object Classification Page 4  Given an image or a frame in the video, the goal of object classification is to identify specific objects within a certain object set.  The object classification should tell what the difference is between object A and object B in the predefined object set.  Contextual info. between objects is modeled (by learning) too;  Object detection can be solved as object parts classification;  Segmentation co-trained with classification.  Objects’ geometric context are useful too;  Multi views or 3-d information (depth) if available.  Object classification is complicated and the difficulty raises as the number of objects in the set increases.
  • 5. Page 5 State-of-Art Methods of Object Detection/Classification  1. Global/Local representation;  Template matching (multi-scale) and eigen-object (subspace or manifold);  GIST, MSER, SIFT, SURF, Haar-like, Histogram of Oriented Gradient (HOG), Pyramid HOG, GLOH (Gradient Location and Orientation Histogram), Local Binary Pattern (LBP), Shape-Context etc;  Local feature correspondence – Indexing local features for Approximate Nearest Neighbor (ANN) search: KD-tree, min-Hashing, spectral hashing, inverted index; – Fast matching for large dataset: product quantization, Hamming embedding, Fisher kernel, sparse coding, manifold learning; – Global spatial model: homography constraints.
  • 6. Page 6 Visual Features
  • 7. LBP: Local Binary Pattern  LBP transforms an image into an array or image of integer labels describing small-scale appearance of the image.  Assume texture has locally two complementary aspects, a pattern and its strength.  Divide the examined window into cells (e.g. 16x16 pixels for each cell).  For each pixel in a cell, compare the pixel to each of its 8 neighbors. Follow the pixels along a circle, i.e. clockwise or counter-clockwise.  Where the center pixel's value is greater than the neighbor's value, write "1". Otherwise, write "0". This gives an 8-digit binary number (which is usually converted to decimal for convenience).  Compute the histogram, over the cell, of the frequency of each "number" occurring (i.e., each combination of which pixels are smaller and which are greater).  Optionally normalize the histogram.  Concatenate (normalized) histograms of all cells. This gives the feature vector. Page 7
  • 8. HOG: Histograms of Oriented Gradients  Introduce invariance  Bias / gain / nonlinear transformations  bias: gradients / gain: local normalization  nonlinearity: clamping magnitude, orientations  Small deformations  spatial subsampling  local “bag” models  At each pixel  Gradient magnitude: m = || (Ix, Iy) ||  Gradient orientation: o = tan-1(Iy / Ix)  Quantize orientation: vote into bin (weighted)
  • 9. SIFT: Scale Invariant Feature Transform  Scale-space extrema detection:  Find the points, whose surrounding patches (with some scale) are distinctive  An approximation to the scale-normalized Laplacian of Gaussian  Keypoint localization: eliminating edge points  Orientation assignment:  Assign an orientation to each keypoint, which descriptor represented relative to this orientation and therefore achieve invariance to image rotation;  Magnitude & orientation on the Gaussian smoothed images  A histogram is formed by quantizing the orientations into 36 bins;  Peaks in the histogram correspond to the orientations of the patch;  For the same scale & location, there could be multiple keypoints with different orientations (if another peak is bigger than 80% of the maximal peak);  Keypoint descriptor: 128-d  16x16 patch -> 4x4 subregions ->8 bins for each subregion Page 9
  • 10. SIFT: Scale Invariant Feature Transform Page 10
  • 11. SURF: Speeded Up Robust Features  The feature vector of SURF is almost identical to that of SIFT. It creates a grid around the keypoint and divides each grid cell into sub-grids.  At each sub-grid cell, the gradient is calculated and is binned by angle into a histogram whose counts are increased by the magnitude of the gradient, all weighted by a Gaussian.  These grid histograms of gradients are concatenated into a 64-d vector.  SURF can also use 36-vector of principle components of the 64 vector (PCA is performed on a large set of training images) for a speedup.  SURF also improves on SIFT by using a box filter approximation to the convolution kernel of the Gaussian derivative operator. This convolution is sped up further using integral images. Page 11
  • 12. BRIEF  BRIEF: Binary Robust Independent Elementary Features;  Binary test  BRIEF descriptor  For each S*S patch  Smooth it, then pick pixels using pre-defined binary tests  Pros:  Compact, easy-computed, highly discriminative  Fast matching using Hamming distance  Good recognition performance  Cons:  More sensitive to image distortions and transformations, in particular to in-plane rotation and scale change Page 12
  • 13. Binary Robust Invariant Scalable Keypoints  BRISK: Combination of SIFT-like scale- space keypoint detection and BRIEF- like descriptor  Scale and rotation invariant  BRISK is a 512 bit binary descriptor that computes the weighted Gaussian average over a select pattern of points near the keypoint;  It compares the values of specific pairs of Gaussian windows, leading to either a 1 or a 0, depending on which window in the pair was greater.  The pairs to use are preselected in BRISK. This creates binary descriptors that work with hamming distance instead of Euclidean. Page 13
  • 14. FREAK: Fast Retina Keypoint  FREAK is a cascade of binary strings is computed by efficiently comparing image intensities over a retinal sampling pattern;  FREAK improves upon the sampling pattern and method of pair selection that BRISK uses.  FREAK evaluates 43 weighted Gaussians at locations around the keypoint, but the pattern formed by these Gaussians is biologically inspired by the retinal pattern in the eye.  The pixels being averaged overlap, and are much more concentrated near the keypoint. This leads to a more accurate description of the keypoint.  The actual FREAK algorithm also uses a cascade for comparing these pairs, and puts the 64 most important bits in front to speed up the matching process. Page 14
  • 15. KD-tree  The kd-tree data structure is based on a recursive subdivision of space into disjoint hyper-rectangular regions called cells;  Each node of the tree is associated with a region, called a box, and is associated with a set of data points that lie within this box;  The root node of the tree is associated with a bounding box that contains all the data points;  Consider an arbitrary node in the tree: As long as the number of data points associated with this node is greater than a small quantity, called the bucket size, the box is split into two boxes by an axis-orthogonal hyper-plane that intersects this box.  There are a number of different splitting rules, which determine how this hyper-plane is selected (mean or median usually);  When the number of points that are associated with the current box falls below the bucket size, then the resulting node is declared a leaf node, and these points are stored with the node;  Limit the number of neighboring k-d tree bins to explore  ANN;  Reduce the boundary effects by randomization.
  • 16. KD-tree  Randomized kd-tree forest: a fast ANN search;  Split by picking the dimension with the highest variance first;  Multiple randomized trees increase the chances of finding nearby points;  Best-bin first search heuristic: priority queue;  A branch-and-bound technique for an estimate of the smallest distance from the query point to any of the data points down all of the open paths;  Priority search: visits cells in increasing order of distance from the query, and converge rapidly on the true NN (max/min heap data structure).
  • 17. Locality Sensitive Hashing (LSH)  LSH is a randomized hashing technique using hash functions that map similar points to the same bin, with high probability;  Choose a random projection;  Project points;  Points close in the original space remain close under the projection  Use multiple quantized projections defining a high- dimensional “grid”;  Cell contents can be efficiently indexed using a hash table;  Repeat to avoid quantization errors near the cell boundaries;  Point that shares at least one cell = potential candidate;  Compute distance to all candidates;
  • 18. Min-Hash  The Min-Hash seen as an instance of locality sensitive hashing (LSH);  The more similar two items are, the higher the chance that they share the same min- hash;  Similarity is measured by the Jaccard distance : the number of elements two sets have in common divided by the total number elements in both;  It is a weaker representation than a BoW since word frequency information is reduced into a binary information (present or absent);  To estimate the word overlap of two items, multiple independent min-Hash functions fi are used;  To efficiently retrieve items with high similarity, the values of min-Hash functions fi are grouped into s-tuples, called sketches;  The recall is increased by repeating the random selection of sketches k times; A pair of items is a potential match when at least one sketch collision is encountered;  The probability of a pair of items having at least one sketch out of k in common is a function of the word overlap.
  • 19. Inverted File  An inverted file index is just like an index in a book, where the keywords are mapped to the numbers of pages using them;  In the visual word case, a table that points from the word number to the indices of the database images with the word, is built too;  Retrieval via the inverted file is faster than searching every image, assuming that not all images contain every word (sparse).  Index compression: Huffman compression  Note: inverted list contains both the vector identifier and the encoded reisdual.
  • 20. Page 20 State-of-Art Methods of Object Detection/Classification  2. Bags of words model (derived from natural language processing);  Key point Localization (MSER, SIFT, SURF, Shape Context…);  Codebook generation: clustering or quantizing the feature space (k-means);  Sparse coding for efficient quantization.  Learning with histogram of code words and its extension; – Pyramid match kernel: map to multi-dimensional multi-resolution histograms; – Spatial pyramid match: partition the image into increasingly fine sub-regions. – Prob. Latent Semantic Analysis (pLSA): mixture decomposition from a latent model; – Latent Dirichlet Allocation (LDA): add the Dirichlet prior for the topic distribution.
  • 21. Bag of Visual Words Page 21 feature detection & representation codewords dictionary image representation Representation 1. 2. 3. category models (and/or) classifiers category decision
  • 22. Pyramid Match Kernel  Fast approximation of Earth Mover’s Distance;  Weighted sum of histogram intersections at multiple resolutions (linear in the number of features instead of cubic); Page 22
  • 23. Spatial Pyramid Matching  Based on pyramid matching kernel;  Descriptor layer: detect, locate features, extract correspond. descriptors;  Code layer: code the descriptor by VQ, soft-VQ or even sparse coding;  SPM layer: pool codes across subregions and normalize into a histogram;  Classifiers with these features by nonlinear kernels. Page 23
  • 24. Vocabulary Trees  Vocabulary Tree defined using an offline unsupervised (k- means) training stage.  Hierarchical scoring based on term freq. inverse document freq. (TF-IDF).  Number of the descriptor vectors of each image with a path along the node i (ni query, mi database)  Number of images in the database with at least one descriptor vector path through the node i (Ni )  Defining the relevance score  Implementation of Scoring  Every node is associated with an inverted file  Decrease the fraction of images in the database that have to be explicitely considered for a query  Hierarchical k-means  K-means tree of height h (levels)  Determining the path of a descriptor means performing kh dot products. Page 24 Nister & Stewenius, 2006
  • 25. Vocabulary Trees Page 25 Index Query
  • 26. Locally Aggregated Descriptors  VLAD : vector of locally aggregated descriptors;  Learning: a vector quantifier (k-means)  output: k centroids (visual words): c1,…,ci,…ck  centroid ci has dimension d  For a given image  assign each descriptor to closest center ci  accumulate (sum) descriptors per cell vi := vi + (x - ci)  VLAD (dimension D = k x d): run PCA for reduction  The vector is L2-normalized;  VLAD better than BoF for a given descriptor size  comparable to Fisher descriptors for these operating points  Choose a small D if output dimension D’ is small.
  • 27. Fisher Kernel  Given a likelihood function uλ with parameters λ, the score function of a given sample X is given by: fixed-length vector whose dimensionality depends only on # parameters.  Intuition: direction in which the parameters λ of the model should be modified to better fit the data;  Fisher information matrix (FIM) or negative Hessian:  Measure similarity between using the Fisher Kernel (FK):  FK can be rewritten as a dot product between Fisher Vectors (FV):  A Gaussian Mixture Model (GMM) trained on a large set of features X={xt, t=1,...T}, to get a probabilistic visual vocabulary (soft BoV);  Fisher kernel transforms an variable size set of independent samples into a fixed size vector representation. average pooling
  • 28. Hamming Embedding  Representation of a descriptor x with binary signatures  Vector-quantized to q(x) as in standard BoF  Short binary vector b(x) for an additional localization in the Voronoi cell  Define HE matching: two descriptors x and y match iif q(x)=q(y) and h(b(x), b(y)) <=ht where h(a, b) is the Hamming distance.  Nearest neighbors for Hamming distance ≈ the ones for Euclidean distance  Efficiency:  Hamming distance = very few operations  Fewer random memory accesses: faster that BOF with same dictionary size!  Off-line (given a quantizer)  Draw an orthogonal projection matrix P of size db× d (random matrix generation)  this defines db random projection directions (projection and assignment for each learning data point)  for each Voronoi cell and projection direction, compute the median value from a learning set;  On-line: Compute the binary signature b(x) of a given descriptor  project x onto the projection directions as z(x) = (z1,…zdb)  Signature: bi(x) = 1 if zi(x) is above the learned median value, otherwise 0
  • 29. Product Quantization  Main idea: compressed representation of the database vectors;  Vector split into m sub-vectors: y --> [y1|…|ym];  Sub-vectors are quantized separately by different quantizers q(y)=[q(y1)|…|q(y2)]; where each qi is learned by k-means with a limited number of centroids;  The key: estimate the distances in the compressed domain, such that  Quantization is fast enough;  Quantization is precise, i.e., many different possible indexes (ex: 2^64).  Note: Regular k-means is not appropriate; not for k=2^64 centroids;  Product quantization-based approach offers  Competitive search accuracy  Compact footprint: few bytes per indexed vector
  • 30. Page 30 State-of-Art Methods of Object Detection/Classification  3. Part-based method (structure);  Constellation Model: find mean of the appearance density, mean location & uncertainty in part location, output the prob. of that part being present;  Pictorial Structures: modeled with unary template and pair-wise springs, then joint estimation of part locations;  Implicit Shape Model: Consistent configurations of the observed parts (visual words) with spatial distribution and final probabilistic voting for object segmentation;  Deformable model: learn the latent SVM model structure, filters, deformation costs.
  • 31. Constellation Model  Originally for unsupervised learning of object categories;  It represents objects by estimating a joint appearance and shape distribution of their parts;  The model is very flexible and can even be applied to objects that are only characterized by their texture;  Representation  Joint model of part locations  Ability to deal with background clutter and occlusions  Multiple mixture components for different viewpoints  Learning  Manual construction of part detectors  Estimate parameters of shape density  Now semi-unsupervised  Automatic construction and selection of part detectors  Estimation of parameters using EM •Recognition  Run part detectors over image  Try combinations of features in model  Use efficient search techniques to make fast
  • 32. Pictorial Structure Objects are modeled by a collection of parts in a deformable configuration;  Statistical framework  Prior distribution is defined as a tree-structured Markov random field where no preference is given to the absolute location of each part;  Model parts based on the response of Gaussian derivative filters of different orders, orientations and scales;  Connections modeled by springs between parts: Gaussian Distribution. Best match is found by minimizing function that measures both individual match costs and connection costs;  How each part matches at its location which agree with the deformable model;  Matching a pictorial structure does not involve making any decisions about location of individual parts;  It is solved independently and implies that any kind of part model can be used as long as maximum likelihood can be computed for an individual part;
  • 33. Implicit Shape Model  Representation: object shape is only defined implicitly; Not model semantically meaningful parts, instead as a collection of a large number of prototypical features for a dense cover of the object area;  Each feature has a defined appearance and a spatial probability distribution for the locations relative to the object center;  Learning:  Build a visual vocabulary from local features overlapped with training objects;  Learns a spatial occurrence distribution for each visual word (a list of all positions and scales relative to the object center, and a reference figure-ground mask for each occurrence entry, used for inferring a top-down segmentation);  Recognition: by the Generalized Hough Transform  In the test image, only a small fraction of the learned features will typically occur, which consistent configuration still provides strong evidence for an object’s presence;  Each activated visual word then casts votes for possible positions of the object center according to its learned spatial distribution;  Consistent hypotheses are searched as local maxima in the voting space.
  • 34. Implicit Shape Model (Training + Recognition)
  • 35. Deformable Model  Latent SVM Model training: discriminative model with latent variable  The learned positions of object-parts and the position of the whole object are the Latent Variables;  Training data consists of images with labeled bounding boxes;  Need to learn the model structure, filters and deformation costs;  Detection: 8x8 blocks, HOG feature at different resolution;  Root filter: rectangular templates defining weights for features  Learn root filter by standard SVM;  Part filter: Multi-scale model captures features;  Deformation model: matching with pictorial structures.
  • 36. State-of-Art Methods of Object Detection/Classification  4. Generative models vs. Discriminative models; – Naïve Bayes classifier, MRF, PCA, mixture Gaussian,, pLSA, LDA, …; – SVM, Adaboost, random forest, CRF, Logistic regression, KNN,…; – Deep learning: convolutional NN, DBN, DBM, Auto-encoder, .... Page 36
  • 37. Page 37 Neural Network (MLP)  A multilayer perceptron (MLP) is a feedforward artificial NN model that maps sets of input data onto a set of appropriate outputs;  A MLP consists of multiple layers of nodes in a directed graph, with each layer fully connected to the next one; Except for the input nodes, each node is a neuron (or processing element) with a nonlinear activation function;  MLP utilizes a supervised learning technique called back propagation for training the network;  MLP is a modification of the standard linear perceptron and can distinguish data that are not linearly separable.
  • 38. Page 38 Decision Trees  Classification trees and regression trees predict responses to data.  To predict a response, follow the decisions in the tree from the root node down to a leaf node. The leaf node contains the response.  Classification trees give responses that are nominal, as 'true‘ or 'false'.  Regression trees give numeric responses. A Decision Tree consists of three types of nodes: 1. Decision nodes; 2. Chance nodes; 3. End nodes .
  • 39. Page 39 Naïve Bayes Classifier  The Naive Bayes classifier is designed when features are independent of one another within each class, but it appears to work well even when that independence assumption is not valid. It classifies data in two steps:  Training step: Using the training samples, the method estimates the parameters of a probability distribution, assuming features are conditionally independent given the class.  Prediction step: For any unseen test sample, the method computes the posterior probability of that sample belonging to each class. The method then classifies the test sample according the largest posterior probability.  The class-conditional independence assumption greatly simplifies the training step since you can estimate the one-dimensional class-conditional density for each feature individually;  While the class-conditional independence between features is not true in general, research shows that this optimistic assumption works well in practice;  This assumption of class independence allows the Naive Bayes classifier to better estimate the parameters for accurate classification while using less training data than many other classifiers;  This makes it particularly effective for datasets containing many predictors or features.
  • 40. Page 40 Support Vector Machines (SVM)  Separable Data  An SVM classifies data by finding the best hyperplane that separates all data points of one class from those of the other class.  “Margin” means the maximal width of the slab parallel to the hyperplane that has no interior data points.  The support vectors are the data points that are closest to the separating hyperplane.  Non-separable Data  Your data might not allow for a separating hyperplane. In that case, SVM can use a soft margin, meaning a hyperplane that separates many, but not all data points.  Kernel trick: Polynomials, Radial basis or Sigmoid function
  • 41. Page 41 Generative Model: MRF  Random Field: F={F1,F2,…FM} a family of random variables on set S in which each Fi takes value fi in a label set L.  Markov Random Field: F is said to be a MRF on S w.r.t. a neighborhood N if and only if it satisfies Markov property.  Generative model for joint probability p(x)  allows no direct probabilistic interpretation  define potential functions Ψ on maximal cliques A  map joint assignment to non-negative real number  requires normalization  MRF is undirected graphical models
  • 42. Page 42 Hidden Markov Model  A hidden Markov model (HMM) is a statistical Markov model: the modeled system is a Markov process with unobserved (hidden) states;  In HMM, state is not visible, but output, dependent on state, is visible.  Each state has a probability distribution over the possible output tokens;  Sequence of tokens generated by an HMM gives some information about the sequence of states.  Note: the adjective 'hidden' refers to the state sequence through which the model passes, not to the parameters of the model;  A HMM can be considered a generalization of a mixture model where the hidden variables are related through a Markov process;  Inference: prob. of an observed sequence by Forward-Backward Algorithm and the most likely state trajectory by Viterbi algorithm (DP);  Learning: optimize state transition and output probabilities by Baum-Welch algorithm (special case of EM).
  • 43. Page 43 Discriminative Model: CRF  Conditional , not joint, probabilistic sequential models p(y|x)  Allow arbitrary, non-independent features on the observation seq X  Specify the probability of possible label seq given an observation seq  Prob. of a transition between labels depend on past/future observ.  Relax strong independence assumptions, no p(x) required  CRF is MRF plus “external” variables, where “internal” variables Y of MRF are un-observables and “external” variables X are observables  Linear chain CRF: transition score depends on current observation  Inference by DP like HMM, learning by forward-backward as HMM  Optimization for learning CRF: discriminative model  Conjugate gradient, stochastic gradient,…
  • 44. Page 44 AdaBoost  Boosting: at each step, training data are re-weighted that incorrectly classified objects get larger weights in a new, modified training set, thus actually maximizes the margins between objects;  Classifiers are constructed on weighted versions of the training set, which are independent on previous classification results;  Boosting learning originated from the Probably Approximately Correct (PAC) learning theory;  AdaBoost is the first algorithm that could adapt to the weak learners;  Variant of Adaboost (Adaptive boosting): originally DigitalBoost  LogitBoost:  GentleBoost: Update is fm(x) = P(y=1|x) – P(y=0|x) instead of RealBoost’s
  • 45. Page 45 Deep Learning  Representation learning attempts to automatically learn good features or representations;  Deep learning algorithms attempt to learn multiple levels of representation of increasing complexity/abstraction (features);  Become effective via unsupervised pre-training + supervised fine tuning;  Deep networks trained with back propagation (without unsupervised pre-training) perform worse than shallow networks.  Deal with the curse of dimensionality (smoothing & sparsity) and over-fitting (unsupervised, regularizer);  Semi-supervised: structure of manifold assumption;  labeled data is scarce and unlabeled data is abundant.
  • 46. Page 46 Learning Feature Hierarchy with DL • Deep architectures can be more efficient in feature representation; • Natural derivation/abstration from low level structures to high level structures; • Share the lower-level representations for multiple tasks (such as detection, recognition, segmentation).
  • 47. Page 47 Convolutional Neural Networks  CNN is a special kind of multi-layer NNs applied to 2-d arrays (usually images), based on spatially localized neural input;  Local receptive fields(shifted window), shared weights (weight averaging) across the hidden units, and often, spatial or temporal sub-sampling;  Related to generative MRF/discriminative CRF:  CNN=Field of Experts MRF=ML inference in CRF;  Generate ‘patterns of patterns’ for pattern recognition.  Each layer combines (merge, smooth) patches from previous layers  Pooling /Sampling (e.g., max or average) filter: compress and smooth the data.  Convolution filters: (translation invariance) unsupervised;  Local contrast normalization: increase sparsity, improve optimization/invariance. C layers convolutions, S layers pool/sample
  • 48. Belief Nets  Belief net is directed acyclic graph composed of stochastic var.  Can observe some of the variables and solve two problems:  inference: Infer the states of the unobserved variables.  learning: Adjust the interactions between variables to more likely generate the observed data. stochastic hidden cause visible effect Use nets composed of layers of stochastic variables with weighted connections.
  • 49. Boltzmann Machines  Energy-based model associate a energy to each configuration of stochastic variables of interests (for example, MRF, Nearest Neighbor);  Learning means adjustment of the low energy function’s shape properties;  Boltzmann machine is a stochastic recurrent model with hidden variables;  Monte Carlo Markov Chain (MCMC) sampling for gradient estimate;  Restricted Boltzmann machine is a special case:  Only one layer of hidden units;  factorization of each layer’s neurons/units (no connections in the same layer);  Contrastive divergence: approximation of gradient in RBMs. probability Energy Function Learning rule
  • 50. Deep Belief Networks  A hybrid model: can be trained as generative or discriminative model;  Deep architecture: multiple layers (learn features layer by layer);  Multi layer learning is difficult in sigmoid belief networks.  Top two layers are undirected connections, RBM;  Lower layers get top down directed connections from layers above;  Unsupervised or self-taught pre-learning provides a good initialization;  Greedy layer-wise training for RBM  Supervised fine-tuning  Generative: Up-down wake-sleep algorithm  Discriminative: bottom-up back propagation
  • 51. Deep Boltzmann Machine  Learning internal representations that become increasingly complex;  High-level representations built from a large supply of unlabeled inputs;  Pre-training consists of learning a stack of modified RBMs, then which are composed to create a deep Boltzmann machine (undirected graph);  Generative fine-tuning: different from DBN, two phases  Positive: observed, sample hidden, using variational approximation (mean-field);  Negative: sample both observed and hidden, using persistent sampling (stochastic approximation: MCMC).  Discriminative fine-tuning: the same to DBN  Back propagation.
  • 52. Denoising Auto-Encoder  Multilayer NNs with target output=input;  Reconstruction=decoder(encoder(input));  Perturbs the input x to a corrupted version;  Randomly sets some of the coordinates of input to zeros.  Recover x from encoded perturbed data.  Learns a vector field towards higher probability regions;  Pre-trained with DBN or regularizer with perturbed training data;  Minimizes variational lower bound on a generative model;  Corresponds to regularized score matching on an RBM;  PCA=linear manifold=linear Auto Encoder;  Auto-encoder learns the salient variation like a nonlinear PCA.
  • 53. Stacked Denoising Auto-Encoder  Stack many (sparse) auto-encoders in succession and train them using greedy layer-wise learning  Drop the decode layer each time  Supervised training on the last layer using final features  Then supervised training on the entire network to fine- tune all weights  Performs better than stacking RBMs for unsupervised pre-training.  Empirically not quite as accurate as DBNs.
  • 54. MCMC Sampling for Optimization  Markov Chain: a stochastic process in which future states are independent of past states but the present state.  Markov chain will typically converge to a stable distribution.  Monte Carlo Markov Chain: sampling using ‘local’ information  Devise a Markov chain whose stationary distribution is the target.  Ergodic MC must be aperiodic, irreducible, and positive recurrent.  Monte Carlo Integration to get quantities of interest.  Metropolis-Hastings method: sampling from a target distribution  Create a Markov chain whose transition matrix does not depend on the normalization term.  Make sure the chain has a stationary distribution and it is equal to the target distribution (accept ratio).  After sufficient number of iterations, the chain will converge the stationary distribution.  Gibbs sampling is a special case of M-H Sampling.  The Hammersley-Clifford Theorem: get the joint distribution from the complete conditional distribution.  Hybrid Monte Carlo: gradient sub step for each Markov chain.
  • 55. Mean Field for Optimization  Variational approximation modifies the optimization problem to be tractable, at the price of approximate solution;  Mean Field replaces M with a (simple) subset M(F), on which A* (μ) is a closed form (Note: F is disconnected graph);  Density becomes factorized product distribution in this sub-family.  Objective: K-L divergence.  Mean field is a structured variation approximation approach:  Coordinate ascent (deterministic);  Compared with stochastic approximation (sampling):  Faster, but maybe not exact.
  • 56. Page 56 Contrastive Divergence  Contrastive divergence (CD) is a quicker way to learn RBMs;  Contrastive divergence as the new objective;  Taking gradients and ignoring a term which is usually very small.  Steps:  Start with a training vector on the visible units.  Then alternate between updating all the hidden units in parallel and updating all the visible units in parallel.  Can be applied using any MCMC algorithm to simulate the model (not limited to just Gibbs sampling);  CD learning is biased: not work as gradient descent  Improved: Persistent CD explores more modes in the distribution  Rather than from data samples, begin sampling from the mode samples, obtained from the last gradient update.  Still suffer from divergence of likelihood due to missing the modes.  Score matching: the score function does not depend on its normal. factor. So, match it b.t.w. the model with the empirical density.
  • 57. Page 57 “Wake-Sleep” Algorithm for DBN  Pre-trained DBN is a generative model;  Do a stochastic bottom-up pass (wake phase)  Get samples from factorial distribution (visible first, then generate hidden);  Adjust the top-down weights to be good at reconstructing the feature activities in the layer below.  Do a few iterations of sampling in the top level RBM  Adjust the weights in the top-level RBM.  Do a stochastic top-down pass (sleep phase)  Get visible and hidden samples generated by generative model using data coming from nowhere!  Adjust the bottom-up weights to be good at reconstructing the feature activities in the layer above.  Any guarantee for improvement? No!  The “Wake-Sleep” algorithm is trying to describe the representation economical (Shannon’s coding theory).
  • 58. Page 58 Greedy Layer-Wise Training  Deep networks tend to have more local minima problems than shallow networks during supervised training  Train first layer using unlabeled data  Supervised or semi-supervised: use more unlabeled data.  Freeze the first layer parameters and train the second layer  Repeat this for as many layers as desire  Build more robust features  Use the outputs of the final layer to train the last supervised layer (leave early weights frozen)  Fine tune the full network with a supervised approach;  Avoid problems to train a deep net in a supervised fashion.  Each layer gets full learning  Help with ineffective early layer learning  Help with deep network local minima
  • 59. Sparse Coding  Sparse coding (Olshausen & Field, 1996).  Originally developed to explain early visual processing in the brain (edge detection).  Objective: Given a set of input data vectors learn a dictionary of bases such that:  Each data vector is represented as a sparse linear combination of bases. Sparse: mostly zeros
  • 60. Methods of Solving Sparse Coding  Greedy methods: projecting the residual on some atom;  Matching pursuit, orthogonal matching pursuit;  L1-norm: Least Absolute Shrinkage and Selection Operator (LASSO);  The residual is updated iteratively in the direction of the atom;  Gradient-based finding new search directions  Projected Gradient Descent  Coordinate Descent  Homotopy: a set of solutions indexed by a parameter (regularization)  LARS (Least Angle Regression)  First order/proximal methods: Generalized gradient descent  solving efficiently the proximal operator  soft-thresholding for L1-norm  Accelerated by the Nesterov optimal first-order method  Iterative reweighting schemes  L2-norm: Chartand and Yin (2008)  L1-norm: Cand`es et al. (2008)
  • 61. Page 61 State-of-Art Methods of Object Detection/Classification  5. Deep learning for visual recognition: • Hand-crafted features: • Needs expert knowledge • Requires time-consuming hand-tuning • (Arguably) one limiting factor of computer vision systems • Key idea of feature learning: • Learn statistical structure or correlation from unlabeled data • The learned representations used as features in supervised and semi-supervised settings • Hierarchical feature learning • Deep architectures can be representationally efficient. • Natural progression from the low level to the high level structures. • Can share the lower-level representations for multiple tasks in computer vision.
  • 62. Hierarchical Feature Learning
  • 63. Feature Learning Architectures Pixels / Features Filtering with Dictionary (patch/tiled/convolutional) Spatial/Feature (Sum or Max) Normalization between feature responses Features + Non-linearity Local Contrast Normalization (Subtractive & Divisive) (Group) Sparsity Max / Softmax Not an exact separation
  • 64. ‘Filtering’  Patch  Image as a set of patches Input #patches #filters Filters
  • 65. ‘Filtering’  Convolutional  Translation equivariance  Tied filter weights (same at each position  few parameters) Input Feature Map . . .
  • 66. ‘Filtering’  Tiled  Filters repeat every n  More filters than convolution for given # features Input Filters Feature maps
  • 67. ‘Normalization’  Contrast normalize. (across feature maps)  Local mean = 0, local std. = 1, “Local”  7x7 Gaussian  Equalizes the features maps Feature Maps Feature Maps After Contrast Normalization Input Filters
  • 68. ‘Normalization’  Sparsity  Constrain L0 or L1 norm of features  Iterate with filtering operation (ISTA sparse coding) Filters Features K-meansSparse Coding Input Patch
  • 69. ‘Normalization’  Induces local competition between features to explain input  “Explaining away” in graphical models  Just like top-down models  But more local mechanism  Filtering alone cannot do this! Example: Convolutional Sparse Coding Filters Convolution |.|1|.|1|.|1|.|1 from Zeiler et al. [CVPR’10/ICCV’11] Input Image Reconstructed Image
  • 70. ‘Pooling’  Spatial Pooling  Non-overlapping / overlapping regions  Sum or Max  Boureau et al. ICML’10 for theoretical analysis Max Sum
  • 71. ‘Pooling’  Spatial pooling  Invariance to small transformations  Larger receptive fields (see more of input) Visualization technique from [Le et al. NIPS’10]: Zeiler, Taylor, Fergus [ICCV 2011]
  • 72. ‘Pooling’ Chen, Zhu, Lin, Yuille, Zhang [NIPS 2007] • Pooling across feature groups • Additional form of inter-feature competition • Gives AND/OR type behavior via (sum / max) • Compositional models of Zhu, Yuille [Zeiler et al., ‘11]
  • 73. Sparse Coding for Visual Recognition • Descriptor Layer: detect and locate features, extract corresponding descriptors (e.g. SIFT); • Code Layer: code the descriptors • Vector Quantization (VQ): each code has only one non-zero element; • Soft-VQ: small group of elements can be non- zero; • SPM layer: pool codes across subregions and average/normalize into a histogram. [Lazebnik et al., CVPR 2005; Yang et al., CVPR 2009]
  • 74. Sparse Coding for Visual Recognition • Classifiers using these features need nonlinear kernels • Lazebnik et al., CVPR 2005; Grauman, Darrell, JMLR 2007; • High computational complexity • Idea: modify the coding step to produce feature representations that linear classifiers can use effectively • Sparse coding [Olshausen & Field, Nature 1996; Lee et al., NIPS 2007; Yang et al., CVPR 2009; Boureau et al., CVPR 2010] • Local Coordinate coding [Yu et al., NIPS 2009; Wang et al., CVPR 2010] • RBMs [Sohn, Jung, Lee, Hero III, ICCV 2011] • Other feature learning algorithms Improving the coding step
  • 75. Page 75 State-of-Art Methods of Object Detection/Classification  6. Open Questions:  Viewpoint variations;  Scales and appearance variation (illumination and shape distortion);  Cluttered background, partial occlusion, lack of contextual inform.;  Multiple poses (out of the plane or on the plane) or articulation;  Variations of the intra class or category.
  • 76. Page 76 State-of-Art Methods of Object Detection/Classification  7. Efficiency in Detection/Classification:  Efficient features:  Vector quantization;  Integral channel features;  Divide-and-conquer:  Cascaded Classifiers.  Coarse-to-fine:  Feature scaling;  Branch-and-bound;  Efficient subwindow search.  Generic object proposal:  Selective search;  BING for objectness;  Dynamic programming;  Avoid repeating compute.  Parallelize (hardware):  GPU;  Multi-core;  Cloud?
  • 77. Integral Channel Features (ICF)  Multiple registered image channels are computed using image linear/non-linear transformations, called integral channel features;  Features such as local sums, histograms, Haar features and their various generalizations are computed using integral images;  ICF naturally integrate heterogeneous sources of information, have few parameters, and result in fast, accurate detectors. Page 77
  • 78. Integral Channel Features (ICF)  Very efficient for human/pedestrian detection;  6 quantized orientations, 1 gradient magnitude, 3 LUV color channels.  Multi-scale without image scaling.  Boosted classifiers: soft cascades  Two level decision trees. Page 78
  • 79. Segmentation as Selective Search  Challenges  Objects extremely diverse  Various shapes, sizes and appearances;  Within object variation  Multiple materials and textures with strong interior boundaries;  Many objects in an image  Selective search by hierarchical grouping: encouraging diversity. Page 79
  • 80. BING: Binarized Normed Gradients  What is the object?  Standalone, unique with different appearance from surroundings, well closed boundary;  Objectness metric is for how likely a window covers an object of any category;  Reduce the search space;  Allow strong classifiers.  Normed gradients (NG) + Linear SVMs  BING feature: illustration  Use a single atomic variable (INT64 & BYTE) to represents a BING feature and its last row. Page 80Proc. Of CVPR 2014!
  • 81. Branch-and-Bound for Sub-window Search  It is a general algorithm for optimal solution, especially for discrete and combinatorial optimization;  A branch-and-bound algorithm consists of a systematic enumeration of all candidate solutions, where large subsets of fruitless candidates are discarded en masse, by using upper and lower estimated bounds of the quantity being optimized;  Splitting: given a set of candidates, S1 S2 …=S, “bounding”  a search tree whose nodes are S1, S2, …  Bounding: upper/lower bounds for the objective function within Si;  The idea: prune by maintaining a global variable recording the min upper bound, so any node whose lower bound greater than it can be discarded;  Used for efficient sliding sub-window search. Page 81 
  • 82. Page 82 State-of-Art Methods of Object Detection/Classification  8. “Open Set” problem: how to handle unknown or unfamiliar classes;  Label as one of known classes or as unknown;  Zero shot learning/unseen class detection;  Novelty detection with null space methods;  One class SVM;  Multiple classes:  Artificial super class from all given classes;  Combine several one class classifiers learned separately;  K-nearest neighbors;
  • 83. State-of-Art Methods of Object Detection/Classification  9. “Data unbalancing” problem:  Resampling methods for balancing the data set.  Over-sampling, under-sampling, importance sampling;  Modification of existing learning algorithms.  Cost-sensitive learning;  One class classification;  Classifier ensemble (bagging, boosting, random forest…)  Measuring the classifier performance in imbalanced domains.  ROC, F-measure,…  Relationship between class imbalance and other data complexity characteristics.
  • 84. State-of-Art Methods of Object Detection/Classification  10. Face detection/recognition is a special area: large variations.  Face detection: Viola-Jones’s cascaded Adaboost;  Face verification/authentication: validate a claimed identity based on the image, and either accept or reject the identity claim (one-to-one matching);  Face identification: identify a person based on the image, compared with all the registered persons (one-to-many matching). Page 84
  • 85. Face Detection and Pose Alignment  Paul Viola’s method is real time:  Integral image;  Cascaded Adaboost;  View-based detectors;  Pose alignment + pose specific detectors;  Yang, Kriegman & Ahuja’s survey paper in 2002;  Facial feature (2d-landmark) detection for alignment: holistic or local  ASM, AAM: generative;  Project-out inverse compositional algorithm for model fitting;  Elastic graph matching;  Local feature detectors: discriminative  Constrained local model (CLM): global shape constraints;  Decouple shape from appearance.  Congealing: reduce the entropy by transform Page 85
  • 86. 2-D Face Verification and Identification  How to represent the face?  Features are hand-crafted or learned automatically;  Hierarchically? (manual first, then learn from it…)  Dimensionality reduction? (high dimensional is good)  Matching metric learning: weight optimization  Siamese network (two identical convolutional network sharing weights)  DeepFace: a paper from Taigman, Yang, Ranzato and Wolf at CVPR 2014  Deep learning with multiple layers ( including convolutional networks) Page 86
  • 87. 3D Model-based 2D Face Alignment Page 87
  • 88. 3D Model-based 2D Face Recognition Page 88
  • 89. 3-D Face Analysis and Recognition  Preprocessing: surface smoothing, noise removal and hole filling;  3-d facial landmark detection and face registration;  Curvature-based (spin image is too costly): landmark extraction;  3D statistical facial feature model (SFAM): both global and local;  Procrustes Analysis and Iterated Closest Point (ICP) for registration;  Depth image: RGB-D;  Deformable face model.  Feature extraction/learning;  Curvature-based;  Part-based feature;  Learning again?  Feature matching.  Region-based ICP;  Shape and texture. Page 89
  • 90. Page 90 Average Face Model Average Regional Model
  • 91. State-of-Art Methods of Object Detection/Classification  11. Text detection and recognition: text is object too.  Text detection: connected component-based or texture(template)-based;  Maximally Stable Extremal Regions (MSERs) first;  Feature HOG works;  Text recognition: OCR (optical character recogn.), license plate recogn.  Pictorial structure model;  Lexicon helps. “video text detection and recognition: dataset and benchmark”, by P Nguyen K Wang, S Belongie, ICCV13.
  • 92. Multi-digit Number Recognition from Street Views using Deep CNNs  Unified one of localization, segmentation, and recognition steps via the use of a deep CNN that operates directly on the image pixels;  Train a probabilistic model of sequences given images;  Extract a set of features H from the image X using a CNN with a fully connected final layer;  Six separate softmax classifiers are then connected to this feature vector H, i.e., each softmax classifier forms a response by making an affine transformation of H and normalizing this response with the softmax function. Page 92
  • 93. Dataset for Object Classification  Caltech 101 and 256;  LabelMe;  ImageNet;  VOC: Pascal;  …… Page 93
  • 94. Dataset for Object Classification  Face Recognition:  NIST FERET;  Yale Face;  CMU PIE Face;  …... Page 94
  • 95. Evaluation Metric  Detection/Classification rate  true positives, false positives, true negatives and false negatives (reflected in confusion or contigency matrix);  accuracy, precision, recall;  ROC, i.e. receiver operating characteristic;  plotting the fraction of true positives (TPR = true positive rate) vs. the fraction of false positives (FPR = false positive rate), at various threshold settings;  Complexity/Time;  Real-time? Page 95
  • 96. Page 96 Typical Applications of Object Detection/Classification  Human computer interaction;  Faces for conversations and teleconferences.  Intelligent Driver Assistance;  Traffic lighting and signs, cars, bicyclist & pedestrian classification etc.  Content-based Enhancement;  Automatic camera focusing, color balancing and zooming on the specific object;  Object-based compression (ROI).  Video retrieval, filtering, annotation;  Automatically labeling and categorization.  Surveillance;  Target acquisition and tracking in the visual surveillance system (Initializing the tracker with detection/classification).
  • 97. Object Tracking Page 97  Definition: object tracking is generally posed as an recursive estimation problem, i.e., estimate the current location (size) given the estimate at the previous time instant as well as the current observation (or measurement).  Object tracking is to find an object based on a short-time (applying Markov chain in modeling) and narrow-viewed observer (a dynamic model for searching);  However, object tracking can collaborate with different types of object detectors/classifiers to enhance the performance.  Object tracking can be performed on different feature levels: points, regions, contours or blobs;  Tracked objects can be varied types: articulated, deformable, fluid, multiple objects and so on.
  • 98. Page 98 State-of-Art Methods of Object Tracking Y Wu, J Lim, M-H Yang, “Online Object Tracking: A Benchmark,” CVPR 2013. model update L: local, H: holistic, T: template, IH: intensity histogram, BP: binary pattern, SR: sparse representation, DM: discriminative model, GM: generative model. PF: particle filter, MCMC: Markov Chain Monte Carlo, LOS: local optimum search, DS: dense sampling search.
  • 99. Representation Scheme in Tracking  Holistic templates;  Subspace-based;  Sparse representation;  Feature-based:  Color histograms, HOG, covariance region descriptor, Haar-like features, LBP etc.;  Discriminative model with a binary classifier:  SVM, structured output SVM, ranking SVM, boosting, semi-boosting and multi-instance boosting;  Parts-based;  Bags of features (patch-based);  3-d information (depth) or multiple cameras-based? Page 99
  • 100. PCA, AP & Spectral Clustering  Principal Component Analysis (PCA) uses orthogonal transformation to convert a set of observations of possibly correlated variables into a set of linearly uncorrelated variables called principal components.  This transformation is defined in such a way that the first principal component has the largest possible variance and each succeeding component in turn has the highest variance possible under the constraint that it be orthogonal to the preceding components.  PCA is sensitive to the relative scaling of the original variables.  Also called as Karhunen–Loève transform (KLT), Hotelling transform, singular value decomposition (SVD) , factor analysis, eigenvalue decomposition (EVD), spectral decomposition etc.;  Affinity Propagation (AP) is a clustering algorithm based on the concept of "message passing" between data points.[Unlike clustering algorithms such as k-means or k-medoids, AP does not require the number of clusters to be determined or estimated before running the algorithm;  Spectral Clustering makes use of the spectrum (eigenvalues) of the data similarity matrix to perform dimensionality reduction before clustering in fewer dimensions.  The similarity matrix consists of a quantitative assessment of the relative similarity of each pair of points in the dataset.
  • 101. Blind Source Separation & ICA  Independent component analysis (ICA) is for separating a multivariate signal into additive subcomponents by assuming that the subcomponents are non-Gaussian signals and all statistically independent from each other.  ICA is a special case of blind source separation.  Assumptions: the source signals are independent of each other; distribution of the values in each source signals are non-Gaussian.  Three effects of mixing signals as below  Independence: the signal mixtures may not;  Normality: closer to Gaussian than any of original variables;  Complexity: Greater than that of its constituent source signal.  Preprocessing: centering, whitening and dimension reduction;  ICA finds the independent components (latent variables) by maximizing the statistical independence of the estimated components;  Definitions of independence for ICA:  Minimization of mutual information (KL divergence or entropy);  Maximization of non-Gaussianity (kurtosis and negative entropy).
  • 102. NMF & pLSA  Non-negative matrix factorization (NMF): a matrix V is factorized into (usually) two matrices W and H, that all three matrices have no negative elements.  The different types arise from using different cost functions for measuring the divergence between V and W*H and possibly by regularization of the W and/or H matrices;  squared error, Kullback-Leibler divergence or total variation (TV);  NMF is an instance of a more general probabilistic model called "multinomial PCA“, as pLSA (probabilistic latent semantic analysis);  pLSA is a statistical technique for two-mode (extended naturally to higher modes) analysis, modeling the probability of each co-occurrence as a mixture of conditionally independent multinomial distributions;  Their parameters are learned using EM algorithm;  pLSA is based on a mixture decomposition derived from a latent class model, not as downsizing the occurrence tables by SVD in LSA.  Note: an extended model, LDA (Latent Dirichlet allocation) , adds a Dirichlet prior on the per-document topic distribution.
  • 103. ISOMAP  General idea:  Approximate the geodesic distances by shortest graph distance.  MDS (multi-dimensional scaling) using geodic distances  Algorithm:  Construct a neighborhood graph  Construct a distance matrix  Find the shortest path between every i and j (e.g. using Floyd-Marshall) and construct a new distance matrix such that Dij is the length of the shortest path between i and j.  Apply MDS to matrix to find coordinates
  • 104. LLE (Locally Linear Embedding)  General idea: represent each point on the local linear subspace of the manifold as a linear combination of its neighbors to characterize the local neighborhood relations; then use the same linear coefficient for embedding to preserve the neighborhood relations in the low dimensional space;  Compute the coefficient w for each data by solving a constraint LS problem;  Algorithm:  1. Find weight matrix W of linear coefficients  2. Find low dimensional embedding Y that minimizes the reconstruction error  3. Solution: Eigen-decomposition of M=(I-W)’(I-W)   i j jiji YWYY 2 )( 
  • 105. Local Tangent Space Alignment (LTSA)  Every smooth manifold can be constructed locally by its tangent plane;  Stages: 1) A local parameterization is established for each data point; 2) then a global alignment is computed.  Taylor series expansion of the embedding function f(•) in the local neighborhood;  We are given samples from the embedded manifold with noise therefore, for an arbitrary point xi and its local neighbor and in the absence of the noise (εi = 0), we can write:  Solve the problem:  where si is the i-th membership vector.  The optimal alignment (using LS):  Substituting Li into the objective:  where S=[s1,…,sn], W=diag(W1,…,Wn), and  Solve using an EVD. i jx 1
  • 106. Local Tangent Space Alignment (LTSA)
  • 107. Laplacian Eigenmaps  General idea: minimize the norm of Laplace-Beltrami operator on the manifold  measures how far apart maps nearby points.  Avoid the trivial solution of f = const.  The Laplacian-Beltrami operator can be approximated by Laplacian of the neighborhood graph with appropriate weights.  Construct the Laplacian matrix L=D-W.  can be approximated by its discrete equivalent  Algorithm:  Construct a neighborhood graph (e.g., epsilonball, k-nearest neighbors).  Construct an adjacency matrix with the following weights  Minimize  The generalized eigen-decomposition of the graph Laplacian is  Spectral embedding of the Laplacian manifold:  • The first eigenvector is trivial (the all one vector).
  • 108. Search Mechanism in Tracking  Tracking is posed within an optimization framework, gradient descent methods used to locate the target;  Iterative Image Registration;  Mean shift;  Mean field;  Distribution Fields.  Dense sampling methods at the expense of high computational load;  Online boosting;  Online Multiple Instance Learning;  Struck: Structured Output Tracking with Kernels (SVM).  Stochastic search algorithms insensitive to local minima and computationally efficient :  Particle filters;  Incremental learning within Particle filter;  Sparse coding within particle filter;  Mean shift within particle filter. Page 108
  • 109. Particle Filter  Monte Carlo characterization of pdf:  Represent posterior density by a set of random i.i.d. samples (particles) from the pdf p(x0:t|z1:t)  For larger number N of particles equivalent to functional description of pdf  For N approaches optimal Bayesian estimate  Regions of high density  Many particles  Large weight of particles  Uneven partitioning  Discrete approximation for continuous pdf  Draw N samples x0:t (i) from importance sampling distribution (x0:t|z1:t)  Importance weight and its update   N i i tt i tttN xxwzxP 1 :0:0:1:0 )(δ)|( )|(π )|( )( :1:0 :1:0 :0 tt tt t zx zxp xw  ),|(π )|()|( )( 1 )( )( 1 )()( )( 1 )( t i t i t i t i t i tti t i t zxx xxpxzp ww   
  • 110. Mean Shift  A tool for finding modes in a set of data samples, manifesting an underlying probability density function (PDF) in RN;  Non-parametric density estimation (Parsen window)  MS is for kernel density gradient estimation  Translate the kernel window by the mean shift vector: m(x)  Go to the maxima of density.  Used for visual tracking. 1 1 ( ) ( ) n i i P K n   x x - x 1 1 1 1 ( ) n i in n i i i n i i i i g c c P k g n n g                           x x x 2 ( ) i iK ck h         x - x x - x 2 1 2 1 ( ) n i i i n i i g h g h                           x - x x m x x x - x
  • 111. Model Update in Tracking  Template update for the KLT algorithm;  Online mixture model;  Incremental subspace update;  Online boosting;  More robust online-trained classifier:  Semi-supervised;  Multiple instance learning;  Co-tracking with different features (co-training);  Multiple classifier boosting (object part-based or orientation-based);  Background separation for its multimodal distribution as well;  Object Structural Constraints by Positive-Negative learning (self learning);  Struck: Structured Output Tracking with Kernels (SVM). Page 111
  • 112. Online Multiple Negative Modality Learning Boosting for Tracking Page 112
  • 113. Online Boosting  Online learning: a learning algorithm is presented with one example at a time;  Since we don’t know a priori how the difficult/good a sample is, the online boosting algorithm turns to a new strategy to compute the weight distribution;  Oza proposed an on-line boosting framework in which the importance of a sample can be estimated by propagating it through the set of weak classifiers.  However, Oza’s algorithm has no way of choosing the most discriminative feature because the entire training set is not available at one time.  Grabner and Bischof proposed a modified method which performs feature selection (defined as selectors) by maintaining a pool of M > N candidate weak classifiers:  The number of weak classifiers N is fixed at the beginning;  One sample is used to update all weak classifiers and the corresponding voting weights;  For each classifier, the most discriminative feature for the entire training set is selected from a given feature pool;  It is suggested the worst feature is replaced by a new one randomly from the feature pool.  Stochastic gradient descent is more suited for online learning, more specifically for online boosting. Page 113
  • 114. Multiple Instance Learning (MIL)  The basic idea is that during training, examples are presented in sets or bags and labels are provided for the bags rather than individual instances;  If the bag is labeled positive, it is assumed to contain at least one positive instance, otherwise the bag is negative;  The ambiguity is passed on to the learning algorithm, which now has to figure out which instance in each positive bag is the most “correct”.  The MIL problem can be solved by a gradient boosting framework (proposed by Friedman) to maximize the log likelihood of bags;  A Noisy-OR (NOR) model is used to define the bag probability;  The instance-is-positive probability is modeled as the logistic function;  The weight on each sample is given as the derivative of the loss function with respect to a change in the score of the sample.  People have been adapting classical classification methods, such as SVMs or boosting, to work within the context of multiple-instance learning. Page 114
  • 115. Online Incremental SVM Learning for Tracking Page 115
  • 116. Incremental/Decremental SVM  Incremental learning on-line to construct the solution recursively one point at a time;  Retain the Kuhn-Tucker (KT) conditions on all previously seen data, while “adiabatically” adding a new data point to the solution;  Support vectors (SV) in soft margin classification: margin SV, error SV and ignored SV;  Leave-one-out (LOO) for predicting the generalization power of a trained classifier, implemented by decremental unlearning, adiabatic reversal of incremental learning, on each of the training data from the full trained solution. Page 116ignored vector margin
  • 117. Context in Tracker, Fusion of Trackers  Context information is also important;  Mining auxiliary objects or local visual inform. surrounding the target to assist tracking;  The context information helpful when the target is fully occluded or leaves the image;  “Learning where the object might be”;  “Exploring Supporters and Distracters in Unconstrained Environments”.  Combines static, moderately adaptive and highly adaptive trackers to account for appearance changes;  Multiple trackers are maintained in MAP;  Multiple feature sets selected in a Bayesian way. Page 117
  • 118. Page 118 Multiple Object Tracking  The MOT tracker not only associates the objects with the observation data correctly (with in-between interaction), but also discriminates them with similar appearance;  Persistence in motion;  Mutual exclusion;  Challenging situations:  Joint state space: variable number of objects?  Interaction (split/merge): occlusion;  Birth/death (enter/leave): appear/disappear.  Similar cases:  Articulated object tracking: body parts, hand fingers etc.  Deformable object tracking: facial expression with multiple action units. Page 118
  • 119. Page 119 Multiple Object Tracking  Typical methods:  Multiple Independent object tracker (MioT): no interaction, so apply separate trackers meanwhile.  Multiple hypothesis tracker (MHT): k-best hypothesis, gating/pruning.  Joint probabilistic data association filter (JPDAF): expectation over all.  Probability hypothesis density: propagate 1st moment of posterior.  Particle filter  Apply the probabilistic exclusion principle;  Bayesian Multiple Blob tracker: background subtraction;  Subordination links between particles;  Mixture model: particle clustering.  MCMC: Markov chain in sampling.  Markov random field-based: energy minimization of interaction and data terms;  Belief propagation: inference approximation.  Mean Field: variational approximation.
  • 120. 3-D Model-based Tracking  3-D model: CAD geometric model, planar parts or rough ellipsoid;  Pose (camera or object) estimation can considerably simplify the task;  Camera calibration or not?  Non-rigid object tracking with 3-d model: 3D Morphable Models;  Articulated object tracking with 3-d model: 3D kinematic chain structure;  Factorization-based: separation of motion from structure.
  • 121. 3-D Model-based Tracking  How to eliminate the drifting error?  Key frames for registration (pose refinement);  Bundle adjustment (used in the point-based method).  Edge-based method  Look for strong gradients: fast and general;  Extract image contours and then fit them to the model outlines;  Pixel-based method  Optic flow: add a feedback loop to suppress drift effect;  Template matching: Lucas-Kanade;  Interest point-based: Lucas-Kanade-Tomasi;  Note: tracking without 3-d models -> visual SLAM v.s. SfM (structure from motion).  SLAM (simultaneous localization and mapping): to provide in real-time an estimation of the relative motion of a camera and a 3D map of the surrounding environment;  Extensible tracking: attempts to add previously unknown scene elements to its initial map, and these then provide registration even when the original map is out of sensing range;
  • 122. Deformable Object Tracking  2-D methods;  Active contour (snake);  Level set;  Exemplar-based shape prior: ASM, AAM as a generative model;  Project-out inverse compositional algorithm for model fitting;  Constrained local model:  Use point distribution model (PDM), but only model the appearance of local patches around landmarks;  3-D methods: model the variability of a certain class of objects;  Deformable model:  Shape, represented as curve or surface, is deformed in order to match a specific example;  Model texture variations & imaging factors i.e. perspective projection and illumination effects;  3-D pose estimation from 2-D view;  Rigid pose.
  • 123. Articulated Object Tracking  2-D methods;  Model-free methods;  Exemplar-based;  2-D model-based: cardboard-like;  3-D methods;  Without estimating parts’ joint angles;  With fixed basis;  Use top part’s tip locations;  3-D model-based: Kinematic chain structure;  Quantized feature space;  Model refinement: factorization, key-frame-based;  Motion filter: kalman filter, particle filter or multiple hypothesis (MT);  Data driven dimensionality reduction by learning the configuration space;  3-D pose estimation from 2-D view;  Classification-based or direct mapping-based.
  • 124. Page 124 PTAM, DTAM and Semi-Dense TAM  PTAM: tracking and mapping separated, mapping based on key frames BA (not update frame by frame), while tracking based on camera pose estimation with patch-based search;  DTAM: still key frames-based, a pixel-based rather than feature-based method for tracking and mapping with a dense 3D surface textured model;  Dense inverse depth map estimate by regularization (TV): primal dual;  Occlusion/blur handling;  Tracking by image registration with synthesized realistic novel view:  Nonlinear LS: like iterative LK style (rotation first, then full pose).  Semi-Dense TAM: semi-dense depth of pixels with apparent image gradients;  Probabilistic depth map + uncertainty from both geometric and photometric disparity errors;  Reduction of image regions for estimating semi-dense inverse depth map for the current frame;  Tracking is still dense in image alignment.  Initial depth map is from stereo vision by five-point-pose algorithm (Multiple view stereo).
  • 125. Feature Tracking  The popular KLT (Kanade-Lukas-Tomasi) method;  Good features for tracking;  Eigen values (Shi);  Affine alignment;  FAST, SIFT, SURF, HOG, …;  Gradient, orientation, histogram;  Random forest, FERNS: tracking-by-detection  The classifiers are applied for keypoint matching;  Multiclassifier based on randomized trees;  Random selection of features;  Random selection of patches.  For classification.  Note: Optic flow is dense tracking. Extract features independently and then match by comparing descriptors x x x 3 2 Possible Outputs  ik cCFP |Posterior Distributions (Look-up Tables) 0 7
  • 126. Feature Tracking  After FERNS are trained;  Do patch classification.     M ikflabelclass cCFPExample 1 _ |maxarg 2 6 1 Posterior Distributions (Look-up Tables) Fern 1 Fern 2 Fern 3
  • 127. Page 127 PTAM: Parallel Tracking and Mapping  Use a large number of (1000) points in the map;  Don’t update the map every frame: Keyframes;  Split the tracking and mapping into two threads;  Make pyramid levels for tracking and mapping;  Detect Fast corners;  Use motion model to update camera pose;  Key frames/new map points added conditionally;  At least sufficient baseline,good epipolar search;  Map optimization by batch method: BA.  Stereo initialization densely: two clicks for key frames  Use five-point-pose algorithm;  Map maintained with data association,outlier retest./ Time One frame Find features Draw graphics Update camera pose only Simple & easy Thread #2 Mapping Thread #1 Tracking Update map
  • 128. Page 128 PTAM: Parallel Tracking and Mapping Tracking thread: • Responsible estimation of camera pose and rendering augmented graphics • Must run at 30 Hz • Make as robust and accurate as possible Mapping thread: • Responsible for providing the map • Can take lots of time per key frame • Make as rich and accurate as possible Stereo Initialization Wait for new key frame Add new map points Optimize map Map maintenance Pre-process frame Project points Measure points Update Camera Pose Project points Measure points Update Camera Pose Draw Graphics Coarse stage Fine stage
  • 129. DTAM: Dense Tracking and Mapping Inputs: • 3D texture model of the scene • Pose at previous frame Tracking as a registration problem • First inter-frame rotation estimation: the previous is aligned on the current image to estimate a coarse inter-frame rotation; • Estimated pose is used to project the 3D model into 2.5D image; • The 2.5D image is registered with the current frame to find the current pose; • Two template matching problems for minimize SSD: direct and inverse compositional. Principle: • S depth hypothesis are considered for each pixel of the reference image Ir • Each corresponding 3D point is projected onto a bundle of images Im • Keep the depth hypothesis that best respects the color consistency from the reference to the bundle of images Formulation: • : pixel position and depth hypothesis • : number of valid reprojection of the pixel in the bundle • : photometric error between reference and current image
  • 130. DTAM: Dense Tracking and Mapping Formulation of a variational approach: • First term as regularizer with Huber norm, second term as photometric error; • Problem: optimizing this equation directly requires linearising of cost volume. Expensive and cost volume has many local minima. Approximation: • Introduce as an auxiliary variable, can optimized with heuristic search • Second term brings original and auxiliary variable together Problem (Depth Estimation): Uniform regions in reference image do not give discriminative enough photometric error; Solution (Primal Dual): Assume depth is smooth on uniform regions, use total variation approach where depth map is the functional to optimize: photometric error defines the data term and the smoothness constraint defines the regularization. Reformulation of regularization with primal dual method • Dual variable p is introduced to compute the TV norm: Problem in SSD minimization: Align template T(x) with input I(x). Formulation: Find the transform that best maps the pixels of the templates into the ones of the current image, minimize:
  • 131. Semi-Dense TAM  Continuously estimate a semi-dense inverse depth map for the current frame, which in turn is used to track the motion of the camera using dense image alignment.  Estimate the depth of all pixels which have a non-negligible image gradient.  Use the oldest frame, where the disparity range and the view angle within a certain threshold;  If a disparity search is unsuccessful, the pixel’s “age” is increased, such that subsequent disparity searches use newer frames where the pixel is likely to be still visible.  Each estimate is represented as a Gaussian probability distribution over the inverse depth;  Minimize photometric and geometric disparity errors with respect to pixel-to-inverse depth ratio.  Propagate this information over time, update with new measurements as new images arrive.  Based on the inverse depth estimate for a pixel, the corresponding 3D point is calculated and projected into the new frame, providing an inverse depth estimate in the new frame;  The hypothesis is assigned to the closest integer pixel position – to eliminate discretization errors, the sub-pixel image location of the projected point is kept, and re-used for the next propagation step;  Assign inverse depth value the average of the surrounding inverse depths, weighted by respective inverse variance;  Keep track of validity of each inverse depth hypothesis to handle outliers (removed if needed).
  • 132. Semi-Dense TAM
  • 133. Depth Sensor-based Tracking  Features:  RGBD HOG feature;  Point cloud feature;  3D iterative closest point (ICP);  Occlusion detection and recovery from RGBD; “Tracking revisited using RGBD camera: unified benchmark and Baselines”, by S Song, J Xiao, proc. of ICCV13.
  • 134. Page 134 KinectFusion: Depth Surface Mapping and Tracking  Use data from Kinect’s depth sensor to build real-time 3D models of medium sized scenes;  It employs a Simultaneous Localization and Mapping (SLAM) technique, automatic tracking of its pose plus synchronous real-time construction of the finished 3D model;  1. register the incoming depth maps from the Kinect by Iterative Closest Point (ICP), which tracks the pose of the sensor from one frame to the next and allows each incoming frame to be placed in a unified global coordinate space;  2. employ Volumetric Integration which computes a truncated signed distance function (SDF) from each input depth map frame and accumulates the data into a discretized volumetric representation of the global coordinate space;  Pre-ICP registration: bilateral filter and subsampling, build depth map pyramid -> vertex/normal pyramids;  ICP registration: estimate rigid transform, with a point-to-plane error metric for faster convergence;  A coarse to fine iterations through 3-level pyramids and check the stability/validity;  Volumetric integration: 3-d grids of voxels swept through by SDF with depth maps (weighted averaging);  SDF gets the relative distance from a given 3D point to the surface (+mu/0/-mu) for surface reconstruction;  Post-volumetric integration: to visualize the surface by ray-casting and also used the synthetic depth map for ICP at next time, i.e. aligning the live depth map with the raycasted view of the model;  Find zero-crossings of the truncated SDF in ray-traversing and get its normal by interpolating over neighbor voxels;  Apply Marching Cubes for the extraction of mesh-based surface representation from volumetric data.
  • 135. Page 135 KinectFusion: Depth Surface Mapping and Tracking
  • 136. 3-D Model-based Body Tracking with Kinect RGB-D Data • Predict 3D position of each body joints from a single depth image • No temporal information • Uses an object recognition approach • Single per pixel classification • Large and highly varied training • The classifications can produce hypotheses of 3D body joint positions by skeletal tracking; • Random forest-based classifiers with depth invariant features. • The method has been designed to be robust, in two ways: • The system is trained with a vast, highly varied training set of synthetic images ensuring for all: • Ages, body shapes and sizes, clothing, hairstyles; • The recognition does not rely on any temporal information: • The system can initialize from arbitrary poses; • Prevents catastrophic loss of track.
  • 137. 3-D Model-based Face Tracking with Kinect RGB-D Data  Use a 3-D deformable face model;  Handle noisy depth data by maximum likelihood;  L1 regularization in ICP-based tracking framework;  Feature tracking in RGB data helps.
  • 138. Page 138 Classical Scenarios of Object Tracking Region tracking  case: histogram + gradient; Point tracking  case: trajectory, continuity; Contour tracking  case: snake, level set; Articulated object tracking  case: articulation constraints. Deformable object tracking  case: subspace modeling; Multiple objects tracking  case: detection + mixture models.
  • 139. Typical Applications of Object Tracking  Surveillance & Monitoring  Trajectory estimation of cars, people and animals etc.  Abnormal behavior analysis.  Motion-based Recognition  Facial expression recognition, gait recognition  Video Indexing  Camera motion, object annotation, scene classification, event detection  Video Compression  Model-based coding, ROI-based coding.  Human Computer Interaction  Gesture understanding, body pose identification, audio-visual speech recognition.  Vision-based Navigation  Obstacle avoidance, path planning.  Mixed Reality  Dynamic registration  Medical Therapy  Left ventricular motion estimation
  • 140. Application 1: Object Filtering – Pornography Blocking in Adult Content Page 140  Porn Detection System  Camera motion in obscene video: frequent tilting and zooming;  Erotic sound detection;  Skin color detection;  Filtered by face detection  Text classification?  Video type classifiers.
  • 141. Application 2: Object Search – Video Retrieval Page 141 1. Collect all visual words within query region; 1. Sampling SIFT in MSERs; 2. Vocabulary tree or kd-tree; 2. Inverted file index to find relevant frames; 1. Tf-idf weighting; 2. Stop word removal; 3. Compare word counts; 1. Bag of words histogram as the object descriptor; 2. Query expansion being blind relevance feedback. 4. Spatial verification. 1. Spatial consistency checking by homography; 2. Spatial re-ranking. Query region Retrieved frames
  • 142. Application 3: Contextual Video Ads – Object-based Ads Matching (AdImages) Page 142  Specify ad context by the appearance of characteristic images, AdImages;  AdImages are those related to a company (e.g., Puma logo), a product (e.g., a sneaker), or a location (e.g., Golden Gate Bridge) etc. ;  Indexing and matching utilizing spatial contextual cues;  Spatial verification with an unsupervised learning method . AdImage (objects) -> matching frame
  • 143. Application 4: Object Recognition-based Recommendation System Page 143  A mobile cooking recipe recommendation system employing object recognition for food ingredients such as vegetables and meats;  Object recognition on food ingredients in a real-time way on an Android-based smartphone;  Bag-of-features with SURF and color histogram extracted from multiple images as image features;  Linear SVM with the one-vs-rest strategy as a classifier;  Achieve 83.93% recognition rate.  The user can obtain a recipe list instantly as a result.
  • 144. Application 5: Interactive Video Page 144  Ooyala: Identify and associate metadata with specific people, objects and shapes within the video by using a framing tool in the ingest UI;  Track these tagged objects as they move through the content, preserving their identities and associated metadata;  These objects "broadcast" their metadata information to the surrounding application using Ooyala's player. Enable click-to-buy and targeted advertising
  • 145. Application 6: Object-based Video Navigation/Browsing Features in the video are automatically tracked and grouped off-line; Tracks the motion of image points across the video and segments those tracks into coherently moving groups; Enabling re-animation of video sequences using the mouse as input. Page 145
  • 146. Application 7: Object-based Video Annotation  Top-down approach: generating language through combinations of object detections and language models;  Bottom-up method: propagation of keyword tags from training images to test images through prob. or NN techniques. Page 146
  • 147. Application 8: Object-based Video Compression Page 147 Input video Object localization Object-aware encoding Output bitstream object info. Min Max Macroblock cost distribution Original pictures Conventional H264 coding ROI-based H264 coding