Machine learning based contour boundary detection from images


Published on

machine learning, scene understanding, static segmentation, Gestalt Cues, superpixel, logistic regression, MRF, CRF, manifold learning, ensemble learning, k-means, SVM, Naive Bayes, sparse coding, K-SVD, Orthogonal mactching pursuit. deep learning, RBM, DBM, DBN, SAE.

Published in: Technology, Education

Machine learning based contour boundary detection from images

  1. 1. Yu Huang Sunnyvale, California
  2. 2.  Edges, Contours and Boundaries  Finding Meaningful Contours  Static Segmentation (Regions)  Classical Gestalt Cues  Berkeley Segmentation Data Set  Learning for Scene Segmentation  Learn a Local Boundary Model  Image Figure/Ground Assignment  Learning Edges and Boundaries  Sparse Models for Edge Detection  Boundary Detection and Grouping  Sparse Coding for Contour Detection  Sketch Tokens for Contour Detection  Deep learning shape prior for segmentation  Deep neural prediction network for visual boundary  References  Appendix
  3. 3.  Edges: Significant local changes in image; occur on the boundary between 2 different regions in an image.  Contour: Representation of linked edges for a region boundary. ◦ Closed: Correspond to region boundaries; filling algorithm determines the pixels in the region. ◦ Open: part of a region boundary; gaps’ formation due to high edge-detection threshold or weak contrast.  occur when line fragments are linked together, as in drawing or handwriting.  Contour Representation: ◦ Ordered list of Edges (chains codes) ◦ Curve- model for a contour (piecewise line segments or cubic splines)
  4. 4.  Local edge detection ◦ Problems - false targets, misses  One solution: use other cues (image segmentation) ◦ Texture: Sharp changes in orientation, scale of textures ◦ Motion: >=2 Frames ◦ Disparity: Stereo Left eye Right eye Frame 1 Frame 2
  5. 5.  Regional Approaches (split-merge, watershed, mean shift, ...)  Use regional info, optimize labelling of regional tokens, e.g. clustering  Depending on uniformity in object region  Active Contour Models (snakes) ◦ Use regional (external) & boundary (internal) info, optimize deformation of model ◦ Sensitivity to initialization, too smooth  Level Set (implicit active contour)  handle topological changes naturally  not robust to boundary gaps  Contour Grouping  Use boundary info (& regional info), optimize grouping of contour fragments  Learning-based: Boundary Detection.
  6. 6. How is grouping done in human vision?  Proximity  Similarity ◦ Brightness ◦ Contrast  Good continuation ◦ Parallelism ◦ Co-circularity
  7. 7.  Two-class classification model  Over segmentation as preprocessing  Use classical Gestalt cues ◦ Contour, texture, brightness and continuation  A linear classifier is used for training (logistic regression) Superpixel map K=200 Reconstruction of human segmentation from Superpixels •Local •Coherent •Preserve structure •Contour •texture
  8. 8. Image Boundary Cues Model Brightness Color Texture Challenges: texture cue, cue combination Goal: learn the posterior probability of a boundary Pb(x,y,) from local information only Cue Combination
  9. 9.  Human subjects label ground truth figure/ground assignments in natural images.  “Shapemes” encode high-level knowledge in a generic way, capturing local figure/ground cues.  A conditional random field (CRF) incorporates junction cues and enforces global consistency.
  10. 10. Shapemes (clusters of local shapes) Pb edge maps human-marked boundaries Color image contour/junction
  11. 11.  Boosted Edge Learning (BEL): Probabilistic Boosting Tree (PBT) classification;  Features: gradient+Haarlet, over a large image patch.  Learn to detect edges from images with labeled ground truth;
  12. 12. PBT Training:
  13. 13.  Sparseland model and dictionary learning by k-SVD;  Edge detection as the pixelwise classification problem; ◦ “patches centered on edge pixel or not”;  Contour training: class specific edge classifier  Shape training: shape-based object classifier  Classification: edge classifier then shape classifier ◦ Bike, Motorbike, Person or Car? Person?
  14. 14.  Learning-based boundary detection: SIFT-based, dim. reduction by PCA, boosting (Adaboost, Gentleboost and Madaboost);  Boundary grouping: use a normalized saliency criterion, fractional- linear programming to find graph circles with min. cost .
  15. 15.  Sparse Code Gradients (SCG): by sparse coding (k-SVD);  Gradient, color, plus depth & surface normal(option);  Linear classifier (SVM) with contrast features (SCG);  Globalization by computing a spectral gradient (like gPb) optionally;
  16. 16.  Definition: straight lines, t-junctions, y-junctions, corners, curves, parallel lines;  Learned (k-means clustering) from patches of human generated contours: a number of classes in hundreds (150 in the paper), Daisy (MSR) descriptors used for shift invariance;  Low-level image features: gradient, color, orientation, etc.;  Classifier: Random decision forest for sketch token labeling from image patches. Sketch Tokens Like “Shapeme”?
  17. 17.  Use deep Boltzmann machine to learn the hierarchical architecture of shape priors: low level local feature and high level global feature;  Apply the learned architecture to model shape variations of global and local structures;  A data-driven variational method to perform object extraction based on shape probabilistic representation.
  18. 18. original result learned shape result by sparse learned shape
  19. 19.  Integration from multiple scales and semantic levels via multi- streams of interlinked, layered, non-linear “deep” processing; ◦ Deep belief net with a variant of the mean-and-covariance RBM;  Unsupervised feature learning; ◦ Supervised boundary prediction by feed forward NN.
  20. 20. CNN structure: explicitly visualizing the dimensions of each network layers. • Contour detection accuracy can be improved by instead making the use of the deep features learned from CNNs. • Customize the training strategy by partitioning contour (positive) data into subclasses and fitting each subclass by different model parameters. • A new loss function, named positive-sharing loss, in which each subclass shares the loss for the whole positive class to learn the parameters • It introduces an extra regularizer to emphasizes the losses for the positive and negative classes, which facilitates to explore more discriminative features.
  21. 21. • Run the Canny edge detector to get candidate contour points. • Around each candidate point, extract patches at four different scales and simultaneously run them through the five convolutional layers of the KNet. • Connect these convolutional layers to two separately-trained network branches. • The first branch is trained for classification, the second is trained as a regressor. • Outputs from these two sub-networks are averaged to produce the final score.
  22. 22. An input patch, centered around the candidate point, goes through five conv. layers of the KNet. To extract high-level features, at each conv. layer extract a small sub-volume of the feature map around the center point, and perform max, average, and center pooling on this sub-volume. The pooled values feed a bifurcated sub-network. The scalar outputs computed from the branches of a bifurcated sub-networks are averaged to produce a final contour prediction.
  23. 23.  X. Ren, and J. Malik. "Learning a Classification Model for Segmentation", ICCV’03  D. Martin, C. Fowlkes, and J. Malik. "Learning to detect natural Image boundaries using local brightness, color, and texture cues", IEEE T-PAMI 2004  P. Doll´ar, Z. Tu, and S. Belongie, “Supervised learning of edges and object boundaries”, CVPR, 2005  Ren, Fowlkes, Malik. "Figure/Ground assignment in natural images“, ECCV 2006  Mairal1, M. Leordeanu, F. Bach1, M. Hebert, J. Ponce, “Discriminative Sparse Image Models for Class-Specific Edge Detection and Image Interpretation”, ECCV’08.  I. Kokkinos, “Highly Accurate Boundary Detection and Grouping”. CVPR 2010.  X. Ren and L. Bo, “Discriminatively Trained Sparse Code Gradients for Contour Detection”, NIPS’12.  J Lim, C. L. Zitnick, P Dollar, “Sketch Tokens: A Learned Mid-level Representation for Contour and Object Detection”, CVPR, 2013  Chen, Yu, Hu, Zeng, “Deep Learning Shape Priors for Object Segmentation”, CVPR’13.  Kivinen, Williams, Heess, “visual boundary prediction: a deep neural prediction network and quality dissection”, AISTATS, 2014.
  24. 24.  “Machine Learning is programming computers to optimize a performance criterion using example data or past experience” ◦ Supervised/Unsupervised model: labeled/unlabeled data; ◦ Semi-supervised model: both labeled and unlabeled data; ◦ Online learning: incremental update; ◦ Ensemble classifiers: bagging, stacking, boosting, random forest,… ◦ Reinforcement Learning: learn by interacting with an environment.  Types of ML algorithms ◦ Prediction: predicting a variable from data ◦ Classification: assigning records to predefined groups ◦ Clustering: splitting records into groups based on similarity ◦ Association learning: seeing what often appears together with what  Relationship with others ◦ Artificial intelligence: emulate how the brain works with program.;  ML is a branch of AI ◦ Data mining: building models in order to detect the patterns; ◦ Statistical analysis: probabilistic models, on which to infer with data; ◦ Information retrieval: retrieval of information from a collection of data.
  25. 25.  Unsupervised learning is that of trying to find hidden structure in unlabeled data; Since the examples given to the learner are unlabeled, there is no error or reward signal to evaluate a potential solution;  It is closely related to the problem of density estimation in statistics; However also encompasses many other techniques that seek to summarize and explain key features of the data  Unsupervised learning also encompasses many other techniques that seek to summarize and explain key features of the data.  Approaches to unsupervised learning include: ◦ Clustering; ◦ Hidden Markov models; ◦ Blind signal separation (PCA, ICA, NMF, SVD…);  Unsupervised methods in NN: ◦ Self Organizing Map: topographic organization in which nearby locations in the map represent inputs with similar properties; ◦ Adaptive Resonance Theory: allows the number of clusters to vary with problem size and lets the user control the degree of similarity between members of the same clusters by means of the vigilance parameter.
  26. 26.  Supervised learning is the task of inferring a function from labeled training data. The training data consist of a set of training examples.  Each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal).  A supervised learning algorithm analyzes the training data and produces an inferred function, used for mapping new examples.  There are four major issues to consider in supervised learning: ◦ tradeoff between bias and variance; ◦ amount of training data relative to the complexity of the "true" function; ◦ dimensionality of the input space: curse of dimensionality; ◦ degree of noise in the desired output values: over-fitting.  There are several ways to be generalized: ◦ Semi-supervised learning: the desired output values are provided only for a subset of the training data. The remaining data is unlabeled. ◦ Active learning: Instead of assuming that all of the training examples are given at the start, interactively collect new examples, typically by making queries to a human user.
  27. 27.  Training/testing data (70%/30%)  Data unbalanced (one class’ data more than others) ◦ Sampling, learning algorithm modification (cost-sensitive), ensemble,…  Feature extraction ◦ Sparse coding, vector quantization,…  Curse of Dimensionality: Sensitivity to “noise” ◦ Dimension reduction, manifold learning/distance metric learning  Linear or non-linear model ◦ Local/Global minimum (convex/concave obj. function): Learning rate ◦ Regularization: L-1/L-2 norm ◦ Kernel trick: mapping nonlinear feature space to high dim. linear space  Discriminative or generative model ◦ Bottom up (conditional distribution) /Top down (joint distribution)  Over-fitting: Learn the “noise” ◦ Cross validation with grid search  Performance evaluation ◦ Precision/Recall, confusion matrix, ROC, i.e. receiver operating characteristic)
  28. 28.  Principal Component Analysis (PCA) uses orthogonal transformation to convert a set of observations of possibly correlated variables into a set of linearly uncorrelated variables called principal components.  This transformation is defined in such a way that the first principal component has the largest possible variance and each succeeding component in turn has the highest variance possible under the constraint that it be orthogonal to the preceding components.  PCA is sensitive to the relative scaling of the original variables.  Also called as Karhunen–Loève transform (KLT), Hotelling transform, singular value decomposition (SVD) , factor analysis, eigenvalue decomposition (EVD), spectral decomposition etc.;  Affinity Propagation (AP) is a clustering algorithm based on the concept of "message passing" between data points. Unlike clustering algorithms such as k- means or k-medoids, AP does not require the number of clusters to be determined or estimated before running the algorithm;  Spectral Clustering makes use of the spectrum (eigenvalues) of the data similarity matrix to perform dimensionality reduction before clustering in fewer dimensions.
  29. 29.  Independent component analysis (ICA) is for separating a multivariate signal into additive subcomponents by assuming that the subcomponents are non-Gaussian signals and all statistically independent from each other. ◦ ICA is a special case of blind source separation.  Assumptions: the source signals are independent of each other; distribution of the values in each source signals are non-Gaussian.  Three effects of mixing signals as below ◦ Independence: the signal mixtures may not; ◦ Normality: closer to Gaussian than any of original variables; ◦ Complexity: Greater than that of its constituent source signal.  Preprocessing: centering, whitening and dimension reduction;  ICA finds the independent components (latent variables) by maximizing the statistical independence of the estimated components;  Definitions of independence for ICA: ◦ Minimization of mutual information (KL divergence or entropy); ◦ Maximization of non-Gaussianity (kurtosis and negative entropy).
  30. 30. Initial signal mixed signal whitening ICA
  31. 31.  Mixture model is a probabilistic model for representing the presence of subpopulations within an overall population;  “Mixture models" are used to make statistical inferences about the properties of the sub-populations given only observations on the pooled population;  A Gaussian mixture model can be Bayesian or non-Bayesian;  A variety of approaches focus on maximum likelihood estimate (MLE) as expectation maximization (EM) or maximum a posteriori (MAP);  EM is used to determine the parameters of a mixture with an a priori given number of components (a variation version can adapt it in the iteration); ◦ Expectation step: "partial membership" of each data point in each constituent distribution is computed by calculating expectation values for the membership variables of each data point; ◦ Maximization step: plug-in estimates, mixing coefficients and component model parameters, are re-computed for the distribution parameters; ◦ Each successive EM iteration will not decrease the likelihood.  Alternatives of EM for mixture models: ◦ mixture model parameters can be deduced using posterior sampling as indicated by Bayes' theorem, i.e. Gibbs sampling or Markov Chain Monte Carlo (MCMC); ◦ Spectral methods based on SVD; ◦ Graphical model: MRF or CRF.
  32. 32.  Non-negative matrix factorization (NMF): a matrix V is factorized into (usually) two matrices W and H, that all three matrices have no negative elements.  The different types arise from using different cost functions for measuring the divergence between V and W*H and possibly by regularization of the W and/or H matrices; ◦ squared error, Kullback-Leibler divergence or total variation (TV);  NMF is an instance of a more general probabilistic model called "multinomial PCA“, as pLSA (probabilistic latent semantic analysis);  pLSA is a statistical technique for two-mode (extended naturally to higher modes) analysis, modeling the probability of each co- occurrence as a mixture of conditionally independent multinomial distributions; ◦ Their parameters are learned using EM algorithm;  pLSA is based on a mixture decomposition derived from a latent class model, not as downsizing the occurrence tables by SVD in LSA.  Note: an extended model, LDA (Latent Dirichlet allocation) , adds a Dirichlet prior on the per-document topic distribution.
  33. 33. Note: d is the document index variable, c is a word's topic drawn from the document's topic distribution, P(c|d), and w is a word drawn from the word distribution of this word's topic, P(w|c). (d and w are observable variables, c is a latent variable.)
  34. 34.  A hidden Markov model (HMM) is a statistical Markov model: the modeled system is a Markov process with unobserved (hidden) states;  In HMM, state is not visible, but output, dependent on state, is visible. ◦ Each state has a probability distribution over the possible output tokens; ◦ Sequence of tokens generated by an HMM gives some information about the sequence of states.  Note: the adjective 'hidden' refers to the state sequence through which the model passes, not to the parameters of the model;  A HMM can be considered a generalization of a mixture model where the hidden variables are related through a Markov process;  Inference: prob. of an observed sequence by Forward-Backward Algorithm and the most likely state trajectory by Viterbi algorithm (DP);  Learning: optimize state transition and output probabilities by Baum-Welch algorithm (special case of EM).
  35. 35.  logistic regression is a probabilistic statistical classification model;  The prob. of the possible outcomes of a single trial are modeled as a function of explanatory variables by a logistic function;  Training: Maximizes conditional likelihood P(y|x) directly;
  36. 36.  Convex optimization (logistic function): w = argmax|w P(Y|X, w); ◦ Adding regularization term as well for overfitting; ◦ Iterative solution: a gradient descent method.  In a Bayesian statistics context, prior distributions are normally placed on the regression coefficients, usually as Gaussian distributions; ◦ Apply Metropolis–Hastings algorithm (a more general MCMC method than Gibbs sampling): based on proposal distribution or jumping distribution. The proposal distribution Q proposes the next point that the random walk might move to.
  37. 37.  The Naive Bayes classifier is designed when features are independent of one another within each class, but it appears to work well in practice even when that independence assumption is not valid. It classifies data in two steps: ◦ Training step: Using the training samples, the method estimates the parameters of a probability distribution, assuming features are conditionally independent given the class. ◦ Prediction step: For any unseen test sample, the method computes the posterior probability of that sample belonging to each class. The method then classifies the test sample according the largest posterior probability.  The class-conditional independence assumption greatly simplifies the training step since you can estimate the one-dimensional class- conditional density for each feature individually; ◦ While the class-conditional independence between features is not true in general, research shows that this optimistic assumption works well in practice; ◦ This assumption of class independence allows the Naive Bayes classifier to better estimate the parameters required for accurate classification while using less training data than many other classifiers; ◦ This makes it particularly effective for datasets containing many predictors or features.
  38. 38.  Supported distributions in NB classif. ◦ Naive Bayes is based on estimating P(x|y), the probability or probability density of features x given class y. ◦ Support for normal (Gaussian), kernel, multinomial, and multivariate multinomial distributions.  Normal (Gaussian) Distribution: features have normal distributions in each class;  Kernel: computes a separate kernel density estimate for each class based on the training data for that class;  Multinomial Distribution ("bag of words" model): each feature for the count of one word; classification is based on the relative frequencies of the words.  Multivariate Multinomial Distribution: feature categories, differ from the class levels for the response variable.
  39. 39.  Separable Data ◦ An SVM classifies data by finding the best hyperplane that separates all data points of one class from those of the other class. ◦ “Margin” means the maximal width of the slab parallel to the hyperplane that has no interior data points. ◦ The support vectors are the data points that are closest to the separating hyperplane.
  40. 40.  Mathematical Formulation: Primal.  Mathematical Formulation: Dual. Variables i are slack variables measuring the error made at point (xi,yi)   l i 1 i 2 Kf, Cfmin i   yif(xi)  1 - i, for all i i  0     l 1i l 1j jijiji l 1i i α ),K(yyαα 2 1 αmin i xx 0  i  C, for all i 0 1  l i ii y
  41. 41.  Non-separable Data ◦ Your data might not allow for a separating hyperplane. In that case, SVM can use a soft margin, meaning a hyperplane that separates many, but not all data points.  Nonlinear Transformation with Kernels ◦ Some binary classification problems do not have a simple hyperplane as a useful separating criterion; ◦ Theory of reproducing kernels: Polynomials, Radial basis or sigmoid function; ◦ Nonlinear kernels can use identical calculations and solution algorithms, and obtain classifiers that are nonlinear.
  42. 42.  Random Field: F={F1,F2,…FM} a family of random variables on set S in which each Fi takes value fi in a label set L.  Markov Random Field: F is said to be a MRF on S w.r.t. a neighborhood N if and only if it satisfies Markov property. ◦ Generative model for joint probability p(x) ◦ allows no direct probabilistic interpretation ◦ define potential functions Ψ on maximal cliques A  map joint assignment to non-negative real number  requires normalization  MRF is undirected graphical models
  43. 43.  Conditional , not joint, probabilistic sequential models p(y|x)  Allow arbitrary, non-independent features on the observation seq X  Specify the probability of possible label seq given an observation seq  Prob. of a transition between labels depend on past/future observ.  Relax strong independence assumptions, no p(x) required  CRF is MRF plus “external” variables, where “internal” variables Y of MRF are un-observables and “external” variables X are observables  Linear chain CRF: transition score depends on current observation ◦ Inference by DP like HMM, learning by forward-backward as HMM  Optimization for learning CRF: discriminative model ◦ Conjugate gradient, stochastic gradient,…
  44. 44.  Ensemble methods use multiple models to obtain better predictive performance than could be obtained from any of the constituent models;  Ensembles combine many weak learners to produce a strong learner; ◦ Term ensemble is for methods that generate multiple hypotheses using the same base learner;  Ensembles can be shown to have more flexibility in the functions they can represent. This flexibility can, in theory, enable them to over-fit the training data more than a single model would; but in practice, some ensemble techniques (especially bagging) tend to reduce problems related to over-fitting of the training data;  Empirically, ensembles tend to yield better results when there is a significant diversity among the models;  Popular types: ◦ Bagging, boosting, stacking, stochastic discrimination, random subspace, … ◦ Random forest, derived from the random subspace method, constructs a multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual trees.
  45. 45.  Samples the training set, generate random independent bootstrap replicates, constructs the classifier, aggregates them by a majority vote in the final decision rule; (called “bootstrap aggregating”)  Bootstrapping is based on random sampling with replacement;  Therefore, taking bootstrap replicate (random selection with replacement) of the training set sometimes avoid or get less misleading training objects in the bootstrap training set;  Consequently, a classifier constructed on such a training set may have a better performance.
  46. 46.  At each step, training data are re-weighted that incorrectly classified objects get larger weights in a new, modified training set, thus actually maximizes the margins between objects;  Classifiers are constructed on weighted versions of the training set, which are independent on previous classification results;  Boosting learning originated from the Probably Approximately Correct (PAC) learning theory;  AdaBoost is the first algorithm that could adapt to the weak learners;  Variant of Adaboost (Adaptive boosting): ◦ LogitBoost: ◦ GentleBoost: Update is fm(x) = P(y=1|x) – P(y=0|x) instead of
  47. 47.  In SVM, one performs global optimization in order to maximize the minimal margin, while in Boosting one maximizes the margin locally for each training object;  SVM uses the L-2 norm for both hypothesis and weight vector, while Boosting uses the L- norm for the hypothesis vector and L-1 norm for the weight vector;  It is shown that if the number of relevant weak hypothesis k is a small fraction of the total number of weak hypotheses, then the margin associated with Boosting will be much larger than the one associated with SVM;  SVM corresponds to quadratic programming while Boosting only to linear programming;  Through the method of kernels, SVM allows to perform low dimensional calculation that mathematically equivalent to inner products in a high dimensional ‘virtual’ space; Instead, Boosting employs greedy search, the re-weighting of the examples changes the distribution with respect to which the correlation is measured, thus guiding the weak learner to find different correlated coordinates.
  48. 48.  With discrete stochastic processes, arbitrary numbers of very weak models are generated and combined to separate the points in multi-dimensional spaces. ◦ Can be regarded as a method of dimensionality reduction; ◦ ``uniformity'‘: two points from the same class are equally likely to be captured by a weak model of a given size ; ◦ “enrichment”: weak models do not have the same chance of capturing points from different classes .  SD has the property that the very complex and accurate classifiers produced in this way retain the ability, characteristic of their weak component pieces, to generalize to new data;  It is in combining these weak models that the discriminative power is developed.  SD simply transforms the multi-d feature vectors to points coming from two uni-variate normal distributions;  These two uni-variate normal distributions are separate further as the number of weak models increases, which intuitively is similar to how people learn the knowledge.
  49. 49.  Classifiers are constructed in random subspaces of the data feature space, usually combined by simple majority voting in the final decision rule;  It relies on also a stochastic process that randomly selects a number of components of the given feature vector in constructing each classifier;  Geometrically this is equivalent to projecting all the points to the selected subspace;  Random subspace method effectively takes advantages of high dimensionality .
  50. 50. • Defined as a set {D, X, Y} such that DY = X
  51. 51. • Given a D and yi, how to find xi ? • Constraint : xi is sufficiently sparse; • Finding exact solution is difficult; • Approximate a solution good enough?
  52. 52.  Greedy methods: projecting the residual on some atom; ◦ Matching pursuit, orthogonal matching pursuit;  L1-norm: Least Absolute Shrinkage and Selection Operator (LASSO); ◦ The residual is updated iteratively in the direction of the atom;  Gradient-based finding new search directions ◦ Projected Gradient Descent ◦ Coordinate Descent  Homotopy: a set of solutions indexed by a parameter (regularization) ◦ LARS (Least Angle Regression)  First order/proximal methods: Generalized gradient descent ◦ solving efficiently the proximal operator ◦ soft-thresholding for L1-norm ◦ Accelerated by the Nesterov optimal first-order method  Iterative reweighting schemes ◦ L2-norm: Chartand and Yin (2008) ◦ L1-norm: Cand`es et al. (2008)
  53. 53. Select dk with max projection on residue xk = arg min ||y-Dkxk|| Update residue r = y - Dkxk Check terminating condition D, y x
  54. 54.  What D to use?  A fixed overcomplete set of basis: no adaptivity.  Steerable wavelet;  Bandlet, curvelet, contourlet;  DCT Basis;  Gabor function;  ….  Data adaptive dictionary – learn from data;  K-SVD: a generalized K-means clustering process for Vector Quantization (VQ). ◦ An iterative algorithm to effectively optimize the sparse approximation of signals in a learned dictionary.  Other methods of dictionary learning: ◦ non-negative matrix decompositions. ◦ sparse PCA (sparse dictionaries). ◦ fused-lasso regularizations (piecewise constant dictionaries)  Extending the models: Sparsity + Self-similarity=Group Sparsity
  55. 55. • Select atoms from input; • Atoms can be image patches; • Patches are overlapping. Initialize Dictionary Sparse Coding (OMP) Update Dictionary One atom at a time • Use OMP or any pursuit method; • Output sparse code for all signals; • Minimize representation error.
  56. 56.  Representation learning attempts to automatically learn good features or representations;  Deep learning algorithms attempt to learn multiple levels of representation of increasing complexity/abstraction (intermediate and high level features);  Become effective via unsupervised pre-training + supervised fine tuning; ◦ Deep networks trained with back propagation (without unsupervised pre- training) perform worse than shallow networks.  Deal with the curse of dimensionality (smoothing & sparsity) and over-fitting (unsupervised, regularizer);  Semi-supervised: structure of manifold assumption; ◦ labeled data is scarce and unlabeled data is abundant.
  57. 57. • Supervised training of deep models (e.g. many-layered Nets) is too hard (optimization problem);  Learn prior from unlabeled data; • Shallow models are not for learning high-level abstractions;  Ensembles or forests do not learn features first;  Graphical models could be deep net, but mostly not. • Unsupervised learning could be “local-learning”;  Resemble boosting with each layer being like a weak learner • Learning is weak in directed graphical models with many hidden variables;  Sparsity and regularizer. • Traditional unsupervised learning methods aren’t easy to learn multiple levels of representation.  Layer-wised unsupervised learning is the solution. • Multi-task learning (transfer learning and self taught learning); • Other issues: scalability & parallelism with the burden from big data.
  58. 58.  A neural network = running several logistic regressions at the same time; ◦ Neuron=logistic regression or…  Calculate error derivatives (gradients) to refine: back propagate the error derivative through model (the chain rule) ◦ Online learning: stochastic/incremental gradient descent; ◦ Batch learning: conjugate gradient descent.
  59. 59.  CNN is a special kind of multi-layer NNs applied to 2-d arrays (usually images), based on spatially localized neural input; ◦ local receptive fields(shifted window), shared weights (weight averaging) across the hidden units, and often, spatial or temporal sub-sampling; ◦ Related to generative MRF/discriminative CRF:  CNN=Field of Experts MRF=ML inference in CRF; ◦ Generate ‘patterns of patterns’ for pattern recognition.  Each layer combines (merge, smooth) patches from previous layers ◦ Pooling /Sampling (e.g., max or average) filter: compress and smooth the data. ◦ Convolution filters: (translation invariance) unsupervised; ◦ Local contrast normalization: increase sparsity, improve optimization/invariance. C layers convolutions, S layers pool/sample
  60. 60.  Convolutional Networks are trainable multistage architectures composed of multiple stages;  Input and output of each stage are sets of arrays called feature maps;  At output, each feature map represents a particular feature extracted at all locations on input;  Each stage is composed of: a filter bank layer, a non-linearity layer, and a feature pooling layer;  A ConvNet is composed of 1, 2 or 3 such 3-layer stages, followed by a classification module; ◦ A fully connected layer: softmax transfer function for posterior distribution.  Filter: A trainable filter (kernel) in filter bank connects input feature map to output feature map;  Nonlinearity: a pointwise sigmoid tanh() or a rectified sigmoid abs(gi•tanh()) function; ◦ In rectified function, gi is a trainable gain parameter, might be followed a contrast normalization N;  Feature pooling: treats each feature map separately -> a reduced-resolution output feature map;  Supervised training is performed using a form of SGD to minimize the prediction error; ◦ Gradients are computed with the back-propagation method.  Unsupervised pre-training: predictive sparse decomposition (PSD), then supervised fine- tuning. * is discrete convolution operator
  61. 61.  A layered model composed of convolution and subsampling operations followed by a holistic representation and ultimately a classifier for handwritten digits;  Local receptive fields (5x5) with local connections;  Output via a RBF function, one for each class, with 84 inputs each;  Learning by Graph Transformer Networks (GTN);
  62. 62.  A layered model composed of convol., subsample., followed by a holistic representation and all-in-all a landmark classifier;  Consists of 5 convolutional layers, some of which followed by max-pooling layers, 3 fully-connected layers with a final 1000-way softmax;  Fully-connected “FULL” layers: linear classifiers/matrix multiplications;  ReLU are rectified-linear nonlinearities on layer output, can be trained several times faster;  Local normalization scheme aids generalization;  Overlapping pooling slightly less prone to overfitting;  Data augmentation: artificially enlarge the dataset using label-preserving transformations;  Dropout: setting to zero output of each hidden neuron with prob. 0.5;  Trained by SGD with batch # 128, momentum 0.9, weight decay 0.0005.
  63. 63. The network’s input is 150,528-dimensional, and the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264-4096–4096–1000.
  64. 64.  Matthew Zeiler from the startup company “Clarifai”, winner of ImageNet Classification in 2013;  Preprocessing: subtracting a per-pixel mean;  Data augmentation: downsampled to 256 pixels and a random 224 pixel crop is taken out of the image and randomly flipped horizontally to provide more views of each example;  SGD with min-batch # 128, learning rate annealing, momentum 0.9 and dropout to prevent overfitting;  65M parameters trained for 12 days on a single Nvidia GPU;  Visualization by layered DeconvNets: project the feature activations back to the input pixel space; ◦ Reveal input stimuli exciting individual feature maps at any layer; ◦ Observe evolution of features during training; ◦ Sensitivity analysis of the classifier output by occluding portions to reveal which parts of scenes are important;  DeconvNet attached to each of ConvNet layer, unpooling uses locations of maxima to preserve structure;  Multiple such models were averaged together to further boost performance;  Supervised pre-training with AlexNet, then modify it to get better performance (error rate 14.8%).
  65. 65. Architecture of an eight layer ConvNet model. Input: 224 by 224 crop of an image (with 3 color planes). # 1-5 layers Convolution: 96 filters, 7x7, stride of 2 in both x and y. Feature maps: (i) via a rectified linear function, (ii) 3x3 max pooled (stride 2), (iii) contrast normalized 55x55 feature maps. # 6-7 layers: fully connected, input in vector form (6x6x256 = 9216 dimensions). The final layer: a C-way softmax function, C - number of classes.
  66. 66. Top: A deconvnet layer (left) attached to a convnet layer (right). The deconvnet will reconstruct approximate version of convnet features from the layer beneath. Bottom: Unpooling operation in the deconvnet, using switches which record the location of the local max in each pooling region (colored zones) during pooling in the convnet.
  67. 67. • A hybrid model: can be trained as generative or discriminative model; • Deep architecture: multiple layers (learn features layer by layer); • Multi layer learning is difficult in sigmoid belief networks. • Top two layers are undirected connections, Restricted Boltzmann Machine (RBM); • Lower layers get top down directed connections from layers above; • Unsupervised or self-taught pre-learning provides a good initialization; • Greedy layer-wise unsupervised training for RBM; • Supervised fine-tuning • Generative: wake-sleep algorithm (Up- down); • Discriminative: back propagation (bottom-up); Belief net is directed acyclic graph composed of stochastic variables.
  68. 68. • Boltzmann machine is a stochastic recurrent model, and RBM is its special case (one hidden layer); • Learning internal representations that become increasingly complex; • High-level representations built from a large supply of unlabeled inputs; • Pre-training: learning a stack of modified RBMs, which are composed to create a deep Boltzmann machine (undirected graph); • Generative fine-tuning: different from DBN • Positive and negative phase • Discriminative fine-tuning: the same to DBN • Back propagation.
  69. 69. • Denoising Auto-Encoder: Multilayer NNs with target output=input; • Auto-encoder learns the salient variation like a nonlinear PCA; • Stack many (may be sparse) auto-encoders in succession and train them using greedy layer-wise unsupervised learning • Drop the decode layer each time • Performs better than stacking RBMs; • Supervised training on the last layer using final features; • (option) Supervised training on the entire network to fine- tune all weights of the neural net; • Empirically not quite as accurate as DBNs.
  70. 70. Stochastic Gradient Descent (SGD) • The general class of estimators that arise as minimizers of sums are called M-estimators; • Where are stationary points of the likelihood function (or zeroes of its derivative, the score function)? • Online gradient descent samples a subset of summand functions at every step; • The true gradient of is approximated by a gradient at a single example; • Shuffling of training set at each pass. • There is a compromise between two forms, often called "mini-batches", where the true gradient is approximated by a sum over a small number of training examples. • STD converges almost surely to a global minimum when the objective function is convex or pseudo-convex, and otherwise converges almost surely to a local minimum.
  71. 71. Back Propagation • Back propagation is a multi-layer network training method • We find parameters W, to minimize an error • For this we will do iterative gradient descent: w(t) = w(t-1) – λ * −𝜕𝐸 𝜕𝑤 (t) • Error propagation • Forward propagation of a training pattern's input through the multilayer network to generate the output activations; • Backward propagation of the output activations (logistic or soft-max) through the multiplayer network using the pattern target to generate deltas of all output and hidden units (the chain rule); • Weight update • Multiply its output delta and input activation to get the weight gradient; • Subtract a ratio (i.e. the learning rate) of the gradient from the weight. 𝜕𝐸 𝜕𝑦 𝑙−1 = 𝜕𝐸 𝜕𝑦 𝑙 × 𝜕𝑦 𝑙(𝑤,𝑦 𝑙−1) 𝜕𝑦 𝑙−1 𝜕𝐸 𝜕𝑤 𝑙 = 𝜕𝐸 𝜕𝑦 𝑙 × 𝜕𝑦 𝑙(𝑤,𝑦 𝑙−1) 𝜕𝑤 𝑙 E (f(x0,w),y0) = -log (f(x0,w)- y0).
  72. 72.  Euclidean loss is used for regressing to real-valued lables [- inf,inf];  Sigmoid cross-entropy loss is used for predicting K independent probability values in [0,1];  Softmax (normalized exponential) loss is predicting a single class of K mutually exclusive classes; ◦ Generalization of the logistic function that "squashes" a K-dimensional vector of arbitrary real values z to a K-dimensional vector of real values σ(z) in the range (0, 1). ◦ The predicted probability for the j'th class given a sample vector x is  Sigmoidal or Softmax normalization is a way of reducing the influence of extreme values or outliers in the data without removing them from the dataset.
  73. 73.  Too large learning rate ◦ cause oscillation in searching for the minimal point  Too slow learning rate ◦ too slow convergence to the minimal point  Adaptive learning rate ◦ At the beginning, the learning rate can be large when the current point is far from the optimal point; ◦ Gradually, the learning rate will decay as time goes by.  Should not be too large or too small: ◦ annealing rate 𝛼(𝑡)=𝛼(0)/(1+𝑡/𝑇) ◦ 𝛼(𝑡) will eventually go to zero, but at the beginning it is almost a constant.
  74. 74.  Classical Momentum (CM) is a technique for accelerating gradient descent that accumulates a velocity vector in directions of persistent reduction in the objective across iterations: given the objective function f(θ), Vt+1 = µVt - ε 𝛻f(θt), θt+1 = θt + Vt+1, With ε>0 as learning rate, µͼ[0,1] as momentum coefficient and 𝛻f(θt) as gradient at θt;  Nesterov’s Accelerated Gradient (NAG) is also a 1st order optimization method with better convergence rate guarantee than gradient descent; Vt+1 = µVt - ε 𝛻f(θt + µVt), θt+1 = θt + Vt+1,  For convex objectives, momentum-based methods outperform SGD in the early or transient stages of optimization, however equally effective in the final stage;  Hessian-free (HF) methods and truncated Newton methods work by optimizing a local quadratic model of the objective via the linear conjugate gradient (CG) algorithms; ◦ If CG terminated after just one step, HF becomes equivalent to NAG;
  75. 75.  AdaGrad: asymptotically sublinear regret, adapt learning rate for each weight based on historical info.: ∆𝑊𝑖𝑗 𝑡 + 1 = − 𝛾 1 𝑡+1( 𝜕𝐸 𝜕𝑤 𝑖𝑗 (𝜏))2 ∗ 𝜕𝐸 𝜕𝑤 𝑖𝑗 (𝑡 + 1) ◦ Normalizes each coordinate of gradient by the historical (previous iterations) magnitude of that coordinate; ◦ Frequently occurring features in the gradients get small learning rates and infrequent features get higher ones; ◦ Sensitive to initial conditions, continual decay of learning rate.  AdaDelta: accumulate the denominator over last k gradients (a sliding window): 𝛼 𝑡 + 1 = 𝑡−𝑘+1 𝑡+1 ( 𝜕𝐸 𝜕𝑤 (𝜏))2 ∆𝑊 𝑡 + 1 = − 𝛾 𝛼(𝑡+1) ∗ 𝜕𝐸 𝜕𝑤 (𝑡 + 1) . ◦ This requires to keep last k gradients; instead it use a simpler formula: 𝛽 𝑡 + 1 = 𝜌 ∗ 𝛽 𝑡 + 1 − 𝜌 ∗ ( 𝜕𝐸 𝜕𝑤 (𝑡 + 1))2 ∆𝑊 𝑡 + 1 = − 𝛾 𝛽 𝑡+1 +𝜖 ∗ 𝜕𝐸 𝜕𝑤 (𝑡 + 1) . ◦ Avoid AdaGrad’s weakness.
  76. 76.  The easiest and most common method to reduce overfitting on image data is to artificially enlarge the dataset using label-preserving transformations;  Perturbing an image I by transformations that leave the underlying class unchanged (e.g. cropping and flipping) in order to generate additional examples of the class;  Two distinct forms of data augmentation: ◦ image translation ◦ horizontal reflections ◦ changing RGB intensities
  77. 77.  Weight decay or L2 regularization adds a penalty term to the error function, a term called the regularization term: the negative log prior in Bayesian justification, ◦ Weight decay works as rescaling weights in the learning rule, but bias learning still the same; ◦ Prefer to learn small weights, and large weights allowed if improving the original cost function; ◦ A way of compromising btw finding small weights and minimizing the original cost function;  In a linear model, weight decay is equivalent to ridge (Tikhonov) regression;  L1 regularization: the weights not really useful shrink by a constant amount toward zero; ◦ Act like a form of feature selection; ◦ Make the input filters cleaner and easier to interpret;  L2 regularization penalizes large values strongly while L1 regularization ;  Markov Chain Monte Carlo (MCMC): simulating a Markov chain whose equilibrium distr. is the posterior distribution for weights & hyper-parameters;  Hybrid Monte Carlo: gradient and sampling.
  78. 78.  Steps in early stopping: ◦ Divide the available data into training and validation sets. ◦ Use a large number of hidden units. ◦ Use very small random initial values. ◦ Use a slow learning rate. ◦ Compute the validation error rate periodically during training. ◦ Stop training when the validation error rate "starts to go up".  Early stopping has several advantages: ◦ It is fast. ◦ It can be applied successfully to networks in which the number of weights far exceeds the sample size. ◦ It requires only one major decision by the user: what proportion of validation cases to use.  Practical issues in early stopping: ◦ How many cases do you assign to the training and validation sets? ◦ Do you split the data into training and validation sets randomly or by some systematic algorithm? ◦ How do you tell when the validation error rate "starts to go up"?
  79. 79.  Dropout: set the output of each hidden neuron to zero w.p. 0.5. ◦ Motivation: Combining many different models that share parameters succeeds in reducing test errors by approximately averaging together the predictions, which resembles the bagging. ◦ The units which are “dropped out” in this way do not contribute to the forward pass and do not participate in back propagation. ◦ So every time an input is presented, the NN samples a different architecture, but all these architectures share weights. ◦ This technique reduces complex co-adaptations of units, since a neuron cannot rely on the presence of particular other units. ◦ It is, therefore, forced to learn more robust features that are useful in conjunction with many different random subsets of the other units. ◦ Without dropout, the network exhibits substantial overfitting. ◦ Dropout roughly doubles the number of iterations required to converge.  Maxout takes the maximum across multiple feature maps;
  80. 80.  Markov Chain: a stochastic process in which future states are independent of past states but the present state. ◦ Markov chain will typically converge to a stable distribution.  Monte Carlo Markov Chain: sampling using ‘local’ information ◦ Devise a Markov chain whose stationary distribution is the target.  Ergodic MC must be aperiodic, irreducible, and positive recurrent. ◦ Monte Carlo Integration to get quantities of interest.  Metropolis-Hastings method: sampling from a target distribution ◦ Create a Markov chain whose transition matrix does not depend on the normalization term. ◦ Make sure the chain has a stationary distribution and it is equal to the target distribution (accept ratio). ◦ After sufficient number of iterations, the chain will converge the stationary distribution.  Gibbs sampling is a special case of M-H Sampling. ◦ The Hammersley-Clifford Theorem: get the joint distribution from the complete conditional distribution.  Hybrid Monte Carlo: gradient sub step for each Markov chain.
  81. 81.  Variational approximation modifies the optimization problem to be tractable, at the price of approximate solution;  Mean Field replaces M with a (simple) subset M(F), on which A* (μ) is a closed form (Note: F is disconnected graph); ◦ Density becomes factorized product distribution in this sub-family. ◦ Objective: K-L divergence.  Mean field is a structured variation approximation approach: ◦ Coordinate ascent (deterministic);  Compared with stochastic approximation (sampling): ◦ Faster, but maybe not exact.
  82. 82.  Contrastive divergence (CD) is proposed for training PoE first, also being a quicker way to learn RBMs; ◦ Contrastive divergence as the new objective; ◦ Taking gradients and ignoring a term which is usually very small.  Steps: ◦ Start with a training vector on the visible units. ◦ Then alternate between updating all the hidden units in parallel and updating all the visible units in parallel.  Can be applied using any MCMC algorithm to simulate the model (not limited to just Gibbs sampling);  CD learning is biased: not work as gradient descent  Improved: Persistent CD explores more modes in the distribution ◦ Rather than from data samples, begin sampling from the mode samples, obtained from the last gradient update. ◦ Still suffer from divergence of likelihood due to missing the modes.  Score matching: the score function does not depend on its normal. factor. So, match it b.t.w. the model with the empirical density.
  83. 83.  Pre-trained DBN is a generative model;  Do a stochastic bottom-up pass (wake phase) ◦ Get samples from factorial distribution (visible first, then generate hidden); ◦ Adjust the top-down weights to be good at reconstructing the feature activities in the layer below.  Do a few iterations of sampling in the top level RBM ◦ Adjust the weights in the top-level RBM.  Do a stochastic top-down pass (sleep phase) ◦ Get visible and hidden samples generated by generative model using data coming from nowhere! ◦ Adjust the bottom-up weights to be good at reconstructing the feature activities in the layer above. ◦ Any guarantee for improvement? No!  The “Wake-Sleep” algorithm is trying to describe the representation economical (Shannon’s coding theory).
  84. 84.  Deep networks tend to have more local minima problems than shallow networks during supervised training  Train first layer using unlabeled data ◦ Supervised or semi-supervised: use more unlabeled data.  Freeze the first layer parameters and train the second layer  Repeat this for as many layers as desire ◦ Build more robust features  Use the outputs of the final layer to train the last supervised layer (leave early weights frozen)  Fine tune the full network with a supervised approach;  Avoid problems to train a deep net in a supervised fashion. ◦ Each layer gets full learning ◦ Help with ineffective early layer learning ◦ Help with deep network local minima
  85. 85.  Take advantage of the unlabeled data;  Regularization Hypothesis ◦ Pre-training is “constraining” parameters in a region relevant to unsupervised dataset; ◦ Better generalization (representations that better describe unlabeled data are more discriminative for labeled data) ;  Optimization Hypothesis ◦ Unsupervised training initializes lower level parameters near localities of better minima than random initialization can.  Only need fine tuning in the supervised learning stage.
  86. 86.  Pre-training in one stage ◦ Positive phase: clamp observed, sample hidden, using variational approximation (mean-field) ◦ Negative phase: sample both observed and hidden, using persistent sampling (stochastic approximation: MCMC)  Pre-training in two stages ◦ Approximating a posterior distribution over the states of hidden units (a simpler directed deep model as DBNs or stacked DAE); ◦ Train an RBM by updating parameters to maximize the lower- bound of log-likelihood and correspond. posterior of hidden units.  Options (CAST, contrastive divergence, stochastic approximation…).