Machine learning based contour boundary detection from images

  • 1,315 views
Uploaded on

machine learning, scene understanding, static segmentation, Gestalt Cues, superpixel, logistic regression, MRF, CRF, manifold learning, ensemble learning, k-means, SVM, Naive Bayes, sparse coding, …

machine learning, scene understanding, static segmentation, Gestalt Cues, superpixel, logistic regression, MRF, CRF, manifold learning, ensemble learning, k-means, SVM, Naive Bayes, sparse coding, K-SVD, Orthogonal mactching pursuit. deep learning, RBM, DBM, DBN, SAE.

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,315
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
45
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Yu Huang Sunnyvale, California Yu.huang07@gmail.com
  • 2.  Edges, Contours and Boundaries  Finding Meaningful Contours  Static Segmentation (Regions)  Classical Gestalt Cues  Berkeley Segmentation Data Set  Learning for Scene Segmentation  Learn a Local Boundary Model  Image Figure/Ground Assignment  Learning Edges and Boundaries  Sparse Models for Edge Detection  Boundary Detection and Grouping  Sparse Coding for Contour Detection  Sketch Tokens for Contour Detection  Deep learning shape prior for segmentation  Deep neural prediction network for visual boundary  References  Appendix
  • 3.  Edges: Significant local changes in image; occur on the boundary between 2 different regions in an image.  Contour: Representation of linked edges for a region boundary. ◦ Closed: Correspond to region boundaries; filling algorithm determines the pixels in the region. ◦ Open: part of a region boundary; gaps’ formation due to high edge-detection threshold or weak contrast.  occur when line fragments are linked together, as in drawing or handwriting.  Contour Representation: ◦ Ordered list of Edges (chains codes) ◦ Curve- model for a contour (piecewise line segments or cubic splines)
  • 4.  Local edge detection ◦ Problems - false targets, misses  One solution: use other cues (image segmentation) ◦ Texture: Sharp changes in orientation, scale of textures ◦ Motion: >=2 Frames ◦ Disparity: Stereo Left eye Right eye Frame 1 Frame 2
  • 5.  Regional Approaches (split-merge, watershed, mean shift, ...)  Use regional info, optimize labelling of regional tokens, e.g. clustering  Depending on uniformity in object region  Active Contour Models (snakes) ◦ Use regional (external) & boundary (internal) info, optimize deformation of model ◦ Sensitivity to initialization, too smooth  Level Set (implicit active contour)  handle topological changes naturally  not robust to boundary gaps  Contour Grouping  Use boundary info (& regional info), optimize grouping of contour fragments  Learning-based: Boundary Detection.
  • 6. How is grouping done in human vision?  Proximity  Similarity ◦ Brightness ◦ Contrast  Good continuation ◦ Parallelism ◦ Co-circularity
  • 7.  Two-class classification model  Over segmentation as preprocessing  Use classical Gestalt cues ◦ Contour, texture, brightness and continuation  A linear classifier is used for training (logistic regression) Superpixel map K=200 Reconstruction of human segmentation from Superpixels •Local •Coherent •Preserve structure •Contour •texture
  • 8. Image Boundary Cues Model Brightness Color Texture Challenges: texture cue, cue combination Goal: learn the posterior probability of a boundary Pb(x,y,) from local information only Cue Combination
  • 9.  Human subjects label ground truth figure/ground assignments in natural images.  “Shapemes” encode high-level knowledge in a generic way, capturing local figure/ground cues.  A conditional random field (CRF) incorporates junction cues and enforces global consistency.
  • 10. Shapemes (clusters of local shapes) Pb edge maps human-marked boundaries Color image contour/junction
  • 11.  Boosted Edge Learning (BEL): Probabilistic Boosting Tree (PBT) classification;  Features: gradient+Haarlet, over a large image patch.  Learn to detect edges from images with labeled ground truth;
  • 12. PBT Training:
  • 13.  Sparseland model and dictionary learning by k-SVD;  Edge detection as the pixelwise classification problem; ◦ “patches centered on edge pixel or not”;  Contour training: class specific edge classifier  Shape training: shape-based object classifier  Classification: edge classifier then shape classifier ◦ Bike, Motorbike, Person or Car? Person?
  • 14.  Learning-based boundary detection: SIFT-based, dim. reduction by PCA, boosting (Adaboost, Gentleboost and Madaboost);  Boundary grouping: use a normalized saliency criterion, fractional- linear programming to find graph circles with min. cost .
  • 15.  Sparse Code Gradients (SCG): by sparse coding (k-SVD);  Gradient, color, plus depth & surface normal(option);  Linear classifier (SVM) with contrast features (SCG);  Globalization by computing a spectral gradient (like gPb) optionally;
  • 16.  Definition: straight lines, t-junctions, y-junctions, corners, curves, parallel lines;  Learned (k-means clustering) from patches of human generated contours: a number of classes in hundreds (150 in the paper), Daisy (MSR) descriptors used for shift invariance;  Low-level image features: gradient, color, orientation, etc.;  Classifier: Random decision forest for sketch token labeling from image patches. Sketch Tokens Like “Shapeme”?
  • 17.  Use deep Boltzmann machine to learn the hierarchical architecture of shape priors: low level local feature and high level global feature;  Apply the learned architecture to model shape variations of global and local structures;  A data-driven variational method to perform object extraction based on shape probabilistic representation.
  • 18. original result learned shape result by sparse learned shape
  • 19.  Integration from multiple scales and semantic levels via multi- streams of interlinked, layered, non-linear “deep” processing; ◦ Deep belief net with a variant of the mean-and-covariance RBM;  Unsupervised feature learning; ◦ Supervised boundary prediction by feed forward NN.
  • 20.  X. Ren, and J. Malik. "Learning a Classification Model for Segmentation", ICCV’03  D. Martin, C. Fowlkes, and J. Malik. "Learning to detect natural Image boundaries using local brightness, color, and texture cues", IEEE T-PAMI 2004  P. Doll´ar, Z. Tu, and S. Belongie, “Supervised learning of edges and object boundaries”, CVPR, 2005  Ren, Fowlkes, Malik. "Figure/Ground assignment in natural images“, ECCV 2006  Mairal1, M. Leordeanu, F. Bach1, M. Hebert, J. Ponce, “Discriminative Sparse Image Models for Class-Specific Edge Detection and Image Interpretation”, ECCV’08.  I. Kokkinos, “Highly Accurate Boundary Detection and Grouping”. CVPR 2010.  X. Ren and L. Bo, “Discriminatively Trained Sparse Code Gradients for Contour Detection”, NIPS’12.  J Lim, C. L. Zitnick, P Dollar, “Sketch Tokens: A Learned Mid-level Representation for Contour and Object Detection”, CVPR, 2013  Chen, Yu, Hu, Zeng, “Deep Learning Shape Priors for Object Segmentation”, CVPR’13.  Kivinen, Williams, Heess, “visual boundary prediction: a deep neural prediction network and quality dissection”, AISTATS, 2014.
  • 21.  “Machine Learning is programming computers to optimize a performance criterion using example data or past experience” ◦ Supervised/Unsupervised model: labeled/unlabeled data; ◦ Semi-supervised model: both labeled and unlabeled data; ◦ Online learning: incremental update; ◦ Ensemble classifiers: bagging, stacking, boosting, random forest,… ◦ Reinforcement Learning: learn by interacting with an environment.  Types of ML algorithms ◦ Prediction: predicting a variable from data ◦ Classification: assigning records to predefined groups ◦ Clustering: splitting records into groups based on similarity ◦ Association learning: seeing what often appears together with what  Relationship with others ◦ Artificial intelligence: emulate how the brain works with program.;  ML is a branch of AI ◦ Data mining: building models in order to detect the patterns; ◦ Statistical analysis: probabilistic models, on which to infer with data; ◦ Information retrieval: retrieval of information from a collection of data.
  • 22.  Unsupervised learning is that of trying to find hidden structure in unlabeled data; Since the examples given to the learner are unlabeled, there is no error or reward signal to evaluate a potential solution;  It is closely related to the problem of density estimation in statistics; However also encompasses many other techniques that seek to summarize and explain key features of the data  Unsupervised learning also encompasses many other techniques that seek to summarize and explain key features of the data.  Approaches to unsupervised learning include: ◦ Clustering; ◦ Hidden Markov models; ◦ Blind signal separation (PCA, ICA, NMF, SVD…);  Unsupervised methods in NN: ◦ Self Organizing Map: topographic organization in which nearby locations in the map represent inputs with similar properties; ◦ Adaptive Resonance Theory: allows the number of clusters to vary with problem size and lets the user control the degree of similarity between members of the same clusters by means of the vigilance parameter.
  • 23.  Supervised learning is the task of inferring a function from labeled training data. The training data consist of a set of training examples.  Each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal).  A supervised learning algorithm analyzes the training data and produces an inferred function, used for mapping new examples.  There are four major issues to consider in supervised learning: ◦ tradeoff between bias and variance; ◦ amount of training data relative to the complexity of the "true" function; ◦ dimensionality of the input space: curse of dimensionality; ◦ degree of noise in the desired output values: over-fitting.  There are several ways to be generalized: ◦ Semi-supervised learning: the desired output values are provided only for a subset of the training data. The remaining data is unlabeled. ◦ Active learning: Instead of assuming that all of the training examples are given at the start, interactively collect new examples, typically by making queries to a human user.
  • 24.  Training/testing data (70%/30%)  Data unbalanced (one class’ data more than others) ◦ Sampling, learning algorithm modification (cost-sensitive), ensemble,…  Feature extraction ◦ Sparse coding, vector quantization,…  Curse of Dimensionality: Sensitivity to “noise” ◦ Dimension reduction, manifold learning/distance metric learning  Linear or non-linear model ◦ Local/Global minimum (convex/concave obj. function): Learning rate ◦ Regularization: L-1/L-2 norm ◦ Kernel trick: mapping nonlinear feature space to high dim. linear space  Discriminative or generative model ◦ Bottom up (conditional distribution) /Top down (joint distribution)  Over-fitting: Learn the “noise” ◦ Cross validation with grid search  Performance evaluation ◦ Precision/Recall, confusion matrix, ROC, i.e. receiver operating characteristic)
  • 25.  Principal Component Analysis (PCA) uses orthogonal transformation to convert a set of observations of possibly correlated variables into a set of linearly uncorrelated variables called principal components.  This transformation is defined in such a way that the first principal component has the largest possible variance and each succeeding component in turn has the highest variance possible under the constraint that it be orthogonal to the preceding components.  PCA is sensitive to the relative scaling of the original variables.  Also called as Karhunen–Loève transform (KLT), Hotelling transform, singular value decomposition (SVD) , factor analysis, eigenvalue decomposition (EVD), spectral decomposition etc.;  Affinity Propagation (AP) is a clustering algorithm based on the concept of "message passing" between data points. Unlike clustering algorithms such as k- means or k-medoids, AP does not require the number of clusters to be determined or estimated before running the algorithm;  Spectral Clustering makes use of the spectrum (eigenvalues) of the data similarity matrix to perform dimensionality reduction before clustering in fewer dimensions.
  • 26.  Independent component analysis (ICA) is for separating a multivariate signal into additive subcomponents by assuming that the subcomponents are non-Gaussian signals and all statistically independent from each other. ◦ ICA is a special case of blind source separation.  Assumptions: the source signals are independent of each other; distribution of the values in each source signals are non-Gaussian.  Three effects of mixing signals as below ◦ Independence: the signal mixtures may not; ◦ Normality: closer to Gaussian than any of original variables; ◦ Complexity: Greater than that of its constituent source signal.  Preprocessing: centering, whitening and dimension reduction;  ICA finds the independent components (latent variables) by maximizing the statistical independence of the estimated components;  Definitions of independence for ICA: ◦ Minimization of mutual information (KL divergence or entropy); ◦ Maximization of non-Gaussianity (kurtosis and negative entropy).
  • 27. Initial signal mixed signal whitening ICA
  • 28.  Mixture model is a probabilistic model for representing the presence of subpopulations within an overall population;  “Mixture models" are used to make statistical inferences about the properties of the sub-populations given only observations on the pooled population;  A Gaussian mixture model can be Bayesian or non-Bayesian;  A variety of approaches focus on maximum likelihood estimate (MLE) as expectation maximization (EM) or maximum a posteriori (MAP);  EM is used to determine the parameters of a mixture with an a priori given number of components (a variation version can adapt it in the iteration); ◦ Expectation step: "partial membership" of each data point in each constituent distribution is computed by calculating expectation values for the membership variables of each data point; ◦ Maximization step: plug-in estimates, mixing coefficients and component model parameters, are re-computed for the distribution parameters; ◦ Each successive EM iteration will not decrease the likelihood.  Alternatives of EM for mixture models: ◦ mixture model parameters can be deduced using posterior sampling as indicated by Bayes' theorem, i.e. Gibbs sampling or Markov Chain Monte Carlo (MCMC); ◦ Spectral methods based on SVD; ◦ Graphical model: MRF or CRF.
  • 29.  Non-negative matrix factorization (NMF): a matrix V is factorized into (usually) two matrices W and H, that all three matrices have no negative elements.  The different types arise from using different cost functions for measuring the divergence between V and W*H and possibly by regularization of the W and/or H matrices; ◦ squared error, Kullback-Leibler divergence or total variation (TV);  NMF is an instance of a more general probabilistic model called "multinomial PCA“, as pLSA (probabilistic latent semantic analysis);  pLSA is a statistical technique for two-mode (extended naturally to higher modes) analysis, modeling the probability of each co- occurrence as a mixture of conditionally independent multinomial distributions; ◦ Their parameters are learned using EM algorithm;  pLSA is based on a mixture decomposition derived from a latent class model, not as downsizing the occurrence tables by SVD in LSA.  Note: an extended model, LDA (Latent Dirichlet allocation) , adds a Dirichlet prior on the per-document topic distribution.
  • 30. Note: d is the document index variable, c is a word's topic drawn from the document's topic distribution, P(c|d), and w is a word drawn from the word distribution of this word's topic, P(w|c). (d and w are observable variables, c is a latent variable.)
  • 31.  A hidden Markov model (HMM) is a statistical Markov model: the modeled system is a Markov process with unobserved (hidden) states;  In HMM, state is not visible, but output, dependent on state, is visible. ◦ Each state has a probability distribution over the possible output tokens; ◦ Sequence of tokens generated by an HMM gives some information about the sequence of states.  Note: the adjective 'hidden' refers to the state sequence through which the model passes, not to the parameters of the model;  A HMM can be considered a generalization of a mixture model where the hidden variables are related through a Markov process;  Inference: prob. of an observed sequence by Forward-Backward Algorithm and the most likely state trajectory by Viterbi algorithm (DP);  Learning: optimize state transition and output probabilities by Baum-Welch algorithm (special case of EM).
  • 32.  logistic regression is a probabilistic statistical classification model;  The prob. of the possible outcomes of a single trial are modeled as a function of explanatory variables by a logistic function;  Training: Maximizes conditional likelihood P(y|x) directly;
  • 33.  Convex optimization (logistic function): w = argmax|w P(Y|X, w); ◦ Adding regularization term as well for overfitting; ◦ Iterative solution: a gradient descent method.  In a Bayesian statistics context, prior distributions are normally placed on the regression coefficients, usually as Gaussian distributions; ◦ Apply Metropolis–Hastings algorithm (a more general MCMC method than Gibbs sampling): based on proposal distribution or jumping distribution. The proposal distribution Q proposes the next point that the random walk might move to.
  • 34.  The Naive Bayes classifier is designed when features are independent of one another within each class, but it appears to work well in practice even when that independence assumption is not valid. It classifies data in two steps: ◦ Training step: Using the training samples, the method estimates the parameters of a probability distribution, assuming features are conditionally independent given the class. ◦ Prediction step: For any unseen test sample, the method computes the posterior probability of that sample belonging to each class. The method then classifies the test sample according the largest posterior probability.  The class-conditional independence assumption greatly simplifies the training step since you can estimate the one-dimensional class- conditional density for each feature individually; ◦ While the class-conditional independence between features is not true in general, research shows that this optimistic assumption works well in practice; ◦ This assumption of class independence allows the Naive Bayes classifier to better estimate the parameters required for accurate classification while using less training data than many other classifiers; ◦ This makes it particularly effective for datasets containing many predictors or features.
  • 35.  Supported distributions in NB classif. ◦ Naive Bayes is based on estimating P(x|y), the probability or probability density of features x given class y. ◦ Support for normal (Gaussian), kernel, multinomial, and multivariate multinomial distributions.  Normal (Gaussian) Distribution: features have normal distributions in each class;  Kernel: computes a separate kernel density estimate for each class based on the training data for that class;  Multinomial Distribution ("bag of words" model): each feature for the count of one word; classification is based on the relative frequencies of the words.  Multivariate Multinomial Distribution: feature categories, differ from the class levels for the response variable.
  • 36.  Separable Data ◦ An SVM classifies data by finding the best hyperplane that separates all data points of one class from those of the other class. ◦ “Margin” means the maximal width of the slab parallel to the hyperplane that has no interior data points. ◦ The support vectors are the data points that are closest to the separating hyperplane.
  • 37.  Mathematical Formulation: Primal.  Mathematical Formulation: Dual. Variables i are slack variables measuring the error made at point (xi,yi)   l i 1 i 2 Kf, Cfmin i   yif(xi)  1 - i, for all i i  0     l 1i l 1j jijiji l 1i i α ),K(yyαα 2 1 αmin i xx 0  i  C, for all i 0 1  l i ii y
  • 38.  Non-separable Data ◦ Your data might not allow for a separating hyperplane. In that case, SVM can use a soft margin, meaning a hyperplane that separates many, but not all data points.  Nonlinear Transformation with Kernels ◦ Some binary classification problems do not have a simple hyperplane as a useful separating criterion; ◦ Theory of reproducing kernels: Polynomials, Radial basis or sigmoid function; ◦ Nonlinear kernels can use identical calculations and solution algorithms, and obtain classifiers that are nonlinear.
  • 39.  Random Field: F={F1,F2,…FM} a family of random variables on set S in which each Fi takes value fi in a label set L.  Markov Random Field: F is said to be a MRF on S w.r.t. a neighborhood N if and only if it satisfies Markov property. ◦ Generative model for joint probability p(x) ◦ allows no direct probabilistic interpretation ◦ define potential functions Ψ on maximal cliques A  map joint assignment to non-negative real number  requires normalization  MRF is undirected graphical models
  • 40.  Conditional , not joint, probabilistic sequential models p(y|x)  Allow arbitrary, non-independent features on the observation seq X  Specify the probability of possible label seq given an observation seq  Prob. of a transition between labels depend on past/future observ.  Relax strong independence assumptions, no p(x) required  CRF is MRF plus “external” variables, where “internal” variables Y of MRF are un-observables and “external” variables X are observables  Linear chain CRF: transition score depends on current observation ◦ Inference by DP like HMM, learning by forward-backward as HMM  Optimization for learning CRF: discriminative model ◦ Conjugate gradient, stochastic gradient,…
  • 41.  Ensemble methods use multiple models to obtain better predictive performance than could be obtained from any of the constituent models;  Ensembles combine many weak learners to produce a strong learner; ◦ Term ensemble is for methods that generate multiple hypotheses using the same base learner;  Ensembles can be shown to have more flexibility in the functions they can represent. This flexibility can, in theory, enable them to over-fit the training data more than a single model would; but in practice, some ensemble techniques (especially bagging) tend to reduce problems related to over-fitting of the training data;  Empirically, ensembles tend to yield better results when there is a significant diversity among the models;  Popular types: ◦ Bagging, boosting, stacking, stochastic discrimination, random subspace, … ◦ Random forest, derived from the random subspace method, constructs a multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual trees.
  • 42.  Samples the training set, generate random independent bootstrap replicates, constructs the classifier, aggregates them by a majority vote in the final decision rule; (called “bootstrap aggregating”)  Bootstrapping is based on random sampling with replacement;  Therefore, taking bootstrap replicate (random selection with replacement) of the training set sometimes avoid or get less misleading training objects in the bootstrap training set;  Consequently, a classifier constructed on such a training set may have a better performance.
  • 43.  At each step, training data are re-weighted that incorrectly classified objects get larger weights in a new, modified training set, thus actually maximizes the margins between objects;  Classifiers are constructed on weighted versions of the training set, which are independent on previous classification results;  Boosting learning originated from the Probably Approximately Correct (PAC) learning theory;  AdaBoost is the first algorithm that could adapt to the weak learners;  Variant of Adaboost (Adaptive boosting): ◦ LogitBoost: ◦ GentleBoost: Update is fm(x) = P(y=1|x) – P(y=0|x) instead of
  • 44.  In SVM, one performs global optimization in order to maximize the minimal margin, while in Boosting one maximizes the margin locally for each training object;  SVM uses the L-2 norm for both hypothesis and weight vector, while Boosting uses the L- norm for the hypothesis vector and L-1 norm for the weight vector;  It is shown that if the number of relevant weak hypothesis k is a small fraction of the total number of weak hypotheses, then the margin associated with Boosting will be much larger than the one associated with SVM;  SVM corresponds to quadratic programming while Boosting only to linear programming;  Through the method of kernels, SVM allows to perform low dimensional calculation that mathematically equivalent to inner products in a high dimensional ‘virtual’ space; Instead, Boosting employs greedy search, the re-weighting of the examples changes the distribution with respect to which the correlation is measured, thus guiding the weak learner to find different correlated coordinates.
  • 45.  With discrete stochastic processes, arbitrary numbers of very weak models are generated and combined to separate the points in multi-dimensional spaces. ◦ Can be regarded as a method of dimensionality reduction; ◦ ``uniformity'‘: two points from the same class are equally likely to be captured by a weak model of a given size ; ◦ “enrichment”: weak models do not have the same chance of capturing points from different classes .  SD has the property that the very complex and accurate classifiers produced in this way retain the ability, characteristic of their weak component pieces, to generalize to new data;  It is in combining these weak models that the discriminative power is developed.  SD simply transforms the multi-d feature vectors to points coming from two uni-variate normal distributions;  These two uni-variate normal distributions are separate further as the number of weak models increases, which intuitively is similar to how people learn the knowledge.
  • 46.  Classifiers are constructed in random subspaces of the data feature space, usually combined by simple majority voting in the final decision rule;  It relies on also a stochastic process that randomly selects a number of components of the given feature vector in constructing each classifier;  Geometrically this is equivalent to projecting all the points to the selected subspace;  Random subspace method effectively takes advantages of high dimensionality .
  • 47. • Defined as a set {D, X, Y} such that DY = X
  • 48. • Given a D and yi, how to find xi ? • Constraint : xi is sufficiently sparse; • Finding exact solution is difficult; • Approximate a solution good enough?
  • 49.  Greedy methods: projecting the residual on some atom; ◦ Matching pursuit, orthogonal matching pursuit;  L1-norm: Least Absolute Shrinkage and Selection Operator (LASSO); ◦ The residual is updated iteratively in the direction of the atom;  Gradient-based finding new search directions ◦ Projected Gradient Descent ◦ Coordinate Descent  Homotopy: a set of solutions indexed by a parameter (regularization) ◦ LARS (Least Angle Regression)  First order/proximal methods: Generalized gradient descent ◦ solving efficiently the proximal operator ◦ soft-thresholding for L1-norm ◦ Accelerated by the Nesterov optimal first-order method  Iterative reweighting schemes ◦ L2-norm: Chartand and Yin (2008) ◦ L1-norm: Cand`es et al. (2008)
  • 50. Select dk with max projection on residue xk = arg min ||y-Dkxk|| Update residue r = y - Dkxk Check terminating condition D, y x
  • 51.  What D to use?  A fixed overcomplete set of basis: no adaptivity.  Steerable wavelet;  Bandlet, curvelet, contourlet;  DCT Basis;  Gabor function;  ….  Data adaptive dictionary – learn from data;  K-SVD: a generalized K-means clustering process for Vector Quantization (VQ). ◦ An iterative algorithm to effectively optimize the sparse approximation of signals in a learned dictionary.  Other methods of dictionary learning: ◦ non-negative matrix decompositions. ◦ sparse PCA (sparse dictionaries). ◦ fused-lasso regularizations (piecewise constant dictionaries)  Extending the models: Sparsity + Self-similarity=Group Sparsity
  • 52. • Select atoms from input; • Atoms can be image patches; • Patches are overlapping. Initialize Dictionary Sparse Coding (OMP) Update Dictionary One atom at a time • Use OMP or any pursuit method; • Output sparse code for all signals; • Minimize representation error.
  • 53.  Representation learning attempts to automatically learn good features or representations;  Deep learning algorithms attempt to learn multiple levels of representation of increasing complexity/abstraction (intermediate and high level features);  Become effective via unsupervised pre-training + supervised fine tuning; ◦ Deep networks trained with back propagation (without unsupervised pre- training) perform worse than shallow networks.  Deal with the curse of dimensionality (smoothing & sparsity) and over-fitting (unsupervised, regularizer);  Semi-supervised: structure of manifold assumption; ◦ labeled data is scarce and unlabeled data is abundant.
  • 54. • Supervised training of deep models (e.g. many-layered Nets) is too hard (optimization problem);  Learn prior from unlabeled data; • Shallow models are not for learning high-level abstractions;  Ensembles or forests do not learn features first;  Graphical models could be deep net, but mostly not. • Unsupervised learning could be “local-learning”;  Resemble boosting with each layer being like a weak learner • Learning is weak in directed graphical models with many hidden variables;  Sparsity and regularizer. • Traditional unsupervised learning methods aren’t easy to learn multiple levels of representation.  Layer-wised unsupervised learning is the solution. • Multi-task learning (transfer learning and self taught learning); • Other issues: scalability & parallelism with the burden from big data.
  • 55.  A neural network = running several logistic regressions at the same time; ◦ Neuron=logistic regression or…  Calculate error derivatives (gradients) to refine: back propagate the error derivative through model (the chain rule) ◦ Online learning: stochastic/incremental gradient descent; ◦ Batch learning: conjugate gradient descent.
  • 56.  CNN is a special kind of multi-layer NNs applied to 2-d arrays (usually images), based on spatially localized neural input; ◦ local receptive fields(shifted window), shared weights (weight averaging) across the hidden units, and often, spatial or temporal sub-sampling; ◦ Related to generative MRF/discriminative CRF:  CNN=Field of Experts MRF=ML inference in CRF; ◦ Generate ‘patterns of patterns’ for pattern recognition.  Each layer combines (merge, smooth) patches from previous layers ◦ Pooling /Sampling (e.g., max or average) filter: compress and smooth the data. ◦ Convolution filters: (translation invariance) unsupervised; ◦ Local contrast normalization: increase sparsity, improve optimization/invariance. C layers convolutions, S layers pool/sample
  • 57. • A hybrid model: can be trained as generative or discriminative model; • Deep architecture: multiple layers (learn features layer by layer); • Multi layer learning is difficult in sigmoid belief networks. • Top two layers are undirected connections, Restricted Boltzmann Machine (RBM); • Lower layers get top down directed connections from layers above; • Unsupervised or self-taught pre-learning provides a good initialization; • Greedy layer-wise unsupervised training for RBM; • Supervised fine-tuning • Generative: wake-sleep algorithm (Up- down); • Discriminative: back propagation (bottom-up); Belief net is directed acyclic graph composed of stochastic variables.
  • 58. • Boltzmann machine is a stochastic recurrent model, and RBM is its special case (one hidden layer); • Learning internal representations that become increasingly complex; • High-level representations built from a large supply of unlabeled inputs; • Pre-training: learning a stack of modified RBMs, which are composed to create a deep Boltzmann machine (undirected graph); • Generative fine-tuning: different from DBN • Positive and negative phase • Discriminative fine-tuning: the same to DBN • Back propagation.
  • 59. • DeStack many (may be sparse) auto-encoders in succession and train them noising Auto-Encoder: Multilayer NNs with target output=input; • Auto-encoder learns the salient variation like a nonlinear PCA; • using greedy layer-wise unsupervised learning • Drop the decode layer each time • Performs better than stacking RBMs; • Supervised training on the last layer using final features; • (option) Supervised training on the entire network to fine- tune all weights of the neural net; • Empirically not quite as accurate as DBNs.