Deep learning for image denoising and superresolution


Published on

deep learning, MLP, Convolutional Network, Deep Belief Nets, Deep Boltzmann Machine, Stacked Denoising Auto-Encoder, Image Denoising, Image Superresolution

Published in: Technology, Education

Deep learning for image denoising and superresolution

  1. 1. Deep Learning for Image Denoising and Super-resolution Yu Huang Sunnyvale, California
  2. 2. Outline • Deep learning • Why deep learning? • State of Art deep learning • Parallel Deep Learning at Google • Sparse coding • Dictionary learning • Multiple Layer NN (MLP) • Convolutional Neural Network • Generative model: MRF • Deep Gated MRF • Stacked Denoising Auto-Encoder • Deep Belief Nets (DBN) • Deep Boltzmann Machines (DBM) • Generative Adversarial Networks (GAN) • Image Denoising • Image Denoising by BM3D • Image Denoising by K-SVD • Image Denoising by CNN • Image Denoising by MLPs • Image Denoising by DBMs • Image Denoising by Deep GMRF • Non-local Denoising with CNN • Image Restoration by CNN • Image Super-resolution • Example-based SR • Sparse Coding for SR • Frame Alignment-based SR • Image Super-resolution by DBMs • Image Super-resolution by DBNs • Image SR by Cascaded SAE • Image SR by Deep CNN • Image SR via Deep Recursive ResN • Deep Laplacian Pyramid Nets for SR • Image SR Deeply Recursive ConvNet • Appendix: Learning + Optimization
  3. 3. Gartner Emerging Tech Hype Cycle 2012
  4. 4. Deep Learning • Representation learning attempts to automatically learn good features or representations; • Deep learning algorithms attempt to learn multiple levels of representation of increasing complexity/abstraction (intermediate and high level features); • Become effective via unsupervised pre-training + supervised fine tuning; • Deep networks trained with back propagation (without unsupervised pre- training) perform worse than shallow networks. • Deal with the curse of dimensionality (smoothing & sparsity) and over- fitting (unsupervised, regularizer); • Semi-supervised: structure of manifold assumption; • labeled data is scarce and unlabeled data is abundant.
  5. 5. Deep Net Architectures • Feed-Forward: multilayer neural nets, convolutional neural nets • Feed-Back: stacked sparse coding, deconvolutional nets • Bi-Directional: deep Boltzmann machines, stacked auto-encoders
  6. 6. Why Deep Learning? • Supervised training of deep models (e.g. many-layered Nets) is too hard (optimization problem); • Learn prior from unlabeled data; • Shallow models are not for learning high-level abstractions; • Ensembles or forests do not learn features first; • Graphical models could be deep net, but mostly not. • Unsupervised learning could be “local-learning”; • Resemble boosting with each layer being like a weak learner • Learning is weak in directed graphical models with many hidden variables; • Sparsity and regularizer. • Traditional unsupervised learning methods aren’t easy to learn multiple levels of representation. • Layer-wised unsupervised learning is the solution. • Multi-task learning (transfer learning and self taught learning); • Other issues: scalability & parallelism with the burden from big data.
  7. 7. TheMammalianVisual Cortexis Hierarchical
  8. 8. State-of-Art Deep Learning R&D • Deep Learning as the hottest topic in speech recognition • Performance records broken with deep learning methods • Microsoft, Google: DL-based speech recognition products • Deep Learning is the hottest topic in Computer Vision • The record holders on ImageNet are convolutional nets • Deep Learning is becoming hot in NLP • Deep Learning/Feature Learning in Applied Mathematics • sparse coding • non-convex optimization • stochastic gradient algorithms • Transfer learning: inductive transfer, storing knowledge gained while solving one problem and applying it to a different but related problem • Transfer the classification knowledge, adapt the model or less annotate data. • Self taught learning: generic unlabeled data improve the performance on a supervised learning task. • Relax the assumption about the unlabeled data; • Use unlabeled data to learn the best represent. (dictionary) with sparse coding.
  9. 9. ConvolutionalNeural Network’s Progress • Data and GPU, also networks deeper and more non-linear. Convolutional Neural Net 2012 Convolutional Neural Net 1998 Convolutional Neural Net 1988
  10. 10. ConvolutionalNeural Network’s Progress • Fukushima 1980: designed network with same basic structure but did not train by back propagation. • LeCun from late 80s: figured out back propagation for CNN, popularized and deployed CNN for OCR applications and others. • Poggio from 1999: same basic structure but learning is restricted to top layer (k-means at second stage) • LeCun from 2006: unsupervised feature learning • DiCarlo from 2008: large scale experiments, normalization layer • LeCun from 2009: harsher non-linearities, normalization layer, learning unsupervised and supervised. • Mallat from 2011: provides a theory behind the architecture • Hinton 2012: use bigger nets, GPUs, more data
  11. 11. DL Winner in Object Recognition • Won the 2012 ImageNet LSVRC. 60 Million parameters, 832M MAC ops; • Convolutional Nets [Krizhevsky et al., 2012]
  12. 12. Parallel Deep Learning at Google • More features always improve performance unless data is scarce; • Deep learning methods have higher capacity and have the potential to model data better; • However, big data needs deep learning to be scalable: lots of training samples (>10M), classes (>10K) and input dimensions (>10K). • Distributed Deep Nets (easy to be distributed). Model parallelism Model parallelism + data parallelism
  13. 13. Scaling Across Multiple GPUs • Two variations: 1) Simulate the synchronous execution of SGD in one core; 2) Approximation of SGD, not perfectly simulating but working better; • Two parallelisms: 1) model parallelism: Across the model dimension, where different workers train different parts of the model (amount of computation per neuron activity is high); 2) data parallelism: Across the data dimension, where different workers train on different data examples (amount of computation per weight is high); • Observ.s: data parallelism for convolutional layer and model parallelism for fully connected layer; • Convolutional layers cumulatively contain ~90-95% computation, ~5% of parameters; • Fully-connected layers contain ~5-10% of the computation, ~95% of the parameters; • Forward pass: • Each of the K workers is given a different data batch of (let’s say) 128 examples; • Each of the K workers computes all of the convolutional layer activities on its batch; • To compute the fully-connected layer activities, the workers switch to model parallelism; • Parallelism: three schemes of parallelism.
  14. 14. Scaling Across Multiple GPUs • Scheme I: each worker sends its last-stage convolutional layer activities to each other worker; the workers then assemble a big batch of activities for 128K examples and compute the fully-connected activities on this batch as usual; • Scheme II: one of the workers sends its last-stage convolutional layer activities to all other workers; the workers then compute the fully connected activities on this batch of 128 examples and then begin to back propagate the gradients for these 128 examples; in parallel with this computation, the next worker sends its last-stage convolutional layer activities to all other workers; then the workers compute the fully-connected activities on this second batch of 128 examples, and so on; • Scheme III: all of the workers send 128=K of their last stage convolutional layer activities to all other workers. The workers then proceed as in scheme II; • Backward pass is similar: the workers compute the gradients in the fully connected layers in the usual way, then the next step depends on the schemes in forward pass. • Weight synchronization in the convolutional layers after backward pass; • Variable batch size (128k in the convolutional layers and 128 in the fully- connected layers);
  15. 15. Model Parallelism: Partition model across machines Data Parallelism: Asynchronous Distributed Stochastic Gradient Descent
  16. 16. Sparse Coding • Sparse coding (Olshausen & Field, 1996). • Originally developed to explain early visual processing in the brain (edge detection). • Objective: Given a set of input data vectors learn a dictionary of bases such that: • Each data vector is represented as a sparse linear combination of bases. Sparse: mostly zeros
  17. 17. Predictive Sparse Coding • Recall the objective function for sparse coding: • Modify by adding a penalty for prediction error: • Approximate the sparse code with an encoder • PSD for hierarchical feature training • Phase 1: train the first layer; • Phase 2: use encoder + absolute value as 1st feature extractor • Phase 3: train the second layer; • Phase 4: use encoder + absolute value as 1st feature extractor • Phase 5: train a supervised classifier on top layer; • Phase 6: optionally train the whole network with supervised BP.
  18. 18. Methods of Solving Sparse Coding • Greedy methods: projecting the residual on some atom; • Matching pursuit, orthogonal matching pursuit; • L1-norm: Least Absolute Shrinkage and Selection Operator (LASSO); • The residual is updated iteratively in the direction of the atom; • Gradient-based finding new search directions • Projected Gradient Descent • Coordinate Descent • Homotopy: a set of solutions indexed by a parameter (regularization) • LARS (Least Angle Regression) • First order/proximal methods: Generalized gradient descent • solving efficiently the proximal operator • soft-thresholding for L1-norm • Accelerated by the Nesterov optimal first-order method • Iterative reweighting schemes • L2-norm: Chartand and Yin (2008) • L1-norm: Cand`es et al. (2008)
  19. 19. Strategy of Dictionary Selection • What D to use? • A fixed overcomplete set of basis: no adaptivity. • Steerable wavelet; • Bandlet, curvelet, contourlet; • DCT Basis; • Gabor function; • …. • Data adaptive dictionary – learn from data; • K-SVD: a generalized K-means clustering process for Vector Quantization (VQ). • An iterative algorithm to effectively optimize the sparse approximation of signals in a learned dictionary. • Other methods of dictionary learning: • non-negative matrix decompositions. • sparse PCA (sparse dictionaries). • fused-lasso regularizations (piecewise constant dictionaries) • Extending the models: Sparsity + Self-similarity=Group Sparsity
  20. 20. Multi Layer Neural Network • A neural network = running several logistic regressions at the same time; • Neuron=logistic regression or… • Calculate error derivatives (gradients) to refine: back propagate the error derivative through model (the chain rule) • Online learning: stochastic/incremental gradient descent • Batch learning: conjugate gradient descent
  21. 21. Problems in MLPs • Multi Layer Perceptrons (MLPs), one feed-forward neural network, were popularly used for decades. • Gradient is progressively getting more scattered • Below the top few layers, the correction signal is minimal • Gets stuck in local minima • Especially start out far from ‘good’ regions (i.e., random initialization) • In usual settings, use only labeled data • Almost all data is unlabeled! • Instead the human brain can learn from unlabeled data.
  22. 22. Convolutional Neural Networks • CNN is a special kind of multi-layer NNs applied to 2-d arrays (usually images), based on spatially localized neural input; • local receptive fields(shifted window), shared weights (weight averaging) across the hidden units, and often, spatial or temporal sub-sampling; • Related to generative MRF/discriminative CRF: • CNN=Field of Experts MRF=ML inference in CRF; • Generate ‘patterns of patterns’ for pattern recognition. • Each layer combines (merge, smooth) patches from previous layers • Pooling /Sampling (e.g., max or average) filter: compress and smooth the data. • Convolution filters: (translation invariance) unsupervised; • Local contrast normalization: increase sparsity, improve optimization/invariance. C layers convolutions, S layers pool/sample
  23. 23. Convolutional Neural Networks • Convolutional Networks are trainable multistage architectures composed of multiple stages; • Input and output of each stage are sets of arrays called feature maps; • At output, each feature map represents a particular feature extracted at all locations on input; • Each stage is composed of: a filter bank layer, a non-linearity layer, and a feature pooling layer; • A ConvNet is composed of 1, 2 or 3 such 3-layer stages, followed by a classification module; • A fully connected layer: softmax transfer function for posterior distribution. • Filter: A trainable filter (kernel) in filter bank connects input feature map to output feature map; • Nonlinearity: a pointwise sigmoid tanh() or a rectified sigmoid abs(gi•tanh()) function; • In rectified function, gi is a trainable gain parameter, might be followed a contrast normalization N; • Feature pooling: treats each feature map separately -> a reduced-resolution output feature map; • Supervised training is performed using a form of SGD to minimize the prediction error; • Gradients are computed with the back-propagation method. • Unsupervised pre-training: predictive sparse decomposition (PSD), then supervised fine- tuning. * is discrete convolution operator
  24. 24. LeNet (LeNet-5) • A layered model composed of convolution and subsampling operations followed by a holistic representation and ultimately a classifier for handwritten digits; • Local receptive fields (5x5) with local connections; • Output via a RBF function, one for each class, with 84 inputs each; • Learning by Graph Transformer Networks (GTN);
  25. 25. AlexNet • A layered model composed of convol., subsample., followed by a holistic representation and all-in-all a landmark classifier; • Consists of 5 convolutional layers, some of which followed by max-pooling layers, 3 fully-connected layers with a final 1000-way softmax; • Fully-connected “FULL” layers: linear classifiers/matrix multiplications; • ReLU are rectified-linear nonlinearities on layer output, can be trained several times faster; • Local normalization scheme aids generalization; • Overlapping pooling slightly less prone to overfitting; • Data augmentation: artificially enlarge the dataset using label-preserving transformations; • Dropout: setting to zero output of each hidden neuron with prob. 0.5; • Trained by SGD with batch # 128, momentum 0.9, weight decay 0.0005.
  26. 26. Use the two GPUs: One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts at the bottom; The GPUs communicate only at certain layers. The network’s input is 150,528-dimensional, and the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264-4096–4096–1000.
  27. 27. MattNet • Matthew Zeiler from the startup company “Clarifai”, winner of ImageNet Classification in 2013; • Preprocessing: subtracting a per-pixel mean; • Data augmentation: downsampled to 256 pixels and a random 224 pixel crop is taken out of the image and randomly flipped horizontally to provide more views of each example; • SGD with min-batch # 128, learning rate annealing, momentum 0.9 and dropout to prevent overfitting; • 65M parameters trained for 12 days on a single Nvidia GPU; • Visualization by layered DeconvNets: project the feature activations back to the input pixel space; • Reveal input stimuli exciting individual feature maps at any layer; • Observe evolution of features during training; • Sensitivity analysis of the classifier output by occluding portions to reveal which parts of scenes are important; • DeconvNet attached to each of ConvNet layer, unpooling uses locations of maxima to preserve structure; • Multiple such models were averaged together to further boost performance; • Supervised pre-training with AlexNet, then modify it to get better performance (error rate 14.8%).
  28. 28. Architecture of an eight layer ConvNet model. Input: 224 by 224 crop of an image (with 3 color planes). # 1-5 layers Convolution: 96 filters, 7x7, stride of 2 in both x and y. Feature maps: (i) via a rectified linear function, (ii) 3x3 max pooled (stride 2), (iii) contrast normalized 55x55 feature maps. # 6-7 layers: fully connected, input in vector form (6x6x256 = 9216 dimensions). The final layer: a C-way softmax function, C - number of classes.
  29. 29. Top: A deconvnet layer (left) attached to a convnet layer (right). The deconvnet will reconstruct approximate version of convnet features from the layer beneath. Bottom: Unpooling operation in the deconvnet, using switches which record the location of the local max in each pooling region (colored zones) during pooling in the convnet.
  30. 30. Oxford VGG Net: Very Deep CNN • Networks of increasing depth using an architecture with very small (3×3) convolution filters; • Spatial pooling is carried out by 5 max-pooling layers; • A stack of convolutional layers followed by three Fully-Connected (FC) layers; • All hidden layers are equipped with the rectification ReLU non-linearity; • No Local Response Normalisation! • Trained by optimising the multinomial logistic regression objective using SGD; • Regularised by weight decay and dropout regularisation for the first two fully- connected layers; • The learning rate was initially set to 10−2, and then decreased by a factor of 10; • For random initialisation, sample the weights from a normal distribution; • Derived from the publicly available C++ Caffe toolbox, allow training and evaluation on multiple GPUs installed in a single system, and on full-size (uncropped) images at multiple scales; • Combine the outputs of several models by averaging their soft-max class posteriors.
  31. 31. The depth of the configurations increases from the left (A) to the right (E), as more layers are added (the added layers are shown in bold). The convolutional layer parameters are denoted as “conv<receptive field size> - <number of channels>”. The ReLU activation function is not shown for brevity.
  32. 32. GoogleNet • Deep convolutional neural network architecture codenamed Inception; • Increase depth and width of network but keep computational budget constant; • Drawbacks: Bigger size typically means a larger number of parameters, which makes the enlarged network more prone to overfitting, and the dramatically increased use of computational resources; • Solution: Move from fully connected to sparsely connected architectures, and analyze the correlation statistics of the activations of the last layer and clustering neurons with highly correlated outputs. • Based on the well known Hebbian principle: neurons that fire together wire together; • GoogleNet: a framework with Inception architecture • Finding out how an optimal local sparse structure in a convolutional vision network can be approximated and covered by readily available dense components; • Judiciously applying dimension reduction and projections; • Increasing the units # at each stage significantly without blow-up in computational complexity; • Aligning with the intuition that visual info. is processed at various scales and then aggregated so that the next stage can abstract features from different scales simultaneously. • Trained using the DistBelief: A distributed machine learning system (cloud).
  33. 33. Inception module (with dimension reductions)
  34. 34. Convolution Pooling Softmax Other Problems with training deep architectures? Network in a network in a network 9 Inception modules
  35. 35. RNN: Recurrent Neural Network • A nonlinear dynamical system that maps sequences to sequences; • Parameterized with three weight matrices and three bias vectors; • RNNs are fundamentally difficult to train due to their nonlinear iterative nature; • The derivative of the loss function can be exponentially large with respect to the hidden activations; • RNN suffers also from the vanishing gradient problem. • Back Propagation Through Time (BPTT): • “Unfold” the recurrent network in time, by stacking identical copies of the RNN, and redirecting connections within the network to obtain connections between subsequent copies; • It’s hard to be used where online adaption is required as the entire time series must be used. • Real-Time Recurrent Learning (RTRL) is a forward-pass only algorithm that computes the derivatives of the RNN w.r.t. its parameters at each timestep; • Unlike BPTT, RTRL maintains the exact derivative of the loss so far at each timestep of the forward pass, without a backward pass and the need to store the past hidden states; • However, the computational cost of RTRL is prohibitive and more memory than BPTT as well. • Speech Recognition and Handwriting recognition.
  36. 36. LSTM: Long Short-Term Memory • An RNN structure that elegantly addresses the vanishing gradients problem using “memory units”; • These linear units have a self-connection of strength 1 and a pair of auxiliary “gating units” that control the flow of information to and from the unit; • Let N be the number of memory units of the LSTM. At each time step t, the LSTM maintains a set of vectors as below, whose evolution is governed by the following equations: • Since the forward pass of the LSTM is relatively intricate, the equations for the correct derivatives of the LSTM are highly complex, making them tedious to implement; • Note: Theano has LSTM module.
  37. 37. Left: RNN with one fully connected hidden layer; Right: LSTM with memory blocks in hidden layer. From Simple RNN to BPTT
  38. 38. Gated Recurrent Unit • GRU is a variation of RNN, adaptively capturing dependencies of different time scales with each recurrent unit; • GRU uses gating units as well to modulate the flow of information inside the unit, but without a memory cells. • GRU doesn’t control degree to which its state is exposed, but exposes the whole state each time; • Different from LSTM: • GRU expose its full content without control; • GRU controls the information flow from the previous activation when computing the new, candidate activation, but does not independently control the amount of the candidate activation being added (the control is tied via the update gate). • Shared virtues with LSTM: the additive component of their update from t to t + 1; • Easy for each unit to remember the existence of a specific feature in the input stream for a long series of steps; • Effectively creates shortcut paths that bypass multiple temporal steps, which allow the error to be back-propagated easily without too quickly vanishing.
  39. 39. Generative Model: MRF • Random Field: F={F1,F2,…FM} a family of random variables on set S in which each Fi takes value fi in a label set L. • Markov Random Field: F is said to be a MRF on S w.r.t. a neighborhood N if and only if it satisfies Markov property. • Generative model for joint probability p(x) • allows no direct probabilistic interpretation • define potential functions Ψ on maximal cliques A • map joint assignment to non-negative real number • requires normalization • MRF is undirected graphical models
  40. 40. Belief Nets • Belief net is directed acyclic graph composed of stochastic var. • Can observe some of the variables and solve two problems: • inference: Infer the states of the unobserved variables. • learning: Adjust the interactions between variables to more likely generate the observed data. stochastic hidden cause visible effect Use nets composed of layers of stochastic variables with weighted connections.
  41. 41. Boltzmann Machines • Energy-based model associate a energy to each configuration of stochastic variables of interests (for example, MRF, Nearest Neighbor); • Learning means adjustment of the low energy function’s shape properties; • Boltzmann machine is a stochastic recurrent model with hidden variables; • Monte Carlo Markov Chain, i.e. MCMC sampling (appendix); • Restricted Boltzmann machine is a special case: • Only one layer of hidden units; • factorization of each layer’s neurons/units (no connections in the same layer); • Contrastive divergence: approximation of gradient (appendix). probability Energy Function Learning rule
  42. 42. Deep Belief Networks • A hybrid model: can be trained as generative or discriminative model; • Deep architecture: multiple layers (learn features layer by layer); • Multi layer learning is difficult in sigmoid belief networks. • Top two layers are undirected connections, RBM; • Lower layers get top down directed connections from layers above; • Unsupervised or self-taught pre- learning provides a good initialization; • Greedy layer-wise unsupervised training for RBM • Supervised fine-tuning • Generative: wake-sleep algorithm (Up- down) • Discriminative: back propagation (bottom-up)
  43. 43. Deep Boltzmann Machine • Learning internal representations that become increasingly complex; • High-level representations built from a large supply of unlabeled inputs; • Pre-training consists of learning a stack of modified RBMs, which are composed to create a deep Boltzmann machine (undirected graph); • Generative fine-tuning: different from DBN • Positive and negative phase (appendix) • Discriminative fine-tuning: the same to DBN • Back propagation.
  44. 44. Deep Gated MRF • Conditional Distribution Over Input: • P(x∣h)=N (mean(h) ,D); • examples: PPCA, Factor Analysis, ICA, Gaussian RBM; • model does not represent well dependencies, only mean intensity; • P(x∣h)=N (0,Covariance(h)); • examples: PoT (product of student’s t), covariance RBM; • model does not represent well mean intensity, only dependencies; • P(x∣h)=N (mean(h) , Covariance(h)); • mean cRBM, mean PoT; • two sets of latent variables to modulate mean and covariance of the conditional distribution over the input; • Deep gated MRF: RBM layers + MRF with adaptive affinities (to gate the effective interactions and to decide mean intensities); • Learning: Gibbs sampling/HMC sampling, Fast persistent CD.
  45. 45. Deep Gated MRF
  46. 46. Denoising Auto-Encoder • Multilayer NNs with target output=input; • Reconstruction=decoder(encoder(input)); • Perturbs the input x to a corrupted version; • Randomly sets some of the coordinates of input to zeros. • Recover x from encoded perturbed data. • Learns a vector field towards higher probability regions; • Pre-trained with DBN or regularizer with perturbed training data; • Minimizes variational lower bound on a generative model; • corresponds to regularized score matching on an RBM; • PCA=linear manifold=linear Auto Encoder; • Auto-encoder learns the salient variation like a nonlinear PCA.
  47. 47. StackedDenoising Auto-Encoder • Stack many (may be sparse) auto-encoders in succession and train them using greedy layer-wise unsupervised learning • Drop the decode layer each time • Performs better than stacking RBMs; • Supervised training on the last layer using final features; • (option) Supervised training on the entire network to fine- tune all weights of the neural net; • Empirically not quite as accurate as DBNs.
  48. 48. Generative Modeling • Have training examples x ~ pdata(x ) • Want a model that draw samples: x ~ pmodel(x ) • Where pmodel ≈ pdata • Conditional generative models • Speech synthesis: Text ⇒ Speech • Machine Translation: French ⇒ English • French: Si mon tonton tond ton tonton, ton tonton sera tondu. • English: If my uncle shaves your uncle, your uncle will be shaved • Image ⇒ Image segmentation • Environment simulator • Reinforcement learning • Planning • Leverage unlabeled data x ~ pdata(x ) x ~ pmodel(x )
  49. 49. Adversarial Nets Framework • A game between two players: • 1. Discriminator D • 2. Generator G • D tries to discriminate between: • A sample from the data distribution. • And a sample from the generator G. • G tries to “trick” D by generating samples that are hard for D to distinguish from data.
  50. 50. GANs • A framework for estimating generative models via an adversarial process, to train 2 models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. • The training procedure for G is to maximize the probability of D making a mistake. • This framework corresponds to a minimax two-player game: • In the space of arbitrary functions G and D, a unique solution exists, with G recovering training data distribution and D equal to 1/2 everywhere; • In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with BP. • There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples.
  51. 51. GANs
  52. 52. GANs
  53. 53. GANs
  54. 54. GANs Rightmost column shows the nearest training example of the neighboring sample, in order to demonstrate that the model has not memorized the training set. Samples are fair random draws, not cherry-picked. Unlike most other visualizations of deep generative models, these images show actual samples from the model distributions, not conditional means given samples of hidden units. Moreover, these samples are uncorrelated because the sampling process does not depend on Markov chain mixing. a) MNIST b) TFD c) CIFAR-10 (fully connected model) d) CIFAR-10 (convolutional discriminator and “deconvolutional” generator).
  55. 55. Image Denoising • Noise reduction: various assumptions of content internal structures; • Learning-based • Field of experts (MRF), CRF, NN (MLP, CNN); • Sparse coding: K-SVD, LSSC,…. • Self-similarity • Gaussian, Median; • Bilateral filter, anisotropic diffusion; • Non-local means. • Sparsity prior • Wavelet shrinkage; • Use of both Redundancy and Sparsity • BM3D (block matching 3-d filter)-benchmark; • Can ‘Deep Learning’ compete with BM3D?
  56. 56. Block Matching 3-D for Denoising • For each patch, find similar patches; • Group the similar patches into a 3-d stack; • Perform a 3-D transform (2-d + 1-d) and coefficient thresholding (sparsity); • Apply inverse 3-D transform (1-d + 2-d); • Also combine multiple patches in a collaborative way (aggregation); • Two stages: hard -> wiener (soft).
  57. 57. BM3D Outline
  58. 58. Apply Sparse Coding for Denoising • A cost function for : Y = Z + n • Solve for: Prior term • Break problem into smaller problems • Aim at minimization at the patch level. Proximity of selected patch Sparsity of the representations Global proximity
  59. 59. Image Data in K-SVD Denoising • Extract overlapping patches from a single image; • clean or corrupted, even reference (multiple frames)? • for example, 100k of size 8x8 block patches; • Applied the K-SVD, training a dictionary; • Size of 64x256 (n=64, dictionary size k). • Lagrange multiplier namda = 30/sigma of noise; • The coefficients from OMP; • the maximal iteration is 180 and noise gain C=1.15; • the number of nonzero elements L=6 (sigma=5). • Denoising by normalized weighted averaging:
  60. 60. Image Denoising by Conv. Nets • Image denoising is a learning problem to training Conv. Net; • Parameter estimation to minimize the reconstruction error. • Online learning (rather than batch learning): stochastic gradient • Gradient update from 6x6 patches sampled from 6 different training images • Run like greedy layer-wise training for each layer.
  61. 61. Image Denoising by MLP • Denoising as learning: map noisy patches to noise-free ones; • Patch size 17x17; • Training with different noise types and levels: • Sigma=25; noise as Gaussian, stripe, salt-and-pepper, coding artifact; • Feed-forward NN: MLP; • input layer 289-d, four hidden layers (2047-d), output layer 289-d. • input layer 169-d, four hidden layers (511-d), output layer 169-d. • 40 million training images from LabelMe and Berkeley segmentation! • 1000 testing images: Mcgill, Pascal VOC 2007; • GPU: slower than BM3D, much faster than KSVD. • Deep learning can help: unsupervised learning from unlabelled data.
  62. 62. Image Denoising with Deep Nets • Combine sparse coding and deep network pre-trained by DAE; • Reconstruct clean image from noisy image by training DAE; • image denoising by choosing appropriate η in different situations. • Deep network: stacked sparse DAE (denoising auto-encoder). • Pre-training • Fine-tuning by back propagation • Patch-based. Hidden layer KL divergence with sparsity
  63. 63. Image Denoising by DBMs • Combine Botlzmann machine and Denoising Auto-Encoder; • 100, 000 image patches of sizes 4×4, 8×8 and 16×16 from CIFAR-10 dataset to get 50, 000 training samples; • Three sets of testing images from USC, textures, aerials and miscellaneous; • Gaussian BMs+DAEs: one, two and four hidden layers; • Deep Network training: • A two-stage pre-training and PCD training for Gaussian DBMs; • Stochastic BP for DAE training; • Noise: Gaussian, salt-and-pepper; • Patch-based as well; • Comparison: when noise is heavy, DBM beats DAE; otherwise, vice versa.
  64. 64. Image Denoising by Deep Gated MRF • Works as solving the following optimization problem where F(x;θ) is the mPoT energy function • Adapt the generic prior learned by mPoT: • 1. Adapt the parameters to the denoised test image (mPoT+A), such as sparse coding; • 2. Add to the denoising loss an extra quadratic term pulling the estimate close to the denoising result of the non-local means algorithm (mPoT+A+NLM), such as adding the term as Original noisy (22.1dB) mPoT(28.0dB) mPoT+A(29.2dB) mPoT+A+NLM(30.7dB)
  65. 65. Image Restoration by CNN • Collect a dataset of clean/corrupted image pairs which are then used to train a specialized form of convolutional neural network. • Given a noisy image x, predict a clean image y close to the clean image y* • the input kernels p1 = 16, the output kernel pL = 8. • 2 hidden layers (i.e. L = 3), each with 512 units, the middle layer kernel p2 = 1. • W1 512 kernels of size 16x16x3, W2 512 kernels of size 1x1x512, and W3 size 8x8x512. • This learns how to map corrupted image patches to clean ones, implicitly capturing the characteristic appearance of noise in natural images. • Train the weights Wl and biases bl by minimizing the mean squared error • Minimize with SGD • Regarded as: first patchifying the input, applying a fully-connected neural network to each patch, and averaging the resulting output patches.
  66. 66. Image Restoration by CNN • Comparison.
  67. 67. Image Deconvolution with Deep CNN • Establish the connection between traditional optimization-based schemes and a CNN architecture; • A separable structure is used as a reliable support for robust deconvolution against artifacts; • The deconvolution task can be approximated by a convolutional network by nature, based on the kernel separability theorem; • Kernel separability is achieved via SVD; • An inverse kernel with length 100 is enough for plausible deconv. results; • Image deconvolution convolutional neural network (DCNN); • Two hidden layers: h1 is 38 large-scale 1-d kernels of size 121×1, and h2 is 381x121 convolution kernels to each in h1, output is 1×1×38 kernel; • Random-weight initialization or from the separable kernel inversion; • Concatenation of deconvolution CNN module with denoising CNN; • called “Outlier-rejection Deconvolution CNN (ODCNN)”; • 2 million sharp patches together with their blurred versions in training.
  68. 68. Image Deconvolution with Deep CNN
  69. 69. End-to-EndDeep Learning for Deblur Feature extraction module transforms the image to a learned gradient-like representation suitable for kernel estimation. The kernel is estimated by division in Fourier space, then similarly the latent image. Each consisting of these three operations, operate on both the blurry image and the latent image.
  70. 70. Intermediary outputs of a single-stage NN with 8x Conv, Tanh, 8x8, Tanh, 8x4. End-to-EndDeep Learning for Deblur
  71. 71. CompressionArtifactsReductionbyaDeep CNN Framework of Artifacts Reduction Convolutional Neural Network (AR-CNN). The network consists of four convolutional layers, each of which is responsible for a specific operation. Then it optimizes the four operations (i.e., feature extraction, feature enhancement, mapping and reconstruction) jointly in an end-to-end framework. Example feature maps shown in each step could well illustrate the functionality of each operation. They are normalized for better visualization. Reuse the features learned in a relatively easier task to initialize a deeper or harder network, called “easy-hard transfer”: shallow to deep model, high to low quality, standard to real use case.
  72. 72. DehazeNet by CNN for Dehaze DehazeNet conceptually consists of four sequential operations (feature extraction, multi-scale mapping, local extremum and non-linear regression), which is constructed by 3 convolution layers, a max-pooling, a Maxout unit and a BReLU activation function.
  73. 73. Deep Joint Demosaicking and Denoising • A data-driven approach: train a deep neural network on a large corpus of images instead of using hand-tuned filters. • To create a training set, metrics to identify difficult patches and techniques for mining community photographs for such patches. The first layer packs 2 × 2 blocks in the Bayer image into a 4D vector to restore translation invariance and speed up the processing. Augment each vector with the noise parameter σ to form 5D vectors. Then, a series of convol. layers filter the image to interpolate the missing color values. Finally unpack the 12 color samples back to the original pixel grid and concatenate a masked copy of the input mosaick. A last group of convolutions at full resolution to produce the final features. Linearly combine them to produce the demosaicked output.
  74. 74. Non-local Color Image Denoising with Convolutional Neural Networks • Variational methods that exploit the inherent non-local self- similarity property of natural images. • Build on it and introduce deep networks that perform non-local processing and significantly benefit from discriminative learning. Convolutional implementation of the non-local operator Architecture of a single stage of the proposed non-local convolutional network.
  75. 75. Non-local Color Image Denoising with Convolutional Neural Networks Original Noisy Denoised CNLNet Denoised CBM3D
  76. 76. Image Super-resolution • Super-resolution (SR): how to find missing details/HF comp? • Interpolation-based: • Edge-directed; • B-spline; • Sub-pixel alignment; • Reconstruction-based: • Gradient prior; • TV (Total Variation); • MRF (Markov Random Field). • Learning-based (hallucination). • Example-based: texture synthesis, LR-HR mapping; • Self learning: sparse coding, self similarity-based; • ‘Deep Learning’ competes with shallow learning in image SR.
  77. 77. • Estimate missing HR detail that isn’t present in the original LR image, and which we can’t make visible by simple sharpening; • Image database with HR/LR image pairs; • Algorithm uses a training set to learn the fine details of LR; • It then uses learned relationships (MRF) to predict fine details. What is Example Based SR?
  78. 78. SR from a Single Image • Multi-frame-based SR (alignment); • Example-based SR.
  79. 79. SR from a Single Image • Combination of Example-based and Multi-frame- based. same scale different scales FindNN Parent Copy
  80. 80. Example-based Edge Statistics Single Frame
  81. 81. Sparse Coding for SR [Yang et al.08] • HR patches have a sparse represent. w.r.t. an over-complete dictionary of patches randomly sampled from similar images. • Sample 3 x 3 LR overlapping patches y on a regular grid. output HR patch HR dictionary for some with The input LR patch satisfies linear measurements of sparse coefficient vector ! Dictionary of low-resolution patches Downsampling/Blurring operator If we can recover the sparse solution to the underdetermined system of linear equations , we can reconstruct as convex relaxation T, T’: select overlap between patches F : 1st and 2nd derivatives from LR bicubic interpolation.
  82. 82. Sparse Coding for SR [Yang et al.08] Two training sets: Flower images – smooth area, sharp edge Animal images -- HF textures Randomly sample 100,000 HR-LR patch pairs from each set of training images.
  83. 83. Sparse coding MRF / BP [Freeman IJCV ‘00] Bicubic Original
  84. 84. Joint Dictionary Learning for SR • Local sparse prior for detail recovery; • Global constraints for artifact avoiding (L=SH); • Joint dictionary learning: extract overlap regionprevious reconstruct on the overlap controls the tradeoff between matching the LR input and finding a neighbor-compatible HR patch. Solved by back-projection: a gradient descent method
  85. 85. Bicubic Sparse coding MRF / BP [Freeman IJCV ‘00]Input LR
  86. 86. Image SR by DBMs • Sparsity prior pre-learned into the dictionary [Yang’08]; • Learn the dictionary (size=1024), encoded in the RBM; • Trained by contrastive divergence. • Use interpolation to initialize HR from LR, to accelerate inference; • Training images: 10,000 HR/LR image patch (8x8) pairs. The image patches are elements of the dictionaries to be learned and collected from the normalized weights in RBM.
  87. 87. Results of images magnified by a factor of 2
  88. 88. Super-resolution by DBNs • SR is a image completion problem of missing data (HF); • Training: HR image is divided into PxP patches transformed to DCT domain, and DBNs trained by SGD, layer-by-layer; • Restoring: LR image is interpolated first and divided into PxP patches as well, transformed to DCT domain, fed into DBNs to infer missing HF, then reversed. • Iteratively. • Experiment setting: P=16, scaling =2, learning rate 0.01, hidden units 400 (1st layer) + 200 (2nd layer).
  89. 89. Super-resolution by DBNs Connections among LF and HF Restoration of HF after training (Two hidden layers as example)
  91. 91. Image Super-resolution by Learning Deep CNN • Learns an end-to-end mapping btw low/high-resolution images as a deep CNN from the low-resolution image to the high-resolution one; • Traditional sparse-coding-based SR viewed as a deep convolutional network, but handle each component separately, rather jointly optimizes all layers.
  92. 92. Image Super-resolution by Learning Deep CNN 2. Rectified Linear Unit: ReLU, max(0; x) 1. Convolution Layer: W2 - size n1x1x1xn2, B2 - n2-dim W1 - size cxf1xf1xn1, B1 – n1-dim W3 - size n2xf3xf3xc, B3 - c-dim 3. Convolution Layer: 4. Loss Function: MSE Note: c=3, f1 = 9, f3 = 5, n1 = 64, n2 = 32.
  93. 93. Image Super-resolution by Learning Deep CNN • LR image upscaled to the desired size using bicubic interpolation as Y; Then recover from Y an image F(Y) similar to ground truth HR image X. • Learn a mapping F, consists of three operations: • 1. Patch extraction and representation; • 2. Non-linear mapping; • 3. Reconstruction. • Traditional Sparse coding method shown as
  94. 94. Image Super-resolution by Learning Deep CNN Results comparison.
  95. 95. Accelerating the Super-Resolution Convolutional Neural Network • A compact hourglass-shape CNN structure for faster and better Super-Resolution Convolutional Neural Network (SRCNN). • Introduce a deconvolution layer at the end of the network, then the mapping is learned directly from the original low-resolution image (without interpolation) to the high-resolution one. • Reformulate the mapping layer by shrinking the input feature dimension before mapping and expanding back afterwards. • Adopt smaller filter sizes but more mapping layers.
  96. 96. Accelerating the Super-Resolution Convolutional Neural Network The FSRCNN consists of convolution layers and a deconvolution layer. The convolution layers can be shared for different upscaling factors. A specific deconvolution layer is trained for different upscaling factors.
  97. 97. ImageSuperResolutionby a Cascade of StackedCLA • In each layer of the cascade, non-local self-similarity search is first performed to enhance high-frequency texture details of the partitioned patches in the input image; • The enhanced image patches are then input into a collaborative local auto-encoder (CLA) to suppress the noises as well as collaborate the compatibility of the overlapping patches; • By closing the loop on non-local self-similarity search and CLA in a cascade layer, refine the super-resolution fed into next layer until the required image scale.
  98. 98. ImageSuperResolutionby a Cascade of StackedCLA • Experimental results compared with others. Kim’ sparse regression exemplar-based Yang’s sparse coding Cacade of stacked LAS
  99. 99. Accurate Image Super-Resolution Using Very Deep Convolutional Networks • Use a very deep convolutional network inspired by VGG-net used for ImageNet classification. • The model uses 20 weight layers. By cascading small filters many times in a deep network structure, contextual information over large image regions is exploited in an efficient way. • With very deep networks, apply a simple effective training procedure. • Learn residuals only and use extremely high learning rates (104 times higher than SRCNN) enabled by adjustable gradient clipping. Apply a single network for all scale factors SRCNN(×3 model used for all scales)
  100. 100. Accurate Image Super-Resolution Using Very Deep Convolutional Networks Network Structure. Cascade a pair of layers repeatedly. An interpolated low- resolution (ILR) image goes through layers and transforms into a high-resolution (HR) image. The network predicts a residual image and the addition of ILR and the residual gives the desired output. Use 64 filters for each convolutional layer and some sample feature maps are drawn for visualization. Most features after applying rectified linear units (ReLu) are zero.
  101. 101. Deeply-Recursive Convolutional Networkfor Image Super-Resolution • A deeply-recursive convolutional network (DRCN); • A very deep recursive layer (up to 16 recursions); • Training: recursive-supervision, skip-connection. It consists of three parts: embedding network, inference network and reconstruction network. Inference network has a recursive layer.
  102. 102. Deeply-Recursive Convolutional Networkfor Image Super-Resolution Unfolding inference network. Left: A recursive layer. Right: Unfolded structure. The same filter W is applied to feature maps recursively. The unfolded model can utilize very large context without adding new weight parameters.
  103. 103. Deeply-Recursive Convolutional Networkfor Image Super-Resolution (a) Model with recursive-supervision and skip-connection. (b) Applying deep-supervision. (c) Example of expanded structure of (a) w/o parameter sharing (no recursion).
  104. 104. Image/VideoSuper-ResolutionUsing anEfficient Sub-PixelCNN • CNN capable of real-time SR of 1080p videos on a single K2 GPU. • A CNN architecture where feature maps are extracted in the LR space. • A sub-pixel convolution layer which learns an array of upscaling filters to upscale the final LR feature maps into the HR output. • Replace the handcrafted bicubic filter in the SR pipeline with more complex upscaling filters specifically trained for each feature. The sub-pixel convolutional neural network (ESPCN), with 2 convolution layers for feature maps extraction, and a sub-pixel convolution layer that aggregates the feature maps from LR space and builds the SR image in a single step.
  105. 105. Deep Laplacian Pyramid Networks for Fast and Accurate Super-Resolution • Laplacian Pyramid Super-Resolution Network (LapSRN) to progressively reconstruct sub-band residuals of HR images. • At each pyramid level, the model takes coarse-resolution feature maps as input, predicts the high-frequency residuals, and uses transposed convolutions for upsampling to the finer level. • Not require the bicubic interpolation as the pre-processing step. • Train the proposed LapSRN with deep supervision using a robust Charbonnier loss function and achieve high-quality reconstruction. • The network generates multi-scale predictions in one feed- forward pass through the progressive reconstruction, thereby facilitates resource-aware applications.
  106. 106. Deep Laplacian Pyramid Networks for Fast and Accurate Super-Resolution
  107. 107. ImageSuper-Resolutionvia Deep RecursiveResidualNetwork • A very deep CNN model (up to 52 convolutional layers) named Deep Recursive Residual Network (DRRN) that strives for deep yet concise networks. • Residual learning is adopted, both in global and local manners, to mitigate the difficulty of training very deep networks. • Recursive learning is used to control the model parameters while increasing the depth. • Code at /DRRN CVPR17.
  108. 108. ImageSuper-Resolutionvia Deep RecursiveResidualNetwork Structures of recursive blocks. U means number of residual units in the recursive block. A close look at the u-th residual unit in DRRN. The residual function F
  109. 109. Photo-RealisticSingleImageSuper-ResolutionUsing aGenerativeAdversarialNetwork • SRGAN, a generative adversarial network (GAN) for image superresolution (SR). • Capable of inferring photo-realistic natural images for 4 upscaling factors. • A perceptual loss function which consists of an adversarial loss and a content loss. • The adversarial loss pushes the solution to the natural image manifold using a discriminator network that is trained to differentiate between the super-resolved images and original photo-realistic images. • A content loss motivated by perceptual similarity instead of similarity in pixel space. • The deep residual network is able to recover photo-realistic textures from heavily downsampled images on public benchmarks.
  110. 110. Photo-RealisticSingleImageSuper-ResolutionUsing aGenerativeAdversarialNetwork
  111. 111. Photo-RealisticSingleImageSuper-ResolutionUsing aGenerativeAdversarialNetwork
  112. 112. Appendix
  113. 113. Outline • PCA, AP & Spectral Clustering • NMF & pLSA • ISOMAP • LLE • Laplacian Eigenmaps • Gaussian Mixture & EM • Hidden Markov Model (HMM) • Discriminative model: CRF • Product of Experts • Back propagation • Stochastic gradient descent • MCMC sampling for optimization approx. • Mean field for optimization approx. • Contrastive divergence for RBMs • “Wake-sleep” algorithm for DBNs • Two-stage pre-training for DBMs • Greedy layer-wise unsupervised pre-training
  114. 114. Graphical Models • Graphical Models: Powerful framework for representing dependency structure between random variables. • The joint probability distribution over a set of random variables. • The graph contains a set of nodes (vertices) that represent random variables, and a set of links (edges) that represent dependencies between those random variables. • The joint distribution over all random variables decomposes into a product of factors, where each factor depends on a subset of the variables. • Two type of graphical models: • Directed (Bayesian networks) • Undirected (Markov random fields, Boltzmann machines) • Hybrid graphical models that combine directed and undirected models, such as Deep Belief Networks, Hierarchical-Deep Models.
  115. 115. PCA, AP & Spectral Clustering • Principal Component Analysis (PCA) uses orthogonal transformation to convert a set of observations of possibly correlated variables into a set of linearly uncorrelated variables called principal components. • This transformation is defined in such a way that the first principal component has the largest possible variance and each succeeding component in turn has the highest variance possible under the constraint that it be orthogonal to the preceding components. • PCA is sensitive to the relative scaling of the original variables. • Also called as Karhunen–Loève transform (KLT), Hotelling transform, singular value decomposition (SVD) , factor analysis, eigenvalue decomposition (EVD), spectral decomposition etc.; • Affinity Propagation (AP) is a clustering algorithm based on the concept of "message passing" between data points.[Unlike clustering algorithms such as k-means or k-medoids, AP does not require the number of clusters to be determined or estimated before running the algorithm; • Spectral Clustering makes use of the spectrum (eigenvalues) of the data similarity matrix to perform dimensionality reduction before clustering in fewer dimensions. • The similarity matrix consists of a quantitative assessment of the relative similarity of each pair of points in the dataset.
  116. 116. PCA, AP & Spectral Clustering
  117. 117. NMF & pLSA • Non-negative matrix factorization (NMF): a matrix V is factorized into (usually) two matrices W and H, that all three matrices have no negative elements. • The different types arise from using different cost functions for measuring the divergence between V and W*H and possibly by regularization of the W and/or H matrices; • squared error, Kullback-Leibler divergence or total variation (TV); • NMF is an instance of a more general probabilistic model called "multinomial PCA“, as pLSA (probabilistic latent semantic analysis); • pLSA is a statistical technique for two-mode (extended naturally to higher modes) analysis, modeling the probability of each co-occurrence as a mixture of conditionally independent multinomial distributions; • Their parameters are learned using EM algorithm; • pLSA is based on a mixture decomposition derived from a latent class model, not as downsizing the occurrence tables by SVD in LSA. • Note: an extended model, LDA (Latent Dirichlet allocation) , adds a Dirichlet prior on the per-document topic distribution.
  118. 118. NMF & pLSA Note: d is the document index variable, c is a word's topic drawn from the document's topic distribution, P(c|d), and w is a word drawn from the word distribution of this word's topic, P(w|c). (d and w are observable variables, c is a latent variable.)
  119. 119. ISOMAP • General idea: • Approximate the geodesic distances by shortest graph distance. • MDS (multi-dimensional scaling) using geodic distances • Algorithm: • Construct a neighborhood graph • Construct a distance matrix • Find the shortest path between every i and j (e.g. using Floyd-Marshall) and construct a new distance matrix such that Dij is the length of the shortest path between i and j. • Apply MDS to matrix to find coordinates
  120. 120. LLE (Locally Linear Embedding) • General idea: represent each point on the local linear subspace of the manifold as a linear combination of its neighbors to characterize the local neighborhood relations; then use the same linear coefficient for embedding to preserve the neighborhood relations in the low dimensional space; • Compute the coefficient w for each data by solving a constraint LS problem; • Algorithm: • 1. Find weight matrix W of linear coefficients • 2. Find low dimensional embedding Y that minimizes the reconstruction error • 3. Solution: Eigen-decomposition of M=(I-W)’(I-W)   i j jiji YWYY 2 )( 
  121. 121. Laplacian Eigenmaps • General idea: minimize the norm of Laplace-Beltrami operator on the manifold • measures how far apart maps nearby points. • Avoid the trivial solution of f = const. • The Laplacian-Beltrami operator can be approximated by Laplacian of the neighborhood graph with appropriate weights. • Construct the Laplacian matrix L=D-W. • can be approximated by its discrete equivalent • Algorithm: • Construct a neighborhood graph (e.g., epsilonball, k-nearest neighbors). • Construct an adjacency matrix with the following weights • Minimize • The generalized eigen-decomposition of the graph Laplacian is • Spectral embedding of the Laplacian manifold: • • The first eigenvector is trivial (the all one vector).
  122. 122. Gaussian Mixture Model & EM • Mixture model is a probabilistic model for representing the presence of subpopulations within an overall population; • “Mixture models" are used to make statistical inferences about the properties of the sub-populations given only observations on the pooled population; • A Gaussian mixture model can be Bayesian or non-Bayesian; • A variety of approaches focus on maximum likelihood estimate (MLE) as expectation maximization (EM) or maximum a posteriori (MAP); • EM is used to determine the parameters of a mixture with an a priori given number of components (a variation version can adapt it in the iteration); • Expectation step: "partial membership" of each data point in each constituent distribution is computed by calculating expectation values for the membership variables of each data point; • Maximization step: plug-in estimates, mixing coefficients and component model parameters, are re-computed for the distribution parameters; • Each successive EM iteration will not decrease the likelihood. • Alternatives of EM for mixture models: • mixture model parameters can be deduced using posterior sampling as indicated by Bayes' theorem, i.e. Gibbs sampling or Markov Chain Monte Carlo (MCMC); • Spectral methods based on SVD; • Graphical model: MRF or CRF.
  123. 123. Gaussian Mixture Model & EM
  124. 124. Hidden Markov Model • A hidden Markov model (HMM) is a statistical Markov model: the modeled system is a Markov process with unobserved (hidden) states; • In HMM, state is not visible, but output, dependent on state, is visible. • Each state has a probability distribution over the possible output tokens; • Sequence of tokens generated by an HMM gives some information about the sequence of states. • Note: the adjective 'hidden' refers to the state sequence through which the model passes, not to the parameters of the model; • A HMM can be considered a generalization of a mixture model where the hidden variables are related through a Markov process; • Inference: prob. of an observed sequence by Forward-Backward Algorithm and the most likely state trajectory by Viterbi algorithm (DP); • Learning: optimize state transition and output probabilities by Baum- Welch algorithm (special case of EM).
  125. 125. • A flow network G(V, E) defined as a fully connected directed graph where each edge (u,v) in E has a positive capacity c(u,v) >= 0; • The max-flow problem is to find the flow of maximum value on a flow network G; • A s-t cut or simply cut of a flow network G is a partition of V into S and T = V-S, such that s in S and t in T; • A minimum cut of a flow network is a cut whose capacity is the least over all the s-t cuts of the network; • Methods of max flow or mini-cut: • Ford Fulkerson method; • "Push-Relabel" method.
  126. 126. • Mostly labeling is solved as an energy minimization problem; • Two common energy models: • Potts Interaction Energy Model; • Linear Interaction Energy Model. • Graph G contain two kinds of vertices: p-vertices and i-vertices; • all the edges in the neighborhood N, called n-links; • edges between the p-vertices and the i-vertices called t-links. • In the multiple labeling case, the multi-way cut should leave each p-vertex connected to one i-vertex; • The minimum cost multi-way cut will minimize the energy function where the severed n-links would correspond to the boundaries of the labeled vertices; • The approximation algorithms to find this multi-way cut: • "alpha-expansion" algorithm; • "alpha-beta swap" algorithm.
  127. 127.  A simplified Bayes Net: it propagates info. throughout a graphical model via a series of messages between neighboring nodes iteratively; likely to converge to a consensus that determines the marginal prob. of all the variables;  messages estimate the cost (or energy) of a configuration of a clique given all other cliques; then the messages are combined to compute a belief (marginal or maximum probability); • Two types of BP methods: • max-product; • sum-product. • BP provides exact solution when there are no loops in graph! • Equivalent to dynamic programming/Viterbi in these cases; • Loopy Belief Propagation: still provides approximate (but often good) solution;
  128. 128. • Generalized BP for pairwise MRFs • Hidden variables xi and xj are connected through a compatibility function; • Hidden variables xi are connected to observable variables yi by the local “evidence” function; • The joint probability of {x} is given by • To improve inference by taking into account higher-order interactions among the variables; • An intuitive way is to define messages that propagate between groups of nodes rather than just single nodes; • This is the intuition in Generalized Belief Propagation (GBP).
  129. 129. Discriminative Model: CRF • Conditional , not joint, probabilistic sequential models p(y|x) • Allow arbitrary, non-independent features on the observation seq X • Specify the probability of possible label seq given an observation seq • Prob. of a transition between labels depend on past/future observ. • Relax strong independence assumptions, no p(x) required • CRF is MRF plus “external” variables, where “internal” variables Y of MRF are un-observables and “external” variables X are observables • Linear chain CRF: transition score depends on current observation • Inference by DP like HMM, learning by forward-backward as HMM • Optimization for learning CRF: discriminative model • Conjugate gradient, stochastic gradient,…
  130. 130. Product of Experts (PoE) • Model a probability distribution by combining the output from several simpler distributions. • Combine several probability distributions ("experts") by multiplying their density functions— similar to “AND" operation. • This allows each expert to make decisions on the basis of a few dimensions w.o. having to cover the full dimensionality. • Related to (but quite different from) a mixture model, combining several probability distributions via “OR" operation. • Learning by CD: run N samplers in parallel, one for each data-case in the (mini-)batch; • Boosting: focusing on training data with high reconstruct. errors; • Easy for inference, no suffer from “Explaining Away”.
  131. 131. Stochastic Gradient Descent (SGD) • The general class of estimators that arise as minimizers of sums are called M-estimators; • Where are stationary points of the likelihood function (or zeroes of its derivative, the score function)? • Online gradient descent samples a subset of summand functions at every step; • The true gradient of is approximated by a gradient at a single example; • Shuffling of training set at each pass. • There is a compromise between two forms, often called "mini-batches", where the true gradient is approximated by a sum over a small number of training examples. • STD converges almost surely to a global minimum when the objective function is convex or pseudo-convex, and otherwise converges almost surely to a local minimum.
  132. 132. Back Propagation E (f(x0,w),y0) = -log (f(x0,w)- y0).
  133. 133. Loss function • Euclidean loss is used for regressing to real-valued lables [-inf,inf]; • Sigmoid cross-entropy loss is used for predicting K independent probability values in [0,1]; • Softmax (normalized exponential) loss is predicting a single class of K mutually exclusive classes; • Generalization of the logistic function that "squashes" a K-dimensional vector of arbitrary real values z to a K-dimensional vector of real values σ(z) in the range (0, 1). • The predicted probability for the j'th class given a sample vector x is • Sigmoidal or Softmax normalization is a way of reducing the influence of extreme values or outliers in the data without removing them from the dataset.
  134. 134. Variable Learning Rate • Too large learning rate • cause oscillation in searching for the minimal point • Too slow learning rate • too slow convergence to the minimal point • Adaptive learning rate • At the beginning, the learning rate can be large when the current point is far from the optimal point; • Gradually, the learning rate will decay as time goes by. • Should not be too large or too small: • annealing rate 𝛼(𝑡)=𝛼(0)/(1+𝑡/𝑇) • 𝛼(𝑡) will eventually go to zero, but at the beginning it is almost a constant.
  135. 135. Variable Momentum •
  136. 136. AdaGrad/AdaDelta •
  137. 137. Data Augmentation for Overfitting • The easiest and most common method to reduce overfitting on image data is to artificially enlarge the dataset using label- preserving transformations; • Perturbing an image I by transformations that leave the underlying class unchanged (e.g. cropping and flipping) in order to generate additional examples of the class; • Two distinct forms of data augmentation: • image translation • horizontal reflections • changing RGB intensities
  138. 138. Dropout and Maxout for Overfitting • Dropout: set the output of each hidden neuron to zero w.p. 0.5. • Motivation: Combining many different models that share parameters succeeds in reducing test errors by approximately averaging together the predictions, which resembles the bagging. • The units which are “dropped out” in this way do not contribute to the forward pass and do not participate in back propagation. • So every time an input is presented, the NN samples a different architecture, but all these architectures share weights. • This technique reduces complex co-adaptations of units, since a neuron cannot rely on the presence of particular other units. • It is, therefore, forced to learn more robust features that are useful in conjunction with many different random subsets of the other units. • Without dropout, the network exhibits substantial overfitting. • Dropout roughly doubles the number of iterations required to converge. • Maxout takes the maximum across multiple feature maps;
  139. 139. Weight Decay for Overfitting • Weight decay or L2 regularization adds a penalty term to the error function, a term called the regularization term: the negative log prior in Bayesian justification, • Weight decay works as rescaling weights in the learning rule, but bias learning still the same; • Prefer to learn small weights, and large weights allowed if improving the original cost function; • A way of compromising btw finding small weights and minimizing the original cost function; • In a linear model, weight decay is equivalent to ridge (Tikhonov) regression; • L1 regularization: the weights not really useful shrink by a constant amount toward zero; • Act like a form of feature selection; • Make the input filters cleaner and easier to interpret; • L2 regularization penalizes large values strongly while L1 regularization ; • Markov Chain Monte Carlo (MCMC): simulating a Markov chain whose equilibrium distr. is the posterior distribution for weights & hyper-parameters; • Hybrid Monte Carlo: gradient and sampling.
  140. 140. Early Stopping for Overfitting • Steps in early stopping: • Divide the available data into training and validation sets. • Use a large number of hidden units. • Use very small random initial values. • Use a slow learning rate. • Compute the validation error rate periodically during training. • Stop training when the validation error rate "starts to go up". • Early stopping has several advantages: • It is fast. • It can be applied successfully to networks in which the number of weights far exceeds the sample size. • It requires only one major decision by the user: what proportion of validation cases to use. • Practical issues in early stopping: • How many cases do you assign to the training and validation sets? • Do you split the data into training and validation sets randomly or by some systematic algorithm? • How do you tell when the validation error rate "starts to go up"?
  141. 141. MCMC Sampling for Optimization • Markov Chain: a stochastic process in which future states are independent of past states but the present state. • Markov chain will typically converge to a stable distribution. • Monte Carlo Markov Chain: sampling using ‘local’ information • Devise a Markov chain whose stationary distribution is the target. • Ergodic MC must be aperiodic, irreducible, and positive recurrent. • Monte Carlo Integration to get quantities of interest. • Metropolis-Hastings method: sampling from a target distribution • Create a Markov chain whose transition matrix does not depend on the normalization term. • Make sure the chain has a stationary distribution and it is equal to the target distribution (accept ratio). • After sufficient number of iterations, the chain will converge the stationary distribution. • Gibbs sampling is a special case of M-H Sampling. • The Hammersley-Clifford Theorem: get the joint distribution from the complete conditional distribution. • Hybrid Monte Carlo: gradient sub step for each Markov chain.
  142. 142. Mean Field for Optimization • Variational approximation modifies the optimization problem to be tractable, at the price of approximate solution; • Mean Field replaces M with a (simple) subset M(F), on which A* (μ) is a closed form (Note: F is disconnected graph); • Density becomes factorized product distribution in this sub-family. • Objective: K-L divergence. • Mean field is a structured variation approximation approach: • Coordinate ascent (deterministic); • Compared with stochastic approximation (sampling): • Faster, but maybe not exact.
  143. 143. Contrastive Divergence for RBMs • Contrastive divergence (CD) is proposed for training PoE first, also being a quicker way to learn RBMs; • Contrastive divergence as the new objective; • Taking gradients and ignoring a term which is usually very small. • Steps: • Start with a training vector on the visible units. • Then alternate between updating all the hidden units in parallel and updating all the visible units in parallel. • Can be applied using any MCMC algorithm to simulate the model (not limited to just Gibbs sampling); • CD learning is biased: not work as gradient descent • Improved: Persistent CD explores more modes in the distribution • Rather than from data samples, begin sampling from the mode samples, obtained from the last gradient update. • Still suffer from divergence of likelihood due to missing the modes. • Score matching: the score function does not depend on its normal. factor. So, match it b.t.w. the model with the empirical density.
  144. 144. “Wake-Sleep” Algorithm for DBN • Pre-trained DBN is a generative model; • Do a stochastic bottom-up pass (wake phase) • Get samples from factorial distribution (visible first, then generate hidden); • Adjust the top-down weights to be good at reconstructing the feature activities in the layer below. • Do a few iterations of sampling in the top level RBM • Adjust the weights in the top-level RBM. • Do a stochastic top-down pass (sleep phase) • Get visible and hidden samples generated by generative model using data coming from nowhere! • Adjust the bottom-up weights to be good at reconstructing the feature activities in the layer above. • Any guarantee for improvement? No! • The “Wake-Sleep” algorithm is trying to describe the representation economical (Shannon’s coding theory).
  145. 145. Greedy Layer-Wise Training • Deep networks tend to have more local minima problems than shallow networks during supervised training • Train first layer using unlabeled data • Supervised or semi-supervised: use more unlabeled data. • Freeze the first layer parameters and train the second layer • Repeat this for as many layers as desire • Build more robust features • Use the outputs of the final layer to train the last supervised layer (leave early weights frozen) • Fine tune the full network with a supervised approach; • Avoid problems to train a deep net in a supervised fashion. • Each layer gets full learning • Help with ineffective early layer learning • Help with deep network local minima
  146. 146. WhyGreedyLayer-WiseTraining Works? • Take advantage of the unlabeled data; • Regularization Hypothesis • Pre-training is “constraining” parameters in a region relevant to unsupervised dataset; • Better generalization (representations that better describe unlabeled data are more discriminative for labeled data) ; • Optimization Hypothesis • Unsupervised training initializes lower level parameters near localities of better minima than random initialization can. • Only need fine tuning in the supervised learning stage.
  147. 147. Two-Stage Pre-training in DBMs • Pre-training in one stage • Positive phase: clamp observed, sample hidden, using variational approximation (mean-field) • Negative phase: sample both observed and hidden, using persistent sampling (stochastic approximation: MCMC) • Pre-training in two stages • Approximating a posterior distribution over the states of hidden units (a simpler directed deep model as DBNs or stacked DAE); • Train an RBM by updating parameters to maximize the lower-bound of log-likelihood and correspond. posterior of hidden units. • Options (CAST, contrastive divergence, stochastic approximation…).