Sparse coding for image/video denoising and superresolution
1.
Sparse Coding & Dictionary Learning for
Image Denoising & Super-Resolution
Yu Huang
Sunnyvale, California
yu.huang07@gmail.com
2.
Outline
• The sparse-land model
• What is sparse coding?
• Methods of solving sparse coding
• Orthogonal Matching Pursuit (OMP)
• Strategy of dictionary selection (dictionary learning)
• What is the K-SVD algorithm?
• Image denoising
– Apply Sparse Coding for Denoising
– Learned Simultaneous Sparse Coding
– Locally Learned Dictionaries
– Clustering-based Sparse Represent.
• Image super-resolution
– Sparse coding for SR
– Joint dictionary learning for SR
– Self similarities & group sparsity for SR
– Adaptive sparse domain selection in SR
– Semi-coupled dictionary learning-based SR
• References
3.
Appendix
• K-nearest nearest neighbor;
• PCA, AP and spectral clustering;
• NMF and pLSA;
• ISOMAP;
• Locally Linear Embedding;
• Laplacian eigenmap;
• Gaussian mixture and EM;
• Graphical model;
• Generative model: MRF;
• Discriminative model: CRF;
• Graph cut;
• Belief propagation.
4.
The Sparseland Model
• Defined as a set {D, X, Y} such that
DY = X
5.
What is Sparse Coding?
• Given a D and yi, how to find xi ?
• Constraint : xi is sufficiently sparse;
• Finding exact solution is difficult;
• Approximate a solution good enough?
6.
Methods of Solving Sparse Coding
• Greedy methods: projecting the residual on some atom;
– Matching pursuit, orthogonal matching pursuit;
• L1-norm: Least Absolute Shrinkage and Selection Operator (LASSO);
– The residual is updated iteratively in the direction of the atom;
• Gradient-based finding new search directions
– Projected Gradient Descent
– Coordinate Descent
• Homotopy: a set of solutions indexed by a parameter (regularization)
– LARS (Least Angle Regression)
• First order/proximal methods: Generalized gradient descent
– solving efficiently the proximal operator
– soft-thresholding for L1-norm
– Accelerated by the Nesterov optimal first-order method
• Iterative reweighting schemes
– L2-norm: Chartand and Yin (2008)
– L1-norm: Cand`es et al. (2008)
7.
Orthogonal Matching Pursuit (OMP)
Select dk with max
projection on residue
xk = arg min ||y-Dkxk||
Update residue
r = y - Dkxk
Check terminating
condition
D, y x
8.
Features of OMP
• A greedy algorithm, better than MP;
– Able to find approximate solution;
• Full backward orthogonality of error;
• Close solution if T is really small;
• Simplistic in nature.
9.
Strategy of Dictionary Selection
• What D to use?
• A fixed overcomplete set of basis: no adaptivity.
• Steerable wavelet;
• Bandlet, curvelet, contourlet;
• DCT Basis;
• Gabor function;
• ….
• Data adaptive dictionary – learn from data;
• K-SVD: a generalized K-means clustering process for Vector
Quantization (VQ).
– An iterative algorithm to effectively optimize the sparse approximation of
signals in a learned dictionary.
• Other methods of dictionary learning:
– non-negative matrix decompositions.
– sparse PCA (sparse dictionaries).
– fused-lasso regularizations (piecewise constant dictionaries)
• Extending the models: Sparsity + Self-similarity=Group Sparsity
10.
What is the K-SVD Algorithm?
• Select atoms from input;
• Atoms can be image patches;
• Patches are overlapping.
Initialize Dictionary
Sparse Coding
(OMP)
Update Dictionary
One atom at a time
• Use OMP or any pursuit method;
• Output sparse code for all signals;
• Minimize representation error.
11.
Image Denoising
• Various assumptions of content internal structures;
• Learning-based
– Field of experts (MRF), NN, CRF,…;
– Sparse coding: K-SVD, LSSC,….
• Self-similarity
– Gaussian, Median;
– Bilateral filter, anisotropic diffusion;
– Non-local means.
• Sparsity prior
– Wavelet shrinkage;
• Use of both Redundancy and Sparsity
– BM3D (block matching 3-d filter): benchmark;
12.
Apply Sparse Coding for Denoising
• A cost function for : Y = Z + n
• Solve for: Prior term
• Break problem into smaller problems
• Aim at minimization at the patch level.
Proximity of
selected patch
Sparsity of the
representations
Global
proximity
13.
Image Data
• Extract overlapping patches from a single image;
– clean or corrupted, even reference (multiple frames)?
– for example, 100k of size 8x8 block patches;
• Applied the K-SVD, training a dictionary;
– Size of 64x256 (n=64, dictionary size k).
– Lagrange multiplier namda = 30/sigma of noise;
• The coefficients from OMP;
– the maximal iteration is 180 and noise gain C=1.15;
– the number of nonzero elements L=6 (sigma=5).
• Denoising by normalized weighted averaging:
14.
Extended to Color Images
• Color correction in OMP:
– put more importance of the proximity of the
mean value of the color patches.
– Coefficient gama = 5.25;
• channels R, G and B are concatenated in the
sparseland model.
15.
Block Matching 3-D for Denoising
• For each patch, find similar patches;
• Group the similar patches into a 3-d stack;
• Perform a 3-D transform (2-d + 1-d) and
coefficient thresholding (sparsity);
• Apply inverse 3-D transform (1-d + 2-d);
• Also combine multiple patches in a
collaborative way (aggregation);
• Two stages: hard -> wiener (soft).
18.
Locally Learned Dictionaries (K-LLD)
• Identify dictionary which best captures
underlying geometric structure;
• Similar structures will have similar dictionary,
similar weights;
• Cluster image based on geometric similarity (K-
Means on the SKR weights);
• Learn dictionary and order of regression for each
cluster;
• Performance is between K-SVD and BM3D.
20.
Learned Simultaneous Sparse Coding
• Idea: combine dictionary learning and grouping;
– Non-local Means: self-similarity;
– Dictionary learning: sparse coding.
• Different from BM3D:
– Classical fixed orthogonal dictionaries;
• Problem in Sparse Coding: instable sparse decompositions may
cause reconstruction artifacts;
• LSSC model: A joint sparsity pattern imposed through a grouped-
sparsity regularizer
• The perform. is a little better than BM3D and K-SVD.
j
j
i i
21.
Clustering-based Sparse Represent.
• Idea: Combination of local and global sparsity;
– Dictionary learning (K-SVD);
– Structural clustering (BM3D);
• CSR Model:
– PCA/k-means Sparse coding (alpha) + k-NN clustering (beta);
• Equivalence of sparse coding and Bayesian network:
– Clustering in CSR looks like a 2nd stage sparse coding.
• Performance: better than K-SVD, close to BM3D.
• Question: globally get but locally fit?
1
1
23.
• Estimate missing HR detail that isn’t present in the original LR
image, and which we can’t make visible by simple sharpening;
• Image database with HR/LR image pairs;
• Algorithm uses a training set to learn the fine details of LR;
• It then uses learned relationships (MRF) to predict fine details.
What is Example Based SR?
25.
SR from a Single Image
• Multi-frame-based SR (alignment);
• Example-based SR.
26.
SR from a Single Image
• Combination of Example-based and Multi-frame-based.
same scale
different scales
𝑃0(𝑝) −→ 𝑃−𝑙( 𝑝) −→ 𝑄0(𝑠 𝑙. 𝑝) −→ 𝑄𝑙(𝑠 𝑙.𝑝)
FindNN Parent Copy
28.
Sparse Coding for SR [Yang et al.08]
• HR patches have a sparse represent. w.r.t. an over-complete
dictionary of patches randomly sampled from similar images.
• Sample 3 x 3 LR overlapping patches y on a regular grid.
output HR patch HR dictionary
for some with
The input LR patch satisfies
linear measurements of sparse coefficient vector !
Dictionary of low-resolution patches
Downsampling/Blurring operator
If we can recover the sparse solution to the underdetermined
system of linear equations , we can reconstruct as
convex
relaxation
T, T’: select overlap between patches F : 1st and 2nd derivatives from LR bicubic interpolation.
29.
Sparse Coding for SR [Yang et al.08]
Two training sets:
Flower images – smooth area, sharp edge
Animal images -- HF textures
Randomly sample 100,000 HR-LR patch pairs
from each set of training images.
30.
Sparse coding
MRF / BP
[Freeman IJCV ‘00]
Bicubic
Original
31.
Joint Dictionary Learning for SR
• Local sparse prior for detail recovery;
• Global constraints for artifact avoiding (L=SH);
• Joint dictionary learning:
extract overlap regionprevious reconstruct on the overlap
controls the tradeoff between matching the LR
input and finding a neighbor-compatible HR patch.
Solved by back-projection: a gradient descent method
32.
Bicubic Sparse coding
MRF / BP [Freeman IJCV ‘00]Input LR
33.
Self Similarities & Group Sparsity
• Generate HR-LR patch pairs from image pyramid;
• Like LSSC, grouping patch pairs by k-means clustering (Note:
not only within the scale, but also cross scales);
ANN Search
bicubic interpolation
Fill the uncovered area with
the back projection method
34.
Self Similarities & Group Sparsity
• Features extracted for clustering (1st and 2nd gradients);
• For each cluster, run sparse coding; then reconstruct HR patches.
35.
Adaptive Sparse Domain Selection
• Stability of sparse decomposition by domain selection, i.e. sub-
dictionary learning (via PCA) after clustering features (k-means);
• Adaptive selection of sub-dictionary (wavelet iterative shrinkage);
• Local structure encoding with the piecewise AR models;
• Non-local similarities constraints for regularization;
• Reweighted sparsity for regularization as well;
• 727,615 patches of size 7×7 randomly from training images;
• 200 clusters initially, then merge;
• Computational cost: image 256x256, 100 iterations, 2~5 minutes.
the fidelity term local AR model
non-local similarity
regularization
sparsity penalty
38.
Semi-Coupled Dictionary Learning
• Dictionary pair Dh, Dl and a mapping function W will
be simultaneously learned;
– Not fully coupled learning;
– clustering in sparse and exploiting nonlocal similarities;
• Training: dictionary and mapping update;
• Synthesis: reconstruct
39.
LR ground truth Bicubic Sparse coding Semi-coupled
40.
Fast Direct SR by Simple Functions
• Split the feature space into subspaces and collect exemplars to
learn priors for each one, to create effective mapping functions;
– Use simple features: LR-HR;
– Cluster LR training features by K-means;
– Learn simple regression functions;
– Exploit a large image dataset for training;
A set of 4096 cluster centers learned from 2.2 million natural patches.
(a) 512 clusters (b) 4096 clusters (c) Difference map
41.
SR Using Sub-band Self Similarity
• Self-similarity approach has the advantage of not requiring a
separate external training database, but some drawbacks as
– Internal dictionaries often inadequate for finding good matches for
patches containing complex structures such as textures;
– Sensitivity of patch size and the dimensions of the LR image.
• Use similarity principle independently on each of a sub-band
image set, obtained with orientation selective band-pass filters;
• Allow the directional frequency components of a patch to find
matches independently, might in different image locations;
• Decompose the local image structure into component patches
defined by different sub-bands;
– Sub-band image patches are simpler;
– The size of the dictionary defined by patches from the sub-band images is
exponential in the number of sub-bands used;
– With greater degree of invariance to parameters.
42.
Allow for its various sub-bands to find matches in different spatial locations
Combining sub-bands of different patches effectively allows for synthesizing new patches in the expanded dictionary.
SR Using Sub-band Self Similarity
43.
Single Image SR Using Deformable Patches
• A patch is not regarded as a fixed vector but a flexible deformation flow;
• Via deformable patches, the dictionary can cover more patterns that do
not appear, thus becoming more expressive;
– The energy function with slow, smooth and flexible prior for deformation
model;
– Deformation similarity based on the minimized energy function for basic
patch matching;
• Multiple deformed patches combined for the final reconstruction.
44.
Reference I: Denoising
• K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising by sparse
3-D transform-domain collaborative filtering. IEEE T-IP, 16(8):2080–2095,
2007.
• M. Elad and M. Aharon. Image denoising via sparse and redundant
representations over learned dictionaries. IEEE T-IP, 15(12):3736–3745,
2006.
• J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Non-local sparse
models for image restoration. ICCV’09, 2009 (LSSC).
• S. Roth and M. Black. Fields of experts. IJCV, 82(2):205–229, 2009.
• C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images.
ICCV, pages 839–846, 1998.
• H. C. Burger, C. J. Schuler, and S. Harmeling. Image denoising: can plain
neural networks compete with bm3d? Proc. CVPR, 2012.
• A. Buadess, B. Coll, and J. Morel. A non local algorithm for image denoising.
In CVPR, 2005.
• W Dong, X Li, L Zhang and G Shi, Sparsity-based Image Denoising vis
Dictionary Learning and Structural Clustering, CVPR’11, 2011.
• P. Chatterjee and P. Milanfar, Clustering-based Denoising with Locally
Learned Dictionaries (K-LLD), IEEE T-IP, vol. 18, num. 7, July 2009.
45.
Reference II: SR
• W. Freeman, T. Jones, E. Pasztor, Example-based super-resolution, IEEE CGA, 2002;
• D. Glasner, S. Bagon, M. Irani, Super-resolution from a single image, CVPR, 2009;
• J. Yang, J. Wright, T. Huang, and Y. Ma, Image Super-Resolution as Sparse
Representation of Raw Image Patches. CVPR 2008;
• J. Yang, J. Wright, T. Huang, and Y. Ma, Image super-resolution via sparse
representation, IEEE T-IP, 19(11), pp2861–2873, 2010;
• C.-Y. Yang, J.-B. Huang, and M.-H. Yang, Exploiting self-similarities for single frame
super-resolution, ACCV, 2010;
• J. Sun, Xu, Shum. Image super-resolution using gradient profile prior. CVPR, 2008;
• Y. HaCohen, R. Fattal, and D. Lischinski, Image upsampling via texture
• hallucination, IEEE ICCP, 2010.
• W Dong, D Zhang, G Shi, X Wu, Image Deblurring and Super-Resolution by Adaptive
Sparse Domain Selection and Adaptive Regularization, IEEE T-IP, 20(7), 2011;
• S. Wang, L Zhang, Y Liang, Q Pan, Semi-Coupled Dictionary Learning for Super-
Resolution, CVPR, Sept 2012.
• C Yang, M Yang, Fast Direct SR by Simple Functions, ICCV, 2013;
• A Singh, N Ahuja, SR using sub-band self similarity, ACCV, 2014;
• Y Zhu, Y Zhang, A Yuille, Single image SR using deformable patches, CVPR, 2014.
47.
K-Nearest Neighbors
• A non-parametric method for regression
and classification;
• Input: the k closest training examples in
the feature space.
• Output depends on application cases as
– Classification: a class membership by a
majority vote of its neighbors, with the
object being assigned to the class most
common among its k nearest neighbors;
– Regression: the property value for the
object; average of the values of
its k nearest neighbors.
• k-NN is a instance-based learning, or lazy
learning, where the function is
approximated locally and computation is
deferred until classification;
• A shortcoming of k-NN: sensitive to the
local structure of the data.
48.
K-Nearest Neighbors
• Pairwise Distance:
– Euclidean,
– Mahalanobis,
– city,
– correlation,
– Minkowski,
– Chebychev,
– Hamming,
– Jaccard;
– Spearman;
• kNN can be done by
exhaustive search or
approximate NN (kd-
tree);
• Note: Condensed nearest
neighbor (CNN) is an
algorithm designed to
reduce the data set for k-
NN classification.
49.
PCA, AP & Spectral Clustering
• Principal Component Analysis (PCA) uses orthogonal transformation to
convert a set of observations of possibly correlated variables into a set of
linearly uncorrelated variables called principal components.
• This transformation is defined in such a way that the first principal component
has the largest possible variance and each succeeding component in turn has
the highest variance possible under the constraint that it be orthogonal to the
preceding components.
• PCA is sensitive to the relative scaling of the original variables.
• Also called as Karhunen–Loève transform (KLT), Hotelling transform, singular
value decomposition (SVD) , factor analysis, eigenvalue
decomposition (EVD), spectral decomposition etc.;
• Affinity Propagation (AP) is a clustering algorithm based on the concept of
"message passing" between data points.[Unlike clustering algorithms such as k-
means or k-medoids, AP does not require the number of clusters to be
determined or estimated before running the algorithm;
• Spectral Clustering makes use of the spectrum (eigenvalues) of the data
similarity matrix to perform dimensionality reduction before clustering in
fewer dimensions.
– The similarity matrix consists of a quantitative assessment of the relative similarity
of each pair of points in the dataset.
51.
NMF & pLSA
• Non-negative matrix factorization (NMF): a matrix V is factorized into
(usually) two matrices W and H, that all three matrices have no
negative elements.
• The different types arise from using different cost functions for
measuring the divergence between V and W*H and possibly
by regularization of the W and/or H matrices;
– squared error, Kullback-Leibler divergence or total variation (TV);
• NMF is an instance of a more general probabilistic model called
"multinomial PCA“, as pLSA (probabilistic latent semantic analysis);
• pLSA is a statistical technique for two-mode (extended naturally to
higher modes) analysis, modeling the probability of each co-occurrence
as a mixture of conditionally independent multinomial distributions;
– Their parameters are learned using EM algorithm;
• pLSA is based on a mixture decomposition derived from a latent class
model, not as downsizing the occurrence tables by SVD in LSA.
• Note: an extended model, LDA (Latent Dirichlet allocation) , adds
a Dirichlet prior on the per-document topic distribution.
52.
NMF & pLSA
Note: d is the document index variable, c is a word's topic drawn from the document's
topic distribution, P(c|d), and w is a word drawn from the word distribution of this
word's topic, P(w|c). (d and w are observable variables, c is a latent variable.)
53.
ISOMAP
• General idea:
– Approximate the geodesic distances by shortest graph distance.
– MDS (multi-dimensional scaling) using geodic distances
• Algorithm:
– Construct a neighborhood graph
– Construct a distance matrix
– Find the shortest path between every i and j (e.g. using Floyd-Marshall) and construct a new
distance matrix such that Dij is the length of the shortest path between i and j.
– Apply MDS to matrix to find coordinates
54.
LLE (Locally Linear Embedding)
• General idea: represent each point on the local linear subspace of the manifold
as a linear combination of its neighbors to characterize the local neighborhood
relations; then use the same linear coefficient for embedding to preserve the
neighborhood relations in the low dimensional space;
• Compute the coefficient w for each data by solving a constraint LS problem;
• Algorithm:
– 1. Find weight matrix W of linear coefficients
– 2. Find low dimensional embedding Y that minimizes the reconstruction error
– 3. Solution: Eigen-decomposition of M=(I-W)’(I-W)
i j
jiji YWYY
2
)(
55.
Laplacian Eigenmaps
• General idea: minimize the norm of Laplace-Beltrami operator on the manifold
– measures how far apart maps nearby points.
– Avoid the trivial solution of f = const.
– The Laplacian-Beltrami operator can be approximated by Laplacian of the neighborhood graph
with appropriate weights.
– Construct the Laplacian matrix L=D-W.
– can be approximated by its discrete equivalent
• Algorithm:
– Construct a neighborhood graph (e.g., epsilonball, k-nearest neighbors).
– Construct an adjacency matrix with the following weights
– Minimize
–
– The generalized eigen-decomposition of the graph Laplacian is
– Spectral embedding of the Laplacian manifold:
– • The first eigenvector is trivial (the all one vector).
56.
Gaussian Mixture Model & EM
• Mixture model is a probabilistic model for representing the presence
of subpopulations within an overall population;
• “Mixture models" are used to make statistical inferences about the properties
of the sub-populations given only observations on the pooled population;
• A Gaussian mixture model can be Bayesian or non-Bayesian;
• A variety of approaches focus on maximum likelihood estimate (MLE)
as expectation maximization (EM) or maximum a posteriori (MAP);
• EM is used to determine the parameters of a mixture with an a priori given
number of components (a variation version can adapt it in the iteration);
– Expectation step: "partial membership" of each data point in each constituent
distribution is computed by calculating expectation values for the membership
variables of each data point;
– Maximization step: plug-in estimates, mixing coefficients and component model
parameters, are re-computed for the distribution parameters;
– Each successive EM iteration will not decrease the likelihood.
• Alternatives of EM for mixture models:
– mixture model parameters can be deduced using posterior sampling as indicated
by Bayes' theorem, i.e. Gibbs sampling or Markov Chain Monte Carlo (MCMC);
– Spectral methods based on SVD;
– Graphical model: MRF or CRF.
58.
Graphical Models
• Graphical Models: Powerful framework for representing
dependency structure between random variables.
• The joint probability distribution over a set of random variables.
• The graph contains a set of nodes (vertices) that represent random
variables, and a set of links (edges) that represent dependencies
between those random variables.
• The joint distribution over all random variables decomposes
into a product of factors, where each factor depends on a subset
of the variables.
• Two type of graphical models:
• Directed (Bayesian networks)
• Undirected (Markov random fields, Boltzmann machines)
• Hybrid graphical models that combine directed and undirected models,
such as Deep Belief Networks, Hierarchical-Deep Models.
59.
Generative Model: MRF
• Random Field: F={F1,F2,…FM} a family of random variables on set
S in which each Fi takes value fi in a label set L.
• Markov Random Field: F is said to be a MRF on S w.r.t. a
neighborhood N if and only if it satisfies Markov property.
– Generative model for joint probability p(x)
– allows no direct probabilistic interpretation
– define potential functions Ψ on maximal cliques A
• map joint assignment to non-negative real number
• requires normalization
• MRF is undirected graphical models
60.
Discriminative Model: CRF
• Conditional , not joint, probabilistic sequential models p(y|x)
• Allow arbitrary, non-independent features on the observation seq X
• Specify the probability of possible label seq given an observation seq
• Prob. of a transition between labels depend on past/future observ.
• Relax strong independence assumptions, no p(x) required
• CRF is MRF plus “external” variables, where “internal” variables Y of
MRF are un-observables and “external” variables X are observables
• Linear chain CRF: transition score depends on current observation
– Inference by DP like HMM, learning by forward-backward as HMM
• Optimization for learning CRF: discriminative model
– Conjugate gradient, stochastic gradient,…
61.
• A flow network G(V, E) defined as a fully
connected directed graph where each edge
(u,v) in E has a positive capacity c(u,v) >= 0;
• The max-flow problem is to find the flow of
maximum value on a flow network G;
• A s-t cut or simply cut of a flow network G is
a partition of V into S and T = V-S, such that
s in S and t in T;
• A minimum cut of a flow network is a cut
whose capacity is the least over all the s-t
cuts of the network;
• Methods of max flow or mini-cut:
– Ford Fulkerson method;
– "Push-Relabel" method.
62.
• Mostly labeling is solved as an energy minimization problem;
• Two common energy models:
– Potts Interaction Energy Model;
– Linear Interaction Energy Model.
• Graph G contain two kinds of vertices: p-vertices and i-vertices;
– all the edges in the neighborhood N, called n-links;
– edges between the p-vertices and the i-vertices called t-links.
• In the multiple labeling case, the multi-way cut should leave each p-vertex
connected to one i-vertex;
• The minimum cost multi-way cut will minimize the energy function where the
severed n-links would correspond to the boundaries of the labeled vertices;
• The approximation algorithms to find this multi-way cut:
– "alpha-expansion" algorithm;
– "alpha-beta swap" algorithm.
63.
A simplified Bayes Net: it propagates info. throughout a graphical
model via a series of messages sent between neighboring nodes
iteratively; likely to converge to a consensus that determines the
marginal probabilities of all the variables;
messages estimate the cost (or energy) of a configuration of a
clique given all other cliques; then the messages are combined to
compute a belief (marginal or maximum probability);
• Two types of BP methods:
– max-product;
– sum-product.
• BP provides exact solution when there are no loops in graph!
• Equivalent to dynamic programming/Viterbi in these cases;
• Loopy Belief Propagation: still provides approximate (but often
good) solution;
64.
• Generalized BP for pairwise MRFs
– Hidden variables xi and xj are connected through a
compatibility function;
– Hidden variables xi are connected to observable variables yi by
the local “evidence” function;
• The joint probability of {x} is given by
• To improve inference by taking into account higher-order
interactions among the variables;
– An intuitive way is to define messages that propagate between
groups of nodes rather than just single nodes;
– This is the intuition in Generalized Belief Propagation (GBP).
Be the first to comment