Sparse coding for image/video denoising and superresolution


Published on

Sparse coding, KSVD, OMP, image Denoising, image superresolution

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Sparse coding for image/video denoising and superresolution

  1. 1. Sparse Coding & Dictionary Learning for Image Denoising & Super-Resolution Yu Huang Sunnyvale, California
  2. 2. Outline • The sparse-land model • What is sparse coding? • Methods of solving sparse coding • Orthogonal Matching Pursuit (OMP) • Strategy of dictionary selection (dictionary learning) • What is the K-SVD algorithm? • Image denoising – Apply Sparse Coding for Denoising – Learned Simultaneous Sparse Coding – Locally Learned Dictionaries – Clustering-based Sparse Represent. • Image super-resolution – Sparse coding for SR – Joint dictionary learning for SR – Self similarities & group sparsity for SR – Adaptive sparse domain selection in SR – Semi-coupled dictionary learning-based SR • References
  3. 3. Appendix • K-nearest nearest neighbor; • PCA, AP and spectral clustering; • NMF and pLSA; • ISOMAP; • Locally Linear Embedding; • Laplacian eigenmap; • Gaussian mixture and EM; • Graphical model; • Generative model: MRF; • Discriminative model: CRF; • Graph cut; • Belief propagation.
  4. 4. The Sparseland Model • Defined as a set {D, X, Y} such that DY = X
  5. 5. What is Sparse Coding? • Given a D and yi, how to find xi ? • Constraint : xi is sufficiently sparse; • Finding exact solution is difficult; • Approximate a solution good enough?
  6. 6. Methods of Solving Sparse Coding • Greedy methods: projecting the residual on some atom; – Matching pursuit, orthogonal matching pursuit; • L1-norm: Least Absolute Shrinkage and Selection Operator (LASSO); – The residual is updated iteratively in the direction of the atom; • Gradient-based finding new search directions – Projected Gradient Descent – Coordinate Descent • Homotopy: a set of solutions indexed by a parameter (regularization) – LARS (Least Angle Regression) • First order/proximal methods: Generalized gradient descent – solving efficiently the proximal operator – soft-thresholding for L1-norm – Accelerated by the Nesterov optimal first-order method • Iterative reweighting schemes – L2-norm: Chartand and Yin (2008) – L1-norm: Cand`es et al. (2008)
  7. 7. Orthogonal Matching Pursuit (OMP) Select dk with max projection on residue xk = arg min ||y-Dkxk|| Update residue r = y - Dkxk Check terminating condition D, y x
  8. 8. Features of OMP • A greedy algorithm, better than MP; – Able to find approximate solution; • Full backward orthogonality of error; • Close solution if T is really small; • Simplistic in nature.
  9. 9. Strategy of Dictionary Selection • What D to use? • A fixed overcomplete set of basis: no adaptivity. • Steerable wavelet; • Bandlet, curvelet, contourlet; • DCT Basis; • Gabor function; • …. • Data adaptive dictionary – learn from data; • K-SVD: a generalized K-means clustering process for Vector Quantization (VQ). – An iterative algorithm to effectively optimize the sparse approximation of signals in a learned dictionary. • Other methods of dictionary learning: – non-negative matrix decompositions. – sparse PCA (sparse dictionaries). – fused-lasso regularizations (piecewise constant dictionaries) • Extending the models: Sparsity + Self-similarity=Group Sparsity
  10. 10. What is the K-SVD Algorithm? • Select atoms from input; • Atoms can be image patches; • Patches are overlapping. Initialize Dictionary Sparse Coding (OMP) Update Dictionary One atom at a time • Use OMP or any pursuit method; • Output sparse code for all signals; • Minimize representation error.
  11. 11. Image Denoising • Various assumptions of content internal structures; • Learning-based – Field of experts (MRF), NN, CRF,…; – Sparse coding: K-SVD, LSSC,…. • Self-similarity – Gaussian, Median; – Bilateral filter, anisotropic diffusion; – Non-local means. • Sparsity prior – Wavelet shrinkage; • Use of both Redundancy and Sparsity – BM3D (block matching 3-d filter): benchmark;
  12. 12. Apply Sparse Coding for Denoising • A cost function for : Y = Z + n • Solve for: Prior term • Break problem into smaller problems • Aim at minimization at the patch level. Proximity of selected patch Sparsity of the representations Global proximity
  13. 13. Image Data • Extract overlapping patches from a single image; – clean or corrupted, even reference (multiple frames)? – for example, 100k of size 8x8 block patches; • Applied the K-SVD, training a dictionary; – Size of 64x256 (n=64, dictionary size k). – Lagrange multiplier namda = 30/sigma of noise; • The coefficients from OMP; – the maximal iteration is 180 and noise gain C=1.15; – the number of nonzero elements L=6 (sigma=5). • Denoising by normalized weighted averaging:
  14. 14. Extended to Color Images • Color correction in OMP: – put more importance of the proximity of the mean value of the color patches. – Coefficient gama = 5.25; • channels R, G and B are concatenated in the sparseland model.
  15. 15. Block Matching 3-D for Denoising • For each patch, find similar patches; • Group the similar patches into a 3-d stack; • Perform a 3-D transform (2-d + 1-d) and coefficient thresholding (sparsity); • Apply inverse 3-D transform (1-d + 2-d); • Also combine multiple patches in a collaborative way (aggregation); • Two stages: hard -> wiener (soft).
  16. 16. BM3D Outline
  17. 17. Noisy K-SVDBM3D NL Means
  18. 18. Locally Learned Dictionaries (K-LLD) • Identify dictionary which best captures underlying geometric structure; • Similar structures will have similar dictionary, similar weights; • Cluster image based on geometric similarity (K- Means on the SKR weights); • Learn dictionary and order of regression for each cluster; • Performance is between K-SVD and BM3D.
  19. 19. K-LLD Outline Calculate weights Learn dictionaries Clustering Iterate Noisy Image Kernel Regression Denoised Image
  20. 20. Learned Simultaneous Sparse Coding • Idea: combine dictionary learning and grouping; – Non-local Means: self-similarity; – Dictionary learning: sparse coding. • Different from BM3D: – Classical fixed orthogonal dictionaries; • Problem in Sparse Coding: instable sparse decompositions may cause reconstruction artifacts; • LSSC model: A joint sparsity pattern imposed through a grouped- sparsity regularizer • The perform. is a little better than BM3D and K-SVD. j j i i
  21. 21. Clustering-based Sparse Represent. • Idea: Combination of local and global sparsity; – Dictionary learning (K-SVD); – Structural clustering (BM3D); • CSR Model: – PCA/k-means Sparse coding (alpha) + k-NN clustering (beta); • Equivalence of sparse coding and Bayesian network: – Clustering in CSR looks like a 2nd stage sparse coding. • Performance: better than K-SVD, close to BM3D. • Question: globally get but locally fit? 1 1
  22. 22. Image Super-Resolution (SR) • SR: how to find missing details/HF comp? • Interpolation-based: – Edge-directed; – B-spline; – Sub-pixel alignment; • Reconstruction-based; – Gradient prior; – TV (Total Variation); – MRF (Markov Random Field). • Learning-based (hallucination). – Example-based: texture synthesis, LR-HR mapping; – Self learning: sparse coding, self similarity-based;
  23. 23. • Estimate missing HR detail that isn’t present in the original LR image, and which we can’t make visible by simple sharpening; • Image database with HR/LR image pairs; • Algorithm uses a training set to learn the fine details of LR; • It then uses learned relationships (MRF) to predict fine details. What is Example Based SR?
  24. 24. One pass Algorithm
  25. 25. SR from a Single Image • Multi-frame-based SR (alignment); • Example-based SR.
  26. 26. SR from a Single Image • Combination of Example-based and Multi-frame-based. same scale different scales 𝑃0(𝑝) −→ 𝑃−𝑙( 𝑝) −→ 𝑄0(𝑠 𝑙. 𝑝) −→ 𝑄𝑙(𝑠 𝑙.𝑝) FindNN Parent Copy
  27. 27. Example-based Edge Statistics Single Frame
  28. 28. Sparse Coding for SR [Yang et al.08] • HR patches have a sparse represent. w.r.t. an over-complete dictionary of patches randomly sampled from similar images. • Sample 3 x 3 LR overlapping patches y on a regular grid. output HR patch HR dictionary for some with The input LR patch satisfies linear measurements of sparse coefficient vector ! Dictionary of low-resolution patches Downsampling/Blurring operator If we can recover the sparse solution to the underdetermined system of linear equations , we can reconstruct as convex relaxation T, T’: select overlap between patches F : 1st and 2nd derivatives from LR bicubic interpolation.
  29. 29. Sparse Coding for SR [Yang et al.08] Two training sets: Flower images – smooth area, sharp edge Animal images -- HF textures Randomly sample 100,000 HR-LR patch pairs from each set of training images.
  30. 30. Sparse coding MRF / BP [Freeman IJCV ‘00] Bicubic Original
  31. 31. Joint Dictionary Learning for SR • Local sparse prior for detail recovery; • Global constraints for artifact avoiding (L=SH); • Joint dictionary learning: extract overlap regionprevious reconstruct on the overlap controls the tradeoff between matching the LR input and finding a neighbor-compatible HR patch. Solved by back-projection: a gradient descent method
  32. 32. Bicubic Sparse coding MRF / BP [Freeman IJCV ‘00]Input LR
  33. 33. Self Similarities & Group Sparsity • Generate HR-LR patch pairs from image pyramid; • Like LSSC, grouping patch pairs by k-means clustering (Note: not only within the scale, but also cross scales); ANN Search bicubic interpolation Fill the uncovered area with the back projection method
  34. 34. Self Similarities & Group Sparsity • Features extracted for clustering (1st and 2nd gradients); • For each cluster, run sparse coding; then reconstruct HR patches.
  35. 35. Adaptive Sparse Domain Selection • Stability of sparse decomposition by domain selection, i.e. sub- dictionary learning (via PCA) after clustering features (k-means); • Adaptive selection of sub-dictionary (wavelet iterative shrinkage); • Local structure encoding with the piecewise AR models; • Non-local similarities constraints for regularization; • Reweighted sparsity for regularization as well; • 727,615 patches of size 7×7 randomly from training images; • 200 clusters initially, then merge; • Computational cost: image 256x256, 100 iterations, 2~5 minutes. the fidelity term local AR model non-local similarity regularization sparsity penalty
  36. 36. Adaptive Sparse Domain Selection
  37. 37. LR Sparse coding Sparse Domain Selection
  38. 38. Semi-Coupled Dictionary Learning • Dictionary pair Dh, Dl and a mapping function W will be simultaneously learned; – Not fully coupled learning; – clustering in sparse and exploiting nonlocal similarities; • Training: dictionary and mapping update; • Synthesis: reconstruct
  39. 39. LR ground truth Bicubic Sparse coding Semi-coupled
  40. 40. Fast Direct SR by Simple Functions • Split the feature space into subspaces and collect exemplars to learn priors for each one, to create effective mapping functions; – Use simple features: LR-HR; – Cluster LR training features by K-means; – Learn simple regression functions; – Exploit a large image dataset for training; A set of 4096 cluster centers learned from 2.2 million natural patches. (a) 512 clusters (b) 4096 clusters (c) Difference map
  41. 41. SR Using Sub-band Self Similarity • Self-similarity approach has the advantage of not requiring a separate external training database, but some drawbacks as – Internal dictionaries often inadequate for finding good matches for patches containing complex structures such as textures; – Sensitivity of patch size and the dimensions of the LR image. • Use similarity principle independently on each of a sub-band image set, obtained with orientation selective band-pass filters; • Allow the directional frequency components of a patch to find matches independently, might in different image locations; • Decompose the local image structure into component patches defined by different sub-bands; – Sub-band image patches are simpler; – The size of the dictionary defined by patches from the sub-band images is exponential in the number of sub-bands used; – With greater degree of invariance to parameters.
  42. 42. Allow for its various sub-bands to find matches in different spatial locations Combining sub-bands of different patches effectively allows for synthesizing new patches in the expanded dictionary. SR Using Sub-band Self Similarity
  43. 43. Single Image SR Using Deformable Patches • A patch is not regarded as a fixed vector but a flexible deformation flow; • Via deformable patches, the dictionary can cover more patterns that do not appear, thus becoming more expressive; – The energy function with slow, smooth and flexible prior for deformation model; – Deformation similarity based on the minimized energy function for basic patch matching; • Multiple deformed patches combined for the final reconstruction.
  44. 44. Reference I: Denoising • K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising by sparse 3-D transform-domain collaborative filtering. IEEE T-IP, 16(8):2080–2095, 2007. • M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned dictionaries. IEEE T-IP, 15(12):3736–3745, 2006. • J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Non-local sparse models for image restoration. ICCV’09, 2009 (LSSC). • S. Roth and M. Black. Fields of experts. IJCV, 82(2):205–229, 2009. • C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images. ICCV, pages 839–846, 1998. • H. C. Burger, C. J. Schuler, and S. Harmeling. Image denoising: can plain neural networks compete with bm3d? Proc. CVPR, 2012. • A. Buadess, B. Coll, and J. Morel. A non local algorithm for image denoising. In CVPR, 2005. • W Dong, X Li, L Zhang and G Shi, Sparsity-based Image Denoising vis Dictionary Learning and Structural Clustering, CVPR’11, 2011. • P. Chatterjee and P. Milanfar, Clustering-based Denoising with Locally Learned Dictionaries (K-LLD), IEEE T-IP, vol. 18, num. 7, July 2009.
  45. 45. Reference II: SR • W. Freeman, T. Jones, E. Pasztor, Example-based super-resolution, IEEE CGA, 2002; • D. Glasner, S. Bagon, M. Irani, Super-resolution from a single image, CVPR, 2009; • J. Yang, J. Wright, T. Huang, and Y. Ma, Image Super-Resolution as Sparse Representation of Raw Image Patches. CVPR 2008; • J. Yang, J. Wright, T. Huang, and Y. Ma, Image super-resolution via sparse representation, IEEE T-IP, 19(11), pp2861–2873, 2010; • C.-Y. Yang, J.-B. Huang, and M.-H. Yang, Exploiting self-similarities for single frame super-resolution, ACCV, 2010; • J. Sun, Xu, Shum. Image super-resolution using gradient profile prior. CVPR, 2008; • Y. HaCohen, R. Fattal, and D. Lischinski, Image upsampling via texture • hallucination, IEEE ICCP, 2010. • W Dong, D Zhang, G Shi, X Wu, Image Deblurring and Super-Resolution by Adaptive Sparse Domain Selection and Adaptive Regularization, IEEE T-IP, 20(7), 2011; • S. Wang, L Zhang, Y Liang, Q Pan, Semi-Coupled Dictionary Learning for Super- Resolution, CVPR, Sept 2012. • C Yang, M Yang, Fast Direct SR by Simple Functions, ICCV, 2013; • A Singh, N Ahuja, SR using sub-band self similarity, ACCV, 2014; • Y Zhu, Y Zhang, A Yuille, Single image SR using deformable patches, CVPR, 2014.
  46. 46. Appendix
  47. 47. K-Nearest Neighbors • A non-parametric method for regression and classification; • Input: the k closest training examples in the feature space. • Output depends on application cases as – Classification: a class membership by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors; – Regression: the property value for the object; average of the values of its k nearest neighbors. • k-NN is a instance-based learning, or lazy learning, where the function is approximated locally and computation is deferred until classification; • A shortcoming of k-NN: sensitive to the local structure of the data.
  48. 48. K-Nearest Neighbors • Pairwise Distance: – Euclidean, – Mahalanobis, – city, – correlation, – Minkowski, – Chebychev, – Hamming, – Jaccard; – Spearman; • kNN can be done by exhaustive search or approximate NN (kd- tree); • Note: Condensed nearest neighbor (CNN) is an algorithm designed to reduce the data set for k- NN classification.
  49. 49. PCA, AP & Spectral Clustering • Principal Component Analysis (PCA) uses orthogonal transformation to convert a set of observations of possibly correlated variables into a set of linearly uncorrelated variables called principal components. • This transformation is defined in such a way that the first principal component has the largest possible variance and each succeeding component in turn has the highest variance possible under the constraint that it be orthogonal to the preceding components. • PCA is sensitive to the relative scaling of the original variables. • Also called as Karhunen–Loève transform (KLT), Hotelling transform, singular value decomposition (SVD) , factor analysis, eigenvalue decomposition (EVD), spectral decomposition etc.; • Affinity Propagation (AP) is a clustering algorithm based on the concept of "message passing" between data points.[Unlike clustering algorithms such as k- means or k-medoids, AP does not require the number of clusters to be determined or estimated before running the algorithm; • Spectral Clustering makes use of the spectrum (eigenvalues) of the data similarity matrix to perform dimensionality reduction before clustering in fewer dimensions. – The similarity matrix consists of a quantitative assessment of the relative similarity of each pair of points in the dataset.
  50. 50. PCA, AP & Spectral Clustering
  51. 51. NMF & pLSA • Non-negative matrix factorization (NMF): a matrix V is factorized into (usually) two matrices W and H, that all three matrices have no negative elements. • The different types arise from using different cost functions for measuring the divergence between V and W*H and possibly by regularization of the W and/or H matrices; – squared error, Kullback-Leibler divergence or total variation (TV); • NMF is an instance of a more general probabilistic model called "multinomial PCA“, as pLSA (probabilistic latent semantic analysis); • pLSA is a statistical technique for two-mode (extended naturally to higher modes) analysis, modeling the probability of each co-occurrence as a mixture of conditionally independent multinomial distributions; – Their parameters are learned using EM algorithm; • pLSA is based on a mixture decomposition derived from a latent class model, not as downsizing the occurrence tables by SVD in LSA. • Note: an extended model, LDA (Latent Dirichlet allocation) , adds a Dirichlet prior on the per-document topic distribution.
  52. 52. NMF & pLSA Note: d is the document index variable, c is a word's topic drawn from the document's topic distribution, P(c|d), and w is a word drawn from the word distribution of this word's topic, P(w|c). (d and w are observable variables, c is a latent variable.)
  53. 53. ISOMAP • General idea: – Approximate the geodesic distances by shortest graph distance. – MDS (multi-dimensional scaling) using geodic distances • Algorithm: – Construct a neighborhood graph – Construct a distance matrix – Find the shortest path between every i and j (e.g. using Floyd-Marshall) and construct a new distance matrix such that Dij is the length of the shortest path between i and j. – Apply MDS to matrix to find coordinates
  54. 54. LLE (Locally Linear Embedding) • General idea: represent each point on the local linear subspace of the manifold as a linear combination of its neighbors to characterize the local neighborhood relations; then use the same linear coefficient for embedding to preserve the neighborhood relations in the low dimensional space; • Compute the coefficient w for each data by solving a constraint LS problem; • Algorithm: – 1. Find weight matrix W of linear coefficients – 2. Find low dimensional embedding Y that minimizes the reconstruction error – 3. Solution: Eigen-decomposition of M=(I-W)’(I-W)   i j jiji YWYY 2 )( 
  55. 55. Laplacian Eigenmaps • General idea: minimize the norm of Laplace-Beltrami operator on the manifold – measures how far apart maps nearby points. – Avoid the trivial solution of f = const. – The Laplacian-Beltrami operator can be approximated by Laplacian of the neighborhood graph with appropriate weights. – Construct the Laplacian matrix L=D-W. – can be approximated by its discrete equivalent • Algorithm: – Construct a neighborhood graph (e.g., epsilonball, k-nearest neighbors). – Construct an adjacency matrix with the following weights – Minimize – – The generalized eigen-decomposition of the graph Laplacian is – Spectral embedding of the Laplacian manifold: – • The first eigenvector is trivial (the all one vector).
  56. 56. Gaussian Mixture Model & EM • Mixture model is a probabilistic model for representing the presence of subpopulations within an overall population; • “Mixture models" are used to make statistical inferences about the properties of the sub-populations given only observations on the pooled population; • A Gaussian mixture model can be Bayesian or non-Bayesian; • A variety of approaches focus on maximum likelihood estimate (MLE) as expectation maximization (EM) or maximum a posteriori (MAP); • EM is used to determine the parameters of a mixture with an a priori given number of components (a variation version can adapt it in the iteration); – Expectation step: "partial membership" of each data point in each constituent distribution is computed by calculating expectation values for the membership variables of each data point; – Maximization step: plug-in estimates, mixing coefficients and component model parameters, are re-computed for the distribution parameters; – Each successive EM iteration will not decrease the likelihood. • Alternatives of EM for mixture models: – mixture model parameters can be deduced using posterior sampling as indicated by Bayes' theorem, i.e. Gibbs sampling or Markov Chain Monte Carlo (MCMC); – Spectral methods based on SVD; – Graphical model: MRF or CRF.
  57. 57. Gaussian Mixture Model & EM
  58. 58. Graphical Models • Graphical Models: Powerful framework for representing dependency structure between random variables. • The joint probability distribution over a set of random variables. • The graph contains a set of nodes (vertices) that represent random variables, and a set of links (edges) that represent dependencies between those random variables. • The joint distribution over all random variables decomposes into a product of factors, where each factor depends on a subset of the variables. • Two type of graphical models: • Directed (Bayesian networks) • Undirected (Markov random fields, Boltzmann machines) • Hybrid graphical models that combine directed and undirected models, such as Deep Belief Networks, Hierarchical-Deep Models.
  59. 59. Generative Model: MRF • Random Field: F={F1,F2,…FM} a family of random variables on set S in which each Fi takes value fi in a label set L. • Markov Random Field: F is said to be a MRF on S w.r.t. a neighborhood N if and only if it satisfies Markov property. – Generative model for joint probability p(x) – allows no direct probabilistic interpretation – define potential functions Ψ on maximal cliques A • map joint assignment to non-negative real number • requires normalization • MRF is undirected graphical models
  60. 60. Discriminative Model: CRF • Conditional , not joint, probabilistic sequential models p(y|x) • Allow arbitrary, non-independent features on the observation seq X • Specify the probability of possible label seq given an observation seq • Prob. of a transition between labels depend on past/future observ. • Relax strong independence assumptions, no p(x) required • CRF is MRF plus “external” variables, where “internal” variables Y of MRF are un-observables and “external” variables X are observables • Linear chain CRF: transition score depends on current observation – Inference by DP like HMM, learning by forward-backward as HMM • Optimization for learning CRF: discriminative model – Conjugate gradient, stochastic gradient,…
  61. 61. • A flow network G(V, E) defined as a fully connected directed graph where each edge (u,v) in E has a positive capacity c(u,v) >= 0; • The max-flow problem is to find the flow of maximum value on a flow network G; • A s-t cut or simply cut of a flow network G is a partition of V into S and T = V-S, such that s in S and t in T; • A minimum cut of a flow network is a cut whose capacity is the least over all the s-t cuts of the network; • Methods of max flow or mini-cut: – Ford Fulkerson method; – "Push-Relabel" method.
  62. 62. • Mostly labeling is solved as an energy minimization problem; • Two common energy models: – Potts Interaction Energy Model; – Linear Interaction Energy Model. • Graph G contain two kinds of vertices: p-vertices and i-vertices; – all the edges in the neighborhood N, called n-links; – edges between the p-vertices and the i-vertices called t-links. • In the multiple labeling case, the multi-way cut should leave each p-vertex connected to one i-vertex; • The minimum cost multi-way cut will minimize the energy function where the severed n-links would correspond to the boundaries of the labeled vertices; • The approximation algorithms to find this multi-way cut: – "alpha-expansion" algorithm; – "alpha-beta swap" algorithm.
  63. 63.  A simplified Bayes Net: it propagates info. throughout a graphical model via a series of messages sent between neighboring nodes iteratively; likely to converge to a consensus that determines the marginal probabilities of all the variables;  messages estimate the cost (or energy) of a configuration of a clique given all other cliques; then the messages are combined to compute a belief (marginal or maximum probability); • Two types of BP methods: – max-product; – sum-product. • BP provides exact solution when there are no loops in graph! • Equivalent to dynamic programming/Viterbi in these cases; • Loopy Belief Propagation: still provides approximate (but often good) solution;
  64. 64. • Generalized BP for pairwise MRFs – Hidden variables xi and xj are connected through a compatibility function; – Hidden variables xi are connected to observable variables yi by the local “evidence” function; • The joint probability of {x} is given by • To improve inference by taking into account higher-order interactions among the variables; – An intuitive way is to define messages that propagate between groups of nodes rather than just single nodes; – This is the intuition in Generalized Belief Propagation (GBP).
  65. 65. THANKS!