Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Spectral convnets

488 views

Published on

Talk at SMU on spectral convnets by Xavier Bresson, 19 October 2017

Published in: Science
  • Be the first to comment

Spectral convnets

  1. 1. Convolutional Neural Networks on Graphs Xavier Bresson Xavier Bresson 2 School of Information Systems Singapore Management University (SMU) Oct 19th 2017 School of Computer Science and Engineering Nanyang Technological University Joint work with M. Defferrard (EPFL), P. Vandergheynst (EPFL), M. Bronstein (USI), F. Monti (USI), R. Levie (TAU)
  2. 2. Outline Ø Deep Learning - A brief history - High-dimensional learning Ø Convolutional Neural Networks - Architecture - Non-Euclidean Data Ø Spectral Graph Theory - Graphs and operators - Spectral decomposition - Fourier analysis - Convolution - Coarsening Ø Spectral ConvNets - SplineNets - ChebNets* [NIPS’16] - GraphConvNets - CayleyNets* Ø Spectral ConvNets on Multiple Graphs - Recommender Systems* [NIPS’17] Ø Conclusion Xavier Bresson 3
  3. 3. Outline Ø Deep Learning - A brief history Xavier Bresson 4
  4. 4. A brief history of artificial neural networks Xavier Bresson 5 1958 Perceptron Rosenblatt 1982 Backprop Werbos 1959 Visual primary cortex Hubel-Wiesel 1987 Neocognitron Fukushima SVM/Kernel techniques Vapnik 1995 1997 1998 1999 2006 2012 2014 2015 Data scientist 1st Job in US Facebook Center OpenAI Center Deep Learning Breakthough Hinton GAN, Goodfellow ResNet, He Deep RL, Mnih First NVIDIA GPU First Amazon Cloud Center RNN/LSTM Schmidhuber CNN LeCun AI Winter [1960’s-2012]AI Birth AI Renaissance Hardware GPU speed 2x/year Kernel techniques Graphical models Wavelets, numerical PDEs Handcrafted Features 4th industrial revolution? Digital intelligence revolution or new AI bubble? Big Data Volume 2x/1.5 year Google AI TensorFlow Facebook AI Torch 1962 Birth of Data Science Split from Statistics Tukey First NIPS 1989 First KDD 2010 Kaggle Platform
  5. 5. Breakthrough in image recognition Xavier Bresson 6 4.1%Human accuracy Learned features (end-to-end systems)Handcraft features (SIFT)
  6. 6. Convolutional neural networks: LeNet-5[1] 2 convolutional + 2 fully connected layers tanh non-linearity 1M parameters Training set: MNIST 70K images Trained on CPU Xavier Bresson 7 [1] LeCun et al. 1998 1998
  7. 7. Convolutional neural networks: AlexNet[2] Xavier Bresson 8 5 convolutional + 3 fully connected layers ReLU non-linearity Dropout regularization 60M parameters Trained set: ImageNet 1.5M images Trained on GPU [2] Krizhevsky, Sutskever, Hinton 2012 2012 More learning capacity More data More computational power
  8. 8. Outline Ø Deep Learning - High-dimensional learning Xavier Bresson 9
  9. 9. High-dimensional learning Xavier Bresson 10 NNs are powerful to solve high-dimensional learning problems. Prototypal learning: Evaluate function f(x) given n samples {xi,f(xi)}. How? Suppose f is regular, use interpolation of close points. f(x) f(xi) xi 1-dimensional learning
  10. 10. Curse of dimensionality Xavier Bresson 11 What about in high dimensions? No interpolation possible because we can never have enough close points to make a good interpolation. 1 · · · · · · · · · ·· · · · · · · · · ·· · · · · · · · · ·· · · · · · · · · ·· · · · · · · · · ·· · · · · · · · · ·· · · · · · · · · ·· · · · · · · · · ·· · · · · · · · · ·· · · · · · · · · · · · · · · · · · · ·· · · · · · · · · · 0 1 N samples/dim " = N 1 How many points to cover the 2D plane? ε-2 d-dim cube [0,1]d requires ε-d points. dim(image) = 512 x 512 ≈ 106 For N=10 samples/dim 101,000,000 points! Curse of dim: The number of x of f is exponential. All samples are far from each other in high-dim[3]. Optimization: Also need ε-d points to find solution. Nsamples/dim [3] Beyer 1998 min f L(f) = E[`(f(xi), yi)]
  11. 11. Blessing of structure Xavier Bresson 12 In our universe: Data have structures, they concentrate in (non-)smooth low-dim spaces. These structures can be expressed as mathematical properties (statistics, geometry, groups, etc). Low-dim space High-dim space Data
  12. 12. Learning exponential functions Neural networks can learn parametric exponential functions with linear number of parameters. Xavier Bresson 13 A very large number of parameters can be learned by backpropagation, order of billions. Huge capacity to learn complex data[4]. n_in = 2 n_out = 3 |W|= 2.3 Parameters #configurations= 2n_out =23=8 Number of parameters grow linearly, but function capacity grows exponentially. [4] Zhang, Bengio, Hardt, Recht, Vinyals, 2016
  13. 13. Universal function approximation[4] Xavier Bresson 14 [5] Cybenko 1989, Hornik 1991 Universal Approximation Theorem Let ⇠ be a non-constant, bounded, and monotonically-increasing continuous activation function, y : [0, 1]p ! R continuous function, and ✏ > 0. Then, 9n and parameters a, b 2 Rn , W 2 Rn⇥p s.t. nX i=1 ai⇠(w> i f + bi) y(f) < ✏ 8f 2 [0, 1]p
  14. 14. Shallow neural nets as universal approximators? Any continuous function can be approximated arbitrarily well by a neural network with a single hidden layer. Then, How many neurons? How long is the optimization? Does it generalize well/overfit? Xavier Bresson 15 Shallow NNs: Data (s.a. image) requires NNs with an exponential number of neurons for a single layer, optimizes badly, and does not generalize. Deep NNs: Multiple layers and data structure can alleviate the above issues[6]. [6] Montufar, Pascanu, Cho, Bengio 2014
  15. 15. Fully connected networks Xavier Bresson 16 Deep neural network consists of multiple layers. fl = l-th image feature (R,G,B channels), dim(fl) = n ⇥ 1 g (k) l = l-th feature map, dim(g (k) l ) = n (k) l ⇥ 1 Linear layer g (k) l = ⇠ qk 1 X l0=1 W (k) l,l0 ⇠ qk 2 X l0=1 W (k 1) l,l0 ⇠ · · · fl0 !!! Activation, e.g. ⇠(x) = max{x, 0} rectified linear unit (ReLU)
  16. 16. Fully connected networks Xavier Bresson 17 FC networks capture non-linear factorizations of data. They overfit: Too many parameters to learn. NN architectures require additional data structures to successfully analyze image, sound, document, etc.
  17. 17. Outline Ø Convolutional Neural Networks - Architecture Xavier Bresson 18
  18. 18. Convolutional neural networks[1] Data (image, video, sound) are compositional, i.e. they are formed of patterns that are: Ø Local Ø Stationary Ø Multi-scale (/hierarchical) Xavier Bresson 19 ConvNets leverage the compositionality structure: They extract compositional features and feed them to classifier, recommender, etc (end-to-end). Computer Vision NLP Drug discovery Games [1] LeCun et al. 1998
  19. 19. Key properties Locality: Property inspired by visual cortex neurons[7]. Xavier Bresson 20 Local receptive fields activate in the presence of (learned) local features. Neocognitron[8] [7] Cybenko 1989, Hornik 1991, [8] Fukushima 1980
  20. 20. Key properties Stationarity Translation invariance Xavier Bresson 21 Local stationarity Similar patches shared across the data domain.
  21. 21. Key properties Multi-scale: Simple structures combine to compose slightly more abstract structures, and so on, in a hierarchical way. Xavier Bresson 22 Also inspired by brain visual primary cortex (V1 and V2 neurons). Features learned by a ConvNet become increasingly complex at deeper layers.[9] [9] Zeiler, Fergus 2013
  22. 22. Implementation Locality: compact support kernels O(1) parameters per filter. Xavier Bresson 23 Stationarity: convolutional operators O(nlogn) in general (FFT) and O(n) for compact kernels. Multi-scale: downsampling + pooling O(n)
  23. 23. Compositional features Xavier Bresson 24 fl = l-th image feature (R,G,B channels), dim(fl) = n ⇥ 1 g (k) l = l-th feature map, dim(g (k) l ) = n (k) l ⇥ 1 Compositional features consist of multiple convolutional + pooling layers. Linear layer g (k) l = ⇠ qk 1 X l0=1 W (k) l,l0 ? ⇠ qk 2 X l0=1 W (k 1) l,l0 ? ⇠ · · · fl0 !!! Activation, e.g. ⇠(x) = max{x, 0} rectified linear unit (ReLU) Pooling g (k) l (x) = kg (k 1) l (x0 ) : x0 2 N(x)kp p = 1, 2, or 1
  24. 24. ConvNets[1] Xavier Bresson 25 Filters localized in space (Locality) Convolutional filters (Stationarity) Multiple layers (Multi-scale) O(1) parameters per filter (independent of input image size n) O(n) complexity per layer (filtering done in the spatial domain) [1] LeCun et al. 1998
  25. 25. ConvNets in computer vision Xavier Bresson 26 Image/object recognition [Krizhevsky-Sutskever-Hinton’12] Image inpainting [Pathak-Efros-etal’16] Image captioning [Karpathy-Li’15] Self-driving cars Highly successful - ConvNets is the Swiss knife of Computer Vision!
  26. 26. Outline Ø Convolutional Neural Networks - Non-Euclidean Data Xavier Bresson 27
  27. 27. Data domain for ConvNets Xavier Bresson 28 Image, volume, video: 2D, 3D, 2D+1 Euclidean domains Sentence, word, character, sound: 1D Euclidean domain These domains have nice regular spatial structures. All CNN operations are math well defined and fast (convolution, pooling). 2D grids 1D grid
  28. 28. Non-Euclidean data Also NLP, physics, social science, communication networks, etc. Xavier Bresson 29 Graphs/ Networks =
  29. 29. Challenges of graph deep learning Extend neural network techniques to graph-structured data. Assumption: Non-Euclidean data are locally stationary and manifest hierarchical structures. How to define compositionality on graphs? (convolution and pooling on graphs) How to make them fast? (linear complexity) Xavier Bresson 30
  30. 30. Outline Ø Spectral Graph Theory - Graphs and operators Xavier Bresson 31
  31. 31. Graphs[10] Xavier Bresson 32 Graph Vertices Edges Vertex weights Edge weights Vertex fields Represented as Hilbert space with inner product G = (V, E) V = {1, . . . , n} E ✓ V ⇥ V f = (f1, . . . , fn) hf, giL2(V) = P i2V aifigi L2 (V) = {f : V ! Rh } [10] Chung 1994 bi > 0 for i 2 V aij 0 for (i, j) 2 E
  32. 32. Calculus on graphs Xavier Bresson 33 Gradient operator Divergence operator adjoint to the gradient operator r : L2 (V) ! L2 (E) div : L2 (E) ! L2 (V) hF, rfiL2(E) = hr⇤ F, fiL2(V) = h divF, fiL2(V) (rf)ij = p aij(fi fj) (divF)i = 1 bi X j:(i,j)2E p aij(Fij Fji)
  33. 33. Graph Laplacian Xavier Bresson 34 Laplacian operator difference between f and its local average (2nd derivative on graphs) Core operator in spectral graph theory. Represented as a positive semi-definite n × n matrix Unnormalized Laplacian Normalized Laplacian Random walk Laplacian : L2 (V) ! L2 (V) ( f)i = 1 bi X j:(i,j)2E aij(fi fj) where A = (aij) and D = diag( P j6=i aij) = D A = I D 1 A = I D 1/2 AD 1/2
  34. 34. Outline Ø Spectral Graph Theory - Spectral decomposition Xavier Bresson 35
  35. 35. Spectral decomposition Xavier Bresson 36 A Laplacian of a graph of n vertices admits n eigenvectors Eigenvectors are real and orthonormal (due to self-adjointness) Eigenvalues are non-negative (due to positive-semidefiniteness) Eigendecomposition of a graph Laplacian k = k k, k = 1, 2, . . . h k, k0 iL2(V) = kk0 0 = 1  2  . . .  n where = ( 1, . . . , n) and ⇤ = diag( 1, . . . , n) = T ⇤
  36. 36. Interpretation Xavier Bresson 37 Find the smoothest orthonormal basis on a graph where EDir is the Dirichlet energy = measure of smoothness of a function Solution: first n Laplacian eigenvectors = ( 1, . . . , n) min k EDir( k) s.t. k kk = 1, k = 2, 3, . . . n k ? span{ 1, . . . , k 1} EDir(f) = f> f min 2Rn⇥n trace( > ) | {z } k kG Dirichlet norm s.t. > = I
  37. 37. Laplacian eigenvectors Xavier Bresson 38 Euclidean domain: Graph domain: Lap eigenvectors related to graph geometry (s.a. communities, hubs, etc), spectral clustering[10] [10] Von Luxburg 2007
  38. 38. Outline Ø Spectral Graph Theory - Fourier analysis Xavier Bresson 39
  39. 39. Fourier analysis on Euclidean spaces Xavier Bresson 40 A function can be written as Fourier series Fourier basis = Laplace-Beltrami eigenfunctions: f : [ ⇡, ⇡] ! R f(x) = X k 0 1 2⇡ Z ⇡ ⇡ f(x0 )e ikx0 dx0 | {z } ˆfk=hf,e ikxiL2([ ⇡,⇡]) e ikx k = Fourier mode k = frequency of Fourier mode e ikx k = k2 k f
  40. 40. Fourier analysis on graphs[11] Xavier Bresson 41 [11] Hammond, Vandergheynst, Gribonval, 2011 A function can be written as Fourier series is the k-th graph Fourier coefficient. In matrix-vector notation, with the n x n Fourier matrix Graph Fourier basis = Laplacian eigenvectors : ˆfk fi = nX k=1 hf, kiL2(V) | {z } ˆfk k,i = ⇥ 1, ..., n ⇤ ˆf = > f and f = ˆf k k = k k k = graph Fourier mode k = (square) frequency f : V ! R Fourier transform Inverse Fourier transform
  41. 41. Outline Ø Spectral Graph Theory - Convolution Xavier Bresson 42
  42. 42. Convolution on Euclidean spaces Xavier Bresson 43 Given two functions their convolution is a function Shift-invariance: Convolution operator commutes with Laplacian: Convolution theorem: Convolution can be computed in the Fourier domain as Efficient computation using FFT: O(nlogn) f, g : [ ⇡, ⇡] ! R (f ? g)(x) = Z ⇡ ⇡ f(x0 )g(x x0 )dx0 f(x x0) ? g(x) = (f ? g)(x x0) ( f) ? g = (f ? g) (f ? g) = ˆf · ˆg
  43. 43. Convolution on discrete Euclidean spaces Xavier Bresson 44 Convolution of two vectors f = (f1, . . . , fn)> and g = (g1, . . . , gn)> (f ? g)i = X m g(i m) mod n · fm f ? g = 2 6 6 6 6 6 4 g1 g2 . . . . . . gn gn g1 g2 . . . gn 1 ... ... ... ... ... g3 g4 . . . g1 g2 g2 g3 . . . . . . g1 3 7 7 7 7 7 5 | {z } Circulant matrix diagonalised by Fourier basis (Toeplitz) 2 6 4 f1 ... fn 3 7 5 = 2 6 4 ˆg1 ... ˆgn 3 7 5 > f = > g > f
  44. 44. f ? g = ( > g > f) = diag(ˆg1, . . . , ˆgn) > | {z } G f = ˆg(⇤) > f = ˆg( ⇤ > )f = ˆg( )f Convolution on graphs[11] Xavier Bresson 45 Spectral convolution of can be defined by analogy In matrix-vector notation Not shift-invariant! (G has no circulant structure) Commutes with Laplacian: Filter coefficients depend on basis Expensive computation (no FFT): O(n2) (f ? g)i = X k 1 hf, kiL2(V)hg, kiL2(V) | {z } product in the Fourier domain k,i | {z } inverse Fourier transform f, g 2 L2 (V) G f = Gf 1, . . . , n [11] Hammond, Vandergheynst, Gribonval, 2011
  45. 45. (f) Ti00 f(e) Ti0 f(d) Tif No shift invariance on graphs Xavier Bresson 46 A signal f on graph can be translated to vertex i: Euclidean domain Graph domain[12] [12] Shuman et al. 2016 Tif = f ? i (a) Tsf (b) Ts0 f (c) Ts00 f
  46. 46. Outline Ø Spectral Graph Theory - Coarsening Xavier Bresson 47
  47. 47. Graph coarsening for graph pooling Goals: Pool similar local features (max pooling). Series of pooling layers create invariance to global geometric deformations. Challenges: Design a multi-scale coarsening algorithm that preserves non- linear graph structures. How to make graph pooling fast? Xavier Bresson 48
  48. 48. Graph coarsening = graph partitioning Xavier Bresson 49 Graph partitioning decomposes G into smaller meaningful clusters. NP-hard Approximation
  49. 49. Powerful combinatorial graph partitioning models: Normalized cut[13] Normalized association Both models are equivalent, but lead to different algorithms. Balanced cuts Xavier Bresson 50 Partitioning by min edge cuts. Ck Cc k Partitioning by max vertex matching. Ck [13] Shi, Malik, 2000 max C1,...,CK KX k=1 Assoc(Ck) Vol(Ck) min C1,...,CK KX k=1 Cut(Ck, Cc k) Vol(Ck) where Cut(A, B) := P i2A,j2B aij, Assoc(A) := P i2A,i2B aij, Vol(A) := P i2A,j2B di, and di := P j2V aij.
  50. 50. Balanced cuts Xavier Bresson 51 Balanced cuts are NP-hard most popular approximation techniques focus on linear spectral relaxation (eigenproblem with global solution). Graph geometry are generally not linear Graclus[14] algorithm computes non-linear clusters that locally maximize the Normalized Association. Graclus algorithm offers a control of the coarsening ratio of ≈ 2, like image grid using heavy-edge matching[15]. [14] Dhillon, Guan, Kulis 2007 [15] Karypis, Kumar 1995
  51. 51. Heavy-Edge Matching (HEM) Xavier Bresson 52 HEM proceeds by two successive steps, vertex matching and graph coarsening (that guarantees a local solution of Norm assoc): Matched vertices { i, j} are merged into a super-vertex at the next coarsening level. … … Gl 1 … … … Gl Gl+1 Gl+2 Graph coarsening/ clustering … … i j (1) Vertex matching: n i, j = argmax j al ii + 2al ij + al jj dl i + dl j o (2): Gl+1 = ⇢ al+1 ij = Cut(Cl i, Cl j) al+1 ii = Assoc(Cl i)
  52. 52. Unstructured pooling Sequence of coarsened graphs produced by HEM: Xavier Bresson 53 Stores a table of indices for graph and all its coarsened versions Computationally inefficient
  53. 53. Fast graph pooling[18] Structured pooling: Arrangement of the node indexing such that adjacent nodes are hierarchically merged at the next coarser level. Xavier Bresson 54 As efficient as 1D-Euclidean grid pooling. [18] Defferrard, Bresson, Vandergheynst 2016
  54. 54. Outline Ø Spectral ConvNets - SplineNets Xavier Bresson 55
  55. 55. Vanilla spectral graph ConvNets[16] Graph convolutional layer Xavier Bresson 56 [16] Bruna, Zaremba, Szlam, LeCun 2014; Henaff, Bruna, LeCun 2015 fl = l-th data feature on graphs, dim(fl) = n ⇥ 1 gl = l-th feature map, dim(gl) = n ⇥ 1 Conv. layer gl = ⇠ pX l0=1 Wl,l0 ? fl0 ! Activation, e.g. ⇠(x) = max{x, 0} rectified linear unit (ReLU)
  56. 56. Spectral graph convolution Xavier Bresson 57 Convolutional layer in the spatial domain: where = matrix of graph spatial filter, can also be expressed in the spectral domain where = n x n diagonal matrix of graph spectral filter. We will denote the spectral filter without the hat symbol, i.e. gl = ⇠ pX l0=1 Wl,l0 ? fl0 ! , Wl,l0 (using g ? f = ˆg(⇤) > f) ˆWl,l0 Wl,l0 gl = ⇠ pX l0=1 ˆWl,l0 > fl0 ! ,
  57. 57. Vanilla spectral graph ConvNets[16] Xavier Bresson 58 Series of spectral convolutional layers with spectral coefficients to be learned at each layer. g (k) l = ⇠ 0 @ q(k 1) X l0=1 W (k) l,l0 > g (k 1) l0 1 A , W (k) l,l0 First Graph CNN architecture No guarantee of spatial localization of filters O(n) parameters per layer O(n2) computation of forward and inverse Fourier transforms φ, φT (no FFT on graphs) Filters are basis-dependent does not generalize across graphs [16] Bruna, Zaremba, Szlam, LeCun 2014; Henaff, Bruna, LeCun 2015
  58. 58. Spatial localization and spectral smoothness Xavier Bresson 59 In the Euclidean setting (by Parseval's identity) Localization in space = smoothness in frequency domain Smooth spectral filter function , Spatial localization Z +1 1 |x|2k |w(x)|2 dx = Z +1 1 @k ˆw( ) @ k 2 d w(x) ˆw( )
  59. 59. Smooth parametric spectral filter Xavier Bresson 60 [17] (Litman, Bronstein, 2014); Henaff, Bruna, LeCun 2015 Parametrize the smooth spectral filter function with a linear combination of smooth kernel functions , e.g. splines[17] where is the vector of filter parameters 1( ), . . . , r( ) ↵ = (↵1, . . . , ↵r)>) ) W = Diag(B↵) w↵( ) = rX j=1 ↵j j( ) w↵( i) = rX j=1 ↵j j( i) = (B↵)i w( ) w( )
  60. 60. Smooth parametric spectral filter Xavier Bresson 61 Graph convolutional layer: Apply the spectral filter to a feature signal f Application of the spectral filter to a feature signal with learnable parameters f ↵ w( )f = w(⇤) > f = 0 B @ w( 1) ... w( n) 1 C A > f w↵( )f = 0 B @ w↵( 1) ... w↵( n) 1 C A > f w
  61. 61. SplineNets[17] Xavier Bresson 62 Series of spectral convolutional layers with smooth spectral parametric coefficients to be learned at each layer. g (k) l = ⇠ 0 @ q(k 1) X l0=1 W (k) l,l0 > g (k 1) l0 1 A , W (k) l,l0 Fast-decaying filters in space O(n) parameters per layer O(n2) computation of forward and inverse Fourier transforms φ, φT (no FFT on graphs) Filters are basis-dependent does not generalize across graphs [17] Henaff, Bruna, LeCun 2015
  62. 62. Outline Ø Spectral ConvNets - ChebNets* [NIPS’16] Xavier Bresson 63
  63. 63. Spectral polynomial filters[18] Xavier Bresson 64 Represent smooth spectral functions with polynomials of Laplacian eigenvalues where is the vector of filter parameters. Convolutional layer: Apply spectral filter to feature signal f w↵( ) = rX j=0 ↵j j ↵ = (↵1, . . . , ↵r)> w↵( )f = Pr j=0 ↵j j f Key observation: Each Laplacian operation increases the support of a function by 1-hop Exact control the size of Laplacian-based filters. s=heat source 1-hop 2-hop· · ·| {z } j [18] Defferrard, Bresson, Vandergheynst 2016
  64. 64. Linear complexity Xavier Bresson 65 Application of the filter to a feature signal f Denote and define and the sequence Two important observations: 1. *No* need to compute the eigendecomposition of the Laplacian (φ,Λ). 2. Observe that {Xj} are generated by multiplication of a sparse matrix and a vector Complexity is O(Er)=O(n) for sparse graphs. Graph convolutional layers are GPU friendly. w↵( )f = Pr j=0 ↵j j f Xj = Xj 1X0 = f w↵( )f = Pr j=0 ↵jXj X1 = X0 = f
  65. 65. Spectral graph ConvNets with polynomial filters Xavier Bresson 66 Series of spectral convolutional layers with spectral polynomial coefficients to be learned at each layer. g (k) l = ⇠ 0 @ q(k 1) X l0=1 W (k) l,l0 > g (k 1) l0 1 A , W (k) l,l0 Filters are exactly localized in r-hops support O(1) parameters per layer No computation of φ, φT O(n) computational complexity (assuming sparsely-connected graphs) Unstable under coefficients perturbation (hard to optimize) Filters are basis-dependent does not generalize across graphs [18] Defferrard, Bresson, Vandergheynst 2016
  66. 66. Chebyshev polynomials Xavier Bresson 67 Graph convolution with (non-orthogonal) monomial basis Graph convolution with (orthogonal) Chebyshev polynomials w↵( )f = Pr j=0 ↵j j f 1, x, x2 , x3 , · · · Orthonormal on Stable under perturbation of coefficients L2 ([ 1, +1]) w.r.t. hf, gi = R +1 1 f(˜)g(˜) d˜ p 1 ˜2 T0, T1, T2, T3 w↵( ˜ )f = Pr j=0 ↵jTj( ˜ )f
  67. 67. ChebNets[18] Xavier Bresson 68 [18] Defferrard, Bresson, Vandergheynst 2016 Application of the filter with the scaled Laplacian with ˜ = 2 1 n I X(j) = Tj( ˜ )f = 2 ˜ X(j 1) X(j 2) , X(0) = f, X(1) = ˜ f Filters are exactly localized in r-hops support O(1) parameters per layer No computation of φ, φT O(n) computational complexity (assuming sparsely-connected graphs) Stable under coefficients perturbation Filters are basis-dependent does not generalize across graphs w↵( ˜ )f = Pr j=0 ↵jTj( ˜ )f = Pr j=0 ↵jX(j)
  68. 68. Numerical experiments Xavier Bresson 69 Running time Accuracy Optimization Graph: a 8-NN graph of the Euclidean grid i j
  69. 69. Outline Ø Spectral ConvNets - GraphConvNets Xavier Bresson 70
  70. 70. Graph convolutional nets[19]: simplified ChebNets Xavier Bresson 71 [19] Kipf, Welling 2016 Use Chebychev polynomials of degree r=2 and assume Further constrain to obtain a single-parameter filter Caveat: The eigenvalues of are now in [0,2] repeated application of the filter results in numerical instability Fix: Apply a renormalization with n ⇡ 2 ↵ = ↵0 = ↵1 w↵( )f = ↵0f + ↵1( I)f = ↵0f ↵1D 1/2 WD 1/2 f w↵( )f = ↵(I + D 1/2 WD 1/2 )f I + D 1/2 WD 1/2 w↵( )f = ↵ ˜D 1/2 ˜W ˜D 1/2 f ˜W = W + I and ˜D = diag( P j6=i ˜wij)
  71. 71. Example: citation networks Xavier Bresson 72
  72. 72. Outline Ø Spectral ConvNets - CayleyNets* Xavier Bresson 73
  73. 73. Community graphs Xavier Bresson 74 Synthetic graph with 15 communities Normalized Laplacian eigenvalues Chebyshev filters learned on the 15-communities graph ChebNet is unstable to produce filters with frequency bands of interest
  74. 74. Spectral graph ConvNets with Cayley polynomials[20] Xavier Bresson 75 Cayley polynomials are a family of real-valued rational functions (w/ complex coefficients) Cayley transform Expressivity: we can rewrite as a real trigonometric polynomial w.r.t. p( ) = c0 + 2Re n rX j=1 cj ⇣ i + i | {z } C( ) ⌘jo p( ) C( ) p( ) = c0 + rX j=1 cj Cj ( ) | {z } eij! + ¯cj C j ( ) | {z } e ij! = 2 rX j=1 Re{cj} cos(j! ) Im{cj} sin(j! ) [20] (Cayley 1846); Levie, Monti, Bresson, Bronstein 2017 C( ) = i +i is a smooth bijection from R to eiR {1} Since z 1 = ¯z for z 2 eiR and 2Re{z} = z + ¯z
  75. 75. Spectral zoom[20] Xavier Bresson 76 Applying Cayley transform to the scaled Laplacian results in a non-linear transformation of the eigenvalues (spectral zoom) C(h ) = (h iI)(h + iI) 1 h [20] Levie, Monti, Bresson, Bronstein 2017 Cayley transform of the 15-communities graph Laplacian spectrumC(h ) Laplacian spectrum h = 0.1 h = 1 h = 10
  76. 76. Fast inversion Xavier Bresson 77 Application of Cayley filter requires the solution of recursive linear system Exact (stable) solution: O(n3) complexity Approximate solution using K Jacobi iterations with Application of the approximate Cayley filter w(h )f y0 = f (h + iI)yj = (h iI)yj 1 j = 1, . . . , r ˜y (k+1) j = J˜y (k) j + Diag 1 (h + iI)(h iI)˜yj 1 ˜y (0) j = Diag 1 (h + iI)(h iI)˜yj 1 J = Diag 1 (h + iI)O↵(h + iI) rX j=0 cj ˜yj ⇡ ⌧(h )f Complexity: O(ErK) = O(n)
  77. 77. CayleyNets[30] Xavier Bresson 78 [20] Levie, Monti, Bresson, Bronstein 2017 Represent spectral transfer function as a Cayley polynomial of order r where the filter parameters are the vector of complex coefficients and the spectral zoom h. wc,h( ) = c0 + 2Re n rX j=1 cj(h i)j (h + i) j o c = (c0, . . . , cr)> Filters have guaranteed exponential spatial decay O(1) parameters per layer O(n) computational complexity with Jacobi approximate inversion (assuming sparsely-connected graph) Spectral zoom property allowing to better localize in frequency Richer class of filters than Chebyshev for the same order Filters are basis-dependent does not generalize across graphs
  78. 78. Chebyshev vs Cayley Xavier Bresson 79
  79. 79. Chebyshev vs Cayley: community graphs example Xavier Bresson 80
  80. 80. Chebyshev vs Cayley: Cora example Xavier Bresson 81 Accuracy of ChebNet (blue) and CayleyNet (orange) on Cora dataset
  81. 81. Approximate inversion Xavier Bresson 82 Accuracy of approximate inversion Speed of approximate inversion
  82. 82. Outline Ø Spectral ConvNets on Multiple Graphs - Recommender Systems* [NIPS’17] Xavier Bresson 83
  83. 83. Matrix completion - Netflix challenge Xavier Bresson 84 minX2Rm⇥n rank(X) s.t. xij = aij 8ij 2 ⌦ minX2Rm⇥n kXk⇤ s.t. xij = aij 8ij 2 ⌦ minX2Rm⇥n kXk⇤ + µk⌦ (X A)k2 F [21] Candes et al. 2008 Convex relaxation[21] )
  84. 84. Matrix completion on single graph[22] Xavier Bresson 85 minX2Rm⇥n kXk⇤ + µk⌦ (X A)k2 F + µc tr(X cX> ) | {z } kXk2 Gc [22] Kalofolias, Bresson, Bronstein, Vandergheynst, 2014
  85. 85. Matrix completion on multiple graphs[22] Xavier Bresson 86 [22] Kalofolias, Bresson, Bronstein, Vandergheynst, 2014 minX2Rm⇥n kXk⇤ + µk⌦ (X A)k2 F + µc tr(X cX> ) | {z } kXk2 Gc +µr tr(X> rX) | {z } kXk2 Gr
  86. 86. Factorized matrix completion models[23] Xavier Bresson 87 min W2Rm⇥s H2Rn⇥s µk⌦ (X A)k2 F + µckWk2 F + µrkHk2 F O(n+m) variables instead of O(nm) rank(X) = rank(WHT) ≤ s by construction [23] Srebro, Rennie, Jaakkola 2004
  87. 87. Factorized matrix completion on graphs[24] Xavier Bresson 88 O(n+m) variables instead of O(nm) rank(X) = rank(WHT) ≤ s by construction [24] Rao et al. 2015 min W2Rm⇥s H2Rn⇥s µk⌦ (X A)k2 F + µctr(H> cH) + µrtr(W> rW)
  88. 88. 1D Fourier transform Xavier Bresson 89 Column-wise transform
  89. 89. 2D Fourier transform Xavier Bresson 90 Column-wise transform + Row-wise transform = 2D transform From 1D signal processing to 2D image processing
  90. 90. Multi-graph Fourier transform and convolution[25] Xavier Bresson 91 Multi-graph Fourier transform where are the eigenvectors of the column- and row- graph Laplacians , respectively. Multi-graph spectral convolution ˆX = > r X c c, r c, r X ? Y = r(( > r X c) ( > r Y c)) > c = r( ˆX ˆY) > c [25] Monti, Bresson, Bronstein 2017
  91. 91. Parametrize the multi-graph filter as a smooth function of column- and row frequencies w/ e.g. bivariate Chebyshev polynomial where is matrix of coefficients and Multi-graph convolutional layer: Apply filter to matrix X where the scaled Laplacians Multi-graph spectral filters Xavier Bresson 92 w⇥(˜c, ˜r) = rX j,j0=1 ✓jj0 Tj(˜c)Tj0 (˜r) ⇥ = (✓jj0 ) 1  ˜c, ˜r  1 Y = rX j,j0=1 ✓jj0 Tj( ˜ c)XTj0 ( ˜ r) ˜ c = 2 1 c,n c I and ˜ r = 2 1 r,m r I
  92. 92. Multi-graph ConvNets[25] Xavier Bresson 93 [25] Monti, Bresson, Bronstein 2017 Multi-graph spectral filters applied to p input channels (m x n matrices ) and producing q output channels Y (k) l = ⇠ 0 @ pX l0=1 rX j,j0=0 ✓jj0ll0 Tj( ˜ r)Y (k 1) l0 Tj0 ( ˜ c) 1 A l = 1, . . . , q l0 = 1, . . . , p Y (k 1) 1 , . . . , Y (k 1) p Y (k) 1 , . . . , Y (k) q Filters have guaranteed r-hops support on both graphs O(1) parameters per layer O(nm) computational complexity (assuming sparse graphs)
  93. 93. Factorized multi-graph ConvNets Xavier Bresson 94 Two spectral convolutional layers applied to each of the factors W, H ul = ⇠ 0 @ pX l0=1 rX j=0 ✓r,jll0 Tj( ˜ r)wl0 1 A l = 1, . . . , q l0 = 1, . . . , p vl = ⇠ 0 @ pX l0=1 rX j0=0 ✓c,j0ll0 Tj0 ( ˜ c)hl0 1 A Filters have guaranteed r-hops support on both graphs O(1) parameters per layer O(n+m) computational complexity (assuming sparse graphs) May loose some coupling information
  94. 94. Matrix completion with Recurrent Multi-Graph ConvNets[25] Xavier Bresson 95 Complexity: O(nm) Extract compositional features from graphs of users and items [25] Monti, Bresson, Bronstein 2017
  95. 95. Factorized version Xavier Bresson 96 Complexity: O(n+m) [25] Monti, Bresson, Bronstein 2017
  96. 96. Incremental updates with RNNs The RNN/LSTM[26] cast the completion problem as non-linear diffusion process. Replace multi-layer CNNs by simple RNN + diffusion much less parameters to learn (reduce overfitting). more favorable for highly sparse entries (better info diffusion). Fix #parameters of the model independently of entry sparsity. Allows to learn temporal dynamics of entries if such information is available. Xavier Bresson 97 dX(t)RNN X(t + 1) = X(t) + dX(t) X(t) [26] Hochreiter, Schmidhuber 1997
  97. 97. Matrix completion methods comparison: synthetic Xavier Bresson 98
  98. 98. Matrix completion methods comparison: real Xavier Bresson 99
  99. 99. Outline Ø Conclusion Xavier Bresson 100
  100. 100. Related works Xavier Bresson 101 Classes of graph ConvNets: (1) Spectral approach (2) Spatial approach Spatial approaches Ø Geodesic CNN: Masci, Boscaini, Bronstein, Vandergheynst 2015 Ø Anisotropic CNN: Boscaini, Masci, Rodola, Bronstein 2016 Ø MoNet: Monti, Boscaini, Masci, Rodola, Svoboda, Bronstein 2017 Graph Neural Networks: Ø Gori, Monfardini, Scarselli 2005 Ø Li, Tarlow, Brockschmidt, Zemel 2015 Ø Sukhbaatar, Szlam, Fergus 2016 Several other works
  101. 101. Conclusion Contributions: Generalization of ConvNets to data on graphs (fixed) Exact/tight localized filters on graphs Linear complexity for sparse graphs GPU implementation Rich expressivity Multiple graphs Several potential applications in network sciences. Xavier Bresson 102
  102. 102. Papers & code Papers Convolutional neural networks on graphs with fast localized spectral filtering, M Defferrard, X Bresson, P Vandergheynst, NIPS, 2016, arXiv:1606.09375 Geometric matrix completion with recurrent multi-graph neural networks, F Monti, MM Bronstein, X Bresson, NIPS, 2017, arXiv:1704.06803 CayleyNets: Graph Convolutional Neural Networks with Complex Rational Spectral Filters, R Levie, F Monti, X Bresson, MM Bronstein, 2017, arXiv:1705.07664 Xavier Bresson 103 Code https://github.com/xbresson/ graph_convnets_pytorch
  103. 103. Xavier Bresson 104 Tutorials Computer Vision and Pattern Recognition (CVPR) Geometric deep learning on graphs and manifolds M. Bronstein​​, J. Bruna​, A. Szlam, X. Bresson, Y. LeCun July 21, 2017 Slides: https://www.dropbox.com/s/appafg1fumb6u7f/tutorial_CVPR17_DL_graphs.pdf?dl=0 Video will be posted on YouTube/ComputerVisionFoundation Videos Neural Information Processing Systems (NIPS) Geometric deep learning on graphs and manifolds M. Bronstein​​, J. Bruna​, A. Szlam, X. Bresson, Y. LeCun Dec 4, 2017
  104. 104. Xavier Bresson 105 Workshop New Deep Learning Techniques Feb 5-9, 2018 Organizers Xavier Bresson, NTU Michael Bronstein, USI/Intel Joan Bruna, NYU/Berkeley Yann LeCun, NYU/Facebook Stanley Osher, UCLA Arthur Szlam, Facebook http://www.ipam.ucla.edu/programs/work shops/new-deep-learning-techniques
  105. 105. Thank you Xavier Bresson 106 http://www.ntu.edu.sg/home/xbresson https://github.com/xbresson https://twitter.com/xbresson

×