Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- AI and Machine Learning Demystified... by Carol Smith 2923242 views
- The AI Rush by Jean-Baptiste Dumont 760898 views
- 10 facts about jobs in the future by Pew Research Cent... 434298 views
- 2017 holiday survey: An annual anal... by Deloitte United S... 713821 views
- Harry Surden - Artificial Intellige... by Harry Surden 389848 views
- Inside Google's Numbers in 2017 by Rand Fishkin 945845 views

393 views

Published on

Talk at SMU on spectral convnets by Xavier Bresson, 19 October 2017

Published in:
Science

No Downloads

Total views

393

On SlideShare

0

From Embeds

0

Number of Embeds

28

Shares

0

Downloads

21

Comments

0

Likes

2

No embeds

No notes for slide

- 1. Convolutional Neural Networks on Graphs Xavier Bresson Xavier Bresson 2 School of Information Systems Singapore Management University (SMU) Oct 19th 2017 School of Computer Science and Engineering Nanyang Technological University Joint work with M. Defferrard (EPFL), P. Vandergheynst (EPFL), M. Bronstein (USI), F. Monti (USI), R. Levie (TAU)
- 2. Outline Ø Deep Learning - A brief history - High-dimensional learning Ø Convolutional Neural Networks - Architecture - Non-Euclidean Data Ø Spectral Graph Theory - Graphs and operators - Spectral decomposition - Fourier analysis - Convolution - Coarsening Ø Spectral ConvNets - SplineNets - ChebNets* [NIPS’16] - GraphConvNets - CayleyNets* Ø Spectral ConvNets on Multiple Graphs - Recommender Systems* [NIPS’17] Ø Conclusion Xavier Bresson 3
- 3. Outline Ø Deep Learning - A brief history Xavier Bresson 4
- 4. A brief history of artificial neural networks Xavier Bresson 5 1958 Perceptron Rosenblatt 1982 Backprop Werbos 1959 Visual primary cortex Hubel-Wiesel 1987 Neocognitron Fukushima SVM/Kernel techniques Vapnik 1995 1997 1998 1999 2006 2012 2014 2015 Data scientist 1st Job in US Facebook Center OpenAI Center Deep Learning Breakthough Hinton GAN, Goodfellow ResNet, He Deep RL, Mnih First NVIDIA GPU First Amazon Cloud Center RNN/LSTM Schmidhuber CNN LeCun AI Winter [1960’s-2012]AI Birth AI Renaissance Hardware GPU speed 2x/year Kernel techniques Graphical models Wavelets, numerical PDEs Handcrafted Features 4th industrial revolution? Digital intelligence revolution or new AI bubble? Big Data Volume 2x/1.5 year Google AI TensorFlow Facebook AI Torch 1962 Birth of Data Science Split from Statistics Tukey First NIPS 1989 First KDD 2010 Kaggle Platform
- 5. Breakthrough in image recognition Xavier Bresson 6 4.1%Human accuracy Learned features (end-to-end systems)Handcraft features (SIFT)
- 6. Convolutional neural networks: LeNet-5[1] 2 convolutional + 2 fully connected layers tanh non-linearity 1M parameters Training set: MNIST 70K images Trained on CPU Xavier Bresson 7 [1] LeCun et al. 1998 1998
- 7. Convolutional neural networks: AlexNet[2] Xavier Bresson 8 5 convolutional + 3 fully connected layers ReLU non-linearity Dropout regularization 60M parameters Trained set: ImageNet 1.5M images Trained on GPU [2] Krizhevsky, Sutskever, Hinton 2012 2012 More learning capacity More data More computational power
- 8. Outline Ø Deep Learning - High-dimensional learning Xavier Bresson 9
- 9. High-dimensional learning Xavier Bresson 10 NNs are powerful to solve high-dimensional learning problems. Prototypal learning: Evaluate function f(x) given n samples {xi,f(xi)}. How? Suppose f is regular, use interpolation of close points. f(x) f(xi) xi 1-dimensional learning
- 10. Curse of dimensionality Xavier Bresson 11 What about in high dimensions? No interpolation possible because we can never have enough close points to make a good interpolation. 1 · · · · · · · · · ·· · · · · · · · · ·· · · · · · · · · ·· · · · · · · · · ·· · · · · · · · · ·· · · · · · · · · ·· · · · · · · · · ·· · · · · · · · · ·· · · · · · · · · ·· · · · · · · · · · · · · · · · · · · ·· · · · · · · · · · 0 1 N samples/dim " = N 1 How many points to cover the 2D plane? ε-2 d-dim cube [0,1]d requires ε-d points. dim(image) = 512 x 512 ≈ 106 For N=10 samples/dim 101,000,000 points! Curse of dim: The number of x of f is exponential. All samples are far from each other in high-dim[3]. Optimization: Also need ε-d points to find solution. Nsamples/dim [3] Beyer 1998 min f L(f) = E[`(f(xi), yi)]
- 11. Blessing of structure Xavier Bresson 12 In our universe: Data have structures, they concentrate in (non-)smooth low-dim spaces. These structures can be expressed as mathematical properties (statistics, geometry, groups, etc). Low-dim space High-dim space Data
- 12. Learning exponential functions Neural networks can learn parametric exponential functions with linear number of parameters. Xavier Bresson 13 A very large number of parameters can be learned by backpropagation, order of billions. Huge capacity to learn complex data[4]. n_in = 2 n_out = 3 |W|= 2.3 Parameters #configurations= 2n_out =23=8 Number of parameters grow linearly, but function capacity grows exponentially. [4] Zhang, Bengio, Hardt, Recht, Vinyals, 2016
- 13. Universal function approximation[4] Xavier Bresson 14 [5] Cybenko 1989, Hornik 1991 Universal Approximation Theorem Let ⇠ be a non-constant, bounded, and monotonically-increasing continuous activation function, y : [0, 1]p ! R continuous function, and ✏ > 0. Then, 9n and parameters a, b 2 Rn , W 2 Rn⇥p s.t. nX i=1 ai⇠(w> i f + bi) y(f) < ✏ 8f 2 [0, 1]p
- 14. Shallow neural nets as universal approximators? Any continuous function can be approximated arbitrarily well by a neural network with a single hidden layer. Then, How many neurons? How long is the optimization? Does it generalize well/overfit? Xavier Bresson 15 Shallow NNs: Data (s.a. image) requires NNs with an exponential number of neurons for a single layer, optimizes badly, and does not generalize. Deep NNs: Multiple layers and data structure can alleviate the above issues[6]. [6] Montufar, Pascanu, Cho, Bengio 2014
- 15. Fully connected networks Xavier Bresson 16 Deep neural network consists of multiple layers. fl = l-th image feature (R,G,B channels), dim(fl) = n ⇥ 1 g (k) l = l-th feature map, dim(g (k) l ) = n (k) l ⇥ 1 Linear layer g (k) l = ⇠ qk 1 X l0=1 W (k) l,l0 ⇠ qk 2 X l0=1 W (k 1) l,l0 ⇠ · · · fl0 !!! Activation, e.g. ⇠(x) = max{x, 0} rectiﬁed linear unit (ReLU)
- 16. Fully connected networks Xavier Bresson 17 FC networks capture non-linear factorizations of data. They overfit: Too many parameters to learn. NN architectures require additional data structures to successfully analyze image, sound, document, etc.
- 17. Outline Ø Convolutional Neural Networks - Architecture Xavier Bresson 18
- 18. Convolutional neural networks[1] Data (image, video, sound) are compositional, i.e. they are formed of patterns that are: Ø Local Ø Stationary Ø Multi-scale (/hierarchical) Xavier Bresson 19 ConvNets leverage the compositionality structure: They extract compositional features and feed them to classifier, recommender, etc (end-to-end). Computer Vision NLP Drug discovery Games [1] LeCun et al. 1998
- 19. Key properties Locality: Property inspired by visual cortex neurons[7]. Xavier Bresson 20 Local receptive fields activate in the presence of (learned) local features. Neocognitron[8] [7] Cybenko 1989, Hornik 1991, [8] Fukushima 1980
- 20. Key properties Stationarity Translation invariance Xavier Bresson 21 Local stationarity Similar patches shared across the data domain.
- 21. Key properties Multi-scale: Simple structures combine to compose slightly more abstract structures, and so on, in a hierarchical way. Xavier Bresson 22 Also inspired by brain visual primary cortex (V1 and V2 neurons). Features learned by a ConvNet become increasingly complex at deeper layers.[9] [9] Zeiler, Fergus 2013
- 22. Implementation Locality: compact support kernels O(1) parameters per filter. Xavier Bresson 23 Stationarity: convolutional operators O(nlogn) in general (FFT) and O(n) for compact kernels. Multi-scale: downsampling + pooling O(n)
- 23. Compositional features Xavier Bresson 24 fl = l-th image feature (R,G,B channels), dim(fl) = n ⇥ 1 g (k) l = l-th feature map, dim(g (k) l ) = n (k) l ⇥ 1 Compositional features consist of multiple convolutional + pooling layers. Linear layer g (k) l = ⇠ qk 1 X l0=1 W (k) l,l0 ? ⇠ qk 2 X l0=1 W (k 1) l,l0 ? ⇠ · · · fl0 !!! Activation, e.g. ⇠(x) = max{x, 0} rectiﬁed linear unit (ReLU) Pooling g (k) l (x) = kg (k 1) l (x0 ) : x0 2 N(x)kp p = 1, 2, or 1
- 24. ConvNets[1] Xavier Bresson 25 Filters localized in space (Locality) Convolutional filters (Stationarity) Multiple layers (Multi-scale) O(1) parameters per filter (independent of input image size n) O(n) complexity per layer (filtering done in the spatial domain) [1] LeCun et al. 1998
- 25. ConvNets in computer vision Xavier Bresson 26 Image/object recognition [Krizhevsky-Sutskever-Hinton’12] Image inpainting [Pathak-Efros-etal’16] Image captioning [Karpathy-Li’15] Self-driving cars Highly successful - ConvNets is the Swiss knife of Computer Vision!
- 26. Outline Ø Convolutional Neural Networks - Non-Euclidean Data Xavier Bresson 27
- 27. Data domain for ConvNets Xavier Bresson 28 Image, volume, video: 2D, 3D, 2D+1 Euclidean domains Sentence, word, character, sound: 1D Euclidean domain These domains have nice regular spatial structures. All CNN operations are math well defined and fast (convolution, pooling). 2D grids 1D grid
- 28. Non-Euclidean data Also NLP, physics, social science, communication networks, etc. Xavier Bresson 29 Graphs/ Networks =
- 29. Challenges of graph deep learning Extend neural network techniques to graph-structured data. Assumption: Non-Euclidean data are locally stationary and manifest hierarchical structures. How to define compositionality on graphs? (convolution and pooling on graphs) How to make them fast? (linear complexity) Xavier Bresson 30
- 30. Outline Ø Spectral Graph Theory - Graphs and operators Xavier Bresson 31
- 31. Graphs[10] Xavier Bresson 32 Graph Vertices Edges Vertex weights Edge weights Vertex fields Represented as Hilbert space with inner product G = (V, E) V = {1, . . . , n} E ✓ V ⇥ V f = (f1, . . . , fn) hf, giL2(V) = P i2V aifigi L2 (V) = {f : V ! Rh } [10] Chung 1994 bi > 0 for i 2 V aij 0 for (i, j) 2 E
- 32. Calculus on graphs Xavier Bresson 33 Gradient operator Divergence operator adjoint to the gradient operator r : L2 (V) ! L2 (E) div : L2 (E) ! L2 (V) hF, rfiL2(E) = hr⇤ F, fiL2(V) = h divF, fiL2(V) (rf)ij = p aij(fi fj) (divF)i = 1 bi X j:(i,j)2E p aij(Fij Fji)
- 33. Graph Laplacian Xavier Bresson 34 Laplacian operator difference between f and its local average (2nd derivative on graphs) Core operator in spectral graph theory. Represented as a positive semi-definite n × n matrix Unnormalized Laplacian Normalized Laplacian Random walk Laplacian : L2 (V) ! L2 (V) ( f)i = 1 bi X j:(i,j)2E aij(fi fj) where A = (aij) and D = diag( P j6=i aij) = D A = I D 1 A = I D 1/2 AD 1/2
- 34. Outline Ø Spectral Graph Theory - Spectral decomposition Xavier Bresson 35
- 35. Spectral decomposition Xavier Bresson 36 A Laplacian of a graph of n vertices admits n eigenvectors Eigenvectors are real and orthonormal (due to self-adjointness) Eigenvalues are non-negative (due to positive-semidefiniteness) Eigendecomposition of a graph Laplacian k = k k, k = 1, 2, . . . h k, k0 iL2(V) = kk0 0 = 1 2 . . . n where = ( 1, . . . , n) and ⇤ = diag( 1, . . . , n) = T ⇤
- 36. Interpretation Xavier Bresson 37 Find the smoothest orthonormal basis on a graph where EDir is the Dirichlet energy = measure of smoothness of a function Solution: first n Laplacian eigenvectors = ( 1, . . . , n) min k EDir( k) s.t. k kk = 1, k = 2, 3, . . . n k ? span{ 1, . . . , k 1} EDir(f) = f> f min 2Rn⇥n trace( > ) | {z } k kG Dirichlet norm s.t. > = I
- 37. Laplacian eigenvectors Xavier Bresson 38 Euclidean domain: Graph domain: Lap eigenvectors related to graph geometry (s.a. communities, hubs, etc), spectral clustering[10] [10] Von Luxburg 2007
- 38. Outline Ø Spectral Graph Theory - Fourier analysis Xavier Bresson 39
- 39. Fourier analysis on Euclidean spaces Xavier Bresson 40 A function can be written as Fourier series Fourier basis = Laplace-Beltrami eigenfunctions: f : [ ⇡, ⇡] ! R f(x) = X k 0 1 2⇡ Z ⇡ ⇡ f(x0 )e ikx0 dx0 | {z } ˆfk=hf,e ikxiL2([ ⇡,⇡]) e ikx k = Fourier mode k = frequency of Fourier mode e ikx k = k2 k f
- 40. Fourier analysis on graphs[11] Xavier Bresson 41 [11] Hammond, Vandergheynst, Gribonval, 2011 A function can be written as Fourier series is the k-th graph Fourier coefficient. In matrix-vector notation, with the n x n Fourier matrix Graph Fourier basis = Laplacian eigenvectors : ˆfk fi = nX k=1 hf, kiL2(V) | {z } ˆfk k,i = ⇥ 1, ..., n ⇤ ˆf = > f and f = ˆf k k = k k k = graph Fourier mode k = (square) frequency f : V ! R Fourier transform Inverse Fourier transform
- 41. Outline Ø Spectral Graph Theory - Convolution Xavier Bresson 42
- 42. Convolution on Euclidean spaces Xavier Bresson 43 Given two functions their convolution is a function Shift-invariance: Convolution operator commutes with Laplacian: Convolution theorem: Convolution can be computed in the Fourier domain as Efficient computation using FFT: O(nlogn) f, g : [ ⇡, ⇡] ! R (f ? g)(x) = Z ⇡ ⇡ f(x0 )g(x x0 )dx0 f(x x0) ? g(x) = (f ? g)(x x0) ( f) ? g = (f ? g) (f ? g) = ˆf · ˆg
- 43. Convolution on discrete Euclidean spaces Xavier Bresson 44 Convolution of two vectors f = (f1, . . . , fn)> and g = (g1, . . . , gn)> (f ? g)i = X m g(i m) mod n · fm f ? g = 2 6 6 6 6 6 4 g1 g2 . . . . . . gn gn g1 g2 . . . gn 1 ... ... ... ... ... g3 g4 . . . g1 g2 g2 g3 . . . . . . g1 3 7 7 7 7 7 5 | {z } Circulant matrix diagonalised by Fourier basis (Toeplitz) 2 6 4 f1 ... fn 3 7 5 = 2 6 4 ˆg1 ... ˆgn 3 7 5 > f = > g > f
- 44. f ? g = ( > g > f) = diag(ˆg1, . . . , ˆgn) > | {z } G f = ˆg(⇤) > f = ˆg( ⇤ > )f = ˆg( )f Convolution on graphs[11] Xavier Bresson 45 Spectral convolution of can be defined by analogy In matrix-vector notation Not shift-invariant! (G has no circulant structure) Commutes with Laplacian: Filter coefficients depend on basis Expensive computation (no FFT): O(n2) (f ? g)i = X k 1 hf, kiL2(V)hg, kiL2(V) | {z } product in the Fourier domain k,i | {z } inverse Fourier transform f, g 2 L2 (V) G f = Gf 1, . . . , n [11] Hammond, Vandergheynst, Gribonval, 2011
- 45. (f) Ti00 f(e) Ti0 f(d) Tif No shift invariance on graphs Xavier Bresson 46 A signal f on graph can be translated to vertex i: Euclidean domain Graph domain[12] [12] Shuman et al. 2016 Tif = f ? i (a) Tsf (b) Ts0 f (c) Ts00 f
- 46. Outline Ø Spectral Graph Theory - Coarsening Xavier Bresson 47
- 47. Graph coarsening for graph pooling Goals: Pool similar local features (max pooling). Series of pooling layers create invariance to global geometric deformations. Challenges: Design a multi-scale coarsening algorithm that preserves non- linear graph structures. How to make graph pooling fast? Xavier Bresson 48
- 48. Graph coarsening = graph partitioning Xavier Bresson 49 Graph partitioning decomposes G into smaller meaningful clusters. NP-hard Approximation
- 49. Powerful combinatorial graph partitioning models: Normalized cut[13] Normalized association Both models are equivalent, but lead to different algorithms. Balanced cuts Xavier Bresson 50 Partitioning by min edge cuts. Ck Cc k Partitioning by max vertex matching. Ck [13] Shi, Malik, 2000 max C1,...,CK KX k=1 Assoc(Ck) Vol(Ck) min C1,...,CK KX k=1 Cut(Ck, Cc k) Vol(Ck) where Cut(A, B) := P i2A,j2B aij, Assoc(A) := P i2A,i2B aij, Vol(A) := P i2A,j2B di, and di := P j2V aij.
- 50. Balanced cuts Xavier Bresson 51 Balanced cuts are NP-hard most popular approximation techniques focus on linear spectral relaxation (eigenproblem with global solution). Graph geometry are generally not linear Graclus[14] algorithm computes non-linear clusters that locally maximize the Normalized Association. Graclus algorithm offers a control of the coarsening ratio of ≈ 2, like image grid using heavy-edge matching[15]. [14] Dhillon, Guan, Kulis 2007 [15] Karypis, Kumar 1995
- 51. Heavy-Edge Matching (HEM) Xavier Bresson 52 HEM proceeds by two successive steps, vertex matching and graph coarsening (that guarantees a local solution of Norm assoc): Matched vertices { i, j} are merged into a super-vertex at the next coarsening level. … … Gl 1 … … … Gl Gl+1 Gl+2 Graph coarsening/ clustering … … i j (1) Vertex matching: n i, j = argmax j al ii + 2al ij + al jj dl i + dl j o (2): Gl+1 = ⇢ al+1 ij = Cut(Cl i, Cl j) al+1 ii = Assoc(Cl i)
- 52. Unstructured pooling Sequence of coarsened graphs produced by HEM: Xavier Bresson 53 Stores a table of indices for graph and all its coarsened versions Computationally inefficient
- 53. Fast graph pooling[18] Structured pooling: Arrangement of the node indexing such that adjacent nodes are hierarchically merged at the next coarser level. Xavier Bresson 54 As efficient as 1D-Euclidean grid pooling. [18] Defferrard, Bresson, Vandergheynst 2016
- 54. Outline Ø Spectral ConvNets - SplineNets Xavier Bresson 55
- 55. Vanilla spectral graph ConvNets[16] Graph convolutional layer Xavier Bresson 56 [16] Bruna, Zaremba, Szlam, LeCun 2014; Henaff, Bruna, LeCun 2015 fl = l-th data feature on graphs, dim(fl) = n ⇥ 1 gl = l-th feature map, dim(gl) = n ⇥ 1 Conv. layer gl = ⇠ pX l0=1 Wl,l0 ? fl0 ! Activation, e.g. ⇠(x) = max{x, 0} rectiﬁed linear unit (ReLU)
- 56. Spectral graph convolution Xavier Bresson 57 Convolutional layer in the spatial domain: where = matrix of graph spatial filter, can also be expressed in the spectral domain where = n x n diagonal matrix of graph spectral filter. We will denote the spectral filter without the hat symbol, i.e. gl = ⇠ pX l0=1 Wl,l0 ? fl0 ! , Wl,l0 (using g ? f = ˆg(⇤) > f) ˆWl,l0 Wl,l0 gl = ⇠ pX l0=1 ˆWl,l0 > fl0 ! ,
- 57. Vanilla spectral graph ConvNets[16] Xavier Bresson 58 Series of spectral convolutional layers with spectral coefficients to be learned at each layer. g (k) l = ⇠ 0 @ q(k 1) X l0=1 W (k) l,l0 > g (k 1) l0 1 A , W (k) l,l0 First Graph CNN architecture No guarantee of spatial localization of filters O(n) parameters per layer O(n2) computation of forward and inverse Fourier transforms φ, φT (no FFT on graphs) Filters are basis-dependent does not generalize across graphs [16] Bruna, Zaremba, Szlam, LeCun 2014; Henaff, Bruna, LeCun 2015
- 58. Spatial localization and spectral smoothness Xavier Bresson 59 In the Euclidean setting (by Parseval's identity) Localization in space = smoothness in frequency domain Smooth spectral filter function , Spatial localization Z +1 1 |x|2k |w(x)|2 dx = Z +1 1 @k ˆw( ) @ k 2 d w(x) ˆw( )
- 59. Smooth parametric spectral filter Xavier Bresson 60 [17] (Litman, Bronstein, 2014); Henaff, Bruna, LeCun 2015 Parametrize the smooth spectral filter function with a linear combination of smooth kernel functions , e.g. splines[17] where is the vector of filter parameters 1( ), . . . , r( ) ↵ = (↵1, . . . , ↵r)>) ) W = Diag(B↵) w↵( ) = rX j=1 ↵j j( ) w↵( i) = rX j=1 ↵j j( i) = (B↵)i w( ) w( )
- 60. Smooth parametric spectral filter Xavier Bresson 61 Graph convolutional layer: Apply the spectral filter to a feature signal f Application of the spectral filter to a feature signal with learnable parameters f ↵ w( )f = w(⇤) > f = 0 B @ w( 1) ... w( n) 1 C A > f w↵( )f = 0 B @ w↵( 1) ... w↵( n) 1 C A > f w
- 61. SplineNets[17] Xavier Bresson 62 Series of spectral convolutional layers with smooth spectral parametric coefficients to be learned at each layer. g (k) l = ⇠ 0 @ q(k 1) X l0=1 W (k) l,l0 > g (k 1) l0 1 A , W (k) l,l0 Fast-decaying filters in space O(n) parameters per layer O(n2) computation of forward and inverse Fourier transforms φ, φT (no FFT on graphs) Filters are basis-dependent does not generalize across graphs [17] Henaff, Bruna, LeCun 2015
- 62. Outline Ø Spectral ConvNets - ChebNets* [NIPS’16] Xavier Bresson 63
- 63. Spectral polynomial filters[18] Xavier Bresson 64 Represent smooth spectral functions with polynomials of Laplacian eigenvalues where is the vector of filter parameters. Convolutional layer: Apply spectral filter to feature signal f w↵( ) = rX j=0 ↵j j ↵ = (↵1, . . . , ↵r)> w↵( )f = Pr j=0 ↵j j f Key observation: Each Laplacian operation increases the support of a function by 1-hop Exact control the size of Laplacian-based filters. s=heat source 1-hop 2-hop· · ·| {z } j [18] Defferrard, Bresson, Vandergheynst 2016
- 64. Linear complexity Xavier Bresson 65 Application of the filter to a feature signal f Denote and define and the sequence Two important observations: 1. *No* need to compute the eigendecomposition of the Laplacian (φ,Λ). 2. Observe that {Xj} are generated by multiplication of a sparse matrix and a vector Complexity is O(Er)=O(n) for sparse graphs. Graph convolutional layers are GPU friendly. w↵( )f = Pr j=0 ↵j j f Xj = Xj 1X0 = f w↵( )f = Pr j=0 ↵jXj X1 = X0 = f
- 65. Spectral graph ConvNets with polynomial filters Xavier Bresson 66 Series of spectral convolutional layers with spectral polynomial coefficients to be learned at each layer. g (k) l = ⇠ 0 @ q(k 1) X l0=1 W (k) l,l0 > g (k 1) l0 1 A , W (k) l,l0 Filters are exactly localized in r-hops support O(1) parameters per layer No computation of φ, φT O(n) computational complexity (assuming sparsely-connected graphs) Unstable under coefficients perturbation (hard to optimize) Filters are basis-dependent does not generalize across graphs [18] Defferrard, Bresson, Vandergheynst 2016
- 66. Chebyshev polynomials Xavier Bresson 67 Graph convolution with (non-orthogonal) monomial basis Graph convolution with (orthogonal) Chebyshev polynomials w↵( )f = Pr j=0 ↵j j f 1, x, x2 , x3 , · · · Orthonormal on Stable under perturbation of coefficients L2 ([ 1, +1]) w.r.t. hf, gi = R +1 1 f(˜)g(˜) d˜ p 1 ˜2 T0, T1, T2, T3 w↵( ˜ )f = Pr j=0 ↵jTj( ˜ )f
- 67. ChebNets[18] Xavier Bresson 68 [18] Defferrard, Bresson, Vandergheynst 2016 Application of the filter with the scaled Laplacian with ˜ = 2 1 n I X(j) = Tj( ˜ )f = 2 ˜ X(j 1) X(j 2) , X(0) = f, X(1) = ˜ f Filters are exactly localized in r-hops support O(1) parameters per layer No computation of φ, φT O(n) computational complexity (assuming sparsely-connected graphs) Stable under coefficients perturbation Filters are basis-dependent does not generalize across graphs w↵( ˜ )f = Pr j=0 ↵jTj( ˜ )f = Pr j=0 ↵jX(j)
- 68. Numerical experiments Xavier Bresson 69 Running time Accuracy Optimization Graph: a 8-NN graph of the Euclidean grid i j
- 69. Outline Ø Spectral ConvNets - GraphConvNets Xavier Bresson 70
- 70. Graph convolutional nets[19]: simplified ChebNets Xavier Bresson 71 [19] Kipf, Welling 2016 Use Chebychev polynomials of degree r=2 and assume Further constrain to obtain a single-parameter filter Caveat: The eigenvalues of are now in [0,2] repeated application of the filter results in numerical instability Fix: Apply a renormalization with n ⇡ 2 ↵ = ↵0 = ↵1 w↵( )f = ↵0f + ↵1( I)f = ↵0f ↵1D 1/2 WD 1/2 f w↵( )f = ↵(I + D 1/2 WD 1/2 )f I + D 1/2 WD 1/2 w↵( )f = ↵ ˜D 1/2 ˜W ˜D 1/2 f ˜W = W + I and ˜D = diag( P j6=i ˜wij)
- 71. Example: citation networks Xavier Bresson 72
- 72. Outline Ø Spectral ConvNets - CayleyNets* Xavier Bresson 73
- 73. Community graphs Xavier Bresson 74 Synthetic graph with 15 communities Normalized Laplacian eigenvalues Chebyshev filters learned on the 15-communities graph ChebNet is unstable to produce filters with frequency bands of interest
- 74. Spectral graph ConvNets with Cayley polynomials[20] Xavier Bresson 75 Cayley polynomials are a family of real-valued rational functions (w/ complex coefficients) Cayley transform Expressivity: we can rewrite as a real trigonometric polynomial w.r.t. p( ) = c0 + 2Re n rX j=1 cj ⇣ i + i | {z } C( ) ⌘jo p( ) C( ) p( ) = c0 + rX j=1 cj Cj ( ) | {z } eij! + ¯cj C j ( ) | {z } e ij! = 2 rX j=1 Re{cj} cos(j! ) Im{cj} sin(j! ) [20] (Cayley 1846); Levie, Monti, Bresson, Bronstein 2017 C( ) = i +i is a smooth bijection from R to eiR {1} Since z 1 = ¯z for z 2 eiR and 2Re{z} = z + ¯z
- 75. Spectral zoom[20] Xavier Bresson 76 Applying Cayley transform to the scaled Laplacian results in a non-linear transformation of the eigenvalues (spectral zoom) C(h ) = (h iI)(h + iI) 1 h [20] Levie, Monti, Bresson, Bronstein 2017 Cayley transform of the 15-communities graph Laplacian spectrumC(h ) Laplacian spectrum h = 0.1 h = 1 h = 10
- 76. Fast inversion Xavier Bresson 77 Application of Cayley filter requires the solution of recursive linear system Exact (stable) solution: O(n3) complexity Approximate solution using K Jacobi iterations with Application of the approximate Cayley filter w(h )f y0 = f (h + iI)yj = (h iI)yj 1 j = 1, . . . , r ˜y (k+1) j = J˜y (k) j + Diag 1 (h + iI)(h iI)˜yj 1 ˜y (0) j = Diag 1 (h + iI)(h iI)˜yj 1 J = Diag 1 (h + iI)O↵(h + iI) rX j=0 cj ˜yj ⇡ ⌧(h )f Complexity: O(ErK) = O(n)
- 77. CayleyNets[30] Xavier Bresson 78 [20] Levie, Monti, Bresson, Bronstein 2017 Represent spectral transfer function as a Cayley polynomial of order r where the filter parameters are the vector of complex coefficients and the spectral zoom h. wc,h( ) = c0 + 2Re n rX j=1 cj(h i)j (h + i) j o c = (c0, . . . , cr)> Filters have guaranteed exponential spatial decay O(1) parameters per layer O(n) computational complexity with Jacobi approximate inversion (assuming sparsely-connected graph) Spectral zoom property allowing to better localize in frequency Richer class of filters than Chebyshev for the same order Filters are basis-dependent does not generalize across graphs
- 78. Chebyshev vs Cayley Xavier Bresson 79
- 79. Chebyshev vs Cayley: community graphs example Xavier Bresson 80
- 80. Chebyshev vs Cayley: Cora example Xavier Bresson 81 Accuracy of ChebNet (blue) and CayleyNet (orange) on Cora dataset
- 81. Approximate inversion Xavier Bresson 82 Accuracy of approximate inversion Speed of approximate inversion
- 82. Outline Ø Spectral ConvNets on Multiple Graphs - Recommender Systems* [NIPS’17] Xavier Bresson 83
- 83. Matrix completion - Netflix challenge Xavier Bresson 84 minX2Rm⇥n rank(X) s.t. xij = aij 8ij 2 ⌦ minX2Rm⇥n kXk⇤ s.t. xij = aij 8ij 2 ⌦ minX2Rm⇥n kXk⇤ + µk⌦ (X A)k2 F [21] Candes et al. 2008 Convex relaxation[21] )
- 84. Matrix completion on single graph[22] Xavier Bresson 85 minX2Rm⇥n kXk⇤ + µk⌦ (X A)k2 F + µc tr(X cX> ) | {z } kXk2 Gc [22] Kalofolias, Bresson, Bronstein, Vandergheynst, 2014
- 85. Matrix completion on multiple graphs[22] Xavier Bresson 86 [22] Kalofolias, Bresson, Bronstein, Vandergheynst, 2014 minX2Rm⇥n kXk⇤ + µk⌦ (X A)k2 F + µc tr(X cX> ) | {z } kXk2 Gc +µr tr(X> rX) | {z } kXk2 Gr
- 86. Factorized matrix completion models[23] Xavier Bresson 87 min W2Rm⇥s H2Rn⇥s µk⌦ (X A)k2 F + µckWk2 F + µrkHk2 F O(n+m) variables instead of O(nm) rank(X) = rank(WHT) ≤ s by construction [23] Srebro, Rennie, Jaakkola 2004
- 87. Factorized matrix completion on graphs[24] Xavier Bresson 88 O(n+m) variables instead of O(nm) rank(X) = rank(WHT) ≤ s by construction [24] Rao et al. 2015 min W2Rm⇥s H2Rn⇥s µk⌦ (X A)k2 F + µctr(H> cH) + µrtr(W> rW)
- 88. 1D Fourier transform Xavier Bresson 89 Column-wise transform
- 89. 2D Fourier transform Xavier Bresson 90 Column-wise transform + Row-wise transform = 2D transform From 1D signal processing to 2D image processing
- 90. Multi-graph Fourier transform and convolution[25] Xavier Bresson 91 Multi-graph Fourier transform where are the eigenvectors of the column- and row- graph Laplacians , respectively. Multi-graph spectral convolution ˆX = > r X c c, r c, r X ? Y = r(( > r X c) ( > r Y c)) > c = r( ˆX ˆY) > c [25] Monti, Bresson, Bronstein 2017
- 91. Parametrize the multi-graph filter as a smooth function of column- and row frequencies w/ e.g. bivariate Chebyshev polynomial where is matrix of coefficients and Multi-graph convolutional layer: Apply filter to matrix X where the scaled Laplacians Multi-graph spectral filters Xavier Bresson 92 w⇥(˜c, ˜r) = rX j,j0=1 ✓jj0 Tj(˜c)Tj0 (˜r) ⇥ = (✓jj0 ) 1 ˜c, ˜r 1 Y = rX j,j0=1 ✓jj0 Tj( ˜ c)XTj0 ( ˜ r) ˜ c = 2 1 c,n c I and ˜ r = 2 1 r,m r I
- 92. Multi-graph ConvNets[25] Xavier Bresson 93 [25] Monti, Bresson, Bronstein 2017 Multi-graph spectral filters applied to p input channels (m x n matrices ) and producing q output channels Y (k) l = ⇠ 0 @ pX l0=1 rX j,j0=0 ✓jj0ll0 Tj( ˜ r)Y (k 1) l0 Tj0 ( ˜ c) 1 A l = 1, . . . , q l0 = 1, . . . , p Y (k 1) 1 , . . . , Y (k 1) p Y (k) 1 , . . . , Y (k) q Filters have guaranteed r-hops support on both graphs O(1) parameters per layer O(nm) computational complexity (assuming sparse graphs)
- 93. Factorized multi-graph ConvNets Xavier Bresson 94 Two spectral convolutional layers applied to each of the factors W, H ul = ⇠ 0 @ pX l0=1 rX j=0 ✓r,jll0 Tj( ˜ r)wl0 1 A l = 1, . . . , q l0 = 1, . . . , p vl = ⇠ 0 @ pX l0=1 rX j0=0 ✓c,j0ll0 Tj0 ( ˜ c)hl0 1 A Filters have guaranteed r-hops support on both graphs O(1) parameters per layer O(n+m) computational complexity (assuming sparse graphs) May loose some coupling information
- 94. Matrix completion with Recurrent Multi-Graph ConvNets[25] Xavier Bresson 95 Complexity: O(nm) Extract compositional features from graphs of users and items [25] Monti, Bresson, Bronstein 2017
- 95. Factorized version Xavier Bresson 96 Complexity: O(n+m) [25] Monti, Bresson, Bronstein 2017
- 96. Incremental updates with RNNs The RNN/LSTM[26] cast the completion problem as non-linear diffusion process. Replace multi-layer CNNs by simple RNN + diffusion much less parameters to learn (reduce overfitting). more favorable for highly sparse entries (better info diffusion). Fix #parameters of the model independently of entry sparsity. Allows to learn temporal dynamics of entries if such information is available. Xavier Bresson 97 dX(t)RNN X(t + 1) = X(t) + dX(t) X(t) [26] Hochreiter, Schmidhuber 1997
- 97. Matrix completion methods comparison: synthetic Xavier Bresson 98
- 98. Matrix completion methods comparison: real Xavier Bresson 99
- 99. Outline Ø Conclusion Xavier Bresson 100
- 100. Related works Xavier Bresson 101 Classes of graph ConvNets: (1) Spectral approach (2) Spatial approach Spatial approaches Ø Geodesic CNN: Masci, Boscaini, Bronstein, Vandergheynst 2015 Ø Anisotropic CNN: Boscaini, Masci, Rodola, Bronstein 2016 Ø MoNet: Monti, Boscaini, Masci, Rodola, Svoboda, Bronstein 2017 Graph Neural Networks: Ø Gori, Monfardini, Scarselli 2005 Ø Li, Tarlow, Brockschmidt, Zemel 2015 Ø Sukhbaatar, Szlam, Fergus 2016 Several other works
- 101. Conclusion Contributions: Generalization of ConvNets to data on graphs (fixed) Exact/tight localized filters on graphs Linear complexity for sparse graphs GPU implementation Rich expressivity Multiple graphs Several potential applications in network sciences. Xavier Bresson 102
- 102. Papers & code Papers Convolutional neural networks on graphs with fast localized spectral filtering, M Defferrard, X Bresson, P Vandergheynst, NIPS, 2016, arXiv:1606.09375 Geometric matrix completion with recurrent multi-graph neural networks, F Monti, MM Bronstein, X Bresson, NIPS, 2017, arXiv:1704.06803 CayleyNets: Graph Convolutional Neural Networks with Complex Rational Spectral Filters, R Levie, F Monti, X Bresson, MM Bronstein, 2017, arXiv:1705.07664 Xavier Bresson 103 Code https://github.com/xbresson/ graph_convnets_pytorch
- 103. Xavier Bresson 104 Tutorials Computer Vision and Pattern Recognition (CVPR) Geometric deep learning on graphs and manifolds M. Bronstein, J. Bruna, A. Szlam, X. Bresson, Y. LeCun July 21, 2017 Slides: https://www.dropbox.com/s/appafg1fumb6u7f/tutorial_CVPR17_DL_graphs.pdf?dl=0 Video will be posted on YouTube/ComputerVisionFoundation Videos Neural Information Processing Systems (NIPS) Geometric deep learning on graphs and manifolds M. Bronstein, J. Bruna, A. Szlam, X. Bresson, Y. LeCun Dec 4, 2017
- 104. Xavier Bresson 105 Workshop New Deep Learning Techniques Feb 5-9, 2018 Organizers Xavier Bresson, NTU Michael Bronstein, USI/Intel Joan Bruna, NYU/Berkeley Yann LeCun, NYU/Facebook Stanley Osher, UCLA Arthur Szlam, Facebook http://www.ipam.ucla.edu/programs/work shops/new-deep-learning-techniques
- 105. Thank you Xavier Bresson 106 http://www.ntu.edu.sg/home/xbresson https://github.com/xbresson https://twitter.com/xbresson

No public clipboards found for this slide

Be the first to comment