Successfully reported this slideshow.
Upcoming SlideShare
×

# Representation learning in limited-data settings

1,010 views

Published on

A 4-hour long course given at the Deep learning 2019 summer school.

The topic is how to learn representations for machine learning when the amount of data is limited, for instance when the amount of samples is not large compared to the dimensionality of the problem, or when there is a lot of noise which renders learning difficult. This course bridge deep learning to more classic "shallow" learning techniques that work well in limited-data settings, with some theory and some practical recommendations.

1. Representations for machine learning: some learning theory results, some reflections on representations, and some simple models that extract representations.

2. Matrix factorizations: covering the wide spectrum from PCA to word2vec via dictionary learning and metric learning

3. Fisher kernels: building representations from likelihood models (slightly more academic)

Published in: Technology
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

### Representation learning in limited-data settings

1. 1. Representation learning in limited-data settings Ga¨el Varoquaux
2. 2. Limited-data settings n to be compared to: A measure of the signal-to-noise ratio The dimensional of the data p Deep learning does not work well in small-sample regimes But we can borrow ideas This talk: No silver bullet, many simple (shallow) tricks G Varoquaux 1
3. 3. Small-n problems are important 83% of data scientists1 never have n > 1M n is often small for applications such as medicine Bigger is better (how to not use this talk) Get more data (pool related datasets) Find a related problem and try transfer This talk: data that differs from common sources 1www.kaggle.com/laurae2/data-scientists-vs-size-of-datasetsG Varoquaux 2
4. 4. Perils of deep learning with small n Selecting architecture, learning rate... A deep architecture is validated by its measured accuracy overfitting the validation & test set Sampling noise for ntest = 1000: -10% -5% 0% +5% +10% Binomial distribution of error on test accuracy -2% +2% Optimizing test accuracy will explore the tails cf online challenges Need for guiding principles G Varoquaux 3[Varoquaux 2018]
5. 5. Outline 1 Representations for machine learning Non-asymptotic supervised learning Learning with representations Supervised learning of representations 2 Matrix factorization and its variants For signals For discrete objects 3 Fisher kernels Kernels feature maps From likelihoods to Kernels G Varoquaux 4
6. 6. 1 Representations for machine learning Defining the notion of representations Their use for supervised learning
7. 7. 1 Representations for machine learning Non-asymptotic supervised learning Learning with representations Supervised learning of representations
8. 8. Settings: supervised learning Given n pairs (x, y) ∈ X × Y drawn i.i.d. find a function f : X → Y such that f(x) ≈ y Notation: ˆy def = f(x) Empirical risk minimization Loss function l : Y × Y → Estimation of f: f = argmin f∈F ¾ l(ˆy, y) This course: how to choose good function classes F G Varoquaux 7
9. 9. Example: finite-sample estimation of f Data generated with 9th order polynomial + noise G Varoquaux 8
10. 10. Example: finite-sample estimation of f Data generated with 9th order polynomial + noise Fit polynomials of various degrees Degree 1 G Varoquaux 8
11. 11. Example: finite-sample estimation of f Data generated with 9th order polynomial + noise Fit polynomials of various degrees Degree 1 Degree 2 G Varoquaux 8
12. 12. Example: finite-sample estimation of f Data generated with 9th order polynomial + noise Fit polynomials of various degrees Degree 1 Degree 2 Degree 5 G Varoquaux 8
13. 13. Example: finite-sample estimation of f Data generated with 9th order polynomial + noise Fit polynomials of various degrees Degree 1 Degree 2 Degree 5 Degree 9 G Varoquaux 8
14. 14. Example: finite-sample estimation of f Data generated with 9th order polynomial + noise Fit polynomials of various degrees Degree 1 Degree 2 Degree 5 Degree 9 Truth Model too simple: underfit Model too complex: overfit G Varoquaux 8
15. 15. Theory: the generalization error Generalization error of a prediction function f: Notation : E(f) def = ¾ l(y, f(x)) Finite-sample regime Ideally: f = argmin f∈F ¾ l f(x), y In practice: ˆf = argmin f∈F n i=1 l f(xi), yi E(ˆf) ≥ E(f ) f f G Varoquaux 9
16. 16. Theory: decomposing the generalization error Assuming y = g(x) + e, e random with ¾[e] = 0, the generalization error of ˆf is: E(ˆf) = ¾ l(g(x) + e, ˆf(x)) = E(g) + E(f ) − E(g) + E(ˆf) − E(f ) Bayes rate Best possible pre- diction ¾ l(g(x) + e, g(x)) Approximation error: g F Our model is wrong Estimation Sampling noise on train data ˆf f G Varoquaux 10
17. 17. Theory: decomposing the generalization error Assuming y = g(x) + e, e random with ¾[e] = 0, the generalization error of ˆf is: E(ˆf) = ¾ l(g(x) + e, ˆf(x)) = E(g) + E(f ) − E(g) + E(ˆf) − E(f ) Bayes rate Best possible pre- diction ¾ l(g(x) + e, g(x)) Due to the noise e Cannot be avoided G Varoquaux 10
18. 18. Theory: decomposing the generalization error Assuming y = g(x) + e, e random with ¾[e] = 0, the generalization error of ˆf is: E(ˆf) = ¾ l(g(x) + e, ˆf(x)) = E(g) + E(f ) − E(g) + E(ˆf) − E(f ) Approximation error: g F Our model is wrong Decreases for larger F Empirical upper bound: train error G Varoquaux 10
19. 19. Theory: decomposing the generalization error Assuming y = g(x) + e, e random with ¾[e] = 0, the generalization error of ˆf is: E(ˆf) = ¾ l(g(x) + e, ˆf(x)) = E(g) + E(f ) − E(g) + E(ˆf) − E(f ) Estimation Sampling noise on train data ˆf f Finite-sample problem Decreases as n grows Increases for larger F Guesstimate: difference be- tween train and test error G Varoquaux 10
20. 20. Example: polynomial regression degree f f Degree 9, small n no approximation error large estimation error f f g Degree 1, large n small estimation error large approximation error G Varoquaux 11
21. 21. Example: polynomial regression degree f f Degree 9, small n no approximation error large estimation error ˆf = argminf∈F i l f(xi), yi f f g Degree 1, large n small estimation error large approximation error Function class F not restrictive enough Function class F too restrictive G Varoquaux 11
22. 22. Gauging overfit vs underfit: learning curves 100 1000 Number of samples Error sklearn.model selection.learning curve G Varoquaux 12 Overfit region Underfit? Or Bayes rate?
23. 23. Gauging overfit vs underfit: learning curves 100 1000 Number of samples Error Generalization error Training error sklearn.model selection.learning curve G Varoquaux 12 Estimation error ∼ gap be- tween train and test error
24. 24. Gauging overfit vs underfit: learning curves 100 1000 Number of samples Error Generalization error Training error Degree of polynomial 9 1 Simpler models reach the assymptotic regime faster (smaller “sample complexity”) But can underfit G Varoquaux 12
25. 25. Gauging overfit vs underfit: validation curves 5 10 15 Polynomial degree Error Generalization error Training error sklearn.model selection.validation curve Reveals underfits G Varoquaux 13
26. 26. Linear models for limited-data settings In high-dimensional limited-data settings, linear models are often the best choice For p-dimensional data, x ∈ p, they have p parameters n ∼ 200 000 Inpatient Mortality, AUROC (95% CI) Hospital A Hospital B Deep learning 0.95(0.94-0.96) 0.93(0.92-0.94) Baseline (logistic regression) 0.93(0.92-0.95) 0.91(0.89-0.92) G Varoquaux 14
27. 27. Theory: Approximating with linear predictors Linear predictor1: ˆy = xTw, w ∈ p Data model: y = xTw + δ(x) + e ¾[e] = 0 xTw : best linear predictor Ridge estimator: ˆw = argmin w ytrain − Xtrainw 2 Fro + λ w 2 2 Error compared to best linear predictor: ¾ y − xT ˆw 2 2 = ¾ y − xTw 2 2 + o σ2p/ntrain [Hsu... 2014, sec 2.5] Random design analysis can characterize the generalization error without assuming a correct data-generating model (miss-specified model) [Hsu... 2014, Rosset and Tibshirani 2018] 1Predictor, not model: we do not assume it is a data-generating process.G Varoquaux 15
28. 28. Theory: Approximating with linear predictors Linear predictor1: ˆy = xTw, w ∈ p Data model: y = xTw + δ(x) + e ¾[e] = 0 xTw : best linear predictor Ridge estimator: ˆw = argmin w ytrain − Xtrainw 2 Fro + λ w 2 2 Error compared to best linear predictor: ¾ y − xT ˆw 2 2 = ¾ y − xTw 2 2 + o σ2p/ntrain Approximation error Data not linearly generated ⇒ craft more features Estimation error Curse of dimensionality ⇒ limit number of features 1Predictor, not model: we do not assume it is a data-generating process.G Varoquaux 15
29. 29. Example: extrapolating sea level (tides) Predict sea level as a function of time Test outside of observed range1 1Technically, this is not in our theory: test set train set.G Varoquaux 16
30. 30. Example: extrapolating sea level (tides) Polynomial regression dim=10 Covariates G Varoquaux 16
31. 31. Example: extrapolating sea level (tides) Polynomial regression dim=10 dim=100 Covariates G Varoquaux 16
32. 32. Example: extrapolating sea level (tides) Polynomial regression dim=10 dim=100 dim=1000 Covariates G Varoquaux 16
33. 33. Example: extrapolating sea level (tides) Polynomial regression dim=10 dim=100 dim=1000 Covariates Sines and cosines basis dim=10 G Varoquaux 16
34. 34. Example: extrapolating sea level (tides) Polynomial regression dim=10 dim=100 dim=1000 Covariates Sines and cosines basis dim=10 dim=100 G Varoquaux 16
35. 35. Example: extrapolating sea level (tides) Polynomial regression dim=10 dim=100 dim=1000 Covariates Sines and cosines basis dim=10 dim=100 dim=1000 G Varoquaux 16
36. 36. Example: extrapolating sea level (tides) Polynomial regression dim=10 dim=100 dim=1000 Covariates Sines and cosines basis dim=10 dim=100 dim=1000 Choice of covariates / basis / signal representation ⇒ huge difference on approximation error ⇒ huge difference on generalization error G Varoquaux 16
37. 37. Summary ˆy = f(x), f chosen in F to minimize the observed error i∈train l f(xi), y generalization error: - approximation error ⇒ F adapted to the data - estimation error ⇒ F small Limited-data settings Linear models best option when p n A good choice of covariates is crucial G Varoquaux 17
38. 38. 1 Representations for machine learning Non-asymptotic supervised learning Learning with representations Supervised learning of representations
39. 39. Representations to build F Settings z = r(x): representation of the data, z ∈ k Predictor f : x → ˆy = hw r(x) Function composition: “depth” G Varoquaux 19
40. 40. Representations to build F Settings z = r(x): representation of the data, z ∈ k Predictor f : x → ˆy = hw r(x) Function composition: “depth” Benefits For expressiveness composition basis expansion Composing L rectifying functions on intermediate representa- tions of dimension k gives O k p p(L−1) kp linear regions. Basis expansion + linear predictor gives O(k) Exponential in depth, linear with dimension [Montufar... 2014] G Varoquaux 19
41. 41. Representations to build F Settings z = r(x): representation of the data, z ∈ k Predictor f : x → ˆy = hw r(x) Function composition: “depth” Benefits For expressiveness composition basis expansion For multi-tasks sharing representations across tasks y multidimensional G Varoquaux 19
42. 42. Representations to build F Settings z = r(x): representation of the data, z ∈ k Predictor f : x → ˆy = hw r(x) Function composition: “depth” Benefits For expressiveness composition basis expansion For multi-tasks sharing representations across tasks For limited data hw(z) = wTz, a linear predictor A good choice of z can decrease sample complexity G Varoquaux 19
43. 43. Representations to build F Settings z = r(x): representation of the data, z ∈ k Predictor f : x → ˆy = hw r(x) Function composition: “depth” Benefits For expressiveness composition basis expansion For multi-tasks sharing representations across tasks For limited data hw(z) = wTz, a linear predictor Transfer: r is learned on large data; a simple h used. G Varoquaux 19
44. 44. Background: Information theory Entropy = amount of information in x H(x) = ¾p[− log p(x)] Equi-probable distribution = high entropy x=0 x=1 x=2 x=3 x=4 x=5 P Uneven distribution = low entropy x=0 x=1 x=2 x=3 x=4 x=5 P Mutual information between x and y I(x; y) = H(x, y) − H(x) − H(y) x ⊥⊥ y (independent) ⇔ I(x; y) = 0 independence ⇔ p(x; y) = p(x)p(y) H(x; y) = ¾(x;y) log p(x; y) = ¾(x;y) log p(x) + log p(y) x y = ¾x log p(x) + ¾y log p(y) = H(x) + H(y) G Varoquaux 20
45. 45. Theory: information in representations A representation z of x is sufficient for y if y ⊥⊥ x|z, or equivalently if I(z; y) = I(x; y) x, z, y form a Markov chain if (y|x, z) = (y|z). x → z → y Data processing inequality: I(x; y) ≤ I(x; z) A sufficient representation z is minimal when I(x; z) is smallest among sufficient representations G Varoquaux 21[Achille and Soatto 2018]
46. 46. Nuisances and invariances A nuisance n: I(x, n) ≥ 0, but I(y, n) = 0 Representation z is invariant to the nuisance n if z ⊥⊥ n, or I(z; n) = 0 ⇒ We want I(z; n) low In a Markov chain x → z1 → z2 · · · → zL → y If z is a sufficient representation for y, I(z; n) ≤ I(z; x) − I(x; y) Communication bottleneck: I(z1; z2) < I(z1; x) ⇒ I(z2; n) ≤ I(z1; z2) − I(x; y) Stacking increases invariance G Varoquaux 22[Achille and Soatto 2018]
47. 47. Invariant representations on a continous space st Shift invariance representation = Fourier basis Fourier transform: F(s)f = t e−i f t st complex i Shifting the signal: st → st = st+k F(s )f = t e−i f t st+k = t e−i f (t−k) st = ei k f t e−i f t st = ei k f F(s)f → change in phase An orthonormal basis of shift-invariant vectors G Varoquaux 23
48. 48. Invariant representations on a continous space st Shift invariance = Fourier basis Local deformations = Wavelets Locally equivalent to Fourier basis But without the global extent Decimated wavelets Isometric transform of the signal Higher scales lose shift invariance Redundant wavelets Increase the dimensionality Good shift invariance G Varoquaux 23
49. 49. Representations invariant to rich deformations Scaling Rotations Deformations Ingredients Modulus of wavelet / Fourier transform ⇒ non linearity & filter banks (convolutions) + stacking (repeating simple invariants) Scattering transform Derived from first principles Building first-order invariants Convolutional networks Learned from data Pooling across pixels (eg max) G Varoquaux 24[Mallat 2016]
50. 50. Summary Intermediate representations give expressiveness to predictive models Good representations keep predictive information and loose nuisance information Bottleneck and regularization to loose information Limited-data settings Given know invariants of the problem, reusing existing representations helps eg Headless conv-net, wavelets... [Oyallon... 2017] G Varoquaux 25
51. 51. 1 Representations for machine learning Non-asymptotic supervised learning Learning with representations Supervised learning of representations
52. 52. The need to supervision Maximizing I(z; y) (≤ I(x; y)) sufficient representations ⇒ supervised learning while minimizing I(z; n) nuisance ⇒ sampling nuisance / invariants data augmentation Challenge: amount of labeled data Pretext tasks Other targets y that capture useful information Finding them needs domain knowledge G Varoquaux 27
53. 53. Deep architectures ... ˆy = fd Wd ◦ ... ◦ f1 W1 (x) Typically fk Wk (x) = gk (WT k x) and gk element-wise non-linearity Thus ˆy = gd WT d ... g1 (WT 1 x) Stacked representations: Wk {Wk} optimized to minimize a prediction error G Varoquaux 28
54. 54. Shallow architectures for limited data Keep one latent layer 2 Without non-linearity: ˆy = xT W1 W2, y ∈ k W1 ∈ p×d W2 ∈ d×k , factored / reduced-rank linear model Multi-task / multi-output literature ⇒ structured loss (multiple soft-max’s) Overparametrization sometimes useful: d > k can be achieved with dropout G Varoquaux 29[Bzdok... 2015, Mensch... 2018]
55. 55. Simple case: square loss = reduced rank regression ˆY = X W1 W2, Y ∈ n×k W1 ∈ p×d , W2 ∈ d×k ˆW1, ˆW2 = argmin W1,W2 ˆY − Ytrain 2 Fro For squared loss the problem is convex Full-rank solution1 (X and Y on train set): ˆW = ˆΣ−1 X XT Y ˆY = X ˆW = X ˆΣ−1 X XT Y Rank d solution: [Izenman 1975, Rahim... 2017b] ˆRd def = YT ˆY ∈ k×k SVD → = ˆUd ˆsd ˆVd, ˆUd ∈ k×d then ˆW1 = Σ−1 X XTY ˆUd ˆW2 = ˆUT d Full-rank solution Rank-k projector2 1No need for pesky SGDs 2The projector captures the variance explained on the multiple outputsG Varoquaux 30
56. 56. Model stacking x f1 → z f2 → y Learn f1 separately Directly supervising z: z = ˆy for a (simple) predictive model Trick: “cross-fit” during training obtain ˆy by splitting the training data Testset Trainset Fulldata (in sklearn: cross val predict) G Varoquaux 31
57. 57. Model stacking x f1 → z f2 → y Learn f1 separately Directly supervising z: z = ˆy for a (simple) predictive model Trick: “cross-fit” during training obtain ˆy by splitting the training data Testset Trainset Fulldata (in sklearn: cross val predict) Application: tackling dimensionality [Rahim... 2017a] Some features are a high-dimensional signal eg medical images f1: linear to reduce signal features f2: non-linear (eg trees) on all features G Varoquaux 31
58. 58. Model stacking to encode discrete items Sex Date Hired Employee Position M 09/12/1988 Master Police Officer F 06/26/2006 Social Worker III M 07/16/2007 Police Officer III predict → Salary 69222.18 97392.47 104717.28 Difficulty: number of different positions what invariants? 40000 60000 80000 100000 120000 140000 y: Employee salary Crossing Guard Liquor Store Clerk I Library Aide Police Cadet Public Safety Reporting Aide I Administrative Specialist II Management and Budget Specialist III Manager III Manager I Manager II Target encoding1 [Micci-Barreca 2001] position → ¾train[salary|position] 1To inject categories in , before a second level that combines all columnsG Varoquaux 32
59. 59. Summary Supervision helps selecting the relevant part of the signal In limited-sample settings, simple models can create representations Simple latent-factor models Multi-output models Stacking: fit a first-level model G Varoquaux 33
60. 60. Summary of first section For generalization: small family of functions fw that approximate the signal well Generalization of a linear predictor: approximation error + o(p/ntrain ) Predictors by composition: ˆy = f2(z), z = f1(x) x f1 → z f2 → y ideally, f1 makes z invariant to nuisances Reuse representations with the right invariances: wavelets, fasttext, pretrained headless neural nets Simple supervised models can create representations stacking multioutput pretext tasks G Varoquaux 34
61. 61. 2 Matrix factorization and its variants Simple unsupervised representation learning More unlabeled data than labeled data Learn representations and transfer them Here: Focus on simple models for limited n or low SNR settings Particularly interesting regime: p large and n large.
62. 62. 2 Matrix factorization and its variants For signals For discrete objects
63. 63. Principal Component Analysis Find the directions of largest variance Computation X ∈ n×p ΣX = XTX ∈ p×p PCA projector: PPCA ∈ p×k SVDk(X) or EVDk(ΣX) Reduced X: X PPCA ∈ n×k G Varoquaux 37
64. 64. Principal Component Analysis Find the directions of largest variance Computation X ∈ n×p ΣX = XTX ∈ p×p PCA projector: PPCA ∈ p×k SVDk(X) or EVDk(ΣX) Reduced X: X PPCA ∈ n×k Model: low-rank Gaussian latent factors X ≈ U V + E, E ∼ N(0, Ip), U ∈ n×k, V ∈ k×p ˆU, ˆV = argmin U,V X − U V 2 Fro Rotationally invariant: U = U O, OT V also solution for O s.t. OTO = I G Varoquaux 37
65. 65. Principal Component Analysis Find the directions of largest variance In a learning pipeline Useful for dimensionality reduction (eg p is large) Eases statistics and computations Generalization error of PCA + OLS within a factor of 4 of ridge [Dhillon... 2013] G Varoquaux 37
66. 66. Beyond variance: Independent Component Analysis Separate out signals U observed mixed1 True sources, signals U Observations (mixed signal) ICA recovered signals 1Classic ICA has no noise model: it does not do dimension reductionG Varoquaux 38
67. 67. Beyond variance: Independent Component Analysis Separate out signals U observed mixed1 Model: X = U V V ∈ p×p, VTV = Ip If V is Gaussian, the model is not identifiable Seek low mutual information across {uj} ⇒ Maximally non-Gaussian marginals [Cardoso 2003] Latent signals V Observed data U V 1Classic ICA has no noise model: it does not do dimension reductionG Varoquaux 38
68. 68. Beyond variance: Independent Component Analysis Separate out signals U observed mixed1 Model: X = U V V ∈ p×p, VTV = Ip If V is Gaussian, the model is not identifiable Seek low mutual information across {uj} ⇒ Maximally non-Gaussian marginals [Cardoso 2003] Computation: FastICA [Hyv¨arinen and Oja 2000] Power iterations on V Each time: - apply a smooth increasing non-linearity on {uj} - decorrelate Preprocessing: whiten the data eg with PCA 1Classic ICA has no noise model: it does not do dimension reductionG Varoquaux 38
69. 69. ICA to learn representations Across patches of natural images: Gabor-like filters Similar to wavelets and first layer of convnets G Varoquaux 39[Hyv¨arinen and Oja 2000]
70. 70. Dictionary learning Find vectors V that represents well the signal with sparse combinations U Model: X = U V s.t. U is sparse U ∈ n×k, V ∈ k×p k can be > p (overcomplete dictionary) Estimation: ˆU, ˆV = argmin U,V, s.t. vi 2 2 ≤1 X − U V 2 Fro + λ U 1 Combining squared loss and 1 penalty creates sparsity Constraint on vi 2 2 required to avoid cancelling out penalty with V → ∞ and U → 0 x2 x1 G Varoquaux 40
71. 71. Dictionary learning Find vectors V that represents well the signal with sparse combinations U Model: X = U V s.t. U is sparse U ∈ n×k, V ∈ k×p k can be > p (overcomplete dictionary) Estimation: ˆU, ˆV = argmin U,V, s.t. V∈C X − U V 2 Fro + λΩ(U) Constraint set and penalty can be varied1 Typically, 2, 1, and positivity2 on U or V. 1Fast when C and Ω lead to simple projections and penalized regression. 2Recovers a form of NMF (non-negative matrix factorization)G Varoquaux 40
72. 72. Sparse dictionary learning to learn representations Across patches of natural images: Also learns Gabor-like filters1 Good for sparse models, eg for denoising 1as ICA, K-Means, etc on images patchesG Varoquaux 41[Mairal... 2014]
73. 73. Large n large p: brain imaging Brain activity at rest 1000 subjects with ∼ 100–10 000 samples Images of dimensionality > 100 000 Dense matrix, large both ways G Varoquaux 42 voxels time voxels time X +U · V= E 25
74. 74. Large n large p: recommender systems 3 9 7 7 9 5 7 8 4 1 6 9 7 7 1 4 4 9 5 5 8 Product ratings Millions of entries Hundreds of thousands of products and users Large sparse matrix G Varoquaux 43 users product users products X +U · V= E
75. 75. Online estimation: stochastic optimization min w i l(xi w) Many samples min w ¾[l(y, x w)] Gradient descent: wt+1 ← wt + αt wl Stochastic gradient descent: wt+1 ← wt + αt¾[ wl] Use a cheap estimate of ¾[ wl] (e.g. subsampling) αt must decrease “suitably” with t. Those pesky learning rate G Varoquaux 44
76. 76. Online estimation for matrix factorization - Data access - Dictionary update Stream columns - Code com- putation Alternating minimization Data matrix Large matrices = terabytes of data argmin U,V X−U V 2 Fro + λΩ(U) G Varoquaux 45[Mairal... 2010]
77. 77. Online estimation for matrix factorization Large matrices = terabytes of data argmin U,V X−U V 2 Fro + λΩ(U) Rewrite as an expectation: argmin V i min u Xi − V u 2 Fro + λΩ(u) argmin E f(V) ⇒ Optimize on approximations (sub-samples) G Varoquaux 45[Mairal... 2010]
78. 78. Online estimation for matrix factorization - Data access - Dictionary update Stream columns - Code com- putation Online matrix factorization Alternating minimization Seen at t Seen at t+1 Unseen at t Data matrix G Varoquaux 45[Mairal... 2010]
79. 79. Online estimation for matrix factorization - Data access - Dictionary update Stream columns - Code com- putation Subsample rows Online matrix factorization Subsampled & online Alternating minimization Seen at t Seen at t+1 Unseen at t Data matrix G Varoquaux 45[Mensch... 2017]
80. 80. Online matrix factorization algorithm [Mairal... 2010] Stream samples xt: 1. Compute code ut = argmin u∈ k xt − Vt−1u 2 2 + λΩ(u) G Varoquaux 46
81. 81. Online matrix factorization algorithm [Mairal... 2010] Stream samples xt: 1. Compute code ut = argmin u∈ k xt − Vt−1u 2 2 + λΩ(u) 2. Update the surrogate function gt(V) = 1 t t i=1 xi − V ui 2 2 gt(V) surrogate = x l(x, V) ui is used, and not u G Varoquaux 46
82. 82. Online matrix factorization algorithm [Mairal... 2010] Stream samples xt: 1. Compute code ut = argmin u∈ k xt − Vt−1u 2 2 + λΩ(u) 2. Update the surrogate function gt(V) = 1 t t i=1 xi − V ui 2 2 = tr 1 2 V VAt − V Bt At def = (1 − 1 t )At−1 + 1 t utut Bt def = (1 − 1 t )Bt−1 + 1 t xtut At and Bt are sufficient statistics of the loss accumulated over the data G Varoquaux 46
83. 83. Online matrix factorization algorithm [Mairal... 2010] Stream samples xt: 1. Compute code ut = argmin u∈ k xt − Vt−1u 2 2 + λΩ(u) 2. Update the surrogate function gt(V) = 1 t t i=1 xi − V ui 2 2 = tr 1 2 V VAt − V Bt At def = (1 − 1 t )At−1 + 1 t utut Bt def = (1 − 1 t )Bt−1 + 1 t xtut 3. Minimize surrogate Vt = argmin V∈C gt(V) gt = VAt − Bt G Varoquaux 46
84. 84. Stochastic Majorization-Minimization [Mairal 2013] V = argmin V∈C x l(x, V) where l(x, V) = min u f(x, V, u) Algorithm: gt(V) majorant = x l(x, V) ui is used, and not u ⇒ Majorization-Minimization scheme1 Surrogate computation SMM Full minimization 2nd order information No learning rate 1SOMF uses a approximate majorant and minimization [Mensch... 2017]G Varoquaux 47
85. 85. Experimental convergence: large images 5s 1min 6min 2.80 2.85 2.90 2.95 Testobjectivevalue ×104 Time ADHD Sparse dictionary 2 GB 1min 1h 5h 0.105 0.106 0.107 0.108 0.109 Aviris NMF 103 GB 1min 1h 5h 0.35 0.36 0.37 0.38 0.39 0.40 Testobjectivevalue Time Aviris Dictionary learning 103 GB OMF: SOMF: r = 4 r = 6 r = 8 r = 12 r = 24r = 1 Best step-size SGD 100s 1h 5h 24h 0.98 1.00 1.02 1.04 ×105 HCP Sparse dictionary 2 TB SOMF = Subsampled Online Matrix Factorization G Varoquaux 48
86. 86. Experimental convergence: recommender system SOMF = Subsampled Online Matrix Factorization G Varoquaux 49
87. 87. Summary Versatile matrix-factorization formulation1 argmin U∈ n×k,V∈C X − U V 2 Fro + λΩ(U) Estimation Stochastic majorization miniminization2 ⇒ an online alternated optimization Example use of learned representations Biomakers of autism on brain images: p ∼ 100 000, n ∼ 1 000 [Abraham... 2017] 11-layer linear autoencoder 2Common case algorithm readily usable in scikit-learn: MiniBatchDictionaryLearningG Varoquaux 50
88. 88. 2 Matrix factorization and its variants For signals For discrete objects
89. 89. Gamma-Poisson for factorizing counts [Canny 2004] When X is a matrix of counts - Recommenders systems [Gopalan... 2014] - Database string entries [Cerda and Varoquaux 2019] =⇒ Poisson loss, instead of squared loss (xj|u, V) = Poisson (u V)j = 1/xj! (u V) xj j e−(u V)j u are loadings, modeled as random with a Gamma prior3 (ui) = u αi−1 i e−ui/βi β αi i Γ(αi) 3Because it is the conjugate prior of the Poisson, it imposes soft sparsity, and it raises rotational invarianceG Varoquaux 52
90. 90. Gamma-Poisson for factorizing counts [Canny 2004] When X is a matrix of counts - Recommenders systems [Gopalan... 2014] - Database string entries [Cerda and Varoquaux 2019] =⇒ Poisson loss, instead of squared loss (xj|u, V) = Poisson (u V)j = 1/xj! (u V) xj j e−(u V)j u are loadings, modeled as random with a Gamma prior3 (ui) = u αi−1 i e−ui/βi β αi i Γ(αi) Maximum a posteriori estimation: ˆU, ˆV = argmin U,V − j log (xj|u, V) + i log (ui) 3Because it is the conjugate prior of the Poisson, it imposes soft sparsity, and it raises rotational invarianceG Varoquaux 52
91. 91. Gamma-Poisson estimation Full log-likelihood expression: log L = p j=1 xj log((u V)j) − (u V)j − log(xj!) + k i=1 (αi − 1) log(ui) − ui βi − αi log βi − log Γ(αi) Gradients: ∂ ∂Vij log L = xj (u V)j ui − ui ∂ ∂ui log L = p j=1 xj (u V)j Vij − Vij + αi − 1 ui − 1 βi G Varoquaux 53
92. 92. Gamma-Poisson estimation Gradients: ∂ ∂Vij log L = xj (u V)j ui − ui ∂ ∂ui log L = p j=1 xj (u V)j Vij − Vij + αi − 1 ui − 1 βi Equivalent to some NMF formulation: multiplicative updates1 Vij ← Vij n =1 x j (UV) j u i n =1 u i −1 u i ← u i p j=1 x j (UV) j Vij + αi − 1 u i p j=1 Vij + β−1 i −1 1Efficient implementation with sparse matrices: the summations can be done only on non-zero entries of X.G Varoquaux 53
93. 93. Adapt the majorization minimization algorithm while V(t) − V(t−1) F > η do draw xt from the training set. while ut − uold t 2 > do ut ← ut. xt utV(t) V(t)T + a−1 ut . 1 V(t)T + b−1 .−1 At ← V(t). uT t xt utV(t) Bt ← uT t 1 A(t) ← ρ A(t−1) + A(t) B(t) ← ρ B(t−1) + B(t) V(t) ← A(t)./ B(t) t ← t + 1 G Varoquaux 54[Lefevre... 2011, Cerda and Varoquaux 2019]
94. 94. Application: sub-string representation Problem: representing non-normalized categories Drug Name alcohol ethyl alcohol isopropyl alcohol polyvinyl alcohol isopropyl alcohol swab 62% ethyl alcohol alcohol 68% alcohol denat benzyl alcohol dehydrated alcohol Employee Position Title Police Aide Master Police Officer Mechanic Technician II Police Officer III Senior Architect Senior Engineer Technician Social Worker III G Varoquaux 55[Cerda and Varoquaux 2019]
95. 95. Application: sub-string representation Gamma-Poisson factorization on sub-strings counts 3-gram1 P 3-gram2 ol 3-gram3 ic... Models strings as a linear combination of substrings 11111000000000 00000011111111 10000001100000 11100000000000 11111100000000 11111000000000 police officer pol off polis policeman policier er_ cer fic off _of ce_ ice lic pol G Varoquaux 56[Cerda and Varoquaux 2019]
96. 96. Application: sub-string representation Gamma-Poisson factorization on sub-strings counts 3-gram1 P 3-gram2 ol 3-gram3 ic... Models strings as a linear combination of substrings 11111000000000 00000011111111 10000001100000 11100000000000 11111100000000 11111000000000 police officer pol off polis policeman policier er_ cer fic off _of ce_ ice lic pol → 03078090707907 00790752700578 94071006000797 topics 030 007 940 009 100 000 documents topics + What substrings are in a latent category What latent categories are in an entry er_ cer fic off _of ce_ ice lic pol G Varoquaux 56[Cerda and Varoquaux 2019]
97. 97. Application: sub-string representation Representations that extract latent categories library perator cialist rehouse manager mmunity rescue officer Legislative Analyst II Legislative Attorney Equipment Operator I Transit Coordinator Bus Operator Senior Architect Senior Engineer Technician Financial Programs Manager Capital Projects Manager Mechanic Technician II Master Police Officer Police Sergeant am es Categories G Varoquaux 57[Cerda and Varoquaux 2019]
98. 98. Application: sub-string representation Inferring plausible feature names ntant, assistant, library ator, equipment, operator dministration, specialist , craftsworker, warehouse rossing, program, manager cian, mechanic, community efighter, rescuer, rescue onal, correction, officer Legislative Analyst II Legislative Attorney Equipment Operator I Transit Coordinator Bus Operator Senior Architect Senior Engineer Technician Financial Programs Manager Capital Projects Manager Mechanic Technician II Master Police Officer Police Sergeant Inferred featurenam es Categories G Varoquaux 57[Cerda and Varoquaux 2019]
99. 99. Natural language processing: topic-modeling history Topic modeling: embedding documents1 03078090707907 00790752700578 94071006000797 00970008007000 10000400400090 00050205008000 documents the Python performance profiling module is code can a → 03078090707907 00790752700578 94071006000797 topics the Python performance profiling module is code can a 030 007 940 009 100 000 documents topics + What terms are in a topics What documents are in a topics LSA (Latent Semantic Analysis) [Landauer... 1998] SVD2 of the terms×documents matrix 1Typically for information retrieval purpose, aka search engines 2Later: refinements for more complex loss: LDA (Latent Dirichlet Allocation) [Blei... 2003] and Gamma Poisson [Canny 2004].G Varoquaux 58
100. 100. Word embeddings Distributional semantics: meaning of words “You shall know a word by the company it keeps” Firth, 1957 Example: A glass of red , please Could be wine maybe juice? wine and juice have related meanings Factorization of the word×context matrix What choice of context? What loss? word2vec [Mikolov... 2013a] glove [Pennington... 2014] G Varoquaux 59
101. 101. Word2vec: skip-gram sampling [Mikolov... 2013b] { ˆuw, ˆvc} = argmax {uw,vc} pairs of words (w, c) in the same window1 log softmax(V uT w)c softmax(z)i = exp zi j exp zj uw ∈ k: embedding of word w V ∈ card(voc)×k: [vc, c ∈ voc] all context words Big sum on contexts ⇒ solved by SGD2 salad meat juice wine glass green red Center word U:wordembedding salad meat juice wine glass red green Context word V:contextembedding Other view: Language models Prediction of words 1Efficient: never build the matrix, stream directly from text. 2These windows are called skip gramG Varoquaux 60
102. 102. Word2vec: negative sampling [Mikolov... 2013a] Costly loss: log softmax(z)i = log exp zi j exp zj Approximate1 Huge sum in softmax (all vocabulary) Downsample it by drawing the positive (numerator) and a few negative examples (denominator) Negative sampling loss2: [Goldberg and Levy 2014] log σ(vc uT w) + nneg words w not in window log σ(−vcuw ) σ: sigmoid (log σ(z) = −1 − exp −z) 1Related to noise contrastive estimate, that avoid computing costly normalizations in likelihoods [Gutmann and Hyv¨arinen 2010] 2Related to a matrix factorization of mutual information inword occurence [Levy and Goldberg 2014]G Varoquaux 61
103. 103. Beyond natural language: metric learning Triplet loss For a “anchor”, b close to a, c far from a: log σ(vT aub) − log σ(vT auc) Quadruplet loss [Chen... 2017] For a and b close by, c and d far appart: log σ(vT aub) − log σ(vT cud) In practice: draw1 randomly (a, b, c) or (a, b, c, d) Metric learning: [Bellet... 2013] Learning embeddings with weak supervision 1Many strategies, eg “hard negative mining”, requires a good test set and metric to set, as with SGD hyperparameters.G Varoquaux 62
104. 104. Embedding entities in knowledge graphs Structured (graph) represen- tation of human knowledge eg dbpedia, Yago G Varoquaux 63
105. 105. Embedding entities in knowledge graphs Structured (graph) represen- tation of human knowledge eg dbpedia, Yago Learning embeddings of enti- ties {ei} and relations {rj}: ea ∼ eb + rc a model of the relation Then triplet / quadruplet loss Reuse existing: conceptnet.io G Varoquaux 63 [Bordes... 2013, Wang... 2017]
106. 106. The value of simple models Risk of invisible overfit dur- ing search for hyperparameters and models Complex models call for a clear utility measure with low mea- surement error Many reliable labels G Varoquaux 64
107. 107. The value of simple models Risk of invisible overfit dur- ing search for hyperparameters and models Complex models call for a clear utility measure with low mea- surement error Many reliable labels Matrix factorization models1: 2 hyper parameters: Dimensionality k Regularization λ Set them to optimize representations for supervised problems 1Using majorization-minimization approaches to avoid learning rateG Varoquaux 64
108. 108. Summary Discrete entities lead to counting occurences ⇒ Poisson and logistic loss (ugly logs in equations) Word & entity embeddings Factorization of coocurrences in a notion of context more generally: metric learning Limited-data settings: Avoid negative-sampling models ( hyper-parameters) Try to reuse representations (fastext, conceptnet.io) G Varoquaux 65
109. 109. 3 Fisher kernels What if the objects studied do not naturally live in a vector space? eg graphs of varying number of nodes
110. 110. 3 Fisher kernels Kernels feature maps From likelihoods to Kernels
111. 111. Learning with Kernels [Scholkopf and Smola 2001] Kernels A kernel K is a function: X × X → + positive symmetric It captures similarity between observations Building functions with kernels on the training data: Ki def = K(xi, ·) i ∈ train prediction function2: f(x) = i∈train wi Ki(x) 2Benefits of this formulation: i) non-linear predictor trained with linear problem; ii) expressiveness that increases with amount of training dataG Varoquaux 68
112. 112. Feature maps [Scholkopf and Smola 2001] Drawbacks of kernels Compute cost O(n2) Representations not explicit f(x) = i∈train wi Ki(x) As K is symmetric positive1, φ : X → d , such that x, x K(x, x ) = φ(x)Tφ(x ) φ is a “feature map” f(x) is a linear function of φ(x) but d can be ∞ Approximate φ 1Think of it as a generalization of the Cholesky decompositionG Varoquaux 69
113. 113. Nystr¨om approximate feature maps [Drineas and Mahoney 2005] On a random subset of the training data: G def =       K(x1, x1) . . . K(x1, xm) ... . ... K(xm, x1) . . . K(xm, xm)       ∈ Rm×m Let L ∈ k×m rank-k approximation LTL rank−k ≈ G−1 Feature map1 φNystrom(x) =       K(x1, x) ... K(xm, x)       LT sklearn.kernel approximation.Nystroem 1Exercise: check that φNystrom(x)TφNystrom(x) ≈ φ(x)Tφ(x) for x in our subset.G Varoquaux 70
114. 114. Nystr¨om approximate feature maps [Drineas and Mahoney 2005] On a random subset of the training data: G def =       K(x1, x1) . . . K(x1, xm) ... . ... K(xm, x1) . . . K(xm, xm)       ∈ Rm×m Let L ∈ k×m rank-k approximation LTL rank−k ≈ G−1 Feature map φNystrom(x) =       K(x1, x) ... K(xm, x)       LT sklearn.kernel approximation.Nystroem See also: Random features [Rahimi and Recht 2008] sklearn.kernel approximation.RBFSampler G Varoquaux 70
115. 115. 3 Fisher kernels Kernels feature maps From likelihoods to Kernels
116. 116. Parametric generative model Consider a model of x parametrized by w ∈ k: (x) = Pw(x) log-likelihood LP def = log Pw Maximum likelihood estimates: ˆw = argmaxw LP(x) Kullback-Leibler divergence Natural distance1 to another distribution KL(P|Q) = ¾P[LP − LQ] Goal: Benefit from our model to build a representation All models are wrong but some are useful 1Not a distance, technically, as not symmetric.G Varoquaux 72
117. 117. Local behavior of parametric models Fisher information matrix Expectation of Hessian of L given w: I(w) def = ¾ ∂2 ∂2w L(x|w) w ∈ k×k Order-2 approximation of KL divergence: KL(Pw|Pw+ ) = TIw Iw also scales the covariance of the estimation error on maximum-likelihood estimates of w (Cramer-Rao bounds) G Varoquaux 73 ( )wI
118. 118. Fisher-Rao manifold (information geometry) Order-2 approximation of KL(Pw|Pw+ ) = TIw KL close to w1 KL close to w2 G Varoquaux 74
119. 119. Fisher-Rao manifold (information geometry) Order-2 approximation of KL(Pw|Pw+ ) = TIw KL close to w1 KL close to w2 Non constant across the family of distri- butions {Pw, w ∈ k} G Varoquaux 74
120. 120. Fisher-Rao manifold (information geometry) Order-2 approximation of KL(Pw|Pw+ ) = TIw {Pw, w ∈ k} form a Riemannian manifold, with I as the metric tensor [Rao 1945]G Varoquaux 74
121. 121. Remannian manifolds Continuous geometry on curved spaces (eg the Earth) Locally, but not globally, Euclidean A Riemmannian manifold M is a differentiable space endowed with a metric d that is locally equivalent to a Euclidean vector space: ξ MT M M M M' LogM ExpM for M and M ∈ M, if d(M, M ) → 0, M and M can be mapped to elements of a vector space m, m such that d(M, M ) ∼ mTm Global structure: geodesic distance G Varoquaux 75
122. 122. Fisher Kernel [Jaakkola and Haussler 1999] A Kernel locally equivalent to the KL divergence Build upon the Fisher matrix Create a feature map Vector space where Euclidean distance ≈ KL ⇒ G Varoquaux 76
123. 123. Fisher Kernel [Jaakkola and Haussler 1999] A Kernel locally equivalent to the KL divergence Build upon the Fisher matrix Create a feature map Vector space where Euclidean distance ≈ KL ⇒ In practice: 1. Fit model Pw on train data: ˆw ← argmax w i∈train L(xi, w) 2. Compute gradient on w of likelihood for ˆw: zFisher(x) = wL(x, ˆw) ∈ k G Varoquaux 76
124. 124. Fisher Kernel applications Text: TF-IDF [Elkan 2005] Multinomial model of word appearance Genomics [Jaakkola and Haussler 1999] Hidden Markov model of DNA sequences (variable-length sequences ⇒ encoding difficult) Tree-structured data [Nicotra... 2004] A transition model on the tree Brain connectivity [Varoquaux... 2010] Multivariate Gaussian model (covariances) G Varoquaux 77
125. 125. Summary Kernels build prediction functions on similarities Features maps / kernel approximation captures the corresponding representation Fisher Kernels can go from likelihood to vector space Very useful for non numeric objects G Varoquaux 78
126. 126. Limited-data settings Reminder: Your valida- tion measure is intrinsi- cally unreliable (sampling noise) Get more data For instance acquiring data on a related task, to learn representations Use simple models Do not spend too much time tweaking ­20% ­10%  0% +10% +20% Distribution of errors under a binomial law         1000 300 200 100 30 Number of available samples    ­2% +2% ­4% +4% ­5% +5% ­7% +7% ­15% +12% G Varoquaux 79[Varoquaux 2018]
127. 127. References I A. Abraham, M. P. Milham, A. Di Martino, R. C. Craddock, D. Samaras, B. Thirion, and G. Varoquaux. Deriving reproducible biomarkers from multi-site resting-state data: an autism-based example. NeuroImage, 147:736, 2017. A. Achille and S. Soatto. Emergence of invariance and disentanglement in deep representations. The Journal of Machine Learning Research, 19(1):1947–1980, 2018. A. Bellet, A. Habrard, and M. Sebban. A survey on metric learning for feature vectors and structured data. arXiv preprint arXiv:1306.6709, 2013. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022, 2003. G Varoquaux 80
128. 128. References II A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko. Translating embeddings for modeling multi-relational data. In Advances in Neural Information Processing Systems, pages 2787–2795, 2013. D. Bzdok, M. Eickenberg, O. Grisel, B. Thirion, and G. Varoquaux. Semi-supervised factored logistic regression for high-dimensional neuroimaging data. In Advances in Neural Information Processing Systems, page 3348, 2015. J. Canny. Gap: A factor model for discrete data. In SIGIR, page 122, 2004. J.-F. Cardoso. Dependence, correlation and gaussianity in independent component analysis. Journal of Machine Learning Research, 4:1177, 2003. P. Cerda and G. Varoquaux. Encoding high-cardinality string categorical variables. arXiv:1907.01860, 2019. G Varoquaux 81
129. 129. References III W. Chen, X. Chen, J. Zhang, and K. Huang. Beyond triplet loss: a deep quadruplet network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, page 403, 2017. P. S. Dhillon, D. P. Foster, S. M. Kakade, and L. H. Ungar. A risk comparison of ordinary least squares vs ridge regression. The Journal of Machine Learning Research, 14:1505, 2013. P. Drineas and M. W. Mahoney. On the nystr¨om method for approximating a gram matrix for improved kernel-based learning. journal of machine learning research, 6:2153, 2005. C. Elkan. Deriving tf-idf as a fisher kernel. In International Symposium on String Processing and Information Retrieval, page 295, 2005. G Varoquaux 82
130. 130. References IV Y. Goldberg and O. Levy. word2vec explained: Deriving mikolov et al.’s negative-sampling word-embedding method. arXiv:1402.3722, 2014. P. K. Gopalan, L. Charlin, and D. Blei. Content-based recommendations with poisson factorization. In Advances in Neural Information Processing Systems, page 3176, 2014. M. Gutmann and A. Hyv¨arinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the International Conference on Artificial Intelligence and Statistics, page 297, 2010. D. Hsu, S. Kakade, and T. Zhang. Random design analysis of ridge regression. Foundations of Computational Mathematics, 14, 2014. A. Hyv¨arinen and E. Oja. Independent component analysis: algorithms and applications. Neural networks, 13(4):411, 2000. G Varoquaux 83
131. 131. References V A. J. Izenman. Reduced-rank regression for the multivariate linear model. Journal of multivariate analysis, 5:248, 1975. T. Jaakkola and D. Haussler. Exploiting generative models in discriminative classifiers. In Advances in neural information processing systems, pages 487–493, 1999. T. K. Landauer, P. W. Foltz, and D. Laham. An introduction to latent semantic analysis. Discourse processes, 25:259, 1998. A. Lefevre, F. Bach, and C. F´evotte. Online algorithms for nonnegative matrix factorization with the itakura-saito divergence. In Applications of Signal Processing to Audio and Acoustics (WASPAA), page 313. IEEE, 2011. O. Levy and Y. Goldberg. Neural word embedding as implicit matrix factorization. In Advances in neural information processing systems, page 2177, 2014. G Varoquaux 84
132. 132. References VI J. Mairal. Stochastic majorization-minimization algorithms for large-scale optimization. In Advances in Neural Information Processing Systems, 2013. J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research, 11:19–60, 2010. J. Mairal, F. Bach, and J. Ponce. Sparse modeling for image and vision processing. Foundations and Trends® in Computer Graphics and Vision, 8(2-3):85–283, 2014. S. Mallat. Understanding deep convolutional networks. Philosophical Transactions of the Royal Society A, 374:20150203, 2016. A. Mensch, J. Mairal, B. Thirion, and G. Varoquaux. Stochastic subsampling for factorizing huge matrices. IEEE Transactions on Signal Processing, 66:113, 2017. G Varoquaux 85
133. 133. References VII A. Mensch, J. Mairal, B. Thirion, and G. Varoquaux. Extracting universal representations of cognition across brain-imaging studies. arXiv preprint arXiv:1809.06035, 2018. D. Micci-Barreca. A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. ACM SIGKDD Explorations Newsletter, 3:27, 2001. T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. In ICLR Workshop Papers. 2013a. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, page 3111, 2013b. G Varoquaux 86
134. 134. References VIII G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio. On the number of linear regions of deep neural networks. In Advances in neural information processing systems, page 2924, 2014. L. Nicotra, A. Micheli, and A. Starita. Fisher kernel for tree structured data. In 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No. 04CH37541), volume 3, pages 1917–1922. IEEE, 2004. E. Oyallon, E. Belilovsky, and S. Zagoruyko. Scaling the scattering transform: Deep hybrid networks. In Proceedings of the IEEE international conference on computer vision, page 5618, 2017. J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), page 1532, 2014. G Varoquaux 87
135. 135. References IX M. Rahim, B. Thirion, D. Bzdok, I. Buvat, and G. Varoquaux. Joint prediction of multiple scores captures better individual traits from brain images. Neuroimage, 158:145–154, 2017a. M. Rahim, B. Thirion, and G. Varoquaux. Multi-output predictions from neuroimaging: assessing reduced-rank linear models. In 2017 International Workshop on Pattern Recognition in Neuroimaging (PRNI), pages 1–4. IEEE, 2017b. A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177–1184, 2008. C. Rao. Information and accuracy attainable in the estimation of statistical parameters. Bull Calcutta. Math. Soc., 37:81, 1945. G Varoquaux 88
136. 136. References X S. Rosset and R. J. Tibshirani. From fixed-x to random-x regression: Bias-variance decompositions, covariance penalties, and prediction error estimation. Journal of the American Statistical Association, pages 1–14, 2018. B. Scholkopf and A. J. Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2001. G. Varoquaux. Cross-validation failure: small sample sizes lead to large error bars. Neuroimage, 180:68–77, 2018. G. Varoquaux, F. Baronnet, A. Kleinschmidt, P. Fillard, and B. Thirion. Detection of brain functional-connectivity difference in post-stroke patients using group-level covariance modeling. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 200–208. Springer, 2010. G Varoquaux 89
137. 137. References XI Q. Wang, Z. Mao, B. Wang, and L. Guo. Knowledge graph embedding: A survey of approaches and applications. IEEE Transactions on Knowledge and Data Engineering, 29(12):2724–2743, 2017. G Varoquaux 90