Representation learning in limited-data settings

Representation learning
in limited-data settings
Gaël Varoquaux

Limited-data settings
n to be compared to:
A measure of the signal-to-noise ratio
The dimensional of the data p
Deep learning is hard in small-sample regimes
But we can borrow ideas
This talk: No silver bullet,
many simple (shallow) tricks
G Varoquaux 1

Small-n problems are important
83% of data scientists1 never have n > 1M
n is often small for applications such as medicine
Bigger is better (how to not use this talk)
Get more data (pool related datasets)
Find a related problem and try transfer
This talk: data that differs from common sources
1www.kaggle.com/laurae2/data-scientists-vs-size-of-datasets
G Varoquaux 2

Small-n problems need guiding principles
Selecting architecture, learning rate...
A deep architecture is validated by its measured accuracy
" less data =⇒ poorer validation
more in last part of this talk
Need for guiding principles
This talk: connecting deep learning to Good
Old-Fashioned Machine Learning
G Varoquaux 3

Outline
1 Representations for machine learning
Finite-sample supervised learning
Learning with representations
Supervised learning of representations
Over-parametrized representation learning
2 Matrix factorization and its variants
For signals
For discrete objects
3 Method evaluation with limited data
Variance in model evaluation
Reliable experimental procedures
From benchmarks to conclusion
G Varoquaux 4

1 Representations for machine
learning
Defining the notion of representations
Their use for supervised learning

1 Representations for machine learning
Finite-sample supervised learning
Learning with representations
Supervised learning of representations
Over-parametrized representation learning

Settings: supervised learning
Given n pairs (x, y) ∈ X × Y drawn i.i.d.
find a function f : X → Y such that f (x) ≈ y
Notation: ŷ
def
= f (x)
Empirical risk minimization
Loss function l : Y × Y → Ò
Estimation of f: f?
= argmin
f∈F
Å

l(ŷ, y)

This course: how to choose good function classes F
G Varoquaux 7

Example: finite-sample estimation of f
Data generated
with 9th order
polynomial
+ noise
G Varoquaux 8

Data generated
with 9th order
polynomial
+ noise
Fit polynomials of
various degrees
Degree 1
G Varoquaux 8

Data generated
with 9th order
polynomial
+ noise
Fit polynomials of
various degrees
Degree 1
Degree 2
G Varoquaux 8

Data generated
with 9th order
polynomial
+ noise
Fit polynomials of
various degrees
Degree 1
Degree 2
Degree 5
G Varoquaux 8

Data generated
with 9th order
polynomial
+ noise
Fit polynomials of
various degrees
Degree 1
Degree 2
Degree 5
Degree 9
G Varoquaux 8

Data generated
with 9th order
polynomial
+ noise
Fit polynomials of
various degrees
Degree 1
Degree 2
Degree 5
Degree 9
Truth
Model too simple: underfit
Model too complex: overfit
G Varoquaux 8

Theory: the generalization error
Generalization error of a prediction function f:
Notation : E(f)
def
= Å

l(y, f (x))

Finite-sample regime
Ideally: f?
= argmin
f∈F
Å

l f (x), y

In practice: f̂ = argmin
f∈F
n
Õ
i=1
l f (xi), yi

E(f̂) ≥ E(f?)
f
f
G Varoquaux 9

Theory: decomposing the generalization error
Assuming y = g(x) + e, e random with Å[e] = 0,
the generalization error of f̂ is:
E(f̂) = Å

l(g(x) + e, f̂ (x))

= E(g) + E(f?) − E(g)

+ E(f̂) − E(f?)

Bayes rate
Best possible pre-
diction
Å

l(g(x)+e, g(x))

Approximation
error: g F
Our model is
wrong
Estimation
Sampling noise on
train data
f̂ , f?
G Varoquaux 10

E(f̂) = Å

l(g(x) + e, f̂ (x))

= E(g) + E(f?) − E(g)

+ E(f̂) − E(f?)

Bayes rate
Best possible pre-
diction
Å

l(g(x)+e, g(x))

Due to the noise e
Cannot be avoided
G Varoquaux 10

E(f̂) = Å

l(g(x) + e, f̂ (x))

= E(g) + E(f?) − E(g)

+ E(f̂) − E(f?)

Approximation
error: g F
Our model is
wrong
Decreases for larger F
Empirical lower bound
of E(f?): train error
G Varoquaux 10

E(f̂) = Å

l(g(x) + e, f̂ (x))

= E(g) + E(f?) − E(g)

+ E(f̂) − E(f?)

Estimation
Sampling noise on
train data
f̂ , f?
Finite-sample problem
Decreases as n grows
Increases for larger F
Guesstimate: difference be-
tween train and test error
G Varoquaux 10

Example: polynomial regression degree
f
f
Degree 9, small n
no approximation error
large estimation error
f f
g
Degree 1, large n
small estimation error
large approximation
error
G Varoquaux 11

Example: polynomial regression degree
f
f
Degree 9, small n
no approximation error
large estimation error
f̂ = argminf∈F
Í
i l f (xi), yi

f f
g
Degree 1, large n
small estimation error
large approximation
error
Function class F not
restrictive enough
Function class F too
restrictive
G Varoquaux 11

Gauging overfit vs underfit: learning curves
100 1000
Number of samples
Error
sklearn.model selection.learning curve
G Varoquaux 12
Overfit
region
Underfit? Or Bayes rate?

100 1000
Number of samples
Error Generalization error
Training error
sklearn.model selection.learning curve
G Varoquaux 12
Estimation error ∼ gap be-
tween train and test error

100 1000
Number of samples
Error Generalization error
Training error
Degree of polynomial
9 1
Simpler models reach the assymptotic regime faster
(smaller “sample complexity”)
But can underfit
G Varoquaux 12

Gauging overfit vs underfit: validation curves
5 10 15
Polynomial degree
Error
Generalization error
Training error
sklearn.model selection.validation curve
Reveals underfits
G Varoquaux 13

Linear models for limited-data settings
In high-dimensional limited-data settings,
linear models are often the best choice
For p-dimensional data, x ∈ Òp,
they have p parameters
n ∼ 200 000
Inpatient Mortality, AUROC (95% CI) Hospital A Hospital B
Deep learning 0.95(0.94-0.96) 0.93(0.92-0.94)
Baseline (logistic regression) 0.93(0.92-0.95) 0.91(0.89-0.92)
G Varoquaux 14

Theory: Approximating with linear predictors
Linear predictor1: ŷ = xTw, w ∈ Òp
Data model: y = xTw? + δ(x) + e Å[e] = 0
xTw?: best linear predictor
Ridge estimator:
ŵ = argmin
w
kytrain − XT
train
wk2
Fro + λkwk2
2
Error compared to best linear predictor:
Å

ky − xT
ŵk2
2

= Å

ky − xTw?k2
2

+ o σ2p/ntrain

[Hsu... 2014, sec 2.5]
Random design analysis can characterize the generalization
error without assuming a correct data-generating model
(miss-specified model) [Hsu... 2014, Rosset and Tibshirani 2018]
1Predictor, not model: we do not assume it is a data-generating process.
G Varoquaux 15

Theory: Approximating with linear predictors
Linear predictor1: ŷ = xTw, w ∈ Òp
Data model: y = xTw? + δ(x) + e Å[e] = 0
xTw?: best linear predictor
Ridge estimator:
ŵ = argmin
w
kytrain − XT
train
wk2
Fro + λkwk2
2
Error compared to best linear predictor:
Å

ky − xT
ŵk2
2

= Å

ky − xTw?k2
2

+ o σ2p/ntrain

Approximation error
Data not linearly generated
⇒ craft more features
Estimation error
Curse of dimensionality
⇒ limit number of features
1Predictor, not model: we do not assume it is a data-generating process.
G Varoquaux 15

Example: extrapolating sea level (tides)
Predict sea level as a function of time
Test outside of observed range1
1Technically, this is not in our theory: test set , train set.
G Varoquaux 16

Polynomial regression
dim=10
Covariates
G Varoquaux 16

dim=10 dim=100
Covariates
G Varoquaux 16

dim=10
dim=100
dim=1000
Covariates
G Varoquaux 16

dim=10
dim=100
dim=1000
Covariates
Sines and cosines basis
dim=10
G Varoquaux 16

dim=10
dim=100
dim=1000
Covariates
dim=10 dim=100
G Varoquaux 16

dim=10
dim=100
dim=1000
Covariates
dim=10
dim=100
dim=1000
G Varoquaux 16

dim=10
dim=100
dim=1000
Covariates
dim=10
dim=100
dim=1000
Choice of covariates / basis / signal representation
⇒ huge difference on approximation error
⇒ huge difference on generalization error
G Varoquaux 16

Summary – minimizing a generalization error
ŷ = f (x), f chosen in F
to minimize the observed error
Õ
i∈train
l f (xi), y

generalization error:
- approximation error ⇒ F adapted to the data
- estimation error ⇒ F small
Linear models best option when p n
A good choice of covariates is crucial
G Varoquaux 17

Representations to build F
Settings
z = r(x): representation of the data, z ∈ Òk
Predictor f : x → ŷ = hw r(x)

Function composition: “depth”
G Varoquaux 19

Settings

Benefits
For expressiveness composition basis expansion
Composing L rectifying functions on intermediate representa-
tions of dimension k gives O k
p
p(L−1)
kp

linear regions.
Basis expansion + linear predictor gives O(k)
Exponential in depth, linear with dimension [Montufar... 2014]
G Varoquaux 19

Settings

Benefits
For multi-tasks sharing representations across tasks
y multidimensional
G Varoquaux 19

Settings

Benefits
For limited data hw(z) = wTz, a linear predictor
A good choice of z can decrease sample complexity
G Varoquaux 19

Settings

Benefits
For limited data hw(z) = wTz, a linear predictor
Transfer: r is learned on large data; a simple h used.
G Varoquaux 19

Representations to keep only the “useful information”
Formalize
How a representation z should:
keep information on the output y
loose non-useful information
G Varoquaux 20

Background: Information theory
Entropy = amount of information in x
H (x) = Åp[log p(x)]
Equi-probable distribution
= low entropy x=0 x=1 x=2 x=3 x=4 x=5
P
Uneven distribution
= high entropy x=0 x=1 x=2 x=3 x=4 x=5
P
Mutual information between x and y
I(x; y) = H (x, y) − H (x) − H (y)
x ⊥
⊥ y (independent) ⇔ I(x; y) = 0
independence ⇔ p(x; y) = p(x)p(y)
H (x; y) = Å(x;y)

log p(x; y)

= Å(x;y)

log p(x) + log p(y)
x
y
= Åx

log p(x)

+ Åy

log p(y)

= H (x) + H (y)
G Varoquaux 21

Theory: information in representations
A representation z of x is sufficient for y if y ⊥
⊥ x|z,
or equivalently if I(z; y) = I(x; y)
x, z, y form a Markov chain if Ð(y|x, z) = Ð(y|z).
x → z → y
Data processing inequality: I(x; y) ≤ I(x; z)
A sufficient representation z is minimal when
I(x; z) is smallest among sufficient
representations
G Varoquaux 22
[Achille and Soatto 2018]

Nuisances and invariances
A nuisance n: I(x, n) ≥ 0, but I(y, n) = 0
Representation z is invariant to the nuisance n
if z ⊥
⊥ n, or I(z; n) = 0 ⇒ We want I(z; n) low
In a Markov chain x → z1 → z2 · · · → zL → y
If z is a sufficient representation for y,
I(z; n) ≤ I(z; x) − I(x; y)
Communication bottleneck: I(z1; z2) I(z1; x)
⇒ I(z2; n) ≤ I(z1; z2) − I(x; y)
Stacking increases invariance
G Varoquaux 23
[Achille and Soatto 2018]

Examples of invariances representations
Illustrate
Ingredients of well-known representations
their links to invariances
G Varoquaux 24

Invariant representations on a continous space
st
Shift invariance representation = Fourier basis
Fourier transform: F(s)f =
Õ
t
e−i f t
st
complex i
Shifting the signal: st → s0
t = st+k
F(s0
)f =
Õ
t
e−i f t
st+k =
Õ
t
e−i f (t−k)
st = ei k f
Õ
t
e−i f t
st
= ei k f
F(s)f → change in phase
An orthonormal basis
of shift-invariant vectors
G Varoquaux 25

Invariant representations on a continous space
st
Shift invariance = Fourier basis
Local deformations = Wavelets
Locally equivalent to Fourier basis
But without the global extent
Decimated wavelets
Isometric transform of the signal
Higher scales lose shift invariance
Redundant wavelets
Increase the dimensionality
Good shift invariance
G Varoquaux 25

Representations invariant to rich deformations
Scaling
Rotations
Deformations
Ingredients
Modulus of wavelet / Fourier transform
⇒ non linearity filter banks (convolutions)
+ stacking (repeating simple invariants)
Scattering transform
Derived from first principles
Building first-order invariants
Convolutional networks
Learned from data
Pooling across pixels (eg max)
G Varoquaux 26
[Mallat 2016]

Summary – representions to help learning
Intermediate representations give
expressiveness to predictive models
Good representations keep predictive information
and loose nuisance information
Bottleneck and regularization to loose information
Given know invariants of the problem,
reusing existing representations helps
eg Headless conv-net, wavelets... [Oyallon... 2017]
G Varoquaux 27

The need to supervision
Maximizing I(z; y) (≤ I(x; y)) sufficient representations
⇒ supervised learning
while minimizing I(z; n) nuisance
⇒ sampling nuisance / invariants
data augmentation
Challenge: amount of labeled data
Pretext tasks
Other targets y0 that capture useful information
Finding them needs domain knowledge
G Varoquaux 29

Deep architectures
.
.
.
ŷ = fd
Wd
◦ ... ◦ f1
W1
(x)
Typically fk
Wk
(x) = gk
(WT
k x)

and gk
element-wise non-linearity
Thus ŷ = gd

WT
d ... g1
(WT
1 x)

Stacked representations: Wk
{Wk} optimized to minimize a prediction error
G Varoquaux 30

Shallow architectures for limited data
Keep one
latent layer
2
Without non-linearity:
ŷ = xT
W1 W2, y ∈ Òk
W1 ∈ Òp×d
W2 ∈ Òd×k
,
factored / reduced-rank linear model
Multi-task / multi-output
structured loss can help (multiple soft-max’s)
Overparametrization sometimes useful: d k
can be achieved with dropout
G Varoquaux 31
[Bzdok... 2015, Mensch... 2018]

Examples of simple models that extract representations
G Varoquaux 32

Simple case: square loss = reduced rank regression
Ŷ = X W1 W2, Y ∈ Òn×k
W1 ∈ Òp×d
, W2 ∈ Òd×k
Ŵ1, Ŵ2 = argmin
W1,W2
kŶ − Ytraink2
Fro For squared loss the
problem is convex
Full-rank solution1 (X and Y on train set):
Ŵ = Σ̂−1
X XT
Y Ŷ = X Ŵ = X Σ̂−1
X XT
Y
Rank d solution: [Izenman 1975, Rahim... 2017b]
R̂d
def
= YT
Ŷ ∈ Òk×k SVD
→ = Ûd ŝdV̂d, Ûd ∈ Òk×d
then Ŵ1 = Σ−1
X
XTY Ûd Ŵ2 = ÛT
d
Full-rank solution Rank-d projector2
1No need for pesky SGDs
2The projector captures the variance explained on the multiple outputs
G Varoquaux 33

Model stacking
x
f1
→ z
f2
→ y
Learn f1 separately
Train a first model, feed it’s output to a second model
Directly supervising z:
z = ŷ for a (simple) predictive model
First model f1 must underfit output:
Model chosen from a simple function class
(linear models)
Trick: “cross-fit” during training
obtain ŷ by splitting the training data
Test set
Train set
Full data
(in sklearn: cross val predict)
G Varoquaux 34

Model stacking
x
f1
→ z
f2
→ y
Learn f1 separately
Train a first model, feed it’s output to a second model
Directly supervising z:
z = ŷ for a (simple) predictive model
Application: tackling dimensionality [Rahim... 2017a]
Some features are a high-dimensional signal
eg medical images
f1: linear to reduce signal features
f2: non-linear (eg treesa) on all features
aTrees-based models are great for mixed-typed data with categorical features
G Varoquaux 34

Model stacking to encode discrete items
Sex Date Hired Employee Position
M 09/12/1988 Master Police Officer
F 06/26/2006 Social Worker III
M 07/16/2007 Police Officer III
predict
→
Salary
69222.18
97392.47
104717.28
Difficulty: number of different positions
what invariants?
40000 60000 80000 100000 120000 140000
y: Employee salary
Crossing Guard
Liquor Store Clerk I
Library Aide
Police Cadet
Public Safety Reporting Aide I
Administrative Specialist II
Management and Budget Specialist III
Manager III
Manager I
Manager II
Target encoding1 [Micci-Barreca 2001]
position → Åtrain[salary|position]
1To inject categories in Ò, before a second level that combines all columns
Python package: dirty-cat.github.io
G Varoquaux 35

Summary – supervised extraction of representations
Supervision helps selecting
the relevant part of the signal
In limited-sample settings, simple
models can create representations
Simple latent-factor models
Multi-output models
Stacking: fit a first-level model
G Varoquaux 36

Revisiting the bias-tradeoff
Flexible models can achieve
less bias but come with
more variance
[Geman... 1992]
Degree 1
Degree 2
Degree 5
Degree 9
Truth
G Varoquaux 38

Revisiting the bias-tradeoff
Flexible models can achieve
less bias but come with
more variance
[Geman... 1992]
Degree 1
Degree 2
Degree 5
Degree 9
Truth
Strong theoretical arguments
come from a worst-case analysis1
Average case can be very different
Achieve more flexibility without variance increase
1eg minimax rates of non-parametric regression [Györfi... 2002]
G Varoquaux 38

Example: random forest
1 tree: much bias 1 tree
G Varoquaux 39

Example: random forest
1 tree: much bias
300 tree: less bias,
no variance increase
1 tree
300 trees
Ensemble models
Prediction: ŷ = ŷ1 + ŷ2 + · · · + ŷm
If the errors of each model ŷ1 = y + ε1
are independent, they average out:
kŷ − yk2 = kε1 + ε2 + · · · + εmk2 = 1
mvarε
Increase in model flexibility without variance
G Varoquaux 39

Overparametrized neural networks
For suitable random initialization1 ŷ error does
not increase with network width.
Overparametrization
can even decrease
sample complexity
[Kaplan... 2020]
1Initialization must be diverse enough, and more concentrated for wide
networks [Chizat and Bach 2018, Chizat... 2019].
G Varoquaux 40
[Neal... 2018, Nakkiran... 2020]

Overparametrized neural networks
Overparametrize to set train error to zero
In error decomposition: approximation error to zero
f̂ = argminf∈F
Í
i l f (xi), yi

Another error decomposition:
Error can be due to
1 optimizing on noisy training data
2 initialization
1 plateaus with wide networks, while 2 decreases.
Optimum on train set is degenerate
G Varoquaux 41
[Neal... 2018, Nakkiran... 2020]

Randomization as a regularization
Toy example: ridge
OLS: ŵ = argminw ky − XTwk2
2
Inject noise: X0 = X + E, E ∼ N (0, σ)
ŵ0 = argminw ky − (X + E)Twk2
2
= argminw ky − XTwk2
2 + kETwk2
2
2 + σkwk2
2
G Varoquaux 42

Randomization as a regularization
Toy example: ridge
OLS: ŵ = argminw ky − XTwk2
2
Inject noise: X0 = X + E, E ∼ N (0, σ)
ŵ0 = argminw ky − (X + E)Twk2
2
2 + kETwk2
2
2 + σkwk2
2
Dropout as an implicit regularization
[Mianjy... 2018]
Random kernel expansions regularize
[Rahimi and Recht 2008]
G Varoquaux 42

Fine-tuning to reuse complex representations
Overparametrized architectures might not have
low-dimension representations
Fine tune the full architecture1
Lower learning rate to the input layers
to avoid catastrophic forgetting [Sun... 2019]
Feature extraction from the full architecture
Pooling linear combinations of input layers
[Peters... 2019]
Fine tuning best on complex architectures
1Thanks to Lihu Chen for help with this slide
G Varoquaux 43

Summary – overparametrized representations
Diversity (randomness) regularizes
Randomization can create interesting
inductive biases
Random CNNs work surprisingly well
[He... 2016, Ustyuzhaninov... 2016]
Fine-tuning overparametrized
representations to reuse them
G Varoquaux 44

Summary of first section
For generalization: small family of functions fw that
approximate the signal well
Generalization of a linear predictor:
approximation error + o(p/ntrain
)
Predictors by composition: ŷ = f2(z), z = f1(x)
x
f1
→ z
f2
→ y ideally, f1 makes z invariant to nuisances
Reuse representations with the right invariances:
wavelets, fasttext, pretrained headless neural nets
Simple supervised models
can create representations
stacking multioutput pretext tasks
G Varoquaux 45

2 Matrix factorization and its
variants
Simple unsupervised representation learning
More unlabeled data than labeled data
Learn representations and transfer them
Here: Focus on simple models for limited n or low SNR settings
Particularly interesting regime: p large and n large.
Matrix factorization is a simplified version of deep learning
This section: building the framework from simple to complex

2 Matrix factorization and its variants
For signals
For discrete objects

Matrix factorization for representations
Reduce the dimensionality
while keeping the signal
“disentangle”
give features that are useful in themselves
G Varoquaux 48

Principal Component Analysis1
Find the directions of largest variance
Computation X ∈ Òn×p ΣX = XTX ∈ Òp×p
PCA projector: PPCA ∈ Òp×k SVDk(X) or EVDk(ΣX)
Reduced X: X PPCA ∈ Òn×k
1Mother of all representations (simplest)
G Varoquaux 49

Principal Component Analysis
Model: low-rank Gaussian latent factors
X ≈ U V + E, E ∼ N (0, Ip), U ∈ Òn×k, V ∈ Òk×p
Û, V̂ = argmin
U,V
kX − U Vk2
Fro
Rotationally invariant: U0 = U O, OT V also solution for O s.t. OTO = I
G Varoquaux 49

Model: low-rank Gaussian latent factors
X ≈ U V + E, E ∼ N (0, Ip), U ∈ Òn×k, V ∈ Òk×p
Û, V̂ = argmin
U,V
kX − U Vk2
Fro
Rotationally invariant: U0 = U O, OT V also solution for O s.t. OTO = I
PCA = 1-hidden layer autoencoder with squared lossa
min
W
kX − W WT
Xk2
Fro, with suitable constraint on W
aBoth find the same subspace
G Varoquaux 49

In a learning pipeline
Useful for dimensionality reduction (eg p is large)
Eases statistics and computations
Generalization error of PCA + OLS
within a factor of 4 of ridge
[Dhillon... 2013]
G Varoquaux 49

Beyond variance: Independent Component Analysis
Separate out signals U observed mixed1
True sources, signals U
Observations (mixed signal)
ICA recovered signals
Disentangles:
Raises the rotational invariance
1Classic ICA has no noise model: it does not do dimension reduction
G Varoquaux 50

Model: X = U V V ∈ Òp×p, VTV = Ip
If V is Gaussian, the model is not identifiable
Seek low mutual information across {uj}
⇒ Maximally non-Gaussian marginals [Cardoso 2003]
Latent signals V Observed data U V
G Varoquaux 50

Model: X = U V V ∈ Òp×p, VTV = Ip
If V is Gaussian, the model is not identifiable
Seek low mutual information across {uj}
⇒ Maximally non-Gaussian marginals [Cardoso 2003]
Computation: FastICA [Hyvärinen and Oja 2000]
Power iterations on V
Each time:
- apply a smooth increasing non-linearity on {uj}
- decorrelate
Preprocessing: whiten the data eg with PCA
G Varoquaux 50

ICA to learn representations
Across patches of natural images:
Gabor-like filters
Similar to wavelets
and first layer of convnets
G Varoquaux 51
[Hyvärinen and Oja 2000]

ICA to learn representations
ICA
Disantengles
Can only learn rotations
No dimension reduction
G Varoquaux 52

Dictionary learning
Find vectors V that represents well the signal
with sparse combinations U
Model: X = U V s.t. U is sparse U ∈ Òn×k, V ∈ Òk×p
k can be p (overcomplete dictionary)
Estimation: Û, V̂ = argmin
U,V,
s.t. kvik2
2≤1
kX − U Vk2
Fro + λkUk1
Data fit without need
for reduction
Combining squared loss and
`1 penalty creates sparsity
Constraint on kvik2
2 required to
avoid cancelling out penalty with
V → ∞ and U → 0
x2
x1
G Varoquaux 53

Dictionary learning
Find vectors V that represents well the signal
with sparse combinations U
Model: X = U V s.t. U is sparse U ∈ Òn×k, V ∈ Òk×p
k can be p (overcomplete dictionary)
Estimation: Û, V̂ = argmin
U,V,
s.t. V∈C
kX − U Vk2
Fro + λΩ(U)
Constraint set and penalty can be varied1
Typically, `2, `1, and positivity2 on U or V.
1Fast when C and Ω lead to simple projections and penalized regression.
2Recovers a form of NMF (non-negative matrix factorization)
G Varoquaux 53

Sparse dictionary learning to learn representations
Also learns Gabor-like filters1
Good for sparse models,
eg for denoising
Also performs dimensionality reduction
1as ICA, K-Means, etc on images patches
G Varoquaux 54
[Mairal... 2014]

Large n large p: brain imaging
Brain activity at rest
1000 subjects with
∼ 100–10 000 samples
Images of dimensionality
100 000
Dense matrix, large both ways
G Varoquaux 55
voxels
time
voxels
time
X +
U · V
= E
25

Estimation algorithms
For dictionary learning
G Varoquaux 56

Large n large p: recommender systems
3
9 7
7
9 5 7
8
4
1 6
9
7
7
1
4 4
9
5
5 8
Product ratings
Millions of entries
Hundreds of thousands of
products and users
Large sparse matrix
G Varoquaux 57
users
product
users
products
X +
U · V
= E

Online estimation: stochastic optimization
min
w
Õ
i
l(xi w)
Many samples min
w
Å[l(y, x w)]
Gradient descent: wt+1 ← wt + αt+wl
Stochastic gradient descent: wt+1 ← wt + αtÅ[+wl]
Use a cheap estimate of Å[+wl] (e.g. subsampling)
αt must decrease
“suitably” with t.
Those pesky learning rate
G Varoquaux 58

Online estimation for matrix factorization
- Data
access
- Dictionary
update
Stream
columns
- Code com-
putation
Alternating
minimization
Data
matrix
Large matrices
= terabytes of data
argmin
U,V
kX−U Vk2
Fro + λΩ(U)
G Varoquaux 59
[Mairal... 2010]

Large matrices
= terabytes of data
argmin
U,V
kX−U Vk2
Fro + λΩ(U)
Rewrite as an expectation:
argmin
V
Õ
i

min
u
kXi − V uk2
Fro + λΩ(u)

argmin
E
Ö

f (V)

⇒ Optimize on approximations (sub-samples)
G Varoquaux 59
[Mairal... 2010]

- Data
access
- Dictionary
update
Stream
columns
- Code com-
putation
Online matrix
factorization
Alternating
minimization
Seen at t Seen at t+1 Unseen at t
Data
matrix
G Varoquaux 59
[Mairal... 2010]

- Data
access
- Dictionary
update
Stream
columns
- Code com-
putation Subsample
rows
Online matrix
factorization
Subsampled
online
Alternating
minimization
Seen at t Seen at t+1 Unseen at t
Data
matrix
G Varoquaux 59
[Mensch... 2017]

Online matrix factorization algorithm [Mairal... 2010]
Stream samples xt:
1. Compute code
ut = argmin
u∈Òk
kxt − Vt−1uk2
2 + λΩ(u)
G Varoquaux 60

Stream samples xt:
1. Compute code
ut = argmin
u∈Òk
kxt − Vt−1uk2
2 + λΩ(u)
2. Update the surrogate function
gt(V) =
1
t
t
Õ
i=1
kxi − V uik2
2
gt(V)
surrogate
=
Õ
x
l(x, V) ui is used, and not u?
G Varoquaux 60

Stream samples xt:
1. Compute code
ut = argmin
u∈Òk
kxt − Vt−1uk2
2 + λΩ(u)
gt(V) =
1
t
t
Õ
i=1
kxi − V uik2
2 = tr
1
2
V
VAt − V
Bt

At
def
= (1 −
1
t
)At−1 +
1
t
utu
t Bt
def
= (1 −
1
t
)Bt−1 +
1
t
xtu
t
At and Bt are sufficient statistics of the loss
accumulated over the data
G Varoquaux 60

Stream samples xt:
1. Compute code
ut = argmin
u∈Òk
kxt − Vt−1uk2
2 + λΩ(u)
gt(V) =
1
t
t
Õ
i=1
kxi − V uik2
2 = tr
1
2
V
VAt − V
Bt

At
def
= (1 −
1
t
)At−1 +
1
t
utu
t Bt
def
= (1 −
1
t
)Bt−1 +
1
t
xtu
t
3. Minimize surrogate
Vt = argmin
V∈C
gt(V) +gt = VAt − Bt
G Varoquaux 60

Stochastic Majorization-Minimization [Mairal 2013]
V = argmin
V∈C
Õ
x
l(x, V) where l(x, V) = min
u
f (x, V, u)
Algorithm:
gt(V)
majorant
=
Õ
x
l(x, V) ui is used, and not u?
⇒ Majorization-Minimization scheme1
Surrogate computation SMM Full minimization
2nd order information No learning rate
1SOMF uses a approximate majorant and minimization [Mensch... 2017]
G Varoquaux 61

Experimental convergence: large images
5s 1min 6min
2.80
2.85
2.90
2.95
Test
objective
value
×104
Time
ADHD
Sparse dictionary
2 GB
1min 1h 5h
0.105
0.106
0.107
0.108
0.109
Aviris
NMF
103 GB
1min 1h 5h
0.35
0.36
0.37
0.38
0.39
0.40
Test
objective
value
Time
Aviris
Dictionary learning
103 GB
OMF: SOMF: r = 4
r = 6
r = 8
r = 12
r = 24
r = 1
Best step-size SGD
100s 1h 5h 24h
0.98
1.00
1.02
1.04
×105
HCP
Sparse dictionary
2 TB
SOMF = Subsampled Online Matrix Factorization
G Varoquaux 62

Experimental convergence: recommender system
SOMF = Subsampled Online Matrix Factorization
G Varoquaux 63

Summary – matrix factorization of signals
Versatile matrix-factorization formulation1
argmin
U∈Òn×k,V∈C
kX − U Vk2
Fro + λΩ(U)
Estimation
Stochastic majorization miniminization2
⇒ an online alternated optimization
Example use of learned representations
Biomakers of autism on brain images:
p ∼ 100 000, n ∼ 1 000 [Abraham... 2017]
11-layer linear autoencoder
2Common case algorithm readily usable in scikit-learn:
MiniBatchDictionaryLearning
G Varoquaux 64

Embedding discrete objects
Embedding discrete objects
(words, entities, users ids) is crucial
It endowes them with a metric,
enables building predictive functions
that extrapolate between objects
Original p
is not small
in front of n Construction
Representative III
Fire/Rescue
Captain
Resource
Conservationist
Security Officer
II
Security Officer
III (Sergeant)
G Varoquaux 66

Natural language processing: topic-modeling history
Topic modeling: embedding documents3
03078090707907
00790752700578
94071006000797
00970008007000
10000400400090
00050205008000
documents
t
h
e
P
y
t
h
o
n
p
e
r
f
o
r
m
a
n
c
e
p
r
o
f
i
l
i
n
g
m
o
d
u
l
e
i
s
c
o
d
e
c
a
n
a
Start from a vectorization
of each document by
counting word occurence:
The term-document
matrix
3Typically for information retrieval purpose, aka search engines
G Varoquaux 67

Natural language processing: topic-modeling history
Topic modeling: embedding documents3
03078090707907
00790752700578
94071006000797
00970008007000
10000400400090
00050205008000
documents
t
h
e
P
y
t
h
o
n
p
e
r
f
o
r
m
a
n
c
e
p
r
o
f
i
l
i
n
g
m
o
d
u
l
e
i
s
c
o
d
e
c
a
n
a
→
03078090707907
00790752700578
94071006000797
topics
t
h
e
P
y
t
h
o
n
p
e
r
f
o
r
m
a
n
c
e
p
r
o
f
i
l
i
n
g
m
o
d
u
l
e
i
s
c
o
d
e
c
a
n
a
030
007
940
009
100
000
documents
topics
+
What terms
are in a topics
What documents
are in a topics
LSA (Latent Semantic Analysis) [Landauer... 1998]
SVD of the terms×documents matrix
3Typically for information retrieval purpose, aka search engines
G Varoquaux 67

Gamma-Poisson for factorizing counts [Canny 2004]
When X is a matrix of counts
- Topic modeling
- Recommenders systems [Gopalan... 2014]
- Database string entries [Cerda and Varoquaux 2020]
=⇒ Poisson loss, instead of squared loss
Ð(xj|wj) = Poisson wj

= 1/xj! w
xj
j
e−wj
0 5
0.0
0.5
1.0 Gaussian(.5)
Poisson(3)
Poisson(1)
Poisson(0)
Counts are not well approximated by a Gaussian
G Varoquaux 68

Gamma-Poisson for factorizing counts [Canny 2004]
When X is a matrix of counts
- Topic modeling
- Recommenders systems [Gopalan... 2014]
- Database string entries [Cerda and Varoquaux 2020]
=⇒ Poisson loss, instead of squared loss
Ð(xj|u, V) = Poisson (u V)j

= 1/xj! (u V)
xj
j
e−(u V)j
u are loadings, modeled as random with a
Gamma prior1
Ð(ui) =
u
αi−1
i
e−ui/βi
β
αi
i
Γ(αi)
Maximum a posteriori estimation:
Û, V̂ = argmin
U,V
−
Õ
j

log Ð(xj|u, V) +
Õ
i
log Ð(ui)

1Because it is the conjugate prior of the Poisson, and because it imposes
soft sparsity and raises rotational invariance
G Varoquaux 68

Gamma-Poisson estimation
Full log-likelihood expression:
log L =
p
Õ
j=1
xj log((u V)j) − (u V)j − log(xj!)
+
k
Õ
i=1
(αi − 1) log(ui) −
ui
βi
− αi log βi − log Γ(αi)
Gradients: ∂
∂Vij
log L =
xj
(u V)j
ui − ui
∂
∂ui
log L =
p
Õ
j=1
xj
(u V)j
Vij − Vij +
αi − 1
ui
−
1
βi
G Varoquaux 69

Gradients: ∂
∂Vij
log L =
xj
(u V)j
ui − ui
∂
∂ui
log L =
p
Õ
j=1
xj
(u V)j
Vij − Vij +
αi − 1
ui
−
1
βi
Equivalent to some NMF formulation: multiplicative updates1
Vij ← Vij
n
Õ
`=1
x`j
(UV)`j
uì
! n
Õ
`=1
uì
!−1
uì ← uì
©

«
p
Õ
j=1
x`j
(UV)`j
Vij +
αi − 1
uì
ª
®
¬
©

«
p
Õ
j=1
Vij + β−1
i
ª
®
¬
−1
1Efficient implementation with sparse matrices: the summations can be
done only on non-zero entries of X.
G Varoquaux 69

Gradients: ∂
∂Vij
log L =
xj
(u V)j
ui − ui
∂
∂ui
log L =
p
Õ
j=1
xj
(u V)j
Vij − Vij +
αi − 1
ui
−
1
βi
Equivalent to some NMF formulation: multiplicative updates1
Vij ← Vij
n
Õ
`=1
x`j
(UV)`j
uì
! n
Õ
`=1
uì
!−1
uì ← uì
©

«
p
Õ
j=1
x`j
(UV)`j
Vij +
αi − 1
uì
ª
®
¬
©

«
p
Õ
j=1
Vij + β−1
i
ª
®
¬
−1
Adapt the majorization minimization algorithm
[Lefevre... 2011, Cerda and Varoquaux 2020]
1Efficient implementation with sparse matrices: the summations can be
done only on non-zero entries of X.
G Varoquaux 69

Application: embedding via string form
Problem: representing non-normalized categories
Drug Name
alcohol
ethyl alcohol
isopropyl alcohol
polyvinyl alcohol
isopropyl alcohol swab
62% ethyl alcohol
alcohol 68%
alcohol denat
benzyl alcohol
dehydrated alcohol
Employee Position Title
Police Aide
Master Police Officer
Mechanic Technician II
Police Officer III
Senior Architect
Senior Engineer Technician
Social Worker III
G Varoquaux 70
Code: dirty-cat.github.io [Cerda and Varoquaux 2020]

Gamma-Poisson
factorization
on sub-strings counts
|{z}
3-gram1
P
|{z}
3-gram2
ol
|{z}
3-gram3
ic...
Models strings as a linear combination of substrings
11111000000000
00000011111111
10000001100000
11100000000000
11111100000000
11111000000000
police
officer
pol off
polis
policeman
policier
e
r
_
c
e
r
f
i
c
o
f
f
_
o
f
c
e
_
i
c
e
l
i
c
p
o
l
G Varoquaux 71

Gamma-Poisson
factorization
on sub-strings counts
|{z}
3-gram1
P
|{z}
3-gram2
ol
|{z}
3-gram3
ic...
Models strings as a linear combination of substrings
11111000000000
00000011111111
10000001100000
11100000000000
11111100000000
11111000000000
police
officer
pol off
polis
policeman
policier
e
r
_
c
e
r
f
i
c
o
f
f
_
o
f
c
e
_
i
c
e
l
i
c
p
o
l
→
03078090707907
00790752700578
94071006000797
topics
030
007
940
009
100
000
documents
topics
+
What substrings
are in a latent
category
What latent categories
are in an entry
e
r
_
c
e
r
f
i
c
o
f
f
_
o
f
c
e
_
i
c
e
l
i
c
p
o
l
G Varoquaux 71

Representations that extract latent categories
l
i
b
r
a
r
y
p
e
r
a
t
o
r
c
i
a
l
i
s
t
r
e
h
o
u
s
e
m
a
n
a
g
e
r
m
m
u
n
i
t
y
r
e
s
c
u
e
o
f
f
i
c
e
r
Legislative Analyst II
Legislative Attorney
Equipment Operator I
Transit Coordinator
Bus Operator
Senior Architect
Financial Programs Manager
Capital Projects Manager
Police Sergeant
a
m
e
s
Categories
G Varoquaux 72

Inferring plausible feature names
n
t
a
n
t
,
a
s
s
i
s
t
a
n
t
,
l
i
b
r
a
r
y
a
t
o
r
,
e
q
u
i
p
m
e
n
t
,
o
p
e
r
a
t
o
r
d
m
i
n
i
s
t
r
a
t
i
o
n
,
s
p
e
c
i
a
l
i
s
t
,
c
r
a
f
t
s
w
o
r
k
e
r
,
w
a
r
e
h
o
u
s
e
r
o
s
s
i
n
g
,
p
r
o
g
r
a
m
,
m
a
n
a
g
e
r
c
i
a
n
,
m
e
c
h
a
n
i
c
,
c
o
m
m
u
n
i
t
y
e
f
i
g
h
t
e
r
,
r
e
s
c
u
e
r
,
r
e
s
c
u
e
o
n
a
l
,
c
o
r
r
e
c
t
i
o
n
,
o
f
f
i
c
e
r
Legislative Analyst II
Legislative Attorney
Equipment Operator I
Transit Coordinator
Bus Operator
Senior Architect
Financial Programs Manager
Capital Projects Manager
Police Sergeant
I
n
f
e
r
r
e
d
f
e
a
t
u
r
e
n
a
m
e
s
Categories
G Varoquaux 72
[Cerda and Varoquaux 2020]

So far:
Matrix factorization of count (eg cooccurences)
Embeds discrete objects
Better with a suitable loss
Next:
Implicit matrix factorization and losses
G Varoquaux 73

Word embeddings
Distributional semantics: meaning of words
“You shall know a word by the company it keeps”
Firth, 1957
Example: A glass of red , please
Could be wine maybe juice?
wine and juice have related meanings
Factorization of the word×context matrix
What choice of context?
What loss?
word2vec [Mikolov... 2013a] glove [Pennington... 2014]
G Varoquaux 74

Word2vec: skip-gram sampling [Mikolov... 2013b]
{ûw, v̂c} = argmax
{uw,vc}
Õ
pairs of words (w, c)
in the same window1
log softmax(V uT
w)c
softmax(z)i =
exp zi
Í
j exp zj
uw ∈ Òk: embedding of word w
V ∈ Òcard(voc)×k: [vc, c ∈ voc]
all context words
Big sum on contexts
⇒ solved by SGD2
salad
meat
juice
wine
glass
green
red
Center
word
U:
word
embedding
salad
meat
juice
wine
glass
red
green
Context
word
V:
context
embedding
Other view:
Language models
Prediction of words
1Efficient: never build the matrix, stream directly from text.
2These windows are called skip gram
G Varoquaux 75

Word2vec: negative sampling [Mikolov... 2013a]
Costly loss: log softmax(z)i = log
exp zi
Í
j exp zj
Approximate1 Huge sum in softmax (all vocabulary)
Downsample it by drawing the positive (numerator)
and a few negative examples (denominator)
Negative sampling loss2:
[Goldberg and Levy 2014] log σ(vc uT
w) +
Õ
nneg words w
not in window
log σ(−vcuw0)
σ: sigmoid (log σ(z) = −1 − exp −z)
1Related to noise contrastive estimate, that avoid computing costly
normalizations in likelihoods [Gutmann and Hyvärinen 2010]
2Related to a matrix factorization of mutual information inword occurence
[Levy and Goldberg 2014]
G Varoquaux 76

Beyond natural language: metric learning
Triplet loss
For a “anchor”, b close to a, c far from a:
log σ(vT
aub) − log σ(vT
auc)
Quadruplet loss [Chen... 2017]
For a and b close by, c and d far appart:
log σ(vT
aub) − log σ(vT
cud)
In practice: draw1 randomly (a, b, c) or (a, b, c, d)
Metric learning: [Bellet... 2013]
Learning embeddings with weak supervision
1Many strategies, eg “hard negative mining”, requires a good test set and
metric to set, as with SGD hyperparameters.
G Varoquaux 77

Embedding entities in knowledge graphs
Structured (graph) represen-
tation of human knowledge
eg dbpedia, Yago
Challenge: relations
of multiple nature
G Varoquaux 78

Embedding entities in knowledge graphs
Structured (graph) represen-
tation of human knowledge
eg dbpedia, Yago
Learning embeddings of enti-
ties {ei} and relations {rj}:
ea ∼ eb + rc
a model of the relation1
Then triplet / quadruplet loss Reuse existing:
conceptnet.io
1Richer, better, models
[Wang... 2014]
G Varoquaux 78
[Bordes... 2013, Wang... 2017]

The value of simple models
Risk of invisible overfit dur-
ing search for hyperparameters
and models
Complex models call for a clear
utility measure with low mea-
surement error
Many reliable labels
G Varoquaux 79

The value of simple models
Risk of invisible overfit dur-
ing search for hyperparameters
and models
Complex models call for a clear
utility measure with low mea-
surement error
Many reliable labels
Matrix factorization models1: 2 hyper parameters:
Dimensionality k Regularization λ
Set them to optimize representations for supervised problems
1Using majorization-minimization approaches to avoid learning rate
G Varoquaux 79

Summary – embedding discrete objects
Discrete entities lead to counting occurences
⇒ Poisson and logistic loss (ugly logs in equations)
Word entity embeddings
Factorization of coocurrences in a notion of context
more generally: metric learning
Limited-data settings:
Avoid negative-sampling models (hyper-parameters)
Try to reuse representations (fastext, conceptnet.io)
G Varoquaux 80

Summary – matrix factorization
Builds linear representions of input
At the root of many more complex variants
Minimization-Majorization solvers:
scalable and “fire and forget”
G Varoquaux 81

3 Method evaluation with
limited data
Less data =⇒ more difficult evaluation
Section inspired by [Bouthillier... 2021]

Evaluation of the generalization error
Focus on representation to facilitate prediction
=⇒ evaluate prediction
Leaving aside representation for interpretability
Danger of reading tea leaves
Interpretation = ill defined, requires expert knowledge,
subject to confirmation bias [Lipton 2018]
Ill-conditioned problem
=⇒ strong dependence on prior
=⇒ self-fulfilling prophecies
G Varoquaux 83

3 Method evaluation with limited data
Variance in model evaluation
Reliable experimental procedures
From benchmarks to conclusion

Model evaluation
New data is required to assess
generalization performance
Å

l f (X), y

Split data in train and test set
typically 10%
trade off better learning
vs better estimation
Test set
Train set
Full data
Make choices on the model
split train, validation, and test Test set
Full data
Validation set
Train set
Make model choices
Evaluate model
G Varoquaux 85

Evaluation error: Sampling noise on test set
Sampling noise1 for ntest = 1000:
-10% -5% 0% +5% +10%
Binomial distribution of error on test accuracy
-2% +2%
Confidence intervals ntest = 1 000 interval: 5.7%
ntest = 10 000 interval: 1.8%
ntest = 100 000 interval: 0.6%
Optimizing test accuracy will explore the tails
Selecting architecture, learning rate...
overfitting the validation test set
1The data at hand (eg the test set) is just a small sample of the full
population “in the wild”, and sampling other data will lead to other results.
G Varoquaux 86
[Varoquaux 2018]

Evaluation error: Sampling noise on test set
“in the wild”
102
103
104
105
106
Test set size
0
1
2
3
4
Standard
deviation
(%
acc)
In Theory:
From a Binomial
In Practice:
Random splits
Binom(n', 0.66)
Binom(n', 0.95)
Binom(n', 0.91)
Glue-RTE BERT
(n'=277)
Glue-SST2 BERT
(n'=872)
CIFAR10 VGG11
(n'=10000)
G Varoquaux 87
[Bouthillier... 2021]

Evaluation is a bottleneck – in publications
90.0
92.5
95.0
97.5
100.0
cifar10
2012 2014 2016 2018 2020
85
90
95
100
sst2
non-'SOTA' results
Significant
Non-Significant
Year
Accuracy
NLP: Glue sentiment-analysis benchmark (ntest = 1.8k)
Vision: object-recognition benchmark (ntest = 10k)
Published improvements compared to benchmark variance
G Varoquaux 88

Evaluation is a bottleneck – in Kaggle competitions
Lung cancer classification
Test size: max 1K
Smaller improvements than noise
-0.75 0.0 +0.75
Observed improvement in score
Diminishing returns
Schizophrenia classification
Test size: 120
-0.2 0.0 +0.2
Improvement of
top model on 10% best
Evaluation noise between public
and private sets
Diminishing returns
Lung tumor segmentation
Test size: max 6k
Poorer score on private set
-0.15 0.0 +0.15
Overfit
Nerve segmentation
Test size 5.5K
-0.04 0.0 +0.04
Improvement of
top model on 10% best
Evaluation noise between public
and private sets
Actual improvement
G Varoquaux 89
[Varoquaux and Cheplygina 2021]

The full benchmarking pipeline
New data to assess generalization
performance Å

l f (X), y

Split out test set
Split out validation set
Choose hyper-parameters
on validation set
Test set
Full data
Validation set
Train set
Make model choices
Evaluate model
Measure performance on test set
Rampant overfit of validation set [Makarova... 2021]
G Varoquaux 90

Sources of variance in a machine-learning benchmark
0 1
Numerical noise
Dropout
Weights init
Data order
Data augment
Data (bootstrap)
Noisy Grid Search
Random Search
Bayes Opt
bert-rte
0 1
bert-sst2
0
bio-
hyperparameter
optimization
HOpt { H}
learning
algorithm
{ O}
source of variation c
1 0 1
bio-task2
0.0 0.5
segmentation
0 1
vgg
0 1
average
case studies
Model-evaluation results are most affected by:
1. Arbitrary split into train and test
2. Random (arbitrary) parameters
3. Uncertainty in optimized hyper-parameters
G Varoquaux 91

Summary – variance in benchmarks
Evaluating generalization is limited by ntest
ntest = 10 000 =⇒ ±.9% ntest = 100 000 =⇒ ±.3%
Benchmark hyper parameter choice
Careful not to overfit hyper-parameters
Variance in machine-learning benchmarks
1. Data splits
2. Random seeds
3. Hyper-parameter choice
...
G Varoquaux 92

Settings: what are we benchmarking
prediction rule: f : X → Y
training procedure: given data (X, y) ∈ (X × Y)n
outputs a prediction rule
hyper parameters: parameters not set by the
procedure
full training pipeline: hyper-parameter choice +
training procedure
G Varoquaux 94

Benchmarking a prediction rule vs a training pipeline
Benchmarking a prediction rule
Before putting in production
Fixed training set evaluation limited by test set size
Benchmarking a training pipeline
To conclude on good training procedures
Useless to tune random seeds
(for weights init, dropout, data augmentation)
will not carry over to new training data
G Varoquaux 95

Benchmarking a training pipeline
0 1
Numerical noise
Dropout
Weights init
Data order
Data augment
Data (bootstrap)
Noisy Grid Search
Random Search
Bayes Opt
bert-rte
0
hyperparameter
optimization
HOpt { H}
learning
algorithm
{ O}
source of variation
0 1
Numerical noise
Dropout
Weights init
Data order
Data augment
Data (bootstrap)
Noisy Grid Search
Random Search
Bayes Opt
bert-rte
0 1
bert-sst2
0 1
bio-task2
0.0 0.5
segmentation
0 1
vgg
0 1
average
parameter
zation
{ H}
rning
rithm
O}
of variation case studies
Reduce error
and gauge variance
data sampling
Multiple train-test splits
cross-validation
Test set
Train set
Full data
arbitrary choices (seeds)
Randomize them all
hyper-parameters
Hyper-parameter optimization
Too expensive to randomize
G Varoquaux 96

Hyper-parameter optimization procedures
Random search [Bergstra and Bengio 2012]
(prefer to grid-search for more than 2 params)
Region of good
hyperparameters
Hyperparameter 1
Hyperparameter
2
Grid Search
Randomized
Search
(important hyperparameter)
(unimportant
hyperparameter)
G Varoquaux 97

Bayesian optimization
G Varoquaux 97

Bayesian optimization
Sub-optimal hyper-parameters on models routinely
lead to invalid conclusions
See refs in [Bouthillier... 2021]
G Varoquaux 97

Benchmarking with hyper-parameters
Difficulty: measure suboptimality and variance
due to hyper-parameters
Ideal strategy: multiple hyper-parameter
optimizations with different seeds Costly
In practice: set hyper parameters once, then
randomize model seeds and data splits
Counterintuitive: more randomization decorrelates
sources of error, and thus improves benchmarks
G Varoquaux 98

Summary – better measures
Benchmarking prediction rule
, benchmarking training procedure
For training procedures: randomize everything
Data splits, all random procedures
Hyper-parameter optimization outside randomiza-
tion is suboptimal, but randomization after helps
G Varoquaux 99

Statistical tests ML benchmarks
Null hypothesis testing – p-value: the chance to
observe the results if a null hypothesis were true
Typical null: model comparison
model p1 and p2 give same expected error
G Varoquaux 101

Statistical tests: single test set
(comparing prediction rules)
Test set
Train set
Full data
Simple distribution of metrics,
eg accuracy: binomial
Safer to use permutations,
for correlated errors across prediction rules
[Bandos... 2005]
Sample null distribution by randomly switching
predictions from p1 and p2.
G Varoquaux 102

Statistical tests: cross-validation
(comparing training pipelines)
Test set
Train set
Full data
Challenge: folds are not independent1 [Dietterich 1998]
t-test/Wilcoxon across folds are not valid
1Train sets overlap, and often test sets also do.
G Varoquaux 103

Statistical tests: cross-validation
(comparing training pipelines)
Test set
Train set
Full data
Challenge: folds are not independent1 [Dietterich 1998]
t-test/Wilcoxon across folds are not valid
Correct for dependence across folds2
5x2cv: repeat 5 times randomized 2-fold
Use a t-test with 5 degrees of freedom [Dietterich 1998]
Corrected resampled t-test statistic
Formula for fold correlation [Nadeau and Bengio 2003]
1Train sets overlap, and often test sets also do.
2Does not account for sources of variance other than data sampling, eg
random seeds, hyper parameters.
G Varoquaux 103

Statistical tests: across datasets
(more general claims on training pipelines)
Challenge:
metrics not comparable across datasets
=⇒ Tests based on rank statistics
Wilcoxon signed rank test
Tests how often p1 outperforms p2 across datasets
G Varoquaux 104
[Demšar 2006]

Statistical tests: multiple pipelines across datasets
(compare multiple training pipelines)
Challenge: multiple comparisons1
The Wilcoxon-Holm approach
Pairwise comparisons
+ Bonferroni-Holm correction
The Friedman-Nemenyi approach2
1. Friedman test across all pipelines (omnibus test)
2. Nemenyi test gives a critical difference
Critical difference diagrams 1
2
3
4
5
4.2000
clf13.7667
clf23.5000
clf4
2.0000clf5
1.5333clf3
Accuracy (rank)
1If we do many tests, some will show large differences by chance.
2The Holm approach can be more interesting when considering only
comparisons to one referent classifier.
G Varoquaux 105
[Demšar 2006]

Statistical tests: multiple pipelines across datasets
(compare multiple training pipelines)
Challenge: multiple comparisons1
Replicability analysis
Perform dataset-level pairwise tests
Combine by testing2:
“Does p1 perform better than p2 on at least u
datasets?”
More powerful than [Demšar 2006]
for a small number of datasets
1If we do many tests, some will show large differences by chance.
2Using a partial conjunction multiple-testing procedure, as described in
[Dror... 2017]
G Varoquaux 106
[Dror... 2017]

Statistical tests: beyond null-hypothesis testing
Sample size is a problem
Across datasets:
significance typically requires 15 datasets
In a dataset (repeating folds, seeds...):
many repetitions makes any difference significant1
Underpowered experiments are no evidence
1Though as the total test-set size is limited, they do not bring more evidence
for generalization.
G Varoquaux 107
[Demšar 2008]

Statistical tests: beyond null-hypothesis testing
Sample size is a problem
Across datasets:
significance typically requires 15 datasets
In a dataset (repeating folds, seeds...):
many repetitions makes any difference significant1
Underpowered experiments are no evidence
Shortcomings of null-hypothesis testing
Significance decreases with more comparisons2
Statistically significance , practical significance
1Though as the total test-set size is limited, they do not bring more evidence
for generalization.
2FDR (False Discovery Rate) attempts to solve this.
G Varoquaux 107
[Demšar 2008]

Statistical tests: accounting for effect sizes
Neyman-Pearson view of hypothesis testing
Two hypothesis, H0 and H1
H1: p1 outperforms p2 by a margin1
Which is mostly likely? H0 H1
H0 H1
Requires the choice of the margin
Related to superiority testing in clinical trials
[Lesaffre 2008]
1Related to the rejection region in the Neyman-Pearson lemma.
G Varoquaux 108

Pragmatic compromises
Test on P(p1 p2) δ
δ .5: Neyman-Pearson view
Evaluate P(p1 p2) by resampling
Randomize everything: data splits, seeds,...
Gaussian approximation: amounts to comparing
differences to standard deviations
Not an inference on the expected difference in performance1
1Unlike standard error, standard deviation does not shrink to zero with the
number of resampling.
G Varoquaux 109

Summary – concluding from benchmarks
Account for variance
Null-hypothesis testing:
no t-test on cross-validation!
Don’t mis-interpret p-value:
- Not significant: more data could change that
- Significant: difference may be trivial
Detect practical differences:
difference in performance vs standard deviation
G Varoquaux 110

Better experimental procedures
Crack the black box open
A prediction score is seldom insightful
Ablation studies: remove/change atomic elements
Learning curves
Better benchmarking in these
Tune hyper-parameters to the same quality
Randomize everything
Account for variance in conclusions
G Varoquaux 111

Summary – Benchmarking with limited data
Reminder: Your valida-
tion measure is intrinsi-
cally unreliable
(sampling noise)
An arbitrary choice
(random seed) may give
seemingly-good results
that do not generalize
Sample many choices
Account for resulting vari-
ance in conclusions
20% 10% 0% +10% +20%
Distribution of errors under a binomial law
1000
300
200
100
30
Number of available samples
2% +2%
4% +4%
5% +5%
7% +7%
15% +12%
G Varoquaux 112

Representation learning in limited-data settings
Good representations help learning
Enable the use of simpler models
better approximation representation, less estimation error
Simple supervised learning of representations
pretext tasks, stacking, factorizing multi-output
Matrix factorizations
Extract representations without labels
MM solvers are “fire and forget”
Careful benchmarking is crucial
Optimistic flukes will not generalize
G Varoquaux 113
@GaelVaroquaux

References I
A. Abraham, M. P. Milham, A. Di Martino, R. C. Craddock, D. Samaras,
B. Thirion, and G. Varoquaux. Deriving reproducible biomarkers
from multi-site resting-state data: an autism-based example.
NeuroImage, 147:736, 2017.
A. Achille and S. Soatto. Emergence of invariance and
disentanglement in deep representations. The Journal of Machine
Learning Research, 19(1):1947–1980, 2018.
A. I. Bandos, H. E. Rockette, and D. Gur. A permutation test sensitive
to differences in areas for comparing roc curves from a paired
design. Statistics in medicine, 24:2873, 2005.
A. Bellet, A. Habrard, and M. Sebban. A survey on metric learning for
feature vectors and structured data. arXiv preprint
arXiv:1306.6709, 2013.
J. Bergstra and Y. Bengio. Random search for hyper-parameter
optimization. Journal of Machine Learning Research, 13:281, 2012.
G Varoquaux 114

References II
A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko.
Translating embeddings for modeling multi-relational data. In
Advances in Neural Information Processing Systems, pages
2787–2795, 2013.
X. Bouthillier, P. Delaunay, M. Bronzi, A. Trofimov, B. Nichyporuk,
J. Szeto, N. Mohammadi Sepahvand, E. Raff, K. Madan, V. Voleti, ...
Accounting for variance in machine learning benchmarks.
Proceedings of Machine Learning and Systems, 3, 2021.
D. Bzdok, M. Eickenberg, O. Grisel, B. Thirion, and G. Varoquaux.
Semi-supervised factored logistic regression for
high-dimensional neuroimaging data. In Advances in Neural
Information Processing Systems, page 3348, 2015.
J. Canny. Gap: A factor model for discrete data. In SIGIR, page 122,
2004.
G Varoquaux 115

References III
J.-F. Cardoso. Dependence, correlation and gaussianity in
independent component analysis. Journal of Machine Learning
Research, 4:1177, 2003.
P. Cerda and G. Varoquaux. Encoding high-cardinality string
categorical variables. IEEE Transactions on Knowledge and Data
Engineering, 2020.
W. Chen, X. Chen, J. Zhang, and K. Huang. Beyond triplet loss: a deep
quadruplet network for person re-identification. In Proceedings
of the IEEE Conference on Computer Vision and Pattern
Recognition, page 403, 2017.
L. Chizat and F. Bach. On the global convergence of gradient descent
for over-parameterized models using optimal transport. Advances
in Neural Information Processing Systems, 31:3036–3046, 2018.
G Varoquaux 116

References IV
L. Chizat, E. Oyallon, and F. Bach. On lazy training in differentiable
programming. Advances in Neural Information Processing
Systems, 2019.
J. Demšar. Statistical comparisons of classifiers over multiple data
sets. The Journal of Machine Learning Research, 7:1–30, 2006.
J. Demšar. On the appropriateness of statistical tests in machine
learning. In Workshop on Evaluation Methods for Machine
Learning in conjunction with ICML, page 65. Citeseer, 2008.
P. S. Dhillon, D. P. Foster, S. M. Kakade, and L. H. Ungar. A risk
comparison of ordinary least squares vs ridge regression. The
Journal of Machine Learning Research, 14:1505, 2013.
T. G. Dietterich. Approximate statistical tests for comparing
supervised classification learning algorithms. Neural
computation, 10(7):1895–1923, 1998.
G Varoquaux 117

References V
R. Dror, B. G., Bogomolov, M., and R. Reichart. Replicability analysis
for natural language processing: Testing significance with
multiple datasets. Transactions of the Association for
Computational Linguistics, 2017.
S. Geman, E. Bienenstock, and R. Doursat. Neural networks and the
bias/variance dilemma. Neural computation, 4(1):1–58, 1992.
Y. Goldberg and O. Levy. word2vec explained: Deriving mikolov et
al.’s negative-sampling word-embedding method. arXiv:1402.3722,
2014.
P. K. Gopalan, L. Charlin, and D. Blei. Content-based
recommendations with poisson factorization. In Advances in
Neural Information Processing Systems, page 3176, 2014.
G Varoquaux 118

References VI
M. Gutmann and A. Hyvärinen. Noise-contrastive estimation: A new
estimation principle for unnormalized statistical models. In
Proceedings of the International Conference on Artificial
Intelligence and Statistics, page 297, 2010.
L. Györfi, M. Kohler, A. Krzyżak, and H. Walk. A distribution-free
theory of nonparametric regression, volume 1. Springer, 2002.
K. He, Y. Wang, and J. Hopcroft. A powerful generative model using
random weights for the deep image representation. Advances in
Neural Information Processing Systems, 29:631–639, 2016.
D. Hsu, S. Kakade, and T. Zhang. Random design analysis of ridge
regression. Foundations of Computational Mathematics, 14, 2014.
A. Hyvärinen and E. Oja. Independent component analysis:
algorithms and applications. Neural networks, 13(4):411, 2000.
A. J. Izenman. Reduced-rank regression for the multivariate linear
model. Journal of multivariate analysis, 5:248, 1975.
G Varoquaux 119

References VII
J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child,
S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural
language models. arXiv preprint arXiv:2001.08361, 2020.
T. K. Landauer, P. W. Foltz, and D. Laham. An introduction to latent
semantic analysis. Discourse processes, 25:259, 1998.
A. Lefevre, F. Bach, and C. Févotte. Online algorithms for
nonnegative matrix factorization with the itakura-saito
divergence. In Applications of Signal Processing to Audio and
Acoustics (WASPAA), page 313. IEEE, 2011.
E. Lesaffre. Superiority, equivalence, and non-inferiority trials.
Bulletin of the NYU hospital for joint diseases, 66(2), 2008.
O. Levy and Y. Goldberg. Neural word embedding as implicit matrix
factorization. In Advances in neural information processing
systems, page 2177, 2014.
G Varoquaux 120

References VIII
Z. C. Lipton. The mythos of model interpretability: In machine
learning, the concept of interpretability is both important and
slippery. Queue, 2018.
J. Mairal. Stochastic majorization-minimization algorithms for
large-scale optimization. In Advances in Neural Information
Processing Systems, 2013.
J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix
factorization and sparse coding. Journal of Machine Learning
Research, 11:19–60, 2010.
J. Mairal, F. Bach, and J. Ponce. Sparse modeling for image and vision
processing. Foundations and Trends® in Computer Graphics and
Vision, 8(2-3):85–283, 2014.
G Varoquaux 121

References IX
A. Makarova, H. Shen, V. Perrone, A. Klein, J. B. Faddoul, A. Krause,
M. Seeger, and C. Archambeau. Overfitting in bayesian
optimization: an empirical study and early-stopping solution.
arXiv preprint arXiv:2104.08166, 2021.
S. Mallat. Understanding deep convolutional networks.
Philosophical Transactions of the Royal Society A, 374:20150203,
2016.
A. Mensch, J. Mairal, B. Thirion, and G. Varoquaux. Stochastic
subsampling for factorizing huge matrices. IEEE Transactions on
Signal Processing, 66:113, 2017.
A. Mensch, J. Mairal, B. Thirion, and G. Varoquaux. Extracting
universal representations of cognition across brain-imaging
studies. arXiv preprint arXiv:1809.06035, 2018.
G Varoquaux 122

References X
P. Mianjy, R. Arora, and R. Vidal. On the implicit bias of dropout. In
International Conference on Machine Learning, pages 3540–3548.
PMLR, 2018.
D. Micci-Barreca. A preprocessing scheme for high-cardinality
categorical attributes in classification and prediction problems.
ACM SIGKDD Explorations Newsletter, 3:27, 2001.
T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of
word representations in vector space. In ICLR Workshop Papers.
2013a.
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean.
Distributed representations of words and phrases and their
compositionality. In Advances in neural information processing
systems, page 3111, 2013b.
G Varoquaux 123

References XI
G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio. On the number of
linear regions of deep neural networks. In Advances in neural
information processing systems, page 2924, 2014.
C. Nadeau and Y. Bengio. Inference for the generalization error.
Machine learning, 52(3):239–281, 2003.
P. Nakkiran, G. Kaplun, Y. Bansal, T. Yang, B. Barak, and I. Sutskever.
Deep double descent: Where bigger models and more data hurt.
ICLR, 2020.
B. Neal, S. Mittal, A. Baratin, V. Tantia, M. Scicluna, S. Lacoste-Julien,
and I. Mitliagkas. A modern take on the bias-variance tradeoff in
neural networks. arXiv preprint arXiv:1810.08591, 2018.
E. Oyallon, E. Belilovsky, and S. Zagoruyko. Scaling the scattering
transform: Deep hybrid networks. In Proceedings of the IEEE
international conference on computer vision, page 5618, 2017.
G Varoquaux 124

References XII
J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for
word representation. In Proceedings of the 2014 conference on
empirical methods in natural language processing (EMNLP), page
1532, 2014.
M. E. Peters, S. Ruder, and N. A. Smith. To tune or not to tune?
adapting pretrained representations to diverse tasks.
Proceedings of the 4th Workshop on Representation Learning for
NLP (RepL4NLP), 2019.
M. Rahim, B. Thirion, D. Bzdok, I. Buvat, and G. Varoquaux. Joint
prediction of multiple scores captures better individual traits
from brain images. Neuroimage, 158:145–154, 2017a.
M. Rahim, B. Thirion, and G. Varoquaux. Multi-output predictions
from neuroimaging: assessing reduced-rank linear models. In
2017 International Workshop on Pattern Recognition in
Neuroimaging (PRNI), pages 1–4. IEEE, 2017b.
G Varoquaux 125

References XIII
A. Rahimi and B. Recht. Weighted sums of random kitchen sinks:
replacing minimization with randomization in learning. In Nips,
pages 1313–1320. Citeseer, 2008.
S. Rosset and R. J. Tibshirani. From fixed-x to random-x regression:
Bias-variance decompositions, covariance penalties, and
prediction error estimation. Journal of the American Statistical
Association, pages 1–14, 2018.
C. Sun, X. Qiu, Y. Xu, and X. Huang. How to fine-tune bert for text
classification? China National Conference on Chinese
Computational Linguistics, 2019.
I. Ustyuzhaninov, W. Brendel, L. A. Gatys, and M. Bethge. Texture
synthesis using shallow convolutional networks with random
filters. arXiv preprint arXiv:1606.00021, 2016.
G. Varoquaux. Cross-validation failure: small sample sizes lead to
large error bars. Neuroimage, 180:68–77, 2018.
G Varoquaux 126

References XIV
G. Varoquaux and V. Cheplygina. How i failed machine learning in
medical imaging–shortcomings and recommendations. arXiv
preprint arXiv:2103.10292, 2021.
Q. Wang, Z. Mao, B. Wang, and L. Guo. Knowledge graph embedding:
A survey of approaches and applications. IEEE Transactions on
Knowledge and Data Engineering, 29(12):2724–2743, 2017.
Z. Wang, J. Zhang, J. Feng, and Z. Chen. Knowledge graph embedding
by translating on hyperplanes. AAAI Conference on Artificial
Intelligence, 2014.
G Varoquaux 127

Representation learning in limited-data settings

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Representation learning in limited-data settings

Similar to Representation learning in limited-data settings (20)

More from Gael Varoquaux

More from Gael Varoquaux (20)

Recently uploaded

Recently uploaded (20)

Representation learning in limited-data settings