SlideShare a Scribd company logo
1 of 180
Download to read offline
Representation learning
in limited-data settings
Gaël Varoquaux
Limited-data settings
n to be compared to:
A measure of the signal-to-noise ratio
The dimensional of the data p
Deep learning is hard in small-sample regimes
But we can borrow ideas
This talk: No silver bullet,
many simple (shallow) tricks
G Varoquaux 1
Small-n problems are important
83% of data scientists1 never have n > 1M
n is often small for applications such as medicine
Bigger is better (how to not use this talk)
Get more data (pool related datasets)
Find a related problem and try transfer
This talk: data that differs from common sources
1www.kaggle.com/laurae2/data-scientists-vs-size-of-datasets
G Varoquaux 2
Small-n problems need guiding principles
Selecting architecture, learning rate...
A deep architecture is validated by its measured accuracy
" less data =⇒ poorer validation
more in last part of this talk
Need for guiding principles
This talk: connecting deep learning to Good
Old-Fashioned Machine Learning
G Varoquaux 3
Outline
1 Representations for machine learning
Finite-sample supervised learning
Learning with representations
Supervised learning of representations
Over-parametrized representation learning
2 Matrix factorization and its variants
For signals
For discrete objects
3 Method evaluation with limited data
Variance in model evaluation
Reliable experimental procedures
From benchmarks to conclusion
G Varoquaux 4
1 Representations for machine
learning
Defining the notion of representations
Their use for supervised learning
1 Representations for machine learning
Finite-sample supervised learning
Learning with representations
Supervised learning of representations
Over-parametrized representation learning
Settings: supervised learning
Given n pairs (x, y) ∈ X × Y drawn i.i.d.
find a function f : X → Y such that f (x) ≈ y
Notation: ŷ
def
= f (x)
Empirical risk minimization
Loss function l : Y × Y → Ò
Estimation of f: f?
= argmin
f∈F
Å

l(ŷ, y)

This course: how to choose good function classes F
G Varoquaux 7
Example: finite-sample estimation of f
Data generated
with 9th order
polynomial
+ noise
G Varoquaux 8
Example: finite-sample estimation of f
Data generated
with 9th order
polynomial
+ noise
Fit polynomials of
various degrees
Degree 1
G Varoquaux 8
Example: finite-sample estimation of f
Data generated
with 9th order
polynomial
+ noise
Fit polynomials of
various degrees
Degree 1
Degree 2
G Varoquaux 8
Example: finite-sample estimation of f
Data generated
with 9th order
polynomial
+ noise
Fit polynomials of
various degrees
Degree 1
Degree 2
Degree 5
G Varoquaux 8
Example: finite-sample estimation of f
Data generated
with 9th order
polynomial
+ noise
Fit polynomials of
various degrees
Degree 1
Degree 2
Degree 5
Degree 9
G Varoquaux 8
Example: finite-sample estimation of f
Data generated
with 9th order
polynomial
+ noise
Fit polynomials of
various degrees
Degree 1
Degree 2
Degree 5
Degree 9
Truth
Model too simple: underfit
Model too complex: overfit
G Varoquaux 8
Theory: the generalization error
Generalization error of a prediction function f:
Notation : E(f)
def
= Å

l(y, f (x))

Finite-sample regime
Ideally: f?
= argmin
f∈F
Å

l f (x), y

In practice: f̂ = argmin
f∈F
n
Õ
i=1
l f (xi), yi

E(f̂) ≥ E(f?)
f
f
G Varoquaux 9
Theory: decomposing the generalization error
Assuming y = g(x) + e, e random with Å[e] = 0,
the generalization error of f̂ is:
E(f̂) = Å

l(g(x) + e, f̂ (x))

= E(g) + E(f?) − E(g)

+ E(f̂) − E(f?)

Bayes rate
Best possible pre-
diction
Å

l(g(x)+e, g(x))

Approximation
error: g  F
Our model is
wrong
Estimation
Sampling noise on
train data
f̂ , f?
G Varoquaux 10
Theory: decomposing the generalization error
Assuming y = g(x) + e, e random with Å[e] = 0,
the generalization error of f̂ is:
E(f̂) = Å

l(g(x) + e, f̂ (x))

= E(g) + E(f?) − E(g)

+ E(f̂) − E(f?)

Bayes rate
Best possible pre-
diction
Å

l(g(x)+e, g(x))

Due to the noise e
Cannot be avoided
G Varoquaux 10
Theory: decomposing the generalization error
Assuming y = g(x) + e, e random with Å[e] = 0,
the generalization error of f̂ is:
E(f̂) = Å

l(g(x) + e, f̂ (x))

= E(g) + E(f?) − E(g)

+ E(f̂) − E(f?)

Approximation
error: g  F
Our model is
wrong
Decreases for larger F
Empirical lower bound
of E(f?): train error
G Varoquaux 10
Theory: decomposing the generalization error
Assuming y = g(x) + e, e random with Å[e] = 0,
the generalization error of f̂ is:
E(f̂) = Å

l(g(x) + e, f̂ (x))

= E(g) + E(f?) − E(g)

+ E(f̂) − E(f?)

Estimation
Sampling noise on
train data
f̂ , f?
Finite-sample problem
Decreases as n grows
Increases for larger F
Guesstimate: difference be-
tween train and test error
G Varoquaux 10
Example: polynomial regression degree
f
f
Degree 9, small n
no approximation error
large estimation error
f f
g
Degree 1, large n
small estimation error
large approximation
error
G Varoquaux 11
Example: polynomial regression degree
f
f
Degree 9, small n
no approximation error
large estimation error
f̂ = argminf∈F
Í
i l f (xi), yi

f f
g
Degree 1, large n
small estimation error
large approximation
error
Function class F not
restrictive enough
Function class F too
restrictive
G Varoquaux 11
Gauging overfit vs underfit: learning curves
100 1000
Number of samples
Error
sklearn.model selection.learning curve
G Varoquaux 12
Overfit
region
Underfit? Or Bayes rate?
Gauging overfit vs underfit: learning curves
100 1000
Number of samples
Error Generalization error
Training error
sklearn.model selection.learning curve
G Varoquaux 12
Estimation error ∼ gap be-
tween train and test error
Gauging overfit vs underfit: learning curves
100 1000
Number of samples
Error Generalization error
Training error
Degree of polynomial
9 1
Simpler models reach the assymptotic regime faster
(smaller “sample complexity”)
But can underfit
G Varoquaux 12
Gauging overfit vs underfit: validation curves
5 10 15
Polynomial degree
Error
Generalization error
Training error
sklearn.model selection.validation curve
Reveals underfits
G Varoquaux 13
Linear models for limited-data settings
In high-dimensional limited-data settings,
linear models are often the best choice
For p-dimensional data, x ∈ Òp,
they have p parameters
n ∼ 200 000
Inpatient Mortality, AUROC (95% CI) Hospital A Hospital B
Deep learning 0.95(0.94-0.96) 0.93(0.92-0.94)
Baseline (logistic regression) 0.93(0.92-0.95) 0.91(0.89-0.92)
G Varoquaux 14
Theory: Approximating with linear predictors
Linear predictor1: ŷ = xTw, w ∈ Òp
Data model: y = xTw? + δ(x) + e Å[e] = 0
xTw?: best linear predictor
Ridge estimator:
ŵ = argmin
w
kytrain − XT
train
wk2
Fro + λkwk2
2
Error compared to best linear predictor:
Å

ky − xT
ŵk2
2

= Å

ky − xTw?k2
2

+ o σ2p/ntrain

[Hsu... 2014, sec 2.5]
Random design analysis can characterize the generalization
error without assuming a correct data-generating model
(miss-specified model) [Hsu... 2014, Rosset and Tibshirani 2018]
1Predictor, not model: we do not assume it is a data-generating process.
G Varoquaux 15
Theory: Approximating with linear predictors
Linear predictor1: ŷ = xTw, w ∈ Òp
Data model: y = xTw? + δ(x) + e Å[e] = 0
xTw?: best linear predictor
Ridge estimator:
ŵ = argmin
w
kytrain − XT
train
wk2
Fro + λkwk2
2
Error compared to best linear predictor:
Å

ky − xT
ŵk2
2

= Å

ky − xTw?k2
2

+ o σ2p/ntrain

Approximation error
Data not linearly generated
⇒ craft more features
Estimation error
Curse of dimensionality
⇒ limit number of features
1Predictor, not model: we do not assume it is a data-generating process.
G Varoquaux 15
Example: extrapolating sea level (tides)
Predict sea level as a function of time
Test outside of observed range1
1Technically, this is not in our theory: test set , train set.
G Varoquaux 16
Example: extrapolating sea level (tides)
Polynomial regression
dim=10
Covariates
G Varoquaux 16
Example: extrapolating sea level (tides)
Polynomial regression
dim=10 dim=100
Covariates
G Varoquaux 16
Example: extrapolating sea level (tides)
Polynomial regression
dim=10
dim=100
dim=1000
Covariates
G Varoquaux 16
Example: extrapolating sea level (tides)
Polynomial regression
dim=10
dim=100
dim=1000
Covariates
Sines and cosines basis
dim=10
G Varoquaux 16
Example: extrapolating sea level (tides)
Polynomial regression
dim=10
dim=100
dim=1000
Covariates
Sines and cosines basis
dim=10 dim=100
G Varoquaux 16
Example: extrapolating sea level (tides)
Polynomial regression
dim=10
dim=100
dim=1000
Covariates
Sines and cosines basis
dim=10
dim=100
dim=1000
G Varoquaux 16
Example: extrapolating sea level (tides)
Polynomial regression
dim=10
dim=100
dim=1000
Covariates
Sines and cosines basis
dim=10
dim=100
dim=1000
Choice of covariates / basis / signal representation
⇒ huge difference on approximation error
⇒ huge difference on generalization error
G Varoquaux 16
Summary – minimizing a generalization error
ŷ = f (x), f chosen in F
to minimize the observed error
Õ
i∈train
l f (xi), y

generalization error:
- approximation error ⇒ F adapted to the data
- estimation error ⇒ F small
Limited-data settings
Linear models best option when p  n
A good choice of covariates is crucial
G Varoquaux 17
1 Representations for machine learning
Finite-sample supervised learning
Learning with representations
Supervised learning of representations
Over-parametrized representation learning
Representations to build F
Settings
z = r(x): representation of the data, z ∈ Òk
Predictor f : x → ŷ = hw r(x)

Function composition: “depth”
G Varoquaux 19
Representations to build F
Settings
z = r(x): representation of the data, z ∈ Òk
Predictor f : x → ŷ = hw r(x)

Function composition: “depth”
Benefits
For expressiveness composition  basis expansion
Composing L rectifying functions on intermediate representa-
tions of dimension k gives O k
p
p(L−1)
kp

linear regions.
Basis expansion + linear predictor gives O(k)
Exponential in depth, linear with dimension [Montufar... 2014]
G Varoquaux 19
Representations to build F
Settings
z = r(x): representation of the data, z ∈ Òk
Predictor f : x → ŷ = hw r(x)

Function composition: “depth”
Benefits
For expressiveness composition  basis expansion
For multi-tasks sharing representations across tasks
y multidimensional
G Varoquaux 19
Representations to build F
Settings
z = r(x): representation of the data, z ∈ Òk
Predictor f : x → ŷ = hw r(x)

Function composition: “depth”
Benefits
For expressiveness composition  basis expansion
For multi-tasks sharing representations across tasks
For limited data hw(z) = wTz, a linear predictor
A good choice of z can decrease sample complexity
G Varoquaux 19
Representations to build F
Settings
z = r(x): representation of the data, z ∈ Òk
Predictor f : x → ŷ = hw r(x)

Function composition: “depth”
Benefits
For expressiveness composition  basis expansion
For multi-tasks sharing representations across tasks
For limited data hw(z) = wTz, a linear predictor
Transfer: r is learned on large data; a simple h used.
G Varoquaux 19
Representations to keep only the “useful information”
Formalize
How a representation z should:
keep information on the output y
loose non-useful information
G Varoquaux 20
Background: Information theory
Entropy = amount of information in x
H (x) = Åp[log p(x)]
Equi-probable distribution
= low entropy x=0 x=1 x=2 x=3 x=4 x=5
P
Uneven distribution
= high entropy x=0 x=1 x=2 x=3 x=4 x=5
P
Mutual information between x and y
I(x; y) = H (x, y) − H (x) − H (y)
x ⊥
⊥ y (independent) ⇔ I(x; y) = 0
independence ⇔ p(x; y) = p(x)p(y)
H (x; y) = Å(x;y)

log p(x; y)

= Å(x;y)

log p(x) + log p(y)
 x
y
= Åx

log p(x)

+ Åy

log p(y)

= H (x) + H (y)
G Varoquaux 21
Theory: information in representations
A representation z of x is sufficient for y if y ⊥
⊥ x|z,
or equivalently if I(z; y) = I(x; y)
x, z, y form a Markov chain if Ð(y|x, z) = Ð(y|z).
x → z → y
Data processing inequality: I(x; y) ≤ I(x; z)
A sufficient representation z is minimal when
I(x; z) is smallest among sufficient
representations
G Varoquaux 22
[Achille and Soatto 2018]
Nuisances and invariances
A nuisance n: I(x, n) ≥ 0, but I(y, n) = 0
Representation z is invariant to the nuisance n
if z ⊥
⊥ n, or I(z; n) = 0 ⇒ We want I(z; n) low
In a Markov chain x → z1 → z2 · · · → zL → y
If z is a sufficient representation for y,
I(z; n) ≤ I(z; x) − I(x; y)
Communication bottleneck: I(z1; z2)  I(z1; x)
⇒ I(z2; n) ≤ I(z1; z2) − I(x; y)
Stacking increases invariance
G Varoquaux 23
[Achille and Soatto 2018]
Examples of invariances  representations
Illustrate
Ingredients of well-known representations
 their links to invariances
G Varoquaux 24
Invariant representations on a continous space
st
Shift invariance representation = Fourier basis
Fourier transform: F(s)f =
Õ
t
e−i f t
st
complex i
Shifting the signal: st → s0
t = st+k
F(s0
)f =
Õ
t
e−i f t
st+k =
Õ
t
e−i f (t−k)
st = ei k f
Õ
t
e−i f t
st
= ei k f
F(s)f → change in phase
An orthonormal basis
of shift-invariant vectors
G Varoquaux 25
Invariant representations on a continous space
st
Shift invariance = Fourier basis
Local deformations = Wavelets
Locally equivalent to Fourier basis
But without the global extent
Decimated wavelets
Isometric transform of the signal
Higher scales lose shift invariance
Redundant wavelets
Increase the dimensionality
Good shift invariance
G Varoquaux 25
Representations invariant to rich deformations
Scaling
Rotations
Deformations
Ingredients
Modulus of wavelet / Fourier transform
⇒ non linearity  filter banks (convolutions)
+ stacking (repeating simple invariants)
Scattering transform
Derived from first principles
Building first-order invariants
Convolutional networks
Learned from data
Pooling across pixels (eg max)
G Varoquaux 26
[Mallat 2016]
Summary – representions to help learning
Intermediate representations give
expressiveness to predictive models
Good representations keep predictive information
and loose nuisance information
Bottleneck and regularization to loose information
Limited-data settings
Given know invariants of the problem,
reusing existing representations helps
eg Headless conv-net, wavelets... [Oyallon... 2017]
G Varoquaux 27
1 Representations for machine learning
Finite-sample supervised learning
Learning with representations
Supervised learning of representations
Over-parametrized representation learning
The need to supervision
Maximizing I(z; y) (≤ I(x; y)) sufficient representations
⇒ supervised learning
while minimizing I(z; n) nuisance
⇒ sampling nuisance / invariants
data augmentation
Challenge: amount of labeled data
Pretext tasks
Other targets y0 that capture useful information
Finding them needs domain knowledge
G Varoquaux 29
Deep architectures
.
.
.
ŷ = fd
Wd
◦ ... ◦ f1
W1
(x)
Typically fk
Wk
(x) = gk
(WT
k x)

and gk
element-wise non-linearity
Thus ŷ = gd

WT
d ... g1
(WT
1 x)

Stacked representations: Wk
{Wk} optimized to minimize a prediction error
G Varoquaux 30
Shallow architectures for limited data
Keep one
latent layer
2
Without non-linearity:
ŷ = xT
W1 W2, y ∈ Òk
W1 ∈ Òp×d
W2 ∈ Òd×k
,
factored / reduced-rank linear model
Multi-task / multi-output
structured loss can help (multiple soft-max’s)
Overparametrization sometimes useful: d  k
can be achieved with dropout
G Varoquaux 31
[Bzdok... 2015, Mensch... 2018]
Examples of simple models that extract representations
G Varoquaux 32
Simple case: square loss = reduced rank regression
Ŷ = X W1 W2, Y ∈ Òn×k
W1 ∈ Òp×d
, W2 ∈ Òd×k
Ŵ1, Ŵ2 = argmin
W1,W2
kŶ − Ytraink2
Fro For squared loss the
problem is convex
Full-rank solution1 (X and Y on train set):
Ŵ = Σ̂−1
X XT
Y Ŷ = X Ŵ = X Σ̂−1
X XT
Y
Rank d solution: [Izenman 1975, Rahim... 2017b]
R̂d
def
= YT
Ŷ ∈ Òk×k SVD
→ = Ûd ŝdV̂d, Ûd ∈ Òk×d
then Ŵ1 = Σ−1
X
XTY Ûd Ŵ2 = ÛT
d
Full-rank solution Rank-d projector2
1No need for pesky SGDs
2The projector captures the variance explained on the multiple outputs
G Varoquaux 33
Model stacking
x
f1
→ z
f2
→ y
Learn f1 separately
Train a first model, feed it’s output to a second model
Directly supervising z:
z = ŷ for a (simple) predictive model
First model f1 must underfit output:
Model chosen from a simple function class
(linear models)
Trick: “cross-fit” during training
obtain ŷ by splitting the training data
Test set
Train set
Full data
(in sklearn: cross val predict)
G Varoquaux 34
Model stacking
x
f1
→ z
f2
→ y
Learn f1 separately
Train a first model, feed it’s output to a second model
Directly supervising z:
z = ŷ for a (simple) predictive model
Application: tackling dimensionality [Rahim... 2017a]
Some features are a high-dimensional signal
eg medical images
f1: linear to reduce signal features
f2: non-linear (eg treesa) on all features
aTrees-based models are great for mixed-typed data with categorical features
G Varoquaux 34
Model stacking to encode discrete items
Sex Date Hired Employee Position
M 09/12/1988 Master Police Officer
F 06/26/2006 Social Worker III
M 07/16/2007 Police Officer III
predict
→
Salary
69222.18
97392.47
104717.28
Difficulty: number of different positions
what invariants?
40000 60000 80000 100000 120000 140000
y: Employee salary
Crossing Guard
Liquor Store Clerk I
Library Aide
Police Cadet
Public Safety Reporting Aide I
Administrative Specialist II
Management and Budget Specialist III
Manager III
Manager I
Manager II
Target encoding1 [Micci-Barreca 2001]
position → Åtrain[salary|position]
1To inject categories in Ò, before a second level that combines all columns
Python package: dirty-cat.github.io
G Varoquaux 35
Summary – supervised extraction of representations
Supervision helps selecting
the relevant part of the signal
In limited-sample settings, simple
models can create representations
Simple latent-factor models
Multi-output models
Stacking: fit a first-level model
G Varoquaux 36
1 Representations for machine learning
Finite-sample supervised learning
Learning with representations
Supervised learning of representations
Over-parametrized representation learning
Revisiting the bias-tradeoff
Flexible models can achieve
less bias but come with
more variance
[Geman... 1992]
Degree 1
Degree 2
Degree 5
Degree 9
Truth
G Varoquaux 38
Revisiting the bias-tradeoff
Flexible models can achieve
less bias but come with
more variance
[Geman... 1992]
Degree 1
Degree 2
Degree 5
Degree 9
Truth
Strong theoretical arguments
come from a worst-case analysis1
Average case can be very different
Achieve more flexibility without variance increase
1eg minimax rates of non-parametric regression [Györfi... 2002]
G Varoquaux 38
Example: random forest
1 tree: much bias 1 tree
G Varoquaux 39
Example: random forest
1 tree: much bias
300 tree: less bias,
no variance increase
1 tree
300 trees
Ensemble models
Prediction: ŷ = ŷ1 + ŷ2 + · · · + ŷm
If the errors of each model ŷ1 = y + ε1
are independent, they average out:
kŷ − yk2 = kε1 + ε2 + · · · + εmk2 = 1
mvarε
Increase in model flexibility without variance
G Varoquaux 39
Overparametrized neural networks
For suitable random initialization1 ŷ error does
not increase with network width.
Overparametrization
can even decrease
sample complexity
[Kaplan... 2020]
1Initialization must be diverse enough, and more concentrated for wide
networks [Chizat and Bach 2018, Chizat... 2019].
G Varoquaux 40
[Neal... 2018, Nakkiran... 2020]
Overparametrized neural networks
Overparametrize to set train error to zero
In error decomposition: approximation error to zero
f̂ = argminf∈F
Í
i l f (xi), yi

Another error decomposition:
Error can be due to
1 optimizing on noisy training data
2 initialization
1 plateaus with wide networks, while 2 decreases.
Optimum on train set is degenerate
G Varoquaux 41
[Neal... 2018, Nakkiran... 2020]
Randomization as a regularization
Toy example: ridge
OLS: ŵ = argminw ky − XTwk2
2
Inject noise: X0 = X + E, E ∼ N (0, σ)
ŵ0 = argminw ky − (X + E)Twk2
2
= argminw ky − XTwk2
2 + kETwk2
2
= argminw ky − XTwk2
2 + σkwk2
2
G Varoquaux 42
Randomization as a regularization
Toy example: ridge
OLS: ŵ = argminw ky − XTwk2
2
Inject noise: X0 = X + E, E ∼ N (0, σ)
ŵ0 = argminw ky − (X + E)Twk2
2
= argminw ky − XTwk2
2 + kETwk2
2
= argminw ky − XTwk2
2 + σkwk2
2
Dropout as an implicit regularization
[Mianjy... 2018]
Random kernel expansions regularize
[Rahimi and Recht 2008]
G Varoquaux 42
Fine-tuning to reuse complex representations
Overparametrized architectures might not have
low-dimension representations
Fine tune the full architecture1
Lower learning rate to the input layers
to avoid catastrophic forgetting [Sun... 2019]
Feature extraction from the full architecture
Pooling  linear combinations of input layers
[Peters... 2019]
Fine tuning best on complex architectures
1Thanks to Lihu Chen for help with this slide
G Varoquaux 43
Summary – overparametrized representations
Diversity (randomness) regularizes
Randomization can create interesting
inductive biases
Random CNNs work surprisingly well
[He... 2016, Ustyuzhaninov... 2016]
Fine-tuning overparametrized
representations to reuse them
G Varoquaux 44
Summary of first section
For generalization: small family of functions fw that
approximate the signal well
Generalization of a linear predictor:
approximation error + o(p/ntrain
)
Predictors by composition: ŷ = f2(z), z = f1(x)
x
f1
→ z
f2
→ y ideally, f1 makes z invariant to nuisances
Reuse representations with the right invariances:
wavelets, fasttext, pretrained headless neural nets
Simple supervised models
can create representations
stacking multioutput pretext tasks
G Varoquaux 45
2 Matrix factorization and its
variants
Simple unsupervised representation learning
More unlabeled data than labeled data
Learn representations and transfer them
Here: Focus on simple models for limited n or low SNR settings
Particularly interesting regime: p large and n large.
Matrix factorization is a simplified version of deep learning
This section: building the framework from simple to complex
2 Matrix factorization and its variants
For signals
For discrete objects
Matrix factorization for representations
Reduce the dimensionality
while keeping the signal
“disentangle”
give features that are useful in themselves
G Varoquaux 48
Principal Component Analysis1
Find the directions of largest variance
Computation X ∈ Òn×p ΣX = XTX ∈ Òp×p
PCA projector: PPCA ∈ Òp×k SVDk(X) or EVDk(ΣX)
Reduced X: X PPCA ∈ Òn×k
1Mother of all representations (simplest)
G Varoquaux 49
Principal Component Analysis
Find the directions of largest variance
Computation X ∈ Òn×p ΣX = XTX ∈ Òp×p
PCA projector: PPCA ∈ Òp×k SVDk(X) or EVDk(ΣX)
Reduced X: X PPCA ∈ Òn×k
Model: low-rank Gaussian latent factors
X ≈ U V + E, E ∼ N (0, Ip), U ∈ Òn×k, V ∈ Òk×p
Û, V̂ = argmin
U,V
kX − U Vk2
Fro
Rotationally invariant: U0 = U O, OT V also solution for O s.t. OTO = I
G Varoquaux 49
Principal Component Analysis
Find the directions of largest variance
Computation X ∈ Òn×p ΣX = XTX ∈ Òp×p
PCA projector: PPCA ∈ Òp×k SVDk(X) or EVDk(ΣX)
Reduced X: X PPCA ∈ Òn×k
Model: low-rank Gaussian latent factors
X ≈ U V + E, E ∼ N (0, Ip), U ∈ Òn×k, V ∈ Òk×p
Û, V̂ = argmin
U,V
kX − U Vk2
Fro
Rotationally invariant: U0 = U O, OT V also solution for O s.t. OTO = I
PCA = 1-hidden layer autoencoder with squared lossa
min
W
kX − W WT
Xk2
Fro, with suitable constraint on W
aBoth find the same subspace
G Varoquaux 49
Principal Component Analysis
Find the directions of largest variance
In a learning pipeline
Useful for dimensionality reduction (eg p is large)
Eases statistics and computations
Generalization error of PCA + OLS
within a factor of 4 of ridge
[Dhillon... 2013]
G Varoquaux 49
Beyond variance: Independent Component Analysis
Separate out signals U observed mixed1
True sources, signals U
Observations (mixed signal)
ICA recovered signals
Disentangles:
Raises the rotational invariance
1Classic ICA has no noise model: it does not do dimension reduction
G Varoquaux 50
Beyond variance: Independent Component Analysis
Separate out signals U observed mixed1
Model: X = U V V ∈ Òp×p, VTV = Ip
If V is Gaussian, the model is not identifiable
Seek low mutual information across {uj}
⇒ Maximally non-Gaussian marginals [Cardoso 2003]
Latent signals V Observed data U V
1Classic ICA has no noise model: it does not do dimension reduction
G Varoquaux 50
Beyond variance: Independent Component Analysis
Separate out signals U observed mixed1
Model: X = U V V ∈ Òp×p, VTV = Ip
If V is Gaussian, the model is not identifiable
Seek low mutual information across {uj}
⇒ Maximally non-Gaussian marginals [Cardoso 2003]
Computation: FastICA [Hyvärinen and Oja 2000]
Power iterations on V
Each time:
- apply a smooth increasing non-linearity on {uj}
- decorrelate
Preprocessing: whiten the data eg with PCA
1Classic ICA has no noise model: it does not do dimension reduction
G Varoquaux 50
ICA to learn representations
Across patches of natural images:
Gabor-like filters
Similar to wavelets
and first layer of convnets
G Varoquaux 51
[Hyvärinen and Oja 2000]
ICA to learn representations
Across patches of natural images:
ICA
Disantengles
Can only learn rotations
No dimension reduction
G Varoquaux 52
Dictionary learning
Find vectors V that represents well the signal
with sparse combinations U
Model: X = U V s.t. U is sparse U ∈ Òn×k, V ∈ Òk×p
k can be  p (overcomplete dictionary)
Estimation: Û, V̂ = argmin
U,V,
s.t. kvik2
2≤1
kX − U Vk2
Fro + λkUk1
Data fit without need
for reduction
Combining squared loss and
`1 penalty creates sparsity
Constraint on kvik2
2 required to
avoid cancelling out penalty with
V → ∞ and U → 0
x2
x1
G Varoquaux 53
Dictionary learning
Find vectors V that represents well the signal
with sparse combinations U
Model: X = U V s.t. U is sparse U ∈ Òn×k, V ∈ Òk×p
k can be  p (overcomplete dictionary)
Estimation: Û, V̂ = argmin
U,V,
s.t. V∈C
kX − U Vk2
Fro + λΩ(U)
Constraint set and penalty can be varied1
Typically, `2, `1, and positivity2 on U or V.
1Fast when C and Ω lead to simple projections and penalized regression.
2Recovers a form of NMF (non-negative matrix factorization)
G Varoquaux 53
Sparse dictionary learning to learn representations
Across patches of natural images:
Also learns Gabor-like filters1
Good for sparse models,
eg for denoising
Also performs dimensionality reduction
1as ICA, K-Means, etc on images patches
G Varoquaux 54
[Mairal... 2014]
Large n large p: brain imaging
Brain activity at rest
1000 subjects with
∼ 100–10 000 samples
Images of dimensionality
 100 000
Dense matrix, large both ways
G Varoquaux 55
voxels
time
voxels
time
X +
U · V
= E
25
Estimation algorithms
For dictionary learning
G Varoquaux 56
Large n large p: recommender systems
3
9 7
7
9 5 7
8
4
1 6
9
7
7
1
4 4
9
5
5 8
Product ratings
Millions of entries
Hundreds of thousands of
products and users
Large sparse matrix
G Varoquaux 57
users
product
users
products
X +
U · V
= E
Online estimation: stochastic optimization
min
w
Õ
i
l(xi w)
Many samples min
w
Å[l(y, x w)]
Gradient descent: wt+1 ← wt + αt+wl
Stochastic gradient descent: wt+1 ← wt + αtÅ[+wl]
Use a cheap estimate of Å[+wl] (e.g. subsampling)
αt must decrease
“suitably” with t.
Those pesky learning rate
G Varoquaux 58
Online estimation for matrix factorization
- Data
access
- Dictionary
update
Stream
columns
- Code com-
putation
Alternating
minimization
Data
matrix
Large matrices
= terabytes of data
argmin
U,V
kX−U Vk2
Fro + λΩ(U)
G Varoquaux 59
[Mairal... 2010]
Online estimation for matrix factorization
Large matrices
= terabytes of data
argmin
U,V
kX−U Vk2
Fro + λΩ(U)
Rewrite as an expectation:
argmin
V
Õ
i

min
u
kXi − V uk2
Fro + λΩ(u)

argmin
E
Ö

f (V)

⇒ Optimize on approximations (sub-samples)
G Varoquaux 59
[Mairal... 2010]
Online estimation for matrix factorization
- Data
access
- Dictionary
update
Stream
columns
- Code com-
putation
Online matrix
factorization
Alternating
minimization
Seen at t Seen at t+1 Unseen at t
Data
matrix
G Varoquaux 59
[Mairal... 2010]
Online estimation for matrix factorization
- Data
access
- Dictionary
update
Stream
columns
- Code com-
putation Subsample
rows
Online matrix
factorization
Subsampled
 online
Alternating
minimization
Seen at t Seen at t+1 Unseen at t
Data
matrix
G Varoquaux 59
[Mensch... 2017]
Online matrix factorization algorithm [Mairal... 2010]
Stream samples xt:
1. Compute code
ut = argmin
u∈Òk
kxt − Vt−1uk2
2 + λΩ(u)
G Varoquaux 60
Online matrix factorization algorithm [Mairal... 2010]
Stream samples xt:
1. Compute code
ut = argmin
u∈Òk
kxt − Vt−1uk2
2 + λΩ(u)
2. Update the surrogate function
gt(V) =
1
t
t
Õ
i=1
kxi − V uik2
2
gt(V)
surrogate
=
Õ
x
l(x, V) ui is used, and not u?
G Varoquaux 60
Online matrix factorization algorithm [Mairal... 2010]
Stream samples xt:
1. Compute code
ut = argmin
u∈Òk
kxt − Vt−1uk2
2 + λΩ(u)
2. Update the surrogate function
gt(V) =
1
t
t
Õ
i=1
kxi − V uik2
2 = tr
 1
2
V
VAt − V
Bt

At
def
= (1 −
1
t
)At−1 +
1
t
utu
t Bt
def
= (1 −
1
t
)Bt−1 +
1
t
xtu
t
At and Bt are sufficient statistics of the loss
accumulated over the data
G Varoquaux 60
Online matrix factorization algorithm [Mairal... 2010]
Stream samples xt:
1. Compute code
ut = argmin
u∈Òk
kxt − Vt−1uk2
2 + λΩ(u)
2. Update the surrogate function
gt(V) =
1
t
t
Õ
i=1
kxi − V uik2
2 = tr
 1
2
V
VAt − V
Bt

At
def
= (1 −
1
t
)At−1 +
1
t
utu
t Bt
def
= (1 −
1
t
)Bt−1 +
1
t
xtu
t
3. Minimize surrogate
Vt = argmin
V∈C
gt(V) +gt = VAt − Bt
G Varoquaux 60
Stochastic Majorization-Minimization [Mairal 2013]
V = argmin
V∈C
Õ
x
l(x, V) where l(x, V) = min
u
f (x, V, u)
Algorithm:
gt(V)
majorant
=
Õ
x
l(x, V) ui is used, and not u?
⇒ Majorization-Minimization scheme1
Surrogate computation SMM Full minimization
2nd order information No learning rate
1SOMF uses a approximate majorant and minimization [Mensch... 2017]
G Varoquaux 61
Experimental convergence: large images
5s 1min 6min
2.80
2.85
2.90
2.95
Test
objective
value
×104
Time
ADHD
Sparse dictionary
2 GB
1min 1h 5h
0.105
0.106
0.107
0.108
0.109
Aviris
NMF
103 GB
1min 1h 5h
0.35
0.36
0.37
0.38
0.39
0.40
Test
objective
value
Time
Aviris
Dictionary learning
103 GB
OMF: SOMF: r = 4
r = 6
r = 8
r = 12
r = 24
r = 1
Best step-size SGD
100s 1h 5h 24h
0.98
1.00
1.02
1.04
×105
HCP
Sparse dictionary
2 TB
SOMF = Subsampled Online Matrix Factorization
G Varoquaux 62
Experimental convergence: recommender system
SOMF = Subsampled Online Matrix Factorization
G Varoquaux 63
Summary – matrix factorization of signals
Versatile matrix-factorization formulation1
argmin
U∈Òn×k,V∈C
kX − U Vk2
Fro + λΩ(U)
Estimation
Stochastic majorization miniminization2
⇒ an online alternated optimization
Example use of learned representations
Biomakers of autism on brain images:
p ∼ 100 000, n ∼ 1 000 [Abraham... 2017]
11-layer linear autoencoder
2Common case algorithm readily usable in scikit-learn:
MiniBatchDictionaryLearning
G Varoquaux 64
2 Matrix factorization and its variants
For signals
For discrete objects
Embedding discrete objects
Embedding discrete objects
(words, entities, users ids) is crucial
It endowes them with a metric,
enables building predictive functions
that extrapolate between objects
Original p
is not small
in front of n Construction
Representative III
Fire/Rescue
Captain
Resource
Conservationist
Security Officer
II
Security Officer
III (Sergeant)
G Varoquaux 66
Natural language processing: topic-modeling history
Topic modeling: embedding documents3
03078090707907
00790752700578
94071006000797
00970008007000
10000400400090
00050205008000
documents
t
h
e
P
y
t
h
o
n
p
e
r
f
o
r
m
a
n
c
e
p
r
o
f
i
l
i
n
g
m
o
d
u
l
e
i
s
c
o
d
e
c
a
n
a
Start from a vectorization
of each document by
counting word occurence:
The term-document
matrix
3Typically for information retrieval purpose, aka search engines
G Varoquaux 67
Natural language processing: topic-modeling history
Topic modeling: embedding documents3
03078090707907
00790752700578
94071006000797
00970008007000
10000400400090
00050205008000
documents
t
h
e
P
y
t
h
o
n
p
e
r
f
o
r
m
a
n
c
e
p
r
o
f
i
l
i
n
g
m
o
d
u
l
e
i
s
c
o
d
e
c
a
n
a
→
03078090707907
00790752700578
94071006000797
topics
t
h
e
P
y
t
h
o
n
p
e
r
f
o
r
m
a
n
c
e
p
r
o
f
i
l
i
n
g
m
o
d
u
l
e
i
s
c
o
d
e
c
a
n
a
030
007
940
009
100
000
documents
topics
+
What terms
are in a topics
What documents
are in a topics
LSA (Latent Semantic Analysis) [Landauer... 1998]
SVD of the terms×documents matrix
3Typically for information retrieval purpose, aka search engines
G Varoquaux 67
Gamma-Poisson for factorizing counts [Canny 2004]
When X is a matrix of counts
- Topic modeling
- Recommenders systems [Gopalan... 2014]
- Database string entries [Cerda and Varoquaux 2020]
=⇒ Poisson loss, instead of squared loss
Ð(xj|wj) = Poisson wj

= 1/xj! w
xj
j
e−wj
0 5
0.0
0.5
1.0 Gaussian(.5)
Poisson(3)
Poisson(1)
Poisson(0)
Counts are not well approximated by a Gaussian
G Varoquaux 68
Gamma-Poisson for factorizing counts [Canny 2004]
When X is a matrix of counts
- Topic modeling
- Recommenders systems [Gopalan... 2014]
- Database string entries [Cerda and Varoquaux 2020]
=⇒ Poisson loss, instead of squared loss
Ð(xj|u, V) = Poisson (u V)j

= 1/xj! (u V)
xj
j
e−(u V)j
u are loadings, modeled as random with a
Gamma prior1
Ð(ui) =
u
αi−1
i
e−ui/βi
β
αi
i
Γ(αi)
Maximum a posteriori estimation:
Û, V̂ = argmin
U,V
−
Õ
j

log Ð(xj|u, V) +
Õ
i
log Ð(ui)

1Because it is the conjugate prior of the Poisson, and because it imposes
soft sparsity and raises rotational invariance
G Varoquaux 68
Gamma-Poisson estimation
Full log-likelihood expression:
log L =
p
Õ
j=1
xj log((u V)j) − (u V)j − log(xj!)
+
k
Õ
i=1
(αi − 1) log(ui) −
ui
βi
− αi log βi − log Γ(αi)
Gradients: ∂
∂Vij
log L =
xj
(u V)j
ui − ui
∂
∂ui
log L =
p
Õ
j=1
xj
(u V)j
Vij − Vij +
αi − 1
ui
−
1
βi
G Varoquaux 69
Gamma-Poisson estimation
Gradients: ∂
∂Vij
log L =
xj
(u V)j
ui − ui
∂
∂ui
log L =
p
Õ
j=1
xj
(u V)j
Vij − Vij +
αi − 1
ui
−
1
βi
Equivalent to some NMF formulation: multiplicative updates1
Vij ← Vij
n
Õ
`=1
x`j
(UV)`j
u`i
! n
Õ
`=1
u`i
!−1
u`i ← u`i
©
­
«
p
Õ
j=1
x`j
(UV)`j
Vij +
αi − 1
u`i
ª
®
¬
©
­
«
p
Õ
j=1
Vij + β−1
i
ª
®
¬
−1
1Efficient implementation with sparse matrices: the summations can be
done only on non-zero entries of X.
G Varoquaux 69
Gamma-Poisson estimation
Gradients: ∂
∂Vij
log L =
xj
(u V)j
ui − ui
∂
∂ui
log L =
p
Õ
j=1
xj
(u V)j
Vij − Vij +
αi − 1
ui
−
1
βi
Equivalent to some NMF formulation: multiplicative updates1
Vij ← Vij
n
Õ
`=1
x`j
(UV)`j
u`i
! n
Õ
`=1
u`i
!−1
u`i ← u`i
©
­
«
p
Õ
j=1
x`j
(UV)`j
Vij +
αi − 1
u`i
ª
®
¬
©
­
«
p
Õ
j=1
Vij + β−1
i
ª
®
¬
−1
Adapt the majorization minimization algorithm
[Lefevre... 2011, Cerda and Varoquaux 2020]
1Efficient implementation with sparse matrices: the summations can be
done only on non-zero entries of X.
G Varoquaux 69
Application: embedding via string form
Problem: representing non-normalized categories
Drug Name
alcohol
ethyl alcohol
isopropyl alcohol
polyvinyl alcohol
isopropyl alcohol swab
62% ethyl alcohol
alcohol 68%
alcohol denat
benzyl alcohol
dehydrated alcohol
Employee Position Title
Police Aide
Master Police Officer
Mechanic Technician II
Police Officer III
Senior Architect
Senior Engineer Technician
Social Worker III
G Varoquaux 70
Code: dirty-cat.github.io [Cerda and Varoquaux 2020]
Application: embedding via string form
Gamma-Poisson
factorization
on sub-strings counts
|{z}
3-gram1
P
|{z}
3-gram2
ol
|{z}
3-gram3
ic...
Models strings as a linear combination of substrings
11111000000000
00000011111111
10000001100000
11100000000000
11111100000000
11111000000000
police
officer
pol off
polis
policeman
policier
e
r
_
c
e
r
f
i
c
o
f
f
_
o
f
c
e
_
i
c
e
l
i
c
p
o
l
G Varoquaux 71
Code: dirty-cat.github.io [Cerda and Varoquaux 2020]
Application: embedding via string form
Gamma-Poisson
factorization
on sub-strings counts
|{z}
3-gram1
P
|{z}
3-gram2
ol
|{z}
3-gram3
ic...
Models strings as a linear combination of substrings
11111000000000
00000011111111
10000001100000
11100000000000
11111100000000
11111000000000
police
officer
pol off
polis
policeman
policier
e
r
_
c
e
r
f
i
c
o
f
f
_
o
f
c
e
_
i
c
e
l
i
c
p
o
l
→
03078090707907
00790752700578
94071006000797
topics
030
007
940
009
100
000
documents
topics
+
What substrings
are in a latent
category
What latent categories
are in an entry
e
r
_
c
e
r
f
i
c
o
f
f
_
o
f
c
e
_
i
c
e
l
i
c
p
o
l
G Varoquaux 71
Code: dirty-cat.github.io [Cerda and Varoquaux 2020]
Application: embedding via string form
Representations that extract latent categories
l
i
b
r
a
r
y
p
e
r
a
t
o
r
c
i
a
l
i
s
t
r
e
h
o
u
s
e
m
a
n
a
g
e
r
m
m
u
n
i
t
y
r
e
s
c
u
e
o
f
f
i
c
e
r
Legislative Analyst II
Legislative Attorney
Equipment Operator I
Transit Coordinator
Bus Operator
Senior Architect
Senior Engineer Technician
Financial Programs Manager
Capital Projects Manager
Mechanic Technician II
Master Police Officer
Police Sergeant
a
m
e
s
Categories
G Varoquaux 72
Code: dirty-cat.github.io [Cerda and Varoquaux 2020]
Application: embedding via string form
Inferring plausible feature names
n
t
a
n
t
,
a
s
s
i
s
t
a
n
t
,
l
i
b
r
a
r
y
a
t
o
r
,
e
q
u
i
p
m
e
n
t
,
o
p
e
r
a
t
o
r
d
m
i
n
i
s
t
r
a
t
i
o
n
,
s
p
e
c
i
a
l
i
s
t
,
c
r
a
f
t
s
w
o
r
k
e
r
,
w
a
r
e
h
o
u
s
e
r
o
s
s
i
n
g
,
p
r
o
g
r
a
m
,
m
a
n
a
g
e
r
c
i
a
n
,
m
e
c
h
a
n
i
c
,
c
o
m
m
u
n
i
t
y
e
f
i
g
h
t
e
r
,
r
e
s
c
u
e
r
,
r
e
s
c
u
e
o
n
a
l
,
c
o
r
r
e
c
t
i
o
n
,
o
f
f
i
c
e
r
Legislative Analyst II
Legislative Attorney
Equipment Operator I
Transit Coordinator
Bus Operator
Senior Architect
Senior Engineer Technician
Financial Programs Manager
Capital Projects Manager
Mechanic Technician II
Master Police Officer
Police Sergeant
I
n
f
e
r
r
e
d
f
e
a
t
u
r
e
n
a
m
e
s
Categories
G Varoquaux 72
[Cerda and Varoquaux 2020]
So far:
Matrix factorization of count (eg cooccurences)
Embeds discrete objects
Better with a suitable loss
Next:
Implicit matrix factorization and losses
G Varoquaux 73
Word embeddings
Distributional semantics: meaning of words
“You shall know a word by the company it keeps”
Firth, 1957
Example: A glass of red , please
Could be wine maybe juice?
wine and juice have related meanings
Factorization of the word×context matrix
What choice of context?
What loss?
word2vec [Mikolov... 2013a] glove [Pennington... 2014]
G Varoquaux 74
Word2vec: skip-gram sampling [Mikolov... 2013b]
{ûw, v̂c} = argmax
{uw,vc}
Õ
pairs of words (w, c)
in the same window1
log softmax(V uT
w)c
softmax(z)i =
exp zi
Í
j exp zj
uw ∈ Òk: embedding of word w
V ∈ Òcard(voc)×k: [vc, c ∈ voc]
all context words
Big sum on contexts
⇒ solved by SGD2
salad
meat
juice
wine
glass
green
red
Center
word
U:
word
embedding
salad
meat
juice
wine
glass
red
green
Context
word
V:
context
embedding
Other view:
Language models
Prediction of words
1Efficient: never build the matrix, stream directly from text.
2These windows are called skip gram
G Varoquaux 75
Word2vec: negative sampling [Mikolov... 2013a]
Costly loss: log softmax(z)i = log
exp zi
Í
j exp zj
Approximate1 Huge sum in softmax (all vocabulary)
Downsample it by drawing the positive (numerator)
and a few negative examples (denominator)
Negative sampling loss2:
[Goldberg and Levy 2014] log σ(vc uT
w) +
Õ
nneg words w
not in window
log σ(−vcuw0)
σ: sigmoid (log σ(z) = −1 − exp −z)
1Related to noise contrastive estimate, that avoid computing costly
normalizations in likelihoods [Gutmann and Hyvärinen 2010]
2Related to a matrix factorization of mutual information inword occurence
[Levy and Goldberg 2014]
G Varoquaux 76
Beyond natural language: metric learning
Triplet loss
For a “anchor”, b close to a, c far from a:
log σ(vT
aub) − log σ(vT
auc)
Quadruplet loss [Chen... 2017]
For a and b close by, c and d far appart:
log σ(vT
aub) − log σ(vT
cud)
In practice: draw1 randomly (a, b, c) or (a, b, c, d)
Metric learning: [Bellet... 2013]
Learning embeddings with weak supervision
1Many strategies, eg “hard negative mining”, requires a good test set and
metric to set, as with SGD hyperparameters.
G Varoquaux 77
Embedding entities in knowledge graphs
Structured (graph) represen-
tation of human knowledge
eg dbpedia, Yago
Challenge: relations
of multiple nature
G Varoquaux 78
Embedding entities in knowledge graphs
Structured (graph) represen-
tation of human knowledge
eg dbpedia, Yago
Learning embeddings of enti-
ties {ei} and relations {rj}:
ea ∼ eb + rc
a model of the relation1
Then triplet / quadruplet loss Reuse existing:
conceptnet.io
1Richer, better, models
[Wang... 2014]
G Varoquaux 78
[Bordes... 2013, Wang... 2017]
The value of simple models
Risk of invisible overfit dur-
ing search for hyperparameters
and models
Complex models call for a clear
utility measure with low mea-
surement error
Many reliable labels
G Varoquaux 79
The value of simple models
Risk of invisible overfit dur-
ing search for hyperparameters
and models
Complex models call for a clear
utility measure with low mea-
surement error
Many reliable labels
Matrix factorization models1: 2 hyper parameters:
Dimensionality k Regularization λ
Set them to optimize representations for supervised problems
1Using majorization-minimization approaches to avoid learning rate
G Varoquaux 79
Summary – embedding discrete objects
Discrete entities lead to counting occurences
⇒ Poisson and logistic loss (ugly logs in equations)
Word  entity embeddings
Factorization of coocurrences in a notion of context
more generally: metric learning
Limited-data settings:
Avoid negative-sampling models (hyper-parameters)
Try to reuse representations (fastext, conceptnet.io)
G Varoquaux 80
Summary – matrix factorization
Builds linear representions of input
At the root of many more complex variants
Minimization-Majorization solvers:
scalable and “fire and forget”
G Varoquaux 81
3 Method evaluation with
limited data
Less data =⇒ more difficult evaluation
Section inspired by [Bouthillier... 2021]
Evaluation of the generalization error
Focus on representation to facilitate prediction
=⇒ evaluate prediction
Leaving aside representation for interpretability
Danger of reading tea leaves
Interpretation = ill defined, requires expert knowledge,
subject to confirmation bias [Lipton 2018]
Ill-conditioned problem
=⇒ strong dependence on prior
=⇒ self-fulfilling prophecies
G Varoquaux 83
3 Method evaluation with limited data
Variance in model evaluation
Reliable experimental procedures
From benchmarks to conclusion
Model evaluation
New data is required to assess
generalization performance
Å

l f (X), y

Split data in train and test set
typically 10%
trade off better learning
vs better estimation
Test set
Train set
Full data
Make choices on the model
split train, validation, and test Test set
Full data
Validation set
Train set
Make model choices
Evaluate model
G Varoquaux 85
Evaluation error: Sampling noise on test set
Sampling noise1 for ntest = 1000:
-10% -5% 0% +5% +10%
Binomial distribution of error on test accuracy
-2% +2%
Confidence intervals ntest = 1 000 interval: 5.7%
ntest = 10 000 interval: 1.8%
ntest = 100 000 interval: 0.6%
Optimizing test accuracy will explore the tails
Selecting architecture, learning rate...
overfitting the validation  test set
1The data at hand (eg the test set) is just a small sample of the full
population “in the wild”, and sampling other data will lead to other results.
G Varoquaux 86
[Varoquaux 2018]
Evaluation error: Sampling noise on test set
“in the wild”
102
103
104
105
106
Test set size
0
1
2
3
4
Standard
deviation
(%
acc)
In Theory:
From a Binomial
In Practice:
Random splits
Binom(n', 0.66)
Binom(n', 0.95)
Binom(n', 0.91)
Glue-RTE BERT
(n'=277)
Glue-SST2 BERT
(n'=872)
CIFAR10 VGG11
(n'=10000)
G Varoquaux 87
[Bouthillier... 2021]
Evaluation is a bottleneck – in publications
90.0
92.5
95.0
97.5
100.0
cifar10
2012 2014 2016 2018 2020
85
90
95
100
sst2
non-'SOTA' results
Significant
Non-Significant
Year
Accuracy
NLP: Glue sentiment-analysis benchmark (ntest = 1.8k)
Vision: object-recognition benchmark (ntest = 10k)
Published improvements compared to benchmark variance
G Varoquaux 88
[Bouthillier... 2021]
Evaluation is a bottleneck – in Kaggle competitions
Lung cancer classification
Test size: max 1K
Smaller improvements than noise
-0.75 0.0 +0.75
Observed improvement in score
Diminishing returns
Schizophrenia classification
Test size: 120
-0.2 0.0 +0.2
Improvement of
top model on 10% best
Evaluation noise between public
and private sets
Diminishing returns
Lung tumor segmentation
Test size: max 6k
Poorer score on private set
-0.15 0.0 +0.15
Overfit
Nerve segmentation
Test size 5.5K
-0.04 0.0 +0.04
Improvement of
top model on 10% best
Evaluation noise between public
and private sets
Actual improvement
G Varoquaux 89
[Varoquaux and Cheplygina 2021]
The full benchmarking pipeline
New data to assess generalization
performance Å

l f (X), y

Split out test set
Split out validation set
Choose hyper-parameters
on validation set
Test set
Full data
Validation set
Train set
Make model choices
Evaluate model
Measure performance on test set
Rampant overfit of validation set [Makarova... 2021]
G Varoquaux 90
Sources of variance in a machine-learning benchmark
0 1
Numerical noise
Dropout
Weights init
Data order
Data augment
Data (bootstrap)
Noisy Grid Search
Random Search
Bayes Opt
bert-rte
0 1
bert-sst2
0
bio-
hyperparameter
optimization
HOpt { H}
learning
algorithm
{ O}
source of variation c
1 0 1
bio-task2
0.0 0.5
segmentation
0 1
vgg
0 1
average
case studies
Model-evaluation results are most affected by:
1. Arbitrary split into train and test
2. Random (arbitrary) parameters
3. Uncertainty in optimized hyper-parameters
G Varoquaux 91
[Bouthillier... 2021]
Summary – variance in benchmarks
Evaluating generalization is limited by ntest
ntest = 10 000 =⇒ ±.9% ntest = 100 000 =⇒ ±.3%
Benchmark hyper parameter choice
Careful not to overfit hyper-parameters
Variance in machine-learning benchmarks
1. Data splits
2. Random seeds
3. Hyper-parameter choice
...
G Varoquaux 92
3 Method evaluation with limited data
Variance in model evaluation
Reliable experimental procedures
From benchmarks to conclusion
Settings: what are we benchmarking
prediction rule: f : X → Y
training procedure: given data (X, y) ∈ (X × Y)n
outputs a prediction rule
hyper parameters: parameters not set by the
procedure
full training pipeline: hyper-parameter choice +
training procedure
G Varoquaux 94
Benchmarking a prediction rule vs a training pipeline
Benchmarking a prediction rule
Before putting in production
Fixed training set evaluation limited by test set size
Benchmarking a training pipeline
To conclude on good training procedures
Useless to tune random seeds
(for weights init, dropout, data augmentation)
will not carry over to new training data
G Varoquaux 95
Benchmarking a training pipeline
0 1
Numerical noise
Dropout
Weights init
Data order
Data augment
Data (bootstrap)
Noisy Grid Search
Random Search
Bayes Opt
bert-rte
0
hyperparameter
optimization
HOpt { H}
learning
algorithm
{ O}
source of variation
0 1
Numerical noise
Dropout
Weights init
Data order
Data augment
Data (bootstrap)
Noisy Grid Search
Random Search
Bayes Opt
bert-rte
0 1
bert-sst2
0 1
bio-task2
0.0 0.5
segmentation
0 1
vgg
0 1
average
parameter
zation
{ H}
rning
rithm
O}
of variation case studies
Reduce error
and gauge variance
data sampling
Multiple train-test splits
cross-validation
Test set
Train set
Full data
arbitrary choices (seeds)
Randomize them all
hyper-parameters
Hyper-parameter optimization
Too expensive to randomize
G Varoquaux 96
[Bouthillier... 2021]
Hyper-parameter optimization procedures
Random search [Bergstra and Bengio 2012]
(prefer to grid-search for more than 2 params)
Region of good
hyperparameters
Hyperparameter 1
Hyperparameter
2
Grid Search
Randomized
Search
(important hyperparameter)
(unimportant
hyperparameter)
G Varoquaux 97
Hyper-parameter optimization procedures
Random search [Bergstra and Bengio 2012]
(prefer to grid-search for more than 2 params)
Bayesian optimization
G Varoquaux 97
Hyper-parameter optimization procedures
Random search [Bergstra and Bengio 2012]
(prefer to grid-search for more than 2 params)
Bayesian optimization
Sub-optimal hyper-parameters on models routinely
lead to invalid conclusions
See refs in [Bouthillier... 2021]
G Varoquaux 97
Benchmarking with hyper-parameters
Difficulty: measure suboptimality and variance
due to hyper-parameters
Ideal strategy: multiple hyper-parameter
optimizations with different seeds Costly
In practice: set hyper parameters once, then
randomize model seeds and data splits
Counterintuitive: more randomization decorrelates
sources of error, and thus improves benchmarks
G Varoquaux 98
[Bouthillier... 2021]
Summary – better measures
Benchmarking prediction rule
, benchmarking training procedure
For training procedures: randomize everything
Data splits, all random procedures
Hyper-parameter optimization outside randomiza-
tion is suboptimal, but randomization after helps
G Varoquaux 99
3 Method evaluation with limited data
Variance in model evaluation
Reliable experimental procedures
From benchmarks to conclusion
Statistical tests  ML benchmarks
Null hypothesis testing – p-value: the chance to
observe the results if a null hypothesis were true
Typical null: model comparison
model p1 and p2 give same expected error
G Varoquaux 101
Statistical tests: single test set
(comparing prediction rules)
Test set
Train set
Full data
Simple distribution of metrics,
eg accuracy: binomial
Safer to use permutations,
for correlated errors across prediction rules
[Bandos... 2005]
Sample null distribution by randomly switching
predictions from p1 and p2.
G Varoquaux 102
Statistical tests: cross-validation
(comparing training pipelines)
Test set
Train set
Full data
Challenge: folds are not independent1 [Dietterich 1998]
t-test/Wilcoxon across folds are not valid
1Train sets overlap, and often test sets also do.
G Varoquaux 103
Statistical tests: cross-validation
(comparing training pipelines)
Test set
Train set
Full data
Challenge: folds are not independent1 [Dietterich 1998]
t-test/Wilcoxon across folds are not valid
Correct for dependence across folds2
5x2cv: repeat 5 times randomized 2-fold
Use a t-test with 5 degrees of freedom [Dietterich 1998]
Corrected resampled t-test statistic
Formula for fold correlation [Nadeau and Bengio 2003]
1Train sets overlap, and often test sets also do.
2Does not account for sources of variance other than data sampling, eg
random seeds, hyper parameters.
G Varoquaux 103
Statistical tests: across datasets
(more general claims on training pipelines)
Challenge:
metrics not comparable across datasets
=⇒ Tests based on rank statistics
Wilcoxon signed rank test
Tests how often p1 outperforms p2 across datasets
G Varoquaux 104
[Demšar 2006]
Statistical tests: multiple pipelines across datasets
(compare multiple training pipelines)
Challenge: multiple comparisons1
The Wilcoxon-Holm approach
Pairwise comparisons
+ Bonferroni-Holm correction
The Friedman-Nemenyi approach2
1. Friedman test across all pipelines (omnibus test)
2. Nemenyi test gives a critical difference
Critical difference diagrams 1
2
3
4
5
4.2000
clf13.7667
clf23.5000
clf4
2.0000clf5
1.5333clf3
Accuracy (rank)
1If we do many tests, some will show large differences by chance.
2The Holm approach can be more interesting when considering only
comparisons to one referent classifier.
G Varoquaux 105
[Demšar 2006]
Statistical tests: multiple pipelines across datasets
(compare multiple training pipelines)
Challenge: multiple comparisons1
Replicability analysis
Perform dataset-level pairwise tests
Combine by testing2:
“Does p1 perform better than p2 on at least u
datasets?”
More powerful than [Demšar 2006]
for a small number of datasets
1If we do many tests, some will show large differences by chance.
2Using a partial conjunction multiple-testing procedure, as described in
[Dror... 2017]
G Varoquaux 106
[Dror... 2017]
Statistical tests: beyond null-hypothesis testing
Sample size is a problem
Across datasets:
significance typically requires  15 datasets
In a dataset (repeating folds, seeds...):
many repetitions makes any difference significant1
Underpowered experiments are no evidence
1Though as the total test-set size is limited, they do not bring more evidence
for generalization.
G Varoquaux 107
[Demšar 2008]
Statistical tests: beyond null-hypothesis testing
Sample size is a problem
Across datasets:
significance typically requires  15 datasets
In a dataset (repeating folds, seeds...):
many repetitions makes any difference significant1
Underpowered experiments are no evidence
Shortcomings of null-hypothesis testing
Significance decreases with more comparisons2
Statistically significance , practical significance
1Though as the total test-set size is limited, they do not bring more evidence
for generalization.
2FDR (False Discovery Rate) attempts to solve this.
G Varoquaux 107
[Demšar 2008]
Statistical tests: accounting for effect sizes
Neyman-Pearson view of hypothesis testing
Two hypothesis, H0 and H1
H1: p1 outperforms p2 by a margin1
Which is mostly likely? H0 H1
H0 H1
Requires the choice of the margin
Related to superiority testing in clinical trials
[Lesaffre 2008]
1Related to the rejection region in the Neyman-Pearson lemma.
G Varoquaux 108
Pragmatic compromises
Test on P(p1  p2)  δ
δ  .5: Neyman-Pearson view
Evaluate P(p1  p2) by resampling
Randomize everything: data splits, seeds,...
Gaussian approximation: amounts to comparing
differences to standard deviations
Not an inference on the expected difference in performance1
1Unlike standard error, standard deviation does not shrink to zero with the
number of resampling.
G Varoquaux 109
[Bouthillier... 2021]
Summary – concluding from benchmarks
Account for variance
Null-hypothesis testing:
no t-test on cross-validation!
Don’t mis-interpret p-value:
- Not significant: more data could change that
- Significant: difference may be trivial
Detect practical differences:
difference in performance vs standard deviation
G Varoquaux 110
Better experimental procedures
Crack the black box open
A prediction score is seldom insightful
Ablation studies: remove/change atomic elements
Learning curves
Better benchmarking in these
Tune hyper-parameters to the same quality
Randomize everything
Account for variance in conclusions
G Varoquaux 111
Summary – Benchmarking with limited data
Reminder: Your valida-
tion measure is intrinsi-
cally unreliable
(sampling noise)
An arbitrary choice
(random seed) may give
seemingly-good results
that do not generalize
Sample many choices
Account for resulting vari-
ance in conclusions
­20% ­10%  0% +10% +20%
Distribution of errors under a binomial law        
1000
300
200
100
30
Number of available samples   
­2% +2%
­4% +4%
­5% +5%
­7% +7%
­15% +12%
G Varoquaux 112
Representation learning in limited-data settings
Good representations help learning
Enable the use of simpler models
better approximation representation, less estimation error
Simple supervised learning of representations
pretext tasks, stacking, factorizing multi-output
Matrix factorizations
Extract representations without labels
MM solvers are “fire and forget”
Careful benchmarking is crucial
Optimistic flukes will not generalize
G Varoquaux 113
@GaelVaroquaux
References I
A. Abraham, M. P. Milham, A. Di Martino, R. C. Craddock, D. Samaras,
B. Thirion, and G. Varoquaux. Deriving reproducible biomarkers
from multi-site resting-state data: an autism-based example.
NeuroImage, 147:736, 2017.
A. Achille and S. Soatto. Emergence of invariance and
disentanglement in deep representations. The Journal of Machine
Learning Research, 19(1):1947–1980, 2018.
A. I. Bandos, H. E. Rockette, and D. Gur. A permutation test sensitive
to differences in areas for comparing roc curves from a paired
design. Statistics in medicine, 24:2873, 2005.
A. Bellet, A. Habrard, and M. Sebban. A survey on metric learning for
feature vectors and structured data. arXiv preprint
arXiv:1306.6709, 2013.
J. Bergstra and Y. Bengio. Random search for hyper-parameter
optimization. Journal of Machine Learning Research, 13:281, 2012.
G Varoquaux 114
References II
A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko.
Translating embeddings for modeling multi-relational data. In
Advances in Neural Information Processing Systems, pages
2787–2795, 2013.
X. Bouthillier, P. Delaunay, M. Bronzi, A. Trofimov, B. Nichyporuk,
J. Szeto, N. Mohammadi Sepahvand, E. Raff, K. Madan, V. Voleti, ...
Accounting for variance in machine learning benchmarks.
Proceedings of Machine Learning and Systems, 3, 2021.
D. Bzdok, M. Eickenberg, O. Grisel, B. Thirion, and G. Varoquaux.
Semi-supervised factored logistic regression for
high-dimensional neuroimaging data. In Advances in Neural
Information Processing Systems, page 3348, 2015.
J. Canny. Gap: A factor model for discrete data. In SIGIR, page 122,
2004.
G Varoquaux 115
References III
J.-F. Cardoso. Dependence, correlation and gaussianity in
independent component analysis. Journal of Machine Learning
Research, 4:1177, 2003.
P. Cerda and G. Varoquaux. Encoding high-cardinality string
categorical variables. IEEE Transactions on Knowledge and Data
Engineering, 2020.
W. Chen, X. Chen, J. Zhang, and K. Huang. Beyond triplet loss: a deep
quadruplet network for person re-identification. In Proceedings
of the IEEE Conference on Computer Vision and Pattern
Recognition, page 403, 2017.
L. Chizat and F. Bach. On the global convergence of gradient descent
for over-parameterized models using optimal transport. Advances
in Neural Information Processing Systems, 31:3036–3046, 2018.
G Varoquaux 116
References IV
L. Chizat, E. Oyallon, and F. Bach. On lazy training in differentiable
programming. Advances in Neural Information Processing
Systems, 2019.
J. Demšar. Statistical comparisons of classifiers over multiple data
sets. The Journal of Machine Learning Research, 7:1–30, 2006.
J. Demšar. On the appropriateness of statistical tests in machine
learning. In Workshop on Evaluation Methods for Machine
Learning in conjunction with ICML, page 65. Citeseer, 2008.
P. S. Dhillon, D. P. Foster, S. M. Kakade, and L. H. Ungar. A risk
comparison of ordinary least squares vs ridge regression. The
Journal of Machine Learning Research, 14:1505, 2013.
T. G. Dietterich. Approximate statistical tests for comparing
supervised classification learning algorithms. Neural
computation, 10(7):1895–1923, 1998.
G Varoquaux 117
References V
R. Dror, B. G., Bogomolov, M., and R. Reichart. Replicability analysis
for natural language processing: Testing significance with
multiple datasets. Transactions of the Association for
Computational Linguistics, 2017.
S. Geman, E. Bienenstock, and R. Doursat. Neural networks and the
bias/variance dilemma. Neural computation, 4(1):1–58, 1992.
Y. Goldberg and O. Levy. word2vec explained: Deriving mikolov et
al.’s negative-sampling word-embedding method. arXiv:1402.3722,
2014.
P. K. Gopalan, L. Charlin, and D. Blei. Content-based
recommendations with poisson factorization. In Advances in
Neural Information Processing Systems, page 3176, 2014.
G Varoquaux 118
References VI
M. Gutmann and A. Hyvärinen. Noise-contrastive estimation: A new
estimation principle for unnormalized statistical models. In
Proceedings of the International Conference on Artificial
Intelligence and Statistics, page 297, 2010.
L. Györfi, M. Kohler, A. Krzyżak, and H. Walk. A distribution-free
theory of nonparametric regression, volume 1. Springer, 2002.
K. He, Y. Wang, and J. Hopcroft. A powerful generative model using
random weights for the deep image representation. Advances in
Neural Information Processing Systems, 29:631–639, 2016.
D. Hsu, S. Kakade, and T. Zhang. Random design analysis of ridge
regression. Foundations of Computational Mathematics, 14, 2014.
A. Hyvärinen and E. Oja. Independent component analysis:
algorithms and applications. Neural networks, 13(4):411, 2000.
A. J. Izenman. Reduced-rank regression for the multivariate linear
model. Journal of multivariate analysis, 5:248, 1975.
G Varoquaux 119
References VII
J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child,
S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural
language models. arXiv preprint arXiv:2001.08361, 2020.
T. K. Landauer, P. W. Foltz, and D. Laham. An introduction to latent
semantic analysis. Discourse processes, 25:259, 1998.
A. Lefevre, F. Bach, and C. Févotte. Online algorithms for
nonnegative matrix factorization with the itakura-saito
divergence. In Applications of Signal Processing to Audio and
Acoustics (WASPAA), page 313. IEEE, 2011.
E. Lesaffre. Superiority, equivalence, and non-inferiority trials.
Bulletin of the NYU hospital for joint diseases, 66(2), 2008.
O. Levy and Y. Goldberg. Neural word embedding as implicit matrix
factorization. In Advances in neural information processing
systems, page 2177, 2014.
G Varoquaux 120
References VIII
Z. C. Lipton. The mythos of model interpretability: In machine
learning, the concept of interpretability is both important and
slippery. Queue, 2018.
J. Mairal. Stochastic majorization-minimization algorithms for
large-scale optimization. In Advances in Neural Information
Processing Systems, 2013.
J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix
factorization and sparse coding. Journal of Machine Learning
Research, 11:19–60, 2010.
J. Mairal, F. Bach, and J. Ponce. Sparse modeling for image and vision
processing. Foundations and Trends® in Computer Graphics and
Vision, 8(2-3):85–283, 2014.
G Varoquaux 121
References IX
A. Makarova, H. Shen, V. Perrone, A. Klein, J. B. Faddoul, A. Krause,
M. Seeger, and C. Archambeau. Overfitting in bayesian
optimization: an empirical study and early-stopping solution.
arXiv preprint arXiv:2104.08166, 2021.
S. Mallat. Understanding deep convolutional networks.
Philosophical Transactions of the Royal Society A, 374:20150203,
2016.
A. Mensch, J. Mairal, B. Thirion, and G. Varoquaux. Stochastic
subsampling for factorizing huge matrices. IEEE Transactions on
Signal Processing, 66:113, 2017.
A. Mensch, J. Mairal, B. Thirion, and G. Varoquaux. Extracting
universal representations of cognition across brain-imaging
studies. arXiv preprint arXiv:1809.06035, 2018.
G Varoquaux 122
References X
P. Mianjy, R. Arora, and R. Vidal. On the implicit bias of dropout. In
International Conference on Machine Learning, pages 3540–3548.
PMLR, 2018.
D. Micci-Barreca. A preprocessing scheme for high-cardinality
categorical attributes in classification and prediction problems.
ACM SIGKDD Explorations Newsletter, 3:27, 2001.
T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of
word representations in vector space. In ICLR Workshop Papers.
2013a.
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean.
Distributed representations of words and phrases and their
compositionality. In Advances in neural information processing
systems, page 3111, 2013b.
G Varoquaux 123
References XI
G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio. On the number of
linear regions of deep neural networks. In Advances in neural
information processing systems, page 2924, 2014.
C. Nadeau and Y. Bengio. Inference for the generalization error.
Machine learning, 52(3):239–281, 2003.
P. Nakkiran, G. Kaplun, Y. Bansal, T. Yang, B. Barak, and I. Sutskever.
Deep double descent: Where bigger models and more data hurt.
ICLR, 2020.
B. Neal, S. Mittal, A. Baratin, V. Tantia, M. Scicluna, S. Lacoste-Julien,
and I. Mitliagkas. A modern take on the bias-variance tradeoff in
neural networks. arXiv preprint arXiv:1810.08591, 2018.
E. Oyallon, E. Belilovsky, and S. Zagoruyko. Scaling the scattering
transform: Deep hybrid networks. In Proceedings of the IEEE
international conference on computer vision, page 5618, 2017.
G Varoquaux 124
References XII
J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for
word representation. In Proceedings of the 2014 conference on
empirical methods in natural language processing (EMNLP), page
1532, 2014.
M. E. Peters, S. Ruder, and N. A. Smith. To tune or not to tune?
adapting pretrained representations to diverse tasks.
Proceedings of the 4th Workshop on Representation Learning for
NLP (RepL4NLP), 2019.
M. Rahim, B. Thirion, D. Bzdok, I. Buvat, and G. Varoquaux. Joint
prediction of multiple scores captures better individual traits
from brain images. Neuroimage, 158:145–154, 2017a.
M. Rahim, B. Thirion, and G. Varoquaux. Multi-output predictions
from neuroimaging: assessing reduced-rank linear models. In
2017 International Workshop on Pattern Recognition in
Neuroimaging (PRNI), pages 1–4. IEEE, 2017b.
G Varoquaux 125
References XIII
A. Rahimi and B. Recht. Weighted sums of random kitchen sinks:
replacing minimization with randomization in learning. In Nips,
pages 1313–1320. Citeseer, 2008.
S. Rosset and R. J. Tibshirani. From fixed-x to random-x regression:
Bias-variance decompositions, covariance penalties, and
prediction error estimation. Journal of the American Statistical
Association, pages 1–14, 2018.
C. Sun, X. Qiu, Y. Xu, and X. Huang. How to fine-tune bert for text
classification? China National Conference on Chinese
Computational Linguistics, 2019.
I. Ustyuzhaninov, W. Brendel, L. A. Gatys, and M. Bethge. Texture
synthesis using shallow convolutional networks with random
filters. arXiv preprint arXiv:1606.00021, 2016.
G. Varoquaux. Cross-validation failure: small sample sizes lead to
large error bars. Neuroimage, 180:68–77, 2018.
G Varoquaux 126
References XIV
G. Varoquaux and V. Cheplygina. How i failed machine learning in
medical imaging–shortcomings and recommendations. arXiv
preprint arXiv:2103.10292, 2021.
Q. Wang, Z. Mao, B. Wang, and L. Guo. Knowledge graph embedding:
A survey of approaches and applications. IEEE Transactions on
Knowledge and Data Engineering, 29(12):2724–2743, 2017.
Z. Wang, J. Zhang, J. Feng, and Z. Chen. Knowledge graph embedding
by translating on hyperplanes. AAAI Conference on Artificial
Intelligence, 2014.
G Varoquaux 127

More Related Content

What's hot

Finding connections among images using CycleGAN
Finding connections among images using CycleGANFinding connections among images using CycleGAN
Finding connections among images using CycleGANNAVER Engineering
 
초단타매매 전략 소개 및 트렌드
초단타매매 전략 소개 및 트렌드초단타매매 전략 소개 및 트렌드
초단타매매 전략 소개 및 트렌드NAVER Engineering
 
Intro to ggplot2 - Sheffield R Users Group, Feb 2015
Intro to ggplot2 - Sheffield R Users Group, Feb 2015Intro to ggplot2 - Sheffield R Users Group, Feb 2015
Intro to ggplot2 - Sheffield R Users Group, Feb 2015Paul Richards
 
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...Sangwoo Mo
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018HJ van Veen
 
Intro to Neural Networks
Intro to Neural NetworksIntro to Neural Networks
Intro to Neural NetworksDean Wyatte
 
Bayesian Deep Learning
Bayesian Deep LearningBayesian Deep Learning
Bayesian Deep LearningRayKim51
 
Tips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsTips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsDarius Barušauskas
 
Kaggle and data science
Kaggle and data scienceKaggle and data science
Kaggle and data scienceAkira Shibata
 
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기NAVER Engineering
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoostJoonyoung Yi
 
Tda presentation
Tda presentationTda presentation
Tda presentationHJ van Veen
 
Exploring temporal graph data with Python: 
a study on tensor decomposition o...
Exploring temporal graph data with Python: 
a study on tensor decomposition o...Exploring temporal graph data with Python: 
a study on tensor decomposition o...
Exploring temporal graph data with Python: 
a study on tensor decomposition o...André Panisson
 
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks Christopher Morris
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networksYunjey Choi
 
Time series classification
Time series classificationTime series classification
Time series classificationSung Kim
 
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...Universitat Politècnica de Catalunya
 
Expectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture ModelsExpectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture Modelspetitegeek
 

What's hot (20)

Finding connections among images using CycleGAN
Finding connections among images using CycleGANFinding connections among images using CycleGAN
Finding connections among images using CycleGAN
 
초단타매매 전략 소개 및 트렌드
초단타매매 전략 소개 및 트렌드초단타매매 전략 소개 및 트렌드
초단타매매 전략 소개 및 트렌드
 
Intro to ggplot2 - Sheffield R Users Group, Feb 2015
Intro to ggplot2 - Sheffield R Users Group, Feb 2015Intro to ggplot2 - Sheffield R Users Group, Feb 2015
Intro to ggplot2 - Sheffield R Users Group, Feb 2015
 
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
 
Intro to Neural Networks
Intro to Neural NetworksIntro to Neural Networks
Intro to Neural Networks
 
Bayesian Deep Learning
Bayesian Deep LearningBayesian Deep Learning
Bayesian Deep Learning
 
Tips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsTips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitions
 
Kaggle and data science
Kaggle and data scienceKaggle and data science
Kaggle and data science
 
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
 
Deep Learning for Computer Vision: ImageNet Challenge (UPC 2016)
Deep Learning for Computer Vision: ImageNet Challenge (UPC 2016)Deep Learning for Computer Vision: ImageNet Challenge (UPC 2016)
Deep Learning for Computer Vision: ImageNet Challenge (UPC 2016)
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoost
 
Tda presentation
Tda presentationTda presentation
Tda presentation
 
Exploring temporal graph data with Python: 
a study on tensor decomposition o...
Exploring temporal graph data with Python: 
a study on tensor decomposition o...Exploring temporal graph data with Python: 
a study on tensor decomposition o...
Exploring temporal graph data with Python: 
a study on tensor decomposition o...
 
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
 
Time series classification
Time series classificationTime series classification
Time series classification
 
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
 
Expectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture ModelsExpectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture Models
 
Fuzzy Clustering(C-means, K-means)
Fuzzy Clustering(C-means, K-means)Fuzzy Clustering(C-means, K-means)
Fuzzy Clustering(C-means, K-means)
 

Similar to Representation learning in limited-data settings

A tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imagingA tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imagingGael Varoquaux
 
BEADS : filtrage asymétrique de ligne de base (tendance) et débruitage pour d...
BEADS : filtrage asymétrique de ligne de base (tendance) et débruitage pour d...BEADS : filtrage asymétrique de ligne de base (tendance) et débruitage pour d...
BEADS : filtrage asymétrique de ligne de base (tendance) et débruitage pour d...Laurent Duval
 
Modeling, Control and Optimization for Aerospace Systems
Modeling, Control and Optimization for Aerospace SystemsModeling, Control and Optimization for Aerospace Systems
Modeling, Control and Optimization for Aerospace SystemsBehzad Samadi
 
A copula model to analyze minimum admission scores
A copula model to analyze minimum admission scores A copula model to analyze minimum admission scores
A copula model to analyze minimum admission scores Mariela Fernández
 
Bayesian phylogenetic inference_big4_ws_2016-10-10
Bayesian phylogenetic inference_big4_ws_2016-10-10Bayesian phylogenetic inference_big4_ws_2016-10-10
Bayesian phylogenetic inference_big4_ws_2016-10-10FredrikRonquist
 
Connection between inverse problems and uncertainty quantification problems
Connection between inverse problems and uncertainty quantification problemsConnection between inverse problems and uncertainty quantification problems
Connection between inverse problems and uncertainty quantification problemsAlexander Litvinenko
 
Efficient Random-Walk Methods forApproximating Polytope Volume
Efficient Random-Walk Methods forApproximating Polytope VolumeEfficient Random-Walk Methods forApproximating Polytope Volume
Efficient Random-Walk Methods forApproximating Polytope VolumeVissarion Fisikopoulos
 
Efficient Simulations for Contamination of Groundwater Aquifers under Uncerta...
Efficient Simulations for Contamination of Groundwater Aquifers under Uncerta...Efficient Simulations for Contamination of Groundwater Aquifers under Uncerta...
Efficient Simulations for Contamination of Groundwater Aquifers under Uncerta...Alexander Litvinenko
 
Secure Domination in graphs
Secure Domination in graphsSecure Domination in graphs
Secure Domination in graphsMahesh Gadhwal
 
Calibrating Probability with Undersampling for Unbalanced Classification
Calibrating Probability with Undersampling for Unbalanced ClassificationCalibrating Probability with Undersampling for Unbalanced Classification
Calibrating Probability with Undersampling for Unbalanced ClassificationAndrea Dal Pozzolo
 
Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Valentin De Bortoli
 
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920Karl Rudeen
 
MarcoCeze_defense
MarcoCeze_defenseMarcoCeze_defense
MarcoCeze_defenseMarco Ceze
 
CutFEM on hierarchical B-Spline Cartesian grids with applications to fluid-st...
CutFEM on hierarchical B-Spline Cartesian grids with applications to fluid-st...CutFEM on hierarchical B-Spline Cartesian grids with applications to fluid-st...
CutFEM on hierarchical B-Spline Cartesian grids with applications to fluid-st...CHENNAKESAVA KADAPA
 
Matlab polynimials and curve fitting
Matlab polynimials and curve fittingMatlab polynimials and curve fitting
Matlab polynimials and curve fittingAmeen San
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learningYogendra Singh
 

Similar to Representation learning in limited-data settings (20)

A tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imagingA tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imaging
 
BEADS : filtrage asymétrique de ligne de base (tendance) et débruitage pour d...
BEADS : filtrage asymétrique de ligne de base (tendance) et débruitage pour d...BEADS : filtrage asymétrique de ligne de base (tendance) et débruitage pour d...
BEADS : filtrage asymétrique de ligne de base (tendance) et débruitage pour d...
 
Modeling, Control and Optimization for Aerospace Systems
Modeling, Control and Optimization for Aerospace SystemsModeling, Control and Optimization for Aerospace Systems
Modeling, Control and Optimization for Aerospace Systems
 
A copula model to analyze minimum admission scores
A copula model to analyze minimum admission scores A copula model to analyze minimum admission scores
A copula model to analyze minimum admission scores
 
Bayesian phylogenetic inference_big4_ws_2016-10-10
Bayesian phylogenetic inference_big4_ws_2016-10-10Bayesian phylogenetic inference_big4_ws_2016-10-10
Bayesian phylogenetic inference_big4_ws_2016-10-10
 
Connection between inverse problems and uncertainty quantification problems
Connection between inverse problems and uncertainty quantification problemsConnection between inverse problems and uncertainty quantification problems
Connection between inverse problems and uncertainty quantification problems
 
Efficient Random-Walk Methods forApproximating Polytope Volume
Efficient Random-Walk Methods forApproximating Polytope VolumeEfficient Random-Walk Methods forApproximating Polytope Volume
Efficient Random-Walk Methods forApproximating Polytope Volume
 
Efficient Simulations for Contamination of Groundwater Aquifers under Uncerta...
Efficient Simulations for Contamination of Groundwater Aquifers under Uncerta...Efficient Simulations for Contamination of Groundwater Aquifers under Uncerta...
Efficient Simulations for Contamination of Groundwater Aquifers under Uncerta...
 
Secure Domination in graphs
Secure Domination in graphsSecure Domination in graphs
Secure Domination in graphs
 
Calibrating Probability with Undersampling for Unbalanced Classification
Calibrating Probability with Undersampling for Unbalanced ClassificationCalibrating Probability with Undersampling for Unbalanced Classification
Calibrating Probability with Undersampling for Unbalanced Classification
 
Intro to ABC
Intro to ABCIntro to ABC
Intro to ABC
 
Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...
 
stat-phys-appis-reduced.pdf
stat-phys-appis-reduced.pdfstat-phys-appis-reduced.pdf
stat-phys-appis-reduced.pdf
 
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
 
Project Paper
Project PaperProject Paper
Project Paper
 
MarcoCeze_defense
MarcoCeze_defenseMarcoCeze_defense
MarcoCeze_defense
 
AppsDiff3c.pdf
AppsDiff3c.pdfAppsDiff3c.pdf
AppsDiff3c.pdf
 
CutFEM on hierarchical B-Spline Cartesian grids with applications to fluid-st...
CutFEM on hierarchical B-Spline Cartesian grids with applications to fluid-st...CutFEM on hierarchical B-Spline Cartesian grids with applications to fluid-st...
CutFEM on hierarchical B-Spline Cartesian grids with applications to fluid-st...
 
Matlab polynimials and curve fitting
Matlab polynimials and curve fittingMatlab polynimials and curve fitting
Matlab polynimials and curve fitting
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learning
 

More from Gael Varoquaux

Measuring mental health with machine learning and brain imaging
Measuring mental health with machine learning and brain imagingMeasuring mental health with machine learning and brain imaging
Measuring mental health with machine learning and brain imagingGael Varoquaux
 
Machine learning with missing values
Machine learning with missing valuesMachine learning with missing values
Machine learning with missing valuesGael Varoquaux
 
Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...Gael Varoquaux
 
Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?Gael Varoquaux
 
Atlases of cognition with large-scale human brain mapping
Atlases of cognition with large-scale human brain mappingAtlases of cognition with large-scale human brain mapping
Atlases of cognition with large-scale human brain mappingGael Varoquaux
 
Similarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variablesSimilarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variablesGael Varoquaux
 
Machine learning for functional connectomes
Machine learning for functional connectomesMachine learning for functional connectomes
Machine learning for functional connectomesGael Varoquaux
 
Towards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imagingTowards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imagingGael Varoquaux
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Gael Varoquaux
 
Scikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imagingScikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imagingGael Varoquaux
 
Computational practices for reproducible science
Computational practices for reproducible scienceComputational practices for reproducible science
Computational practices for reproducible scienceGael Varoquaux
 
Coding for science and innovation
Coding for science and innovationCoding for science and innovation
Coding for science and innovationGael Varoquaux
 
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsEstimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsGael Varoquaux
 
On the code of data science
On the code of data scienceOn the code of data science
On the code of data scienceGael Varoquaux
 
Scientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of dataScientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of dataGael Varoquaux
 
Machine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questionsMachine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questionsGael Varoquaux
 
Social-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsitySocial-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsityGael Varoquaux
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Gael Varoquaux
 
Inter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRIInter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRIGael Varoquaux
 
Brain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizationsBrain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizationsGael Varoquaux
 

More from Gael Varoquaux (20)

Measuring mental health with machine learning and brain imaging
Measuring mental health with machine learning and brain imagingMeasuring mental health with machine learning and brain imaging
Measuring mental health with machine learning and brain imaging
 
Machine learning with missing values
Machine learning with missing valuesMachine learning with missing values
Machine learning with missing values
 
Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...
 
Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?
 
Atlases of cognition with large-scale human brain mapping
Atlases of cognition with large-scale human brain mappingAtlases of cognition with large-scale human brain mapping
Atlases of cognition with large-scale human brain mapping
 
Similarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variablesSimilarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variables
 
Machine learning for functional connectomes
Machine learning for functional connectomesMachine learning for functional connectomes
Machine learning for functional connectomes
 
Towards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imagingTowards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imaging
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities
 
Scikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imagingScikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imaging
 
Computational practices for reproducible science
Computational practices for reproducible scienceComputational practices for reproducible science
Computational practices for reproducible science
 
Coding for science and innovation
Coding for science and innovationCoding for science and innovation
Coding for science and innovation
 
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsEstimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
 
On the code of data science
On the code of data scienceOn the code of data science
On the code of data science
 
Scientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of dataScientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of data
 
Machine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questionsMachine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questions
 
Social-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsitySocial-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsity
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016
 
Inter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRIInter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRI
 
Brain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizationsBrain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizations
 

Recently uploaded

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 

Recently uploaded (20)

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 

Representation learning in limited-data settings

  • 1. Representation learning in limited-data settings Gaël Varoquaux
  • 2. Limited-data settings n to be compared to: A measure of the signal-to-noise ratio The dimensional of the data p Deep learning is hard in small-sample regimes But we can borrow ideas This talk: No silver bullet, many simple (shallow) tricks G Varoquaux 1
  • 3. Small-n problems are important 83% of data scientists1 never have n > 1M n is often small for applications such as medicine Bigger is better (how to not use this talk) Get more data (pool related datasets) Find a related problem and try transfer This talk: data that differs from common sources 1www.kaggle.com/laurae2/data-scientists-vs-size-of-datasets G Varoquaux 2
  • 4. Small-n problems need guiding principles Selecting architecture, learning rate... A deep architecture is validated by its measured accuracy " less data =⇒ poorer validation more in last part of this talk Need for guiding principles This talk: connecting deep learning to Good Old-Fashioned Machine Learning G Varoquaux 3
  • 5. Outline 1 Representations for machine learning Finite-sample supervised learning Learning with representations Supervised learning of representations Over-parametrized representation learning 2 Matrix factorization and its variants For signals For discrete objects 3 Method evaluation with limited data Variance in model evaluation Reliable experimental procedures From benchmarks to conclusion G Varoquaux 4
  • 6. 1 Representations for machine learning Defining the notion of representations Their use for supervised learning
  • 7. 1 Representations for machine learning Finite-sample supervised learning Learning with representations Supervised learning of representations Over-parametrized representation learning
  • 8. Settings: supervised learning Given n pairs (x, y) ∈ X × Y drawn i.i.d. find a function f : X → Y such that f (x) ≈ y Notation: ŷ def = f (x) Empirical risk minimization Loss function l : Y × Y → Ò Estimation of f: f? = argmin f∈F Å l(ŷ, y) This course: how to choose good function classes F G Varoquaux 7
  • 9. Example: finite-sample estimation of f Data generated with 9th order polynomial + noise G Varoquaux 8
  • 10. Example: finite-sample estimation of f Data generated with 9th order polynomial + noise Fit polynomials of various degrees Degree 1 G Varoquaux 8
  • 11. Example: finite-sample estimation of f Data generated with 9th order polynomial + noise Fit polynomials of various degrees Degree 1 Degree 2 G Varoquaux 8
  • 12. Example: finite-sample estimation of f Data generated with 9th order polynomial + noise Fit polynomials of various degrees Degree 1 Degree 2 Degree 5 G Varoquaux 8
  • 13. Example: finite-sample estimation of f Data generated with 9th order polynomial + noise Fit polynomials of various degrees Degree 1 Degree 2 Degree 5 Degree 9 G Varoquaux 8
  • 14. Example: finite-sample estimation of f Data generated with 9th order polynomial + noise Fit polynomials of various degrees Degree 1 Degree 2 Degree 5 Degree 9 Truth Model too simple: underfit Model too complex: overfit G Varoquaux 8
  • 15. Theory: the generalization error Generalization error of a prediction function f: Notation : E(f) def = Å l(y, f (x)) Finite-sample regime Ideally: f? = argmin f∈F Å l f (x), y In practice: f̂ = argmin f∈F n Õ i=1 l f (xi), yi E(f̂) ≥ E(f?) f f G Varoquaux 9
  • 16. Theory: decomposing the generalization error Assuming y = g(x) + e, e random with Å[e] = 0, the generalization error of f̂ is: E(f̂) = Å l(g(x) + e, f̂ (x)) = E(g) + E(f?) − E(g) + E(f̂) − E(f?) Bayes rate Best possible pre- diction Å l(g(x)+e, g(x)) Approximation error: g F Our model is wrong Estimation Sampling noise on train data f̂ , f? G Varoquaux 10
  • 17. Theory: decomposing the generalization error Assuming y = g(x) + e, e random with Å[e] = 0, the generalization error of f̂ is: E(f̂) = Å l(g(x) + e, f̂ (x)) = E(g) + E(f?) − E(g) + E(f̂) − E(f?) Bayes rate Best possible pre- diction Å l(g(x)+e, g(x)) Due to the noise e Cannot be avoided G Varoquaux 10
  • 18. Theory: decomposing the generalization error Assuming y = g(x) + e, e random with Å[e] = 0, the generalization error of f̂ is: E(f̂) = Å l(g(x) + e, f̂ (x)) = E(g) + E(f?) − E(g) + E(f̂) − E(f?) Approximation error: g F Our model is wrong Decreases for larger F Empirical lower bound of E(f?): train error G Varoquaux 10
  • 19. Theory: decomposing the generalization error Assuming y = g(x) + e, e random with Å[e] = 0, the generalization error of f̂ is: E(f̂) = Å l(g(x) + e, f̂ (x)) = E(g) + E(f?) − E(g) + E(f̂) − E(f?) Estimation Sampling noise on train data f̂ , f? Finite-sample problem Decreases as n grows Increases for larger F Guesstimate: difference be- tween train and test error G Varoquaux 10
  • 20. Example: polynomial regression degree f f Degree 9, small n no approximation error large estimation error f f g Degree 1, large n small estimation error large approximation error G Varoquaux 11
  • 21. Example: polynomial regression degree f f Degree 9, small n no approximation error large estimation error f̂ = argminf∈F Í i l f (xi), yi f f g Degree 1, large n small estimation error large approximation error Function class F not restrictive enough Function class F too restrictive G Varoquaux 11
  • 22. Gauging overfit vs underfit: learning curves 100 1000 Number of samples Error sklearn.model selection.learning curve G Varoquaux 12 Overfit region Underfit? Or Bayes rate?
  • 23. Gauging overfit vs underfit: learning curves 100 1000 Number of samples Error Generalization error Training error sklearn.model selection.learning curve G Varoquaux 12 Estimation error ∼ gap be- tween train and test error
  • 24. Gauging overfit vs underfit: learning curves 100 1000 Number of samples Error Generalization error Training error Degree of polynomial 9 1 Simpler models reach the assymptotic regime faster (smaller “sample complexity”) But can underfit G Varoquaux 12
  • 25. Gauging overfit vs underfit: validation curves 5 10 15 Polynomial degree Error Generalization error Training error sklearn.model selection.validation curve Reveals underfits G Varoquaux 13
  • 26. Linear models for limited-data settings In high-dimensional limited-data settings, linear models are often the best choice For p-dimensional data, x ∈ Òp, they have p parameters n ∼ 200 000 Inpatient Mortality, AUROC (95% CI) Hospital A Hospital B Deep learning 0.95(0.94-0.96) 0.93(0.92-0.94) Baseline (logistic regression) 0.93(0.92-0.95) 0.91(0.89-0.92) G Varoquaux 14
  • 27. Theory: Approximating with linear predictors Linear predictor1: ŷ = xTw, w ∈ Òp Data model: y = xTw? + δ(x) + e Å[e] = 0 xTw?: best linear predictor Ridge estimator: ŵ = argmin w kytrain − XT train wk2 Fro + λkwk2 2 Error compared to best linear predictor: Å ky − xT ŵk2 2 = Å ky − xTw?k2 2 + o σ2p/ntrain [Hsu... 2014, sec 2.5] Random design analysis can characterize the generalization error without assuming a correct data-generating model (miss-specified model) [Hsu... 2014, Rosset and Tibshirani 2018] 1Predictor, not model: we do not assume it is a data-generating process. G Varoquaux 15
  • 28. Theory: Approximating with linear predictors Linear predictor1: ŷ = xTw, w ∈ Òp Data model: y = xTw? + δ(x) + e Å[e] = 0 xTw?: best linear predictor Ridge estimator: ŵ = argmin w kytrain − XT train wk2 Fro + λkwk2 2 Error compared to best linear predictor: Å ky − xT ŵk2 2 = Å ky − xTw?k2 2 + o σ2p/ntrain Approximation error Data not linearly generated ⇒ craft more features Estimation error Curse of dimensionality ⇒ limit number of features 1Predictor, not model: we do not assume it is a data-generating process. G Varoquaux 15
  • 29. Example: extrapolating sea level (tides) Predict sea level as a function of time Test outside of observed range1 1Technically, this is not in our theory: test set , train set. G Varoquaux 16
  • 30. Example: extrapolating sea level (tides) Polynomial regression dim=10 Covariates G Varoquaux 16
  • 31. Example: extrapolating sea level (tides) Polynomial regression dim=10 dim=100 Covariates G Varoquaux 16
  • 32. Example: extrapolating sea level (tides) Polynomial regression dim=10 dim=100 dim=1000 Covariates G Varoquaux 16
  • 33. Example: extrapolating sea level (tides) Polynomial regression dim=10 dim=100 dim=1000 Covariates Sines and cosines basis dim=10 G Varoquaux 16
  • 34. Example: extrapolating sea level (tides) Polynomial regression dim=10 dim=100 dim=1000 Covariates Sines and cosines basis dim=10 dim=100 G Varoquaux 16
  • 35. Example: extrapolating sea level (tides) Polynomial regression dim=10 dim=100 dim=1000 Covariates Sines and cosines basis dim=10 dim=100 dim=1000 G Varoquaux 16
  • 36. Example: extrapolating sea level (tides) Polynomial regression dim=10 dim=100 dim=1000 Covariates Sines and cosines basis dim=10 dim=100 dim=1000 Choice of covariates / basis / signal representation ⇒ huge difference on approximation error ⇒ huge difference on generalization error G Varoquaux 16
  • 37. Summary – minimizing a generalization error ŷ = f (x), f chosen in F to minimize the observed error Õ i∈train l f (xi), y generalization error: - approximation error ⇒ F adapted to the data - estimation error ⇒ F small Limited-data settings Linear models best option when p n A good choice of covariates is crucial G Varoquaux 17
  • 38. 1 Representations for machine learning Finite-sample supervised learning Learning with representations Supervised learning of representations Over-parametrized representation learning
  • 39. Representations to build F Settings z = r(x): representation of the data, z ∈ Òk Predictor f : x → ŷ = hw r(x) Function composition: “depth” G Varoquaux 19
  • 40. Representations to build F Settings z = r(x): representation of the data, z ∈ Òk Predictor f : x → ŷ = hw r(x) Function composition: “depth” Benefits For expressiveness composition basis expansion Composing L rectifying functions on intermediate representa- tions of dimension k gives O k p p(L−1) kp linear regions. Basis expansion + linear predictor gives O(k) Exponential in depth, linear with dimension [Montufar... 2014] G Varoquaux 19
  • 41. Representations to build F Settings z = r(x): representation of the data, z ∈ Òk Predictor f : x → ŷ = hw r(x) Function composition: “depth” Benefits For expressiveness composition basis expansion For multi-tasks sharing representations across tasks y multidimensional G Varoquaux 19
  • 42. Representations to build F Settings z = r(x): representation of the data, z ∈ Òk Predictor f : x → ŷ = hw r(x) Function composition: “depth” Benefits For expressiveness composition basis expansion For multi-tasks sharing representations across tasks For limited data hw(z) = wTz, a linear predictor A good choice of z can decrease sample complexity G Varoquaux 19
  • 43. Representations to build F Settings z = r(x): representation of the data, z ∈ Òk Predictor f : x → ŷ = hw r(x) Function composition: “depth” Benefits For expressiveness composition basis expansion For multi-tasks sharing representations across tasks For limited data hw(z) = wTz, a linear predictor Transfer: r is learned on large data; a simple h used. G Varoquaux 19
  • 44. Representations to keep only the “useful information” Formalize How a representation z should: keep information on the output y loose non-useful information G Varoquaux 20
  • 45. Background: Information theory Entropy = amount of information in x H (x) = Åp[log p(x)] Equi-probable distribution = low entropy x=0 x=1 x=2 x=3 x=4 x=5 P Uneven distribution = high entropy x=0 x=1 x=2 x=3 x=4 x=5 P Mutual information between x and y I(x; y) = H (x, y) − H (x) − H (y) x ⊥ ⊥ y (independent) ⇔ I(x; y) = 0 independence ⇔ p(x; y) = p(x)p(y) H (x; y) = Å(x;y) log p(x; y) = Å(x;y) log p(x) + log p(y) x y = Åx log p(x) + Åy log p(y) = H (x) + H (y) G Varoquaux 21
  • 46. Theory: information in representations A representation z of x is sufficient for y if y ⊥ ⊥ x|z, or equivalently if I(z; y) = I(x; y) x, z, y form a Markov chain if Ð(y|x, z) = Ð(y|z). x → z → y Data processing inequality: I(x; y) ≤ I(x; z) A sufficient representation z is minimal when I(x; z) is smallest among sufficient representations G Varoquaux 22 [Achille and Soatto 2018]
  • 47. Nuisances and invariances A nuisance n: I(x, n) ≥ 0, but I(y, n) = 0 Representation z is invariant to the nuisance n if z ⊥ ⊥ n, or I(z; n) = 0 ⇒ We want I(z; n) low In a Markov chain x → z1 → z2 · · · → zL → y If z is a sufficient representation for y, I(z; n) ≤ I(z; x) − I(x; y) Communication bottleneck: I(z1; z2) I(z1; x) ⇒ I(z2; n) ≤ I(z1; z2) − I(x; y) Stacking increases invariance G Varoquaux 23 [Achille and Soatto 2018]
  • 48. Examples of invariances representations Illustrate Ingredients of well-known representations their links to invariances G Varoquaux 24
  • 49. Invariant representations on a continous space st Shift invariance representation = Fourier basis Fourier transform: F(s)f = Õ t e−i f t st complex i Shifting the signal: st → s0 t = st+k F(s0 )f = Õ t e−i f t st+k = Õ t e−i f (t−k) st = ei k f Õ t e−i f t st = ei k f F(s)f → change in phase An orthonormal basis of shift-invariant vectors G Varoquaux 25
  • 50. Invariant representations on a continous space st Shift invariance = Fourier basis Local deformations = Wavelets Locally equivalent to Fourier basis But without the global extent Decimated wavelets Isometric transform of the signal Higher scales lose shift invariance Redundant wavelets Increase the dimensionality Good shift invariance G Varoquaux 25
  • 51. Representations invariant to rich deformations Scaling Rotations Deformations Ingredients Modulus of wavelet / Fourier transform ⇒ non linearity filter banks (convolutions) + stacking (repeating simple invariants) Scattering transform Derived from first principles Building first-order invariants Convolutional networks Learned from data Pooling across pixels (eg max) G Varoquaux 26 [Mallat 2016]
  • 52. Summary – representions to help learning Intermediate representations give expressiveness to predictive models Good representations keep predictive information and loose nuisance information Bottleneck and regularization to loose information Limited-data settings Given know invariants of the problem, reusing existing representations helps eg Headless conv-net, wavelets... [Oyallon... 2017] G Varoquaux 27
  • 53. 1 Representations for machine learning Finite-sample supervised learning Learning with representations Supervised learning of representations Over-parametrized representation learning
  • 54. The need to supervision Maximizing I(z; y) (≤ I(x; y)) sufficient representations ⇒ supervised learning while minimizing I(z; n) nuisance ⇒ sampling nuisance / invariants data augmentation Challenge: amount of labeled data Pretext tasks Other targets y0 that capture useful information Finding them needs domain knowledge G Varoquaux 29
  • 55. Deep architectures . . . ŷ = fd Wd ◦ ... ◦ f1 W1 (x) Typically fk Wk (x) = gk (WT k x) and gk element-wise non-linearity Thus ŷ = gd WT d ... g1 (WT 1 x) Stacked representations: Wk {Wk} optimized to minimize a prediction error G Varoquaux 30
  • 56. Shallow architectures for limited data Keep one latent layer 2 Without non-linearity: ŷ = xT W1 W2, y ∈ Òk W1 ∈ Òp×d W2 ∈ Òd×k , factored / reduced-rank linear model Multi-task / multi-output structured loss can help (multiple soft-max’s) Overparametrization sometimes useful: d k can be achieved with dropout G Varoquaux 31 [Bzdok... 2015, Mensch... 2018]
  • 57. Examples of simple models that extract representations G Varoquaux 32
  • 58. Simple case: square loss = reduced rank regression Ŷ = X W1 W2, Y ∈ Òn×k W1 ∈ Òp×d , W2 ∈ Òd×k Ŵ1, Ŵ2 = argmin W1,W2 kŶ − Ytraink2 Fro For squared loss the problem is convex Full-rank solution1 (X and Y on train set): Ŵ = Σ̂−1 X XT Y Ŷ = X Ŵ = X Σ̂−1 X XT Y Rank d solution: [Izenman 1975, Rahim... 2017b] R̂d def = YT Ŷ ∈ Òk×k SVD → = Ûd ŝdV̂d, Ûd ∈ Òk×d then Ŵ1 = Σ−1 X XTY Ûd Ŵ2 = ÛT d Full-rank solution Rank-d projector2 1No need for pesky SGDs 2The projector captures the variance explained on the multiple outputs G Varoquaux 33
  • 59. Model stacking x f1 → z f2 → y Learn f1 separately Train a first model, feed it’s output to a second model Directly supervising z: z = ŷ for a (simple) predictive model First model f1 must underfit output: Model chosen from a simple function class (linear models) Trick: “cross-fit” during training obtain ŷ by splitting the training data Test set Train set Full data (in sklearn: cross val predict) G Varoquaux 34
  • 60. Model stacking x f1 → z f2 → y Learn f1 separately Train a first model, feed it’s output to a second model Directly supervising z: z = ŷ for a (simple) predictive model Application: tackling dimensionality [Rahim... 2017a] Some features are a high-dimensional signal eg medical images f1: linear to reduce signal features f2: non-linear (eg treesa) on all features aTrees-based models are great for mixed-typed data with categorical features G Varoquaux 34
  • 61. Model stacking to encode discrete items Sex Date Hired Employee Position M 09/12/1988 Master Police Officer F 06/26/2006 Social Worker III M 07/16/2007 Police Officer III predict → Salary 69222.18 97392.47 104717.28 Difficulty: number of different positions what invariants? 40000 60000 80000 100000 120000 140000 y: Employee salary Crossing Guard Liquor Store Clerk I Library Aide Police Cadet Public Safety Reporting Aide I Administrative Specialist II Management and Budget Specialist III Manager III Manager I Manager II Target encoding1 [Micci-Barreca 2001] position → Åtrain[salary|position] 1To inject categories in Ò, before a second level that combines all columns Python package: dirty-cat.github.io G Varoquaux 35
  • 62. Summary – supervised extraction of representations Supervision helps selecting the relevant part of the signal In limited-sample settings, simple models can create representations Simple latent-factor models Multi-output models Stacking: fit a first-level model G Varoquaux 36
  • 63. 1 Representations for machine learning Finite-sample supervised learning Learning with representations Supervised learning of representations Over-parametrized representation learning
  • 64. Revisiting the bias-tradeoff Flexible models can achieve less bias but come with more variance [Geman... 1992] Degree 1 Degree 2 Degree 5 Degree 9 Truth G Varoquaux 38
  • 65. Revisiting the bias-tradeoff Flexible models can achieve less bias but come with more variance [Geman... 1992] Degree 1 Degree 2 Degree 5 Degree 9 Truth Strong theoretical arguments come from a worst-case analysis1 Average case can be very different Achieve more flexibility without variance increase 1eg minimax rates of non-parametric regression [Györfi... 2002] G Varoquaux 38
  • 66. Example: random forest 1 tree: much bias 1 tree G Varoquaux 39
  • 67. Example: random forest 1 tree: much bias 300 tree: less bias, no variance increase 1 tree 300 trees Ensemble models Prediction: ŷ = ŷ1 + ŷ2 + · · · + ŷm If the errors of each model ŷ1 = y + ε1 are independent, they average out: kŷ − yk2 = kε1 + ε2 + · · · + εmk2 = 1 mvarε Increase in model flexibility without variance G Varoquaux 39
  • 68. Overparametrized neural networks For suitable random initialization1 ŷ error does not increase with network width. Overparametrization can even decrease sample complexity [Kaplan... 2020] 1Initialization must be diverse enough, and more concentrated for wide networks [Chizat and Bach 2018, Chizat... 2019]. G Varoquaux 40 [Neal... 2018, Nakkiran... 2020]
  • 69. Overparametrized neural networks Overparametrize to set train error to zero In error decomposition: approximation error to zero f̂ = argminf∈F Í i l f (xi), yi Another error decomposition: Error can be due to 1 optimizing on noisy training data 2 initialization 1 plateaus with wide networks, while 2 decreases. Optimum on train set is degenerate G Varoquaux 41 [Neal... 2018, Nakkiran... 2020]
  • 70. Randomization as a regularization Toy example: ridge OLS: ŵ = argminw ky − XTwk2 2 Inject noise: X0 = X + E, E ∼ N (0, σ) ŵ0 = argminw ky − (X + E)Twk2 2 = argminw ky − XTwk2 2 + kETwk2 2 = argminw ky − XTwk2 2 + σkwk2 2 G Varoquaux 42
  • 71. Randomization as a regularization Toy example: ridge OLS: ŵ = argminw ky − XTwk2 2 Inject noise: X0 = X + E, E ∼ N (0, σ) ŵ0 = argminw ky − (X + E)Twk2 2 = argminw ky − XTwk2 2 + kETwk2 2 = argminw ky − XTwk2 2 + σkwk2 2 Dropout as an implicit regularization [Mianjy... 2018] Random kernel expansions regularize [Rahimi and Recht 2008] G Varoquaux 42
  • 72. Fine-tuning to reuse complex representations Overparametrized architectures might not have low-dimension representations Fine tune the full architecture1 Lower learning rate to the input layers to avoid catastrophic forgetting [Sun... 2019] Feature extraction from the full architecture Pooling linear combinations of input layers [Peters... 2019] Fine tuning best on complex architectures 1Thanks to Lihu Chen for help with this slide G Varoquaux 43
  • 73. Summary – overparametrized representations Diversity (randomness) regularizes Randomization can create interesting inductive biases Random CNNs work surprisingly well [He... 2016, Ustyuzhaninov... 2016] Fine-tuning overparametrized representations to reuse them G Varoquaux 44
  • 74. Summary of first section For generalization: small family of functions fw that approximate the signal well Generalization of a linear predictor: approximation error + o(p/ntrain ) Predictors by composition: ŷ = f2(z), z = f1(x) x f1 → z f2 → y ideally, f1 makes z invariant to nuisances Reuse representations with the right invariances: wavelets, fasttext, pretrained headless neural nets Simple supervised models can create representations stacking multioutput pretext tasks G Varoquaux 45
  • 75. 2 Matrix factorization and its variants Simple unsupervised representation learning More unlabeled data than labeled data Learn representations and transfer them Here: Focus on simple models for limited n or low SNR settings Particularly interesting regime: p large and n large. Matrix factorization is a simplified version of deep learning This section: building the framework from simple to complex
  • 76. 2 Matrix factorization and its variants For signals For discrete objects
  • 77. Matrix factorization for representations Reduce the dimensionality while keeping the signal “disentangle” give features that are useful in themselves G Varoquaux 48
  • 78. Principal Component Analysis1 Find the directions of largest variance Computation X ∈ Òn×p ΣX = XTX ∈ Òp×p PCA projector: PPCA ∈ Òp×k SVDk(X) or EVDk(ΣX) Reduced X: X PPCA ∈ Òn×k 1Mother of all representations (simplest) G Varoquaux 49
  • 79. Principal Component Analysis Find the directions of largest variance Computation X ∈ Òn×p ΣX = XTX ∈ Òp×p PCA projector: PPCA ∈ Òp×k SVDk(X) or EVDk(ΣX) Reduced X: X PPCA ∈ Òn×k Model: low-rank Gaussian latent factors X ≈ U V + E, E ∼ N (0, Ip), U ∈ Òn×k, V ∈ Òk×p Û, V̂ = argmin U,V kX − U Vk2 Fro Rotationally invariant: U0 = U O, OT V also solution for O s.t. OTO = I G Varoquaux 49
  • 80. Principal Component Analysis Find the directions of largest variance Computation X ∈ Òn×p ΣX = XTX ∈ Òp×p PCA projector: PPCA ∈ Òp×k SVDk(X) or EVDk(ΣX) Reduced X: X PPCA ∈ Òn×k Model: low-rank Gaussian latent factors X ≈ U V + E, E ∼ N (0, Ip), U ∈ Òn×k, V ∈ Òk×p Û, V̂ = argmin U,V kX − U Vk2 Fro Rotationally invariant: U0 = U O, OT V also solution for O s.t. OTO = I PCA = 1-hidden layer autoencoder with squared lossa min W kX − W WT Xk2 Fro, with suitable constraint on W aBoth find the same subspace G Varoquaux 49
  • 81. Principal Component Analysis Find the directions of largest variance In a learning pipeline Useful for dimensionality reduction (eg p is large) Eases statistics and computations Generalization error of PCA + OLS within a factor of 4 of ridge [Dhillon... 2013] G Varoquaux 49
  • 82. Beyond variance: Independent Component Analysis Separate out signals U observed mixed1 True sources, signals U Observations (mixed signal) ICA recovered signals Disentangles: Raises the rotational invariance 1Classic ICA has no noise model: it does not do dimension reduction G Varoquaux 50
  • 83. Beyond variance: Independent Component Analysis Separate out signals U observed mixed1 Model: X = U V V ∈ Òp×p, VTV = Ip If V is Gaussian, the model is not identifiable Seek low mutual information across {uj} ⇒ Maximally non-Gaussian marginals [Cardoso 2003] Latent signals V Observed data U V 1Classic ICA has no noise model: it does not do dimension reduction G Varoquaux 50
  • 84. Beyond variance: Independent Component Analysis Separate out signals U observed mixed1 Model: X = U V V ∈ Òp×p, VTV = Ip If V is Gaussian, the model is not identifiable Seek low mutual information across {uj} ⇒ Maximally non-Gaussian marginals [Cardoso 2003] Computation: FastICA [Hyvärinen and Oja 2000] Power iterations on V Each time: - apply a smooth increasing non-linearity on {uj} - decorrelate Preprocessing: whiten the data eg with PCA 1Classic ICA has no noise model: it does not do dimension reduction G Varoquaux 50
  • 85. ICA to learn representations Across patches of natural images: Gabor-like filters Similar to wavelets and first layer of convnets G Varoquaux 51 [Hyvärinen and Oja 2000]
  • 86. ICA to learn representations Across patches of natural images: ICA Disantengles Can only learn rotations No dimension reduction G Varoquaux 52
  • 87. Dictionary learning Find vectors V that represents well the signal with sparse combinations U Model: X = U V s.t. U is sparse U ∈ Òn×k, V ∈ Òk×p k can be p (overcomplete dictionary) Estimation: Û, V̂ = argmin U,V, s.t. kvik2 2≤1 kX − U Vk2 Fro + λkUk1 Data fit without need for reduction Combining squared loss and `1 penalty creates sparsity Constraint on kvik2 2 required to avoid cancelling out penalty with V → ∞ and U → 0 x2 x1 G Varoquaux 53
  • 88. Dictionary learning Find vectors V that represents well the signal with sparse combinations U Model: X = U V s.t. U is sparse U ∈ Òn×k, V ∈ Òk×p k can be p (overcomplete dictionary) Estimation: Û, V̂ = argmin U,V, s.t. V∈C kX − U Vk2 Fro + λΩ(U) Constraint set and penalty can be varied1 Typically, `2, `1, and positivity2 on U or V. 1Fast when C and Ω lead to simple projections and penalized regression. 2Recovers a form of NMF (non-negative matrix factorization) G Varoquaux 53
  • 89. Sparse dictionary learning to learn representations Across patches of natural images: Also learns Gabor-like filters1 Good for sparse models, eg for denoising Also performs dimensionality reduction 1as ICA, K-Means, etc on images patches G Varoquaux 54 [Mairal... 2014]
  • 90. Large n large p: brain imaging Brain activity at rest 1000 subjects with ∼ 100–10 000 samples Images of dimensionality 100 000 Dense matrix, large both ways G Varoquaux 55 voxels time voxels time X + U · V = E 25
  • 91. Estimation algorithms For dictionary learning G Varoquaux 56
  • 92. Large n large p: recommender systems 3 9 7 7 9 5 7 8 4 1 6 9 7 7 1 4 4 9 5 5 8 Product ratings Millions of entries Hundreds of thousands of products and users Large sparse matrix G Varoquaux 57 users product users products X + U · V = E
  • 93. Online estimation: stochastic optimization min w Õ i l(xi w) Many samples min w Å[l(y, x w)] Gradient descent: wt+1 ← wt + αt+wl Stochastic gradient descent: wt+1 ← wt + αtÅ[+wl] Use a cheap estimate of Å[+wl] (e.g. subsampling) αt must decrease “suitably” with t. Those pesky learning rate G Varoquaux 58
  • 94. Online estimation for matrix factorization - Data access - Dictionary update Stream columns - Code com- putation Alternating minimization Data matrix Large matrices = terabytes of data argmin U,V kX−U Vk2 Fro + λΩ(U) G Varoquaux 59 [Mairal... 2010]
  • 95. Online estimation for matrix factorization Large matrices = terabytes of data argmin U,V kX−U Vk2 Fro + λΩ(U) Rewrite as an expectation: argmin V Õ i min u kXi − V uk2 Fro + λΩ(u) argmin E Ö f (V) ⇒ Optimize on approximations (sub-samples) G Varoquaux 59 [Mairal... 2010]
  • 96. Online estimation for matrix factorization - Data access - Dictionary update Stream columns - Code com- putation Online matrix factorization Alternating minimization Seen at t Seen at t+1 Unseen at t Data matrix G Varoquaux 59 [Mairal... 2010]
  • 97. Online estimation for matrix factorization - Data access - Dictionary update Stream columns - Code com- putation Subsample rows Online matrix factorization Subsampled online Alternating minimization Seen at t Seen at t+1 Unseen at t Data matrix G Varoquaux 59 [Mensch... 2017]
  • 98. Online matrix factorization algorithm [Mairal... 2010] Stream samples xt: 1. Compute code ut = argmin u∈Òk kxt − Vt−1uk2 2 + λΩ(u) G Varoquaux 60
  • 99. Online matrix factorization algorithm [Mairal... 2010] Stream samples xt: 1. Compute code ut = argmin u∈Òk kxt − Vt−1uk2 2 + λΩ(u) 2. Update the surrogate function gt(V) = 1 t t Õ i=1 kxi − V uik2 2 gt(V) surrogate = Õ x l(x, V) ui is used, and not u? G Varoquaux 60
  • 100. Online matrix factorization algorithm [Mairal... 2010] Stream samples xt: 1. Compute code ut = argmin u∈Òk kxt − Vt−1uk2 2 + λΩ(u) 2. Update the surrogate function gt(V) = 1 t t Õ i=1 kxi − V uik2 2 = tr 1 2 V VAt − V Bt At def = (1 − 1 t )At−1 + 1 t utu t Bt def = (1 − 1 t )Bt−1 + 1 t xtu t At and Bt are sufficient statistics of the loss accumulated over the data G Varoquaux 60
  • 101. Online matrix factorization algorithm [Mairal... 2010] Stream samples xt: 1. Compute code ut = argmin u∈Òk kxt − Vt−1uk2 2 + λΩ(u) 2. Update the surrogate function gt(V) = 1 t t Õ i=1 kxi − V uik2 2 = tr 1 2 V VAt − V Bt At def = (1 − 1 t )At−1 + 1 t utu t Bt def = (1 − 1 t )Bt−1 + 1 t xtu t 3. Minimize surrogate Vt = argmin V∈C gt(V) +gt = VAt − Bt G Varoquaux 60
  • 102. Stochastic Majorization-Minimization [Mairal 2013] V = argmin V∈C Õ x l(x, V) where l(x, V) = min u f (x, V, u) Algorithm: gt(V) majorant = Õ x l(x, V) ui is used, and not u? ⇒ Majorization-Minimization scheme1 Surrogate computation SMM Full minimization 2nd order information No learning rate 1SOMF uses a approximate majorant and minimization [Mensch... 2017] G Varoquaux 61
  • 103. Experimental convergence: large images 5s 1min 6min 2.80 2.85 2.90 2.95 Test objective value ×104 Time ADHD Sparse dictionary 2 GB 1min 1h 5h 0.105 0.106 0.107 0.108 0.109 Aviris NMF 103 GB 1min 1h 5h 0.35 0.36 0.37 0.38 0.39 0.40 Test objective value Time Aviris Dictionary learning 103 GB OMF: SOMF: r = 4 r = 6 r = 8 r = 12 r = 24 r = 1 Best step-size SGD 100s 1h 5h 24h 0.98 1.00 1.02 1.04 ×105 HCP Sparse dictionary 2 TB SOMF = Subsampled Online Matrix Factorization G Varoquaux 62
  • 104. Experimental convergence: recommender system SOMF = Subsampled Online Matrix Factorization G Varoquaux 63
  • 105. Summary – matrix factorization of signals Versatile matrix-factorization formulation1 argmin U∈Òn×k,V∈C kX − U Vk2 Fro + λΩ(U) Estimation Stochastic majorization miniminization2 ⇒ an online alternated optimization Example use of learned representations Biomakers of autism on brain images: p ∼ 100 000, n ∼ 1 000 [Abraham... 2017] 11-layer linear autoencoder 2Common case algorithm readily usable in scikit-learn: MiniBatchDictionaryLearning G Varoquaux 64
  • 106. 2 Matrix factorization and its variants For signals For discrete objects
  • 107. Embedding discrete objects Embedding discrete objects (words, entities, users ids) is crucial It endowes them with a metric, enables building predictive functions that extrapolate between objects Original p is not small in front of n Construction Representative III Fire/Rescue Captain Resource Conservationist Security Officer II Security Officer III (Sergeant) G Varoquaux 66
  • 108. Natural language processing: topic-modeling history Topic modeling: embedding documents3 03078090707907 00790752700578 94071006000797 00970008007000 10000400400090 00050205008000 documents t h e P y t h o n p e r f o r m a n c e p r o f i l i n g m o d u l e i s c o d e c a n a Start from a vectorization of each document by counting word occurence: The term-document matrix 3Typically for information retrieval purpose, aka search engines G Varoquaux 67
  • 109. Natural language processing: topic-modeling history Topic modeling: embedding documents3 03078090707907 00790752700578 94071006000797 00970008007000 10000400400090 00050205008000 documents t h e P y t h o n p e r f o r m a n c e p r o f i l i n g m o d u l e i s c o d e c a n a → 03078090707907 00790752700578 94071006000797 topics t h e P y t h o n p e r f o r m a n c e p r o f i l i n g m o d u l e i s c o d e c a n a 030 007 940 009 100 000 documents topics + What terms are in a topics What documents are in a topics LSA (Latent Semantic Analysis) [Landauer... 1998] SVD of the terms×documents matrix 3Typically for information retrieval purpose, aka search engines G Varoquaux 67
  • 110. Gamma-Poisson for factorizing counts [Canny 2004] When X is a matrix of counts - Topic modeling - Recommenders systems [Gopalan... 2014] - Database string entries [Cerda and Varoquaux 2020] =⇒ Poisson loss, instead of squared loss Ð(xj|wj) = Poisson wj = 1/xj! w xj j e−wj 0 5 0.0 0.5 1.0 Gaussian(.5) Poisson(3) Poisson(1) Poisson(0) Counts are not well approximated by a Gaussian G Varoquaux 68
  • 111. Gamma-Poisson for factorizing counts [Canny 2004] When X is a matrix of counts - Topic modeling - Recommenders systems [Gopalan... 2014] - Database string entries [Cerda and Varoquaux 2020] =⇒ Poisson loss, instead of squared loss Ð(xj|u, V) = Poisson (u V)j = 1/xj! (u V) xj j e−(u V)j u are loadings, modeled as random with a Gamma prior1 Ð(ui) = u αi−1 i e−ui/βi β αi i Γ(αi) Maximum a posteriori estimation: Û, V̂ = argmin U,V − Õ j log Ð(xj|u, V) + Õ i log Ð(ui) 1Because it is the conjugate prior of the Poisson, and because it imposes soft sparsity and raises rotational invariance G Varoquaux 68
  • 112. Gamma-Poisson estimation Full log-likelihood expression: log L = p Õ j=1 xj log((u V)j) − (u V)j − log(xj!) + k Õ i=1 (αi − 1) log(ui) − ui βi − αi log βi − log Γ(αi) Gradients: ∂ ∂Vij log L = xj (u V)j ui − ui ∂ ∂ui log L = p Õ j=1 xj (u V)j Vij − Vij + αi − 1 ui − 1 βi G Varoquaux 69
  • 113. Gamma-Poisson estimation Gradients: ∂ ∂Vij log L = xj (u V)j ui − ui ∂ ∂ui log L = p Õ j=1 xj (u V)j Vij − Vij + αi − 1 ui − 1 βi Equivalent to some NMF formulation: multiplicative updates1 Vij ← Vij n Õ `=1 x`j (UV)`j u`i ! n Õ `=1 u`i !−1 u`i ← u`i © ­ « p Õ j=1 x`j (UV)`j Vij + αi − 1 u`i ª ® ¬ © ­ « p Õ j=1 Vij + β−1 i ª ® ¬ −1 1Efficient implementation with sparse matrices: the summations can be done only on non-zero entries of X. G Varoquaux 69
  • 114. Gamma-Poisson estimation Gradients: ∂ ∂Vij log L = xj (u V)j ui − ui ∂ ∂ui log L = p Õ j=1 xj (u V)j Vij − Vij + αi − 1 ui − 1 βi Equivalent to some NMF formulation: multiplicative updates1 Vij ← Vij n Õ `=1 x`j (UV)`j u`i ! n Õ `=1 u`i !−1 u`i ← u`i © ­ « p Õ j=1 x`j (UV)`j Vij + αi − 1 u`i ª ® ¬ © ­ « p Õ j=1 Vij + β−1 i ª ® ¬ −1 Adapt the majorization minimization algorithm [Lefevre... 2011, Cerda and Varoquaux 2020] 1Efficient implementation with sparse matrices: the summations can be done only on non-zero entries of X. G Varoquaux 69
  • 115. Application: embedding via string form Problem: representing non-normalized categories Drug Name alcohol ethyl alcohol isopropyl alcohol polyvinyl alcohol isopropyl alcohol swab 62% ethyl alcohol alcohol 68% alcohol denat benzyl alcohol dehydrated alcohol Employee Position Title Police Aide Master Police Officer Mechanic Technician II Police Officer III Senior Architect Senior Engineer Technician Social Worker III G Varoquaux 70 Code: dirty-cat.github.io [Cerda and Varoquaux 2020]
  • 116. Application: embedding via string form Gamma-Poisson factorization on sub-strings counts |{z} 3-gram1 P |{z} 3-gram2 ol |{z} 3-gram3 ic... Models strings as a linear combination of substrings 11111000000000 00000011111111 10000001100000 11100000000000 11111100000000 11111000000000 police officer pol off polis policeman policier e r _ c e r f i c o f f _ o f c e _ i c e l i c p o l G Varoquaux 71 Code: dirty-cat.github.io [Cerda and Varoquaux 2020]
  • 117. Application: embedding via string form Gamma-Poisson factorization on sub-strings counts |{z} 3-gram1 P |{z} 3-gram2 ol |{z} 3-gram3 ic... Models strings as a linear combination of substrings 11111000000000 00000011111111 10000001100000 11100000000000 11111100000000 11111000000000 police officer pol off polis policeman policier e r _ c e r f i c o f f _ o f c e _ i c e l i c p o l → 03078090707907 00790752700578 94071006000797 topics 030 007 940 009 100 000 documents topics + What substrings are in a latent category What latent categories are in an entry e r _ c e r f i c o f f _ o f c e _ i c e l i c p o l G Varoquaux 71 Code: dirty-cat.github.io [Cerda and Varoquaux 2020]
  • 118. Application: embedding via string form Representations that extract latent categories l i b r a r y p e r a t o r c i a l i s t r e h o u s e m a n a g e r m m u n i t y r e s c u e o f f i c e r Legislative Analyst II Legislative Attorney Equipment Operator I Transit Coordinator Bus Operator Senior Architect Senior Engineer Technician Financial Programs Manager Capital Projects Manager Mechanic Technician II Master Police Officer Police Sergeant a m e s Categories G Varoquaux 72 Code: dirty-cat.github.io [Cerda and Varoquaux 2020]
  • 119. Application: embedding via string form Inferring plausible feature names n t a n t , a s s i s t a n t , l i b r a r y a t o r , e q u i p m e n t , o p e r a t o r d m i n i s t r a t i o n , s p e c i a l i s t , c r a f t s w o r k e r , w a r e h o u s e r o s s i n g , p r o g r a m , m a n a g e r c i a n , m e c h a n i c , c o m m u n i t y e f i g h t e r , r e s c u e r , r e s c u e o n a l , c o r r e c t i o n , o f f i c e r Legislative Analyst II Legislative Attorney Equipment Operator I Transit Coordinator Bus Operator Senior Architect Senior Engineer Technician Financial Programs Manager Capital Projects Manager Mechanic Technician II Master Police Officer Police Sergeant I n f e r r e d f e a t u r e n a m e s Categories G Varoquaux 72 [Cerda and Varoquaux 2020]
  • 120. So far: Matrix factorization of count (eg cooccurences) Embeds discrete objects Better with a suitable loss Next: Implicit matrix factorization and losses G Varoquaux 73
  • 121. Word embeddings Distributional semantics: meaning of words “You shall know a word by the company it keeps” Firth, 1957 Example: A glass of red , please Could be wine maybe juice? wine and juice have related meanings Factorization of the word×context matrix What choice of context? What loss? word2vec [Mikolov... 2013a] glove [Pennington... 2014] G Varoquaux 74
  • 122. Word2vec: skip-gram sampling [Mikolov... 2013b] {ûw, v̂c} = argmax {uw,vc} Õ pairs of words (w, c) in the same window1 log softmax(V uT w)c softmax(z)i = exp zi Í j exp zj uw ∈ Òk: embedding of word w V ∈ Òcard(voc)×k: [vc, c ∈ voc] all context words Big sum on contexts ⇒ solved by SGD2 salad meat juice wine glass green red Center word U: word embedding salad meat juice wine glass red green Context word V: context embedding Other view: Language models Prediction of words 1Efficient: never build the matrix, stream directly from text. 2These windows are called skip gram G Varoquaux 75
  • 123. Word2vec: negative sampling [Mikolov... 2013a] Costly loss: log softmax(z)i = log exp zi Í j exp zj Approximate1 Huge sum in softmax (all vocabulary) Downsample it by drawing the positive (numerator) and a few negative examples (denominator) Negative sampling loss2: [Goldberg and Levy 2014] log σ(vc uT w) + Õ nneg words w not in window log σ(−vcuw0) σ: sigmoid (log σ(z) = −1 − exp −z) 1Related to noise contrastive estimate, that avoid computing costly normalizations in likelihoods [Gutmann and Hyvärinen 2010] 2Related to a matrix factorization of mutual information inword occurence [Levy and Goldberg 2014] G Varoquaux 76
  • 124. Beyond natural language: metric learning Triplet loss For a “anchor”, b close to a, c far from a: log σ(vT aub) − log σ(vT auc) Quadruplet loss [Chen... 2017] For a and b close by, c and d far appart: log σ(vT aub) − log σ(vT cud) In practice: draw1 randomly (a, b, c) or (a, b, c, d) Metric learning: [Bellet... 2013] Learning embeddings with weak supervision 1Many strategies, eg “hard negative mining”, requires a good test set and metric to set, as with SGD hyperparameters. G Varoquaux 77
  • 125. Embedding entities in knowledge graphs Structured (graph) represen- tation of human knowledge eg dbpedia, Yago Challenge: relations of multiple nature G Varoquaux 78
  • 126. Embedding entities in knowledge graphs Structured (graph) represen- tation of human knowledge eg dbpedia, Yago Learning embeddings of enti- ties {ei} and relations {rj}: ea ∼ eb + rc a model of the relation1 Then triplet / quadruplet loss Reuse existing: conceptnet.io 1Richer, better, models [Wang... 2014] G Varoquaux 78 [Bordes... 2013, Wang... 2017]
  • 127. The value of simple models Risk of invisible overfit dur- ing search for hyperparameters and models Complex models call for a clear utility measure with low mea- surement error Many reliable labels G Varoquaux 79
  • 128. The value of simple models Risk of invisible overfit dur- ing search for hyperparameters and models Complex models call for a clear utility measure with low mea- surement error Many reliable labels Matrix factorization models1: 2 hyper parameters: Dimensionality k Regularization λ Set them to optimize representations for supervised problems 1Using majorization-minimization approaches to avoid learning rate G Varoquaux 79
  • 129. Summary – embedding discrete objects Discrete entities lead to counting occurences ⇒ Poisson and logistic loss (ugly logs in equations) Word entity embeddings Factorization of coocurrences in a notion of context more generally: metric learning Limited-data settings: Avoid negative-sampling models (hyper-parameters) Try to reuse representations (fastext, conceptnet.io) G Varoquaux 80
  • 130. Summary – matrix factorization Builds linear representions of input At the root of many more complex variants Minimization-Majorization solvers: scalable and “fire and forget” G Varoquaux 81
  • 131. 3 Method evaluation with limited data Less data =⇒ more difficult evaluation Section inspired by [Bouthillier... 2021]
  • 132. Evaluation of the generalization error Focus on representation to facilitate prediction =⇒ evaluate prediction Leaving aside representation for interpretability Danger of reading tea leaves Interpretation = ill defined, requires expert knowledge, subject to confirmation bias [Lipton 2018] Ill-conditioned problem =⇒ strong dependence on prior =⇒ self-fulfilling prophecies G Varoquaux 83
  • 133. 3 Method evaluation with limited data Variance in model evaluation Reliable experimental procedures From benchmarks to conclusion
  • 134. Model evaluation New data is required to assess generalization performance Å l f (X), y Split data in train and test set typically 10% trade off better learning vs better estimation Test set Train set Full data Make choices on the model split train, validation, and test Test set Full data Validation set Train set Make model choices Evaluate model G Varoquaux 85
  • 135. Evaluation error: Sampling noise on test set Sampling noise1 for ntest = 1000: -10% -5% 0% +5% +10% Binomial distribution of error on test accuracy -2% +2% Confidence intervals ntest = 1 000 interval: 5.7% ntest = 10 000 interval: 1.8% ntest = 100 000 interval: 0.6% Optimizing test accuracy will explore the tails Selecting architecture, learning rate... overfitting the validation test set 1The data at hand (eg the test set) is just a small sample of the full population “in the wild”, and sampling other data will lead to other results. G Varoquaux 86 [Varoquaux 2018]
  • 136. Evaluation error: Sampling noise on test set “in the wild” 102 103 104 105 106 Test set size 0 1 2 3 4 Standard deviation (% acc) In Theory: From a Binomial In Practice: Random splits Binom(n', 0.66) Binom(n', 0.95) Binom(n', 0.91) Glue-RTE BERT (n'=277) Glue-SST2 BERT (n'=872) CIFAR10 VGG11 (n'=10000) G Varoquaux 87 [Bouthillier... 2021]
  • 137. Evaluation is a bottleneck – in publications 90.0 92.5 95.0 97.5 100.0 cifar10 2012 2014 2016 2018 2020 85 90 95 100 sst2 non-'SOTA' results Significant Non-Significant Year Accuracy NLP: Glue sentiment-analysis benchmark (ntest = 1.8k) Vision: object-recognition benchmark (ntest = 10k) Published improvements compared to benchmark variance G Varoquaux 88 [Bouthillier... 2021]
  • 138. Evaluation is a bottleneck – in Kaggle competitions Lung cancer classification Test size: max 1K Smaller improvements than noise -0.75 0.0 +0.75 Observed improvement in score Diminishing returns Schizophrenia classification Test size: 120 -0.2 0.0 +0.2 Improvement of top model on 10% best Evaluation noise between public and private sets Diminishing returns Lung tumor segmentation Test size: max 6k Poorer score on private set -0.15 0.0 +0.15 Overfit Nerve segmentation Test size 5.5K -0.04 0.0 +0.04 Improvement of top model on 10% best Evaluation noise between public and private sets Actual improvement G Varoquaux 89 [Varoquaux and Cheplygina 2021]
  • 139. The full benchmarking pipeline New data to assess generalization performance Å l f (X), y Split out test set Split out validation set Choose hyper-parameters on validation set Test set Full data Validation set Train set Make model choices Evaluate model Measure performance on test set Rampant overfit of validation set [Makarova... 2021] G Varoquaux 90
  • 140. Sources of variance in a machine-learning benchmark 0 1 Numerical noise Dropout Weights init Data order Data augment Data (bootstrap) Noisy Grid Search Random Search Bayes Opt bert-rte 0 1 bert-sst2 0 bio- hyperparameter optimization HOpt { H} learning algorithm { O} source of variation c 1 0 1 bio-task2 0.0 0.5 segmentation 0 1 vgg 0 1 average case studies Model-evaluation results are most affected by: 1. Arbitrary split into train and test 2. Random (arbitrary) parameters 3. Uncertainty in optimized hyper-parameters G Varoquaux 91 [Bouthillier... 2021]
  • 141. Summary – variance in benchmarks Evaluating generalization is limited by ntest ntest = 10 000 =⇒ ±.9% ntest = 100 000 =⇒ ±.3% Benchmark hyper parameter choice Careful not to overfit hyper-parameters Variance in machine-learning benchmarks 1. Data splits 2. Random seeds 3. Hyper-parameter choice ... G Varoquaux 92
  • 142. 3 Method evaluation with limited data Variance in model evaluation Reliable experimental procedures From benchmarks to conclusion
  • 143. Settings: what are we benchmarking prediction rule: f : X → Y training procedure: given data (X, y) ∈ (X × Y)n outputs a prediction rule hyper parameters: parameters not set by the procedure full training pipeline: hyper-parameter choice + training procedure G Varoquaux 94
  • 144. Benchmarking a prediction rule vs a training pipeline Benchmarking a prediction rule Before putting in production Fixed training set evaluation limited by test set size Benchmarking a training pipeline To conclude on good training procedures Useless to tune random seeds (for weights init, dropout, data augmentation) will not carry over to new training data G Varoquaux 95
  • 145. Benchmarking a training pipeline 0 1 Numerical noise Dropout Weights init Data order Data augment Data (bootstrap) Noisy Grid Search Random Search Bayes Opt bert-rte 0 hyperparameter optimization HOpt { H} learning algorithm { O} source of variation 0 1 Numerical noise Dropout Weights init Data order Data augment Data (bootstrap) Noisy Grid Search Random Search Bayes Opt bert-rte 0 1 bert-sst2 0 1 bio-task2 0.0 0.5 segmentation 0 1 vgg 0 1 average parameter zation { H} rning rithm O} of variation case studies Reduce error and gauge variance data sampling Multiple train-test splits cross-validation Test set Train set Full data arbitrary choices (seeds) Randomize them all hyper-parameters Hyper-parameter optimization Too expensive to randomize G Varoquaux 96 [Bouthillier... 2021]
  • 146. Hyper-parameter optimization procedures Random search [Bergstra and Bengio 2012] (prefer to grid-search for more than 2 params) Region of good hyperparameters Hyperparameter 1 Hyperparameter 2 Grid Search Randomized Search (important hyperparameter) (unimportant hyperparameter) G Varoquaux 97
  • 147. Hyper-parameter optimization procedures Random search [Bergstra and Bengio 2012] (prefer to grid-search for more than 2 params) Bayesian optimization G Varoquaux 97
  • 148. Hyper-parameter optimization procedures Random search [Bergstra and Bengio 2012] (prefer to grid-search for more than 2 params) Bayesian optimization Sub-optimal hyper-parameters on models routinely lead to invalid conclusions See refs in [Bouthillier... 2021] G Varoquaux 97
  • 149. Benchmarking with hyper-parameters Difficulty: measure suboptimality and variance due to hyper-parameters Ideal strategy: multiple hyper-parameter optimizations with different seeds Costly In practice: set hyper parameters once, then randomize model seeds and data splits Counterintuitive: more randomization decorrelates sources of error, and thus improves benchmarks G Varoquaux 98 [Bouthillier... 2021]
  • 150. Summary – better measures Benchmarking prediction rule , benchmarking training procedure For training procedures: randomize everything Data splits, all random procedures Hyper-parameter optimization outside randomiza- tion is suboptimal, but randomization after helps G Varoquaux 99
  • 151. 3 Method evaluation with limited data Variance in model evaluation Reliable experimental procedures From benchmarks to conclusion
  • 152. Statistical tests ML benchmarks Null hypothesis testing – p-value: the chance to observe the results if a null hypothesis were true Typical null: model comparison model p1 and p2 give same expected error G Varoquaux 101
  • 153. Statistical tests: single test set (comparing prediction rules) Test set Train set Full data Simple distribution of metrics, eg accuracy: binomial Safer to use permutations, for correlated errors across prediction rules [Bandos... 2005] Sample null distribution by randomly switching predictions from p1 and p2. G Varoquaux 102
  • 154. Statistical tests: cross-validation (comparing training pipelines) Test set Train set Full data Challenge: folds are not independent1 [Dietterich 1998] t-test/Wilcoxon across folds are not valid 1Train sets overlap, and often test sets also do. G Varoquaux 103
  • 155. Statistical tests: cross-validation (comparing training pipelines) Test set Train set Full data Challenge: folds are not independent1 [Dietterich 1998] t-test/Wilcoxon across folds are not valid Correct for dependence across folds2 5x2cv: repeat 5 times randomized 2-fold Use a t-test with 5 degrees of freedom [Dietterich 1998] Corrected resampled t-test statistic Formula for fold correlation [Nadeau and Bengio 2003] 1Train sets overlap, and often test sets also do. 2Does not account for sources of variance other than data sampling, eg random seeds, hyper parameters. G Varoquaux 103
  • 156. Statistical tests: across datasets (more general claims on training pipelines) Challenge: metrics not comparable across datasets =⇒ Tests based on rank statistics Wilcoxon signed rank test Tests how often p1 outperforms p2 across datasets G Varoquaux 104 [Demšar 2006]
  • 157. Statistical tests: multiple pipelines across datasets (compare multiple training pipelines) Challenge: multiple comparisons1 The Wilcoxon-Holm approach Pairwise comparisons + Bonferroni-Holm correction The Friedman-Nemenyi approach2 1. Friedman test across all pipelines (omnibus test) 2. Nemenyi test gives a critical difference Critical difference diagrams 1 2 3 4 5 4.2000 clf13.7667 clf23.5000 clf4 2.0000clf5 1.5333clf3 Accuracy (rank) 1If we do many tests, some will show large differences by chance. 2The Holm approach can be more interesting when considering only comparisons to one referent classifier. G Varoquaux 105 [Demšar 2006]
  • 158. Statistical tests: multiple pipelines across datasets (compare multiple training pipelines) Challenge: multiple comparisons1 Replicability analysis Perform dataset-level pairwise tests Combine by testing2: “Does p1 perform better than p2 on at least u datasets?” More powerful than [Demšar 2006] for a small number of datasets 1If we do many tests, some will show large differences by chance. 2Using a partial conjunction multiple-testing procedure, as described in [Dror... 2017] G Varoquaux 106 [Dror... 2017]
  • 159. Statistical tests: beyond null-hypothesis testing Sample size is a problem Across datasets: significance typically requires 15 datasets In a dataset (repeating folds, seeds...): many repetitions makes any difference significant1 Underpowered experiments are no evidence 1Though as the total test-set size is limited, they do not bring more evidence for generalization. G Varoquaux 107 [Demšar 2008]
  • 160. Statistical tests: beyond null-hypothesis testing Sample size is a problem Across datasets: significance typically requires 15 datasets In a dataset (repeating folds, seeds...): many repetitions makes any difference significant1 Underpowered experiments are no evidence Shortcomings of null-hypothesis testing Significance decreases with more comparisons2 Statistically significance , practical significance 1Though as the total test-set size is limited, they do not bring more evidence for generalization. 2FDR (False Discovery Rate) attempts to solve this. G Varoquaux 107 [Demšar 2008]
  • 161. Statistical tests: accounting for effect sizes Neyman-Pearson view of hypothesis testing Two hypothesis, H0 and H1 H1: p1 outperforms p2 by a margin1 Which is mostly likely? H0 H1 H0 H1 Requires the choice of the margin Related to superiority testing in clinical trials [Lesaffre 2008] 1Related to the rejection region in the Neyman-Pearson lemma. G Varoquaux 108
  • 162. Pragmatic compromises Test on P(p1 p2) δ δ .5: Neyman-Pearson view Evaluate P(p1 p2) by resampling Randomize everything: data splits, seeds,... Gaussian approximation: amounts to comparing differences to standard deviations Not an inference on the expected difference in performance1 1Unlike standard error, standard deviation does not shrink to zero with the number of resampling. G Varoquaux 109 [Bouthillier... 2021]
  • 163. Summary – concluding from benchmarks Account for variance Null-hypothesis testing: no t-test on cross-validation! Don’t mis-interpret p-value: - Not significant: more data could change that - Significant: difference may be trivial Detect practical differences: difference in performance vs standard deviation G Varoquaux 110
  • 164. Better experimental procedures Crack the black box open A prediction score is seldom insightful Ablation studies: remove/change atomic elements Learning curves Better benchmarking in these Tune hyper-parameters to the same quality Randomize everything Account for variance in conclusions G Varoquaux 111
  • 165. Summary – Benchmarking with limited data Reminder: Your valida- tion measure is intrinsi- cally unreliable (sampling noise) An arbitrary choice (random seed) may give seemingly-good results that do not generalize Sample many choices Account for resulting vari- ance in conclusions ­20% ­10%  0% +10% +20% Distribution of errors under a binomial law         1000 300 200 100 30 Number of available samples    ­2% +2% ­4% +4% ­5% +5% ­7% +7% ­15% +12% G Varoquaux 112
  • 166. Representation learning in limited-data settings Good representations help learning Enable the use of simpler models better approximation representation, less estimation error Simple supervised learning of representations pretext tasks, stacking, factorizing multi-output Matrix factorizations Extract representations without labels MM solvers are “fire and forget” Careful benchmarking is crucial Optimistic flukes will not generalize G Varoquaux 113 @GaelVaroquaux
  • 167. References I A. Abraham, M. P. Milham, A. Di Martino, R. C. Craddock, D. Samaras, B. Thirion, and G. Varoquaux. Deriving reproducible biomarkers from multi-site resting-state data: an autism-based example. NeuroImage, 147:736, 2017. A. Achille and S. Soatto. Emergence of invariance and disentanglement in deep representations. The Journal of Machine Learning Research, 19(1):1947–1980, 2018. A. I. Bandos, H. E. Rockette, and D. Gur. A permutation test sensitive to differences in areas for comparing roc curves from a paired design. Statistics in medicine, 24:2873, 2005. A. Bellet, A. Habrard, and M. Sebban. A survey on metric learning for feature vectors and structured data. arXiv preprint arXiv:1306.6709, 2013. J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13:281, 2012. G Varoquaux 114
  • 168. References II A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko. Translating embeddings for modeling multi-relational data. In Advances in Neural Information Processing Systems, pages 2787–2795, 2013. X. Bouthillier, P. Delaunay, M. Bronzi, A. Trofimov, B. Nichyporuk, J. Szeto, N. Mohammadi Sepahvand, E. Raff, K. Madan, V. Voleti, ... Accounting for variance in machine learning benchmarks. Proceedings of Machine Learning and Systems, 3, 2021. D. Bzdok, M. Eickenberg, O. Grisel, B. Thirion, and G. Varoquaux. Semi-supervised factored logistic regression for high-dimensional neuroimaging data. In Advances in Neural Information Processing Systems, page 3348, 2015. J. Canny. Gap: A factor model for discrete data. In SIGIR, page 122, 2004. G Varoquaux 115
  • 169. References III J.-F. Cardoso. Dependence, correlation and gaussianity in independent component analysis. Journal of Machine Learning Research, 4:1177, 2003. P. Cerda and G. Varoquaux. Encoding high-cardinality string categorical variables. IEEE Transactions on Knowledge and Data Engineering, 2020. W. Chen, X. Chen, J. Zhang, and K. Huang. Beyond triplet loss: a deep quadruplet network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, page 403, 2017. L. Chizat and F. Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. Advances in Neural Information Processing Systems, 31:3036–3046, 2018. G Varoquaux 116
  • 170. References IV L. Chizat, E. Oyallon, and F. Bach. On lazy training in differentiable programming. Advances in Neural Information Processing Systems, 2019. J. Demšar. Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research, 7:1–30, 2006. J. Demšar. On the appropriateness of statistical tests in machine learning. In Workshop on Evaluation Methods for Machine Learning in conjunction with ICML, page 65. Citeseer, 2008. P. S. Dhillon, D. P. Foster, S. M. Kakade, and L. H. Ungar. A risk comparison of ordinary least squares vs ridge regression. The Journal of Machine Learning Research, 14:1505, 2013. T. G. Dietterich. Approximate statistical tests for comparing supervised classification learning algorithms. Neural computation, 10(7):1895–1923, 1998. G Varoquaux 117
  • 171. References V R. Dror, B. G., Bogomolov, M., and R. Reichart. Replicability analysis for natural language processing: Testing significance with multiple datasets. Transactions of the Association for Computational Linguistics, 2017. S. Geman, E. Bienenstock, and R. Doursat. Neural networks and the bias/variance dilemma. Neural computation, 4(1):1–58, 1992. Y. Goldberg and O. Levy. word2vec explained: Deriving mikolov et al.’s negative-sampling word-embedding method. arXiv:1402.3722, 2014. P. K. Gopalan, L. Charlin, and D. Blei. Content-based recommendations with poisson factorization. In Advances in Neural Information Processing Systems, page 3176, 2014. G Varoquaux 118
  • 172. References VI M. Gutmann and A. Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the International Conference on Artificial Intelligence and Statistics, page 297, 2010. L. Györfi, M. Kohler, A. Krzyżak, and H. Walk. A distribution-free theory of nonparametric regression, volume 1. Springer, 2002. K. He, Y. Wang, and J. Hopcroft. A powerful generative model using random weights for the deep image representation. Advances in Neural Information Processing Systems, 29:631–639, 2016. D. Hsu, S. Kakade, and T. Zhang. Random design analysis of ridge regression. Foundations of Computational Mathematics, 14, 2014. A. Hyvärinen and E. Oja. Independent component analysis: algorithms and applications. Neural networks, 13(4):411, 2000. A. J. Izenman. Reduced-rank regression for the multivariate linear model. Journal of multivariate analysis, 5:248, 1975. G Varoquaux 119
  • 173. References VII J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020. T. K. Landauer, P. W. Foltz, and D. Laham. An introduction to latent semantic analysis. Discourse processes, 25:259, 1998. A. Lefevre, F. Bach, and C. Févotte. Online algorithms for nonnegative matrix factorization with the itakura-saito divergence. In Applications of Signal Processing to Audio and Acoustics (WASPAA), page 313. IEEE, 2011. E. Lesaffre. Superiority, equivalence, and non-inferiority trials. Bulletin of the NYU hospital for joint diseases, 66(2), 2008. O. Levy and Y. Goldberg. Neural word embedding as implicit matrix factorization. In Advances in neural information processing systems, page 2177, 2014. G Varoquaux 120
  • 174. References VIII Z. C. Lipton. The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue, 2018. J. Mairal. Stochastic majorization-minimization algorithms for large-scale optimization. In Advances in Neural Information Processing Systems, 2013. J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research, 11:19–60, 2010. J. Mairal, F. Bach, and J. Ponce. Sparse modeling for image and vision processing. Foundations and Trends® in Computer Graphics and Vision, 8(2-3):85–283, 2014. G Varoquaux 121
  • 175. References IX A. Makarova, H. Shen, V. Perrone, A. Klein, J. B. Faddoul, A. Krause, M. Seeger, and C. Archambeau. Overfitting in bayesian optimization: an empirical study and early-stopping solution. arXiv preprint arXiv:2104.08166, 2021. S. Mallat. Understanding deep convolutional networks. Philosophical Transactions of the Royal Society A, 374:20150203, 2016. A. Mensch, J. Mairal, B. Thirion, and G. Varoquaux. Stochastic subsampling for factorizing huge matrices. IEEE Transactions on Signal Processing, 66:113, 2017. A. Mensch, J. Mairal, B. Thirion, and G. Varoquaux. Extracting universal representations of cognition across brain-imaging studies. arXiv preprint arXiv:1809.06035, 2018. G Varoquaux 122
  • 176. References X P. Mianjy, R. Arora, and R. Vidal. On the implicit bias of dropout. In International Conference on Machine Learning, pages 3540–3548. PMLR, 2018. D. Micci-Barreca. A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. ACM SIGKDD Explorations Newsletter, 3:27, 2001. T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. In ICLR Workshop Papers. 2013a. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, page 3111, 2013b. G Varoquaux 123
  • 177. References XI G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio. On the number of linear regions of deep neural networks. In Advances in neural information processing systems, page 2924, 2014. C. Nadeau and Y. Bengio. Inference for the generalization error. Machine learning, 52(3):239–281, 2003. P. Nakkiran, G. Kaplun, Y. Bansal, T. Yang, B. Barak, and I. Sutskever. Deep double descent: Where bigger models and more data hurt. ICLR, 2020. B. Neal, S. Mittal, A. Baratin, V. Tantia, M. Scicluna, S. Lacoste-Julien, and I. Mitliagkas. A modern take on the bias-variance tradeoff in neural networks. arXiv preprint arXiv:1810.08591, 2018. E. Oyallon, E. Belilovsky, and S. Zagoruyko. Scaling the scattering transform: Deep hybrid networks. In Proceedings of the IEEE international conference on computer vision, page 5618, 2017. G Varoquaux 124
  • 178. References XII J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), page 1532, 2014. M. E. Peters, S. Ruder, and N. A. Smith. To tune or not to tune? adapting pretrained representations to diverse tasks. Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP), 2019. M. Rahim, B. Thirion, D. Bzdok, I. Buvat, and G. Varoquaux. Joint prediction of multiple scores captures better individual traits from brain images. Neuroimage, 158:145–154, 2017a. M. Rahim, B. Thirion, and G. Varoquaux. Multi-output predictions from neuroimaging: assessing reduced-rank linear models. In 2017 International Workshop on Pattern Recognition in Neuroimaging (PRNI), pages 1–4. IEEE, 2017b. G Varoquaux 125
  • 179. References XIII A. Rahimi and B. Recht. Weighted sums of random kitchen sinks: replacing minimization with randomization in learning. In Nips, pages 1313–1320. Citeseer, 2008. S. Rosset and R. J. Tibshirani. From fixed-x to random-x regression: Bias-variance decompositions, covariance penalties, and prediction error estimation. Journal of the American Statistical Association, pages 1–14, 2018. C. Sun, X. Qiu, Y. Xu, and X. Huang. How to fine-tune bert for text classification? China National Conference on Chinese Computational Linguistics, 2019. I. Ustyuzhaninov, W. Brendel, L. A. Gatys, and M. Bethge. Texture synthesis using shallow convolutional networks with random filters. arXiv preprint arXiv:1606.00021, 2016. G. Varoquaux. Cross-validation failure: small sample sizes lead to large error bars. Neuroimage, 180:68–77, 2018. G Varoquaux 126
  • 180. References XIV G. Varoquaux and V. Cheplygina. How i failed machine learning in medical imaging–shortcomings and recommendations. arXiv preprint arXiv:2103.10292, 2021. Q. Wang, Z. Mao, B. Wang, and L. Guo. Knowledge graph embedding: A survey of approaches and applications. IEEE Transactions on Knowledge and Data Engineering, 29(12):2724–2743, 2017. Z. Wang, J. Zhang, J. Feng, and Z. Chen. Knowledge graph embedding by translating on hyperplanes. AAAI Conference on Artificial Intelligence, 2014. G Varoquaux 127