Brain reading, compressive sensing, fMRI and statistical learning in Python

Brain reading:
Compressive sensing, fMRI,
and statistical learning in Python
Ga¨l Varoquaux
e

INRIA/Parietal

1 Brain reading: predictive models
2 Sparse recovery with correlated de-
signs
3 Having an impact: software

G Varoquaux 2

1 Brain reading: predictive
models
Functional brain imaging:
Study of human cognition

G Varoquaux 3

1 Brain imaging
fMRI data > 50 000 voxels stimuli

G Varoquaux 4

1 Brain imaging
fMRI data > 50 000 voxels stimuli

Standard analysis

Detect voxels that correlate to
the stimuli

G Varoquaux 4

1 Brain reading

Predicting the object category viewed
[Haxby 2001, Distributed and Overlapping
Representations of Faces and Objects in Ventral
Temporal Cortex ]
Supervised learning task

G Varoquaux 5

1 Brain reading

Temporal Cortex ]combinations of voxels to
Find
predict the stimuli
Supervised learning task

Multi-variate statistics

G Varoquaux 5

1 Linear model for fMRI

sign(X w + e) = y

Target
Design
matrix × Coeﬃcients =

Problem size:
p > 50 000
n∼100 per category

G Varoquaux 6

1 Estimation: statistical learning

Inverse problem Minimize an error term:

w = argmin l(y − X w)
ˆ
w

Ill-posed: X is not full rank

Inject prior: regularize

w = argmin l(y − X w) + p(w)
ˆ
w

G Varoquaux 7

1 Estimation: statistical learning

Inverse problem Minimize an error term:

w = argmin l(y − X w)
ˆ
w
Example: Lasso = sparseis not full rank
Ill-posed: X regression
w = argmin y − X w 2 + 1 (w)
ˆ 2
w

1 (w) = i |wi |Inject prior: regularize

ˆ
w

G Varoquaux 7

1 TV-penalization to promote regions

[Haxby 2001]
Neuroscientists think in
terms of brain regions

Total-variation penalization
Impose sparsity on the gradient
of the image:

p(w) = 1( w)

[Michel TMI 2011]
G Varoquaux 8

1 Prediction with logistic regression - TV
ˆ
w
l: least-square or logistic-regression p: TV

Optimization: proximal gradient (FISTA)
- Gradient descent on l (smooth term)
- Projections on TV

Prediction performance:
Feature screening + SVC 0.77
Sparse regression 0.78
Total Variation 0.84
(explained variance)

G Varoquaux 9

1 Prediction with logistic regression - TV
wStandard l(y − X w) + p(w)
ˆ = argmin analysis
w
l: least-square or logistic-regression p: TV

Optimization: proximal gradient (FISTA)
- Gradient descent on l (smooth term)
- Projections on TV

Prediction performance:
Feature screening + SVC 0.77
Sparse regression 0.78
Total Variation 0.84
(explained variance)

G Varoquaux 9

1 Standard analysis or predictive modeling?

Temporal Cortex ]
Take home message:
brain regions, not prediction

G Varoquaux 10

1 Standard analysis or predictive modeling?

Recovery rather than prediction

G Varoquaux 11

1 Good prediction = good recovery
Ground truth
Simulations
Ground truth

Lasso
Prediction: 0.78
Recovery: 0.429

SVM
Prediction: 0.71
Recovery: 0.486

Need a method suited for recovery
G Varoquaux 12

1 Brain mapping: a statistical perspective

Small sample linear model estimation
Random correlated design

× =

Problem size:
p > 50 000
n∼100 per category
G Varoquaux 13

1 Brain mapping: a statistical perspective

Small sample linear model estimation
Random correlated design

Estimation strategy
Standard approach: univariate statistics
×
Multiple comparisons problem
⇒ statistical power ∝ 1/p
=
We want sub-linear sample complexity
⇒ non-rotationally-invariant estimators
e.g. 1 penalization
[ Ng, 2004 Feature selection, 1 vs. 2 regularization,
and rotational invariance ]
G Varoquaux 13

1 Brain mapping as a sparse recovery task
Recovering brain regions

G Varoquaux 14

1 Brain mapping as a sparse recovery task
Recovering k non-zero coeﬃcients
nmin ∼ 2 k log p
Restricted-isometry-like property:
The design matrix is well-conditioned [Candes 2006]
on sub-matrices of size > k [Tropp 2004]
[Wainwright 2009]
Mutual incoherence:
Relevant features S and irrelevant
ones S are not too correlated

Violated by spatial
correlations in our design

lasso: 23 non-zeros
G Varoquaux 14

1 Randomized sparsity
[Meinshausen and Buhlmann 2010, Bach 2008]
Perturb the design matrix:
Subsample the data
Randomly rescale features
+ Run sparse estimator
Keep features that are often selected
⇒ Good recovery without mutual incoherence
But RIP-like condition

Cannot recover large
correlated groups
For m correlated features,
selection frequency divided by m
G Varoquaux 15

2 Sparse recovery with
correlated designs
Not enough samples: nmin ∼ 2 k log p
Spatial correlations

G Varoquaux 16

2 Sparse recovery with
correlated designs
Combining

Clustering

Sparsity
[Varoquaux ICML 2012]

G Varoquaux 16

2 Brain parcellations
Spatially-connected hierarchical clustering
⇒ reduces voxel numbers [Michel Pat Rec 2011]

Replace features by corresponding cluster average
+ Use a supervised learner on reduced problem

Cluster choice sub-optimal for regression
G Varoquaux 17

2 Brain parcellations + sparsity
Hypothesis: clustering compatible with support(w)

Beneﬁts of clustering
Reduced k and p
⇒ n > nmin : good side of the “sharp threshold”
Cluster together correlated features
⇒ Improves RIP-like conditions

Recovery possible on reduced features
G Varoquaux 18

2 Randomized parcellations + sparsity

Randomization
+ Stability scores

Marginalize the
cluster choice

Relaxes mutual
incoherence
requirement

G Varoquaux 19

2 Algorithm
1 set n clusters and sparsity by cross-validation

2 loop: perturb randomly data

3 clustering to form reduced features

4 sparse linear model on reduced features

5 accumulate non-zero features

6 threshold map of apparition counts
G Varoquaux 20

2 Simulations
p = 2048, k = 64, n = 256 (nmin > 1000)
Weights w: patches of varying size
Design matrix: 2D Gaussian random images of
varying smoothness
Estimators
Randomized lasso Our approach
Elastic Net Univariate F test
Parameters set by cross-validation

Performance metric
Recovery seen as a 2-class problem
⇒ Report AUC of the precision-recall curve

G Varoquaux 21

2 When can we recover patches?

Smoothness helps (reduces noise degrees of freedom)
Small patches are hard to recover

G Varoquaux 22

2 What is the best method for patch recovery?

For small patches: elastic net
For large patches: randomized-clustered sparsity
Large patches and very smooth images: F-test
G Varoquaux 23

2 Randomizing clusters matters!

Non-random (Ward) clustering ineﬃcient
Fully-random performs quite well
Randomized Ward gives an extra gain
G Varoquaux 24

2 Randomizing clusters matters!

Degenerate family of cluster
assignements
Non-random (Ward) clustering ineﬃcient
Fully-random performs quite well
Randomized Ward gives an extra gain
G Varoquaux 24

2 fMRI: face vs house discrimination [Haxby 2001]
F-scores
L R
L R

y=-31 x=17
z=-17

G Varoquaux 25


1 Logistic
L R
L R

y=-31 x=17
z=-17

G Varoquaux 25

Randomized 1 Logistic
L R
L R

y=-31 x=17
z=-17

G Varoquaux 25

Randomized Clustered 1 Logistic
L R
L R

y=-31 x=17
z=-17

G Varoquaux 25

2 Predictive model on selected features
Object recognition [Haxby 2001]

Using recovered features improves prediction
G Varoquaux 26

Small-sample brain mapping
Sparse recovery of patches on
spatially-correlated designs

Ingredients: Clustering + Randomization
⇒ Reduced feature set compatible with recovery:
matches sparsity pattern + recovery conditions

Compressive sensing questions
Can we recover k > n, in the case of large patches?
When do we loose sub-linear sample-complexity?

G Varoquaux 27

3 Having an impact: software
How to we reach our target
audience (neuroscientists)?

How do we disseminate our ideas?

How do we facilitate new ideas?

G Varoquaux 28

3 Python as a scientiﬁc environment

General purpose

Easy, readable syntax

Interactive (ipython)

Great scientiﬁc libraries (numpy, scipy, matplotlib...)

G Varoquaux 29

3 Growing a software stack

Code lines are costly

⇒ Open source + community driven

Need quality and impact

⇒ Focus on the general purpose libraries ﬁrst

Scikit-learn: machine learning in Python
http://scikit-learn.org

G Varoquaux 30

3 scikit-learn: machine learning in Python
Technical choices
Prefer Python or Cython, focus on readability
Documentation and examples are paramount
Little object-oriented design. Opt for simplicity
Prefer algorithms to framework
Code quality: consistency and testing

G Varoquaux 31

API
Inputs are numpy arrays
Learn a model from the data:
estimator.fit(X train, Y train)
Predict using learned model
estimator.predict(X test)
Test goodness of ﬁt
estimator.score(X test, y test)
Apply change of representation
estimator.transform(X, y)

G Varoquaux 32

Computational performance
scikit-learn mlpy pybrain pymvpa mdp shogun
SVM 5.2 9.47 17.5 11.52 40.48 5.63
LARS 1.17 105.3 - 37.35 - -
Elastic Net 0.52 73.7 - 1.44 - -
kNN 0.57 1.41 - 0.56 0.58 1.36
PCA 0.18 - - 8.93 0.47 0.33
k-Means 1.34 0.79 ∞ - 35.75 0.68
Algorithms rather than low-level optimization
convex optimization + machine learning

Avoid memory copies

G Varoquaux 33

Community
163 contributors since 2008, 397 github forks
25 contributors in latest release (3 months span)

Why this success?
Trendy topic?
Low barrier of entry
Friendly and very skilled mailing list
Credit to people

G Varoquaux 34

3 Research code = software library
Factor 10 in time investment

Corner cases in algorithm (numerical stability)

Multiple platforms and library versions (Blas )

Documentation

Making it simpler (and get less educated users)

User and developer support ( ∼ 100 mails/day)

Exhausting,
but has impact on science and society
G Varoquaux 35

3 Research code = software library
Factor 10 in time investment

CornerTechnical + scientiﬁc tradeoﬀs
cases in algorithm (numerical stability)

Multiple platforms andof use rather than speed
Ease of install/ease library versions (Blas )
Documentation science”
Focus on “old
Making it simpler (and get less educatednot
Nice publications and theorems are users)
a recipe for useful code
User and developer support ( ∼ 100 mails/day)

Exhausting,
but has impact on science and society
G Varoquaux 35

Statistical learning to study brain function

Spatial regularization for
predictive models
Total variation

Compressive-sensing approach
Sparsity + randomized
clustering for correlated designs

Machine learning in Python
Huge impact
Post-doc positions available
G Varoquaux 36

Bibliography
[Michel TMI 2011] V. Michel, et al., Total variation regularization for
fMRI-based prediction of behaviour, IEEE Transactions in medical
imaging (2011)
http://hal.inria.fr/inria-00563468/en
[Varoquaux ICML 2012] G. Varoquaux, A. Gramfort, B. Thirion
Small-sample brain mapping: sparse recovery on spatially correlated
designs with randomization and clustering, ICML (2012)
http://hal.inria.fr/hal-00705192/en
[Michel Pat Rec 2011] V. Michel, et al., A supervised clustering approach
for fMRI-based inference of brain states, Pattern Recognition (2011)
http://hal.inria.fr/inria-00589201/en
[Pedregosa ICML 2011] F. Pedregosa, el al., Scikit-learn: machine
learning in Python, JMRL (2011)
http://hal.inria.fr/hal-00650905/en

G Varoquaux 37

Brain reading, compressive sensing, fMRI and statistical learning in Python

More Related Content

What's hot

Similar to Brain reading, compressive sensing, fMRI and statistical learning in Python

More from Gael Varoquaux

Recently uploaded

Brain reading, compressive sensing, fMRI and statistical learning in Python