Pitfalls of multivariate pattern analysis(MVPA), fMRI

The impact of study design on pattern estimation
for single-trial multivariate pattern analysis
Mumford, Jeanette A., Tyler Davis, and Russell A. Poldrack, 2014, Neuroimage
CLMN Journal Club

2019.03.13

Emily Yunha Shin
1
A primer on pattern-based approaches to fMRI:
principles, pitfalls, and perspectives
Haynes, John-Dylan, 2015, Neuron

A Primer on Pattern-Based
Approaches to fMRI:
Principles, Pitfalls, and Perspectives
John-Dylan Haynes, 2015, Neuron

2

The Aim of This Paper
• To provide a concise introduction to the key concepts of pattern-
based analysis
• To present an overview of challenges and limitations in the
interpretation of decoding results, especially with respect to underlying
neural population signals
3

• Eight voxels in visual cortex while a participant is viewing a
picture of a cat (A, red) or dog (B, green)

• The two response distributions are separable in (C) case

• In many cases, the marginal distributions for both conditions are
highly overlapping (D)

• A multivariate solution (E); LDA, SVM

• The weights of the classifier for each voxel can be plotted as a
weight map (H)

• In certain cases, response distributions cannot be sufficiently
partitioned using single linear decision boundaries (F)

• A nonlinear approaches; kNN classifier, nonlinear SVMs

• Often one might be interested in predicting a continuous
variable (G)

• Multivariate regression approaches

• Train the classifier

• Split trainig / testing data (H)

• Assess whether the classifier can correctly assign the labels
in test data = classification accuracy

• Repeated again using a different partitioning of data into
training and test = cross-validation

• It is absolutely vital that the training and test data are
independent and stationary in order to avoid overfitting
and circular inference.
Analyzing Pattern fMRI Signals
4

Interpreting Accuracies
• Underestimating Information
• An absence of information at the level of fMRI does not mean that the local neural
populations do not contain information

• If neurons with diﬀerent tuning properties were mixed randomly in a salt-and-pepper
fashion, then no macroscopic information would be expected at the voxel level

• A single neuron might contain substantial information that is drowned out by other
neurons only contributing noise
• Overestimating Information
• There are several ways in which an observed accuracy with fMRI might overestimate
the information

• A voxel might sample a large blood vessel that drains a large population of neurons
• = Aggregation of information that is not computationally used at the neural level.

• The low sampling rate of fMRI signals and the sluggishness of the hemodynamic
response

• might temporally integrate information beyond the relevant timescales of neural signal processing
5

• Comparing Different Brain Regions
• Several factors that limit the comparison of accuracies between different brain regions

• The size of regions

• The sensitivity of fMRI to neural activity (the local hemodynamic response
efficiency)

• The signal-to-noise levels also generally differ between regions

• Other Limitations
• The obtained accuracy depends on the partitioning of data into training and test.

• Less training data generally yield lower accuracies

• Experimental design efficiency

• Block based, trial based, or others?

• The level of temporal aggregation

• ISI?

• Smoothing
Interpreting Accuracies
6

Circularity And Overfitting
• Any dependencies are likely to cause false-positive classification of the test
data even in the absence of information

• Double dipping: leakage of information between training and test data

• Overfitting phenomenon

• Overfitting can occur if a too-complex classifier is fit to the training dataset that
works well in the training data, but then fails to generalize to the test data.

• Testing the generalizability of a classifier on independent test data thus protects
against overfitting.

• When different classifiers are tried out on the same data …

• Overfitting only can be revealed by testing the accuracy on a further
independent test dataset.

• A nested cross-validation: split data to test / validation / training set

• Another solution: decreases the number of free parameters
7

Interpreting Classification Maps
• Weight map
• In a linear classifier such as LDA or SVM, the weight at each voxel directly reflects the
contribution of that voxel to the classification result

• But it does not permit a conclusion as to whether an individual voxel contributed
significantly to the result

• Test whether it makes a significant difference if the voxel is included in the classifier

• A voxel might have a significant weight despite not having label-related information.

• Searchlight analysis
• depict the centers of informative voxel clusters, but not the informative voxels themselves.
8

Controlling For Nuisance Variables
• A more detailed control for confounding factors is also necessary.
• Classifiers can extract information even if the sign of an effect
randomly varies across subjects

• Thus, more elaborate controls are needed to avoid that decoding
results merely reflect nuisance variables, such as difficulty or
attention

• Solutions

• Regress out the nuisance variable

• Directly compare decoding for nuisance variables and for the
cognitive factor of interest
9

Extra. Information-based Approach
• What's in a pattern? Examining the type of signal
multivariate analysis uncovers at the group level
• Gilron Roee et al., 2017, NeuroImage

• 2nd level multivariate analysis often “information-based”,
univariate “activation-based”.

• Information-based: the sign of the effect of individual
subjects is discarded and a non-directional summary
statistic

• Activation-based: both signal magnitude and sign are
taken into account

• Implicit paradigm shift in signal definition in univariate vs.
multivariate analysis.

• This paper…

• shows that directional and non-directional group-level
MVPA approaches uncover distinct brain regions
with only partial overlap.

• offers resolution by proposing multivariate activation
based statistic.
10

Summary
• The approach has to be applied carefully in order to avoid overﬁtting
of the large parameter spaces involved.

• Caution also is required when interpreting the results of classiﬁcation
studies in terms of the information encoded in neural populations or in
the tuning of single neurons.

• (Use with the form of encoding models or combination with RSAs)
11

The impact of study design on
pattern estimation for single-trial
multivariate pattern analysis
Jeanette A. Mumford et al., 2014, NeuroImage
12

Highlights
• Assessment of Type I error in pattern similarity and classification
analyses.

• Type I errors of similarity analyses are notably affected by study design.

• Classification analyses are more robust to study design choice.

• The optimal design for pattern similarity is to use between-run-based
patterns.

• The optimal analysis strategy for classification is between-run cross
validation.
13

Least Squares All Model
• All trials are estimated simultaneously in a single
model, using a separate regressor consisting of an
impulse (or boxcar) function convolved with a
double gamma hemodynamic response function
(HRF)

• = beta-series regression

• Pitfall: when trials have a short interstimulus
interval (ISI), e.g., less than 3 s between the end
of one stimulus and onset of the next stimulus, the
regressors become highly correlated, or
collinear, which inﬂates the variance of the
resulting parameter estimates.
14

Least Squares Single Model
• The LSS model reduces collinearity by using
a separate model for each trial, in which the
first regressor models the trial of interest and
the other two regressors model the remaining
trials according to trial type

• Only the first parameter estimate is retained in
each model and estimates the activation for
that individual trial

• LSS has been shown to produce higher
classification accuracies than LSA for short
ISIs (3–5 s)
15

Overview: Pattern Similarity (Discussion)
• In within run setting,
• Even large ISIs (15s) do not guarantee independence between the
pattern estimates and this can drive false positive diﬀerences.

• regardless of which of the three pattern estimators (LSS, LSA, Add6)
are used…

• Only way to preserve Type I error in within run:

• randomly order the trials with a diﬀerent randomization for each
subject
16

Overview: Pattern Classiﬁcation (Discussion)
• A Within-run CV
• would be susceptible to a peeking bias

• Especially the case for blocked and alternating trials when the ISI
was only 3 second long

• Between-run CV
• is stable regardless of trial order and is the recommended approach.

• Shorter ISI studies
• LSS model & between-run CV is overall more advantageous

• without any detriment to the Type I or II error rates!
17

Derivations
• BOLD time series Y

• Trial-speciﬁc activations β

• Vβ = true covariance between the trials

• = true representational similarity covariance matrix

• = the pattern similarity correlations can be derived

• Combining (1) and (2),

• The variance of Y 
 
18
Y = XLSAβ + ϵY, ϵY ∼ N(0,VY) (1)
β = μ + ϵβ, ϵβ ∼ N(0,Vβ) (2)
Y = XLSAμ + XLSAϵβ + ϵY (3)
Var(Y) = XLSAVβX′LSA + Vy (4)

Derivations
• Pattern distribution: LSA
• The trial-speciﬁc parameter estimates

• The true similarity between the estimated patterns of all pairs of trials, derived from Eqs
(4), (5)

• In the special case where the BOLD time series are uncorrelated, Vy = σy2I, where σy2
is the variance and I is a Ntpts × Ntpts identity matrix, this estimated variance reduces to  
 
 
19
(5)̂βLSA = (X′LSAXLSA)−1
X′LSAY
(6)
Var( ̂βLSA) = (X′LSAXLSA)−1
X′LSAVar(Y)XLSA(X′LSAXLSA)−1
= Vβ + (X′LSAXLSA)−1
X′LSAVyXLSA(X′LSAXLSA)−1
(7)Var( ̂βLSA) = Vβ + σ2
y (X′LSAXLSA)−1

Derivations
• Pattern distribution: LSS
• The estimate for the ﬁrst trial 
• where c is the row vector, [1, 0, +0] (regressors)

• All LSS-based trial estimates can simultaneously be estimated using

• where

• Combining this with the variance of Y given in (4) yields

• In the special case where Vy = σy2I,  
20
̂βLSSi,1 = c(X′LSSi
XLSSi
)−1
XLSSi
Y (8)
̂βLSS = XLSSY (9)
XLSS =
c(X′LSS1
XLSS1
)−1
XLSS1
c(X′LSS2
XLSS2
)−1
XLSS2
. . .
c(X′LSSNtrials
XLSSNtrials
)−1
XLSSNtrials
(10)
Var( ̂βLSS) = XLSSVar(Y)X′LSS
= XLSSXLSAVβX′LSAX′LSS + XLSSVyX′LSS
(11)
Var( ̂βLSS) = XLSSXLSAVβX′LSAX′LSS + XLSSX′LSSσ2
y (12)

Methods: Pattern Similarity (within)
• Experiment design
• With / without temporal autocorrelation: Eqs (6), (11) or Eqs (7), (12)

• 2 trial types: t1, t2

• trial numbers: 22 or 42

• Diﬀerent lengths of ISI: mean 3s (2~5s), 7s (6~9s), (+15s)

• σy2 = 1

• 225 time points (TR = 2s)

• Trial orderings: blocked, alternating(t1, t2, t1, t2, …), random order

• Hypothesis
• Whether within-trial-type similarities(wt1, wt2) diﬀer from each other or from
between-trial-type similarities(bt1t2)

• Paired t-test: wt1-wt2, wt1-bt1t2, wt2-bt1t2
21

Methods: Pattern Similarity (within)
• Simulated data parameters
• Temporal covariance estimates Vy are based on real data with 225 time points
(TR = 2s)

• estimated from 198 resting state data sets for the same ROI, a randomly
chosen 7 × 7 × 7 voxel cube in standard MNI space (Right Putamen)

• For each simulated subject, assuming that the true similarity was the identity Vβ

• A design matrix was randomly generated

• Eqs. (7) and (12) were used to compute the similarity matrices

• 10,000 data sets of 30 subjects were randomly generated,

• using a diﬀerent set of ISIs and randomly ordered trials, when applicable, for
each subject.

• An additional simulation: pseudorandom order
22

Methods: Pattern Similarity (between)
• The Type I error rates when similarities were computed between-run
• Eq. (2) was used to simulate activation magnitudes for 500 voxels

• The values for β were used to simulate time series following Eq. (1)

• The true similarity covariance, Vβ, was set to the identity matrix

• The mean trial activation, µ, was set to a vector of zeros

• Temporal covariances(VY) were derived from resting state data
23

Methods: Classification (within & between)
• Simulated data parameters
• Mean ISI: 3s, 7s

• Trial orders: blocked, alternating, random

• Data for 1000 subjects were generated

• Additional pseudorandom ordering test in within-run CV

• based on a data set of 30 subjects

• Classification options
• SVM classifier (cost = 1)

• 2-fold CV

• (within, random split; between, grouped)

• WR: within-run

• BR(Same): between-run & same ISIs and stimulus order

• BR(Diff): between-run & different ISIs and stimulus order
24

Results: Pattern Similarity (within)
• Fig 2. Impact of collinearity in LSS and LSA models on similarity
estimates
• Blocked design
25

• Patterns of LSA
• At lag 1, One parameter estimate will be elevated and, to preserve the model ﬁt, the
collinear counterparts' parameter estimates will be pushed in the opposite direction

• At lag 2, the two trials will be pushed in the same direction by their common collinear
neighbor, causing a positive correlation.

• Patterns of LSS
• Two blocks along the diagonal:

• a weak collinearity occurs if the neighbors of a trial of interest are exemplars of the
same category

• With blocked trial order, almost all trials of t1 have t1 neighbors,

• that results in negatively biased similarity estimates between same category
• Strong positive correlations for early lags:

• Result of each trial's estimate coming from an independent model
• This weak eﬀect will be shown to have a smaller, but opposite, impact on pattern
similarities.
26

• Fig 3. Distributions of paired similarity diﬀerences (30 subjects)
27
Eqs. (7), (12)
Eqs. (6), (11)

• Table 1. Type 1 error rates across simulations
• Table 2. Pseudorandom
Results: Pattern Similarity (within&between)
28

• Fig 4. Classiﬁcation accuracy distributions
Results: Classiﬁcation
29

Results: Classiﬁcation
• Classiﬁcation accuracy distributions in Pseudorandom
30

Discussion
• No observed benefit with increased ISI
• Even with long ISIs there is an impact of trial order on pattern similarity
estimates.

• Effects driven by temporal autocorrelation occur because time points that are
closer tend to be more highly correlated

• In the case of the blocked trials…

• most between-trial-type similarities are very far apart and the within-trial-type similarities
are at smaller lags

• This is why the wt1–bt1t2 and wt2–bt1t2 distributions tend to be significantly larger than 0

• Generally, at small ISIs the different pattern estimators suffer from collinearity,
driven by positively correlated regressors in the models

• When the ISI is increased this alleviates collinearity, yet there will always be a
slight negative correlation between regressors…

• weak effect, but is enough to drive biases
31

Discussion
• Benefits of between-run similarity analyses
• The inflation of Type I error rates that arose in the within-run similarity
analyses was driven by correlations,

• either between covariates in the model used to estimate the patterns, or
the temporal covariance

• The time series from two different runs are completely independent from each
other

• The simulations used temporal covariance estimates from different subjects

• in place of temporal covariance estimates from two runs of the same
subject

• Since temporal covariance for small differences in time seems to have the
largest impact, it seems that two runs, which would typically have a couple
of minutes between them, would not be problematic…
32

Discussion
• Why and when within-run CV fails
• Within-run cross validations are especially problematic for the blocked
trial order and somewhat for alternating trial orders

• Randomly ordered trials seem to perform ﬁne within the within-run CV
setting

• The reason the results vary according to study design is due to
diﬀerent levels of peeking bias

• In the alternating case, the trials of the same class are always
separated by at least 1 other trial, so the relationship is weaker
33

Discussion
• Add6 model
• Intuitively, with the Add6 approach, one might think that there will not
be a model-based eﬀect, since a model is not necessary to extract the
patterns

• However, the results of the Add6 model are very similar to LSA at a long
ISI of 15 s
34

Discussion
• Impact on future study designs
• It seems that using multiple, shorter runs would be more advantageous
than using longer runs (with between-run analysis)

• When want to chapters a diﬀerent level of learning,

• This would likely require both shorter runs and tasks where
learning occurs slow enough that ceiling is not immediately
reached
35

Pitfalls of multivariate pattern analysis(MVPA), fMRI

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Pitfalls of multivariate pattern analysis(MVPA), fMRI

Similar to Pitfalls of multivariate pattern analysis(MVPA), fMRI (20)

Recently uploaded

Recently uploaded (20)

Pitfalls of multivariate pattern analysis(MVPA), fMRI