SlideShare a Scribd company logo
1 of 28
Download to read offline
Sparse statistical modelling
Tom Bartlett
Sparse statistical modelling Tom Bartlett 1 / 28
Introduction
‘A sparse statistical model is one having only a small number of
nonzero parameters or weights.’[1]
The number of features or variables measured on a person or object
can be very large (e.g., expression levels of ∼ 30000 genes)
These measurements are often highly correlated, i.e., contain much
redundant information
This scenario is particularly relevant in the age of ‘big-data’
1Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity:
the lasso and generalizations. CRC Press, 2015
Sparse statistical modelling Tom Bartlett 2 / 28
Outline
Sparse linear models
Sparse PCA
Sparse SVD
Sparse CCA
Sparse LDA
Sparse clustering
Sparse statistical modelling Tom Bartlett 3 / 28
Sparse linear models
A linear model can be written as
yi =α +
p
X
j=1
xij βj + i , i = 1, ..., n
=α + x
i β + i
Hence, the model can be fit by minimising the objective function
minimise
a,β
( N
X
i=1
(yi − α − x
i β)2
)
Adding a penalisation term to the objective function makes the
solution more sparse:
minimise
a,β
(
1
2N
N
X
i=1
(yi − α − x
i β)2
+ λkβkq
q
)
, where q = 1 or 2
Sparse statistical modelling Tom Bartlett 4 / 28
Sparse linear models
The penalty term λkβkq
q means that only the bare minimum is used of
all the information available in the p predictor variables xij , j = 1, ...p.
minimise
a,β
(
1
2N
N
X
i=1
(yi − α − x
i β)2
+ λkβkq
q
)
q is typically chosen as q = 1 or q = 2, because these produce convex
solutions and hence are computationally much nicer!
q = 1 is called the ‘lasso’; it tends to set as many elements of β as
possible to zero
q = 2 is called ‘ridge regression’, and it tends to minimise the size of
all the elements of β
Penalisation is equally applicable to other types of linear models:
logistic regression, generalised linear models etc
Sparse statistical modelling Tom Bartlett 5 / 28
Sparse linear models - simple example
0.0 0.2 0.4 0.6 0.8 1.0
−5
0
5
10
Coefficients
hs
college
college4
not−hs
funding
Lasso
0.0 0.2 0.4 0.6 0.8 1.0
−5
0
5
10
Coefficients
hs
college
college4
not−hs
funding
Ridge Regression
β̂ 1/ β̃ 1 β̂ 2/ β̃ 2
Crime-rate modelled according to 5 predictors: annual police funding in
dollars per resident (funding), percent of people 25 years and older with
four years of high school (hs), percent of 16- to 19-year olds not in high
school and not high school graduates (not-hs), percent of 18- to 24-year
olds in college (college), and percent of people 25 years and older with at
least four years of college (college4).
Sparse statistical modelling Tom Bartlett 6 / 28
Sparse linear models - genomics example
Gene expression data, for p = 17280 genes, for nc = 530 cancer
samples + nh = 61 healthy tissue samples
Fit logistic (i.e., 2 class, cancer/healthy) lasso model using the R
package glmnet, selecting λ by cross-validation
Out of 17280 possible genes for prediction, lasso chooses just these
25 (shown with their fitted model coefficients)
ADAMTS5 -0.0666 HPD -0.00679 NUP210 0.00582
ADH4 -0.165 HS3ST4 -0.0863 PAFAH1B3 0.297
CA4 -0.151 IGSF10 -0.356 TACC3 0.128
CCDC36 -0.335 LRRTM2 -0.0711 TESC -0.0568
CDH12 -0.253 LRRC3B -0.211 TRPM3 -1.24
CES1 -0.302 MEG3 -0.022 TSLP -0.0841
COL10A1 0.747 MMP11 0.22 WDR51A 0.0722
DPP6 -0.107 NUAK2 0.0354 WISP1 0.14
HHATL -0.0665
Caveat: these are not necessarily the only ‘predictive’ genes. If we
removed these genes from the data-set and fitted the model again,
lasso would choose an entirely new set of genes which might be
almost as good at predicting!
Sparse statistical modelling Tom Bartlett 7 / 28
Sparse PCA
Ordinary PCA finds v by carrying out the optimisation:
maximise
kvk2=1

v XX
n
v

,
with X ∈ Rn×p (i.e., n samples and p variables).
With p  n, the eigenvectors of the sample covariance matrix
XX/n are not necessarily close to those of the population covariance
matrix [2].
Hence ordinary PCA can fail in this context. This motivates sparse
PCA, in which many entries of v are encouraged to be zero, by
finding v by carrying out the optimisation:
maximise
kvk2=1
n
v
X
Xv
o
, subject to: kvk1 ≤ t.
In effect this discards some variables such that p is closer to n.
2Iain M Johnstone. “On the distribution of the largest eigenvalue in principal components
analysis”. In: Annals of statistics (2001), pp. 295–327
Sparse statistical modelling Tom Bartlett 8 / 28
Sparse SVD
The SVD of a matrix X ∈ Rn×p, with n  p, can be expressed as
X = UDV, where U ∈ Rn×p and V ∈ Rp×p are orthogonal and
D ∈ Rp×p is diagonal. The SVD can hence be found by carrying out
the optimisation:
minimise
U∈Rn×p,V∈Rp×p,D∈Rp×p
kX − UDV
k2.
Hence, a sparse SVD with rank r can be obtained by carrying out the
optimisation:
minimise
U∈Rn×r ,V∈Rp×r ,D∈Rr×r
n
kX − UDV
k2 + λ1kUk1 + λ2kVk1
o
.
This allows SVD to be applied to the p  n scenario.
Sparse statistical modelling Tom Bartlett 9 / 28
Sparse PCA and SVD - an algorithm
SVD is a generalisation of PCA. Hence, algorithms to solve the SVD
problem can be applied to the PCA problem
The sparse PCA can thus be re-formulated as:
maximise
kuk2=kvk2=1
n
u
Xv
o
, subject to: kvk1 ≤ t,
which is biconvex in u and v and can be solved by alternating
between the updates:
u ←
Xv
kXvk2
, and v ←
Sλ Xu

kSλ (Xu) k2
, (1)
where Sλ is the soft-thresholding operator Sλ = sign(x) (|x| − λ)+.
Sparse statistical modelling Tom Bartlett 10 / 28
Sparse PCA - simulation study
Define Σ as a p × p block-diagonal
matrix, with p = 200 and 10 blocks
of 1s of size 20 × 20.
Hence, we would expect there to be 10
independent components of variation
in the corresponding distribution.
Generate n samples x ∼ Normal(0, Σ)
Estimate b
Σ =
P
(x − x̄)(x − x̄)/n
Correlate eigenvectors of Σ with
eigenvectors of b
Σ
Repeat 100 times for each
different value of n
0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Top 10 PCs
n/p
Eigenvector
correlation
The plot shows the means of
these correlations over the
100 repetitions for different
values of n.
Sparse statistical modelling Tom Bartlett 11 / 28
Sparse PCA - simulation study
An implementation of sparse PCA is
available in the R package PMA as the
function spca. It proceeds similarly
to the algorithm described earlier,
which is presented in more detail by
Witten, Tibshirani and Hastie [3].
I applied this function to the same
simulation as described in the
previous slide.
The scale of the penalisation is in terms
of kuk1, with kuk1 =
√
p being the
minimum and kuk1 = 1 being the
maximum permissible values.
0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Top 10 PCs
n/p
Eigenvector
correlation
The plot shows the result
with kuk1 =
√
p.
3Daniela M Witten, Robert Tibshirani, and Trevor Hastie. “A penalized matrix
decomposition, with applications to sparse principal components and canonical correlation
analysis”. In: Biostatistics (2009), kxp008
Sparse statistical modelling Tom Bartlett 12 / 28
Sparse PCA - simulation study
0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Top 10 PCs
n/p
Eigenvector
correlation
The plot shows the result
with kuk1 =
√
p/2.
0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Top 10 PCs
n/p
Eigenvector
correlation The plot shows the result
with kuk1 =
√
p/3.
Sparse statistical modelling Tom Bartlett 13 / 28
Sparse PCA - real data example
I carried out PCA on expression levels
of 10138 genes in individual cells
from developing brains
There are many different cell types in
the data - some mature, some
immature, and some in between
Different cell-types are characterised by
different gene expression profiles
We would therefore expect to be able
visualise some separation of the
cell-types by dimensionality reduction
to three dimensions
The plot shows the cells
plotted in terms of the top
three (standard) PCA
components.
Sparse statistical modelling Tom Bartlett 14 / 28
Sparse PCA - real data example
The plot shows the cells in
terms of the top three sparse
PCA components, with
kuk1 = 0.1
√
p (i.e., a high
level of regularisation).
The plot shows the cells in
terms of the top three sparse
PCA components, with
kuk1 = 0.8
√
p (i.e., a low
level of regularisation).
Sparse statistical modelling Tom Bartlett 15 / 28
Sparse CCA
In CCA, the aim is to find coefficient vectors u ∈ Rp and v ∈ Rq
which project the data-matrices X ∈ Rn×p and Y ∈ Rn×q so as to
maximise the correlations between these projections.
Whereas PCA aims to find the ‘direction’ of maximum variance in a
single data-matrix, CCA aims to find the ‘directions’ in the two
data-matrices in which the variances best explain each other.
The CCA problem can be solved by carrying out the optimisation:
maximise
u∈Rp,v∈Rq
Cor(Xu, Yv)
This problem is not well posed for n  max(p, q), in which case u and
v can be found which trivially give Cor(Xu, Yv) = 1.
Sparse CCA solves this problem by carrying out the optimisation:
maximise
u∈Rp,v∈Rq
Cor(Xu, Yv), subject to kuk1  t1 and kvk1  t2.
Sparse statistical modelling Tom Bartlett 16 / 28
Sparse CCA - real data example
‘Cell cycle’ is a biological process
involved in the replication of cells
Cell-cycle can be thought of as a latent
process which is not directly
observable in genomics data
It is driven by a small set of genes
(particularly cyclins and cyclin-
dependent kinases) from which it
may be inferred
It has an effect on the expression of very
many genes: hence it can also tend
to act as a confounding factor when
modelling many other biological
processes
Used CCA here as an
exploratory tool, with Y the
data for the cell cycle genes,
and X the data for all the
other genes.
Sparse statistical modelling Tom Bartlett 17 / 28
Sparse LDA
LDA assigns item i to a group G based a corresponding data-vector
xi , according to the posterior probability:
P(G = k|xi ) =
πkfk(xi )
PK
l=1 πl fl (xi )
, with
fk(xi ) =
1
(2π)p/2|Σ|1/2
exp

−
1
2
(xi − µk)
Σ−1
(xi − µk)

,
with prior πk and mean µk for group k, and covariance Σ.
This assignment takes place by constructing ‘decision boundaries’
between classes k and l:
log
P(G = k|xi )
P(G = l|xi )
= log
πk
πl
+ x
i Σ−1
(µk − µl )
−
1
2
(µk + µl )
Σ−1
(µk − µl )
Because this boundary is linear in xi , we get the name LDA.
Sparse statistical modelling Tom Bartlett 18 / 28
Sparse LDA
The decision boundary
log
P(G = k|xi )
P(G = l|xi )
= log
πk
πl
+ x
i Σ−1
(µk − µl )
−
1
2
(µk + µl )
Σ−1
(µk − µl )
then naturally leads to the decision rule:
G(xi ) = argmax
k
n
log πk + x
i Σ−1
µk − µ
k Σ−1
µk
o
.
By assuming Σ is diagonal, i.e., there is no covariance between the p
dimensions, this decision rule can be reduced to the nearest centroids
classifier:
G(xi ) = argmin
k



p
X
j=1
(xj − µjk)2
σ2
j
− log πk



.
Typically, Σ (or σ) are estimated from the data as b
Σ (or b
σ), and the
µk are estimated as b
µk whilst training the classifier.
Sparse statistical modelling Tom Bartlett 19 / 28
Sparse LDA
The nearest centroids classifier
b
G(xi ) = argmin
k



p
X
j=1
(xj − b
µjk)2
b
σ2
j
− log πk



will typically use all p variables. This is often unnecessary and can
lead to overfitting in high-dimensional contexts. The nearest shrunken
centroids classifier deals with this issue.
Define b
µ = x̄ + αk, where x̄ is the data-mean across all classes, and
αk is the class-specific deviation of the mean from x̄. Then, the
nearest shrunken centroids classifier proceeds with the optimisation:
minimise
αk ∈Rp,k∈{1,...,K}



1
2n
K
X
k=1
X
i∈Ck
p
X
j=1
(xij − x̄j − αjk)2
σ̂2
+λ
K
X
k=1
p
X
j=1
√
nk
σ̂2
|αjk|



,
where Ck and nk are the set and number of samples in group k.
Sparse statistical modelling Tom Bartlett 20 / 28
Sparse LDA
Hence, the αk estimated from the optimisation
minimise
αk ∈Rp,k∈{1,...,K}



1
2n
K
X
k=1
X
i∈Ck
p
X
j=1
(xij − x̄j − αjk)2
σ̂2
+λ
K
X
k=1
p
X
j=1
√
nk
σ̂2
|αjk|



can be used to estimate the shrunken centroids b
µ = x̄ + αk, thus
training the classifier:
b
G(xi ) = argmin
k



p
X
j=1
(xj − b
µjk)2
b
σ2
j
− log πk



.
Sparse statistical modelling Tom Bartlett 21 / 28
Sparse LDA - real data example
I applied nearest (shrunken) centroids to
expression data for 14349 genes, for
347 cells of different types:
leukocytes (54); lymphoblastic cells
(88); fetal brain cells (16wk, 26;
21wk, 24); fibroblasts (37); ductal
carcinoma (22); keratinocytes (40);
B lymphoblasts (17); iPS cells (24);
neural progenitors (15).
Used R packages MASS, and pamr [4].
Carried out 100 repetitions of 3-fold
CV. Plots show normalised mutual
information (NMI), adjusted Rand
index (ARI) and prediction accuracy.
0 5 10 15 20 25 30
0.0
0.4
0.8
Sparsity threshold
NMI
0 5 10 15 20 25 30
0.0
0.4
0.8
Sparsity threshold
ARI
0 5 10 15 20 25 30
0.0
0.4
0.8
Sparsity threshold
Accuracy
Sparse LDA quantile (over 300 predictions)
100% 75% 50% 25% 0%
Regular LDA quantile (over 300 predictions)
100% 75% 50% 25% 0%
4Robert Tibshirani et al. “Class prediction by nearest shrunken centroids, with applications
to DNA microarrays”. In: Statistical Science (2003), pp. 104–117
Sparse statistical modelling Tom Bartlett 22 / 28
Sparse clustering
Many clustering methods, such hierarchical clustering, are based on a
dissimilarity measure Di,i0 =
Pp
j=1 di,i0,j between samples i and i0.
One popular choice of dissimilarity measure is the euclidean distance.
In high-dimensions, it is often unnecessary to use information from all
of the p dimensions.
A weighted dissimilarity measure e
Di,i0 =
Pp
j=1 wj di,i0,j can be a useful
approach to this problem. This can be obtained by the sparse matrix
decomposition:
maximise
u∈Rn2
,w∈Rp
u
∆w, subject to kuk2 ≤ 1, kwk2 ≤ 1,
kwk1 ≤ t, and wj ≥ 0, j ∈ {1, ..., p},
where w is vector of the weights wj , j ∈ {1, ..., p}, and ∆ ∈ Rn2×p is
the dissimilarity components arranged such that each row of ∆
corresponds to the di,i0,j , j ∈ {1, ..., p} for a pair of samples i, i0.
This weighted dissimilarity measure can then be used for sparse
clustering, such as sparse hierarchical clustering.
Sparse statistical modelling Tom Bartlett 23 / 28
Sparse clustering
Some clustering methods, such as K-means, need a slightly modified
approach.
K-means seeks to minimise the within-cluster sum of squares
K
X
k=1
X
i∈Ck
kxi − x̄kk2
2 =
1
2N
K
X
k=1
X
i,i0∈Ck
kxi − xi0 k2
2
where Ck is the set of samples in cluster k and x̄k is the
corresponding centroid.
Hence, a weighted K-means could proceed according to the
optimisation:
minimise
w∈Rp



p
X
j=1
wj


K
X
k=1
1
nk
X
i,i0∈Ck
di,i0,j





,
where di,i0,j = (xij − xi0j )2, and nk is the number of samples
in cluster k.
Sparse statistical modelling Tom Bartlett 24 / 28
Sparse clustering
However, for the optimisation
minimise
w∈Rp



p
X
j=1
wj


K
X
k=1
1
nk
X
i,i0∈Ck
di,i0,j





,
it is not possible to choose a set of constraints which guarantee a
non-pathological solution as well as convexity.
Instead, the between-cluster sum of squares can be maximised:
maximise
w∈Rp



p
X
j=1
wj

1
n
n
X
i=1
n
X
i0=1
di,i0,j −
K
X
k=1
1
nk
X
i,i0∈Ck
di,i0,j





subject to kwk2 ≤ 1, kwk1 ≤ t, and wj ≥ 0, j ∈ {1, ..., p}.
Sparse statistical modelling Tom Bartlett 25 / 28
Sparse clustering - real data examples
Applied (sparse) hierarchal
clustering to the same
benchmark expression
data-set (14349 genes, for
347 cells of different types).
Used R package sparcl [5] for
the sparse clustering. Plots
show normalised mutual
information (NMI) and
adjusted Rand index (ARI)
comparing sparse with
standard clustering.
2 5 10 20 50 100 200 500 1000
0.0
0.4
0.8
L1 bound
NMI
2 5 10 20 50 100 200 500 1000
0.0
0.4
0.8
L1 bound
ARI
Sparse hierarchical clustering hierarchical clustering
5Daniela M Witten and Robert Tibshirani. “A framework for feature selection in clustering”.
In: Journal of the American Statistical Association (2012)
Sparse statistical modelling Tom Bartlett 26 / 28
Sparse clustering - real data examples
Applied (sparse) k-means to
the same benchmark
expression data-set (14349
genes, for 347 cells of
different types).
Used R package sparcl for the
sparse clustering. Plots
show normalised mutual
information (NMI) and
adjusted Rand index (ARI)
comparing sparse with
standard clustering.
2 5 10 20 50 100 200 500 1000
0.0
0.4
0.8
L1 bound
NMI
2 5 10 20 50 100 200 500 1000
0.0
0.4
0.8
L1 bound
ARI
Sparse k−means k−means
Sparse statistical modelling Tom Bartlett 27 / 28
Sparse clustering - real data examples
Spectral clustering essentially
uses k-means clustering (or
similar) in dimensionally-
reduced (e.g., PCA) space.
Applied standard k-means in
sparse-PCA space to the
same benchmark expression
data-set (14349 genes, for
347 cells of different types).
Offers computational
advantages, running in 9
seconds on a 2.8GHz
Macbook, compared with
19 seconds for standard
k-means, and 35 seconds
for sparse k-means.
0.1 0.2 0.5 1.0
0.0
0.4
0.8
L1 bound / sqrt(n)
NMI
0.1 0.2 0.5 1.0
0.0
0.4
0.8
L1 bound / sqrt(n)
ARI
Sparse spectral k−means k−means
Sparse statistical modelling Tom Bartlett 28 / 28

More Related Content

Similar to 7.pdf

Introduction to Evidential Neural Networks
Introduction to Evidential Neural NetworksIntroduction to Evidential Neural Networks
Introduction to Evidential Neural NetworksFederico Cerutti
 
A comparative analysis of predictve data mining techniques4
A comparative analysis of predictve data mining techniques4A comparative analysis of predictve data mining techniques4
A comparative analysis of predictve data mining techniques4Mintu246
 
A comparative analysis of predictve data mining techniques4
A comparative analysis of predictve data mining techniques4A comparative analysis of predictve data mining techniques4
A comparative analysis of predictve data mining techniques4Mintu246
 
Temporal disaggregation methods
Temporal disaggregation methodsTemporal disaggregation methods
Temporal disaggregation methodsStephen Bradley
 
Survival analysis 1
Survival analysis 1Survival analysis 1
Survival analysis 1KyusonLim
 
System overflow blocking-transients-for-queues-with-batch-arrivals-using-a-fa...
System overflow blocking-transients-for-queues-with-batch-arrivals-using-a-fa...System overflow blocking-transients-for-queues-with-batch-arrivals-using-a-fa...
System overflow blocking-transients-for-queues-with-batch-arrivals-using-a-fa...Cemal Ardil
 
Spillover Dynamics for Systemic Risk Measurement Using Spatial Financial Time...
Spillover Dynamics for Systemic Risk Measurement Using Spatial Financial Time...Spillover Dynamics for Systemic Risk Measurement Using Spatial Financial Time...
Spillover Dynamics for Systemic Risk Measurement Using Spatial Financial Time...SYRTO Project
 
Introduction to Bootstrap and elements of Markov Chains
Introduction to Bootstrap and elements of Markov ChainsIntroduction to Bootstrap and elements of Markov Chains
Introduction to Bootstrap and elements of Markov ChainsUniversity of Salerno
 
Descriptive_Statistics : Introduction to Descriptive_Statistics,Central tende...
Descriptive_Statistics : Introduction to Descriptive_Statistics,Central tende...Descriptive_Statistics : Introduction to Descriptive_Statistics,Central tende...
Descriptive_Statistics : Introduction to Descriptive_Statistics,Central tende...SanjanaSaxena17
 
Big Data Analysis
Big Data AnalysisBig Data Analysis
Big Data AnalysisNBER
 
Linear regression [Theory and Application (In physics point of view) using py...
Linear regression [Theory and Application (In physics point of view) using py...Linear regression [Theory and Application (In physics point of view) using py...
Linear regression [Theory and Application (In physics point of view) using py...ANIRBANMAJUMDAR18
 
whitehead-logistic-regression.ppt
whitehead-logistic-regression.pptwhitehead-logistic-regression.ppt
whitehead-logistic-regression.ppt19DSMA012HarshSingh
 
Course Project 1 for Coursera Statistical Inference
Course Project 1 for Coursera Statistical InferenceCourse Project 1 for Coursera Statistical Inference
Course Project 1 for Coursera Statistical InferenceJohn Edward Slough II
 
A comparative analysis of predictve data mining techniques3
A comparative analysis of predictve data mining techniques3A comparative analysis of predictve data mining techniques3
A comparative analysis of predictve data mining techniques3Mintu246
 
Factor analysis
Factor analysis Factor analysis
Factor analysis Mintu246
 

Similar to 7.pdf (20)

Introduction to Evidential Neural Networks
Introduction to Evidential Neural NetworksIntroduction to Evidential Neural Networks
Introduction to Evidential Neural Networks
 
A comparative analysis of predictve data mining techniques4
A comparative analysis of predictve data mining techniques4A comparative analysis of predictve data mining techniques4
A comparative analysis of predictve data mining techniques4
 
A comparative analysis of predictve data mining techniques4
A comparative analysis of predictve data mining techniques4A comparative analysis of predictve data mining techniques4
A comparative analysis of predictve data mining techniques4
 
Temporal disaggregation methods
Temporal disaggregation methodsTemporal disaggregation methods
Temporal disaggregation methods
 
Survival analysis 1
Survival analysis 1Survival analysis 1
Survival analysis 1
 
Chapter1
Chapter1Chapter1
Chapter1
 
System overflow blocking-transients-for-queues-with-batch-arrivals-using-a-fa...
System overflow blocking-transients-for-queues-with-batch-arrivals-using-a-fa...System overflow blocking-transients-for-queues-with-batch-arrivals-using-a-fa...
System overflow blocking-transients-for-queues-with-batch-arrivals-using-a-fa...
 
Spillover Dynamics for Systemic Risk Measurement Using Spatial Financial Time...
Spillover Dynamics for Systemic Risk Measurement Using Spatial Financial Time...Spillover Dynamics for Systemic Risk Measurement Using Spatial Financial Time...
Spillover Dynamics for Systemic Risk Measurement Using Spatial Financial Time...
 
ResearchPaper
ResearchPaperResearchPaper
ResearchPaper
 
Introduction to Bootstrap and elements of Markov Chains
Introduction to Bootstrap and elements of Markov ChainsIntroduction to Bootstrap and elements of Markov Chains
Introduction to Bootstrap and elements of Markov Chains
 
Descriptive_Statistics : Introduction to Descriptive_Statistics,Central tende...
Descriptive_Statistics : Introduction to Descriptive_Statistics,Central tende...Descriptive_Statistics : Introduction to Descriptive_Statistics,Central tende...
Descriptive_Statistics : Introduction to Descriptive_Statistics,Central tende...
 
Big Data Analysis
Big Data AnalysisBig Data Analysis
Big Data Analysis
 
Linear regression [Theory and Application (In physics point of view) using py...
Linear regression [Theory and Application (In physics point of view) using py...Linear regression [Theory and Application (In physics point of view) using py...
Linear regression [Theory and Application (In physics point of view) using py...
 
Biostatistics in Bioequivalence
Biostatistics in BioequivalenceBiostatistics in Bioequivalence
Biostatistics in Bioequivalence
 
CLIM Program: Remote Sensing Workshop, Blocking Methods for Spatial Statistic...
CLIM Program: Remote Sensing Workshop, Blocking Methods for Spatial Statistic...CLIM Program: Remote Sensing Workshop, Blocking Methods for Spatial Statistic...
CLIM Program: Remote Sensing Workshop, Blocking Methods for Spatial Statistic...
 
whitehead-logistic-regression.ppt
whitehead-logistic-regression.pptwhitehead-logistic-regression.ppt
whitehead-logistic-regression.ppt
 
Input analysis
Input analysisInput analysis
Input analysis
 
Course Project 1 for Coursera Statistical Inference
Course Project 1 for Coursera Statistical InferenceCourse Project 1 for Coursera Statistical Inference
Course Project 1 for Coursera Statistical Inference
 
A comparative analysis of predictve data mining techniques3
A comparative analysis of predictve data mining techniques3A comparative analysis of predictve data mining techniques3
A comparative analysis of predictve data mining techniques3
 
Factor analysis
Factor analysis Factor analysis
Factor analysis
 

More from NitishChoudhary23

Screening of Mental Health in Adolescence.pptx
Screening of Mental Health in Adolescence.pptxScreening of Mental Health in Adolescence.pptx
Screening of Mental Health in Adolescence.pptxNitishChoudhary23
 
Screening of Mental Health in Adolescents using ML.pptx
Screening of Mental Health in Adolescents using ML.pptxScreening of Mental Health in Adolescents using ML.pptx
Screening of Mental Health in Adolescents using ML.pptxNitishChoudhary23
 
Attacks on Mobile Devices by Nitish.pptx
Attacks on Mobile Devices by Nitish.pptxAttacks on Mobile Devices by Nitish.pptx
Attacks on Mobile Devices by Nitish.pptxNitishChoudhary23
 
Presentation_entrepreneurship_.pptx
Presentation_entrepreneurship_.pptxPresentation_entrepreneurship_.pptx
Presentation_entrepreneurship_.pptxNitishChoudhary23
 
Presentation1_CYBERCRIME.pptx
Presentation1_CYBERCRIME.pptxPresentation1_CYBERCRIME.pptx
Presentation1_CYBERCRIME.pptxNitishChoudhary23
 

More from NitishChoudhary23 (9)

12001319032_ML.pptx
12001319032_ML.pptx12001319032_ML.pptx
12001319032_ML.pptx
 
Screening of Mental Health in Adolescence.pptx
Screening of Mental Health in Adolescence.pptxScreening of Mental Health in Adolescence.pptx
Screening of Mental Health in Adolescence.pptx
 
12001319032_OR.pptx
12001319032_OR.pptx12001319032_OR.pptx
12001319032_OR.pptx
 
Screening of Mental Health in Adolescents using ML.pptx
Screening of Mental Health in Adolescents using ML.pptxScreening of Mental Health in Adolescents using ML.pptx
Screening of Mental Health in Adolescents using ML.pptx
 
Attacks on Mobile Devices by Nitish.pptx
Attacks on Mobile Devices by Nitish.pptxAttacks on Mobile Devices by Nitish.pptx
Attacks on Mobile Devices by Nitish.pptx
 
Presentation_entrepreneurship_.pptx
Presentation_entrepreneurship_.pptxPresentation_entrepreneurship_.pptx
Presentation_entrepreneurship_.pptx
 
Presentation1_CYBERCRIME.pptx
Presentation1_CYBERCRIME.pptxPresentation1_CYBERCRIME.pptx
Presentation1_CYBERCRIME.pptx
 
7.pdf
7.pdf7.pdf
7.pdf
 
shovan 7.pdf
shovan 7.pdfshovan 7.pdf
shovan 7.pdf
 

Recently uploaded

1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...anjaliyadav012327
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 

Recently uploaded (20)

INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 

7.pdf

  • 1. Sparse statistical modelling Tom Bartlett Sparse statistical modelling Tom Bartlett 1 / 28
  • 2. Introduction ‘A sparse statistical model is one having only a small number of nonzero parameters or weights.’[1] The number of features or variables measured on a person or object can be very large (e.g., expression levels of ∼ 30000 genes) These measurements are often highly correlated, i.e., contain much redundant information This scenario is particularly relevant in the age of ‘big-data’ 1Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity: the lasso and generalizations. CRC Press, 2015 Sparse statistical modelling Tom Bartlett 2 / 28
  • 3. Outline Sparse linear models Sparse PCA Sparse SVD Sparse CCA Sparse LDA Sparse clustering Sparse statistical modelling Tom Bartlett 3 / 28
  • 4. Sparse linear models A linear model can be written as yi =α + p X j=1 xij βj + i , i = 1, ..., n =α + x i β + i Hence, the model can be fit by minimising the objective function minimise a,β ( N X i=1 (yi − α − x i β)2 ) Adding a penalisation term to the objective function makes the solution more sparse: minimise a,β ( 1 2N N X i=1 (yi − α − x i β)2 + λkβkq q ) , where q = 1 or 2 Sparse statistical modelling Tom Bartlett 4 / 28
  • 5. Sparse linear models The penalty term λkβkq q means that only the bare minimum is used of all the information available in the p predictor variables xij , j = 1, ...p. minimise a,β ( 1 2N N X i=1 (yi − α − x i β)2 + λkβkq q ) q is typically chosen as q = 1 or q = 2, because these produce convex solutions and hence are computationally much nicer! q = 1 is called the ‘lasso’; it tends to set as many elements of β as possible to zero q = 2 is called ‘ridge regression’, and it tends to minimise the size of all the elements of β Penalisation is equally applicable to other types of linear models: logistic regression, generalised linear models etc Sparse statistical modelling Tom Bartlett 5 / 28
  • 6. Sparse linear models - simple example 0.0 0.2 0.4 0.6 0.8 1.0 −5 0 5 10 Coefficients hs college college4 not−hs funding Lasso 0.0 0.2 0.4 0.6 0.8 1.0 −5 0 5 10 Coefficients hs college college4 not−hs funding Ridge Regression β̂ 1/ β̃ 1 β̂ 2/ β̃ 2 Crime-rate modelled according to 5 predictors: annual police funding in dollars per resident (funding), percent of people 25 years and older with four years of high school (hs), percent of 16- to 19-year olds not in high school and not high school graduates (not-hs), percent of 18- to 24-year olds in college (college), and percent of people 25 years and older with at least four years of college (college4). Sparse statistical modelling Tom Bartlett 6 / 28
  • 7. Sparse linear models - genomics example Gene expression data, for p = 17280 genes, for nc = 530 cancer samples + nh = 61 healthy tissue samples Fit logistic (i.e., 2 class, cancer/healthy) lasso model using the R package glmnet, selecting λ by cross-validation Out of 17280 possible genes for prediction, lasso chooses just these 25 (shown with their fitted model coefficients) ADAMTS5 -0.0666 HPD -0.00679 NUP210 0.00582 ADH4 -0.165 HS3ST4 -0.0863 PAFAH1B3 0.297 CA4 -0.151 IGSF10 -0.356 TACC3 0.128 CCDC36 -0.335 LRRTM2 -0.0711 TESC -0.0568 CDH12 -0.253 LRRC3B -0.211 TRPM3 -1.24 CES1 -0.302 MEG3 -0.022 TSLP -0.0841 COL10A1 0.747 MMP11 0.22 WDR51A 0.0722 DPP6 -0.107 NUAK2 0.0354 WISP1 0.14 HHATL -0.0665 Caveat: these are not necessarily the only ‘predictive’ genes. If we removed these genes from the data-set and fitted the model again, lasso would choose an entirely new set of genes which might be almost as good at predicting! Sparse statistical modelling Tom Bartlett 7 / 28
  • 8. Sparse PCA Ordinary PCA finds v by carrying out the optimisation: maximise kvk2=1 v XX n v , with X ∈ Rn×p (i.e., n samples and p variables). With p n, the eigenvectors of the sample covariance matrix XX/n are not necessarily close to those of the population covariance matrix [2]. Hence ordinary PCA can fail in this context. This motivates sparse PCA, in which many entries of v are encouraged to be zero, by finding v by carrying out the optimisation: maximise kvk2=1 n v X Xv o , subject to: kvk1 ≤ t. In effect this discards some variables such that p is closer to n. 2Iain M Johnstone. “On the distribution of the largest eigenvalue in principal components analysis”. In: Annals of statistics (2001), pp. 295–327 Sparse statistical modelling Tom Bartlett 8 / 28
  • 9. Sparse SVD The SVD of a matrix X ∈ Rn×p, with n p, can be expressed as X = UDV, where U ∈ Rn×p and V ∈ Rp×p are orthogonal and D ∈ Rp×p is diagonal. The SVD can hence be found by carrying out the optimisation: minimise U∈Rn×p,V∈Rp×p,D∈Rp×p kX − UDV k2. Hence, a sparse SVD with rank r can be obtained by carrying out the optimisation: minimise U∈Rn×r ,V∈Rp×r ,D∈Rr×r n kX − UDV k2 + λ1kUk1 + λ2kVk1 o . This allows SVD to be applied to the p n scenario. Sparse statistical modelling Tom Bartlett 9 / 28
  • 10. Sparse PCA and SVD - an algorithm SVD is a generalisation of PCA. Hence, algorithms to solve the SVD problem can be applied to the PCA problem The sparse PCA can thus be re-formulated as: maximise kuk2=kvk2=1 n u Xv o , subject to: kvk1 ≤ t, which is biconvex in u and v and can be solved by alternating between the updates: u ← Xv kXvk2 , and v ← Sλ Xu kSλ (Xu) k2 , (1) where Sλ is the soft-thresholding operator Sλ = sign(x) (|x| − λ)+. Sparse statistical modelling Tom Bartlett 10 / 28
  • 11. Sparse PCA - simulation study Define Σ as a p × p block-diagonal matrix, with p = 200 and 10 blocks of 1s of size 20 × 20. Hence, we would expect there to be 10 independent components of variation in the corresponding distribution. Generate n samples x ∼ Normal(0, Σ) Estimate b Σ = P (x − x̄)(x − x̄)/n Correlate eigenvectors of Σ with eigenvectors of b Σ Repeat 100 times for each different value of n 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Top 10 PCs n/p Eigenvector correlation The plot shows the means of these correlations over the 100 repetitions for different values of n. Sparse statistical modelling Tom Bartlett 11 / 28
  • 12. Sparse PCA - simulation study An implementation of sparse PCA is available in the R package PMA as the function spca. It proceeds similarly to the algorithm described earlier, which is presented in more detail by Witten, Tibshirani and Hastie [3]. I applied this function to the same simulation as described in the previous slide. The scale of the penalisation is in terms of kuk1, with kuk1 = √ p being the minimum and kuk1 = 1 being the maximum permissible values. 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Top 10 PCs n/p Eigenvector correlation The plot shows the result with kuk1 = √ p. 3Daniela M Witten, Robert Tibshirani, and Trevor Hastie. “A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis”. In: Biostatistics (2009), kxp008 Sparse statistical modelling Tom Bartlett 12 / 28
  • 13. Sparse PCA - simulation study 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Top 10 PCs n/p Eigenvector correlation The plot shows the result with kuk1 = √ p/2. 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Top 10 PCs n/p Eigenvector correlation The plot shows the result with kuk1 = √ p/3. Sparse statistical modelling Tom Bartlett 13 / 28
  • 14. Sparse PCA - real data example I carried out PCA on expression levels of 10138 genes in individual cells from developing brains There are many different cell types in the data - some mature, some immature, and some in between Different cell-types are characterised by different gene expression profiles We would therefore expect to be able visualise some separation of the cell-types by dimensionality reduction to three dimensions The plot shows the cells plotted in terms of the top three (standard) PCA components. Sparse statistical modelling Tom Bartlett 14 / 28
  • 15. Sparse PCA - real data example The plot shows the cells in terms of the top three sparse PCA components, with kuk1 = 0.1 √ p (i.e., a high level of regularisation). The plot shows the cells in terms of the top three sparse PCA components, with kuk1 = 0.8 √ p (i.e., a low level of regularisation). Sparse statistical modelling Tom Bartlett 15 / 28
  • 16. Sparse CCA In CCA, the aim is to find coefficient vectors u ∈ Rp and v ∈ Rq which project the data-matrices X ∈ Rn×p and Y ∈ Rn×q so as to maximise the correlations between these projections. Whereas PCA aims to find the ‘direction’ of maximum variance in a single data-matrix, CCA aims to find the ‘directions’ in the two data-matrices in which the variances best explain each other. The CCA problem can be solved by carrying out the optimisation: maximise u∈Rp,v∈Rq Cor(Xu, Yv) This problem is not well posed for n max(p, q), in which case u and v can be found which trivially give Cor(Xu, Yv) = 1. Sparse CCA solves this problem by carrying out the optimisation: maximise u∈Rp,v∈Rq Cor(Xu, Yv), subject to kuk1 t1 and kvk1 t2. Sparse statistical modelling Tom Bartlett 16 / 28
  • 17. Sparse CCA - real data example ‘Cell cycle’ is a biological process involved in the replication of cells Cell-cycle can be thought of as a latent process which is not directly observable in genomics data It is driven by a small set of genes (particularly cyclins and cyclin- dependent kinases) from which it may be inferred It has an effect on the expression of very many genes: hence it can also tend to act as a confounding factor when modelling many other biological processes Used CCA here as an exploratory tool, with Y the data for the cell cycle genes, and X the data for all the other genes. Sparse statistical modelling Tom Bartlett 17 / 28
  • 18. Sparse LDA LDA assigns item i to a group G based a corresponding data-vector xi , according to the posterior probability: P(G = k|xi ) = πkfk(xi ) PK l=1 πl fl (xi ) , with fk(xi ) = 1 (2π)p/2|Σ|1/2 exp − 1 2 (xi − µk) Σ−1 (xi − µk) , with prior πk and mean µk for group k, and covariance Σ. This assignment takes place by constructing ‘decision boundaries’ between classes k and l: log P(G = k|xi ) P(G = l|xi ) = log πk πl + x i Σ−1 (µk − µl ) − 1 2 (µk + µl ) Σ−1 (µk − µl ) Because this boundary is linear in xi , we get the name LDA. Sparse statistical modelling Tom Bartlett 18 / 28
  • 19. Sparse LDA The decision boundary log P(G = k|xi ) P(G = l|xi ) = log πk πl + x i Σ−1 (µk − µl ) − 1 2 (µk + µl ) Σ−1 (µk − µl ) then naturally leads to the decision rule: G(xi ) = argmax k n log πk + x i Σ−1 µk − µ k Σ−1 µk o . By assuming Σ is diagonal, i.e., there is no covariance between the p dimensions, this decision rule can be reduced to the nearest centroids classifier: G(xi ) = argmin k    p X j=1 (xj − µjk)2 σ2 j − log πk    . Typically, Σ (or σ) are estimated from the data as b Σ (or b σ), and the µk are estimated as b µk whilst training the classifier. Sparse statistical modelling Tom Bartlett 19 / 28
  • 20. Sparse LDA The nearest centroids classifier b G(xi ) = argmin k    p X j=1 (xj − b µjk)2 b σ2 j − log πk    will typically use all p variables. This is often unnecessary and can lead to overfitting in high-dimensional contexts. The nearest shrunken centroids classifier deals with this issue. Define b µ = x̄ + αk, where x̄ is the data-mean across all classes, and αk is the class-specific deviation of the mean from x̄. Then, the nearest shrunken centroids classifier proceeds with the optimisation: minimise αk ∈Rp,k∈{1,...,K}    1 2n K X k=1 X i∈Ck p X j=1 (xij − x̄j − αjk)2 σ̂2 +λ K X k=1 p X j=1 √ nk σ̂2 |αjk|    , where Ck and nk are the set and number of samples in group k. Sparse statistical modelling Tom Bartlett 20 / 28
  • 21. Sparse LDA Hence, the αk estimated from the optimisation minimise αk ∈Rp,k∈{1,...,K}    1 2n K X k=1 X i∈Ck p X j=1 (xij − x̄j − αjk)2 σ̂2 +λ K X k=1 p X j=1 √ nk σ̂2 |αjk|    can be used to estimate the shrunken centroids b µ = x̄ + αk, thus training the classifier: b G(xi ) = argmin k    p X j=1 (xj − b µjk)2 b σ2 j − log πk    . Sparse statistical modelling Tom Bartlett 21 / 28
  • 22. Sparse LDA - real data example I applied nearest (shrunken) centroids to expression data for 14349 genes, for 347 cells of different types: leukocytes (54); lymphoblastic cells (88); fetal brain cells (16wk, 26; 21wk, 24); fibroblasts (37); ductal carcinoma (22); keratinocytes (40); B lymphoblasts (17); iPS cells (24); neural progenitors (15). Used R packages MASS, and pamr [4]. Carried out 100 repetitions of 3-fold CV. Plots show normalised mutual information (NMI), adjusted Rand index (ARI) and prediction accuracy. 0 5 10 15 20 25 30 0.0 0.4 0.8 Sparsity threshold NMI 0 5 10 15 20 25 30 0.0 0.4 0.8 Sparsity threshold ARI 0 5 10 15 20 25 30 0.0 0.4 0.8 Sparsity threshold Accuracy Sparse LDA quantile (over 300 predictions) 100% 75% 50% 25% 0% Regular LDA quantile (over 300 predictions) 100% 75% 50% 25% 0% 4Robert Tibshirani et al. “Class prediction by nearest shrunken centroids, with applications to DNA microarrays”. In: Statistical Science (2003), pp. 104–117 Sparse statistical modelling Tom Bartlett 22 / 28
  • 23. Sparse clustering Many clustering methods, such hierarchical clustering, are based on a dissimilarity measure Di,i0 = Pp j=1 di,i0,j between samples i and i0. One popular choice of dissimilarity measure is the euclidean distance. In high-dimensions, it is often unnecessary to use information from all of the p dimensions. A weighted dissimilarity measure e Di,i0 = Pp j=1 wj di,i0,j can be a useful approach to this problem. This can be obtained by the sparse matrix decomposition: maximise u∈Rn2 ,w∈Rp u ∆w, subject to kuk2 ≤ 1, kwk2 ≤ 1, kwk1 ≤ t, and wj ≥ 0, j ∈ {1, ..., p}, where w is vector of the weights wj , j ∈ {1, ..., p}, and ∆ ∈ Rn2×p is the dissimilarity components arranged such that each row of ∆ corresponds to the di,i0,j , j ∈ {1, ..., p} for a pair of samples i, i0. This weighted dissimilarity measure can then be used for sparse clustering, such as sparse hierarchical clustering. Sparse statistical modelling Tom Bartlett 23 / 28
  • 24. Sparse clustering Some clustering methods, such as K-means, need a slightly modified approach. K-means seeks to minimise the within-cluster sum of squares K X k=1 X i∈Ck kxi − x̄kk2 2 = 1 2N K X k=1 X i,i0∈Ck kxi − xi0 k2 2 where Ck is the set of samples in cluster k and x̄k is the corresponding centroid. Hence, a weighted K-means could proceed according to the optimisation: minimise w∈Rp    p X j=1 wj   K X k=1 1 nk X i,i0∈Ck di,i0,j      , where di,i0,j = (xij − xi0j )2, and nk is the number of samples in cluster k. Sparse statistical modelling Tom Bartlett 24 / 28
  • 25. Sparse clustering However, for the optimisation minimise w∈Rp    p X j=1 wj   K X k=1 1 nk X i,i0∈Ck di,i0,j      , it is not possible to choose a set of constraints which guarantee a non-pathological solution as well as convexity. Instead, the between-cluster sum of squares can be maximised: maximise w∈Rp    p X j=1 wj  1 n n X i=1 n X i0=1 di,i0,j − K X k=1 1 nk X i,i0∈Ck di,i0,j      subject to kwk2 ≤ 1, kwk1 ≤ t, and wj ≥ 0, j ∈ {1, ..., p}. Sparse statistical modelling Tom Bartlett 25 / 28
  • 26. Sparse clustering - real data examples Applied (sparse) hierarchal clustering to the same benchmark expression data-set (14349 genes, for 347 cells of different types). Used R package sparcl [5] for the sparse clustering. Plots show normalised mutual information (NMI) and adjusted Rand index (ARI) comparing sparse with standard clustering. 2 5 10 20 50 100 200 500 1000 0.0 0.4 0.8 L1 bound NMI 2 5 10 20 50 100 200 500 1000 0.0 0.4 0.8 L1 bound ARI Sparse hierarchical clustering hierarchical clustering 5Daniela M Witten and Robert Tibshirani. “A framework for feature selection in clustering”. In: Journal of the American Statistical Association (2012) Sparse statistical modelling Tom Bartlett 26 / 28
  • 27. Sparse clustering - real data examples Applied (sparse) k-means to the same benchmark expression data-set (14349 genes, for 347 cells of different types). Used R package sparcl for the sparse clustering. Plots show normalised mutual information (NMI) and adjusted Rand index (ARI) comparing sparse with standard clustering. 2 5 10 20 50 100 200 500 1000 0.0 0.4 0.8 L1 bound NMI 2 5 10 20 50 100 200 500 1000 0.0 0.4 0.8 L1 bound ARI Sparse k−means k−means Sparse statistical modelling Tom Bartlett 27 / 28
  • 28. Sparse clustering - real data examples Spectral clustering essentially uses k-means clustering (or similar) in dimensionally- reduced (e.g., PCA) space. Applied standard k-means in sparse-PCA space to the same benchmark expression data-set (14349 genes, for 347 cells of different types). Offers computational advantages, running in 9 seconds on a 2.8GHz Macbook, compared with 19 seconds for standard k-means, and 35 seconds for sparse k-means. 0.1 0.2 0.5 1.0 0.0 0.4 0.8 L1 bound / sqrt(n) NMI 0.1 0.2 0.5 1.0 0.0 0.4 0.8 L1 bound / sqrt(n) ARI Sparse spectral k−means k−means Sparse statistical modelling Tom Bartlett 28 / 28