2. Introduction
‘A sparse statistical model is one having only a small number of
nonzero parameters or weights.’[1]
The number of features or variables measured on a person or object
can be very large (e.g., expression levels of ∼ 30000 genes)
These measurements are often highly correlated, i.e., contain much
redundant information
This scenario is particularly relevant in the age of ‘big-data’
1Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity:
the lasso and generalizations. CRC Press, 2015
Sparse statistical modelling Tom Bartlett 2 / 28
4. Sparse linear models
A linear model can be written as
yi =α +
p
X
j=1
xij βj + i , i = 1, ..., n
=α + x
i β + i
Hence, the model can be fit by minimising the objective function
minimise
a,β
( N
X
i=1
(yi − α − x
i β)2
)
Adding a penalisation term to the objective function makes the
solution more sparse:
minimise
a,β
(
1
2N
N
X
i=1
(yi − α − x
i β)2
+ λkβkq
q
)
, where q = 1 or 2
Sparse statistical modelling Tom Bartlett 4 / 28
5. Sparse linear models
The penalty term λkβkq
q means that only the bare minimum is used of
all the information available in the p predictor variables xij , j = 1, ...p.
minimise
a,β
(
1
2N
N
X
i=1
(yi − α − x
i β)2
+ λkβkq
q
)
q is typically chosen as q = 1 or q = 2, because these produce convex
solutions and hence are computationally much nicer!
q = 1 is called the ‘lasso’; it tends to set as many elements of β as
possible to zero
q = 2 is called ‘ridge regression’, and it tends to minimise the size of
all the elements of β
Penalisation is equally applicable to other types of linear models:
logistic regression, generalised linear models etc
Sparse statistical modelling Tom Bartlett 5 / 28
6. Sparse linear models - simple example
0.0 0.2 0.4 0.6 0.8 1.0
−5
0
5
10
Coefficients
hs
college
college4
not−hs
funding
Lasso
0.0 0.2 0.4 0.6 0.8 1.0
−5
0
5
10
Coefficients
hs
college
college4
not−hs
funding
Ridge Regression
β̂ 1/ β̃ 1 β̂ 2/ β̃ 2
Crime-rate modelled according to 5 predictors: annual police funding in
dollars per resident (funding), percent of people 25 years and older with
four years of high school (hs), percent of 16- to 19-year olds not in high
school and not high school graduates (not-hs), percent of 18- to 24-year
olds in college (college), and percent of people 25 years and older with at
least four years of college (college4).
Sparse statistical modelling Tom Bartlett 6 / 28
7. Sparse linear models - genomics example
Gene expression data, for p = 17280 genes, for nc = 530 cancer
samples + nh = 61 healthy tissue samples
Fit logistic (i.e., 2 class, cancer/healthy) lasso model using the R
package glmnet, selecting λ by cross-validation
Out of 17280 possible genes for prediction, lasso chooses just these
25 (shown with their fitted model coefficients)
ADAMTS5 -0.0666 HPD -0.00679 NUP210 0.00582
ADH4 -0.165 HS3ST4 -0.0863 PAFAH1B3 0.297
CA4 -0.151 IGSF10 -0.356 TACC3 0.128
CCDC36 -0.335 LRRTM2 -0.0711 TESC -0.0568
CDH12 -0.253 LRRC3B -0.211 TRPM3 -1.24
CES1 -0.302 MEG3 -0.022 TSLP -0.0841
COL10A1 0.747 MMP11 0.22 WDR51A 0.0722
DPP6 -0.107 NUAK2 0.0354 WISP1 0.14
HHATL -0.0665
Caveat: these are not necessarily the only ‘predictive’ genes. If we
removed these genes from the data-set and fitted the model again,
lasso would choose an entirely new set of genes which might be
almost as good at predicting!
Sparse statistical modelling Tom Bartlett 7 / 28
8. Sparse PCA
Ordinary PCA finds v by carrying out the optimisation:
maximise
kvk2=1
v XX
n
v
,
with X ∈ Rn×p (i.e., n samples and p variables).
With p n, the eigenvectors of the sample covariance matrix
XX/n are not necessarily close to those of the population covariance
matrix [2].
Hence ordinary PCA can fail in this context. This motivates sparse
PCA, in which many entries of v are encouraged to be zero, by
finding v by carrying out the optimisation:
maximise
kvk2=1
n
v
X
Xv
o
, subject to: kvk1 ≤ t.
In effect this discards some variables such that p is closer to n.
2Iain M Johnstone. “On the distribution of the largest eigenvalue in principal components
analysis”. In: Annals of statistics (2001), pp. 295–327
Sparse statistical modelling Tom Bartlett 8 / 28
9. Sparse SVD
The SVD of a matrix X ∈ Rn×p, with n p, can be expressed as
X = UDV, where U ∈ Rn×p and V ∈ Rp×p are orthogonal and
D ∈ Rp×p is diagonal. The SVD can hence be found by carrying out
the optimisation:
minimise
U∈Rn×p,V∈Rp×p,D∈Rp×p
kX − UDV
k2.
Hence, a sparse SVD with rank r can be obtained by carrying out the
optimisation:
minimise
U∈Rn×r ,V∈Rp×r ,D∈Rr×r
n
kX − UDV
k2 + λ1kUk1 + λ2kVk1
o
.
This allows SVD to be applied to the p n scenario.
Sparse statistical modelling Tom Bartlett 9 / 28
10. Sparse PCA and SVD - an algorithm
SVD is a generalisation of PCA. Hence, algorithms to solve the SVD
problem can be applied to the PCA problem
The sparse PCA can thus be re-formulated as:
maximise
kuk2=kvk2=1
n
u
Xv
o
, subject to: kvk1 ≤ t,
which is biconvex in u and v and can be solved by alternating
between the updates:
u ←
Xv
kXvk2
, and v ←
Sλ Xu
kSλ (Xu) k2
, (1)
where Sλ is the soft-thresholding operator Sλ = sign(x) (|x| − λ)+.
Sparse statistical modelling Tom Bartlett 10 / 28
11. Sparse PCA - simulation study
Define Σ as a p × p block-diagonal
matrix, with p = 200 and 10 blocks
of 1s of size 20 × 20.
Hence, we would expect there to be 10
independent components of variation
in the corresponding distribution.
Generate n samples x ∼ Normal(0, Σ)
Estimate b
Σ =
P
(x − x̄)(x − x̄)/n
Correlate eigenvectors of Σ with
eigenvectors of b
Σ
Repeat 100 times for each
different value of n
0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Top 10 PCs
n/p
Eigenvector
correlation
The plot shows the means of
these correlations over the
100 repetitions for different
values of n.
Sparse statistical modelling Tom Bartlett 11 / 28
12. Sparse PCA - simulation study
An implementation of sparse PCA is
available in the R package PMA as the
function spca. It proceeds similarly
to the algorithm described earlier,
which is presented in more detail by
Witten, Tibshirani and Hastie [3].
I applied this function to the same
simulation as described in the
previous slide.
The scale of the penalisation is in terms
of kuk1, with kuk1 =
√
p being the
minimum and kuk1 = 1 being the
maximum permissible values.
0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Top 10 PCs
n/p
Eigenvector
correlation
The plot shows the result
with kuk1 =
√
p.
3Daniela M Witten, Robert Tibshirani, and Trevor Hastie. “A penalized matrix
decomposition, with applications to sparse principal components and canonical correlation
analysis”. In: Biostatistics (2009), kxp008
Sparse statistical modelling Tom Bartlett 12 / 28
13. Sparse PCA - simulation study
0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Top 10 PCs
n/p
Eigenvector
correlation
The plot shows the result
with kuk1 =
√
p/2.
0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Top 10 PCs
n/p
Eigenvector
correlation The plot shows the result
with kuk1 =
√
p/3.
Sparse statistical modelling Tom Bartlett 13 / 28
14. Sparse PCA - real data example
I carried out PCA on expression levels
of 10138 genes in individual cells
from developing brains
There are many different cell types in
the data - some mature, some
immature, and some in between
Different cell-types are characterised by
different gene expression profiles
We would therefore expect to be able
visualise some separation of the
cell-types by dimensionality reduction
to three dimensions
The plot shows the cells
plotted in terms of the top
three (standard) PCA
components.
Sparse statistical modelling Tom Bartlett 14 / 28
15. Sparse PCA - real data example
The plot shows the cells in
terms of the top three sparse
PCA components, with
kuk1 = 0.1
√
p (i.e., a high
level of regularisation).
The plot shows the cells in
terms of the top three sparse
PCA components, with
kuk1 = 0.8
√
p (i.e., a low
level of regularisation).
Sparse statistical modelling Tom Bartlett 15 / 28
16. Sparse CCA
In CCA, the aim is to find coefficient vectors u ∈ Rp and v ∈ Rq
which project the data-matrices X ∈ Rn×p and Y ∈ Rn×q so as to
maximise the correlations between these projections.
Whereas PCA aims to find the ‘direction’ of maximum variance in a
single data-matrix, CCA aims to find the ‘directions’ in the two
data-matrices in which the variances best explain each other.
The CCA problem can be solved by carrying out the optimisation:
maximise
u∈Rp,v∈Rq
Cor(Xu, Yv)
This problem is not well posed for n max(p, q), in which case u and
v can be found which trivially give Cor(Xu, Yv) = 1.
Sparse CCA solves this problem by carrying out the optimisation:
maximise
u∈Rp,v∈Rq
Cor(Xu, Yv), subject to kuk1 t1 and kvk1 t2.
Sparse statistical modelling Tom Bartlett 16 / 28
17. Sparse CCA - real data example
‘Cell cycle’ is a biological process
involved in the replication of cells
Cell-cycle can be thought of as a latent
process which is not directly
observable in genomics data
It is driven by a small set of genes
(particularly cyclins and cyclin-
dependent kinases) from which it
may be inferred
It has an effect on the expression of very
many genes: hence it can also tend
to act as a confounding factor when
modelling many other biological
processes
Used CCA here as an
exploratory tool, with Y the
data for the cell cycle genes,
and X the data for all the
other genes.
Sparse statistical modelling Tom Bartlett 17 / 28
18. Sparse LDA
LDA assigns item i to a group G based a corresponding data-vector
xi , according to the posterior probability:
P(G = k|xi ) =
πkfk(xi )
PK
l=1 πl fl (xi )
, with
fk(xi ) =
1
(2π)p/2|Σ|1/2
exp
−
1
2
(xi − µk)
Σ−1
(xi − µk)
,
with prior πk and mean µk for group k, and covariance Σ.
This assignment takes place by constructing ‘decision boundaries’
between classes k and l:
log
P(G = k|xi )
P(G = l|xi )
= log
πk
πl
+ x
i Σ−1
(µk − µl )
−
1
2
(µk + µl )
Σ−1
(µk − µl )
Because this boundary is linear in xi , we get the name LDA.
Sparse statistical modelling Tom Bartlett 18 / 28
19. Sparse LDA
The decision boundary
log
P(G = k|xi )
P(G = l|xi )
= log
πk
πl
+ x
i Σ−1
(µk − µl )
−
1
2
(µk + µl )
Σ−1
(µk − µl )
then naturally leads to the decision rule:
G(xi ) = argmax
k
n
log πk + x
i Σ−1
µk − µ
k Σ−1
µk
o
.
By assuming Σ is diagonal, i.e., there is no covariance between the p
dimensions, this decision rule can be reduced to the nearest centroids
classifier:
G(xi ) = argmin
k
p
X
j=1
(xj − µjk)2
σ2
j
− log πk
.
Typically, Σ (or σ) are estimated from the data as b
Σ (or b
σ), and the
µk are estimated as b
µk whilst training the classifier.
Sparse statistical modelling Tom Bartlett 19 / 28
20. Sparse LDA
The nearest centroids classifier
b
G(xi ) = argmin
k
p
X
j=1
(xj − b
µjk)2
b
σ2
j
− log πk
will typically use all p variables. This is often unnecessary and can
lead to overfitting in high-dimensional contexts. The nearest shrunken
centroids classifier deals with this issue.
Define b
µ = x̄ + αk, where x̄ is the data-mean across all classes, and
αk is the class-specific deviation of the mean from x̄. Then, the
nearest shrunken centroids classifier proceeds with the optimisation:
minimise
αk ∈Rp,k∈{1,...,K}
1
2n
K
X
k=1
X
i∈Ck
p
X
j=1
(xij − x̄j − αjk)2
σ̂2
+λ
K
X
k=1
p
X
j=1
√
nk
σ̂2
|αjk|
,
where Ck and nk are the set and number of samples in group k.
Sparse statistical modelling Tom Bartlett 20 / 28
21. Sparse LDA
Hence, the αk estimated from the optimisation
minimise
αk ∈Rp,k∈{1,...,K}
1
2n
K
X
k=1
X
i∈Ck
p
X
j=1
(xij − x̄j − αjk)2
σ̂2
+λ
K
X
k=1
p
X
j=1
√
nk
σ̂2
|αjk|
can be used to estimate the shrunken centroids b
µ = x̄ + αk, thus
training the classifier:
b
G(xi ) = argmin
k
p
X
j=1
(xj − b
µjk)2
b
σ2
j
− log πk
.
Sparse statistical modelling Tom Bartlett 21 / 28
22. Sparse LDA - real data example
I applied nearest (shrunken) centroids to
expression data for 14349 genes, for
347 cells of different types:
leukocytes (54); lymphoblastic cells
(88); fetal brain cells (16wk, 26;
21wk, 24); fibroblasts (37); ductal
carcinoma (22); keratinocytes (40);
B lymphoblasts (17); iPS cells (24);
neural progenitors (15).
Used R packages MASS, and pamr [4].
Carried out 100 repetitions of 3-fold
CV. Plots show normalised mutual
information (NMI), adjusted Rand
index (ARI) and prediction accuracy.
0 5 10 15 20 25 30
0.0
0.4
0.8
Sparsity threshold
NMI
0 5 10 15 20 25 30
0.0
0.4
0.8
Sparsity threshold
ARI
0 5 10 15 20 25 30
0.0
0.4
0.8
Sparsity threshold
Accuracy
Sparse LDA quantile (over 300 predictions)
100% 75% 50% 25% 0%
Regular LDA quantile (over 300 predictions)
100% 75% 50% 25% 0%
4Robert Tibshirani et al. “Class prediction by nearest shrunken centroids, with applications
to DNA microarrays”. In: Statistical Science (2003), pp. 104–117
Sparse statistical modelling Tom Bartlett 22 / 28
23. Sparse clustering
Many clustering methods, such hierarchical clustering, are based on a
dissimilarity measure Di,i0 =
Pp
j=1 di,i0,j between samples i and i0.
One popular choice of dissimilarity measure is the euclidean distance.
In high-dimensions, it is often unnecessary to use information from all
of the p dimensions.
A weighted dissimilarity measure e
Di,i0 =
Pp
j=1 wj di,i0,j can be a useful
approach to this problem. This can be obtained by the sparse matrix
decomposition:
maximise
u∈Rn2
,w∈Rp
u
∆w, subject to kuk2 ≤ 1, kwk2 ≤ 1,
kwk1 ≤ t, and wj ≥ 0, j ∈ {1, ..., p},
where w is vector of the weights wj , j ∈ {1, ..., p}, and ∆ ∈ Rn2×p is
the dissimilarity components arranged such that each row of ∆
corresponds to the di,i0,j , j ∈ {1, ..., p} for a pair of samples i, i0.
This weighted dissimilarity measure can then be used for sparse
clustering, such as sparse hierarchical clustering.
Sparse statistical modelling Tom Bartlett 23 / 28
24. Sparse clustering
Some clustering methods, such as K-means, need a slightly modified
approach.
K-means seeks to minimise the within-cluster sum of squares
K
X
k=1
X
i∈Ck
kxi − x̄kk2
2 =
1
2N
K
X
k=1
X
i,i0∈Ck
kxi − xi0 k2
2
where Ck is the set of samples in cluster k and x̄k is the
corresponding centroid.
Hence, a weighted K-means could proceed according to the
optimisation:
minimise
w∈Rp
p
X
j=1
wj
K
X
k=1
1
nk
X
i,i0∈Ck
di,i0,j
,
where di,i0,j = (xij − xi0j )2, and nk is the number of samples
in cluster k.
Sparse statistical modelling Tom Bartlett 24 / 28
25. Sparse clustering
However, for the optimisation
minimise
w∈Rp
p
X
j=1
wj
K
X
k=1
1
nk
X
i,i0∈Ck
di,i0,j
,
it is not possible to choose a set of constraints which guarantee a
non-pathological solution as well as convexity.
Instead, the between-cluster sum of squares can be maximised:
maximise
w∈Rp
p
X
j=1
wj
1
n
n
X
i=1
n
X
i0=1
di,i0,j −
K
X
k=1
1
nk
X
i,i0∈Ck
di,i0,j
subject to kwk2 ≤ 1, kwk1 ≤ t, and wj ≥ 0, j ∈ {1, ..., p}.
Sparse statistical modelling Tom Bartlett 25 / 28
26. Sparse clustering - real data examples
Applied (sparse) hierarchal
clustering to the same
benchmark expression
data-set (14349 genes, for
347 cells of different types).
Used R package sparcl [5] for
the sparse clustering. Plots
show normalised mutual
information (NMI) and
adjusted Rand index (ARI)
comparing sparse with
standard clustering.
2 5 10 20 50 100 200 500 1000
0.0
0.4
0.8
L1 bound
NMI
2 5 10 20 50 100 200 500 1000
0.0
0.4
0.8
L1 bound
ARI
Sparse hierarchical clustering hierarchical clustering
5Daniela M Witten and Robert Tibshirani. “A framework for feature selection in clustering”.
In: Journal of the American Statistical Association (2012)
Sparse statistical modelling Tom Bartlett 26 / 28
27. Sparse clustering - real data examples
Applied (sparse) k-means to
the same benchmark
expression data-set (14349
genes, for 347 cells of
different types).
Used R package sparcl for the
sparse clustering. Plots
show normalised mutual
information (NMI) and
adjusted Rand index (ARI)
comparing sparse with
standard clustering.
2 5 10 20 50 100 200 500 1000
0.0
0.4
0.8
L1 bound
NMI
2 5 10 20 50 100 200 500 1000
0.0
0.4
0.8
L1 bound
ARI
Sparse k−means k−means
Sparse statistical modelling Tom Bartlett 27 / 28
28. Sparse clustering - real data examples
Spectral clustering essentially
uses k-means clustering (or
similar) in dimensionally-
reduced (e.g., PCA) space.
Applied standard k-means in
sparse-PCA space to the
same benchmark expression
data-set (14349 genes, for
347 cells of different types).
Offers computational
advantages, running in 9
seconds on a 2.8GHz
Macbook, compared with
19 seconds for standard
k-means, and 35 seconds
for sparse k-means.
0.1 0.2 0.5 1.0
0.0
0.4
0.8
L1 bound / sqrt(n)
NMI
0.1 0.2 0.5 1.0
0.0
0.4
0.8
L1 bound / sqrt(n)
ARI
Sparse spectral k−means k−means
Sparse statistical modelling Tom Bartlett 28 / 28