This document provides an overview of sparse statistical modelling techniques. It begins by defining sparse statistical models as those with only a small number of nonzero parameters. It then outlines several sparse modelling methods, including sparse linear models, sparse PCA, sparse SVD, sparse CCA, and sparse LDA. For each method, it provides a brief mathematical formulation and discusses how sparsity is introduced through penalization terms. It also includes examples applying several of these techniques to both simulated and real-world genomic data.
Covariance matrices are central to many adaptive filtering and optimisation problems. In practice, they have to be estimated from a finite number of samples; on this, I will review some known results from spectrum estimation and multiple-input multiple-output communications systems, and how properties that are assumed to be inherent in covariance and power spectral densities can easily be lost in the estimation process. I will discuss new results on space-time covariance estimation, and how the estimation from finite sample sets will impact on factorisations such as the eigenvalue decomposition, which is often key to solving the introductory optimisation problems. The purpose of the presentation is to give you some insight into estimating statistics as well as to provide a glimpse on classical signal processing challenges such as the separation of sources from a mixture of signals.
MVPA with SpaceNet: sparse structured priorsElvis DOHMATOB
The GraphNet (aka S-Lasso), as well as other “sparsity + structure” priors like TV (Total-Variation), TV-L1, etc., are not easily applicable to brain data because of technical problems
relating to the selection of the regularization parameters. Also, in
their own right, such models lead to challenging high-dimensional optimization problems. In this manuscript, we present some heuristics for speeding up the overall optimization process: (a) Early-stopping, whereby one halts the optimization process when the test score (performance on leftout data) for the internal cross-validation for model-selection stops improving, and (b) univariate feature-screening, whereby irrelevant (non-predictive) voxels are detected and eliminated before the optimization problem is entered, thus reducing the size of the problem. Empirical results with GraphNet on real MRI (Magnetic Resonance Imaging) datasets indicate that these heuristics are a win-win strategy, as they add speed without sacrificing the quality of the predictions. We expect the proposed heuristics to work on other models like TV-L1, etc.
Covariance matrices are central to many adaptive filtering and optimisation problems. In practice, they have to be estimated from a finite number of samples; on this, I will review some known results from spectrum estimation and multiple-input multiple-output communications systems, and how properties that are assumed to be inherent in covariance and power spectral densities can easily be lost in the estimation process. I will discuss new results on space-time covariance estimation, and how the estimation from finite sample sets will impact on factorisations such as the eigenvalue decomposition, which is often key to solving the introductory optimisation problems. The purpose of the presentation is to give you some insight into estimating statistics as well as to provide a glimpse on classical signal processing challenges such as the separation of sources from a mixture of signals.
MVPA with SpaceNet: sparse structured priorsElvis DOHMATOB
The GraphNet (aka S-Lasso), as well as other “sparsity + structure” priors like TV (Total-Variation), TV-L1, etc., are not easily applicable to brain data because of technical problems
relating to the selection of the regularization parameters. Also, in
their own right, such models lead to challenging high-dimensional optimization problems. In this manuscript, we present some heuristics for speeding up the overall optimization process: (a) Early-stopping, whereby one halts the optimization process when the test score (performance on leftout data) for the internal cross-validation for model-selection stops improving, and (b) univariate feature-screening, whereby irrelevant (non-predictive) voxels are detected and eliminated before the optimization problem is entered, thus reducing the size of the problem. Empirical results with GraphNet on real MRI (Magnetic Resonance Imaging) datasets indicate that these heuristics are a win-win strategy, as they add speed without sacrificing the quality of the predictions. We expect the proposed heuristics to work on other models like TV-L1, etc.
Spillover Dynamics for Systemic Risk Measurement Using Spatial Financial Time...SYRTO Project
Spillover Dynamics for Systemic Risk Measurement Using Spatial Financial Time Series Models. Andre Lucas. Amsterdam - June, 25 2015. European Financial Management Association 2015 Annual Meetings.
Linear regression [Theory and Application (In physics point of view) using py...ANIRBANMAJUMDAR18
Machine-learning models are behind many recent technological advances, including high-accuracy translations of the text and self-driving cars. They are also increasingly used by researchers to help in solving physics problems, like Finding new phases of matter, Detecting interesting outliers
in data from high-energy physics experiments, Founding astronomical objects are known as gravitational lenses in maps of the night sky etc. The rudimentary algorithm that every Machine Learning enthusiast starts with is a linear regression algorithm. In statistics, linear regression is a linear approach to modelling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent
variables). Linear regression analysis (least squares) is used in a physics lab to prepare the computer-aided report and to fit data. In this article, the application is made to experiment: 'DETERMINATION OF DIELECTRIC CONSTANT OF NON-CONDUCTING LIQUIDS'. The entire computation is made through Python 3.6 programming language in this article.
When spatial data are distributed across multiple servers, there is an obvious difficulty with computing the likelihood function without combining all the data onto one server. Therefore, it would be of interest to compute estimates of the spatial parameters based on decompositions of the spatial held into blocks, each block corresponding to one server. Two methods suggest themselves, a \between blocks" approach in which each block is reduced to a single observation (or a low dimensional summary) to facilitate calculation of a likelihood across blocks, or a within blocks" approach in which the likelihood is calculated for each block and then combined into an overall likelihood for the full process. In fact, I argue that a hybrid approach that combines both ideas is best. Theoretical calculations are provided for the statistical efficiency of each approach. In conclusion, I will present some thoughts for optimal sampling designs with distributed data.
Spillover Dynamics for Systemic Risk Measurement Using Spatial Financial Time...SYRTO Project
Spillover Dynamics for Systemic Risk Measurement Using Spatial Financial Time Series Models. Andre Lucas. Amsterdam - June, 25 2015. European Financial Management Association 2015 Annual Meetings.
Linear regression [Theory and Application (In physics point of view) using py...ANIRBANMAJUMDAR18
Machine-learning models are behind many recent technological advances, including high-accuracy translations of the text and self-driving cars. They are also increasingly used by researchers to help in solving physics problems, like Finding new phases of matter, Detecting interesting outliers
in data from high-energy physics experiments, Founding astronomical objects are known as gravitational lenses in maps of the night sky etc. The rudimentary algorithm that every Machine Learning enthusiast starts with is a linear regression algorithm. In statistics, linear regression is a linear approach to modelling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent
variables). Linear regression analysis (least squares) is used in a physics lab to prepare the computer-aided report and to fit data. In this article, the application is made to experiment: 'DETERMINATION OF DIELECTRIC CONSTANT OF NON-CONDUCTING LIQUIDS'. The entire computation is made through Python 3.6 programming language in this article.
When spatial data are distributed across multiple servers, there is an obvious difficulty with computing the likelihood function without combining all the data onto one server. Therefore, it would be of interest to compute estimates of the spatial parameters based on decompositions of the spatial held into blocks, each block corresponding to one server. Two methods suggest themselves, a \between blocks" approach in which each block is reduced to a single observation (or a low dimensional summary) to facilitate calculation of a likelihood across blocks, or a within blocks" approach in which the likelihood is calculated for each block and then combined into an overall likelihood for the full process. In fact, I argue that a hybrid approach that combines both ideas is best. Theoretical calculations are provided for the statistical efficiency of each approach. In conclusion, I will present some thoughts for optimal sampling designs with distributed data.
Macroeconomics- Movie Location
This will be used as part of your Personal Professional Portfolio once graded.
Objective:
Prepare a presentation or a paper using research, basic comparative analysis, data organization and application of economic information. You will make an informed assessment of an economic climate outside of the United States to accomplish an entertainment industry objective.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
Embracing GenAI - A Strategic ImperativePeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
2. Introduction
‘A sparse statistical model is one having only a small number of
nonzero parameters or weights.’[1]
The number of features or variables measured on a person or object
can be very large (e.g., expression levels of ∼ 30000 genes)
These measurements are often highly correlated, i.e., contain much
redundant information
This scenario is particularly relevant in the age of ‘big-data’
1Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity:
the lasso and generalizations. CRC Press, 2015
Sparse statistical modelling Tom Bartlett 2 / 28
4. Sparse linear models
A linear model can be written as
yi =α +
p
X
j=1
xij βj + i , i = 1, ..., n
=α + x
i β + i
Hence, the model can be fit by minimising the objective function
minimise
a,β
( N
X
i=1
(yi − α − x
i β)2
)
Adding a penalisation term to the objective function makes the
solution more sparse:
minimise
a,β
(
1
2N
N
X
i=1
(yi − α − x
i β)2
+ λkβkq
q
)
, where q = 1 or 2
Sparse statistical modelling Tom Bartlett 4 / 28
5. Sparse linear models
The penalty term λkβkq
q means that only the bare minimum is used of
all the information available in the p predictor variables xij , j = 1, ...p.
minimise
a,β
(
1
2N
N
X
i=1
(yi − α − x
i β)2
+ λkβkq
q
)
q is typically chosen as q = 1 or q = 2, because these produce convex
solutions and hence are computationally much nicer!
q = 1 is called the ‘lasso’; it tends to set as many elements of β as
possible to zero
q = 2 is called ‘ridge regression’, and it tends to minimise the size of
all the elements of β
Penalisation is equally applicable to other types of linear models:
logistic regression, generalised linear models etc
Sparse statistical modelling Tom Bartlett 5 / 28
6. Sparse linear models - simple example
0.0 0.2 0.4 0.6 0.8 1.0
−5
0
5
10
Coefficients
hs
college
college4
not−hs
funding
Lasso
0.0 0.2 0.4 0.6 0.8 1.0
−5
0
5
10
Coefficients
hs
college
college4
not−hs
funding
Ridge Regression
β̂ 1/ β̃ 1 β̂ 2/ β̃ 2
Crime-rate modelled according to 5 predictors: annual police funding in
dollars per resident (funding), percent of people 25 years and older with
four years of high school (hs), percent of 16- to 19-year olds not in high
school and not high school graduates (not-hs), percent of 18- to 24-year
olds in college (college), and percent of people 25 years and older with at
least four years of college (college4).
Sparse statistical modelling Tom Bartlett 6 / 28
7. Sparse linear models - genomics example
Gene expression data, for p = 17280 genes, for nc = 530 cancer
samples + nh = 61 healthy tissue samples
Fit logistic (i.e., 2 class, cancer/healthy) lasso model using the R
package glmnet, selecting λ by cross-validation
Out of 17280 possible genes for prediction, lasso chooses just these
25 (shown with their fitted model coefficients)
ADAMTS5 -0.0666 HPD -0.00679 NUP210 0.00582
ADH4 -0.165 HS3ST4 -0.0863 PAFAH1B3 0.297
CA4 -0.151 IGSF10 -0.356 TACC3 0.128
CCDC36 -0.335 LRRTM2 -0.0711 TESC -0.0568
CDH12 -0.253 LRRC3B -0.211 TRPM3 -1.24
CES1 -0.302 MEG3 -0.022 TSLP -0.0841
COL10A1 0.747 MMP11 0.22 WDR51A 0.0722
DPP6 -0.107 NUAK2 0.0354 WISP1 0.14
HHATL -0.0665
Caveat: these are not necessarily the only ‘predictive’ genes. If we
removed these genes from the data-set and fitted the model again,
lasso would choose an entirely new set of genes which might be
almost as good at predicting!
Sparse statistical modelling Tom Bartlett 7 / 28
8. Sparse PCA
Ordinary PCA finds v by carrying out the optimisation:
maximise
kvk2=1
v XX
n
v
,
with X ∈ Rn×p (i.e., n samples and p variables).
With p n, the eigenvectors of the sample covariance matrix
XX/n are not necessarily close to those of the population covariance
matrix [2].
Hence ordinary PCA can fail in this context. This motivates sparse
PCA, in which many entries of v are encouraged to be zero, by
finding v by carrying out the optimisation:
maximise
kvk2=1
n
v
X
Xv
o
, subject to: kvk1 ≤ t.
In effect this discards some variables such that p is closer to n.
2Iain M Johnstone. “On the distribution of the largest eigenvalue in principal components
analysis”. In: Annals of statistics (2001), pp. 295–327
Sparse statistical modelling Tom Bartlett 8 / 28
9. Sparse SVD
The SVD of a matrix X ∈ Rn×p, with n p, can be expressed as
X = UDV, where U ∈ Rn×p and V ∈ Rp×p are orthogonal and
D ∈ Rp×p is diagonal. The SVD can hence be found by carrying out
the optimisation:
minimise
U∈Rn×p,V∈Rp×p,D∈Rp×p
kX − UDV
k2.
Hence, a sparse SVD with rank r can be obtained by carrying out the
optimisation:
minimise
U∈Rn×r ,V∈Rp×r ,D∈Rr×r
n
kX − UDV
k2 + λ1kUk1 + λ2kVk1
o
.
This allows SVD to be applied to the p n scenario.
Sparse statistical modelling Tom Bartlett 9 / 28
10. Sparse PCA and SVD - an algorithm
SVD is a generalisation of PCA. Hence, algorithms to solve the SVD
problem can be applied to the PCA problem
The sparse PCA can thus be re-formulated as:
maximise
kuk2=kvk2=1
n
u
Xv
o
, subject to: kvk1 ≤ t,
which is biconvex in u and v and can be solved by alternating
between the updates:
u ←
Xv
kXvk2
, and v ←
Sλ Xu
kSλ (Xu) k2
, (1)
where Sλ is the soft-thresholding operator Sλ = sign(x) (|x| − λ)+.
Sparse statistical modelling Tom Bartlett 10 / 28
11. Sparse PCA - simulation study
Define Σ as a p × p block-diagonal
matrix, with p = 200 and 10 blocks
of 1s of size 20 × 20.
Hence, we would expect there to be 10
independent components of variation
in the corresponding distribution.
Generate n samples x ∼ Normal(0, Σ)
Estimate b
Σ =
P
(x − x̄)(x − x̄)/n
Correlate eigenvectors of Σ with
eigenvectors of b
Σ
Repeat 100 times for each
different value of n
0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Top 10 PCs
n/p
Eigenvector
correlation
The plot shows the means of
these correlations over the
100 repetitions for different
values of n.
Sparse statistical modelling Tom Bartlett 11 / 28
12. Sparse PCA - simulation study
An implementation of sparse PCA is
available in the R package PMA as the
function spca. It proceeds similarly
to the algorithm described earlier,
which is presented in more detail by
Witten, Tibshirani and Hastie [3].
I applied this function to the same
simulation as described in the
previous slide.
The scale of the penalisation is in terms
of kuk1, with kuk1 =
√
p being the
minimum and kuk1 = 1 being the
maximum permissible values.
0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Top 10 PCs
n/p
Eigenvector
correlation
The plot shows the result
with kuk1 =
√
p.
3Daniela M Witten, Robert Tibshirani, and Trevor Hastie. “A penalized matrix
decomposition, with applications to sparse principal components and canonical correlation
analysis”. In: Biostatistics (2009), kxp008
Sparse statistical modelling Tom Bartlett 12 / 28
13. Sparse PCA - simulation study
0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Top 10 PCs
n/p
Eigenvector
correlation
The plot shows the result
with kuk1 =
√
p/2.
0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Top 10 PCs
n/p
Eigenvector
correlation The plot shows the result
with kuk1 =
√
p/3.
Sparse statistical modelling Tom Bartlett 13 / 28
14. Sparse PCA - real data example
I carried out PCA on expression levels
of 10138 genes in individual cells
from developing brains
There are many different cell types in
the data - some mature, some
immature, and some in between
Different cell-types are characterised by
different gene expression profiles
We would therefore expect to be able
visualise some separation of the
cell-types by dimensionality reduction
to three dimensions
The plot shows the cells
plotted in terms of the top
three (standard) PCA
components.
Sparse statistical modelling Tom Bartlett 14 / 28
15. Sparse PCA - real data example
The plot shows the cells in
terms of the top three sparse
PCA components, with
kuk1 = 0.1
√
p (i.e., a high
level of regularisation).
The plot shows the cells in
terms of the top three sparse
PCA components, with
kuk1 = 0.8
√
p (i.e., a low
level of regularisation).
Sparse statistical modelling Tom Bartlett 15 / 28
16. Sparse CCA
In CCA, the aim is to find coefficient vectors u ∈ Rp and v ∈ Rq
which project the data-matrices X ∈ Rn×p and Y ∈ Rn×q so as to
maximise the correlations between these projections.
Whereas PCA aims to find the ‘direction’ of maximum variance in a
single data-matrix, CCA aims to find the ‘directions’ in the two
data-matrices in which the variances best explain each other.
The CCA problem can be solved by carrying out the optimisation:
maximise
u∈Rp,v∈Rq
Cor(Xu, Yv)
This problem is not well posed for n max(p, q), in which case u and
v can be found which trivially give Cor(Xu, Yv) = 1.
Sparse CCA solves this problem by carrying out the optimisation:
maximise
u∈Rp,v∈Rq
Cor(Xu, Yv), subject to kuk1 t1 and kvk1 t2.
Sparse statistical modelling Tom Bartlett 16 / 28
17. Sparse CCA - real data example
‘Cell cycle’ is a biological process
involved in the replication of cells
Cell-cycle can be thought of as a latent
process which is not directly
observable in genomics data
It is driven by a small set of genes
(particularly cyclins and cyclin-
dependent kinases) from which it
may be inferred
It has an effect on the expression of very
many genes: hence it can also tend
to act as a confounding factor when
modelling many other biological
processes
Used CCA here as an
exploratory tool, with Y the
data for the cell cycle genes,
and X the data for all the
other genes.
Sparse statistical modelling Tom Bartlett 17 / 28
18. Sparse LDA
LDA assigns item i to a group G based a corresponding data-vector
xi , according to the posterior probability:
P(G = k|xi ) =
πkfk(xi )
PK
l=1 πl fl (xi )
, with
fk(xi ) =
1
(2π)p/2|Σ|1/2
exp
−
1
2
(xi − µk)
Σ−1
(xi − µk)
,
with prior πk and mean µk for group k, and covariance Σ.
This assignment takes place by constructing ‘decision boundaries’
between classes k and l:
log
P(G = k|xi )
P(G = l|xi )
= log
πk
πl
+ x
i Σ−1
(µk − µl )
−
1
2
(µk + µl )
Σ−1
(µk − µl )
Because this boundary is linear in xi , we get the name LDA.
Sparse statistical modelling Tom Bartlett 18 / 28
19. Sparse LDA
The decision boundary
log
P(G = k|xi )
P(G = l|xi )
= log
πk
πl
+ x
i Σ−1
(µk − µl )
−
1
2
(µk + µl )
Σ−1
(µk − µl )
then naturally leads to the decision rule:
G(xi ) = argmax
k
n
log πk + x
i Σ−1
µk − µ
k Σ−1
µk
o
.
By assuming Σ is diagonal, i.e., there is no covariance between the p
dimensions, this decision rule can be reduced to the nearest centroids
classifier:
G(xi ) = argmin
k
p
X
j=1
(xj − µjk)2
σ2
j
− log πk
.
Typically, Σ (or σ) are estimated from the data as b
Σ (or b
σ), and the
µk are estimated as b
µk whilst training the classifier.
Sparse statistical modelling Tom Bartlett 19 / 28
20. Sparse LDA
The nearest centroids classifier
b
G(xi ) = argmin
k
p
X
j=1
(xj − b
µjk)2
b
σ2
j
− log πk
will typically use all p variables. This is often unnecessary and can
lead to overfitting in high-dimensional contexts. The nearest shrunken
centroids classifier deals with this issue.
Define b
µ = x̄ + αk, where x̄ is the data-mean across all classes, and
αk is the class-specific deviation of the mean from x̄. Then, the
nearest shrunken centroids classifier proceeds with the optimisation:
minimise
αk ∈Rp,k∈{1,...,K}
1
2n
K
X
k=1
X
i∈Ck
p
X
j=1
(xij − x̄j − αjk)2
σ̂2
+λ
K
X
k=1
p
X
j=1
√
nk
σ̂2
|αjk|
,
where Ck and nk are the set and number of samples in group k.
Sparse statistical modelling Tom Bartlett 20 / 28
21. Sparse LDA
Hence, the αk estimated from the optimisation
minimise
αk ∈Rp,k∈{1,...,K}
1
2n
K
X
k=1
X
i∈Ck
p
X
j=1
(xij − x̄j − αjk)2
σ̂2
+λ
K
X
k=1
p
X
j=1
√
nk
σ̂2
|αjk|
can be used to estimate the shrunken centroids b
µ = x̄ + αk, thus
training the classifier:
b
G(xi ) = argmin
k
p
X
j=1
(xj − b
µjk)2
b
σ2
j
− log πk
.
Sparse statistical modelling Tom Bartlett 21 / 28
22. Sparse LDA - real data example
I applied nearest (shrunken) centroids to
expression data for 14349 genes, for
347 cells of different types:
leukocytes (54); lymphoblastic cells
(88); fetal brain cells (16wk, 26;
21wk, 24); fibroblasts (37); ductal
carcinoma (22); keratinocytes (40);
B lymphoblasts (17); iPS cells (24);
neural progenitors (15).
Used R packages MASS, and pamr [4].
Carried out 100 repetitions of 3-fold
CV. Plots show normalised mutual
information (NMI), adjusted Rand
index (ARI) and prediction accuracy.
0 5 10 15 20 25 30
0.0
0.4
0.8
Sparsity threshold
NMI
0 5 10 15 20 25 30
0.0
0.4
0.8
Sparsity threshold
ARI
0 5 10 15 20 25 30
0.0
0.4
0.8
Sparsity threshold
Accuracy
Sparse LDA quantile (over 300 predictions)
100% 75% 50% 25% 0%
Regular LDA quantile (over 300 predictions)
100% 75% 50% 25% 0%
4Robert Tibshirani et al. “Class prediction by nearest shrunken centroids, with applications
to DNA microarrays”. In: Statistical Science (2003), pp. 104–117
Sparse statistical modelling Tom Bartlett 22 / 28
23. Sparse clustering
Many clustering methods, such hierarchical clustering, are based on a
dissimilarity measure Di,i0 =
Pp
j=1 di,i0,j between samples i and i0.
One popular choice of dissimilarity measure is the euclidean distance.
In high-dimensions, it is often unnecessary to use information from all
of the p dimensions.
A weighted dissimilarity measure e
Di,i0 =
Pp
j=1 wj di,i0,j can be a useful
approach to this problem. This can be obtained by the sparse matrix
decomposition:
maximise
u∈Rn2
,w∈Rp
u
∆w, subject to kuk2 ≤ 1, kwk2 ≤ 1,
kwk1 ≤ t, and wj ≥ 0, j ∈ {1, ..., p},
where w is vector of the weights wj , j ∈ {1, ..., p}, and ∆ ∈ Rn2×p is
the dissimilarity components arranged such that each row of ∆
corresponds to the di,i0,j , j ∈ {1, ..., p} for a pair of samples i, i0.
This weighted dissimilarity measure can then be used for sparse
clustering, such as sparse hierarchical clustering.
Sparse statistical modelling Tom Bartlett 23 / 28
24. Sparse clustering
Some clustering methods, such as K-means, need a slightly modified
approach.
K-means seeks to minimise the within-cluster sum of squares
K
X
k=1
X
i∈Ck
kxi − x̄kk2
2 =
1
2N
K
X
k=1
X
i,i0∈Ck
kxi − xi0 k2
2
where Ck is the set of samples in cluster k and x̄k is the
corresponding centroid.
Hence, a weighted K-means could proceed according to the
optimisation:
minimise
w∈Rp
p
X
j=1
wj
K
X
k=1
1
nk
X
i,i0∈Ck
di,i0,j
,
where di,i0,j = (xij − xi0j )2, and nk is the number of samples
in cluster k.
Sparse statistical modelling Tom Bartlett 24 / 28
25. Sparse clustering
However, for the optimisation
minimise
w∈Rp
p
X
j=1
wj
K
X
k=1
1
nk
X
i,i0∈Ck
di,i0,j
,
it is not possible to choose a set of constraints which guarantee a
non-pathological solution as well as convexity.
Instead, the between-cluster sum of squares can be maximised:
maximise
w∈Rp
p
X
j=1
wj
1
n
n
X
i=1
n
X
i0=1
di,i0,j −
K
X
k=1
1
nk
X
i,i0∈Ck
di,i0,j
subject to kwk2 ≤ 1, kwk1 ≤ t, and wj ≥ 0, j ∈ {1, ..., p}.
Sparse statistical modelling Tom Bartlett 25 / 28
26. Sparse clustering - real data examples
Applied (sparse) hierarchal
clustering to the same
benchmark expression
data-set (14349 genes, for
347 cells of different types).
Used R package sparcl [5] for
the sparse clustering. Plots
show normalised mutual
information (NMI) and
adjusted Rand index (ARI)
comparing sparse with
standard clustering.
2 5 10 20 50 100 200 500 1000
0.0
0.4
0.8
L1 bound
NMI
2 5 10 20 50 100 200 500 1000
0.0
0.4
0.8
L1 bound
ARI
Sparse hierarchical clustering hierarchical clustering
5Daniela M Witten and Robert Tibshirani. “A framework for feature selection in clustering”.
In: Journal of the American Statistical Association (2012)
Sparse statistical modelling Tom Bartlett 26 / 28
27. Sparse clustering - real data examples
Applied (sparse) k-means to
the same benchmark
expression data-set (14349
genes, for 347 cells of
different types).
Used R package sparcl for the
sparse clustering. Plots
show normalised mutual
information (NMI) and
adjusted Rand index (ARI)
comparing sparse with
standard clustering.
2 5 10 20 50 100 200 500 1000
0.0
0.4
0.8
L1 bound
NMI
2 5 10 20 50 100 200 500 1000
0.0
0.4
0.8
L1 bound
ARI
Sparse k−means k−means
Sparse statistical modelling Tom Bartlett 27 / 28
28. Sparse clustering - real data examples
Spectral clustering essentially
uses k-means clustering (or
similar) in dimensionally-
reduced (e.g., PCA) space.
Applied standard k-means in
sparse-PCA space to the
same benchmark expression
data-set (14349 genes, for
347 cells of different types).
Offers computational
advantages, running in 9
seconds on a 2.8GHz
Macbook, compared with
19 seconds for standard
k-means, and 35 seconds
for sparse k-means.
0.1 0.2 0.5 1.0
0.0
0.4
0.8
L1 bound / sqrt(n)
NMI
0.1 0.2 0.5 1.0
0.0
0.4
0.8
L1 bound / sqrt(n)
ARI
Sparse spectral k−means k−means
Sparse statistical modelling Tom Bartlett 28 / 28