Introduction to Gaussian Processes for Dimensionality Reduction and Single-Cell Genomics
1. Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Gaussian Process Latent Variable Models &
applications in single-cell genomics
Kieran Campbell
University of Oxford
November 19, 2015
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
2. Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Introduction to Gaussian Processes
Gaussian Process Latent Variable Models
Applications in single-cell genomics
References
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
3. Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Introduction
In (Bayesian) supervised learning some (non-)linear function
f (x; w) parametrized by w is assumed to generate data {xn, yn}.
f may take any parametric form, e.g. linear f (x) = w0 + w1x
Posterior inference can be performed on
p(w|y, X) =
p(y|w, X)p(w)
p(y|X)
(1)
Predictions of a new point {y∗, x∗} can be made by
marginalising over w:
p(y∗|y, X, x∗) = dwp(y∗|w, X, x∗)p(w|y, X) (2)
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
4. Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Gaussian Process Regression
Gaussian Processes place a non-parametric prior over the functions
f (x)
f always indexed by ‘input variable’ x
Any subset of functions {fi }N
i=1 are jointly drawn from a
multivariate Gaussian distribution with zero mean and
covariance matrix K:
p(f1, . . . , fN) = N(0, K) (3)
In other words, entirely defined by second-order statistics K
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
5. Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Choice of Kernel
Behaviour of the GP defined by choice of kernel & parameters
Kernel function K(x, x ) becomes covariance matrix once set
of points ‘realised’
Typical choice is double exponential
K(x, x ) = exp(−λ x − x 2
) (4)
Intuition is if x and x are similar, covariance will be larger and
so f and f will - on average - be closer together
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
6. Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
GPs with noisy observations
So far assumed observations of f are noise free - GP becomes
function interpolator
Instead observations y(x) corrupted by noise so
y ∼ N(f (x), σ2)
Because everything is Gaussian, can marginalise over (latent)
functions f and find
p(y1, . . . , yN) ∼ N(0, K + σ2
I) (5)
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
7. Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Predictions with noisy observations
To make predictions with GPs only need covariance between ‘old’
inputs X and ‘new’ input x∗:
Let k∗ = K(X, x∗) and k∗∗ = K(x∗, x∗)
Then
p(f∗|x∗, X, y) = N(f∗|kT
∗ K−1
, k∗∗ − kT
∗ K−1
k∗) (6)
This highlights the major disadvantage of GPs - to make
predictions we need to invert an n × n matrix - O(n3)
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
8. Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Effect of RBF kernel parameters
Kernel
κ(xp, xq) =
σ2
f exp − 1
2l2 (xp − xq)2
+ σ2
y δqp
Parameters
l controls horizontal length scale
σf controls vertical length scale
σy noise variance
In figure (l, σf , σy ) have values
(a) (1, 1, 0.1)
(b) (0.3, 1.08, 0.00005)
(c) (3.0, 1.16, 0.89)
Figure: Rasmussen and Williams
2006
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
9. Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Dimensionality reduction & unsupervised learning
Dimensionality reduction
Want to reduce some observed data Y ∈ RN×D to a set of latent
variables X ∈ RN×Q where Q D.
Methods
Linear: PCA, ICA
Non-linear: Laplacian eigenmaps, MDS, etc.
Probabilistic: PPCA, GPLVM
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
10. Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Probabilistic PCA (Tipping and Bishop, 1999)
Recall Y observed data matrix, X latent matrix. Then assume
yn = Wxn + ηn
where
W linear relationship between latent space and data space
ηn Gaussian noise mean 0 precision β
Then marginalise out X to find
p(yn|W , β) = N(yn|0, WW T
+ β−1
I)
Analytic solution when W spans principal subspace - probabilistic
PCA.
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
11. Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
GPLVM (Lawrence 2005)
Alternative representation (dual probabilistic PCA)
Instead of marginalising latent factors X, marginalise mapping W . Let
p(W ) = i N(wi , |0, I) then
p(y:,d |X, β) = N(y:,d |0, XXT
+ β−1
I)
GPLVM
Lawrence’s breakthrough was to realise that the covariance matrix
K = XXT
+ β−1
I
can be replaced by any similarity (kernel) matrix S as in the GP
framework.
GP-LVM define a mapping from the latent space to the observed space.
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
12. Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
GPLVM example - oil flow data
Figure: PCA (left) and GPLVM (right) on multi-phase oil flow data
(Lawrence 2006)
GPLVM shows better separation between oil flow class (shape) compared
to PCA
GPLVM gives uncertainty in the data space. Since this is shared across all
feautures, can visualise in latent space (pixel intensity)
If we want true uncertainty in latent need Bayesian approach to find
p(latent|data)
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
13. Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Bayesian GPLVM
Ideally we want to know the uncertainty in the latent factors
p(latent|data). Approaches to inference:
Metropolis-hastings - requires lots of tweaking but
‘guaranteed’ for any model
HMC with Stan - fast, requires less tweaking but less support
for arbitrary priors
Variational inference1
1
Titsias, M., & Lawrence, N. (2010). Bayesian Gaussian Process Latent
Variable Model. Artificial Intelligence, 9, 844-851.
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
14. Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Buettner 2012
Introduce ‘structure preserving’ GPLVM for clustering of single-cell qPCR from
zygote to blastocyst development
Includes a ‘prior’ that preserves local structure by modifying likelihood
(previously studied2
)
Find modified GPLVM gives better separation between different
developmental stages)
2
Maaten, L. Van Der. (2005). Preserving Local Structure in Gaussian
Process Latent Variable Models
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
15. Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Buettner 2015
Use (GP?)-LVM to construct low-rank cell-to-cell covariance based on expression of
specific gene pathway
Model
yg ∼ N(µg , XXT
+ σ2
ν CCT
+ ν2
g I)
where
X hidden factor such as cell cycle
C observed covariate
Can then assess gene-gene correlation controlling for hidden factors
Non-linear PCA of genes not
annotated as cell-cycle. Left:
before scLVM, right: after.
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
16. Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Bayesian Gaussian Process Latent Variable Models for
pseudotime inference
Pseudotime Artificial measure of a cells progression through some
process (differentiation, apoptosis, cell cycle)
Cell ordering problem Order high-dimensional transcriptomes
through process
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
17. Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Current approaches
Monocle
ICA for dimensionality
reduction, longest path
through minimum spanning
tree to assign pseudotime
Uses cubic smoothing splines &
likelihood ratio test for
differential expression analysis
Standard analysis is to examine differential expression across
pseudotime
Questions What is the uncertainty in pseudotime? How does this
impact the false discovery rate of differential expression analysis?
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
18. Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Bayesian GPLVM for pseudotime inference
1. Reduce dimensionality of gene expression data (LE, t-SNE,
PCA or all at once!)
2. Fit Bayesian GPLVM in reduced space (this is essentially a
probabilistic curve)
3. Quantify posterior samples, uncertainty etc
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
19. Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Model
γ ∼ Gamma(γα, γβ)
λj ∼ Exp(γ)
σj ∼ InvGamma(α, β)
ti ∼ πt, i = 1, . . . , N,
Σ = diag(σ2
1, . . . , σ2
P)
K(j)
(t, t ) = exp(−λj (t − t )2
)
µj ∼ GP(0, K(j)
), j = 1, . . . , P,
xi ∼ N(µ(ti ), Σ), i = 1, . . . , N.
(7)
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
20. Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Prior issues
How do we define the prior on t, πt?
Typically want t = (t1, . . . , tn) to sit uniformly on [0, 1]
t only appears in the likelihood via λj (t − t )2
Means we can arbitrarily rescale λ → λ
and t →
√
t and get
same likelihood
t equivalent on any subset of [0, 1] - ill-defined problem
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
21. Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Solutions
Corp prior
Want t to ‘fill out’ over [0, 1]
Introduce repulsive prior
πt(t) ∝
N
i=1
N
j=i+1
sin (π|ti − tj |) (8)
Non conjugate & difficult to evaluate gradient - need Metropolis
Hastings
Constrained random walk inference
If we constrain t to be on [0, 1] and use random walk sampling
(MH, HMC), pseudotimes naturally ‘wander’ towards the boundary
Once there, covariance structure settles them into a steady state
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
22. Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Applications to biological datasets
Applied Bayesian GPLVM to three datasets:
1. Monocle Differentiating human myoblasts (time series) - 155
cells once contamination removed
2. Ear Differentiating cells from mouse cochlear & utricular
sensory epithelia. Pseudotime shows supporting cells (SC)
differentiating into hair cells (HC)
3. Waterfall Adult neurogenesis (PCA captures pseudotime
variation)
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
23. Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Sampling posterior curves
A Monocle dataset, laplacian eigenmaps representation
B Ear dataset, laplacian eigenmaps representation
C Waterfall dataset, PCA representation
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
24. Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
What does the posterior uncertainty look like? (I)
95% HPD credible interval typically spans ∼ 1
4 of pseudotime
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
25. Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
What does the posterior uncertainty look like? (II)
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
26. Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Effect of hyperparameters (Monocle dataset)
Recall
K(t, t ) ∝ exp −λj (t − t )2
λj ∼ Exp(γ)
γ ∼ Gamma(γα, γβ)
|λ| roughly corresponds to arc-length. So what are the effects of
changing γα, γβ?
E[γ] = γα
γβ
, Var[γ] = γα
γ2
β
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
27. Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Approximate false discovery rate
How to approximate false discovery rate?
Refit differential expression
for each gene across
posterior samples of
pseudotime
Compute p- and q- values
for each sample for each
gene
Statistic is proportion
significant at 5% FDR
Differential gene expression
is false positive if
proportion significant
< 0.95 and q-value < 0.05
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
28. Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Approximate false discovery rates
Approximate false discovery rate can be very high (∼ 3× larger
than it should be) but is also variable
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
29. Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Integrating multiple dimensionality reduction algorithms
Can very easily integrate multiple source of data from different
dimensionality reduction algorithms:
p(t, {X}) ∝ πt(t)p(XLE|t)p(XPCA|t)p(XtSNE|t) (9)
Natural extension to integrate multiple heterogeneous source of
data, e.g.
p(t, {X}) ∝ πt(t)p(imaging|t)p(ATAC|t)p(transcriptomics|t)
(10)
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
30. Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Example: Monocle with LE, PCA & t-SNE
Learning curves for each representation separately:
Joint learning of all representations:
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
31. Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
FDR from multiple representation learning
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
32. Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Some good references (I)
Gaussian Processes
Rasmussen, Carl Edward. ”Gaussian processes for machine learning.” (2006).
GPLVM
Lawrence, Neil D. ”Gaussian process latent variable models for visualisation of high
dimensional data.” Advances in neural information processing systems 16.3 (2004):
329-336.
Titsias, Michalis K., and Neil D. Lawrence. ”Bayesian Gaussian process latent variable
model.” International Conference on Artificial Intelligence and Statistics. 2010.
van der Maaten, Laurens. ”Preserving local structure in Gaussian process latent variable
models.” Proceedings of the 18th Annual Belgian-Dutch Conference on Machine Learning.
2009.
Wang, Ye, and David B. Dunson. ”Probabilistic Curve Learning: Coulomb Repulsion and
the Electrostatic Gaussian Process.” arXiv preprint arXiv:1506.03768 (2015).
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
33. Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Some good references (II)
Latent variable models in single-cell genomics
Buettner, Florian, and Fabian J. Theis. ”A novel approach for resolving differences in
single-cell gene expression patterns from zygote to blastocyst.” Bioinformatics 28.18
(2012): i626-i632.
Buettner, Florian, et al. ”Computational analysis of cell-to-cell heterogeneity in single-cell
RNA-sequencing data reveals hidden subpopulations of cells.” Nature biotechnology 33.2
(2015): 155-160.
Pseudotime
Trapnell, Cole, et al. ”The dynamics and regulators of cell fate decisions are revealed by
pseudotemporal ordering of single cells.” Nature biotechnology 32.4 (2014): 381-386.
Bendall, Sean C., et al. ”Single-cell trajectory detection uncovers progression and regulatory
coordination in human B cell development.” Cell 157.3 (2014): 714-725.
Marco, Eugenio, et al. ”Bifurcation analysis of single-cell gene expression data reveals
epigenetic landscape.” Proceedings of the National Academy of Sciences 111.52 (2014):
E5643-E5650.
Shin, Jaehoon, et al. ”Single-Cell RNA-Seq with Waterfall Reveals Molecular Cascades
underlying Adult Neurogenesis.” Cell stem cell 17.3 (2015): 360-372.
Leng, Ning, et al. ”Oscope identifies oscillatory genes in unsynchronized single-cell
RNA-seq experiments.” Nature methods 12.10 (2015): 947-950.
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics