Gaussian Process Latent Variable Models & applications in single-cell genomics

Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Gaussian Process Latent Variable Models &
applications in single-cell genomics
Kieran Campbell
University of Oxford
November 19, 2015
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics

Introduction to Gaussian Processes
Gaussian Process Latent Variable Models
Applications in single-cell genomics
References

Introduction
In (Bayesian) supervised learning some (non-)linear function
f (x; w) parametrized by w is assumed to generate data {xn, yn}.
f may take any parametric form, e.g. linear f (x) = w0 + w1x
Posterior inference can be performed on
p(w|y, X) =
p(y|w, X)p(w)
p(y|X)
(1)
Predictions of a new point {y∗, x∗} can be made by
marginalising over w:
p(y∗|y, X, x∗) = dwp(y∗|w, X, x∗)p(w|y, X) (2)

Gaussian Process Regression
Gaussian Processes place a non-parametric prior over the functions
f (x)
f always indexed by ‘input variable’ x
Any subset of functions {fi }N
i=1 are jointly drawn from a
multivariate Gaussian distribution with zero mean and
covariance matrix K:
p(f1, . . . , fN) = N(0, K) (3)
In other words, entirely deﬁned by second-order statistics K

Choice of Kernel
Behaviour of the GP deﬁned by choice of kernel & parameters
Kernel function K(x, x ) becomes covariance matrix once set
of points ‘realised’
Typical choice is double exponential
K(x, x ) = exp(−λ x − x 2
) (4)
Intuition is if x and x are similar, covariance will be larger and
so f and f will - on average - be closer together

GPs with noisy observations
So far assumed observations of f are noise free - GP becomes
function interpolator
Instead observations y(x) corrupted by noise so
y ∼ N(f (x), σ2)
Because everything is Gaussian, can marginalise over (latent)
functions f and ﬁnd
p(y1, . . . , yN) ∼ N(0, K + σ2
I) (5)

Predictions with noisy observations
To make predictions with GPs only need covariance between ‘old’
inputs X and ‘new’ input x∗:
Let k∗ = K(X, x∗) and k∗∗ = K(x∗, x∗)
Then
p(f∗|x∗, X, y) = N(f∗|kT
∗ K−1
, k∗∗ − kT
∗ K−1
k∗) (6)
This highlights the major disadvantage of GPs - to make
predictions we need to invert an n × n matrix - O(n3)

Eﬀect of RBF kernel parameters
Kernel
κ(xp, xq) =
σ2
f exp − 1
2l2 (xp − xq)2
+ σ2
y δqp
Parameters
l controls horizontal length scale
σf controls vertical length scale
σy noise variance
In ﬁgure (l, σf , σy ) have values
(a) (1, 1, 0.1)
(b) (0.3, 1.08, 0.00005)
(c) (3.0, 1.16, 0.89)
Figure: Rasmussen and Williams
2006

Dimensionality reduction & unsupervised learning
Dimensionality reduction
Want to reduce some observed data Y ∈ RN×D to a set of latent
variables X ∈ RN×Q where Q D.
Methods
Linear: PCA, ICA
Non-linear: Laplacian eigenmaps, MDS, etc.
Probabilistic: PPCA, GPLVM

Probabilistic PCA (Tipping and Bishop, 1999)
Recall Y observed data matrix, X latent matrix. Then assume
yn = Wxn + ηn
where
W linear relationship between latent space and data space
ηn Gaussian noise mean 0 precision β
Then marginalise out X to ﬁnd
p(yn|W , β) = N(yn|0, WW T
+ β−1
I)
Analytic solution when W spans principal subspace - probabilistic
PCA.

GPLVM (Lawrence 2005)
Alternative representation (dual probabilistic PCA)
Instead of marginalising latent factors X, marginalise mapping W . Let
p(W ) = i N(wi , |0, I) then
p(y:,d |X, β) = N(y:,d |0, XXT
+ β−1
I)
GPLVM
Lawrence’s breakthrough was to realise that the covariance matrix
K = XXT
+ β−1
I
can be replaced by any similarity (kernel) matrix S as in the GP
framework.
GP-LVM deﬁne a mapping from the latent space to the observed space.

GPLVM example - oil flow data
Figure: PCA (left) and GPLVM (right) on multi-phase oil flow data
(Lawrence 2006)
GPLVM shows better separation between oil flow class (shape) compared
to PCA
GPLVM gives uncertainty in the data space. Since this is shared across all
feautures, can visualise in latent space (pixel intensity)
If we want true uncertainty in latent need Bayesian approach to find
p(latent|data)

Bayesian GPLVM
Ideally we want to know the uncertainty in the latent factors
p(latent|data). Approaches to inference:
Metropolis-hastings - requires lots of tweaking but
‘guaranteed’ for any model
HMC with Stan - fast, requires less tweaking but less support
for arbitrary priors
Variational inference1
1
Titsias, M., & Lawrence, N. (2010). Bayesian Gaussian Process Latent
Variable Model. Artiﬁcial Intelligence, 9, 844-851.

Buettner 2012
Introduce ‘structure preserving’ GPLVM for clustering of single-cell qPCR from
zygote to blastocyst development
Includes a ‘prior’ that preserves local structure by modifying likelihood
(previously studied2
)
Find modiﬁed GPLVM gives better separation between diﬀerent
developmental stages)
2
Maaten, L. Van Der. (2005). Preserving Local Structure in Gaussian
Process Latent Variable Models

Buettner 2015
Use (GP?)-LVM to construct low-rank cell-to-cell covariance based on expression of
speciﬁc gene pathway
Model
yg ∼ N(µg , XXT
+ σ2
ν CCT
+ ν2
g I)
where
X hidden factor such as cell cycle
C observed covariate
Can then assess gene-gene correlation controlling for hidden factors
Non-linear PCA of genes not
annotated as cell-cycle. Left:
before scLVM, right: after.

Bayesian Gaussian Process Latent Variable Models for
pseudotime inference
Pseudotime Artiﬁcial measure of a cells progression through some
process (diﬀerentiation, apoptosis, cell cycle)
Cell ordering problem Order high-dimensional transcriptomes
through process

Current approaches
Monocle
ICA for dimensionality
reduction, longest path
through minimum spanning
tree to assign pseudotime
Uses cubic smoothing splines &
likelihood ratio test for
differential expression analysis
Standard analysis is to examine differential expression across
pseudotime
Questions What is the uncertainty in pseudotime? How does this
impact the false discovery rate of differential expression analysis?

Bayesian GPLVM for pseudotime inference
1. Reduce dimensionality of gene expression data (LE, t-SNE,
PCA or all at once!)
2. Fit Bayesian GPLVM in reduced space (this is essentially a
probabilistic curve)
3. Quantify posterior samples, uncertainty etc

Model
γ ∼ Gamma(γα, γβ)
λj ∼ Exp(γ)
σj ∼ InvGamma(α, β)
ti ∼ πt, i = 1, . . . , N,
Σ = diag(σ2
1, . . . , σ2
P)
K(j)
(t, t ) = exp(−λj (t − t )2
)
µj ∼ GP(0, K(j)
), j = 1, . . . , P,
xi ∼ N(µ(ti ), Σ), i = 1, . . . , N.
(7)

Prior issues
How do we deﬁne the prior on t, πt?
Typically want t = (t1, . . . , tn) to sit uniformly on [0, 1]
t only appears in the likelihood via λj (t − t )2
Means we can arbitrarily rescale λ → λ
and t →
√
t and get
same likelihood
t equivalent on any subset of [0, 1] - ill-deﬁned problem

Solutions
Corp prior
Want t to ‘ﬁll out’ over [0, 1]
Introduce repulsive prior
πt(t) ∝
N
i=1
N
j=i+1
sin (π|ti − tj |) (8)
Non conjugate & diﬃcult to evaluate gradient - need Metropolis
Hastings
Constrained random walk inference
If we constrain t to be on [0, 1] and use random walk sampling
(MH, HMC), pseudotimes naturally ‘wander’ towards the boundary
Once there, covariance structure settles them into a steady state

Applications to biological datasets
Applied Bayesian GPLVM to three datasets:
1. Monocle Differentiating human myoblasts (time series) - 155
cells once contamination removed
2. Ear Differentiating cells from mouse cochlear & utricular
sensory epithelia. Pseudotime shows supporting cells (SC)
differentiating into hair cells (HC)
3. Waterfall Adult neurogenesis (PCA captures pseudotime
variation)

Sampling posterior curves
A Monocle dataset, laplacian eigenmaps representation
B Ear dataset, laplacian eigenmaps representation
C Waterfall dataset, PCA representation

What does the posterior uncertainty look like? (I)
95% HPD credible interval typically spans ∼ 1
4 of pseudotime

What does the posterior uncertainty look like? (II)

Eﬀect of hyperparameters (Monocle dataset)
Recall
K(t, t ) ∝ exp −λj (t − t )2
λj ∼ Exp(γ)
γ ∼ Gamma(γα, γβ)
|λ| roughly corresponds to arc-length. So what are the eﬀects of
changing γα, γβ?
E[γ] = γα
γβ
, Var[γ] = γα
γ2
β

Approximate false discovery rate
How to approximate false discovery rate?
Refit differential expression
for each gene across
posterior samples of
pseudotime
Compute p- and q- values
for each sample for each
gene
Statistic is proportion
significant at 5% FDR
Differential gene expression
is false positive if
proportion significant
< 0.95 and q-value < 0.05

Approximate false discovery rates
Approximate false discovery rate can be very high (∼ 3× larger
than it should be) but is also variable

Integrating multiple dimensionality reduction algorithms
Can very easily integrate multiple source of data from diﬀerent
dimensionality reduction algorithms:
p(t, {X}) ∝ πt(t)p(XLE|t)p(XPCA|t)p(XtSNE|t) (9)
Natural extension to integrate multiple heterogeneous source of
data, e.g.
p(t, {X}) ∝ πt(t)p(imaging|t)p(ATAC|t)p(transcriptomics|t)
(10)

Example: Monocle with LE, PCA & t-SNE
Learning curves for each representation separately:
Joint learning of all representations:

FDR from multiple representation learning

Some good references (I)
Gaussian Processes
Rasmussen, Carl Edward. ”Gaussian processes for machine learning.” (2006).
GPLVM
Lawrence, Neil D. ”Gaussian process latent variable models for visualisation of high
dimensional data.” Advances in neural information processing systems 16.3 (2004):
329-336.
Titsias, Michalis K., and Neil D. Lawrence. ”Bayesian Gaussian process latent variable
model.” International Conference on Artiﬁcial Intelligence and Statistics. 2010.
van der Maaten, Laurens. ”Preserving local structure in Gaussian process latent variable
models.” Proceedings of the 18th Annual Belgian-Dutch Conference on Machine Learning.
2009.
Wang, Ye, and David B. Dunson. ”Probabilistic Curve Learning: Coulomb Repulsion and
the Electrostatic Gaussian Process.” arXiv preprint arXiv:1506.03768 (2015).

Some good references (II)
Latent variable models in single-cell genomics
Buettner, Florian, and Fabian J. Theis. ”A novel approach for resolving diﬀerences in
single-cell gene expression patterns from zygote to blastocyst.” Bioinformatics 28.18
(2012): i626-i632.
Buettner, Florian, et al. ”Computational analysis of cell-to-cell heterogeneity in single-cell
RNA-sequencing data reveals hidden subpopulations of cells.” Nature biotechnology 33.2
(2015): 155-160.
Pseudotime
Trapnell, Cole, et al. ”The dynamics and regulators of cell fate decisions are revealed by
pseudotemporal ordering of single cells.” Nature biotechnology 32.4 (2014): 381-386.
Bendall, Sean C., et al. ”Single-cell trajectory detection uncovers progression and regulatory
coordination in human B cell development.” Cell 157.3 (2014): 714-725.
Marco, Eugenio, et al. ”Bifurcation analysis of single-cell gene expression data reveals
epigenetic landscape.” Proceedings of the National Academy of Sciences 111.52 (2014):
E5643-E5650.
Shin, Jaehoon, et al. ”Single-Cell RNA-Seq with Waterfall Reveals Molecular Cascades
underlying Adult Neurogenesis.” Cell stem cell 17.3 (2015): 360-372.
Leng, Ning, et al. ”Oscope identiﬁes oscillatory genes in unsynchronized single-cell
RNA-seq experiments.” Nature methods 12.10 (2015): 947-950.

Gaussian Process Latent Variable Models & applications in single-cell genomics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Gaussian Process Latent Variable Models & applications in single-cell genomics

Similar to Gaussian Process Latent Variable Models & applications in single-cell genomics (20)

Recently uploaded

Recently uploaded (20)

Gaussian Process Latent Variable Models & applications in single-cell genomics