Deep Learning Opening Workshop - Domain Adaptation Challenges in Genomics: a deep learning take on medical pathology - Bianca Dumitrascu, August 13, 2019

(Deep) learning from multi-view
data: beyond ImageNet and CelebA
Bianca Dumitrascu
SAMSI/Duke
Deep Learning Workshop
08/13/2019

Talk Goals
introduce a clinically relevant question
provide a temporary solution
(joint work with Greg Gundersen, Jordan Ash, Barbara E
Engelhardt)
suggest paths to (statistical) improvement
1 / 1

Modern genetics allows the collection of diverse type of data.
personal fan art drawing for the video game Hollow Knight

Each data type and data type combination comes with its own
set of computational challenges and opportunities (both
shallow and deep).

Simple additive models can relate SNPs and gene expression
with the goal of explaining variation in phenotypic differences.
Y = Xβ + ,
Y ∈ Rn×G
, X ∈ {0, 1, 2}n×p
, ∼ N(0, σf(X))
Statistical concepts: linear regression, false discovery rate,
heteroskedasticity, Bayes factors
Statistical tests for detecting variance effects in quantitative trait studies. BD, G. Darnell, J. Ayroles, and B. E Engelhardt.
In Bioinformatics, 2018

Modelling gene expression can determine differences in cell
types (single cell) or can uncover information about
transcriptional programs and pathway dynamics.
Statistical concepts: embeddings, visualization, spectral
clustering, regularization, convex relaxation
netNMF-sc: Leveraging gene-gene interactions for imputation and dimensionality reduction in single-cell expression
analysis. R. Elyanow, BD, B. E Engelhardt, B. J Raphael. bioRxiv, 2019

Statistical differences at macro levels: in medical health record
data modeling covariate information can formalize
interventional challenges regarding medication and diagnostics.
HP
HP
Statistical concepts: Gaussian processes, hierarchical priors,
latent force models
Sparse Multi-Output Gaussian Processes for Medical Time Series Prediction. Li-Fang Cheng, Gregory Darnell, BD, Corey
Chivers, Michael E Draugelis, Kai Li, and Barbara E Engelhardt. arxiv
Causal Convolutional Gaussian Processes for Modeling Personalized Dynamics of Clinical Treatments. Li-Fang Cheng, BD,
and Barbara E Engelhardt. in preparation.

Focus: genomics and histology
Improvements in technology, data storage, financial incentives
have lead to a focus on medical applications involving imaging
tasks.

Example: Cancer prediction from histology slides
Google Brain: An augmented reality microscope with real-time
artificial intelligence integration for cancer diagnosis. Po-Hsuan
Cameron Chen et al., Nature Medicine 2019.

Example: Cancer prediction from histology slides
a straightforward, yet challenging engineering approach
required a large number of manually annotated samples by
experts (prior human knowledge of features)

Mechanism: The scientist’s dream
A few notes:
performing well on prediction tasks (i.e. for cancer
classification & detection) is very important
detecting structures that are associated with cancer (by
experts) is very important
detecting what drives the formation of such structures is
difficult
is there a a genetic basis to structures?

Relating gene expression to morphology
Data: paired gene expression and histology slides images from
the GTEx project.
End-to-end Training of Deep Probabilistic CCA on Paired
Biomedical Observations. (UAI 2019) Gregory Gundersen,
Bianca Dumitrascu, Jordan T. Ash, Barbara E. Engelhardt

Relating gene expression to morphology
Data: paired gene expression and histology slides images from
the GTEx project.
GTEx Consortium. Genetic effects on gene expression across
human tissues. Nature, 2017.

Scientific goals
Collecting images is cheaper than collecting gene
expression: can we predict images from gene expression?
(reconstruction)
Understand the effect of genetic variation at both molecular
and morphological levels! (good performance on
downstream tasks)
Understand what group of genes might give raise to
observable morphology! (interpretibility)

Canonical correlation analysis
Given X1 ∈ Rn,d1 and X2 ∈ Rn,d2 , find Λ∗
1 ∈ Rd1 and Λ∗
2 ∈ Rd2 , such
that:
Λ∗
1 , Λ∗
2 = argmaxΛ1,Λ2 corr(X1Λ1, X2Λ2)
H. Hotelling. Relations between two sets of variates.
Biometrika, 1936.
F.R. Bach and M.I. Jordan. A probabilistic interpretation of
canonical correlation analysis. 2005. (Bayesian, solved
through EM)

Two related solutions: Image CCA & DeepPCCA
ImageCCA: Ash, Darnell, Munro, Engelhardt, Joint analysis of gene expression levels and histological images
identifies genes associated with tissue morphology, biorxiv 2018

(deep)PCCA
Input: Paired data (xi
1, xi
2), with i = 1, n
Ingredients: Variational loss + Factor Analysis (Probabilistic CCA)+
Domain Knowledge (linear map for gene expression,
convolutional neural networks for image features);
z ∼ N(0; I)
z1, z2 ∼ N(0; I)
x1 ∼ N(zΛ1 + z1B1; Ψ1)
x2 ∼ N(fθ(zΛ2 + z2B2); Ψ2),
where fθ is a convolutional neural network (same architecture as
ImageNet).

GTEx training
Deep probabilistic CCA fits PCCA embeddings of two
variational autoencoders, training the model end-to-end
with backpropagation
The embeddings are inputs to PCCA module whose output
are latent embeddings z1, z2, zc ∼ N(0k, Ik), with
yj ∼ N(zjΛj + zcΛjc, Ψj)
Obtain Λ∗, Ψ∗ via EM parameter updates
Backprop trough the L-penalized reconstruction loss
L =
1
n
n
i=1
Dec(x1
)i − x1
i
2
2 +
1
n
n
i=1
Dec(x2
)i − x2
i
2
2+
γ1 Λ1
1 + γ2 Λ2
1 + γ3 θdec 1

GTEx training
z ∼ N(0; I)
z1, z2 ∼ N(0; I)
x1 ∼ N(zΛ1 + z1B1; Ψ1)
x2 ∼ N(fθ(zΛ2 + z2B2); Ψ2),
A. Klami, S. Virtanen, and S. Kaski. Bayesian canonical
correlation analysis. JLMR, 2013.
D.P. Kingma and M. Welling. Auto-encoding variational Bayes.
arXiv preprint, 2013.

Toy Data
Input: paired vectors (x1, x2), where x1 represents image features
from images of 0, 1, 2, and x2 are sampled from multivariate
normal distributions with different means (not pictured below).

Toy Data
Latent embeddings of the Toy Data views

GTEx: Latent Space Organization
Latent joint image embeddings using DPCCA(left) show structural
organization, whereas clustering of images based on image specific
embeddings (right) do not.

GTEx: Modality recovery
Subset gene covariance matrix (left) and recovered histology slides
(right)

GTEx: Downstream analysis
Downstream eQTL discovery:
Meaningful latent space embeddings:

Desiderata: cell painting - are we there yet?

No!
Experimental challenge (only temporary, data TBD): bulk vs
single cell

Necessary Modelling Extensions

Necessary Modelling Extensions: Interpretability!
introduce adversarial training (avoid information bottleneck)
introduce sparsity on gene expression (horseshoe, spike and
slab)
extract features that are most responsible for embedding
[e.g. adapt Deep Learning for Case-Based Reasoning through
Prototypes: A Neural Network that Explains Its Predictions
Oscar Li, Hao Liu, Chaofan Chen, Cynthia Rudin]

A parting problem
Scenario where samples are not aligned (matched):
from multi-view to domain adaptation.
Requires optimizing over a large discrete structure Π
(permutation matrix) or a different modelling formulation.
z ∼ N(0; I)
z1, z2 ∼ N(0; I)
x1 ∼ N(ΠzΛ1 + Πz1B1; Ψ1)
x2 ∼ N(fθ(zΛ2 + z2B2); Ψ2),

Deep Learning Opening Workshop - Domain Adaptation Challenges in Genomics: a deep learning take on medical pathology - Bianca Dumitrascu, August 13, 2019

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deep Learning Opening Workshop - Domain Adaptation Challenges in Genomics: a deep learning take on medical pathology - Bianca Dumitrascu, August 13, 2019

Similar to Deep Learning Opening Workshop - Domain Adaptation Challenges in Genomics: a deep learning take on medical pathology - Bianca Dumitrascu, August 13, 2019 (20)

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Recently uploaded

Recently uploaded (20)

Deep Learning Opening Workshop - Domain Adaptation Challenges in Genomics: a deep learning take on medical pathology - Bianca Dumitrascu, August 13, 2019