Deep Learning Opening Workshop - Domain Adaptation Challenges in Genomics: a deep learning take on medical pathology - Bianca Dumitrascu, August 13, 2019
Medical pathology images are visually evaluated by experts for disease diagnosis, but the connectionbetween image features and the state of the cells in an image is typically unknown. To understand thisrelationship, we describe a multimodal modeling and inference framework that estimates shared latentstructure of joint gene expression levels and medical image features. The method is built aroundprobabilistic canonical correlation analysis (PCCA), which is jointly fit to image embeddings that are learnedusing convolutional neural networks and linear embeddings of paired gene expression data. We finallydiscuss a set of theoretical and empirical challenges in domain adaptation settings arising from genomics data.(based on work in collab with Gregory Gundersen and Barbara E. Engelhardt)
Similar to Deep Learning Opening Workshop - Domain Adaptation Challenges in Genomics: a deep learning take on medical pathology - Bianca Dumitrascu, August 13, 2019
Similar to Deep Learning Opening Workshop - Domain Adaptation Challenges in Genomics: a deep learning take on medical pathology - Bianca Dumitrascu, August 13, 2019 (20)
Deep Learning Opening Workshop - Domain Adaptation Challenges in Genomics: a deep learning take on medical pathology - Bianca Dumitrascu, August 13, 2019
1. (Deep) learning from multi-view
data: beyond ImageNet and CelebA
Bianca Dumitrascu
SAMSI/Duke
Deep Learning Workshop
08/13/2019
2. Talk Goals
introduce a clinically relevant question
provide a temporary solution
(joint work with Greg Gundersen, Jordan Ash, Barbara E
Engelhardt)
suggest paths to (statistical) improvement
1 / 1
3. Modern genetics allows the collection of diverse type of data.
personal fan art drawing for the video game Hollow Knight
4. Each data type and data type combination comes with its own
set of computational challenges and opportunities (both
shallow and deep).
5. Simple additive models can relate SNPs and gene expression
with the goal of explaining variation in phenotypic differences.
Y = Xβ + ,
Y ∈ Rn×G
, X ∈ {0, 1, 2}n×p
, ∼ N(0, σf(X))
Statistical concepts: linear regression, false discovery rate,
heteroskedasticity, Bayes factors
Statistical tests for detecting variance effects in quantitative trait studies. BD, G. Darnell, J. Ayroles, and B. E Engelhardt.
In Bioinformatics, 2018
6. Modelling gene expression can determine differences in cell
types (single cell) or can uncover information about
transcriptional programs and pathway dynamics.
Statistical concepts: embeddings, visualization, spectral
clustering, regularization, convex relaxation
netNMF-sc: Leveraging gene-gene interactions for imputation and dimensionality reduction in single-cell expression
analysis. R. Elyanow, BD, B. E Engelhardt, B. J Raphael. bioRxiv, 2019
7. Statistical differences at macro levels: in medical health record
data modeling covariate information can formalize
interventional challenges regarding medication and diagnostics.
HP
HP
Statistical concepts: Gaussian processes, hierarchical priors,
latent force models
Sparse Multi-Output Gaussian Processes for Medical Time Series Prediction. Li-Fang Cheng, Gregory Darnell, BD, Corey
Chivers, Michael E Draugelis, Kai Li, and Barbara E Engelhardt. arxiv
Causal Convolutional Gaussian Processes for Modeling Personalized Dynamics of Clinical Treatments. Li-Fang Cheng, BD,
and Barbara E Engelhardt. in preparation.
8. Focus: genomics and histology
Improvements in technology, data storage, financial incentives
have lead to a focus on medical applications involving imaging
tasks.
9. Example: Cancer prediction from histology slides
Google Brain: An augmented reality microscope with real-time
artificial intelligence integration for cancer diagnosis. Po-Hsuan
Cameron Chen et al., Nature Medicine 2019.
10. Example: Cancer prediction from histology slides
a straightforward, yet challenging engineering approach
required a large number of manually annotated samples by
experts (prior human knowledge of features)
11. Mechanism: The scientist’s dream
A few notes:
performing well on prediction tasks (i.e. for cancer
classification & detection) is very important
detecting structures that are associated with cancer (by
experts) is very important
detecting what drives the formation of such structures is
difficult
is there a a genetic basis to structures?
12. Relating gene expression to morphology
Data: paired gene expression and histology slides images from
the GTEx project.
End-to-end Training of Deep Probabilistic CCA on Paired
Biomedical Observations. (UAI 2019) Gregory Gundersen,
Bianca Dumitrascu, Jordan T. Ash, Barbara E. Engelhardt
13. Relating gene expression to morphology
Data: paired gene expression and histology slides images from
the GTEx project.
GTEx Consortium. Genetic effects on gene expression across
human tissues. Nature, 2017.
14. Scientific goals
Collecting images is cheaper than collecting gene
expression: can we predict images from gene expression?
(reconstruction)
Understand the effect of genetic variation at both molecular
and morphological levels! (good performance on
downstream tasks)
Understand what group of genes might give raise to
observable morphology! (interpretibility)
17. Canonical correlation analysis
Given X1 ∈ Rn,d1 and X2 ∈ Rn,d2 , find Λ∗
1 ∈ Rd1 and Λ∗
2 ∈ Rd2 , such
that:
Λ∗
1 , Λ∗
2 = argmaxΛ1,Λ2 corr(X1Λ1, X2Λ2)
H. Hotelling. Relations between two sets of variates.
Biometrika, 1936.
F.R. Bach and M.I. Jordan. A probabilistic interpretation of
canonical correlation analysis. 2005. (Bayesian, solved
through EM)
18. Two related solutions: Image CCA & DeepPCCA
ImageCCA: Ash, Darnell, Munro, Engelhardt, Joint analysis of gene expression levels and histological images
identifies genes associated with tissue morphology, biorxiv 2018
19. (deep)PCCA
Input: Paired data (xi
1, xi
2), with i = 1, n
Ingredients: Variational loss + Factor Analysis (Probabilistic CCA)+
Domain Knowledge (linear map for gene expression,
convolutional neural networks for image features);
z ∼ N(0; I)
z1, z2 ∼ N(0; I)
x1 ∼ N(zΛ1 + z1B1; Ψ1)
x2 ∼ N(fθ(zΛ2 + z2B2); Ψ2),
where fθ is a convolutional neural network (same architecture as
ImageNet).
20. GTEx training
Deep probabilistic CCA fits PCCA embeddings of two
variational autoencoders, training the model end-to-end
with backpropagation
The embeddings are inputs to PCCA module whose output
are latent embeddings z1, z2, zc ∼ N(0k, Ik), with
yj ∼ N(zjΛj + zcΛjc, Ψj)
Obtain Λ∗, Ψ∗ via EM parameter updates
Backprop trough the L-penalized reconstruction loss
L =
1
n
n
i=1
Dec(x1
)i − x1
i
2
2 +
1
n
n
i=1
Dec(x2
)i − x2
i
2
2+
γ1 Λ1
1 + γ2 Λ2
1 + γ3 θdec 1
21. GTEx training
z ∼ N(0; I)
z1, z2 ∼ N(0; I)
x1 ∼ N(zΛ1 + z1B1; Ψ1)
x2 ∼ N(fθ(zΛ2 + z2B2); Ψ2),
A. Klami, S. Virtanen, and S. Kaski. Bayesian canonical
correlation analysis. JLMR, 2013.
D.P. Kingma and M. Welling. Auto-encoding variational Bayes.
arXiv preprint, 2013.
22. Toy Data
Input: paired vectors (x1, x2), where x1 represents image features
from images of 0, 1, 2, and x2 are sampled from multivariate
normal distributions with different means (not pictured below).
24. GTEx: Latent Space Organization
Latent joint image embeddings using DPCCA(left) show structural
organization, whereas clustering of images based on image specific
embeddings (right) do not.
31. Necessary Modelling Extensions: Interpretability!
introduce adversarial training (avoid information bottleneck)
introduce sparsity on gene expression (horseshoe, spike and
slab)
extract features that are most responsible for embedding
[e.g. adapt Deep Learning for Case-Based Reasoning through
Prototypes: A Neural Network that Explains Its Predictions
Oscar Li, Hao Liu, Chaofan Chen, Cynthia Rudin]
32. A parting problem
Scenario where samples are not aligned (matched):
from multi-view to domain adaptation.
Requires optimizing over a large discrete structure Π
(permutation matrix) or a different modelling formulation.
z ∼ N(0; I)
z1, z2 ∼ N(0; I)
x1 ∼ N(ΠzΛ1 + Πz1B1; Ψ1)
x2 ∼ N(fθ(zΛ2 + z2B2); Ψ2),