Abstract
Omics techniques (e.g., i.e., transcriptomics, genomics, and epigenomics) report quantitative measures of more than tens of thousands of biological features and provide a more comprehensive molecular perspective of studied diabetes mechanisms compared to transitional approaches. Identifying representative molecular signatures from the tremendous number of biological features becomes a central problem in utilizing the data for clinical decision-making. Exploring the complex causal relations of the identified representative molecular signatures and diabetes phenotypes can be the most effective and efficient ways to improve the understanding of diabetes and assess the cause of diabetes for the new patients with already collected data influencing (e.g., TEDDY project). However, due to the unavoidable patient heterogeneity, statistical randomness, and experimental noise in the high-dimension, low-sample-size omics data of the diabetic patients, utilizing the available data for clinical decision-making remains an ongoing challenge for many researchers. To overcome the limitations, in this study we developed (1) a generative adversarial network (GAN)-based model to generate synthetic omics data for the samples with few omics profiles available; (2) a deep learning-based fusion network model for phenotype prediction of type-1 diabetes; (3) a long short-term memory (LSTM)-based model for predicting outcomes of islet autoantibody and persistent positivity. The models are tested on the multi-omics data in TEDDY project.
Presenter: Wei Zhang, Ph.D. Assistant Professor, Department of Computer Science & Genomics and Bioinformatics Cluster, University of Central Florida
Upcoming webinars schedule: https://dknet.org/about/webinar
dkNET Webinar: Multi-Omics Data Integration for Phenotype Prediction of Type-1 Diabetes 04/09/2021
1. Multi-omics Data Integration for Phenotype
Prediction of Type-1 Diabetes
Wei Zhang, Ph.D.
Genomics and Bioinformatics Cluster
Department of Computer Science
University of Central Florida
Computational Biology Lab: https://server.cs.ucf.edu/compbio/
Email: wzhang.cs@ ucf.edu
dkNET Webinar, 04/09/2021
2. TEDDY Study
• The Environmental Determinants of Diabetes in the Young (TEDDY)
• Investigate the causes of T1DM in children
• Find out the external triggers that cause some children to get diabetes, whereas some
high-risk children remain free of it
• Multi-omics Data
• Microarray Gene expression
• Normalized microarray gene expression data (47,169 x 2013)
• Time series data with 401 participants and 2013 total time steps
• Non-uniform time steps among the participants
• RNA sequencing
• Raw RNA-seq data for 112 nested case-control samples
• SNP
• Illumina SNP array data (195,806 x 7012)
• Clinical Variables
• Four types of islet autoantibody (MIAA, IA2A, GADA, ZnT8A)
• Time series data with test results at multiple age of participants
2
3. Outline
Our main objectives are to predict diabetes phenotype and identify
biomarkers in diabetes using omics data. The work plan consists of two
main steps.
1. Estimation of gene expression for phenotype prediction: Some of the
participants have too few time steps with gene expression data for a reliable data
analysis. Estimation of gene expression at missing time steps of those
participants should provide more information for islet autoantibody (IA) tests
and persistent positivity (two consecutive positive for any IA outcome)
prediction.
2. Identification of mRNA truncation-derived biomarkers in type 1 diabetes:
Identifying genome-wide alternative polyadenylation events from RNA-seq data
of the participants in TEDDY study to better understand the role of post-
transcriptional regulation in progression of diabetes and identify better
biological signatures for clinical decision-making.
3
4. Outline
Our main objectives are to predict diabetes phenotype and identify
biomarkers in diabetes using omics data. The work plan consists of two
main steps.
1. Estimation of gene expression for phenotype prediction: Some of the
participants have too few time steps with gene expression data for a reliable data
analysis. Estimation of gene expression at missing time steps of those
participants should provide more information for islet autoantibody (IA) tests
and persistent positivity (two consecutive positive for any IA outcome)
prediction.
2. Identification of mRNA truncation-derived biomarkers in type 1 diabetes:
Identifying genome-wide alternative polyadenylation events from RNA-seq data
of the participants in TEDDY study to better understand the role of post-
transcriptional regulation in progression of diabetes and identify better
biological signatures for clinical decision-making.
4
5. Estimation of Gene Expression
• Several studies published in the last five years showed that deep learning
models can be applied to infer the gene expression levels
• Park et al. (PLoS Comp Bio 2020) Used GAN (generative adversarial network) to
simulate gene expression data to predict the molecular progress of Alzheimer’s disease.
• Bahrami et al. (Bioinformatics 2020) Applied GAN to generate the hidden structure
from the scRNA-seq data for cell types clustering.
• Chen et al. (Bioinformatics 2016) Inferred the expression of targe genes from the
expression of landmark genes for the NIH LINCS program.
• On a previous study, we found that GANs can reliably generate synthetic
omics data with predictive signature for disease outcome classification
• Ahmed et al. (Under review) Generate synthetic data by integrating multi-omics
profiles and biological interaction network for disease outcome prediction.
5
6. Generative Adversarial Network (GAN)
6
Cihan Ongun et al. (2019): Paired 3D Model Generation with Conditional Generative Adversarial Networks
7. Estimation of Gene Expression
Research Design: Some of the participants have too few time steps with gene
expression data for a reliable data analysis. Estimation of gene expression at
missing time steps of those participants should provide more information for
downstream analysis.
• Clustering of participants using k-means/hierarchical clustering algorithm
• Training a GAN on a cluster to generate synthetic gene expression.
7
8. Estimation of Gene Expression
• Mean correlation coefficients
between synthetic gene
expression and real gene
expression is 0.925
8
9. Prediction of Phenotype
• Research Design: Outcome for islet autoantibody (IA) tests and
persistent positivity (two consecutive positive for any IA outcome)
will then be predicted using the imputed gene expression. Larger
number of available time steps should result in a better training of the
classifier.
• Outcome of IA: IA outcome will be predicted for individual time step of a
participant.
• Persistent positive: persistent positive outcome will be predicted for each
sample using all available time steps and their time dependency in Long short-
term memory (LSTM).
9
10. Prediction of IA Outcome
IA outcome for individual time steps are calculated considering all time steps
as independent as following four cases. SVM is used as classifier and all
results are reported in terms of AUC.
• Case 1: last time step in each sample with real gene expression available was used as
test set, and all the other time steps were used as training set
• Case 2: all time steps were randomly splitted into train and test sets
Case 1 Case 2
Real Real + Synthetic Real Real + Synthetic
IA2A 0.736 (1612) 0.757 (2651) 0.850 (1515/379) 0.794 (2409/603)
MIAA 0.657 (1612) 0.654 (2648) 0.743 (1515/379) 0.723 (2407/602)
GADA 0.702 (1610) 0.691 (2649) 0.730 (1513/379) 0.680 (2407/603)
ZnT8A 0.855 (433) 0.901 (626) 0.908 (429/107) 0.861 (625/158)
11. Prediction of Persistent Positive Samples
First, we generate persistent positive (PP) labels for all samples based
on two consecutive positive in any IA outcome. Only the matched time
steps between gene expression and IA outcomes are considered for PP
labels. 71 out of 401 samples are found as PP.
• Case 1: we do not consider the time dependency and using traditional
classification model (random forest) to predict PP (AUC: 0.638)
• Case 2: we use LSTM for a time series analysis and use all available gene
expression time steps for each samples for PP prediction (AUC: 0.686)
11
input vector for time step t
hidden state vector for time step t
cell state vector for time step t
LSTM cell
LSTM
b.
LSTM LSTM
a. X
X
+
X
tanh
tanh
= sigmoid activation
forget gate vector for time step t
input gate vector for time step t
cell input vector for time step t
output gate vector for time step t
Phenotype 1
Phenotype 2
12. Outline
Our main objectives are to predict diabetes phenotype and identify
biomarkers in diabetes using omics data. The work plan consists of two
main steps.
1. Estimation of gene expression for phenotype prediction: Some of the
participants have too few time steps with gene expression data for a reliable data
analysis. Estimation of gene expression at missing time steps of those
participants should provide more information for islet autoantibody (IA) tests
and persistent positivity (two consecutive positive for any IA outcome)
prediction.
2. Identification of mRNA truncation-derived biomarkers in type 1 diabetes:
Identifying genome-wide alternative polyadenylation events from RNA-seq data
of the participants in TEDDY study to better understand the role of post-
transcriptional regulation in progression of diabetes and identify better
biological signatures for clinical decision-making.
12
13. APA and mTOR
13
• Alternative Polyadenylation (APA)
– UTR-APA potentially regulates the stability,
cellular localization and translation efficiency
of target RNAs
– CR-APA can affect gene expression qualitatively
by distinct protein isoforms
• Mammalian target of rapamycin (mTOR)
– Critical in regulating cell proliferation/growth
– Upregulation of the mTOR signaling pathway leads to transcriptome wide mRNA-
truncation
– Dysregulation leads to several metabolic pathological conditions, including obesity
and diabetes
(UTR-APA)
(CR-APA)
14. mTOR activation leads to genome-wide
3’UTR shortening
• RNA-Seq experiments
using Tsc1 knockout
MEFs (Tsc1-/-) and wild
type MEFs (WT)
• RT-qPCR validation of
RNA-Seq data
15. mTOR activation leads to genome-wide
3’UTR shortening
• Much more transcripts show 3’UTR shortening in TSC1-/- compare to WT
• No strong correlation between differential expression and 3’UTR shortening
• Many enrichment KEGG pathways by 3’UTR shortening genes in TSC1-/- are cancer- and
metabolism-related pathways
• 3’UTR-shortening is shown to increase protein production
18. Future work
• Identify both 3’UTR-APA and CR-APA events between 112 nested
case-control samples in TEDDY study with RNA-seq data
• Developing a deep learning-based method to integrate UTR-APA
profiling, CR-APA profiling, and gene expression data to improve type
1 diabetes outcome prediction
18
19. Acknowledgements
19
Computational Biology Group
Khandakar Tanvir Ahmed
University of Minnesota
Jeongsik Yong, Ph.D.
University of South Florida
Michael Toth
Chris Shaffer
NIDDK Central Repository
Rose Woodruff
NIH NIDDK
Xujing Wang, Ph.D.
Beena Akolkar, Ph.D.
Corinne Silva, Ph.D.
UCSD dkNET Team:
Maryann Martone, Ph.D.
Jeffrey Grethe, Ph.D.
Ko-Wei Lin, Ph.D.
Neil McKenna, Ph.D.
Funding
dkNET New Investigator Pilot Program