dkNET Webinar: Multi-Omics Data Integration for Phenotype Prediction of Type-1 Diabetes 04/09/2021

Multi-omics Data Integration for Phenotype
Prediction of Type-1 Diabetes
Wei Zhang, Ph.D.
Genomics and Bioinformatics Cluster
Department of Computer Science
University of Central Florida
Computational Biology Lab: https://server.cs.ucf.edu/compbio/
Email: wzhang.cs@ ucf.edu
dkNET Webinar, 04/09/2021

TEDDY Study
• The Environmental Determinants of Diabetes in the Young (TEDDY)
• Investigate the causes of T1DM in children
• Find out the external triggers that cause some children to get diabetes, whereas some
high-risk children remain free of it
• Multi-omics Data
• Microarray Gene expression
• Normalized microarray gene expression data (47,169 x 2013)
• Time series data with 401 participants and 2013 total time steps
• Non-uniform time steps among the participants
• RNA sequencing
• Raw RNA-seq data for 112 nested case-control samples
• SNP
• Illumina SNP array data (195,806 x 7012)
• Clinical Variables
• Four types of islet autoantibody (MIAA, IA2A, GADA, ZnT8A)
• Time series data with test results at multiple age of participants
2

Outline
Our main objectives are to predict diabetes phenotype and identify
biomarkers in diabetes using omics data. The work plan consists of two
main steps.
1. Estimation of gene expression for phenotype prediction: Some of the
participants have too few time steps with gene expression data for a reliable data
analysis. Estimation of gene expression at missing time steps of those
participants should provide more information for islet autoantibody (IA) tests
and persistent positivity (two consecutive positive for any IA outcome)
prediction.
2. Identification of mRNA truncation-derived biomarkers in type 1 diabetes:
Identifying genome-wide alternative polyadenylation events from RNA-seq data
of the participants in TEDDY study to better understand the role of post-
transcriptional regulation in progression of diabetes and identify better
biological signatures for clinical decision-making.
3

Outline
main steps.
prediction.
4

Estimation of Gene Expression
• Several studies published in the last five years showed that deep learning
models can be applied to infer the gene expression levels
• Park et al. (PLoS Comp Bio 2020) Used GAN (generative adversarial network) to
simulate gene expression data to predict the molecular progress of Alzheimer’s disease.
• Bahrami et al. (Bioinformatics 2020) Applied GAN to generate the hidden structure
from the scRNA-seq data for cell types clustering.
• Chen et al. (Bioinformatics 2016) Inferred the expression of targe genes from the
expression of landmark genes for the NIH LINCS program.
• On a previous study, we found that GANs can reliably generate synthetic
omics data with predictive signature for disease outcome classification
• Ahmed et al. (Under review) Generate synthetic data by integrating multi-omics
profiles and biological interaction network for disease outcome prediction.
5

Generative Adversarial Network (GAN)
6
Cihan Ongun et al. (2019): Paired 3D Model Generation with Conditional Generative Adversarial Networks

Research Design: Some of the participants have too few time steps with gene
expression data for a reliable data analysis. Estimation of gene expression at
missing time steps of those participants should provide more information for
downstream analysis.
• Clustering of participants using k-means/hierarchical clustering algorithm
• Training a GAN on a cluster to generate synthetic gene expression.
7

• Mean correlation coefficients
between synthetic gene
expression and real gene
expression is 0.925
8

Prediction of Phenotype
• Research Design: Outcome for islet autoantibody (IA) tests and
persistent positivity (two consecutive positive for any IA outcome)
will then be predicted using the imputed gene expression. Larger
number of available time steps should result in a better training of the
classifier.
• Outcome of IA: IA outcome will be predicted for individual time step of a
participant.
• Persistent positive: persistent positive outcome will be predicted for each
sample using all available time steps and their time dependency in Long short-
term memory (LSTM).
9

Prediction of IA Outcome
IA outcome for individual time steps are calculated considering all time steps
as independent as following four cases. SVM is used as classifier and all
results are reported in terms of AUC.
• Case 1: last time step in each sample with real gene expression available was used as
test set, and all the other time steps were used as training set
• Case 2: all time steps were randomly splitted into train and test sets
Case 1 Case 2
Real Real + Synthetic Real Real + Synthetic
IA2A 0.736 (1612) 0.757 (2651) 0.850 (1515/379) 0.794 (2409/603)
MIAA 0.657 (1612) 0.654 (2648) 0.743 (1515/379) 0.723 (2407/602)
GADA 0.702 (1610) 0.691 (2649) 0.730 (1513/379) 0.680 (2407/603)
ZnT8A 0.855 (433) 0.901 (626) 0.908 (429/107) 0.861 (625/158)

Prediction of Persistent Positive Samples
First, we generate persistent positive (PP) labels for all samples based
on two consecutive positive in any IA outcome. Only the matched time
steps between gene expression and IA outcomes are considered for PP
labels. 71 out of 401 samples are found as PP.
• Case 1: we do not consider the time dependency and using traditional
classification model (random forest) to predict PP (AUC: 0.638)
• Case 2: we use LSTM for a time series analysis and use all available gene
expression time steps for each samples for PP prediction (AUC: 0.686)
11
input vector for time step t
hidden state vector for time step t
cell state vector for time step t
LSTM cell
LSTM
b.
LSTM LSTM
a. X
X
+
X
tanh
tanh
= sigmoid activation
forget gate vector for time step t
input gate vector for time step t
cell input vector for time step t
output gate vector for time step t
Phenotype 1
Phenotype 2

Outline
main steps.
prediction.
12

APA and mTOR
13
• Alternative Polyadenylation (APA)
– UTR-APA potentially regulates the stability,
cellular localization and translation efficiency
of target RNAs
– CR-APA can affect gene expression qualitatively
by distinct protein isoforms
• Mammalian target of rapamycin (mTOR)
– Critical in regulating cell proliferation/growth
– Upregulation of the mTOR signaling pathway leads to transcriptome wide mRNA-
truncation
– Dysregulation leads to several metabolic pathological conditions, including obesity
and diabetes
(UTR-APA)
(CR-APA)

mTOR activation leads to genome-wide
3’UTR shortening
• RNA-Seq experiments
using Tsc1 knockout
MEFs (Tsc1-/-) and wild
type MEFs (WT)
• RT-qPCR validation of
RNA-Seq data

mTOR activation leads to genome-wide
3’UTR shortening
• Much more transcripts show 3’UTR shortening in TSC1-/- compare to WT
• No strong correlation between differential expression and 3’UTR shortening
• Many enrichment KEGG pathways by 3’UTR shortening genes in TSC1-/- are cancer- and
metabolism-related pathways
• 3’UTR-shortening is shown to increase protein production

Future work
• Identify both 3’UTR-APA and CR-APA events between 112 nested
case-control samples in TEDDY study with RNA-seq data
• Developing a deep learning-based method to integrate UTR-APA
profiling, CR-APA profiling, and gene expression data to improve type
1 diabetes outcome prediction
18

Acknowledgements
19
Computational Biology Group
Khandakar Tanvir Ahmed
University of Minnesota
Jeongsik Yong, Ph.D.
University of South Florida
Michael Toth
Chris Shaffer
NIDDK Central Repository
Rose Woodruff
NIH NIDDK
Xujing Wang, Ph.D.
Beena Akolkar, Ph.D.
Corinne Silva, Ph.D.
UCSD dkNET Team:
Maryann Martone, Ph.D.
Jeffrey Grethe, Ph.D.
Ko-Wei Lin, Ph.D.
Neil McKenna, Ph.D.
Funding
dkNET New Investigator Pilot Program

dkNET Webinar: Multi-Omics Data Integration for Phenotype Prediction of Type-1 Diabetes 04/09/2021

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to dkNET Webinar: Multi-Omics Data Integration for Phenotype Prediction of Type-1 Diabetes 04/09/2021

Similar to dkNET Webinar: Multi-Omics Data Integration for Phenotype Prediction of Type-1 Diabetes 04/09/2021 (20)

More from dkNET

More from dkNET (20)

Recently uploaded

Recently uploaded (20)

dkNET Webinar: Multi-Omics Data Integration for Phenotype Prediction of Type-1 Diabetes 04/09/2021