BOSE, Debasish - Research Plan

Noname manuscript No.
(will be inserted by the editor)
Signaling Pathways Reconstruction and Inference of
Gene Regulatory Networks In Cancer Phenotypes
Integrating Genomic and Proteomic Datasets Refined
By Epigenetic Prior and Knowledge Prior
Debasish Bose
the date of receipt and acceptance should be inserted later
Abstract Gene Regulatory Network (GRN) Inference, along with understanding
of Cellular Signaling Pathways is an important challenge to Systems Biology and
have strong promise to positively disrupt the health sector, especially through
development of various targeted and personalized medicines. Though plethora of
research has been done to reverse engineer the regulatory network from mRNA/TF
levels alone, many problems remain unaddressed such as integration of disperate
datasources like Proteomics and Metabolomics, automatic incorporation of Knowl-
edge Prior and Epigenetic Prior. This document describes different approaches to
reconstruct signaling pathways (along with reverse engineering of GRNs), surveys
literature and proposes a research around a Dynamic Bayesian Network (DBN)
model (with local Gaussian Mixture distribution to factor in hidden molecular
processes) capable of systematic inclusion of various *-omics datasets and priors
(epigenetic and knowledge) while infering most probable network topology (G) of
some key cancer signaling pathways, aimed to Ph.D. degree in bioinformatics.
1 Introduction
A gene regulatory network (GRN, or genetic network) is a collection of genes
in a cell that interact with each other (indirectly through their products, i.e.,
RNAs, proteins) and the regulatory relationships between gene activities are me-
diated by proteins and metabolites. In other words, gene regulatory networks
are high-level descriptions of cellular biochemistry, in which only the transcrip-
tome is considered and all biochemical processes underlying gene–gene interac-
tions are implicitly present. Signaling Pathways operates at a higher level than
gene-gene space describing the Protein-Protein Interaction (PPI) cascade (Pro-
tein and Protein-Complex) often terminating with Transcription Factors (TF)
which in turn up/down-regulates a set of genes. Reconstruction of Signling Path-
ways (along with associated GRNs) is critical not only for elucidating underlying
molecular processes, but also for desgining highly targeted and personalized drugs.
Debasish Bose
Affiliation not available

2 Debasish Bose
Fig. 1 Gene regulation at different funtional ”spaces” [2]
2 Literature Survey
[11]Several machine learning and statistical methods have been proposed for the
problem [1]; [4]; [9]; [10] and Bayesian network (BN) models have gained popularity
for the task of inferring gene networks ([5]; [6];
Because of the complexity and sparsity (sparse because N ¡¡ D, where N de-
notes number of observations and D denotes dimensioanlity of the dataset - typi-
cally number of genes in case of mRNA/micro-array dataset) of regulatory path-
ways, noisy nature of experimental data, pure machine learning and statistical
methods may lead to poor reconstruction accuracy for the underlying molecular
network. A promising approach in this direction has been proposed by [14]. The
authors formulate the learning scheme in a Bayesian framework. This scheme al-
lows the systematic integration of gene expression data with biological knowledge
from other types of postgenomic data or the literature via a prior distribution
over network structures. The hyperparameters (β) of this distribution are inferred
together with the network structure in a maximum a posteriori (MAP) sense by
maximizing the joint posterior distribution with a heuristic greedy optimization
algorithm. Their method is based upon mRNA/TF data and TFs are often re-
alized by Protein-Complexes. Without integrating PPI dataset it seems difficult
to reconstruct the complete sigaling and regulatory pathway. For example, stud-
ies comparing mRNA and protein expression profiles have indicated that mRNA
changes are unreliable predictors of protein abundance [3] [8]
In the field of GRN and associated domain of Signaling Pathways Analysis,
two questions are gaining more and more relevance from the perspective of re-
construction accuracy and central proposal of this document takes the form of a
Bayesian Framework towards answering them -
• How to integerate disperate data sources (Proteomics, Metabolomics,
other -omics and experiments under varied conditions) to the already
promising Dynamic Bayesian Network? The necessity for integrated data

Title Suppressed Due to Excessive Length 3
analysis across ’omics platforms is further driven by the desire to identify funda-
mental properties of biological networks, such as redundancy, modularity, robust-
ness, feedback control and motifs. Such properties provide the underlying structure
of signaling networks, yet they are difficult to specify using a single type of analyt-
ical measurement. [18]. [16] has done similar work. However this proposal differs
in a number of ways. Firstly, they took a Bayesian approach to data integration
with weights of edges connecting pairs of proteins as the posterior probability of
a functional relationship between the proteins given all observed evidence for the
pair. This is actually a pair-wise approach with Naive Bayes, not a full-blown
Bayesian Network or Markov Network (Probablistic Graphical Model) with au-
tomatic structure learning. Secondly, causality is important to pathways analysis
and entire reconstruction process. They have reconstructed undirected proteomic
networks, whereas this document proposes building directed (with causality auto-
matically learned from data) Bayesian Networks merging genomics and proteomics
data (Fig. 3). Lastly, they have formally assessed the prior probabilities from seven
experts in the field of yeast molecular biology. This document proposes a Bayesian
approach of automatic incorporation of prior knowledge.
• How to incorporate our prior knowledge about pathways and regu-
latory networks in a systematic way, without ignoring epigenetic varia-
tions (Epigenetic Prior [17])? In fact epigenetic variations (esp. Methylations and
histone modifications) and associated variations in regulatory networks are critical
in designing and delivering personalized medicine with high efficacy.
3 Core Proposition
This document proposes following algorithmic procedure called DBNConsensus
-
1. Learning of Dynamic Bayesian Network from microarray and transcriptomics
(DNA-TF) data, where causality is easier to derive
2. Learning of Markov Network from all other -omics (Proteomics in particular)
data, where causality is deduced from data
3. Entities (variables or nodes in the graph) in networks (genes, proteins and
protein complexes) need not be identical across various datasets
4. Use semi-parametric method (ex. Gaussian Mixture) for local probability dis-
tributions while learning graph structures
5. Incorporate prior knowledge automatically, while learning the structures
6. Merge all such Bayesian Networks to produce the final causal network, using
consensus algorithm
4 Methodology
A Bayesian Network model is proposed for addressing aforementioned questions.
4.1 Bayesian Network (BNs)
BNs are directed graphical models for representing probabilistic independence re-
lations between multiple interacting entities. Formally, a BN is defined by a graph-

4 Debasish Bose
ical structure G, a family of (conditional) probability distributions F, and their
parameters θG, which together specify a joint distribution X over a set of random
variables of interest. The graphical structure G of a BN consists of a set of nodes
and a set of directed edges. The nodes represent random variables, while the edges
indicate conditional dependence relations. When we have a directed edge from
node A to node B, then A is called the parent of B, and B is called the child of
A. The structure G of a BN has to be a directed acyclic graph (DAG), that is,
a network without any directed cycles. This structure defines a unique rule for
expanding the joint probability in terms of simpler conditional probabilities. Let
X1, X2, ..., XN be a set of random variables represented by the nodes i ∈ {1, 2...N}
in the graph, define Xπi to be the parents of node Xi in graph G
P(X1, X2...XN ) =
N
i=0
P(Xi|Xπi ) (1)
When adopting a score-based approach (score and search), our objective is to
maximize the graph posterior probability P(G|D) which is deduced by applying
Bayes over network structures
P(G|D) ∝ P(D|G)P(G) (2)
Where D is the dataset, P(G) is the prior over structure. P(D|G) is Marginal
Likelihood and is obtained by averaging over all the parameters
P(D|G) = P(D|G, θg)P(θG|G)dθG (3)
Where P(D|G, θG) is the likelihood and P(θG|G) is the prior over parameters.
4.2 Local Distribution Families
To compute P(D|G) we need to consider function families F for local distributions.
Two common function families are
– Unrestricted Multinomial Distribution with Dirichlet prior (discrete)
– Linear Gaussian Distribution with normal-Wishart prior (continuous)
However both of these families are parametric families and often Genomics/Proteomics
datasets are ”multi-modal” manifesting different modalities under different exper-
imental conditions, epigenetic variations and disease pathogenesis. In fact the fo-
cus of current proposal on integrative approaches stems from this ”multi-modal”
nature of -omics data. To cope with the multi-modality of data at the local dis-
tribution level, a suitable semi-parametric distribution is proposed, for example -
Gaussian Mixture. Mixture models can describe potentially complex distributions
of gene expression across a wide range of conditions
P(G) = P(X1, X2, X3...XN ) =
N
i=0
P(Xi|Xπi ) =
N
i=0
P(Xi, Xπi )
P(Xπi )
(4)

Where,
P(Xi, Xπi ) =
Ki
k=0
αikN(XT
i |µik, Σik); XT
i = Xi ∪ Xπi ; Λi = |Xπi | + 1 (5)
P(Xi, Xπi ) is a mixture of Ki components, each component is a Multi-variate
Normal (MVN) distribution having mean vector of µik with dimension Λi and
Σik is the covariance matrix with dimensin Λi × Λi. Gaussian Mixture models
when applied to local distribution of a BN, can loosely model the underlying
latent subnetworks, suitably deducing the mixture set {Ki}i=1..N from k-Means
or hierarchical clustering.
4.3 Temporal Patterns
Static BN, discussed so far has two major disadvantages
• Feedback loops (which are common in GRN and Signaling Pathways) are not
allowed.
• Can’t model the temporal patterns of regulatory mechanism.
Dynamic Bayesian Network (DBN - [?]murphy1999modelling) was devised to
address those concerns. Assuming temporal transitions are obeying first-order,
homogenous Markov Chain
P(G) = P(X1, X2, X3...XN ) =
T
t=2
N
i=0
P(Xi(t)|Xπi (t − 1)) (6)
DBN along with local Gaussian Mixture distribution is proposed for the P(D|G)
4.4 Prior - P(G)
Systematic Incorporation of structural priors is one of the core proposals of this
document. Prior can come from disperate sources
• Knowledge prior in various pathways databases like KEGG
• Knowledeg prior in various PPI databases like MIPS, BioGrid
• Knowledge prior in various microarray databases like Stanford Microarray
Database
• Epigenetic prior of histone modification from [12] or [19]
We need to define a function that measures the agreement between a given
candidate network G and the biological prior knowledge that we have at our dis-
posal. We follow the approach proposed by [14] and call this measure the energy
E, borrowing the name from the statistical physics community. G is candidate
network structure obtained while repeatatively evaluating the posterior P(G|D)
through search algorithm like Greedy Hill Climber (GHC) or Metropolis-Hastings
(MH). ”Energy of the network” is obtained as

6 Debasish Bose
E(G) =
1≤i≤N
1jN
|Prior(i, j) − G(i, j)| (7)
And corresponding prior over graph is defined by the Gibbs Distribution
P(G|β) =
e−βE(G)
Z(β)
(8)
β is a hyperparameter that corresponds to an inverse temperature in statistical
physics, and the denominator is a normalizing constant that is usually referred to
as the partition function Z(β)
Z(β) =
G∈Ω
e−βE(G)
(9)
Now that we have the necessary frameowork for computation of P(G), we
need a systematic procedure to derive Prior(i, j) matrices integrating data avail-
able from different sources already mentioned. We propose a Bayesian Multinet
approach [15] for such a integration.
Fig. 2 Integration of different priors through a MDAG model. BNP-i represent the i-th prior.
P(X) =
L
k=1
πkP(X = x|Z = k, Gk, θgk ) (10)
The advantage of Bayesian multinets over more traditional graphical models is
the ability to represent context-specific independencies - situations in which subsets
of variables exhibit certain conditional independencies for some, but not all, values
of a conditioning variable. This context-specific independencies and corresponding
existance of distinguished variable (latent variable) Z models the variance within
prior knowledge.

4.5 Computation of Posterior - P(G|D, β)
P(G|D, β) =
P(D|G, β)P(G|β)
G
P(D|G , β)P(G |β)
(11)
Because of the intractability of the denominator
G
P(D|G , β)P(G |β), this
document proposes a Markov Chain Monte Carlo (MCMC) scheme with Metropolis-
Hastings search algorithm to sample both network structures (G) and hyperpa-
rameters (β) from the posterior distribution. A restricted search space (parent
node conﬁgurations as opposed to network structures) is proposed similar to [5]
4.6 Integration of Genomics and Proteomics Data
This document proposes an integrative approach towards incorporation of Ge-
nomics (Miroarray, RNA-Seq etc.) and Proteomics (Protein-Protein Interaction)
datasets. Though High-throughput Proteomics (ex Mass Spectrometry) has gen-
erated an enormous amout of data, the probability of false negative is rather high
in such Protein Interactome datasets. To reliably reconstruct protein signaling
pathways, data from genomics (Gene Regulatory Network and Transcriptional
Regulatory Network) and prior biological knowledgebases must be incorporated
as well. In fact this is one of the major objectives of Systems Biology. We illustrate
the idea behind such an integrative approach as follows
Fig. 3 True network, undirected protein network and directed gene regulatory network
Network obtained from proteomics dataset is inherently undirected (Markov
Network = GP P I ) as the it captures the correlation or covariance of proteins (and

8 Debasish Bose
protein complexes) rather than true causality. To integrate undirected proteomic
network with directed gene regulatory network (Bayesian Network = GRN ), we
need to infer the causality from the data and merge to produce the ﬁnal causal
network through consensus algorithm.
5 Data Sources
Required dataset is devided in two broad categories a) Synthetic (in-silico) and
b) Public datasets. For Synthetic data, GeneNetWeaver [13] is proposed. For real
datasets following data sources are considered
– DNA-sequence data (e.g., GeneBank and EBI)
– RNA sequence data (e.g., NCBI and Rfam)
– GWAS data (e.g., dbSNP and HapMap)
– Protein sequence data (e.g., UniProt, PIR and RefSeq)
– Protein class and classiﬁcation (e.g., Pfam, IntDom, and GO)
– Gene structural (e.g., ChEBI, KEGG ligand Database, and PDB)
– Genomics (e.g., SMD, Entrez Gene, KEGG, and MetaCyc)
– Signaling pathway (e.g., ChemProt and Reactome)
– Metabolomics (e.g., BioCycy, HMDB, and MMCD)
– Protein-protein interaction (e.g., MIPS, BioGrid, IntAct, DIP, MiMI)
Cancer datasets can be obtained from [7]
6 Timeline - 1st Year
• More study on high-throughput approaches (and computational models) towards
proteomics and metabolomics. Network (pathways/structural) analysis, causality
inference and variations, both on normal and disease phenotypes - 1 month
• A coherent framework of prior knowledge (epigenetics, knowledgebases etc.)
incorpoation into the Bayesian Network - 2 months
• Integrative framework for heterogenous (genomic and proteomic to start
with) datasources - 3 months
• Building of the Bayesian Network model - 2 months
• Model validation against synthetic data and aforementioned public data (nor-
mal genotype and phenotype) - 2 months
• Model validation against cancer (disease phenotype) pathways - 2 months
References
1. Akutsu, T., Miyano, S., Kuhara, S.: Inferring qualitative relations in genetic networks and
metabolic pathways. Bioinformatics 16(8), 727–734 (2000). DOI 10.1093/bioinformatics/
16.8.727. URL http://dx.doi.org/10.1093/bioinformatics/16.8.727
2. Brazhnik, P., de la Fuente, A., Mendes, P.: Gene networks: how to put the function in
genomics. TRENDS in Biotechnology 20(11), 467–472 (2002)
3. Chen, G.: Discordant Protein and mRNA Expression in Lung Adenocarcinomas. Molecular
Cellular Proteomics 1(4), 304–313 (2002). DOI 10.1074/mcp.m200008-mcp200. URL
http://dx.doi.org/10.1074/mcp.m200008-mcp200

4. D’haeseleer, P., Liang, S., Somogyi, R.: Genetic network inference: from co-expression
clustering to reverse engineering. Bioinformatics 16(8), 707–726 (2000). DOI 10.1093/
bioinformatics/16.8.707. URL http://dx.doi.org/10.1093/bioinformatics/16.8.707
5. Friedman, N., Koller, D.: Machine Learning 50(1/2), 95–125 (2003). DOI 10.1023/a:
1020249912095. URL http://dx.doi.org/10.1023/a:1020249912095
6. Friedman, N., Linial, M., Nachman, I., Pe’er, D.: Using Bayesian Networks to Analyze
Expression Data. Journal of Computational Biology 7(3-4), 601–620 (2000). DOI 10.
1089/106652700750050961. URL http://dx.doi.org/10.1089/106652700750050961
7. Gadaleta, E., Cutts, R.J., Kelly, G.P., Crnogorac-Jurcevic, T., Kocher, H.M., Lemoine,
N.R., Chelala, C.: A global insight into a cancer transcriptional space using pancreatic
data: importance, findings and flaws. Nucleic acids research 39(18), 7900–7907 (2011)
8. Gygi, S.P., Rochon, Y., Franza, B.R., Aebersold, R.: Correlation between protein and
mRNA abundance in yeast. Molecular and cellular biology 19(3), 1720–1730 (1999)
9. Hecker, M., Lambeck, S., Toepfer, S., van Someren, E., Guthke, R.: Gene regulatory
network inference: Data integration in dynamic models—A review. Biosystems 96(1),
86–103 (2009). DOI 10.1016/j.biosystems.2008.12.004. URL http://dx.doi.org/10.1016/
j.biosystems.2008.12.004
10. Lezon, T.R., Banavar, J.R., Cieplak, M., Maritan, A., Fedoroff, N.V.: Using the principle of
entropy maximization to infer genetic interaction networks from gene expression patterns.
Proceedings of the National Academy of Sciences 103(50), 19,033–19,038 (2006). DOI
10.1073/pnas.0609152103. URL http://dx.doi.org/10.1073/pnas.0609152103
11. Myers, C.L., Robson, D., Wible, A., Hibbs, M.A., Chiriac, C., Theesfeld, C.L., Dolinski, K.,
Troyanskaya, O.G.: Genome Biology 6(13), R114 (2005). DOI 10.1186/gb-2005-6-13-r114.
URL http://dx.doi.org/10.1186/gb-2005-6-13-r114
12. Pokholok, D.K., Harbison, C.T., Levine, S., Cole, M., Hannett, N.M., Lee, T.I., Bell,
G.W., Walker, K., Rolfe, P.A., Herbolsheimer, E., et al.: Genome-wide map of nucleosome
acetylation and methylation in yeast. Cell 122(4), 517–527 (2005)
13. Schaffter, T., Marbach, D., Floreano, D.: GeneNetWeaver: in silico benchmark generation
and performance profiling of network inference methods. Bioinformatics 27(16), 2263–2270
(2011)
14. Tamada, Y., Kim, S., Bannai, H., Imoto, S., Tashiro, K., Kuhara, S., Miyano, S.: Estimat-
ing gene networks from gene expression data by combining Bayesian network model with
promoter element detection. Bioinformatics 19(Suppl 2), ii227–ii236 (2003). DOI 10.1093/
bioinformatics/btg1082. URL http://dx.doi.org/10.1093/bioinformatics/btg1082
15. Thiesson, B., Meek, C., Chickering, D.M., Heckerman, D.: Learning mixtures of DAG mod-
els. In: Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence,
pp. 504–513. Morgan Kaufmann Publishers Inc. (1998)
16. Troyanskaya, O.G., Dolinski, K., Owen, A.B., Altman, R.B., Botstein, D.: A Bayesian
framework for combining heterogeneous data sources for gene function prediction (in Sac-
charomyces cerevisiae). Proceedings of the National Academy of Sciences 100(14), 8348–
8353 (2003)
17. Tsang, D.P., Cheng, A.S.: Epigenetic regulation of signaling pathways in cancer: role of
the histone methyltransferase EZH2. Journal of gastroenterology and hepatology 26(1),
19–27 (2011)
18. Waters, K.M., Liu, T., Quesenberry, R.D., Willse, A.R., Bandyopadhyay, S., Kathmann,
L.E., Weber, T.J., Smith, R.D., Wiley, H.S., Thrall, B.D.: Network Analysis of Epidermal
Growth Factor Signaling Using Integrated Genomic, Proteomic and Phosphorylation Data.
PLoS ONE 7(3), e34,515 (2012). DOI 10.1371/journal.pone.0034515. URL http://dx.
doi.org/10.1371/journal.pone.0034515
19. Zhang, Y., Lv, J., Liu, H., Zhu, J., Su, J., Wu, Q., Qi, Y., Wang, F., Li, X.: HHMD:
the human histone modification database. Nucleic acids research 38(suppl 1), D149–D154
(2010)

BOSE, Debasish - Research Plan

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to BOSE, Debasish - Research Plan

Similar to BOSE, Debasish - Research Plan (20)

BOSE, Debasish - Research Plan