SlideShare a Scribd company logo
Noname manuscript No.
(will be inserted by the editor)
Signaling Pathways Reconstruction and Inference of
Gene Regulatory Networks In Cancer Phenotypes
Integrating Genomic and Proteomic Datasets Refined
By Epigenetic Prior and Knowledge Prior
Debasish Bose
the date of receipt and acceptance should be inserted later
Abstract Gene Regulatory Network (GRN) Inference, along with understanding
of Cellular Signaling Pathways is an important challenge to Systems Biology and
have strong promise to positively disrupt the health sector, especially through
development of various targeted and personalized medicines. Though plethora of
research has been done to reverse engineer the regulatory network from mRNA/TF
levels alone, many problems remain unaddressed such as integration of disperate
datasources like Proteomics and Metabolomics, automatic incorporation of Knowl-
edge Prior and Epigenetic Prior. This document describes different approaches to
reconstruct signaling pathways (along with reverse engineering of GRNs), surveys
literature and proposes a research around a Dynamic Bayesian Network (DBN)
model (with local Gaussian Mixture distribution to factor in hidden molecular
processes) capable of systematic inclusion of various *-omics datasets and priors
(epigenetic and knowledge) while infering most probable network topology (G) of
some key cancer signaling pathways, aimed to Ph.D. degree in bioinformatics.
1 Introduction
A gene regulatory network (GRN, or genetic network) is a collection of genes
in a cell that interact with each other (indirectly through their products, i.e.,
RNAs, proteins) and the regulatory relationships between gene activities are me-
diated by proteins and metabolites. In other words, gene regulatory networks
are high-level descriptions of cellular biochemistry, in which only the transcrip-
tome is considered and all biochemical processes underlying gene–gene interac-
tions are implicitly present. Signaling Pathways operates at a higher level than
gene-gene space describing the Protein-Protein Interaction (PPI) cascade (Pro-
tein and Protein-Complex) often terminating with Transcription Factors (TF)
which in turn up/down-regulates a set of genes. Reconstruction of Signling Path-
ways (along with associated GRNs) is critical not only for elucidating underlying
molecular processes, but also for desgining highly targeted and personalized drugs.
Debasish Bose
Affiliation not available
2 Debasish Bose
Fig. 1 Gene regulation at different funtional ”spaces” [2]
2 Literature Survey
[11]Several machine learning and statistical methods have been proposed for the
problem [1]; [4]; [9]; [10] and Bayesian network (BN) models have gained popularity
for the task of inferring gene networks ([5]; [6];
Because of the complexity and sparsity (sparse because N ¡¡ D, where N de-
notes number of observations and D denotes dimensioanlity of the dataset - typi-
cally number of genes in case of mRNA/micro-array dataset) of regulatory path-
ways, noisy nature of experimental data, pure machine learning and statistical
methods may lead to poor reconstruction accuracy for the underlying molecular
network. A promising approach in this direction has been proposed by [14]. The
authors formulate the learning scheme in a Bayesian framework. This scheme al-
lows the systematic integration of gene expression data with biological knowledge
from other types of postgenomic data or the literature via a prior distribution
over network structures. The hyperparameters (β) of this distribution are inferred
together with the network structure in a maximum a posteriori (MAP) sense by
maximizing the joint posterior distribution with a heuristic greedy optimization
algorithm. Their method is based upon mRNA/TF data and TFs are often re-
alized by Protein-Complexes. Without integrating PPI dataset it seems difficult
to reconstruct the complete sigaling and regulatory pathway. For example, stud-
ies comparing mRNA and protein expression profiles have indicated that mRNA
changes are unreliable predictors of protein abundance [3] [8]
In the field of GRN and associated domain of Signaling Pathways Analysis,
two questions are gaining more and more relevance from the perspective of re-
construction accuracy and central proposal of this document takes the form of a
Bayesian Framework towards answering them -
• How to integerate disperate data sources (Proteomics, Metabolomics,
other -omics and experiments under varied conditions) to the already
promising Dynamic Bayesian Network? The necessity for integrated data
Title Suppressed Due to Excessive Length 3
analysis across ’omics platforms is further driven by the desire to identify funda-
mental properties of biological networks, such as redundancy, modularity, robust-
ness, feedback control and motifs. Such properties provide the underlying structure
of signaling networks, yet they are difficult to specify using a single type of analyt-
ical measurement. [18]. [16] has done similar work. However this proposal differs
in a number of ways. Firstly, they took a Bayesian approach to data integration
with weights of edges connecting pairs of proteins as the posterior probability of
a functional relationship between the proteins given all observed evidence for the
pair. This is actually a pair-wise approach with Naive Bayes, not a full-blown
Bayesian Network or Markov Network (Probablistic Graphical Model) with au-
tomatic structure learning. Secondly, causality is important to pathways analysis
and entire reconstruction process. They have reconstructed undirected proteomic
networks, whereas this document proposes building directed (with causality auto-
matically learned from data) Bayesian Networks merging genomics and proteomics
data (Fig. 3). Lastly, they have formally assessed the prior probabilities from seven
experts in the field of yeast molecular biology. This document proposes a Bayesian
approach of automatic incorporation of prior knowledge.
• How to incorporate our prior knowledge about pathways and regu-
latory networks in a systematic way, without ignoring epigenetic varia-
tions (Epigenetic Prior [17])? In fact epigenetic variations (esp. Methylations and
histone modifications) and associated variations in regulatory networks are critical
in designing and delivering personalized medicine with high efficacy.
3 Core Proposition
This document proposes following algorithmic procedure called DBNConsensus
-
1. Learning of Dynamic Bayesian Network from microarray and transcriptomics
(DNA-TF) data, where causality is easier to derive
2. Learning of Markov Network from all other -omics (Proteomics in particular)
data, where causality is deduced from data
3. Entities (variables or nodes in the graph) in networks (genes, proteins and
protein complexes) need not be identical across various datasets
4. Use semi-parametric method (ex. Gaussian Mixture) for local probability dis-
tributions while learning graph structures
5. Incorporate prior knowledge automatically, while learning the structures
6. Merge all such Bayesian Networks to produce the final causal network, using
consensus algorithm
4 Methodology
A Bayesian Network model is proposed for addressing aforementioned questions.
4.1 Bayesian Network (BNs)
BNs are directed graphical models for representing probabilistic independence re-
lations between multiple interacting entities. Formally, a BN is defined by a graph-
4 Debasish Bose
ical structure G, a family of (conditional) probability distributions F, and their
parameters θG, which together specify a joint distribution X over a set of random
variables of interest. The graphical structure G of a BN consists of a set of nodes
and a set of directed edges. The nodes represent random variables, while the edges
indicate conditional dependence relations. When we have a directed edge from
node A to node B, then A is called the parent of B, and B is called the child of
A. The structure G of a BN has to be a directed acyclic graph (DAG), that is,
a network without any directed cycles. This structure defines a unique rule for
expanding the joint probability in terms of simpler conditional probabilities. Let
X1, X2, ..., XN be a set of random variables represented by the nodes i ∈ {1, 2...N}
in the graph, define Xπi to be the parents of node Xi in graph G
P(X1, X2...XN ) =
N
i=0
P(Xi|Xπi ) (1)
When adopting a score-based approach (score and search), our objective is to
maximize the graph posterior probability P(G|D) which is deduced by applying
Bayes over network structures
P(G|D) ∝ P(D|G)P(G) (2)
Where D is the dataset, P(G) is the prior over structure. P(D|G) is Marginal
Likelihood and is obtained by averaging over all the parameters
P(D|G) = P(D|G, θg)P(θG|G)dθG (3)
Where P(D|G, θG) is the likelihood and P(θG|G) is the prior over parameters.
4.2 Local Distribution Families
To compute P(D|G) we need to consider function families F for local distributions.
Two common function families are
– Unrestricted Multinomial Distribution with Dirichlet prior (discrete)
– Linear Gaussian Distribution with normal-Wishart prior (continuous)
However both of these families are parametric families and often Genomics/Proteomics
datasets are ”multi-modal” manifesting different modalities under different exper-
imental conditions, epigenetic variations and disease pathogenesis. In fact the fo-
cus of current proposal on integrative approaches stems from this ”multi-modal”
nature of -omics data. To cope with the multi-modality of data at the local dis-
tribution level, a suitable semi-parametric distribution is proposed, for example -
Gaussian Mixture. Mixture models can describe potentially complex distributions
of gene expression across a wide range of conditions
P(G) = P(X1, X2, X3...XN ) =
N
i=0
P(Xi|Xπi ) =
N
i=0
P(Xi, Xπi )
P(Xπi )
(4)
Title Suppressed Due to Excessive Length 5
Where,
P(Xi, Xπi ) =
Ki
k=0
αikN(XT
i |µik, Σik); XT
i = Xi ∪ Xπi ; Λi = |Xπi | + 1 (5)
P(Xi, Xπi ) is a mixture of Ki components, each component is a Multi-variate
Normal (MVN) distribution having mean vector of µik with dimension Λi and
Σik is the covariance matrix with dimensin Λi × Λi. Gaussian Mixture models
when applied to local distribution of a BN, can loosely model the underlying
latent subnetworks, suitably deducing the mixture set {Ki}i=1..N from k-Means
or hierarchical clustering.
4.3 Temporal Patterns
Static BN, discussed so far has two major disadvantages
• Feedback loops (which are common in GRN and Signaling Pathways) are not
allowed.
• Can’t model the temporal patterns of regulatory mechanism.
Dynamic Bayesian Network (DBN - [?]murphy1999modelling) was devised to
address those concerns. Assuming temporal transitions are obeying first-order,
homogenous Markov Chain
P(G) = P(X1, X2, X3...XN ) =
T
t=2
N
i=0
P(Xi(t)|Xπi (t − 1)) (6)
DBN along with local Gaussian Mixture distribution is proposed for the P(D|G)
4.4 Prior - P(G)
Systematic Incorporation of structural priors is one of the core proposals of this
document. Prior can come from disperate sources
• Knowledge prior in various pathways databases like KEGG
• Knowledeg prior in various PPI databases like MIPS, BioGrid
• Knowledge prior in various microarray databases like Stanford Microarray
Database
• Epigenetic prior of histone modification from [12] or [19]
We need to define a function that measures the agreement between a given
candidate network G and the biological prior knowledge that we have at our dis-
posal. We follow the approach proposed by [14] and call this measure the energy
E, borrowing the name from the statistical physics community. G is candidate
network structure obtained while repeatatively evaluating the posterior P(G|D)
through search algorithm like Greedy Hill Climber (GHC) or Metropolis-Hastings
(MH). ”Energy of the network” is obtained as
6 Debasish Bose
E(G) =
1≤i≤N
1jN
|Prior(i, j) − G(i, j)| (7)
And corresponding prior over graph is defined by the Gibbs Distribution
P(G|β) =
e−βE(G)
Z(β)
(8)
β is a hyperparameter that corresponds to an inverse temperature in statistical
physics, and the denominator is a normalizing constant that is usually referred to
as the partition function Z(β)
Z(β) =
G∈Ω
e−βE(G)
(9)
Now that we have the necessary frameowork for computation of P(G), we
need a systematic procedure to derive Prior(i, j) matrices integrating data avail-
able from different sources already mentioned. We propose a Bayesian Multinet
approach [15] for such a integration.
Fig. 2 Integration of different priors through a MDAG model. BNP-i represent the i-th prior.
P(X) =
L
k=1
πkP(X = x|Z = k, Gk, θgk ) (10)
The advantage of Bayesian multinets over more traditional graphical models is
the ability to represent context-specific independencies - situations in which subsets
of variables exhibit certain conditional independencies for some, but not all, values
of a conditioning variable. This context-specific independencies and corresponding
existance of distinguished variable (latent variable) Z models the variance within
prior knowledge.
Title Suppressed Due to Excessive Length 7
4.5 Computation of Posterior - P(G|D, β)
P(G|D, β) =
P(D|G, β)P(G|β)
G
P(D|G , β)P(G |β)
(11)
Because of the intractability of the denominator
G
P(D|G , β)P(G |β), this
document proposes a Markov Chain Monte Carlo (MCMC) scheme with Metropolis-
Hastings search algorithm to sample both network structures (G) and hyperpa-
rameters (β) from the posterior distribution. A restricted search space (parent
node configurations as opposed to network structures) is proposed similar to [5]
4.6 Integration of Genomics and Proteomics Data
This document proposes an integrative approach towards incorporation of Ge-
nomics (Miroarray, RNA-Seq etc.) and Proteomics (Protein-Protein Interaction)
datasets. Though High-throughput Proteomics (ex Mass Spectrometry) has gen-
erated an enormous amout of data, the probability of false negative is rather high
in such Protein Interactome datasets. To reliably reconstruct protein signaling
pathways, data from genomics (Gene Regulatory Network and Transcriptional
Regulatory Network) and prior biological knowledgebases must be incorporated
as well. In fact this is one of the major objectives of Systems Biology. We illustrate
the idea behind such an integrative approach as follows
Fig. 3 True network, undirected protein network and directed gene regulatory network
Network obtained from proteomics dataset is inherently undirected (Markov
Network = GP P I ) as the it captures the correlation or covariance of proteins (and
8 Debasish Bose
protein complexes) rather than true causality. To integrate undirected proteomic
network with directed gene regulatory network (Bayesian Network = GRN ), we
need to infer the causality from the data and merge to produce the final causal
network through consensus algorithm.
5 Data Sources
Required dataset is devided in two broad categories a) Synthetic (in-silico) and
b) Public datasets. For Synthetic data, GeneNetWeaver [13] is proposed. For real
datasets following data sources are considered
– DNA-sequence data (e.g., GeneBank and EBI)
– RNA sequence data (e.g., NCBI and Rfam)
– GWAS data (e.g., dbSNP and HapMap)
– Protein sequence data (e.g., UniProt, PIR and RefSeq)
– Protein class and classification (e.g., Pfam, IntDom, and GO)
– Gene structural (e.g., ChEBI, KEGG ligand Database, and PDB)
– Genomics (e.g., SMD, Entrez Gene, KEGG, and MetaCyc)
– Signaling pathway (e.g., ChemProt and Reactome)
– Metabolomics (e.g., BioCycy, HMDB, and MMCD)
– Protein-protein interaction (e.g., MIPS, BioGrid, IntAct, DIP, MiMI)
Cancer datasets can be obtained from [7]
6 Timeline - 1st Year
• More study on high-throughput approaches (and computational models) towards
proteomics and metabolomics. Network (pathways/structural) analysis, causality
inference and variations, both on normal and disease phenotypes - 1 month
• A coherent framework of prior knowledge (epigenetics, knowledgebases etc.)
incorpoation into the Bayesian Network - 2 months
• Integrative framework for heterogenous (genomic and proteomic to start
with) datasources - 3 months
• Building of the Bayesian Network model - 2 months
• Model validation against synthetic data and aforementioned public data (nor-
mal genotype and phenotype) - 2 months
• Model validation against cancer (disease phenotype) pathways - 2 months
References
1. Akutsu, T., Miyano, S., Kuhara, S.: Inferring qualitative relations in genetic networks and
metabolic pathways. Bioinformatics 16(8), 727–734 (2000). DOI 10.1093/bioinformatics/
16.8.727. URL http://dx.doi.org/10.1093/bioinformatics/16.8.727
2. Brazhnik, P., de la Fuente, A., Mendes, P.: Gene networks: how to put the function in
genomics. TRENDS in Biotechnology 20(11), 467–472 (2002)
3. Chen, G.: Discordant Protein and mRNA Expression in Lung Adenocarcinomas. Molecular
Cellular Proteomics 1(4), 304–313 (2002). DOI 10.1074/mcp.m200008-mcp200. URL
http://dx.doi.org/10.1074/mcp.m200008-mcp200
Title Suppressed Due to Excessive Length 9
4. D’haeseleer, P., Liang, S., Somogyi, R.: Genetic network inference: from co-expression
clustering to reverse engineering. Bioinformatics 16(8), 707–726 (2000). DOI 10.1093/
bioinformatics/16.8.707. URL http://dx.doi.org/10.1093/bioinformatics/16.8.707
5. Friedman, N., Koller, D.: Machine Learning 50(1/2), 95–125 (2003). DOI 10.1023/a:
1020249912095. URL http://dx.doi.org/10.1023/a:1020249912095
6. Friedman, N., Linial, M., Nachman, I., Pe’er, D.: Using Bayesian Networks to Analyze
Expression Data. Journal of Computational Biology 7(3-4), 601–620 (2000). DOI 10.
1089/106652700750050961. URL http://dx.doi.org/10.1089/106652700750050961
7. Gadaleta, E., Cutts, R.J., Kelly, G.P., Crnogorac-Jurcevic, T., Kocher, H.M., Lemoine,
N.R., Chelala, C.: A global insight into a cancer transcriptional space using pancreatic
data: importance, findings and flaws. Nucleic acids research 39(18), 7900–7907 (2011)
8. Gygi, S.P., Rochon, Y., Franza, B.R., Aebersold, R.: Correlation between protein and
mRNA abundance in yeast. Molecular and cellular biology 19(3), 1720–1730 (1999)
9. Hecker, M., Lambeck, S., Toepfer, S., van Someren, E., Guthke, R.: Gene regulatory
network inference: Data integration in dynamic models—A review. Biosystems 96(1),
86–103 (2009). DOI 10.1016/j.biosystems.2008.12.004. URL http://dx.doi.org/10.1016/
j.biosystems.2008.12.004
10. Lezon, T.R., Banavar, J.R., Cieplak, M., Maritan, A., Fedoroff, N.V.: Using the principle of
entropy maximization to infer genetic interaction networks from gene expression patterns.
Proceedings of the National Academy of Sciences 103(50), 19,033–19,038 (2006). DOI
10.1073/pnas.0609152103. URL http://dx.doi.org/10.1073/pnas.0609152103
11. Myers, C.L., Robson, D., Wible, A., Hibbs, M.A., Chiriac, C., Theesfeld, C.L., Dolinski, K.,
Troyanskaya, O.G.: Genome Biology 6(13), R114 (2005). DOI 10.1186/gb-2005-6-13-r114.
URL http://dx.doi.org/10.1186/gb-2005-6-13-r114
12. Pokholok, D.K., Harbison, C.T., Levine, S., Cole, M., Hannett, N.M., Lee, T.I., Bell,
G.W., Walker, K., Rolfe, P.A., Herbolsheimer, E., et al.: Genome-wide map of nucleosome
acetylation and methylation in yeast. Cell 122(4), 517–527 (2005)
13. Schaffter, T., Marbach, D., Floreano, D.: GeneNetWeaver: in silico benchmark generation
and performance profiling of network inference methods. Bioinformatics 27(16), 2263–2270
(2011)
14. Tamada, Y., Kim, S., Bannai, H., Imoto, S., Tashiro, K., Kuhara, S., Miyano, S.: Estimat-
ing gene networks from gene expression data by combining Bayesian network model with
promoter element detection. Bioinformatics 19(Suppl 2), ii227–ii236 (2003). DOI 10.1093/
bioinformatics/btg1082. URL http://dx.doi.org/10.1093/bioinformatics/btg1082
15. Thiesson, B., Meek, C., Chickering, D.M., Heckerman, D.: Learning mixtures of DAG mod-
els. In: Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence,
pp. 504–513. Morgan Kaufmann Publishers Inc. (1998)
16. Troyanskaya, O.G., Dolinski, K., Owen, A.B., Altman, R.B., Botstein, D.: A Bayesian
framework for combining heterogeneous data sources for gene function prediction (in Sac-
charomyces cerevisiae). Proceedings of the National Academy of Sciences 100(14), 8348–
8353 (2003)
17. Tsang, D.P., Cheng, A.S.: Epigenetic regulation of signaling pathways in cancer: role of
the histone methyltransferase EZH2. Journal of gastroenterology and hepatology 26(1),
19–27 (2011)
18. Waters, K.M., Liu, T., Quesenberry, R.D., Willse, A.R., Bandyopadhyay, S., Kathmann,
L.E., Weber, T.J., Smith, R.D., Wiley, H.S., Thrall, B.D.: Network Analysis of Epidermal
Growth Factor Signaling Using Integrated Genomic, Proteomic and Phosphorylation Data.
PLoS ONE 7(3), e34,515 (2012). DOI 10.1371/journal.pone.0034515. URL http://dx.
doi.org/10.1371/journal.pone.0034515
19. Zhang, Y., Lv, J., Liu, H., Zhu, J., Su, J., Wu, Q., Qi, Y., Wang, F., Li, X.: HHMD:
the human histone modification database. Nucleic acids research 38(suppl 1), D149–D154
(2010)

More Related Content

What's hot

Towards a Query Rewriting Algorithm Over Proteomics XML Resources
Towards a Query Rewriting Algorithm Over Proteomics XML ResourcesTowards a Query Rewriting Algorithm Over Proteomics XML Resources
Towards a Query Rewriting Algorithm Over Proteomics XML Resources
CSCJournals
 
SEQUENCE ANALYSIS
SEQUENCE ANALYSISSEQUENCE ANALYSIS
SEQUENCE ANALYSIS
prashant tripathi
 
Gutell 116.rpass.bibm11.pp618-622.2011
Gutell 116.rpass.bibm11.pp618-622.2011Gutell 116.rpass.bibm11.pp618-622.2011
Gutell 116.rpass.bibm11.pp618-622.2011
Robin Gutell
 
Protein Structure, Databases and Structural Alignment
Protein Structure, Databases and Structural AlignmentProtein Structure, Databases and Structural Alignment
Protein Structure, Databases and Structural Alignment
Saramita De Chakravarti
 
Structure alignment methods
Structure alignment methodsStructure alignment methods
Structure alignment methods
Samvartika Majumdar
 
Sequence Alignment In Bioinformatics
Sequence Alignment In BioinformaticsSequence Alignment In Bioinformatics
Sequence Alignment In Bioinformatics
Nikesh Narayanan
 
K04302082087
K04302082087K04302082087
K04302082087
ijceronline
 
dream
dreamdream
Bioinformatics.Assignment
Bioinformatics.AssignmentBioinformatics.Assignment
Bioinformatics.Assignment
Naima Tahsin
 
Drug discovery presentation
Drug discovery presentationDrug discovery presentation
Drug discovery presentation
Theertha Raveendran
 
Basics of bioinformatics
Basics of bioinformaticsBasics of bioinformatics
Basics of bioinformatics
Abhishek Vatsa
 
Blasta
BlastaBlasta
Computational predictiction of prrotein structure
Computational predictiction of prrotein structureComputational predictiction of prrotein structure
Computational predictiction of prrotein structure
Archita Srivastava
 
Molecular phylogenetics
Molecular phylogeneticsMolecular phylogenetics
Molecular phylogenetics
Ajay Kumar Chandra
 
Softwares For Phylogentic Analysis
Softwares For Phylogentic AnalysisSoftwares For Phylogentic Analysis
Softwares For Phylogentic Analysis
Prasanthperceptron
 
Sequence Analysis
Sequence AnalysisSequence Analysis
Sequence Analysis
Meghaj Mallick
 
Systems genetics approaches to understand complex traits
Systems genetics approaches to understand complex traitsSystems genetics approaches to understand complex traits
Systems genetics approaches to understand complex traits
SOYEON KIM
 
Report-de Bruijn Graph
Report-de Bruijn GraphReport-de Bruijn Graph
Report-de Bruijn Graph
Ashwani kumar
 
Construction of phylogenetic tree from multiple gene trees using principal co...
Construction of phylogenetic tree from multiple gene trees using principal co...Construction of phylogenetic tree from multiple gene trees using principal co...
Construction of phylogenetic tree from multiple gene trees using principal co...
IAEME Publication
 
Multi label text classification
Multi label text classificationMulti label text classification
Multi label text classification
raghavr186
 

What's hot (20)

Towards a Query Rewriting Algorithm Over Proteomics XML Resources
Towards a Query Rewriting Algorithm Over Proteomics XML ResourcesTowards a Query Rewriting Algorithm Over Proteomics XML Resources
Towards a Query Rewriting Algorithm Over Proteomics XML Resources
 
SEQUENCE ANALYSIS
SEQUENCE ANALYSISSEQUENCE ANALYSIS
SEQUENCE ANALYSIS
 
Gutell 116.rpass.bibm11.pp618-622.2011
Gutell 116.rpass.bibm11.pp618-622.2011Gutell 116.rpass.bibm11.pp618-622.2011
Gutell 116.rpass.bibm11.pp618-622.2011
 
Protein Structure, Databases and Structural Alignment
Protein Structure, Databases and Structural AlignmentProtein Structure, Databases and Structural Alignment
Protein Structure, Databases and Structural Alignment
 
Structure alignment methods
Structure alignment methodsStructure alignment methods
Structure alignment methods
 
Sequence Alignment In Bioinformatics
Sequence Alignment In BioinformaticsSequence Alignment In Bioinformatics
Sequence Alignment In Bioinformatics
 
K04302082087
K04302082087K04302082087
K04302082087
 
dream
dreamdream
dream
 
Bioinformatics.Assignment
Bioinformatics.AssignmentBioinformatics.Assignment
Bioinformatics.Assignment
 
Drug discovery presentation
Drug discovery presentationDrug discovery presentation
Drug discovery presentation
 
Basics of bioinformatics
Basics of bioinformaticsBasics of bioinformatics
Basics of bioinformatics
 
Blasta
BlastaBlasta
Blasta
 
Computational predictiction of prrotein structure
Computational predictiction of prrotein structureComputational predictiction of prrotein structure
Computational predictiction of prrotein structure
 
Molecular phylogenetics
Molecular phylogeneticsMolecular phylogenetics
Molecular phylogenetics
 
Softwares For Phylogentic Analysis
Softwares For Phylogentic AnalysisSoftwares For Phylogentic Analysis
Softwares For Phylogentic Analysis
 
Sequence Analysis
Sequence AnalysisSequence Analysis
Sequence Analysis
 
Systems genetics approaches to understand complex traits
Systems genetics approaches to understand complex traitsSystems genetics approaches to understand complex traits
Systems genetics approaches to understand complex traits
 
Report-de Bruijn Graph
Report-de Bruijn GraphReport-de Bruijn Graph
Report-de Bruijn Graph
 
Construction of phylogenetic tree from multiple gene trees using principal co...
Construction of phylogenetic tree from multiple gene trees using principal co...Construction of phylogenetic tree from multiple gene trees using principal co...
Construction of phylogenetic tree from multiple gene trees using principal co...
 
Multi label text classification
Multi label text classificationMulti label text classification
Multi label text classification
 

Viewers also liked

TIC
TICTIC
Vertol drones for Cargo? Next-generation helicopters can achieve higher perfo...
Vertol drones for Cargo? Next-generation helicopters can achieve higher perfo...Vertol drones for Cargo? Next-generation helicopters can achieve higher perfo...
Vertol drones for Cargo? Next-generation helicopters can achieve higher perfo...
rhetoricalmosai86
 
salesprop2_Tirado_S
salesprop2_Tirado_Ssalesprop2_Tirado_S
salesprop2_Tirado_S
Sherece Tirado
 
T8 Day Photo Choices
T8 Day Photo ChoicesT8 Day Photo Choices
T8 Day Photo Choices
Karen Burke
 
Resume (1)
Resume (1)Resume (1)
Resume (1)
kratika gupta
 
Hidraulica y neumatica
Hidraulica y neumaticaHidraulica y neumatica
Hidraulica y neumatica
LauraHernandezSantamaria0
 
TEST J.C.RAVEN
TEST J.C.RAVENTEST J.C.RAVEN
TEST J.C.RAVEN
Reyna1101
 
Semiconductores
SemiconductoresSemiconductores
Semiconductores
Lin Condor Zamata
 
Mga Teorya - Edukasyon sa Pagpapakatao (EsP)
Mga Teorya - Edukasyon sa Pagpapakatao (EsP)Mga Teorya - Edukasyon sa Pagpapakatao (EsP)
Mga Teorya - Edukasyon sa Pagpapakatao (EsP)
Sophia Marie Verdeflor
 
Wed 2.0 por katherine salazar
Wed 2.0 por katherine salazarWed 2.0 por katherine salazar
Wed 2.0 por katherine salazar
kathesadi
 

Viewers also liked (10)

TIC
TICTIC
TIC
 
Vertol drones for Cargo? Next-generation helicopters can achieve higher perfo...
Vertol drones for Cargo? Next-generation helicopters can achieve higher perfo...Vertol drones for Cargo? Next-generation helicopters can achieve higher perfo...
Vertol drones for Cargo? Next-generation helicopters can achieve higher perfo...
 
salesprop2_Tirado_S
salesprop2_Tirado_Ssalesprop2_Tirado_S
salesprop2_Tirado_S
 
T8 Day Photo Choices
T8 Day Photo ChoicesT8 Day Photo Choices
T8 Day Photo Choices
 
Resume (1)
Resume (1)Resume (1)
Resume (1)
 
Hidraulica y neumatica
Hidraulica y neumaticaHidraulica y neumatica
Hidraulica y neumatica
 
TEST J.C.RAVEN
TEST J.C.RAVENTEST J.C.RAVEN
TEST J.C.RAVEN
 
Semiconductores
SemiconductoresSemiconductores
Semiconductores
 
Mga Teorya - Edukasyon sa Pagpapakatao (EsP)
Mga Teorya - Edukasyon sa Pagpapakatao (EsP)Mga Teorya - Edukasyon sa Pagpapakatao (EsP)
Mga Teorya - Edukasyon sa Pagpapakatao (EsP)
 
Wed 2.0 por katherine salazar
Wed 2.0 por katherine salazarWed 2.0 por katherine salazar
Wed 2.0 por katherine salazar
 

Similar to BOSE, Debasish - Research Plan

An Explainable Unsupervised Framework For Alignment-Free Protein Classificati...
An Explainable Unsupervised Framework For Alignment-Free Protein Classificati...An Explainable Unsupervised Framework For Alignment-Free Protein Classificati...
An Explainable Unsupervised Framework For Alignment-Free Protein Classificati...
Allison Thompson
 
Proteomics - Analysis and integration of large-scale data sets
Proteomics - Analysis and integration of large-scale data setsProteomics - Analysis and integration of large-scale data sets
Proteomics - Analysis and integration of large-scale data sets
Lars Juhl Jensen
 
A comparative study of covariance selection models for the inference of gene ...
A comparative study of covariance selection models for the inference of gene ...A comparative study of covariance selection models for the inference of gene ...
A comparative study of covariance selection models for the inference of gene ...
Roberto Anglani
 
STRING - Modeling of pathways through cross-species integration of large-scal...
STRING - Modeling of pathways through cross-species integration of large-scal...STRING - Modeling of pathways through cross-species integration of large-scal...
STRING - Modeling of pathways through cross-species integration of large-scal...
Lars Juhl Jensen
 
presentation
presentationpresentation
presentation
Peter Langfelder
 
Presage database
Presage databasePresage database
Presage database
Akshay More
 
Novel modelling of clustering for enhanced classification performance on gene...
Novel modelling of clustering for enhanced classification performance on gene...Novel modelling of clustering for enhanced classification performance on gene...
Novel modelling of clustering for enhanced classification performance on gene...
IJECEIAES
 
GLUER integrative analysis of single-cell omics and imaging data by deep neur...
GLUER integrative analysis of single-cell omics and imaging data by deep neur...GLUER integrative analysis of single-cell omics and imaging data by deep neur...
GLUER integrative analysis of single-cell omics and imaging data by deep neur...
mallannasuman
 
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
IJRTEMJOURNAL
 
Network Biology Lent 2010 - lecture 1
Network Biology Lent 2010 - lecture 1Network Biology Lent 2010 - lecture 1
Network Biology Lent 2010 - lecture 1
Florian Markowetz
 
Network embedding in biomedical data science
Network embedding in biomedical data scienceNetwork embedding in biomedical data science
Network embedding in biomedical data science
Arindam Ghosh
 
The bayesian revolution in genetics
The bayesian revolution in geneticsThe bayesian revolution in genetics
The bayesian revolution in genetics
Beat Winehouse
 
International Journal of Computer Science, Engineering and Information Techno...
International Journal of Computer Science, Engineering and Information Techno...International Journal of Computer Science, Engineering and Information Techno...
International Journal of Computer Science, Engineering and Information Techno...
IJCSEIT Journal
 
Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...
Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...
Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...
Melissa Moody
 
Multiscale and integrative single-cell Hi-C analysis with Higashi.pdf
Multiscale and integrative single-cell Hi-C analysis with Higashi.pdfMultiscale and integrative single-cell Hi-C analysis with Higashi.pdf
Multiscale and integrative single-cell Hi-C analysis with Higashi.pdf
AtiaGohar1
 
A family of global protein shape descriptors using gauss integrals, christian...
A family of global protein shape descriptors using gauss integrals, christian...A family of global protein shape descriptors using gauss integrals, christian...
A family of global protein shape descriptors using gauss integrals, christian...
pfermat
 
STRING - Cross-species integration of known and predicted protein-protein int...
STRING - Cross-species integration of known and predicted protein-protein int...STRING - Cross-species integration of known and predicted protein-protein int...
STRING - Cross-species integration of known and predicted protein-protein int...
Lars Juhl Jensen
 
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...EnrichNet: Graph-based statistic and web-application for gene/protein set enr...
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...
Enrico Glaab
 
Spatial_final
Spatial_finalSpatial_final
Spatial_final
Ian Johnston
 
Sample Work For Engineering Literature Review and Gap Identification
Sample Work For Engineering Literature Review and Gap IdentificationSample Work For Engineering Literature Review and Gap Identification
Sample Work For Engineering Literature Review and Gap Identification
PhD Assistance
 

Similar to BOSE, Debasish - Research Plan (20)

An Explainable Unsupervised Framework For Alignment-Free Protein Classificati...
An Explainable Unsupervised Framework For Alignment-Free Protein Classificati...An Explainable Unsupervised Framework For Alignment-Free Protein Classificati...
An Explainable Unsupervised Framework For Alignment-Free Protein Classificati...
 
Proteomics - Analysis and integration of large-scale data sets
Proteomics - Analysis and integration of large-scale data setsProteomics - Analysis and integration of large-scale data sets
Proteomics - Analysis and integration of large-scale data sets
 
A comparative study of covariance selection models for the inference of gene ...
A comparative study of covariance selection models for the inference of gene ...A comparative study of covariance selection models for the inference of gene ...
A comparative study of covariance selection models for the inference of gene ...
 
STRING - Modeling of pathways through cross-species integration of large-scal...
STRING - Modeling of pathways through cross-species integration of large-scal...STRING - Modeling of pathways through cross-species integration of large-scal...
STRING - Modeling of pathways through cross-species integration of large-scal...
 
presentation
presentationpresentation
presentation
 
Presage database
Presage databasePresage database
Presage database
 
Novel modelling of clustering for enhanced classification performance on gene...
Novel modelling of clustering for enhanced classification performance on gene...Novel modelling of clustering for enhanced classification performance on gene...
Novel modelling of clustering for enhanced classification performance on gene...
 
GLUER integrative analysis of single-cell omics and imaging data by deep neur...
GLUER integrative analysis of single-cell omics and imaging data by deep neur...GLUER integrative analysis of single-cell omics and imaging data by deep neur...
GLUER integrative analysis of single-cell omics and imaging data by deep neur...
 
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
 
Network Biology Lent 2010 - lecture 1
Network Biology Lent 2010 - lecture 1Network Biology Lent 2010 - lecture 1
Network Biology Lent 2010 - lecture 1
 
Network embedding in biomedical data science
Network embedding in biomedical data scienceNetwork embedding in biomedical data science
Network embedding in biomedical data science
 
The bayesian revolution in genetics
The bayesian revolution in geneticsThe bayesian revolution in genetics
The bayesian revolution in genetics
 
International Journal of Computer Science, Engineering and Information Techno...
International Journal of Computer Science, Engineering and Information Techno...International Journal of Computer Science, Engineering and Information Techno...
International Journal of Computer Science, Engineering and Information Techno...
 
Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...
Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...
Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...
 
Multiscale and integrative single-cell Hi-C analysis with Higashi.pdf
Multiscale and integrative single-cell Hi-C analysis with Higashi.pdfMultiscale and integrative single-cell Hi-C analysis with Higashi.pdf
Multiscale and integrative single-cell Hi-C analysis with Higashi.pdf
 
A family of global protein shape descriptors using gauss integrals, christian...
A family of global protein shape descriptors using gauss integrals, christian...A family of global protein shape descriptors using gauss integrals, christian...
A family of global protein shape descriptors using gauss integrals, christian...
 
STRING - Cross-species integration of known and predicted protein-protein int...
STRING - Cross-species integration of known and predicted protein-protein int...STRING - Cross-species integration of known and predicted protein-protein int...
STRING - Cross-species integration of known and predicted protein-protein int...
 
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...EnrichNet: Graph-based statistic and web-application for gene/protein set enr...
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...
 
Spatial_final
Spatial_finalSpatial_final
Spatial_final
 
Sample Work For Engineering Literature Review and Gap Identification
Sample Work For Engineering Literature Review and Gap IdentificationSample Work For Engineering Literature Review and Gap Identification
Sample Work For Engineering Literature Review and Gap Identification
 

BOSE, Debasish - Research Plan

  • 1. Noname manuscript No. (will be inserted by the editor) Signaling Pathways Reconstruction and Inference of Gene Regulatory Networks In Cancer Phenotypes Integrating Genomic and Proteomic Datasets Refined By Epigenetic Prior and Knowledge Prior Debasish Bose the date of receipt and acceptance should be inserted later Abstract Gene Regulatory Network (GRN) Inference, along with understanding of Cellular Signaling Pathways is an important challenge to Systems Biology and have strong promise to positively disrupt the health sector, especially through development of various targeted and personalized medicines. Though plethora of research has been done to reverse engineer the regulatory network from mRNA/TF levels alone, many problems remain unaddressed such as integration of disperate datasources like Proteomics and Metabolomics, automatic incorporation of Knowl- edge Prior and Epigenetic Prior. This document describes different approaches to reconstruct signaling pathways (along with reverse engineering of GRNs), surveys literature and proposes a research around a Dynamic Bayesian Network (DBN) model (with local Gaussian Mixture distribution to factor in hidden molecular processes) capable of systematic inclusion of various *-omics datasets and priors (epigenetic and knowledge) while infering most probable network topology (G) of some key cancer signaling pathways, aimed to Ph.D. degree in bioinformatics. 1 Introduction A gene regulatory network (GRN, or genetic network) is a collection of genes in a cell that interact with each other (indirectly through their products, i.e., RNAs, proteins) and the regulatory relationships between gene activities are me- diated by proteins and metabolites. In other words, gene regulatory networks are high-level descriptions of cellular biochemistry, in which only the transcrip- tome is considered and all biochemical processes underlying gene–gene interac- tions are implicitly present. Signaling Pathways operates at a higher level than gene-gene space describing the Protein-Protein Interaction (PPI) cascade (Pro- tein and Protein-Complex) often terminating with Transcription Factors (TF) which in turn up/down-regulates a set of genes. Reconstruction of Signling Path- ways (along with associated GRNs) is critical not only for elucidating underlying molecular processes, but also for desgining highly targeted and personalized drugs. Debasish Bose Affiliation not available
  • 2. 2 Debasish Bose Fig. 1 Gene regulation at different funtional ”spaces” [2] 2 Literature Survey [11]Several machine learning and statistical methods have been proposed for the problem [1]; [4]; [9]; [10] and Bayesian network (BN) models have gained popularity for the task of inferring gene networks ([5]; [6]; Because of the complexity and sparsity (sparse because N ¡¡ D, where N de- notes number of observations and D denotes dimensioanlity of the dataset - typi- cally number of genes in case of mRNA/micro-array dataset) of regulatory path- ways, noisy nature of experimental data, pure machine learning and statistical methods may lead to poor reconstruction accuracy for the underlying molecular network. A promising approach in this direction has been proposed by [14]. The authors formulate the learning scheme in a Bayesian framework. This scheme al- lows the systematic integration of gene expression data with biological knowledge from other types of postgenomic data or the literature via a prior distribution over network structures. The hyperparameters (β) of this distribution are inferred together with the network structure in a maximum a posteriori (MAP) sense by maximizing the joint posterior distribution with a heuristic greedy optimization algorithm. Their method is based upon mRNA/TF data and TFs are often re- alized by Protein-Complexes. Without integrating PPI dataset it seems difficult to reconstruct the complete sigaling and regulatory pathway. For example, stud- ies comparing mRNA and protein expression profiles have indicated that mRNA changes are unreliable predictors of protein abundance [3] [8] In the field of GRN and associated domain of Signaling Pathways Analysis, two questions are gaining more and more relevance from the perspective of re- construction accuracy and central proposal of this document takes the form of a Bayesian Framework towards answering them - • How to integerate disperate data sources (Proteomics, Metabolomics, other -omics and experiments under varied conditions) to the already promising Dynamic Bayesian Network? The necessity for integrated data
  • 3. Title Suppressed Due to Excessive Length 3 analysis across ’omics platforms is further driven by the desire to identify funda- mental properties of biological networks, such as redundancy, modularity, robust- ness, feedback control and motifs. Such properties provide the underlying structure of signaling networks, yet they are difficult to specify using a single type of analyt- ical measurement. [18]. [16] has done similar work. However this proposal differs in a number of ways. Firstly, they took a Bayesian approach to data integration with weights of edges connecting pairs of proteins as the posterior probability of a functional relationship between the proteins given all observed evidence for the pair. This is actually a pair-wise approach with Naive Bayes, not a full-blown Bayesian Network or Markov Network (Probablistic Graphical Model) with au- tomatic structure learning. Secondly, causality is important to pathways analysis and entire reconstruction process. They have reconstructed undirected proteomic networks, whereas this document proposes building directed (with causality auto- matically learned from data) Bayesian Networks merging genomics and proteomics data (Fig. 3). Lastly, they have formally assessed the prior probabilities from seven experts in the field of yeast molecular biology. This document proposes a Bayesian approach of automatic incorporation of prior knowledge. • How to incorporate our prior knowledge about pathways and regu- latory networks in a systematic way, without ignoring epigenetic varia- tions (Epigenetic Prior [17])? In fact epigenetic variations (esp. Methylations and histone modifications) and associated variations in regulatory networks are critical in designing and delivering personalized medicine with high efficacy. 3 Core Proposition This document proposes following algorithmic procedure called DBNConsensus - 1. Learning of Dynamic Bayesian Network from microarray and transcriptomics (DNA-TF) data, where causality is easier to derive 2. Learning of Markov Network from all other -omics (Proteomics in particular) data, where causality is deduced from data 3. Entities (variables or nodes in the graph) in networks (genes, proteins and protein complexes) need not be identical across various datasets 4. Use semi-parametric method (ex. Gaussian Mixture) for local probability dis- tributions while learning graph structures 5. Incorporate prior knowledge automatically, while learning the structures 6. Merge all such Bayesian Networks to produce the final causal network, using consensus algorithm 4 Methodology A Bayesian Network model is proposed for addressing aforementioned questions. 4.1 Bayesian Network (BNs) BNs are directed graphical models for representing probabilistic independence re- lations between multiple interacting entities. Formally, a BN is defined by a graph-
  • 4. 4 Debasish Bose ical structure G, a family of (conditional) probability distributions F, and their parameters θG, which together specify a joint distribution X over a set of random variables of interest. The graphical structure G of a BN consists of a set of nodes and a set of directed edges. The nodes represent random variables, while the edges indicate conditional dependence relations. When we have a directed edge from node A to node B, then A is called the parent of B, and B is called the child of A. The structure G of a BN has to be a directed acyclic graph (DAG), that is, a network without any directed cycles. This structure defines a unique rule for expanding the joint probability in terms of simpler conditional probabilities. Let X1, X2, ..., XN be a set of random variables represented by the nodes i ∈ {1, 2...N} in the graph, define Xπi to be the parents of node Xi in graph G P(X1, X2...XN ) = N i=0 P(Xi|Xπi ) (1) When adopting a score-based approach (score and search), our objective is to maximize the graph posterior probability P(G|D) which is deduced by applying Bayes over network structures P(G|D) ∝ P(D|G)P(G) (2) Where D is the dataset, P(G) is the prior over structure. P(D|G) is Marginal Likelihood and is obtained by averaging over all the parameters P(D|G) = P(D|G, θg)P(θG|G)dθG (3) Where P(D|G, θG) is the likelihood and P(θG|G) is the prior over parameters. 4.2 Local Distribution Families To compute P(D|G) we need to consider function families F for local distributions. Two common function families are – Unrestricted Multinomial Distribution with Dirichlet prior (discrete) – Linear Gaussian Distribution with normal-Wishart prior (continuous) However both of these families are parametric families and often Genomics/Proteomics datasets are ”multi-modal” manifesting different modalities under different exper- imental conditions, epigenetic variations and disease pathogenesis. In fact the fo- cus of current proposal on integrative approaches stems from this ”multi-modal” nature of -omics data. To cope with the multi-modality of data at the local dis- tribution level, a suitable semi-parametric distribution is proposed, for example - Gaussian Mixture. Mixture models can describe potentially complex distributions of gene expression across a wide range of conditions P(G) = P(X1, X2, X3...XN ) = N i=0 P(Xi|Xπi ) = N i=0 P(Xi, Xπi ) P(Xπi ) (4)
  • 5. Title Suppressed Due to Excessive Length 5 Where, P(Xi, Xπi ) = Ki k=0 αikN(XT i |µik, Σik); XT i = Xi ∪ Xπi ; Λi = |Xπi | + 1 (5) P(Xi, Xπi ) is a mixture of Ki components, each component is a Multi-variate Normal (MVN) distribution having mean vector of µik with dimension Λi and Σik is the covariance matrix with dimensin Λi × Λi. Gaussian Mixture models when applied to local distribution of a BN, can loosely model the underlying latent subnetworks, suitably deducing the mixture set {Ki}i=1..N from k-Means or hierarchical clustering. 4.3 Temporal Patterns Static BN, discussed so far has two major disadvantages • Feedback loops (which are common in GRN and Signaling Pathways) are not allowed. • Can’t model the temporal patterns of regulatory mechanism. Dynamic Bayesian Network (DBN - [?]murphy1999modelling) was devised to address those concerns. Assuming temporal transitions are obeying first-order, homogenous Markov Chain P(G) = P(X1, X2, X3...XN ) = T t=2 N i=0 P(Xi(t)|Xπi (t − 1)) (6) DBN along with local Gaussian Mixture distribution is proposed for the P(D|G) 4.4 Prior - P(G) Systematic Incorporation of structural priors is one of the core proposals of this document. Prior can come from disperate sources • Knowledge prior in various pathways databases like KEGG • Knowledeg prior in various PPI databases like MIPS, BioGrid • Knowledge prior in various microarray databases like Stanford Microarray Database • Epigenetic prior of histone modification from [12] or [19] We need to define a function that measures the agreement between a given candidate network G and the biological prior knowledge that we have at our dis- posal. We follow the approach proposed by [14] and call this measure the energy E, borrowing the name from the statistical physics community. G is candidate network structure obtained while repeatatively evaluating the posterior P(G|D) through search algorithm like Greedy Hill Climber (GHC) or Metropolis-Hastings (MH). ”Energy of the network” is obtained as
  • 6. 6 Debasish Bose E(G) = 1≤i≤N 1jN |Prior(i, j) − G(i, j)| (7) And corresponding prior over graph is defined by the Gibbs Distribution P(G|β) = e−βE(G) Z(β) (8) β is a hyperparameter that corresponds to an inverse temperature in statistical physics, and the denominator is a normalizing constant that is usually referred to as the partition function Z(β) Z(β) = G∈Ω e−βE(G) (9) Now that we have the necessary frameowork for computation of P(G), we need a systematic procedure to derive Prior(i, j) matrices integrating data avail- able from different sources already mentioned. We propose a Bayesian Multinet approach [15] for such a integration. Fig. 2 Integration of different priors through a MDAG model. BNP-i represent the i-th prior. P(X) = L k=1 πkP(X = x|Z = k, Gk, θgk ) (10) The advantage of Bayesian multinets over more traditional graphical models is the ability to represent context-specific independencies - situations in which subsets of variables exhibit certain conditional independencies for some, but not all, values of a conditioning variable. This context-specific independencies and corresponding existance of distinguished variable (latent variable) Z models the variance within prior knowledge.
  • 7. Title Suppressed Due to Excessive Length 7 4.5 Computation of Posterior - P(G|D, β) P(G|D, β) = P(D|G, β)P(G|β) G P(D|G , β)P(G |β) (11) Because of the intractability of the denominator G P(D|G , β)P(G |β), this document proposes a Markov Chain Monte Carlo (MCMC) scheme with Metropolis- Hastings search algorithm to sample both network structures (G) and hyperpa- rameters (β) from the posterior distribution. A restricted search space (parent node configurations as opposed to network structures) is proposed similar to [5] 4.6 Integration of Genomics and Proteomics Data This document proposes an integrative approach towards incorporation of Ge- nomics (Miroarray, RNA-Seq etc.) and Proteomics (Protein-Protein Interaction) datasets. Though High-throughput Proteomics (ex Mass Spectrometry) has gen- erated an enormous amout of data, the probability of false negative is rather high in such Protein Interactome datasets. To reliably reconstruct protein signaling pathways, data from genomics (Gene Regulatory Network and Transcriptional Regulatory Network) and prior biological knowledgebases must be incorporated as well. In fact this is one of the major objectives of Systems Biology. We illustrate the idea behind such an integrative approach as follows Fig. 3 True network, undirected protein network and directed gene regulatory network Network obtained from proteomics dataset is inherently undirected (Markov Network = GP P I ) as the it captures the correlation or covariance of proteins (and
  • 8. 8 Debasish Bose protein complexes) rather than true causality. To integrate undirected proteomic network with directed gene regulatory network (Bayesian Network = GRN ), we need to infer the causality from the data and merge to produce the final causal network through consensus algorithm. 5 Data Sources Required dataset is devided in two broad categories a) Synthetic (in-silico) and b) Public datasets. For Synthetic data, GeneNetWeaver [13] is proposed. For real datasets following data sources are considered – DNA-sequence data (e.g., GeneBank and EBI) – RNA sequence data (e.g., NCBI and Rfam) – GWAS data (e.g., dbSNP and HapMap) – Protein sequence data (e.g., UniProt, PIR and RefSeq) – Protein class and classification (e.g., Pfam, IntDom, and GO) – Gene structural (e.g., ChEBI, KEGG ligand Database, and PDB) – Genomics (e.g., SMD, Entrez Gene, KEGG, and MetaCyc) – Signaling pathway (e.g., ChemProt and Reactome) – Metabolomics (e.g., BioCycy, HMDB, and MMCD) – Protein-protein interaction (e.g., MIPS, BioGrid, IntAct, DIP, MiMI) Cancer datasets can be obtained from [7] 6 Timeline - 1st Year • More study on high-throughput approaches (and computational models) towards proteomics and metabolomics. Network (pathways/structural) analysis, causality inference and variations, both on normal and disease phenotypes - 1 month • A coherent framework of prior knowledge (epigenetics, knowledgebases etc.) incorpoation into the Bayesian Network - 2 months • Integrative framework for heterogenous (genomic and proteomic to start with) datasources - 3 months • Building of the Bayesian Network model - 2 months • Model validation against synthetic data and aforementioned public data (nor- mal genotype and phenotype) - 2 months • Model validation against cancer (disease phenotype) pathways - 2 months References 1. Akutsu, T., Miyano, S., Kuhara, S.: Inferring qualitative relations in genetic networks and metabolic pathways. Bioinformatics 16(8), 727–734 (2000). DOI 10.1093/bioinformatics/ 16.8.727. URL http://dx.doi.org/10.1093/bioinformatics/16.8.727 2. Brazhnik, P., de la Fuente, A., Mendes, P.: Gene networks: how to put the function in genomics. TRENDS in Biotechnology 20(11), 467–472 (2002) 3. Chen, G.: Discordant Protein and mRNA Expression in Lung Adenocarcinomas. Molecular Cellular Proteomics 1(4), 304–313 (2002). DOI 10.1074/mcp.m200008-mcp200. URL http://dx.doi.org/10.1074/mcp.m200008-mcp200
  • 9. Title Suppressed Due to Excessive Length 9 4. D’haeseleer, P., Liang, S., Somogyi, R.: Genetic network inference: from co-expression clustering to reverse engineering. Bioinformatics 16(8), 707–726 (2000). DOI 10.1093/ bioinformatics/16.8.707. URL http://dx.doi.org/10.1093/bioinformatics/16.8.707 5. Friedman, N., Koller, D.: Machine Learning 50(1/2), 95–125 (2003). DOI 10.1023/a: 1020249912095. URL http://dx.doi.org/10.1023/a:1020249912095 6. Friedman, N., Linial, M., Nachman, I., Pe’er, D.: Using Bayesian Networks to Analyze Expression Data. Journal of Computational Biology 7(3-4), 601–620 (2000). DOI 10. 1089/106652700750050961. URL http://dx.doi.org/10.1089/106652700750050961 7. Gadaleta, E., Cutts, R.J., Kelly, G.P., Crnogorac-Jurcevic, T., Kocher, H.M., Lemoine, N.R., Chelala, C.: A global insight into a cancer transcriptional space using pancreatic data: importance, findings and flaws. Nucleic acids research 39(18), 7900–7907 (2011) 8. Gygi, S.P., Rochon, Y., Franza, B.R., Aebersold, R.: Correlation between protein and mRNA abundance in yeast. Molecular and cellular biology 19(3), 1720–1730 (1999) 9. Hecker, M., Lambeck, S., Toepfer, S., van Someren, E., Guthke, R.: Gene regulatory network inference: Data integration in dynamic models—A review. Biosystems 96(1), 86–103 (2009). DOI 10.1016/j.biosystems.2008.12.004. URL http://dx.doi.org/10.1016/ j.biosystems.2008.12.004 10. Lezon, T.R., Banavar, J.R., Cieplak, M., Maritan, A., Fedoroff, N.V.: Using the principle of entropy maximization to infer genetic interaction networks from gene expression patterns. Proceedings of the National Academy of Sciences 103(50), 19,033–19,038 (2006). DOI 10.1073/pnas.0609152103. URL http://dx.doi.org/10.1073/pnas.0609152103 11. Myers, C.L., Robson, D., Wible, A., Hibbs, M.A., Chiriac, C., Theesfeld, C.L., Dolinski, K., Troyanskaya, O.G.: Genome Biology 6(13), R114 (2005). DOI 10.1186/gb-2005-6-13-r114. URL http://dx.doi.org/10.1186/gb-2005-6-13-r114 12. Pokholok, D.K., Harbison, C.T., Levine, S., Cole, M., Hannett, N.M., Lee, T.I., Bell, G.W., Walker, K., Rolfe, P.A., Herbolsheimer, E., et al.: Genome-wide map of nucleosome acetylation and methylation in yeast. Cell 122(4), 517–527 (2005) 13. Schaffter, T., Marbach, D., Floreano, D.: GeneNetWeaver: in silico benchmark generation and performance profiling of network inference methods. Bioinformatics 27(16), 2263–2270 (2011) 14. Tamada, Y., Kim, S., Bannai, H., Imoto, S., Tashiro, K., Kuhara, S., Miyano, S.: Estimat- ing gene networks from gene expression data by combining Bayesian network model with promoter element detection. Bioinformatics 19(Suppl 2), ii227–ii236 (2003). DOI 10.1093/ bioinformatics/btg1082. URL http://dx.doi.org/10.1093/bioinformatics/btg1082 15. Thiesson, B., Meek, C., Chickering, D.M., Heckerman, D.: Learning mixtures of DAG mod- els. In: Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence, pp. 504–513. Morgan Kaufmann Publishers Inc. (1998) 16. Troyanskaya, O.G., Dolinski, K., Owen, A.B., Altman, R.B., Botstein, D.: A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Sac- charomyces cerevisiae). Proceedings of the National Academy of Sciences 100(14), 8348– 8353 (2003) 17. Tsang, D.P., Cheng, A.S.: Epigenetic regulation of signaling pathways in cancer: role of the histone methyltransferase EZH2. Journal of gastroenterology and hepatology 26(1), 19–27 (2011) 18. Waters, K.M., Liu, T., Quesenberry, R.D., Willse, A.R., Bandyopadhyay, S., Kathmann, L.E., Weber, T.J., Smith, R.D., Wiley, H.S., Thrall, B.D.: Network Analysis of Epidermal Growth Factor Signaling Using Integrated Genomic, Proteomic and Phosphorylation Data. PLoS ONE 7(3), e34,515 (2012). DOI 10.1371/journal.pone.0034515. URL http://dx. doi.org/10.1371/journal.pone.0034515 19. Zhang, Y., Lv, J., Liu, H., Zhu, J., Su, J., Wu, Q., Qi, Y., Wang, F., Li, X.: HHMD: the human histone modification database. Nucleic acids research 38(suppl 1), D149–D154 (2010)