Pathway Discovery in Cancer: the Bayesian Approach

Pathway Discovery in Cancer:
the Bayesian Approach

Francesco Gadaleta
Developed and written at ESAT dept. of Electrical Engineering of the Faculty of Engineering
Katholieke Universiteit Leuven (Belgium)

Genes and Deseases
Biological Assumptions
• Cancer normally originate in a single cell

• Cell’s life is regulated by many genes activated in different steps

Types Of Genes
• Oncogene
• Tumor-suppressor
• DNA-repair

Genes and Deseases:
genetic predisposition

Genes and Deseases:
genetic predisposition

Normal cell

First mutation Second mutation Third mutation

Malignant Cell

Genes and Deseases:
Microarray Technology
Cancer Cells
Red Fluorescent Probes
mRNA cDNA

Reverse
RNA Isolation Transcriptase Combine Target
Normal Cells Labeling

mRNA cDNA
Green Fluorescent Probes

Genes and Deseases:
the goal of biologists and genetists

• Prenatal diagnosis for recognized deseases, eg Down Sindrome

• Carrier testing to help couples with hereditary desease in the risky
decision of breeding

• Patient tailored diagnosis for genetic deseases

Goal of this Thesis

• Microarray Analysis by more complex tools

• Integrate in a unique model what is already known from other
experiments

• Identify those genes that form desease pathways

Type Of Data

• Normalization (ﬂuorescent intensity)

• Filtering of microarray data (how to select subsets of genes)

• Data Discretization (Are bio reactions discrete events?)
➡ Interval discretization

➡ Quantile discretization

➡ Exporting

Interval Discretization

• Sort n observations

• Divide observations in d levels (uniformly spaced intervals)

• i-th obs. is disretized as j-th level iff:

x0 + j(xn-1 - x0) (j+1)(xn-1 - x0)
d < xi < x0 + d

Quantile Discretization

• Sort n observations

• Divide all observations in d levels by placing an equal number of obs.
in each bin: all levels are equally represented

• i-th observation belongs to j-th level iff:

jn (j+1)n
d <i< d

Exporting Data
Gene Name Sample_0 Sample_1 Sample_2 Sample_3 ... Sample_n

rpoH 29.345 30.431 25.125 29.543 29.987

mopA 42.746 40.375 41.740 29.345 29.345

htpG 29.345 29.345 29.345 29.345 29.345

...
araE 29.345 29.345 29.345 29.345 29.345

Knowledge Base

• Many biological processes are still not known

• Reliabiality of data
➡ hybridization is still a handmade process

• Small sample size - Huge number of genes
➡ integration with heterogeneous data

What we want to solve?
• Genetical cancer forecasting?

• Need for a model to handle uncertain knowledge

• A model that biologists and epidemiologists can understand

• A model to be updated in different times

What we want to solve?

• Need for a model to handle uncertain knowledge

• A model that biologists and epidemiologists can understand

• A model to be updated in different times

Bayesian Networks:
features

• Can handle uncertain knowledge with probability

• Can handle subsequent changes (bio noise, multiple measurements)

• Intuitive model a biologist can understand: white box vs. black box
(neural networks)

Bayesian Networks:
deﬁnition

• Direct Acyclic Graph
(how variables interact each other)
A B

C
• Set of local probability distributions F
(p(xi=k | Pa(xi)=j) = ijk)
E D

G

Bayesian Networks:
deﬁnition

•
p(A)
Direct Acyclic Graph A B

C
E D

G

Bayesian Networks:
deﬁnition

•
p(A) p(B)

C
E D

G

Bayesian Networks:
deﬁnition

•
p(A) p(B)

p(C|A,B)
C
E D

G

Bayesian Networks:
deﬁnition

•
p(A) p(B)

p(C|A,B)
C
• Set of local probability distributions F p(F|B)

E D

G

Bayesian Networks:
deﬁnition

•
p(A) p(B)

p(C|A,B)
C

p(E|C)
E D

G

Bayesian Networks:
deﬁnition

•
p(A) p(B)

p(C|A,B)
C

p(E|C) p(D|C)
E D

G

Bayesian Networks:
deﬁnition

•
p(A) p(B)

p(C|A,B)
C

p(E|C) p(D|C)
E D
p(G|F)
G

Bayesian Networks:
formal assumptions

• Structure Possibility
• Complete Data
• Markov Condition
• Observational Equivalence
• Scoring Function

Bayesian Networks:
formal assumptions

• Structure Possibility Each of the n! structures is possible
• Complete Data p(Si |) 0

Bayesian Networks:
formal assumptions

• Structure Possibility No missing data in order to compute
• Complete Data p(S, S|) and p(C|D, S, ),
• Markov Condition C new observation,
in closed form

Bayesian Networks:
formal assumptions

• Structure Possibility Allows to factorize
• Complete Data p(x1 , x2 , . . . xn ) = p(xi |P a(xi ))

Bayesian Networks:
formal assumptions
X1 X1

• Structure Possibility
• Complete Data X2 X3 X2 X3

• Markov Condition X4 X4

• Scoring Function X5 X5

Bayesian Networks:
formal assumptions

• Structure Possibility A function to measure how well a
• Complete Data structure ﬁts the data

Bayesian Network:
structure learning

• Constraint Satisfaction Problem vs. Optimization Problem

• CSP tries to discover dependencies from the data with a statistical
hypothesis test

• OP searches and tries to improve the score assigned by a scoring
function

Bayesian Networks:
K2 algorithm

• Goal: maximize the structure probability given the data
• A initial order is given (A,B,C, D, E, F, G)

[Quality measure of the
net given the data by
Cooper Herskovits]

Bayesian Networks:
K2 algorithm

Cooper Herskovits]

Bayesian Networks:
K2 algorithm
• let D the dataset, N the number of examples,

• G the network structure, paij the j th instantiation of P a(xi ),

• Nijk the number of data where xi = k and P a(xi ) = j, and

ri
• Nij = k=1 Nijk

Cooper Herskovits]

Bayesian Networks:
K2 algorithm
• let D the dataset, N the number of examples,

• G the network structure, paij the j th instantiation of P a(xi ),

• Nijk the number of data where xi = k and P a(xi ) = j, and

ri
• Nij = k=1 Nijk

P (G, D) = P (G)P (D|G) [Quality measure of the
n qi ri net given the data by
(ri −1)!
P (D|G) = i=1 j=1 (Nij +ri −1)! k=1 Nijk ! Cooper Herskovits]

Bayesian Networks:
K2 algorithm

• Possible actions

• edge addition

• edge deletion

Data Integration

• heterogeneous data integration

• binary gene-gene relations

• bayesian network collective learning
(Partial Integration)

Data Integration

Gene1
Gene2
Gene3 G1 G9
Gene4
. G7

GeneN
G5

• binary gene-gene relations Literature extraction
Fixed vocabulary
G3
G4
G2

prior G8
Abstract
Indexing
prior

Cosine measure
(Partial Integration) prior

Gene
Similarity
Matrix

Data Integration
Microarray data




Data Integration
Microarray data Clinical data




Experiments and results
a generator of synthetic gene expression data
SynTReN for design and analysis of structure learning
algorithms

syntetic model syntetic data
Validator

Structure
Learning
Framework

learned model

Experiments and results

• Results (random net + bio net (without clinical data))

• Idea that clinical data may improve structure learning: more complete
biological models (not bad considering that it is a type of data medical centers are
equipped)

Learned Structure Network

Microarray variables
232
DEMO
Clinical variables
11
Patients
(train) 78
Patients
(test) 19
Structure Learning
Computation time 12h (*)

(*) Matlab running on Intel 2CoreDuo 2Ghz

Conclusions

• Partial Integration of two data sources improves performance within
the Bayesian Network Framework

• A huge pure-microarray dataset is not helpful

• Data Integration leads to fewer variables for each source (pure
microarray is expensive)

Pathway Discovery in Cancer: the Bayesian Approach

Recommended

Recommended

More Related Content

Similar to Pathway Discovery in Cancer: the Bayesian Approach

Similar to Pathway Discovery in Cancer: the Bayesian Approach (11)

Recently uploaded

Recently uploaded (20)

Pathway Discovery in Cancer: the Bayesian Approach