1. Expression Networks for
Cancer Gene Markers
DIMITRIOS-APOSTOLOS CHALEPAKIS-NTELLIS
TECHNICAL UNIVERSITY OF CRETE
SCHOOL OF ELECTRONIC AND COMPUTER ENGINEERING
DIGITAL SIGNAL & IMAGE PROCESSING LAB
Supervisor: Prof. M.Zervakis
Assoc. Prof. A.Mania
Principal Inv. D.Kafetzopoulos
July 2015
2. Presentation Structure
1. Aim of Research
2. Background
3. Method
◦ Structure Learning
◦ Structure Analysis
◦ Finding Hubs
◦ Clustering
4. Results & Conclusion
TECHNICAL UNIVERSITY OF CRETE 2
3. 1. Aim of Research
◦ Create Bayesian networks with two kind of variables (discrete and
continuous) based on a small amount of genes (part of a gene
signature of breast cancer)
◦ Find new or confirm known interactions related with breast cancer
◦ Study the properties of Bayesian networks and how they can be
compared with biological and other networks.
◦ Find significant genes, modules and pathways in these networks.
TECHNICAL UNIVERSITY OF CRETE 3
4. 2. Background
Why Networks?
◦Network biology is a multidisciplinary intersection of
mathematic, computer science and biology.
◦Network biology provide valuable frameworks :
◦ to analyze high throughput data
◦ significantly altered our understanding of biological systems
◦ embedded in important applications in practical medicine.
TECHNICAL UNIVERSITY OF CRETE 4
5. 2. Background
Bayesian Networks(1)
◦ A Bayesian network specifies a joint distribution in a structured form
◦ Represent dependence/independence via a directed graph
◦ Nodes = random variables (this variables can be expression levels of
different genes)
◦ Edges = direct dependence
◦ Requires that graph is acyclic (no directed cycles)
◦ Two components to a Bayesian network
◦ The graph structure
◦ The numerical probabilities (for each variable given its parents)
TECHNICAL UNIVERSITY OF CRETE 5
6. 2. Background
Bayesian Networks(2)
◦ Specifically encodes the Markov assumptions according to which, each variable is
independent of its non-descendant, given its parents.
◦ Any joint distribution that satisfied Markov’s assumptions can be analyzed into the product
form:
𝑃 𝑋1, … , 𝑋 𝑛 =
𝑖=1
𝑛
𝑃(𝑋𝑖|𝑃𝑎
𝐺(𝑋𝑖))
Xi : random variables
PaG : sets of parents of Xi
◦ To fully determine a joint distribution we need to determine each of the conditional
probabilities in the product form.
◦ The functional form of the conditional distribution can be:
1. Multinomial – for discrete variables
2. Linear Gaussian – for continuous variables
TECHNICAL UNIVERSITY OF CRETE 6
7. 3. Methodology
Data
◦ High throughput data - Gene expression values of 4174 genes
◦ 529 samples
◦ 425 cancer and 104 control samples
◦ We used the 82 from 4174 genes for our research
◦ 77 genes are part of a gene signature (Nikos Chlis) + 5 control genes
◦ The interactions of this 82 genes compose our initial network
◦ Our initial network composed from 12 biological verified interactions
◦ 15 genes participate in these 12 interactions
TECHNICAL UNIVERSITY OF CRETE 7
8. 3. Methodology
Structure Learning
◦ Structure learning is the process which induces Bayesian Networks from data
◦ We get different networks if we change the parameters:
◦ Variable type (Discrete, Continuous)
◦ Data (cancer and control samples)
◦ Discrete Variables – Discretization based on 2 thresholds
◦ Continuous Variables – need of Gaussian distribution
TECHNICAL UNIVERSITY OF CRETE 8
9. 3. Methodology
Finding Thresholds - Discretization(1)
◦ 900/4174 most differential expressed genes
◦ Means of the expression values of these 900 genes, separately for cancer and control smaples
◦ Creating two classes – max class and min class
◦ Comparing the mean values of cancer and control samples, we group them into these two classes
◦ Creating the histograms of these two classes
◦ Finding thresholds from the joint Gaussian Fit
TECHNICAL UNIVERSITY OF CRETE 9
10. 3. Methodology
Finding Thresholds - Discretization(2)
TECHNICAL UNIVERSITY OF CRETE 10
Thresholds Discrete Value
Expression value < 1.131 Underexpressed
1.131<Expression value<3.48 Normal
Expression value>3.48 Overexpressed
11. 3. Methodology
Structure Learning – Gaussian Variables
◦ Each variable take the expression values of each gene
◦ The expression values of each gene are normally(Gaussian) distributed because a log base 2
transformation has been applied to them.
◦ We can ascertain this by creating the histogram of the expression values for each gene
◦ We observe that our data are about to be normally distributed
TECHNICAL UNIVERSITY OF CRETE 11
12. 3. Methodology
Structure Learning
◦ We used K2 algorithm which is a score-based algorithm.
◦ It attempts to select the network structure that maximizes the networks posterior
probability given the experimental data.
◦ The K2 search
◦ assumes that a node has no parents
◦ adds incrementally that parent from a given ordering whose addition increases the score of
the resulting structure the most.
◦ stops adding parents to the node when the score does not increase.
◦ The final score of network is obtained by multiplying the individual score of
nodes.
TECHNICAL UNIVERSITY OF CRETE 12
13. 3. Methodology
Structure Implementation
◦ K2 algorithm needs to know:
◦ The maximum number of node parents (the number of our total genes - 82)
◦ The order of the nodes-variables (reduces computational complexity)
◦ We used two kind of orders of the nodes:
1. MWST. The order that obtained applying the MWST algorithm (Maximum Weight
Spanning Tree) and a topological sorting
2. CUSTOM. The first 15 (15 nodes participate in the initial 12 interactions) slots of the
custom order are result from the MWST and the rest are random order.
TECHNICAL UNIVERSITY OF CRETE 13
16. 3. Methodology
Structure Analysis
◦ Small-world and scale-free are some properties that real networks often have.
◦ Small- worlds is a type of mathematical graph in which
◦ most nodes are not neighbors of one another, but most nodes can be reached from every other by a
small number of hops or steps.
◦ the typical distance L between two randomly chosen nodes (the number of steps required) grows
proportionally to the logarithm of the number of nodes N in the network
◦ Their model is characterized by
◦ a small average path length
◦ a large clustering coefficient
◦ A scale-free network is a network whose degree distribution follows a power law
◦ The most notable characteristic in a scale-free network is the relative commonness of
vertices with a degree that greatly exceeds the average. The highest-degree nodes are often
called "hubs", and are thought to serve specific purposes in their networks, although this
depends greatly on the domain.
TECHNICAL UNIVERSITY OF CRETE 16
17. 3. Methodology
Structure Analysis - Examples
TECHNICAL UNIVERSITY OF CRETE 17
Small-World Network Example
Hubs are highlighted
Average Path Length = 1.8
Clustering Coefficient = 0.522
Random Network Example
Average Path Length = 2.1
Clustering Coefficient = 0.167
Scale-Free Network Example
Hubs are highlighted
18. 3. Methodology
Structure Analysis – Small World
TECHNICAL UNIVERSITY OF CRETE 18
Network C Crand l log(n) n
Cancer (CU) 0,152 0,126 2,37308 4,4067 82
Control (HU) 0,120 0,099 2,47681 4,4067 82
C = clustering coefficient of the current network
Crand = clustering coefficient of an equivalent randomized network
l = average path length of the current network
n = number of nodes
• To characterize a network as Small-World:
• C > Crand and l = log(n)
• The Clustering Coefficients of our networks are a bit higher that these of the
random network.
• Average path lengths are much lower that the logarithm of n.
• So the networks are not Small-World
19. 3. Methodology
Structure Analysis – Scale Free
TECHNICAL UNIVERSITY OF CRETE 19
Cancer Union (CU) Network Control Union (HU) Network
• The Degree Distributions of
our networks follow a power
law.
• So the networks can be
categorized as Scale-Free.
• This is in accordance with
other studies.
20. 3. Methodology
Centralities
◦ Centralities are some topological characteristics-indices that produce rankings which seek to
identify the most important nodes in a network model.
◦ Degree
◦ of a node in a network is the number of links (vertices) incident on the node. If a network
is directed, meaning that edges point in one direction from one node to another node,
then nodes have two different degrees, the in-degree, which is the number of incoming
edges, and the out-degree, which is the number of outgoing edges.
◦ Betweenness centrality
◦ determines the relative importance of a node by measuring the amount of traffic flowing
through that node to other nodes in the network. This is done by measuring the fraction
of paths connecting all pairs of nodes and containing the node of interest.
TECHNICAL UNIVERSITY OF CRETE 20
21. 3. Methodology
Finding Hubs
◦ We need to find the Hubs in the network because they have significant role in networks.
◦ Hubs are the highest degree nodes of a network and have usually great biological
significance.
◦ Degree is a local node metric
◦ Betweenness is a global node metric
◦ So to find the significant nodes-central proteins we used a combination of these metrics
◦ How to find hubs in a network:
◦ Draw histograms and cumulative distributions of the node degrees and betweenness for the network
◦ In cumulative distribution find the point that the curve starts flattening
◦ We call this values as the minimum hub node degree (or betweenness) value
◦ The nodes with the highest degrees and betweennesses are the most significant in out network
TECHNICAL UNIVERSITY OF CRETE 21
22. 3. Methodology
Finding Hubs
TECHNICAL UNIVERSITY OF CRETE 22
Cancer Union (CU) Network – Cumulative Distribution Control Union (HU) Network – Cumulative Distribution
• 7 Hubs occurred • 11 Hubs occurred
In-Degree In-Degree
Out-Degree Out-Degree
Betweenness Betweenness
23. 3. Methodology
Network Comparison
◦ We need to compare our networks to find their similarity degree.
◦ We compared the networks with a method that estimates the ratio of
correctness of one net with respect to an other. This measures ranges
between 0 and 1, where 0 is the lowest validity and 1 the highest.
◦ This method is based on distance levels between nodes – shortest paths.
TECHNICAL UNIVERSITY OF CRETE 23
24. 3. Methodology
Network Comparison
TECHNICAL UNIVERSITY OF CRETE 24
• We compare the CU and HU networks.
• We got four V, one for each level.
• We got four levels because this is the
maximum shortest path in the networks.
• The value of V indicates us how correct is one
network with respect to the other.
• The best similarity is observed at level 4 and
3, in which we get 45-50% similarity.
• For bigger level we get bigger similarity.
Validity
Network
CU HU
V1 0,0992 0,1339
V2 0,3215 0,4124
V3 0,4491 0,5088
V4 0,4591 0,5138
25. 3. Methodology
Conclusion
◦ Until now:
◦ The networks that we are working with are the Cancer Union (CU) and Control Union (HU)
◦ Our networks are Scale-Free
◦ There are hubs for each network
◦ What’s next?
◦ We want to find significant modules and molecular complexes in our networks
◦ We apply two clustering algorithms (MCODE and jActiveModules)
◦ We create a difference network and study some centralities of its clusters
TECHNICAL UNIVERSITY OF CRETE 25
26. 3. Methodology
Clustering with MCODE algorithm
◦ MCODE uses a clustering coefficient algorithm to identify molecular complexes in a large
protein interaction network derived from heterogeneous experimental sources.
◦ The idea of this algorithm is that highly interconnected, or dense, regions of the network may
represent complexes.
◦ The algorithmic stages are:
◦ Vertex weighting
◦ which weights all of the nodes based on their local network density using the highest k-core of the
vertex neighborhood.
◦ Molecular complex prediction
◦ staring with the highest-weighted node, recursively move out adding nodes to the complex that are
above a given threshold.
TECHNICAL UNIVERSITY OF CRETE 26
27. 3. Methodology
Clustering with MCODE algorithm
TECHNICAL UNIVERSITY OF CRETE 27
Union Bayesian Network
with Control Samples (HU)
Union Bayesian Network
with Cancer Samples (CU)
28. 3. Methodology
Clustering with jActiveModules algorithm
◦ A general method for searching the network to find active subnetworks
◦ This algorithm uses
◦ a statistical scoring system which captures the amount of gene expression change in a given
subnetwork.
◦ search algorithm for identifying the highest scoring subnetworks.
◦ The algorithmic stages are:
◦ Basic z-score calculation
◦ Transform p-values (significance of differential expression for each gene) to z-score
◦ Calibrating z against the background distribution
◦ Searching for high-scoring subnetworks via simulated annealing
TECHNICAL UNIVERSITY OF CRETE 28
29. 3. Methodology
Clustering with jActiveModules algorithm
TECHNICAL UNIVERSITY OF CRETE 29
Union Bayesian Network
with Cancer Samples (CU)
Union Bayesian Network
with Control Samples (HU)
30. TECHNICAL UNIVERSITY OF CRETE 30
Current Knowledge: 12 experimentally verified interactions among the 82 genes
Control Union Interactions: FN1-CDKN2A, FN1-KRT16, FN1-COMP, ERBB2-NRG1 and FGFR3-FGF18
Cancer Union Interactions: ACTA1-CDKN2A, FN1-COMP and KRT16-IGHG1
A number of known interactions (Control 5/12, Cancer 3/12) are validated
New interactions are provided (Control 600, Cancer 840) that can be experimentally verified
There is a need of a more compact model in order to examine the biological significance of the identified
interactions in constructed Networks
Construction of Differentiating Network from cancer and control Bayesian Networks
Identification of enriched pathways within MCODE clusters
Construction of Differentiating Sub-Networks (cancer and control)
Evaluation of Interactions of Bayesian Networks
Biological Results
32. TECHNICAL UNIVERSITY OF CRETE 32
Differentiating
Network
Significant (p≤0.05) pathways of
Differentiating Network are
provided that are associated with
breast cancer. For each MCODE
cluster or pathway within MCODE
cluster the average betweenness
centrality and the average
degree centrality were computed.
Average
Betweenness
Centrality
Average
Degree
Centrality
MCODE - Cluster 1 18 12.5
MCODE - Cluster 2 23.78 7.14
MCODE - Cluster 3 5.7 6
MCODE - Cluster 4 2.3 2.7
MCODE - PATHWAYS Pathways Gene Symbol
Average
Betweenness
Centrality
Average
Degree
Centrality
Pathway 1_1 Pathways in cancer ERBB2 FGFR3 CDKN2A FGF18 EGF 23.92 14.2
Pathway 1_2 Focal adhesion ERBB2 COL11A1 COMP EGF 23.73 13.75
Pathway 1_3 ErbB signaling pathway NRG2 ERBB2 EGF 27.52 14.66
Pathway 1_4 Regulation of actin cytoskeleton FGFR3 FGF18 EGF 21.79 13.33
Pathway 1_5 MAPK signaling pathway FGFR3 FGF18 EGF 21.79 13.33
Pathway 1_6 EGF-EGFR Signaling Pathway ERBB2 REPS2 EGF 27.38 14.33
Pathway 1_7 ECM-receptor interaction COL11A1 COMP 20.33 13
Pathway 1_8 Endocytosis FGFR3 EGF 26.95 14
Pathway 1_9 Endochondral Ossification FGFR3 FGF18 21.52 13.5
Pathway 2_1 Protein digestion and absorption COL17A1 CPA3 39.45 9
Pathway 2_2 Chemokine signaling pathway CCL19 CCL18 15.35 6
Pathway 2_3 Metabolic pathways TAT ATP6V0A4 HSD17B2 33.48 8.66
Pathway 2_4 Androgen receptor signaling pathway BRCA1 PARK7 19.47 6
KEGG & WikiPathways in Differentiating Network
MCODE - CLUSTERS
Cancer and Control hubs are
included in the differentiating
network.
3. Methodology: Differentiating Network
33. 3. Methodology
Differentiating Network
◦ Create the difference network of CU and HU
◦ Apply the MCODE algorithm on this network to observe the clusters
◦ Analyze the centralities of the clusters and their pathways
◦ The centralities of these pathways were analyzed by aggregating the centralities of all genes
enriched in one pathway
TECHNICAL UNIVERSITY OF CRETE 33
0
2
4
6
8
10
12
14
16
Cluster1
Cluster2
Cluster3
Cluster4
Pathway1_1
Pathway1_2
Pathway1_3
Pathway1_4
Pathway1_5
Pathway1_6
Pathway1_7
Pathway1_8
Pathway1_9
Pathway2_1
Pathway2_2
Pathway2_3
Pathway2_4
Degree
Degree
0
5
10
15
20
25
30
35
40
Cluster1
Cluster2
Cluster3
Cluster4
Pathway1_1
Pathway1_2
Pathway1_3
Pathway1_4
Pathway1_5
Pathway1_6
Pathway1_7
Pathway1_8
Pathway1_9
Pathway2_1
Pathway2_2
Pathway2_3
Pathway2_4
Betweenness Centrality
Betweenness Centrality
34. TECHNICAL UNIVERSITY OF CRETE 34
Construction of
Differentiating Sub-Networks
By Considering
Significant Pathways of Differentiating Network
ʽcancerʼ and ʽhealthyʼ hubs
adjacent edges of ʽcancerʼ or ʽhealthyʼ hubs
DifferentiatingʽCancerʼSubnetwork DifferentiatingʻHealthyʼSubnetwork
Biological Results
35. 4. Conclusions
◦ Networks from Cancer and Control samples are Scale-Free.
◦ There are significant nodes in Cancer and Control Networks with biological
significance.
◦ There are significant pathways in network complexes.
TECHNICAL UNIVERSITY OF CRETE 35
36. 4. Conclusions
TECHNICAL UNIVERSITY OF CRETE 36
• The differentiating network involves all Hub nodes, so that Hub genes can be considered as
potential gene markers for breast cancer
• FN1, TTYH1 and OGN are the common hubs between cancer and control differentiating sub-
networks
• There are fewer interactions in cancer differentiating sub-network compared to the number
of interactions in control differentiating sub-network, despite the fact that the number of
genes remains constant
The constructed Networks and Sub-networks can give an insight into the
molecular alterations taking place in different conditions (cancer and control)
The differentiating sub-networks might be considered as local models to test
a biological hypothesis and are more convenient for experimental design