2. - Introduction to Computational & Systems Biology
- Concepts of Graph Theory
- Bio-interaction networks and visualization
- Data sources of P2P interactions
- Measures of topological importance
OUTLINE
4. What
is
Systems
Biology? To understand biology at the system
level, we must examine the structure
and dynamics of cellular and
organismal function, rather than the
characteristics of isolated parts of a cell
or organism. Properties of systems,
such as robustness, emerge as central
issues, and understanding these
properties may have an impact on the
future of medicine.
Hiroaki Kitano
9. - Introduction to Computational & Systems Biology
- Concepts of Graph Theory
- Bio-interaction networks and visualization
- Data sources of P2P interactions
- Measures of topological importance
OUTLINE
10. Networks:
the
starting
points
Texts typically trace the origin of
graph theory back to the Königsberg
Bridge Problem and its solution by
Leonhard Euler (1736). He wrote a
solution to a problem concerning the
geometry of a place. First paper in
graph theory
Problem of the Königsberg bridges:
Starting and ending at the same
point, is it possible to cross all
seven bridges just once and return
to the starting point?
11. What’s
a
Graph?
It is a pair G = (V, E), where
V = V(G) = set of vertices
E = E(G) = set of edges
v1
v5
v3
v2
v4
e1
e2
e4
e3
e5
e6
12. Definitions
–
Graph
Type
Simple graph
A graph without loops or parallel edges
Weighted graph
A graph where each edge is assigned a
numerical label or “weight”
Type Edges Multiple Edges
Allowed ?
Loops Allowed ?
Simple Graph undirected No No
Multigraph undirected Yes No
Pseudograph undirected Yes Yes
Directed Graph directed No Yes
Directed
Multigraph
directed Yes Yes
13. Connected
graphs
An undirected graph is
connected if every pair of
vertices can be connected
by a path
Each connected subgraph
of a non-connected graph G
is called a component of G
14. Representation
Incidence (Matrix)
Adjacency List
Adjacency Matrix
- Rows and columns are labeled with
ordered vertices
- write a 1 if there is an edge between the
row vertex and the column vertex
- and 0 if no edge exists between them
v w x y
v 0 1 0 1
w 1 0 1 1
x 0 1 0 1
y 1 1 1 0
16. - Install Miniconda for your OS
- https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html
- Setup environment
- conda create –n meetup python=3.7
- conda activate meetup
- conda install notebook pandas networkx matplotlib=2.2.3
- Lunch Jupyter Notebook from the meetup code folder
- jupyter notebook meetup1.ipynb
cmd
17. - Introduction to Computational & Systems Biology
- Concepts of Graph theory
- Bio-interaction networks and visualization
- Data sources of P2P interactions
- Measures of topological importance
OUTLINE
18. Network abstractions
- Node: biological object; edge: interaction between nodes
Regulatory networks
- Node: genes; edge: regulatory interaction
Metabolic networks
- Node: metabolite; edge: reaction
Type
of
biological
networks
Protein networks
- Node: protein; edge: interaction
- Node: complex; edge: sharing a protein
- Node: residue; edge: folding neighbors
19. Is
the
organization
of
biological
network
random? The Scale-Free Model:
Preferential Attachment
Preferential attachment means that the more connected a
node is, the more likely it is to receive new links.
Growth: degree-m nodes are constantly added
Preferential attachment: the probability that a new node
connects to an existing one is proportional to its degree
21. Is
the
organization
of
biological
network
random?
The Power-Law Distribution
( ) c
P k k
Fat or heavy tail!
Leads to a “scale-free” network
Characterized by a small number of highly
connected nodes, known as hubs
Hubs are crucial:
Affect error and attack tolerance of complex
networks (Albert et al. Nature, 2000)
party hubs and date hubs
23. Is
the
organization
of
biological
network
random?
Power laws are seemingly everywhere
Moby Dick scientific papers 1981-1997 AOL users visiting sites ‘97
bestsellers 1895-1965 AT&T customers on 1 day California 1910-1992
Source: MEJ Newman, ’Power laws, Pareto distributions and Zipf’s law’
27. Protein interaction
- Introduction on Graph theory
- Bio-interaction networks and visualization
- Data sources of P2P interactions
- Measures of topological importance
OUTLINE
31. Protein interaction
- Introduction on Graph theory
- Bio-interaction networks and visualization
- Data sources of P2P interactions
- Measures of topological importance
OUTLINE
34. Protein interaction
- Introduction on Graph theory
- Bio-interaction networks and visualization
- Data sources of P2P interactions
- Measures of topological importance
OUTLINE
39. Identifying sets of key players
AIMS:
optimally diffusing something through the network
(KPP-Pos) The kp-set is maximally connected to all other nodes.
optimally disrupting or fragmenting the network by
removing the key nodes
(KPP-Neg) Removing the kp-set would result in a residual network with the least possible
cohesion)
Co-stardom network
Six Degrees of Kevin Bacon or Bacon's Law. It rests on the assumption that anyone involved in the Hollywood film industry can be linked through their film roles to Bacon within six steps.
The AIDS dataset consists of an acquaintance network among 293 drug injectors on the streets of Hartford, CT. The data are described in Weeks et al (2002). The network consists of one large main component (193 nodes), and many very small components. As shown in Figure (right), the main component of the network has a very clear structure. It consists of two groups, one african-american (with higher HIV+ proportion), and the other made by puertorican (with lower HIV+ proportion). Connection between the two groups is limited by just a few acquaintances and this bottleneck helps maintain the lower HIV+ rate in the Puertorican part of the network.
The terrorist dataset, compiled by Krebs (2001), consists of a presumed acquaintance network among 74 suspected terrorists.
The first question we ask is which persons should be isolated in order to maximally disrupt the network. Let us assume that we can only isolate three people. A run of the KeyPlayer program selects the three red nodes identified in red in Figure 7 (nodes A, B and C). Removing these nodes yields a fragmentation measure of 0.59, and breaks the graph into 7 components (including two large ones comprising the left and right halves of the graph).
Originally introduced by Barabasi and Albert [13], scale-free graphs (Figure 2.3b) have been proposed as generic, yet universal models of network topologies that exhibit power law distributions in the connectivity of network nodes. As a result of the apparent ubiquity of such distributions across many naturally occurring and man-made systems, scale-free graphs have been suggested as representative models of complex systems ranging from the social sciences to molecular biology. For scale-free networks, the vertex connectivity follows a scale-free power-law distribution. In particular, the connectivity follows a Poisson distribution that peaks strongly at K, the number of links, implying that the probability of finding a highly connected node decays exponentially (P(k) e^(-γ), γ >>K)
Biological scale-free networks resulted to be extremely sensitive to the targeted removal of hubs, namely hubs are said to hold the network together. The presence of a hub-like network core yields a robust yet fragile connectivity structure that has become a hallmark of scale-free networks models. Thus, apparently due to their hub-like core structure, scale-free networks are said to be simultaneously robust to the random loss of nodes (i.e. error tolerance) since these tend to miss hubs, but fragile to targeted worst-case attacks (i.e. attack vulnerability) on hubs. This latter property has been termed the Achilles heel of scale-free networks.
small-world character (Figure 2.3a), meaning that any two nodes in the system can be connected by relatively short paths along existing links. That is, any node can be reached with just a few hops. It is found that in most networks the mean geodesic distance between vertex pairs is small compared to the size of the network as a whole. In a famous experiment conducted in the 1960s, the psychologist Stanley Milgram asked participants to get a message to a specified target person elsewhere in the country by passing it from one acquaintance to another, stepwise through the population. Milgram found that the typical message had passed through just six people on its journey between randomly chosen initial and final individuals. This finding has been immortalized in popular culture in the phrase six degrees of separation. Since Milgram's experiment, the small-world effect has been confirmed experimentally in many other networks. In metabolic networks these paths corresponds to the biochemical path-
way connecting two substrates. Watts and Strogatz found that these systems can be highly clustered, like regular lattice, yet that have small characteristic path lengths, like random graphs. They called them small-world networks,
by analogy with the small-world phenomenon.
Who are the most central members of a network and who are the most peripheral? Which component has most influence over others? Does the network break down into smaller groups and, if so, what are they? Which connections are most crucial to the functioning of a group?.
Degree centrality. It is defined as the number of links incident upon a node (Figure 2.4c). Normally, it is interpreted as an index of node popularity and gregariousness as well as the immediate risk of a node for catching whatever is flowing through the network. In mathematical terms, the degree ki of a vertex i is:
Eigenvector centrality. In general, connections to people who are them-selves influential will lend a person more influence than connections to less influential people. Denoting the centrality of vertex i by xi, then one can allow for this effect by making xi proportional to the average of the centralities of i's network neighbours:
Defining the vector of centralities x = (x1; x2; :::), one can rewrite this equation in matrix form as: LAMBDA x = Ax and hence one can see that x is an eigenvector of the adjacency matrix with eigenvalue LAMBDA. The eigenvector centrality defined in this way accords each vertex a centrality that depends both on the number and
the quality of its connections
Betweenness. Vertices that occur on many shortest paths between other vertices have higher betweenness than those that do not. For a graph G = (V;E) with n vertices (Figure 2.4a), the betweenness CB(v) for vertex v is:
where st is the number of shortest geodesic paths from s to t, and st is the number of shortest geodesic paths from s to t that pass through a vertex v. So defined, then betweenness centrality measures the fraction of the information that will flow through a node i on its way to wherever it is going. In many contexts a vertex with high betweenness will exert substantial influence by virtue not of being in the middle of the network but of lying between other vertices in this way.
Closeness. Intuitively, two sets are said to be close if they are arbitrarily near to each other. Vertices that are shallow to other vertices (that is, those that tend to have short geodesic distances to other vertices with in the graph) have higher closeness (Figure 2.4b). Formally, it is defined as the mean geodesic (i.e the shortest path) between a vertex v and all other vertices reachable from it:
Closeness can be regarded as a measure of how long it will take information to spread from a given vertex to others in the network.
Clustering coefficient. The cohesiveness of the neighborhood of a node i is usually quantified by the clustering coefficient Ci, defined as the ratio between the number of edges linking nodes adjacent to i and the total possible number of edges among them [19]. Namely, it determines how close the local neighborhood of a node is to being part of a clique. High average clustering coe±cient values have been detected in protein interaction and metabolic networks
network density ranges from 0 to 1, and measures how densely a network is populated with edges. A network with no edges and solely isolated nodes has a density equal to 0
Diameter. The longest shortest path
Degree centrality. It is defined as the number of links incident upon a node (Figure 2.4c). Normally, it is interpreted as an index of node popularity and gregariousness as well as the immediate risk of a node for catching whatever is flowing through the network. In mathematical terms, the degree ki of a vertex i is:
Eigenvector centrality. In general, connections to people who are them-selves influential will lend a person more influence than connections to less influential people. Denoting the centrality of vertex i by xi, then one can allow for this effect by making xi proportional to the average of the centralities of i's network neighbours:
Defining the vector of centralities x = (x1; x2; :::), one can rewrite this equation in matrix form as: LAMBDA x = Ax and hence one can see that x is an eigenvector of the adjacency matrix with eigenvalue LAMBDA. The eigenvector centrality defined in this way accords each vertex a centrality that depends both on the number and
the quality of its connections
Betweenness. Vertices that occur on many shortest paths between other vertices have higher betweenness than those that do not. For a graph G = (V;E) with n vertices (Figure 2.4a), the betweenness CB(v) for vertex v is:
where st is the number of shortest geodesic paths from s to t, and st is the number of shortest geodesic paths from s to t that pass through a vertex v. So defined, then betweenness centrality measures the fraction of the information that will flow through a node i on its way to wherever it is going. In many contexts a vertex with high betweenness will exert substantial influence by virtue not of being in the middle of the network but of lying between other vertices in this way.
Closeness. Intuitively, two sets are said to be close if they are arbitrarily near to each other. Vertices that are shallow to other vertices (that is, those that tend to have short geodesic distances to other vertices with in the graph) have higher closeness (Figure 2.4b). Formally, it is defined as the mean geodesic (i.e the shortest path) between a vertex v and all other vertices reachable from it:
Closeness can be regarded as a measure of how long it will take information to spread from a given vertex to others in the network.
Clustering coefficient. The cohesiveness of the neighborhood of a node i is usually quantified by the clustering coefficient Ci, defined as the ratio between the number of edges linking nodes adjacent to i and the total possible number of edges among them. Namely, it determines how close the local neighborhood of a node is to being part of a clique. High average clustering coefficient values have been detected in protein interaction and metabolic networks