Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
A Mini-Course on Network Science
Pavel Loskot
p.loskot@swan.ac.uk
Pavel Loskot c 2014 1/3
Course Outline
1. Introduction
• fundamentals of complex systems and graph theory
2. Structure
• s...
Pavel Loskot c 2014 2/3
Used Resources
Ernesto Estrada
The Structure of Complex Networks: Theory and Applications
Oxford U...
Pavel Loskot c 2014 3/3
Used Resources
Robert Leese
An Introduction to Clustering
Industrial Mathematics Knowledge Transfe...
Networks: Introduction
Pavel Loskot c 2014 1/22
Complex Systems
Emergence of complexity:
• locally simple rules, and yet
globally complex behavio...
Pavel Loskot c 2014 2/22
Illustration of Complexity
Simple idea:
• send packets between two nodes
Implementation:
• how to...
Pavel Loskot c 2014 3/22
Emergence of Order
Differing (spatial-temporal) perspectives
• insider: interacting with immediat...
Pavel Loskot c 2014 4/22
Description of Networks
1. Complete: everybody connected with everybody else
2. Random: connectio...
Pavel Loskot c 2014 5/22
Formal Definitions
Network
• graph model of functional and/or structural relationships of a compl...
Pavel Loskot c 2014 6/22
Formal Definitions (cont.)
Graph edges (in structural models)
• only if two nodes communicate; th...
Pavel Loskot c 2014 7/22
Fundamentals (of Graph Theory)
(Un)directed graphs
• for directed graphs, E is a set of ordered p...
Pavel Loskot c 2014 8/22
Fundamentals
Isomorphic graphs
• G1 and G2 are isomorphic if one-to-one mapping of vertices and (...
Pavel Loskot c 2014 9/22
Fundamentals
Path, walk, trial
• path from v1 to vL is an ordered sequence of edges between order...
Pavel Loskot c 2014 10/22
Fundamentals
Average path length
• it is the average shortest path between all pairs of vertices...
Pavel Loskot c 2014 11/22
Fundamentals
Connectivity
• connected component is a subgraph where there is a path between ever...
Pavel Loskot c 2014 12/22
Fundamentals
• vertices G and H in graph below are cutset vertices
• bridges: if removed, the nu...
Pavel Loskot c 2014 13/22
Fundamentals
Tree
• connected graph with no cycles (adding only one link creates a cycle)
• beco...
Pavel Loskot c 2014 14/22
Basic Graphs
Bipartite networks
• two sets of vertices, only edges between vertices in these two...
Pavel Loskot c 2014 15/22
Adjacency Matrix
• (binary) adjacency matrix [A]ij =
1 if [vi, vj] ∈ E
0 otherwise
• for undirec...
Pavel Loskot c 2014 16/22
Adjacency Matrix
• for undirected bipartite graphs with vertices V = V1 ∪ V2, |V1| = n1, |V2| = ...
Pavel Loskot c 2014 17/22
Adjacency Matrix
Graph spectrum
• recall that for undirected graph, adjacency matrix is symmetri...
Pavel Loskot c 2014 18/22
Power-Law Distribution
• long-tail (right) with many low-connected vertices (left) (80-20 rule)
...
Pavel Loskot c 2014 19/22
Analyzing Degree Distributions
Degree-degree correlations
• assortivity coefficient or Pearson co...
Pavel Loskot c 2014 20/22
Analyzing Degree Distributions
Mathematically:
• let pij be the probability of edge to have degr...
Pavel Loskot c 2014 21/22
Degree Distributions
Degree-degree correlation
• directed graphs (networks)
Summary
• for large ...
Pavel Loskot c 2014 22/22
Take-Home Messages
Complex Systems
• consists of large number of interacting components
• graphs...
Networks: Structure
Pavel Loskot c 2014 1/32
Similarity of Networks
• the Nature is built up of complex networks
• there is need to have a com...
Pavel Loskot c 2014 2/32
Comparing Networks
Similarity of (static) networks
1. calculate and compare (a vector of) metrics...
Pavel Loskot c 2014 3/32
Comparing Networks
Motifs [Milo, 2002]
• subgraphs having the statistical significance of occurren...
Pavel Loskot c 2014 4/32
Comparing Networks
Motif examples
Relative abundance of fragments
• assume that ensemble of rando...
Pavel Loskot c 2014 5/32
Comparing Networks
Relative abundance examples
• ratios of the number of fragments occurrences ar...
Pavel Loskot c 2014 6/32
Transitivity Measures
Clustering coefficient
• recall that every triangle represents three connect...
Pavel Loskot c 2014 7/32
Centrality Measures
Aim
• quantify importance of nodes in a network (so-called positional advanta...
Pavel Loskot c 2014 8/32
Centrality Measures
Freeman’s degree centrality
• quantify variations in node degree centrality i...
Pavel Loskot c 2014 9/32
Centrality Measures
Example (betweenness centrality)
• A and E are not in-between any pairs, B an...
Pavel Loskot c 2014 10/32
Centrality Measures
Information centrality
CIC(i) =


1
N j
1
Iij


−1
Eigenve...
Pavel Loskot c 2014 11/32
Centrality Measures
PageRank centrality
• reflects the probabilities that random walk through the...
Pavel Loskot c 2014 12/32
Centrality Measures
Reciprocity (r)
• in directed networks, link from u to v can be reciprocated...
Pavel Loskot c 2014 13/32
Weighted Networks
Graph Network System
vertex node component
edge link interaction
Weights mappi...
Pavel Loskot c 2014 14/32
Weighted Networks
Other generalizations
• the edge contributions can be normalized as wij/ j wij...
Pavel Loskot c 2014 15/32
Weighted Networks
Time-series as graphs
1. pre-processing: reduce measurement noise, reduce amou...
Pavel Loskot c 2014 16/32
Weighted Networks
Spanning tree of a graph
• a tree topology containing all nodes of the graph
•...
Pavel Loskot c 2014 17/32
Community Structure
Network communities
• so far, we considered local and
global structure and p...
Pavel Loskot c 2014 18/32
Community Structure
Balanced partitioning
• given P, the size of partitions is approximately equ...
Pavel Loskot c 2014 19/32
Community Structure
Reducing cut size
• moving node vi in or out the partition ˜V will change th...
Pavel Loskot c 2014 20/32
Community Structure
Modularity
• need to compare different partitions to decide which one is the...
Pavel Loskot c 2014 21/32
Community Structure
Resolution problem
• modularity based clustering may fail
to identify obviou...
Pavel Loskot c 2014 22/32
Community Structure
Hierarchical clustering
• complexity O((|E| + |V|)|V|), many networks are sp...
Pavel Loskot c 2014 23/32
Community Structure
Merging clusters
• similarity between clusters can be measured as single lin...
Pavel Loskot c 2014 24/32
Community Structure
Louvain method (based on modularity optimization)
• more accurate and more e...
Pavel Loskot c 2014 25/32
Community Structure
K-means clustering
• number of clusters K predefined
• minimize e.g. Euclidea...
Pavel Loskot c 2014 26/32
Community Structure
Limitations of K-means clustering
• sensitivity to initial conditions and ou...
Pavel Loskot c 2014 27/32
Community Structure
Gaussian mixture models
• assume there are K clusters, vertex vi has locatio...
Pavel Loskot c 2014 28/32
Community Structure
Overlapping communities
• nodes may belong to more than one community (i.e. ...
Pavel Loskot c 2014 29/32
Community Structure
Spectral clustering
• K-means, Gaussian mixtures,
hierarchical method are go...
Pavel Loskot c 2014 30/32
Community Structure
Real-time clustering
• (dynamic) re-clustering for every new data arrival is...
Pavel Loskot c 2014 31/32
Community Structure
Community analysis
• distribution of community sizes
• intra-community edge ...
Pavel Loskot c 2014 32/32
Take-Home Messages
Network structure analysis
• structure of un-weighted static networks, i.e. k...
Networks: Random Models
Pavel Loskot c 2014 1/23
Statistical Modeling
Objectives
• account for models or parameters uncertainty, measurement noise...
Pavel Loskot c 2014 2/23
Random Network Models
Erdos-Renyi (ER) random graph [1959]
• graph GER(n, p) with n vertices and ...
Pavel Loskot c 2014 3/23
Random Network Models
Average path length and diameter of ER (random) graph
• let l(i, j) be the ...
Pavel Loskot c 2014 4/23
Random Network Models
Clustering coefficient of ER graph
• ratio of neighbors being friends to all...
Pavel Loskot c 2014 5/23
Random Network Models
Percolation transition (when increasing p)
1. Subcritical: ¯k < 1, many sma...
Pavel Loskot c 2014 6/23
Random Network Models
Random geometric models [Penrose, 2003]
• main motivation: some networks ca...
Pavel Loskot c 2014 7/23
Small World Networks
More on Milgram’s experiment
• how accurate 6 degree separation, how likely ...
Pavel Loskot c 2014 8/23
Small World Networks
Main features
high clustering: Creal−world ≫ Crandom
average path length: ¯l...
Pavel Loskot c 2014 9/23
Small World Networks
WS original model
• select fraction of p edges and rewire one of their end-p...
Pavel Loskot c 2014 10/23
Small World Networks
Properties of WS model:
Degree distribution
Pr(k) =
min(k−K,K)
i=0
K
i (1 −...
Pavel Loskot c 2014 11/23
Small World Networks
Kleinberg’s geographical small world model [Nature, 2000]
• connectivity de...
Pavel Loskot c 2014 12/23
Small World Networks
Topology trade-offs
(a) commuter rail network
(b) star network
(c) minimum ...
Pavel Loskot c 2014 13/23
Small World Networks
Bridges
• in social networks, close friends know what you know, and they al...
Pavel Loskot c 2014 14/23
Small World Networks
Strong and weak ties [Granovetter, 1974]
• in social networks, links are st...
Pavel Loskot c 2014 15/23
Small World Networks
Removing ties from social networks (percolation analysis)
• removing weak t...
Pavel Loskot c 2014 16/23
Small World Networks
Illustration of weak vs. strong ties removal
(a) original network
(b) 80% o...
Pavel Loskot c 2014 17/23
Scale Free Networks
Power-law distribution
Pr(k) ∼ const×k−γ
, typically 2 < γ < 3
• i.e., a str...
Pavel Loskot c 2014 18/23
Scale Free Networks
Power-law distribution
• some other distributions look like power-laws
• est...
Pavel Loskot c 2014 19/23
Scale Free Networks
Barab´asi-Albert (BA) scale-free model
1. Growth: start from seed network of...
Pavel Loskot c 2014 20/23
Scale Free Networks
Question
• In scale-free networks, how much is “popularity” predictable?
Ans...
Pavel Loskot c 2014 21/23
Scale Free Networks
Network generator with given γ
1. Initialize: seed network with m0 (isolated...
Pavel Loskot c 2014 22/23
Scale Free Networks
Configuration network model
• degrees are pre-assigned to n nodes assuming a ...
Pavel Loskot c 2014 23/23
Take-Home Messages
Random networks
• Erdos & Renyi studied a simple model in 1959
• it has Poiss...
Networks: Robustness
Pavel Loskot c 2014 1/11
Robustness
Percolation
• monitor network metrics while nodes or edges are being removed
Dual prob...
Pavel Loskot c 2014 2/11
Robustness
Pragmatic definition
• the network is robust if it can withstand accidental damage, ran...
Pavel Loskot c 2014 3/11
Robustness as Stability
Global stability
• system is stable if it returns to equilibrium after an...
Pavel Loskot c 2014 4/11
Robustness
Percolation threshold
• if ¯k decreases by removing
edges, network suddenly becomes
di...
Pavel Loskot c 2014 5/11
Robustness
Experiment [Barab´asi et al., 2000]
strategy: random failures versus targeted attacks ...
Pavel Loskot c 2014 6/11
Robustness
Experiment (cont.)
effect on size of
giant component s
and its average s
Pavel Loskot c 2014 7/11
Robustness
Experiment (cont.)
(Internet and WWW)
effect on size of
giant component s
and its aver...
Pavel Loskot c 2014 8/11
Robustness of Scale-Free Networks
Random failures vs targeted attack
(a) original network of 574 ...
Pavel Loskot c 2014 9/11
Robustness of Scale-Free Networks
Impact of power-law exponent on robustness (∼ k−γ
)
• γ = 2.5: ...
Pavel Loskot c 2014 10/11
Robustness
Percolation threshold for random failures
• in general, minimum fraction of nodes req...
Pavel Loskot c 2014 11/11
Take-Home Messages
Scale-free networks
• very robust against random failures (some suggest that ...
Networks: Processes
Pavel Loskot c 2014 1/13
Epidemic Spreading
Network processes
• strongly influenced by network structure; e.g. shortcuts si...
Pavel Loskot c 2014 2/13
Epidemic Spreading
Simple spreading models
• ring topology with shortcuts
• all nodes susceptible...
Pavel Loskot c 2014 3/13
Epidemic Spreading
Limitations of simple models
• small changes in k and p can move R0 above or b...
Pavel Loskot c 2014 4/13
Epidemic Spreading
More realistic: SIR model
• improve SI model by assuming infected nodes recove...
Pavel Loskot c 2014 5/13
Epidemic Spreading
More realistic: SIS model
• no (permanent) recovery, but infected node may aga...
Pavel Loskot c 2014 6/13
Epidemic Spreading
Prognosis of epidemic
• reproductive number: R0 = β/υ
a) if R0 > 1, infection ...
Pavel Loskot c 2014 7/13
Epidemic Spreading
SIS model in scale-free networks
• experimentally observed that
computer virus...
Pavel Loskot c 2014 8/13
Epidemic Spreading
Network immunization
• random networks: uniformly random immunization is helpf...
Pavel Loskot c 2014 9/13
Take-Home Messages
Epidemic spreading
• practical modeling requires to extract model parameters f...
Pavel Loskot c 2014 10/13
Network Dynamics
Spatial-temporal scales
(a) short: link activation and deactivation
– topology ...
Pavel Loskot c 2014 11/13
Information Cascades
Aims
• understand how behaviors, ideas, technology usage etc. are adopted,
...
Pavel Loskot c 2014 12/13
Information Cascades
Example diffusion in a network
• let a = 3, b = 2, so b
a+b = 2/5
• A: dark...
Pavel Loskot c 2014 13/13
Take-Home Messages
Cascades
• initial adoption by few nodes may generate complete cascade
• it i...
Networks: Algorithms
Pavel Loskot c 2014 1/24
Max Flow and Min Cut
Scenario
• single source node s and single sink node t (for simplicity)
• di...
Pavel Loskot c 2014 2/24
Max Flow and Min Cut
Dual problems (of combinatorial optimization)
1. find minimum cut of a graph ...
Pavel Loskot c 2014 3/24
Max Flow and Min Cut
Minimum cut problem
• find the cut with the minimum capacity
Maximum flow prob...
Pavel Loskot c 2014 4/24
Max Flow and Min Cut
Observation 1
• flow from S to T is equal to the total flow reaching sink t
Pavel Loskot c 2014 5/24
Max Flow and Min Cut
Observation 2
• flow from S to T is at most equal to capacity of the cut
• if...
Pavel Loskot c 2014 6/24
Max Flow and Min Cut
Greedy algorithm
1. select a path from s to t and set its flow to be equal to...
Pavel Loskot c 2014 7/24
Max Flow and Min Cut
Ford-Fulkerson algorithm
• greedy algorithm to find a maximum flow
• find augme...
Pavel Loskot c 2014 8/24
Max Flow and Min Cut
Choosing initial augmenting path
• some choices lead to exponential time alg...
Pavel Loskot c 2014 9/24
Take-Home Messages
Applications of max-flow and min-cut theorem
• Network connectivity
• Bipartite...
Pavel Loskot c 2014 10/24
Network Routing
Routing algorithms
• find the least cost path between any two nodes in the (telec...
Pavel Loskot c 2014 11/24
Network Routing
Link state routing: Dijkstra algorithm
• every node computes the least cost path...
Pavel Loskot c 2014 12/24
Network Routing
Dijkstra algorithm example
• the shortest path constructed by tracking predecess...
Pavel Loskot c 2014 13/24
Network Routing
Dijkstra algorithm example
Complexity of Dijkstra algorithm
• at each iteration,...
Pavel Loskot c 2014 14/24
Network Routing
Distance vector algorithm
• fully distributed generation of forwarding tables
• ...
Pavel Loskot c 2014 15/24
Network Routing
Distance vector algorithm
• Dx(y) is least cost from x to y and it is iterativel...
Pavel Loskot c 2014 16/24
Network Routing
Example updates
Pavel Loskot c 2014 17/24
Network Routing
“Good news travel fast”
“Bad news travel slow”
Comparison
Link state Distance ve...
Pavel Loskot c 2014 18/24
Search on Networks
• the aim is to find some source-destination path in reasonable amount of time...
Pavel Loskot c 2014 19/24
Search on Networks
Comparing search strategies
• efficiency of a search strategy is expected deli...
Pavel Loskot c 2014 20/24
Search on Networks
Web search
• information retrieval since 60’s using “textual analysis”
• more...
Pavel Loskot c 2014 21/24
Search on Networks
Authorities
• nodes pointed to by highly ranked nodes
• they offer prominent,...
Pavel Loskot c 2014 22/24
Search on Networks
PageRank (named by the Google founder)
• ranking pages independently of queri...
Pavel Loskot c 2014 23/24
Search on Networks
Strategies
• many strategies may be devised, some are more efficient than othe...
Pavel Loskot c 2014 24/24
Take-Home Messages
Routing
• it is not only to find source-destination path, but the one having l...
Networks: Software
Pavel Loskot c 2014 1/11
Software Requirements for Graph Data
Tasks
• input data in common format (e.g. Excel, CSV, . . . ...
Pavel Loskot c 2014 2/11
Networks in Matlab
Pavel Loskot c 2014 3/11
Networks in Matlab
Pavel Loskot c 2014 4/11
Networks with Python
Pavel Loskot c 2014 5/11
Networks in C, R, Python
Pavel Loskot c 2014 6/11
Networks Visualization and Analysis
Pavel Loskot c 2014 7/11
Networks Community Analysis
Pavel Loskot c 2014 8/11
Social Network Analysis
Pavel Loskot c 2014 9/11
Popular in Bioinformatics
Pavel Loskot c 2014 10/11
Networks Online Demos
Pavel Loskot c 2014 11/11
Networks Data
Upcoming SlideShare
Loading in …5
×

Minicourse on Network Science

3,402 views

Published on

Prepared as a conference tutorial, MIC-Electrical, Athens, Greece, 5th April 2014, updated and delivered again in Beijing, China, 27 January 2015 to students from Complex Systems Group, CSRC and Dept. of Engineering Physics, Tsinghua University

Published in: Science

Minicourse on Network Science

  1. 1. A Mini-Course on Network Science Pavel Loskot p.loskot@swan.ac.uk
  2. 2. Pavel Loskot c 2014 1/3 Course Outline 1. Introduction • fundamentals of complex systems and graph theory 2. Structure • sub-graphs, centrality measures, weighted networks, community 3. Random Models • random, small world and scale free networks 4. Robustness • some definitions and metrics 5. Processes • epidemic spreading and information cascades 6. Algorithms • max flow and min cut, routing, search and navigation 7. Software • using Matlab and Python, available software, few demos from YouTube
  3. 3. Pavel Loskot c 2014 2/3 Used Resources Ernesto Estrada The Structure of Complex Networks: Theory and Applications Oxford University Press, 2011 Cecilia Mascolo Social and Technological Network Analysis Course at University of Cambridge, UK Jari Saram¨aki Introduction to Complex Networks Aalto University, Finland Animesh Mukherjee Complex Network Theory IIT Kharagpur, India
  4. 4. Pavel Loskot c 2014 3/3 Used Resources Robert Leese An Introduction to Clustering Industrial Mathematics Knowledge Transfer Network Kevin Wayne Max Flow, Min Cut Princeton University, USA James F. Kurose and Keith W. Ross Computer Networking, A Top-Down Approach Pearson Education, 2012 Wikipedia various topics
  5. 5. Networks: Introduction
  6. 6. Pavel Loskot c 2014 1/22 Complex Systems Emergence of complexity: • locally simple rules, and yet globally complex behavior • systems evolve, are dynamic and adapt to the environment Modeling: • infinitely many possibilities • normally data-driven, but what data to collect? Emergence of stochasticity: God doesn’t play dice with the world. • many entities, complex interactions • often useful to describe observations statistically (joint PDF, correlations) • human beings are living at the edge of stochastic and deterministic world
  7. 7. Pavel Loskot c 2014 2/22 Illustration of Complexity Simple idea: • send packets between two nodes Implementation: • how to distinguish end-nodes? • how to find the route? • how to share network (resources) among billions end-nodes? • how to deal with lost and delayed packets? • how to deal with mobility and nodes leaving and arriving? Solution: • evolution - solve problems iteratively • separation - divide and conquer • new problems emerge as network growths: scalability, stability, security
  8. 8. Pavel Loskot c 2014 3/22 Emergence of Order Differing (spatial-temporal) perspectives • insider: interacting with immediate neighbors (immediate, local) • outsider: system level perception (average, global)
  9. 9. Pavel Loskot c 2014 4/22 Description of Networks 1. Complete: everybody connected with everybody else 2. Random: connections selected arbitrarily at random 3. Random tree: connections selected arbitrarily at random, no cycles allowed 4. Real-world networks: • exponential degree distribution and strongly disassortative • small average path length and high clustering coefficient • several nodes with high (degree, closeness and betweenness) centrality • several main communities . . . and many other distinctive characteristics Challenge: how to synthesize real-world networks with all these properties?
  10. 10. Pavel Loskot c 2014 5/22 Formal Definitions Network • graph model of functional and/or structural relationships of a complex system Time-invariant network • graph G = (V, E) where set of nodes V = {v1, . . . , vN}, and set of edges (links) E =⊂ V ⊗ V = {e1, . . . , eL}, i.e., every edge el ∈ E is associated with one pair (vi, vj) ∈ V ⊗ V, or in other words, E is a set of (un)ordered pairs from V • let’s not allow self-edges ([vi, vi] E) and duplicate-edges (E has unique elements) • nodes and edges are objects, but for analysis and evaluation purposes, we need numbers, i.e., assign numbers (called weights) to nodes and edges vn → Wv(n) el → We(l) el = [vi, vj] = Wv(i, j) Dynamic networks • graphs (nodes, edges) as well as weights can vary over time
  11. 11. Pavel Loskot c 2014 6/22 Formal Definitions (cont.) Graph edges (in structural models) • only if two nodes communicate; this communication can be implemented in many different ways (radiation, material transport flows, . . .) • communicating nodes interact i.e. influence each other’s behavior • communications are, first of all, information flows: Two nodes communicate • if there is enough information delivered (just sent is not enough) over a given time-window i.e. communication is integral (average) quantity • delivered information may be ignored, not recognized, or misinterpreted
  12. 12. Pavel Loskot c 2014 7/22 Fundamentals (of Graph Theory) (Un)directed graphs • for directed graphs, E is a set of ordered pairs [u, v] ∈ V ⊗ V Neighbors, degrees • u is neighbor of v if (u, v) ∈ E, then u and v are said to be adjacent nodes • (u, w), (v, w), (y, w) and (x, w) are adjacent edges which are incident at w • in-degree kin, out-degree kout and k = kin + kout degree distributions are important statistics (this assumes all edges counted with unit weights)
  13. 13. Pavel Loskot c 2014 8/22 Fundamentals Isomorphic graphs • G1 and G2 are isomorphic if one-to-one mapping of vertices and (possibly directed) edges (i.e., different visualizations of the same graph) Edge (or connection) density ρ = |E| |V| 2 = 2|E| |V|(|V| − 1) • |V| 2 = |V|(|V|−1) 2 is the maximum possible number of edges • ρ = 1 if fully connected, real-world network ρ ≪ 1 (i.e. sparse) • graph is sparse if |E| ≈ |V|, graph is dense if |E| ≈ |V|2 Clique • ˜G = ( ˜V, ˜E) is a subgraph of G = (V, E) if ˜V ⊆ V and ˜E ⊆ E • clique is a maximal, completely connected subgraph of the graph • N-clique is a fully connected subgraph with N vertices • clique number is the size of the largest clique in the graph
  14. 14. Pavel Loskot c 2014 9/22 Fundamentals Path, walk, trial • path from v1 to vL is an ordered sequence of edges between ordered list of vertices such that no vertex is visited twice • length of path is the number of its edges (i.e. assuming edges of unit length) • if there is no path between two vertices, their path length is infinite • distance of two vertices is their shortest path (having the smallest length) • walk of length L from v1 to vL+1 is a sequence [v1, . . . , vL+1] where two subsequent (only those) vertices are required to be different • trial is a walk with no repeated edge • cycle is a path that starts and ends at the same vertex Diameter (d) and radius (r) of a graph • the longest and the smallest shortest path, respectively: d = max u,v∈V distance(u, v) r = min u,v∈V distance(u, v)
  15. 15. Pavel Loskot c 2014 10/22 Fundamentals Average path length • it is the average shortest path between all pairs of vertices ¯d = 1 2 |V| 2 u,v∈V distance(u, v) • if some vertices u and v are disconnected (i.e., no path connecting u and v), the average path length is harmonic mean instead (its reciprocal) ¯d =   1 2 |V| 2 u,v∈V 1 distance(u, v)   −1 Graph coloring • assign labels to vertices, so that no adjacent vertices get the same label • chromatic number is the minimum number of colors to solve the coloring
  16. 16. Pavel Loskot c 2014 11/22 Fundamentals Connectivity • connected component is a subgraph where there is a path between every pair of vertices; for directed graphs, the directions can be ignored • connected graph if there is a path between every pair of its vertices; in other words, the graph contains a single connected component • (sub)graphs not connected are disconnected • node connectivity is the smallest number of vertices when (they are) removed, the graph becomes disconnected • edge connectivity is the smallest number of edges when (they are) removed, the graph becomes disconnected • strongly connected component if its every vertex is reachable from any other of its vertex (i.e., edge directions matter here) Cutsets • vertex cutset is a set of vertices when removed disconnect the graph (i.e. increases the number of graph components); they are also known as articulation points or brokers (in social networks) • edge cutset is a set of edges when removed disconnect the graph
  17. 17. Pavel Loskot c 2014 12/22 Fundamentals • vertices G and H in graph below are cutset vertices • bridges: if removed, the number of graph components increases Basic graphs
  18. 18. Pavel Loskot c 2014 13/22 Fundamentals Tree • connected graph with no cycles (adding only one link creates a cycle) • becomes disconnected by removing any single link • any pair of nodes is connected by exactly one path • spanning tree is subgraph of a network including all its nodes and it is a tree R-regular graph • all vertices have degree R and there are |E| = R|V|/2 edges Planar graph • can be drawn in a 2D plane such that no two edges intersect • among all complete graphs Cn, only C1, C2, C3 and C4 are planar • example of embeddings of C4
  19. 19. Pavel Loskot c 2014 14/22 Basic Graphs Bipartite networks • two sets of vertices, only edges between vertices in these two sets allowed • graph is bipartite if it does not contain any odd cycles • generalization to more than two sets of vertices is possible Graph matching • given graph G = (V, E), a matching M ⊆ E in G is a set of edges not sharing any common vertex • maximal matching any edge added to M will violate matching • maximum matching contains the largest number of edges
  20. 20. Pavel Loskot c 2014 15/22 Adjacency Matrix • (binary) adjacency matrix [A]ij = 1 if [vi, vj] ∈ E 0 otherwise • for undirected graphs, A is symmetric (i.e. A = AT ) k = (1T A)T = A1 degree distribution • for directed graphs, A is asymmetric kin = (1T A)T in-degree distribution kout = A1 out-degree distribution • average degree of a graph ¯k = 2|E| |V| = 1T k |V| = 1T A1 |V| = u∈V k(u) |V|
  21. 21. Pavel Loskot c 2014 16/22 Adjacency Matrix • for undirected bipartite graphs with vertices V = V1 ∪ V2, |V1| = n1, |V2| = n2, A = 0n2×n1 RT n1×n2 Rn1×n2 0n1×n2 • generally, An , n = 1, 2, . . . denotes the number of paths of length n in graph, i.e., [An ]ij is the number of distinct n-hop paths between vertices i and j • [AT A]ij and [AAT ]ij is the number of vertices connected to/from the vertices vi and vj at the same time, respectively • tr A3 /6 is the number of triangles in the matrix • open and closed triangles • closed triangle represents 6 closed triplets (starting at each of 3 vertices in 2 directions) Incidence matrix • |V| × |E| matrix, [B]ij = 1 if vi ∈ ej 0 otherwise • degree matrix is a diagonal matrix D = diag k1, . . . , k|V| • adjacency matrix can be also expressed as A = BBT − D
  22. 22. Pavel Loskot c 2014 17/22 Adjacency Matrix Graph spectrum • recall that for undirected graph, adjacency matrix is symmetric, so its eigenvalues are real-valued and referred to as graph spectrum • eigenvalue λ and eigenvector v satisfy Av = λv, i.e., (λI − A)v = 0 • characteristic polynomial pA(t) = det(tI − A) = i(t − λi) of matrix A has the roots the eigenvalues of A • Laplacian matrix of graph G is L = D − A (degree minus adjacency matrix): [L]ij =    [D]ii = k(i) i = j −1 i j and [vi, vj] ∈ E 0 otherwise and the spectrum of graph G are eigenvalues of L (rather than of A) Properties of Laplacian • multiplicity of λ0 = 0 of L is the number of connected components of G • eigenvalue λ0 = 0 corresponds to eigenvector v0 = [1, . . . , 1]T , i.e., Lv0 = 0 • L = BBT where B is |V| × |E| incidence matrix of graph G = (V, E)
  23. 23. Pavel Loskot c 2014 18/22 Power-Law Distribution • long-tail (right) with many low-connected vertices (left) (80-20 rule) • many real-world networks experience this degree distribution, so they have star-like topology • also known as scale-free distribution of scale-free networks; these networks are self-similar at different (spatial-temporal) scales p(k) = A k−γ → p(c · k) = A c−γ · p(k) • cumulative degree distribution (CDD) P(k) = ∞ k′=k p(k′ ) ≈ k−(γ−1) (the probability the degree at least k)
  24. 24. Pavel Loskot c 2014 19/22 Analyzing Degree Distributions Degree-degree correlations • assortivity coefficient or Pearson correlation coefficient (r) Assortative mixing (r > 0) • bias towards connections between nodes with similar characteristics (hubs tend to connect to each other) • useful, e.g. to understand spread of diseases and their treatment Disassortative mixing (r < 0) • dissimilar nodes tend to connect to each other (hubs avoid each other) Neutral mixing (r = 0) • connections follow some probability distribution
  25. 25. Pavel Loskot c 2014 20/22 Analyzing Degree Distributions Mathematically: • let pij be the probability of edge to have degrees ki and kj at both ends ij pij = 1 i pij = qj = kjpj j kjpj • perfectly assortative networks have pij = qiδij (only nodes of the same degree connect) • if degrees independent, then pij = qiqj • Pearson coefficient, −1 ≤ r ≤ 1 r = E (ki − E[ki])(kj − E kj ) σ2(ki) σ2(kj) = i,j kikj(pij − qiqj) max i,j kikj(pij − qiqj) = i,j kikj(pij − qiqj) i,j kikj(δij − qi)qj
  26. 26. Pavel Loskot c 2014 21/22 Degree Distributions Degree-degree correlation • directed graphs (networks) Summary • for large graphs, edges (topology) can be considered statistically • degree distribution is partial statistical description (of topology) • degree-degree correlation is more informative, but still incomplete info
  27. 27. Pavel Loskot c 2014 22/22 Take-Home Messages Complex Systems • consists of large number of interacting components • graphs are very good mathematical models of these systems; they are very generic objects with many specific instances (trees, lists, tables etc.) • availability of observations (measurements data) is a strong driving force • a common systematic framework to study these systems: Network Science History of modern science Problems of simplicity (1600-1800) understanding influence of one variable over another Problems of disorganized (1900-1950) number of variables is very large complexity but system as a whole has well- defined average behavior Problems of organized (1950-today) simultaneously dealing with number complexity of factors forming whole system - W. Weaver, 1948
  28. 28. Networks: Structure
  29. 29. Pavel Loskot c 2014 1/32 Similarity of Networks • the Nature is built up of complex networks • there is need to have a common framework for systematically describing, analyzing and eventually synthesizing networks to mimic the Nature
  30. 30. Pavel Loskot c 2014 2/32 Comparing Networks Similarity of (static) networks 1. calculate and compare (a vector of) metrics for each network; N.B. we can only compare scalar values (e.g. Euclidean distances between vectors) OR 2. identify distinctive subgraphs at certain granularity, and compare those Graphlets [Prˇzulj, 2004] • pictured right: 30 subgraphs of 2-5 nodes of 73 possible types • generalizes vector of node degrees to graphlet degrees; it is a vector of 73 components of the number of nodes of given type in the network Fragments • quantitative analysis relies on correlations between fragment statistics in the network and the network properties
  31. 31. Pavel Loskot c 2014 3/32 Comparing Networks Motifs [Milo, 2002] • subgraphs having the statistical significance of occurrence much larger than if the network was created completely at random • network randomization: 1. select two links at random, 2. exchange their end-points, 3. repeat • a motif in the real network occurs much more often than (on average) in an ensemble of random networks having the same degree distribution • we require that the probability of motif appearing in an ensemble of random networks at least the number of times as in real network is small • this is quantified by the Z-score (N denotes the number of occurrences) Z = Nreal − E[Nrandom] E (Nrandom − E[Nrandom])2 • motifs are network specific, although families of networks can share the same motifs • importance of motifs can be evaluated as the significance profile (SP) vector SP =   Z1 i Z2 i , Z2 i Z2 i , · · ·  
  32. 32. Pavel Loskot c 2014 4/32 Comparing Networks Motif examples Relative abundance of fragments • assume that ensemble of random networks has the same nodes degrees as the real-world network α = Nreal − E[Nrandom] Nreal + E[Nrandom]
  33. 33. Pavel Loskot c 2014 5/32 Comparing Networks Relative abundance examples • ratios of the number of fragments occurrences are also useful to characterize the network structure as shown next
  34. 34. Pavel Loskot c 2014 6/32 Transitivity Measures Clustering coefficient • recall that every triangle represents three connected (open or closed) triples • let | ˜T3| be the number of triangles and | ˜P2| the number of 2-paths (connected triples with 2 or 3 ties); the clustering coefficient (a.k.a. network transitivity): C3 = 3| ˜T3| | ˜P2| where | ˜T3| = tr A3 /6, and | ˜P2| = 1 2 ij [A2 ]ij − tr{A} • a network can be highly clustered locally, but not globally (i.e., considering average of local clusterings across all nodes is not sufficient) • clustering tends to be much larger for real-world than random networks Example • A’s friends: B,C,D and E • all possible edges among A’s friends: B-C, B- D, B-E, C-D, C-E, D-E, i.e., 6 in total and out of which only 1 (C-D) exists • thus, clustering coefficient of A is 1/6 Generalization • any subgraph, ratio of actual to maximum possible number of its occurrences: Cn = n| ˜Tn| | ˜Pn|
  35. 35. Pavel Loskot c 2014 7/32 Centrality Measures Aim • quantify importance of nodes in a network (so-called positional advantage), i.e. how nodes contribute to the overall structural properties of the network • e.g. important nodes disseminate information faster, can stop spreading epidemics, can protect network from breaking and so on Degree centrality • hubs are likely to have the largest influence (e.g. number of friends to help) • a transitivity measure since it is ratio of single (neighboring) node fragments • for a network of N nodes, i-th node of degree ki has degree centrality C1(i) = | ˜T1| | ˜P1| = CD(i) = ki N − 1 Network centralization (centrality) (σ2 C1 ) σ2 C1 = 1 N − 1 N i=1 C1(i) − ¯C1 2 where ¯C1 = 1 N N i=1 C1(i) • star topology has the maximum while line topology has the minimum centralization
  36. 36. Pavel Loskot c 2014 8/32 Centrality Measures Freeman’s degree centrality • quantify variations in node degree centrality in the whole network ¯CD = N i=1(k∗ − ki) (N − 1)(N − 2) where k∗ = maxi ki, and max N i=1(k∗ − ki) = (N − 1)(N − 2) for a star network Betweenness centrality (beyond nearest neighbors) • quantify node importance in communications between pairs of other nodes • ability to broker between groups, likelihood of intercepting information etc. • thus, it is the likeliness of node w to be involved in communications CB(w) = 1 n−1 2 u w v ρ(u, w, v) ρ(u, v) (normalization optional) ρ(u, w, v) number of shortest paths between u and v via w ρ(u, v) number of all shortest paths between u and v OR ρ(u, w, v) maximum flow from u to v through w ρ(u, v) total maximum flow from u to v
  37. 37. Pavel Loskot c 2014 9/32 Centrality Measures Example (betweenness centrality) • A and E are not in-between any pairs, B and D are in-between 3 pairs, and C is in-between 4 pairs Closeness centrality • measure of how much the node is in “middle of things” • let d(u, v) be the shortest path length between nodes u and v CC(u) =   1 N − 1 u v d(u, v)   −1 (normalization optional) Example (closeness centrality) CC(A)= 1+2+3+4 4 −1 = 0.4
  38. 38. Pavel Loskot c 2014 10/32 Centrality Measures Information centrality CIC(i) =   1 N j 1 Iij   −1 Eigenvector centrality (xu) • account for connections that are (or not) isolated; important nodes are likely connected to other important nodes • let B(u) be the neighbors of node u xu = 1 λ v∈B(u) xv = 1 λ v∈V [A]u,v xv ⇒ Ax = λx algorithm: initialize xu = 1 ∀u, re-calculate xu ∀u, λ = maxu xu, repeat Katz centrality • instead of counting shortest paths (as in closeness centrality), count all paths • let 1 < α < λ1 (largest eigenvalue of A) CK(i) = [Z · 1]i where Z = ∞ k=1 α−k Ak = I − 1 α A −1 − I so the values of CK(i) are dependent on choice of α
  39. 39. Pavel Loskot c 2014 11/32 Centrality Measures PageRank centrality • reflects the probabilities that random walk through the network arrives to any particular node • intuitively, if there are many links out of node v, one of these links to node u represents average recommendation of u by v; if the number of links out of v is reduced, recommendation of u by v increases • define the modified adjacency matrix [H]ij = 1/kout(i) if [vi, vj] ∈ E 0 otherwise • PageRank vector CPR = [CPR(1), . . . ,CPR(N)]T at step k is updated as Ck+1 PR := Ck PR · H note that node 4 traps a random walker, and also, the search is often randomly reset (with probability 1 − α), so this modified H should be used instead: H′ = αH + α N (a1T ) + 1 − α N where [a]i = 1 if kout(i) = 0 0 otherwise
  40. 40. Pavel Loskot c 2014 12/32 Centrality Measures Reciprocity (r) • in directed networks, link from u to v can be reciprocated as link from v to u; these are called co-links r = ij[A]ij[A]ji |E| (fraction of reciprocated links) Rich-Club coefficient of degree k (R(k)) • hubs tend to be densely interconnected which is quantified by R(k) • let subgraph (V′ (k), E′ ) ⊆ (V, E) where V′ (k) ⊆ V is subset of nodes with degree at least k, and E′ ⊆ E are the corresponding edges among V′ (k) R(k) = |E′ | |V′(k)| 2 Matching index (µij) • quantify similarity of connectivity of the two end-vertices of an edge • small value of µij indicates the edge between vi ∈ V and vj ∈ V is a bridge between two dissimilar regions of the network µij = k i,j AikAk j i k Aik + j k Ajk
  41. 41. Pavel Loskot c 2014 13/32 Weighted Networks Graph Network System vertex node component edge link interaction Weights mapping • weights can be assigned to vertices as well as to (more often) edges; we assume mapping W : (V, E) → (V, W) so weighted adjacency matrix and original adjacency matrix, respectively, [W ]ij = wij ∈ R [A]ij = 1 |wij| ≥ ∆ 0 |wij| < ∆ Vertex strength • degree distribution is generalized to the strength distribution having again a power-law-like tails in many real-world networks s(i) = j wij • it was observed that node strength and node degree have dependency as E[s|k] ∝ kβ , β > 0 for β > 1, high-degree vertices (hubs) tend to be high-strength vertices
  42. 42. Pavel Loskot c 2014 14/32 Weighted Networks Other generalizations • the edge contributions can be normalized as wij/ j wij = wij/s(i), e.g. the average nearest (first order) neighbor degree kNN(i) = j wi j s(i) [A]ij k( j) • importantly, there are no generally agreed definitions of quantities (metrics) for weighted networks, e.g. the clustering coefficient ([A]ij ≡ aij) C3(i) = 1 s(i)(k(i) − 1) j,k wij + wik 2 aijajkaik [Barrat, Barth´elemy, Vespignani] C3(i) = 1 k(i)(k(i) − 1) j,k (wijwikwjk)1/3 [Onnela, Saram¨aki, Kaski,Kert´esz] C3(i) = j,k wijwjkwik k wik 2 − k w2 ik [Zhang] C3(i) = j,k wijwjkwki (maxij wij) j,k wijwik [Holme] where for unweighted network we assume wij = 1 if [vi, vj] ∈ E, and 0 otherwise
  43. 43. Pavel Loskot c 2014 15/32 Weighted Networks Time-series as graphs 1. pre-processing: reduce measurement noise, reduce amount of data 2. calculate magnitude of correlations (possibly with thresholding) 0 ≤ [W ]ij = E didj − E[di] E dj E d2 i − E[di]2 E d2 j − E dj 2 ≤ 1 3. construct a weighted graph assuming the weight matrix W
  44. 44. Pavel Loskot c 2014 16/32 Weighted Networks Spanning tree of a graph • a tree topology containing all nodes of the graph • possibly additional requirement to maximize or minimize the sum of edge weights • it can be used to emphasize clusters in the graph, but . . . a lot of information is discarded and is also sensitive to noise and thresholding Example (NYSE stocks)
  45. 45. Pavel Loskot c 2014 17/32 Community Structure Network communities • so far, we considered local and global structure and properties; here, we look at spatial scale in-between individual nodes and the whole network - clusters • clusters are obtained by network partitioning or clustering • our objective is to partition the network using only its topology Why clustering • manage complex systems by creating hierarchy, for example, Big Data analysis and classification such as large databases, customer recommendations, website ranking, genomics, market evaluations etc. • identify bridges and weak ties in networks Formally • find P disjoint subsets Vi, so that ∪P i=1Vi = V and Vi ∩ Vj is empty set for i j
  46. 46. Pavel Loskot c 2014 18/32 Community Structure Balanced partitioning • given P, the size of partitions is approximately equal i.e. |Vi| ≈ |V|/P • possibly also, the cut (the links between subsets) size can be minimized Community • there is a path through the community between every pair of nodes • internal connection density significantly larger than density of external connections Cut size • assume a weighted network, and the partition ˜V ⊂ V • the internal and external weights of a node vi ∈ ˜V in the partition Wint(i) = vj∈ ˜V wij , Wext(i) = vj ˜V wij • the cut size between ˜V and the rest of nodes V ˜V Ccut( ˜V) = 1 2 vi∈V Wext(i)
  47. 47. Pavel Loskot c 2014 19/32 Community Structure Reducing cut size • moving node vi in or out the partition ˜V will change the cut size by g(i) = Wext(i) − Wint(i) so cut size is reduced if Wext(i) > Wint(i) • for partitions already balanced, consider replacing one node in the partition (i.e., move one node out and another node in); the cut size is changed by g(i, j) = g(i) + g( j) − 2wij if vi and vj connected g(i) + g( j) otherwise Centrality based partitioning • links connecting nodes in different communities are likely to have large edge betweenness centrality (defined analogically to node betweenness) Algorithm [Girvan, Newman, 2002] 1. calculate edge betweenness for all links, remove link with highest such value 2. recalculate edge betweenness for remaining links 3. repeat until all links have been removed
  48. 48. Pavel Loskot c 2014 20/32 Community Structure Modularity • need to compare different partitions to decide which one is the best • intuitively, cohesion or links density within the community is likely to be significantly larger than if the community is formed at random • for partitioning ∪P i=1Vi = V, with edges Ei within the partition Vi, the modularity indicator (ci is community assignment of vertex vi) Q = P i=1   |Ei| |E| −   vj∈Vk k(vj) 2|E|   2   = . . . = 1 2|E| i,j [A]ij − k(i)k( j) 2|E| δ(ci, cj) so it is actual number of edges minus expected number of edges inside the community for a random subgraph with the same node degree distribution • Q ≥ −1, and max Q = 1 for strong community structure • can be used as stopping criterion (Q >> 0) in Girvan-Newman algorithm Modularity optimization • find partitioning with maximum modularity (exact solution is NP complete): complexity O |V| |V|/2 ∼ 2|V| √ π|V|/2 for large |V|
  49. 49. Pavel Loskot c 2014 21/32 Community Structure Resolution problem • modularity based clustering may fail to identify obvious small clusters close to a large cluster • modularity is deficient if clusters are circularly connect (pictured right) • other similarity measures also affected (minimum cuts, . . .) Possible solution • use multiple similarity metrics • then choose the best partition by consensus (e.g. majority vote)
  50. 50. Pavel Loskot c 2014 22/32 Community Structure Hierarchical clustering • complexity O((|E| + |V|)|V|), many networks are sparse (|V| ≈ |E|) Algorithm 1. Initialize: |V| communities of 1 vertex each 2. Calculate modularity ∆Q for all pairs of existing communities 3. Merge the community pair having the largest increase ∆Q 4. Build the dendrogram and repeat steps 2 and 3 until only one community remains Clustering based on Euclidean distance
  51. 51. Pavel Loskot c 2014 23/32 Community Structure Merging clusters • similarity between clusters can be measured as single linkage: minimum between all pairs of nodes in two clusters • Complete linkage: maximum between all pairs of nodes in two clusters • Average linkage: average between all pairs of nodes in two clusters Limitations of modularity • appears to be strongly dependent on the density of links in the network • thus, not good measure to determine communities in sparse networks Clustering techniques 1. Agglomerative (bottom-up) techniques: edges are added among nodes to create communities (e.g. dendrogram) 2. Divisive (top-down) techniques: edges are removed from graph to create separate communities 3. Spectral techniques: graph splitting based on eigen-analysis Similarity measures • quantify (dis)similarity between nodes to decide on communities in all clustering algorithms • selection strongly application dependent (modularity, cosine similarity, Jaccard’s coefficient, . . .)
  52. 52. Pavel Loskot c 2014 24/32 Community Structure Louvain method (based on modularity optimization) • more accurate and more efficient (much faster) than hierarchical clustering • number of communities decreases quickly in only few iterations Algorithm 1. Initialize: every node is in its own community 2. For each node i, consider all its neighbors j, and check if moving i into j’s community increases ∆Q 3. Move i into community for which ∆Q is maximum 4. Repeat steps 2 and 3 until no further improvement possible (i.e. ∆Q = 0) 5. collapse the communities into single nodes (merging multiple edges between these new nodes), and go back to step 2
  53. 53. Pavel Loskot c 2014 25/32 Community Structure K-means clustering • number of clusters K predefined • minimize e.g. Euclidean distances: {Vi}i=1,...,K = argmin K i=1 v∈Vi v − ¯vi 2 , ¯vi = 1 |Vi| v∈Vi v Algorithm 1. Initialize: select K vertices at random as initial clusters and assign remaining vertices to nearest clusters 2. Calculate new centroids ¯vi for each cluster 3. Re-assigned all vertices to the nearest clusters 4. Go to step 2 until some stopping criterion is met 89% of data correctly classified
  54. 54. Pavel Loskot c 2014 26/32 Community Structure Limitations of K-means clustering • sensitivity to initial conditions and outliers • sensitivity to non-homogeneous structure, i.e. clusters differ significantly in size, connection density, non-spherical shape (for Euclidean distance metric)
  55. 55. Pavel Loskot c 2014 27/32 Community Structure Gaussian mixture models • assume there are K clusters, vertex vi has location xi • location of vertices in cluster Vi are normally distributed ∼ N(x|µi, λi) • let zi, i = 1, . . . , K be independent latent variables such that zi = 1 if cluster i ∼ N(x|µi, λi) and zi = 0 otherwise, so Pr(z) = K i=1 πzi i • if zi are known, the data are labeled (parameters of their distribution are known), otherwise data are un-labeled (unsupervised learning) • πi = Pr(zi) are mixing probabilities (weights), so that K i=1 πi = 1 and the distribution of location x of a vertex is p(x) = z p(x|z) p(z) = K i=1 πi N(x|µi, λi), wherep(x|z) = K i=1 N(x|µi, λi)zi • unknown parameters: mixing coefficients (πi), means (µi), covariances (λi) • using Bayes’ theorem, we can find posterior probabilities (responsibilities) Pr(zi|x) that the k-th Gaussian component has in explaining observed data Algorithm [Expectation Maximization (EM)] 1. E-step: evaluate Pr(zi|x) given current parameters 2. M-step: re-estimate parameters using current Pr(zi|x)
  56. 56. Pavel Loskot c 2014 28/32 Community Structure Overlapping communities • nodes may belong to more than one community (i.e. subsets Vi not disjoint) Clique percolation method [Palla 2005] • K-clique are K fully connected nodes • K-cliques adjacent if share K − 1 nodes • K-clique community is a set of nodes connected through adjacent K-cliques Algorithm 1. identify maximal cliques in the network (complex problem, but fortunately many real-world networks are relatively sparse) 2. consider cliques as single nodes; interconnect cliques if they share at least K − 1 nodes 3. identify connected components in graph created in step 2
  57. 57. Pavel Loskot c 2014 29/32 Community Structure Spectral clustering • K-means, Gaussian mixtures, hierarchical method are good for compact clusters • spectral clustering transforms the data into a new basis where standard algorithms work well Algorithm 1. Construct similarity matrix, [S]ij = Exp − xi − xj 2 /2σ2 2. Construct Laplacian L = D −S where D is diagonal matrix of weights, [D]ii = j[S]ij 3. Construct matrix U of k eigenvectors corresponding to k largest eigenvalues of L 4. Perform clustering on the transformed data x′ = UT x
  58. 58. Pavel Loskot c 2014 30/32 Community Structure Real-time clustering • (dynamic) re-clustering for every new data arrival is expensive • (dynamically) varying the number of clusters is confusing Hierarchical Agglomerate Clustering [HAC, 2004] 1. Initialization: hierarchical clustering (e.g. using dendrogram) 2. new data either assigned to one of existing cluster, OR 3. new data form new cluster, and two existing clusters are merged
  59. 59. Pavel Loskot c 2014 31/32 Community Structure Community analysis • distribution of community sizes • intra-community edge densities • number of intra- and inter-community links • average number of communities per node • . . . Community network • communities → nodes • edges weighted by number of links between communities
  60. 60. Pavel Loskot c 2014 32/32 Take-Home Messages Network structure analysis • structure of un-weighted static networks, i.e. knowing only their topology • subgraphs, graphlets, fragments and motifs are building blocks of large networks; the statistics of their occurrence is useful to compare network topology beyond their degree distribution • network partitioning or clustering to identify (overlapping) communities Measures of network structure • centrality (degree, betweenness, closeness, eigenvector, Katz, PageRank, . . .) • clustering coefficient, Rich-Club coefficient • modifications of measures for weighted networks
  61. 61. Networks: Random Models
  62. 62. Pavel Loskot c 2014 1/23 Statistical Modeling Objectives • account for models or parameters uncertainty, measurement noises etc. • make (short-to-medium term) predictions from the models • generate artificial data for verifying models and predictions • decide how much randomness influence properties; here we compare structural and functional properties of random and real-world networks Milgram’s experiment [1967] • famous “six-degree separation” • 300 people at random to send letter to a person in Boston • repeated in 2003: 18 targets, 60k senders, communications via emails • new findings in 2003: median 5 − 7 steps, network structure is not everything, high impact of incentives • Facebook: 92% of users only 5 hops away, 99% at 6 hops away
  63. 63. Pavel Loskot c 2014 2/23 Random Network Models Erdos-Renyi (ER) random graph [1959] • graph GER(n, p) with n vertices and edges chosen independently with probability p, “a zero-order approximation” of real-world networks • thus, vertex degree is random (for large n, binomial distribution approximated by Poisson distribution) Pr(k) = n − 1 k pk (1 − p)N−1−k ≈ e−¯k ¯kk k! , ¯k = E[k] = (n − 1)p • most vertices have average linking to other nodes (i.e. degree close to ¯k) • diameter (d) and average distance (¯l) between two vertices is relatively small compared to the size of the graph d = ln(n) ln(p(n − 1)) = ln(n) ln(¯k) ≈ ¯l • average number of edges E[|E|] = p N 2 = n¯k 2 ; the latter, since ¯k = 2E[|E|] n Connectivity of ER random graph • average degree ¯k    < 1 graph disconnected > 1 a giant component appears ≥ ln(n) graph (almost) completely connected
  64. 64. Pavel Loskot c 2014 3/23 Random Network Models Average path length and diameter of ER (random) graph • let l(i, j) be the shortest path between vertices vi and vj • the shortest paths can be combined into a single metric as ¯l = 1 (n 2) i,j i j l(i,j) 2 average shortest path d = maxi,j l(i, j) maximum shortest path (diameter) • if n is large, the average path length ¯l ∝ ln n is relatively small and growths slowly with network size (this is typical for many large networks); for comparison, 1D lattice (chain): ¯l ∝ n, 2D lattice: ¯l ∝ n1/2
  65. 65. Pavel Loskot c 2014 4/23 Random Network Models Clustering coefficient of ER graph • ratio of neighbors being friends to all possible friendships among neighbors • probability that two neighbors are connected is p, so clustering coefficient CER = p = ¯k n which is much smaller than for real-world networks with the same density • for large networks limn→∞ CER = 0, so large random networks resembles a tree (i.e. they have no clustering) Components in ER graphs • if p (and thus, also ¯k) is small, there are several disjoint components • if p is increased, there is one giant component (of size ≈ n) with the rest of nodes being in isolated small components • the giant component appears when p ≈ 1/n, e.g. for n = 103 (see figure)
  66. 66. Pavel Loskot c 2014 5/23 Random Network Models Percolation transition (when increasing p) 1. Subcritical: ¯k < 1, many small simple components of size at most ln n 2. Critical: ¯k ≈ 1, size of largest component is ∼ n2/3 , the giant component appears and starts growing 3. Supercritical: ¯k > 1, there is one giant component of size almost n, the second largest component has size about ln n Summary of ER graph • degree distribution is Poisson (most nodes have degree close to average) with no correlations of node degrees • average path length is small and ∝ ln n • connectivity depends on ¯k with percolation transition
  67. 67. Pavel Loskot c 2014 6/23 Random Network Models Random geometric models [Penrose, 2003] • main motivation: some networks can grow subject to geometric constraints • e.g., place n nodes randomly in (2D) space; two nodes i and j connected only if their distance xi − xj ≤ r • there exists critical radius rc to form a connected giant component (if r > rc): rc = √ ln n + O(1) πn Random distance models [Avin, 2008] • n nodes placed randomly in (2D) space • links created randomly with the probability ∝ f(dij) where dij = xi − xj
  68. 68. Pavel Loskot c 2014 7/23 Small World Networks More on Milgram’s experiment • how accurate 6 degree separation, how likely the chain to be completed Findings from real-world social networks • sub-optimal choice in choosing next link in chain is made 1/2 of time • Facebook measurements: average distance is 4.74 • Twitter measurements: average distance is 4.67 (50% are at 4 steps, nearly everyone in 5 steps)
  69. 69. Pavel Loskot c 2014 8/23 Small World Networks Main features high clustering: Creal−world ≫ Crandom average path length: ¯lreal−world ≈ ¯lrandom Watts-Strogatz (WS) small world network model [Nature, 1998] • launched the interest into complex networks (over 3.5k citations) • single control parameter to generate regular to purely random networks • the model: 1. generate regular graph, 2. rewire links with probability p • in all network generators: self-loops and duplicated links not allowed
  70. 70. Pavel Loskot c 2014 9/23 Small World Networks WS original model • select fraction of p edges and rewire one of their end-points WS model alternation • add fraction p of edges to initial regular lattice
  71. 71. Pavel Loskot c 2014 10/23 Small World Networks Properties of WS model: Degree distribution Pr(k) = min(k−K,K) i=0 K i (1 − p)i pK−i(pK)k−K−i (k−K−i)! e−pK ∼ Poisson-like distribution Clustering coefficient • if node i has K neighbors, C = #edges among K neighbors (K 2) • probability that connected triple still connected after rewiring is (1 − p)3 • C(p = 0) = 3k−3 4k−2 = 3 4, C(p = 1) ≈ 2k n • C(p) C(p = 0) · (1 − p)3 , i.e., C(p)/C(0) (1 − p)3 Average path length: ¯l ≈ (n−1)(n+2k−1) 4kn
  72. 72. Pavel Loskot c 2014 11/23 Small World Networks Kleinberg’s geographical small world model [Nature, 2000] • connectivity derived from geographical distances • the model: 1. link nearest neighbors 2. add links with the probability Pr(link between u and v) ∼ const × u − v −r where r is navigability exponent (e.g., links are purely random for r = 0) Hierarchical small world model [Science, 2001] • hierarchically nested groups, link probability pij ∼ exp−αxi j Other strategies for generative models of small world networks • add/rewire links based on chosen properties of current links and edges • add/rewire links to optimize particular property of the network
  73. 73. Pavel Loskot c 2014 12/23 Small World Networks Topology trade-offs (a) commuter rail network (b) star network (c) minimum spanning tree network
  74. 74. Pavel Loskot c 2014 13/23 Small World Networks Bridges • in social networks, close friends know what you know, and they also know others who know what you know • bridge between A and B (if removed, these two nodes become disconnected) • local bridge between A and B (if removed, distance A-B increased to > 2)
  75. 75. Pavel Loskot c 2014 14/23 Small World Networks Strong and weak ties [Granovetter, 1974] • in social networks, links are strong or weak ties (friends vs. acquaintances) • strong triadic closure: if A-B and A-C are strong ties, then at least weak tie between B-C exists • if there are enough strong ties in network, local bridges must be weak ties Almost local bridges • neighbor overlap of nodes A and B: #neighbors of both A and B #neighbors of at least A or B = N(i,j) (k(i)−1)+(k(j)−1)−N(i,j) • almost local bridges are links whose end-nodes have no common neighbors (i.e., the overlap of their neighbors is 0)
  76. 76. Pavel Loskot c 2014 15/23 Small World Networks Removing ties from social networks (percolation analysis) • removing weak ties breaks down the network • removing strong ties degrades the network more smoothly • however, this is specific only to social networks Removing ties from other networks • e.g. removing important road (strong tie) is more damaging • central veins are more important then peripheral veins
  77. 77. Pavel Loskot c 2014 16/23 Small World Networks Illustration of weak vs. strong ties removal (a) original network (b) 80% of strongest links removed, 20% of weak ties remain (c) 80% of weakest links removed, 20% of strong ties remain • no evidence of degradation for (b), network clearly fragmented in case (c) • strong links are within dense neighborhoods (triangles, cliques etc.) • weak links (and bridges) interconnect these dense neighborhoods
  78. 78. Pavel Loskot c 2014 17/23 Scale Free Networks Power-law distribution Pr(k) ∼ const×k−γ , typically 2 < γ < 3 • i.e., a straight line in log-log domain: log Pr(k) ∼ −γ log k + log const • always a few highly connected hubs
  79. 79. Pavel Loskot c 2014 18/23 Scale Free Networks Power-law distribution • some other distributions look like power-laws • estimating γ may not be so easy • 1st and 2nd moments: E[k] ∝ ∞ k0 k × k−γ dk = lim k→∞ 1 2 − γ   1 kγ−2 − 1 kγ−2 0   = = ∞ if γ ≤ 2 < ∞ if γ > 2 E k2 ∝ ∞ k0 k2 × k−γ dk = = ∞ if γ ≤ 3 < ∞ if γ > 3 Preferential attachment • now, the concern is how to generate scale-free networks • we use richer-get-richer effect and add new nodes sequentially: 1. with probability p, choose any existing node and link to it 2. with probability 1 − p, link to existing node with probability proportional to their current degrees
  80. 80. Pavel Loskot c 2014 19/23 Scale Free Networks Barab´asi-Albert (BA) scale-free model 1. Growth: start from seed network of m0 isolated nodes 2. Preferential attachment: add a new node with m ≤ m0 edges to existing nodes that are chosen with the probability Π(i) = k(i)/ i k(i) 3. after t steps, the network has n = m0 + t nodes and mt edges, and Π(i) = k(i) i k(i) = k(i) 2mt − m ≈ k(i) 2mt • this procedure generates degree distribution Pr(k) = 2m(m + 1) k(k + 1)(k + 2) 2m k3 ∝ k−3 so γ = 3 and average degree ¯k = 2m • average shortest path length: ¯l ∝ ln n ln(ln n) • clustering coefficients: C ∝ (ln n)2 n ( . . . too small for real-world networks)
  81. 81. Pavel Loskot c 2014 20/23 Scale Free Networks Question • In scale-free networks, how much is “popularity” predictable? Answer • if we restart the process, different popular nodes will emerge Other scale-free network models • motivation: improve clustering coefficient and allow to change exponent γ [Holme, Kim] • after preferential attachment step, with probability p, add one more edge to randomly selected neighbor • resulting clustering coefficient C ∝ 1 k ( . . . much more realistic) [Vazquez et al.] • random walk instead of preferential attachment (“get to know important people through people you already know”) [Kleinberg, Kumar] • copy a vertex and rewire its edges with certain probability
  82. 82. Pavel Loskot c 2014 21/23 Scale Free Networks Network generator with given γ 1. Initialize: seed network with m0 (isolated) nodes 2. Add one node and m links (not necessarily stemming from the new node) at each time; after t time-steps, the link is added to node i with the probability Π(i) = α k(i) i k(i) + (1 − α) 1 t + m0 where i k(i) = 2mt 3. thus, α = 1 leads to preferential attachment, and α = 0 is for uniform attachment, and the degree distribution Pr(k) ∝ k−(1+1/α) Models with non-power-law distribution • power-law distribution is a good fit for large networks (averaging effect) • on smaller scales, in more “specialized” sub-networks, power-law may not be such a good fit • log-normal distribution has been observed in such cases
  83. 83. Pavel Loskot c 2014 22/23 Scale Free Networks Configuration network model • degrees are pre-assigned to n nodes assuming a degree distribution Pr(k) • edges are added by randomly selecting pairs of these n nodes • a family of graphs generated this way will have the same degree distribution • excess degree is the number of possible outward links of a node which has been arrived to during a walk (i.e., one less than the node degree in undirected graphs) Pr(kexcess) = (k + 1)Pr(k + 1) k kPr(k) Other models • many other stochastic models of networks can be devised and then analyzed • hence, it is important to define quality of such models, e.g. (generally): – flexibility (design for specific parameter settings) – mathematical tractability – accuracy (to fit experimental data, make predictions)
  84. 84. Pavel Loskot c 2014 23/23 Take-Home Messages Random networks • Erdos & Renyi studied a simple model in 1959 • it has Poisson degree distribution with small average path length, but clustering goes to zero with network size Small world networks • the world is small, 6 degree separation (Milgram’s experiment) • short average path length, but clustering still smaller than in real networks • real-world networks contain weak and strong ties • Watts & Strogatz proposed simple model of small world networks (in 1998) Scale free networks • main focus is to produce power-law distribution • Barab´asi & Albert proposed model based on preferential attachment (in 1999); many modifications of this model can be (and were) devised Network models • mostly stochastic with main motivation is to emulate real-world networks • find structural properties to explain specific (global) properties of networks • useful to define quality of these models
  85. 85. Networks: Robustness
  86. 86. Pavel Loskot c 2014 1/11 Robustness Percolation • monitor network metrics while nodes or edges are being removed Dual problem • monitor network metrics while nodes or edges are being added 1. What strategy to remove/add nodes or edges? • no knowledge: nodes and edges removed (uniformly) at random • knowledge of structure: removing nodes and edges with high centrality • adding nodes and edges: cf. random network generators 2. Which metrics most relevant and should be monitored? • rate of decay/growth of: network diameter, average degree, average distance, size of giant component etc. 3. Which (class of) networks to consider? • any network, networks with specific degree distribution etc. 4. Why to consider robustness? • in general, networks resilience to attacks is a growing concern • want to design networks that are robust to damage
  87. 87. Pavel Loskot c 2014 2/11 Robustness Pragmatic definition • the network is robust if it can withstand accidental damage, random topology changes as well as intentional attacks and remain operational • this accounts for the remaining nodes and links to be able to carry flows and perform other tasks without excessive congestion, dead-locks etc. • observing average decay (e.g. size of giant component) may not be that useful (e.g. it cannot identify local congestion further impairing the network) • note also that we are still considering only networks with static topology Example • 50 nodes, removing 40 out of 116 edges decreases ¯k from 4.6 to 3.0
  88. 88. Pavel Loskot c 2014 3/11 Robustness as Stability Global stability • system is stable if it returns to equilibrium after any perturbation Resistance • ability of a community to resist change in face of potentially perturbing force Resilience • ability of a community to recover to normal functioning after disturbance Variability • variations in community density over time (measured e.g. as changes in mean/variance) due to external disturbances
  89. 89. Pavel Loskot c 2014 4/11 Robustness Percolation threshold • if ¯k decreases by removing edges, network suddenly becomes disconnected • if ¯k increases by adding edges, giant component suddenly emerges Examples p probability of filling squares, at p critical, giant connected component appears
  90. 90. Pavel Loskot c 2014 5/11 Robustness Experiment [Barab´asi et al., 2000] strategy: random failures versus targeted attacks removing nodes metrics: average or maximum (network diameter) shortest path networks: exponential versus scale-free (the same |V| and |E|)
  91. 91. Pavel Loskot c 2014 6/11 Robustness Experiment (cont.) effect on size of giant component s and its average s
  92. 92. Pavel Loskot c 2014 7/11 Robustness Experiment (cont.) (Internet and WWW) effect on size of giant component s and its average s
  93. 93. Pavel Loskot c 2014 8/11 Robustness of Scale-Free Networks Random failures vs targeted attack (a) original network of 574 nodes (b) removing 20% (115) of nodes randomly leaves 427 nodes in giant component (c) removing only 2.8% (22) most connected hubs leaves 301 nodes in giant component Bottom line • scale-free networks are robust against random failures • they are very vulnerable against targeted attacks
  94. 94. Pavel Loskot c 2014 9/11 Robustness of Scale-Free Networks Impact of power-law exponent on robustness (∼ k−γ ) • γ = 2.5: graceful degradation • γ = 3.5: giant component disappears at about f = 40% • assume e.g. case of γ = 2.7 (square markers) • kmax is maximum degree among remaining nodes • removing only 1% of nodes discards giant component (top figure) • kmax has to be very small to destroy giant component (bottom figure)
  95. 95. Pavel Loskot c 2014 10/11 Robustness Percolation threshold for random failures • in general, minimum fraction of nodes required (i.e., that cannot be randomly removed) for giant component to exist fc = E[k] E k2 − E[k] • specialized for random networks: fc = 1 E[k] thus, if ¯k = E[k] is large, random network can withstand large losses; e.g. if ¯k = 4, then 1/4 of nodes is enough for giant component to exists (i.e., 3/4 of nodes have to be removed to destroy giant component) • specialized for scale-free networks: fc −→ 0 as E k2 tends to be very large (even infinite) which makes these networks very robust against random failures (and attacks)
  96. 96. Pavel Loskot c 2014 11/11 Take-Home Messages Scale-free networks • very robust against random failures (some suggest that this is the reason why these networks are found so often in real world) • but very vulnerable against attacks to highly connected hubs • since hubs are also responsible (and effective) for spreading messages, diseases etc. through the network • in Social Networks, it is not hubs but rather weak ties and bridges that make these networks vulnerable Small world networks • have not been considered here • one extreme is a ring with one hop neighboring connections without any shortcut links; such network is not robust at all • another extreme is a fully connected network which is unbreakable • small world networks are in-between these two extremes; their robustness is likely derived from the density of shortcut links Making networks more robust • obvious strategy is to guarantee some minimum degree for every node (i.e., to achieve connections redundancy)
  97. 97. Networks: Processes
  98. 98. Pavel Loskot c 2014 1/13 Epidemic Spreading Network processes • strongly influenced by network structure; e.g. shortcuts significantly speed up spreading (of information, diseases) and synchronization of processes • hence, understanding of such network(ed) (distributed) processes requires understanding of the underlying network structures • e.g. neurons integrate signals from neighbors, if above threshold, the excitation fires and then fades away; this leads to oscillating cascades • here, we consider diffusion of diseases characterized by contagion (lack of choice), unlike information spreading where nodes make decisions to maximize their pay-offs
  99. 99. Pavel Loskot c 2014 2/13 Epidemic Spreading Simple spreading models • ring topology with shortcuts • all nodes susceptible • nodes infected with probability p • spreading of disease, computer viruses, . . . • tree topology of spreading in waves of k nodes, all nodes susceptible • nodes infected with probability p a) p is large, disease spreads out b) p is small, disease dies out Reproductive number: R0 = kp a) if R0 < 1, disease dies out in finite number of waves b) if R0 > 1, disease very likely infects at least 1 person in each wave
  100. 100. Pavel Loskot c 2014 3/13 Epidemic Spreading Limitations of simple models • small changes in k and p can move R0 above or below threshold (R0 ≷ 1) • network topology not realistic (e.g. no triangles) • nodes get infected only once and never recover More realistic: SI model • two classes of nodes: S (susceptible) and I (infected) • once infected, the node cannot recover |V| = |S | + |I| total number of nodes (V = S ∪ I) β = λ¯k infection rate per node (0 ≤ λ ≤ 1) β|S |/|V| susceptible contacts per unit of time dI dt = β|S ||I|/|V| overall rate of infection • let i = |I|/|V| be fraction of nodes infected, then di dt = β i (1 − i) which yields a logistic curve: i(t) = i(0) eβt 1 − (1 − eβt) i(0)
  101. 101. Pavel Loskot c 2014 4/13 Epidemic Spreading More realistic: SIR model • improve SI model by assuming infected nodes recover at rate υ, i.e., nodes stay infected only for (average) time τ = 1/υ • recovered node will become resistant (i.e. cannot be infected again) • define fractions s = |S |/|V|, i = |I|/|V|, r = |R|/|V|, so s + i + r = 1; rate of change of these fractions over time: ds dt = −βsi, di dt = βsi − υi, dr dt = υi solution again requires initial conditions s(0), i(0) and r(0) Possible outcomes a) disease may die out b) disease may spread to whole network c) disease becomes endemic (does not spread, nor die out)
  102. 102. Pavel Loskot c 2014 5/13 Epidemic Spreading More realistic: SIS model • no (permanent) recovery, but infected node may again become susceptible • infected to susceptible rate υ: a) if β > υ, logistic growth (as in SI model), but never infects whole population b) if β → υ, then i → 0 (infection will slowly die out) c) if β < υ, then infection dies out exponentially • mathematical model (assumes r = 0): ds dt = υi − βsi, di dt = βsi − υi, s + i = 1
  103. 103. Pavel Loskot c 2014 6/13 Epidemic Spreading Prognosis of epidemic • reproductive number: R0 = β/υ a) if R0 > 1, infection survives b) if R0 < 1, infection dies out • in SI model, υ → 0, so R0 ≫ 1 (a) SI model (b) SIR model (c) SIS model Extensions of SIR model • rather than assuming recovery after τ time units, let recovery be possible at each time with some (fixed) probability • infected state further subdivided (e.g. early, middle and final disease stages) • non-homogeneous mixing: restrictions how the nodes meet (e.g. travel to geographical locations, quarantining, . . . ) • other random network models (note that Erdos-Renyi model with homogeneous mixing was implicitly assumed in SI, SIR and SIS models)
  104. 104. Pavel Loskot c 2014 7/13 Epidemic Spreading SIS model in scale-free networks • experimentally observed that computer viruses survive significantly longer than predicted from SIS model over random networks • it was found that there is no epidemic threshold in scale-free networks, so infection proliferate independently of spreading rate • however, there is critical fraction of shortcuts in scale-free networks; if enough shortcuts, disease suddenly becomes epidemic • critical fraction of shortcuts is a function of rates β and υ
  105. 105. Pavel Loskot c 2014 8/13 Epidemic Spreading Network immunization • random networks: uniformly random immunization is helpful • scale-free networks: targeted degree-based immunization required as random immunization does not help • targeted local immunization: immunize one immediate neighbor for every node in a randomly selected group (i.e. nodes with higher degree are more likely to be immunized) red-circles: random immunization of scale-free network red-squares: targeted immunization of scale-free network black-squares: random & targeted immunization of random network
  106. 106. Pavel Loskot c 2014 9/13 Take-Home Messages Epidemic spreading • practical modeling requires to extract model parameters from real data • knowledge of nodes mobility is key to accurate modeling of spreading • epidemic spreading strongly influenced by information diffusion (i.e. knowing what is happening and what to do) • predictive modeling (if epidemic spreading on-going, it is desirable to be in real-time) is routinely used in practice as prevention SARS prediction and comparison with real outbreak data
  107. 107. Pavel Loskot c 2014 10/13 Network Dynamics Spatial-temporal scales (a) short: link activation and deactivation – topology is a snapshot – connected components must respect time sequences of links (b) longer: topology change from one structure to another – communities formation, merging, splitting – large communities persist in time if there is exchange of their members – small communities persist if their core is highly connected with strong ties (c) long: network evolution (birth, growth and decline) – in scale-free networks, in spite of changes (nodes and links appear and disappear), degree, weight and strength distributions remain stationary
  108. 108. Pavel Loskot c 2014 11/13 Information Cascades Aims • understand how behaviors, ideas, technology usage etc. are adopted, influenced and spread through networks Diffusion model • two nodes v and w • two behaviors A and B • two pay-offs a > 0 and b > 0 (i.e., the larger, the better) Network implications • let 0 ≤ p ≤ 1, and there are d neighbors of v • pd neighbors of v choose A • (1 − p)d neighbors of v choose B • A is better strategy if: pd·a ≥ (1−p)d·b ⇒ p ≥ b a + b
  109. 109. Pavel Loskot c 2014 12/13 Information Cascades Example diffusion in a network • let a = 3, b = 2, so b a+b = 2/5 • A: dark circles, B: light circles (b) only v and w adopt A (c) nodes r and t switch to A (i.e. 2/3 neighbors of A); u does not switch (but 1/3 of its neighbors chose A); note also that: 1/3 < 2/5 < 2/3 (d) also nodes s and u switch to A
  110. 110. Pavel Loskot c 2014 13/13 Take-Home Messages Cascades • initial adoption by few nodes may generate complete cascade • it is dependent on network structure • it is also crucially dependent on threshold b/(a+b), so changing pay-offs can make big difference (e.g. making the product more attractive) • OR, directly influence key nodes (initial adopters) • densely inter-connected clusters are difficult to penetrate • key parameters: clusters connection density and pay-off threshold Role of weak ties • very useful in spreading information • poor in transferring behaviors that are risky and/or costly Influencing nodes • in networks with many clusters, users are more easily influenced • reinforcement is very important in influencing users • node centrality is crucial for (information, behavior) diffusion
  111. 111. Networks: Algorithms
  112. 112. Pavel Loskot c 2014 1/24 Max Flow and Min Cut Scenario • single source node s and single sink node t (for simplicity) • directed edges between nodes represent flows (information, material, . . . ) • every edge assigned a weight representing max possible flow ≡ capacity
  113. 113. Pavel Loskot c 2014 2/24 Max Flow and Min Cut Dual problems (of combinatorial optimization) 1. find minimum cut of a graph G = (V, E) where V is set of nodes and E are weighted edges (max flows) 2. find maximum possible total flow from s ∈ V to t ∈ V over E while flows at every other node are equalized (in-flow = out-flow) Cut (S, T) • node partitioning V = S ∪ T such that S ∩ T = ∅ and s ∈ S and t ∈ T Capacity of cut (S, T) • sum of weights (capacities) leaving set S and entering set T
  114. 114. Pavel Loskot c 2014 3/24 Max Flow and Min Cut Minimum cut problem • find the cut with the minimum capacity Maximum flow problem • assign flows to edges not larger than their capacity, so that total flow from s to t is maximized and flows in all other nodes (V{s, t}) are equalized
  115. 115. Pavel Loskot c 2014 4/24 Max Flow and Min Cut Observation 1 • flow from S to T is equal to the total flow reaching sink t
  116. 116. Pavel Loskot c 2014 5/24 Max Flow and Min Cut Observation 2 • flow from S to T is at most equal to capacity of the cut • if flow from S to T is equal to capacity of the cut, then we have maximum possible flow from S to T and (S, T) is minimum cut
  117. 117. Pavel Loskot c 2014 6/24 Max Flow and Min Cut Greedy algorithm 1. select a path from s to t and set its flow to be equal to the minimum capacity among its edges (≡ bottleneck) 2. for every edge, obtain residual capacity ≡ capacity - flow (“undo” flow sent): i.e. add edge (w, v) to every edge (v, w) with positive residual capacity 3. augment path with strictly positive residual capacities
  118. 118. Pavel Loskot c 2014 7/24 Max Flow and Min Cut Ford-Fulkerson algorithm • greedy algorithm to find a maximum flow • find augmenting path with strictly positive residual capacities • if path can no longer be augmented, the flow is maximum Max-flow min-cut theorem • The value of maximum flow is equal to the capacity of the minimum cut. Complexity of Ford-Fulkerson algorithm • assume capacities are integers 1, . . . , U • Theorem 1: the algorithm terminates in at most |V| · U iterations. • Theorem 2: if all edge capacities are integers, then the maximum flow has integer values of flows on every edge.
  119. 119. Pavel Loskot c 2014 8/24 Max Flow and Min Cut Choosing initial augmenting path • some choices lead to exponential time algorithm, clever choices lead to polynomial time algorithm (number of iterations): 1. choose path with fewest edges (shortest path, breadth first search) 2. choose path with maximum bottleneck capacity (fastest path, priority or depth first search) Application: Bipartite matching • find maximum matching of a bipartite graph G • solve max-flow problem for extended graph G′ • by integer theorem (see above), there exists a maximum flow with 0/1 values
  120. 120. Pavel Loskot c 2014 9/24 Take-Home Messages Applications of max-flow and min-cut theorem • Network connectivity • Bipartite matching • Data mining • Open-pit mining • Airline scheduling • Image processing • Project selection • Baseball elimination • Network reliability • Security of statistical data • Distributed computing • Egalitarian stable matching • Distributed computing • . . . There are many efficient algorithms for solving max-flow min-cut problem.
  121. 121. Pavel Loskot c 2014 10/24 Network Routing Routing algorithms • find the least cost path between any two nodes in the (telecommunication) network • link cost: e.g. capacity, inverse of delay, or more simply, all links have a unit cost • path cost: sum of link costs along the path 1. Link state routing algorithms • assume every node has knowledge about network topology and all link costs • thus, all nodes have the same (global) knowledge (how?) • so-called centralized or link state algorithms 2. Distance vector routing algorithms • only local knowledge of link costs to all neighbors • iterative computations in collaboration with neighbors • so-called decentralized or distance vector algorithms
  122. 122. Pavel Loskot c 2014 11/24 Network Routing Link state routing: Dijkstra algorithm • every node computes the least cost path to all other nodes in the network • the computed paths are stored in so-called forwarding table • after K iterations, the least cost paths known for K destination nodes Algorithm: c(x, y) link cost between neighbors x and y (= ∞ if not neighbors) D(v) current cost of path from source to destination node v p(v) predecessor node along path from source to node v V′ set of nodes whose least cost paths already known
  123. 123. Pavel Loskot c 2014 12/24 Network Routing Dijkstra algorithm example • the shortest path constructed by tracking predecessors • if ties encountered, they can be broken arbitrarily
  124. 124. Pavel Loskot c 2014 13/24 Network Routing Dijkstra algorithm example Complexity of Dijkstra algorithm • at each iteration, need to check N nodes not in V′ , i.e., N(N + 1)/2 comparisons ∼ O(N2 ) • more efficient implementations devised ∼ O(N log N)
  125. 125. Pavel Loskot c 2014 14/24 Network Routing Distance vector algorithm • fully distributed generation of forwarding tables • based on Bellman-Ford equation (dynamic programming) dx(y) = minv∈N(x) (c(x, v) + dv(y)) v∗ = argminv∈N(x) (c(x, v) + dv(y)) N(x) neighbors of node x c(x, v) link cost from x to v dv(y) cost from neighbor v to destination y v∗ next hop in least cost path from x to y Example dv(z) = 5, dx(z) = 3, dw(z) = 3 du(z) = min   c(u, v) + dv(z), c(u, x) + dx(z), c(u, w) + dw(z)   = min   2 + 5, 1 + 3, 5 + 3   = 4
  126. 126. Pavel Loskot c 2014 15/24 Network Routing Distance vector algorithm • Dx(y) is least cost from x to y and it is iteratively estimated • every node x maintains distance vectors for yourself and all its neighbors; recall that V is set of all nodes and N(x) is set of neighbors of x Dx = Dx(y) : y ∈ V Dv = Dv(y) : y ∈ V , v ∈ N(x) as well as x knows costs c(x, v) to all its neighbors v ∈ N(x) • key idea is to periodically exchange distance vectors Dx among neighbors; the vectors are then updated using B-F equation as: Dx(y) ← min v∈N(x) (c(x, v) + Dv(y)) , for ∀y ∈ V so (under some minor conditions) estimate Dx(y) −→ true value dx(y) Distance vector updates (at each node) 1. asynchronous: triggered by change of local link cost, or by update message from the neighbor 2. synchronous: notify all neighbors if own distance vector changes
  127. 127. Pavel Loskot c 2014 16/24 Network Routing Example updates
  128. 128. Pavel Loskot c 2014 17/24 Network Routing “Good news travel fast” “Bad news travel slow” Comparison Link state Distance vector Messages O(|V| · |E|) msgs sent local exchange only Convergence O(|V|2 ), may have time varies, possibly loops, oscillations count-to-inf problem Robustness may advertise incorrect may advertise incorrect link cost, each node path cost, each node’s computes its own table table used by others (errors propagate )
  129. 129. Pavel Loskot c 2014 18/24 Search on Networks • the aim is to find some source-destination path in reasonable amount of time • the path cost is not an issue unlike in routing Surprising observations (from real-world networks) 1. short paths exist between pairs of nodes (6 degree separation) 2. these short paths can be discovered (and used) Remarks • both observations closely interrelated • it is not so clear how to discover (or even create) these short paths • typical situation is nodes have only local rather than global information; flooding to discover the destination known to be very inefficient Decentralized search • Kleinberg’s small world network model: n × n grid of nodes with local connections plus every node v has a random long range link to node w Pr(v link to w) ∼ d(v, w)−α , α ≥ 0 and distance d(v, w) ≡ #grid steps • value of α trade-offs how random long-range connections are
  130. 130. Pavel Loskot c 2014 19/24 Search on Networks Comparing search strategies • efficiency of a search strategy is expected delivery time (over random long- range contacts i.e. topology, and random source-destination pairs) • delivery time ∼ number of hops in the graph (unit-weight links) Trading-off value of α • α = 0 long-range links are uniformly distributed (∼ WS model), difficult to navigate having only local knowledge (and knowing location of destination) • for α = 0, the actual chosen path to destination is likely to be significantly longer than the corresponding shortest path • α > 0 higher clustering, long-range links less random, more realistic scenario • lower-bounds on expected delivery time [Kleinberg 2000] ¯TD ≥    const × n(2−α)/3 0 ≤ α < 2 const × (log n)2 α = 2 const × n(α−2) 2 < α < 3 thus, α = 2 is a polynomial in log n, while other cases are polynomials in n
  131. 131. Pavel Loskot c 2014 20/24 Search on Networks Web search • information retrieval since 60’s using “textual analysis” • more recently, information ranked by its score (e.g., #links to it) Scoring a webpage • #webpages pointing to it (unit-weight links) • sum of the scores of neighboring webpages pointing to it
  132. 132. Pavel Loskot c 2014 21/24 Search on Networks Authorities • nodes pointed to by highly ranked nodes • they offer prominent, highly endorsed answers to queries Hubs • nodes that point to highly ranked nodes Assessing authorities and hubs • compute weights h(i) (for hubs) and b(i) (for authorities) h(i) = j [A]ij b( j) b(i) = j [A]ij h( j) • the weights are computed iteratively as (in matrix form) ht+1 = (AAT )ht bt+1 = (AT A)bt • main drawback: it requires global knowledge (of A), so it is query-dependent
  133. 133. Pavel Loskot c 2014 22/24 Search on Networks PageRank (named by the Google founder) • ranking pages independently of queries • main idea: page is important if it is linked by other important pages • every page is assigned a weight w( j) = i [A]ij w(i) · 1 dout(i) w(i) weights of in-bound neighboring pages dout(i) out-degree of node i to dilute its importance if it links to many other nodes • the weights w(i) are probabilities that from any starting page, the page i is reached via a random walk • however, if some page does not have out-bound links, the random walker gets trapped; so with probability s choose random walk, and with probability (1 − s) jump randomly to any other node
  134. 134. Pavel Loskot c 2014 23/24 Search on Networks Strategies • many strategies may be devised, some are more efficient than others • decentralized search is a practical requirement in large networks • in social networks, weak (social) ties and hierarchy play significant role • visiting the same nodes while searching is inefficient, yet there is tendency to visit hubs often To aid the search • nodes as sources of information are scored (e.g. by level of trust) • exploiting network structure of (distributed) information helps significantly • challenge: real-time updates of contents • ranking (i.e. scoring) algorithms are kept secret and changed (updated) continuously
  135. 135. Pavel Loskot c 2014 24/24 Take-Home Messages Routing • it is not only to find source-destination path, but the one having least cost • it is implicitly assumed that each node has an address (identification) • routing in the Internet evolved over time (i.e., it has not been designed from the beginning) • it is still unclear why the Internet routing works so well at such large scales • main issues with the Internet routing are robustness, security and congestion Search on small world and scale free networks • small world networks have small short path length and high clustering coefficient, however, Watts-Strogatz (WS) model does not capture navigability of real-world networks • search is fast and scales well in scale-free networks
  136. 136. Networks: Software
  137. 137. Pavel Loskot c 2014 1/11 Software Requirements for Graph Data Tasks • input data in common format (e.g. Excel, CSV, . . . ) • convert (output) data into the desired format (GraphML, Pajek, . . . ) • Social Network Analysis (SNA) of data • dynamic (temporal) analysis • data visualization Requirements • steep learning curve (easy to grasp) • flexibility (use different formats for input and output) • scalability (Big Data, application dependent) • speed (if Big Data or real-time) • parallel and distributed computing capability (MapReduce) • functionality as modules or add-ins • . . .
  138. 138. Pavel Loskot c 2014 2/11 Networks in Matlab
  139. 139. Pavel Loskot c 2014 3/11 Networks in Matlab
  140. 140. Pavel Loskot c 2014 4/11 Networks with Python
  141. 141. Pavel Loskot c 2014 5/11 Networks in C, R, Python
  142. 142. Pavel Loskot c 2014 6/11 Networks Visualization and Analysis
  143. 143. Pavel Loskot c 2014 7/11 Networks Community Analysis
  144. 144. Pavel Loskot c 2014 8/11 Social Network Analysis
  145. 145. Pavel Loskot c 2014 9/11 Popular in Bioinformatics
  146. 146. Pavel Loskot c 2014 10/11 Networks Online Demos
  147. 147. Pavel Loskot c 2014 11/11 Networks Data

×