An introductory-to-mid level to presentation to complex network analysis: network metrics, analysis of online social networks, approximated algorithms, memorization issues, storage.
3. A social network is a structure composed by actors and their relationships
Actor: person, organization, role ...
Relationship: friendship, knowledge...
A social networking system is system allowing users to:
ā¢ construct a profile which represents them in the system;
ā¢ create a list of users with whom they share a connection
ā¢ navigate their list of connections and that of their friends
(Boyd, 2008)
So, what is a complex network?
A complex network is a network with non-trivial topological featuresā
features that do not occur in simple networks such as lattices or random
graphs but often occur in real graphs. (Wikipedia). Foggy.
4. COMPLEX
NETWORKS
A complex network is a network with non-trivial topological featuresāfeatures
that do not occur in simple networks such as lattices or random graphs but often
occur in real graphs.
ā¢ Non-trivial topological features (what are topological features?)
ā¢ Simple networks: lattices, regular or random graphs
ā¢ Real graphs
ā¢ Are social networks complex networks?
8. Path p = v0 ,ā¦,vk (viā1 ,vi ) āE
Path Length: length( p) Set of paths from i to j: Paths(i, j)
j
i
Shortest path length: (
L(i, j) = min {length( p) p āPaths(i, j)} )
Shortest/Geodesic path:
9.
10. Average geodesic distance ļ¬(i) = (n ā 1)ā1 ā L (i, j)
kā ļ{i}
V
Average shortest path length ļ¬ = n ā1 ā ļ¬(i)
iāV
Characteristic path length (
CPL = median {ļ¬(i) i āV } )
11. ā” A k ā¤ number of paths of length k from i to j
ā£ ā¦ij
12. ā” A k ā¤ number of paths of length k from i to j
ā£ ā¦ij
number of paths!
13. ā” A k ā¤ number of paths of length k from i to j
ā£ ā¦ij
number of paths!
degree
14. CLUSTERING
COEFFICIENT
()
ā1
ki
Local Clustering Coefficient Ci = 2 T (i)
T(i): # distinct triangles with i as vertex
Clustering Coefficient
1
C = ā Ci
n iāV
ā¢ Measure of transitivity x
ā¢ High CC ā āresilientā network
ā¢ Counting triangles
1
( )
Ī(G) = ā N(i) = trace A 3
6
iāV
15. CLUSTERING
COEFFICIENT
()
ā1
ki
Local Clustering Coefficient Ci = 2 T (i)
T(i): # distinct triangles with i as vertex
Clustering Coefficient
1
C = ā Ci
kx = 5
n iāV ā ā 1
kx
= 10
2
ā¢ Measure of transitivity
ā¢ High CC ā āresilientā network
ā¢ Counting triangles
1
( )
Ī(G) = ā N(i) = trace A 3
6
iāV
16. CLUSTERING
COEFFICIENT
()
ā1
ki
Local Clustering Coefficient Ci = 2 T (i)
T(i): # distinct triangles with i as vertex
Clustering Coefficient
1
C = ā Ci T (x) = 3
n iāV
ā¢ Measure of transitivity
ā¢ High CC ā āresilientā network
ā¢ Counting triangles
1
( )
Ī(G) = ā N(i) = trace A 3
6
iāV
17. CLUSTERING
COEFFICIENT
()
ā1
ki
Local Clustering Coefficient Ci = 2 T (i)
T(i): # distinct triangles with i as vertex
Clustering Coefficient
1
C = ā Ci
n iāV x
ā¢ Measure of transitivity
ā1
High CC ā āresilientā network ā kx ā
ā¢
T (x) = 3 = 10
ā¢ Counting triangles ā2 ā
1
( )
Ī(G) = ā N(i) = trace A 3 kx = 5 Cx = 0.3
iāV 6
18. ā”A k ā¤
ā£ ā¦ij number of paths of length k from i to j
3 All paths of length 3 starting
[A ]ii
and ending in i ā triangles
19. DEGREE
DISTRIBUTION
ā¢ Every ānode-wiseā property can be studies as an average, but it is
most interesting to study the whole distribution.
ā¢ One of the most interesting is the ādegree distributionā
px = # {i ki = x }
1
n
20. Many networks have
power-law degree distribution. pk ā k āĪ³
Ī³ >1
ā¢ Citation networks
ā¢ Biological networks k r
=?
ā¢ WWW graph
ā¢ Internet graph
ā¢ Social Networks
Power-Law: ! gamma=3
1000000
log-log plot
100000
10000
1000
100
10
1
0.1
1 10 100 1000
21. Generated with random generator
80-20 Law
Few nodes account for the vast majority of links
Most nodes have very few links
This points towards the idea that we have a core with a fringe of
nodes with few connections.
... and it it proved that implies super-short diameter
22. HIGH LEVEL
STRUCTURE
Isle Isolated community
Core
23. CONNECTED
COMPONENTS
Most features are computed on the core
Undirected: strongly connected = weakly connected
Directed/Undirected
Directed: strongly connected != weakly connected
The adjacency matrix is primitive
iff the network is connected
25. Facebook Hugs Degree Distribution
10000000
1000000
100000
10000
1000
100
10
1
1 10 100 1000
For small k power-laws do not hold
26. Facebook Hugs Degree Distribution
10000000
1000000
100000
10000
1000
For large k we have
100
statistical fluctuations
10
1
1 10 100 1000
For small k power-laws do not hold
27. Facebook Hugs Degree Distribution
10000000 Nodes: 1322631 Edges: 1555597
m/n: 1.17 CPL: 11.74
1000000
Clustering Coefficient: 0.0527
Number of Components: 18987
100000
Isles: 0
10000
Largest Component Size: 1169456
1000
For large k we have
100
statistical fluctuations
10
1
1 10 100 1000
For small k power-laws do not hold
28. Facebook Hugs Degree Distribution
10000000 Nodes: 1322631 Edges: 1555597
m/n: 1.17 CPL: 11.74
1000000
Clustering Coefficient: 0.0527
Number of Components: 18987
100000
Isles: 0
10000
Largest Component Size: 1169456
1000
For large k we have
100
statistical fluctuations
10
1
1 10 100 1000
For small k power-laws do not hold
Moreover, many distributions are wrongly identified as PLs
29. Online Social Networks
OSN Refs. Users Links <k> C CPL d Ī³ r
Club Nexus Adamic et al 2.5 K 10 K 8.2 0.2 4 13 n.a. n.a.
Cyworld Ahn et al 12 M 191 M 31.6 0.2 3.2 16 -0.13
Cyworld T Ahn et al 92 K 0.7 M 15.3 0.3 7.2 n.a. n.a. 0.43
LiveJournal Mislove et al 5 M 77 M 17 0.3 5.9 20 0.18
Flickr Mislove et al 1.8 M 22 M 12.2 0.3 5.7 27 0.20
Twitter Kwak et al 41 M 1700 M n.a. n.a. 4 4.1 n.a.
Orkut Mislove et al 3 M 223 M 106 0.2 4.3 9 1.5 0.07
Orkut Ahn et al 100 K 1.5 M 30.2 0.3 3.8 n.a. 3.7 0.31
Youtube Mislove et al 1.1 M 5 M 4.29 0.1 5.1 21 -0.03
Facebook Gjoka et al 1 M n.a. n.a. 0.2 n.a. n.a. 0.23
FB H Nazir et al 51 K 116 K n.a. 0.4 n.a. 29 n.a.
FB GL Nazir et al 277 K 600 K n.a. 0.3 n.a. 45 n.a.
BrightKite Scellato et al 54 K 213 K 7.88 0.2 4.7 n.a. n.a.
FourSquare Scellato et al 58 K 351 K 12 0.3 4.6 n.a. n.a.
LiveJournal Scellato et al 993 K 29.6 M 29.9 0.2 4.9 n.a. n.a.
Twitter Java et al 87 K 829 K 18.9 0.1 n.a. 6 0.59
Twitter Scellato et al 409 K 183 M 447 0.2 2.8 n.a. n.a.
32. p
Watts-Strogatz Model
In the modified model, we only add the edges.
3(Īŗ ā 2)
ki = Īŗ + si Cā
4(Īŗ ā 1) + 8Īŗ p + 4Īŗ p 2
Edges in
the lattice # added log(npĪŗ )
ļ¬ā
shortcuts Īŗ p
2
33. Strogatz-Watts Model - 10000 nodes k = 4
1
CPL(p)/CPL(0)
C(p)/C(0)
0.8
CPL(p)/CPL(0)
0.6
C(p)/C(0)
0.4
0.2
0
0 0.2 0.4 p 0.6 0.8 1
Short CPL
Large Clustering Coefficient
Threshold
Threshold
34. BarabƔsi-Albert Model Connectedness log n
Threshold log log n
BARABASI-ALBERT-MODEL(G,M0,STEPS)
FOR K FROM 1 TO STEPS
N0 ā NEW-NODE(G)
ADD-NODE(G,N0)
A ā MAKE-ARRAY()
FOR N IN NODES(G)
PUSH(A, N) pk ā x ā3
FOR J IN DEGREE(N)
PUSH(A, N) log n
FOR J FROM 1 TO M
ļ¬ā
log log n
N ā RANDOM-CHOICE(A)
ADD-LINK (N0, N) C ā n ā3/4
Scale-free entails
short CPL
Transitivity disappears
with network size No analytical proof available
35. ANALYSIS
ā¢ There are many network features we can study
ā¢ Letās discuss some algorithms for the ones we studied so-far
ā¢ Also consider the size of the networks ( > 1M nodes ), so algorithmic
costs can become an issue
39. Computational Complexity of ASPL:
All pairs shortest path matrix based (parallelizable): ( )
Ī n 3
All pairs shortest path Dijkstra w. Fibonacci Heaps: O ( n log n + nm )
2
Computing the CPL
q#S elements are ā¤ than x Āµ(a) = M 1 (a)
x = M q (S) and (1-q)#S are > than x
2
q#S(1-Ī“) elements are ā¤ than x M 1 (a)
x āLqĪ“ (S) 3
or (1-q)#S(1-Ī“) are > than x
Huber Method L 1 , 1 (a)
2 5
2 2 (1 ā Ī“ )
2
Let R a random sample of S such that #R=s, then
s = 2 ln
q ļ² Ī“2 Mq(R) ā LqĪ“(S) with probability p = 1-Īµ.
48. How about social networking applications?
proļ¬le posts
name, email, ... post
connections post
post
post
When a user opens his page, his contacts profiles are read to get their posts.
A user that has many friends (high degree) has his profile read more often.
The degree distribution becomes the profile accesses distribution.
Very good for caching!
And what about the clustering coefficient? Friends of friends tend to be friends...
49. X
Minimize: (G) = (loc(v) loc(u)) Auv
v,u2V 2
unfortunately NP complete ālocation of v profileā
We can however use community detection to improve locality.
We define a cluster to be a subgraph with some cohesion.
Different cluster definition exist, with different trade-offs.
But profiles in a cluster tend to be accessed together, so that
actually we can store the information ācloseā (disk-layout, sharding)
We can also study the correlations between geography
and clustering and hopefully use that info.
In general, for SNS knowing the network structure
gives insight on how to optimize stuff.
50. NETWORK
PROCESSES
ā¢ Study of processes that occur on real networks
ā¢ āNetwork destructionā: models malfunctions in the network
ā¢ āIdea/Diseaseā diffusion over networks
ā¢ Link prediction
57. Massachussets
Boston
Nebraska
Omaha
Wichita 6 Degrees
Kansas
Milgramās Experiment
ā¢ Random people from Omaha & Wichita were asked to send a postcard to a
person in Boston:
ā¢ Write the name on the postcard
ā¢ Forward the message only to people personally known that was more likely
to know the target
58. Massachussets 1st run: 64/296 arrived, most
Boston delivered to him by 2 men
Nebraska
2nd run: 24/160 arrived, 2/3
delivered by āMr. Jacobsā
Omaha
2 ā¤ hops ā¤ 10; Āµ=5.x
Wichita 6 Degrees
CPL, hubs, ...
Kansas ... and Kleinbergās Intuition
Milgramās Experiment
ā¢ Random people from Omaha & Wichita were asked to send a postcard to a
person in Boston:
ā¢ Write the name on the postcard
ā¢ Forward the message only to people personally known that was more likely
to know the target