Upcoming SlideShare
×

# Complex and Social Network Analysis in Python

14,713 views
12,986 views

Published on

An introductory-to-mid level to presentation to complex network analysis: network metrics, analysis of online social networks, approximated algorithms, memorization issues, storage.

Published in: Technology
27 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
14,713
On SlideShare
0
From Embeds
0
Number of Embeds
2,093
Actions
Shares
0
356
0
Likes
27
Embeds 0
No embeds

No notes for slide

### Complex and Social Network Analysis in Python

1. COMPLEX &SOCIALNETWORKANALYSISEnrico Franchi (efranchi@ce.unipr.it)
2. https://www.dropbox.com/s/43f7c84iolxfvg2/csnap.pdf
3. A social network is a structure composed by actors and their relationships Actor: person, organization, role ... Relationship: friendship, knowledge... A social networking system is system allowing users to: • construct a profile which represents them in the system; • create a list of users with whom they share a connection • navigate their list of connections and that of their friends (Boyd, 2008)So, what is a complex network? A complex network is a network with non-trivial topological features— features that do not occur in simple networks such as lattices or random graphs but often occur in real graphs. (Wikipedia). Foggy.
4. COMPLEX NETWORKSA complex network is a network with non-trivial topological features—featuresthat do not occur in simple networks such as lattices or random graphs but oftenoccur in real graphs. • Non-trivial topological features (what are topological features?) • Simple networks: lattices, regular or random graphs • Real graphs • Are social networks complex networks?
5. TOOLS••• Matplotlib• IPython• NetworkX
6. BASIC NOTATION Adjacency MatrixNetwork ⎧1 if (i,j) ∈EG = (V, E) E ⊂ V 2 A ij = ⎨ ⎩0 otherwise{(x, x) x ∈V } ∩ E = ∅Directed Network Undirected Networkk = ∑ A ji i in A symmetric x ki = ∑ A ji = ∑ A ij jki out = ∑ A ij j j k in = 4 x kx = 3 out jki = kiin + kiout
7. BASIC NOTATION Adjacency Matrix Assume theNetwork network ⎧1 if (i,j) ∈EG = (V, E) E ⊂ V 2 A ij = ⎨ connected! ⎩0 otherwise{(x, x) x ∈V } ∩ E = ∅Directed Network Undirected Networkk = ∑ A ji i in A symmetric x ki = ∑ A ji = ∑ A ij jki out = ∑ A ij j j k in = 4 x kx = 3 out jki = kiin + kiout
8. Path p = v0 ,…,vk (vi−1 ,vi ) ∈EPath Length: length( p) Set of paths from i to j: Paths(i, j) j iShortest path length: ( L(i, j) = min {length( p) p ∈Paths(i, j)} )Shortest/Geodesic path:
9. Average geodesic distance (i) = (n − 1)−1 ∑ L (i, j) k∈ {i} VAverage shortest path length  = n −1 ∑ (i) i∈VCharacteristic path length ( CPL = median {(i) i ∈V } )
10. ⎡ A k ⎤ number of paths of length k from i to j⎣ ⎦ij
11. ⎡ A k ⎤ number of paths of length k from i to j⎣ ⎦ij number of paths!
12. ⎡ A k ⎤ number of paths of length k from i to j⎣ ⎦ij number of paths! degree
13. CLUSTERING COEFFICIENT () −1 kiLocal Clustering Coefficient Ci = 2 T (i) T(i): # distinct triangles with i as vertex Clustering Coefficient 1 C = ∑ Ci n i∈V• Measure of transitivity x• High CC → “resilient” network• Counting triangles 1 ( ) Δ(G) = ∑ N(i) = trace A 3 6 i∈V
14. CLUSTERING COEFFICIENT () −1 kiLocal Clustering Coefficient Ci = 2 T (i) T(i): # distinct triangles with i as vertex Clustering Coefficient 1 C = ∑ Ci kx = 5 n i∈V ✓ ◆ 1 kx = 10 2• Measure of transitivity• High CC → “resilient” network• Counting triangles 1 ( ) Δ(G) = ∑ N(i) = trace A 3 6 i∈V
15. CLUSTERING COEFFICIENT () −1 kiLocal Clustering Coefficient Ci = 2 T (i) T(i): # distinct triangles with i as vertex Clustering Coefficient 1 C = ∑ Ci T (x) = 3 n i∈V• Measure of transitivity• High CC → “resilient” network• Counting triangles 1 ( ) Δ(G) = ∑ N(i) = trace A 3 6 i∈V
16. CLUSTERING COEFFICIENT () −1 kiLocal Clustering Coefficient Ci = 2 T (i) T(i): # distinct triangles with i as vertex Clustering Coefficient 1 C = ∑ Ci n i∈V x• Measure of transitivity −1 High CC → “resilient” network ⎛ kx ⎞• T (x) = 3 = 10• Counting triangles ⎝2 ⎠ 1 ( ) Δ(G) = ∑ N(i) = trace A 3 kx = 5 Cx = 0.3 i∈V 6
17. ⎡A k ⎤⎣ ⎦ij number of paths of length k from i to j 3 All paths of length 3 starting [A ]ii and ending in i → triangles
18. DEGREE DISTRIBUTION• Every “node-wise” property can be studies as an average, but it is most interesting to study the whole distribution.• One of the most interesting is the “degree distribution” px = # {i ki = x } 1 n
19. Many networks havepower-law degree distribution. pk ∝ k −γ γ >1• Citation networks• Biological networks k r =?• WWW graph• Internet graph• Social Networks Power-Law: ! gamma=3 1000000 log-log plot 100000 10000 1000 100 10 1 0.1 1 10 100 1000
20. Generated with random generator 80-20 LawFew nodes account for the vast majority of linksMost nodes have very few links This points towards the idea that we have a core with a fringe of nodes with few connections. ... and it it proved that implies super-short diameter
21. HIGH LEVEL STRUCTUREIsle Isolated community Core
22. CONNECTED COMPONENTSMost features are computed on the core Undirected: strongly connected = weakly connected Directed/Undirected Directed: strongly connected != weakly connected The adjacency matrix is primitive iff the network is connected
23. Facebook Hugs Degree Distribution10000000 1000000 100000 10000 1000 100 10 1 1 10 100 1000
24. Facebook Hugs Degree Distribution10000000 1000000 100000 10000 1000 100 10 1 1 10 100 1000 For small k power-laws do not hold
25. Facebook Hugs Degree Distribution10000000 1000000 100000 10000 1000 For large k we have 100 statistical fluctuations 10 1 1 10 100 1000 For small k power-laws do not hold
26. Facebook Hugs Degree Distribution10000000 Nodes: 1322631 Edges: 1555597 m/n: 1.17 CPL: 11.74 1000000 Clustering Coefficient: 0.0527 Number of Components: 18987 100000 Isles: 0 10000 Largest Component Size: 1169456 1000 For large k we have 100 statistical fluctuations 10 1 1 10 100 1000 For small k power-laws do not hold
27. Facebook Hugs Degree Distribution10000000 Nodes: 1322631 Edges: 1555597 m/n: 1.17 CPL: 11.74 1000000 Clustering Coefficient: 0.0527 Number of Components: 18987 100000 Isles: 0 10000 Largest Component Size: 1169456 1000 For large k we have 100 statistical fluctuations 10 1 1 10 100 1000 For small k power-laws do not hold Moreover, many distributions are wrongly identified as PLs
28. Online Social Networks OSN Refs. Users Links <k> C CPL d γ r Club Nexus Adamic et al 2.5 K 10 K 8.2 0.2 4 13 n.a. n.a. Cyworld Ahn et al 12 M 191 M 31.6 0.2 3.2 16 -0.13 Cyworld T Ahn et al 92 K 0.7 M 15.3 0.3 7.2 n.a. n.a. 0.43 LiveJournal Mislove et al 5 M 77 M 17 0.3 5.9 20 0.18 Flickr Mislove et al 1.8 M 22 M 12.2 0.3 5.7 27 0.20 Twitter Kwak et al 41 M 1700 M n.a. n.a. 4 4.1 n.a. Orkut Mislove et al 3 M 223 M 106 0.2 4.3 9 1.5 0.07 Orkut Ahn et al 100 K 1.5 M 30.2 0.3 3.8 n.a. 3.7 0.31 Youtube Mislove et al 1.1 M 5 M 4.29 0.1 5.1 21 -0.03 Facebook Gjoka et al 1 M n.a. n.a. 0.2 n.a. n.a. 0.23 FB H Nazir et al 51 K 116 K n.a. 0.4 n.a. 29 n.a. FB GL Nazir et al 277 K 600 K n.a. 0.3 n.a. 45 n.a. BrightKite Scellato et al 54 K 213 K 7.88 0.2 4.7 n.a. n.a. FourSquare Scellato et al 58 K 351 K 12 0.3 4.6 n.a. n.a. LiveJournal Scellato et al 993 K 29.6 M 29.9 0.2 4.9 n.a. n.a. Twitter Java et al 87 K 829 K 18.9 0.1 n.a. 6 0.59 Twitter Scellato et al 409 K 183 M 447 0.2 2.8 n.a. n.a.
29. Erdös-Rényi Random Graphs Connectedness p Threshold log n / nG(n, p) pG(n, m) p p p pEnsembles of Graphs p pWhen describe values of pproperties, we actually the p Pr(Aij = 1) = pexpected value of the property
30. Erdös-Rényi Random Graphs Connectedness p Threshold log n / nG(n, p) pG(n, m) p p p pEnsembles of Graphs p pWhen describe values of pproperties, we actually the p Pr(Aij = 1) = pexpected value of the propertyd := d = ∑ Pr(G)⋅ d(G) ∝ log n Pr(G) = p m (1− p) () n 2 −m G log k −1 C = k (n − 1) = p ⎛ n⎞ m =⎜ ⎟ p ⎝ 2⎠ ⎛ n − 1⎞ k k kpk = ⎜ p (1− p)n−1−k n→∞ pk = e − k ⎝k ⎟ ⎠ k!
31. p Watts-Strogatz Model In the modified model, we only add the edges. 3(κ − 2) ki = κ + si C→ 4(κ − 1) + 8κ p + 4κ p 2Edges inthe lattice # added log(npκ ) ≈ shortcuts κ p 2
32. Strogatz-Watts Model - 10000 nodes k = 4 1 CPL(p)/CPL(0) C(p)/C(0) 0.8CPL(p)/CPL(0) 0.6 C(p)/C(0) 0.4 0.2 0 0 0.2 0.4 p 0.6 0.8 1 Short CPL Large Clustering Coefficient Threshold Threshold
33. Barabási-Albert Model Connectedness log n Threshold log log nBARABASI-ALBERT-MODEL(G,M0,STEPS) FOR K FROM 1 TO STEPS N0 ← NEW-NODE(G) ADD-NODE(G,N0) A ← MAKE-ARRAY() FOR N IN NODES(G) PUSH(A, N) pk ∝ x −3 FOR J IN DEGREE(N) PUSH(A, N) log n FOR J FROM 1 TO M ≈ log log n N ← RANDOM-CHOICE(A) ADD-LINK (N0, N) C ≈ n −3/4 Scale-free entails short CPL Transitivity disappears with network size No analytical proof available
34. ANALYSIS• There are many network features we can study• Let’s discuss some algorithms for the ones we studied so-far• Also consider the size of the networks ( > 1M nodes ), so algorithmic costs can become an issue
35. Dijkstra Algorithm (single source shortest path)from heapq import heappush, heappop# based on recipe 119466def dijkstra_shortest_path(graph, source):    distances = {}    predecessors = {}    seen = {source: 0}    priority_queue = [(0, source)]    while priority_queue:        v_dist, v = heappop(priority_queue)        distances[v] = v_dist                for w in graph[v]:            vw_dist = distances[v] + 1            if w not in seen or vw_dist < seen[w]:                seen[w] = vw_dist                heappush(priority_queue,(vw_dist,w))                predecessors[w] = v    return distances, predecessors O(m· pushQ + n· ex-minQ)=O(m log n + n log n)
36. Dijkstra Algorithm (single source shortest path)from heapq import heappush, heappop# based on recipe 119466def dijkstra_shortest_path(graph, source):    distances = {}    predecessors = {}    seen = {source: 0}    priority_queue = [(0, source)]    while priority_queue:        v_dist, v = heappop(priority_queue)        distances[v] = v_dist                for w in graph[v]:            vw_dist = distances[v] + 1            if w not in seen or vw_dist < seen[w]:                seen[w] = vw_dist                heappush(priority_queue,(vw_dist,w))                predecessors[w] = v    return distances, predecessors O(m· pushQ + n· ex-minQ)=O(m log n + n log n)
37. Dijkstra Algorithm (single source shortest path)from heapq import heappush, heappop# based on recipe 119466def dijkstra_shortest_path(graph, source):    distances = {}    predecessors = {}    seen = {source: 0}    priority_queue = [(0, source)]    while priority_queue:        v_dist, v = heappop(priority_queue)        distances[v] = v_dist                for w in graph[v]:            vw_dist = distances[v] + 1            if w not in seen or vw_dist < seen[w]:                seen[w] = vw_dist                heappush(priority_queue,(vw_dist,w))                predecessors[w] = v    return distances, predecessors O(m· pushQ + n· ex-minQ)=O(m log n + n log n)
38. Computational Complexity of ASPL: All pairs shortest path matrix based (parallelizable): ( ) Θ n 3 All pairs shortest path Dijkstra w. Fibonacci Heaps: O ( n log n + nm ) 2Computing the CPL q#S elements are ≤ than x µ(a) = M 1 (a) x = M q (S) and (1-q)#S are > than x 2 q#S(1-δ) elements are ≤ than x M 1 (a) x ∈Lqδ (S) 3 or (1-q)#S(1-δ) are > than xHuber Method L 1 , 1 (a) 2 5 2 2 (1 − δ ) 2 Let R a random sample of S such that #R=s, then s = 2 ln q  δ2 Mq(R) ∈ Lqδ(S) with probability p = 1-ε.
39. 2 2 (1 − δ ) 2s = 2 ln q  δ 2
40. HOW ABOUT THE MEMORY?• Different representations• Different trade-offs (space/time)• How easy is metadata to manipulate• Disk/RAM
41. POPULAR REPRESENTATIONS• Adjacency List• Incidence List• Adjacency Matrix (using sparse matrices)• Incidence Matrix (using sparse matrices)