This document provides an overview of network sampling techniques, community detection algorithms, and network models. It discusses random sampling, stratified sampling, and cluster sampling for network data. For community detection, it describes algorithms based on cliques, k-cliques, k-clubs, quasi-cliques, modularity maximization, and spectral clustering. It also introduces models for small-world networks including Watts-Strogatz model and discusses how real-world networks exhibit small-world properties.
11. The social network of friendships
within a 34-person karate club
When people are inuenced by the behaviors
their neighbors in the network (a kind of
informational or social contagion.)
The spread of an epidemic disease
(such as the tuberculosis)
14. Eric D. Kolaczyk
• Statistical Analysis of Network Data
• Lecture 1 { Network Mapping &
Characterization}
• Dept of Mathematics and Statistics, Boston
University
• kolaczyk@bu.edu
17. O quão pequeno é o mundo?
• Quantos contatos separam uma pessoa de outra, em
qualquer lugar do mundo?
• 1990 – John Guare e a peça “Six Degrees of
Separation”.
• Karinthy acreditava que o mundo estava “encolhendo”.
• 1960 – The Small World Experiment por Stanley
Milgram e Jeffrey Travers.
18. Modelo de Crescimento Exponencial
• Você : 1
• Seus amigos : 100
• Os amigos dos seus amigos : 10000
19. Modelo de Crescimento Exponencial
• Em apenas 5 passos, 1005 pessoas seriam alcançadas, o que
é maior que a população mundial.
• Esse modelo despreza a probabilidade de cada pessoa ter
amigos em comum.
• Não serve para representar os relacionamentos sociais no
mundo real.
• Mundo real – muitos triângulos.
• Como é possível então, que as redes sociais possuem
muitos triângulos em escopo local e mesmo assim
constituírem um mundo pequeno?
20. Modelo de Watts-Strogatz
• Criado por Duncan J. Watts e Steven
Strogatz, em 1998.
• Introduz o conceito de coeficiente de
clusterização.
• Consegue simular tanto redes aleatórias
como as redes sociais do mundo real.
22. Construção do grafo
• O modelo de Watts-Strogatz usa os seguintes parâmetros:
• N número de nós;
• K grau de cada nó;
• (0 <= P <= 1) -> coeficiente que modela o grafo, quanto maior
P, maior a desordem;
• C coeficiente de clusterização;
• L distância média entre dois nós da rede.
24. Estudo do grafo
p = 0 (ordenado) p = 1 (aleatório)
• L (n / k) L (ln n /ln k)
• C 3 / 4 C (k / n) 0
• O que acontece com valores de P intermediários?
• P pequeno - diminui consideravelmente L, mantém C praticamente constante.
• P pequeno caracteriza redes de mundo pequeno.
• Condições necessárias: alguma fonte de ordenação; um pouco de
aleatoriedade.
25. Por que o mundo real é pequeno?
• Eu conheço o Governador! - E eu a Xuxa, agora
pergunta se ele conhece a minha pessoa!
• Ordenação redes de amigos locais altamente
clusterizadas.
• Aleatoriedade algum contato que foge da rede local.
Número de Bacon
http://oracleofbacon.org/
Facebook
• Experiência com 5,8
milhões de usuários.
• Distância média: 5,73.
• Distância máxima: 12.
Twitter
• Estudo da relação de um
usuário seguir outro.
• 5,2 bilhões de relações.
• Distância média: 4,67.
26. Class Size Paradox
• Simple example: each student takes one
course; suppose there is one course with 100
students, fifty courses with 2 students.
– Dean calculates: (100+50*2)/51 = 3.92
– Students calculate: (100*100+100*2)/200 = 51
• Class Size Paradox in Networks
– Average number of friends of a person’s friends is
greater than average number of friends of a
person!!
28. Community Detection Algorithms
Metric called Modularity.
Community: It is formed by individuals
such that those within a group interact
with each other more frequently than with
those outside the group
a.k.a. group, cluster, cohesive subgroup,
module in different contexts
Community detection: discovering groups
in a network where individuals’ group
memberships are not explicitly given
29. Complete Mutuality: Cliques
• Clique: a maximum complete subgraph in which all nodes
are adjacent to each other
• NP-hard to find the maximum clique in a network
• Straightforward implementation to find cliques is very
expensive in time complexity
Nodes 5, 6, 7 and 8 form a clique
29
30. Finding the Maximum Clique
• In a clique of size k, each node maintains degree >= k-1
– Nodes with degree < k-1 will not be included in the maximum clique
• Recursively apply the following pruning procedure
– Sample a sub-network from the given network, and find a clique in the
sub-network, say, by a greedy approach
– Suppose the clique above is size k, in order to find out a larger clique,
all nodes with degree <= k-1 should be removed.
• Repeat until the network is small enough
• Many nodes will be pruned as social media networks follow a
power law distribution for node degrees
30
31. Maximum Clique Example
• Suppose we sample a sub-network with nodes {1-9} and find a
clique {1, 2, 3} of size 3
• In order to find a clique >3, remove all nodes with degree <=3-
1=2
– Remove nodes 2 and 9
– Remove nodes 1 and 3
– Remove node 4
31
32. Clique Percolation Method (CPM)
• Clique is a very strict definition, unstable
• Normally use cliques as a core or a seed to find larger
communities
• CPM is such a method to find overlapping communities
– Input
• A parameter k, and a network
– Procedure
• Find out all cliques of size k in a given network
• Construct a clique graph. Two cliques are adjacent if they share k-1
nodes
• Each connected component in the clique graph forms a
community
32
34. Reachability : k-clique, k-club
• Any node in a group should be reachable in k hops
• k-clique: a maximal subgraph in which the largest geodesic
distance between any two nodes <= k
• k-club: a substructure of diameter <= k
• A k-clique might have diameter larger than k in the subgraph
– E.g. {1, 2, 3, 4, 5}
• Commonly used in traditional SNA
• Often involves combinatorial optimization
Cliques: {1, 2, 3}
2-cliques: {1, 2, 3, 4, 5}, {2, 3, 4, 5, 6}
2-clubs: {1,2,3,4}, {1, 2, 3, 5}, {2, 3, 4, 5, 6}
34
35. Group-Centric Community Detection:
Density-Based Groups
• The group-centric criterion requires the whole group to satisfy
a certain condition
– E.g., the group density >= a given threshold
• A subgraph is a quasi-clique if
where the denominator is the maximum number of degrees.
• A similar strategy to that of cliques can be used
– Sample a subgraph, and find a maximal quasi-clique
(say, of size )
– Remove nodes with degree less than the average degree
35
,
<
36. Network-Centric Community Detection
• Network-centric criterion needs to consider the
connections within a network globally
• Goal: partition nodes of a network into disjoint sets
• Approaches:
– (1) Clustering based on vertex similarity
– (2) Latent space models (multi-dimensional scaling )
– (3) Block model approximation
– (4) Spectral clustering
– (5) Modularity maximization
36
37. Clustering based on Vertex Similarity
• Apply k-means or similarity-based clustering to nodes
• Vertex similarity is defined in terms of the similarity of their
neighborhood
• Structural equivalence: two nodes are structurally equivalent
iff they are connecting to the same set of actors
• Structural equivalence is too strict for practical use.
Nodes 1 and 3 are
structurally equivalent;
So are nodes 5 and 6.
37
(1) Clustering based on vertex similarity
39. Cut
• Most interactions are within group whereas interactions
between groups are few
• community detection minimum cut problem
• Cut: A partition of vertices of a graph into two disjoint sets
• Minimum cut problem: find a graph partition such that the
number of edges between the two sets is minimized
42
(4) Spectral clustering
40. Ratio Cut & Normalized Cut
• Minimum cut often returns an imbalanced partition, with one
set being a singleton, e.g. node 9
• Change the objective function to consider community size
Ci,:a community
|Ci|: number of nodes in Ci
vol(Ci): sum of degrees in Ci
43
(4) Spectral clustering
41. Ratio Cut & Normalized Cut Example
For partition in red:
For partition in green:
Both ratio cut and normalized cut prefer a balanced partition
44
(4) Spectral clustering
43. Bickel-Chen on Community Detection
Inconsistency result for Newman-Girvan.
http://www.stat.berkeley.edu/~bickel/Bickel%20Chen%2021068.full.pdf
44. Agglomerative Hierarchical Clustering
• Initialize each node as a community
• Merge communities successively into larger
communities following a certain criterion
– E.g., based on modularity increase
48
Dendrogram according to Agglomerative Clustering based on Modularity
45. Sampling
• Types of Sampling
– Simple Random Sampling
– Stratified Sampling
– Cluster Sampling
• Sample Size
• Incorporating Sample Design
• Design vs. Model-Based Sampling
– Statisticians tend to favor the design-based point of
view because it makes no assumptions about the
mechanism that generates the data in the survey, as
we explain below
46. Respondent Driven Sampling (RDS)
• sampling scheme for hard-to-reach populations, based
on link-tracing across a social network with coupon
incentives
• becoming extremely-widely used all over the world;
hundreds of studies done or ongoing, e.g., CDC
National HIV Behavioral Surveillance (NHBS) studies of
injection drug users
• RDS as sampling vs. RDS estimation
47.
48. To Model or Not To Model;
Design-based vs. model-based
• Model the underlying network?
• What about unknown nodes?
• The recruitment process?
• Coupon refusal?
• The outcome variables (such as HIV status)?
49. Referências
• Modelo de Mundo Pequeno - Arthur Freitas Ramos & Hugo Neiva de Melo (000 - mundo-pequeno.pptx)
• Collective dynamics of ‘small-world’ networks Duncan J. Watts* & Steven H. Strogatz (001 - w_s_NATURE_0.pdf)
• Statistical Analysis of Network Data Lecture 1 { Network Mapping & Characterization Eric D. Kolaczyk (003 -
kolaczyk_yes13_1.pdf)
• Tutorial: Statistical Analysis of Network Data Eric D. Kolaczyk (004 - Kolaczyk-CN.pdf)
• The Watts–Strogatz network model developed by including degree distribution: theory and computer simulation Y W
Chen1,2, L F Zhang1 and J P Huang1 (005 - JPA-1.pdf)
• Where Online Friends Meet: Social Communities in Location-based Networks - Chlo¨e Brown Vincenzo Nicosia Salvatore
Scellato, Anastasios Noulas Cecilia Mascolo (006 - icwsm12_brown.pdf)
• Mining of Massive Datasets Anand Rajaraman Jure Leskovec (001 – book.pdf)
• Machine Learning for Hackers - Drew Conway and John Myles White (002 - Machine_Learning_for_Hackers.pdf)
• pandas: powerful Python data analysis toolkit Release 0.13.1 Wes McKinney & PyData Development Team(003 –
pandas.pdf)
• Provenance Data in Social Media Geoffrey , Barbier Zhuo Feng , Pritam Gundecha, Huan Liu - SYNTHESIS LECTURES ON
DATA MINING AND KNOWLEDGE DISCOVERY #7 &MC Morgan &cLaypool publishers(004 - Provenance-Data-in-Social-
Media(2).pdf)
• Python for Data Analysis Wes McKinney (005 - Python_for_Data_Analysis.pdf)
• Dominique Haughton l Jonathan Haughton - Living Standards Analytics (006 - bok%3A978-1-4614-0385-2.pdf)
• A nonparametric view of network models and Newman–Girvan and other modularities Peter J. Bickela,1 and Aiyou
Chenb (007 - Bickel%20Chen%2021068.full 2.pdf)
• Collective dynamics of ‘small-world’ networks Duncan J. Watts* & Steven H. Strogatz (008 - 393440a0.pdf)
• Communities in Networks - Mason A. Porter, Jukka-Pekka Onnela, and Peter J. Mucha(009 - 0902.3788v2 a.pdf)
• Managing and Mining Graph Data. Haixun Wang, Charu C. Aggarwal. Beijing, China. Springer. (007 - bok%3A978-1-4419-
6045-0.pdf)
• Social Network. Data Analytics. Charu C. Aggarwal. Springer. (008 - bok%3A978-1-4419-8462-3.pdf)
• RDS Analysis Tools 7.2 RDSAT 7.1 User Manual. Springer. (009 - RDSAT_7.1-Manual_2012-11-25.pdf)
• Eric D. Kolaczyk. Statistical Analysis of Network Data Methods and Models. Springer. (010 - bok_978-0-387-88146-1)
• Networks, Crowds, and Markets: Reasoning about a Highly Connected World. David Easley & Jon Kleinberg (011 -
networks-book.pdf)
Editor's Notes
A typical goal is to partition a network into disjoint sets. But that can be extended to soft clustering as well
widetilde{P} is introduced for notational convenience. It can be constructed based on the proximity matrix P.