1. Model-based Overlapping Seed ExpanSion
(MOSES)
Aaron McDaid and Neil Hurley. This research was supported by
Science Foundation Ireland (SFI) Grant No. 08/SRC/I1407.
Clique: Graph & Network Analysis Cluster
School of Computer Science & Informatics
University College Dublin, Ireland
4. Facebook
Traud et al. Community Structure In Online Collegiate Social
Networks
M. Salter-Townshend and T.B. Murphy. Variational Bayesian
Inference for the Latent Position Cluster Model
Marlow et al. Maintained relationships on Facebook
August 7, 2010 4
5. Communities
Some nodes assigned to multiple communities.
Most edges assigned to just one community.
Multiple researchers have found Facebook members being in 6
or 7 communities.
August 7, 2010 5
6. Communities
A partition will break some of the communities in that simple
example.
Graclus breaks synthetic communities with low levels of
overlap. (A. Lancichinetti and S. Fortunato, Benchmarks for
testing community detection algorithms on directed and
weighted graphs with overlapping communities. )
Graclus breaks communities found by MOSES in Facebook
networks. (Traud et al, Community Structure in Online
Collegiate Social Networks)
Modularity has known problems, but we need to go further
and move on from partitioning.
August 7, 2010 6
7. Facebook
Traud et al’s five university networks.
Average of 7 communities per node.
August 7, 2010 7
8. Community finding
A general-purpose community finding algorithm must allow:
Each node to be assigned to any number of communities.
Pervasive overlap. Ahn et al. Link communities reveal
multiscale complexity in networks. (Nature).
The intersection (number of shared nodes) between a pair of
communities can vary. It can be small, even when the number
of communities-per-node is high.
August 7, 2010 8
9. MOSES
MOSES deals only with undirected, unweighted, networks.
No attributes/weights associated with nodes or edges.
August 7, 2010 9
10. The MOSES model
Model that:
Every pair of nodes has a chance of having an edge.
Independent for each pair of nodes, given the communities,
but probability is higher for pairs that share community(ies).
(This is an OSBM - Latouche et al. Annals of Applied
Statistics
http://www.imstat.org/aoas/next_issue.html.)
August 7, 2010 10
11. MOSES model
Ignoring the observed edges
for now. Just consider the
nodes and a (proposed) set of
communities
August 7, 2010 11
12. MOSES model
These communities create
probabilities for the edges.
P(v1 ∼ v2) = pout where the
two vertices do NOT share a
community.
P(v1 ∼ v2) = 1−(1−pout)(1−
pin) where the two vertices do
share 1 community.
August 7, 2010 12
13. MOSES model
These communities create
probabilities for the edges.
P(v1 v2) = qout where the
two vertices do NOT share a
community.
P(v1 v2) = qoutqin where
the two vertices do share 1
community.
P(v1 v2) = qoutqin
s(v1,v2)
where s(v1, v2) is the number
of communities shared by v1
and v2.
August 7, 2010 13
14. MOSES model
We now have a model that, for a given set of communities,
assigns probabilities for edges.
P(g|z, pin, pout)
g is the observed graph of nodes and edges. z is the proposed
set of communities.
August 7, 2010 14
15. MOSES model
We now have a model that, for a given set of communities,
assigns probabilities for edges.
P(g|z, pin, pout)
g is the observed graph of nodes and edges. z is the proposed
set of communities.
How do we match that with the observed edges to get a good
estimate of the set of communities?
Naive approach: find (z, pin, pout) that maximizes
P(g|z, pin, pout).
August 7, 2010 14
16. MOSES model
P(g|z, pin, pout) is maximized when pin = 1, pout = 1, and
when z is defined as exactly one community around each edge.
i.e. we don’t want to maximize P(g|z, pin, pout).
August 7, 2010 15
18. MOSES model
Apply Bayes’ Theorem:
P(z, pin, pout|g) ∝ P(g|z, pin, pout) P(z) P(pin, pout)
August 7, 2010 17
19. MOSES model
Apply Bayes’ Theorem:
P(z, pin, pout|g) ∝ P(g|z, pin, pout) P(z) P(pin, pout)
P(z) ∼ k!
1≤i≤k
1
N + 1
1
N
ni
where k is the number of communities, and ni is the number
of nodes in community i.
August 7, 2010 17
20. MOSES model
We can correctly integrate out the number of communities, k,
and search across the resulting varying-dimensional space.
No need for model selection, e.g. BIC.
August 7, 2010 18
21. MOSES Algorithm
For the MOSES algorithm, we chose to look at the joint
distribution over (z, pin, pout) and aim to maximize it.
The algorithm is a heuristic approximate algorithm, and we do
not claim that it finds the maximum.
August 7, 2010 19
22. MOSES Algorithm
Choose an edge at random to form a seed, and expand.
Accept/reject those expanded seeds that contribute positively
to the objective.
Update pin, pout based on the graph and the current set of
communities.
Delete communities that don’t make a positive contribution to
the objective.
Final fine-tuning that moves nodes one at a time.
August 7, 2010 20
23. MOSES Algorithm
Choose an edge at random to form a seed, and expand.
Accept/reject those expanded seeds that contribute positively
to the objective.
Update pin, pout based on the graph and the current set of
communities.
Delete communities that don’t make a positive contribution to
the objective.
Final fine-tuning that moves nodes one at a time.
It is not a Markov Chain, nor an EM algorithm. We can make
no such guarantees.
The algorithm will be reaching a local maximum, and may
well have strong biases.
August 7, 2010 20
24. Evaluation
Synthetic benchmarks
Networks created randomly by software.
Ground truth communities are builtin to these networks.
Check if the algorithms can discover the correct communities
when fed the network.
To measure the similarity between the found communities and
the ground truth communities, overlapping NMI is used.
(Lancichinetti et al. Detecting the overlapping and
hierarchical community structure in complex networks)
August 7, 2010 21
25. Evaluation
2000 nodes
Define hundreds of communities.
Each community contains 20 nodes chosen at random from
the 2000 nodes.
Some nodes may be assigned to many communities. Some
may not be assigned to a community.
pin = 0.4. About 40% of the pairs of nodes that share a
community are then joined.
pout = 0.005. Finally, a small amount of background noise is
added.
August 7, 2010 22
36. Scalability
In general, community finding means overlapping community
finding, (in my interpretation).
Partitioning breaks communities.
August 7, 2010 32
37. Scalability
In general, community finding means overlapping community
finding, (in my interpretation).
Partitioning breaks communities.
So, partitioning is scalable, but partitioning doesn’t help with
community finding.
August 7, 2010 32
38. Scalability
In general, community finding means overlapping community
finding, (in my interpretation).
Partitioning breaks communities.
So, partitioning is scalable, but partitioning doesn’t help with
community finding.
Challenge: a very scalable algorithm that can credibly claim to
be a community-finding algorithm.
August 7, 2010 32
39. Other/future research
Markov Chain Monte Carlo
Working with Prof. Brendan Murphy on an MCMC method.
Very different algorithm, which allows us to investigate the
model directly.
August 7, 2010 33
40. Other/future research
Markov Chain Monte Carlo
Working with Prof. Brendan Murphy on an MCMC method.
Very different algorithm, which allows us to investigate the
model directly.
MOSES algorithm may have many biases we’ll never fully
grasp.
August 7, 2010 33
41. Other/future research
Markov Chain Monte Carlo
Working with Prof. Brendan Murphy on an MCMC method.
Very different algorithm, which allows us to investigate the
model directly.
MOSES algorithm may have many biases we’ll never fully
grasp.
Different model (still an OSBM) where each community has
its own internal-connection probability.
MOSES breaks down on synthetic data if the communities are
not equally dense (pin).
August 7, 2010 33
42. Other/future research
Markov Chain Monte Carlo
Working with Prof. Brendan Murphy on an MCMC method.
Very different algorithm, which allows us to investigate the
model directly.
MOSES algorithm may have many biases we’ll never fully
grasp.
Different model (still an OSBM) where each community has
its own internal-connection probability.
MOSES breaks down on synthetic data if the communities are
not equally dense (pin).
Draw from this distribution: P(z, pout, p1, p2, p3, ...|g)
August 7, 2010 33
43. Other/future research
Markov Chain Monte Carlo
Working with Prof. Brendan Murphy on an MCMC method.
Very different algorithm, which allows us to investigate the
model directly.
MOSES algorithm may have many biases we’ll never fully
grasp.
Different model (still an OSBM) where each community has
its own internal-connection probability.
MOSES breaks down on synthetic data if the communities are
not equally dense (pin).
Draw from this distribution: P(z, pout, p1, p2, p3, ...|g)
Multiple MCMC chains, where chains propose splits/merge to
each other.
(Modern) statisticians are innovative about scalability, e.g.
Hybrid Monte Carlo.
August 7, 2010 33
44. Take home messages
Community finding should be about discovering structure, not
forcing the structure. Overlapping, hierarchy, et cetera.
August 7, 2010 34
45. Take home messages
Community finding should be about discovering structure, not
forcing the structure. Overlapping, hierarchy, et cetera.
MOSES is a proof-of-concept: We show that quality results,
overlapping communities, and scalability, are not incompatible.
August 7, 2010 34
46. Take home messages
Community finding should be about discovering structure, not
forcing the structure. Overlapping, hierarchy, et cetera.
MOSES is a proof-of-concept: We show that quality results,
overlapping communities, and scalability, are not incompatible.
Very-scalable community finding algorithms don’t exist. This
is an interesting challenge.
August 7, 2010 34
47. Acknowledgments
This research was supported by Science Foundation Ireland (SFI)
Grant No. 08/SRC/I1407.
http://clique.ucd.ie/software
http://www.aaronmcdaid.com
aaronmcdaid@gmail.com , neil.hurley@ucd.ie
August 7, 2010 35