Model-Based Overlapping Seed Expansion (MOSES

Model-based Overlapping Seed ExpanSion
(MOSES)
Aaron McDaid and Neil Hurley. This research was supported by
Science Foundation Ireland (SFI) Grant No. 08/SRC/I1407.
Clique: Graph & Network Analysis Cluster
School of Computer Science & Informatics
University College Dublin, Ireland

Overview
Community ﬁnding
The MOSES model
The MOSES algorithm
Evaluation
Scalability
Other/future work
August 7, 2010 2

Facebook
Traud et al. Community Structure In Online Collegiate Social
Networks
M. Salter-Townshend and T.B. Murphy. Variational Bayesian
Inference for the Latent Position Cluster Model
Marlow et al. Maintained relationships on Facebook
August 7, 2010 4

Communities
Some nodes assigned to multiple communities.
Most edges assigned to just one community.
Multiple researchers have found Facebook members being in 6
or 7 communities.
August 7, 2010 5

Communities
A partition will break some of the communities in that simple
example.
Graclus breaks synthetic communities with low levels of
overlap. (A. Lancichinetti and S. Fortunato, Benchmarks for
testing community detection algorithms on directed and
weighted graphs with overlapping communities. )
Graclus breaks communities found by MOSES in Facebook
networks. (Traud et al, Community Structure in Online
Collegiate Social Networks)
Modularity has known problems, but we need to go further
and move on from partitioning.
August 7, 2010 6

Facebook
Traud et al’s ﬁve university networks.
Average of 7 communities per node.
August 7, 2010 7

Community ﬁnding
A general-purpose community ﬁnding algorithm must allow:
Each node to be assigned to any number of communities.
Pervasive overlap. Ahn et al. Link communities reveal
multiscale complexity in networks. (Nature).
The intersection (number of shared nodes) between a pair of
communities can vary. It can be small, even when the number
of communities-per-node is high.
August 7, 2010 8

MOSES
MOSES deals only with undirected, unweighted, networks.
No attributes/weights associated with nodes or edges.
August 7, 2010 9

The MOSES model
Model that:
Every pair of nodes has a chance of having an edge.
Independent for each pair of nodes, given the communities,
but probability is higher for pairs that share community(ies).
(This is an OSBM - Latouche et al. Annals of Applied
Statistics
http://www.imstat.org/aoas/next_issue.html.)
August 7, 2010 10

MOSES model
Ignoring the observed edges
for now. Just consider the
nodes and a (proposed) set of
communities
August 7, 2010 11

MOSES model
These communities create
probabilities for the edges.
P(v1 ∼ v2) = pout where the
two vertices do NOT share a
community.
P(v1 ∼ v2) = 1−(1−pout)(1−
pin) where the two vertices do
share 1 community.
August 7, 2010 12

MOSES model
These communities create
probabilities for the edges.
P(v1 v2) = qout where the
two vertices do NOT share a
community.
P(v1 v2) = qoutqin where
the two vertices do share 1
community.
P(v1 v2) = qoutqin
s(v1,v2)
where s(v1, v2) is the number
of communities shared by v1
and v2.
August 7, 2010 13

MOSES model
We now have a model that, for a given set of communities,
assigns probabilities for edges.
P(g|z, pin, pout)
g is the observed graph of nodes and edges. z is the proposed
set of communities.
August 7, 2010 14

MOSES model
We now have a model that, for a given set of communities,
assigns probabilities for edges.
P(g|z, pin, pout)
g is the observed graph of nodes and edges. z is the proposed
set of communities.
How do we match that with the observed edges to get a good
estimate of the set of communities?
Naive approach: ﬁnd (z, pin, pout) that maximizes
P(g|z, pin, pout).
August 7, 2010 14

MOSES model
P(g|z, pin, pout) is maximized when pin = 1, pout = 1, and
when z is deﬁned as exactly one community around each edge.
i.e. we don’t want to maximize P(g|z, pin, pout).
August 7, 2010 15

MOSES model
P(z, pin, pout|g)
August 7, 2010 16

MOSES model
Apply Bayes’ Theorem:
P(z, pin, pout|g) ∝ P(g|z, pin, pout) P(z) P(pin, pout)
August 7, 2010 17

MOSES model
Apply Bayes’ Theorem:
P(z, pin, pout|g) ∝ P(g|z, pin, pout) P(z) P(pin, pout)
P(z) ∼ k!
1≤i≤k
1
N + 1
1
N
ni
where k is the number of communities, and ni is the number
of nodes in community i.
August 7, 2010 17

MOSES model
We can correctly integrate out the number of communities, k,
and search across the resulting varying-dimensional space.
No need for model selection, e.g. BIC.
August 7, 2010 18

MOSES Algorithm
For the MOSES algorithm, we chose to look at the joint
distribution over (z, pin, pout) and aim to maximize it.
The algorithm is a heuristic approximate algorithm, and we do
not claim that it ﬁnds the maximum.
August 7, 2010 19

MOSES Algorithm
Choose an edge at random to form a seed, and expand.
Accept/reject those expanded seeds that contribute positively
to the objective.
Update pin, pout based on the graph and the current set of
communities.
Delete communities that don’t make a positive contribution to
the objective.
Final ﬁne-tuning that moves nodes one at a time.
August 7, 2010 20

MOSES Algorithm
Choose an edge at random to form a seed, and expand.
Accept/reject those expanded seeds that contribute positively
to the objective.
Update pin, pout based on the graph and the current set of
communities.
Delete communities that don’t make a positive contribution to
the objective.
Final ﬁne-tuning that moves nodes one at a time.
It is not a Markov Chain, nor an EM algorithm. We can make
no such guarantees.
The algorithm will be reaching a local maximum, and may
well have strong biases.
August 7, 2010 20

Evaluation
Synthetic benchmarks
Networks created randomly by software.
Ground truth communities are builtin to these networks.
Check if the algorithms can discover the correct communities
when fed the network.
To measure the similarity between the found communities and
the ground truth communities, overlapping NMI is used.
(Lancichinetti et al. Detecting the overlapping and
hierarchical community structure in complex networks)
August 7, 2010 21

Evaluation
2000 nodes
Deﬁne hundreds of communities.
Each community contains 20 nodes chosen at random from
the 2000 nodes.
Some nodes may be assigned to many communities. Some
may not be assigned to a community.
pin = 0.4. About 40% of the pairs of nodes that share a
community are then joined.
pout = 0.005. Finally, a small amount of background noise is
added.
August 7, 2010 22

Evaluation
20-node communities (pin = 0.4), po = 0.005
2 4 6 8 10 12 14
0.00.20.40.60.81.0
Average Overlap
NMI
1 15
MOSES
LFM (default)
LFM (last Collection)
GCE
Louvain method
copra
5−clique percolation
4−clique percolation (dashed)
Iterative Scan (dashed)
August 7, 2010 23

Evaluation, LFR benchmarks
1 2 5 10
0.00.20.40.60.81.0
Communities per node
NMI
3 4 6 7 8 91.2 1.6
MOSES
LFM2−firstCol
LFM2−lastCol
GCE
SCP−3
Louvain method
copra
SCP−4
Evaluation, degree = 15,
15 ≤ c ≤ 60
August 7, 2010 24

Evaluation, LFR benchmarks
1 2 5 10
0.00.20.40.60.81.0
NMI
3 4 6 7 8 91.2 1.6
MOSES
LFM2−firstCol
LFM2−lastCol
GCE
Louvain method
copra
SCP−4
degree ∼ 15, maxdegree = 45, 15 ≤ c ≤ 60
August 7, 2010 25

Facebook
1 5 10 50 500
0.00.10.20.30.4
Degree
Density
August 7, 2010 26

Facebook
1 2 5 10 20 50 100
0.00.10.20.30.40.5
Communities−per−person
Density
August 7, 2010 27

Facebook
1 5 10 50 500
0.00.10.20.30.40.50.6
Size of community
Density
Oklahoma
Princeton
UNC
Georgetown
Caltech
August 7, 2010 28

Facebook
0 200 400 600 800 1000 1200
0
10
20
30
40
50
60
70
Degree
Communitierspernode
1
72
144
215
286
358
429
500
572
643
714
785
857
928
999
1071
1142
Counts
August 7, 2010 29

Facebook
Table: Summary of Traud et al’s ﬁve university Facebook datasets, and
of MOSES’s output.
Caltech
Princeton
Georgetown
UNC
Oklahoma
Edges 16656 293320 425638 766800 892528
Nodes 769 6596 9414 18163 17425
Average Degree 43.3 88.9 90.4 84.4 102.4
Communities found 62 832 1284 2725 3073
Average Overlap 3.29 6.28 6.67 6.96 7.46
MOSES runtime (s) 41 553 839 1585 2233
August 7, 2010 30

Scalability
1 2 5 10
1e−021e+001e+02
Time(s)
3 4 6 7 8 91.2 1.6
MOSES
LFM2−firstCol
LFM2−lastCol
GCE
blondel
copra
SCP−4
degree ∼ 15, maxdegree = 45, 15 ≤ c ≤ 60
August 7, 2010 31

Scalability
In general, community ﬁnding means overlapping community
ﬁnding, (in my interpretation).
August 7, 2010 32

Scalability
Partitioning breaks communities.
August 7, 2010 32

Scalability
So, partitioning is scalable, but partitioning doesn’t help with
community ﬁnding.
August 7, 2010 32

Scalability
So, partitioning is scalable, but partitioning doesn’t help with
community ﬁnding.
Challenge: a very scalable algorithm that can credibly claim to
be a community-ﬁnding algorithm.
August 7, 2010 32

Other/future research
Markov Chain Monte Carlo
Working with Prof. Brendan Murphy on an MCMC method.
Very diﬀerent algorithm, which allows us to investigate the
model directly.
August 7, 2010 33

model directly.
MOSES algorithm may have many biases we’ll never fully
grasp.
August 7, 2010 33

model directly.
grasp.
Diﬀerent model (still an OSBM) where each community has
its own internal-connection probability.
MOSES breaks down on synthetic data if the communities are
not equally dense (pin).
August 7, 2010 33

model directly.
grasp.
Draw from this distribution: P(z, pout, p1, p2, p3, ...|g)
August 7, 2010 33

model directly.
grasp.
Draw from this distribution: P(z, pout, p1, p2, p3, ...|g)
Multiple MCMC chains, where chains propose splits/merge to
each other.
(Modern) statisticians are innovative about scalability, e.g.
Hybrid Monte Carlo.
August 7, 2010 33

Take home messages
Community ﬁnding should be about discovering structure, not
forcing the structure. Overlapping, hierarchy, et cetera.
August 7, 2010 34

Take home messages
MOSES is a proof-of-concept: We show that quality results,
overlapping communities, and scalability, are not incompatible.
August 7, 2010 34

Take home messages
MOSES is a proof-of-concept: We show that quality results,
overlapping communities, and scalability, are not incompatible.
Very-scalable community ﬁnding algorithms don’t exist. This
is an interesting challenge.
August 7, 2010 34

Acknowledgments
This research was supported by Science Foundation Ireland (SFI)
Grant No. 08/SRC/I1407.
http://clique.ucd.ie/software
http://www.aaronmcdaid.com
aaronmcdaid@gmail.com , neil.hurley@ucd.ie
August 7, 2010 35

Model-Based Overlapping Seed Expansion (MOSES

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (19)

Similar to Model-Based Overlapping Seed Expansion (MOSES

Similar to Model-Based Overlapping Seed Expansion (MOSES (20)

Recently uploaded

Recently uploaded (20)

Model-Based Overlapping Seed Expansion (MOSES