Model-based Overlapping Seed ExpanSion
(MOSES)
Aaron McDaid and Neil Hurley. This research was supported by
Science Founda...
Overview
Community finding
The MOSES model
The MOSES algorithm
Evaluation
Scalability
Other/future work
August 7, 2010 2
Communities
August 7, 2010 3
Facebook
Traud et al. Community Structure In Online Collegiate Social
Networks
M. Salter-Townshend and T.B. Murphy. Variat...
Communities
Some nodes assigned to multiple communities.
Most edges assigned to just one community.
Multiple researchers h...
Communities
A partition will break some of the communities in that simple
example.
Graclus breaks synthetic communities wi...
Facebook
Traud et al’s five university networks.
Average of 7 communities per node.
August 7, 2010 7
Community finding
A general-purpose community finding algorithm must allow:
Each node to be assigned to any number of commun...
MOSES
MOSES deals only with undirected, unweighted, networks.
No attributes/weights associated with nodes or edges.
August...
The MOSES model
Model that:
Every pair of nodes has a chance of having an edge.
Independent for each pair of nodes, given ...
MOSES model
Ignoring the observed edges
for now. Just consider the
nodes and a (proposed) set of
communities
August 7, 201...
MOSES model
These communities create
probabilities for the edges.
P(v1 ∼ v2) = pout where the
two vertices do NOT share a
...
MOSES model
These communities create
probabilities for the edges.
P(v1 v2) = qout where the
two vertices do NOT share a
co...
MOSES model
We now have a model that, for a given set of communities,
assigns probabilities for edges.
P(g|z, pin, pout)
g...
MOSES model
We now have a model that, for a given set of communities,
assigns probabilities for edges.
P(g|z, pin, pout)
g...
MOSES model
P(g|z, pin, pout) is maximized when pin = 1, pout = 1, and
when z is defined as exactly one community around ea...
MOSES model
P(z, pin, pout|g)
August 7, 2010 16
MOSES model
Apply Bayes’ Theorem:
P(z, pin, pout|g) ∝ P(g|z, pin, pout) P(z) P(pin, pout)
August 7, 2010 17
MOSES model
Apply Bayes’ Theorem:
P(z, pin, pout|g) ∝ P(g|z, pin, pout) P(z) P(pin, pout)
P(z) ∼ k!
1≤i≤k
1
N + 1
1
N
ni
w...
MOSES model
We can correctly integrate out the number of communities, k,
and search across the resulting varying-dimension...
MOSES Algorithm
For the MOSES algorithm, we chose to look at the joint
distribution over (z, pin, pout) and aim to maximiz...
MOSES Algorithm
Choose an edge at random to form a seed, and expand.
Accept/reject those expanded seeds that contribute po...
MOSES Algorithm
Choose an edge at random to form a seed, and expand.
Accept/reject those expanded seeds that contribute po...
Evaluation
Synthetic benchmarks
Networks created randomly by software.
Ground truth communities are builtin to these netwo...
Evaluation
2000 nodes
Define hundreds of communities.
Each community contains 20 nodes chosen at random from
the 2000 nodes...
Evaluation
20-node communities (pin = 0.4), po = 0.005
2 4 6 8 10 12 14
0.00.20.40.60.81.0
Average Overlap
NMI
1 15
MOSES
...
Evaluation, LFR benchmarks
1 2 5 10
0.00.20.40.60.81.0
Communities per node
NMI
3 4 6 7 8 91.2 1.6
MOSES
LFM2−firstCol
LFM...
Evaluation, LFR benchmarks
1 2 5 10
0.00.20.40.60.81.0
Communities per node
NMI
3 4 6 7 8 91.2 1.6
MOSES
LFM2−firstCol
LFM...
Facebook
1 5 10 50 500
0.00.10.20.30.4
Degree
Density
August 7, 2010 26
Facebook
1 2 5 10 20 50 100
0.00.10.20.30.40.5
Communities−per−person
Density
August 7, 2010 27
Facebook
1 5 10 50 500
0.00.10.20.30.40.50.6
Size of community
Density
Oklahoma
Princeton
UNC
Georgetown
Caltech
August 7,...
Facebook
0 200 400 600 800 1000 1200
0
10
20
30
40
50
60
70
Degree
Communitierspernode
1
72
144
215
286
358
429
500
572
64...
Facebook
Table: Summary of Traud et al’s five university Facebook datasets, and
of MOSES’s output.
Caltech
Princeton
George...
Scalability
1 2 5 10
1e−021e+001e+02
Communities per node
Time(s)
3 4 6 7 8 91.2 1.6
MOSES
LFM2−firstCol
LFM2−lastCol
GCE
...
Scalability
In general, community finding means overlapping community
finding, (in my interpretation).
August 7, 2010 32
Scalability
In general, community finding means overlapping community
finding, (in my interpretation).
Partitioning breaks c...
Scalability
In general, community finding means overlapping community
finding, (in my interpretation).
Partitioning breaks c...
Scalability
In general, community finding means overlapping community
finding, (in my interpretation).
Partitioning breaks c...
Other/future research
Markov Chain Monte Carlo
Working with Prof. Brendan Murphy on an MCMC method.
Very different algorith...
Other/future research
Markov Chain Monte Carlo
Working with Prof. Brendan Murphy on an MCMC method.
Very different algorith...
Other/future research
Markov Chain Monte Carlo
Working with Prof. Brendan Murphy on an MCMC method.
Very different algorith...
Other/future research
Markov Chain Monte Carlo
Working with Prof. Brendan Murphy on an MCMC method.
Very different algorith...
Other/future research
Markov Chain Monte Carlo
Working with Prof. Brendan Murphy on an MCMC method.
Very different algorith...
Take home messages
Community finding should be about discovering structure, not
forcing the structure. Overlapping, hierarc...
Take home messages
Community finding should be about discovering structure, not
forcing the structure. Overlapping, hierarc...
Take home messages
Community finding should be about discovering structure, not
forcing the structure. Overlapping, hierarc...
Acknowledgments
This research was supported by Science Foundation Ireland (SFI)
Grant No. 08/SRC/I1407.
http://clique.ucd....
Upcoming SlideShare
Loading in …5
×

MOSES: Community finding using Model-based Overlapping Seed ExpanSion

1,673 views
1,628 views

Published on

Presented at ASONAM 2010 by Aaron McDaid, describing a new model and algorithm for overlapping community finding.

Location: University of Sour

Published in: Technology, Real Estate
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,673
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
28
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

MOSES: Community finding using Model-based Overlapping Seed ExpanSion

  1. 1. Model-based Overlapping Seed ExpanSion (MOSES) Aaron McDaid and Neil Hurley. This research was supported by Science Foundation Ireland (SFI) Grant No. 08/SRC/I1407. Clique: Graph & Network Analysis Cluster School of Computer Science & Informatics University College Dublin, Ireland
  2. 2. Overview Community finding The MOSES model The MOSES algorithm Evaluation Scalability Other/future work August 7, 2010 2
  3. 3. Communities August 7, 2010 3
  4. 4. Facebook Traud et al. Community Structure In Online Collegiate Social Networks M. Salter-Townshend and T.B. Murphy. Variational Bayesian Inference for the Latent Position Cluster Model Marlow et al. Maintained relationships on Facebook August 7, 2010 4
  5. 5. Communities Some nodes assigned to multiple communities. Most edges assigned to just one community. Multiple researchers have found Facebook members being in 6 or 7 communities. August 7, 2010 5
  6. 6. Communities A partition will break some of the communities in that simple example. Graclus breaks synthetic communities with low levels of overlap. (A. Lancichinetti and S. Fortunato, Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities. ) Graclus breaks communities found by MOSES in Facebook networks. (Traud et al, Community Structure in Online Collegiate Social Networks) Modularity has known problems, but we need to go further and move on from partitioning. August 7, 2010 6
  7. 7. Facebook Traud et al’s five university networks. Average of 7 communities per node. August 7, 2010 7
  8. 8. Community finding A general-purpose community finding algorithm must allow: Each node to be assigned to any number of communities. Pervasive overlap. Ahn et al. Link communities reveal multiscale complexity in networks. (Nature). The intersection (number of shared nodes) between a pair of communities can vary. It can be small, even when the number of communities-per-node is high. August 7, 2010 8
  9. 9. MOSES MOSES deals only with undirected, unweighted, networks. No attributes/weights associated with nodes or edges. August 7, 2010 9
  10. 10. The MOSES model Model that: Every pair of nodes has a chance of having an edge. Independent for each pair of nodes, given the communities, but probability is higher for pairs that share community(ies). (This is an OSBM - Latouche et al. Annals of Applied Statistics http://www.imstat.org/aoas/next_issue.html.) August 7, 2010 10
  11. 11. MOSES model Ignoring the observed edges for now. Just consider the nodes and a (proposed) set of communities August 7, 2010 11
  12. 12. MOSES model These communities create probabilities for the edges. P(v1 ∼ v2) = pout where the two vertices do NOT share a community. P(v1 ∼ v2) = 1−(1−pout)(1− pin) where the two vertices do share 1 community. August 7, 2010 12
  13. 13. MOSES model These communities create probabilities for the edges. P(v1 v2) = qout where the two vertices do NOT share a community. P(v1 v2) = qoutqin where the two vertices do share 1 community. P(v1 v2) = qoutqin s(v1,v2) where s(v1, v2) is the number of communities shared by v1 and v2. August 7, 2010 13
  14. 14. MOSES model We now have a model that, for a given set of communities, assigns probabilities for edges. P(g|z, pin, pout) g is the observed graph of nodes and edges. z is the proposed set of communities. August 7, 2010 14
  15. 15. MOSES model We now have a model that, for a given set of communities, assigns probabilities for edges. P(g|z, pin, pout) g is the observed graph of nodes and edges. z is the proposed set of communities. How do we match that with the observed edges to get a good estimate of the set of communities? Naive approach: find (z, pin, pout) that maximizes P(g|z, pin, pout). August 7, 2010 14
  16. 16. MOSES model P(g|z, pin, pout) is maximized when pin = 1, pout = 1, and when z is defined as exactly one community around each edge. i.e. we don’t want to maximize P(g|z, pin, pout). August 7, 2010 15
  17. 17. MOSES model P(z, pin, pout|g) August 7, 2010 16
  18. 18. MOSES model Apply Bayes’ Theorem: P(z, pin, pout|g) ∝ P(g|z, pin, pout) P(z) P(pin, pout) August 7, 2010 17
  19. 19. MOSES model Apply Bayes’ Theorem: P(z, pin, pout|g) ∝ P(g|z, pin, pout) P(z) P(pin, pout) P(z) ∼ k! 1≤i≤k 1 N + 1 1 N ni where k is the number of communities, and ni is the number of nodes in community i. August 7, 2010 17
  20. 20. MOSES model We can correctly integrate out the number of communities, k, and search across the resulting varying-dimensional space. No need for model selection, e.g. BIC. August 7, 2010 18
  21. 21. MOSES Algorithm For the MOSES algorithm, we chose to look at the joint distribution over (z, pin, pout) and aim to maximize it. The algorithm is a heuristic approximate algorithm, and we do not claim that it finds the maximum. August 7, 2010 19
  22. 22. MOSES Algorithm Choose an edge at random to form a seed, and expand. Accept/reject those expanded seeds that contribute positively to the objective. Update pin, pout based on the graph and the current set of communities. Delete communities that don’t make a positive contribution to the objective. Final fine-tuning that moves nodes one at a time. August 7, 2010 20
  23. 23. MOSES Algorithm Choose an edge at random to form a seed, and expand. Accept/reject those expanded seeds that contribute positively to the objective. Update pin, pout based on the graph and the current set of communities. Delete communities that don’t make a positive contribution to the objective. Final fine-tuning that moves nodes one at a time. It is not a Markov Chain, nor an EM algorithm. We can make no such guarantees. The algorithm will be reaching a local maximum, and may well have strong biases. August 7, 2010 20
  24. 24. Evaluation Synthetic benchmarks Networks created randomly by software. Ground truth communities are builtin to these networks. Check if the algorithms can discover the correct communities when fed the network. To measure the similarity between the found communities and the ground truth communities, overlapping NMI is used. (Lancichinetti et al. Detecting the overlapping and hierarchical community structure in complex networks) August 7, 2010 21
  25. 25. Evaluation 2000 nodes Define hundreds of communities. Each community contains 20 nodes chosen at random from the 2000 nodes. Some nodes may be assigned to many communities. Some may not be assigned to a community. pin = 0.4. About 40% of the pairs of nodes that share a community are then joined. pout = 0.005. Finally, a small amount of background noise is added. August 7, 2010 22
  26. 26. Evaluation 20-node communities (pin = 0.4), po = 0.005 2 4 6 8 10 12 14 0.00.20.40.60.81.0 Average Overlap NMI 1 15 MOSES LFM (default) LFM (last Collection) GCE Louvain method copra 5−clique percolation 4−clique percolation (dashed) Iterative Scan (dashed) August 7, 2010 23
  27. 27. Evaluation, LFR benchmarks 1 2 5 10 0.00.20.40.60.81.0 Communities per node NMI 3 4 6 7 8 91.2 1.6 MOSES LFM2−firstCol LFM2−lastCol GCE SCP−3 Louvain method copra SCP−4 Evaluation, degree = 15, 15 ≤ c ≤ 60 August 7, 2010 24
  28. 28. Evaluation, LFR benchmarks 1 2 5 10 0.00.20.40.60.81.0 Communities per node NMI 3 4 6 7 8 91.2 1.6 MOSES LFM2−firstCol LFM2−lastCol GCE Louvain method copra SCP−4 degree ∼ 15, maxdegree = 45, 15 ≤ c ≤ 60 August 7, 2010 25
  29. 29. Facebook 1 5 10 50 500 0.00.10.20.30.4 Degree Density August 7, 2010 26
  30. 30. Facebook 1 2 5 10 20 50 100 0.00.10.20.30.40.5 Communities−per−person Density August 7, 2010 27
  31. 31. Facebook 1 5 10 50 500 0.00.10.20.30.40.50.6 Size of community Density Oklahoma Princeton UNC Georgetown Caltech August 7, 2010 28
  32. 32. Facebook 0 200 400 600 800 1000 1200 0 10 20 30 40 50 60 70 Degree Communitierspernode 1 72 144 215 286 358 429 500 572 643 714 785 857 928 999 1071 1142 Counts August 7, 2010 29
  33. 33. Facebook Table: Summary of Traud et al’s five university Facebook datasets, and of MOSES’s output. Caltech Princeton Georgetown UNC Oklahoma Edges 16656 293320 425638 766800 892528 Nodes 769 6596 9414 18163 17425 Average Degree 43.3 88.9 90.4 84.4 102.4 Communities found 62 832 1284 2725 3073 Average Overlap 3.29 6.28 6.67 6.96 7.46 MOSES runtime (s) 41 553 839 1585 2233 August 7, 2010 30
  34. 34. Scalability 1 2 5 10 1e−021e+001e+02 Communities per node Time(s) 3 4 6 7 8 91.2 1.6 MOSES LFM2−firstCol LFM2−lastCol GCE blondel copra SCP−4 degree ∼ 15, maxdegree = 45, 15 ≤ c ≤ 60 August 7, 2010 31
  35. 35. Scalability In general, community finding means overlapping community finding, (in my interpretation). August 7, 2010 32
  36. 36. Scalability In general, community finding means overlapping community finding, (in my interpretation). Partitioning breaks communities. August 7, 2010 32
  37. 37. Scalability In general, community finding means overlapping community finding, (in my interpretation). Partitioning breaks communities. So, partitioning is scalable, but partitioning doesn’t help with community finding. August 7, 2010 32
  38. 38. Scalability In general, community finding means overlapping community finding, (in my interpretation). Partitioning breaks communities. So, partitioning is scalable, but partitioning doesn’t help with community finding. Challenge: a very scalable algorithm that can credibly claim to be a community-finding algorithm. August 7, 2010 32
  39. 39. Other/future research Markov Chain Monte Carlo Working with Prof. Brendan Murphy on an MCMC method. Very different algorithm, which allows us to investigate the model directly. August 7, 2010 33
  40. 40. Other/future research Markov Chain Monte Carlo Working with Prof. Brendan Murphy on an MCMC method. Very different algorithm, which allows us to investigate the model directly. MOSES algorithm may have many biases we’ll never fully grasp. August 7, 2010 33
  41. 41. Other/future research Markov Chain Monte Carlo Working with Prof. Brendan Murphy on an MCMC method. Very different algorithm, which allows us to investigate the model directly. MOSES algorithm may have many biases we’ll never fully grasp. Different model (still an OSBM) where each community has its own internal-connection probability. MOSES breaks down on synthetic data if the communities are not equally dense (pin). August 7, 2010 33
  42. 42. Other/future research Markov Chain Monte Carlo Working with Prof. Brendan Murphy on an MCMC method. Very different algorithm, which allows us to investigate the model directly. MOSES algorithm may have many biases we’ll never fully grasp. Different model (still an OSBM) where each community has its own internal-connection probability. MOSES breaks down on synthetic data if the communities are not equally dense (pin). Draw from this distribution: P(z, pout, p1, p2, p3, ...|g) August 7, 2010 33
  43. 43. Other/future research Markov Chain Monte Carlo Working with Prof. Brendan Murphy on an MCMC method. Very different algorithm, which allows us to investigate the model directly. MOSES algorithm may have many biases we’ll never fully grasp. Different model (still an OSBM) where each community has its own internal-connection probability. MOSES breaks down on synthetic data if the communities are not equally dense (pin). Draw from this distribution: P(z, pout, p1, p2, p3, ...|g) Multiple MCMC chains, where chains propose splits/merge to each other. (Modern) statisticians are innovative about scalability, e.g. Hybrid Monte Carlo. August 7, 2010 33
  44. 44. Take home messages Community finding should be about discovering structure, not forcing the structure. Overlapping, hierarchy, et cetera. August 7, 2010 34
  45. 45. Take home messages Community finding should be about discovering structure, not forcing the structure. Overlapping, hierarchy, et cetera. MOSES is a proof-of-concept: We show that quality results, overlapping communities, and scalability, are not incompatible. August 7, 2010 34
  46. 46. Take home messages Community finding should be about discovering structure, not forcing the structure. Overlapping, hierarchy, et cetera. MOSES is a proof-of-concept: We show that quality results, overlapping communities, and scalability, are not incompatible. Very-scalable community finding algorithms don’t exist. This is an interesting challenge. August 7, 2010 34
  47. 47. Acknowledgments This research was supported by Science Foundation Ireland (SFI) Grant No. 08/SRC/I1407. http://clique.ucd.ie/software http://www.aaronmcdaid.com aaronmcdaid@gmail.com , neil.hurley@ucd.ie August 7, 2010 35

×