Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Overlapping correlation clustering by LARCA UPC 1859 views
- علم بهتر است یا ثروت by Mohammad Baghaei 675 views
- 20140619 Insight Overview_Regina Re... by Irish Software In... 254 views
- Martin Invention Commercialization ... by QRCE 1349 views
- Region-based Semi-supervised Cluste... by Onur Yılmaz 513 views

1,673 views

1,628 views

1,628 views

Published on

Location: University of Sour

License: CC Attribution License

No Downloads

Total views

1,673

On SlideShare

0

From Embeds

0

Number of Embeds

0

Shares

0

Downloads

28

Comments

0

Likes

3

No embeds

No notes for slide

- 1. Model-based Overlapping Seed ExpanSion (MOSES) Aaron McDaid and Neil Hurley. This research was supported by Science Foundation Ireland (SFI) Grant No. 08/SRC/I1407. Clique: Graph & Network Analysis Cluster School of Computer Science & Informatics University College Dublin, Ireland
- 2. Overview Community ﬁnding The MOSES model The MOSES algorithm Evaluation Scalability Other/future work August 7, 2010 2
- 3. Communities August 7, 2010 3
- 4. Facebook Traud et al. Community Structure In Online Collegiate Social Networks M. Salter-Townshend and T.B. Murphy. Variational Bayesian Inference for the Latent Position Cluster Model Marlow et al. Maintained relationships on Facebook August 7, 2010 4
- 5. Communities Some nodes assigned to multiple communities. Most edges assigned to just one community. Multiple researchers have found Facebook members being in 6 or 7 communities. August 7, 2010 5
- 6. Communities A partition will break some of the communities in that simple example. Graclus breaks synthetic communities with low levels of overlap. (A. Lancichinetti and S. Fortunato, Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities. ) Graclus breaks communities found by MOSES in Facebook networks. (Traud et al, Community Structure in Online Collegiate Social Networks) Modularity has known problems, but we need to go further and move on from partitioning. August 7, 2010 6
- 7. Facebook Traud et al’s ﬁve university networks. Average of 7 communities per node. August 7, 2010 7
- 8. Community ﬁnding A general-purpose community ﬁnding algorithm must allow: Each node to be assigned to any number of communities. Pervasive overlap. Ahn et al. Link communities reveal multiscale complexity in networks. (Nature). The intersection (number of shared nodes) between a pair of communities can vary. It can be small, even when the number of communities-per-node is high. August 7, 2010 8
- 9. MOSES MOSES deals only with undirected, unweighted, networks. No attributes/weights associated with nodes or edges. August 7, 2010 9
- 10. The MOSES model Model that: Every pair of nodes has a chance of having an edge. Independent for each pair of nodes, given the communities, but probability is higher for pairs that share community(ies). (This is an OSBM - Latouche et al. Annals of Applied Statistics http://www.imstat.org/aoas/next_issue.html.) August 7, 2010 10
- 11. MOSES model Ignoring the observed edges for now. Just consider the nodes and a (proposed) set of communities August 7, 2010 11
- 12. MOSES model These communities create probabilities for the edges. P(v1 ∼ v2) = pout where the two vertices do NOT share a community. P(v1 ∼ v2) = 1−(1−pout)(1− pin) where the two vertices do share 1 community. August 7, 2010 12
- 13. MOSES model These communities create probabilities for the edges. P(v1 v2) = qout where the two vertices do NOT share a community. P(v1 v2) = qoutqin where the two vertices do share 1 community. P(v1 v2) = qoutqin s(v1,v2) where s(v1, v2) is the number of communities shared by v1 and v2. August 7, 2010 13
- 14. MOSES model We now have a model that, for a given set of communities, assigns probabilities for edges. P(g|z, pin, pout) g is the observed graph of nodes and edges. z is the proposed set of communities. August 7, 2010 14
- 15. MOSES model We now have a model that, for a given set of communities, assigns probabilities for edges. P(g|z, pin, pout) g is the observed graph of nodes and edges. z is the proposed set of communities. How do we match that with the observed edges to get a good estimate of the set of communities? Naive approach: ﬁnd (z, pin, pout) that maximizes P(g|z, pin, pout). August 7, 2010 14
- 16. MOSES model P(g|z, pin, pout) is maximized when pin = 1, pout = 1, and when z is deﬁned as exactly one community around each edge. i.e. we don’t want to maximize P(g|z, pin, pout). August 7, 2010 15
- 17. MOSES model P(z, pin, pout|g) August 7, 2010 16
- 18. MOSES model Apply Bayes’ Theorem: P(z, pin, pout|g) ∝ P(g|z, pin, pout) P(z) P(pin, pout) August 7, 2010 17
- 19. MOSES model Apply Bayes’ Theorem: P(z, pin, pout|g) ∝ P(g|z, pin, pout) P(z) P(pin, pout) P(z) ∼ k! 1≤i≤k 1 N + 1 1 N ni where k is the number of communities, and ni is the number of nodes in community i. August 7, 2010 17
- 20. MOSES model We can correctly integrate out the number of communities, k, and search across the resulting varying-dimensional space. No need for model selection, e.g. BIC. August 7, 2010 18
- 21. MOSES Algorithm For the MOSES algorithm, we chose to look at the joint distribution over (z, pin, pout) and aim to maximize it. The algorithm is a heuristic approximate algorithm, and we do not claim that it ﬁnds the maximum. August 7, 2010 19
- 22. MOSES Algorithm Choose an edge at random to form a seed, and expand. Accept/reject those expanded seeds that contribute positively to the objective. Update pin, pout based on the graph and the current set of communities. Delete communities that don’t make a positive contribution to the objective. Final ﬁne-tuning that moves nodes one at a time. August 7, 2010 20
- 23. MOSES Algorithm Choose an edge at random to form a seed, and expand. Accept/reject those expanded seeds that contribute positively to the objective. Update pin, pout based on the graph and the current set of communities. Delete communities that don’t make a positive contribution to the objective. Final ﬁne-tuning that moves nodes one at a time. It is not a Markov Chain, nor an EM algorithm. We can make no such guarantees. The algorithm will be reaching a local maximum, and may well have strong biases. August 7, 2010 20
- 24. Evaluation Synthetic benchmarks Networks created randomly by software. Ground truth communities are builtin to these networks. Check if the algorithms can discover the correct communities when fed the network. To measure the similarity between the found communities and the ground truth communities, overlapping NMI is used. (Lancichinetti et al. Detecting the overlapping and hierarchical community structure in complex networks) August 7, 2010 21
- 25. Evaluation 2000 nodes Deﬁne hundreds of communities. Each community contains 20 nodes chosen at random from the 2000 nodes. Some nodes may be assigned to many communities. Some may not be assigned to a community. pin = 0.4. About 40% of the pairs of nodes that share a community are then joined. pout = 0.005. Finally, a small amount of background noise is added. August 7, 2010 22
- 26. Evaluation 20-node communities (pin = 0.4), po = 0.005 2 4 6 8 10 12 14 0.00.20.40.60.81.0 Average Overlap NMI 1 15 MOSES LFM (default) LFM (last Collection) GCE Louvain method copra 5−clique percolation 4−clique percolation (dashed) Iterative Scan (dashed) August 7, 2010 23
- 27. Evaluation, LFR benchmarks 1 2 5 10 0.00.20.40.60.81.0 Communities per node NMI 3 4 6 7 8 91.2 1.6 MOSES LFM2−firstCol LFM2−lastCol GCE SCP−3 Louvain method copra SCP−4 Evaluation, degree = 15, 15 ≤ c ≤ 60 August 7, 2010 24
- 28. Evaluation, LFR benchmarks 1 2 5 10 0.00.20.40.60.81.0 Communities per node NMI 3 4 6 7 8 91.2 1.6 MOSES LFM2−firstCol LFM2−lastCol GCE Louvain method copra SCP−4 degree ∼ 15, maxdegree = 45, 15 ≤ c ≤ 60 August 7, 2010 25
- 29. Facebook 1 5 10 50 500 0.00.10.20.30.4 Degree Density August 7, 2010 26
- 30. Facebook 1 2 5 10 20 50 100 0.00.10.20.30.40.5 Communities−per−person Density August 7, 2010 27
- 31. Facebook 1 5 10 50 500 0.00.10.20.30.40.50.6 Size of community Density Oklahoma Princeton UNC Georgetown Caltech August 7, 2010 28
- 32. Facebook 0 200 400 600 800 1000 1200 0 10 20 30 40 50 60 70 Degree Communitierspernode 1 72 144 215 286 358 429 500 572 643 714 785 857 928 999 1071 1142 Counts August 7, 2010 29
- 33. Facebook Table: Summary of Traud et al’s ﬁve university Facebook datasets, and of MOSES’s output. Caltech Princeton Georgetown UNC Oklahoma Edges 16656 293320 425638 766800 892528 Nodes 769 6596 9414 18163 17425 Average Degree 43.3 88.9 90.4 84.4 102.4 Communities found 62 832 1284 2725 3073 Average Overlap 3.29 6.28 6.67 6.96 7.46 MOSES runtime (s) 41 553 839 1585 2233 August 7, 2010 30
- 34. Scalability 1 2 5 10 1e−021e+001e+02 Communities per node Time(s) 3 4 6 7 8 91.2 1.6 MOSES LFM2−firstCol LFM2−lastCol GCE blondel copra SCP−4 degree ∼ 15, maxdegree = 45, 15 ≤ c ≤ 60 August 7, 2010 31
- 35. Scalability In general, community ﬁnding means overlapping community ﬁnding, (in my interpretation). August 7, 2010 32
- 36. Scalability In general, community ﬁnding means overlapping community ﬁnding, (in my interpretation). Partitioning breaks communities. August 7, 2010 32
- 37. Scalability In general, community ﬁnding means overlapping community ﬁnding, (in my interpretation). Partitioning breaks communities. So, partitioning is scalable, but partitioning doesn’t help with community ﬁnding. August 7, 2010 32
- 38. Scalability In general, community ﬁnding means overlapping community ﬁnding, (in my interpretation). Partitioning breaks communities. So, partitioning is scalable, but partitioning doesn’t help with community ﬁnding. Challenge: a very scalable algorithm that can credibly claim to be a community-ﬁnding algorithm. August 7, 2010 32
- 39. Other/future research Markov Chain Monte Carlo Working with Prof. Brendan Murphy on an MCMC method. Very diﬀerent algorithm, which allows us to investigate the model directly. August 7, 2010 33
- 40. Other/future research Markov Chain Monte Carlo Working with Prof. Brendan Murphy on an MCMC method. Very diﬀerent algorithm, which allows us to investigate the model directly. MOSES algorithm may have many biases we’ll never fully grasp. August 7, 2010 33
- 41. Other/future research Markov Chain Monte Carlo Working with Prof. Brendan Murphy on an MCMC method. Very diﬀerent algorithm, which allows us to investigate the model directly. MOSES algorithm may have many biases we’ll never fully grasp. Diﬀerent model (still an OSBM) where each community has its own internal-connection probability. MOSES breaks down on synthetic data if the communities are not equally dense (pin). August 7, 2010 33
- 42. Other/future research Markov Chain Monte Carlo Working with Prof. Brendan Murphy on an MCMC method. Very diﬀerent algorithm, which allows us to investigate the model directly. MOSES algorithm may have many biases we’ll never fully grasp. Diﬀerent model (still an OSBM) where each community has its own internal-connection probability. MOSES breaks down on synthetic data if the communities are not equally dense (pin). Draw from this distribution: P(z, pout, p1, p2, p3, ...|g) August 7, 2010 33
- 43. Other/future research Markov Chain Monte Carlo Working with Prof. Brendan Murphy on an MCMC method. Very diﬀerent algorithm, which allows us to investigate the model directly. MOSES algorithm may have many biases we’ll never fully grasp. Diﬀerent model (still an OSBM) where each community has its own internal-connection probability. MOSES breaks down on synthetic data if the communities are not equally dense (pin). Draw from this distribution: P(z, pout, p1, p2, p3, ...|g) Multiple MCMC chains, where chains propose splits/merge to each other. (Modern) statisticians are innovative about scalability, e.g. Hybrid Monte Carlo. August 7, 2010 33
- 44. Take home messages Community ﬁnding should be about discovering structure, not forcing the structure. Overlapping, hierarchy, et cetera. August 7, 2010 34
- 45. Take home messages Community ﬁnding should be about discovering structure, not forcing the structure. Overlapping, hierarchy, et cetera. MOSES is a proof-of-concept: We show that quality results, overlapping communities, and scalability, are not incompatible. August 7, 2010 34
- 46. Take home messages Community ﬁnding should be about discovering structure, not forcing the structure. Overlapping, hierarchy, et cetera. MOSES is a proof-of-concept: We show that quality results, overlapping communities, and scalability, are not incompatible. Very-scalable community ﬁnding algorithms don’t exist. This is an interesting challenge. August 7, 2010 34
- 47. Acknowledgments This research was supported by Science Foundation Ireland (SFI) Grant No. 08/SRC/I1407. http://clique.ucd.ie/software http://www.aaronmcdaid.com aaronmcdaid@gmail.com , neil.hurley@ucd.ie August 7, 2010 35

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment