Successfully reported this slideshow.
Using Content
and
Interactions for
Discovering
Communities in
Social Networks
IBM Research India
Abstract
 Problem:discovering

meaningful communities

from a social network
 We propose generative models that can
disc...
Introduction
 Background:



rich data -> academia & business;
discover relationships -> discover community

A

commun...
 We

consider communities as “groups of
users(nodes) who are interconnected and
communicate on shared topics”.

1.

采用Bay...
2.

We also utilize the “type” of interactions
between users to emphasize their interest in
topics, and thus community mem...
PRIOR WORK
 第一种:只考虑用户间的links。不考虑其他节

点特性和user interactions. 不允许一个user属
于多个communities。
 第二种:Bayesian probabilistic model...
CUT(Community-User-Topic):
 假设通过讨论特定话题结成社区的成员之间是

连接的。
 graph structure and interactions between
users
 两个模型CUT1和CUT2:
...
CART(Community -Author-Recipient-Topic)
 将内容和连接关系结合利用

network中的社区,不适用于twitter
这种broadcast的网络

 适用于提取email
COMMUNITY DISCOVERY MODELS
 两类网络:
1.
2.

Twitter, Facebook: 一个post是关于这个user自己的
兴趣话题,不考虑接收者的兴趣
Email: post的话题意味着发送者和接收者双方共...
Topic User Community Model
 假设一个用户可以属于多个社区,也可以对多个话

题感兴趣
 模型中,利用交互类型来提升社区发现;交互类型
反映了两个用户间联系的强度和他们对于一个话题
的兴趣。
 每个user有自己...
参数估计:
Gibbs Sampling:
随机指派 -> 更新等式
更新等式:
 The

procedure for the Gibbs Inference:

the worst time complexity: O ( IPCXZ + IW )
 Topics

can be computed using the approximation

 P(u|c),

P (z|c)
Topic User Recipient Community
Model1
Topic User Recipient Community
Model2
Full TURCM
 generating

a topic for each word in a
post(instead of generating atopic per post)
参数估计:
更新等式
EXPERIMENTS
 Datasets:




Twitter over a period of six months in 2009
Enron Email corpus

 we

set the number of comm...
 Qualitative

Analysis
 Community

Analysis
 Perplexity

Analysis
 it measures the log likelihood of generating unseen
data after learning from a fraction of data.
 Runtime

Analysis
CONCLUSION
 we

proposed probabilistic schemes that
incorporate topics, social relation ships
and nature of posts for mor...
Community
Detection in
Incomplete
Information
Networks
Abstract
 detecting

communities in incomplete information
networks with missing edges.
1. learn a distance metric to rep...
INTRODUCTION
 The

community is defined as a group of nodes
which are densely connected inside the group,
while loosely co...
contributes







We identify and define the problem of community
detection in incomplete information networks with
lo...
RELATED WORK
1.
2.
3.

focused on the topological structures
Some graph clustering methods which
based on attributes.
some...
PROBLEM DEFINITION
OPTIMIZATION FRAMEWORK
 Diagonal

 full

form of M

matrix of M
DISTANCE-BASED CLUSTERING
 Distanced-based

Modularity
 Clustering

Algorithm
 Speeding

up the Clustering Process with
Approximation
EXPERIMENTS
 Data

Sets
 DBLP-A Dataset: DBLP-A is the data set
extracted from DBLP database which provides
bibliographi...
 Incomplete
 Snowball

Information Network Generation

sampling

 parameter

p ,called sample ratio
 parameter q ,call...
 Evaluation

Measures
 The definition of purity is as follows:


each cluster is first assigned with the most
frequent cl...
 Compared





Methods

Kmeans:
Md +DSHRINK: We learn a diagonal
Mahalanobis matrix Md and use it as the
input of M fo...
 Effectiveness

Results
 Efficiency

Results
CONCLUSION
a

global metric
 distance-based modularity function
 a distance-based clustering algorithm
DSHRINK
 Approx...
Using content and interactions for discovering communities in
Using content and interactions for discovering communities in
Using content and interactions for discovering communities in
Using content and interactions for discovering communities in
Using content and interactions for discovering communities in
Using content and interactions for discovering communities in
Using content and interactions for discovering communities in
Using content and interactions for discovering communities in
Using content and interactions for discovering communities in
Using content and interactions for discovering communities in
Using content and interactions for discovering communities in
Using content and interactions for discovering communities in
Using content and interactions for discovering communities in
Upcoming SlideShare
Loading in …5
×

Using content and interactions for discovering communities in

2,505 views

Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Using content and interactions for discovering communities in

  1. 1. Using Content and Interactions for Discovering Communities in Social Networks IBM Research India
  2. 2. Abstract  Problem:discovering meaningful communities from a social network  We propose generative models that can discover communities based on the discussed topics, interaction types and the social connections among people.  Person->multiple communities->multiple topics  We discover both community interests and user interests based on the information and linked associations.
  3. 3. Introduction  Background:   rich data -> academia & business; discover relationships -> discover community A community is a collection of users as a group such that there is high relatedness among people within the group.  One common approach used is to treat communities as group of nodes in social network that are more densely connected among themselves than with the rest of the network.  A graph clustering problem
  4. 4.  We consider communities as “groups of users(nodes) who are interconnected and communicate on shared topics”. 1. 采用Bayesian models来提取潜在的 communities。模型假设:社区关系是依赖 于用户间感兴趣的topics和他们之间的链接 关系的。这种方法有助于发现用户兴趣和他 在网络中的角色。同时还能发现一个社区里 流行的话题。所以,给定一个主题或兴趣时, 就可以以此找到相关的社区。
  5. 5. 2. We also utilize the “type” of interactions between users to emphasize their interest in topics, and thus community membership.  3. e.g, conversation vs broadcast 两种社交网络:1. 用户的posts 广播给他的邻居;2. 用户只能直接给其他人发送posts(比如 email networks);所以本文推荐了两种不同的方法对应两 种不同的网络结构。  假设:post只讨论单个topic,为了减少模型训练时间。 但是当post很长时,这个假设就不合适了,所以本文同时 给出了另一个模型适应这个问题。
  6. 6. PRIOR WORK  第一种:只考虑用户间的links。不考虑其他节 点特性和user interactions. 不允许一个user属 于多个communities。  第二种:Bayesian probabilistic models . 可 以解决一对多的问题,但仍太依赖于link structure来发现communities.  第三种:利用语义内容来发现communities。 Communities are modeled as random mixtures over users who in turn have a topical distribution (interest) associated with them. 没有利用链接信息。
  7. 7. CUT(Community-User-Topic):  假设通过讨论特定话题结成社区的成员之间是 连接的。  graph structure and interactions between users  两个模型CUT1和CUT2:   CUT1只考虑社区与成员的关系,所发现的子社 区更侧重于成员间联系的紧密程度,与基于图论 的社区发现算法得到的结果很相似 CUT2只考虑社区与主题间的联系,所发现的子 社区更侧重于成员所关注主题的紧密程度。
  8. 8. CART(Community -Author-Recipient-Topic)  将内容和连接关系结合利用 network中的社区,不适用于twitter 这种broadcast的网络  适用于提取email
  9. 9. COMMUNITY DISCOVERY MODELS  两类网络: 1. 2. Twitter, Facebook: 一个post是关于这个user自己的 兴趣话题,不考虑接收者的兴趣 Email: post的话题意味着发送者和接收者双方共同的 兴趣话题  Notation   : U, Ri, Pij, Pi, P, Np, Wp,Xp c, z
  10. 10. Topic User Community Model  假设一个用户可以属于多个社区,也可以对多个话 题感兴趣  模型中,利用交互类型来提升社区发现;交互类型 反映了两个用户间联系的强度和他们对于一个话题 的兴趣。  每个user有自己的Interaction space
  11. 11. 参数估计: Gibbs Sampling: 随机指派 -> 更新等式 更新等式:
  12. 12.  The procedure for the Gibbs Inference: the worst time complexity: O ( IPCXZ + IW )
  13. 13.  Topics can be computed using the approximation  P(u|c), P (z|c)
  14. 14. Topic User Recipient Community Model1
  15. 15. Topic User Recipient Community Model2
  16. 16. Full TURCM  generating a topic for each word in a post(instead of generating atopic per post)
  17. 17. 参数估计: 更新等式
  18. 18. EXPERIMENTS  Datasets:   Twitter over a period of six months in 2009 Enron Email corpus  we set the number of communities C at 10 and topics Z at 20  We ran 1000 iterations to burn in and took 250 samples (every fourth sample) in the next 1000 iterations .
  19. 19.  Qualitative Analysis
  20. 20.  Community Analysis
  21. 21.  Perplexity Analysis  it measures the log likelihood of generating unseen data after learning from a fraction of data.
  22. 22.  Runtime Analysis
  23. 23. CONCLUSION  we proposed probabilistic schemes that incorporate topics, social relation ships and nature of posts for more effective community discovery .  Interaction types are important
  24. 24. Community Detection in Incomplete Information Networks
  25. 25. Abstract  detecting communities in incomplete information networks with missing edges. 1. learn a distance metric to reproduce the linkbased distance between nodes from the observed edges in the local information regions 2. Use the learned distance metric to estimate the distance between any pair of nodes in the network.  A hierarchical clustering approach
  26. 26. INTRODUCTION  The community is defined as a group of nodes which are densely connected inside the group, while loosely connected with the nodes outside the group.  The local regions with complete linkage information are called local information regions .   Terrorist-attack  Food Web network .
  27. 27. contributes     We identify and define the problem of community detection in incomplete information networks with local information regions Then a metric, which can be used to measure the distance between any pair of nodes, is learned. Based on the learned metric, we devise a distance-based modularity function to evaluate the quality of the communities. We propose a distance-based algorithm DSHRINK which can discover the hierarchical and overlapped communities.
  28. 28. RELATED WORK 1. 2. 3. focused on the topological structures Some graph clustering methods which based on attributes. some clustering methods based on both links and attributes were also proposed
  29. 29. PROBLEM DEFINITION
  30. 30. OPTIMIZATION FRAMEWORK
  31. 31.  Diagonal  full form of M matrix of M
  32. 32. DISTANCE-BASED CLUSTERING  Distanced-based Modularity
  33. 33.  Clustering Algorithm
  34. 34.  Speeding up the Clustering Process with Approximation
  35. 35. EXPERIMENTS  Data Sets  DBLP-A Dataset: DBLP-A is the data set extracted from DBLP database which provides bibliographic information on computer science journals and proceeding.  DBLP-B Dataset:
  36. 36.  Incomplete  Snowball Information Network Generation sampling  parameter p ,called sample ratio  parameter q ,called local information region size
  37. 37.  Evaluation Measures  The definition of purity is as follows:  each cluster is first assigned with the most frequent class in the cluster, and then the purity is measured by computing the number of the instances assigned with the same labels in all clusters.
  38. 38.  Compared    Methods Kmeans: Md +DSHRINK: We learn a diagonal Mahalanobis matrix Md and use it as the input of M for DSHRINK. Mf +DSHRINK: We learn a full Mahalanobis matrix Mf and use it as the input of M for DSHRINK.
  39. 39.  Effectiveness Results
  40. 40.  Efficiency Results
  41. 41. CONCLUSION a global metric  distance-based modularity function  a distance-based clustering algorithm DSHRINK  Approximation strategies

×