[HCII2011] Mining Social Relationships in Micro-blogging systems

QIN GAO, QU QU, XUHUI ZHANG
INSTITUTE OF HUMAN FACTORS & ERGONOMICS
DEPT. OF INDUSTRIAL ENGINEERING
TSINGHUA UNIVERSITY, Beijing, China

MINING SOCIAL RELATIONSHIPS IN
MICRO-BLOGGING SYSTEMS

HCI International 2011
9-14 July, Orlando, USA

CONTENT

• Motivation
• A graph-based approach to social relationship
mining in micro-blogging systems
• Preliminary validation
• Future work

Mining Social Relationships in Micro-blogging systems 2

WHY MINING SOCIAL RELATIONSHIPS IN
MICRO-BLOGGING SYSTEMS?

Potential Challenge
• High popularity of micro- • Most available methods
blogging systems emphasize structural
• Explicit indication of analysis of the network
information dissemination • Many do not take
directions by “following” information flow directions
relationships into analysis
• Networks in micro-blogging • Existing methods often have
systems overlap heavily with limitations in analyzing huge
social networks in real life volume of data sets
(Java, et al., 2007)


RELATED WORK

• Analysis of online social networks
• Most influential method: SNA
• Useful measures: centrality, betweenness
• Used for structural analysis of blog and email networks
• Useful for structural analysis of the network
• Difficult to evaluate information dissemination between users
• Time consuming
• Other methods: Matsumura, 2003; Kazienko & Musial, 2008
• Graph theory
• Useful for modeling complex networks
• E.g., Protein structure by Sadumrala, 1998
• Many methods for mining frequent subgraph patterns
• Use of graph theory in social network analysis (e.g., Cai, 2005)


A GENERAL
INFORMATION DIFFUSION MODEL


1. USER GROUPING BY INFORMATION
DISSEMINATION RELATIONSHIPS
• Definition: A user group is a set of nodes within which any two
nodes can transfer information bi-directionally, and any user in
a group cannot transfer information bi-directionally with any
other user outside of the group
• Developed the definition based on maximum strongly
connected components
• Given a G = (V, G) where V(G) is a finite set of nodes, E(G) is a
finite set of edges (each edges have its endpoints in V(G)
• For ∀a∈V，∀b∈V, if there is at least one path from a to b, and at
least path from b to a, then G is a bi-directionally strongly
connected component
• G is a maximum bi-directionally strongly connected component
(MBSCC) if G would not be a bi-directionally strongly connected
component when any node or edge were added to G


2. GROUP RANKING BY CONTRIBUTIONS IN
INFORMATION DISSEMINATION

• Each group (MBSCC) is denoted as a node
• The network of a micro-blogging system is then
condensed into a directed acyclic graph G’
• Each node of G’ is a MBSCC
• Topological sorting algorithm
• The node without any information outflow is deleted from
G´and put at the end of the ranking list.
• This step is repeated till all nodes are deleted.


2. GROUP RANKING BY CONTRIBUTIONS IN
INFORMATION DISSEMINATION

• Sorting algorithm
P<Set<Node>> Empty list that will contain sets of
nodes in sequence
N Set of nodes with no outside link

Insert all nodes which have no outside link into N
while N is non-empty do
insert N into P
for each node n in N
remove n
for each node m with a link e from n to m do
remove e

• In the final ranking list P, groups are listed in a
descending order with regard to their contribution
to information dissemination in the network


3. USER INFLUENCE EVALUATION BY THE
PROBABILITY OF INFORMATION DISSEMINATION

• Term definition
• Path distance: the number of nodes from the source node a to the target
node b along a path
• Distance between nodes: smallest path distance between the source node
a and the target b
• Width between nodes: the number of different paths connecting the
source node a and the target node b
• Assuming the probability that any user retweets a certain received
information is P, the probability that the target user can receive this
information from the source user is:
p= Ʃ i∈N P di
• N: the set of different paths from the source to the target
• di: the distance of path I
• The shorter the distance and the wider the width of paths, the more
probably information is transmitted.



• The shortest path from the information source to the
target makes the greatest contribution.
• According to observation, it is reasonable to
assume P < .5
• To simplify the problem, we can set a threshold T
• If di > T, pi (the probability that information transmits via path
i)  0



• QIndex Algorithm (inspired by Dijkstra)
• For a G = (V, E), the information source node is labeled as vs (vs ∈
V); the current node is denoted as nc; distance value and width
value is denoted as d, w.
1. Initializing: ds =0, ws = 1; d = infinity and w = 0 for all the other
nodes; mark all nodes unvisited; set vs as the current node (nc)
2. An unvisited node which is linked to nc is denoted as n’,
distance between n’ and the source node via nc is dc+1
• If dc+1<d’ and dc+1 <T, then d’=dc+1 and w’=wc
• If dc+1≥ d’ and dc+1 <T, then w’=wc+1
3. The current node nc will be marked as a visited node when all
unvisited nodes directly linked to it are calculated
4. Set the node with the smallest distance value in all unvisited
node as nc, and repeat step 2



• Qindex Algorithm
• If there is no unvisited nodes in a distance less than T, Qindex of all
visited nodes will be calculated as
Qindex = d/w
• The smaller the Qindex, the more probably the target
node would receive information from the source node
• Importance of the setting of T
• The worst case: the running time of Qindex algorithm is O (ǀVǀ2 +
ǀEǀ); if T approximates 0, the time cost of Qindex is close to O (ǀVǀ +
ǀEǀ)


VALIDATION

• Source: digu.com
• A Chinese micro-blogging
system since 2009
• More than 2 million users
• Data collection
• Snowball sampling
• 20 users randomly chosen as
“seeds”
• Last for 2 weeks
• 332, 122 users and 11, 160,
822 following relationships


VALIDATION

• Data collection example:
Item
ID 11528569
User name ququjoy
Nick name Qu
Location Beijing
Gender 1（1-male，2-female，3-private）
Self-introduction From Chongqing
Address http://pic.minicloud.com.cn/file/default/SIGN_24x24.png
Homepage http://digu.com/ququjoy
Information Privacy false(false-information disclosure ， true-information
The Number of Followees protection)
2
The Number of followers 2
The Number of updates 7
Folloee digu, robot
Follower xabcdefg, flyinglin456


VALIDATION

• A sub-sample of 2,556
users with 35, 510 following
relationships was used in
validation
• Using MBSCC to find
groups, a biggest group
contains 1,426 users
• Network pattern of the
biggest group is highly
similar to the whole
network pattern


VALIDATION

• Users most influenced by a chosen user
yoohee1221_ (T = 5)
Users Distance Width QIndex
classyuan 1 1 1
gambol 1 1 1
liuxinwu 2 2 1
xujun99663 3 2 1.5
dan123 4 2 2
chervun 4 2 2
tuniu 4 2 2
harliger 4 2 2
zxb888 4 2 2
topidea 4 2 2
yuanjuan 4 2 2
WDM123 4 2 2
shaun 4 2 2

• Note that the influence on liuxinwu is as strong as
those directly connected to yoohee1221_


CONCLUSION

• Pros of the proposed method
• Incorporating direction information into network analysis
• Evaluate groups/users by their contribution to information
dissemination
• Competent of handling large amount of data and timely
efficient
• Limitation of the proposed method
• Useful for studying characteristics of the whole network, but
not good for splitting the whole network into sub-networks
• Vulnerable to spam following relationships in grouping
• Future work: revise the grouping algorithm


THANKS, AND QUESTIONS?


[HCII2011] Mining Social Relationships in Micro-blogging systems

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (8)

Similar to [HCII2011] Mining Social Relationships in Micro-blogging systems

Similar to [HCII2011] Mining Social Relationships in Micro-blogging systems (20)

Recently uploaded

Recently uploaded (20)

[HCII2011] Mining Social Relationships in Micro-blogging systems