Iiwas19 yamazaki slide

Fast RankClus Algorithm via Dynamic Rank Score Tracking
on Bi-type Information Networks
iiWAS 2019
Kotaro Yamazaki✝ , Shohei Matsugu✝
Hiroaki Shiokawa✝✝, Hiroyuki Kitagawa✝✝
✝: Graduate School of SIE, University of Tsukuba
✝✝: Center for Computational Sciences, University of Tsukuba

Background:
Ubiquitous Information Networks
• Information network:
Each node represents an entity and each link(edge) a relationship between
entities
• Homogeneous vs. Heterogeneous networks
- Homogeneous Network
Single type Object
E.g., Co-author network, Web pages, Friendship network
Most Current studies are on homogeneous network
- Heterogeneous Network
Objects belong to several types
E.g., Conference-author network, Medical networks
Most real system can be modeled as heterogeneous network
3

Background:
RankClus
[Sun et al.,EDBT’09]
Ranking-based Clustering algorithm for Heterogeneous Network
The Methodology
• Ranking as the features of clusters
• Clustering so that each node has the highest rank score.
• Repeat and improve the quality of clustering and ranking mutually
4
Heterogeneous Network
RankClus
Framework
with rank scores

Background:
Bottleneck of RankClus
Consumes much computational cost in ranking process
5
Why?
• Generate subgraphs
as many as the number of clusters
• Iteratively updates rank scores
for all nodesClustering
Initialization
Ranking
Repeat

•Pruning-RankClus[Yamazaki et, al. iiWAS’18]
The fast RankClus algorithm by pruning nodes
- Approach
Specify nodes that not significantly affect clustering
result and prune them
- Bottleneck
Difficult to prune while maintaining accuracy
Needs to set many user-specific parameters
6
Background:
Related Work

Background:
Our Goal
Reduce the computational cost of RankClus
7
Local update
Approach
Focus on the dynamic graph property of RankClus and
compute only evolving nodes and their neighbors efficiently

Background:
Our Contributions
1. Efficient
Our proposed method outperforms RankClus and the state-
of-the-art(Pruning-RankClus) algorithm
2. Highly Accurate
Although our proposed algorithm does not compute the
entire graph, its clustering results are more accurate than
those of the state-of-the-art algorithm
3. Easy to deploy
Our proposed method requires fewer user- specified
parameters than the state-of-the-art algorithm
8

Preliminary:
Data Model: Bi-type Information Network
• A graph consist of two kinds of nodes X and Y.
E.g.) Conference-author network
- Links can exist between
 Conference (X) and author (Y)
 Author (Y) and author (Y)
10
Target type
Target of clustering
Attribute type
Support information for clustering
X Y

Preliminary:
Algorithm Framework - Overview
11
Clustering
Initialization
Ranking
Step 0:
Step 1:
Step 2:
Repeat
with each rank scores
Target type
• Bi-type Information network: 𝔾
• Cluster number: K
• 𝑅𝑎𝑛𝑘𝑖𝑛𝑔 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛: 𝑓
𝐾 = 2
Ex)
𝑓
Output
Input

Clustering
Initialization
Ranking
Step 0:
Step 1:
Step 2:
1. Partition target-type nodes into K
clusters
2. Construct subgraphs based on
their clusters
Repeat
Preliminary:
Step 0: Partition and Construct Subgraphs
12

Preliminary:
Step 1: Ranking for Each Subgraph
13
Compute rank scores for each type
by a ranking function
Clustering
Initialization
Ranking
Step 0:
Step 1:
Step 2:
Repeat
Ranking for all nodes

Preliminary:
Step 2: Clustering for Each Target Node
14
• Considers the rank scores of attribute
type as cluster features
• Estimates the posterior probability that
target nodes belongs to a cluster.
Clustering
Initialization
Ranking
Step 0:
Step 1:
Step 2:
Repeat
Re-assign new cluster

15
Preliminarily:
Key Observations
About subgraph 𝔾𝑖 in each iteration
1. Inserts new nodes and edges into 𝔾𝑖
2. Remove several nodes and edges from 𝔾𝑖
Each subgraph can be regarded as a dynamic graph
𝒕
𝒕+1
𝒕+2

Proposed Method:
Problem Setting
• Given: 𝔾0at initial time t = 0
17
…
Problem at time t = 0:
1.Initialization
2. Ranking (Ranking function: Personalized PageRank)
3. Clustering
Problem at time 𝑡:
1. Compute approx. rank score
2. Clustering
…
…
Given at time 𝑡:
Nodes and Edges are inserted to / remove from
or

Proposed Method:
Overview
Adopt A Dynamic PPR computation
based on the Gauss-Southwell method [Ohsaka KDD’15]
18
Main Idea
1. Each subgraph can be regarded as a dynamic graph
2. Previous rank score is a GOOD initial rank score.
3. We need to improve the approximate rank score locally
𝑒𝑟𝑟𝑜𝑟 < 𝜖
𝑒𝑟𝑟𝑜𝑟 ≥ 𝜖
14] Naoto Ohsaka, Takanori Maehara, and Ken-ichi Kawarabayashi. 2015.
Efficient PageRank Tracking in Evolving Networks (KDD ’15). 875–884.

Proposed Method:
Gauss-Southwell Method [Southwell. ‘40, ‘46]
• 𝑣-th rank score 𝑟𝑃𝑃𝑅(𝑣) of 𝔾𝑖
• Corresponding residual 𝑑(𝑣) as: 𝑑(𝑣) = 1 − 𝛼 𝑏 − 𝐼 − 𝛼𝑃 𝑟𝑃𝑃𝑅(𝑣)
Goal:
𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝑑 𝑣 → 0
19
𝑖
・Propagate
・Update r and d
𝛼
𝑑𝑖
𝑑𝑒𝑔
𝛼
𝑑𝑖
𝑑𝑒𝑔
𝑖

Proposed Method:
Dynamic Rank Score Tracking
20
• Iteration number : 𝑡
𝒓 𝑷𝑷𝑹
𝒕
(𝒗 = 𝟎) = 𝒓 𝑷𝑷𝑹
(𝒕−𝟏)
𝒅 𝒕 (𝒗 = 𝟎) = 𝒅 𝒕−𝟏 + 𝜶 𝑷
(𝒕)
− 𝑷
(𝒕−𝟏)
𝒓 𝑷𝑷𝑹
(𝒕−𝟏)
Compute approx. rank score by Gauss-Southwell algorithm
At the time t: cluster is updated
Added node

Proposed Method:
21
𝒓 𝑷𝑷𝑹
𝒕
(𝒗 = 𝟎) = 𝒓 𝑷𝑷𝑹
(𝒕−𝟏)
𝒅 𝒕 (𝒗 = 𝟎) = 𝒅 𝒕−𝟏 + 𝜶 𝑷
(𝒕)
− 𝑷
(𝒕−𝟏)
𝒓 𝑷𝑷𝑹
(𝒕−𝟏)
𝒗 = 𝟎: Compute initial solution

Proposed Method:
𝒙 𝒌 𝒕 𝟎 = 𝒙 𝒌 𝒕 − 𝟏
𝒓 𝒌 𝒕 𝟎 = 𝒓 𝒌 𝒕 − 𝟏 + 𝜶 𝑷 𝒌 𝒕 − 𝑷 𝒌 𝒕 − 𝟏 𝒙 𝒌 𝒕 − 𝟏
22
𝑖
𝒗 = 1
・Propagate
・Update r and d

23
𝑖
𝒗 = 𝟐
𝒙 𝒌 𝒕 𝟎 = 𝒙 𝒌 𝒕 − 𝟏
𝒓 𝒌 𝒕 𝟎 = 𝒓 𝒌 𝒕 − 𝟏 + 𝜶 𝑷 𝒌 𝒕 − 𝑷 𝒌 𝒕 − 𝟏 𝒙 𝒌 𝒕 − 𝟏
Proposed Method:

Evaluation Experiments:
Setup
• Datasets
• Algorithms
- Proposed method: reduce the cost by updating locally
- RankClus: the original algorithm
- Pruning: reduce the cost by pruning
[Yamazaki, et al. iiWAS’ 18] [Yamazaki, et al. JDI’ 19]
• Evaluation Criterion
- NMI (Normalized Mutual Information)
NMI takes a value between 0 (no mutual information) and 1
(perfect correlation)
26
Dataset name # of |X| # of |Y| # of edges
DBLP 20 5,693 95,516
Yahoo-msg 57 100,001 6,359,436

Running Time Analysis
27
Running times of each algorithm
(𝐾 = 4, 𝜖 = 10−9
) (𝐾 = 10, 𝜖 = 10−9
)

Parameter Analysis (1/2)
28
Running times by varying 𝜖
Yahoo-msgDBLP

Parameter Analysis (2/2)
29
Running times by varying the number of clusters 𝐾
Yahoo-msg

Clustering Accuracy Analysis
NMI score of Proposal by varying 𝜖
30
Comparing with the original RankClus result

Conclusion
• Main Approach
- Focus on the dynamic graph property of RankClus and
compute only evolving nodes and their neighbors
• Evaluation Result
- Confirm our proposed method can obtain clusters almost
twice as faster than competitive method while keeping
the clustering accuracy for two real-world dataset
32
Propose efficient RankClus algorithm
for large-scale bi-type information networks

Iiwas19 yamazaki slide

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Iiwas19 yamazaki slide

Similar to Iiwas19 yamazaki slide (20)

Recently uploaded

Recently uploaded (20)

Iiwas19 yamazaki slide

Editor's Notes