Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Iiwas19 yamazaki slide
1. Fast RankClus Algorithm via Dynamic Rank Score Tracking
on Bi-type Information Networks
iiWAS 2019
Kotaro Yamazaki✝ , Shohei Matsugu✝
Hiroaki Shiokawa✝✝, Hiroyuki Kitagawa✝✝
✝: Graduate School of SIE, University of Tsukuba
✝✝: Center for Computational Sciences, University of Tsukuba
3. Background:
Ubiquitous Information Networks
• Information network:
Each node represents an entity and each link(edge) a relationship between
entities
• Homogeneous vs. Heterogeneous networks
- Homogeneous Network
Single type Object
E.g., Co-author network, Web pages, Friendship network
Most Current studies are on homogeneous network
- Heterogeneous Network
Objects belong to several types
E.g., Conference-author network, Medical networks
Most real system can be modeled as heterogeneous network
3
4. Background:
RankClus
[Sun et al.,EDBT’09]
Ranking-based Clustering algorithm for Heterogeneous Network
The Methodology
• Ranking as the features of clusters
• Clustering so that each node has the highest rank score.
• Repeat and improve the quality of clustering and ranking mutually
4
Heterogeneous Network
RankClus
Framework
with rank scores
5. Background:
Bottleneck of RankClus
Consumes much computational cost in ranking process
5
Why?
• Generate subgraphs
as many as the number of clusters
• Iteratively updates rank scores
for all nodesClustering
Initialization
Ranking
Repeat
6. •Pruning-RankClus[Yamazaki et, al. iiWAS’18]
The fast RankClus algorithm by pruning nodes
- Approach
Specify nodes that not significantly affect clustering
result and prune them
- Bottleneck
Difficult to prune while maintaining accuracy
Needs to set many user-specific parameters
6
Background:
Related Work
7. Background:
Our Goal
Reduce the computational cost of RankClus
7
Local update
Approach
Focus on the dynamic graph property of RankClus and
compute only evolving nodes and their neighbors efficiently
8. Background:
Our Contributions
1. Efficient
Our proposed method outperforms RankClus and the state-
of-the-art(Pruning-RankClus) algorithm
2. Highly Accurate
Although our proposed algorithm does not compute the
entire graph, its clustering results are more accurate than
those of the state-of-the-art algorithm
3. Easy to deploy
Our proposed method requires fewer user- specified
parameters than the state-of-the-art algorithm
8
10. Preliminary:
Data Model: Bi-type Information Network
• A graph consist of two kinds of nodes X and Y.
E.g.) Conference-author network
- Links can exist between
Conference (X) and author (Y)
Author (Y) and author (Y)
10
Target type
Target of clustering
Attribute type
Support information for clustering
X Y
11. Preliminary:
Algorithm Framework - Overview
11
Clustering
Initialization
Ranking
Step 0:
Step 1:
Step 2:
Repeat
with each rank scores
Target type
• Bi-type Information network: 𝔾
• Cluster number: K
• 𝑅𝑎𝑛𝑘𝑖𝑛𝑔 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛: 𝑓
𝐾 = 2
Ex)
𝑓
Output
Input
12. Clustering
Initialization
Ranking
Step 0:
Step 1:
Step 2:
1. Partition target-type nodes into K
clusters
2. Construct subgraphs based on
their clusters
Repeat
Preliminary:
Step 0: Partition and Construct Subgraphs
12
13. Preliminary:
Step 1: Ranking for Each Subgraph
13
Compute rank scores for each type
by a ranking function
Clustering
Initialization
Ranking
Step 0:
Step 1:
Step 2:
Repeat
Ranking for all nodes
14. Preliminary:
Step 2: Clustering for Each Target Node
14
• Considers the rank scores of attribute
type as cluster features
• Estimates the posterior probability that
target nodes belongs to a cluster.
Clustering
Initialization
Ranking
Step 0:
Step 1:
Step 2:
Repeat
Re-assign new cluster
15. 15
Preliminarily:
Key Observations
About subgraph 𝔾𝑖 in each iteration
1. Inserts new nodes and edges into 𝔾𝑖
2. Remove several nodes and edges from 𝔾𝑖
Each subgraph can be regarded as a dynamic graph
𝒕
𝒕+1
𝒕+2
17. Proposed Method:
Problem Setting
• Given: 𝔾0at initial time t = 0
17
…
Problem at time t = 0:
1.Initialization
2. Ranking (Ranking function: Personalized PageRank)
3. Clustering
Problem at time 𝑡:
1. Compute approx. rank score
2. Clustering
…
…
Given at time 𝑡:
Nodes and Edges are inserted to / remove from
or
18. Proposed Method:
Overview
Adopt A Dynamic PPR computation
based on the Gauss-Southwell method [Ohsaka KDD’15]
18
Main Idea
1. Each subgraph can be regarded as a dynamic graph
2. Previous rank score is a GOOD initial rank score.
3. We need to improve the approximate rank score locally
𝑒𝑟𝑟𝑜𝑟 < 𝜖
𝑒𝑟𝑟𝑜𝑟 ≥ 𝜖
14] Naoto Ohsaka, Takanori Maehara, and Ken-ichi Kawarabayashi. 2015.
Efficient PageRank Tracking in Evolving Networks (KDD ’15). 875–884.
25. Evaluation Experiments:
Setup
• Datasets
• Algorithms
- Proposed method: reduce the cost by updating locally
- RankClus: the original algorithm
- Pruning: reduce the cost by pruning
[Yamazaki, et al. iiWAS’ 18] [Yamazaki, et al. JDI’ 19]
• Evaluation Criterion
- NMI (Normalized Mutual Information)
NMI takes a value between 0 (no mutual information) and 1
(perfect correlation)
26
Dataset name # of |X| # of |Y| # of edges
DBLP 20 5,693 95,516
Yahoo-msg 57 100,001 6,359,436
31. Conclusion
• Main Approach
- Focus on the dynamic graph property of RankClus and
compute only evolving nodes and their neighbors
• Evaluation Result
- Confirm our proposed method can obtain clusters almost
twice as faster than competitive method while keeping
the clustering accuracy for two real-world dataset
32
Propose efficient RankClus algorithm
for large-scale bi-type information networks
Editor's Notes
Hello, everyone. I’m Kotaro Yamazaki from University of Tsukuba, in Japan.
I’d like to talk about our paper “fast RankClus Algorithm via Dynamic Rank Score Tracking on Bi-Type Information Networks”.
First, I introduce the background.
Information networks, which consisted of nodes and edges are ubiquitous
There are two types, homogeneous network and heterogeneous network.
Homogeneous network consists of the same type object and links, such as collaborative network and the web pages.
A lot of current studies are on homogeneous network.
While, Heterogeneous network contains different types of objects such as conference-author network and medical network.
most real systems contain multi- typed interacting components and we can model them as heterogeneous information networks
So, it is important to analyze heterogeneous networks.
RankClus is a novel graph mining framework for heterogeneous network.
it has achieved clustering based on ranking results so that each node has the highest rank score.
Compared with general graph clustering algorithm, it focuses on the mportance for nodes in the cluster.
However, in terms of efficiently, the bottleneck of RankClus is “the computational cost ” of ranking procedure.
The reason why is RankClus needs to generate subgraphs as many as the number of clusters, and it needs to iteratively perform the ranking procedure for all nodes in all subgraphs
Pruning RankClus is an efficient RankClus algorithm.
The approach is specifying nodes that not significantly affect clustering result and prune them.
However Pruning-RankClus has bottlenecks
First, it is difficult to prune while maintaining accuracy
Second, it needs to set many user-specific parameter
To overcome the performance limitation in RankClus, we present an efficient algorithm for speeding-up RankClus on large scale heterogeneous network.
we focus on the dynamic graph property of RankClus and compute only (evolving nodes and their neighbors) efficiently.
Our contributions are shown as follows
Efficient
Our proposed method outperforms the RankClus and the state-of-the-art algorithm
Highly Accurate
Although our proposed algorithm does not compute the entire graph, its clustering results are more accurate than those of the state-of-the-art algorithm
Easy to deploy
Our proposed method requires fewer user- specified parameters than the state-of-the-art algorithm
Next, I’d like to talk about preliminary.
First ,I briefly introduce the data model of RankClus.
RankClus takes a bi-type information network.
It is one of heterogenous network.
,Bi-type information network has two kinds of nodes; X and Y.
Then, X are defined as target-type nodes and Y are defined as attribute-type nodes
RankClus performs clustering only for the target type nodes, and attribute type nodes are used as a support information for the clustering.
This is the overview of RankClus algorithm framework
RankClus is an iterative method.
As an input, it takes a bi-type information network G, cluster number K, ranking function f.
As an output, it outputs clusters with their rank score for each type nodes in each subgraph.
Next, I will explain the details of each step.
In initialization step, RankClus partition target type nodes into K clusters
Then, subgraphs are constructed based on their clusters as many as cluster Number K.
In Ranking step, compute rank scores for each type by a ranking function
Note that it is computationally expensive to obtain rank score because it needs to be computed for all nodes in each subgraph.
Next step is clustering step.
First, RankClus consds
Finally, target type nodes is re-assigned into the nearest cluster.
Our key observation of RankClus is that each subgraph can be regarded as a dynamic graph.
When focus on one subgraph, it behave such as a dynamic graph in each iteration.
In the next part, I’d like to talk about our Proposed method.
Let’s begin with our problem setting.
At the beginning, we are given an initial graph G(0) as many as cluster number K.
Problem at time 0.,
We perform ranking and clustering as same as original RankClus algorithm with Personalized PageRank
For each time t.
Because their clusters are updated, nodes and edges are inserted to/ removed from each subgraph
At time t we apply the dynamic programing property to take care of the computational complexity
compute the approximate rank score.
This is the overview of our proposed method to reduce the computational cost of RankClus.
In this approach, we adopt a dynamic PPR computation based on the gauss-Southwell method
The main idea is very simple.
1. Each subgraph can be regarded as a dynamic graph
2. The Previous rank score is considered a good initial rank score for obtaining the current rank score
3 we need to improve the approximate solution locally.
Because if the change of graph is small, we need to update only few nodes. Not all nodes.
I first introduce gauss-southwell method
Gauss-southwell method is iterative method to solve Personalized PageRank.
This method has two vectors.
First is approximate rank score and
Second is residuals d corresponding to the rank score.
And The goal of this method is to make d nearly 0
first, at each iteration, the method picks the node which has the largest residual.
If the residual is greater than epsilon, It updates the approximate rank score r and residuals d locally.
The example of the updating process is shown in the figure.
I mean that if it picks the node “I”,
the residual propagates from I to the out neighbors.
And it iterates the process until the largest residual is less than epsilon.
So now we describe our proposed method.
This is an example of our proposed method.
We takes one of the subgraphs that is added one node at time t.
First, based on our idea, we use the previous rank scores as the initial rank scores at time t.
The red nods are larger than epsilon.
And then, we apply Gauss-Southwell method.
It picks the node that has the largest residual.
the residual is propagated to the out-neighbors and updates r and d at the same time.
The computation is iterated until the largest residual is less than epsilon.
And that’s about the flow of our proposed method.
In Ranking step, compute rank scores for each type by a ranking function
Note that it is computationally expensive to compute rank score because it needs to be computed for all nodes in each subgraph.
In next part, I’d like to talk about Evaluation Experiments
We evaluate our proposed method on two Real dataset “DBLP” and ”Yahoo-msg”.
The performance of our proposed method compared with the original RankClus algorithm and the state-of-the-art method that reduces the cost by pruning
In order to evaluate the accuracy of clustering results, we employed NMI which is an information-theory measurement.
NMI measures clustering accuracy by comparing the clustering result. NMI takes a value between 0 and 1;
It returns 1, if the two clusters are completely s
This is the result of running times of each algorithm.
As we can see from the result, our proposed algorithm outperforms the other algorithms.
Specifically, our proposed algorithm is up to twice as fast as the other algorithms,
it suggests that our dynamic rank score tracking method can cut off the computation cost.
Next, we assessed the effect of the user-specified parameter ε of our proposed method
We compared running times by varying ε.
These Figures show the running times of DBLP and Yahoo-msg, respectively.
We varied \epsilon from 10 exponent -9 to 10 exponent -2.
As we can see from these figures/
Our proposed method gradually reduces the running time as the epsilon value increases .
Because our dynamic rank score tracking method only needs to update residual and score until the largest residual satisfies < \epsilon.
We assessed the impact of the number of clusters K on the running time.
This figure shows the runtimes when K was varied for the Yahoo-msg dataset.
As we can see from this result, the speeding-up ratio increases as we increase K on our proposed algorithm .
That’s because each subgraph does not drastically change if we set a large K value.
In particular, if a subgraph has no updates, our proposed algorithm can skip the ranking process for the subgraph while the other algorithm needs to perform ranking process.
Thus, we can further improve the efficiency for larger K settings.
Finally, We assessed the accuracy of clustering result produced by the proposed algorithm.
In this evaluation, we measured the NMI scores between clusters extracted by our proposed method and RankClus.
Herein we varied the epsilon values from 10 exponent -9 to 10 exponent -2
The results shows NMI scores of our proposed method shows high NMI values for all \epsilon settings.
even though it has drastically reduced running times compared to RankClus.
Finally, let me conclude my talk.
In this study, we proposed an efficient RankClus algorithm for large-scale bi-type information network.
The main approach is to reduce the cost by focusing on the dynamic graph property and compute only evolving nodes and their neighbors.
Evaluating experiments showed that our proposed method can obtain clusters almost twice as faster than competitive method while keeping the clustering accuracy.