GHOST: An Effective  Graph-based  Framework for Name Distinction Author: Xiaoming Fan, Jianyong Wang, Bing Lv, Lizhu Zhou,...
Outline <ul><li>Introduction </li></ul><ul><li>The GHOST framework </li></ul><ul><ul><li>Graphical View of the Database </...
Introduction <ul><li>This paper focus on investigating the problem in digital libraries to distinguish publications writte...
Introduction <ul><li>Objective of  name distinction : </li></ul><ul><ul><li>group publications coauthored by those with an...
Graphical View of the Database <ul><li>The database of publications  D  is  represented by a graph  G =  { V, E } </li></u...
Selection of Valid Paths <ul><li>GHOST needs to do is to discover the existence of a triangle-like basic unit and check wh...
Computing Similarity <ul><li>All valid paths will be used to compute the similarity of the two nodes based on the followin...
Clustering Strategy <ul><li>This paper adopt a powerful new clustering algorithm called  Affinity Propagation  (AP for sho...
Affinity Propagation <ul><li>Take each data point as a node in the network. </li></ul><ul><li>Consider all data points as ...
How Affinity Propagation works start Update r(i,k) Change on Decision? Construct Similarity Matrix Update a(i,k) Decide ex...
Responsibility  r(i, k) <ul><li>Responsibility: data point -> candidate exemplar </li></ul><ul><ul><li>How well suited is ...
Availability  a(i, k) <ul><li>Availability: candidate exemplar -> data point </li></ul><ul><ul><li>How appropriate is cand...
Affinity Propagation Update Rules <ul><li>r(i, j) = 0; a(i, j) = 0; ∀ i, j </li></ul><ul><li>for i := 1   to num_iteration...
How Affinity Propagation works
User Feedback <ul><li>When we resolve a name and find that there is at least one direct coauthor shared by two distinct au...
Experimental Results Evaluated on the real DBLP dataset Identical Name Publications Actual Authors Estimated  Clusters Res...
Experimental Results <ul><li>GHOST  v.s  DISTINCT </li></ul><ul><ul><li>DISTINCT is one of the state-of-the-art name disti...
Conclusion <ul><li>In this paper, we have explored the problem of name distinction, and developed an effective five-step f...
Upcoming SlideShare
Loading in …5
×

Ghost

610 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
610
On SlideShare
0
From Embeds
0
Number of Embeds
40
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Ghost

  1. 1. GHOST: An Effective Graph-based Framework for Name Distinction Author: Xiaoming Fan, Jianyong Wang, Bing Lv, Lizhu Zhou, Wei Hu Publication: CIKM ’08 Presenter: Jhih-Ming Chen
  2. 2. Outline <ul><li>Introduction </li></ul><ul><li>The GHOST framework </li></ul><ul><ul><li>Graphical View of the Database </li></ul></ul><ul><ul><li>Selection of Valid Paths </li></ul></ul><ul><ul><li>Computing Similarity </li></ul></ul><ul><ul><li>Clustering Strategy </li></ul></ul><ul><ul><li>User Feedback </li></ul></ul><ul><li>Experimental Results </li></ul><ul><li>Conclusion </li></ul>
  3. 3. Introduction <ul><li>This paper focus on investigating the problem in digital libraries to distinguish publications written by authors with identical names. </li></ul><ul><ul><li>GHOST (abbr. G rap H -based framew O rk for name di S tinc T ion) </li></ul></ul><ul><li>Example: </li></ul><ul><ul><li>Query term: “Lei Wang” </li></ul></ul><ul><ul><li>About 151 publications are retrieved in DBLP </li></ul></ul><ul><ul><li>There are no fewer than 39 different persons with this name </li></ul></ul>
  4. 4. Introduction <ul><li>Objective of name distinction : </li></ul><ul><ul><li>group publications coauthored by those with an identical name into clusters so that the elements in each cluster belong to the same author. </li></ul></ul>
  5. 5.
  6. 6. Graphical View of the Database <ul><li>The database of publications D is represented by a graph G = { V, E } </li></ul><ul><ul><li>Each node v ∈ V represents an author. </li></ul></ul><ul><ul><ul><li>Note that authors with an ambiguous name are treated as different nodes. </li></ul></ul></ul><ul><ul><li>An undirected edge represents a coauthorship. </li></ul></ul><ul><ul><li>The edge between node a and node b has a label S(a, b), which denotes the complete set of publications coauthored by both a and b. </li></ul></ul>
  7. 7. Selection of Valid Paths <ul><li>GHOST needs to do is to discover the existence of a triangle-like basic unit and check whether the longer sub-path is invalid or not. </li></ul><ul><ul><li>If invalid paths are found, the longer one would be eliminated in the process of searching. </li></ul></ul><ul><li>An invalid sub-path emerges if and only if two sets of three publication sets consist of only one identical publication. </li></ul>
  8. 8. Computing Similarity <ul><li>All valid paths will be used to compute the similarity of the two nodes based on the following heuristics: </li></ul><ul><ul><li>The shortest path is the most indicative valid paths. </li></ul></ul><ul><ul><li>The more paths there exist between two nodes, the more “similar” the two nodes may be. </li></ul></ul><ul><ul><li>The notion of “six degrees of separation”. </li></ul></ul><ul><li>Suppose there are m(i, j) valid paths linking two nodes a i and a j , and the length of n th path is l n </li></ul>
  9. 9. Clustering Strategy <ul><li>This paper adopt a powerful new clustering algorithm called Affinity Propagation (AP for short) for GHOST. </li></ul>
  10. 10. Affinity Propagation <ul><li>Take each data point as a node in the network. </li></ul><ul><li>Consider all data points as potential cluster centers. </li></ul><ul><li>Start the clustering with a similarity between pairs of data points. </li></ul><ul><li>Exchange messages between data points until the good cluster centers are found. </li></ul><ul><ul><li>Responsibility </li></ul></ul><ul><ul><li>Availability </li></ul></ul>
  11. 11. How Affinity Propagation works start Update r(i,k) Change on Decision? Construct Similarity Matrix Update a(i,k) Decide exemplar end Y N
  12. 12. Responsibility r(i, k) <ul><li>Responsibility: data point -> candidate exemplar </li></ul><ul><ul><li>How well suited is exemplar for data point, compared to all other possible exemplar. </li></ul></ul><ul><ul><li>self-responsibility r(k, k) : prior likelihood for k to be chosen as exemplar. </li></ul></ul><ul><ul><ul><li>Defined by user, determines number of clusters. </li></ul></ul></ul><ul><ul><ul><li>Good choice: </li></ul></ul></ul>
  13. 13. Availability a(i, k) <ul><li>Availability: candidate exemplar -> data point </li></ul><ul><ul><li>How appropriate is candidate as exemplar for data point, taking support from other data points into account. </li></ul></ul>
  14. 14. Affinity Propagation Update Rules <ul><li>r(i, j) = 0; a(i, j) = 0; ∀ i, j </li></ul><ul><li>for i := 1 to num_iterations </li></ul><ul><li>end; </li></ul><ul><li>for all x k with ( r(k, k) + a(k, k) > 0 ) </li></ul><ul><ul><li>x k is exemplar </li></ul></ul><ul><ul><li>Assign non-exemplars x i to closest exemplar under similarity measure s(i, k) </li></ul></ul><ul><li>end; </li></ul>
  15. 15.
  16. 16. How Affinity Propagation works
  17. 17. User Feedback <ul><li>When we resolve a name and find that there is at least one direct coauthor shared by two distinct authors, any author with the name is referred to as a “dense author”. </li></ul><ul><ul><li>For example, two different authors named “Wei Wang” have coauthored with “Jiawei Han”. </li></ul></ul><ul><li>To deal with the low performance caused by dense authors, user feedback is adopted to achieve enhancement. </li></ul><ul><ul><li>Decrease the number of valid paths. </li></ul></ul><ul><ul><li>Increase the depth while searching for valid paths. </li></ul></ul><ul><ul><li>Adjust the value of “preferences” in the AP clustering process. </li></ul></ul>
  18. 18. Experimental Results Evaluated on the real DBLP dataset Identical Name Publications Actual Authors Estimated Clusters Result Evaluation precision recall f-score Cheng Chang 14 4 4 1.00 1.00 1.00 Hui Fang 20 3 3 1.00 1.00 1.00 Yi Li 40 22 22 0.78 0.93 0.85 Jim Smith 21 3 5 1.00 0.85 0.92 Michael Wagner 37 11 13 1.00 0.55 0.71 Jianyong Wang 45 1 2 1.00 0.87 0.93 Lei Wang 95 39 39 0.99 1.00 0.99 Wei Wang 141 14 14 0.88 0.92 0.90 Bin Yu 58 12 17 0.97 0.69 0.81 Jing Zhang 60 25 23 0.94 1.00 0.97 (Average) - - - 0.96 0.88 0.91
  19. 19. Experimental Results <ul><li>GHOST v.s DISTINCT </li></ul><ul><ul><li>DISTINCT is one of the state-of-the-art name distinction algorithms. </li></ul></ul>
  20. 20. Conclusion <ul><li>In this paper, we have explored the problem of name distinction, and developed an effective five-step framework, GHOST, which employs only one type of relationship, namely, co-authorship. </li></ul><ul><li>Experimental results show that GHOST can achieve both high precision and recall, and outperforms the-state-of-the-art approach, DISTINCT. </li></ul>

×