Ghost
Upcoming SlideShare
Loading in...5
×
 

Ghost

on

  • 597 views

 

Statistics

Views

Total Views
597
Views on SlideShare
561
Embed Views
36

Actions

Likes
0
Downloads
5
Comments
0

2 Embeds 36

http://web204seminar.blogspot.com 30
http://web204seminar.blogspot.tw 6

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Ghost Ghost Presentation Transcript

  • GHOST: An Effective Graph-based Framework for Name Distinction Author: Xiaoming Fan, Jianyong Wang, Bing Lv, Lizhu Zhou, Wei Hu Publication: CIKM ’08 Presenter: Jhih-Ming Chen
  • Outline
    • Introduction
    • The GHOST framework
      • Graphical View of the Database
      • Selection of Valid Paths
      • Computing Similarity
      • Clustering Strategy
      • User Feedback
    • Experimental Results
    • Conclusion
  • Introduction
    • This paper focus on investigating the problem in digital libraries to distinguish publications written by authors with identical names.
      • GHOST (abbr. G rap H -based framew O rk for name di S tinc T ion)
    • Example:
      • Query term: “Lei Wang”
      • About 151 publications are retrieved in DBLP
      • There are no fewer than 39 different persons with this name
    View slide
  • Introduction
    • Objective of name distinction :
      • group publications coauthored by those with an identical name into clusters so that the elements in each cluster belong to the same author.
    View slide
  • Graphical View of the Database
    • The database of publications D is represented by a graph G = { V, E }
      • Each node v ∈ V represents an author.
        • Note that authors with an ambiguous name are treated as different nodes.
      • An undirected edge represents a coauthorship.
      • The edge between node a and node b has a label S(a, b), which denotes the complete set of publications coauthored by both a and b.
  • Selection of Valid Paths
    • GHOST needs to do is to discover the existence of a triangle-like basic unit and check whether the longer sub-path is invalid or not.
      • If invalid paths are found, the longer one would be eliminated in the process of searching.
    • An invalid sub-path emerges if and only if two sets of three publication sets consist of only one identical publication.
  • Computing Similarity
    • All valid paths will be used to compute the similarity of the two nodes based on the following heuristics:
      • The shortest path is the most indicative valid paths.
      • The more paths there exist between two nodes, the more “similar” the two nodes may be.
      • The notion of “six degrees of separation”.
    • Suppose there are m(i, j) valid paths linking two nodes a i and a j , and the length of n th path is l n
  • Clustering Strategy
    • This paper adopt a powerful new clustering algorithm called Affinity Propagation (AP for short) for GHOST.
  • Affinity Propagation
    • Take each data point as a node in the network.
    • Consider all data points as potential cluster centers.
    • Start the clustering with a similarity between pairs of data points.
    • Exchange messages between data points until the good cluster centers are found.
      • Responsibility
      • Availability
  • How Affinity Propagation works start Update r(i,k) Change on Decision? Construct Similarity Matrix Update a(i,k) Decide exemplar end Y N
  • Responsibility r(i, k)
    • Responsibility: data point -> candidate exemplar
      • How well suited is exemplar for data point, compared to all other possible exemplar.
      • self-responsibility r(k, k) : prior likelihood for k to be chosen as exemplar.
        • Defined by user, determines number of clusters.
        • Good choice:
  • Availability a(i, k)
    • Availability: candidate exemplar -> data point
      • How appropriate is candidate as exemplar for data point, taking support from other data points into account.
  • Affinity Propagation Update Rules
    • r(i, j) = 0; a(i, j) = 0; ∀ i, j
    • for i := 1 to num_iterations
    • end;
    • for all x k with ( r(k, k) + a(k, k) > 0 )
      • x k is exemplar
      • Assign non-exemplars x i to closest exemplar under similarity measure s(i, k)
    • end;
  • How Affinity Propagation works
  • User Feedback
    • When we resolve a name and find that there is at least one direct coauthor shared by two distinct authors, any author with the name is referred to as a “dense author”.
      • For example, two different authors named “Wei Wang” have coauthored with “Jiawei Han”.
    • To deal with the low performance caused by dense authors, user feedback is adopted to achieve enhancement.
      • Decrease the number of valid paths.
      • Increase the depth while searching for valid paths.
      • Adjust the value of “preferences” in the AP clustering process.
  • Experimental Results Evaluated on the real DBLP dataset Identical Name Publications Actual Authors Estimated Clusters Result Evaluation precision recall f-score Cheng Chang 14 4 4 1.00 1.00 1.00 Hui Fang 20 3 3 1.00 1.00 1.00 Yi Li 40 22 22 0.78 0.93 0.85 Jim Smith 21 3 5 1.00 0.85 0.92 Michael Wagner 37 11 13 1.00 0.55 0.71 Jianyong Wang 45 1 2 1.00 0.87 0.93 Lei Wang 95 39 39 0.99 1.00 0.99 Wei Wang 141 14 14 0.88 0.92 0.90 Bin Yu 58 12 17 0.97 0.69 0.81 Jing Zhang 60 25 23 0.94 1.00 0.97 (Average) - - - 0.96 0.88 0.91
  • Experimental Results
    • GHOST v.s DISTINCT
      • DISTINCT is one of the state-of-the-art name distinction algorithms.
  • Conclusion
    • In this paper, we have explored the problem of name distinction, and developed an effective five-step framework, GHOST, which employs only one type of relationship, namely, co-authorship.
    • Experimental results show that GHOST can achieve both high precision and recall, and outperforms the-state-of-the-art approach, DISTINCT.