Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Digital Libraries

819 views

Published on

This paper addresses the baffling problem of name disam- biguation in the context of digital libraries that administer bibliographic citations. The problem emanates when multi- ple authors share a common name or when multiple name variations of an author appear in citation records. Name dis- ambiguation is not trivial to solve, and most of the digital libraries do not provide an efficient way to accurately iden- tify the citation records of an author. Furthermore, lack of complete meta-data information in digital libraries hinders the existence of generic algorithm that can be applicable on any dataset. We propose a heuristic-based, unsupervised and adaptive method that also embraces users’ interaction to count users’ feedback in disambiguation process. Moreover, the method exploits important features associated with an author and citation records such as co-authors, affiliation, publication title, venue etc., and contrives a conspicuous multilayer hierarchical clustering algorithm, which tunes it- self according to the available information and form clusters of unambiguous records. Our experiments on a set of re- searchers that are contemplated to be highly ambiguous de- cisively produced high precision and recall results and affirm the viability of our algorithm.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Digital Libraries

  1. 1. A Real-time Heuristic based Name Disambiguation Method for Digital Libraries Muhammad Imran, Syed Zeeshan Haider Gillani, Maurizio Marchese
  2. 2. Outline • Name Disambiguation problem • Mixed and Split Citations • Related work • Our approach • Experiments & results • Conclusion
  3. 3. Name Disambiguation Author-1 Author-2 Author-3 Author-4 Muhammad Imran Multiple authors share same name Name variation-1 Name variation-2 Name variation-3 Muhammad Imran M. Imran Imran Muhammad One author with multiple name variations
  4. 4. Name Disambiguation Types M. Imran Muhammad Imran Malik Imran Mehar Imran Mixed citations mixed citation records DL
  5. 5. Name Disambiguation Types Muhammad Imran Author-1 Author-2 Author-3 Split citations split citations DL split citations split citations
  6. 6. Related Work • Supervision approaches • Generative (naïve Bayes) • Discriminative (Support vector machines) • Labor-intensive, high training cost • Unsupervised approaches • Mostly failed to tackle name variations issue • No users interventions
  7. 7. Our Contributions • An end-to-end system • Retrieval -> pre-processing -> disambiguation • A generic disambiguation approach • Unsupervised • Heuristics based • Involves Users’ feedback
  8. 8. Our Approach Citation Records a cluster CR CR C R Cluster selection CR CR C R C R C R cp cp cp cp C R C R C R cp cp cp Citation records containing both mixed and split subset of citation records Discipline based clustering Co-author based split & building candidate principal authors' list Affiliation & candidate authors based merge C R C R c p c p Title & homepage based merge Principal cluster selection user selected CR pa user selected principal cluster CR p a title based vector titl e titl e list of candidate principal authors principal author Layer-3 Layer-4 Layer-2 Layer-1
  9. 9. Hierarchical Clustering & Feature Representation • Approaches • Agglomerative Feature matrix (N x D) • Divisive Xi,j N (cols) = No. of citation records D (rows) = No. of features jth feature of ith citation record
  10. 10. Features: co-authorship • Joint authors of a book, article … • Available across DLs • We use it as: • Principal author • Co-authors citation record {author-1, author-2, author-3, author-4, author-5} principal author co-authors
  11. 11. Features: co-authorship • Heuristics “If a co-author appears in two different publications with a same principal author then most likely both publications belong to the principal author” citation record-1 {author-1, author-2, ...} author-2 THEN principal author-1 citation record-2 {author-1, author-2, ...} author-2 IF = = principal author-1
  12. 12. Features: Conference Venue • Venue represents an event name e.g., a conference, workshop or a journal name. • Available across DLs. • Heuristics “The venues information of two researchers, having same names, can differentiate one from the other based on examining disciplines and sub-disciplines information of a researcher's interest.”
  13. 13. Features: Author’s Affiliation • Author’s affiliation with an institute, university, organization etc. • Available across DLs. • Heuristics “If two publications with same principal author names, also share the same affiliation information then both publications will be considered as belongs to the same author.”
  14. 14. Features: Authors Names • An author’s name can have multiple name variations. • For example: Muhammad Imran • M. Imran • Imran Muhammad • Muhammad. I
  15. 15. Features: Publications titles • Title as a String literal • We maintain a vector of important keywords • Represents author’s interests • Similarity measure between a given citation records and the vector can be useful
  16. 16. Features: Principal Author’s Homepage • Homepage is the URL of an author's homepage.
  17. 17. Disambiguation System in Action • Inter-related disciplines based formation of clusters • Co-authors based split • Affiliation based agglomerative • Pursuit of the remaining bits
  18. 18. Inter-related disciplines based formation of clusters • Exploits venue/discipline information • Forms relatively big clusters • Involves users and consider their selection among clusters
  19. 19. Inter-related disciplines based formation of clusters • Inter-related disciplines based formation of clusters
  20. 20. Co-author Based Split • Using k-means clustering
  21. 21. Experiment & Evaluation Dataset • 50 most ambiguous researchers • Manually annotated a golden dataset • Used DBLP as a data source • Used ADANA as a base-line approach • Used Precision, Recall and F1 as performance measures
  22. 22. Experiment & Evaluation
  23. 23. Thank you! Muhammad Imran mimran@qf.org.qa

×