Author disambiguation for enhanced science 2.0 services

  • 1,244 views
Uploaded on

The disambiguation of authors with homonymous names has been a growing concern in bibliometrics for several years. The identification of authors is a necessary precondition for accurate bibliometric …

The disambiguation of authors with homonymous names has been a growing concern in bibliometrics for several years. The identification of authors is a necessary precondition for accurate bibliometric calculations. The basic technique of matching authors' names based on ratios of shared letters is susceptible to error in the case of very common names or in broad research fields. A more advanced approach to the problem relies on the contextual metadata relating to an author. Using the co-author and the co-citation relationships that pertain to a given author, a map of the intellectual landscape of that author can be constructed. Such patterns of collaboration and citation are typically unique to each individual and serve as fingerprints for the identification of those who happen to share the same name. The automated analysis of contextual metadata for each author-name-instance can be used to enhance the management of knowledge within science-focused social media. This paper presents a minimalistic solution to the author disambiguation problem that leverages bibliographic metadata to generate clusters of articles belonging to distinct individuals who all share the same name. Implications for enhanced Science 2.0 services are discussed.

More in: Education , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,244
On Slideshare
0
From Embeds
0
Number of Embeds
12

Actions

Shares
Downloads
7
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Author Disambiguation forenhanced science 2.0 servicesGenerating groups based on “thin” dataJeffrey DemaineInstitute for Research Information and Quality Assurance (iFQ)Bonn, Germanywww.research-information.de
  • 2. Outline• Describe author-disambiguation efforts at iFQ – Good results based on limited metadata. – Simple approach can achieve remarkable results.• Adapt the findings to the context of an online service.• Most author-disambiguation articles achieve remarkable results by starting with clean (manually curated) data. – What if you dont have clean, curated data? – i.e.: the community-generated content of Science 2.0 Institute for Research Information and Quality Assurance Jeffrey Demaine 08/2011
  • 3. Challenge• Web of Science database does not group author names by identity: each name-instance gets a different ID number.• Example: There are 100 articles by Smith, J – 100 different authorID numbers – 100 different researchers? – 1 researcher published them all? – Something in-between?• How can iFQ measure a researcher´s productivity if we dont know who that person is? Institute for Research Information and Quality Assurance Jeffrey Demaine 08/2011
  • 4. ChallengeGoal:• To reduce uncertainty about identity• Identify and group homonyms – same spelling = same thingConstraints:• WoS only uses the first initial. – Few records have a first name• Affiliation data is suspect – Corresponding author only Institute for Research Information and Quality Assurance Jeffrey Demaine 08/2011
  • 5. Ignoring synonyms• Spelling mistakes, anglicisation of characters (changing ö into oe or o), name changes due to marriage, etc. are beyond the scope of this project.• While J. Smith can be disambiguated, we will never know if Jane Smith is a mis-spelling of June Smith. (Maybe it’s really Janet Smith, or Jane Smyth, or…)• Speculation about what could be cannot be resolved. Institute for Research Information and Quality Assurance Jeffrey Demaine 08/2011
  • 6. Matching on metadata Thiel, C Thiel, CM; Bentley, P; Dolan, RJ (2002) Effects of cholinergic enhancement on conditioning-related? responses in human auditory cortex. EUR J NEUROSCI Thiel, C Thiel, CM; Henson, R; Morris, JS; Friston, KJ; Dolan, RJ (2001) Pharmacological modulation of behavioral and neuronal correlates of repetition priming. J NEUROSCI These articles have on co-author in common: we can be pretty sure they were both written by the same Christiane Thiel Institute for Research Information and Quality Assurance Jeffrey Demaine 08/2011
  • 7. Matching on metadata Thiel, C Smith, JAuthor-instances are matchedbased on Bibliographic Coupling Thiel, C Institute for Research Information and Quality Assurance Jeffrey Demaine 08/2011
  • 8. Types of matching tested4 types of matching were tested:• > 1 co-author• > 1 co-author and/or > 1 reference• > 1 co-author and/or > 2 references• > 1 co-author and/or > 1 reference [Jaccard Similarity Coeff.]Evaluation of matching:• Fewer groups = simplification of identity. RATIO• How accurate are the groups? PRECISION• How efficient are the groups? RECALL Institute for Research Information and Quality Assurance Jeffrey Demaine 08/2011
  • 9. Types of matching tested• Compare each group against the Emmy Noether data, which we know represents a single person.• How accurate is the group with the most matches?That is:• We know that a J Smith wrote 15 articles.• Which group contains the most of those?• Which method of matching creates the best groups? Institute for Research Information and Quality Assurance Jeffrey Demaine 08/2011
  • 10. Goal #1: Reducing uncertainty14 names from Emmy Noether dataset(all records since 1992) Name tested WoS Simplification of identityB Borasoy 59 Co-author matching 95 90%R Huber 945 Number of Co-authors & reference 65 93%A Blaukat 44 groups by: Co-authors & 2 references 76 92%G Behre 98 Co-authors & reference Jaccard 154 84%M Albrecht 652L Ackermann 60R Everaers 43 (945 x 3) - 1 x (944 x 3) - 1 > 8 millionP Neumann 317K Niebuhr 13 R Huber - manual matchingC OchsenfeldJW Pan 33 192 Co-author 70,839,681 Co-author comparisonsT Pietschmann 37 matching 1180661 147583 minutes (1 per second) 8-hour daysB Regenberg 30K Wiegand 28 took 5m 27sec 29517 5-day work weeks 615 years (4 weeks vacation) Institute for Research Information and Quality Assurance Jeffrey Demaine 08/2011
  • 11. Types of matching tested Which method of matching creates the best groups? Precision (%)Matching by co-authors & 2 references 75% PrecisionMatching by co-authors & 2 references & author starting year 79% Precision Precision = The ratio of correct matches to all matches (% correct) Institute for Research Information and Quality Assurance Jeffrey Demaine 08/2011
  • 12. Goal #2: Split known names• Use the co-author & 2-references matching technique.• We know a priori the identities of the persons, so we know how many groups there should be.• This allows us to calculate Recall.• Also allows us to calculate correlation coefficients. Institute for Research Information and Quality Assurance Jeffrey Demaine 08/2011
  • 13. Who is E. Cardellach?Esteve Cardellach Estel CardellachBellaterra, Spain Bellaterra, SpainUniversitat Autònoma de Barcelona Insitut de Ciècies de l’ESPAI• Environmental Engineering • Computer Science, A.I.• Environmental Sciences • Electrical & Electronic Engineering• Geochemistry & Geophysics • Imaging Science & Photo. Tech.• Geology • Multidisciplinary Geosciences• Mineralogy • Remote Sensing• Civil Engineering • Telecommunications Web of Science identifies both as simply “Cardellach, E” Institute for Research Information and Quality Assurance Jeffrey Demaine 08/2011
  • 14. Using the program• There are 41 distinct records published since 2002 that match the name “Cardellach, E”.• The program grouped the names according to the co- authors and/or references (minimum 2) they share.• Involved the comparison of 9202 co-authors and of 19777 references, which took only 21 seconds (!).• The RESULT was 4 groups containing all 41 records.• Googled article titles to verify author name & address. Institute for Research Information and Quality Assurance Jeffrey Demaine 08/2011
  • 15. Precision & Recall Condition (reality) Condition (reality) TRUE FALSE Test (prediction) Precision True positive False positive TRUE TP ÷ (TP + FP) Test (prediction) False negative True negative FALSE Recall TP ÷ (TP + FN)Precision: % of results that are correct (accuracy).Recall: % of all relevant matches that were made (efficiency).Must know the total number of correct answers there are.The Matthews correlation coefficientgives a statistical measure of the results: Institute for Research Information and Quality Assurance Jeffrey Demaine 08/2011
  • 16. Disambiguation of Estel and Esteve In reality: In reality: Esteve Estel Precision Predicted: Esteve 25 1 96% Predicted: Estel 3 12 Recall Matthews c.c. 89% 0.79The one false positive occurred when Estel published apaper in the field of Geochemistry & Geophysics. Institute for Research Information and Quality Assurance Jeffrey Demaine 08/2011
  • 17. Disadvantages• Not the most advanced approach. – Does not use NLP on title, abstract.• High variablility of precision: groups can contain errors.• Smaller (“orphan“) groups require manual matching.• Very common names produce a large number of groups (naturally). Requires some post-processing. Institute for Research Information and Quality Assurance Jeffrey Demaine 08/2011
  • 18. Advantages• It is highly automated, requiring only the input of a name and a year with which to limit the search.• It runs very quickly, with hundreds of names being disambiguated in a few minutes.• Only have to verify one name-instance per group.• Works with the real-world data we have (FIZ databases), not some hand-curated test set. Institute for Research Information and Quality Assurance Jeffrey Demaine 08/2011
  • 19. What about science 2.0?• Most author-disambiguation articles achieve remarkable results by starting with a very clean data-set.• In a science 2.0 context, content is user-generated and nobody cleans the data.• But that does not mean one cannot extract interesting patterns from the metadata.• Example: author-disambiguation based on co-authors, co-citation, and shared keywords. Institute for Research Information and Quality Assurance Jeffrey Demaine 08/2011
  • 20. Institute for Research Information and Quality Assurance Jeffrey Demaine 08/2011
  • 21. Institute for Research Information and Quality Assurance Jeffrey Demaine 08/2011
  • 22. “Loren M. Frank” found 7 times in Karlmarx20’s library• Emery N. Brown, David P. Nguyen, Loren M. Frank, Matthew A. Wilson, Victor Solo. “An analysis of neural receptive field plasticity by point process adaptive filtering[Quick Edit]” Proceedings of the National Academy of Sciences of the United States of America, Vol. 98, No. 21. (9 October 2001), pp. 12261-12266.• Emery N. Brown, Loren M. Frank, Dengda Tang, Michael C. Quirk, Matthew A. Wilson . “A Statistical Paradigm for Neural Spike Train Decoding Applied to Position Prediction from Ensemble Firing Patterns of Rat Hippocampal Place Cells”[Quick Edit] J. Neurosci., Vol. 18, No. 18. (15 September 1998), pp. 7411-7425.• Emery N. Brown, Riccardo Barbieri, Valérie Ventura, Robert E. Kass, Loren M. Frank . “The Time-Rescaling Theorem and Its Application to Neural Spike Train Data Analysis”[Quick Edit]. Neural Computation, Vol. 14, No. 2. (19 December 2010), pp. 325-346.• Riccardo Barbieri, Loren M. Frank, Michael C. Quirk, Matthew A. Wilson, Emery N. Brown. “Diagnostic methods for statistical models of place cell spiking activity[Quick Edit]” Neurocomputing, Vol. 38-40 (June 2001), pp. 1087- 1093.• Emery N. Brown, Riccardo Barbieri, Valerie Ventura, Robert E. Kass, Loren M. Frank. “The Time-Rescaling Theorem and Its Application to Neural Spike Train Data Analysis” [Quick Edit]Neural Comp., Vol. 14, No. 2. (February 2002), pp. 325-346.• Riccardo Barbieri, Michael C. Quirk, Loren M. Frank, Matthew A. Wilson, Emery N. Brown. “Construction and analysis of non-Poisson stimulus-response models of neural spiking activity”. [Quick Edit]Journal of Neuroscience Methods, Vol. 105, No. 1. (January 2001), pp. 25-37.• Uri T. Eden, Loren M. Frank, Riccardo Barbieri, Victor Solo, Emery N. Brown. “Dynamic Analysis of Neural Encoding by Point Process Adaptive Filtering” [Quick Edit]Neural Comp., Vol. 16, No. 5. (May 2004), pp. 971-998. Institute for Research Information and Quality Assurance Jeffrey Demaine 08/2011
  • 23. “Loren M. Frank” found 7 times in Karlmarx20’s library• Emery N. Brown, David P. Nguyen, Loren M. Frank, Matthew A. Wilson, Victor Solo. “An analysis of neural receptive field plasticity by point process adaptive filtering” Proceedings of the National Academy of Sciences of the United States of America, Vol. 98, No. 21. (9 October 2001), pp. 12261-12266.• Emery N. Brown, Loren M. Frank, Dengda Tang, Michael C. Quirk, Matthew A. Wilson . “A Statistical Paradigm for Neural Spike Train Decoding Applied to Position Prediction from Ensemble Firing Patterns of Rat Hippocampal Place Cells” J. Neurosci., Vol. 18, No. 18. (15 September 1998), pp. 7411-7425.• Emery N. Brown, Riccardo Barbieri, Valérie Ventura, Robert E. Kass, Loren M. Frank . “The Time-Rescaling Theorem and Its Application to Neural Spike Train Data Analysis”. Neural Computation, Vol. 14, No. 2. (19 December 2010), pp. 325-346.• Riccardo Barbieri, Loren M. Frank, Michael C. Quirk, Matthew A. Wilson, Emery N. Brown. “Diagnostic methods for statistical models of place cell spiking activity” Neurocomputing, Vol. 38-40 (June 2001), pp. 1087-1093.• Emery N. Brown, Riccardo Barbieri, Valerie Ventura, Robert E. Kass, Loren M. Frank. “The Time-Rescaling Theorem and Its Application to Neural Spike Train Data Analysis” Neural Comp., Vol. 14, No. 2. (February 2002), pp. 325-346.• Riccardo Barbieri, Michael C. Quirk, Loren M. Frank, Matthew A. Wilson, Emery N. Brown. “Construction and analysis of non-Poisson stimulus-response models of neural spiking activity”. Journal of Neuroscience Methods, Vol. 105, No. 1. (January 2001), pp. 25-37.• Uri T. Eden, Loren M. Frank, Riccardo Barbieri, Victor Solo, Emery N. Brown. “Dynamic Analysis of Neural Encoding by Point Process Adaptive Filtering” Neural Comp., Vol. 16, No. 5. (May 2004), pp. 971-998. Disambiguation based on shared co-authors. Institute for Research Information and Quality Assurance Jeffrey Demaine 08/2011
  • 24. What about reference matches?• CiteULike does not have citation data!• The references collected in a users library (‟bookmarks”) are a form of citation. – Indicate common academic interest.• So: leverage the personal collections generated by users to disambiguate authors based on ‟co-collection” of articles. Institute for Research Information and Quality Assurance Jeffrey Demaine 08/2011
  • 25. Matching on metadata Thiel, CAuthor-instances are matchedbased on Co-citation Thiel, C Institute for Research Information and Quality Assurance Jeffrey Demaine 08/2011
  • 26. “Wentai Liu” found 10 times in Karlmarx20’s library• Zhi Yang, Qi Zhao, Wentai Liu. “Improving spike separation using waveform derivatives”. Journal of Neural Engineering, Vol. 6, No. 4. (2009)• Zhi Yang, Qi Zhao, Wentai Liu. “Neural signal classification using a simplified feature set with nonparametric clustering“. Neurocomputing, Vol. 73, No. 1-3. (2009), pp. 412-422.• Zhi Yang, Qi Zhao, Wentai Liu. “Improving spike separation using waveform derivatives.” Journal of Neural Engineering, Vol. 6, No. 4. (2009)• Zhi Yang, Qi Zhao, Wentai Liu. “Energy based evolving mean shift algorithm for neural spike classification“. In 2009 31st Annual International Conference of the {IEEE} Engineering in Medicine and Biology Society. {EMBC} 2009, 3-6 Sept. 2009 (2009), pp. 966-9.• Linh Hoang, Zhi Yang, Wentai Liu. "VLSI architecture of NEO spike detection with noise shaping filter and feature extraction using informative samples“ In 2009 31st Annual International Conference of the {IEEE} Engineering in Medicine and Biology Society. {EMBC} 2009, 3-6 Sept. 2009 (2009), pp. 978-81.• Linh Hoang, Zhi Yang, Wentai Liu. "VLSI architecture of NEO spike detection with noise shaping filter and feature extraction using informative samples“ Conference Proceedings: Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Conference, Vol. 1 (2009), pp. 978-981.• Zhi Yang, Qi Zhao, Wentai Liu. “Energy based evolving mean shift algorithm for neural spike classification“. Conference Proceedings: Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Conference, Vol. 1 (2009), pp. 966-969.• Linh Hoang, Zhi Yang, Wentai Liu. "VLSI architecture of NEO spike detection with noise shaping filter and feature extraction using informative samples.” Conference Proceedings: Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Conference, Vol. 1 (2009), pp. 978-981.• Tung-Chien Chen, Wentai Liu, Liang-Gee Chen. “VLSI architecture of leading eigenvector generation for on-chip principal component analysis spike sorting system“. Conference Proceedings: Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Conference, Vol. 1 (2008), pp. 3192-5.• Zhi Yang, Qi Zhao, Wentai Liu. “Improving spike separation using waveform derivatives“. Journal of Neural Engineering, Vol. 6, No. 4. (July 2009). Institute for Research Information and Quality Assurance Jeffrey Demaine 08/2011
  • 27. Matching based on collection• Some precedence for this in the Science 2.0 literature: – Haustein, S. et al. equate bookmarking on CiteULike as a proxy for citations in measuring journal impact. – Usage ratio, diffusion, intensity based on counting user profiles.• Matching based on membership in a collection: data that is extrinsic to the article itself. – Match on user-generated keywordsHaustein, S., Golov, E., Luckanus, K., Reher, S. & Terliesner, J. (2010). “Journal evaluation and science 2.0:Using social bookmarks to analyze reader perception”. Proceedings of the 11th International Conference onScience and Technology Indicators, Leiden, 117-119. Institute for Research Information and Quality Assurance Jeffrey Demaine 08/2011
  • 28. Institute for Research Information and Quality Assurance Jeffrey Demaine 08/2011
  • 29. Practicalities• The accuracy of the disambiguation is not mission- critical – its just an online web-service.• Could be easily implemented to run on-the-fly using Java Server Page technology. – When a user loads a web-page, the server matches the references based on their metadata and profile membership, creating a web-page with groups.• An example of an automated value-added service in a Science 2.0 context. Institute for Research Information and Quality Assurance Jeffrey Demaine 08/2011
  • 30. Thank you very much for your attention! demaine@forschungsinfo.deInstitute for Research Information and Quality Assurance Jeffrey Demaine 08/2011