Leveraging Semantic Fingerprinting for Building Author Networks

1,280 views

Published on

Presented at the 10th annual Data Harmony Users Group meeting on Wednesday, February 12, 2014 by Bob Kasenchak of Access Innovations, Inc. With the rise of ORCID and other universal databases of researchers and institutions, it is increasingly crucial for publishers to sort out their own data containing named entities. This talk details Access Innovations' approach to author disambiguation, which includes a taxonomy-based solution in addition to algorithmic processes. The presentation includes a case study.

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,280
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
13
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Leveraging Semantic Fingerprinting for Building Author Networks

  1. 1. LEVER AG IN G SEM AN TIC FIN G ER PR IN TS FO R BUILD IN G AUTH O R N ETW O R KS Bob Kasenchak Production Coordinator DHUG 2014
  2. 2. NAMED ENTITY DISAMBIGUATION Most publishers (and many other organizations) have need of disambiguating lists of: Persons Authors Editors Members Employees Institutions Colleges, Companies, Laboratories, Organizations Copyright 2014 Access Innovations, Inc.
  3. 3. BUT WHY DISAMBIGUATE? Facilitate content discovery Browse by Author or Institution name Resolve member, author, marketing lists Link out to other organizations (e.g., ORCID) Demonstrate value to stakeholders e.g., College libraries less apt to cancel subscriptions if they are shown how many of their professors are published in your content Market research and analysis Copyright 2014 Access Innovations, Inc.
  4. 4. TWO DISAMBIGUATION PROCESSES Matching algorithms String matching Fuzzy matching Leveraging other data associated with each entity to increase matching probability and reduce false matches, such as: Country Date Co-authors Copyright 2014 Access Innovations, Inc.
  5. 5. TWO-PHASE WORKFLOW Initial set of raw data is used to create an authority file Questionable names are subject to human review Authority file is subject to constant review and cleanup Entities are extracted from new content and compared to the authority file Anomalies are reviewed and matched to existing records or added as new entities Copyright 2014 Access Innovations, Inc.
  6. 6. INSTITUTION DISAMBIGUATION Having a clean Institution authority file allows for better processing of persons The work is easier and more clear-cut Develop standards and practices, but be prepared to change or add to them as new data comes to light Forcing data into a bad paradigm isn’t helpful The data should inform your standards and practices Copyright 2014 Access Innovations, Inc.
  7. 7. INSTITUTION DISAMBIGUATION FLOW Copyright 2014 Access Innovations, Inc.
  8. 8. QUALITY OF RAW DATA MATTERS Well-formed source data? Structured or unstructured? Legacy content? Often not as well structured Or auto-tagged, so can be unreliable Parsed using punctuation etc. as delimiters Common abbreviations and stopwords Also, leverage country information if available Copyright 2014 Access Innovations, Inc.
  9. 9. INSTITUTIONS: RAW DATA Ohio Aerosp. Inst., Cleveland, OH 44142 Ohio Aerospace Institute (OAI) Ohio Dominican University Ohio Institute of Technology Ohio Northern University Ohio State Ohio State Univ., Columbus, OH Ohio State Univ., Columbus, OH 43210 Ohio State Univ., Columbus, OH 43210 1298‐ Ohio State Univ., Dept. of Linguist. Ohio State Univ., Dept. of Mech. Eng., Columbus, OH 43210, mechprof@osu.edu Ohio University Copyright 2014 Access Innovations, Inc.
  10. 10. INSTITUTION DISAMBIGUATION FLOW Copyright 2014 Access Innovations, Inc.
  11. 11. HUMAN EDITORIAL REVIEW Two kinds of human intervention are used: QC of automated matches for accuracy Culls out errors Gather data to iteratively adjust matching algorithms Reviewing non-matched entities Match by hand to existing authority file Create new listings for new entities Copyright 2014 Access Innovations, Inc.
  12. 12. EDITORIAL REVIEW INTERFACE Institutions to be reviewed Authority File lookup Search results Copyright 2014 Access Innovations, Inc.
  13. 13. AUTHORS (AND OTHER PERSONS) Persons are trickier than institutions! Variants Nicknames Middle name, initial, or nothing Initials Suffixes and Prefixes Similar last names Name changes Transliterations Copyright 2014 Access Innovations, Inc.
  14. 14. NAMES: RAW DATA Carlson, N. Carlson, Neil N. Carlson, P. Carlson, R. L. Carlson, R. M. K. Carlson, R. W. Carlson, Roy Carlson, Roy F. Carlson, T. A. Carlson, Thomas Carlson, Thomas A. Carlson, Thomas J. Carlson, W. G. Carlson, William Carlson, William V. Which, if any, are the same person? Copyright 2014 Access Innovations, Inc.
  15. 15. PERSON NAME DISAMBIGUATION FLOW Copyright 2014 Access Innovations, Inc.
  16. 16. RESOLVER; SEMANTIC FINGERPRINTS Copyright 2014 Access Innovations, Inc.
  17. 17. RESOLVER; SEMANTIC FINGERPRINTS Copyright 2014 Access Innovations, Inc.
  18. 18. AUTHOR NAME AUTHORITY FILE Each author record is linked to other associated data: Every DOI (or other document #) Every co-author Every institution Dates of publication Subject terms from thesaurus used to index content associated with each person Each of these is used in the disambiguation algorithm to weight the potential matches of similar names Copyright 2014 Access Innovations, Inc.
  19. 19. LEVERAGING THESAURUS TERMS The indexing from every paper by each known author comprises a weighted subject “fingerprint” Potential matching names from incoming content are associated with the indexing from each paper Subject terms are compared to potential matches to increase certainty weighting Copyright 2014 Access Innovations, Inc.
  20. 20. PERSON NAME DISAMBIGUATION FLOW Copyright 2014 Access Innovations, Inc.
  21. 21. EDITORIAL REVIEW INTERFACE Authors to be reviewed Authority File lookup Search results Copyright 2014 Access Innovations, Inc.
  22. 22. ITERATIVE PROCESSES Every batch of new content adds more data for the matching algorithms to use The authority files should be reviewed by editors for QC to keep the files clean Editors can suggest tweaks to the algorithm based on the results that are being sent to them for review and QC of the authority files  Too many obvious matches being kicked out; or  Bad automatic matches being added to authority files Copyright 2014 Access Innovations, Inc.
  23. 23. CONTENT-AWARE PROCESSES Every dataset is different, so the named entity disambiguation processes and algorithms should be modified to suit More “adjustable” than “one-size-fits-all” Basic processes can be customized to suit different datasets and client needs Leveraging thesaurus/subject terms from indexing is a huge addition to the disambiguation algorithms Copyright 2014 Access Innovations, Inc.
  24. 24. NAM ED ENTITY DISAM BIGUATION PROCESSES AND PROCEDURES Bob Kasenchak Project Coordinator November 20, 2013 Thank You – Any Questions?

×