Seed and Expand
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Seed and Expand

on

  • 302 views

 

Statistics

Views

Total Views
302
Views on SlideShare
301
Embed Views
1

Actions

Likes
0
Downloads
1
Comments
0

1 Embed 1

https://twitter.com 1

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Seed and Expand Presentation Transcript

  • 1. Seed+Expand aggregating the scientific output of the Netherlands, 2000-2010 Linda Reijnhoudt, Rodrigo Costas, Ed Noyons, Katy Börner, Andrea Scharnhorst 1 linda.reijnhoudt@dans.knaw.nl, andrea.scharnhorst@dans.knaw.nl DANS, Royal Netherlands Academy of Arts and Sciences (KNAW), the Hague, the Netherlands 2 rcostas@cwts.leidenuniv.nl, noyons@cwts.leidenuniv.nl Center for Science and Technology Studies (CWTS)-Leiden University, Leiden, the Netherlands 3 katy@indiana.edu Cyberinfrastructure for Network Science Center, School of Library and Information Science, Indiana University, Bloomington, Indiana, United States of America
  • 2. to study the dynamics on the output of Dutch professors (2001-2011) but, lack of data on the output of full professors! goal
  • 3. the problem given a Dutch professor in the NARCIS system find all his/her publications how to connect bibliographic data from CWTS with the NARCIS system?
  • 4. CWTS Bibliometric publications database: ● author ● author-order ● email (sometimes) ● affiliation (sometimes) ● journal context DANS NARCIS dutch scholars: ● name, initials ● DAI ● affiliations ● organisation ● email = ?
  • 5. non trivial I ● misspelled names ○ Van Knienberg instead Van Knippenberg ● different initials / first name ○ Johannes and Hans ● different formats in the data across sources ○ Prefixes separated in the NARCIS system ■ P.M.P. | van | Bergen en Henegouwen ○ Made initials or concatenated in WoS ■ Henegouwen, PMPVE (Henegouwen, Paul M. P. van Bergen En)
  • 6. non trivial II ● multiple scholars have the same author name (homonymy) ● the same scholar with multiple author names (synonymy) ○ changes over time, e.g., due to marriage
  • 7. the raw data NARCIS database (DANS) ○ 8378 Dutch full professors ■ affiliation to dutch organizations ■ name, initials ■ email ■ DAI CWTS bibiometric data system ○ close to 23 million publications in more than 12,000 journals ○ no unique author identifier for all authors
  • 8. the Gold Standard we already know the complete oeuvre of 1400 Dutch full professors, due to manually verified publication lists by CWTS (2001- 2010) USEFUL TO VALIDATE OUR METHODOLOGY the 1400 of the 8376 (17%) full professors who already appear in this list: the Gold Standard
  • 9. the sources & main overview
  • 10. Seed+Expand main concept ● seed creation, precision ○ given a full professor, {initials, name, email, affiliations} ○ find one or more publications that are most likely authored by this professor ● seed expansion, recall ○ given these 'seed' publications, ○ find publications by the same author 1. publication-based classifications 2. Scopus Author Identifier
  • 11. seed creation 1. Email seed (EM) 2. Author Address approaches (*) a. Reprint Author (RP) b. Direct linkage author-addresses (DL) c. Approximate linkage author addresses (AL) 3. Digital Author Identifier seed (DAI) (*) For these seeds, very common names have been excluded
  • 12. seed expansion 1. CWTS Paper-Based Classification (2001-2011) ○ based on citation relationships of publications ○ 672 meso, over 20K micro disciplines ○ micro: +23% unique papers over seed ○ meso: +34% unique papers over seed 2. Scopus Author Identifier (1996-2011) ○ +69% unique papers over seed
  • 13. evaluation Gold standard:2001-2010
  • 14. results ● 80% of Dutch professors detected ● Micro-disciplines: highest precision (88.5) ● Scopus Author id & micro disciplines: same recall (95.9) ● This methodology can be applied to other sets and author identity schemes (ORCID, VIVO, etc.) ● Further research on disciplinary differences and improvements
  • 15. general discussion ● increasing bibliographic data sources but still lacking author disambiguated data!! ● lack of research on how to connect databases ○ repositories ○ bibliographic databases (WoS, Scopus, etc.) ○ altmetrics ● e-mail data and DAI/ORCID-like identifiers are powerful linking elements across systems
  • 16. the end ... thank you very much for your attention! questions? comments?
  • 17. five seeds combined: 6753 of 8376 full professors found