Seed and Expand

414 views

Published on

Published in: Education, Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
414
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Seed and Expand

  1. 1. Seed+Expand aggregating the scientific output of the Netherlands, 2000-2010 Linda Reijnhoudt, Rodrigo Costas, Ed Noyons, Katy Börner, Andrea Scharnhorst 1 linda.reijnhoudt@dans.knaw.nl, andrea.scharnhorst@dans.knaw.nl DANS, Royal Netherlands Academy of Arts and Sciences (KNAW), the Hague, the Netherlands 2 rcostas@cwts.leidenuniv.nl, noyons@cwts.leidenuniv.nl Center for Science and Technology Studies (CWTS)-Leiden University, Leiden, the Netherlands 3 katy@indiana.edu Cyberinfrastructure for Network Science Center, School of Library and Information Science, Indiana University, Bloomington, Indiana, United States of America
  2. 2. to study the dynamics on the output of Dutch professors (2001-2011) but, lack of data on the output of full professors! goal
  3. 3. the problem given a Dutch professor in the NARCIS system find all his/her publications how to connect bibliographic data from CWTS with the NARCIS system?
  4. 4. CWTS Bibliometric publications database: ● author ● author-order ● email (sometimes) ● affiliation (sometimes) ● journal context DANS NARCIS dutch scholars: ● name, initials ● DAI ● affiliations ● organisation ● email = ?
  5. 5. non trivial I ● misspelled names ○ Van Knienberg instead Van Knippenberg ● different initials / first name ○ Johannes and Hans ● different formats in the data across sources ○ Prefixes separated in the NARCIS system ■ P.M.P. | van | Bergen en Henegouwen ○ Made initials or concatenated in WoS ■ Henegouwen, PMPVE (Henegouwen, Paul M. P. van Bergen En)
  6. 6. non trivial II ● multiple scholars have the same author name (homonymy) ● the same scholar with multiple author names (synonymy) ○ changes over time, e.g., due to marriage
  7. 7. the raw data NARCIS database (DANS) ○ 8378 Dutch full professors ■ affiliation to dutch organizations ■ name, initials ■ email ■ DAI CWTS bibiometric data system ○ close to 23 million publications in more than 12,000 journals ○ no unique author identifier for all authors
  8. 8. the Gold Standard we already know the complete oeuvre of 1400 Dutch full professors, due to manually verified publication lists by CWTS (2001- 2010) USEFUL TO VALIDATE OUR METHODOLOGY the 1400 of the 8376 (17%) full professors who already appear in this list: the Gold Standard
  9. 9. the sources & main overview
  10. 10. Seed+Expand main concept ● seed creation, precision ○ given a full professor, {initials, name, email, affiliations} ○ find one or more publications that are most likely authored by this professor ● seed expansion, recall ○ given these 'seed' publications, ○ find publications by the same author 1. publication-based classifications 2. Scopus Author Identifier
  11. 11. seed creation 1. Email seed (EM) 2. Author Address approaches (*) a. Reprint Author (RP) b. Direct linkage author-addresses (DL) c. Approximate linkage author addresses (AL) 3. Digital Author Identifier seed (DAI) (*) For these seeds, very common names have been excluded
  12. 12. seed expansion 1. CWTS Paper-Based Classification (2001-2011) ○ based on citation relationships of publications ○ 672 meso, over 20K micro disciplines ○ micro: +23% unique papers over seed ○ meso: +34% unique papers over seed 2. Scopus Author Identifier (1996-2011) ○ +69% unique papers over seed
  13. 13. evaluation Gold standard:2001-2010
  14. 14. results ● 80% of Dutch professors detected ● Micro-disciplines: highest precision (88.5) ● Scopus Author id & micro disciplines: same recall (95.9) ● This methodology can be applied to other sets and author identity schemes (ORCID, VIVO, etc.) ● Further research on disciplinary differences and improvements
  15. 15. general discussion ● increasing bibliographic data sources but still lacking author disambiguated data!! ● lack of research on how to connect databases ○ repositories ○ bibliographic databases (WoS, Scopus, etc.) ○ altmetrics ● e-mail data and DAI/ORCID-like identifiers are powerful linking elements across systems
  16. 16. the end ... thank you very much for your attention! questions? comments?
  17. 17. five seeds combined: 6753 of 8376 full professors found

×