Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

An Analysis of the Microsoft Academic Graph

49 views

Published on

Slides for a presentation at the Fifth International Workshop on Mining Scientific Publications @ JCDL 2016
Paper: http://mirror.dlib.org/dlib/september16/herrmannova/09herrmannova.html

Published in: Technology
  • Be the first to comment

  • Be the first to like this

An Analysis of the Microsoft Academic Graph

  1. 1. 1/32 An Analysis of the Microsoft Academic Graph Drahomira Herrmannova (@robodasha) & Petr Knoth (@petrknoth) KMi, The Open University
  2. 2. 2/32 Introduction • To understand the strengths and limitations of the Microsoft Academic Graph (MAG) for applying it to scholarly communication tasks • We study the characteristics of the dataset and perform a correlation analysis with other similar datasets
  3. 3. 3/32 Questions • How complete/sparse are the data? • How many of the graph entities have all associated metadata fields populated and how reliable they are? • How well are the data conflated/disambiguated?
  4. 4. 4/32 Dataset • Heterogeneous graph comprised of more than 120 million publications and the related authors, venues, organizations, and fields of study • The largest publicly available dataset of scholarly publications • The largest dataset of open citation data
  5. 5. 5/32 Dataset size Papers 126,909,021 Authors 114,698,044 Institutions 19,843 Journals 23,404 Conferences 1,283 Conference instances 50,202 Fields of study 50,266
  6. 6. 6/32 External datasets used • CORE (Connecting Repositories) • Mendeley • Webometrics Ranking of World Universities • Scimago Journal and Country Rank
  7. 7. 7/32 Publication age
  8. 8. 8/32 Publication age • Publication dates from MAG compared with CORE and Mendeley data • Intersection found using DOI Unique DOIs in the MAG 35,569,305 Unique DOIs in CORE 2,673,592 Intersection MAG/CORE 1,690,668 Intersection MAG/CORE/Mendeley 1,314,854 Intersection without missing data 1,258,611
  9. 9. 9/32 Publication age • Compared using two methods – Spearman's rho correlation coefficient – Cumulative distribution function of the difference between the publication years in the different datasets
  10. 10. 10/32 Publication age Spearman’s rho MAG CORE Mendeley MAG - 0.9555 0.9656 CORE 0.9555 - 0.9743 Mendeley 0.9656 0.9743 -
  11. 11. 11/32 Publication age
  12. 12. 12/32 Authors and affiliations • Publications linked to author and affiliation entities • All publications linked to one or more authors, however 105,980,107 (~83%) publications not linked to any affiliation
  13. 13. 13/32 Authors and affiliations Mean number of authors per paper 2.66 Max authors per paper 6,530 Mean number of papers per author 2.94 Max number of papers per author 153,915 Mean number of collaborators 116.93 Max number of collaborators 3,661,912 Number of papers with affiliation 20,928,914 Mean number of affiliations per paper 0.23 Max number of affiliations per paper 181
  14. 14. 14/32 Authors and affiliations • Paper with most authors: ”Sunday, 26 August 2012" • Author with most papers: ”united vertical media gmbh"
  15. 15. 15/32 Journals and conferences • Papers linked to publication venues • Of all papers in MAG (over 126 million), more than 51 million (~40%) are linked to a journal and 1,7 million to a conference entity
  16. 16. 16/32 Fields of study • FoS in MAG organised hierarchically into four levels (0-3) – 47,989 at level 3 – 1,966 at level 2 – 293 at level 1 – 18 at level 0 • Over 41 million papers are linked to one or more fields of study (~33%)
  17. 17. 17/32 Fields of study
  18. 18. 18/32 Fields of study – Mendeley
  19. 19. 19/32 Citation network • We study the network by – looking at the citation distribution, to see whether it is consistent with previous studies – Compare the citations received by two types of entities in the graph with citations from external datasets • Why? – To understand the quality of the citation data (not to rank universities or journals)
  20. 20. 20/32 Citation network • 528,682,289 internal citations • Significant portion of papers disconnected from the graph Total number of papers 126,909,021 Papers with zero references 96,850,699 Papers with zero citations 89,647,949 Papers with zero references and citations 80,166,717 Mean citation per paper 4.17 Mean citation per ”connected” paper 11.31
  21. 21. 21/32 Citation network • Comparison of university and journal citation data found in MAG with the Ranking Web of Universities (RWoU) and the Scimago Journal & Country Rank (SJCR) citation data • Two comparison methods – Size of overlap of the top university/journal lists – Pearson’s and Spearman’s correlation (calculated on matching items)
  22. 22. 22/32 Citation network • Matched 1,255 universities between MAG and RWoU (2,105 in total), and 13,050 journals between MAG and SJRC (22,878 in total) • 4 common journals in among the top 10 • 54 among the top 100 • 677 among the top 1000 and 1407 among the top 2000
  23. 23. 23/32 Citation network – top 10 universities
  24. 24. 24/32 Citation network – top 10 journals
  25. 25. 25/32 Citation network • To quantify how much do the lists differ, we created histograms of the differences between the ranks in the MAG and in the external lists • To produce the histograms – Sorted the data by number of citations found in the external dataset – For top 100/1000 universities/journals created a histogram of absolute difference between rank in MAG and in external dataset
  26. 26. 26/32 Rank difference – top 100 universities
  27. 27. 27/32 Rank difference – top 100 universities • University citation rank in the MAG differs by more than 200 positions for about 20% of universities in the top 100 of the Ranking Web of Universities list • The citation university rank differs by less than 25 positions for less than 40% of universities across these two datasets
  28. 28. 28/32 Rank difference – top 1000 universities
  29. 29. 29/32 Rank difference – top 100 journals
  30. 30. 30/32 Rank difference – top 1000 journals
  31. 31. 31/32 Citation network • Ranks of top universities differ on average by 163, with standard deviation of 185 • Ranks of top journals differ on average by 1,203 with standard deviation of 1,211 • Correlations calculated on matching items Universities Journals Pearson’s r 0.8773, p -> 0.0 0.8246, p -> 0.0 Spearman’s rho 0.8266, p -> 0.0 0.8973, p -> 0.0
  32. 32. 32/32 Conclusions • MAG data correlate well with external datasets • We have identified certain limitations as to the completeness of links from publications to other entities • Existing university and journal rankings (proprietary data) produce substantially different results – MAG is open and transparent at the level of individual citations, it is possible to verify and better interpret the citation data • Currently the most comprehensive publicly available dataset of its kind

×