SlideShare a Scribd company logo
1 of 32
Download to read offline
1/32
An Analysis of the Microsoft Academic Graph
Drahomira Herrmannova (@robodasha)
&
Petr Knoth (@petrknoth)
KMi, The Open University
2/32
Introduction
• To understand the strengths and limitations of
the Microsoft Academic Graph (MAG) for
applying it to scholarly communication tasks
• We study the characteristics of the dataset
and perform a correlation analysis with other
similar datasets
3/32
Questions
• How complete/sparse are the data?
• How many of the graph entities have all
associated metadata fields populated and how
reliable they are?
• How well are the data
conflated/disambiguated?
4/32
Dataset
• Heterogeneous graph comprised of more than
120 million publications and the related
authors, venues, organizations, and fields of
study
• The largest publicly available dataset of
scholarly publications
• The largest dataset of open citation data
5/32
Dataset size
Papers 126,909,021
Authors 114,698,044
Institutions 19,843
Journals 23,404
Conferences 1,283
Conference instances 50,202
Fields of study 50,266
6/32
External datasets used
• CORE (Connecting Repositories)
• Mendeley
• Webometrics Ranking of World Universities
• Scimago Journal and Country Rank
7/32
Publication age
8/32
Publication age
• Publication dates from MAG compared with
CORE and Mendeley data
• Intersection found using DOI
Unique DOIs in the MAG 35,569,305
Unique DOIs in CORE 2,673,592
Intersection MAG/CORE 1,690,668
Intersection MAG/CORE/Mendeley 1,314,854
Intersection without missing data 1,258,611
9/32
Publication age
• Compared using two methods
– Spearman's rho correlation coefficient
– Cumulative distribution function of the difference
between the publication years in the different
datasets
10/32
Publication age
Spearman’s rho MAG CORE Mendeley
MAG - 0.9555 0.9656
CORE 0.9555 - 0.9743
Mendeley 0.9656 0.9743 -
11/32
Publication age
12/32
Authors and affiliations
• Publications linked to author and affiliation
entities
• All publications linked to one or more authors,
however 105,980,107 (~83%) publications not
linked to any affiliation
13/32
Authors and affiliations
Mean number of authors per paper 2.66
Max authors per paper 6,530
Mean number of papers per author 2.94
Max number of papers per author 153,915
Mean number of collaborators 116.93
Max number of collaborators 3,661,912
Number of papers with affiliation 20,928,914
Mean number of affiliations per paper 0.23
Max number of affiliations per paper 181
14/32
Authors and affiliations
• Paper with most authors: ”Sunday, 26 August
2012"
• Author with most papers: ”united vertical
media gmbh"
15/32
Journals and conferences
• Papers linked to publication venues
• Of all papers in MAG (over 126 million), more
than 51 million (~40%) are linked to a journal
and 1,7 million to a conference entity
16/32
Fields of study
• FoS in MAG organised hierarchically into four
levels (0-3)
– 47,989 at level 3
– 1,966 at level 2
– 293 at level 1
– 18 at level 0
• Over 41 million papers are linked to one or
more fields of study (~33%)
17/32
Fields of study
18/32
Fields of study – Mendeley
19/32
Citation network
• We study the network by
– looking at the citation distribution, to see whether
it is consistent with previous studies
– Compare the citations received by two types of
entities in the graph with citations from external
datasets
• Why?
– To understand the quality of the citation data (not
to rank universities or journals)
20/32
Citation network
• 528,682,289 internal citations
• Significant portion of papers disconnected
from the graph
Total number of papers 126,909,021
Papers with zero references 96,850,699
Papers with zero citations 89,647,949
Papers with zero references and citations 80,166,717
Mean citation per paper 4.17
Mean citation per ”connected” paper 11.31
21/32
Citation network
• Comparison of university and journal citation
data found in MAG with the Ranking Web of
Universities (RWoU) and the Scimago Journal
& Country Rank (SJCR) citation data
• Two comparison methods
– Size of overlap of the top university/journal lists
– Pearson’s and Spearman’s correlation (calculated
on matching items)
22/32
Citation network
• Matched 1,255 universities between MAG and
RWoU (2,105 in total), and 13,050 journals
between MAG and SJRC (22,878 in total)
• 4 common journals in among the top 10
• 54 among the top 100
• 677 among the top 1000 and 1407 among the
top 2000
23/32
Citation network – top 10 universities
24/32
Citation network – top 10 journals
25/32
Citation network
• To quantify how much do the lists differ, we
created histograms of the differences between
the ranks in the MAG and in the external lists
• To produce the histograms
– Sorted the data by number of citations found in
the external dataset
– For top 100/1000 universities/journals created a
histogram of absolute difference between rank in
MAG and in external dataset
26/32
Rank difference – top 100 universities
27/32
Rank difference – top 100 universities
• University citation rank in the MAG differs by
more than 200 positions for about 20% of
universities in the top 100 of the Ranking Web
of Universities list
• The citation university rank differs by less than
25 positions for less than 40% of universities
across these two datasets
28/32
Rank difference – top 1000 universities
29/32
Rank difference – top 100 journals
30/32
Rank difference – top 1000 journals
31/32
Citation network
• Ranks of top universities differ on average by
163, with standard deviation of 185
• Ranks of top journals differ on average by
1,203 with standard deviation of 1,211
• Correlations calculated on matching items
Universities Journals
Pearson’s r 0.8773, p -> 0.0 0.8246, p -> 0.0
Spearman’s rho 0.8266, p -> 0.0 0.8973, p -> 0.0
32/32
Conclusions
• MAG data correlate well with external datasets
• We have identified certain limitations as to the
completeness of links from publications to other
entities
• Existing university and journal rankings
(proprietary data) produce substantially different
results
– MAG is open and transparent at the level of individual
citations, it is possible to verify and better interpret
the citation data
• Currently the most comprehensive publicly
available dataset of its kind

More Related Content

What's hot

Most borrowed is most cited? Library loan statistics as a proxy for monograph...
Most borrowed is most cited? Library loan statistics as a proxy for monograph...Most borrowed is most cited? Library loan statistics as a proxy for monograph...
Most borrowed is most cited? Library loan statistics as a proxy for monograph...Torres Salinas
 
Scopus harvestering trumpeting
Scopus harvestering trumpetingScopus harvestering trumpeting
Scopus harvestering trumpetingJoanne Paterson
 
Bibliometrics in the library, putting science in to practice
Bibliometrics in the library, putting science in to practiceBibliometrics in the library, putting science in to practice
Bibliometrics in the library, putting science in to practiceWouter Gerritsma
 
ICPSR Data Services
ICPSR Data ServicesICPSR Data Services
ICPSR Data ServicesICPSR
 
Learning the ABCs of Tracking APCs
Learning the ABCs of Tracking APCsLearning the ABCs of Tracking APCs
Learning the ABCs of Tracking APCsErin Calhoun
 
Most borrowed is most cited? Library loan statistics as a proxy for monograph...
Most borrowed is most cited? Library loan statistics as a proxy for monograph...Most borrowed is most cited? Library loan statistics as a proxy for monograph...
Most borrowed is most cited? Library loan statistics as a proxy for monograph...Nicolas Robinson-Garcia
 
improving student and researcher relations with the library
improving student and researcher relations with the libraryimproving student and researcher relations with the library
improving student and researcher relations with the libraryWouter Gerritsma
 
Scientometric approaches to classification
Scientometric approaches to classificationScientometric approaches to classification
Scientometric approaches to classificationNees Jan van Eck
 
Let's Talk Research 2015 - Mary Hill - What have librarians ever done for us?
Let's Talk Research 2015 - Mary Hill - What have librarians ever done for us? Let's Talk Research 2015 - Mary Hill - What have librarians ever done for us?
Let's Talk Research 2015 - Mary Hill - What have librarians ever done for us? NHSNWRD
 
Citation Analysis of Higher Education Texts in Selected Databases: A Comparis...
Citation Analysis of Higher Education Texts in Selected Databases: A Comparis...Citation Analysis of Higher Education Texts in Selected Databases: A Comparis...
Citation Analysis of Higher Education Texts in Selected Databases: A Comparis...Che-Wei Lee
 
Web resources for thesis work
Web resources for thesis workWeb resources for thesis work
Web resources for thesis workMichael Le Duc
 
Accuracy of citation data in Web of Science and Scopus
Accuracy of citation data in Web of Science and ScopusAccuracy of citation data in Web of Science and Scopus
Accuracy of citation data in Web of Science and ScopusNees Jan van Eck
 
Cabell's Directory - Features NATIONAL FORUM JOURNALS, www.nationalforum.com,...
Cabell's Directory - Features NATIONAL FORUM JOURNALS, www.nationalforum.com,...Cabell's Directory - Features NATIONAL FORUM JOURNALS, www.nationalforum.com,...
Cabell's Directory - Features NATIONAL FORUM JOURNALS, www.nationalforum.com,...William Kritsonis
 
Serving the Biomedical Research Community
Serving the Biomedical Research CommunityServing the Biomedical Research Community
Serving the Biomedical Research CommunityMelissa Rethlefsen
 
Bibliometric analyses on repository contents for the evaluation of research a...
Bibliometric analyses on repository contents for the evaluation of research a...Bibliometric analyses on repository contents for the evaluation of research a...
Bibliometric analyses on repository contents for the evaluation of research a...marco.vanveller
 
What does it take to have precise indicators?
What does it take to have precise indicators?What does it take to have precise indicators?
What does it take to have precise indicators?Held de Souza
 
library resources for optometrists
library resources for optometristslibrary resources for optometrists
library resources for optometristsHossein Mirzaie
 

What's hot (20)

Most borrowed is most cited? Library loan statistics as a proxy for monograph...
Most borrowed is most cited? Library loan statistics as a proxy for monograph...Most borrowed is most cited? Library loan statistics as a proxy for monograph...
Most borrowed is most cited? Library loan statistics as a proxy for monograph...
 
Scopus harvestering trumpeting
Scopus harvestering trumpetingScopus harvestering trumpeting
Scopus harvestering trumpeting
 
Bibliometrics in the library, putting science in to practice
Bibliometrics in the library, putting science in to practiceBibliometrics in the library, putting science in to practice
Bibliometrics in the library, putting science in to practice
 
ICPSR Data Services
ICPSR Data ServicesICPSR Data Services
ICPSR Data Services
 
Learning the ABCs of Tracking APCs
Learning the ABCs of Tracking APCsLearning the ABCs of Tracking APCs
Learning the ABCs of Tracking APCs
 
Most borrowed is most cited? Library loan statistics as a proxy for monograph...
Most borrowed is most cited? Library loan statistics as a proxy for monograph...Most borrowed is most cited? Library loan statistics as a proxy for monograph...
Most borrowed is most cited? Library loan statistics as a proxy for monograph...
 
Disentangling gold open access
Disentangling gold open accessDisentangling gold open access
Disentangling gold open access
 
improving student and researcher relations with the library
improving student and researcher relations with the libraryimproving student and researcher relations with the library
improving student and researcher relations with the library
 
Scientometric approaches to classification
Scientometric approaches to classificationScientometric approaches to classification
Scientometric approaches to classification
 
Let's Talk Research 2015 - Mary Hill - What have librarians ever done for us?
Let's Talk Research 2015 - Mary Hill - What have librarians ever done for us? Let's Talk Research 2015 - Mary Hill - What have librarians ever done for us?
Let's Talk Research 2015 - Mary Hill - What have librarians ever done for us?
 
Citation Analysis of Higher Education Texts in Selected Databases: A Comparis...
Citation Analysis of Higher Education Texts in Selected Databases: A Comparis...Citation Analysis of Higher Education Texts in Selected Databases: A Comparis...
Citation Analysis of Higher Education Texts in Selected Databases: A Comparis...
 
Web resources for thesis work
Web resources for thesis workWeb resources for thesis work
Web resources for thesis work
 
Accuracy of citation data in Web of Science and Scopus
Accuracy of citation data in Web of Science and ScopusAccuracy of citation data in Web of Science and Scopus
Accuracy of citation data in Web of Science and Scopus
 
Cabell's Directory - Features NATIONAL FORUM JOURNALS, www.nationalforum.com,...
Cabell's Directory - Features NATIONAL FORUM JOURNALS, www.nationalforum.com,...Cabell's Directory - Features NATIONAL FORUM JOURNALS, www.nationalforum.com,...
Cabell's Directory - Features NATIONAL FORUM JOURNALS, www.nationalforum.com,...
 
Serving the Biomedical Research Community
Serving the Biomedical Research CommunityServing the Biomedical Research Community
Serving the Biomedical Research Community
 
Bibliometric analyses on repository contents for the evaluation of research a...
Bibliometric analyses on repository contents for the evaluation of research a...Bibliometric analyses on repository contents for the evaluation of research a...
Bibliometric analyses on repository contents for the evaluation of research a...
 
Practical applications of altmetrics
Practical applications of altmetricsPractical applications of altmetrics
Practical applications of altmetrics
 
Liber2011
Liber2011Liber2011
Liber2011
 
What does it take to have precise indicators?
What does it take to have precise indicators?What does it take to have precise indicators?
What does it take to have precise indicators?
 
library resources for optometrists
library resources for optometristslibrary resources for optometrists
library resources for optometrists
 

Similar to An Analysis of the Microsoft Academic Graph

Identifying Twitter audiences: Who is tweeting about scientific papers?
Identifying Twitter audiences: Who is tweeting about scientific papers?Identifying Twitter audiences: Who is tweeting about scientific papers?
Identifying Twitter audiences: Who is tweeting about scientific papers?Stefanie Haustein
 
Non-textual ranking in Digital Libraries
Non-textual ranking in Digital LibrariesNon-textual ranking in Digital Libraries
Non-textual ranking in Digital LibrariesGESIS
 
Scopus:Workshops on Scopus for Literature Searching and Research Impact
Scopus:Workshops on Scopus for Literature Searching and Research ImpactScopus:Workshops on Scopus for Literature Searching and Research Impact
Scopus:Workshops on Scopus for Literature Searching and Research Impactmotqin
 
how to publish a paper-1.ppt
how to publish a paper-1.ppthow to publish a paper-1.ppt
how to publish a paper-1.pptAlexmoradi
 
Research Impact in Specialized Settings: 3 Case Studies
Research Impact in Specialized Settings: 3 Case StudiesResearch Impact in Specialized Settings: 3 Case Studies
Research Impact in Specialized Settings: 3 Case StudiesElaine Lasda
 
Public engagement while you sleep
Public engagement while you sleepPublic engagement while you sleep
Public engagement while you sleepUoLResearchSupport
 
Scopus: a changing world of Research
Scopus: a changing world of ResearchScopus: a changing world of Research
Scopus: a changing world of ResearchCiarán Quinn
 
Public engagement while you sleep? How altmetrics can help researchers broade...
Public engagement while you sleep? How altmetrics can help researchers broade...Public engagement while you sleep? How altmetrics can help researchers broade...
Public engagement while you sleep? How altmetrics can help researchers broade...UoLResearchSupport
 
Public engagement while you sleep
Public engagement while you sleep Public engagement while you sleep
Public engagement while you sleep Kirsten Thompson
 
Making Sense of the Confusing World of Research Information Management
Making Sense of the Confusing World of Research Information ManagementMaking Sense of the Confusing World of Research Information Management
Making Sense of the Confusing World of Research Information ManagementOCLC
 
Research workshop presentation unisa
Research workshop presentation unisaResearch workshop presentation unisa
Research workshop presentation unisaerasmus01
 
Finding research evidence 2016
Finding research evidence 2016 Finding research evidence 2016
Finding research evidence 2016 John Iona
 
Where to publish_130709
Where to publish_130709Where to publish_130709
Where to publish_130709opl10
 
A new role for libraries in research assessments
A new role for libraries in research assessmentsA new role for libraries in research assessments
A new role for libraries in research assessmentsWouter Gerritsma
 

Similar to An Analysis of the Microsoft Academic Graph (20)

Identifying Twitter audiences: Who is tweeting about scientific papers?
Identifying Twitter audiences: Who is tweeting about scientific papers?Identifying Twitter audiences: Who is tweeting about scientific papers?
Identifying Twitter audiences: Who is tweeting about scientific papers?
 
Non-textual ranking in Digital Libraries
Non-textual ranking in Digital LibrariesNon-textual ranking in Digital Libraries
Non-textual ranking in Digital Libraries
 
Hgm elpub2018
Hgm elpub2018Hgm elpub2018
Hgm elpub2018
 
Tr georgia 05 2010
Tr georgia 05 2010Tr georgia 05 2010
Tr georgia 05 2010
 
Scopus:Workshops on Scopus for Literature Searching and Research Impact
Scopus:Workshops on Scopus for Literature Searching and Research ImpactScopus:Workshops on Scopus for Literature Searching and Research Impact
Scopus:Workshops on Scopus for Literature Searching and Research Impact
 
Scopus
ScopusScopus
Scopus
 
how to publish a paper-1.ppt
how to publish a paper-1.ppthow to publish a paper-1.ppt
how to publish a paper-1.ppt
 
Research Impact in Specialized Settings: 3 Case Studies
Research Impact in Specialized Settings: 3 Case StudiesResearch Impact in Specialized Settings: 3 Case Studies
Research Impact in Specialized Settings: 3 Case Studies
 
2016 AAUDE
2016 AAUDE2016 AAUDE
2016 AAUDE
 
Public engagement while you sleep
Public engagement while you sleepPublic engagement while you sleep
Public engagement while you sleep
 
Scopus: a changing world of Research
Scopus: a changing world of ResearchScopus: a changing world of Research
Scopus: a changing world of Research
 
Public engagement while you sleep? How altmetrics can help researchers broade...
Public engagement while you sleep? How altmetrics can help researchers broade...Public engagement while you sleep? How altmetrics can help researchers broade...
Public engagement while you sleep? How altmetrics can help researchers broade...
 
Public engagement while you sleep
Public engagement while you sleep Public engagement while you sleep
Public engagement while you sleep
 
Bryant Confusing World of RIM
Bryant Confusing World of RIM Bryant Confusing World of RIM
Bryant Confusing World of RIM
 
Making Sense of the Confusing World of Research Information Management
Making Sense of the Confusing World of Research Information ManagementMaking Sense of the Confusing World of Research Information Management
Making Sense of the Confusing World of Research Information Management
 
Research workshop presentation unisa
Research workshop presentation unisaResearch workshop presentation unisa
Research workshop presentation unisa
 
InCites
InCitesInCites
InCites
 
Finding research evidence 2016
Finding research evidence 2016 Finding research evidence 2016
Finding research evidence 2016
 
Where to publish_130709
Where to publish_130709Where to publish_130709
Where to publish_130709
 
A new role for libraries in research assessments
A new role for libraries in research assessmentsA new role for libraries in research assessments
A new role for libraries in research assessments
 

More from Dasha Herrmannova

Machine Learning for Data Extraction
Machine Learning for Data ExtractionMachine Learning for Data Extraction
Machine Learning for Data ExtractionDasha Herrmannova
 
Do Authors Deposit on Time? Tracking Open Access Policy Compliance
Do Authors Deposit on Time? Tracking Open Access Policy ComplianceDo Authors Deposit on Time? Tracking Open Access Policy Compliance
Do Authors Deposit on Time? Tracking Open Access Policy ComplianceDasha Herrmannova
 
Semantometrics: Text Analysis in Research Evaluation
Semantometrics: Text Analysis in Research Evaluation Semantometrics: Text Analysis in Research Evaluation
Semantometrics: Text Analysis in Research Evaluation Dasha Herrmannova
 
Do Citations and Readership Predict Excellent Publications?
Do Citations and Readership Predict Excellent Publications?Do Citations and Readership Predict Excellent Publications?
Do Citations and Readership Predict Excellent Publications?Dasha Herrmannova
 
Visual Search for Supporting Content Exploration in Large Document Collections
Visual Search for Supporting Content Exploration in Large Document CollectionsVisual Search for Supporting Content Exploration in Large Document Collections
Visual Search for Supporting Content Exploration in Large Document CollectionsDasha Herrmannova
 
Unsupervised Identification of Study Descriptors in Toxicology Research: An E...
Unsupervised Identification of Study Descriptors in Toxicology Research: An E...Unsupervised Identification of Study Descriptors in Toxicology Research: An E...
Unsupervised Identification of Study Descriptors in Toxicology Research: An E...Dasha Herrmannova
 
Simple Yet Effective Methods for Large-Scale Scholarly Publication Ranking
Simple Yet Effective Methods for Large-Scale Scholarly Publication RankingSimple Yet Effective Methods for Large-Scale Scholarly Publication Ranking
Simple Yet Effective Methods for Large-Scale Scholarly Publication RankingDasha Herrmannova
 
Semantometrics in Coauthorship Networks: Fulltext-based Approach for Analysin...
Semantometrics in Coauthorship Networks: Fulltext-based Approach for Analysin...Semantometrics in Coauthorship Networks: Fulltext-based Approach for Analysin...
Semantometrics in Coauthorship Networks: Fulltext-based Approach for Analysin...Dasha Herrmannova
 
Towards Semantometrics: A New Semantic Similarity Based Measure for Assessing...
Towards Semantometrics: A New Semantic Similarity Based Measure for Assessing...Towards Semantometrics: A New Semantic Similarity Based Measure for Assessing...
Towards Semantometrics: A New Semantic Similarity Based Measure for Assessing...Dasha Herrmannova
 
Mining Research Publication Networks for Impact -- KMi Internal Seminar
Mining Research Publication Networks for Impact -- KMi Internal SeminarMining Research Publication Networks for Impact -- KMi Internal Seminar
Mining Research Publication Networks for Impact -- KMi Internal SeminarDasha Herrmannova
 

More from Dasha Herrmannova (10)

Machine Learning for Data Extraction
Machine Learning for Data ExtractionMachine Learning for Data Extraction
Machine Learning for Data Extraction
 
Do Authors Deposit on Time? Tracking Open Access Policy Compliance
Do Authors Deposit on Time? Tracking Open Access Policy ComplianceDo Authors Deposit on Time? Tracking Open Access Policy Compliance
Do Authors Deposit on Time? Tracking Open Access Policy Compliance
 
Semantometrics: Text Analysis in Research Evaluation
Semantometrics: Text Analysis in Research Evaluation Semantometrics: Text Analysis in Research Evaluation
Semantometrics: Text Analysis in Research Evaluation
 
Do Citations and Readership Predict Excellent Publications?
Do Citations and Readership Predict Excellent Publications?Do Citations and Readership Predict Excellent Publications?
Do Citations and Readership Predict Excellent Publications?
 
Visual Search for Supporting Content Exploration in Large Document Collections
Visual Search for Supporting Content Exploration in Large Document CollectionsVisual Search for Supporting Content Exploration in Large Document Collections
Visual Search for Supporting Content Exploration in Large Document Collections
 
Unsupervised Identification of Study Descriptors in Toxicology Research: An E...
Unsupervised Identification of Study Descriptors in Toxicology Research: An E...Unsupervised Identification of Study Descriptors in Toxicology Research: An E...
Unsupervised Identification of Study Descriptors in Toxicology Research: An E...
 
Simple Yet Effective Methods for Large-Scale Scholarly Publication Ranking
Simple Yet Effective Methods for Large-Scale Scholarly Publication RankingSimple Yet Effective Methods for Large-Scale Scholarly Publication Ranking
Simple Yet Effective Methods for Large-Scale Scholarly Publication Ranking
 
Semantometrics in Coauthorship Networks: Fulltext-based Approach for Analysin...
Semantometrics in Coauthorship Networks: Fulltext-based Approach for Analysin...Semantometrics in Coauthorship Networks: Fulltext-based Approach for Analysin...
Semantometrics in Coauthorship Networks: Fulltext-based Approach for Analysin...
 
Towards Semantometrics: A New Semantic Similarity Based Measure for Assessing...
Towards Semantometrics: A New Semantic Similarity Based Measure for Assessing...Towards Semantometrics: A New Semantic Similarity Based Measure for Assessing...
Towards Semantometrics: A New Semantic Similarity Based Measure for Assessing...
 
Mining Research Publication Networks for Impact -- KMi Internal Seminar
Mining Research Publication Networks for Impact -- KMi Internal SeminarMining Research Publication Networks for Impact -- KMi Internal Seminar
Mining Research Publication Networks for Impact -- KMi Internal Seminar
 

Recently uploaded

[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 

Recently uploaded (20)

[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 

An Analysis of the Microsoft Academic Graph

  • 1. 1/32 An Analysis of the Microsoft Academic Graph Drahomira Herrmannova (@robodasha) & Petr Knoth (@petrknoth) KMi, The Open University
  • 2. 2/32 Introduction • To understand the strengths and limitations of the Microsoft Academic Graph (MAG) for applying it to scholarly communication tasks • We study the characteristics of the dataset and perform a correlation analysis with other similar datasets
  • 3. 3/32 Questions • How complete/sparse are the data? • How many of the graph entities have all associated metadata fields populated and how reliable they are? • How well are the data conflated/disambiguated?
  • 4. 4/32 Dataset • Heterogeneous graph comprised of more than 120 million publications and the related authors, venues, organizations, and fields of study • The largest publicly available dataset of scholarly publications • The largest dataset of open citation data
  • 5. 5/32 Dataset size Papers 126,909,021 Authors 114,698,044 Institutions 19,843 Journals 23,404 Conferences 1,283 Conference instances 50,202 Fields of study 50,266
  • 6. 6/32 External datasets used • CORE (Connecting Repositories) • Mendeley • Webometrics Ranking of World Universities • Scimago Journal and Country Rank
  • 8. 8/32 Publication age • Publication dates from MAG compared with CORE and Mendeley data • Intersection found using DOI Unique DOIs in the MAG 35,569,305 Unique DOIs in CORE 2,673,592 Intersection MAG/CORE 1,690,668 Intersection MAG/CORE/Mendeley 1,314,854 Intersection without missing data 1,258,611
  • 9. 9/32 Publication age • Compared using two methods – Spearman's rho correlation coefficient – Cumulative distribution function of the difference between the publication years in the different datasets
  • 10. 10/32 Publication age Spearman’s rho MAG CORE Mendeley MAG - 0.9555 0.9656 CORE 0.9555 - 0.9743 Mendeley 0.9656 0.9743 -
  • 12. 12/32 Authors and affiliations • Publications linked to author and affiliation entities • All publications linked to one or more authors, however 105,980,107 (~83%) publications not linked to any affiliation
  • 13. 13/32 Authors and affiliations Mean number of authors per paper 2.66 Max authors per paper 6,530 Mean number of papers per author 2.94 Max number of papers per author 153,915 Mean number of collaborators 116.93 Max number of collaborators 3,661,912 Number of papers with affiliation 20,928,914 Mean number of affiliations per paper 0.23 Max number of affiliations per paper 181
  • 14. 14/32 Authors and affiliations • Paper with most authors: ”Sunday, 26 August 2012" • Author with most papers: ”united vertical media gmbh"
  • 15. 15/32 Journals and conferences • Papers linked to publication venues • Of all papers in MAG (over 126 million), more than 51 million (~40%) are linked to a journal and 1,7 million to a conference entity
  • 16. 16/32 Fields of study • FoS in MAG organised hierarchically into four levels (0-3) – 47,989 at level 3 – 1,966 at level 2 – 293 at level 1 – 18 at level 0 • Over 41 million papers are linked to one or more fields of study (~33%)
  • 18. 18/32 Fields of study – Mendeley
  • 19. 19/32 Citation network • We study the network by – looking at the citation distribution, to see whether it is consistent with previous studies – Compare the citations received by two types of entities in the graph with citations from external datasets • Why? – To understand the quality of the citation data (not to rank universities or journals)
  • 20. 20/32 Citation network • 528,682,289 internal citations • Significant portion of papers disconnected from the graph Total number of papers 126,909,021 Papers with zero references 96,850,699 Papers with zero citations 89,647,949 Papers with zero references and citations 80,166,717 Mean citation per paper 4.17 Mean citation per ”connected” paper 11.31
  • 21. 21/32 Citation network • Comparison of university and journal citation data found in MAG with the Ranking Web of Universities (RWoU) and the Scimago Journal & Country Rank (SJCR) citation data • Two comparison methods – Size of overlap of the top university/journal lists – Pearson’s and Spearman’s correlation (calculated on matching items)
  • 22. 22/32 Citation network • Matched 1,255 universities between MAG and RWoU (2,105 in total), and 13,050 journals between MAG and SJRC (22,878 in total) • 4 common journals in among the top 10 • 54 among the top 100 • 677 among the top 1000 and 1407 among the top 2000
  • 23. 23/32 Citation network – top 10 universities
  • 24. 24/32 Citation network – top 10 journals
  • 25. 25/32 Citation network • To quantify how much do the lists differ, we created histograms of the differences between the ranks in the MAG and in the external lists • To produce the histograms – Sorted the data by number of citations found in the external dataset – For top 100/1000 universities/journals created a histogram of absolute difference between rank in MAG and in external dataset
  • 26. 26/32 Rank difference – top 100 universities
  • 27. 27/32 Rank difference – top 100 universities • University citation rank in the MAG differs by more than 200 positions for about 20% of universities in the top 100 of the Ranking Web of Universities list • The citation university rank differs by less than 25 positions for less than 40% of universities across these two datasets
  • 28. 28/32 Rank difference – top 1000 universities
  • 29. 29/32 Rank difference – top 100 journals
  • 30. 30/32 Rank difference – top 1000 journals
  • 31. 31/32 Citation network • Ranks of top universities differ on average by 163, with standard deviation of 185 • Ranks of top journals differ on average by 1,203 with standard deviation of 1,211 • Correlations calculated on matching items Universities Journals Pearson’s r 0.8773, p -> 0.0 0.8246, p -> 0.0 Spearman’s rho 0.8266, p -> 0.0 0.8973, p -> 0.0
  • 32. 32/32 Conclusions • MAG data correlate well with external datasets • We have identified certain limitations as to the completeness of links from publications to other entities • Existing university and journal rankings (proprietary data) produce substantially different results – MAG is open and transparent at the level of individual citations, it is possible to verify and better interpret the citation data • Currently the most comprehensive publicly available dataset of its kind