This document summarizes a presentation about Google Scholar's citation graph. It discusses both the strengths and weaknesses of data available from Google Scholar. On the strengths side, it notes Google Scholar's large size and coverage of documents not found in other indexes. However, it also outlines weaknesses like the lack of support for data reuse and incomplete/erroneous metadata. The document describes personal experiences extracting and analyzing Google Scholar data to build new tools and data products, showing what could be possible with open access to the data.
Not all companies are willing to invest in in-house UX research teams, while others use research vendors to expand the volume of research that can be conducted. Using outside vendors can help manage the ebb and flow of work, expanding and contracting as needed. However, managing vendors isn’t always an easy task. This session will provide tools and tips on finding the right vendor partners, and how to ensure you are setting up your organization and your vendor for success.
UXPA 2023: A Framework to Define an Out of Box Experience Using Measurable Ex...UXPA International
As a perennial innovator in the printing space,HP wanted to understand the aspirational needs of customers during the Out of Box Experience of a printer. By connecting the dots between the holistic journey from purchase to print across the digital and physical aspects of setup, Lextant believed they could deliver better value to users. Working with HP to understand the end-to-end experience across all touchpoints, they evaluated the desired emotional state for users and established an out of box experience that could be standardized across all printers for those target users. The work resulted in an ideal experience framework and defined metrics to measure the desirability (and fidelity) of future products and concepts. Learn how this holistic approach to user experience research drove internal alignment and identified a return on investment.
Hỏi không phải là một kĩ năng quan trọng. Chúng ta phải luyện tập thì mới nâng cao kĩ năng đặt câu hỏi cũng như tận dụng chúng để đạt được các mục đích học tập, giảng dạy hay làm việc.
Tài liệu này mô tả các loại câu hỏi khác nhau, cùng với những ví dụ tiêu biểu. Đồng thời nó cũng nêu những điều cần lưu ý khi thực hành việc hỏi và tạo môi trường cho việc hỏi đáp diễn ra suôn sẻ.
Phần đặc biệt nhất của tài liệu là giới thiệu quy trình truy tìm chân lí dựa trên câu hỏi của Socrates, một quy trình kinh điển!
The two faces of Google Scholar
Opening the academic Pandora’s Box
Why do we call it a big data bibliometric tool?
Drawbacks Google Scholar, Google Scholar Citations, Google Scholar Metrics
Not all companies are willing to invest in in-house UX research teams, while others use research vendors to expand the volume of research that can be conducted. Using outside vendors can help manage the ebb and flow of work, expanding and contracting as needed. However, managing vendors isn’t always an easy task. This session will provide tools and tips on finding the right vendor partners, and how to ensure you are setting up your organization and your vendor for success.
UXPA 2023: A Framework to Define an Out of Box Experience Using Measurable Ex...UXPA International
As a perennial innovator in the printing space,HP wanted to understand the aspirational needs of customers during the Out of Box Experience of a printer. By connecting the dots between the holistic journey from purchase to print across the digital and physical aspects of setup, Lextant believed they could deliver better value to users. Working with HP to understand the end-to-end experience across all touchpoints, they evaluated the desired emotional state for users and established an out of box experience that could be standardized across all printers for those target users. The work resulted in an ideal experience framework and defined metrics to measure the desirability (and fidelity) of future products and concepts. Learn how this holistic approach to user experience research drove internal alignment and identified a return on investment.
Hỏi không phải là một kĩ năng quan trọng. Chúng ta phải luyện tập thì mới nâng cao kĩ năng đặt câu hỏi cũng như tận dụng chúng để đạt được các mục đích học tập, giảng dạy hay làm việc.
Tài liệu này mô tả các loại câu hỏi khác nhau, cùng với những ví dụ tiêu biểu. Đồng thời nó cũng nêu những điều cần lưu ý khi thực hành việc hỏi và tạo môi trường cho việc hỏi đáp diễn ra suôn sẻ.
Phần đặc biệt nhất của tài liệu là giới thiệu quy trình truy tìm chân lí dựa trên câu hỏi của Socrates, một quy trình kinh điển!
The two faces of Google Scholar
Opening the academic Pandora’s Box
Why do we call it a big data bibliometric tool?
Drawbacks Google Scholar, Google Scholar Citations, Google Scholar Metrics
Maximising your communication impact – making altmetrics workssCiarán Quinn
Presented as part of 3U INNOVEDIATE is an advanced summer programme that provides PhD students and postdoctoral researchers with skills and techniques to enable them to negotiate this media-intense world and to become effective communicators of their research and interests. http://3u.ie/3u-innovediate/
Bibliometrics, Journal Impact Factors and Maximising the Cite-ability of Jour...Jamie Bisset
Most recent version of slides from Durham "Bibliometrics, Journal Impact Factors and Maximising the Cite-ability of Journal Articles" session.. Delivered as part of the Durham University Researcher Development Programme.
[Last Devlivered November 2014]
Further Training available at https://www.dur.ac.uk/library/research/training/
Slides from Monday 30 July - Data in the Scholarly Communications Life Cycle Course which is part of the FORCE11 Scholarly Communications Institute.
Presenter - Natasha Simons
In the last decade, several Scientific Knowledge Graphs (SKG) were released, representing scientific knowledge in a structured, interlinked, and semantically rich manner. But, what kind of information they describe? How they have been built? What can we do with them? In this lecture, I will first provide an overview of well-known SKGs, like Microsoft Academic Graph, Dimensions, and others. Then, I will present the Academia/Industry DynAmics (AIDA) Knowledge Graph, which describes 21M publications and 8M patents according to i) the research topics drawn from the Computer Science Ontology, ii) the type of the author's affiliations (e.g, academia, industry), and iii) 66 industrial sectors (e.g., automotive, financial, energy, electronics) from the Industrial Sectors Ontology (INDUSO). Finally, I will showcase a number of tools and approaches using such SKGs, supporting researchers, companies, and policymakers in making sense of research dynamics.
The project re3data.org–Registry of Research Data Repositories–has begun to index research data repositories in 2012 and offers researchers, funding organizations, libraries and publishers an overview of the heterogeneous research data repository landscape.
Jonathan Tedds Distinguished Lecture at DLab, UC Berkeley, 12 Sep 2013: "The ...Jonathan Tedds
http://dlab.berkeley.edu/event/open-research-challenge-peer-review-and-publication-research-data
A talk by Dr. Jonathan Tedds, Senior Research Fellow, D2K Data to Knowledge, Dept of Health Sciences, University of Leicester.
PI: #BRISSKit www.brisskit.le.ac.uk
PI: #PREPARDE www.le.ac.uk/projects/preparde
The Peer REview for Publication & Accreditation of Research data in the Earth sciences (PREPARDE) project seeks to capture the processes and procedures required to publish a scientific dataset, ranging from ingestion into a data repository, through to formal publication in a data journal. It will also address key issues arising in the data publication paradigm, namely, how does one peer-review a dataset, what criteria are needed for a repository to be considered objectively trustworthy, and how can datasets and journal publications be effectively cross-linked for the benefit of the wider research community.
I will discuss this and alternative approaches to research data management and publishing through examples in astronomy, biomedical and interdisciplinary research including the arts and humanities. Who can help in the long tail of research if lacking established data centers, archives or adequate institutional support? How much can we transfer from the so called “big data” sciences to other settings and where does the institution fit in with all this? What about software?
Publishing research data brings a wide and differing range of challenges for all involved, whatever the discipline. In PREPARDE we also considered the pre and post publication peer review paradigm, as implemented in the F1000 Research Publishing Model for the life sciences. Finally, in an era of truly international research how might we coordinate the many institutional, regional, national and international initiatives – has the time come for an international Research Data Alliance?
The 7 Habits of Highly Effective Research CommunicatorsAnup Kumar Das
The emergence of Web 2.0 and simultaneously Library 2.0 platforms has helped the library and information professionals to outreach to new audiences beyond their physical boundaries. In a globalized society, information becomes very useful resource for socio-economic empowerment of marginalized communities, economic prosperity of common citizens, and knowledge enrichment of liberated minds. Scholarly information becomes both developmental and functional for researchers working towards advancement of knowledge. We must recognize a relay of information flow and information ecology while pursuing scholarly research. Published scholarly literatures we consult that help us in creation of new knowledge. Similarly, our published scholarly works should be outreached to future researchers for regeneration of next dimension of knowledge. Fortunately, present day research communicators have many freely available personalized digital tools to outreach to globalized research audiences having similar research interests. These tools and techniques, already adopted by many researchers in different subject areas across the world, should be enthusiastically utilized by LIS researchers in South Asia for global dissemination of their scholarly research works. This newly found enthusiasm will soon become integral part of the positive habits and cultural practices of research communicators in LIS domain.
Full-text Paper is available here: http://arxiv.org/ftp/arxiv/papers/1409/1409.3920.pdf
This is a presentation for the Erwin Hahn Instiutute in Essen, explaining the background, functional design and technical architecture of the Donders Repository. Furthermore, it explains how it aligns with the DCCN project management and with the researchers workflow
This presentation was provided by Camelia Csora of Elsevier during the NISO event "Next Generation Discovery Tools: New Tools, Aging Standards," held March 27 - March 28, 2008.
Research Integrity Advisor and Data ManagementARDC
Dr Paul Wong from the Australian Research Data Commons presented at the University of Technology Sydney's RIA Data Management Workshop on 21 June 2018. In partnership with the Australian Research Council, the National Health and Medical Research Council, the Australian Research Data Commons, and RMIT University, this is part of a national workshop series in data management for research integrity advisors.
A workshop hosted by the South African Journal of Science aimed at postgraduate students and early career researchers with little or no experience in writing and publishing journal articles.
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
More Related Content
Similar to Google Scholar's citation graph: comprehensive, global... and inaccessible
Maximising your communication impact – making altmetrics workssCiarán Quinn
Presented as part of 3U INNOVEDIATE is an advanced summer programme that provides PhD students and postdoctoral researchers with skills and techniques to enable them to negotiate this media-intense world and to become effective communicators of their research and interests. http://3u.ie/3u-innovediate/
Bibliometrics, Journal Impact Factors and Maximising the Cite-ability of Jour...Jamie Bisset
Most recent version of slides from Durham "Bibliometrics, Journal Impact Factors and Maximising the Cite-ability of Journal Articles" session.. Delivered as part of the Durham University Researcher Development Programme.
[Last Devlivered November 2014]
Further Training available at https://www.dur.ac.uk/library/research/training/
Slides from Monday 30 July - Data in the Scholarly Communications Life Cycle Course which is part of the FORCE11 Scholarly Communications Institute.
Presenter - Natasha Simons
In the last decade, several Scientific Knowledge Graphs (SKG) were released, representing scientific knowledge in a structured, interlinked, and semantically rich manner. But, what kind of information they describe? How they have been built? What can we do with them? In this lecture, I will first provide an overview of well-known SKGs, like Microsoft Academic Graph, Dimensions, and others. Then, I will present the Academia/Industry DynAmics (AIDA) Knowledge Graph, which describes 21M publications and 8M patents according to i) the research topics drawn from the Computer Science Ontology, ii) the type of the author's affiliations (e.g, academia, industry), and iii) 66 industrial sectors (e.g., automotive, financial, energy, electronics) from the Industrial Sectors Ontology (INDUSO). Finally, I will showcase a number of tools and approaches using such SKGs, supporting researchers, companies, and policymakers in making sense of research dynamics.
The project re3data.org–Registry of Research Data Repositories–has begun to index research data repositories in 2012 and offers researchers, funding organizations, libraries and publishers an overview of the heterogeneous research data repository landscape.
Jonathan Tedds Distinguished Lecture at DLab, UC Berkeley, 12 Sep 2013: "The ...Jonathan Tedds
http://dlab.berkeley.edu/event/open-research-challenge-peer-review-and-publication-research-data
A talk by Dr. Jonathan Tedds, Senior Research Fellow, D2K Data to Knowledge, Dept of Health Sciences, University of Leicester.
PI: #BRISSKit www.brisskit.le.ac.uk
PI: #PREPARDE www.le.ac.uk/projects/preparde
The Peer REview for Publication & Accreditation of Research data in the Earth sciences (PREPARDE) project seeks to capture the processes and procedures required to publish a scientific dataset, ranging from ingestion into a data repository, through to formal publication in a data journal. It will also address key issues arising in the data publication paradigm, namely, how does one peer-review a dataset, what criteria are needed for a repository to be considered objectively trustworthy, and how can datasets and journal publications be effectively cross-linked for the benefit of the wider research community.
I will discuss this and alternative approaches to research data management and publishing through examples in astronomy, biomedical and interdisciplinary research including the arts and humanities. Who can help in the long tail of research if lacking established data centers, archives or adequate institutional support? How much can we transfer from the so called “big data” sciences to other settings and where does the institution fit in with all this? What about software?
Publishing research data brings a wide and differing range of challenges for all involved, whatever the discipline. In PREPARDE we also considered the pre and post publication peer review paradigm, as implemented in the F1000 Research Publishing Model for the life sciences. Finally, in an era of truly international research how might we coordinate the many institutional, regional, national and international initiatives – has the time come for an international Research Data Alliance?
The 7 Habits of Highly Effective Research CommunicatorsAnup Kumar Das
The emergence of Web 2.0 and simultaneously Library 2.0 platforms has helped the library and information professionals to outreach to new audiences beyond their physical boundaries. In a globalized society, information becomes very useful resource for socio-economic empowerment of marginalized communities, economic prosperity of common citizens, and knowledge enrichment of liberated minds. Scholarly information becomes both developmental and functional for researchers working towards advancement of knowledge. We must recognize a relay of information flow and information ecology while pursuing scholarly research. Published scholarly literatures we consult that help us in creation of new knowledge. Similarly, our published scholarly works should be outreached to future researchers for regeneration of next dimension of knowledge. Fortunately, present day research communicators have many freely available personalized digital tools to outreach to globalized research audiences having similar research interests. These tools and techniques, already adopted by many researchers in different subject areas across the world, should be enthusiastically utilized by LIS researchers in South Asia for global dissemination of their scholarly research works. This newly found enthusiasm will soon become integral part of the positive habits and cultural practices of research communicators in LIS domain.
Full-text Paper is available here: http://arxiv.org/ftp/arxiv/papers/1409/1409.3920.pdf
This is a presentation for the Erwin Hahn Instiutute in Essen, explaining the background, functional design and technical architecture of the Donders Repository. Furthermore, it explains how it aligns with the DCCN project management and with the researchers workflow
This presentation was provided by Camelia Csora of Elsevier during the NISO event "Next Generation Discovery Tools: New Tools, Aging Standards," held March 27 - March 28, 2008.
Research Integrity Advisor and Data ManagementARDC
Dr Paul Wong from the Australian Research Data Commons presented at the University of Technology Sydney's RIA Data Management Workshop on 21 June 2018. In partnership with the Australian Research Council, the National Health and Medical Research Council, the Australian Research Data Commons, and RMIT University, this is part of a national workshop series in data management for research integrity advisors.
A workshop hosted by the South African Journal of Science aimed at postgraduate students and early career researchers with little or no experience in writing and publishing journal articles.
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
Macroeconomics- Movie Location
This will be used as part of your Personal Professional Portfolio once graded.
Objective:
Prepare a presentation or a paper using research, basic comparative analysis, data organization and application of economic information. You will make an informed assessment of an economic climate outside of the United States to accomplish an entertainment industry objective.
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Dr. Vinod Kumar Kanvaria
Exploiting Artificial Intelligence for Empowering Researchers and Faculty,
International FDP on Fundamentals of Research in Social Sciences
at Integral University, Lucknow, 06.06.2024
By Dr. Vinod Kumar Kanvaria
A review of the growth of the Israel Genealogy Research Association Database Collection for the last 12 months. Our collection is now passed the 3 million mark and still growing. See which archives have contributed the most. See the different types of records we have, and which years have had records added. You can also see what we have for the future.
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Francesca Gottschalk - How can education support child empowerment.pptxEduSkills OECD
Francesca Gottschalk from the OECD’s Centre for Educational Research and Innovation presents at the Ask an Expert Webinar: How can education support child empowerment?
This slide is special for master students (MIBS & MIFB) in UUM. Also useful for readers who are interested in the topic of contemporary Islamic banking.
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Google Scholar's citation graph: comprehensive, global... and inaccessible
1. Google Scholar’s citation graph:
comprehensive, global… and inaccessible
Alberto Martín-Martín, Emilio Delgado López-Cózar
Open Citations seminar, May 23rd, 2018, Uppsala, Sweden
2. 2
TEAM
EMILIO DELGADO LÓPEZ-CÓZAR
PROFESSOR
@ UNIVERSIDAD DE GRANADA
ENRIQUE ORDUÑA-MALEA
ASSISTANT PROFESSOR
@ UNIVERSIDAD POLITÉCNICA
DE VALENCIA
ALBERTO MARTÍN-MARTÍN
PHD STUDENT
@ UNIVERSIDAD DE GRANADA
3. CONTEXT
How we got to this point
STRUCTURE OF THE TALK
3
STRENGTHS
of data available in Google Scholar
WEAKNESSES
of data available in Google Scholar
PERSONAL EXPERIENCES
getting data from Google Scholar to generate data products
6. 6
CITATION INDEXES
• Selective coverage based on source
selection
• Commercial (subscription-based).
License to access to data in bulk
separate from license to web
application
• Inclusive coverage based on
parsing webpages
• Non-comercial service offered
by Google. Free to access.
Doesn’t offer options to access
data in bulk (agreements with
publishers preclude it)
7. 10
THE NEED OF OPEN CITATIONS
7
The citation graph is
one of humankind's
most important
intellectual
achievements
DARIO
TARABORELLI
FOUNDER OF I4OC
source
‘‘
’’
In this open-access
age, it is a scandal
that reference lists
from journal articles
[…] are not readily
and freely available
for use by all
scholars.
DAVID SHOTTON
FOUNDER OF OCC
source
‘‘
’’
[I]n order to guarantee
full transparency and
reproducibility of
scientometric
analyses, these
analyses need to be
based on open data
sources
ISSI
source
‘‘
’’
9. 9
“
OVERALL SIZE
• Khabsa & Giles (2014): around 100 million documents (only in English)
• Orduna-Malea et al. (2015): 130-180 million documents (no language restrictions)
• Roughly 2-3 times the size of Web of Science and Scopus. There are disciplinary differences as we’ll see later on
• Khabsa, M., & Giles, C. L. (2014). The number of scholarly documents on the public web. PloS one, 9(5), e93949.
https://doi.org/10.1371/journal.pone.0093949
• Orduna-Malea, E., Ayllón, J. M., Martín-Martín, A., & Delgado López-Cózar, E. (2015). Methods for estimating the size of Google
Scholar. Scientometrics, 104(3), 931-949. https://doi.org/10.1007/s11192-015-1614-6
10. 10
DOCUMENT COVERAGE (1)
•For a sample of 64,000 highly-cited
documents according to Google Scholar,
49% of these documents were not
covered by Web of Science
• Martin-Martin, A., Orduna-Malea, E., Harzing, A. W., & Delgado López-Cózar, E.
(2017). Can we use Google Scholar to identify highly-cited documents?. Journal
of Informetrics, 11(1), 152-163. https://doi.org/10.1016/j.joi.2016.11.008
• Martín-Martín, A., Orduna-Malea, E., Ayllón, J. M., & Delgado López-Cózar, E.
(2016). A two-sided academic landscape: snapshot of highly-cited documents in
Google Scholar (1950-2013). Revista Española de Documentación Científica,
39(4), 1. https://doi.org/10.3989/redc.2016.4.1405
11. 11
DOCUMENT COVERAGE (2)
• “Classic Papers”: Highly cited documents published in 2006 according to Google Scholar
(released in June 2017)
• 252 unique subcategories, 8 broad categories covering all areas of knowledge
• 10 most cited documents in each subcategory. At least 20 citations per paper. Total number of articles:
2,515 (one category had only 5 documents)
Category Number of documents
Not found in WoS
(%)
Not found in Scopus
(%)
Humanities, Literature & Arts 245 28.2 17.1
Social Sciences 510 17.5 8.6
Engineering & Computer Science 570 11.6 2.5
Business, Economics & Management 150 6.0 2.7
Health & Medical Sciences 680 2.8 0.3
Physics & Mathematics 230 2.2 1.7
Life Sciences & Earth Sciences 380 0.5 0.5
Chemical & Material Sciences 170 0 0
Martín-Martín, A., Orduna-Malea, E., & López-Cózar, E. D. (2018, April 23). Coverage of highly-cited
documents in Google Scholar, Web of Science, and Scopus: a multidisciplinary comparison.
http://doi.org/10.17605/OSF.IO/HCX27
12. 12
CITATIONS FOUND (1)
Average log-transformed citation counts of highly-cited documents according to Google
Scholar published in 2006, based on data from Google Scholar, Web of Science, and Scopus, by
broad subject categories
13. 13
CITATIONS FOUND (2)
2.30
M
95%
91%
Total number of citations to 2,299 highly-cited
documents from 2006 covered by GS, WoS,
and Scopus
Of all citations found by WoS (1.27 M) are
also found by Google Scholar
Of all citations found by Scopus (1.47) are
also found by Google Scholar
We extracted the list documents that cite these highly-cited documents from GS
(custom script), WoS (web export), and Scopus (web export)
14. 14
CITATIONS FOUND (3)
HUMANITIES, LITERATURE & ARTS CHEMICAL & MATERIAL SCIENCES
What sources / document types does GS cover that WoS and Scopus do not?
15. 15
CITATIONS FOUND (4)
• Analysis of articles or reviews with a DOI published in 2009 covered by Web of Science and Google
Scholar (~1 million documents)
Citation Index N spearman.cor p.value prop.cited.gs prop.cited.wos ratio of gs_cit to wos_cit (avg)
Sciences 863801 0,94 0,00 0,97 0,95 1,68
Social Sciences 109232 0,90 0,00 0,97 0,94 2,58
Art & Humanities 13487 0,83 0,00 0,84 0,69 2,52
Sciences Social Sciences Arts & Humanities
17. • No support for data reuse: no API available. All data extraction has
to be made through web scraping
• Agreements with publishers preclude them from releasing the data
• Tight security measures to avoid massive data collection (CAPTCHAs)
• Persistent identifiers (DOIs, ORCIDs…) are not available to the
public (although they use DOIs internally)
• Only 1,000 results can be displayed for any given query
• Inability to fix individual errors. Very small team of people working
on GS. Everything is automated.
• Incorrect assignment of documents to researcher profiles (GSC)
LIMITATIONS (1)
CONSEQUENCE OF TRYING TO USE A TOOL FOR A PURPOSE IT WASN’T DESIGNED FOR
18. LIMITATIONS (2)
CONSEQUENCE OF TRYING TO USE A TOOL FOR A PURPOSE IT WASN’T DESIGNED FOR
• Incomplete or erroneous basic metadata, with little or no support for
categorical variables at the document level:
• incomplete lists of authors!!
• truncated journal names!!
• no document types
• no author affiliations (only at author-level in GSC)
• no subject classifications
• Forget about funding acknowledgements…
• Undetected duplicates
• Open to manipulation
• Delgado López‐Cózar, E., Robinson‐García, N., & Torres‐Salinas, D. (2014). The Google Scholar experiment: How to index false papers and
manipulate bibliometric indicators. Journal of the Association for Information Science and Technology, 65(3), 446-454.
https://doi.org/10.1002/asi.23056
• Cited references of documents are not available
19. LIMITATIONS (3)
IS THERE A WAY TO OVERCOME THEM…
… that doesn’t involve paying mind-numbingly high amounts of money
to some multinational?
• Complementing Google Scholar metadata with data from other
freely accessible sources:
• Going to the source: Metadata in publisher websites or
repositories
• CrossRef Metadata API (for everything with a DOI)
• Complete basic metadata.
• Cited references available for over 51% of their records (so far
Springer, Wiley, and some smaller publishers have agreed to
make them public). https://i4oc.org/
• Author affiliations available from some publishers
• Digging deeper into Google Scholar: they have more
metadata, but it’s expensive to extract it (time-wise).
21. 21
SOFTWARE THAT HANDLES GOOGLE SCHOLAR
DATA
• Publish or Perish, by Anne-Wil Harzing: https://harzing.com/resources/publish-or-perish
• Scholarometer, from School of Informatics and Computing, Indiana University-Bloomington (currently doesn’t
work): http://scholarometer.indiana.edu
• Scholar Plot (generates plots to visualize an authors academic career, using GS data):
http://scholarplot.com/help.html
• R Package: scholar: Analyse Citation Data from Google Scholar: https://cran.r-
project.org/web/packages/scholar/index.html
• R Package: scholarnetwork: Extract and Visualize Google Scholar Collaboration Networks:
https://github.com/pablobarbera/scholarnetwork
• R Package: cv (builds a list of publications by an author by extracting data from GS): https://github.com/bomeara/cv
• R Package: Web::Scraper::Citations (scrapes data from GS author profiles): https://github.com/JJ/net-citations-
scraper
• Tutorial: Put Google Scholar citations on your personal website with R, scholar, ggplot2 and cron: https://www.r-
bloggers.com/put-google-scholar-citations-on-your-personal-website-with-r-scholar-ggplot2-and-cron/
• Tutorial: Google scholar scraping with rvest package: https://datascienceplus.com/google-scholar-scraping-with-
rvest/
• Tutorial: Scraping Google Scholar to write your PhD literature chapter: https://mystudentvoices.com/scraping-
google-scholar-to-write-your-phd-literature-chapter-2ea35f8f4fa1
22. 22
MY WORKFLOW (1)
• No magic recipe. I have to deal with CAPTCHAs and scrape the raw HTML just like everyone
else.
• Query embedded in URL: https://scholar.google.com/scholar_lookup?doi=10.1002/asi.23056
STRUCTURE OF A GOOGLE SCHOLAR RECORD
23. 23
MY WORKFLOW (2)
• Two-step process (+ cleaning):
1. Getting raw HTML:
• Python script that reads list of queries
• Selenium Webdriver (a headless browser doesn’t work because it is necessary to solve CAPTCHAs)
• Pagination is taken into account (Google Scholar displays a maximum of 10 records per page, and a
maximum of 1000 records per query)
• When a CAPTCHA appears, the script pauses. A human has to solve the CAPTCHA, then the script can
resume.
• Each page is saved as an HTML file.
• More computers: faster.
2. Parsing HTML:
• Once all raw files have been downloaded.
• Another other Python script, using Scrapy library, reads these files.
• Using Xpath, relevant data is identified within the HTML.
• Data is saved to csv file.
3. Cleaning: getting more complete metadata from source website and/or CrossRef…
DATA EXTRACTION PROCESS
24. 24
JOURNAL SCHOLAR METRICS
Presents bibliometric indicators for 9,196 journals in
the Arts, Humanities, and Social Sciences, by
discipline.
These areas have traditionally presented more
difficulties in terms of bibliometric assessment,
mainly because of the lack of international,
geographically and linguistically unbiased tools
http://www.journal-scholar-metrics.infoec3.es
DATA PRODUCTS (1)
BUILDING NEW TOOLS ON TOP OF GOOGLE SCHOLAR DATA
25. 25
SCHOLAR MIRRORS
Bibliometric and altmetric indicators for
authors, documents, journals, and publishers
in the field of Bibliometrics, Scientometrics,
Informetrics, Webometrics, and Altmetrics
in Google Scholar Citations, ResearcherID,
Researchgate, Mendeley, and Twitter.
Data was extracted from Google Scholar,
ResearcherID, ResearchGate, Mendeley,
and Twitter.
http://www.scholar-mirrors.infoec3.es
DATA PRODUCTS (1)
BUILDING NEW TOOLS ON TOP OF GOOGLE SCHOLAR DATA
26. 26
WORK IN PROGRESS
Google Scholar-powered scientific information
system that displays data about all researchers
working in Spain.
Dataset: 44,500 profiles in Google Scholar
Citations (profile service) 2 million
documents approx. 30 million citations
Necessary: generating a document-level
classification for this collection of GS data. I’m
using Ludo Waltman’s and Nees Jan van Eck’s
smart local moving algorithm.
DATA PRODUCTS (1)
BUILDING NEW TOOLS ON TOP OF GOOGLE SCHOLAR DATA
27. ONE LAST THOUGHT
• If you make data available, people will build on top of it
• We want to provide a glimpse of what could be possible to do with Google
Scholar data, if it stopped being a black box