A presentation delivered online to the Mountain Plains Management Conference at Cedar City, UT on Oct. 18, 2013.
Presented by: Jon Ritterbush of the Calvin T. Ryan Library at the University of Nebraska-Kearney.
A presentation delivered online to the Mountain Plains Management Conference at Cedar City, UT on Oct. 18, 2013.
Presented by: Jon Ritterbush of the Calvin T. Ryan Library at the University of Nebraska-Kearney.
This presentation gives an overview of referencing as an academic skill - what it is, why it's important, when do you reference and how/what do you need to reference? It was followed by a hands-on demo of Zotero. This presentation is suitable for all university students, regardless of subject or level.
Presentación realizada en la European Sumer for Scientometrics 2014. Viena, 10 de julio de 2014 http://www.scientometrics-school.eu/programme.html
Since its emergence in 2004, Google Scholar has attracted a huge interest in the scientific community. More recently, it has also drawn attention not only as a information source but also as a tool for evaluation purposes.
The launch of products such as Google Scholar Citations and Metrics or the recent agreement with Thomson Reuters' Web of Science shows that Google is already a major player in the scientific information market.
Its price (free), its huge coverage and the better attention to Social Science and Humanities (compared to commercial databases) has made Google Scholar a potentially valid source for bibliometrics in these areas. Nevertheless, Google Scholar (and their tools) presents many shortcomings that are necessary to know to perform reliable analysis.
In this session, we will briefly review Google Scholar pros and cons and will examine the usefulness of tools such as Citations and Metrics.
This presentation gives an overview of referencing as an academic skill - what it is, why it's important, when do you reference and how/what do you need to reference? It was followed by a hands-on demo of Zotero. This presentation is suitable for all university students, regardless of subject or level.
Presentación realizada en la European Sumer for Scientometrics 2014. Viena, 10 de julio de 2014 http://www.scientometrics-school.eu/programme.html
Since its emergence in 2004, Google Scholar has attracted a huge interest in the scientific community. More recently, it has also drawn attention not only as a information source but also as a tool for evaluation purposes.
The launch of products such as Google Scholar Citations and Metrics or the recent agreement with Thomson Reuters' Web of Science shows that Google is already a major player in the scientific information market.
Its price (free), its huge coverage and the better attention to Social Science and Humanities (compared to commercial databases) has made Google Scholar a potentially valid source for bibliometrics in these areas. Nevertheless, Google Scholar (and their tools) presents many shortcomings that are necessary to know to perform reliable analysis.
In this session, we will briefly review Google Scholar pros and cons and will examine the usefulness of tools such as Citations and Metrics.
CAS, a division of the American Chemical Society, organizes, analyzes and shares information that sparks discoveries that improve the lives of people everywhere. We are a global team of scientists and technologists who offer broad-based solutions that drive discovery and provide deep insights for the scientific enterprise. These breakthroughs lead to innovations that range from product improvements to revelations that solve some of the world’s biggest problems in areas such as the treatment of disease, sustainable energy, and the world’s food supply. Together, we will do great things.
Emerging Sources Citation Index – A new edition of Web Of ScienceState Of Innovation
Web of Science is a single destination to the world’s largest collection of research data, books, journals, proceedings, publications and patents covering the sciences, social sciences, and arts & humanities.
Establishing an Online Access Panel for Interactive Information Retrieval Res...GESIS
We propose an online access panel to support the evaluation process of Interactive Information Retrieval (IIR) systems. By maintaining an online access panel with users of IIR systems we assume that the recurring effort to recruit participants for web-based as well as for lab studies can be minimized. We target on using the online access panel not only for our own development processes but to open it for other interested researchers in the field of IIR. In this paper we present the concept of the online access panel as well as first implementation details.
PEP-TF: Social Media Monitoring of the Campaigns for the 2013 German Bundesta...GESIS
As more and more people use social media to communicate their view and perception of elections, researchers have increasingly been collecting and analyzing data from social media platforms. Our research focuses on social media communication related to the 2013 election of the German parlia-ment [translation: Bundestagswahl 2013]. We constructed several social media datasets using data from Facebook and Twitter. First, we identified the most relevant candidates (n=2,346) and checked whether they maintained social media accounts. The Facebook data was collected in November 2013 for the period of January 2009 to October 2013. On Facebook we identified 1,408 Facebook walls containing approximately 469,000 posts. Twitter data was collected between June and December 2013 finishing with the constitution of the government. On Twitter we identified 1,009 candidates and 76 other agents, for example, journalists. We estimated the number of relevant tweets to exceed eight million for the period from July 27 to September 27 alone. In this document we summarize past research in the literature, discuss possibilities for research with our data set, explain the data collection procedures, and provide a description of the data and a discussion of issues for archiving and dissemination of social media data.
Are topic-specific search term, journal name and author name recommendations ...GESIS
In this paper we describe a case study where researchers in the social sciences (n=19) assess topical relevance for controlled search terms, journal names and author names which have been compiled automatically by bibliometric-enhanced information retrieval (IR) services. We call these bibliometric-enhanced IR services Search Term Recommender (STR), Journal Name Recommender (JNR) and Author Name Recommender (ANR) in this paper. The researchers in our study (practitioners, PhD students and postdocs) were asked to assess the top n pre-processed recommendations from each recommender for specific research topics which have been named by them in an interview before the experiment. Our results show clearly that the presented search term, journal name and author name recommendations are highly relevant to the researchers’ topic and can easily be integrated for search in Digital Libraries. The average precision for top ranked recommendations is 0.75 for author names, 0.74 for search terms and 0.73 for journal names. The relevance distribution differs largely across topics and researcher types. Practitioners seem to favor author name recommendations while postdocs have rated author name recommendations the lowest. In the experiment the small postdoc group (n=3) favor journal name recommendations.
Opening Scholarly Communication in the Social SciencesGESIS
2016 Annual EA Conference: “Innovating the Gutenberg Galaxis. The role of peer review and open access in university knowledge dissemination and evaluation”
Opening Scholarly Communication in Social Sciences (OSCOSS)GESIS
Our system will initially provide readers, authors and reviewers with an alternative, thus having the potential to gain wider acceptance and gradually replace the old, incoherent publication process of our journals and of others in related fields. It will make journals more “open” (in terms of reusability) that are open access already, and it has the potential to serve as an incentive for turning “closed” journals into open access ones.
OSCOSS is funded by the DFG in the Open Access Transformation programme.
Research impact metrics for librarians: calculation & contextLibrary_Connect
Slides from the May 19, 2016, Library Connect webinar "Research impact metrics for librarians: calculation & context" with Jenny Delasalle and Andrew Plume.
Watch the webinar at: https://libraryconnect.elsevier.com/library-connect-webinars?commid=199783
EC3metrics ha estado presente por primera vez en 2014 en la European Summer School for Scientometrics, la escuela de verano internacional que ofrece formación bibliométrica especializada cada año a 50 alumnos llegados desde todo el mundo.Nuestro compañero Álvaro Cabezas participó en dicho foro con una intervención acerca de las ventajas y limitaciones de Google Scholar, en una sesión dedicada a la evaluación de las Ciencias Sociales y Humanidades junto a Henk Moed, Philip Purnell, y Juan Gorráiz. En su intervención, Álvaro revisó los distintos productos de índole bibliométrica de Google Scholar mostrando sus puntos fuertes y débiles. Animó a los asistentes a experimentar con estos productos, si bien siendo conscientes de las precauciones que hay que tomar al usarlos con fines evaluativos.
Methodology ProjectThis project will be completed in steps wi.docxbuffydtesurina
Methodology Project:
This project will be completed in steps with several due dates throughout the semester in order to facilitate understanding of the process involved in a research project. For this project you will be responsible for writing an annotated bibliography, creating hypotheses, operationalizing variables, creating survey questions, and creating an interview guide for your chosen topic.
All steps of the project must abide by the following guidelines:
· Project must have a cover sheet with: title, name, date of submission.
· Pages must be numbered.
· Written in Times New Roman 12-point font, double spaced, with one inch margins on all sides (NOTE: default in word is 1.25).
· Spell-check and grammar-check the document prior to submission.
· Proof-read the document prior to submission.
· Cite sources using the APA format.
The entire project is worth a maximum of 200 points or 50% of your final grade!
Step One ~ Annotated Bibliography:
When searching for sources, you must find relevant academic journal/periodical articles. This means you cannot use popular magazines, newspaper articles, or other non-academic sources! You also cannot use books for this assignment.
Scholarly journal article
Non-scholarly sources
content
original research or comprehensive review of existing research
general information, typically current events, broad overview of the topic
format
structured article with abstract, literature review, methodology, conclusion, and bibliography
no structured format
audience
professionals/students in a particular field of study
general public
authors
scholars or experts in the field; articles are signed and credentials are provided
hired journalists or professional writers
evidence
thorough bibliography or "cited references" provided
No bibliography; research/reports may be mentioned in the article
purpose
inform of scholarly/scientific research
to entertain or inform general public
examples
Criminology; Criminology & Public Policy; Social Problems; Criminal Justice Review
Time; Newsweek; Sports Illustrated; Rolling Stone; National Geographic
It will be useful for you to search for articles using a computerized search program such as EbscoHost or Sociofile, both of which can be accessed through the MSU library’s database section using the instructions provided below. When in doubt, the library reference section personnel can usually be of assistance. You want to be careful in relying on your favorite search engine (such as google) to find academic sources, unless you are using a search engine oriented toward scholarly work (such as http://scholar.google.com/).
How to Access the MSU Databases to Find Scholarly Articles
(1) Go to the MSU homepage (www.montclair.edu) and under “Menu” click on “Library.”
(2) Click on “databases” on the right.
(3) On the right click on “Academic Search Complete.”
(4) You will be prompted to enter your username and password.
(5) You will now see the Ebsco.
Recent advances in the project EXCITE – Extraction of Citations from PDF Docu...GESIS
Workshop on Open Citations
SEPTEMBER 3-5, 2018 | BOLOGNA, ITALY
Presentation of the EXCITE project
Demo system: https://excite.informatik.uni-stuttgart.de/excite
Contextualised Browsing in a Digital Library’s Living LabGESIS
Contextualisation has proven to be effective in tailoring
search results towards the users’ information need. While this
is true for a basic query search, the usage of contextual session
information during exploratory search especially on the level of
browsing has so far been underexposed in research. In this paper, we present two approaches that contextualise browsing on the level of structured metadata in a Digital Library (DL), (1) one variant bases on document similarity and (2) one variant utilises implicit session information, such as queries and different document metadata encountered during the session of a users. We evaluate our
approaches in a living lab environment using a DL in the social
sciences and compare our contextualisation approaches against
a non-contextualised approach. For a period of more than three
months we analysed 47,444 unique retrieval sessions that contain search activities on the level of browsing. Our results show that a contextualisation of browsing significantly outperforms our baseline in terms of the position of the first clicked item in the result set.
The mean rank of the first clicked document (measured as mean first relevant - MFR) was 4.52 using a non-contextualised ranking compared to 3.04 when re-ranking the result lists based on similarity to the previously viewed document. Furthermore, we observed that both contextual approaches show a noticeably higher clickthrough rate. A contextualisation based on document similarity leads to almost twice as many document views compared to the non-contextualised ranking.
Opening Scholarly Communication in Social Sciences by Connecting Collaborativ...GESIS
The objective of the OSCOSS research project on "Opening Scholarly Communication in the Social Sciences" is to build a coherent collaboration environment that facilitates scholarly communication workflows of social scientists in the roles of authors, reviewers, editors and readers. This paper presents the implementation of the core of this environment: the integration of the Fidus Writer academic word processor with the Open Journal Systems (OJS) submission and review management system.
Using co-authorship networks for author name disambiguationGESIS
With the increasing size of digital libraries (DLs) it has become a challenge to identify author names correctly and assign publications to them. The situation becomes more critical when different persons share the same name (homonym problem) or when the names of authors are presented in several different ways (synonym problem). This paper focuses on homonym names in the computer science bibliography DBLP. The goal of this study is to implement and evaluate a method which uses co-authorship networks in order to disambiguate homonym names, especially common names. The results show that the implemented method has a good performance and can be used for author name disambiguation of sparse bibliographic records.
Industrie 4.0 ist ein Zukunftsprojekt in der Hightech-Strategie der deutschen Bundesregierung, mit dem in erster Linie die Informatisierung der Fertigungstechnik vorangetrieben werden soll.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Towards a Semantic Citation Index for the German Social Sciences
1. Towards a Semantic Citation
Index for the German Social
Sciences
William Dinkel, Philipp Mayr,
Frank Sawitzky, Andreas Strotmann*
GESIS – Leibnizinstitut für Sozialwissenschaften, Köln
*alphabetic ordering of names
2. The Problem
● German sociology / political science research output /
impact coverage in SSCI
– SOLIS: ~ 1/3 each of books, journal articles, chapters
● Cover ~ 50% of German researchers' “relevant” output*
– ~1/3 of core journals covered in SSCI**
– So, ~10% of literature indexed there
– Very low percentage of cited literature indexed in SSCI***
● * Research rating exercise Sociology, Wissenschaftsrat
● ** compared to SOLIS “class A” journals
● *** Chi (IfQ) study of core German political science journals
3. The Problem (ctd.)
● Citation culture in the social sciences
– Citations are important
● Perhaps even more so than in the natural sciences
– Some authors are extremely highly cited (Weber, Marx...)
● Suspect very high(!!) Gini coefficient in distribution
● But: it is their books (not articles) that are highly cited!
– Significant fraction of citations are contrastive
– Datasets (survey results) highly mentioned, not cited
– Multilingual citation environment
–
–
4. The Need
● German social scientists & SSCI
– They consider their field inadequateinadequately
represented in “the” citation index
– But useBut use it quite heavily anyway
● e.g. for research, evaluation
● Survey of sociologists and political scientists, GESIS
5. The Need (ctd.)
● We need a citation index for the (German)
social sciences
– Existing citation indexes frankly inadequate
● No reasonable effort in sight to resolve this
– Hence, we need to build our own
● If we want to do serious bibliometrics on SocSci
● If we want to provide a decent social science citation
index in, e.g., sociology or political science
6. The Need (ctd.)
●
We need an open semantic citation index for the
(German) social sciences
– Incorporate referential semantics into search engine
● e.g., reliable hyperlinks to referenced articles
● e.g., equivalence or hierarchy relations for translations, aggregations
– Publish referential semantics as linked open data
● Allow other institutions to discover references to their holdings in our
database(s)
● Invite them to offer the same service to us, too
– Bibliometrics requires cleaned/disambiguated data!
7. The Long-Term Goal
A globally distributed open semantic citation index
● Based on digital full-text collections (cooperate with publishers)
– Semi-automatic / Computer-aided
– Algorithms + professional indexers (authority files) + crowd sourcing +...
● Reference extraction (with contexts)
– Enables sentiment analysis (important in social sciences)
● Reference matching
– Enables referential semantics
● Open reference semantics information exchange
– „<this> paper indexed in our collection cites <that> paper indexed in yours“
8. Sowiport – German Social
Sciences Research Information
● GESIS' Sowiport portal: Single access point to 18 databases, including
– 6 Cambridge Scientific Abstracts databases on social sciences
– GESIS' own SOLIS (literature) and SOFIS (projects) RISs
– SSOAR (Social Science Open Access Repository) @ GESIS
● Goal: Extend to social science citation index
– CSA comes with cited refs for some docs
– SSOAR – extract refs from OA full text and index in Sowiport
– Extract links to data sets / surveys used but not cited from full texts
– Crawl Google Scholar for citations to “our” docs
– Link to/from RepEc (and other) data ...
9. First Steps: National CSA
Social Sciences Citation Index
● Cambridge Scientific Abstracts – Social Sciences
– 6 CSA databases offered & run by GESIS
● National research licence for Germany
– Include >8 mio references
● A good starting point
● Recently activated in Sowiport
● ~25-30% refs found to link to other records
– Using simple matching algorithm
– Biased towards accuracy (>90%), not recall
10.
11. First Steps:
CSA Reference Matching
Reference matching is much(!) harder in social sciences
● Social science publication culture
– Books & chapters, and articles
● Published in roughly equal numbers, books cited most
– Multilingual publishing
● English is not the only language
● Publications may be cited in translation, different editions
– Broad referencing behaviour
● Large proportion of references to non-source items
=> A first-try high-precision match rate of ~25-30% is an excellent result
● Close to expected rate of references to journal articles
12. CSA References in
GESIS' Sowiport Database
● Each full record contains „references“ and
„cited-by“ information
– Some with actionable links to full records
● Combines WoS/Scopus and Google
Scholar approaches to citation index
construction
13. First Steps: Citation Extraction
● SSOAR full texts
– First successful experiments to extract
references from full text
● Based on RepEc's ParsCit
● Extended to German citation styles
– First successful experiments to identify
acknowledgments of large surveys in text
14. Next Steps: “Haus der
Sozialwissenschaften”
● Goal: Digital Special Collection for German
Social Scientists
– Digital access to full literature in one place
●
Large parts unfortunately only accessible in-house
● Collect existing digital versions from “all” sources
● Digitize “important” literature where necessary
● Full text of literature, survey data, project descriptions...
● Joint DFG application with Sondersammelgebiet
Sozialwissenschaften, Univ.- & Stadt-Bibl. Köln
15. Next Step: “GESIS Application
Laboratory Web 3.0”
● Full text collection and processing results available in toto to
visiting researchers
– Social scientists
– Computer scientists
– Computer linguists
– Bibliometricians: You are invited!!!
● Upgrade database
– e.g. disambiguation of authors, institutions, titles
e.g. incorporation of external authority files / semantic web
–
16. Experiment: E-Traces
● Goal: Tracking ideas through the sociology literature
(“text re-use”)
– Experiment (ongoing): attempt to categorize citation contexts
as positive/neutral/negative (sentiment analysis)
– BMBF funded project with U Leipzig, U Göttingen
● Long term use: identify negative citations and contrastive
co-citations for social science citation index
17. Summary
● For GESIS' core covered social sciences (German sociology, political
science), traditional citation indexes are inadequate
● and Google Scholar only provides “cited by” info
● Yet, GESIS' core audience uses them
● and complains about their inadequacies
● Bibliometrics requires an adequate citation index for reliable results
(given typical distributions)
● but no improvements in sight for classic indexes
● Therefore, we need to build our own
● and we have the expertise at GESIS to succeed where others have failed
● and we have taken the first few steps in this direction
●
18. Summary (ctd.)
● In the long run, we would like
– A citation index that is
● Semantic (with explicit referential semantics)
● Distributed (each institution builds their own)
● Open (each institution shares semantics as LOD)
● Global (implemented world wide)
● Cooperative (indexers+researchers contribute)
● Computer-aided (software to get started, people to improve)
– Based on best practices we hope to develop
20. Two Models of Citation Graphs
Bipartite (Classic IR) Model:
Citing and Cited Partitions
• Citing nodes: full
bibliographic records
• Cited nodes: „keys“, e.g.
– First author name & initials
+ Year of publication
+ Journal key, + volume,
+number, +page
Uniform Model:
Interconnected Documents
• All nodes: bibliographic
records
– Citing nodes full records
– Cited nodes mostly simplified
records
– „Matched“ cited nodes have
full records
21. Citation Matching
• Goal: Citation network
–Unique nodes for documents
• Sub tasks:
–Match cited references to each other
–Match cited references to full records
–Match full records across databases
22. Matching Citations to Full Records
„Internal“ matching
● Direct access to
full database(s)
● Options: match
key based or
algorithmic
matching
„External“ matching
● Access only via
search engine
● Options: matching
against same or
different database
23. Scopus Citations
• Cited reference info contains
–Up to 8 author names (family+inits)
• Including last author
• Frequently as cited (not standardized or corrected)
–Publication year, title, journal name/vol./nr./p.
• Frequently as cited
–Reasonably well parsable, not normalized
24. Matching Scopus Citations to
Scopus Full Records
External matching: Scopus search engine
● „Algorithm“: parse Scopus reference into subfields,
construct complex search queries for Scopus engine,
download resulting full records, choose best fit
● High precision searches: complex searches allowed,
many searchable fields
– Improve recall by successively vaguer queries
● Small number of downloads allowed, so many queries
needed to construct sizable citation index
25. Matching Scopus Citations to
PubMed Full Records
CrossDB External Match: Scopus/Medline
● „Algorithm“: parse Scopus reference, construct
PubMed batch citation matcher queries, download
matched PubMed(!) records
– Only for biomedical fields
– Result is a citation network of PubMed records, not Scopus
– Requires matching of Scopus citing records as well
● Either direction (Scopus<->PubMed)
● Both include PubMed IDs
26. Matching Web of Science
References to WoS Full Records
WoS cited reference info contains
● First author (last name plus initials)
● Publication year
● Source title code
● Vol./num./page
● More and more frequently DOI
No title included!
27. Matching WoS Cited References
to WoS Record
External matching via WoS web search
● Only small queries supported
– Many downloads necessary
● Crucial search fields not supported (vol., num.)
– Therefore highly ambiguous results to be expected
● Requires translation of source title from code to full
● Requires algorithmic filtering of correct hit from long
result list
28. Matching WoS references to
WoS
● Internal Matching
● Kompetenzzentrum Bibliometrie has full local
copy of WoS data
● Experiment: good „match key“ to support
this?
– Dinkel (2011), ISSI
– Results in error estimates for references
29. Building a Citation Index for the
Social Sciences: CSA
● Basis: Cambridge Scientific Abstracts (Social Sciences)
– To be extended with additional sources of cited refs info
● Nationwide licensing scheme for Germany administered at
GESIS
● Six CSA/Proquest databases incorporated into GESIS'
„Sowiport“ social sciences portal
– Now including ~8.5 mio cited references
● No matchings to full records provided by Proquest
● Early experimental results available on portal
– Focus on precision, not recall
30. Citation Matching in CSA
„Algorithm“:
● Internal matching
– However, across multiple CSA databases
● Parse references; construct search queries (Solr)
– exact title and year
– or fuzzy title and year and ISSN;
– choose first match
● Favors precision over recall
– Fuzzy match only for journal literature, for example
● Research to be continued!
31. Experiments - Datasets
Caveat
● Scopus/PubMed and WoS experiments run on stem cell
research field (biomedical area)
– < 100k citing docs, ~1mio references
– >95% refs are to journal articles
● CSA experiment run on social sciences databases
– ~1mio full records, ~10mio references
● Only recent records contain refs
● Many(!!) refs to non-journal articles
32. Some Rough Numbers
● Scopus ↔ PubMed full record matching
– >95% match rate
● Scopus references → Scopus/PubMed full record
– ~90% match rate „exact“ + ~5% fuzzy match
– ~1% false positives needed to be filtered out
● WoS references → WoS full record
– ~90% match rate
– >>50% false positives needed to be filtered out
● CSA references → CSA full record
– ~30% match rate
– ~1% false positives
33. CSA reference information
● Fields: citing ID, reference ID, authors, title, year, publisher,
source title/num/vol/p., ISSN
– Format changes, though
● Mostly automatically parsed, as fields frequently mis-assigned
● Example (book):
<CI>200601317</CI><CA>Voice UK</CA>
<CT>No More Abuse.</CT><CY>2000</CY>
<CZ>Derby: Voice UK</CZ>
34. Discussion
● Plenty of research opportunities to improve matching of
non-journal literature references to source records
– e.g. to GESIS' own SOLIS / SOFIS / SSOAR databases
– e.g. by crawling Google Scholar for reference links
– You are invited to try your hands at this, too!
● See below: GESIS Application Laboratory