Towards a Semantic Citation Index for the German Social Sciences

Towards a Semantic Citation
Index for the German Social
Sciences
William Dinkel, Philipp Mayr,
Frank Sawitzky, Andreas Strotmann*
GESIS – Leibnizinstitut für Sozialwissenschaften, Köln
*alphabetic ordering of names

The Problem
● German sociology / political science research output /
impact coverage in SSCI
– SOLIS: ~ 1/3 each of books, journal articles, chapters
● Cover ~ 50% of German researchers' “relevant” output*
– ~1/3 of core journals covered in SSCI**
– So, ~10% of literature indexed there
– Very low percentage of cited literature indexed in SSCI***
● * Research rating exercise Sociology, Wissenschaftsrat
● ** compared to SOLIS “class A” journals
● *** Chi (IfQ) study of core German political science journals

The Problem (ctd.)
● Citation culture in the social sciences
– Citations are important
● Perhaps even more so than in the natural sciences
– Some authors are extremely highly cited (Weber, Marx...)
● Suspect very high(!!) Gini coefficient in distribution
● But: it is their books (not articles) that are highly cited!
– Significant fraction of citations are contrastive
– Datasets (survey results) highly mentioned, not cited
– Multilingual citation environment
–
–

The Need
● German social scientists & SSCI
– They consider their field inadequateinadequately
represented in “the” citation index
– But useBut use it quite heavily anyway
● e.g. for research, evaluation
● Survey of sociologists and political scientists, GESIS

The Need (ctd.)
● We need a citation index for the (German)
social sciences
– Existing citation indexes frankly inadequate
● No reasonable effort in sight to resolve this
– Hence, we need to build our own
● If we want to do serious bibliometrics on SocSci
● If we want to provide a decent social science citation
index in, e.g., sociology or political science

The Need (ctd.)
●
We need an open semantic citation index for the
(German) social sciences
– Incorporate referential semantics into search engine
● e.g., reliable hyperlinks to referenced articles
● e.g., equivalence or hierarchy relations for translations, aggregations
– Publish referential semantics as linked open data
● Allow other institutions to discover references to their holdings in our
database(s)
● Invite them to offer the same service to us, too
– Bibliometrics requires cleaned/disambiguated data!

The Long-Term Goal
A globally distributed open semantic citation index
● Based on digital full-text collections (cooperate with publishers)
– Semi-automatic / Computer-aided
– Algorithms + professional indexers (authority files) + crowd sourcing +...
● Reference extraction (with contexts)
– Enables sentiment analysis (important in social sciences)
● Reference matching
– Enables referential semantics
● Open reference semantics information exchange
– „<this> paper indexed in our collection cites <that> paper indexed in yours“

Sowiport – German Social
Sciences Research Information
● GESIS' Sowiport portal: Single access point to 18 databases, including
– 6 Cambridge Scientific Abstracts databases on social sciences
– GESIS' own SOLIS (literature) and SOFIS (projects) RISs
– SSOAR (Social Science Open Access Repository) @ GESIS
● Goal: Extend to social science citation index
– CSA comes with cited refs for some docs
– SSOAR – extract refs from OA full text and index in Sowiport
– Extract links to data sets / surveys used but not cited from full texts
– Crawl Google Scholar for citations to “our” docs
– Link to/from RepEc (and other) data ...

First Steps: National CSA
Social Sciences Citation Index
● Cambridge Scientific Abstracts – Social Sciences
– 6 CSA databases offered & run by GESIS
● National research licence for Germany
– Include >8 mio references
● A good starting point
● Recently activated in Sowiport
● ~25-30% refs found to link to other records
– Using simple matching algorithm
– Biased towards accuracy (>90%), not recall

First Steps:
CSA Reference Matching
Reference matching is much(!) harder in social sciences
● Social science publication culture
– Books & chapters, and articles
● Published in roughly equal numbers, books cited most
– Multilingual publishing
● English is not the only language
● Publications may be cited in translation, different editions
– Broad referencing behaviour
● Large proportion of references to non-source items
=> A first-try high-precision match rate of ~25-30% is an excellent result
● Close to expected rate of references to journal articles

CSA References in
GESIS' Sowiport Database
● Each full record contains „references“ and
„cited-by“ information
– Some with actionable links to full records
● Combines WoS/Scopus and Google
Scholar approaches to citation index
construction

First Steps: Citation Extraction
● SSOAR full texts
– First successful experiments to extract
references from full text
● Based on RepEc's ParsCit
● Extended to German citation styles
– First successful experiments to identify
acknowledgments of large surveys in text

Next Steps: “Haus der
Sozialwissenschaften”
● Goal: Digital Special Collection for German
Social Scientists
– Digital access to full literature in one place
●
Large parts unfortunately only accessible in-house
● Collect existing digital versions from “all” sources
● Digitize “important” literature where necessary
● Full text of literature, survey data, project descriptions...
● Joint DFG application with Sondersammelgebiet
Sozialwissenschaften, Univ.- & Stadt-Bibl. Köln

Next Step: “GESIS Application
Laboratory Web 3.0”
● Full text collection and processing results available in toto to
visiting researchers
– Social scientists
– Computer scientists
– Computer linguists
– Bibliometricians: You are invited!!!
● Upgrade database
– e.g. disambiguation of authors, institutions, titles
e.g. incorporation of external authority files / semantic web
–

Experiment: E-Traces
● Goal: Tracking ideas through the sociology literature
(“text re-use”)
– Experiment (ongoing): attempt to categorize citation contexts
as positive/neutral/negative (sentiment analysis)
– BMBF funded project with U Leipzig, U Göttingen
● Long term use: identify negative citations and contrastive
co-citations for social science citation index

Summary
● For GESIS' core covered social sciences (German sociology, political
science), traditional citation indexes are inadequate
● and Google Scholar only provides “cited by” info
● Yet, GESIS' core audience uses them
● and complains about their inadequacies
● Bibliometrics requires an adequate citation index for reliable results
(given typical distributions)
● but no improvements in sight for classic indexes
● Therefore, we need to build our own
● and we have the expertise at GESIS to succeed where others have failed
● and we have taken the first few steps in this direction
●

Summary (ctd.)
● In the long run, we would like
– A citation index that is
● Semantic (with explicit referential semantics)
● Distributed (each institution builds their own)
● Open (each institution shares semantics as LOD)
● Global (implemented world wide)
● Cooperative (indexers+researchers contribute)
● Computer-aided (software to get started, people to improve)
– Based on best practices we hope to develop

Two Models of Citation Graphs
Bipartite (Classic IR) Model:
Citing and Cited Partitions
• Citing nodes: full
bibliographic records
• Cited nodes: „keys“, e.g.
– First author name & initials
+ Year of publication
+ Journal key, + volume,
+number, +page
Uniform Model:
Interconnected Documents
• All nodes: bibliographic
records
– Citing nodes full records
– Cited nodes mostly simplified
records
– „Matched“ cited nodes have
full records

Citation Matching
• Goal: Citation network
–Unique nodes for documents
• Sub tasks:
–Match cited references to each other
–Match cited references to full records
–Match full records across databases

Matching Citations to Full Records
„Internal“ matching
● Direct access to
full database(s)
● Options: match
key based or
algorithmic
matching
„External“ matching
● Access only via
search engine
● Options: matching
against same or
different database

Scopus Citations
• Cited reference info contains
–Up to 8 author names (family+inits)
• Including last author
• Frequently as cited (not standardized or corrected)
–Publication year, title, journal name/vol./nr./p.
• Frequently as cited
–Reasonably well parsable, not normalized

Matching Scopus Citations to
Scopus Full Records
External matching: Scopus search engine
● „Algorithm“: parse Scopus reference into subfields,
construct complex search queries for Scopus engine,
download resulting full records, choose best fit
● High precision searches: complex searches allowed,
many searchable fields
– Improve recall by successively vaguer queries
● Small number of downloads allowed, so many queries
needed to construct sizable citation index

Matching Scopus Citations to
PubMed Full Records
CrossDB External Match: Scopus/Medline
● „Algorithm“: parse Scopus reference, construct
PubMed batch citation matcher queries, download
matched PubMed(!) records
– Only for biomedical fields
– Result is a citation network of PubMed records, not Scopus
– Requires matching of Scopus citing records as well
● Either direction (Scopus<->PubMed)
● Both include PubMed IDs

Matching Web of Science
References to WoS Full Records
WoS cited reference info contains
● First author (last name plus initials)
● Publication year
● Source title code
● Vol./num./page
● More and more frequently DOI
No title included!

Matching WoS Cited References
to WoS Record
External matching via WoS web search
● Only small queries supported
– Many downloads necessary
● Crucial search fields not supported (vol., num.)
– Therefore highly ambiguous results to be expected
● Requires translation of source title from code to full
● Requires algorithmic filtering of correct hit from long
result list

Matching WoS references to
WoS
● Internal Matching
● Kompetenzzentrum Bibliometrie has full local
copy of WoS data
● Experiment: good „match key“ to support
this?
– Dinkel (2011), ISSI
– Results in error estimates for references

Building a Citation Index for the
Social Sciences: CSA
● Basis: Cambridge Scientific Abstracts (Social Sciences)
– To be extended with additional sources of cited refs info
● Nationwide licensing scheme for Germany administered at
GESIS
● Six CSA/Proquest databases incorporated into GESIS'
„Sowiport“ social sciences portal
– Now including ~8.5 mio cited references
● No matchings to full records provided by Proquest
● Early experimental results available on portal
– Focus on precision, not recall

Citation Matching in CSA
„Algorithm“:
● Internal matching
– However, across multiple CSA databases
● Parse references; construct search queries (Solr)
– exact title and year
– or fuzzy title and year and ISSN;
– choose first match
● Favors precision over recall
– Fuzzy match only for journal literature, for example
● Research to be continued!

Experiments - Datasets
Caveat
● Scopus/PubMed and WoS experiments run on stem cell
research field (biomedical area)
– < 100k citing docs, ~1mio references
– >95% refs are to journal articles
● CSA experiment run on social sciences databases
– ~1mio full records, ~10mio references
● Only recent records contain refs
● Many(!!) refs to non-journal articles

Some Rough Numbers
● Scopus ↔ PubMed full record matching
– >95% match rate
● Scopus references → Scopus/PubMed full record
– ~90% match rate „exact“ + ~5% fuzzy match
– ~1% false positives needed to be filtered out
● WoS references → WoS full record
– ~90% match rate
– >>50% false positives needed to be filtered out
● CSA references → CSA full record
– ~30% match rate
– ~1% false positives

CSA reference information
● Fields: citing ID, reference ID, authors, title, year, publisher,
source title/num/vol/p., ISSN
– Format changes, though
● Mostly automatically parsed, as fields frequently mis-assigned
● Example (book):
<CI>200601317</CI><CA>Voice UK</CA>
<CT>No More Abuse.</CT><CY>2000</CY>
<CZ>Derby: Voice UK</CZ>

Discussion
● Plenty of research opportunities to improve matching of
non-journal literature references to source records
– e.g. to GESIS' own SOLIS / SOFIS / SSOAR databases
– e.g. by crawling Google Scholar for reference links
– You are invited to try your hands at this, too!
● See below: GESIS Application Laboratory

Towards a Semantic Citation Index for the German Social Sciences

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Towards a Semantic Citation Index for the German Social Sciences

Similar to Towards a Semantic Citation Index for the German Social Sciences (20)

More from GESIS

More from GESIS (20)

Recently uploaded

Recently uploaded (20)

Towards a Semantic Citation Index for the German Social Sciences